Cloud Run - Too Many Instances Active

Hi folks!

Here are my Cloud Run settings:

  • Maximum concurrent requests per instance: 200
  • Execution environment: Second Generation
  • Revision scaling: min 0, max 100
  • Service scaling: min 1, max 100
  • Billing: Request-based

I’ve noticed that my instances count does not go below 2 and most of the time it stays at 4, even during cooldown periods. Here’s a 12 hour window where requests, CPU and memory utilization, are all at low levels, but instance count is almost constantly at 4.

Could someone please assist me on how to investigate this further? What are some things that could be causing this behaviour?

1 Like

Hey,

Hope you’re keeping well.

Cloud Run will keep instances warm if there’s residual request handling or if background processes prevent them from fully idling. Even with min-instances=0, traffic spikes or container startup latency can cause the autoscaler to hold extra instances for a while. Check for long-lived connections, streaming responses, or background threads in your app that might delay shutdown. You can view autoscaling decisions in Cloud Run > Metrics > Instance count alongside Request count to see correlation, and inspect logs for gaps between request end times and container termination.

Thanks and regards,
Taz

1 Like

Hello @mikespy,

With the information that you provided, I would say that your Cloud Run is serving many HTTP calls that can take up to 5 minutes to complete.

Since the CPU and memory are not heavily used, I would try to raise the maximum concurrency (400 or 600) to use fewer Cloud Run instances.

Also, try to enable the Service Affinity so the same client will be less prone to spin up multiple Cloud Run instances with multiple calls.

Last, what @iTazB said is very true: if your Cloud Run is doing its work, it won’t stop. Checking Log Explorer is always valuable.

Hi @LeoK ! Thanks for the tips. The 5 minute requests are actually the websocket connections. I’ll play with your suggestions.

Hi!

Thanks for the help! Those are the two metrics I compare, along with CPU and memory. A cool metric that I think would better indicate instance scaling due to traffic would be an “active connections“ metric, which would show the number of requests received but not responded yet.

I am ok with instances being kept around with a delay or pre-spined to handle traffic, but what concerned me was that the number doesn’t go down even within big time frames. That’s why I shared the graphs for a 12 hour window to better illustrate this.

Here’s an interesting one, a 2 hour window, where instance count jumps regularly between 2 and 4. I don’t see any good cause for it though.

Just a quick update!

I doubled the amount of maximum concurrent requests from 200 to 400 and I enabled session affinity, but the behaviour of instance counts did not change. You can see in the following 2 hour window, where basically nothing much happens (no socket connections, minimum requests) and the instances are stuck at 4, with at most 2 of them going idle.


@iTazB I checked the logs for the above 2 hour window and there are no scaling up or scaling down logs. As you can see instances are stuck to 4, with up to 2 instances going into idle.

Here’s some info in comparison from a 2 hour window during peak time:

And here are the relevant logs for the same window. It’s interesting that the first two scale ups from 5 to 7 instances have the reason MANUAL_OR_CUSTOMER_MIN_INSTANCE. I did not manually increase the instances and the minimum instances are set to 1.

But I think what the above showcases is that services correctly scale up and down as needed. It’s just that they don’t scale below 4.

Thanks for sharing more information :folded_hands:

MANUAL_OR_CUSTOMER_MIN_INSTANCE meaning from the doc:

Instance started because of customer-configured minimum instances or manual scaling.

I think you could lower the minimum instances to zero, as the doc says:

For example, if min-instances is 10 , and the number of active instances is 0 , then the number of idle instances is 10 . When the number of active instances increases to 6 , then the number of idle instances decreases to 4 .

I have compared this to a Cloud Run instance in my project, and I still think that the problem is coming from the end-to-end latency.

Let’s say you have 20 requests per second for 5 minutes, that’s 20 * 60 * 5 = 6000 requests, which can keep the websocket open for up to five minutes. That’s very high, even with a concurrency of 400 because Cloud Run tends to scale at 60% load (CPU, concurrency).

We still see that your CPU usage is very low, so I wouldn’t be afraid to use a concurrency of 1000 :eyes:

Also, would it be possible to lower the socket and/or the Cloud Run timeout to 1 minute instead of 5 minutes? It may help Cloud Run to breathe a little in order to let it scale in/out properly.

Thanks for looking into it!
I appreciate it!

Just 2 things if you have additional thoughts on.

I think you could lower the minimum instances to zero

The behaviour you describe makes sense, but changing from 1 to 0 the minimum instances should not fix the 4 instance issue I have, right? I’ll dive into it further though, maybe a setting is stuck somewhere internally? I don’t even know if that’s possible.

I still think that the problem is coming from the end-to-end latency

Again what you describe makes sense, but in the first 2 hour window (first graph of my previous reply), you can see the latencies being around 1 second, without socket connections. So in this particular case I don’t see how the issue you describe correlates. Unless GCP keeps instances around to be prepared, due to regular patterns? But 2 hours is a long time and always 4 instances seems quite suspicious.

I would suggest finding ways to increase concurrency or use cloud run jobs instead.