I’ve noticed that my instances count does not go below 2 and most of the time it stays at 4, even during cooldown periods. Here’s a 12 hour window where requests, CPU and memory utilization, are all at low levels, but instance count is almost constantly at 4.
Cloud Run will keep instances warm if there’s residual request handling or if background processes prevent them from fully idling. Even with min-instances=0, traffic spikes or container startup latency can cause the autoscaler to hold extra instances for a while. Check for long-lived connections, streaming responses, or background threads in your app that might delay shutdown. You can view autoscaling decisions in Cloud Run > Metrics > Instance count alongside Request count to see correlation, and inspect logs for gaps between request end times and container termination.
Thanks for the help! Those are the two metrics I compare, along with CPU and memory. A cool metric that I think would better indicate instance scaling due to traffic would be an “active connections“ metric, which would show the number of requests received but not responded yet.
I am ok with instances being kept around with a delay or pre-spined to handle traffic, but what concerned me was that the number doesn’t go down even within big time frames. That’s why I shared the graphs for a 12 hour window to better illustrate this.
Here’s an interesting one, a 2 hour window, where instance count jumps regularly between 2 and 4. I don’t see any good cause for it though.
I doubled the amount of maximum concurrent requests from 200 to 400 and I enabled session affinity, but the behaviour of instance counts did not change. You can see in the following 2 hour window, where basically nothing much happens (no socket connections, minimum requests) and the instances are stuck at 4, with at most 2 of them going idle.
@iTazB I checked the logs for the above 2 hour window and there are no scaling up or scaling down logs. As you can see instances are stuck to 4, with up to 2 instances going into idle.
Here’s some info in comparison from a 2 hour window during peak time:
And here are the relevant logs for the same window. It’s interesting that the first two scale ups from 5 to 7 instances have the reason MANUAL_OR_CUSTOMER_MIN_INSTANCE. I did not manually increase the instances and the minimum instances are set to 1.
I think you could lower the minimum instances to zero, as the doc says:
For example, if min-instances is 10 , and the number of active instances is 0 , then the number of idle instances is 10 . When the number of active instances increases to 6 , then the number of idle instances decreases to 4 .
I have compared this to a Cloud Run instance in my project, and I still think that the problem is coming from the end-to-end latency.
Let’s say you have 20 requests per second for 5 minutes, that’s 20 * 60 * 5 = 6000 requests, which can keep the websocket open for up to five minutes. That’s very high, even with a concurrency of 400 because Cloud Run tends to scale at 60% load (CPU, concurrency).
We still see that your CPU usage is very low, so I wouldn’t be afraid to use a concurrency of 1000
Also, would it be possible to lower the socket and/or the Cloud Run timeout to 1 minute instead of 5 minutes? It may help Cloud Run to breathe a little in order to let it scale in/out properly.
I think you could lower the minimum instances to zero
The behaviour you describe makes sense, but changing from 1 to 0 the minimum instances should not fix the 4 instance issue I have, right? I’ll dive into it further though, maybe a setting is stuck somewhere internally? I don’t even know if that’s possible.
I still think that the problem is coming from the end-to-end latency
Again what you describe makes sense, but in the first 2 hour window (first graph of my previous reply), you can see the latencies being around 1 second, without socket connections. So in this particular case I don’t see how the issue you describe correlates. Unless GCP keeps instances around to be prepared, due to regular patterns? But 2 hours is a long time and always 4 instances seems quite suspicious.
I think I’ve found the reason for all of this because it just happened to me today. It was probably under our noses all along!
When I changed the billing for one of our most used Cloud Run APIs from instance-based to request-based, the number of containers increased, as did the idle ones.
Note that before being called “Billing: Request/Instance based,” it was called “Billing: CPU Allocation.” (see)
It may seem counterintuitive because setting the app to request-based billing appears to increase the number of containers, but in reality, you will pay only for the requests and not for the instances.
You may have a better metrics graph using instance-based billing, but you may also pay more… which I’m not really certain about, since you seem to have steady activity and are using websockets. While reading Billing settings for services, I think instance-based billing may be more adequate, as it mentions the following:
Instance-based billing is recommended when incoming traffic is steady, slowly varying.