Cloud Run - Too Many Instances Active

mikespy · December 7, 2025, 3:42pm

Hi folks!

Here are my Cloud Run settings:

Maximum concurrent requests per instance: 200
Execution environment: Second Generation
Revision scaling: min 0, max 100
Service scaling: min 1, max 100
Billing: Request-based

I’ve noticed that my instances count does not go below 2 and most of the time it stays at 4, even during cooldown periods. Here’s a 12 hour window where requests, CPU and memory utilization, are all at low levels, but instance count is almost constantly at 4.

Could someone please assist me on how to investigate this further? What are some things that could be causing this behaviour?

iTazB · December 8, 2025, 8:31am

Hey,

Hope you’re keeping well.

Cloud Run will keep instances warm if there’s residual request handling or if background processes prevent them from fully idling. Even with min-instances=0, traffic spikes or container startup latency can cause the autoscaler to hold extra instances for a while. Check for long-lived connections, streaming responses, or background threads in your app that might delay shutdown. You can view autoscaling decisions in Cloud Run > Metrics > Instance count alongside Request count to see correlation, and inspect logs for gaps between request end times and container termination.

Thanks and regards,
Taz

LeoK · December 8, 2025, 8:39am

Hello @mikespy,

With the information that you provided, I would say that your Cloud Run is serving many HTTP calls that can take up to 5 minutes to complete.

Since the CPU and memory are not heavily used, I would try to raise the maximum concurrency (400 or 600) to use fewer Cloud Run instances.

Also, try to enable the Service Affinity so the same client will be less prone to spin up multiple Cloud Run instances with multiple calls.

Last, what @iTazB said is very true: if your Cloud Run is doing its work, it won’t stop. Checking Log Explorer is always valuable.

mikespy · December 8, 2025, 2:39pm

Hi @LeoK ! Thanks for the tips. The 5 minute requests are actually the websocket connections. I’ll play with your suggestions.

mikespy · December 8, 2025, 2:53pm

Hi!

Thanks for the help! Those are the two metrics I compare, along with CPU and memory. A cool metric that I think would better indicate instance scaling due to traffic would be an “active connections“ metric, which would show the number of requests received but not responded yet.

I am ok with instances being kept around with a delay or pre-spined to handle traffic, but what concerned me was that the number doesn’t go down even within big time frames. That’s why I shared the graphs for a 12 hour window to better illustrate this.

Here’s an interesting one, a 2 hour window, where instance count jumps regularly between 2 and 4. I don’t see any good cause for it though.

mikespy · December 9, 2025, 7:30pm

Just a quick update!

I doubled the amount of maximum concurrent requests from 200 to 400 and I enabled session affinity, but the behaviour of instance counts did not change. You can see in the following 2 hour window, where basically nothing much happens (no socket connections, minimum requests) and the instances are stuck at 4, with at most 2 of them going idle.

@iTazB I checked the logs for the above 2 hour window and there are no scaling up or scaling down logs. As you can see instances are stuck to 4, with up to 2 instances going into idle.

Here’s some info in comparison from a 2 hour window during peak time:

And here are the relevant logs for the same window. It’s interesting that the first two scale ups from 5 to 7 instances have the reason MANUAL_OR_CUSTOMER_MIN_INSTANCE. I did not manually increase the instances and the minimum instances are set to 1.

But I think what the above showcases is that services correctly scale up and down as needed. It’s just that they don’t scale below 4.

LeoK · December 9, 2025, 10:04pm

Thanks for sharing more information

MANUAL_OR_CUSTOMER_MIN_INSTANCE meaning from the doc:

Instance started because of customer-configured minimum instances or manual scaling.

I think you could lower the minimum instances to zero, as the doc says:

For example, if min-instances is 10 , and the number of active instances is 0 , then the number of idle instances is 10 . When the number of active instances increases to 6 , then the number of idle instances decreases to 4 .

I have compared this to a Cloud Run instance in my project, and I still think that the problem is coming from the end-to-end latency.

Let’s say you have 20 requests per second for 5 minutes, that’s 20 * 60 * 5 = 6000 requests, which can keep the websocket open for up to five minutes. That’s very high, even with a concurrency of 400 because Cloud Run tends to scale at 60% load (CPU, concurrency).

We still see that your CPU usage is very low, so I wouldn’t be afraid to use a concurrency of 1000

Also, would it be possible to lower the socket and/or the Cloud Run timeout to 1 minute instead of 5 minutes? It may help Cloud Run to breathe a little in order to let it scale in/out properly.

mikespy · December 9, 2025, 10:24pm

Thanks for looking into it!
I appreciate it!

Just 2 things if you have additional thoughts on.

I think you could lower the minimum instances to zero

The behaviour you describe makes sense, but changing from 1 to 0 the minimum instances should not fix the 4 instance issue I have, right? I’ll dive into it further though, maybe a setting is stuck somewhere internally? I don’t even know if that’s possible.

I still think that the problem is coming from the end-to-end latency

Again what you describe makes sense, but in the first 2 hour window (first graph of my previous reply), you can see the latencies being around 1 second, without socket connections. So in this particular case I don’t see how the issue you describe correlates. Unless GCP keeps instances around to be prepared, due to regular patterns? But 2 hours is a long time and always 4 instances seems quite suspicious.

Brady_Bastian · January 6, 2026, 1:54am

I would suggest finding ways to increase concurrency or use cloud run jobs instead.

LeoK · January 22, 2026, 11:47am

Hello @mikespy,

I think I’ve found the reason for all of this because it just happened to me today. It was probably under our noses all along!

When I changed the billing for one of our most used Cloud Run APIs from instance-based to request-based, the number of containers increased, as did the idle ones.

Note that before being called “Billing: Request/Instance based,” it was called “Billing: CPU Allocation.” (see)

It may seem counterintuitive because setting the app to request-based billing appears to increase the number of containers, but in reality, you will pay only for the requests and not for the instances.

You may have a better metrics graph using instance-based billing, but you may also pay more… which I’m not really certain about, since you seem to have steady activity and are using websockets. While reading Billing settings for services, I think instance-based billing may be more adequate, as it mentions the following:

Instance-based billing is recommended when incoming traffic is steady, slowly varying.

Hope this helps!

mikespy · January 22, 2026, 2:52pm

Thanks for following up!
I’ll try your suggestion!

Topic		Replies	Views
Cloud Run WebSocket service scaling for no apparent reason Serverless Applications cloud-run	8	158	November 16, 2025
Cloud Run Instance never scale down to 0 Compute Infrastructure compute-engine , infrastructure-general	1	283	May 30, 2024
Cloud Run create instance with a less than 1 req/sec Serverless Applications cloud-run	1	63	September 28, 2022

Cloud Run - Too Many Instances Active

AI Suggested topics