Hello community!
I’ve been running detailed Cloud Run GPU autoscaling tests on our scaling-test service (L4 GPU, europe-west1) to investigate why scale-out doesn’t trigger properly.
Here’s what I found:
Autoscaling works only when containerConcurrency = 1.
When set to 5, the autoscaler’s “Recommended Instances” metric increases (e.g., 2) but actual instance count stays at 1.
Quota and capacity are fine — manually setting min-instances=3 immediately spins up 3 GPU instances.
No readiness or quota errors in logs; CPU ≈ 30 %, GPU 100 %.
This points to a potential autoscaler scheduling or signal issue for GPU-bound workloads when CPU utilization is low.
I’ve prepared a short doc with setup details, metrics, and screenshots to include in the support ticket.
Would appreciate it if someone could confirm whether this behavior is known for L4 GPU services in Cloud Run or if there’s an open bug on concurrency-only scaling signals.
Autoscaling will work if you set containerConcurrency = 1 because that means Cloud Run will provision another instance for each new request. That’s how Maximum concurrent requests per instance works:
When you set it to 5, Cloud Run won’t boot another instance until more than 5 requests are received or when the CPU is overwhelmed. You can learn more by reading the documentation About instance autoscaling in Cloud Run services.
If you use the default Cloud Run autoscaling, Cloud Run automatically scales the number of instances of each revision based on factors such as CPU utilisation and request concurrency. However, Cloud Run does not automatically scale the number of instances based on GPU utilisation.
For a revision with a GPU, if the revision does not have significant CPU usage, Cloud Run scales out for request concurrency. To achieve optimal scaling for request concurrency, you must set an optimal maximum concurrent requests per instance, as described in the next section.
I would say that you have just found out by yourself how Cloud Run & GPU work together. I think that your workload may end up using 100% GPU anyway, so I would keep containerConcurrency = 1 so one GPU will be 100% used to work as fast as possible on the workload.
Also, note that if min-instances=x, where x >=1, is used, you will always have a Cloud Run instance up and ready, that could be costly with a GPU. Here’s how Cloud Run billing works:
You can learn more by reading the documentation on Set minimum instances for services. So, if your Cloud Run is serving a high-demand API that receives queries 24/7, that could be beneficial, but if it’s serving sporadic workloads, it’s better to set it to 0.
Just to clarify — in stress tests I was sending ~60 concurrent client requests continuously. Each request normally takes around 300 ms inside the container, but under this sustained load the end-to-end latency rises to ~4 s because Cloud Run’s pending request queue stays around 50–55.
This level of traffic should definitely be enough to trigger a scale-out event, and the “Recommended Instances” metric consistently shows 2, yet even after hours of maintaining this load, no new instances are created (instance count remains 1).
Quota, readiness are all healthy — so it seems like the concurrency-based scaling signal isn’t triggering instance creation when container concurrency is 5.
Thanks for the clarification. I think that I better understand your question. So we see that:
Min Concurrency 1 + Max Instance 3 = 10 requests / second and recommended instance is 5
Min Concurrency 5 + Max instance 3 = 17 requests / second and recommended instance is 2
The second option processes more requests but may end up with worse latency.
I would say that the page about Cloud Run Autoscaling and GPUs is still relevant because it mentions the following information:
If maximum concurrent requests is set too high, requests might end up waiting inside the instance for access to the GPU, which leads to increased latency. If maximum concurrent requests is set too low, the GPU might be underutilized causing Cloud Run to scale out more instances than necessary.
A rule of thumb for configuring maximum concurrent requests for AI workloads is:
(Number of model instances * parallel queries per model) + (number of model instances * ideal batch size)
I see that your Cloud Run dashboard is customised; recommended_instances is an added widget that is not there by default. I suspect that is purely informational and acts as a recommendation and not as a scaling signal since this is a beta metricfor the moment.
I agree that higher concurrency can often lead to increased latency, but in my case, a concurrency of 5 is too low to justify such a large latency jump. As shown in the metrics from my shared report, the in-container p95 latency stays around 300 ms, while the end-to-end latency rises to about 4 s due to Cloud Run’s pending request queue.
This clearly shows that the model’s inference time isn’t affected by the concurrency of 5.
What I’d like to understand is how the “Recommended Instances” metric actually ties into the scale trigger, and why we’re seeing this inconsistent behavior — ideally, someone from the Google Cloud team could provide internal clarification on this.