We are seeing repeated incidents where Cloud Tasks continues dispatching requests to a Vertex AI online prediction endpoint, but most attempts fail with UNAVAILABLE (HTTP 503). During the same windows, the deployed custom container continues to receive /health checks (200 OK), but no /predict requests reach the application. This strongly suggests the Vertex front-end/proxy is rejecting or not routing prediction requests upstream before they hit the container.
Impact
-
Cloud Tasks queue backlog grows while prediction traffic to container drops to zero.
-
Transcription latency increases from ~1 minute to ~5–10+ minutes.
-
Autoscaling does not scale up during the incident because average GPU duty cycle stays ~0 (requests are not reaching the GPU/container).
**
Observed Behavior**
During multiple incident windows:
-
Cloud Tasks continues dispatching to the Vertex predict URL, but the majority of task attempts fail with:
status: UNAVAILABLE(observed in Cloud Tasks task_operations_log), typically failing quickly. -
Container logs show repeated /health requests returning 200, but zero /predict requests reaching the application during the same window.
-
Endpoint metrics show Average accelerator duty cycle drops to ~0 during the incident window, consistent with predictions not being routed to the replica.
-
After an extended gap (~20 minutes), prediction traffic resumes, and the endpoint scales up because the queue backlog has grown.
-
After scaling back down to minReplicaCount (1), the issue can recur: the container again stops receiving /predict for ~20 minutes while /health continues.
Evidence
Average duty cycle pattern with replica
The issue here when i check the endpoint log when the Average duty cycle is zero. Not any request is getting to the container. Although health request being processed and 200 response status code is send properly.
Queue metric
After the gap of about 20 min container start receiving the request properly and scale up as queue size has already grown up. Once it scale down to the minReplica container again stop receiving the request for about 20 min.
Modal config
model: { name: `whisper-x-${stage}`, resources: { machineSpec: { machineType: 'a2-highgpu-1g', acceleratorType: 'NVIDIA_TESLA_A100', acceleratorCount: 1, }, minReplicaCount: 2, maxReplicaCount: 14, autoscalingMetricSpecs: [ { metricName: 'aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle', target: 60, }, ], },
We’d like assistance identifying why Vertex is intermittently not routing prediction traffic to a deployed replica (while still routing /health checks), resulting in upstream UNAVAILABLE/503 errors. Specifically:
-
Is this consistent with Vertex load-shedding, “no ready replicas,” or a front-end routing issue?
-
Are there known conditions where /health continues to succeed but /predict is rejected upstream?
-
What additional logs/metrics can confirm the root cause (e.g., health probe status, backend readiness, routing decisions)?
-
Recommended configuration changes to prevent recurrence (e.g., min replicas > 1, autoscaling settings, concurrency guidance for GPU endpoints, dedicated endpoint capacity constraints, etc.)


