Hello,
I’ve been pretty routinely sending jobs to the batch prediction API and a small percentage of the jobs will get stuck in the ‘running’ state for more than 3 days with no signs of stopping. The jobs are all relatively identical with about 1000 requests per job and generally complete in a few hours with all succeeding within 24H.
But every so often a job will just never leave the running state. I’d expect GCP to cancel the job after a certain period of time due to timeout or if something went wrong. To add to this, attempting to cancel the job via the GCP UI fails to cancel the job and it keeps running…
Any ideas?
Brandon
Hi @brandonsmith,
Welcome to Google Cloud Community!
Here’s the potential causes for Vertex AI Batch Prediction jobs getting stuck in a ‘running’ state:
- Resource Issues:
- Insufficient Compute**:** Although your jobs usually complete quickly, there might be subtle differences in the data for these stuck jobs that trigger higher resource demands. This can lead to a stalled process if the allocated resources are insufficient.
- Quota Limits: While unlikely given that most jobs run fine, a quota limit could be hit during these specific runs. This can happen if there are many other concurrent jobs running or the machine types are large.
- Underlying Infrastructure Problems: Less frequently, issues within Google’s infrastructure could cause compute instances to be unhealthy or get stuck.1. Data Issues:
- Corrupted or Problematic Data: Some data rows might be causing a specific issue within your model prediction logic, leading to an infinite loop or unhandled exception.
- Data Skew: Data for specific rows might create a bottleneck in the prediction pipeline.
- Input Format Issues: Occasionally, there might be subtle input data format issues that your model doesn’t handle gracefully.1. Model Issues:
- Infinite Loops or Bugs in Prediction Logic: If your prediction code has an infinite loop or a bug that doesn’t return a response, it could indefinitely hang.
- Memory Leaks: Memory leaks can cause jobs to use more memory over time, eventually leading to performance degradation and potentially hanging.1. Vertex AI Service Issues:
- Infrequent Service Bugs: Rarely, issues within the Vertex AI service itself could cause batch prediction jobs to hang.
- Networking Issues: While less likely, intermittent networking issues could hinder communication within the job’s pipeline.1. Cancellation Issues:
- Service Delays: The cancel signal might be experiencing delays in propagation, or the job might have progressed past the point where a clean cancel is possible.
Here’s the troubleshooting approaches that you may try:
- Logging and Monitoring:
- Vertex AI Job Logs:
- Cloud Logging: Go to the Google Cloud Console and check Cloud Logging for your Vertex AI batch prediction job.
- Look for errors or warnings: Pay close attention to any exceptions, stack traces, or unusual behavior.
- Check for stalled progress: If the logs stop updating, it’s a clear sign that the job is stuck.- Custom Logging: Make sure your prediction code includes sufficient logging.
- Log the prediction input and output per row, especially on those problematic jobs.
- Log any key steps within the prediction logic to pinpoint where the job might be stuck.- Cloud Monitoring:
- Check resource usage for CPU, memory and disk I/O. See if there are signs of resource saturation (high resource usage for extended periods).
- Monitor error metrics.
- Data Analysis:
- Isolate the problematic data: Try running the batch prediction job on smaller subsets of your input data.
- Identify data patterns: Are there specific characteristics that identify the data rows where the job hangs?
- Compare failed input vs successful: Is there a clear difference in distribution, type or quantity?1. Model Review:
- Prediction Logic: Carefully review your prediction logic, looking for potential infinite loops or memory leaks.
- Exception Handling: Make sure your code handles exceptions gracefully and logs them appropriately. Add more error handling if necessary.
- Test Locally: If possible, try reproducing the prediction logic locally to help identify and debug code-level issues.1. Resource Configuration:
- Increase Compute: Try increasing the resources for your batch prediction job (machine type, number of nodes).
- Autoscaling: Consider using the autoscaling feature of the batch prediction jobs (if available).
- Quota Limits: Verify your project’s Vertex AI quotas in the Google Cloud Console. If necessary, request a quota increase.1. Cancellation Attempts:
- Programmatic Cancellation: Try canceling the job through the Vertex AI client libraries or using the gcloud CLI. This can sometimes be more reliable than the UI.
- gcloud ai batch-predictions cancel: This command can be more efficient at triggering the cancellation.
- Retry Cancel: You can try canceling repeatedly, with a few minutes in between.1. Vertex AI Service Status:
- Check Google Cloud Status Dashboard: Keep an eye on the Google Cloud Status Dashboard for any known incidents that might be affecting Vertex AI.1. Contact GCP Support: If none of the above steps help, contact Google Cloud Support. They have better visibility into the underlying system and can assist you with specific issues. Also, I suggest filing a defect report. This way you could have visibility on the progress of your request as it is publicly available. Please note that I can’t provide any details or timelines at this moment. For future updates, I suggest keeping an eye out on the issue tracker.
Here’s a similar case that you may check and may find useful.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.