I have a Batch job started by a GCP Workflow, I have recently received a series of ZONE_RESOURCE_POOL_EXHAUSTED error and the workflow process eventually timed out after 30 minutes. The current location is “us-central1” and I assume that resources were checked from all zones in that region. I was hoping I could specify other locations with allowedLocations but I see that having multiple regions is not permitted.
call: googleapis.batch.v1.projects.locations.jobs.create
args:
parent: ${"projects/" + project + "/locations/" + location}
What is the best practise for avoiding or recovering from these errors? I see two possibilities.
- I could catch the timeout error from the create call
"{"message":"Timeout of 1800 seconds exceeded. The timeout occurred during operation status polling.","tags":["TimeoutError","OSError"]}"
Then switch regions to us-east1 and/or machine type or similar and try again, but that means I am already 30 minutes in the hole. I guess I could lower the timeout value…
- I could set up a reserved instance for the machine type and region for 730 hours a month, but then how do I handle concurrent requests?
Can you point me to any resources (I have looked BTW)