I ran a job with 200+ tasks using a docker container runnable hosted on GCP’s artifact registry. The job completed with all tasks succeeding except for 4 failed tasks. When looking at the batch_task_logs, I see the following exception listed 4 times (once for each of the failed tasks):
docker: Error response from daemon: Head "Artifact Registry documentation | Google Cloud “: denied: Permission “artifactregistry.repositories.downloadArtifacts” denied on resource “projects/XXXX/locations/us-east4/repositories/XXXX” (or it may not exist).”
The same docker image and service account are being used for all tasks. After re-running the job, all tasks succeed with no issue.
This makes me think we’re running into some concurrency or rate-limiting issue accessing the artifact registry. Is there some sort of quota increase I need to make the artifact registry? Or could retries be configured on GCP’s backend to try and grab the image multiple times if it fails? Our jobs are time critical and run as part of our production process so I’m hesitant to enable retries more broadly as it means if there’s an application failure in a task (ie a bad code push), the task will retry multiple times on cases without any chance of succeeding and will delay our production pipelines.
Google Artifact Registry has a default limit of 60k QPM per project per region. So if it’s 200+ tasks, pulling 10-ish packages each within a minute, it will quite likely to be a quota issue. If so, can you try to request a quota increase following https://cloud.google.com/artifact-registry/quotas#request_a_quota_increase?
In the meantime, if you could provide your job UID to us, it would be easier for Batch to triage whether your tasks for the job has the high possibility to reach the GAR quota limit.
Can you clarify what you mean by “package” when you say “10-ish packages each within a minute”? The way the job is configured, all tasks are using the same singular container. Does it count as an individual pull from the registry for each layer of the container’s docker image?
Additionally, if Batch is pulling ~10 packages * 200 tasks, wouldn’t that still only be 2,000 pulls from the Artifact Registry which is well below the 60k limit?
In the meantime, I’ll go ahead and request a quota increase on the artifact registry pulls to see if that resolves the issue. Lastly, here’s the job UID and region in the event that it’s helpful for debugging/diagnosing:
Sorry to bring you confusion about the quota limit. With the job information you give, we find the 4 tasks that failed with 403 error is because the Google Artifact Registry did not receive authentication when we do docker pull image request.