How can I figure out why a task failed on Google batch?

vedantroy-genmo · March 29, 2024, 8:51pm

I have 4 tasks failed. Using the command:

gcloud beta batch tasks list --job [JOB_NAME] --project [PROJECT_NAME] --location [LOCATION] --page-size 500

I can get a list of the tasks. Then, to filter for the logs of a specific task:

Name=“projects/[PROJECT_NAME]/logs/batch_task_logs” labels.task_id=“[TASK_ID]” timestamp>=“[TIMESTAMP]” severity>=DEFAULT

However, while this gives me the stdout/stderr of a specific task, I don’t see why the task failed. I’m guessing the task failed for a reason–like–out of memory, or maybe (since the tasks are running on spot instances), the host instance got killed. Not sure, but just looking at the process’s stderr/stdout won’t be sufficient. How can I view the complete logs for the task, so I can figure out why it failed?

wenyhu · March 29, 2024, 9:21pm

Hi @vedantroy-genmo ,

Batch also records logs in batch_agent_logs, can you also try to get more logs info from there?

Ref: https://cloud.google.com/batch/docs/analyze-job-using-logs.

Another way to collect the entire logs is to go to Cloud Logging and search for the information you want.

Thanks,

Wenyan

vedantroy-genmo · March 29, 2024, 10:51pm

Ok. If I search “Task task/” and “exited with status”, I can see tasks that failed. The problem is, the error code is opaque.

Task task/mytask-group0-281/0/0 runnable 0 exited with status 125

Do you have any thoughts on what might be happening?

wenyhu · April 1, 2024, 5:45pm

Hi @vedantroy-genmo ,

The exit code 125 usually means your container runnable task failed on docker command as “container failed to run”. There can be multiple reasons that cause this issue. E.g. if you are using GPU, that might because your GPU driver installation is not successful. Or maybe the container image you are using for your task has some issue, or your command has some error. The batch_task_logs gives you task related log details, and the batch_agent_logs gives more details about required package installation from Batch. I would recommend you combine this two types of logs together to have a investigation.

In the meantime, you can also share your logs and your job uid and region to Batch, in case you want us to help investigate.

Thanks!

Wenyan

vedantroy-genmo · April 2, 2024, 8:22pm

Sure. I’m running the job in us-central1 (not sure if zone is us-central-1a, or multiple zones), and the UID is “internvid-md5-v1-a9339995-3599-48f6-b0”.

One issue is, I think the 125 is because of a spot pre-emption notice, but from reading the docs, batch tasks should exit with 50001 following spot pre-emption. For now, I can configure the VMs to retry on exit code 125, but I wonder if I’m hiding a deeper bug.

This is log results from filtering on a hostname: you can see there’s a spot pre-emption notice, and then the runnable exits with code 125.

wenyhu · April 3, 2024, 1:34am

Hi @vedantroy-genmo ,

Yes you find the proper info. And with the job and task information you provide, I checked that your task is failed due to spot preemption.

I would assume if you do Get Task API call for that task internvid-md5-v1-a9339995-3599-48f6-b0-group0-305 in the snapshot, you should be able to see the task’s status event with error code 50001 as

https://cloud.google.com/batch/docs/troubleshooting#vm_preemption_50001 describes.

In gcloud, it would be as gcloud batch tasks describe {YOUR_FULL_TASK_NAME}.

Thanks,
Wenyan

Topic		Replies	Views
Batch - unknown exit code Compute Infrastructure batch	6	28	October 27, 2023
How can I view the ids of failed tasks? Compute Infrastructure batch	2	7	March 29, 2024
GCP Batch failing with exit code 125 Compute Infrastructure batch	1	5	May 20, 2024

How can I figure out why a task failed on Google batch?

AI Suggested topics