However, while this gives me the stdout/stderr of a specific task, I don’t see why the task failed. I’m guessing the task failed for a reason–like–out of memory, or maybe (since the tasks are running on spot instances), the host instance got killed. Not sure, but just looking at the process’s stderr/stdout won’t be sufficient. How can I view the complete logs for the task, so I can figure out why it failed?
The exit code 125 usually means your container runnable task failed on docker command as “container failed to run”. There can be multiple reasons that cause this issue. E.g. if you are using GPU, that might because your GPU driver installation is not successful. Or maybe the container image you are using for your task has some issue, or your command has some error. The batch_task_logs gives you task related log details, and the batch_agent_logs gives more details about required package installation from Batch. I would recommend you combine this two types of logs together to have a investigation.
In the meantime, you can also share your logs and your job uid and region to Batch, in case you want us to help investigate.
Sure. I’m running the job in us-central1 (not sure if zone is us-central-1a, or multiple zones), and the UID is “internvid-md5-v1-a9339995-3599-48f6-b0”.
One issue is, I think the 125 is because of a spot pre-emption notice, but from reading the docs, batch tasks should exit with 50001 following spot pre-emption. For now, I can configure the VMs to retry on exit code 125, but I wonder if I’m hiding a deeper bug.
This is log results from filtering on a hostname: you can see there’s a spot pre-emption notice, and then the runnable exits with code 125.
Yes you find the proper info. And with the job and task information you provide, I checked that your task is failed due to spot preemption.
I would assume if you do Get Task API call for that task internvid-md5-v1-a9339995-3599-48f6-b0-group0-305 in the snapshot, you should be able to see the task’s status event with error code 50001 as