Lots of Batch task interruptions with exit code 50002

atlasai-bborie · September 12, 2024, 5:17pm

We have many Batch jobs/tasks that use all computing resources available. Up until 24-48 hours ago, these jobs/tasks had zero issues.

But in the last two days, these are all failing with the following message

Batch no longer receives VM updates with exit code 50002

I’ve checked the logs and there is nothing out of the ordinary. Essentially, the log sequence is…

message indicating the step about to run
lots of messages around “report agent state:” and "Server response for instance "
and then cuts out due to Batch killing the task

The code that launches these Batch jobs/tasks has not changed NOR has the code running in each Batch task has not changed in the last week

Is anyone else seeing this?

Andrew_B · September 12, 2024, 7:08pm

It is possible that the VM hangs and is not able to communicate with the Batch service, so you may want to inspect the VM instance to see if there is any sign of health issues, such as high CPU usage, out of memory etc.

This thread that has some additional info.

atlasai-bborie · September 12, 2024, 7:16pm

I opened a case with my support partners at DoiT and they’re confirming what I’m seeing. Across 1000+ jobs/tasks, the pattern is the same in the logs…

A steady stream of “report agent state” messages and then silence. After some silence (~10 minutes), the VM gets killed.

We checked the metrics of a few of the VMs and during this period of log silence, CPU utilization is still reporting and at load (~30% - 50% utilization as expected).

Andrew_B · September 12, 2024, 7:28pm

FYI @Wen_gcp @wenyhu

atlasai-bborie · September 12, 2024, 7:38pm

Attaching some screenshots for visibility. Nothing sensitive here but just to show that the agent log messages go silent but CPU still has load

wenyhu · September 12, 2024, 7:41pm

Hi @atlasai-bborie ,

The 50002 indicates there is a timeout that caused Batch to no longer receive updates from a VM for the job. https://cloud.google.com/batch/docs/troubleshooting#vm_reporting_timeout_50002 has some explanation and solution.

In the meantime, would you mind providing us more information so that Batch can help take a look on our side?

When did it start happen?
What is your project id and your job’s region?
Do you have any example job uid that is with the issue?
If you call GetJob, is there any additional information from the status event that can help you further investigate?

Thanks!

Wenyan

atlasai-bborie · September 12, 2024, 8:51pm

Posting to close this thread out. I found out what the cause of the issue was. Specifically OOM caused due to bad configuration. The code hasn’t changed but the configuration of the data to be processed was badly misconfigured. Instead of the typical daily up to one-year’s worth of data, someone wanted almost a decade’s worth of data and these Jobs aren’t designed for that…

As such, the process ran out of memory and died. Sorry for the noise

wenyhu · September 12, 2024, 9:17pm

Glad you find the root cause!

Topic		Replies	Views
Batch no longer receives VM updates with exit code 50002 Compute Infrastructure batch	7	77	July 18, 2024
Always getting error 50002 on Google Batch with specific Nextflow process Compute Infrastructure compute-engine , vm-manager , batch	13	45	July 20, 2024
Google Batch job VMs are terminated after 7d Compute Infrastructure batch	2	14	January 14, 2025

Lots of Batch task interruptions with exit code 50002

AI Suggested topics