Lots of Batch task interruptions with exit code 50002

We have many Batch jobs/tasks that use all computing resources available. Up until 24-48 hours ago, these jobs/tasks had zero issues.

But in the last two days, these are all failing with the following message

Batch no longer receives VM updates with exit code 50002

I’ve checked the logs and there is nothing out of the ordinary. Essentially, the log sequence is…

  • message indicating the step about to run
  • lots of messages around “report agent state:” and "Server response for instance "
  • and then cuts out due to Batch killing the task

The code that launches these Batch jobs/tasks has not changed NOR has the code running in each Batch task has not changed in the last week

Is anyone else seeing this?

It is possible that the VM hangs and is not able to communicate with the Batch service, so you may want to inspect the VM instance to see if there is any sign of health issues, such as high CPU usage, out of memory etc.

This thread that has some additional info.

I opened a case with my support partners at DoiT and they’re confirming what I’m seeing. Across 1000+ jobs/tasks, the pattern is the same in the logs…

A steady stream of “report agent state” messages and then silence. After some silence (~10 minutes), the VM gets killed.

We checked the metrics of a few of the VMs and during this period of log silence, CPU utilization is still reporting and at load (~30% - 50% utilization as expected).

1 Like

FYI @Wen_gcp @wenyhu

Attaching some screenshots for visibility. Nothing sensitive here but just to show that the agent log messages go silent but CPU still has load

Hi @atlasai-bborie ,

The 50002 indicates there is a timeout that caused Batch to no longer receive updates from a VM for the job. https://cloud.google.com/batch/docs/troubleshooting#vm_reporting_timeout_50002 has some explanation and solution.

In the meantime, would you mind providing us more information so that Batch can help take a look on our side?

  1. When did it start happen?

  2. What is your project id and your job’s region?

  3. Do you have any example job uid that is with the issue?

  4. If you call GetJob, is there any additional information from the status event that can help you further investigate?

Thanks!

Wenyan

Posting to close this thread out. I found out what the cause of the issue was. Specifically OOM caused due to bad configuration. The code hasn’t changed but the configuration of the data to be processed was badly misconfigured. Instead of the typical daily up to one-year’s worth of data, someone wanted almost a decade’s worth of data and these Jobs aren’t designed for that…

As such, the process ran out of memory and died. Sorry for the noise

1 Like

Glad you find the root cause!