It is possible that the VM hangs and is not able to communicate with the Batch service, so you may want to inspect the VM instance to see if there is any sign of health issues, such as high CPU usage, out of memory etc.
I opened a case with my support partners at DoiT and they’re confirming what I’m seeing. Across 1000+ jobs/tasks, the pattern is the same in the logs…
A steady stream of “report agent state” messages and then silence. After some silence (~10 minutes), the VM gets killed.
We checked the metrics of a few of the VMs and during this period of log silence, CPU utilization is still reporting and at load (~30% - 50% utilization as expected).
Posting to close this thread out. I found out what the cause of the issue was. Specifically OOM caused due to bad configuration. The code hasn’t changed but the configuration of the data to be processed was badly misconfigured. Instead of the typical daily up to one-year’s worth of data, someone wanted almost a decade’s worth of data and these Jobs aren’t designed for that…
As such, the process ran out of memory and died. Sorry for the noise