We are using Google Batch to run long-running jobs. According to the documentation jobs may run for up to 14d, but we see that the VMs running our jobs are terminated after 7d, leading to the Google Batch job failing with “Batch no longer receives VM updates with exit code 50002”. We have not set any maxRunDuration in our task spec.
The logs show that there was a compute.instances.deferredDelete issued and the message:
Instance Group Manager ‘ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0’ initiated recreateInstance on instance ‘projects/566203377590/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945’. Reason: Instance eligible for repair: Instance passed it’s termination timestamp; termination_timestamp=2024-12-26T20:41:28.68375-08:00; current_time=2024-12-26T20:41:30.643952-08:00; current_status=STOPPING, target_status=STATUS_RUNNING.
Is there a way to prevent the VM from being killed after 7d?
If an example is helpful, these jobs all were killed in the same way after exactly 7d:
-
Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-2hcg9-1732734838
VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-2hcg9-1-8a08b338-9b84-40c30-group0-0-99r5 -
Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-94afp-1734797267
VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-94afp-1-0621dbf0-a10a-4d380-group0-0-r5lz -
Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-vx3sg-1734669578
VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945