Example to show batch job retry in GCP Workflow

kiwicmc · February 11, 2024, 7:59am

Hi,

I have a Workflow that creates a batch job task that runs up a docker image to do image processing then exits. I use VM Spot instances. All is working fine.

Now I would like to implement and test a simple retry mechanism if the VM is pre-empted and warn DevOps if the task fails.

I need an example of the Workflow Task retry syntax, the ability to manually pre-empt the VM (when using the gcloud compute instances stop command I need to have the VM ID), and a way to query the exit code in a subsequent Workflow step.

Workflow snippet (deploys but not tested):

create_transcoding_job:

call: googleapis.batch.v1.projects.locations.jobs.create
args:
parent: ${“projects/” + project + “/locations/” + location}
jobId: “${jobId}”
body:
priority: 99
taskGroups:

taskCount: 1
parallelism: 1
taskSpec:
computeResource:
…
maxRetryCount: 3
lifecyclePolicies:

If VM preempted (error code 50001) retry 3 times

action: RETRY_TASK
actionCondition:
exitCodes: [ 50001 ]
allocationPolicy:
instances:
policy:
provisioningModel: SPOT
machineType: “${machineType}”
…

If retries exhausted, I’d like to notify DevOps

nomi · February 13, 2024, 5:52pm

Hello,

The retry syntax looks good to me, 50001 is the right code for preemption.
To test preemption, you can simulate maintenance event for VMs. What I did previously is

gcloud compute instances set-scheduling VM_NAME --maintenance-policy=TERMINATE --zone ZONE

gcloud compute instances simulate-maintenance-event VM_NAME --zone ZONE

Btw, you can find your VM names in your Cloud Console, they should be prefixed with the same job id.

One thing to mention is that you can not query exit code now. Exit code is only available in the task event description, so for now when a task is failed, you have to parse the description message. For example, description like “task is failed due to Spot preemption with code 50001”, you need to parse the code from the message.

kiwicmc · February 19, 2024, 3:21am

Thanks Nomi,

I have tested the retry and all seems to work.

A couple of things came up when I was implementing your suggestions that may be useful to others seeing this post:

I received this error when setting the maintenance policy

chris@Mando:> gcloud compute instances set-scheduling 'transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk' --maintenance-policy=TERMINATE --zone us-central1-a
ERROR: (gcloud.compute.instances.set-scheduling) Could not fetch resource:
- Invalid value for field 'instance': 'transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk --maintenance-policy=TERMINATE'. Must be a match of regex '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?|[1-9][0-9]{0,19}'

However when I looked at the Management details of the VM in Cloud Console I saw that On host maintenance was set to terminate anyway. The simulate call worked as expected:

chris@Mando:> gcloud compute instances simulate-maintenance-event transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk --zone us-central1-a
Simulating maintenance on instance(s) [https://compute.googleapis.com/compute/v1/projects/xxxxxxxx/zones/us-central1-a/instances/transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk]...done.

and the logs showed the preeemption and the retry

2024-02-19 11:07:24.430 NZDT
Agent received Spot preemption notice.

2024-02-19 11:07:24.430 NZDT
preemption notice has received and will be processed.

I was looking at notifying DevOps when retries are exhausted, but I initially could not find the description message you talked about. The result: value from the task call had nothing interesting. I did note that a failed jobs.create call did throw an Exception that I was able to catch that I can do something with. Maybe that is the description you were talking about?

"status": {
"runDuration": "109.358548848s",
"state": "FAILED",
"statusEvents": [
…
{
"description": "Job state is set from RUNNING to FAILED for job projects/690235362025/locations/us-central1/jobs/transcoding-c70ec27b-4476-45ba-825c-5b6d5bd77ed8. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-a/instances/7192508307548879708 due to Spot VM preemption with exit code 50001.",
"eventTime": "2024-02-18T22:33:44.491834333Z",
"type": "STATUS_CHANGED"
}
],

Thanks again for your timely reply (last week I was otherwise occupied or I would have replied sooner)

C

One last thing, now I’ve worked out how this HTML editor works, here is the working Workflow yaml

- createFfmpegJob:
    call: googleapis.batch.v1.projects.locations.jobs.create
    args:
      parent: ${"projects/" + project + "/locations/" + location}
      jobId: "${jobId}"
      body:
        priority: 99
        taskGroups:
          taskCount: 1
          parallelism: 1
          taskSpec:
            …
            maxRetryCount: 1
            lifecyclePolicies:
              # If VM preempted (error code 50001) retry 1 times
              action: RETRY_TASK
              actionCondition:
                exitCodes: [ 50001 ]
        allocationPolicy:
          instances:
            policy:
              provisioningModel: SPOT
              machineType: "${machineType}"
        logsPolicy:
          destination: CLOUD_LOGGING
    result: ffmpegJobResponse

Pradeep_r · June 10, 2024, 5:53pm

I have issue in cloud batch job run

{
“taskGroups”: [
{
“taskSpec”: {
“runnables”: [
{
“container”: {
“imageUri”: “Google Cloud console”
}
}
],
“computeResource”: {
“cpuMilli”: 4000,
“memoryMib”: 20000
},
“maxRetryCount”: 2,
“maxRunDuration”: “43200s”
},
“taskCount”: 10,
“parallelism”: 10
}
],
“allocationPolicy”: {
“instances”: [
{ “installGpuDrivers”: true,
“policy”: {
“machineType”: “g2-standard-48”,
“accelerators”: [
{
“type”: “nvidia-l4”,
“count”: “4”
}
],
“provisioningModel”: “SPOT”,
“bootDisk”: {
“image”: “batch-debian”,
“sizeGb”: “200”,
“type”: “pd-standard”
}
}
}
],
“network”: {
“networkInterfaces”: [{
“network”: “projects/protein-fold-dev/global/networks/q-eb-vpc”,
“subnetwork”: “projects/protein-fold-dev/regions/us-east4/subnetworks/q-eb-vpc-subnet1”,
“noExternalIpAddress”: true}
]
}
},
“labels”: {
“department”: “embedding”,
“env”: “testing”
},
“logsPolicy”: {
“destination”: “CLOUD_LOGGING”
}
}

error:Job state is set from RUNNING to FAILED for job projects/855990010299/locations/us-east4/jobs/job-embedding-protein-gpu. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-east4-c/instances/6044861851664856881 with exit code 1.

Topic		Replies	Views
What happens when a batch task is preempted on a spot machine? Compute Infrastructure batch	17	12	May 15, 2023
Batch - unknown exit code Compute Infrastructure batch	6	28	October 27, 2023
How can I figure out why a task failed on Google batch? Compute Infrastructure batch	5	10	April 3, 2024

Example to show batch job retry in GCP Workflow

If VM preempted (error code 50001) retry 3 times

AI Suggested topics