Example to show batch job retry in GCP Workflow

I have issue in cloud batch job run

{
“taskGroups”: [
{
“taskSpec”: {
“runnables”: [
{
“container”: {
“imageUri”: “Google Cloud console
}
}
],
“computeResource”: {
“cpuMilli”: 4000,
“memoryMib”: 20000
},
“maxRetryCount”: 2,
“maxRunDuration”: “43200s”
},
“taskCount”: 10,
“parallelism”: 10
}
],
“allocationPolicy”: {
“instances”: [
{ “installGpuDrivers”: true,
“policy”: {
“machineType”: “g2-standard-48”,
“accelerators”: [
{
“type”: “nvidia-l4”,
“count”: “4”
}
],
“provisioningModel”: “SPOT”,
“bootDisk”: {
“image”: “batch-debian”,
“sizeGb”: “200”,
“type”: “pd-standard”
}
}
}
],
“network”: {
“networkInterfaces”: [{
“network”: “projects/protein-fold-dev/global/networks/q-eb-vpc”,
“subnetwork”: “projects/protein-fold-dev/regions/us-east4/subnetworks/q-eb-vpc-subnet1”,
“noExternalIpAddress”: true}
]
}
},
“labels”: {
“department”: “embedding”,
“env”: “testing”
},
“logsPolicy”: {
“destination”: “CLOUD_LOGGING”
}
}

error:Job state is set from RUNNING to FAILED for job projects/855990010299/locations/us-east4/jobs/job-embedding-protein-gpu. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-east4-c/instances/6044861851664856881 with exit code 1.

2 Likes