Thanks Nomi,
I have tested the retry and all seems to work.
A couple of things came up when I was implementing your suggestions that may be useful to others seeing this post:
I received this error when setting the maintenance policy
chris@Mando:> gcloud compute instances set-scheduling 'transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk' --maintenance-policy=TERMINATE --zone us-central1-a
ERROR: (gcloud.compute.instances.set-scheduling) Could not fetch resource:
- Invalid value for field 'instance': 'transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk --maintenance-policy=TERMINATE'. Must be a match of regex '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?|[1-9][0-9]{0,19}'
However when I looked at the Management details of the VM in Cloud Console I saw that On host maintenance was set to terminate anyway. The simulate call worked as expected:
chris@Mando:> gcloud compute instances simulate-maintenance-event transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk --zone us-central1-a
Simulating maintenance on instance(s) [https://compute.googleapis.com/compute/v1/projects/xxxxxxxx/zones/us-central1-a/instances/transcoding-dc2a7b-cd6428f6-36b1-4e930-group0-0-bvtk]...done.
and the logs showed the preeemption and the retry
2024-02-19 11:07:24.430 NZDT
Agent received Spot preemption notice.
2024-02-19 11:07:24.430 NZDT
preemption notice has received and will be processed.
I was looking at notifying DevOps when retries are exhausted, but I initially could not find the description message you talked about. The result: value from the task call had nothing interesting. I did note that a failed jobs.create call did throw an Exception that I was able to catch that I can do something with. Maybe that is the description you were talking about?
"status": {
"runDuration": "109.358548848s",
"state": "FAILED",
"statusEvents": [
…
{
"description": "Job state is set from RUNNING to FAILED for job projects/690235362025/locations/us-central1/jobs/transcoding-c70ec27b-4476-45ba-825c-5b6d5bd77ed8. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-a/instances/7192508307548879708 due to Spot VM preemption with exit code 50001.",
"eventTime": "2024-02-18T22:33:44.491834333Z",
"type": "STATUS_CHANGED"
}
],
Thanks again for your timely reply (last week I was otherwise occupied or I would have replied sooner)
C
One last thing, now I’ve worked out how this HTML editor works, here is the working Workflow yaml
- createFfmpegJob:
call: googleapis.batch.v1.projects.locations.jobs.create
args:
parent: ${"projects/" + project + "/locations/" + location}
jobId: "${jobId}"
body:
priority: 99
taskGroups:
taskCount: 1
parallelism: 1
taskSpec:
…
maxRetryCount: 1
lifecyclePolicies:
# If VM preempted (error code 50001) retry 1 times
action: RETRY_TASK
actionCondition:
exitCodes: [ 50001 ]
allocationPolicy:
instances:
policy:
provisioningModel: SPOT
machineType: "${machineType}"
logsPolicy:
destination: CLOUD_LOGGING
result: ffmpegJobResponse