I submitted a batch task, and it will automatically fail when it is running. Check the events above and it will prompt an unknown error. I would like to ask, what is the reason for this?
details:
Job state is set from RUNNING to FAILED for job projects/647012610224/locations/us-central1/jobs/ifr-etl-218235706939801633. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-a/instances/952685742931573065 with error Batch no longer receives VM updates. with unknown exit code
From the information you provide, your tasks failed because Batch no longer receives the VM updates for some reason. Since you enabled CLOUD_LOGGING logs policy for your job, could you try troubleshooting with logs following https://cloud.google.com/batch/docs/analyze-job-using-logs to see whether there is any potential behavior happens during your job running that causes your VM no longer responses for a period?
I found the cause of the problem because my instance is of SPOT type and was preempted, but I did not find the relevant log in the GCP logger, but found it through the instance log. Can this be optimized?
I would expect you find some preemption related logs when on your pantheon UI, you click on Logs → LOGGING → batch_agent_logs, if spot instance preemption is the cause of your job failure.