Hi all,
I ran into an issue yesterday when submitting a custom job in Vertex AI. The job successfully started (as evident by the logs reported) but at some point, just before the script starts using the GPU on the machine, we stopped receving any logs. I let the job run for 20 minutes, but it did not procide any more logs - as well as there was no indication of the machine having any issues. I then stopped the job manually, re-created the exact same job by running the same script (using the google-cloud-aiplatform package in Python) with the exact same parameters, and the job ran successfully.
Is there any way I can figure out what went wrong in the first job? I am looking for a stable solution to manage custom jobs, but the fact that this happened within my first 5 runs seems very concerning to me, especially since there was no indication that the job was frozen as it could have ran until it hit the max time which would have costed a lot of money.
Thanks!