I am trying to create a custom training job in Vertex AI.
I can successfully use gcloud ai custom-jobs create and specify machine and gpu types, a custom container etc. The job starts successfully. However, I haven’t figured out how can I specify a checkpoint directory within my bucket for saving my model while training. Inside the training script "os.getenv(‘AIP_MODEL_DIR’) " is not available when not setting an output directory.
When using the console there is an option to select ‘Model output directory’. Do you know how can I specify this within the gcloud ai custom-jobs command in the terminal? I think it should be the staging_bucket argument in the CustomTrainingJob class or/and the baseOutputDirectory in CustomJobSpec?
You are correct, you need to specify the baseOutputDirectory within CustomJobSpec when using gcloud ai custom-jobs create. This will define the location where your model checkpoints and other training artifacts will be saved.
Here are possible steps that might help you specify a checkpoint within the gcloud ai custom-jobs command in the terminal:
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name="your-job-name" \
--config=your-custom-job-spec.json
Here are important notes to remember :
Permissions - Make sure that your service account has the necessary permissions to write the specified Google Cloud Storage bucket.
Environment Variable - The 'AIP_MODEL_DIR’ environment variable will be set to the baseOutputDirectory you specified, allowing your training script to access the correct location for saving checkpoints and other artifacts.