Hello,
I am trying to run a Custom Training Job in the Vertex AI Training service.
The job is based on a tutorial for that fine-tuning a pre-trained BERT model (from HuggingFace).
When I use the gcloud CLI tool to auto-package my training code into a Docker image and deploy it to the Vertex AI Training service like so:
$BASE_GPU_IMAGE=“us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest”
$BUCKET_NAME = “my-bucket”
gcloud ai custom-jobs create --region=us-central1
–display-name=fine_tune_bert --args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier"
–worker-pool-spec=“machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task”
… I end up with a Docker image that is roughly 18GB (!) and takes a very long time to upload to the GCP registry.
Granted the base image is around 6.5GB but where do the additional >10GB come from? Is there a way for me to avoid incurring the added size increase?
Please note that my job loads the training data using the datasets Python package at run time and AFAIK does not include it in the auto-packaged docker image.
Thanks,
urig