Vertex AI Training: Auto-packaged Custom Training Job Yields Very Large Docker Image

Hello,

I am trying to run a Custom Training Job in the Vertex AI Training service.

The job is based on a tutorial for that fine-tuning a pre-trained BERT model (from HuggingFace).

When I use the gcloud CLI tool to auto-package my training code into a Docker image and deploy it to the Vertex AI Training service like so:

$BASE_GPU_IMAGE=“us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest”
$BUCKET_NAME = “my-bucket”

gcloud ai custom-jobs create --region=us-central1
–display-name=fine_tune_bert --args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier"
–worker-pool-spec=“machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task”

… I end up with a Docker image that is roughly 18GB (!) and takes a very long time to upload to the GCP registry.

Granted the base image is around 6.5GB but where do the additional >10GB come from? Is there a way for me to avoid incurring the added size increase?

Please note that my job loads the training data using the datasets Python package at run time and AFAIK does not include it in the auto-packaged docker image.

Thanks,
urig

Hi Urig,

Is it possible that you have local files in the current directory such as data or log files that are getting picked up, specifically this line local-package-path=.

If this persists, I highly recommend for you to file a Public Issue as you can high a private thread created to you and we would be able to further support you there.

Hello Ismail,

Thank you for your help.

I’ve checked and to the best of my knowledge there are no data or log files being picked up into my custom docker image.

According to an answer that I’ve received on stackoverflow.com, it’s likely that the 18GB size that I’m seeing is the size of my image after extraction. Apparently the ~6.8GB size is for the image compressed.

Cheers,

@urig