Extremely slow training throughput (I/O Bottleneck) in DreamBooth LoRA SDXL training on Vertex AI

Hi everyone,

I am training a DreamBooth LoRA for SDXL on Vertex AI using the pytorch-peft-train container (Model Garden). My dataset consists of 1,000 images, stored in a GCS Bucket.

The issue:
Training is extremely slow. It takes ~20 seconds per iteration (1 batch size, 4 gradient accumulation steps), and 6 hours of training only yields 6% progress (600/10000 steps).

My setup:

  • Model: Stable Diffusion XL Base 1.0

  • Environment: Vertex AI, machine_type=g2-standard-8 (L4 GPU).

  • Data Path: Accessing images via GCS FUSE (/gcs/...).

  • Observations: GPU utilization is low (10-20%), and Log shows high I/O wait times. I’ve already confirmed the file paths are correct, and the training eventually starts, but the throughput is clearly bottlenecked by the GCS read latency.

My Questions:

  1. Is GCS FUSE (/gcs/ path) known to be inefficient for SDXL training with datasets of 1,000+ images?

  2. What is the recommended strategy for training on larger datasets (up to 8,000 images)? Should I preload data to /tmp using a startup script, or is there a way to optimize the DreamBoothDataset loading process?

  3. Are there any configuration parameters I missed to optimize the dataloader or caching for large datasets on Vertex AI?

Any insights or best practices for optimizing Vertex AI training pipelines for image-heavy datasets would be greatly appreciated!