Hi everyone,
I am training a DreamBooth LoRA for SDXL on Vertex AI using the pytorch-peft-train container (Model Garden). My dataset consists of 1,000 images, stored in a GCS Bucket.
The issue:
Training is extremely slow. It takes ~20 seconds per iteration (1 batch size, 4 gradient accumulation steps), and 6 hours of training only yields 6% progress (600/10000 steps).
My setup:
-
Model: Stable Diffusion XL Base 1.0
-
Environment: Vertex AI, machine_type=
g2-standard-8(L4 GPU). -
Data Path: Accessing images via GCS FUSE (
/gcs/...). -
Observations: GPU utilization is low (10-20%), and Log shows high I/O wait times. I’ve already confirmed the file paths are correct, and the training eventually starts, but the throughput is clearly bottlenecked by the GCS read latency.
My Questions:
-
Is GCS FUSE (
/gcs/path) known to be inefficient for SDXL training with datasets of 1,000+ images? -
What is the recommended strategy for training on larger datasets (up to 8,000 images)? Should I preload data to
/tmpusing a startup script, or is there a way to optimize theDreamBoothDatasetloading process? -
Are there any configuration parameters I missed to optimize the
dataloaderor caching for large datasets on Vertex AI?
Any insights or best practices for optimizing Vertex AI training pipelines for image-heavy datasets would be greatly appreciated!