Extremely slow training throughput (I/O Bottleneck) in DreamBooth LoRA SDXL training on Vertex AI

Aaron.W · March 30, 2026, 3:46pm

Hi everyone,

I am training a DreamBooth LoRA for SDXL on Vertex AI using the pytorch-peft-train container (Model Garden). My dataset consists of 1,000 images, stored in a GCS Bucket.

The issue:
Training is extremely slow. It takes ~20 seconds per iteration (1 batch size, 4 gradient accumulation steps), and 6 hours of training only yields 6% progress (600/10000 steps).

My setup:

Model: Stable Diffusion XL Base 1.0
Environment: Vertex AI, machine_type=g2-standard-8 (L4 GPU).
Data Path: Accessing images via GCS FUSE (/gcs/...).
Observations: GPU utilization is low (10-20%), and Log shows high I/O wait times. I’ve already confirmed the file paths are correct, and the training eventually starts, but the throughput is clearly bottlenecked by the GCS read latency.

My Questions:

Is GCS FUSE (/gcs/ path) known to be inefficient for SDXL training with datasets of 1,000+ images?
What is the recommended strategy for training on larger datasets (up to 8,000 images)? Should I preload data to /tmp using a startup script, or is there a way to optimize the DreamBoothDataset loading process?
Are there any configuration parameters I missed to optimize the dataloader or caching for large datasets on Vertex AI?

Any insights or best practices for optimizing Vertex AI training pipelines for image-heavy datasets would be greatly appreciated!

Topic		Replies	Views
While loading a model from huggingface, my custom training job gets stuck Open Models llama , fine-tuning , benchmarking , hugging-face	4	112	November 10, 2025
What's the most efficient way to load data for training? Custom ML & MLOps vertex-ai-platform	3	123	June 15, 2023
Unusually Long Training Time on Vertex AI Action Recognition Custom ML & MLOps vertex-ai-training , vertex-ai-pipelines	1	86	August 5, 2025

Extremely slow training throughput (I/O Bottleneck) in DreamBooth LoRA SDXL training on Vertex AI

AI Suggested topics