Authors:
Junjie Qian: Staff Software Engineer, Google
Taka Kuwayama: Principal Architect, Google
In this fast-paced world of autonomous vehicles (AV) and advanced driver-assistance systems (ADAS) development, staying ahead of the curve requires constant innovation and a relentless pursuit of cutting-edge technology. The models required to support these complex use cases require substantial computational power and efficient data processing. In this blog, we provide a reproducible recipe that demonstrates how to fine-tune Google’s open-source PaliGemma2 model on the Waymo dataset while systematically evaluating the powerful Google Cloud A4 VMs powered by NVIDIA B200 GPUs.
Why Vision-Language Models (VLM)
Accurate and real-time object recognition in images is a vital component of AV and ADAS. Detecting various objects, such as vehicles, pedestrians, cyclists, and traffic signs, is essential for safe navigation and effective decision-making. Fine-tuning VLMs can significantly enhance object recognition capabilities, ultimately leading to more efficient and reliable autonomous driving solutions.
PaliGemma2 is Google’s advanced open-source vision-language model, expertly designed to understand and process both images and text prompts simultaneously. We conducted a study on multimodal AI performance by fine-tuning the PaliGemma2 model to enhance its capabilities for robust object recognition using image data from the Waymo Perception Dataset. This work provides an educational demonstration of how the fine-tuned PaliGemma2 model can be effectively applied to real-world, high-stakes vision-language tasks. Concurrently, we systematically evaluated the efficiency of Google Cloud’s A4 Virtual Machines (VMs) across various configurations to determine optimal performance strategies for Vision-Language Models (VLM) of this type.
This process leverages the robust capabilities of Google Cloud A4 VMs, powered by NVIDIA HGX B200, and utilizes Google Kubernetes Engine (GKE) for streamlined orchestration.
Platform and configurations
To evaluate the platform’s scalability and performance, we executed experiments using various configurations, specifically 1, 2, 4, and 8 A4 VMs. Each A4 VM is a high-performance instance, equipped with:
-
8 NVIDIA B200 GPUs
-
3600 Gbps network bandwidth
-
Approximately 12TB of local SSD storage
To understand the impact of data storage on training performance, we configured the training data to be accessible through both local SSD (for high-speed, proximate access) and GCSFuse (for flexible, cloud-based storage integration). This dual-storage approach allowed us to analyze the performance trade-offs in real-world scenarios.
Model and dataset
PaliGemma2, an open-source Google vision-language model, is a versatile architecture capable of handling a diverse range of computer vision tasks, including object detection, image segmentation, and visual question answering. Its robust design and pre-training on extensive datasets make it a good candidate for fine-tuning on domain-specific tasks. In this work, we specifically fine-tuned the paligemma2-3b-pt-224 variant with the Waymo Perception Dataset to demonstrate high performance and effective ADAS object recognition. This particular model leverages a transformer-based architecture combined with efficient attention mechanisms, allowing for effective processing of both visual and linguistic information. Its compact yet performant design makes it suitable for deployment in resource-constrained environments while maintaining high accuracy in complex perception tasks.
The Waymo Perception Dataset is renowned for its extensiveness and high-quality annotations, offering a wealth of synchronized lidar and camera data with precise 3D bounding box annotations across a wide range of diverse driving conditions. This dataset is suited for training and evaluating robust and safe autonomous driving systems. For our fine-tuning task, we exclusively utilized the camera data from the dataset. Prior to training, the dataset underwent a meticulous preprocessing phase using our dedicated script to optimize it for the PaliGemma2 model’s input requirements.
Reproducible recipe
To ensure transparency and facilitate adoption, the entire fine-tuning implementation on A4 VMs has been open-sourced and shared as a comprehensive recipe on GitHub. This recipe provides detailed instructions and configurable parameters for users to replicate our results or adapt the methodology to their specific needs.
-
PaliGemma2 Fine-tuning Recipe: This recipe demonstrates the fine-tuning implementation on single or multiple A4 VMs with easily configurable parameters, allowing users to scale their training according to their computational resources and dataset size. The fine-tuning updates all model parameters for this object recognition task.
-
Waymo Perception Dataset Processing Recipe: This recipe demonstrates the dataset preprocessing for this fine-tuning task, which extracts the images and labels from the dataset and packages them in the format consumable by the fine-tuning script.
The recipe highlights the following key steps:
-
Initial Setup: This one-time task involves configuring environment variables and secrets for the GKE cluster and Hugging Face as well as preparing the GKE cluster and training environments.
-
Recipe Execution: Once the setup is complete, this step details how to run the fine-tuning job using the provided training script. The fine-tuning workload within GKE is executed using Helm. An example command line is provided below:
helm install $USER-paligemma2 ${RECIPE_ROOT} -f ${RECIPE_ROOT}/values.yaml \
--set-file workload_launcher=${RECIPE_ROOT}/launcher.sh \
--set-file workload_config=${RECIPE_ROOT}/main.py \
--set workload.image=nvcr.io/nvidia/pytorch:25.01-py3 \
--set volumes.gcsMounts\[0\].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts\[0\].mountPath=/job-logs \
--set workload.envs\[0\].value=/job-logs/${user}-paligemma2
- Job Monitoring: This phase demonstrates how to access job logs and retrieve training results, including guidance on troubleshooting potential issues. The job, by default, provides the following performance metrics:
{'train_runtime': 1203.7256, 'train_samples_per_second': 2041.661, 'train_steps_per_second': 0.083, 'train_loss': 0.43805614471435544, 'epoch': 3.1}
- Cleanup: Finally, this step outlines how to dismantle the deployed Helm job (the fine-tuning experiment) to release allocated resources.
Training performance results
To assess workload efficiency, we evaluate runtime performance using the following metrics:
-
Train steps per second (or train samples per second): This is the primary indicator of training speed. Aim for higher values.
-
Train runtime: The duration of the training workload in seconds is affected by the number of epochs or steps executed.
-
Train Loss: Ensure the loss decreases consistently, indicating effective learning.
-
GPU SM Utilization: The percentage of active time of the GPU’s streaming multiprocessors and it can be obtained with nvidia-smi commands. Ideally, this should be close to 100%. Low utilization often points to data loading or CPU bottlenecks.
-
GPU Memory Usage: The memory consumed by your model and data on the GPU and it can be obtained with nvidia-smi commands. High utilization might require strategies like mixed precision training or gradient accumulation.
We represent the training performance in samples per second for the published recipe on out-of-the-box A4 VMs, as shown below. The other metrics are collected for performance tuning purposes. The results indicate nearly linear scaling as the number of GPUs increase, and we also provide further optimization options to achieve better training performance with both compute and storage solutions.
Our tests highlight the critical impact of data source proximity on performance. We observed a performance difference between local SSD and GCSFuse, particularly as the number of VMs for training increased. The GCS bucket used in this experiment was located in the same zone as the compute resources. The GCSFuse driver, which is detailed in the published recipe, was used to mount the GCS bucket as a local directory on the VM, allowing for direct access to the data.
The performance disparity observed can be attributed to differences in data loading efficiency between local SSD and GCSFuse. We expect this gap to narrow with further optimization of our GCS storage settings. Look for upcoming updates on storage recommendations in our GPU recipes.
Best practices
To maximize the efficiency and performance of the AV/ADAS model training on A4 instances, it’s crucial to implement best practices for monitoring and optimization. These strategies help identify bottlenecks and optimize the training pipeline.
Please note, the metrics, tools, and parameters discussed in this recipe are specific to this experimental context. Other options are omitted for brevity.
Utilize profiling tools
Beyond simply measuring performance, it’s crucial to understand if the system is consistently performing at its hardware’s peak capability. To assess this and identify potential optimization opportunities, we employ the following tools:
-
NVIDIA Nsight Systems (nsys): Use nsys for low-overhead profiling to capture detailed data on GPU kernels, API calls, CPU activity, and network communication. This provides a comprehensive view of bottlenecks.
-
PyTorch Profiler (torch.profiler): This tool helps understand expensive model operators, device kernel activity, and visualizes execution, providing insights into time spent on different tasks.
-
nvidia-smi: Leverage nvidia-smi for real-time collection of metrics like GPU utilization, memory utilization, and power consumption during job execution.
Optimize training parameters
Parameter tuning was performed to enhance performance, drawing on optimization insights previously gathered from profiling tools. The result of this tuning was a tenfold increase in train samples per second.
-
Batch Size: Increase the micro-batch size to enhance GPU memory and SM utilization, ensuring it fits within GPU memory and is divisible by 8 (or 64 for A100 GPUs).
-
Data Loading: Employ a DataLoader with multiple workers to load data in parallel. Pin memory to CUDA to minimize CPU-to-GPU transfer bottlenecks. For large datasets, shard them into fewer, larger files to reduce network overhead. This is particularly important when GPU SM utilization is low but peak GPU memory utilization is high.
-
Mixed Precision (FP16 or BF16): Enable mixed precision training to accelerate computations and decrease memory usage when supported by your hardware and model. This can significantly improve training performance, though careful consideration of potential accuracy impacts is advised.
Advanced optimization techniques
- CUDA graph based optimization with torch.compile: Utilize torch.compile to dynamically identify and compile model segments, leveraging CUDA Graphs for improved performance. The “reduced-overhead” mode is beneficial, while “max-autotune” offers further kernel-level optimizations.
By systematically applying these best practices, we can effectively monitor the performance of the AV/ADAS training workloads on A4 VMs and implement targeted optimizations to achieve faster training times and more efficient resource utilization.
Getting started: Try the recipes today
Being able to test new accelerator-optimized instances quickly and continuously becomes crucial to keep up with the innovations of AV/ADAS models. This reproducible recipe accelerates the process to become familiar with benchmarking A4 instances on VLMs for consistent results. Try the recipe and explore more AI hypercomputer recipes today!


