Running LLM inference benchmarks on Vertex AI

ilnardo92 · November 10, 2025, 5:12am

Last week I spent some time benchmarking LLMs on Vertex AI. I wanted to use vLLM’s bench library to get inference metrics, but I couldn’t find a tutorial on how to make it work with a Vertex AI endpoint.

So, I put together this example notebook. It’s a walkthrough that shows you how to:

Deploy two Llama 4 Scout models (baseline vs. EAGLE-enabled) on 8x H100s.
Apply the necessary patch to vLLM’s bench library so it can call the Vertex AI API.
Run a full concurrency sweep to measure TTFT, TPOT, and throughput.
Analyze the results to see the real-world speedup from EAGLE.

If you’ve been wondering how to benchmark your models on Vertex AI, this should save some time. Hope it’s helpful. Happy building!

Topic		Replies	Views
Introducing Llama 4 on Vertex AI Community Articles googler-article , ai-ml	0	588	April 5, 2025
Optimizing LLM Inference for Minimal Latency with vLLM Compute Infrastructure accelerators , googler-article , infrastructure-general , high-performance-computing-hpc	0	330	November 19, 2025
Optimizing LLMs serving with the new NVIDIA TensorRT-LLM container on Vertex AI Community Articles googler-article , ai-ml , community	1	352	June 30, 2025

Running LLM inference benchmarks on Vertex AI

AI Suggested topics