Last week I spent some time benchmarking LLMs on Vertex AI. I wanted to use vLLM’s bench library to get inference metrics, but I couldn’t find a tutorial on how to make it work with a Vertex AI endpoint.
So, I put together this example notebook. It’s a walkthrough that shows you how to:
Deploy two Llama 4 Scout models (baseline vs. EAGLE-enabled) on 8x H100s.
Apply the necessary patch to vLLM’s bench library so it can call the Vertex AI API.
Run a full concurrency sweep to measure TTFT, TPOT, and throughput.
Analyze the results to see the real-world speedup from EAGLE.
If you’ve been wondering how to benchmark your models on Vertex AI, this should save some time. Hope it’s helpful. Happy building!
