Running LLM inference benchmarks on Vertex AI

Last week I spent some time benchmarking LLMs on Vertex AI. I wanted to use vLLM’s bench library to get inference metrics, but I couldn’t find a tutorial on how to make it work with a Vertex AI endpoint.

So, I put together this example notebook. It’s a walkthrough that shows you how to:

:small_blue_diamond:Deploy two Llama 4 Scout models (baseline vs. EAGLE-enabled) on 8x H100s.
:small_blue_diamond:Apply the necessary patch to vLLM’s bench library so it can call the Vertex AI API.
:small_blue_diamond:Run a full concurrency sweep to measure TTFT, TPOT, and throughput.
:small_blue_diamond:Analyze the results to see the real-world speedup from EAGLE.

If you’ve been wondering how to benchmark your models on Vertex AI, this should save some time. Hope it’s helpful. Happy building!