Optimizing LLM Inference for Minimal Latency with vLLM

LLMs are becoming an integral part of nowadays AI powered apps, for customer-facing applications with reasoning and generative AI power bot to help solve problems, innovate, collaborate and answer questions, requiring low-latency responses in real-time and batch processing for bulk summarizations, extraction, classification and recommendation systems.

Low latency and high throughput are the critical metrics for these applications.

  1. Latency-Sensitive Inference: For real-time agentic LLM systems, latency is paramount. Latency refers to the end-to-end time it takes for a single request to receive results, from input to output.
    1. Shopping Assistant for retail and e-commerce.
    2. Live content and design creation assistant.
    3. Enterprise Coding and debugging Assistant
  2. Offline Inference: For batch agent processing tasks such as summarization, classification throughput is critical. Throughput measures the maximum number of tokens a model can process within a given timeframe, often expressed as inferences per second.

This article focuses on selecting the optimal accelerator for the given model and how to run inference benchmarking, with inference and serving engine vLLM for the real-time low latency scenarios, how to fine-tune vLLM for peak performance, ensuring low latency and cost optimizations

key considerations

Let us look into what all involved in selecting the accelerators to achieve optimal performance. Here are the key considerations.

  1. Model - Llama-3.1-70B model with higher parameters, high quality, complex reasoning and robust output to enable various real time inference scenarios
  2. __Inference Engine __(vLLM) - vLLM is a fast and user-friendly inference engine for LLM inference and serving large language models. It is known for its state-of-the-art serving throughput and efficient management of attention key and value memory with PagedAttention. vLLM supports both GPUs and TPUs.
  3. __Selecting the Accelerators (GPUs) - __A3 Mega (H100) or A3 Ultra (H200)
  4. Running Benchmarks - Leveraging the DWS Flex -( Dynamic Workload Scheduler ) to obtain GPUs for short term reservations from 1 hour and up to 7 days .

This article focuses on selecting the optimal accelerator for the given model and how to run inference benchmarking, with inference and serving engine vLLM for the real-time scenarios, how to fine-tune vLLM for peak performance, ensuring low latency and cost optimizations

Choosing Accelerators

The next step is selecting the optimal chip based on the model size and the performance requirments. For running Llama3.1-70B, possible optimal accelerators are H100 and H200 with 80GB+ vRAM. Here are the specifications of the VM.

Machine Type GPU Chip VRAM per GPU Total VRAM (8-way) Max Network BW
A3 Mega NVIDIA H100 SXM 80 GB HBM3 640 GB 1,800 Gbps
A3 Ultra NVIDIA H200 SXM 141 GB HBM3 1128 GB 3,600 Gbps

A3 Mega vs A3 Ultra:

  1. GPUs, H100 vs H200.
  2. Total VRAM, 640 vs 1128 GB - significantly larger KV Caches to support longer context windows. In our case for the model Llama3.1.70B with 1000-5000k context length,__ A3 Mega’s 640 GB VRAM __could comfortably fit the model of 70B ~ 140GB with 8 GPUs with TP=8 enabled.
  3. __Memory Network Bandwidth, 1800 vs 3600 __50% faster bandwidth. This is critical when the model requires frequent communication with other resources (like large datasets or access to storage).

Considering the scenario of running Llama3.1-70B with 1000-5000 context length with single host inference with no significant storage or data access, A3 Mega (H100) would be a better choice with better price/performance.

Running Inference Benchmarking

Running inference benchmarking is crucial for validating the GPUs performance metrics for large language models (LLMs) to meet specific business use case requirements, leading to improved efficiency, cost savings, and performance.

For real-time inference scenarios, latency is paramount. Here are the requirements.

  • Latency: < 2 seconds
  • Max Concurrent Requests: 1000
  • Input Context Length: 1000, Output: 100 (Total context = 1100)

Let’s delve into the steps involved in fine-tuning the vLLM inference engine’s configuration to achieve optimal performance.

Here is a high level architecture diagram on benchmarking setup with A3 Mega (H100s) GPUs, vLLM inference engine running with Llama 3-1-70B model.

Steps to setup vLLM and run a benchmark

Create the VMs

First, check your project’s quota to ensure H100s are available in the project for the specific region the benchmark is to be running. The below command leverages the DWS (Dynamic Workload Scheduler) Flex mode to obtain GPUs to run for short term benchmarking.

gcloud alpha compute instances create demo-h100-a3-1 \
    --zone=us-east4-a \
    --instance-termination-action=DELETE \
    --machine-type=a3-megagpu-8g \
    --maintenance-policy=TERMINATE \
    --max-run-duration=3000 \
    --provisioning-model=flex-start --max-run-duration=4d \
    --create-disk=name=demo-h100-a3-pd,size=1TB,type=pd-ssd \
    --image=projects/ubuntu-os-accelerator-images/global/images/ubuntu-accelerator-2204-amd64-with-nvidia-570-v20250606 --reservation-affinity=none

Output:

Created [https://www.googleapis.com/compute/v1/projects/gpu-launchpad-playground/zones/us-central1-b/instances/demo-h100-a3].
NAME: demo-h100-a3
ZONE: us-east4-a
MACHINE_TYPE: a3-megagpu-8g
PREEMPTIBLE:
INTERNAL_IP: 10.150.0.106
EXTERNAL_IP: 35.236.200.165
STATUS: RUNNING

SSH into the VM

gcloud compute ssh --zone "us-east4-a" "demo-h100-a3" --project <project-id>

Spin up vLLM Container / Deploy Model

export DOCKER_URI=vllm/vllm-openai:latest
sudo nvidia-docker run -t --rm --name $USER-vllm --shm-size 10gb -e NVIDIA_VISIBLE_DEVICES=all -p 8000:8000 --entrypoint /bin/bash -it ${DOCKER_URI}

Optimizing vLLM parameters for Low-Latency Inference

To meet the target requirements for low-latency, the performance of the vLLM server must be precisely tuned. The following command-line arguments are critical levers for optimizing latency by controlling token batching, the number of concurrent sequences, and the request rate in the inference benchmarking process.

  1. max-num-batched-tokens : Maximum number of tokens to be processed in a single iteration.
  2. max-num-seqs: Maximum number of sequences to be processed in a single iteration. (Lowering this value can negatively impact overall throughput, but it is a__ key lever for improving latency__ through better memory utilization.)).
  3. request-rate (vllm bench serve) : Number of requests per second, this limits the number of requests sent at a time.

Refer to the vLLM documentation for more options.

For optimal latency reduction, try adjusting the__ max-num-seqs__ parameter across various lengths (e.g., 512, 256, 128) and note down the end to end latency value.

Commands to Start vLLM for Latency Optimization

export TP=8
export HF_TOKEN=<HF_TOKEN>

VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --seed 42 --disable-log-requests --tensor-parallel-size $TP --enable-prefix-caching --max-model-len=2048 --max-num-seqs=512 --max-num_batched-tokens=2048 &> serve.log &

export MAX_INPUT_LEN=1000
export MAX_OUTPUT_LEN=100
export MAX_PREFIX_LEN=500

vllm bench serve \
  --backend vllm \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --dataset-name random \
  --num-prompts 1000 \
  --random-prefix-len="$MAX_PREFIX_LEN" \
  --random-input-len="$MAX_INPUT_LEN" \
  --random-output-len="$MAX_OUTPUT_LEN" \
  --request-rate 10

Output from the benchmark results

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           5.00      
Benchmark duration (s):                  201.22    
Total input tokens:                      1498021   
Total generated tokens:                  79901     
Request throughput (req/s):              4.97      
Output token throughput (tok/s):         397.09    
Total Token throughput (tok/s):          7841.92   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.88     
Median TTFT (ms):                        30.57     
P99 TTFT (ms):                           40.34     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.94     
Median TPOT (ms):                        12.84     
P99 TPOT (ms):                           14.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.92     
Median ITL (ms):                         12.77     
P99 ITL (ms):                            16.48   

Finding End to End latency

Mean End-to-End Latency = TTFT + (TPOT (Average Output Tokens - 1))

Average Output Tokens      =  Total Generated Tokens / Successful Requests



Average Output Tokens      =  79901/1000 = 79.901

	       e2e		               =  30.88 ms + (12.94 ms * (79.901-1))

			                       =  1051.53 ms (1.06 sec)

Repeat the above steps with different max-num-seqs parameters and note down the results.

Results for latency sensitive inference with varying max-num-seqs:

max-num-seqs max_num_batched_tokens req-rate TTFT TPOT Avg output tokens e2e
512 2048 5 17.71 ms 16.6 79.622 1.48 sec
256 2048 5 29.67 ms 12.70 ms 79.546 1.06 sec
128 2048 5 30.86 ms 12.97 ms 80.292 1.04 sec

From the results, it is evident that lower the max-num-seqs value the lower the end to end latency.

Finding the number of Chips for 1000 concurrent requests

Once we know the optimal accelerators and the concurrent requests a single VM or set of GPUs can handle, the next step is to find the max number of concurrent requests on a A3 Mega VM with 8 GPUs, run the benchmarks with varying request-rate (vllm bench serve) parameters between the value of 5 and 10. Follow the steps in the previous section to run the vllm bench serve command and note down the results.

A3-Mega H100 Number of requests Context Length Latency # of Chips for 1000 req/sec
1 8 10 1100 1.15 sec 100*8 = 800
1 8 5 1100 1.04 sec 200*8 = 1600

From the results, a total of 200 A3 Mega VMs (1600 H100 Chips) would be required to achieve latency closer to 1 second for 1000 requests per second. If there can be slight latency is acceptable then half the number of chips with 100 A3 VMs can achieve that. Optimally having 100 reserved A3 nodes and scaling up based on demand or with DWS would be more cost effective and reliable.

Fine Tuning the vLLM parameters for High Throughput Batch

Similarly for the high throughput batch use cases, max-num-batched-tokens parameter can be tuned between 1024 , 2048 and 4096 values and find the optimal value for the required throughput, typically higher the value better the high throughput with better GPU utilization, which is optimal for the batch inference scenarios.

Wrap-up

Broadly there are two types of inference scenarios, real time latency sensitive and high throughput batch scenarios. This article covered the key considerations on LLM inference and running benchmarks focussed on latency sensitive realtime scenarios, explained the steps involved in selecting the right

1 Like