As AI compute resources become increasingly constrained and expensive, efficiently serving powerful open-weights models like Google’s Gemma is a critical operational challenge. By pairing vLLM’s high-throughput serving engine with Dynamic Resource Allocation (DRA) and Custom Compute Classes, engineering teams can maximize their overall compute infrastructure and seamlessly scale workloads to meet real-time demand without relying on costly over-provisioning.
This guide shows how to run Gemma on vllm using DRA and Custom Compute Classes.
Environment setup
Ensure your Google Cloud environment is ready. All steps in this walkthrough are tested in Google Cloud Shell. Cloud Shell has the Google Cloud CLI, kubectl, and Helm pre-installed.
1. Google Cloud project
Have a project with billing enabled.
export PROJECT_ID="your-project-id"
gcloud config set project $PROJECT_ID
2. Google Cloud CLI
Ensure gcloud is installed and updated. Run gcloud init if needed.
3. kubectl
Install the kubectl: gcloud components install kubectl
4. Helm
Install Helm (Installation guide).
5. Enable APIs
Activate necessary Google Cloud services.
gcloud services enable \
container.googleapis.com \
compute.googleapis.com \
networkservices.googleapis.com \
monitoring.googleapis.com \
logging.googleapis.com \
--project=$PROJECT_ID
6. Configure permissions (IAM)
Grant required roles.
export USER_EMAIL=$(gcloud config get-value account)
gcloud projects add-iam-policy-binding $PROJECT_ID --member="user:${USER_EMAIL}" --role="roles/container.admin" --condition=None
7. Set region
Choose a region for the cluster and a zone with the GPUs you need for the node pool.
export LOCATION="us-central1" # Example region
gcloud config set compute/region $LOCATION
export NODEPOOL_LOCATION="us-central1-c" # Region with L4 capacity
8. Hugging Face token
Obtain a Hugging Face access token (read permission minimum). If using Gemma models, accept the license terms on the Hugging Face model page.
export HF_TOKEN="your-huggingface-token"
Create a GKE cluster
We need a cluster for running vllm and we to set up a dedicated node pool for nodes with GPUs.
1. Create the cluster
We will use a Standard cluster with autoprovisioning enabled.
export CLUSTER_NAME=vllm-gpu && \
export CLUSTER_VERSION=1.35.1-gke.1616000
gcloud container clusters create ${CLUSTER_NAME} \
--location=${LOCATION} \
--num-nodes=1 \
--cluster-version=${CLUSTER_VERSION} \
--no-enable-autoupgrade \
--enable-autoprovisioning \
--min-cpu 0 --max-cpu 1000 \
--min-memory 0 --max-memory 4000
2. Create a node pool with GPUs
We start with only a single node, but allow the cluster autoscaler to add another one.
gcloud container node-pools create gpu-pool \
--cluster=${CLUSTER_NAME} \
--location=${LOCATION} \
--node-locations=${NODEPOOL_LOCATION} \
--machine-type="g2-standard-12" \
--accelerator="type=nvidia-l4,count=1,gpu-driver-version=disabled" \
--enable-autoscaling \
--total-min-nodes=1 \
--total-max-nodes=2 \
--num-nodes=1 \
--node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true,cloud.google.com/compute-class=vllm-gpu-ccc,cloud.google.com/gke-nvidia-gpu-dra-driver=true \
--spot \
--node-taints="cloud.google.com/compute-class=vllm-gpu-ccc:NoSchedule"
Install the NVIDIA GPU driver and the GPU DRA driver
We need to install both the GPU driver for the OS and the GPU DRA driver.
1. Install the GPU driver
This is the same driver that is used for Device Plugin.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
2. Install the NVIDIA GPU DRA driver
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.8.0" --create-namespace --namespace=nvidia-dra-driver-gpu \
--set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \
--set gpuResourcesEnabledOverride=true \
--set resources.computeDomains.enabled=false \
--set kubeletPlugin.priorityClassName="" \
--set 'kubeletPlugin.tolerations[0].operator=Exists'
Run Gemma on vllm
1. Create secret with Hugging Face token
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN}
2. Define the Custom Compute Class
The Custom Compute Class configures the cluster autoscaler. In this example we only reference a single node pool, but it is possible to reference other pools with different types of GPUs. These will then be used if the cluster autoscaler is unable to add additional nodes with L4 GPUs.
Save as ccc.yaml.
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
name: vllm-gpu-ccc
spec:
autoscalingPolicy:
consolidationDelayMinutes: 3
priorities:
- nodepools: ["gpu-pool"]
Apply the manifest to the cluster.
kubectl apply -f ccc.yaml
3. Define the vllm workload
Save as vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gpu
spec:
replicas: 1
selector:
matchLabels:
app: vllm-gpu
template:
metadata:
labels:
app: vllm-gpu
spec:
nodeSelector:
cloud.google.com/compute-class: vllm-gpu-ccc
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu-claim-template
containers:
- name: vllm-gpu
image: vllm/vllm-openai:latest
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --host=0.0.0.0
- --port=8000
- --model=google/gemma-3-1b-it
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
ports:
- containerPort: 8000
resources:
claims:
- name: gpu
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-claim-template
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-gpu
type: LoadBalancer
ports:
- name: http
protocol: TCP
port: 8000
targetPort: 8000
Apply the manifest to the cluster.
kubectl apply -f vllm.yaml
4. Wait for the vllm pods to be running and Ready
kubectl get pods -l app=vllm-gpu -w
Verify the deployment
Now that vllm is running, lets send a request and make sure it is working.
1. Get the external IP of the load balancer
export vllm_service=$(kubectl get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
2. Send a request
curl http://$vllm_service:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-1b-it",
"prompt": "Write a story about san francisco",
"max_tokens": 100,
"temperature": 0
}'
Scale up to 2 replicas
We want to add a second replica. In production scenarios, this might be handled by the Horizontal Pod Autoscaler, but for this example we will just change the Deployment directly.
1. Change the desired number of replicas
Update the spec.replicas field value to 2.
kubectl edit Deployment vllm-gpu
2. Wait for the second vllm pod to run
This might take some time since the cluster autoscaler first have to add another node to the gpu-pool node pool and only then can the pod be scheduled to a node.
kubectl get pods -l app=vllm-gpu -w
This article shows how to set up the Horizontal Pod Autoscaler to scale based on metrics from vllm. Remember to update the label selectors when adapting it to this example.
Clean up
Delete the cluster to avoid any further charges.
gcloud container clusters delete ${CLUSTER_NAME} --location=${LOCATION}