Running inference on vllm with Dynamic Resource Allocation and Custom Compute Classes

mortent · March 23, 2026, 7:32pm

As AI compute resources become increasingly constrained and expensive, efficiently serving powerful open-weights models like Google’s Gemma is a critical operational challenge. By pairing vLLM’s high-throughput serving engine with Dynamic Resource Allocation (DRA) and Custom Compute Classes, engineering teams can maximize their overall compute infrastructure and seamlessly scale workloads to meet real-time demand without relying on costly over-provisioning.

This guide shows how to run Gemma on vllm using DRA and Custom Compute Classes.

Environment setup

Ensure your Google Cloud environment is ready. All steps in this walkthrough are tested in Google Cloud Shell. Cloud Shell has the Google Cloud CLI, kubectl, and Helm pre-installed.

1. Google Cloud project

Have a project with billing enabled.

export PROJECT_ID="your-project-id"
gcloud config set project $PROJECT_ID

2. Google Cloud CLI

Ensure gcloud is installed and updated. Run gcloud init if needed.

3. kubectl

Install the kubectl: gcloud components install kubectl

4. Helm

Install Helm (Installation guide).

5. Enable APIs

Activate necessary Google Cloud services.

gcloud services enable \
    container.googleapis.com \
    compute.googleapis.com \
    networkservices.googleapis.com \
    monitoring.googleapis.com \
    logging.googleapis.com \
    --project=$PROJECT_ID

6. Configure permissions (IAM)

Grant required roles.

export USER_EMAIL=$(gcloud config get-value account)

gcloud projects add-iam-policy-binding $PROJECT_ID --member="user:${USER_EMAIL}" --role="roles/container.admin" --condition=None

7. Set region

Choose a region for the cluster and a zone with the GPUs you need for the node pool.

export LOCATION="us-central1" # Example region
gcloud config set compute/region $LOCATION

export NODEPOOL_LOCATION="us-central1-c" # Region with L4 capacity

8. Hugging Face token

Obtain a Hugging Face access token (read permission minimum). If using Gemma models, accept the license terms on the Hugging Face model page.

export HF_TOKEN="your-huggingface-token"

Create a GKE cluster

We need a cluster for running vllm and we to set up a dedicated node pool for nodes with GPUs.

1. Create the cluster

We will use a Standard cluster with autoprovisioning enabled.

export CLUSTER_NAME=vllm-gpu && \
export CLUSTER_VERSION=1.35.1-gke.1616000

gcloud container clusters create ${CLUSTER_NAME} \
    --location=${LOCATION} \
    --num-nodes=1 \
    --cluster-version=${CLUSTER_VERSION} \
    --no-enable-autoupgrade \
    --enable-autoprovisioning \
    --min-cpu 0 --max-cpu 1000 \
    --min-memory 0 --max-memory 4000

2. Create a node pool with GPUs

We start with only a single node, but allow the cluster autoscaler to add another one.

gcloud container node-pools create gpu-pool \
    --cluster=${CLUSTER_NAME} \
    --location=${LOCATION} \
    --node-locations=${NODEPOOL_LOCATION} \
    --machine-type="g2-standard-12" \
    --accelerator="type=nvidia-l4,count=1,gpu-driver-version=disabled" \
    --enable-autoscaling \
    --total-min-nodes=1 \
    --total-max-nodes=2 \
    --num-nodes=1 \
--node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true,cloud.google.com/compute-class=vllm-gpu-ccc,cloud.google.com/gke-nvidia-gpu-dra-driver=true \
    --spot \
    --node-taints="cloud.google.com/compute-class=vllm-gpu-ccc:NoSchedule"

Install the NVIDIA GPU driver and the GPU DRA driver

We need to install both the GPU driver for the OS and the GPU DRA driver.

1. Install the GPU driver

This is the same driver that is used for Device Plugin.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

2. Install the NVIDIA GPU DRA driver

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
        && helm repo update

helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --version="25.8.0" --create-namespace --namespace=nvidia-dra-driver-gpu \
    --set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \
    --set gpuResourcesEnabledOverride=true \
    --set resources.computeDomains.enabled=false \
    --set kubeletPlugin.priorityClassName="" \
    --set 'kubeletPlugin.tolerations[0].operator=Exists'

Run Gemma on vllm

1. Create secret with Hugging Face token

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN}

2. Define the Custom Compute Class

The Custom Compute Class configures the cluster autoscaler. In this example we only reference a single node pool, but it is possible to reference other pools with different types of GPUs. These will then be used if the cluster autoscaler is unable to add additional nodes with L4 GPUs.

Save as ccc.yaml.

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: vllm-gpu-ccc
spec:
  autoscalingPolicy:
    consolidationDelayMinutes: 3
  priorities:
  - nodepools: ["gpu-pool"]

Apply the manifest to the cluster.

kubectl apply -f ccc.yaml

3. Define the vllm workload

Save as vllm.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-gpu
  template:
    metadata:
      labels:
        app: vllm-gpu
    spec:
      nodeSelector:
        cloud.google.com/compute-class: vllm-gpu-ccc
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      resourceClaims:
      - name: gpu
        resourceClaimTemplateName: gpu-claim-template
      containers:
      - name: vllm-gpu
        image: vllm/vllm-openai:latest
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --host=0.0.0.0
        - --port=8000
        - --model=google/gemma-3-1b-it
        env: 
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        ports:
        - containerPort: 8000
        resources:
          claims:
          - name: gpu
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 10
        volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: gpu-claim-template
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-gpu
  type: LoadBalancer	
  ports:
    - name: http
      protocol: TCP
      port: 8000  
      targetPort: 8000

Apply the manifest to the cluster.

kubectl apply -f vllm.yaml

4. Wait for the vllm pods to be running and Ready

kubectl get pods -l app=vllm-gpu -w

Verify the deployment

Now that vllm is running, lets send a request and make sure it is working.

1. Get the external IP of the load balancer

export vllm_service=$(kubectl get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

2. Send a request

curl http://$vllm_service:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "google/gemma-3-1b-it",
    "prompt": "Write a story about san francisco",
    "max_tokens": 100,
    "temperature": 0
}'

Scale up to 2 replicas

We want to add a second replica. In production scenarios, this might be handled by the Horizontal Pod Autoscaler, but for this example we will just change the Deployment directly.

1. Change the desired number of replicas

Update the spec.replicas field value to 2.

kubectl edit Deployment vllm-gpu

2. Wait for the second vllm pod to run

This might take some time since the cluster autoscaler first have to add another node to the gpu-pool node pool and only then can the pod be scheduled to a node.

kubectl get pods -l app=vllm-gpu -w

This article shows how to set up the Horizontal Pod Autoscaler to scale based on metrics from vllm. Remember to update the label selectors when adapting it to this example.

Clean up

Delete the cluster to avoid any further charges.

gcloud container clusters delete ${CLUSTER_NAME} --location=${LOCATION}

Abdelzaher_Elnaggar · March 24, 2026, 1:29am

جميل جدا

Topic		Replies	Views
Stream Models into GPU Memory with the Run:ai Model Streamer AI Infrastructure compute-engine , cloud-storage , google-kubernetes-engine-gke	0	290	November 12, 2025
End-to-end guide: Fine-tuning and serving Gemma 4 on Vertex AI Community Articles googler-article , ai-ml , agent-platform , gemma , fine-tuning	1	3614	April 8, 2026
Optimizing LLM Inference for Minimal Latency with vLLM Compute Infrastructure accelerators , googler-article , infrastructure-general , high-performance-computing-hpc	0	942	November 19, 2025