Stream Models into GPU Memory with the Run:ai Model Streamer

bkauf · November 12, 2025, 4:56pm

Accelerating model loads from Object Storage into GPU memory for vLLM inference is dramatically simpler with the Run:ai model streamer. The model streamer will stream multiple files in parallel into GPU memory via CPU memory from a Cloud Storage bucket. To use the model streamer from a Google Cloud server all that’s needed is passing in the appropriate authentication method and inserting the flag -–load-format=runai_streamer and if tensor-parallel is set >1 –model-loader-extra-config={“distributed”:true} to enable distributed streaming mode.

The following steps will guide you through installing the latest version of vLLM and deploying a model that’s loaded from Google Cloud Storage with the model streamer:

Stream a Model from Google Cloud Storage with Run:ai Model Streamer

To use the model streamer install vLLM with the Run:ai extension

pip3 install vllm[runai]

Next you will need a model in Cloud Object storage. You can use the script below to transfer a model from a HuggingFace repo to a bucket you have access to if needed:

Transfer Model From HuggingFace to Google Cloud Storage(Optional)

You will need to make sure the cli you run the script from is logged into Google Cloud with gcloud init and also Huggingface via hf auth login. You may need to download these libraries:

pip3 install google-cloud-storage
pip3 install huggingface_hub

Replace the environmental variables at the top of the script below with your own and save it as hf-2-gc.py

import os
from huggingface_hub import snapshot_download
from google.cloud import storage

change the below to match your Google Cloud and Huggingface model locations

repo_id=“google/gemma-3-4b-it”
local_dir=“/tmp”
gcs_bucket_name=“”
gcs_prefix=“gemma-3-4b-it”

def download_model_then_upload_model():
# Ensure you’re logged in to Hugging Face
# You’ll need to run huggingface-cli login first or set HF_TOKEN

# Download model
print(f"Downloading model {repo_id}...")
snapshot_download(
    repo_id=repo_id,
    local_dir=local_dir,
    local_dir_use_symlinks=False,  # Important for full download
    resume_download=True,  # Resume interrupted downloads
    max_workers=10  # Parallel downloads
)
upload_model()

def upload_model():
# Upload model to GCS
# Initialize Google Cloud Storage client
print(gcs_bucket_name)
storage_client = storage.Client()
bucket = storage_client.bucket(gcs_bucket_name)
print(f"Uploading to GCS bucket {gcs_bucket_name}…")
for root, _, files in os.walk(local_dir):
for file in files:
local_path = os.path.join(root, file)
relative_path = os.path.relpath(local_path, local_dir)
blob_path = os.path.join(gcs_prefix, relative_path)

        blob = bucket.blob(blob_path)
        blob.upload_from_filename(local_path)
        print(f"Uploaded: {local_path} to {blob_path}")

print("Model download and upload complete!")

download_model_then_upload_model()

Stream Your Model in the CLI

Run the command below to start a vLLM inference engine using a model from Google Cloud storage accessed with the Run:ai model streamer

vllm serve gs://models-usc/gemma-3-4b-it --load-format=runai_streamer

Stream Your Model in GKE

If using GKE you will need to enable workload identity on the cluster.
Next You will need to create a GKE service account

export SERVICE_ACCOUNT="[service account name]"

kubectl create serviceaccount $SERVICE_ACCOUNT

Provide the objectViewer and objectUser IAM permissions for this service account

export BUCKET="[bucket name"
export PROJECT_NUMBER="[project number]"
export PROJECT_ID="[project ID"

gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.bucketViewer

gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.objectUser

You can use the example deployment spec as a guide to launch the model streamer in vLLM on GKE using workload identity. You will need to replace the “serviceAccountName” with the value in your SERVICE_ACCOUNT variable. Note if using a tensor-parallel-size >1 you will want to add the flag –model-loader-extra-config={“distributed”:true} to enable distributed streaming:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-streamer
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-streamer
  template:
    metadata:
      labels:
        app: vllm-streamer
    spec:
      serviceAccountName: gcs-access
      containers:
        - args:
            - --model=gs://models-usc/gemma-3-4b-it
            - --load-format=runai_streamer
            - --disable-log-requests
            - --max-num-batched-tokens=512
            - --max-num-seqs=128
            - --max-model-len=2048
            - --tensor-parallel-size=1
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
          name: inference-server
          image: vllm/vllm-openai:nightly
          ports:
            - containerPort: 8000
              name: metrics
          readinessProbe:
            failureThreshold: 600
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 10
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      volumes:
        - emptyDir:
            medium: Memory
          name: dshm

Further Information

Topic		Replies	Views
Running inference on vllm with Dynamic Resource Allocation and Custom Compute Classes Community Articles googler-article , kubernetes-engine-gke , gke	1	470	March 24, 2026
Scaling high-performance inference cost-effectively Community Articles googler-article , compute-engine , gke	1	695	September 21, 2025
Optimizing LLM Inference for Minimal Latency with vLLM Compute Infrastructure accelerators , googler-article , infrastructure-general , high-performance-computing-hpc	0	841	November 19, 2025

Stream Models into GPU Memory with the Run:ai Model Streamer

Stream a Model from Google Cloud Storage with Run:ai Model Streamer

Transfer Model From HuggingFace to Google Cloud Storage(Optional)

Stream Your Model in the CLI

Stream Your Model in GKE

AI Suggested topics