Accelerating model loads from Object Storage into GPU memory for vLLM inference is dramatically simpler with the Run:ai model streamer. The model streamer will stream multiple files in parallel into GPU memory via CPU memory from a Cloud Storage bucket. To use the model streamer from a Google Cloud server all that’s needed is passing in the appropriate authentication method and inserting the flag -–load-format=runai_streamer and if tensor-parallel is set >1 –model-loader-extra-config={“distributed”:true} to enable distributed streaming mode.
The following steps will guide you through installing the latest version of vLLM and deploying a model that’s loaded from Google Cloud Storage with the model streamer:
Stream a Model from Google Cloud Storage with Run:ai Model Streamer
To use the model streamer install vLLM with the Run:ai extension
pip3 install vllm[runai]
Next you will need a model in Cloud Object storage. You can use the script below to transfer a model from a HuggingFace repo to a bucket you have access to if needed:
Transfer Model From HuggingFace to Google Cloud Storage(Optional)
- You will need to make sure the cli you run the script from is logged into Google Cloud with
gcloud initand also Huggingface viahf auth login. You may need to download these libraries:
pip3 install google-cloud-storage
pip3 install huggingface_hub
- Replace the environmental variables at the top of the script below with your own and save it as hf-2-gc.py
import os
from huggingface_hub import snapshot_download
from google.cloud import storage
change the below to match your Google Cloud and Huggingface model locations
repo_id=“google/gemma-3-4b-it”
local_dir=“/tmp”
gcs_bucket_name=“”
gcs_prefix=“gemma-3-4b-it”
def download_model_then_upload_model():
# Ensure you’re logged in to Hugging Face
# You’ll need to run huggingface-cli login first or set HF_TOKEN
# Download model
print(f"Downloading model {repo_id}...")
snapshot_download(
repo_id=repo_id,
local_dir=local_dir,
local_dir_use_symlinks=False, # Important for full download
resume_download=True, # Resume interrupted downloads
max_workers=10 # Parallel downloads
)
upload_model()
def upload_model():
# Upload model to GCS
# Initialize Google Cloud Storage client
print(gcs_bucket_name)
storage_client = storage.Client()
bucket = storage_client.bucket(gcs_bucket_name)
print(f"Uploading to GCS bucket {gcs_bucket_name}…")
for root, _, files in os.walk(local_dir):
for file in files:
local_path = os.path.join(root, file)
relative_path = os.path.relpath(local_path, local_dir)
blob_path = os.path.join(gcs_prefix, relative_path)
blob = bucket.blob(blob_path)
blob.upload_from_filename(local_path)
print(f"Uploaded: {local_path} to {blob_path}")
print("Model download and upload complete!")
download_model_then_upload_model()
Stream Your Model in the CLI
- Run the command below to start a vLLM inference engine using a model from Google Cloud storage accessed with the Run:ai model streamer
vllm serve gs://models-usc/gemma-3-4b-it --load-format=runai_streamer
Stream Your Model in GKE
-
If using GKE you will need to enable workload identity on the cluster.
-
Next You will need to create a GKE service account
export SERVICE_ACCOUNT="[service account name]"
kubectl create serviceaccount $SERVICE_ACCOUNT
- Provide the objectViewer and objectUser IAM permissions for this service account
export BUCKET="[bucket name"
export PROJECT_NUMBER="[project number]"
export PROJECT_ID="[project ID"
gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.bucketViewer
gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.objectUser
- You can use the example deployment spec as a guide to launch the model streamer in vLLM on GKE using workload identity. You will need to replace the “serviceAccountName” with the value in your SERVICE_ACCOUNT variable. Note if using a tensor-parallel-size >1 you will want to add the flag –model-loader-extra-config={“distributed”:true} to enable distributed streaming:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-streamer
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: vllm-streamer
template:
metadata:
labels:
app: vllm-streamer
spec:
serviceAccountName: gcs-access
containers:
- args:
- --model=gs://models-usc/gemma-3-4b-it
- --load-format=runai_streamer
- --disable-log-requests
- --max-num-batched-tokens=512
- --max-num-seqs=128
- --max-model-len=2048
- --tensor-parallel-size=1
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
name: inference-server
image: vllm/vllm-openai:nightly
ports:
- containerPort: 8000
name: metrics
readinessProbe:
failureThreshold: 600
httpGet:
path: /health
port: 8000
periodSeconds: 10
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
volumes:
- emptyDir:
medium: Memory
name: dshm
Further Information