Accelerate TPU model loading while saving RAM on GKE

Alexis_MacAskill · June 22, 2026, 8:33pm

As large language models (LLMs) grow, the time it takes to load them from storage to accelerator memory for inference can become a significant bottleneck. These “cold start” loading times can stall auto-scaling efforts, delay user requests, and leave high-value TPU resources sitting idle rather than generating tokens. Furthermore, the storage and memory requirements to stage these models locally can increase infrastructure costs and cause stability issues.

To help teams bypass these bottlenecks, the open-source Run:ai Model Streamer now natively supports TPUs with Google Cloud Storage starting in TPU vLLM 0.18.0. This integration accelerates vLLM inference pipelines on Google Kubernetes Engine (GKE), helping your TPU workloads start faster and scale more efficiently.

In this blog, we take a closer look at the cold start problem, and how the Run:ai Model Streamer changes the dynamic. We also share preliminary performance results, and show you how to get started. Let’s jump in.

Accelerating cold starts while saving RAM

When launching an inference server on Kubernetes, loading model weights is usually the most time-consuming and resource-intensive step. For PyTorch models running on TPU via TorchAX and using the default vLLM loader, the system suffers from a severe “double-buffering” effect. This double-buffering is unavoidable for models that lack unquantized FusedMoE and Linear layers, because the entire model must be loaded into CPU memory before it is sharded to the TPU.

The double-buffering trap

First, the model is downloaded to either system RAM (via an emptyDir backed by memory) or external storage. Because the TorchAX path needs to use existing upstream vLLM logic to parse the weights, the system explicitly forces the loading context to the host CPU. Then, the inference server reads the checkpoint files to initialize the model, allocating a second, model-sized chunk of process RAM to load all the parameters as PyTorch tensors. When the server reads these files from external storage, the Linux kernel automatically pulls the file data into the OS Page Cache in RAM to speed up future reads. Crucially, the system waits for the full model to be instantiated on the CPU before finally transferring and sharding it to the TPU.

This architectural separation means that the host machine must simultaneously accommodate both the stored model weights — living either in the strict tmpfs RAM disk or the flexible OS Page Cache — and the active PyTorch tensor buffers. For the default vLLM loader, this drives peak memory usage to roughly double the model size, plus overhead. While using a tmpfs RAM disk locks strict memory and increases your risk of an out-of-memory (OOM) crash, reading from persistent storage may survive this massive memory spike because the OS Page Cache is dynamically reclaimable by the kernel.

Note: For PyTorch models on TPU with unquantized FusedMoE and Linear layers, an incremental loader can intercept the loading sequence. It immediately shards weights to the TPU and frees the CPU memory, meaning the peak memory is limited to the size of the largest single layer or shard, plus some vLLM overhead.

The storage trade-off for TPUs

Deploying massive models on TPUs forces a difficult trade-off with this architecture. Because TPU nodes lack local SSDs, teams must either attach slower, costly external storage or rely on fast, RAM-backed emptyDir volumes. However, because these massive models consume significant host memory, using a RAM disk makes the double-buffering trap fatal and frequently leads to OOM crashes.

Bypassing the bottleneck with Run:ai Model Streamer

The Run:ai Model Streamer fundamentally changes this workflow. Rather than waiting for a complete sequential download to local disk, the streamer uses multiple concurrent threads to read model tensors from your Cloud Storage bucket directly into a CPU memory buffer. From this buffer, the inference server instantiates the PyTorch tensors, completely bypassing the local file system. Skipping the local disk shrinks this phase of the startup process.

By using this fixed CPU memory buffer — defaulting to 40 GB — instead of downloading the full model to disk, the streamer eliminates the need for expensive external storage and prevents the fatal double-buffering exhaustion of host RAM. To further reduce local storage dependencies, teams can point the VLLM_XLA_CACHE_PATH to the same Cloud Storage bucket, caching the XLA compilation graph off-node without provisioning additional disks.

Proven performance and efficiency

To illustrate the impact of this Run:ai Model Streamer integration, we benchmarked the time it takes to load the 480 billion parameter Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 model on TPUs. As shown in the chart below, the model load time Run:ai Model Streamer strongly outperforms the default Hugging Face (HF) model loader during cold starts.

Benchmarking cold-start loading times for the Qwen3-Coder-480B-A35B-Instruct-FP8 449 GiB model on a v7x-standard-4t TPU VM. The Run:ai Model Streamer completes the model loading process in under 280 seconds, significantly outperforming the default loader that caches the model in external storage, which takes over 630 seconds.

The memory savings are just as impressive as the speed improvements. For PyTorch models on TPU, the default loader requires host memory equal to twice the model size plus vLLM overhead, forcing the use of costly external storage. In contrast, by eliminating the local file cache, the Run:ai Model Streamer limits peak memory to just the model size itself, a 40 GB memory buffer, and standard overhead. This completely bypasses the fatal double-buffering trap, allowing the 480B Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 Model to load safely within the node’s native constraints.

Peak CPU memory usage during the loading phase of the Qwen3-Coder-480B-A35B-Instruct-FP8 449 GiB model on a v7x-standard-4t TPU VM with 960 GiB RAM, comparing the default loader to the Run:ai Model Streamer. The default loader using external storage peaks at 881 GiB, while the RAM disk approach causes a fatal Out-Of-Memory crash. The Run:ai Model Streamer peaks at 436 GiB, safely within the node’s native constraints.

Streamlined setup and deployment

Bringing these loading speeds and resource improvements to your infrastructure requires minimal configuration. If you are testing locally or using the CLI, you can activate the streamer by simply passing your Cloud Storage URI and the --load-format=runai_streamer flag to your launch command:

vllm serve gs://your-gcs-bucket/path/to/your/model --load-format=runai_streamer

When deploying to production, you typically invoke the Python API server directly rather than using the CLI wrapper. In this case, you pass that exact same Cloud Storage URI using the --model flag instead.

Additionally, vLLM 0.18.0 introduces an improvement for the compilation phase. By defining the VLLM_XLA_CACHE_PATH environment variable and pointing it to your Cloud Storage URI, you can cache the model’s compiled graph. This drastically reduces XLA compilation times for subsequent pod startups of the same model, saving even more time during scaling events.

Deploying on GKE

For users deploying on GKE, the Run:ai Model Streamer natively integrates with Workload Identity. To set this up, you need to create a Kubernetes Service Account (KSA) and attach the necessary IAM policy bindings. The streamer requires the roles/storage.bucketViewer and roles/storage.objectUser roles so it can load the model weights and save the XLA compilation cache back to the bucket. Here is a deployment manifest showing how to securely enable the Run:ai Model Streamer and XLA caching on GKE:

apiVersion: apps/v1
kind: Deployment
…
 spec:
   serviceAccountName: gcs-access
   containers:
     - name: vllm-tpu
       image: vllm/vllm-tpu:v0.18.0
       env:
         - name: VLLM_XLA_CACHE_PATH
           value: "gs://your-gcs-bucket/path/to/xla/cache"
       args:
         - --model=gs://your-gcs-bucket/path/to/your/model 
         - --load-format=runai_streamer
 …
       command:
         - python3
         - -m
         - vllm.entrypoints.openai.api_server
 …

Get started today

Faster model loading and resource utilization translates directly to a smoother experience for your end users. The Run:ai Model Streamer helps teams bypass traditional TPU cold-start delays, making it easier to scale inference workloads rapidly.

To learn more about how to use the model streamer on GKE, see our GKE Run:ai Guide.
For detailed instructions on using the streamer with vLLM, see the official vLLM documentation.
To learn more and contribute to the model streamer’s ongoing development, check out the Run:ai Model Streamer project on GitHub.

Topic		Replies	Views
Stream Models into GPU Memory with the Run:ai Model Streamer AI Infrastructure compute-engine , cloud-storage , google-kubernetes-engine-gke	0	268	November 12, 2025
Scaling high-performance inference cost-effectively Community Articles googler-article , compute-engine , gke	1	696	September 21, 2025
Optimizing LLM Inference for Minimal Latency with vLLM Compute Infrastructure accelerators , googler-article , infrastructure-general , high-performance-computing-hpc	0	846	November 19, 2025