Tutorial: Scaling Reinforcement Learning with verl on GKE

Reinforcement Learning (RL) has moved from academic research to the center of the generative AI revolution. At its core, RL is about teaching a model through experience, exploration, and feedback rather than just static imitation. While pre-training teaches a model what to say, Reinforcement Learning, specifically from Human Feedback (RLHF), teaches it how to be helpful, safe, and logical. It is the bridge between a raw, unpredictable base model and a refined, conversational assistant.

The Efficiency of GRPO

In this blog, we focus on Group Relative Policy Optimization (GRPO). While Proximal Policy Optimization (PPO) has been the industry standard, GRPO (popularized by DeepSeek) offers a more memory-efficient alternative for LLM alignment.

GRPO simplifies the RL pipeline by removing the Critic model entirely. In traditional PPO, a Critic network predicts rewards to create a baseline. GRPO instead generates a group of multiple responses for the same prompt. It uses the average reward of that specific group as the baseline.

  • If a response is better than the group average: It receives a positive advantage signal.
  • If it’s worse: It receives a negative signal.

By eliminating the Critic, we reduce the memory footprint by nearly 50%, allowing us to train larger models (like Qwen-32B) or use much larger batch sizes on the same hardware.

Enter verl on GKE

That’s where verl (Volcano Engine Reinforcement Learning) comes in. verl is a high-performance framework designed specifically to handle the complex memory and compute patterns of LLM-based RL.

In this guide, we’ll walk you through orchestrating a sophisticated GRPO training pipeline on Google Kubernetes Engine (GKE). We will leverage the raw power of NVIDIA B200 GPUs and the distributed flexibility of Ray to turn a complex RL setup into a scalable, manageable workflow. In order to understand the moving pieces better, see the following from the verl documentation:

As well as the below diagram from their documentation of how the RL loop functions for GRPO

The code and artifacts that we will walk through today is also referenced on this github repository: https://github.com/esaaren/verl-on-gke/tree/main


What We Are Building

We are setting up a distributed training environment where data, model weights, and the training engine are decoupled for maximum efficiency.

Key Components:

  • Infrastructure: A GKE cluster featuring NVIDIA’s latest B200 accelerators.
  • Orchestration: KubeRay to manage a distributed Ray cluster.
  • Storage: GCS Fuse to mount a Google Cloud Storage bucket directly across all nodes, ensuring seamless access to models and datasets.
  • Engine: verl, which optimizes the communication and sharding for the GRPO rollout and update phases.

Our architecture leverages a GKE-managed Ray cluster optimized for large-scale RL. Our diagram here references the two workers or VMs we will use in the tutorial, but this is extensible to hundreds of workers.

The compute backbone consists of Ray Worker nodes (utilizing H200/B200 GPUs), where each node orchestrates a high-performance feedback loop: vLLM engines handle the generation of multiple response candidates (the ‘Group’ in GRPO), while verl actors manage the policy training and rollout coordination across the 8-GPU topology. Our model is sharded across workers with FSDP.

We can also see our dedicated swim lanes for RDMA that we build, each NIC (Network Interface Card, one per GPU) has a dedicated subnet and swim lane to facilitate GPU to GPU communication. A subnet per GPU NIC is a requirement for GCP RDMA to work. We also see subnets for our CPU NICs which are used for data center networking (For example, to read off of cloud storage). For more details on the foundational setup with RDMA and how it works, we can refer to the previous tutorial we linked.


0. Building the infrastructure

To keep this tutorial short, we will be drawing off public RDMA documentation.

https://docs.cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#create-cluster

Simply follow the above documentation to get a working cluster with either B200 or H200 GPUs and RDMA setup. You can also follow the github for this tutorial as well (https://github.com/esaaren/verl-on-gke/tree/main). The steps in this blog specifically will use Spot and B200s because that is what was available to us for this but you are free to modify this as you need depending on your situation and what is available to you. Be mindful that there might be small adjustments required in some of the config depending on how far you stray from this example. H200 on Spot should be fully compatible with this blog aside from small changes in naming conventions and one NCCL config (which we point out later).

If you follow the public RDMA documentation, you will need some additional pieces to make our tutorial work (If you follow the github it will help set most of this up for you). You are free to keep the same generic export values, but you may want to modify these to your liking.

Exports

export REGION="asia-southeast1"
export ZONE="asia-southeast1-b"
export PROJECT=""
export KSA_NAME="generic-ksa"
export GSBUCKET="generic-lab-testing"
export NAMESPACE='default'
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT} --format="value(projectNumber)")

# Your Hugging Face token for accessing models/datasets
export HF_TOKEN=""

Creation of a service account

kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}

Creation of a storage bucket

gcloud storage buckets create gs://${GSBUCKET} --location=${REGION} --enable-hierarchical-namespace --uniform-bucket-level-access

Creation of the HF secret and giving our KSA permission to use the bucket we created

kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN}

gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
  --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
  --role "roles/storage.objectUser"

Creation of the GKE storage objects to enable gcsFuse to work

apiVersion: v1
kind: PersistentVolume
metadata:
  name: training-bucket-pv
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 768Gi
  persistentVolumeReclaimPolicy: Delete
  storageClassName: gcsfuse-sc
  mountOptions:
  - implicit-dirs
  - metadata-cache:negative-ttl-secs:0
  - metadata-cache:ttl-secs:0
  - metadata-cache:stat-cache-max-size-mb:-1
  - metadata-cache:type-cache-max-size-mb:-1
  - file-cache:max-size-mb:-1
  - file-cache:cache-file-for-range-read:true
  - file-cache:enable-parallel-downloads:true
  - read_ahead_kb=1024
  - write:enable-streaming-writes:true
  - write:global-max-blocks:200000
  csi:
    driver: gcsfuse.csi.storage.gke.io
    volumeHandle: generic-lab-testing
    volumeAttributes:
      skipCSIBucketAccessCheck: "true"
      gcsfuseMetadataPrefetchOnMount: "true"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-bucket-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 768Gi
  storageClassName: gcsfuse-sc

Hopefully after all of these steps, you now have a working foundation we can build off of. You should have a GKE cluster built with Google’s RDMA stack and a Google Cloud Storage bucket that your cluster can access.

1. Preparing the Data and Models

Before we launch our Ray cluster or any jobs, we need to ensure our “workspace” (the GCS bucket) is ready. We just created it above, so now we can upload our data and code into it.

The Source Code

Clone the verl repository and upload it to the bucket we created earlier, you can do this from your local environment.

The Model Weights

Download the Qwen2.5-32B-Instruct weights using huggingfaces CLI. You can do this locally and upload them to your bucket $GSBUCKET/huggingface_cache. Here are some commands for how this might look (this will change based on your local environment so copy+paste won’t work here):

curl -LsSf https://hf.co/cli/install.sh | bash

hf download Qwen/Qwen2.5-32B-Instruct

gcloud storage cp -r /Users/$USER/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct gs://generic-lab-testing/huggingface_cache/hub

The Dataset

We’ll use the GSM8K dataset (grade school math word problems).

Why GSM8K?

GSM8K consists of high-quality, grade-school-level math word problems that require 2 to 8 steps of reasoning to solve. We use this dataset because:

  • Chain-of-Thought (CoT) Requirement: The model cannot simply know the answer; it must generate a sequence of intermediate steps (the “reasoning path”) to arrive at the correct final solution.
  • Verifiable Rewards (The Efficiency Hack): Unlike creative writing, math has ground truth. This allows us to use a Rule-based Reward Manager rather than a heavy, neural Reward Model (which is typically another LLM).
    GRPO + Math = Maximum Efficiency. By using math, we eliminate the need for a Critic model and a second neural Reward Model. We aren’t scoring vibes or helpfulness; we are scoring correctness. If the final answer parsed from the CoT matches the ground truth, it gets a 1.0. If not, it gets a 0.0.

Our Training Goal: Aligning the models reasoning

Our objective isn’t just to make the model smarter at math; it’s to use GRPO to align its internal reasoning process.

During the Rollout phase, the model generates a group of n different responses for the same problem.

  1. The Good: A clear, step-by-step derivation leading to the correct answer.
  2. The Lucky: A fluke guess with nonsensical logic but a correct final digit.
  3. The Bad: A logical mess that ends in the wrong answer.

By applying GRPO, we compare these responses against each other. Even if several are correct, the model is rewarded more for the paths that are consistently better than the group average. Over many iterations, the model learns that articulating its steps clearly is the most reliable way to maximize reward. This effectively penalizes fluke answers and forces the model to develop a robust internal reasoning chain.

Processing for the Bucket

To make this data digestible for verl, we’ll use a preprocessing script (found in the verl repository) to convert the raw JSON/Hugging Face data into structured Parquet files.

git clone https://github.com/verl-project/verl.git && cd verl 
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
gcloud storage cp ~/data/gsm8k/* $GSBUCKET

This script splits the data into train.parquet and test.parquet. We can then upload these directly to our bucket so they become instantly available to every GPU worker in our future Ray cluster, eliminating the need to download and pre-process the dataset on every single node at runtime.


2. Defining the Ray Cluster

First install the kuberay operator (You can also opt for the managed ray operator as part of GKE)

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.5.1.
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1

We use a RayCluster YAML to define our compute. Notice the use of a multi-NIC configuration (eth0 through eth9) to take advantage of high-speed RDMA networking on GCP. Previous tutorials will give a better understanding of why we do this.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: b200-ray-cluster
  annotations:
spec:
  rayVersion: '2.47.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
      spec:
        serviceAccountName: generic-ksa
        nodeSelector:
          cloud.google.com/gke-nodepool: "default-pool"
        containers:
        - name: ray-head
          image: verlai/verl:vllm011.latest 
          ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
          resources:
            limits:
              cpu: "12"
              memory: "32G"
              ephemeral-storage: "9Gi"
            requests:
              cpu: "12"
              memory: "32G"
              ephemeral-storage: "9Gi"
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            - name: training-bucket-vol
              mountPath: /data
        volumes:
          - name: ray-logs
            emptyDir: {}
          - name: training-bucket-vol
            persistentVolumeClaim:
              claimName: training-bucket-pvc
  workerGroupSpecs:
  - replicas: 2
    minReplicas: 0
    maxReplicas: 4
    groupName: gpu-group
    rayStartParams:
      num-cpus: "220"
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          networking.gke.io/default-interface: 'eth0'
          networking.gke.io/interfaces: |
            [
              {"interfaceName":"eth0","network":"default"},
              {"interfaceName":"eth1","network":"gvnic-1"},
              {"interfaceName":"eth2","network":"rdma-0"},
              {"interfaceName":"eth3","network":"rdma-1"},
              {"interfaceName":"eth4","network":"rdma-2"},
              {"interfaceName":"eth5","network":"rdma-3"},
              {"interfaceName":"eth6","network":"rdma-4"},
              {"interfaceName":"eth7","network":"rdma-5"},
              {"interfaceName":"eth8","network":"rdma-6"},
              {"interfaceName":"eth9","network":"rdma-7"}
            ]
      spec:
        initContainers:
        - name: verl-setup
          image: verlai/verl:vllm011.latest
          command: ["/bin/bash", "-c"]
          args:
            - |
              echo "Performing local editable install..."
              cd /data/verl && pip3 install --no-deps -e .
          volumeMounts:
          - name: training-bucket-vol
            mountPath: /data
        serviceAccountName: generic-ksa
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-b200
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
        containers:
        - name: ray-worker
          image: verlai/verl:vllm011.latest 
          env:
           - name: LD_LIBRARY_PATH
             value: /usr/local/nvidia/lib64
          resources:
            limits:
              cpu: "220"
              memory: "2800Gi"
              nvidia.com/gpu: "8"
              ephemeral-storage: "1000Gi"
            requests:
              cpu: "220"
              memory: "2800Gi"
              nvidia.com/gpu: "8"
              ephemeral-storage: "1000Gi"
          volumeMounts:
          - name: nvidia
            mountPath: /usr/local/nvidia
          - name: gib
            mountPath: /usr/local/gib
          - name: shared-memory
            mountPath: /dev/shm
          - name: ray-tmp-storage
            mountPath: /tmp
          - name: training-bucket-vol
            mountPath: /data
        volumes:
        - name: gib
          hostPath:
            path: /home/kubernetes/bin/gib
        - name: nvidia
          hostPath:
            path: /home/kubernetes/bin/nvidia
        - name: lib64
          hostPath:
            path: /lib64
        - name: shared-memory
          emptyDir:
            medium: "Memory"
            sizeLimit: 250Gi
        - name: sys
          hostPath:
            path: /sys
        - name: proc-sys
          hostPath:
            path: /proc/sys
        - name: ray-tmp-storage
          emptyDir: {}
        - name: training-bucket-vol
          persistentVolumeClaim:
            claimName: training-bucket-pvc

Head Node

The head node manages the job and hosts the dashboard. It doesn’t need a GPU but needs access to our code via the GCS bucket.

Worker Nodes

This is where the heavy lifting happens. We are requesting 2 nodes, each with 8 B200 GPUs.

Note: We use an initContainer to install verl from the mounted /data directory. This ensures every worker node has the exact same environment and dependencies. We could simplify this by just building our own Docker image and using that, but for our purposes this will suffice so we can pull a public image and reduce the number of steps for the tutorial.


3. Configuring the Runtime Environment

To make sure our workers can talk to each other and find the GPUs, we use a runtime-env.yaml. This file tells Ray which environment variables to set, such as NCCL_NET_PLUGIN for optimized GPU communication and the PYTHONPATH for our verl installation. The most critical components are the NCCL parameters. We can see this example yaml file below.

py_modules: ["."]
working_dir": "."
py_executable": "uv run"
setup_hook: runtime_env.uv_runtime_env_hook.hook #runtime_env.setup_hook 
env_vars:
  PYTHONPATH: "/data/verl"
  LD_LIBRARY_PATH: "/usr/local/nvidia/lib64"
  NCCL_DEBUG: "INFO"
  NUM_WORKERS: "2"
  CPUS_PER_WORKER: "192"
  GPUS_PER_WORKER: "8"
  NCCL_NET_PLUGIN: "/usr/local/gib/lib64/libnccl-net_internal.so"
  NCCL_CROSS_NIC: "0"
  NCCL_NET_GDR_LEVEL: "PIX"
  NCCL_P2P_NET_CHUNKSIZE: "131072"
  NCCL_NVLS_CHUNKSIZE: "524288"
  NCCL_IB_ADAPTIVE_ROUTING: "1"
  NCCL_IB_QPS_PER_CONNECTION: "4"
  NCCL_IB_TC: "52"
  NCCL_IB_FIFO_TC: "84"
  NCCL_TUNER_CONFIG_PATH: "/usr/local/gib/configs/tuner_config_a4.txtpb"
  HF_HOME: "/data/huggingface_cache"
  GLOO_SOCKET_IFNAME: "eth0" 
pip:
  packages:
    - torch
    - torchvision

Note: If you decided to use H200s for the tutorial, you must set the below NCCL configuration instead:

NCCL_TUNER_CONFIG_PATH: "/usr/local/gib/configs/tuner_config_a3u.txtpb"

This is ultimately the only difference between H200 and B200 on GCP aside from naming conventions.


4. Launching the GRPO Job

With the infrastructure in place, we submit the job from our local CLI using the Ray Jobs API. When you run the ray job submit command, your local machine communicates with the Ray Head node, typically via a port-forward in a separate terminal to localhost:8265. E.g:

kubectl port-forward svc/b200-ray-cluster-head-svc 8265:8265

Ray packages your runtime-env.yaml, uploads any necessary code, and schedules the main_ppo trainer across your GKE worker nodes. This push model ensures your training logic is executed in a consistent, distributed environment without you having to manually SSH into individual nodes.

Note on the Trainer: You’ll notice the command below calls verl.trainer.main_ppo, verl treats GRPO as a specialized advantage estimator within its unified PPO pipeline. By setting algorithm.adv_estimator=grpo, the framework automatically bypasses the Critic initialization and switches to the group-relative logic.

Your local setup may vary, but here we are using uv to launch our job from our local terminal . Feel free to launch this any way you would like.

uv tool run --index-url https://pypi.org/simple ray -- job submit \
    --address "http://localhost:8265" \
    --runtime-env runtime-env.yaml \
    -- \
    bash -c "
        # 3. Launch PPO Training
        cd /data/verl && PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
        data.train_files=/data/gsm8k/train.parquet \
        data.val_files=/data/gsm8k/test.parquet \
        data.train_batch_size=256 \
        data.max_prompt_length=512 \
        data.max_response_length=512 \
        actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
        actor_rollout_ref.actor.optim.lr=1e-5 \
        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=64 \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
        actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
        actor_rollout_ref.actor.strategy=fsdp2 \
        algorithm.kl_ctrl.kl_coef=0.001 \
        trainer.logger=console \
        trainer.val_before_train=False \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=2 \
        trainer.save_freq=10 \
        trainer.test_freq=10 \
        algorithm.adv_estimator=grpo \
        actor_rollout_ref.rollout.n=8 \
        trainer.total_epochs=2" 2>&1 | tee verl_demo.log

Note: If you are using H200 instead of B200, you will have less overall memory per GPU (B200 has 192 GB vs H200s 141 GB). These job arguments were run with B200. If using H200, try lowering the actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu or actor_rollout_ref.rollout.gpu_memory_utilization if you are having issues with OOM (Out of Memory). You could also use a smaller size of Qwen2.5).

How the GRPO Pipeline Executes

While the command looks like a single execution, verl is orchestrating a highly efficient distributed lifecycle across your cluster. Even though verl uses a training loop familiar to PPO users, it adapts the logic to support the critic-less nature of GRPO.

Here is the lifecycle of a single GRPO training step:

1. Group Generation (The Rollout)

The Actor model (our Qwen instance) generates a group of responses for every prompt in the batch. Unlike standard RL, where you might generate one response, GRPO generates n responses (e.g., actor_rollout.ref.rollout.n=8). This creates a group for comparison.

2. Relative Scoring (No Critic Required)

This is where GRPO departs from the Actor-Critic (PPO) model. Instead of a Critic model estimating the value of a state:

  • The Reward Model scores each of the n responses in the group.
  • The Advantage is calculated by comparing each response’s score to the mean reward of its group.
  • KL Divergence is calculated against the Reference model to ensure the Actor doesn’t cheat the reward system by drifting into gibberish.

3. The Synchronized Update

The sampled trajectories are split into ppo_mini_batches. Even though we aren’t using a Critic, we still use a clipped surrogate update. This ensures the Actor’s weights are improved incrementally, preventing the drastic weight shifts that often destabilize RL training.

4. Distributed Orchestration

With trainer.nnodes=2, verl automatically shards the three remaining logical components (Actor, Ref, Reward) across all 16 GPUs. By removing the Critic, we reduce the communication overhead and memory pressure on the NCCL and RDMA fabric.

What Success Looks Like

Verl is verbose, providing a detailed pulse on both model alignment and hardware efficiency. When your GPU cluster and job is running (verl will take a little while to initialize first, you can follow those logs too), you will see start to see a progress update every step that looks something similar to this (among a bunch of other metrics):

(TaskRunner pid=2860) step:32 - global_seqlen/mean:37335.0 - actor/entropy:0.071 - actor/grad_norm:0.082 - critic/score/mean:0.959 - response_length/mean:184.58 - timing_s/step:46.59 - perf/throughput:801.42 - perf/mfu/actor:0.31

Navigating the Metrics

While the logs provide dozens of parameters, here are the key metrics for you to look out for:

Algorithm Health

  • critic/score/mean: Your north star. This is the average reward (accuracy) across your prompt groups. Seeing this near 1.0 indicates the model is successfully solving the majority of the reasoning tasks in the batch.
  • actor/grad_norm: A vital health check for stability. If this spikes suddenly, it indicates the model is experiencing gradient explosions. A steady, low value like this suggests smooth, stable learning.
  • actor/entropy: The measure of creativity. A slow, steady decline is normal. If this hits zero, the model has collapsed into repeating a single “safe” answer.

Efficiency & Timing

  • perf/throughput: Total tokens processed per second. This is your primary ROI metric for the GPUs hardware.
  • perf/mfu/actor: Model FLOPs Utilization, showing how efficient your GPU FLOPs are being utilized in the actor. A higher MFU means your training is less bottlenecked on non-GPU operations and you are getting better value out of your hardware.
  • timing_s/step: The total wall-clock time per training step (or RL loop).
  • perf/max_memory_allocated_gb: This shows you how much VRAM you are pushing. This gives you a clear picture of how much headroom you have left to increase your batch size or model parameters. This will be useful if you are modifying this tutorial to other hardware.

Visualization

Here is a plotted graph for some of the metrics showing training over 58 steps on 16x B200 on GCP via spot instances.

While we’re not focused on the science aspect for this blog, our indication that things are working is that we can see the critic/score/mean (training accuracy) increase from 0% to well over 90% after the first few steps of training, showing that our model is learning from our dataset. We likely didn’t even need to train for so long! We can also see a steady throughput after the initial warmup step as well as a steady step time after that initial step as well. These metrics provide a great baseline to compare our training performance across frameworks, hardware and configurations.


Conclusion

By combining the elasticity of GKE, the high-performance throughput of the latest generation GPUs on GCP, and the specialized LLM-RL optimization of verl, you can significantly reduce the wall-clock time required to align your models. This architecture allows you to focus on your RL logic while the underlying platform handles the heavy lifting of distributed orchestration.

2 Likes