Reinforcement Learning (RL) has moved from academic research to the center of the generative AI revolution. At its core, RL is about teaching a model through experience, exploration, and feedback rather than just static imitation. While pre-training teaches a model what to say, Reinforcement Learning, specifically from Human Feedback (RLHF), teaches it how to be helpful, safe, and logical. It is the bridge between a raw, unpredictable base model and a refined, conversational assistant.
The Efficiency of GRPO
In this blog, we focus on Group Relative Policy Optimization (GRPO). While Proximal Policy Optimization (PPO) has been the industry standard, GRPO (popularized by DeepSeek) offers a more memory-efficient alternative for LLM alignment.
GRPO simplifies the RL pipeline by removing the Critic model entirely. In traditional PPO, a Critic network predicts rewards to create a baseline. GRPO instead generates a group of multiple responses for the same prompt. It uses the average reward of that specific group as the baseline.
- If a response is better than the group average: It receives a positive advantage signal.
- If itâs worse: It receives a negative signal.
By eliminating the Critic, we reduce the memory footprint by nearly 50%, allowing us to train larger models (like Qwen-32B) or use much larger batch sizes on the same hardware.
Enter verl on GKE
Thatâs where verl (Volcano Engine Reinforcement Learning) comes in. verl is a high-performance framework designed specifically to handle the complex memory and compute patterns of LLM-based RL.
In this guide, weâll walk you through orchestrating a sophisticated GRPO training pipeline on Google Kubernetes Engine (GKE). We will leverage the raw power of NVIDIA B200 GPUs and the distributed flexibility of Ray to turn a complex RL setup into a scalable, manageable workflow. In order to understand the moving pieces better, see the following from the verl documentation:
- GRPO implementation: https://verl.readthedocs.io/en/latest/algo/grpo.html
- HybridFlow programming guide: https://verl.readthedocs.io/en/latest/hybrid_flow.html
As well as the below diagram from their documentation of how the RL loop functions for GRPO
The code and artifacts that we will walk through today is also referenced on this github repository: https://github.com/esaaren/verl-on-gke/tree/main
What We Are Building
We are setting up a distributed training environment where data, model weights, and the training engine are decoupled for maximum efficiency.
Key Components:
- Infrastructure: A GKE cluster featuring NVIDIAâs latest B200 accelerators.
- Orchestration: KubeRay to manage a distributed Ray cluster.
- Storage: GCS Fuse to mount a Google Cloud Storage bucket directly across all nodes, ensuring seamless access to models and datasets.
- Engine: verl, which optimizes the communication and sharding for the GRPO rollout and update phases.
Our architecture leverages a GKE-managed Ray cluster optimized for large-scale RL. Our diagram here references the two workers or VMs we will use in the tutorial, but this is extensible to hundreds of workers.
The compute backbone consists of Ray Worker nodes (utilizing H200/B200 GPUs), where each node orchestrates a high-performance feedback loop: vLLM engines handle the generation of multiple response candidates (the âGroupâ in GRPO), while verl actors manage the policy training and rollout coordination across the 8-GPU topology. Our model is sharded across workers with FSDP.
We can also see our dedicated swim lanes for RDMA that we build, each NIC (Network Interface Card, one per GPU) has a dedicated subnet and swim lane to facilitate GPU to GPU communication. A subnet per GPU NIC is a requirement for GCP RDMA to work. We also see subnets for our CPU NICs which are used for data center networking (For example, to read off of cloud storage). For more details on the foundational setup with RDMA and how it works, we can refer to the previous tutorial we linked.
0. Building the infrastructure
To keep this tutorial short, we will be drawing off public RDMA documentation.
https://docs.cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#create-cluster
Simply follow the above documentation to get a working cluster with either B200 or H200 GPUs and RDMA setup. You can also follow the github for this tutorial as well (https://github.com/esaaren/verl-on-gke/tree/main). The steps in this blog specifically will use Spot and B200s because that is what was available to us for this but you are free to modify this as you need depending on your situation and what is available to you. Be mindful that there might be small adjustments required in some of the config depending on how far you stray from this example. H200 on Spot should be fully compatible with this blog aside from small changes in naming conventions and one NCCL config (which we point out later).
If you follow the public RDMA documentation, you will need some additional pieces to make our tutorial work (If you follow the github it will help set most of this up for you). You are free to keep the same generic export values, but you may want to modify these to your liking.
Exports
export REGION="asia-southeast1"
export ZONE="asia-southeast1-b"
export PROJECT=""
export KSA_NAME="generic-ksa"
export GSBUCKET="generic-lab-testing"
export NAMESPACE='default'
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT} --format="value(projectNumber)")
# Your Hugging Face token for accessing models/datasets
export HF_TOKEN=""
Creation of a service account
kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}
Creation of a storage bucket
gcloud storage buckets create gs://${GSBUCKET} --location=${REGION} --enable-hierarchical-namespace --uniform-bucket-level-access
Creation of the HF secret and giving our KSA permission to use the bucket we created
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN}
gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
--member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
--role "roles/storage.objectUser"
Creation of the GKE storage objects to enable gcsFuse to work
apiVersion: v1
kind: PersistentVolume
metadata:
name: training-bucket-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 768Gi
persistentVolumeReclaimPolicy: Delete
storageClassName: gcsfuse-sc
mountOptions:
- implicit-dirs
- metadata-cache:negative-ttl-secs:0
- metadata-cache:ttl-secs:0
- metadata-cache:stat-cache-max-size-mb:-1
- metadata-cache:type-cache-max-size-mb:-1
- file-cache:max-size-mb:-1
- file-cache:cache-file-for-range-read:true
- file-cache:enable-parallel-downloads:true
- read_ahead_kb=1024
- write:enable-streaming-writes:true
- write:global-max-blocks:200000
csi:
driver: gcsfuse.csi.storage.gke.io
volumeHandle: generic-lab-testing
volumeAttributes:
skipCSIBucketAccessCheck: "true"
gcsfuseMetadataPrefetchOnMount: "true"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-bucket-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 768Gi
storageClassName: gcsfuse-sc
Hopefully after all of these steps, you now have a working foundation we can build off of. You should have a GKE cluster built with Googleâs RDMA stack and a Google Cloud Storage bucket that your cluster can access.
1. Preparing the Data and Models
Before we launch our Ray cluster or any jobs, we need to ensure our âworkspaceâ (the GCS bucket) is ready. We just created it above, so now we can upload our data and code into it.
The Source Code
Clone the verl repository and upload it to the bucket we created earlier, you can do this from your local environment.
The Model Weights
Download the Qwen2.5-32B-Instruct weights using huggingfaces CLI. You can do this locally and upload them to your bucket $GSBUCKET/huggingface_cache. Here are some commands for how this might look (this will change based on your local environment so copy+paste wonât work here):
curl -LsSf https://hf.co/cli/install.sh | bash
hf download Qwen/Qwen2.5-32B-Instruct
gcloud storage cp -r /Users/$USER/.cache/huggingface/hub/models--Qwen--Qwen2.5-32B-Instruct gs://generic-lab-testing/huggingface_cache/hub
The Dataset
Weâll use the GSM8K dataset (grade school math word problems).
Why GSM8K?
GSM8K consists of high-quality, grade-school-level math word problems that require 2 to 8 steps of reasoning to solve. We use this dataset because:
- Chain-of-Thought (CoT) Requirement: The model cannot simply know the answer; it must generate a sequence of intermediate steps (the âreasoning pathâ) to arrive at the correct final solution.
- Verifiable Rewards (The Efficiency Hack): Unlike creative writing, math has ground truth. This allows us to use a Rule-based Reward Manager rather than a heavy, neural Reward Model (which is typically another LLM).
GRPO + Math = Maximum Efficiency. By using math, we eliminate the need for a Critic model and a second neural Reward Model. We arenât scoring vibes or helpfulness; we are scoring correctness. If the final answer parsed from the CoT matches the ground truth, it gets a 1.0. If not, it gets a 0.0.
Our Training Goal: Aligning the models reasoning
Our objective isnât just to make the model smarter at math; itâs to use GRPO to align its internal reasoning process.
During the Rollout phase, the model generates a group of n different responses for the same problem.
- The Good: A clear, step-by-step derivation leading to the correct answer.
- The Lucky: A fluke guess with nonsensical logic but a correct final digit.
- The Bad: A logical mess that ends in the wrong answer.
By applying GRPO, we compare these responses against each other. Even if several are correct, the model is rewarded more for the paths that are consistently better than the group average. Over many iterations, the model learns that articulating its steps clearly is the most reliable way to maximize reward. This effectively penalizes fluke answers and forces the model to develop a robust internal reasoning chain.
Processing for the Bucket
To make this data digestible for verl, weâll use a preprocessing script (found in the verl repository) to convert the raw JSON/Hugging Face data into structured Parquet files.
git clone https://github.com/verl-project/verl.git && cd verl
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
gcloud storage cp ~/data/gsm8k/* $GSBUCKET
This script splits the data into train.parquet and test.parquet. We can then upload these directly to our bucket so they become instantly available to every GPU worker in our future Ray cluster, eliminating the need to download and pre-process the dataset on every single node at runtime.
2. Defining the Ray Cluster
First install the kuberay operator (You can also opt for the managed ray operator as part of GKE)
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.5.1.
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1
We use a RayCluster YAML to define our compute. Notice the use of a multi-NIC configuration (eth0 through eth9) to take advantage of high-speed RDMA networking on GCP. Previous tutorials will give a better understanding of why we do this.
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: b200-ray-cluster
annotations:
spec:
rayVersion: '2.47.0'
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
metadata:
annotations:
gke-gcsfuse/volumes: "true"
spec:
serviceAccountName: generic-ksa
nodeSelector:
cloud.google.com/gke-nodepool: "default-pool"
containers:
- name: ray-head
image: verlai/verl:vllm011.latest
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "12"
memory: "32G"
ephemeral-storage: "9Gi"
requests:
cpu: "12"
memory: "32G"
ephemeral-storage: "9Gi"
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- name: training-bucket-vol
mountPath: /data
volumes:
- name: ray-logs
emptyDir: {}
- name: training-bucket-vol
persistentVolumeClaim:
claimName: training-bucket-pvc
workerGroupSpecs:
- replicas: 2
minReplicas: 0
maxReplicas: 4
groupName: gpu-group
rayStartParams:
num-cpus: "220"
template:
metadata:
annotations:
gke-gcsfuse/volumes: "true"
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"gvnic-1"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
spec:
initContainers:
- name: verl-setup
image: verlai/verl:vllm011.latest
command: ["/bin/bash", "-c"]
args:
- |
echo "Performing local editable install..."
cd /data/verl && pip3 install --no-deps -e .
volumeMounts:
- name: training-bucket-vol
mountPath: /data
serviceAccountName: generic-ksa
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: ray-worker
image: verlai/verl:vllm011.latest
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
cpu: "220"
memory: "2800Gi"
nvidia.com/gpu: "8"
ephemeral-storage: "1000Gi"
requests:
cpu: "220"
memory: "2800Gi"
nvidia.com/gpu: "8"
ephemeral-storage: "1000Gi"
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
- name: shared-memory
mountPath: /dev/shm
- name: ray-tmp-storage
mountPath: /tmp
- name: training-bucket-vol
mountPath: /data
volumes:
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
- name: nvidia
hostPath:
path: /home/kubernetes/bin/nvidia
- name: lib64
hostPath:
path: /lib64
- name: shared-memory
emptyDir:
medium: "Memory"
sizeLimit: 250Gi
- name: sys
hostPath:
path: /sys
- name: proc-sys
hostPath:
path: /proc/sys
- name: ray-tmp-storage
emptyDir: {}
- name: training-bucket-vol
persistentVolumeClaim:
claimName: training-bucket-pvc
Head Node
The head node manages the job and hosts the dashboard. It doesnât need a GPU but needs access to our code via the GCS bucket.
Worker Nodes
This is where the heavy lifting happens. We are requesting 2 nodes, each with 8 B200 GPUs.
Note: We use an initContainer to install verl from the mounted /data directory. This ensures every worker node has the exact same environment and dependencies. We could simplify this by just building our own Docker image and using that, but for our purposes this will suffice so we can pull a public image and reduce the number of steps for the tutorial.
3. Configuring the Runtime Environment
To make sure our workers can talk to each other and find the GPUs, we use a runtime-env.yaml. This file tells Ray which environment variables to set, such as NCCL_NET_PLUGIN for optimized GPU communication and the PYTHONPATH for our verl installation. The most critical components are the NCCL parameters. We can see this example yaml file below.
py_modules: ["."]
working_dir": "."
py_executable": "uv run"
setup_hook: runtime_env.uv_runtime_env_hook.hook #runtime_env.setup_hook
env_vars:
PYTHONPATH: "/data/verl"
LD_LIBRARY_PATH: "/usr/local/nvidia/lib64"
NCCL_DEBUG: "INFO"
NUM_WORKERS: "2"
CPUS_PER_WORKER: "192"
GPUS_PER_WORKER: "8"
NCCL_NET_PLUGIN: "/usr/local/gib/lib64/libnccl-net_internal.so"
NCCL_CROSS_NIC: "0"
NCCL_NET_GDR_LEVEL: "PIX"
NCCL_P2P_NET_CHUNKSIZE: "131072"
NCCL_NVLS_CHUNKSIZE: "524288"
NCCL_IB_ADAPTIVE_ROUTING: "1"
NCCL_IB_QPS_PER_CONNECTION: "4"
NCCL_IB_TC: "52"
NCCL_IB_FIFO_TC: "84"
NCCL_TUNER_CONFIG_PATH: "/usr/local/gib/configs/tuner_config_a4.txtpb"
HF_HOME: "/data/huggingface_cache"
GLOO_SOCKET_IFNAME: "eth0"
pip:
packages:
- torch
- torchvision
Note: If you decided to use H200s for the tutorial, you must set the below NCCL configuration instead:
NCCL_TUNER_CONFIG_PATH: "/usr/local/gib/configs/tuner_config_a3u.txtpb"
This is ultimately the only difference between H200 and B200 on GCP aside from naming conventions.
4. Launching the GRPO Job
With the infrastructure in place, we submit the job from our local CLI using the Ray Jobs API. When you run the ray job submit command, your local machine communicates with the Ray Head node, typically via a port-forward in a separate terminal to localhost:8265. E.g:
kubectl port-forward svc/b200-ray-cluster-head-svc 8265:8265
Ray packages your runtime-env.yaml, uploads any necessary code, and schedules the main_ppo trainer across your GKE worker nodes. This push model ensures your training logic is executed in a consistent, distributed environment without you having to manually SSH into individual nodes.
Note on the Trainer: Youâll notice the command below calls verl.trainer.main_ppo, verl treats GRPO as a specialized advantage estimator within its unified PPO pipeline. By setting algorithm.adv_estimator=grpo, the framework automatically bypasses the Critic initialization and switches to the group-relative logic.
Your local setup may vary, but here we are using uv to launch our job from our local terminal . Feel free to launch this any way you would like.
uv tool run --index-url https://pypi.org/simple ray -- job submit \
--address "http://localhost:8265" \
--runtime-env runtime-env.yaml \
-- \
bash -c "
# 3. Launch PPO Training
cd /data/verl && PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
data.train_files=/data/gsm8k/train.parquet \
data.val_files=/data/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=512 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-5 \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=64 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.strategy=fsdp2 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=console \
trainer.val_before_train=False \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.save_freq=10 \
trainer.test_freq=10 \
algorithm.adv_estimator=grpo \
actor_rollout_ref.rollout.n=8 \
trainer.total_epochs=2" 2>&1 | tee verl_demo.log
Note: If you are using H200 instead of B200, you will have less overall memory per GPU (B200 has 192 GB vs H200s 141 GB). These job arguments were run with B200. If using H200, try lowering the actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu or actor_rollout_ref.rollout.gpu_memory_utilization if you are having issues with OOM (Out of Memory). You could also use a smaller size of Qwen2.5).
How the GRPO Pipeline Executes
While the command looks like a single execution, verl is orchestrating a highly efficient distributed lifecycle across your cluster. Even though verl uses a training loop familiar to PPO users, it adapts the logic to support the critic-less nature of GRPO.
Here is the lifecycle of a single GRPO training step:
1. Group Generation (The Rollout)
The Actor model (our Qwen instance) generates a group of responses for every prompt in the batch. Unlike standard RL, where you might generate one response, GRPO generates n responses (e.g., actor_rollout.ref.rollout.n=8). This creates a group for comparison.
2. Relative Scoring (No Critic Required)
This is where GRPO departs from the Actor-Critic (PPO) model. Instead of a Critic model estimating the value of a state:
- The Reward Model scores each of the n responses in the group.
- The Advantage is calculated by comparing each responseâs score to the mean reward of its group.
- KL Divergence is calculated against the Reference model to ensure the Actor doesnât cheat the reward system by drifting into gibberish.
3. The Synchronized Update
The sampled trajectories are split into ppo_mini_batches. Even though we arenât using a Critic, we still use a clipped surrogate update. This ensures the Actorâs weights are improved incrementally, preventing the drastic weight shifts that often destabilize RL training.
4. Distributed Orchestration
With trainer.nnodes=2, verl automatically shards the three remaining logical components (Actor, Ref, Reward) across all 16 GPUs. By removing the Critic, we reduce the communication overhead and memory pressure on the NCCL and RDMA fabric.
What Success Looks Like
Verl is verbose, providing a detailed pulse on both model alignment and hardware efficiency. When your GPU cluster and job is running (verl will take a little while to initialize first, you can follow those logs too), you will see start to see a progress update every step that looks something similar to this (among a bunch of other metrics):
(TaskRunner pid=2860) step:32 - global_seqlen/mean:37335.0 - actor/entropy:0.071 - actor/grad_norm:0.082 - critic/score/mean:0.959 - response_length/mean:184.58 - timing_s/step:46.59 - perf/throughput:801.42 - perf/mfu/actor:0.31
Navigating the Metrics
While the logs provide dozens of parameters, here are the key metrics for you to look out for:
Algorithm Health
- critic/score/mean: Your north star. This is the average reward (accuracy) across your prompt groups. Seeing this near 1.0 indicates the model is successfully solving the majority of the reasoning tasks in the batch.
- actor/grad_norm: A vital health check for stability. If this spikes suddenly, it indicates the model is experiencing gradient explosions. A steady, low value like this suggests smooth, stable learning.
- actor/entropy: The measure of creativity. A slow, steady decline is normal. If this hits zero, the model has collapsed into repeating a single âsafeâ answer.
Efficiency & Timing
- perf/throughput: Total tokens processed per second. This is your primary ROI metric for the GPUs hardware.
- perf/mfu/actor: Model FLOPs Utilization, showing how efficient your GPU FLOPs are being utilized in the actor. A higher MFU means your training is less bottlenecked on non-GPU operations and you are getting better value out of your hardware.
- timing_s/step: The total wall-clock time per training step (or RL loop).
- perf/max_memory_allocated_gb: This shows you how much VRAM you are pushing. This gives you a clear picture of how much headroom you have left to increase your batch size or model parameters. This will be useful if you are modifying this tutorial to other hardware.
Visualization
Here is a plotted graph for some of the metrics showing training over 58 steps on 16x B200 on GCP via spot instances.
While weâre not focused on the science aspect for this blog, our indication that things are working is that we can see the critic/score/mean (training accuracy) increase from 0% to well over 90% after the first few steps of training, showing that our model is learning from our dataset. We likely didnât even need to train for so long! We can also see a steady throughput after the initial warmup step as well as a steady step time after that initial step as well. These metrics provide a great baseline to compare our training performance across frameworks, hardware and configurations.
Conclusion
By combining the elasticity of GKE, the high-performance throughput of the latest generation GPUs on GCP, and the specialized LLM-RL optimization of verl, you can significantly reduce the wall-clock time required to align your models. This architecture allows you to focus on your RL logic while the underlying platform handles the heavy lifting of distributed orchestration.


