The evolution of Large Language Models from simple next-token predictors to sophisticated reasoning engines has been driven significantly by advances in post-training techniques. While pre-training establishes the foundational capabilities of an LLM, it is the post-training phase that transforms these models into aligned, helpful, and capable AI assistants.
Reinforcement Learning (RL) sits at the heart of modern post-training pipelines. Unlike supervised learning, which requires labeled examples for every desired behavior, RL enables models to learn through trial and error, receiving scalar reward signals that guide policy optimization. This paradigm has proven particularly effective for alignment (RLHF) and reasoning enhancement (RLVR).
This blog focuses on implementing Group Relative Policy Optimization (GRPO) using NVIDIAβs NeMo RL framework on Google Kubernetes Engine. Our target environment leverages Ray for distributed orchestration and Managed Lustre for high-throughput storage, creating a production-ready infrastructure for scalable RL training.
Github repository for the blog: GitHub - pmotgi/nemo-rl-on-gke
Understanding Reinforcement Learning for LLMs
The RL Loop
RL for LLMs follows a continuous feedback loop that combines elements of both training and inference:
- Generation: The LLM (policy) generates one or more responses to a given prompt
- Evaluation: A reward model or verifiable reward function assigns a score to each output
- Optimization: An RL algorithm uses reward signals to update the LLMβs parameters
- Iteration: The process repeats with the updated policy
Key RL Paradigms
-
RLHF (Reinforcement Learning from Human Feedback): Uses a learned reward model trained on human preference data to provide reward signals. This approach was foundational for models like ChatGPT and InstructGPT.
-
RLVR (Reinforcement Learning from Verifiable Rewards): Uses programmatic, rule-based rewards where correctness can be objectively verified. This is particularly effective for mathematical reasoning and code generation tasks where outputs can be checked against ground truth.
NVIDIA NeMo RL Framework
NeMo RL is NVIDIAβs open-source post-training library designed for scalable reinforcement learning. Part of the broader NeMo Framework ecosystem, it enables both small-scale experiments on a single GPU and multi-node deployments across thousands of GPUs.
GKE Architecture for Running NemoRL
GKE provides the orchestration layer for our RL training infrastructure. GKEβs integration with NVIDIA GPUs, Ray, and high-performance storage makes it ideal for demanding RL workloads.
Architecture Overview
The infrastructure consists of three layers: the Compute Layer with GPU node pools (A3/A4 with B200 GPUs), the Orchestration Layer powered by GKE with Kueue and JobSet for scheduling, and the Framework Layer running NeMo RL with Ray as the distributed orchestrator.
Storage: Managed Lustre
Managed Lustre on GCP, powered by DDN EXAScaler, provides the high-performance parallel file system essential for RL workloads. It delivers sub-millisecond latency, up to 1 TB/s throughput, and scales from terabytes to petabytes. For RL specifically, Managed Lustre excels at:
Training data loading: Fast access to datasets for prompt sampling
Checkpointing: Up to 15x faster than other storage solutions, minimizing GPU idle time
Model loading: Rapid weight distribution across nodes during policy updates
Ray on GKE
Ray serves as the distributed computing framework that coordinates NeMo RLβs workers. The KubeRay operator simplifies Ray cluster management within Kubernetes, with native integration for GKE monitoring and logging.
Implementation Guide: GRPO with Llama 3.1-8B
This section provides the technical implementation steps for running GRPO training on GKE using the NeMo RL framework. The complete recipe and scripts are available in the referenced GitHub repository.
This RL training pipeline deployed on a Ray Cluster on Google Kubernetes Engine (GKE), structured for a framework like NeMoRL using the GRPO (Gradients of Rewards) Optimizer. In this setup, the Ray Head Node acts as the GRPO coordinator and driver, managing two distributed worker nodes. Node1 handles the generation and training of the policy model via Policy Workers and Gen Workers using sharded parameters and vLLM inference across a GPU array. Parallel to this, Node2 evaluates the policyβs responses via Reward Workers and Reference Workers using a separate GPU array for the Reward and reference models. During training, Node1 sends policy outputs (βResponsesβ) to Node2, which computes and returns the associated rewards and reference log probabilities (βRewardsβ). This highly scalable architecture decouples model evaluation from generation for distributed training efficiency, using a managed Lustre/PVC storage layer to load and save model checkpoints, references, data, and logs via shared volume mounts.
Prerequisites
- GKE cluster (version 1.31.7 or later) with GPU node pool - To get started with deploying GKE in your setup, follow these steps documented in GitHub - esaaren/torch-distributed-training-gke: A simple recipe for deploying a distributed training job on GKE on GCP with DWS or Spot H200/B200 GPUs - Simply follow the tutorial to get a working cluster with B200s and RDMA setup. The cluster will come with options for Spot and DWS flex, this tutorial will use Spot for simplicity and hardware obtainability. The previous tutorial will also introduce a lot of fundamentals of networking (such as RDMA) which is critical for RL as well.
- This article uses nodepool with A4-highgpu-8g machines with NVIDIA B200 GPUs, For detailed information follow this previously published article that deep dives into A4High Machines: Tutorial: Making high performance LLM training easy on Google Cloud Platform
- Workload Identity Federation enabled
- Lustre CSI driver enabled for GKE
- Managed Lustre instance provisioned with PV, PVCs - Access existing Managed Lustre instances on GKE using the Managed Lustre CSI driver | Google Kubernetes Engine (GKE) | Google Cloud Documentation
Example Details:
Model
NemoRL supports both LLMs and VLMs such as models from Llama Family, Qwen Family and Gemma Family (1B and 27B variant), Mistral, DeepSeek-V3, GPT-OSS etc. In VLM NemoRL Supports Qwen2.5VL-3B and SmolVLM2-2.2B. We ran experiments on LLM models such as Llama3.1-8B and Qwen2-1.5B. We will be going over implementation steps for the Llama3.1-8B model for this example.
Dataset
NemoRL Supports various datasets such as aime24, clevr, dapo_math, deepscaler, geometry3k, helpsteer3, oai_format_dataset, oasst, openmathinstruct2, refcoco, squad, tulu3. For this implementation we will be using Deepscaler datasets.
TheDeepScaleR-Preview-Dataset by agentica-org is a training dataset designed to enhance mathematical reasoning in large language models. Hosted on Hugging Face, it contains approximately 40,000 unique problem-answer pairs compiled from high-difficulty sources including AIME (1984β2023), AMC, Omni-MATH, and the STILL dataset. This collection served as the foundation for training the DeepScaleR-1.5B-Preview model, utilizing reinforcement learning techniques (specifically Group Relative Policy Optimization) to achieve high-level reasoning performance comparable to much larger models.
Hugging face link: agentica-org/DeepScaleR-Preview-Dataset
Step 1: Enable Ray on GKE or Install KubeRay Operator
To Enable Ray on running GKE follow the steps documented in this blog:
** If you want to manually deploy Kuberay operator follow below steps**
Install KubeRay operator for Ray cluster management
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator
Verify installation
kubectl get pods | grep kuberay-operator
Step 2: Configure Environment
Set environment variables
export PROJECT_ID=<YOUR_PROJECT_ID>
export CLUSTER_REGION=<YOUR_REGION>
export CLUSTER_NAME=<YOUR_CLUSTER_NAME>
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
Get cluster credentials
gcloud container clusters get-credentials $CLUSTER_NAME \
--region $CLUSTER_REGION
Step 3: Launch Ray Cluster
The Ray cluster serves as the backbone for distributed RL training, managing worker processes for policy training, generation, and environment interaction.
Clone the gpu-recipes repository
git clone https://github.com/pmotgi/nemo-rl-on-gke
cd nemo-rl-on-gke
Configure the Values.yaml for Ray Cluster with Necessary RDMA Interfaces and Lustre PVCs
image:
repository: "nvcr.io/nvidia/nemo-rl"
tag: "v0.4.0"
pullPolicy: Always
nameOverride: "kuberay"
fullnameOverride: ""
common:
containerEnv: {}
configMap:
fluentbit:
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /tmp/ray/session_latest/logs/worker-*
Tag ray-worker
[INPUT]
Name tail
Path /tmp/ray/session_latest/logs/raylet*
Tag raylet
[INPUT]
Name tail
Path /tmp/ray/session_latest/logs/*
Exclude_Path /tmp/ray/session_latest/logs/debug_state.txt,/tmp/ray/session_latest/logs/raylet*,/tmp/ray/session_latest/logs/worker-*
Tag ray-misc
[OUTPUT]
Name stackdriver
Match *
resource gce_instance
labels_key labels
# --- Head Node Configuration ---
head:
enableInTreeAutoscaling: false
serviceAccountName: ""
rayStartParams:
dashboard-host: '0.0.0.0'
template:
metadata:
annotations:
gke-gcsfuse/volumes: "true"
networking.gke.io/default-interface: 'eth0'
containerEnv:
- name: RAY_GROUP
value: "head"
resources:
limits:
cpu: "206"
memory: "500G"
nvidia.com/gpu: 1
requests:
cpu: "206"
memory: "500G"
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: lustre-data
volumes:
- name: log-volume
emptyDir: {}
- name: fluentbit-config-volume
configMap:
name: "ray-cluster-kuberay-fluentbit-config"
- name: lustre-data
persistentVolumeClaim:
claimName: lustre-pvc
sidecarContainers:
- name: fluent-bit
image: fluent/fluent-bit:latest
env:
- name: RAY_GROUP
value: "head"
volumeMounts:
- name: fluentbit-config-volume
mountPath: /fluent-bit/etc/
- mountPath: /tmp/ray
name: log-volume
# --- HEAD POD STARTUP SCRIPT ---
command:
- "bash"
- "-c"
- |
set -ex
echo "--- Head Pod Setup ---"
apt-get update
apt-get install -y sudo netcat-openbsd pciutils
cd /opt/nemo-rl
/usr/bin/python -m pip install uv
/usr/bin/python -m uv venv
echo "Head pod setup complete. Starting Ray..."
exec ${KUBERAY_GEN_RAY_START_CMD}
args: []
headService: {}
# --- Default Worker (Disabled) ---
worker:
disabled: true
# --- A4 GPU Worker Groups ---
additionalWorkerGroups:
worker-grp-0:
disabled: false
replicas: 2
annotations:
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"gvnic-1"},
{"interfaceName":"eth2","network":"rdma-0"},
{"interfaceName":"eth3","network":"rdma-1"},
{"interfaceName":"eth4","network":"rdma-2"},
{"interfaceName":"eth5","network":"rdma-3"},
{"interfaceName":"eth6","network":"rdma-4"},
{"interfaceName":"eth7","network":"rdma-5"},
{"interfaceName":"eth8","network":"rdma-6"},
{"interfaceName":"eth9","network":"rdma-7"}
]
containerEnv:
- name: RAY_GROUP
valueFrom:
fieldRef:
fieldPath: metadata.labels['ray.io/group']
- name: NCCL_NET
value: "gIB"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: GLOO_SOCKET_IFNAME
value: "eth0"
- name: NCCL_CROSS_NIC
value: "0"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: TP_SOCKET_IFNAME # Specific to DTensor/PyTorch Distributed
value: "eth0"
- name: NCCL_TUNER_CONFIG_PATH
value: "/usr/local/gib/configs/tuner_config_a4.txtpb"
- name: NCCL_NET_GDR_LEVEL
value: "PIX"
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
nvidia.com/gpu: 8
cpu: "206"
memory: "500Gi"
requests:
nvidia.com/gpu: 8
cpu: "206"
memory: "500Gi"
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
tolerations:
- operator: "Exists"
key: "nvidia.com/gpu"
- operator: "Exists"
key: "cloud.google.com/impending-node-termination"
- operator: "Exists"
key: "user-workload"
securityContext:
privileged: true
volumes:
- name: log-volume
emptyDir: {}
- name: shared-memory
emptyDir:
medium: "Memory"
sizeLimit: 240Gi
- name: ray-tmp
emptyDir:
medium: "Memory"
- name: fluentbit-config-volume
configMap:
name: "ray-cluster-kuberay-fluentbit-config"
- name: nvidia-install-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: gib-nccl-plugin-volume
hostPath:
path: /home/kubernetes/bin/gib
- name: lustre-data
persistentVolumeClaim:
claimName: lustre-pvc
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
- name: shared-memory
mountPath: /dev/shm
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
- name: gib-nccl-plugin-volume
mountPath: /usr/local/gib
- mountPath: /data
name: lustre-data
# --- WORKER POD STARTUP SCRIPT ---
command:
- "bash"
- "-c"
- |
set -ex
echo "--- Worker Pod Setup ---"
apt-get update
apt-get install -y sudo netcat-openbsd pciutils
cd /opt/nemo-rl
/usr/bin/python -m pip install uv
/usr/bin/python -m uv venv
ldconfig /usr/local/nvidia/lib64/
ldconfig -p | grep libcuda | sed 's/^/ /'
export LD_LIBRARY_PATH="/usr/local/gib/lib64:$LD_LIBRARY_PATH"
source /usr/local/gib/scripts/set_nccl_env.sh
echo "Worker pod setup complete. Starting Ray..."
exec ${KUBERAY_GEN_RAY_START_CMD}
sidecarContainers:
- name: fluent-bit
env:
- name: RAY_GROUP
valueFrom:
fieldRef:
fieldPath: metadata.labels['ray.io/group']
image: fluent/fluent-bit:latest
volumeMounts:
- name: fluentbit-config-volume
mountPath: /fluent-bit/etc/
- mountPath: /tmp/ray
name: log-volume
# --- Service Config ---
service:
type: ClusterIP
Verify and edit the contents of launcher.sh β helm based installation of your Accelerated Ray cluster
#!/bin/bash
REPLICA_COUNT=2
helm install ray-cluster "<ABSOLUTE_PATH_TO_NeMo-RL-on-GKE/nemo-rl-on-gke" \
--set values.additionalWorkerGroups.worker-grp-0.replicas=$REPLICA_COUNT
Launch Ray cluster using provided script
source launcher.sh
Verify the Ray installation
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ray-cluster-kuberay-head-sw7dp 3/3 Running 0 33h
ray-cluster-kuberay-worker-grp-0-worker-gkbxw 3/3 Running 0 33h
ray-cluster-kuberay-worker-grp-0-worker-kdg62 3/3 Running 0 33h
$kubectl ray get clusters
NAME NAMESPACE DESIRED WORKERS AVAILABLE WORKERS CPUS GPUS TPUS MEMORY CONDITION STATUS AGE
ray-cluster-kuberay default 2 2 618 17 0 1573741824k RayClusterProvisioned ready 33h
Step 3.2: Establish secure local connection to ray before launching training
The ray session establishes a secure network tunnel (port forwarding) from your local computer directly to the Ray Head pod inside your Kubernetes cluster. This is needed because it allows you to access the web-based Ray Dashboard to monitor jobs or submit Python scripts interactively from your laptop as if the remote cluster were running locally on your machine, saving you from having to configure complex networking or log into the remote server manually.
$ kubectl ray session ray-cluster-kuberay
Forwarding ports to service ray-cluster-kuberay-head-svc
Ray Dashboard: http://localhost:8265
Ray Interactive Client: http://localhost:10001
Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001
Step 4: Configure GRPO Training
The GRPO configuration file defines all aspects of training. Key parameters include model selection, generation settings, and optimization hyperparameters. If you want to use the same model and dataset then you can directly use these optimized config file provided by NemoRL.
Configuration file used for this example: RL/examples/configs/grpo_math_8B.yaml at main Β· NVIDIA-NeMo/RL Β· GitHub
grpo_llama8b.yaml - Key configuration sections
policy:
model_name: "meta-llama/Llama-3.1-8B-Instruct"
tokenizer:
name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
train_global_batch_size: 512
train_micro_batch_size: 1
generation_batch_size: 32 # Only used when generating using HF backend
logprob_batch_size: 2
max_total_sequence_length: 4096
precision: "bfloat16"
generation:
backend: "vllm"
max_new_tokens: ${policy.max_total_sequence_length}
temperature: 1.0
top_p: 1.0
top_k: null
stop_token_ids: null
stop_strings: null
vllm_cfg:
tensor_parallel_size: 1
gpu_memory_utilization: 0.6
max_model_len: ${policy.max_total_sequence_length}
enforce_eager: False
grpo:
num_prompts_per_step: 64
num_generations_per_prompt: 32
async_grpo:
enabled: false
max_trajectory_age_steps: 1
loss_fn:
clip_epsilon: 0.2
kl_penalty_coeff: 0.01
cluster:
num_nodes: 2
gpus_per_node: 8
Step 5: Deploy NeMo RL Workload
This bash script automates the submission of a distributed reinforcement learning job to a Ray cluster hosted on Google Kubernetes Engine (GKE). It operates by first dynamically identifying the Ray head pod using kubectl, then constructing and injecting a shell script that configures necessary environment variables for Weights & Biases and Hugging Face authentication. The script ultimately executes a NeMo-RL training command via uv to train a Llama 3.1 8B model using the GRPO algorithm on a deepscaler dataset, distributing the workload across two nodes with a total of 16 GPUs. Checkpoints are stored on the PVC that is provisioned from our Lustre Storage class.
Filename: submit_llama3.1-8b-lustre.sh
#!/bin/bash
WANDB_API_KEY='WANDB_KEY' # Update this with your WANDB API key
HF_TOKEN='HF_KEY' # Update this with your HF token
WORLD_SIZE=16
# --- Step 1: Find the Ray Head Pod ---
echo "Finding Ray head pod..."
export HEAD_POD_NAME=$(kubectl get pods --selector=ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
if [ -z "$HEAD_POD_NAME" ]; then
echo "Error: No running Ray head pod found. Please check your cluster."
exit 1
fi
echo "Found head pod: $HEAD_POD_NAME"
echo ""
# --- Step 2: Define the Job Script to Run ---
# This is the script that will be executed *inside* the head pod.
# It assumes the 'uv venv' setup from the values.yaml is already done.
JOB_SCRIPT=$(cat <<EOF
set -ex
echo "--- Running on Ray Head Pod ($HOSTNAME) ---"
cd /opt/nemo-rl
echo "Setting environment variables..."
export WANDB_API_KEY=$WANDB_API_KEY
export HF_TOKEN=$HF_TOKEN
export HF_HOME=/opt/nemo-rl/
###-----Example to launch llama 3.1 8b on 4 nodes (32 GPUs)----------
uv run python examples/run_grpo_math.py \
--config examples/configs/grpo_math_8B.yaml \
logger.wandb_enabled=True \
cluster.num_nodes=2 \
cluster.gpus_per_node=8 \
logger.wandb.name='llama3.1-8b-deepscaler-grpo-2nodes' \
grpo.max_num_steps=100 \
checkpointing.checkpoint_dir=/data/nemo_rl_llama3_8b_ds_cp \
data.dataset_name='DeepScaler'
echo "--- Job Finished ---"
EOF
)
# --- Step 3: Execute the Job ---
echo "Submitting job to $HEAD_POD_NAME..."
echo "$JOB_SCRIPT" | tr -d '\r' | kubectl exec -i $HEAD_POD_NAME -c ray-head -- /bin/bash
echo ""
echo "Job submission complete."
Run this file as
source submit_llama3.1-8b-lustre.sh
Monitor the output on your console (Successful response after max_step=100)
========================= Step 100/100 =========================
βΆ Preparing batch...
βΆ Generating responses for batch of size 2048...
(VllmGenerationWorker pid=260234, ip=10.4.2.6) INFO 01-23 04:20:02 [block_pool.py:321] Successfully reset prefix cache [repeated 30x across cluster]
(VllmGenerationWorker pid=259642, ip=10.4.1.6) INFO 01-23 04:20:35 [executor_base.py:203] It took 0.582119 seconds to wake up tags ['weights'].
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:20:03 [gpu_worker.py:104] Sleep mode freed 102.62 GiB memory, 4.79 GiB memory is still in use. [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:20:03 [executor_base.py:187] It took 1.359313 seconds to fall asleep. [repeated 15x across cluster]
(VllmGenerationWorker pid=259644, ip=10.4.1.6) INFO 01-23 04:20:35 [executor_base.py:203] It took 0.581814 seconds to wake up tags ['weights'].
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) DTensorPolicyWorkerV2[rank=0]: Packed 1 groups of tensors
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.04GB reserved
(VllmGenerationWorker pid=260126, ip=10.4.2.6) INFO 01-23 04:20:37 [executor_base.py:203] It took 0.075903 seconds to wake up tags ['kv_cache'].
Adding requests: 100%|ββββββββββ| 128/128 [00:00<00:00, 15500.37it/s]
Processed prompts: 0%| | 0/128 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|ββββββββββ| 128/128 [00:32<00:00, 3.91it/s, est. speed input: 588.03 toks/s, output: 4861.31 toks/s] [repeated 3x across cluster]
Processed prompts: 1%| | 1/128 [00:01<03:05, 1.46s/it, est. speed input: 113.18 toks/s, output: 102.20 toks/s]
Adding requests: 100%|ββββββββββ| 128/128 [00:00<00:00, 15729.72it/s] [repeated 15x across cluster]
Processed prompts: 0%| | 0/128 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 15x across cluster]
Processed prompts: 60%|ββββββ | 77/128 [00:06<00:03, 14.97it/s, est. speed input: 1440.55 toks/s, output: 5429.55 toks/s] [repeated 317x across cluster]
Processed prompts: 92%|ββββββββββ| 118/128 [00:07<00:00, 15.60it/s, est. speed input: 1720.48 toks/s, output: 7977.87 toks/s]
Processed prompts: 94%|ββββββββββ| 120/128 [00:07<00:00, 12.95it/s, est. speed input: 1684.62 toks/s, output: 7919.15 toks/s]
Processed prompts: 88%|βββββββββ | 113/128 [00:11<00:01, 8.34it/s, est. speed input: 1316.02 toks/s, output: 6764.95 toks/s] [repeated 267x across cluster]
Processed prompts: 94%|ββββββββββ| 120/128 [00:11<00:01, 4.40it/s, est. speed input: 1906.63 toks/s, output: 6585.21 toks/s] [repeated 9x across cluster]
Processed prompts: 82%|βββββββββ | 105/128 [00:16<00:07, 3.00it/s, est. speed input: 766.67 toks/s, output: 5213.57 toks/s] [repeated 84x across cluster]
Processed prompts: 92%|ββββββββββ| 118/128 [00:17<00:04, 2.30it/s, est. speed input: 874.85 toks/s, output: 4945.55 toks/s] [repeated 36x across cluster]
Processed prompts: 90%|βββββββββ | 115/128 [00:21<00:04, 2.97it/s, est. speed input: 678.24 toks/s, output: 5093.30 toks/s] [repeated 33x across cluster]
Processed prompts: 95%|ββββββββββ| 121/128 [00:22<00:07, 1.02s/it, est. speed input: 648.11 toks/s, output: 3876.12 toks/s] [repeated 21x across cluster]
Processed prompts: 100%|ββββββββββ| 128/128 [00:22<00:00, 5.57it/s, est. speed input: 758.07 toks/s, output: 4504.22 toks/s]
Processed prompts: 87%|βββββββββ | 111/128 [00:26<00:21, 1.26s/it, est. speed input: 438.47 toks/s, output: 3665.12 toks/s] [repeated 8x across cluster]
Processed prompts: 98%|ββββββββββ| 126/128 [00:27<00:04, 2.31s/it, est. speed input: 653.31 toks/s, output: 3041.30 toks/s] [repeated 32x across cluster]
Processed prompts: 100%|ββββββββββ| 128/128 [00:27<00:00, 4.61it/s, est. speed input: 728.16 toks/s, output: 3483.17 toks/s] [repeated 6x across cluster]
Processed prompts: 100%|ββββββββββ| 128/128 [00:30<00:00, 4.22it/s, est. speed input: 594.41 toks/s, output: 4538.78 toks/s]
Processed prompts: 90%|βββββββββ | 115/128 [00:31<00:16, 1.27s/it, est. speed input: 381.76 toks/s, output: 3506.59 toks/s] [repeated 4x across cluster]
Processed prompts: 97%|ββββββββββ| 124/128 [00:31<00:06, 1.61s/it, est. speed input: 490.68 toks/s, output: 4256.31 toks/s] [repeated 15x across cluster]
Processed prompts: 100%|ββββββββββ| 128/128 [00:32<00:00, 3.99it/s, est. speed input: 504.13 toks/s, output: 4730.26 toks/s] [repeated 7x across cluster]
(VllmGenerationWorker pid=260132, ip=10.4.2.6) INFO 01-23 04:21:14 [block_pool.py:321] Successfully reset prefix cache [repeated 2x across cluster]
(VllmGenerationWorker pid=260234, ip=10.4.2.6) INFO 01-23 04:20:35 [executor_base.py:203] It took 0.580567 seconds to wake up tags ['weights']. [repeated 14x across cluster]
(DTensorPolicyWorkerV2[rank=11] pid=262730, ip=10.4.2.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.13GB reserved [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:20:37 [executor_base.py:203] It took 0.091887 seconds to wake up tags ['kv_cache']. [repeated 15x across cluster]
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:21:15 [gpu_worker.py:104] Sleep mode freed 102.60 GiB memory, 4.70 GiB memory is still in use.
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:21:15 [executor_base.py:187] It took 0.914538 seconds to fall asleep.
βΆ Processing rewards...,
βΆ Computing advantages...
βΆ Preparing for logprob inference...
βΆ Computing logprobs...
βΆ Preparing for training...
βΆ Training policy...
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:21:14 [block_pool.py:321] Successfully reset prefix cache [repeated 30x across cluster]
(VllmGenerationWorker pid=260126, ip=10.4.2.6) INFO 01-23 04:21:44 [executor_base.py:203] It took 0.581929 seconds to wake up tags ['weights'].
(VllmGenerationWorker pid=260130, ip=10.4.2.6) INFO 01-23 04:21:15 [gpu_worker.py:104] Sleep mode freed 102.55 GiB memory, 4.77 GiB memory is still in use. [repeated 15x across cluster]
(VllmGenerationWorker pid=260130, ip=10.4.2.6) INFO 01-23 04:21:15 [executor_base.py:187] It took 1.343605 seconds to fall asleep. [repeated 15x across cluster]
(VllmGenerationWorker pid=260128, ip=10.4.2.6) INFO 01-23 04:21:44 [executor_base.py:203] It took 0.582409 seconds to wake up tags ['weights'].
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) DTensorPolicyWorkerV2[rank=0]: Packed 1 groups of tensors
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.04GB reserved
βΆ Starting validation at step 100...
(VllmGenerationWorker pid=260126, ip=10.4.2.6) INFO 01-23 04:21:46 [executor_base.py:203] It took 0.078912 seconds to wake up tags ['kv_cache'].
Adding requests: 100%|ββββββββββ| 16/16 [00:00<00:00, 11875.57it/s]
Processed prompts: 0%| | 0/16 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 91%|βββββββββ | 116/128 [00:33<00:17, 1.48s/it, est. speed input: 362.04 toks/s, output: 3412.34 toks/s]
Processed prompts: 91%|ββββββββββ| 117/128 [00:33<00:13, 1.19s/it, est. speed input: 360.14 toks/s, output: 3480.90 toks/s]
Processed prompts: 100%|ββββββββββ| 128/128 [00:33<00:00, 3.77it/s, est. speed input: 393.67 toks/s, output: 4759.65 toks/s]
Processed prompts: 6%|β | 1/16 [00:01<00:28, 1.89s/it, est. speed input: 107.97 toks/s, output: 160.36 toks/s]
Processed prompts: 12%|ββ | 2/16 [00:03<00:17, 1.27s/it, est. speed input: 83.14 toks/s, output: 313.49 toks/s]
Adding requests: 100%|ββββββββββ| 16/16 [00:00<00:00, 12241.68it/s] [repeated 15x across cluster]
Processed prompts: 0%| | 0/16 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 15x across cluster]
Processed prompts: 69%|βββββββ | 11/16 [00:06<00:02, 2.14it/s, est. speed input: 302.57 toks/s, output: 1218.42 toks/s] [repeated 138x across cluster]
Processed prompts: 94%|ββββββββββ| 15/16 [00:07<00:00, 2.25it/s, est. speed input: 308.54 toks/s, output: 1568.40 toks/s]
Processed prompts: 100%|ββββββββββ| 16/16 [00:07<00:00, 2.01it/s, est. speed input: 319.65 toks/s, output: 1689.95 toks/s]
Processed prompts: 81%|βββββββββ | 13/16 [00:11<00:03, 1.28s/it, est. speed input: 193.58 toks/s, output: 1041.41 toks/s] [repeated 41x across cluster]
Processed prompts: 94%|ββββββββββ| 15/16 [00:13<00:01, 1.53s/it, est. speed input: 205.27 toks/s, output: 1030.76 toks/s] [repeated 8x across cluster]
Processed prompts: 100%|ββββββββββ| 16/16 [00:10<00:00, 1.50it/s, est. speed input: 253.25 toks/s, output: 1233.89 toks/s] [repeated 3x across cluster]
Processed prompts: 88%|βββββββββ | 14/16 [00:13<00:02, 1.47s/it, est. speed input: 175.34 toks/s, output: 1061.80 toks/s]
Processed prompts: 94%|ββββββββββ| 15/16 [00:19<00:03, 3.56s/it, est. speed input: 107.08 toks/s, output: 715.25 toks/s] [repeated 3x across cluster]
Processed prompts: 100%|ββββββββββ| 16/16 [00:14<00:00, 1.08it/s, est. speed input: 167.17 toks/s, output: 1068.08 toks/s] [repeated 2x across cluster]
π Validation Results:
β’ Accuracy: 0.0898
β’ Average response length: 1067.4 tokens
β’ Samples processed: 256
β±οΈ Validation Timing:
β’ Total validation time: 24.44s
(VllmGenerationWorker pid=259642, ip=10.4.1.6) INFO 01-23 04:22:11 [block_pool.py:321] Successfully reset prefix cache [repeated 2x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:21:44 [executor_base.py:203] It took 0.582607 seconds to wake up tags ['weights']. [repeated 14x across cluster]
(DTensorPolicyWorkerV2[rank=11] pid=262730, ip=10.4.2.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.13GB reserved [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:21:46 [executor_base.py:203] It took 0.072994 seconds to wake up tags ['kv_cache']. [repeated 15x across cluster]
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:22:12 [gpu_worker.py:104] Sleep mode freed 102.43 GiB memory, 4.70 GiB memory is still in use.
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:22:12 [executor_base.py:187] It took 0.978071 seconds to fall asleep.
Saving checkpoint for step 100...
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) Saving tokenizer (or processor) to /data/nemo_rl_llama3_8b_ds_cp/tmp_step_100/policy/tokenizer
(VllmGenerationWorker pid=260234, ip=10.4.2.6) INFO 01-23 04:22:11 [block_pool.py:321] Successfully reset prefix cache [repeated 30x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:22:13 [gpu_worker.py:104] Sleep mode freed 102.45 GiB memory, 4.79 GiB memory is still in use. [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:22:13 [executor_base.py:187] It took 1.396513 seconds to fall asleep. [repeated 15x across cluster]
Removing checkpoint /data/nemo_rl_llama3_8b_ds_cp/step_90 due to being outside top-3
Logged data to logs/exp_002/train_data_step99.jsonl
Other Metrics that are captured by NemoRL are:
π Training Results:
β’ Loss: 0.0351
β’ Avg Reward: 0.2407
β’ Mean Generation Length: 903.4531
β±οΈ Timing:
β’ Total step time: 122.87s
β’ generation: 39.05s (31.8%)
β’ checkpointing: 21.48s (17.5%)
β’ policy_training: 16.89s (13.7%)
β’ policy_and_reference_logprobs: 7.74s (6.3%)
β’ prepare_for_generation/total: 3.25s (2.6%)
β’ training_prep: 1.15s (0.9%)
β’ logprob_inference_prep: 1.10s (0.9%)
β’ prepare_for_generation/transfer_and_update_weights: 0.68s (0.6%)
β’ data_processing: 0.56s (0.5%)
β’ reward_calculation: 0.03s (0.0%)
π Performance Metrics:
β’ Mean Total Tokens per Sample: 906.36
β’ Throughputs (per GPU):
- E2E (Samples/sec/gpu): 1.04
- E2E (Tokens/sec/gpu): 1087.21
- Policy Training (Tokens/sec/gpu): 7908.96
- Policy and Reference Logprobs (Tokens/sec/gpu): 17251.90
- Training Worker Group (Tokens/sec/gpu): 5422.89
- Generation Worker Group (Tokens/sec/gpu): 3420.66
β’ Throughputs (per Group):
- E2E (Samples/sec): 16.67
- E2E (Tokens/sec): 17395.32
- Training Worker Group (Tokens/sec): 86766.26
- Generation Worker Group (Tokens/sec): 54730.51
β’ Training FLOPS: 5864.40 TFLOPS (366.53 TFLOPS per rank)
β’ Training Model Floating Point Utilization: 16.29%
Max number of steps has been reached, stopping training early
Step 6: Verify the checkpoints
After Ray finishes the Job, NemoRL stores the checkpoints in the configured path. This can be verified by accessing the pvc via one the ray clusterβs worker pods in GKE. Checkpoints under this folder as saved as Nemo2 format and can be easily imported into any other NeMo functionalities such as NeMo tune, export as well as NIMs.
$ kubectl exec -it ray-cluster-kuberay-worker-grp-0-worker-gkbxw -- bash
Defaulted container "ray-worker" out of: ray-worker, fluent-bit, fluentbit
$root@ray-cluster-kuberay-worker-grp-0-worker-gkbxw:/opt/nemo-rl# tree /data/nemo_rl_llama3_8b_ds_cp/
/data/nemo_rl_llama3_8b_ds_cp/
|-- step_100
| |-- config.yaml
| |-- policy
| | |-- optimizer
| | | `-- optim
| | | |-- __0_0.distcp
| | | |-- __10_0.distcp
| | | |-- __11_0.distcp
| | | |-- __12_0.distcp
| | | |-- __13_0.distcp
| | | |-- __14_0.distcp
| | | |-- __15_0.distcp
| | | |-- __1_0.distcp
| | | |-- __2_0.distcp
| | | |-- __3_0.distcp
| | | |-- __4_0.distcp
| | | |-- __5_0.distcp
| | | |-- __6_0.distcp
| | | |-- __7_0.distcp
| | | |-- __8_0.distcp
| | | `-- __9_0.distcp
| | |-- tokenizer
| | | |-- chat_template.jinja
| | | |-- special_tokens_map.json
| | | |-- tokenizer.json
| | | `-- tokenizer_config.json
| | `-- weights
| | `-- model
| | |-- shard-00001-model-00001-of-00001.safetensors
| | |-- shard-00002-model-00001-of-00001.safetensors
| | |-- shard-00003-model-00001-of-00001.safetensors
| | |-- shard-00004-model-00001-of-00001.safetensors
| | |-- shard-00005-model-00001-of-00001.safetensors
| | |-- shard-00006-model-00001-of-00001.safetensors
| | |-- shard-00007-model-00001-of-00001.safetensors
| | |-- shard-00008-model-00001-of-00001.safetensors
| | |-- shard-00009-model-00001-of-00001.safetensors
| | |-- shard-00010-model-00001-of-00001.safetensors
| | |-- shard-00011-model-00001-of-00001.safetensors
| | |-- shard-00012-model-00001-of-00001.safetensors
| | |-- shard-00013-model-00001-of-00001.safetensors
| | |-- shard-00014-model-00001-of-00001.safetensors
| | |-- shard-00015-model-00001-of-00001.safetensors
| | `-- shard-00016-model-00001-of-00001.safetensors
| |-- train_dataloader.pt
| `-- training_info.json
|-- step_40
| |-- config.yaml
| |-- policy
| | |-- optimizer
| | | `-- optim
| | | |-- __0_0.distcp
| | | |-- __10_0.distcp
| | | |-- __11_0.distcp
| | | |-- __12_0.distcp
| | | |-- __13_0.distcp
| | | |-- __14_0.distcp
| | | |-- __15_0.distcp
| | | |-- __1_0.distcp
| | | |-- __2_0.distcp
| | | |-- __3_0.distcp
| | | |-- __4_0.distcp
| | | |-- __5_0.distcp
| | | |-- __6_0.distcp
| | | |-- __7_0.distcp
| | | |-- __8_0.distcp
| | | `-- __9_0.distcp
| | |-- tokenizer
| | | |-- chat_template.jinja
| | | |-- special_tokens_map.json
| | | |-- tokenizer.json
| | | `-- tokenizer_config.json
| | `-- weights
| | `-- model
| | |-- shard-00001-model-00001-of-00001.safetensors
| | |-- shard-00002-model-00001-of-00001.safetensors
| | |-- shard-00003-model-00001-of-00001.safetensors
| | |-- shard-00004-model-00001-of-00001.safetensors
| | |-- shard-00005-model-00001-of-00001.safetensors
| | |-- shard-00006-model-00001-of-00001.safetensors
| | |-- shard-00007-model-00001-of-00001.safetensors
| | |-- shard-00008-model-00001-of-00001.safetensors
| | |-- shard-00009-model-00001-of-00001.safetensors
| | |-- shard-00010-model-00001-of-00001.safetensors
| | |-- shard-00011-model-00001-of-00001.safetensors
| | |-- shard-00012-model-00001-of-00001.safetensors
| | |-- shard-00013-model-00001-of-00001.safetensors
| | |-- shard-00014-model-00001-of-00001.safetensors
| | |-- shard-00015-model-00001-of-00001.safetensors
| | `-- shard-00016-model-00001-of-00001.safetensors
| |-- train_dataloader.pt
| `-- training_info.json
`-- step_60
|-- config.yaml
|-- policy
| |-- optimizer
| | `-- optim
| | |-- __0_0.distcp
| | |-- __10_0.distcp
| | |-- __11_0.distcp
| | |-- __12_0.distcp
| | |-- __13_0.distcp
| | |-- __14_0.distcp
| | |-- __15_0.distcp
| | |-- __1_0.distcp
| | |-- __2_0.distcp
| | |-- __3_0.distcp
| | |-- __4_0.distcp
| | |-- __5_0.distcp
| | |-- __6_0.distcp
| | |-- __7_0.distcp
| | |-- __8_0.distcp
| | `-- __9_0.distcp
| |-- tokenizer
| | |-- chat_template.jinja
| | |-- special_tokens_map.json
| | |-- tokenizer.json
| | `-- tokenizer_config.json
| `-- weights
| `-- model
| |-- shard-00001-model-00001-of-00001.safetensors
| |-- shard-00002-model-00001-of-00001.safetensors
| |-- shard-00003-model-00001-of-00001.safetensors
| |-- shard-00004-model-00001-of-00001.safetensors
| |-- shard-00005-model-00001-of-00001.safetensors
| |-- shard-00006-model-00001-of-00001.safetensors
| |-- shard-00007-model-00001-of-00001.safetensors
| |-- shard-00008-model-00001-of-00001.safetensors
| |-- shard-00009-model-00001-of-00001.safetensors
| |-- shard-00010-model-00001-of-00001.safetensors
| |-- shard-00011-model-00001-of-00001.safetensors
| |-- shard-00012-model-00001-of-00001.safetensors
| |-- shard-00013-model-00001-of-00001.safetensors
| |-- shard-00014-model-00001-of-00001.safetensors
| |-- shard-00015-model-00001-of-00001.safetensors
| `-- shard-00016-model-00001-of-00001.safetensors
|-- train_dataloader.pt
`-- training_info.json
22 directories, 117 files
Step 7: Running Eval against the post-trained model**
Once training is complete, the evaluation phase allows you to measure the performance of your policy against standard benchmarks (like AIME or MATH-500) or your own custom datasets.
The evaluation pipeline follows three main steps: Format Conversion, Configuration, and Execution.
1. Convert DCP to Hugging Face (Optional)
NeMo-RL often saves checkpoints in the PyTorch Distributed Checkpoint (DCP) format. However, the evaluation script requires the Hugging Face (HF) format. If you have a local checkpoint, convert it using the provided utility script:
# Example: Converting a GRPO checkpoint from step 60
uv run python examples/converters/convert_dcp_to_hf.py \ --config nemo_rl_llama3_8b_ds_cp/step_60/config.yaml \ --dcp-ckpt-path nemo_rl_llama3_8b_ds_cp/step_60/policy/weights/ \ --hf-ckpt-path nemo_rl_llama3_8b_ds_cp/hf
2. Configure the Evaluation Environment
The evaluation suite is highly flexible. You can use the default settings (which target Qwen2.5-Math-1.5B-Instruct on AIME-2024) or override them for your specific needs.
Prompt Templates: Consistency is key. Always use the same chat_template used during training.
Open-Source Defaults: For most HF models, set tokenizer.chat_template=default and keep data.prompt_file as null to use the modelβs native formatting.
3. Run the Evaluation Script
Use run_eval.py to initiate the process. You can point to models on the Hugging Face Hub or your newly converted local path.
Common Execution Commands:
Default Eval:
uv run python examples/run_eval.py
Local Model:
uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
GPQA Benchmark:
uv run python examples/run_eval.py --config examples/configs/evals/gpqa_eval.yaml
Multi GPU Evaluation:
uv run python examples/run_eval.py \
--config examples/configs/evals/math_eval.yaml \
generation.model_name=nemo_rl_llama3_8b_ds_cp/hf \
generation.temperature=0.6 \
generation.vllm_cfg.max_model_len=32768 \
generation.vllm_cfg.tensor_parallel_size=$TP \
data.dataset_name=math500 \
eval.num_tests_per_prompt=16 \
cluster.gpus_per_node=8
4. Interpreting the Output
After the script finishes, you will see a summary block indicating the success rate of your model.
============================================================
model_name='nemo_rl_llama3_8b_ds_cp/hf' dataset_name='math500'
max_new_tokens=32768 temperature=0.6 top_p=1.0 top_k=-1 seed=42
metric=pass@1 num_tests_per_prompt=16
score=0.8981 (449.06250106170774/500)
============================================================
Score: The decimal representation of your accuracy (0.1000 = 10%).
Ratio: The raw count of correct answers over total problems (3.0/30).
Conclusion
The combination of NVIDIA NeMo RL, Google Kubernetes Engine, Ray orchestration, and Managed Lustre storage provides a production-ready infrastructure for implementing GRPO at scale. This stack addresses the unique challenges of RL workloads: the hybrid nature of training and inference, high memory requirements, and the need for fast checkpointing and data access.
As LLMs continue to evolve toward more sophisticated reasoning capabilities, mastering RL techniques like GRPO and the infrastructure to support them will be essential for organizations building the next generation of AI systems


