Implementing GRPO with NVIDIA NeMo-RL on Google Kubernetes Engine

The evolution of Large Language Models from simple next-token predictors to sophisticated reasoning engines has been driven significantly by advances in post-training techniques. While pre-training establishes the foundational capabilities of an LLM, it is the post-training phase that transforms these models into aligned, helpful, and capable AI assistants.

Reinforcement Learning (RL) sits at the heart of modern post-training pipelines. Unlike supervised learning, which requires labeled examples for every desired behavior, RL enables models to learn through trial and error, receiving scalar reward signals that guide policy optimization. This paradigm has proven particularly effective for alignment (RLHF) and reasoning enhancement (RLVR).

This blog focuses on implementing Group Relative Policy Optimization (GRPO) using NVIDIA’s NeMo RL framework on Google Kubernetes Engine. Our target environment leverages Ray for distributed orchestration and Managed Lustre for high-throughput storage, creating a production-ready infrastructure for scalable RL training.

Github repository for the blog: GitHub - pmotgi/nemo-rl-on-gke

Understanding Reinforcement Learning for LLMs

The RL Loop

RL for LLMs follows a continuous feedback loop that combines elements of both training and inference:

  • Generation: The LLM (policy) generates one or more responses to a given prompt
  • Evaluation: A reward model or verifiable reward function assigns a score to each output
  • Optimization: An RL algorithm uses reward signals to update the LLM’s parameters
  • Iteration: The process repeats with the updated policy

Key RL Paradigms

  • RLHF (Reinforcement Learning from Human Feedback): Uses a learned reward model trained on human preference data to provide reward signals. This approach was foundational for models like ChatGPT and InstructGPT.

  • RLVR (Reinforcement Learning from Verifiable Rewards): Uses programmatic, rule-based rewards where correctness can be objectively verified. This is particularly effective for mathematical reasoning and code generation tasks where outputs can be checked against ground truth.

NVIDIA NeMo RL Framework

NeMo RL is NVIDIA’s open-source post-training library designed for scalable reinforcement learning. Part of the broader NeMo Framework ecosystem, it enables both small-scale experiments on a single GPU and multi-node deployments across thousands of GPUs.

GKE Architecture for Running NemoRL

GKE provides the orchestration layer for our RL training infrastructure. GKE’s integration with NVIDIA GPUs, Ray, and high-performance storage makes it ideal for demanding RL workloads.
Architecture Overview

The infrastructure consists of three layers: the Compute Layer with GPU node pools (A3/A4 with B200 GPUs), the Orchestration Layer powered by GKE with Kueue and JobSet for scheduling, and the Framework Layer running NeMo RL with Ray as the distributed orchestrator.

Storage: Managed Lustre
Managed Lustre on GCP, powered by DDN EXAScaler, provides the high-performance parallel file system essential for RL workloads. It delivers sub-millisecond latency, up to 1 TB/s throughput, and scales from terabytes to petabytes. For RL specifically, Managed Lustre excels at:
Training data loading: Fast access to datasets for prompt sampling
Checkpointing: Up to 15x faster than other storage solutions, minimizing GPU idle time
Model loading: Rapid weight distribution across nodes during policy updates

Ray on GKE
Ray serves as the distributed computing framework that coordinates NeMo RL’s workers. The KubeRay operator simplifies Ray cluster management within Kubernetes, with native integration for GKE monitoring and logging.
Implementation Guide: GRPO with Llama 3.1-8B
This section provides the technical implementation steps for running GRPO training on GKE using the NeMo RL framework. The complete recipe and scripts are available in the referenced GitHub repository.

This RL training pipeline deployed on a Ray Cluster on Google Kubernetes Engine (GKE), structured for a framework like NeMoRL using the GRPO (Gradients of Rewards) Optimizer. In this setup, the Ray Head Node acts as the GRPO coordinator and driver, managing two distributed worker nodes. Node1 handles the generation and training of the policy model via Policy Workers and Gen Workers using sharded parameters and vLLM inference across a GPU array. Parallel to this, Node2 evaluates the policy’s responses via Reward Workers and Reference Workers using a separate GPU array for the Reward and reference models. During training, Node1 sends policy outputs (β€œResponses”) to Node2, which computes and returns the associated rewards and reference log probabilities (β€œRewards”). This highly scalable architecture decouples model evaluation from generation for distributed training efficiency, using a managed Lustre/PVC storage layer to load and save model checkpoints, references, data, and logs via shared volume mounts.

Prerequisites

Example Details:

Model
NemoRL supports both LLMs and VLMs such as models from Llama Family, Qwen Family and Gemma Family (1B and 27B variant), Mistral, DeepSeek-V3, GPT-OSS etc. In VLM NemoRL Supports Qwen2.5VL-3B and SmolVLM2-2.2B. We ran experiments on LLM models such as Llama3.1-8B and Qwen2-1.5B. We will be going over implementation steps for the Llama3.1-8B model for this example.

Dataset
NemoRL Supports various datasets such as aime24, clevr, dapo_math, deepscaler, geometry3k, helpsteer3, oai_format_dataset, oasst, openmathinstruct2, refcoco, squad, tulu3. For this implementation we will be using Deepscaler datasets.

TheDeepScaleR-Preview-Dataset by agentica-org is a training dataset designed to enhance mathematical reasoning in large language models. Hosted on Hugging Face, it contains approximately 40,000 unique problem-answer pairs compiled from high-difficulty sources including AIME (1984–2023), AMC, Omni-MATH, and the STILL dataset. This collection served as the foundation for training the DeepScaleR-1.5B-Preview model, utilizing reinforcement learning techniques (specifically Group Relative Policy Optimization) to achieve high-level reasoning performance comparable to much larger models.

Hugging face link: agentica-org/DeepScaleR-Preview-Dataset

Step 1: Enable Ray on GKE or Install KubeRay Operator

To Enable Ray on running GKE follow the steps documented in this blog:

** If you want to manually deploy Kuberay operator follow below steps**

Install KubeRay operator for Ray cluster management

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator

Verify installation

kubectl get pods | grep kuberay-operator

Step 2: Configure Environment

Set environment variables

export PROJECT_ID=<YOUR_PROJECT_ID>
export CLUSTER_REGION=<YOUR_REGION>
export CLUSTER_NAME=<YOUR_CLUSTER_NAME>
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>

Get cluster credentials

gcloud container clusters get-credentials $CLUSTER_NAME \
  --region $CLUSTER_REGION

Step 3: Launch Ray Cluster

The Ray cluster serves as the backbone for distributed RL training, managing worker processes for policy training, generation, and environment interaction.

Clone the gpu-recipes repository

git clone https://github.com/pmotgi/nemo-rl-on-gke

cd nemo-rl-on-gke

Configure the Values.yaml for Ray Cluster with Necessary RDMA Interfaces and Lustre PVCs

image:
 repository: "nvcr.io/nvidia/nemo-rl"
 tag: "v0.4.0"
 pullPolicy: Always


nameOverride: "kuberay"
fullnameOverride: ""


common:
 containerEnv: {}


configMap:
 fluentbit:
   data:
     fluent-bit.conf: |
       [INPUT]
          Name              tail
          Path              /tmp/ray/session_latest/logs/worker-*
          Tag               ray-worker
      [INPUT]
          Name              tail
          Path              /tmp/ray/session_latest/logs/raylet*
          Tag               raylet
      [INPUT]
          Name              tail
          Path              /tmp/ray/session_latest/logs/*
          Exclude_Path      /tmp/ray/session_latest/logs/debug_state.txt,/tmp/ray/session_latest/logs/raylet*,/tmp/ray/session_latest/logs/worker-*
          Tag               ray-misc
      [OUTPUT]
          Name              stackdriver
          Match             *
          resource          gce_instance
          labels_key        labels


# --- Head Node Configuration ---
head:
 enableInTreeAutoscaling: false
 serviceAccountName: ""
 rayStartParams:
   dashboard-host: '0.0.0.0'
 template:
   metadata:
     annotations:
       gke-gcsfuse/volumes: "true"
       networking.gke.io/default-interface: 'eth0'
 containerEnv:
 - name: RAY_GROUP
   value: "head"
 resources:
   limits:
     cpu: "206"
     memory: "500G"
     nvidia.com/gpu: 1
   requests:
     cpu: "206"
     memory: "500G"
     nvidia.com/gpu: 1


 volumeMounts:
   - mountPath: /data
     name: lustre-data


 volumes:
   - name: log-volume
     emptyDir: {}
   - name: fluentbit-config-volume
     configMap:
       name: "ray-cluster-kuberay-fluentbit-config"
   - name: lustre-data
     persistentVolumeClaim:
       claimName: lustre-pvc
 sidecarContainers:
   - name: fluent-bit
     image: fluent/fluent-bit:latest
     env:
     - name: RAY_GROUP
       value: "head"
     volumeMounts:
       - name: fluentbit-config-volume
         mountPath: /fluent-bit/etc/
       - mountPath: /tmp/ray
         name: log-volume
  # --- HEAD POD STARTUP SCRIPT ---
 command:
   - "bash"
   - "-c"
   - |
     set -ex
    echo "--- Head Pod Setup ---"
    apt-get update
    apt-get install -y sudo netcat-openbsd pciutils
    cd /opt/nemo-rl
    /usr/bin/python -m pip install uv
    /usr/bin/python -m uv venv
    echo "Head pod setup complete. Starting Ray..."
   
    exec ${KUBERAY_GEN_RAY_START_CMD}


 args: []
 headService: {}




# --- Default Worker (Disabled) ---
worker:
 disabled: true


# --- A4 GPU Worker Groups ---
additionalWorkerGroups:
 worker-grp-0:
   disabled: false
   replicas: 2
   annotations:
     networking.gke.io/default-interface: 'eth0'
     networking.gke.io/interfaces: |
       [
        {"interfaceName":"eth0","network":"default"},
        {"interfaceName":"eth1","network":"gvnic-1"},
        {"interfaceName":"eth2","network":"rdma-0"},
        {"interfaceName":"eth3","network":"rdma-1"},
        {"interfaceName":"eth4","network":"rdma-2"},
        {"interfaceName":"eth5","network":"rdma-3"},
        {"interfaceName":"eth6","network":"rdma-4"},
        {"interfaceName":"eth7","network":"rdma-5"},
        {"interfaceName":"eth8","network":"rdma-6"},
        {"interfaceName":"eth9","network":"rdma-7"}
      ]
   containerEnv:
     - name: RAY_GROUP
       valueFrom:
         fieldRef:
           fieldPath: metadata.labels['ray.io/group']
     - name: NCCL_NET 
       value: "gIB"
     - name: NCCL_IB_GID_INDEX
       value: "3"  
     - name: GLOO_SOCKET_IFNAME
       value: "eth0"
     - name: NCCL_CROSS_NIC
       value: "0"
     - name: NCCL_SOCKET_IFNAME
       value: "eth0"
     - name: TP_SOCKET_IFNAME # Specific to DTensor/PyTorch Distributed
       value: "eth0"
     - name: NCCL_TUNER_CONFIG_PATH
       value: "/usr/local/gib/configs/tuner_config_a4.txtpb"
     - name: NCCL_NET_GDR_LEVEL
       value: "PIX"
     - name: LD_LIBRARY_PATH
       value: /usr/local/nvidia/lib64
   resources:
     limits:
       nvidia.com/gpu: 8
       cpu: "206"
       memory: "500Gi"
     requests:
       nvidia.com/gpu: 8
       cpu: "206"
       memory: "500Gi"


   nodeSelector:
     cloud.google.com/gke-accelerator: nvidia-b200
   tolerations:
     - operator: "Exists"
       key: "nvidia.com/gpu"
     - operator: "Exists"
       key: "cloud.google.com/impending-node-termination"
     - operator: "Exists"
       key: "user-workload"
   securityContext:
     privileged: true
   volumes:
     - name: log-volume
       emptyDir: {}
     - name: shared-memory
       emptyDir:
         medium: "Memory"
         sizeLimit: 240Gi
     - name: ray-tmp
       emptyDir:
         medium: "Memory"
     - name: fluentbit-config-volume
       configMap:
         name: "ray-cluster-kuberay-fluentbit-config"
     - name: nvidia-install-dir-host
       hostPath:
         path: /home/kubernetes/bin/nvidia
     - name: gib-nccl-plugin-volume
       hostPath:
         path: /home/kubernetes/bin/gib
     - name: lustre-data
       persistentVolumeClaim:
         claimName: lustre-pvc
   volumeMounts:
     - mountPath: /tmp/ray
       name: log-volume
     - name: shared-memory
       mountPath: /dev/shm
     - name: nvidia-install-dir-host
       mountPath: /usr/local/nvidia
     - name: gib-nccl-plugin-volume
       mountPath: /usr/local/gib
     - mountPath: /data
       name: lustre-data  
   # --- WORKER POD STARTUP SCRIPT ---
   command:
     - "bash"
     - "-c"
     - |
       set -ex
     
      echo "--- Worker Pod Setup ---"
      apt-get update
      apt-get install -y sudo netcat-openbsd pciutils
      cd /opt/nemo-rl
      /usr/bin/python -m pip install uv
      /usr/bin/python -m uv venv
     
      ldconfig /usr/local/nvidia/lib64/
      ldconfig -p | grep libcuda | sed 's/^/  /'
      export LD_LIBRARY_PATH="/usr/local/gib/lib64:$LD_LIBRARY_PATH"
      source /usr/local/gib/scripts/set_nccl_env.sh
     
      echo "Worker pod setup complete. Starting Ray..."
     
      exec ${KUBERAY_GEN_RAY_START_CMD}


   sidecarContainers:
     - name: fluent-bit
       env:
         - name: RAY_GROUP
           valueFrom:
             fieldRef:
               fieldPath: metadata.labels['ray.io/group']
       image: fluent/fluent-bit:latest
       volumeMounts:
         - name: fluentbit-config-volume
           mountPath: /fluent-bit/etc/
         - mountPath: /tmp/ray
           name: log-volume


# --- Service Config ---
service:
 type: ClusterIP

Verify and edit the contents of launcher.sh β†’ helm based installation of your Accelerated Ray cluster

#!/bin/bash
REPLICA_COUNT=2

helm install ray-cluster "<ABSOLUTE_PATH_TO_NeMo-RL-on-GKE/nemo-rl-on-gke" \
  --set values.additionalWorkerGroups.worker-grp-0.replicas=$REPLICA_COUNT

Launch Ray cluster using provided script

source launcher.sh

Verify the Ray installation

$ kubectl get pods
NAME                                            READY   STATUS    RESTARTS   AGE
ray-cluster-kuberay-head-sw7dp                  3/3     Running   0          33h
ray-cluster-kuberay-worker-grp-0-worker-gkbxw   3/3     Running   0          33h
ray-cluster-kuberay-worker-grp-0-worker-kdg62   3/3     Running   0          33h

$kubectl ray get clusters
NAME                  NAMESPACE   DESIRED WORKERS   AVAILABLE WORKERS   CPUS   GPUS   TPUS   MEMORY        CONDITION               STATUS   AGE
ray-cluster-kuberay   default     2                 2                   618    17     0      1573741824k   RayClusterProvisioned   ready    33h

Step 3.2: Establish secure local connection to ray before launching training

The ray session establishes a secure network tunnel (port forwarding) from your local computer directly to the Ray Head pod inside your Kubernetes cluster. This is needed because it allows you to access the web-based Ray Dashboard to monitor jobs or submit Python scripts interactively from your laptop as if the remote cluster were running locally on your machine, saving you from having to configure complex networking or log into the remote server manually.

$ kubectl ray session ray-cluster-kuberay
Forwarding ports to service ray-cluster-kuberay-head-svc
Ray Dashboard: http://localhost:8265
Ray Interactive Client: http://localhost:10001

Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001

Step 4: Configure GRPO Training

The GRPO configuration file defines all aspects of training. Key parameters include model selection, generation settings, and optimization hyperparameters. If you want to use the same model and dataset then you can directly use these optimized config file provided by NemoRL.
Configuration file used for this example: RL/examples/configs/grpo_math_8B.yaml at main Β· NVIDIA-NeMo/RL Β· GitHub

grpo_llama8b.yaml - Key configuration sections

policy:
  model_name: "meta-llama/Llama-3.1-8B-Instruct"
  tokenizer:
    name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
  train_global_batch_size: 512
  train_micro_batch_size: 1
  generation_batch_size: 32 # Only used when generating using HF backend
  logprob_batch_size: 2
  max_total_sequence_length: 4096
  precision: "bfloat16"

  generation:
    backend: "vllm"
    max_new_tokens: ${policy.max_total_sequence_length}
    temperature: 1.0
    top_p: 1.0
    top_k: null
    stop_token_ids: null
    stop_strings: null
    vllm_cfg:
      tensor_parallel_size: 1
      gpu_memory_utilization: 0.6
      max_model_len: ${policy.max_total_sequence_length}
      enforce_eager: False


grpo:
  num_prompts_per_step: 64
  num_generations_per_prompt: 32
  async_grpo:
    enabled: false
    max_trajectory_age_steps: 1


loss_fn:
  clip_epsilon: 0.2
  kl_penalty_coeff: 0.01

cluster:
  num_nodes: 2
  gpus_per_node: 8

Step 5: Deploy NeMo RL Workload

This bash script automates the submission of a distributed reinforcement learning job to a Ray cluster hosted on Google Kubernetes Engine (GKE). It operates by first dynamically identifying the Ray head pod using kubectl, then constructing and injecting a shell script that configures necessary environment variables for Weights & Biases and Hugging Face authentication. The script ultimately executes a NeMo-RL training command via uv to train a Llama 3.1 8B model using the GRPO algorithm on a deepscaler dataset, distributing the workload across two nodes with a total of 16 GPUs. Checkpoints are stored on the PVC that is provisioned from our Lustre Storage class.

Filename: submit_llama3.1-8b-lustre.sh

#!/bin/bash
WANDB_API_KEY='WANDB_KEY' # Update this with your WANDB API key
HF_TOKEN='HF_KEY' # Update this with your HF token
WORLD_SIZE=16


# --- Step 1: Find the Ray Head Pod ---
echo "Finding Ray head pod..."
export HEAD_POD_NAME=$(kubectl get pods --selector=ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
if [ -z "$HEAD_POD_NAME" ]; then
   echo "Error: No running Ray head pod found. Please check your cluster."
   exit 1
fi
echo "Found head pod: $HEAD_POD_NAME"
echo ""


# --- Step 2: Define the Job Script to Run ---
# This is the script that will be executed *inside* the head pod.
# It assumes the 'uv venv' setup from the values.yaml is already done.
JOB_SCRIPT=$(cat <<EOF
set -ex


echo "--- Running on Ray Head Pod ($HOSTNAME) ---"
cd /opt/nemo-rl


echo "Setting environment variables..."
export WANDB_API_KEY=$WANDB_API_KEY
export HF_TOKEN=$HF_TOKEN
export HF_HOME=/opt/nemo-rl/


###-----Example to launch llama 3.1 8b on 4 nodes (32 GPUs)----------
uv run python examples/run_grpo_math.py \
--config examples/configs/grpo_math_8B.yaml \
logger.wandb_enabled=True \
cluster.num_nodes=2 \
cluster.gpus_per_node=8 \
logger.wandb.name='llama3.1-8b-deepscaler-grpo-2nodes' \
grpo.max_num_steps=100 \
checkpointing.checkpoint_dir=/data/nemo_rl_llama3_8b_ds_cp \
data.dataset_name='DeepScaler'




echo "--- Job Finished ---"
EOF
)


# --- Step 3: Execute the Job ---
echo "Submitting job to $HEAD_POD_NAME..."
echo "$JOB_SCRIPT" | tr -d '\r' | kubectl exec -i $HEAD_POD_NAME -c ray-head -- /bin/bash


echo ""
echo "Job submission complete."

Run this file as

source submit_llama3.1-8b-lustre.sh

Monitor the output on your console (Successful response after max_step=100)

========================= Step 100/100 =========================
β–Ά Preparing batch...
β–Ά Generating responses for batch of size 2048...
(VllmGenerationWorker pid=260234, ip=10.4.2.6) INFO 01-23 04:20:02 [block_pool.py:321] Successfully reset prefix cache [repeated 30x across cluster]
(VllmGenerationWorker pid=259642, ip=10.4.1.6) INFO 01-23 04:20:35 [executor_base.py:203] It took 0.582119 seconds to wake up tags ['weights'].
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:20:03 [gpu_worker.py:104] Sleep mode freed 102.62 GiB memory, 4.79 GiB memory is still in use. [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:20:03 [executor_base.py:187] It took 1.359313 seconds to fall asleep. [repeated 15x across cluster]
(VllmGenerationWorker pid=259644, ip=10.4.1.6) INFO 01-23 04:20:35 [executor_base.py:203] It took 0.581814 seconds to wake up tags ['weights'].
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) DTensorPolicyWorkerV2[rank=0]: Packed 1 groups of tensors
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.04GB reserved
(VllmGenerationWorker pid=260126, ip=10.4.2.6) INFO 01-23 04:20:37 [executor_base.py:203] It took 0.075903 seconds to wake up tags ['kv_cache'].
Adding requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<00:00, 15500.37it/s]
Processed prompts:   0%|          | 0/128 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:32<00:00,  3.91it/s, est. speed input: 588.03 toks/s, output: 4861.31 toks/s] [repeated 3x across cluster]
Processed prompts:   1%|          | 1/128 [00:01<03:05,  1.46s/it, est. speed input: 113.18 toks/s, output: 102.20 toks/s]
Adding requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<00:00, 15729.72it/s] [repeated 15x across cluster]
Processed prompts:   0%|          | 0/128 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 15x across cluster]
Processed prompts:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 77/128 [00:06<00:03, 14.97it/s, est. speed input: 1440.55 toks/s, output: 5429.55 toks/s] [repeated 317x across cluster]
Processed prompts:  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 118/128 [00:07<00:00, 15.60it/s, est. speed input: 1720.48 toks/s, output: 7977.87 toks/s]
Processed prompts:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 120/128 [00:07<00:00, 12.95it/s, est. speed input: 1684.62 toks/s, output: 7919.15 toks/s]
Processed prompts:  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 113/128 [00:11<00:01,  8.34it/s, est. speed input: 1316.02 toks/s, output: 6764.95 toks/s] [repeated 267x across cluster]
Processed prompts:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 120/128 [00:11<00:01,  4.40it/s, est. speed input: 1906.63 toks/s, output: 6585.21 toks/s] [repeated 9x across cluster]
Processed prompts:  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 105/128 [00:16<00:07,  3.00it/s, est. speed input: 766.67 toks/s, output: 5213.57 toks/s] [repeated 84x across cluster]
Processed prompts:  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 118/128 [00:17<00:04,  2.30it/s, est. speed input: 874.85 toks/s, output: 4945.55 toks/s] [repeated 36x across cluster]
Processed prompts:  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 115/128 [00:21<00:04,  2.97it/s, est. speed input: 678.24 toks/s, output: 5093.30 toks/s] [repeated 33x across cluster]
Processed prompts:  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 121/128 [00:22<00:07,  1.02s/it, est. speed input: 648.11 toks/s, output: 3876.12 toks/s] [repeated 21x across cluster]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:22<00:00,  5.57it/s, est. speed input: 758.07 toks/s, output: 4504.22 toks/s]
Processed prompts:  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 111/128 [00:26<00:21,  1.26s/it, est. speed input: 438.47 toks/s, output: 3665.12 toks/s] [repeated 8x across cluster]
Processed prompts:  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 126/128 [00:27<00:04,  2.31s/it, est. speed input: 653.31 toks/s, output: 3041.30 toks/s] [repeated 32x across cluster]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:27<00:00,  4.61it/s, est. speed input: 728.16 toks/s, output: 3483.17 toks/s] [repeated 6x across cluster]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:30<00:00,  4.22it/s, est. speed input: 594.41 toks/s, output: 4538.78 toks/s]
Processed prompts:  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 115/128 [00:31<00:16,  1.27s/it, est. speed input: 381.76 toks/s, output: 3506.59 toks/s] [repeated 4x across cluster]
Processed prompts:  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 124/128 [00:31<00:06,  1.61s/it, est. speed input: 490.68 toks/s, output: 4256.31 toks/s] [repeated 15x across cluster]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:32<00:00,  3.99it/s, est. speed input: 504.13 toks/s, output: 4730.26 toks/s] [repeated 7x across cluster]
(VllmGenerationWorker pid=260132, ip=10.4.2.6) INFO 01-23 04:21:14 [block_pool.py:321] Successfully reset prefix cache [repeated 2x across cluster]
(VllmGenerationWorker pid=260234, ip=10.4.2.6) INFO 01-23 04:20:35 [executor_base.py:203] It took 0.580567 seconds to wake up tags ['weights']. [repeated 14x across cluster]
(DTensorPolicyWorkerV2[rank=11] pid=262730, ip=10.4.2.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.13GB reserved [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:20:37 [executor_base.py:203] It took 0.091887 seconds to wake up tags ['kv_cache']. [repeated 15x across cluster]
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:21:15 [gpu_worker.py:104] Sleep mode freed 102.60 GiB memory, 4.70 GiB memory is still in use.
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:21:15 [executor_base.py:187] It took 0.914538 seconds to fall asleep.
β–Ά Processing rewards...,
β–Ά Computing advantages...
β–Ά Preparing for logprob inference...
β–Ά Computing logprobs...
β–Ά Preparing for training...
β–Ά Training policy...
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:21:14 [block_pool.py:321] Successfully reset prefix cache [repeated 30x across cluster]
(VllmGenerationWorker pid=260126, ip=10.4.2.6) INFO 01-23 04:21:44 [executor_base.py:203] It took 0.581929 seconds to wake up tags ['weights'].
(VllmGenerationWorker pid=260130, ip=10.4.2.6) INFO 01-23 04:21:15 [gpu_worker.py:104] Sleep mode freed 102.55 GiB memory, 4.77 GiB memory is still in use. [repeated 15x across cluster]
(VllmGenerationWorker pid=260130, ip=10.4.2.6) INFO 01-23 04:21:15 [executor_base.py:187] It took 1.343605 seconds to fall asleep. [repeated 15x across cluster]
(VllmGenerationWorker pid=260128, ip=10.4.2.6) INFO 01-23 04:21:44 [executor_base.py:203] It took 0.582409 seconds to wake up tags ['weights'].
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) DTensorPolicyWorkerV2[rank=0]: Packed 1 groups of tensors
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.04GB reserved
β–Ά Starting validation at step 100...
(VllmGenerationWorker pid=260126, ip=10.4.2.6) INFO 01-23 04:21:46 [executor_base.py:203] It took 0.078912 seconds to wake up tags ['kv_cache'].
Adding requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:00<00:00, 11875.57it/s]
Processed prompts:   0%|          | 0/16 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 116/128 [00:33<00:17,  1.48s/it, est. speed input: 362.04 toks/s, output: 3412.34 toks/s]
Processed prompts:  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 117/128 [00:33<00:13,  1.19s/it, est. speed input: 360.14 toks/s, output: 3480.90 toks/s]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:33<00:00,  3.77it/s, est. speed input: 393.67 toks/s, output: 4759.65 toks/s]
Processed prompts:   6%|β–‹         | 1/16 [00:01<00:28,  1.89s/it, est. speed input: 107.97 toks/s, output: 160.36 toks/s]
Processed prompts:  12%|β–ˆβ–Ž        | 2/16 [00:03<00:17,  1.27s/it, est. speed input: 83.14 toks/s, output: 313.49 toks/s]
Adding requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:00<00:00, 12241.68it/s] [repeated 15x across cluster]
Processed prompts:   0%|          | 0/16 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 15x across cluster]
Processed prompts:  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 11/16 [00:06<00:02,  2.14it/s, est. speed input: 302.57 toks/s, output: 1218.42 toks/s] [repeated 138x across cluster]
Processed prompts:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 15/16 [00:07<00:00,  2.25it/s, est. speed input: 308.54 toks/s, output: 1568.40 toks/s]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:07<00:00,  2.01it/s, est. speed input: 319.65 toks/s, output: 1689.95 toks/s]
Processed prompts:  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 13/16 [00:11<00:03,  1.28s/it, est. speed input: 193.58 toks/s, output: 1041.41 toks/s] [repeated 41x across cluster]
Processed prompts:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 15/16 [00:13<00:01,  1.53s/it, est. speed input: 205.27 toks/s, output: 1030.76 toks/s] [repeated 8x across cluster]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:10<00:00,  1.50it/s, est. speed input: 253.25 toks/s, output: 1233.89 toks/s] [repeated 3x across cluster]
Processed prompts:  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 14/16 [00:13<00:02,  1.47s/it, est. speed input: 175.34 toks/s, output: 1061.80 toks/s]
Processed prompts:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 15/16 [00:19<00:03,  3.56s/it, est. speed input: 107.08 toks/s, output: 715.25 toks/s] [repeated 3x across cluster]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:14<00:00,  1.08it/s, est. speed input: 167.17 toks/s, output: 1068.08 toks/s] [repeated 2x across cluster]

πŸ“Š Validation Results:
    β€’ Accuracy: 0.0898
    β€’ Average response length: 1067.4 tokens
    β€’ Samples processed: 256

  ⏱️  Validation Timing:
    β€’ Total validation time: 24.44s
(VllmGenerationWorker pid=259642, ip=10.4.1.6) INFO 01-23 04:22:11 [block_pool.py:321] Successfully reset prefix cache [repeated 2x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:21:44 [executor_base.py:203] It took 0.582607 seconds to wake up tags ['weights']. [repeated 14x across cluster]
(DTensorPolicyWorkerV2[rank=11] pid=262730, ip=10.4.2.6) GPU Memory after optimizer offload: 0.02GB allocated, 0.13GB reserved [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:21:46 [executor_base.py:203] It took 0.072994 seconds to wake up tags ['kv_cache']. [repeated 15x across cluster]
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:22:12 [gpu_worker.py:104] Sleep mode freed 102.43 GiB memory, 4.70 GiB memory is still in use.
(VllmGenerationWorker pid=259733, ip=10.4.1.6) INFO 01-23 04:22:12 [executor_base.py:187] It took 0.978071 seconds to fall asleep.
Saving checkpoint for step 100...
(DTensorPolicyWorkerV2[rank=0] pid=262448, ip=10.4.1.6) Saving tokenizer (or processor) to /data/nemo_rl_llama3_8b_ds_cp/tmp_step_100/policy/tokenizer
(VllmGenerationWorker pid=260234, ip=10.4.2.6) INFO 01-23 04:22:11 [block_pool.py:321] Successfully reset prefix cache [repeated 30x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:22:13 [gpu_worker.py:104] Sleep mode freed 102.45 GiB memory, 4.79 GiB memory is still in use. [repeated 15x across cluster]
(VllmGenerationWorker pid=259731, ip=10.4.1.6) INFO 01-23 04:22:13 [executor_base.py:187] It took 1.396513 seconds to fall asleep. [repeated 15x across cluster]
Removing checkpoint /data/nemo_rl_llama3_8b_ds_cp/step_90 due to being outside top-3
Logged data to logs/exp_002/train_data_step99.jsonl

Other Metrics that are captured by NemoRL are:

πŸ“Š Training Results:
  β€’ Loss: 0.0351
  β€’ Avg Reward: 0.2407
  β€’ Mean Generation Length: 903.4531

⏱️  Timing:
  β€’ Total step time: 122.87s
  β€’ generation: 39.05s (31.8%)
  β€’ checkpointing: 21.48s (17.5%)
  β€’ policy_training: 16.89s (13.7%)
  β€’ policy_and_reference_logprobs: 7.74s (6.3%)
  β€’ prepare_for_generation/total: 3.25s (2.6%)
  β€’ training_prep: 1.15s (0.9%)
  β€’ logprob_inference_prep: 1.10s (0.9%)
  β€’ prepare_for_generation/transfer_and_update_weights: 0.68s (0.6%)
  β€’ data_processing: 0.56s (0.5%)
  β€’ reward_calculation: 0.03s (0.0%)

πŸ” Performance Metrics:
  β€’ Mean Total Tokens per Sample: 906.36
  β€’ Throughputs (per GPU):
    - E2E (Samples/sec/gpu): 1.04
    - E2E (Tokens/sec/gpu): 1087.21
    - Policy Training (Tokens/sec/gpu): 7908.96
    - Policy and Reference Logprobs (Tokens/sec/gpu): 17251.90
    - Training Worker Group (Tokens/sec/gpu): 5422.89
    - Generation Worker Group (Tokens/sec/gpu): 3420.66
  β€’ Throughputs (per Group):
    - E2E (Samples/sec): 16.67
    - E2E (Tokens/sec): 17395.32
    - Training Worker Group (Tokens/sec): 86766.26
    - Generation Worker Group (Tokens/sec): 54730.51
  β€’ Training FLOPS: 5864.40 TFLOPS (366.53 TFLOPS per rank)
  β€’ Training Model Floating Point Utilization: 16.29%
Max number of steps has been reached, stopping training early

Step 6: Verify the checkpoints

After Ray finishes the Job, NemoRL stores the checkpoints in the configured path. This can be verified by accessing the pvc via one the ray cluster’s worker pods in GKE. Checkpoints under this folder as saved as Nemo2 format and can be easily imported into any other NeMo functionalities such as NeMo tune, export as well as NIMs.

$ kubectl exec -it ray-cluster-kuberay-worker-grp-0-worker-gkbxw -- bash
Defaulted container "ray-worker" out of: ray-worker, fluent-bit, fluentbit

$root@ray-cluster-kuberay-worker-grp-0-worker-gkbxw:/opt/nemo-rl# tree /data/nemo_rl_llama3_8b_ds_cp/
/data/nemo_rl_llama3_8b_ds_cp/
|-- step_100
|   |-- config.yaml
|   |-- policy
|   |   |-- optimizer
|   |   |   `-- optim
|   |   |       |-- __0_0.distcp
|   |   |       |-- __10_0.distcp
|   |   |       |-- __11_0.distcp
|   |   |       |-- __12_0.distcp
|   |   |       |-- __13_0.distcp
|   |   |       |-- __14_0.distcp
|   |   |       |-- __15_0.distcp
|   |   |       |-- __1_0.distcp
|   |   |       |-- __2_0.distcp
|   |   |       |-- __3_0.distcp
|   |   |       |-- __4_0.distcp
|   |   |       |-- __5_0.distcp
|   |   |       |-- __6_0.distcp
|   |   |       |-- __7_0.distcp
|   |   |       |-- __8_0.distcp
|   |   |       `-- __9_0.distcp
|   |   |-- tokenizer
|   |   |   |-- chat_template.jinja
|   |   |   |-- special_tokens_map.json
|   |   |   |-- tokenizer.json
|   |   |   `-- tokenizer_config.json
|   |   `-- weights
|   |       `-- model
|   |           |-- shard-00001-model-00001-of-00001.safetensors
|   |           |-- shard-00002-model-00001-of-00001.safetensors
|   |           |-- shard-00003-model-00001-of-00001.safetensors
|   |           |-- shard-00004-model-00001-of-00001.safetensors
|   |           |-- shard-00005-model-00001-of-00001.safetensors
|   |           |-- shard-00006-model-00001-of-00001.safetensors
|   |           |-- shard-00007-model-00001-of-00001.safetensors
|   |           |-- shard-00008-model-00001-of-00001.safetensors
|   |           |-- shard-00009-model-00001-of-00001.safetensors
|   |           |-- shard-00010-model-00001-of-00001.safetensors
|   |           |-- shard-00011-model-00001-of-00001.safetensors
|   |           |-- shard-00012-model-00001-of-00001.safetensors
|   |           |-- shard-00013-model-00001-of-00001.safetensors
|   |           |-- shard-00014-model-00001-of-00001.safetensors
|   |           |-- shard-00015-model-00001-of-00001.safetensors
|   |           `-- shard-00016-model-00001-of-00001.safetensors
|   |-- train_dataloader.pt
|   `-- training_info.json
|-- step_40
|   |-- config.yaml
|   |-- policy
|   |   |-- optimizer
|   |   |   `-- optim
|   |   |       |-- __0_0.distcp
|   |   |       |-- __10_0.distcp
|   |   |       |-- __11_0.distcp
|   |   |       |-- __12_0.distcp
|   |   |       |-- __13_0.distcp
|   |   |       |-- __14_0.distcp
|   |   |       |-- __15_0.distcp
|   |   |       |-- __1_0.distcp
|   |   |       |-- __2_0.distcp
|   |   |       |-- __3_0.distcp
|   |   |       |-- __4_0.distcp
|   |   |       |-- __5_0.distcp
|   |   |       |-- __6_0.distcp
|   |   |       |-- __7_0.distcp
|   |   |       |-- __8_0.distcp
|   |   |       `-- __9_0.distcp
|   |   |-- tokenizer
|   |   |   |-- chat_template.jinja
|   |   |   |-- special_tokens_map.json
|   |   |   |-- tokenizer.json
|   |   |   `-- tokenizer_config.json
|   |   `-- weights
|   |       `-- model
|   |           |-- shard-00001-model-00001-of-00001.safetensors
|   |           |-- shard-00002-model-00001-of-00001.safetensors
|   |           |-- shard-00003-model-00001-of-00001.safetensors
|   |           |-- shard-00004-model-00001-of-00001.safetensors
|   |           |-- shard-00005-model-00001-of-00001.safetensors
|   |           |-- shard-00006-model-00001-of-00001.safetensors
|   |           |-- shard-00007-model-00001-of-00001.safetensors
|   |           |-- shard-00008-model-00001-of-00001.safetensors
|   |           |-- shard-00009-model-00001-of-00001.safetensors
|   |           |-- shard-00010-model-00001-of-00001.safetensors
|   |           |-- shard-00011-model-00001-of-00001.safetensors
|   |           |-- shard-00012-model-00001-of-00001.safetensors
|   |           |-- shard-00013-model-00001-of-00001.safetensors
|   |           |-- shard-00014-model-00001-of-00001.safetensors
|   |           |-- shard-00015-model-00001-of-00001.safetensors
|   |           `-- shard-00016-model-00001-of-00001.safetensors
|   |-- train_dataloader.pt
|   `-- training_info.json
`-- step_60
    |-- config.yaml
    |-- policy
    |   |-- optimizer
    |   |   `-- optim
    |   |       |-- __0_0.distcp
    |   |       |-- __10_0.distcp
    |   |       |-- __11_0.distcp
    |   |       |-- __12_0.distcp
    |   |       |-- __13_0.distcp
    |   |       |-- __14_0.distcp
    |   |       |-- __15_0.distcp
    |   |       |-- __1_0.distcp
    |   |       |-- __2_0.distcp
    |   |       |-- __3_0.distcp
    |   |       |-- __4_0.distcp
    |   |       |-- __5_0.distcp
    |   |       |-- __6_0.distcp
    |   |       |-- __7_0.distcp
    |   |       |-- __8_0.distcp
    |   |       `-- __9_0.distcp
    |   |-- tokenizer
    |   |   |-- chat_template.jinja
    |   |   |-- special_tokens_map.json
    |   |   |-- tokenizer.json
    |   |   `-- tokenizer_config.json
    |   `-- weights
    |       `-- model
    |           |-- shard-00001-model-00001-of-00001.safetensors
    |           |-- shard-00002-model-00001-of-00001.safetensors
    |           |-- shard-00003-model-00001-of-00001.safetensors
    |           |-- shard-00004-model-00001-of-00001.safetensors
    |           |-- shard-00005-model-00001-of-00001.safetensors
    |           |-- shard-00006-model-00001-of-00001.safetensors
    |           |-- shard-00007-model-00001-of-00001.safetensors
    |           |-- shard-00008-model-00001-of-00001.safetensors
    |           |-- shard-00009-model-00001-of-00001.safetensors
    |           |-- shard-00010-model-00001-of-00001.safetensors
    |           |-- shard-00011-model-00001-of-00001.safetensors
    |           |-- shard-00012-model-00001-of-00001.safetensors
    |           |-- shard-00013-model-00001-of-00001.safetensors
    |           |-- shard-00014-model-00001-of-00001.safetensors
    |           |-- shard-00015-model-00001-of-00001.safetensors
    |           `-- shard-00016-model-00001-of-00001.safetensors
    |-- train_dataloader.pt
    `-- training_info.json

22 directories, 117 files

Step 7: Running Eval against the post-trained model**

Once training is complete, the evaluation phase allows you to measure the performance of your policy against standard benchmarks (like AIME or MATH-500) or your own custom datasets.

The evaluation pipeline follows three main steps: Format Conversion, Configuration, and Execution.

1. Convert DCP to Hugging Face (Optional)

NeMo-RL often saves checkpoints in the PyTorch Distributed Checkpoint (DCP) format. However, the evaluation script requires the Hugging Face (HF) format. If you have a local checkpoint, convert it using the provided utility script:

# Example: Converting a GRPO checkpoint from step 60 
uv run python examples/converters/convert_dcp_to_hf.py \ --config nemo_rl_llama3_8b_ds_cp/step_60/config.yaml \ --dcp-ckpt-path nemo_rl_llama3_8b_ds_cp/step_60/policy/weights/ \ --hf-ckpt-path nemo_rl_llama3_8b_ds_cp/hf

2. Configure the Evaluation Environment

The evaluation suite is highly flexible. You can use the default settings (which target Qwen2.5-Math-1.5B-Instruct on AIME-2024) or override them for your specific needs.
Prompt Templates: Consistency is key. Always use the same chat_template used during training.
Open-Source Defaults: For most HF models, set tokenizer.chat_template=default and keep data.prompt_file as null to use the model’s native formatting.

3. Run the Evaluation Script

Use run_eval.py to initiate the process. You can point to models on the Hugging Face Hub or your newly converted local path.
Common Execution Commands:

Default Eval:
uv run python examples/run_eval.py

Local Model:
uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf

GPQA Benchmark:
uv run python examples/run_eval.py --config examples/configs/evals/gpqa_eval.yaml

Multi GPU Evaluation:
uv run python examples/run_eval.py \
   --config examples/configs/evals/math_eval.yaml \
   generation.model_name=nemo_rl_llama3_8b_ds_cp/hf \
   generation.temperature=0.6 \
   generation.vllm_cfg.max_model_len=32768 \
   generation.vllm_cfg.tensor_parallel_size=$TP \
   data.dataset_name=math500 \
   eval.num_tests_per_prompt=16 \
   cluster.gpus_per_node=8

4. Interpreting the Output

After the script finishes, you will see a summary block indicating the success rate of your model.

============================================================
model_name='nemo_rl_llama3_8b_ds_cp/hf' dataset_name='math500'
max_new_tokens=32768 temperature=0.6 top_p=1.0 top_k=-1 seed=42
metric=pass@1 num_tests_per_prompt=16
score=0.8981 (449.06250106170774/500)
============================================================

Score: The decimal representation of your accuracy (0.1000 = 10%).
Ratio: The raw count of correct answers over total problems (3.0/30).

Conclusion

The combination of NVIDIA NeMo RL, Google Kubernetes Engine, Ray orchestration, and Managed Lustre storage provides a production-ready infrastructure for implementing GRPO at scale. This stack addresses the unique challenges of RL workloads: the hybrid nature of training and inference, high memory requirements, and the need for fast checkpointing and data access.
As LLMs continue to evolve toward more sophisticated reasoning capabilities, mastering RL techniques like GRPO and the infrastructure to support them will be essential for organizations building the next generation of AI systems

3 Likes