Accelerating Reinforcement Learning on Google Cloud using NVIDIA NeMo RL

Authors :

Deepak Patil, Group Product Manager, Google

Qiyue (Jennifer) Liang, Machine Learning Software Engineer, Google


Reinforcement Learning (RL) is rapidly becoming the essential training technique for complex AI agents and workflows requiring advanced reasoning. Unlike traditional methods that rely on static, labeled datasets, RL enables models to learn dynamically—much like humans—through a continuous loop of trial, error, and reward. This learning loop allows models to discover optimal strategies for intricate, sequential decision-making problems. RL is currently driving breakthroughs across every industry, from training robotic arms and optimizing financial trading strategies to discovering new molecules. Ultimately, RL’s ability to tackle problems where the “right answer” isn’t known in advance makes it the critical tool for building the next generation of intelligent, autonomous, and adaptive systems.

Why Reinforcement Learning is complex

Despite its immense potential, scaling RL is notoriously difficult due to the workload’s inherent distributed nature. The process is split between two components: the Sampler, which executes the current policy (model) in a simulated environment to gather new ‘experiences,’ and the Trainer, which updates the policy using those experiences via algorithms like Proximal Policy Optimization (PPO) or the memory-efficient Group Relative Policy Optimization (GRPO).

A major complexity lies in the reward signal: Samples are scored by a separate Reward Model (often needed for PPO) or a simpler Reward Function (a key feature of GRPO) before the Trainer can use them. In large-scale, decoupled setups, the Trainer often experiences ‘data starvation,’ sitting idle while waiting for large model weights and experience data to be transmitted across the network. The delays inherent in this asynchronous process are the biggest inhibitor to rapid iteration. Also, idle accelerators lead to wasted spend.

Optimal performance in most RL workloads hinges on infrastructure strategy. Colocating the Trainer and Sampler on multi-GPU nodes creates a tight, low-latency feedback loop. Alternatively, advanced decoupling allows these activities to overlap, achieving maximum concurrency and the fastest possible RL Step Time. Because of this architectural complexity, RL demands deep AI infrastructure system-level optimization combined with a flexible library of cutting-edge algorithms, unlike simpler post-training methods like Supervised Fine-Tuning (SFT).

Figure 1 - RL Infrastructure View

NVIDIA NeMo RL

NVIDIA NeMo RL is a high-performance framework engineered specifically to address the core scaling and latency challenges inherent in modern RL. Its fundamental value lies in providing a clean, unified API and highly optimized, production-ready implementations of key algorithms, such as GRPO, DPO and PPO.

The framework’s core strength is its design for massive-scale, multi-GPU, and multi-node training. It tackles the critical ‘data starvation’ problem through integrated orchestration. As a comprehensive toolkit—not just a library—NeMo RL is designed to handle the coordination of the sampler and trainer, particularly excelling in colocated architectures where both reside on the same GPU nodes. This tight integration creates a low-latency feedback loop, which is critical for accelerating the entire RL process.

The framework is optimized to leverage the full power of the NVIDIA compute stack, enabling developers to focus on defining their environment and reward functions rather than reinventing the distributed infrastructure. This abstraction is what makes it possible to tackle billion-parameter models and complex 3D simulation environments, turning a “what if” scenario into a deployable reality.

Experiment with NeMo RL recipes on Google Cloud

We have developed reproducible Reinforcement Learning (RL) recipes that significantly simplify the process of getting started with NVIDIA NeMo RL on Google Cloud. These recipes are designed to run on Google Cloud A4 VMs (powered by NVIDIA HGX B200), utilizing Google Kubernetes Engine (GKE) for seamless orchestration and vLLM as the high-performance inference engine.

These starting-point examples showcase standard RL infrastructure scaling and configuration:

  1. Qwen2.5 1.5B Model recipe: Demonstrates a colocated Trainer/Sampler configuration running on a single A4 VM with a sequence length of 512.

  2. Llama 3.1 8B Model recipe: Showcases scaling by deploying the colocated Trainer/Sampler configuration across four A4 VMs with an increased sequence length of 4K.

We plan to expand the recipe portfolio to support additional NVIDIA accelerated computing platforms, such as the Google Cloud A4X VMs (powered by NVIDIA GB200 NVL72), and integrate new inference engines like SGLang, while broadening model coverage.

The recipe highlights the following key steps:

  1. Perform initial setup - This sets up environment variables and secrets for the GKE cluster; this needs to be done one-time only. This is where the GKE cluster is set up.

  2. NeMo RL requires Ray for job orchestration. Install the KubeRay operator, then configure a Ray cluster to support NeMo RL workloads for optimal performance.

  3. Within the recipe, we provide example templates for setting up Ray clusters. After the successful installation of KubeRay and the configuration of your Ray cluster, you will see your head and worker nodes.

  1. When the Ray cluster is operational, you will be able to submit your GRPO workloads. An example configuration is as shown below :
###-----Example to launch llama 3.1 8b on 4 nodes (32 GPUs)----------
uv run python examples/run_grpo_math.py \
  --config examples/configs/grpo_math_8B.yaml \
  logger.wandb_enabled=True \
  cluster.num_nodes=4 \
  cluster.gpus_per_node=8 \
  logger.wandb.name='llama3.1-8b-grpo-4nodes'

  1. Finally, you will be able to see logs with detailed insights, breaking down step times, including checkpointing, policy training, reward calculation, and other critical metrics.

Getting started: Accelerate your RL workflow today

The availability of these new reproducible recipes removes the complexity barrier to high-performance RL. This powerful combination of Google Cloud’s A4 VM with NVIDIA HGX B200 —orchestrated seamlessly with GKE and KubeRay—and the highly optimized NVIDIA NeMo RL framework delivers an unparalleled environment for scale.

We invite you to explore the Qwen2.5 and Llama 3.1 RL recipes now to experience firsthand how this integration simplifies your infrastructure setup and supercharges your RL development cycle. Begin experimenting today and share your feedback by replying at the bottom of this article!

2 Likes