Training a Golang Expert SLM(Small Language Model) with NemoRL(GRPO) & Ray on GKE

Ishmeet_Mehta · March 26, 2026, 8:46pm

Training a Golang Expert SLM with NemoRL & Ray on GKE

Objective: Fine-tune a base Small Language Model (SLM) to write Golang code by compiling and testing its output in real-time using Reinforcement Learning (GRPO) on GKE.

Target Model: Qwen2.5-Coder-1.5B-Instruct

Building a Compiler-in-the-Loop RL Pipeline

Training a model to write functional code is one of the most difficult challenges in AI. Traditional Supervised Fine-Tuning (SFT) teaches models through imitation, relying on static text datasets where the model learns what code looks like rather than how it works. This leads to a common “hallucination loop”: models that generate syntactically plausible code that fails to compile, misses brackets, or fails silently due to logic errors.

This project outlines a shift in that paradigm: the creation of a domain-specific coding model for Golang using Reinforcement Learning (RL) to fine-tune a foundational base model (Qwen2.5-Coder-1.5B-Instruct). By leveraging NeMo RL (v0.50), Ray, and Google Kubernetes Engine (GKE), established an online feedback loop where the Go compiler and unit tests act as the ultimate ground truth.

Instead of matching static strings, our model is rewarded for producing syntactically valid code (partial reward) and functionally correct code (full reward)—optimizing directly for functional correctness rather than just next-token prediction.

The SLM Strategic Edge: High-Velocity Rollouts for RL

Reinforcement Learning is fundamentally a high-throughput exploration challenge. To find the optimal policy, a model must generate thousands of experimental code samples (rollouts) per training step. While massive LLMs are often too sluggish for this density of generation, Small Language Models (SLMs) are the perfect fit. They are lightweight, exceptionally fast at inference, and highly responsive to targeted tuning—allowing us to execute massive parallel rollouts on G4 instances without the prohibitive overhead of larger architectures.

Key Technologies:

Infrastructure: GKE, gVisor and Google Cloud Storage (HNS)
Orchestration: KubeRay for distributed training and rollout.
Framework: NemoRL for the GRPO algorithm.
Dataset: TIGER-Lab/AceCode-87K (Adapted for Go).

Architecture Data Flow: From Tokens to Atomic Checkpoints

To safely bridge Python-heavy RL frameworks with compiled Go execution, the local computation of rewards is being offloaded to a remote reward server. The pipeline is engineered to ensure that untrusted AI-generated code never touches the primary training environment. The high-velocity data flow follows this path:

Inference (vLLM on G4 GPUs): The SLM generates multiple Go code candidates (rollouts) in parallel using high-throughput NVIDIA RTX Pro 6000 GPUs.
Orchestration (Ray Actors): A custom Golang Environment and Data Processor intercept these candidates, bundling the generated code with hidden unit tests from the dataset metadata.
Transport (Remote Reward Server): Ray workers ship this payload via a persistent HTTP session to the stateless Reward Server.
Isolation (gVisor Sandbox): The code enters a CPU-based gVisor sandbox, which intercepts system calls to ensure the AI-generated code cannot escape the container.
Verification (Go Compiler & Test Runner): The “ground truth” engine attempts to compile and run the code against unit tests.
Quantification (Tiered Reward Float): Results are translated into a tiered float (e.g., 0.1 for syntax, 0.3 for logic, 1.0 for success).
Optimization (GRPO Weight Update): Rewards are fed back into the GRPO optimizer to compute the advantage and update model weights.
Persistence (GCS HNS Bucket): Updated weights are saved as atomic checkpoints to a Hierarchical Namespace (HNS) bucket, ensuring POSIX-compliant directory renames.

Step 1: Infrastructure Provisioning

You can find the complete Terraform manifests and configuration files in the github repository:

rl-pipeline/cluster-set-up/terraform

For a detailed, step-by-step walkthrough on authenticating with GCP and executing the Terraform plan, please follow the README.md located within the directory linked above.

What this will create:

By executing the provided Terraform plan, you will deploy GKE optimized cluster for running GRPO jobs with:

Ray Head Node: An n2-standard-32 instance for orchestration.
Ray Worker Nodes nodepool: g4-standard-96 nodes equipped with NVIDIA RTX Pro 6000 GPUs.
Sandbox Pool: n2-standard-4 nodes with gVisor for secure reward evaluation.
GCS HNS Bucket: A POSIX-compliant bucket for atomic checkpointing.
KubeRay Operator

To begin the provisioning process, clone the repository and navigate to the infrastructure directory:

Clone and Switch to the Terraform folder

git clone https://github.com/IshmeetMehta/rl-pipeline.git

cd rl-pipeline/cluster-set-up/terraform

Configure Your Environment

Ensure you are authenticated with your Google Cloud Project and have the correct credentials available for Terraform.

# Set your terraform vars

export GOOGLE_PROJECT=$(gcloud config get-value project)

# Initialize Terraform

terraform init

Review and Apply

Generate a plan to verify the resources (G4 node pools, gVisor sandboxes, and HNS bucket) before provisioning.

# Review the plan

# Choose a region based on availability of G4 or machines of choice.

terraform plan -var=“project_id=$GOOGLE_PROJECT” -var=“region=us-east1”

# Provision the infrastructure

terraform apply -var=“project_id=$GOOGLE_PROJECT” -var=“region=us-east1”

Validate terraform state list for resources

terraform state list

data.google_client_config.default

data.google_project.project

google_artifact_registry_repository.reward_repo

…..

Step 2: Data Preparation & Transpilation

The AceCode-87K dataset is a high-signal foundational corpus for coding RL, but it is predominantly Python-centric. To train a Golang expert, the ML Engineer must transpile the logic from Python to Go.

Utilize a “teacher” model (like Gemini or a larger Qwen model) to transpile the Python functions and their corresponding unit tests into idiomatic Go.
Convert 87,000 Python programming tasks into a format that the Go compiler can actually execute.

To streamline this process, we have developed a Google Colab notebook that demonstrates the batch transpilation logic, including the specific system prompts used to maintain code quality.

You can find the notebook in the repository: rl-pipeline/dataset/transpile_acecode_to_go.ipynb

To ensure seamless context travel from the data loader through the Ray workers and into the sandbox, each record in your golang_prompts.jsonl dataset must adhere to the following schema, strictly coupling the generation prompt with its corresponding ground-truth unit tests:

{

“input”: “Write a Go function ‘IsPrime(n int) bool’ that returns true if a number is prime.”,

“extra_env_info”: {

"test_code": "package main\\nimport \\"testing\\"\\nfunc TestIsPrime(t \*testing.T) {\\n\\tif !IsPrime(7) { t.Errorf(\\"Expected true for 7\\") }\\n\\tif IsPrime(4) { t.Errorf(\\"Expected false for 4\\") }\\n}"

}

}

The schema is chosen to map directly to Qwen’s internal attention mechanism:

Input Mapping: The input field from our schema is injected into the User role of the Qwen prompt template.
Response Capture: The RL loop captures the model’s generation (the Assistant role).
Verification Packaging: Because the test_code was bundled in the original record’s extra_env_info, the Ray worker can instantly package the model’s response with the specific tests required for that prompt and send it to Remote Reward server for reward computation.

There are set of golang prompts for both training and evaluation present in the github repo

## Step 2: Stage Data and Model

Configure your GCS storage by uploading the RL dataset and the base model weights.

1. **Install HuggingFace CLI**:

Ensure you have the CLI tools:

```bash

pip install “huggingface_hub[cli]”

```

2. **Stage Base Model**:

```bash

uvx hf download Qwen/Qwen2.5-Coder-1.5B-Instruct --local-dir ./qwen-gcs-upload

gsutil -m cp -r ./qwen-gcs-upload/* gs://nemo-rl-experiments-rl-pipeline/models/Qwen2.5-Coder-1.5B-Instruct/

```

3. **Upload RL Dataset**:

```bash

gcloud storage cp dataset/golang_prompts.jsonl gs://nemo-rl-experiments-rl-pipeline/datasets/golang_prompts.jsonl

gcloud storage cp dataset/golang_val.jsonl gs://nemo-rl-experiments-rl-pipeline/datasets/golang_val.jsonl

```

Step 3: Defining the Remote Reward Server

@app.postapp.post(“/verify”, response_model=VerificationRe@app.postapp.postapp.postapp.postapp.postapp.postponse)

async def verify_code(payload: VerificationRequest):

"""

Main verification endpoint.

Processes Go code in a sandboxed, unique environment with Tiered Rewards.

"""

full_start_time = time.time()

job_id = str(uuid.uuid4())\[:8\]

raw_response = payload.response

test_code = payload.extra_env_info.get("test_code", "")

logger.info(f"\[JOB {job_id}\] Received verification request (Raw length: {len(raw_response)})")

\# 1. Pre-process and Clean

code = clean_code(raw_response)

if not code or len(code) < 15:

    logger.warning(f"\[JOB {job_id}\] REWARD: -1.0 | REASON: No valid code blocks extracted.")

    return VerificationResponse(

        reward=-1.0, 

        log="Format Error: Empty or invalid code block.",

        metrics={"total_time": time.time() - full_start_time}

    )

\# 2. Workspace Setup (Sandbox)

work_dir = Path("/tmp") / f"go_job\_{job_id}\_{int(time.time())}"

work_dir.mkdir(parents=True, exist_ok=True)

try:

    (work_dir / "main.go").write_text(code)

    if test_code:

        test_code = re.sub(r"(?m)^\\s\*package\\s+\\w+.\*$", "", test_code).strip()

        (work_dir / "main_test.go").write_text(f"package main\\n\\n{test_code}")

    await run_command(\["go", "mod", "init", f"verify/job\_{job_id}"\], cwd=work_dir)

    \# B. Compilation Phase

    logger.info(f"\[JOB {job_id}\] Compiling Go code...")

    exit_code, stdout, stderr, build_time = await run_command(\["go", "build", "."\], cwd=work_dir)

    if exit_code != 0:

        error_msg = (stderr or stdout).strip()

# — 5: SELF-HEALING COMPILER LOOP —

        missing = set(re.findall(r"undefined:\\s+(\[a-zA-Z0-9\_\]+)", error_msg))

        unused = set(re.findall(r'imported and not used:\\s+"(\[^"\]+)"', error_msg))

        if missing or unused:

            logger.info(f"\[JOB {job_id}\] Auto-healing imports: +{list(missing)} -{list(unused)}. Retrying build...")

            \# Inject missing imports directly as extracted from the error

            if missing:

                import_block = "import (\\n" + "\\n".join(\[f'\\t"{pkg}"' for pkg in missing\]) + "\\n)\\n"

                code = code.replace("package main", f"package main\\n\\n{import_block}", 1)

            \# Comment out unused imports (both standalone and grouped blocks)

            for pkg in unused:

                code = re.sub(rf'import\\s+"{pkg}"', f'// import "{pkg}" removed', code)

                code = re.sub(rf'(?m)^\\s\*"{pkg}"\\s\*$', f'// "{pkg}" removed', code)

            \# Re-write and Re-compile

            (work_dir / "main.go").write_text(code)

            exit_code, stdout, stderr, build_time2 = await run_command(\["go", "build", "."\], cwd=work_dir)

            build_time += build_time2

            error_msg = (stderr or stdout).strip()

    if exit_code != 0:

        short_err = error_msg\[:250\].replace('\\n', ' ')

        logger.info(f"\[JOB {job_id}\] REWARD: 0.1 | REASON: Compilation Failed | ERR: {short_err} | TIME: {build_time:.2f}s")

        return VerificationResponse(

            reward=0.1, 

            log=f"Compilation Error: {error_msg}",

            metrics={"build_time": build_time, "total_time": time.time() - full_start_time}

        )

    \# C. Test Execution Phase

    logger.info(f"\[JOB {job_id}\] Compilation successful. Running test suite...")

    t_exit, t_stdout, t_stderr, t_time = await run_command(\["go", "test", "-v", "."\], cwd=work_dir)

    total_time = time.time() - full_start_time

    metrics = {"build_time": build_time, "test_time": t_time, "total_time": total_time}

    if t_exit == 0:

        logger.info(f"\[JOB {job_id}\] REWARD: 1.0 | REASON: Success | EXEC: {t_time:.2f}s | TOTAL: {total_time:.2f}s")

        return VerificationResponse(reward=1.0, log="Success: All tests passed.", metrics=metrics)

    elif t_exit == 124:

        logger.warning(f"\[JOB {job_id}\] REWARD: 0.0 | REASON: Watchdog Timeout | EXEC: {t_time:.2f}s")

        return VerificationResponse(reward=0.0, log="Error: Execution timed out (possible infinite loop).", metrics=metrics)

    else:

        logger.info(f"\[JOB {job_id}\] REWARD: 0.3 | REASON: Test Failures | EXEC: {t_time:.2f}s")

        return VerificationResponse(

            reward=0.3, 

            log=f"Logic Error: Test failures detected.\\n{t_stdout or t_stderr}", 

            metrics=metrics

        )

except Exception as e:

    logger.error(f"\[JOB {job_id}\] CRITICAL ERROR: {str(e)}", exc_info=True)

    return VerificationResponse(

        reward=0.0, 

        log=f"Internal server error: {str(e)}",

        metrics={"total_time": time.time() - full_start_time}

    )

Deploy the Remote Reward Server

## Step 3: Deploy the Auto-Healing Reward Server

To isolate the Go compilation environment and prevent RCE, we deploy a FastAPI service into the `gVisor` sandbox node pool.

1. **Build the Container Image**:

Replace `{REGION}` with your chosen region (e.g., `us-central1`):

```bash

cd reward-server

gcloud builds submit . --config=cloudbuild.yaml --substitutions=_DESTINATION=“{REGION}-docker.pkg.dev/$(gcloud config get-value project)/reward-repo/go-reward-server:v1”

```

2. **Deploy the Service**:

Update the image in `reward-server/manifests/deployment.yaml` and apply:

```bash

kubectl apply -f manifests/deployment.yaml

3. Monitor the reward server logs

Step 4: Configuring GRPO Training

Launch the training loop using Group Relative Policy Optimization (GRPO) to optimize the Qwen2.5-Coder-1.5B-Instruct model for functional Go correctness.

Why GRPO?

GRPO is highly efficient for coding tasks because it eliminates the need for a separate Critic model. By computing the baseline from the mean reward of a group of rollouts for a single prompt, significantly reduces VRAM overhead on our NVIDIA RTX Pro 6000 nodes, allowing for more parallel exploration.

Bridging the Gaps in NeMo RL

While NeMo RL (v0.50) is optimized for Python and standard NLP benchmarks. To adapt it for compiled Go execution and distributed reward scaling, built a custom execution environment and data processor.

The Built-in Offerings & Gaps

NeMo RL provides specialized processors (like math_data_processor and sft_processor) and environments (like MathEnvironment or CodeExecutionWorker).

Data Gap: Built-in processors are tailored for static text or math. They lack the capability to package hidden execution contexts. For this example, the Golang unit tests (test_code) need to be alongside the user prompt through the Ray cluster.
Execution Gap: The default CodeExecutionWorker is designed for Python. Compiling Go locally on expensive GPU nodes is inefficient (wasting GPU cycles on CPU tasks) and presents a Remote Code Execution (RCE) risk.

Custom Configuration Architecture

To bridge these gaps, the training jobis configured with three critical custom components:

Custom Data Processor (golang_processor): Dynamically injects the hidden test_code into the metadata of every prompt payload. This ensures the “Ground Truth”(test cases) for every specific Golang problem is tethered to the prompt as it moves through the distributed Ray cluster.
Custom Environment (GolangRemoteEnv): Local execution is replaced with an asynchronous HTTP client to catch the LLM’s generated Go code, pair it with the unit tests, and transfer the payload across the network to the remote reward evaluation pool.
Remote Golang Reward Server: Deployed on GKE and secured by gVisor, this isolated server executes the go build and go test commands. It returns a deterministic float reward (0.1 for syntax, 1.0 for logic) back to the training loop, keeping our primary cluster secure and performant.

=============================================================================
CUSTOM GOLANG DATA PROCESSOR 

=============================================================================

def golang_processor(
    datum_dict: Dict[str, Any],
    task_data_spec: TaskDataSpec,
    tokenizer: TokenizerType,
    max_seq_length: int,
    idx: int,
) -> DatumSpec:
    problem = datum_dict["input"]
    extra_env_info = {"test_code": datum_dict.get("extra_env_info", {}).get("test_code", "")}
    message_log: LLMMessageLogType = []

    if task_data_spec.system_prompt:
        sys_formatted = tokenizer.apply_chat_template(
            [{"role": "system", "content": task_data_spec.system_prompt}],
            tokenize=False, add_generation_prompt=False, add_special_tokens=False,
        )
        message_log.append({
            "role": "system", "content": sys_formatted,
            "token_ids": tokenizer(sys_formatted, return_tensors="pt", add_special_tokens=False)["input_ids"][0]
        })

    user_content = task_data_spec.prompt.format(problem) if task_data_spec.prompt else problem
    user_formatted = tokenizer.apply_chat_template(
        [{"role": "user", "content": user_content}],
        tokenize=False, add_generation_prompt=True, add_special_tokens=False,
    )
    message_log.append({
        "role": "user", "content": user_formatted,
        "token_ids": tokenizer(user_formatted, return_tensors="pt", add_special_tokens=False)["input_ids"][0]
    })

    length = sum(len(m["token_ids"]) for m in message_log)
    return {
        "message_log": message_log,
        "length": length,
        "extra_env_info": extra_env_info,
        "loss_multiplier": 1.0 if length <= max_seq_length else 0.0,
        "idx": idx,
        "task_name": datum_dict.get("task_name", "go_verify_task")
    }

register_processor("golang_processor", golang_processor)

=============================================================================
CUSTOM GOLANG ENVIRONMENT
 =============================================================================

@ray.remote(max_restarts=-1, max_task_retries=-1)
class GolangRemoteEnv(EnvironmentInterface):
    def __init__(self, config):
        self.base_url = config.get("base_urls")[0]
        self.session = requests.Session()
        adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
        self.session.mount('http://', adapter)

    def shutdown(self) -> None:
        self.session.close()

    async def step(self, message_log_batch, metadata) -> EnvironmentReturn:
        results, observations = [], []
        url = f"{self.base_url.rstrip('/')}/verify"
        
        for i, log in enumerate(message_log_batch):
            raw_response = "".join([m["content"] for m in log if m["role"] == "assistant"])
            test_info = metadata[i].get("extra_env_info", {})
            
            try:
                response = self.session.post(
                    url, 
                    json={"response": raw_response, "extra_env_info": test_info}, 
                    timeout=30
                )
                reward = float(response.json().get("reward", 0.0))
            except Exception as e:
                print(f"⚠️ Reward Server Error: {e}")
                reward = 0.0
            
            results.append(reward)
            label = "Environment: correct" if reward > 0.5 else "Environment: incorrect"
            observations.append({"role": "environment", "content": label})
            
        rewards = torch.tensor(results).cpu()
        return EnvironmentReturn(
            observations=observations, metadata=metadata, rewards=rewards,
            terminateds=torch.ones_like(rewards).cpu(),
            next_stop_strings=[None] * len(results),
            answers=[None] * len(results)
        )

    def global_post_process_and_metrics(self, batch):
        return batch, {"accuracy": batch["rewards"].mean().item()}

    def collect_rollout_metrics(self, message_log_batch, env_return):
        return {}

Hyperparameter Strategy for Golang Convergence

To successfully fine-tune the Qwen2.5-Coder-1.5B-Instruct model, here are tuned key parameters in the NeMo RL config:

Group Size ($G=8$): It generates 8 rollouts per prompt. This provides enough variance for GRPO to calculate a meaningful “relative advantage,” distinguishing between code that simply looks like Go and code that actually works.
KL Divergence Coefficient ($\beta=0.01$): A strict penalty ensures the model doesn’t “reward hack” by generating gibberish that happens to bypass compiler checks, keeping the output idiomatic and readable.

Max Token Length (1024): High enough to handle complex Go concurrency patterns (goroutines/channels) while maintaining high rollout throughput on the G4 nodes.

Step 5: Training Execution & Monitoring

Deploy the Ray Cluster and trigger the GRPO training loop.

Deploy the Ray Cluster

Utilize KubeRay to orchestrate the worker and head nodes. The manifests are optimized to ensure the GCS bucket is automatically mounted to every pod via the GCS Fuse CSI driver, providing a shared filesystem for checkpoints and code.

# Apply optimized manifests for NVIDIA RTX PRO 6000

kubectl apply -f cluster-set-up/ray-config/ray-deployment/manifests/deployment.yaml

Stage and Execute Training

To maintain a stateless cluster, the runner and config files containing the training logic are uploaded to GCS, then execute the job directly on the Ray Head node.

# 1. Stage files to the GCS bucket root

gcloud storage cp cluster-set-up/nemo-rl-config/golang-env/run_grpo_golang.py gs://$BUCKET_NAME/run_grpo_golang.py

gcloud storage cp cluster-set-up/nemo-rl-config/golang-env/debug_grpo.yaml gs://$BUCKET_NAME/configs/debug_grpo.yaml

# 2. Trigger execution on the Ray Head Pod

export HEAD_POD=$(kubectl get pods -l ray.io/node-type=head -o name | head -n 1)

kubectl exec -it $HEAD_POD -c ray-head – bash -c " \

PYTHONPATH=$PYTHONPATH:/mount/gcs/ \

uv run python /mount/gcs/run_grpo_golang.py \

–config /mount/gcs/configs/debug_grpo.yaml"

Training has started

[INFO] Dataset processor golang_processor registered

2026-03-25 23:15:53,258	INFO worker.py:1630 – Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS

2026-03-25 23:15:53,258	INFO worker.py:1771 – Connecting to existing Ray cluster at address: 10.1.1.9:6379…

2026-03-25 23:15:53,270	INFO worker.py:1942 – Connected to Ray cluster. View the dashboard at http://10.1.1.9:8265

No chat template provided, using tokenizer’s default

📥 Loading full Production Train data: /mount/gcs/datasets/golang_prompts.jsonl

📥 Loading Production Val data: /mount/gcs/datasets/golang_val.jsonl

Initialized TensorboardLogger at /mount/gcs/logs/nemo_grpo_golang_final/exp_001/tensorboard

GPU monitoring started with collection interval=10s, flush interval=10s

✓ Training dataloader loaded with 9 samples

✓ Validation dataloader loaded with 5 samples

▶ Setting up compute cluster…

✓ Ray cluster for policy initialized with 2 nodes

▶ Setting up model and training…

⚙️  Using sequential worker initialization (colocated mode)

Initializing vllm_policy workers: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.09worker/s]

✓ 4 workers initialized in 3.69s

(VllmGenerationWorker pid=45201, ip=10.1.4.8) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.

(VllmGenerationWorker pid=45201, ip=10.1.4.8) INFO 03-25 23:16:06 [utils.py:253] non-default args: {'served_mo

Reward computation and validation

Review the reward server activity during training

kubectl logs -l app=go-reward-server -f

2026-03-25 23:17:10,579 - go-reward-server - INFO - [JOB 6bba69b6] Received verification request (Raw length: 1479)

2026-03-25 23:17:10,587 - go-reward-server - INFO - [JOB 6bba69b6] Compiling Go code…

2026-03-25 23:17:11,045 - go-reward-server - INFO - [JOB 6bba69b6] Compilation successful. Running test suite…

2026-03-25 23:17:11,397 - go-reward-server - INFO - [JOB 6bba69b6] REWARD: 1.0 | REASON: Success | EXEC: 0.35s | TOTAL: 0.82s

INFO:     10.1.1.9:40582 - “POST /verify HTTP/1.1” 200 OK

2026-03-25 23:17:11,401 - go-reward-server - INFO - [JOB 4cdfdd13] Received verification request (Raw length: 1323)

2026-03-25 23:17:11,410 - go-reward-server - INFO - [JOB 4cdfdd13] Compiling Go code…

2026-03-25 23:17:11,892 - go-reward-server - INFO - [JOB 4cdfdd13] Compilation successful. Running test suite…

2026-03-25 23:17:12,204 - go-reward-server - INFO - [JOB 4cdfdd13] REWARD: 1.0 | REASON: Success | EXEC: 0.31s | TOTAL: 0.80s

INFO:     10.1.1.9:40582 - “POST /verify HTTP/1.1” 200 OK

INFO:     10.1.1.9:38268 - “POST /verify HTTP/1.1” 200 OK

Last step of the training and saving the computed weights on to GCS bucket.

Logged data to /mount/gcs/logs/nemo_grpo_golang_final/exp_001/train_data_step4.jsonl

📊 Training Results:

• Loss: 0.0000

• Generation KL Error: 0.0003

• Avg Reward: 1.0000

• Mean Generation Length: 325.0000

⏱️  Timing:

• Total step time: 256.04s

• checkpointing: 229.84s (89.8%)

• generation: 8.78s (3.4%)

• prepare_for_generation/total: 2.96s (1.2%)

• policy_and_reference_logprobs: 1.75s (0.7%)

• policy_training: 1.72s (0.7%)

• prepare_for_generation/transfer_and_update_weights: 1.25s (0.5%)

• logprob_inference_prep: 0.77s (0.3%)

• training_prep: 0.49s (0.2%)

• data_processing: 0.00s (0.0%)

• reward_calculation: 0.00s (0.0%)

🔍 Performance Metrics:

• Mean Total Tokens per Sample: 328.00

• Throughputs (per GPU):

- E2E (Samples/sec/gpu): 0.01

- E2E (Tokens/sec/gpu): 3.02

- Policy Training (Tokens/sec/gpu): 450.07

- Policy and Reference Logprobs (Tokens/sec/gpu): 441.77

- Training Worker Group (Tokens/sec/gpu): 222.94

- Generation Worker Group (Tokens/sec/gpu): 87.89

• Throughputs (per Group):

- E2E (Samples/sec): 0.03

- E2E (Tokens/sec): 12.06

- Training Worker Group (Tokens/sec): 891.76

- Generation Worker Group (Tokens/sec): 351.56

• Training FLOPS: 18.05 TFLOPS (4.51 TFLOPS per rank)

Warning: Skipping metric ‘train/vllm_logger_metrics’ for TensorBoard logging (unsupported type: dict)

✅ Training loop finished. Activating Ray cluster hold hook…

⏳ Holding Ray workers alive… 300 seconds remaining for GCS background upload.

…

…

⏳ Holding Ray workers alive… 10 seconds remaining for GCS background upload.

✅ GCS FUSE sync window complete!

(DTensorPolicyWorkerV2[rank=3] pid=46236, ip=10.1.3.12) Saving checkpoint to /mount/gcs/checkpoints/tmp_step_4/policy/weights [repeated 3x across cluster]

3. Monitoring & Observability

As the job runs, let’s track three critical signals in TensorBoard to verify the Qwen model is successfully navigating the Go language’s constraints:

reward_mean: This is the primary success metric. It should transition from 0.1 (pure syntax) toward 1.0, indicating the model is now passing functional unit tests.
compile_rate: Tracks the percentage of rollouts that successfully build. Look for a “Syntax Plateau” above 90% early in the run.

kl_divergence: Ensures the policy doesn’t collapse. If KL spikes too high, the model is likely “reward hacking” (e.g., writing infinite loops that bypass certain checks); this was done to tune penalty.

Ray Dashboard on GKE

Identify the name of the Ray Head node pod:

HEAD_POD=$(kubectl get pods -l ``ray.io/node-type=head`` -o jsonpath=‘{.items[0].metadata.name}’)

Start the Port-Forward:

$kubectl port-forward “$HEAD_POD” 8265:8265

Access the UI: Open your browser and navigate to: http://localhost:8265

Watch TensorBoard:

HEAD_POD=$(kubectl get pods --selector=``ray.io/node-type=head`` -o custom-columns=POD:.metadata.name --no-headers)

Start the Port-Forward

Run this command in the new terminal to tunnel port 6006 from your pod to your local machine:

kubectl port-forward “$HEAD_POD” 6006:6006

Open your web browser and navigate to: http://localhost:6006

You should immediately see the TensorBoard UI populated with the training metrics reading from your /mount/gcs/logs directory.

Step 6: Evaluation & Deployment

Deploying Your Fine-Tuned Golang Expert on GKE

Now that your Reinforcement Learning pipeline has completed, your fine-tuned weights are saved in your Google Cloud Storage (GCS) Hierarchical Namespace (HNS) bucket.

vLLM was used to serve this model. vLLM is a highly optimized inference engine that natively spins up an OpenAI-compatible REST API, allowing you to seamlessly integrate your new Golang expert into IDE extensions, CI/CD pipelines, or chat interfaces.

Deploy to GKE

Update the <YOUR_HNS_BUCKET_NAME> and the step_XXXX path in the YAML file to point to your final training checkpoint.

Apply the manifest to your cluster

cd rl-pipeline/cluster-set-up/nemo-rl-config/golang-env/deployment

kubectl apply -f deployment.model.yaml

Watch the pod spin up and load the consolidated weights from training:

kubectl logs -l app=golang-expert -f

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/responses, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/completions, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/messages, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /inference/v1/generate, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST

(APIServer pid=1) INFO 03-25 22:41:48 [launcher.py:46] Route: /v1/completions/render, Methods: POST

(APIServer pid=1) INFO:     Started server process [1]

(APIServer pid=1) INFO:     Waiting for application startup.

(APIServer pid=1) INFO:     Application startup complete.

Test the Deployed Model

# Port-forward the service to your local machine/cloud shell

kubectl port-forward svc/golang-expert-service 8080:80 &

# Send an OpenAI-compatible API request

curl http://localhost:8080/v1/chat/completions \

-H “Content-Type: application/json” \

-d '{

"model": "golang-expert",

"messages": \[

  {"role": "system", "content": "You are a senior Go developer. Write functional Go code."},

  {"role": "user", "content": "Write a highly optimized Go function to reverse a string."}

\],

"temperature": 0.2,

"max_tokens": 512

}’

Model Response

{
“id”: “chatcmpl-ae5cc0af9f4063b5”,
“object”: “chat.completion”,
“created”: 1774561177,
“model”: “golang-expert”,
“choices”: [
{
“index”: 0,
“message”: {
“role”: “assistant”,
“content”: “Here is an optimized Go function that reverses a string using the built-in reverse package:\n\ngo\npackage main\n\nimport (\n\t\"fmt\"\n\t\"reflect\"\n)\n\n// ReverseString takes a string and returns its reversed version.\nfunc ReverseString(s string) string {\n\trunes := []rune(s)\n\tfor i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {\n\t\trunes[i], runes[j] = runes[j], runes[i]\n\t}\n\treturn string(runes)\n}\n\nfunc main() {\n\ttestCases := []struct {\n\t\tinput string\n\t\texpected string\n\t}{\n\t\t{\"hello\", \"olleh\"},\n\t\t{\"world\", \"dlrow\"},\n\t\t{\"\", \"\"},\n\t\t{\"a\", \"a\"},\n\t}\n\n\tfor _, tc := range testCases {\n\t\tresult := ReverseString(tc.input)\n\t\tif result != tc.expected {\n\t\t\tfmt.Printf(\"ReverseString(%q) = %q; want %q\\n\", tc.input, result, tc.expected)\n\t\t} else {\n\t\t\tfmt.Printf(\"ReverseString(%q) = %q; passed\\n\", tc.input, result)\n\t\t}\n\t}\n\n\t// Test the reverse function with reflection\n\ttype test struct {\n\t\tinput string\n\t\texpected string\n\t}\n\ttests = []test{\n\t\t{\"hello\", \"olleh\"},\n\t\t{\"world\", \"dlrow\"},\n\t\t{\"\", \"\"},\n\t\t{\"a\", \"a\"},\n\t}\n\n\tfor _, tt := range tests {\n\t\ttt := tt // Capture the current value of tt for the closure\n\t\tr := reflect.ValueOf(ReverseString).Call([]reflect.Value{reflect.ValueOf(tt.input)})\n\t\tif r[0].Interface() != tt.expected {\n\t\t\tfmt.Printf(\"ReverseString(%q) = %v; want %v\\n\", tt.input, r[0].Interface(), tt.expected)\n\t\t} else {\n\t\t\tfmt.Printf(\"ReverseString(%q) = %v; passed\\n\", tt.input, r[0].Interface())\n\t\t}\n\t}\n}\n\n\nThis function uses a simple two-pointer approach to swap characters from the start and end of the string, moving towards the center. This method has a time complexity of O(n), where n is the length of the string.\n\nThe main function includes several test cases to verify the correctness of the ReverseString function. It also demonstrates how to use reflection to call the”,
“refusal”: null,
“annotations”: null,
“audio”: null,
“function_call”: null,
“tool_calls”: ,
“reasoning”: null
………

Engineering Challenges & Solutions

Context Travel (Data Pipeline):
- Challenge: RL evaluators usually require pre-defined answers. We needed unique Go unit tests to travel from the loader to an isolated sandbox.
- Solution: Built a golang_processor to embed raw test_code into metadata payloads, ensuring prompts and verification scripts are never decoupled.
Namespace Collisions (func main):
- Challenge: LLMs output package main, causing redeclaration errors with the Go test runner. Deleting it often breaks required imports.
- Solution: Used a regex pipeline in clean_code to rebrand the entry point as func _llm_unused_main(), satisfying the compiler while preserving imports.
The “Syntax Wall” (Learning Stagnation):
- Challenge: Models stall at a $0.1$ reward due to trivial errors (missing imports), punishing logic for minor formatting.
- Solution: Engineered an “Auto-Healing” mechanism in the Reward Server to detect “undefined” errors and inject missing packages, forcing the model to optimize for logic over rote syntax.
System Stability (Zombies & Hacking):
- Challenge: AI-generated infinite loops (for {}) or binary dumps can freeze evaluation infrastructure.
- Solution: Deployed an async Watchdog with a terminate() -> kill() sequence and established payload size limits to block “reward hacking” via verbosity.
Concurrency & File Descriptors (Errno 24):
- Challenge: Thousands of parallel GRPO rollouts exhausted TCP ports and OS File Descriptors, crashing Ray workers.
- Solution: Implemented requests.Session() connection pooling on Ray actors and enforced strict subprocess pipe closure in the FastAPI server.
Distributed Checkpoint Failures:
- Challenge: NeMo RL requires atomic directory renames (os.rename), which standard GCS buckets don’t support.
- Solution: Migrated to GCS Hierarchical Namespace (HNS) with GCS FUSE to enable native POSIX-compliant folder operations.
RCE Vulnerabilities (Security):
- Challenge: Executing untrusted, AI-generated code on the GPU cluster is a catastrophic security risk.
- Solution: Decoupled the compiler into a stateless FastAPI service on a dedicated CPU pool, sandboxed via gVisor to secure all kernel system calls.
  
  Congratulations! You have built a production-grade, distributed RL pipeline that uses the Go compiler as the ultimate judge of truth.

Topic		Replies	Views
Implementing GRPO with NVIDIA NeMo-RL on Google Kubernetes Engine Compute Infrastructure accelerators , googler-article , compute-engine , cloud-storage , high-performance-computing-hpc	0	240	March 2, 2026
Tutorial: Scaling Reinforcement Learning with verl on GKE Compute Infrastructure accelerators , googler-article , compute-engine , infrastructure-general , high-performance-computing-hpc	0	271	March 2, 2026
Accelerating Reinforcement Learning on Google Cloud using NVIDIA NeMo RL Community Articles googler-article , compute-engine	1	882	September 30, 2025