End-to-end guide: Fine-tuning and serving Gemma 4 on Vertex AI

Alex_Martin · April 2, 2026, 4:13pm

Recently, we announced that Gemma 4 is now accessible through Vertex AI. As models like Gemma 4 push the boundaries of open-source AI, the infrastructure required to fine-tune them at scale must keep pace. In this guide, we walk through how to fine-tune Gemma 4 on Vertex AI Training Clusters (VTC) and deploy it for serving through Model Garden.

Training: SFT with NVIDIA NeMo Megatron on Vertex AI Training Clusters (VTC)

To fine-tune Gemma 4 across a multi-node cluster, we provide a supervised fine-tuning (SFT) recipe optimized for VTC. The recipe fine-tunes Gemma 4 31B on the Tulu 3 SFT dataset using NVIDIA NeMo.

What is Vertex AI Training Clusters (VTC)?

Vertex AI Training Clusters (VTC) is a managed service that enables you to go from reservation to production training in hours rather than days. VTC provides an industry-standard open-source Slurm UX, giving your engineering team full cluster transparency and familiar tools for optimized GPU scheduling. Beyond setup, VTC ensures high training workload uptime through advanced resiliency features. In large-scale training, a single node failure can derail a weeks-long run; VTC mitigates this by automatically detecting and triaging failure modes, such as ECC errors, DCGM checks, straggler and heartbeat monitors, and disk space capacity issues, and triggers remediation actions like restarting, reimaging, or replacing faulty nodes.

Create a VTC cluster and access the login node

This cluster provides two A3-Ultra nodes with 8x NVIDIA H200 GPUs each (16 GPUs total), sized for distributed fine-tuning of the Gemma-4 31B model. We decided to use the managed Lustre, which enables high-throughput read and write operations for loading and storing the training data and model checkpoints. We also used the NFS filestore to serve as our Slurm cluster’s shared home directory. Finally, we create two login nodes for this cluster with a relatively large boot disk size. The login nodes are where we did our development of the training recipe, and they also handle the job submission to the Slurm orchestrator.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://$REGION-aiplatform.googleapis.com/v1beta1/projects/$PROJECT_NAME/locations/$REGION/modelDevelopmentClusters?model_development_cluster_id=vtca3u \
  -d '{
  "display_name": "vtca3u",
  "name": "projects/$PROJECT_NAME/locations/$REGION/modelDevelopmentClusters/",
  "network": {
    "network": "projects/$PROJECT_NAME/global/networks/$NETWORK_NAME",
    "subnetwork": "projects/$PROJECT_NAME/regions/$REGION/subnetworks/$SUBNET_NAME"
  },
  "node_pools": [
    {
      "id": "a3u",
      "machine_spec": {
        "machine_type": "a3-ultragpu-8g",
        "accelerator_type": "NVIDIA_H200_141GB",
        "accelerator_count": 8,
        "reservation_affinity": {
          "reservation_affinity_type": 3,
          "key": "compute.googleapis.com/reservation-name",
          "values": [
            "projects/$PROJECT_NAME/zones/$ZONE/reservations/$RESERVATION_NAME"
          ]
        }
      },
      "scaling_spec": {
        "min_node_count": 2,
        "max_node_count": 2
      },
      "zone": "$ZONE",
      "enable_public_ips": true,
      "boot_disk": {
        "boot_disk_type": "hyperdisk-balanced",
        "boot_disk_size_gb": 512
      },
      "lustres": [
        "projects/$PROJECT_NAME/locations/$ZONE/instances/$LUSTRE_NAME"
      ]
    },
    {
      "id": "login",
      "machine_spec": {
        "machine_type": "n2-standard-8"
      },
      "scaling_spec": {
        "min_node_count": 2,
        "max_node_count": 2
      },
      "zone": "$ZONE",
      "enable_public_ips": true,
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 512
      },
      "lustres": [
        "projects/$PROJECT_NAME/locations/$ZONE/instances/$LUSTRE_NAME"
      ]
    }
  ],
  "orchestrator_spec": {
    "slurm_spec": {
      "partitions": [
        {
          "id": "a3u",
          "node_pool_ids": [
            "a3u"
          ]
        }
      ],
      "login_node_pool_id": "login",
      "home_directory_storage": "projects/$PROJECT_NAME/locations/$ZONE/instances/$FILESTORE_NAME"
    }
  }
}'

Install NeMo-Run on the login node

pip install git+https://github.com/NVIDIA/NeMo-Run.git

Copy the NeMo container image to your training cluster

gcloud storage cp gs://vmds-containers-us/vmds_nemo_squashfs/nemo-20260401.sqsh .

Clone the vertex-oss-training repository and use the gemma4 branch

git clone -b gemma4 https://vertex-model-garden.googlesource.com/vertex-oss-training

Start training

Note: The command below assumes that you are using a cluster of A3-Ultra machines (as created above). To run this on a cluster with a different machine type:

A3-Mega: Set --slurm-type to hcc-a3m.
A4: Set --slurm-type to hcc-a4.

cd vertex-oss-training/nemo

export WORK_DIR=$(pwd)
export NEMORUN_HOME=${WORK_DIR}

log_dir="${WORK_DIR}/logs"
cache_dir="${WORK_DIR}/cache"

mkdir -p $log_dir
mkdir -p $cache_dir

export HF_TOKEN="<Your Hugging Face Token>"

# run the file from the cloned repo
python3 run.py -e slurm --slurm-type hcc-a3u -d $WORK_DIR -s sft/gemma4_text_sft.py \
  --recipe-args="--model=gemma4_text_31b_pt --exp_name=gemma4-sft" \
  --import-ckpt-script import_ckpt/import_gemma4_ckpt.py \
  --import-ckpt-args="--model=google/gemma-4-31B" \
  --export-ckpt-script=export_ckpt/export_ckpt.py \
  -n=1 \
  --partition=<Your cluster's targeted partition name> \
  --image=<Path to the container image sqsh file> \
  --experiment-name=gemma4-sft \
  --log-dir=$log_dir \
  --cache-dir=$cache_dir

Serving: Deploy Gemma 4 on Vertex AI Model Garden

Beyond fine-tuning, you can use Vertex AI Model Garden for one-click deployment to self-hosted endpoints. You can deploy Gemma 4 using the Python SDK, CLI, REST API, or console.

To deploy the model you just fine-tuned, you can use the code snippet below or this deployment notebook:

import requests
from google import auth
from google.cloud import aiplatform

VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4"
PROJECT_ID = "your-project-id"
REGION = "your-region"

machine_type = "a3-highgpu-1g"
accelerator_type = "NVIDIA_H100_80GB"
accelerator_count = 1

endpoint = aiplatform.Endpoint.create(
    display_name="gemma-4-31b-finetuned",
    dedicated_endpoint_enabled=True,
)

vllm_args = [
    "python",
    "-m",
    "vllm.entrypoints.api_server",
    "--host=0.0.0.0",
    "--port=8080",
    "--model=gs://your-gcs-bucket/gemma-4-31b-finetuned",
    "--tensor-parallel-size=1",
    "--max-model-len=16384",
    "--gpu-memory-utilization=0.9",
    "--max-num-seqs=128",
    "--limit-mm-per-prompt.image=0",
    "--enable-auto-tool-choice",
    "--tool-call-parser=gemma4",
    "--reasoning-parser=gemma4"
]

env_vars = {
    "MODEL_ID": "google/gemma-4-31B",
    "DEPLOY_SOURCE": "notebook",
}

model = aiplatform.Model.upload(
    display_name="gemma-4-31b-finetuned",
    serving_container_image_uri=VLLM_DOCKER_URI,
    serving_container_args=vllm_args,
    serving_container_ports=[8080],
    serving_container_predict_route="/generate",
    serving_container_health_route="/ping",
    serving_container_environment_variables=env_vars,
    serving_container_shared_memory_size_mb=(16 * 1024),
    serving_container_deployment_timeout=7200,
    model_garden_source_model_name="publishers/google/models/gemma4",
)

creds, _ = auth.default()
auth_req = auth.transport.requests.Request()
creds.refresh(auth_req)

url = f"https://{REGION}-aiplatform.googleapis.com/ui/projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint.name}:deployModel"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {creds.token}",
}
data = {
    "deployedModel": {
        "model": model.resource_name,
        "displayName": "gemma-4-31b-finetuned",
        "dedicatedResources": {
            "machineSpec": {
                "machineType": machine_type,
                "acceleratorType": accelerator_type,
                "acceleratorCount": accelerator_count,
            }
        },
    },
}

response = requests.post(url, headers=headers, json=data)
print(f"Deploy Model response: {response.json()}")

If you want to deploy the base model you can use a cli command. Here is additional reference documentation for deploying the original Gemma 4 models.

gcloud ai model-garden models deploy \
    --model=google/gemma4@gemma-4-31b

What’s next?

To access specialized tuning recipes for Gemma 4 and get custom support on how to train the model efficiently, please contact your Google Cloud sales representative to enable VTC access in your project. For more details on Vertex AI Training Clusters, see the VTC overview documentation. To learn more about Gemma 4 on Google Cloud, read the Gemma 4 announcement blog.

Topic		Replies	Views
From research to production: A first look at OSS managed tuning on Vertex AI Open Models googler-article , open-models , llama , fine-tuning , benchmarking	5	887	August 28, 2025
Supervised Fine-tuning Gemini for Predictive Maintenance Custom ML & MLOps googler-article , vertex-ai-experiments , vertex-ai-tensorboard , vertex-ai-vizier , vertex-ai-platform	0	683	October 25, 2025
Introducing Llama 4 on Vertex AI Community Articles googler-article , ai-ml	0	677	April 5, 2025