TL;DR
It’s not every day that you get to take an influential research paper and bring its core recipe to life with just a few API calls. This post provides a first look at a new tool that makes this possible: the managed fine-tuning service for open models on Vertex AI, which is currently in Preview on Google Cloud. We’ll walk through preparing the specialized MetaMathQA dataset, launching a managed LLM tuning job with the paper’s specific hyperparameters, deploying the resulting tuned model, and seeing it solve math problems in real-time.
From paper to production with Vertex AI’s managed OSS tuning
We’ve all been there. You read an AI paper like MetaMath, see the incredible results, and think, “How can I build that?” The reality is that reproducing these results is often a monumental task involving complex infrastructure, dependency management, and deep expertise in distributed training. This friction can slow down innovation and keep powerful techniques locked away in the academic sphere.
This is where the new managed fine-tuning service for open models on Vertex AI steps in. By handling the underlying infrastructure, this managed service allows us, as developers, to focus mainly on the LLM fine-tuning. We can take an open model, combine it with a high-quality dataset, and apply a specific training recipe—all without managing a single server or GPU cluster. This workflow makes it practical to turn research into production-ready, specialized LLMs.
Replicating the MetaMath recipe step-by-step
Here’s how we can replicate the MetaMath recipe. For all the code, check out the full Colab notebook.
Step 1: Prepare the MetaMathQA dataset
The Vertex AI tuning service requires data in a JSON Lines (JSONL) format, with each line representing a training example. We load the MetaMathQA dataset from Hugging Face, format it into the required messages array structure, and upload it to Google Cloud Storage (GCS).
import os
from datasets import load_dataset
BUCKET_URI = "gs://your-gcs-bucket-name"
# Load the MetaMathQA dataset
dataset = load_dataset("meta-math/MetaMathQA")['train']
# Define the instruction template from the paper
METAMATH_TEMPLATE = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
"""
# Format each example to match the required JSONL structure
def format_for_tuning(example):
return {
"messages": [
{"role": "user", "content": METAMATH_TEMPLATE.format(instruction=example['query'])},
{"role": "assistant", "content": f" {example['response']}"}
]
}
# Apply the formatting
train_formatted = dataset.map(format_for_tuning, remove_columns=dataset.column_names)
# Save to a local JSONL file
train_file_path = "metamath_train.jsonl"
train_formatted.to_json(train_file_path, orient="records", lines=True)
# Upload the dataset to GCS
! gsutil cp {train_file_path} {BUCKET_URI}/datasets/
Step 2: Configure and launch the tuning job
After you set tuning parameters, you run the tuning job using the Vertex AI SDK. To stay as close as possible to the MetaMath paper’s recipe for its 7B model, we’ll configure the job with 3 epochs and a learning rate of 2e-5 using full fine-tuning.
import uuid
from pydantic import BaseModel, Field
# This class groups all hyperparameters and provides documentation and default values.
class MetaMathTuningConfig(BaseModel):
"""Configuration settings for the MetaMath fine-tuning job."""
base_model: str = Field(
default="meta/llama3_1@llama-3.1-8b",
description="The base model to fine-tune, corresponding to the 7B model in the paper."
)
tuning_mode: str = Field(
default="FULL",
description="The tuning mode. We use 'FULL' to replicate the paper's method for the 7B model."
)
epochs: int = Field(
default=3,
description="Number of training epochs, as specified in the MetaMath paper."
)
learning_rate: float = Field(
default=2e-5,
description="The learning rate for the optimizer, matching the paper's value for full fine-tuning."
)
# Create an instance of the configuration class.
config = MetaMathTuningConfig()
# Dynamically create paths that depend on runtime variables.
output_uri = f"{BUCKET_URI}/tuning-output/{uuid.uuid4()}"
model_artifacts_gcs_uri = os.path.join(output_uri, "postprocess/node-0/checkpoints/final")
Once tuning parameters are set, you can run the tuning job using the Vertex AI SDK.
import vertexai
from vertexai.preview import tuning
# Initialize the SDK
vertexai.init(project="your-project-id", location="us-central1", staging_bucket=BUCKET_URI)
# Configure and launch the supervised fine-tuning (SFT) job
source_model = SourceModel(base_model=config.base_model)
sft_tuning_job = sft.preview_train(
source_model=source_model,
tuning_mode=config.tuning_mode,
epochs=config.epochs,
learning_rate=config.learning_rate,
train_dataset=train_file_uri,
validation_dataset=validation_file_uri,
output_uri=output_uri,
)
With that single command, Vertex AI automatically provisions the necessary hardware and initiates the training job. Once submitted, you can monitor the training process directly from the Tuning UI (as shown below) or leverage the associated Vertex AI Tensorboard instance for detailed insights.
All logs are written to gs://your-bucket-name/tuning-output/some-id/oss_tuning_job_logs.txt.
Step 3: Deploy your tuned LLM model on Vertex AI
After the fine-tuning job finishes, the LLM model artifacts live in GCS. To serve predictions, we deploy the model to a Vertex AI Endpoint, which is a managed service resource to serve your tuned LLM using the Vertex AI Model Garden SDK as shown below.
from vertexai.preview import model_garden
# The output_uri is available from your completed sft_tuning_job object
tuned_model_gcs_uri = sft_tuning_job.output_uri
# Create a CustomModel object from the artifacts
tuned_model = model_garden.CustomModel(gcs_uri=tuned_model_gcs_uri)
# Deploy to an endpoint
endpoint = tuned_model.deploy(
machine_type="g2-standard-12",
accelerator_type="NVIDIA_L4",
accelerator_count=1
)
Step 4: Test your model!
Let’s see our tuned model in action. We’ll send it a math problem and see how it performs.
# The inference prompt for MetaMath models
prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\n{instruction}\\n\\n### Response: Let's think step by step."
instruction = "James buys 5 packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay?"
instances = [{"prompt": prompt_template.format(instruction=instruction), "max_tokens": 250}]
# Send the request to our endpoint
response = endpoint.predict(instances=instances, use_dedicated_endpoint=True)
print(response.predictions[0])
# Expected Output: A step-by-step calculation resulting in $110.
# Response from tuned model
# James buys 5 packs of beef, and each pack is 4 pounds, so he buys a total # of 5 * 4 = 20 pounds of beef.
# The price of beef is $5.50 per pound, so James pays 20 * $5.50 = $110.
# Therefore, James paid $110 for the beef.
#### 110
# The answer is: 110.
Seeing the model reason through the problem step-by-step is exactly the outcome we were hoping for.
Bonus: Benchmarking your tuned model
A single prompt gives you a good gut check, but how does our model really stack up? To properly validate our work, we can perform two more levels of evaluation.
Qualitative check vs. the official model
First, we can compare our model’s output with the official MetaMath-7B-V1.0 model from Hugging Face. This provides a valuable qualitative benchmark.
Note: This step involves running a large model locally, which may require a machine with significant RAM and a powerful GPU. Be aware that downloading the model weights will also take some time.
# Enable hf_transfer for parallel downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
# Load the official MetaMath 7B model from Hugging Face.
official_pipe = pipeline("text-generation", model="meta-math/MetaMath-7B-V1.0", device_map="auto")
# Use the same prompt and instruction for a fair comparison.
official_response = official_pipe(
prompt_template.format(instruction=instruction),
max_new_tokens=250,
do_sample=False
)
print("Response from the Official MetaMath Model:")
print(official_response[0]['generated_text'])
### Response: Let's think step by step.
# James buys 5 packs of beef, and each pack is 4 pounds, so he buys a total # of 5 * 4 = 20 pounds of beef.
# The price of beef is $5.50 per pound, so he pays 20 * $5.50 = $110.
#### 110
# The answer is: 110
After downloading the model, you’ll see the output. The striking similarity between our fine-tuned model’s response and the official one confirms that our Vertex AI tuning recipe was successful!
Quantitative analysis with official evaluation scripts
To get the official pass@1 benchmark scores reported in the paper, you must run the evaluation scripts from the MetaMath GitHub repository against the full test dataset. The evaluation steps show how academic results are formally measured.
First, download the official evaluation scripts and test data.
!git clone https://github.com/meta-math/MetaMath.git
The local evaluation script needs the model files. Copy them from your GCS bucket to your local environment.
# Create a local directory to store the model.
LOCAL_MODEL_PATH = "./my_tuned_metamath_model"
!mkdir -p {LOCAL_MODEL_PATH}
# Copy the model files from GCS to the local path. This can take several minutes.
!gsutil -m cp -r {model_artifacts_gcs_uri}/* {LOCAL_MODEL_PATH}/
Finally, execute the official evaluation script, pointing it to your locally downloaded model.
!python MetaMath/eval_gsm8k.py \
--model {LOCAL_MODEL_PATH} \
--data_file ./MetaMath/data/test/GSM8K_test.jsonl \
--tensor_parallel_size 2 \
--batch_size 32
After running, the script will process all the test examples and output the final pass@1 accuracy score as shown in the following.
You can compare this number directly to the results table in the MetaMath paper to see how well your model performed! In our test run, we achieved an accuracy of ~68.8%, which is a fantastic result that closely aligns with the paper’s findings!
Things to know about OSS managed fine-tuning on Vertex AI
As you start your own projects, keep these points in mind:
-
Preview Status: This feature is currently in Preview and subject to the Pre-GA Offerings Terms.
-
Supported Models: The Vertex AI managed fine-tuning service on Google Cloud supports parameter-efficient fine-tuning (PEFT) and full fine-tuning a variety of Llama LLM models. This tutorial uses meta/llama3_1@llama-3.1-8b, but you can also tune the following open LLMs, as detailed in the list below:
- meta/llama3_1@llama-3.1-8b (PEFT / full fine-tuning)
- meta/llama3_1@llama-3.1-8b-instruct (PEFT / full fine-tuning)
- meta/llama3-2@1lama-3.2-1b-instruct (supports only full fine-tuning)
- meta/1lama3-2@1lama-3.2-3b-instruct (supports only full fine-tuning)
- meta/1lama3-3@1lama-3.3-70b-instruct (PEFT / full fine-tuning)
-
Validation limit: For Vertex AI requirement, you can only have a validation dataset of less than 5000 rows.
-
Pricing: You are billed for tuning based on the hardware used during the job, plus costs for the prediction endpoint and Cloud Storage. Always check the official pricing page.
-
Quotas and Limits: Your project has a default quota for concurrent tuning jobs. If you plan to run multiple jobs, you may need to request a quota increase.
What’s next?
We’ve successfully taken a research paper and implemented its LLM fine-tuning recipe in a scalable, managed environment on Google Cloud. This workflow opens up new possibilities for creating highly specialized LLMs.
- Get hands-on with the full Colab notebook.
- Read the official Vertex AI documentation to explore all the configuration options.
Thanks for following along on this first look. I’d love to hear from you! Share your feedback and connect on LinkedIn, X/Twitter.
Happy building!