How to fine-tune PaliGemma2 with Roboflow

Authors

Tai Conley

Aryan Vasudevan

Introduction

One of the most common tasks in computer vision is object detection. Models capable of object detection, like PaliGemma2, return “bounding boxes” – coordinates that correspond to the objects the model has been trained to identify.

For example, you can train a model to identify boats in water, species of fish in the ocean, defects in sheet metal on a manufacturing line, or cracks in ceramic.

Object detection models like PaliGemma perform best when trained to identify the objects you want to find, especially if you are looking for objects specific to the project you are working on (i.e. specific defects on products you manufacture). This is where Roboflow comes in.

With Roboflow, you can prepare a dataset for use in fine-tuning an object detection model. You can then train your own models and deploy them either in the cloud or at the edge with Roboflow Inference, our open source computer vision inference server.

In this blog post, we are going to walk through how to fine-tune a PaliGemma model. We will use Roboflow to prepare a dataset, then Google Colab notebook to train and run inference with our PaliGemma model. We will also talk about how to deploy our model with Roboflow Workflows, which can be run on your own hardware such as a GCP server.

Let’s get started!

What is PaliGemma

PaliGemma is an open-source vision-language model by Google that bridges the gap between visual understanding and natural language processing. Built on the foundation of the SigLIP vision model and the Gemma language model, PaliGemma can analyze images and respond to questions, generate captions, and perform complex visual reasoning tasks.

What sets PaliGemma apart from other vision models is its versatility. Unlike traditional computer vision models that are trained for specific tasks like object detection or image classification, PaliGemma can handle a wide variety of vision-language tasks through natural language prompts. This flexibility makes it particularly powerful for applications where you need to extract structured information from images or perform visual analysis/object detection.

Use cases for PaliGemma

PaliGemma excels in scenarios where you need to extract structured information from images, perform visual reasoning tasks, and even classic object detection. Common applications include document processing (extracting data from invoices, receipts, and forms), industrial quality control (identifying defects or safety compliance issues), medical image analysis (assisting with specimen identification), and so much more.

The example we’re going to build in this guide is a custom, fine-tuned PaliGemma2 boat detection model, followed by integration with Roboflow Workflows to track it as it moves across a screen. The purple line in the screenshot below shows how the model was tracking these boats in real time.

Preparing a PaliGemma dataset with Roboflow

For training PaliGemma on a custom dataset, we need to use its proper training architecture: JSONL Datasets. In JSONL, each line contains its own JSON line. This format allows for easy processing of large datasets, making it suitable for PaliGemma. Fortunately, Roboflow provides this export via the “PaliGemma” option, which is really an JSONL export, when downloading a dataset from your workspace:

If you’re unsure about how to obtain a dataset, you can follow this guide, walking through creating an object detection model, but skip the training steps and just take advantage of the dataset instructions. Additionally, Roboflow Universe hosts numerous datasets, ready to fork into your own project and download. Here’s a link to the boat detection dataset on Universe if you’re interested.

Instead of downloading via the UI, we’re going to be using a Colab notebook for this guide, which keeps snippets of code organized and easily accessible, which also contains code for installing the dataset directly to the notebook.

Now, let’s use the notebook to continue our building!

Environment setup

The first step using the notebook is to assign your environment variables. The notebook provides helpful instructions/links to the both HuggingFace Settings, and Roboflow Settings, from where you need to get API keys/tokens and store them in secrets:

The next step is to enable a GPU, which will provide the processing power to train the model from the notebook. Once you’ve enabled it by following the instructions on the notebook, you can confirm it by running the snippet:

Next, we’ll also need to use the pip install command to install the necessary dependencies for this project. It installs packages like the Roboflow pip package, Supervision, and Transformers.

The next snippet uses Roboflow’s python SDK for installing datasets to install a dataset in the paligemma format. Be sure to replace the workspace and project for whatever dataset you’re using in your workspace:

from google.colab import userdata
from roboflow import Roboflow

ROBOFLOW_API_KEY = userdata.get('ROBOFLOW_API_KEY')
rf = Roboflow(api_key=ROBOFLOW_API_KEY)

project = rf.workspace("YOUR WORKSPACE ID").project("YOUR PROJECT ID")
version = project.version(3)
dataset = version.download("paligemma")

Once installed, you can verify the image paths and prefix/suffix (VLM fine-tuning) with the next line:

!head -n 5 {dataset.location}/dataset/_annotations.train.jsonl

Now, let’s prepare for training the model!

Training PaliGemma2 in colab

The next few snippets dedicate the data to test, train, and valid directories needed for training the model:

Next, use the transformers library (in the snippet) to import PaliGemma. PaliGemma is available in 9 pretrained models, all with varying sizes and resolutions for input image. For this guide, we’ll be using google/paligemma2-3b-pt-448:

import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration

MODEL_ID ="google/paligemma2-3b-pt-448"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = PaliGemmaProcessor.from_pretrained(MODEL_ID)

# @title Freeze the image encoder


TORCH_DTYPE = torch.bfloat16

model = PaliGemmaForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=TORCH_DTYPE).to(DEVICE)

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = False

This next snippet loads a PaliGemma multimodal model and freezes the vision encoder and multimodal projector components so they won’t be updated during training. This is done to preserve the pre-trained visual understanding while only fine-tuning the language generation parts, which saves memory and prevents degradation of the visual features.

Running the next 3 snippets allow us to begin fine-tuning the model:

from transformers import Trainer, TrainingArguments


def augment_suffix(suffix):
    parts = suffix.split(' ; ')
    random.shuffle(parts)
    return ' ; '.join(parts)


def collate_fn(batch):
    images, labels = zip(*batch)

    paths = [label["image"] for label in labels]
    prefixes = ["<image>" + label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]

    inputs = processor(
        text=prefixes,
        images=images,
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE).to(DEVICE)

    return inputs

args = TrainingArguments(
    num_train_epochs=16,
    remove_unused_columns=False,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    warmup_steps=2,
    learning_rate=2e-5,
    weight_decay=1e-6,
    adam_beta2=0.999,
    logging_steps=50,
    optim="adamw_hf",
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=1,
    output_dir="paligemma2_object_detection",
    bf16=True,
    report_to=["tensorboard"],
    dataloader_pin_memory=False
)

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=collate_fn,
    args=args
)

Check back in when it’s time to test.

You can then test the model with the next few snippets. Now that the model is trained, we can now integrate it into Workflows.

Use model with Roboflow

To use this model in a Workflow, we’ll have to upload it to a project. Fortunately, Roboflow supports this, along with a variety of other model architectures.

Using this documentation, run the following snippet in the notebook:

from roboflow import Roboflow

ROBOFLOW_API_KEY = userdata.get('ROBOFLOW_API_KEY')

rf = Roboflow(api_key=ROBOFLOW_API_KEY)
workspace = rf.workspace("YOUR WORKSPACE ID")

workspace.deploy_model(
    model_type="paligemma2-3b-pt-448",
    model_path="PATH TO THE FINAL CHECKPOINT DIRECTORY FROM TRAINING", # copy path in colab
    project_ids=["YOUR PROJECT ID"],  
    model_name="NAME FOR THE MODEL",    filename="model-00001-of-00002.safetensors" # Examine checkpoint directory
)

To use this code, you’ll need to replace all of the values with the proper IDs/names. Uploading may take a while, so be patient:

Now, this model is ready to use for inference! If you would like to know how to do this, follow this guide walking through using object detection models.

For this guide we’ll be making a Workflow, allowing us to track the boats from the video. Head over to Workflows and create a new workflow:

From here, we’re going to include the model that was uploaded via the snippet. You can search for the model by clicking “Add a model”:

From here, we’re going to add two blocks which will enable us to track the boat from detections, a Byte Tracker and a Trace Visualization for the tracker. The trace visualization works by compiling the previous positions of the tracker (Byte Tracker) to leave a visible trail.

If you’re curious about Workflow blocks, check out the documentation for all of the notable blocks.

From here, all that’s left is to deploy the Workflow.

Deploying PaliGemma2 on GCP with Inference

To get started, first set up a GCP Compute Engine instance. We recommend setting up a server with an NVIDIA GPU for the best performance. With a GCP server ready, we can install Roboflow Inference. We will use Inference to run the multi-step vision workflow we built earlier.

pip install inference inference-sdk

Your Inference server will be available at http://localhost:9001. With an Inference server set up, we can start running our Workflow. Go back to your Roboflow Workflow, and click “Deploy” in the Roboflow web interface:

A window will appear in which you can find the code snippet you need to run your model on a video. Choose “Video” as your input type in the left sidebar, and choose “Local Server” as your deployment type.

You will then receive a code snippet that looks like this:

# Import the InferencePipeline object
from inference import InferencePipeline
import cv2

def my_sink(result, video_frame):
    if result.get("output_image"): # Display an image from the workflow response
        cv2.imshow("Workflow Image", result["output_image"].numpy_image)
        cv2.waitKey(1)
    print(result) # do something with the predictions of each frame


# initialize a pipeline object
pipeline = InferencePipeline.init_with_workflow(
    api_key="YOUR_ROBOFLOW_KEY",
    workspace_name="YOUR_WORKSPACE_NAME",
    workflow_id="YOUR_WORKFLOW_ID",
    video_reference=”video.mp4”,
    max_fps=30,
    on_prediction=my_sink
)
pipeline.start() #start the pipeline
pipeline.join() #wait for the pipeline thread to finish

This code snippet will have your API key, Workspace name, and Workflow ID pre-filled. The only change you need to make is to set the “video_reference” argument to either the name of a video file, an RTSP stream URL, or a webcam ID.

For this example, we’ll run our model on a video.

Conclusion

In this guide, we walked through how to fine-tune, run inference on, and deploy a PaliGemma model using Roboflow and Google Cloud Platform.

First, we prepared a dataset to identify boats. We then used Roboflow to label data for use in fine-tuning PaliGemma. We exported our dataset in the PaliGemma JSONL format and used Google Colab to train the model. We tested the model in Colab, then uploaded the model weights to Roboflow for use in building a Workflow.

We built a multi-step vision Workflow that uses our PaliGemma model and an object tracking algorithm to both identify and track boats in a video. We then deployed the Workflow on a Google Cloud Platform server with an NVIDIA GPU.

If you are curious to learn more about Roboflow, the following resources may be helpful:

1 Like