1. The factory floor bottleneck: When “good enough” isn’t
Imagine a high-speed manufacturing line. Thousands of products roll by every hour. Today, quality control (QC) is often a choice between two imperfect options:
- Manual Inspection: Slow, expensive, and subject to human fatigue. An inspector might miss a subtle “Pass” or “Fail.”
 - Traditional Computer Vision (CV): A simple “Pass/Fail” classification model. It’s fast, but it’s a black box. It can’t tell you why a product failed. Was it a scratch? A crack? Discoloration? This lack of detail means a “Fail” still requires a human to investigate, defeating half the purpose of automation.
 
This is where traditional AI hits a wall. The real business value isn’t just in finding defects; it’s in understanding them to improve the manufacturing process.

What if your AI inspector could not only spot a defect but also describe it in plain English?
“Status: Defect - Scratch detected on product surface.”
“Status: Defect - Crack identified in main body.”
This is the power of multimodal foundation models. Today, we’ll show you how to build this exact solution. We’ll fine-tune Gemini 2.5 Flash, Google’s latest lightweight and powerful multimodal model, on Vertex AI to create a specialized visual defect inspector that can see, classify, and explain.
This notebook and blog post will walk you through the entire customer journey: from raw data to a scalable, serverless, “explainable QC” API.
What you’ll learn:
- How to structure multimodal data (images + text) for fine-Tuning.
 - Why Google Cloud Storage (GCS) is your best friend for this process.
 - How to launch a Supervised Fine-Tuning job on Vertex AI with just a few lines of Python.
 - Why regex-based evaluation is critical for testing generative models.
 - The real-world challenges of multimodal tuning to watch out for.
 
2. The game plan: From raw images to a custom Gemini
Our workflow is a straightforward, repeatable MLOps pattern:
- Simulate & Store: Generate sample product images (“Pass” and “Defect”) and upload them to a GCS bucket.
 - Prepare the “Lesson Plan”: Convert our image URIs and text labels into the JSON Lines (JSONL) format Vertex AI requires for tuning.
 - Launch the Tuning Job: Point Vertex AI to our data and tell it to fine-tune Gemini 2.5 Flash.
 - Evaluate & Deploy: The job automatically creates a new, private model endpoint. We’ll test it with unseen images.
 
Prerequisites
Before we start, you’ll need:
- A Google Cloud project with the Vertex AI API enabled.
 - A GCS Bucket to store our images and training files.
 - Permissions to run Vertex AI jobs and GCS operations.
 
 Step 1: Authentication and setup
First, we install our libraries and configure our environment. The key libraries are google-cloud-aiplatform (the Vertex AI SDK) and google-genai (the Gemini SDK, which we’ll configure for Vertex AI).
# gcsfs is added to allow pandas to write directly to GCS
# Pillow (PIL) is needed for image generation
!{sys.executable} -m pip install --upgrade --user --quiet \
    pandas numpy google-cloud-aiplatform google-genai \
    google-cloud-storage gcsfs Pillow`
Next, we set up our project details and initialize the Vertex AI client. This is a crucial step: we’re telling the google-genai SDK to operate within our secure Vertex AI environment, not the public API.
import vertexai
from google.genai import (
    Client as VertexClient,  # This is for Vertex AI tuning/models client
)
# --- Vertex AI Configuration (Required for Fine-tuning Job) ---
PROJECT_ID = ""  # @param {type: "string", placeholder: "your-gcp-project-id"}
REGION = ""        # @param {type:"string"}
BUCKET_NAME = ""    # @param {type:"string", placeholder: "your-gcs-bucket-name"}
BUCKET_URI = f"gs://{BUCKET_NAME}"
# ... (Authentication logic from notebook) ...
# Initialize Vertex AI SDK (needed for launching the tuning job)
vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)
# Initialize the genai client specifically for Vertex AI operations (like tuning)
# This client will manage the tuning job
vertex_client = VertexClient(vertexai=True, project=PROJECT_ID, location=REGION)
print("Vertex AI SDK Initialized.")
3. 
 Step 2: Generating and preparing the “lesson plan”
You can’t teach a model without data. For this demo, we’ll simulate it. We’ll create 200 simple images: some are “Pass” (a clean blue rectangle) and some are “Defect” (with a “Scratch,” “Crack,” or “Discoloration”).
 The data (code deep dive: generate_and_upload_images)
The generate_and_upload_images function from the notebook does two things:
- Uses the Pillow (PIL) library to draw simple images in memory.
 - Uses the 
google-cloud-storageclient to upload each image directly to our GCS bucket as a PNG. 
The most important part is the label it generates. This is our ground truth.
- Pass Image Label: 
Status: Pass - Defect Image Label: 
Status: Defect - Scratch detected on product surface. 
We store this information in a Pandas DataFrame that acts as our “manifest file.”
--- Image Manifest Sample Output ---
   image_name                    gcs_uri                                 status   defect_type   label
0  product_image_0.png           gs://diwali-111/visual_defect.../img_0.png   Pass     None          Status: Pass
1  product_image_1.png           gs://diwali-111/visual_defect.../img_1.png   Defect   Scratch       Status: Defect - Scratch detected...
 The “lesson plan” (code deep dive: create_tuning_jsonl_from_manifest)
This is the most critical part of the entire process. We must convert our manifest into a format that Gemini understands for multimodal tuning. This format is a JSON Lines (JSONL) file, where each line is a complete “lesson.”
From a research perspective, this is Supervised Fine-Tuning (SFT). We provide a prompt (the user’s request) and a perfect completion (the model’s ideal answer).
Here is the structure for a single lesson:
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Analyze the following product image for manufacturing defects. Classify its status as 'Pass' or 'Defect' and provide a brief description if a defect is present."
        },
        {
          "fileData": {
            "mimeType": "image/png",
            "fileUri": "gs://diwali-111/visual_defect_tuning_data/images/product_image_93.png"
          }
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "Status: Pass"
        }
      ]
    }
  ]
}
 Technical deep dive: Multimodal SFT in Gemini
Why this specific JSONL structure? It maps directly to how Gemini “thinks.”
- Sequence-to-Sequence: Gemini is a sequence-to-sequence model. It takes an input sequence and generates an output sequence.
 - Multimodal Input Sequence: For SFT, the 
"role": "user"block defines the entire input sequence. The Vertex AI tuning service processes this by: 
- Taking the 
"text"part: “Analyze the following…” - Fetching the image from the 
"fileUri". - Passing the image through its built-in vision encoder to turn the image into a set of numerical representations (tokens).
 - Splicing these image tokens directly into the text tokens.
 
- The “Lesson”: The SFT process then trains the model (or its adapter) to learn that “when you see this specific combined (text + image) sequence, you must generate this exact output text sequence (the 
"role": "model"part).” 
We are teaching the model a new, specialized skill: how to map a visual concept to a specific descriptive string.
4. 
 Step 3: Launching the fine-tuning job on Vertex AI
Now for the magic. All our hard work in data preparation pays off. We don’t need to provision GPUs, manage clusters, or write complex training loops. We just point the vertex_client to our data.
Background: What is “adapter-based tuning”?
We’re not retraining the entire Gemini 2.5 Flash model from scratch. That would be colossally expensive and time-consuming. Instead, we’re using a technique called adapter-based tuning (like LoRA, or Low-Rank Adaptation).
Think of it this way:
- Freeze the Core: We “freeze” the original, massive model. This is critical because the core model’s vision encoder already knows what scratches, cracks, and colors are from its training on billions of web images.
 - Train the Adapter: We train a tiny, new “adapter” layer that plugs into the model. This adapter doesn’t have to re-learn vision; it only has to learn the mapping.
 - The Mapping: It learns to connect the core model’s existing visual knowledge (e.g., “that’s a line”) to our specific domain (e.g., “a line on this blue block”) and map it to our specific output format (e.g., “Status: Defect - Scratch…”).
 
This is faster, vastly cheaper, and prevents catastrophic forgetting—where a model forgets its original skills (like how to write a poem) because it was over-specialized on a new task.
(Code deep dive: vertex_client.tunings.tune)
This one command kicks off the entire serverless job.
# Define where our data lives
training_dataset = {
    "gcs_uri": TRAIN_JSONL_GCS_URI,
}
validation_dataset = genai_types.TuningValidationDataset(
    gcs_uri=VALIDATION_JSONL_GCS_URI
)
# Launch the job!
sft_tuning_job = vertex_client.tunings.tune(
    base_model="gemini-2.5-flash",  # The base model we are tuning
    training_dataset=training_dataset,
    config=genai_types.CreateTuningJobConfig(
        adapter_size="ADAPTER_SIZE_FOUR", # A small, efficient adapter
        epoch_count=5, # How many times to "review" the lesson plan
        tuned_model_display_name=TUNED_MODEL_DISPLAY_NAME,
        validation_dataset=validation_dataset,
    ),
)
print("\nTuning job created:")
print(sft_tuning_job.name)
The job is now running on Google’s managed infrastructure. You can track its progress in the Vertex AI console or by polling the job object in the notebook. This will take some time (30 minutes to a few hours, depending on data size).
When it’s done, the tuning_job object gives us the single most valuable asset: TUNED_MODEL_ENDPOINT. This is our custom, private, scalable API.
5. 
 Step 4: Evaluating our custom QC inspector
The moment of truth. Does our tuned model actually work? We’ll use our test_split (images the model never saw during training) to find out.
Here is the actual output from our notebook’s evaluation step:
--- Qualitative Evaluation of Tuned Model (projects/551887116707/locations/us-central1/endpoints/7545261106659328000) ---
--- Sample 1 ---
Input Prompt Text: Analyze the following product image for manufacturing defects. Classify its status as 'Pass' or 'Defect' and provide a brief description if a defect is present.
Input Image URI: gs://diwali-111/visual_defect_tuning_data/images/product_image_106.png
Expected Output: Status: Defect - Discoloration spot found.
Predicted Output: Status: Defect - Discoloration detected on product surface.
Result: MATCH (Regex)
--- Sample 2 ---
Input Prompt Text: Analyze the following product image for manufacturing defects. Classify its status as 'Pass' or 'Defect' and provide a brief description if a defect is present.
Input Image URI: gs://diwali-111/visual_defect_tuning_data/images/product_image_16.png
Expected Output: Status: Pass
Predicted Output: Status: Pass
Result: MATCH
 Why simple string-matching fails (and regex-based evaluation wins)
Look at Sample 1.
- Expected: 
Status: Defect - Discoloration spot found. - Predicted: 
Status: Defect - Discoloration detected on product surface. 
If our test was if predicted == expected:, this sample would have FAILED.
This is the biggest trap in evaluating generative models. A simple string match tests memorization. We want to test comprehension. The model’s response is semantically correct and more natural than our original label, which is a fantastic result!
This is why the notebook’s evaluate_qualitatively function uses a regex-based evaluation. The logic, seen in Step 6 of the notebook, is far more robust:
- Check for “Pass”: If the expected output is 
Status: Pass, check if the prediction is alsoStatus: Pass. This is a simple, direct match. - Check for “Defect”:
 
- First, check if the prediction also contains the word 
"Defect". If it predicts “Pass” for a “Defect” image, it’s a clear failure. - If it correctly identifies a “Defect,” extract the key defect type (e.g., “Scratch,” “Crack,” or “Discoloration”) from the expected label.
 - Finally, use 
re.search(defect_type, predicted_output, re.IGNORECASE)to see if the predicted output also contains that key defect, ignoring case. 
This regex approach correctly identifies Sample 1 as a “MATCH”. It confirms the model understood the task: 1) identify the defect, and 2) mention “Discoloration.” This is a production-ready evaluation strategy.
6. 
 Bonus: AI for MLOps (using Gemini to report on itself)
Here’s a clever trick for your MLOps pipelines. How do you notify your team when a tuning job is done? You could send a raw JSON blob… or you could have AI summarize it for you.
We use the base gemini-2.5-flash model (via the standard Vertex AI GenerativeModel class) to generate a human-readable report about our fine-tuning job.
(Code deep dive: generate_tuning_summary_with_gemini)
We feed the final tuning_job object (which contains all the metadata) into a simple prompt.
# We ask the base Gemini model to summarize the job's metadata
prompt = f"""Generate a brief status report for a Gemini model fine-tuning job for a 'Visual Defect Detection' use case.
Job Name: {job_name}
Base Model: {base_model}
Tuned Model Display Name: {display_name}
Final Status: {job_state}
Tuned Model Endpoint: {tuned_endpoint}
Error (if any): {error_message}
Summarize the outcome of this tuning job in 1-2 sentences, specifically mentioning its readiness for the manufacturing defect analysis task."""
# ... (call reporting_client.generate_content) ...
And here’s the AI-generated summary from our notebook:
— Gemini Tuning Job Summary — The Gemini-2.5-flash fine-tuning job for visual defect detection has successfully completed with no errors. The resulting tuned model is now ready for immediate deployment in manufacturing defect analysis tasks.
This is perfect for an automated Slack message, an email alert, or a documentation entry.
7. 
 From demo to production: Real-world tuning challenges
This notebook shows a successful “happy path.” In a real manufacturing setting, tuning image models requires navigating several challenges.
- Data Quality & Consistency: Garbage in, garbage out. If your “Pass” images are blurry, the model might learn that “blurry” means “Pass.” If your labels are inconsistent (e.g., “Scratch” vs. “Scuff”), the model will be confused. Your dataset must be clean and meticulously labeled.
 - Class Imbalance: In the real world, you’ll have 10,000 “Pass” images for every 10 “Crack” images. If you train on this raw data, the model will become biassed and always predict “Pass” because it’s the safest bet. You must use techniques like oversampling your rare defect images or undersampling your “Pass” images to create a balanced “lesson plan.”
 - Prompt Sensitivity: The 
base_promptwe used in our JSONL file (“Analyze the following…”) matters. A different prompt, like “You are a QC inspector. Your only job is to state the Status and Defect type,” might yield different, possibly better, results. This prompt engineering is a key part of the SFT process itself. - Overfitting: We ran for 5 
epoch_count. What if we ran for 50? With a small dataset, the model might memorize all 160 training images perfectly. It would get 100% on the training set but fail miserably on any new image (like our test set). This is overfitting. Thevalidation_datasetis your guardrail—it checks the model’s performance on unseen data after each epoch. If validation accuracy stops improving (or gets worse), it’s time to stop training. 
8. Conclusion: The Future of quality control is here
In a single notebook, we’ve gone from a business problem (slow, “dumb” QC) to a sophisticated, scalable AI solution. We didn’t write a single line of complex model code. We just prepared a “lesson plan” and let Vertex AI and Gemini do the heavy lifting.
The “customer journey” is complete:
- We had a problem: Manual QC is a bottleneck.
 - We had data: Simple images of our products.
 - We built a solution: A custom-tuned Gemini 2.5 Flash model on a serverless endpoint.
 - We got value: The model doesn’t just say “Fail”; it says why it failed, providing actionable business intelligence.
 
This pattern isn’t limited to manufacturing. You can use this exact workflow for:
- Insurance: Analyzing photos of car damage (“Defect: Bumper scratch”).
 - Retail: Flagging damaged goods in warehouse photos.
 - Content Moderation: Identifying and classifying policy-violating images.
 
You’ve now seen how accessible and powerful supervised fine-tuning on Vertex AI has become. Your next-generation AI inspector is just a notebook away.
9. 
 Further reading & citations
- [1] Vertex AI Documentation: Official Docs for Gemini Fine-Tuning
 - [2] Research Paper (LoRA): LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)
 - [3] Blog: Introduction to Gemini 2.5 Flash
 - [4] Code: sft_gemini_visual_defect_detection.ipynb