A spooky day in the life: Veo reference images, Nano Banana, and the art of canine costumes

Introduction: A Halloween doogler

In my last post, I showed how to use Gemini 2.5 Flash Image (aka Nano Banana) and Veo’s Image-to-Video feature to bring my dog, Remy, to life as a Doogler. For a fun Halloween edition, I’m further exploring the consistency challenge. The final video is similar, but the technical process is different: I am using Veo’s Reference-to-Video feature, which allows you to provide separate reference images for the subject and the scene. This is a crucial technique for high-fidelity character and background consistency. Follow along to see how I created this Halloween Doogler narrative!

Step 1: Generate the ‘ingredients’ for Reference-to-Video

Veo’s Reference-to-Video is powerful because it treats subjects, characters, products and backgrounds as distinct inputs, or “ingredients.” It allows you to use up to three reference images to dictate the generated video’s content and preserve the subject’s appearance in the output video. I used two types of reference images for each scene:

  1. Subject: Remy in a specific costume and pose, against a neutral background.
  2. Scene: The background setting, generated without the dog, to guide the environment.

I used Nano Banana to generate both asset types:

Scene assets: I first generated the three core office settings using text-only prompts in Nano Banana: a colorful café, a bright beanbag room, and a modern high-rise gym. These will act as high-quality scene references.

Subject assets: This is where character consistency is critical. I used my original image of Remy and prompted Nano Banana to place him in the three distinct Halloween costumes and poses against a clean white background. This step isolates the subject and its costume for Veo, ensuring the features are preserved regardless of the final video’s setting. For this task, I used the following code sample, just switching out the pose and the costume in the prompt.

from google import genai
from google.genai.types import GenerateContentConfig, Part

# PROJECT_ID = "[your-project-id]"

client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")

subject_image = "remy.jpg"

with open(subject_image, "rb") as f:
    subject = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash-image",
    contents=[
        Part.from_bytes(
            data=subject,
            mime_type="image/jpeg",
        ),
        "Generate a photorealistic image of this dog sitting down against a white background wearing a bumble bee costume.",
    ],
    config=GenerateContentConfig(
        response_modalities=["IMAGE"],
    ),
)

for part in response.candidates[0].content.parts:
    if part.inline_data:
        image_data = part.inline_data.data

Step 2: Combining assets with Veo Reference-to-Video

I then took the subject asset and the scene asset and passed them to Veo alongside a detailed text prompt. The reference_images parameter in the GenerateVideosConfig is where you input both, setting the reference_type to “asset” for each. This tells Veo to combine them and animate the result using the code sample below to generate a 1080p 8-second video:

import time
from google import genai
from google.genai.types import Image, GenerateVideosConfig, VideoGenerationReferenceImage

# PROJECT_ID = "[your-project-id]"

client = genai.Client(vertexai=True, project=PROJECT_ID, location="us-central1")

prompt = "A close-up cinematic zoom-in on the dog at a table in the coffee shop as it slowly lifts a white mug to its face and takes a sip. Distant chatter and coffee sounds are audible."

costume_image = "costume-1.png"
scene_image = "setting-1.jpg"

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    prompt=prompt,
    config=GenerateVideosConfig(
        reference_images=[
            VideoGenerationReferenceImage(
                image=Image.from_file(location=costume_image), reference_type="asset",
            ),
           VideoGenerationReferenceImage(
                image=Image.from_file(location=scene_image), reference_type="asset",
           ),
        ],
        duration_seconds=8,
        resolution="1080p",
        generate_audio=True,
    ),
)

while not operation.done:
    time.sleep(15)
    operation = client.operations.get(operation)
if operation.response:
    video_data = operation.result.generated_videos[0].video.video_bytes

Similarly, I repeated the same process with each of the reference image combos to generate the remaining videos.

Prompt: A front-facing shot where the dog is sitting in a large red beanbag. He decisively begins typing on a plain black laptop.

Prompt: The dog runs into the gym, hops on the treadmill and jumps up to press start. The dog then begins a comically fast jog while looking at the camera and panting.

Step 3: Crafting a narrative

Now that I had each of the three 8 second standalone videos, I stitched them all together with FFmpeg, which is a leading multimedia framework capable of modifying media files. Then, I had Gemini write a script for the video. To accomplish this, I prompted Gemini with a text description and the newly combined video in the following code sample. I specifically asked for each scene to be around 23 words, since that’s the pacing to create an 8 second audio clip when read aloud.

from google import genai
from google.genai.types import Part

# PROJECT_ID = "[your-project-id]"

client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")

video_file = "video.mp4"
with open(video_file, "rb") as f:
    video = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        Part.from_bytes(data=video, mime_type="video/mp4"),
        "Watch this video and create a narration about Remy the Doogler going to work on Halloween. Each scene should be as close to 23 words as possible.",
    ],
)

text_data = response.text

Script: It’s Halloween at Google, and Remy the Doogler has multiple costumes. He starts his morning buzzing with excitement at the cafe. Now time for serious work. Remy trades his antenna for a hard hat and settles in to tackle code emergencies and put out fires. Next, Remy heads to the gym dressed as the fastest pumpkin on the floor. Whether chasing a ball or chasing deadlines, Remy is always on the move.

Step 4: Add in the audio

Now that I had a script, it was time to generate an audio file with Chirp, Google’s Text-to-Speech model. I took my script from Gemini and entered it in the speech section of Vertex AI Media Studio. I then configured the language and voice I wanted from the parameters panel on the right hand side of the screen. Once the audio was generated, I exported the audio file.

Finally, I generated some background music with Lyria, Google’s music generation model. I also did this in the music section of Vertex AI Media Studio and downloaded the music file once generation was complete.

Then, to bring it all together, I used FFmpeg again to stitch and cut the narration, music, and video files, creating the final asset.

Conclusion: Consistent generative storytelling

The ability to separate and control the inputs for generative video is a monumental step forward for professional media creation. By using Nano Banana to create dedicated assets, and then leveraging Veo’s Reference-to-Video capability, I gained granular control over character consistency, costume detail, and background setting. This two-step process allows for high-fidelity character development across multiple cuts, making complex narrative videos—like a Doogler’s three-costume Halloween story—easier and more reliable to produce than ever before.

What’s next?

To get started with generative media on Vertex AI, check out the following resources: