How to use Gemini Image for Veo character consistency: Doogler Edition

Introduction

In honor of National Dog Day and the recent release of Gemini 2.5 Flash Image (aka nano-banana) yesterday, I was inspired to turn images of my dog into AI video creations. My vision was to create three standalone Veo 3 video scenes of my dog at a Google office, showing a day in the life of a Doogler. To accomplish this, I needed to maintain character consistency between each of the videos. This means each video clip needed to maintain my dog’s exact features, appearance, and coloring. This has been a complicated problem in the world of generative AI media, but with Gemini 2.5 Flash Image, this process has become much easier.

Step 1: Generate starting frames

Before I began, I outlined the three scenes I wanted to generate. First, I decided on a video of my Shih Tzu, Remy, riding into the office on a bike. Then, it would cut to a clip of him working at a desk, and finally it would conclude with him eating ice cream in a cafeteria. To bring each of these scenes to life, I needed to generate starting frames with Remy in each of these settings, which is where Gemini 2.5 Flash Image came in.

I started with a reference image of my dog and ran it through the following code sample.

from google import genai
from google.genai.types import GenerateContentConfig, Part

# PROJECT_ID = "[your-project-id]"

client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")

subject_image = "remy.jpg"

with open(subject_image, "rb") as f:
    subject = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash-image-preview",
    contents=[
        Part.from_bytes(
            data=subject,
            mime_type="image/jpeg",
        ),
        "Generate a photorealistic image of this dog balancing on a green, yellow, red, and blue Google bicycle seat with its front paws on the handlebars in front of a Google office. It's wearing a backpack and has a lot of tennis balls in the basket.",
    ],
    config=GenerateContentConfig(
        response_modalities=["TEXT", "IMAGE"],
    ),
)

for part in response.candidates[0].content.parts:
    if part.inline_data:
        image_data = part.inline_data.data

I then repeated this process to generate the next two frames by simply changing out the prompt within the same code sample.

Prompt: Generate a photorealistic image of this dog at a desk in a Google office frantically typing away on a keyboard with over the head headphones.

Prompt: Generate a photorealistic image of this dog sitting in a Google cafeteria eating ice cream with a propellor cap on that says ‘Doogler’.

Step 2: Generate videos from images

Next, my task was to take each of these starting images and generate 8 second videos with Veo 3. For each of these requests, I included the image generated from Gemini Image, as well as a detailed text prompt. I used Gemini to help generate each video text prompt by providing the starting image and some basic keywords that described Remy’s actions or specific camera movements. For more guidance on optimizing image-to-video prompts with Gemini, check out this notebook.

import time
from google import genai
from google.genai.types import Image, GenerateVideosConfig

# PROJECT_ID = "[your-project-id]"

client = genai.Client(vertexai=True, project=PROJECT_ID, location="us-central1")

prompt = "A cinematic zoom out on a charming Shih Tzu dog, wearing a small blue backpack, expertly perched on a vibrant green bicycle with a front basket overflowing with tennis balls, set against the backdrop of the modern Google headquarters entrance. With one quick leap, the dog jumps off of the bicycle, wagging its tail, and playfully dashes into the brightly lit Google office building."

starting_image = "frame1.jpg"

operation = client.models.generate_videos(
    model="veo-3.0-generate-preview",
    prompt=prompt,
    image=Image.from_file(location=starting_image),
    config=GenerateVideosConfig(
        aspect_ratio="16:9",
        number_of_videos=1,
        duration_seconds=8,
        resolution="1080p",
        person_generation="allow_adult",
        enhance_prompt=True,
        generate_audio=True,
    ),
)
while not operation.done:
    time.sleep(15)
    operation = client.operations.get(operation)
if operation.response:
    video_data = operation.result.generated_videos[0].video.video_bytes

Similarly, I repeated the same process with each of the starting images to generate the remaining videos.

Prompt: A Medium Shot captures a small, fluffy Shih Tzu dog, sporting black headphones, intensely focused as it frantically types on a computer keyboard with its paws at a sleek office workstation. In the background, a bright, modern Google office is visible with subtle out-of-focus elements like the Google logo, glass partitions, and other distant employees. The only sounds are a soft, quiet office hum, punctuated by an unexpected, high-pitched woof.

Prompt: Pedestal down, revealing a charming Shih Tzu dog wearing a vibrant ‘Doogler’ propeller hat, its front paws resting on the edge of a bright cafeteria table. The dog is focused, eating ice cream with a small spoon comically held in its mouth, from a clear cup filled with multi-flavored ice cream, colorful sprinkles, and a miniature waffle cone. The dog drops the spoon and begins to lick the ice cream from the cup. The background softly blurs into a bustling, sunlit modern office cafeteria filled with dining employees.

Step 3: Finishing touches

With all videos generated, it was time to stitch them together and create one cohesive video. Since Veo 3 generates audio natively into each video, I could have left it there, but I went a step further.

With the stitched 24 second video, I generated a narration of Remy’s day with help from Gemini. I passed the entire video in the request along with a text prompt asking for a script, explaining that each scene narration should last no longer than 8 seconds when read aloud. Gemini supplied a text script that I then narrated with Chirp, Google’s text-to-speech model. Finally, I generated some upbeat background music with Lyria, Google’s music generation model.

At this point, I had the stitched video file, a voiceover audio file, and a background music file. I used video editing software to combine all of these components, cutting the audio file to the final video time, and spacing out the narration so that it fit perfectly within each scene.

Conclusion

Maintaining character consistency has been a real struggle in creating compelling stories with generative AI. The ability to generate multiple images of the same characters in different backgrounds and settings is a game changer in this space. I was pleasantly surprised by how simple this became with Gemini 2.5 Flash Image and can’t wait to see the stories created through this process!

What’s next?

To get started with generative media on Vertex AI, check out the following resources:

27 Likes

I’ve been experimenting with Gemini 2.5 Flash Image (aka Nano-Banana) specifically to solve the problem of character and environment consistency in AI-generated videos.

:white_check_mark: Character Consistency:
I built structured prompts that lock in each character’s age, clothing, facial expressions, and emotional state across all frames. For example, Sofia’s braid, traditional pink embroidered dress, and timid-but-hopeful posture remain consistent without style drift. Squiggle always appears in his forest-green jumper and red scarf, while Bloom keeps her lavender petal dress and glowing daisy crown identical throughout all scenes.

:globe_showing_europe_africa: Environment Consistency:
I used scene anchoring techniques where the background elements (e.g., lighting, atmosphere, color palette) are reinforced at every frame. This ensures continuity — the forest feels like the same forest across shots, and the emotional tone doesn’t break immersion.

By combining Gemini’s multi-frame generation capabilities with careful prompt engineering + iteration tracking, I was able to maintain both visual fidelity and emotional coherence across characters and settings.

:link: You can explore some of my experiments here: GitHub – RobinaMirbahar

5 Likes

Remy is a very good boy for teaching me how to code a consistent video. :paw_prints: Great work @Katie_Nguyen!

1 Like

@Katie_Nguyen Everytime I have a text overlay in the video, Veo3 never gets it right. Did you explore any workaround for it?

2 Likes

how do you get consistent 360 around the subject in VEO 3 and 3.1 ?

1 Like

no its fine at my end

This is how I sued to create videos on VEO .

Keep your character locked in place with phrasing like:

“Sofia remains steady at the center of the frame while the camera moves in a full circle around her.”
This tells Veo to treat the character as a fixed pivot point — otherwise it may interpret the scene as both subject and camera moving.

Define the Environment Clearly

The background helps Veo understand the rotational reference. Example:

“In a sunlit meadow with mountains in the distance, camera performs a full 360° orbit around Sofia standing in the grass.”
The stable horizon guides the AI to maintain a consistent parallax shift.