Introduction
In honor of National Dog Day and the recent release of Gemini 2.5 Flash Image (aka nano-banana) yesterday, I was inspired to turn images of my dog into AI video creations. My vision was to create three standalone Veo 3 video scenes of my dog at a Google office, showing a day in the life of a Doogler. To accomplish this, I needed to maintain character consistency between each of the videos. This means each video clip needed to maintain my dog’s exact features, appearance, and coloring. This has been a complicated problem in the world of generative AI media, but with Gemini 2.5 Flash Image, this process has become much easier.
Step 1: Generate starting frames
Before I began, I outlined the three scenes I wanted to generate. First, I decided on a video of my Shih Tzu, Remy, riding into the office on a bike. Then, it would cut to a clip of him working at a desk, and finally it would conclude with him eating ice cream in a cafeteria. To bring each of these scenes to life, I needed to generate starting frames with Remy in each of these settings, which is where Gemini 2.5 Flash Image came in.
I started with a reference image of my dog and ran it through the following code sample.
from google import genai
from google.genai.types import GenerateContentConfig, Part
# PROJECT_ID = "[your-project-id]"
client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")
subject_image = "remy.jpg"
with open(subject_image, "rb") as f:
subject = f.read()
response = client.models.generate_content(
model="gemini-2.5-flash-image-preview",
contents=[
Part.from_bytes(
data=subject,
mime_type="image/jpeg",
),
"Generate a photorealistic image of this dog balancing on a green, yellow, red, and blue Google bicycle seat with its front paws on the handlebars in front of a Google office. It's wearing a backpack and has a lot of tennis balls in the basket.",
],
config=GenerateContentConfig(
response_modalities=["TEXT", "IMAGE"],
),
)
for part in response.candidates[0].content.parts:
if part.inline_data:
image_data = part.inline_data.data
I then repeated this process to generate the next two frames by simply changing out the prompt within the same code sample.
Prompt: Generate a photorealistic image of this dog at a desk in a Google office frantically typing away on a keyboard with over the head headphones.
Prompt: Generate a photorealistic image of this dog sitting in a Google cafeteria eating ice cream with a propellor cap on that says ‘Doogler’.
Step 2: Generate videos from images
Next, my task was to take each of these starting images and generate 8 second videos with Veo 3. For each of these requests, I included the image generated from Gemini Image, as well as a detailed text prompt. I used Gemini to help generate each video text prompt by providing the starting image and some basic keywords that described Remy’s actions or specific camera movements. For more guidance on optimizing image-to-video prompts with Gemini, check out this notebook.
import time
from google import genai
from google.genai.types import Image, GenerateVideosConfig
# PROJECT_ID = "[your-project-id]"
client = genai.Client(vertexai=True, project=PROJECT_ID, location="us-central1")
prompt = "A cinematic zoom out on a charming Shih Tzu dog, wearing a small blue backpack, expertly perched on a vibrant green bicycle with a front basket overflowing with tennis balls, set against the backdrop of the modern Google headquarters entrance. With one quick leap, the dog jumps off of the bicycle, wagging its tail, and playfully dashes into the brightly lit Google office building."
starting_image = "frame1.jpg"
operation = client.models.generate_videos(
model="veo-3.0-generate-preview",
prompt=prompt,
image=Image.from_file(location=starting_image),
config=GenerateVideosConfig(
aspect_ratio="16:9",
number_of_videos=1,
duration_seconds=8,
resolution="1080p",
person_generation="allow_adult",
enhance_prompt=True,
generate_audio=True,
),
)
while not operation.done:
time.sleep(15)
operation = client.operations.get(operation)
if operation.response:
video_data = operation.result.generated_videos[0].video.video_bytes
Similarly, I repeated the same process with each of the starting images to generate the remaining videos.
Prompt: A Medium Shot captures a small, fluffy Shih Tzu dog, sporting black headphones, intensely focused as it frantically types on a computer keyboard with its paws at a sleek office workstation. In the background, a bright, modern Google office is visible with subtle out-of-focus elements like the Google logo, glass partitions, and other distant employees. The only sounds are a soft, quiet office hum, punctuated by an unexpected, high-pitched woof.
Prompt: Pedestal down, revealing a charming Shih Tzu dog wearing a vibrant ‘Doogler’ propeller hat, its front paws resting on the edge of a bright cafeteria table. The dog is focused, eating ice cream with a small spoon comically held in its mouth, from a clear cup filled with multi-flavored ice cream, colorful sprinkles, and a miniature waffle cone. The dog drops the spoon and begins to lick the ice cream from the cup. The background softly blurs into a bustling, sunlit modern office cafeteria filled with dining employees.
Step 3: Finishing touches
With all videos generated, it was time to stitch them together and create one cohesive video. Since Veo 3 generates audio natively into each video, I could have left it there, but I went a step further.
With the stitched 24 second video, I generated a narration of Remy’s day with help from Gemini. I passed the entire video in the request along with a text prompt asking for a script, explaining that each scene narration should last no longer than 8 seconds when read aloud. Gemini supplied a text script that I then narrated with Chirp, Google’s text-to-speech model. Finally, I generated some upbeat background music with Lyria, Google’s music generation model.
At this point, I had the stitched video file, a voiceover audio file, and a background music file. I used video editing software to combine all of these components, cutting the audio file to the final video time, and spacing out the narration so that it fit perfectly within each scene.
Conclusion
Maintaining character consistency has been a real struggle in creating compelling stories with generative AI. The ability to generate multiple images of the same characters in different backgrounds and settings is a game changer in this space. I was pleasantly surprised by how simple this became with Gemini 2.5 Flash Image and can’t wait to see the stories created through this process!
What’s next?
To get started with generative media on Vertex AI, check out the following resources:
- Gemini 2.5 Flash Image notebook
- Veo 3 notebook with prompt guidance
- Vertex AI Studio where you can check out Gemini, Chirp, and Lyria



