What the problem seems to be
A user reports that a fine-tuning job for video using Gemini-2.5-flash ran successfully once on a large dataset (~ 2,800 videos / ~7.2 million tokens), but later runs with the same dataset “fail” claiming an error:
> “Dataset example 2434 of 2816 contains a URI to a video file with invalid binary data. Please check your URIs.”
The index of the failing example changes with each run — i.e. a different video (by index) triggers the “invalid binary data” error each time the job runs.
Using a much smaller dataset (~450 videos) seems to work consistently.
Others on the forum confirm experiencing the same issue with larger data batches.
Separately, in a public issue tracker, there is documentation that “Gemini 2.5 models fail to work with video bytes” — meaning that attempts to send raw video bytes (rather than URIs) result in an internal server error (500).
That issue was reportedly closed with a statement that the problem blocks proper video-byte handling and will be fixed in a future release.
The official docs for video-understanding on Vertex AI list Gemini 2.5 Flash as supporting video tuning/analysis (with certain constraints on video length, number per prompt, MIME types, etc.).
So: on paper, Gemini-2.5-flash + Vertex AI should support video fine-tuning. But in practice — at least as of November 2025 — many users appear to be getting inconsistent failures with large video-datasets.
-–
What might be going wrong (hypotheses + likely causes)
Given the behavior, here are plausible technical issues:
1. Intermittent data-validation bug in the fine-tuning backend — The fact that a different URI/index fails each time suggests the code that validates video URIs or binary data might be non-deterministic, or vulnerable to race conditions under load (e.g. when many files are processed). That might make a previously fine dataset fail unpredictably.
2. Resource or size limits being exceeded under load — Large datasets may strain the system (especially around I/O, memory, or streaming video files). Even if tokens count is fine, maybe there’s an internal limit on total video bytes, number of URIs, or simultaneous downloads that makes the job unstable.
3. URI format / storage backend issues — Maybe some URIs reference storage locations (cloud buckets, remote servers) that sometimes become temporarily unavailable or return unexpected responses (partial reads, timeouts, corrupted metadata) under high load. On repeated runs such transient issues may show up differently each time.
4. Video-bytes path is broken (for raw bytes) — As the issue tracker notes, sending raw video bytes rather than URIs is problematic with Gemini-2.5.
5. Edge-case in JSONL dataset format or “parts” field structure — Earlier threads report tuning failures when “unsupported modality [function call] in the ‘parts’ field” is used. While the video-tuning case seems different, it suggests the backend has brittle modality parsing — maybe certain metadata or JSON formatting triggers erroneous validation.
-–
Recommendations & Work-arounds
If I were you and I needed to get video tuning to work reliably with Gemini-2.5/Vertex AI, here’s what I’d try:
Split the dataset into smaller batches — Since small sets (~450 videos) appear to work, process large video datasets in chunks (e.g. 300–500 at a time). Then perhaps merge results or run incremental tuning. That’s what some forum users already tried.
Validate all URIs / video file metadata before tuning — Build a script to pre-check that each video URI is accessible, returns expected MIME type (video/mp4 or other supported), and that binary data is intact (e.g. by opening and reading a few bytes). That reduces risk of “invalid binary data” errors.
Avoid raw bytes: use URI references (prefer storage-backed URIs) — Since the “bytes” route is documented as buggy for 2.5 Flash, rely on URIs (cloud-storage, publicly accessible URLs, or signed URLs) rather than embedding raw video bytes in requests.
Reduce dataset size per tuning job — if total tokens or total video length is very large, maybe that pushes system beyond safe limits. Use lower-resolution encoding (MEDIA_RESOLUTION_LOW) for video during tuning to reduce token and memory load.
Monitor error logs carefully & report reproducible failures to Google — If you can isolate a small reproducible subset that fails consistently, this is likely a bug. Sharing that with developers may help trigger a fix.
-–
What this tells us about relying on video-tuning in current generative models
Even when tooling claims support for video tuning (as docs do for Gemini 2.5), real-world stability can be far from guaranteed — especially for large datasets. Until such bugs are resolved, production-scale video fine-tuning remains risky.
Data-scale matters. Systems can handle small datasets but choke when scaled up. Good engineering practice demands validation, batching, and fallback plans.
Developers need to treat newer multimodal capabilities (video + audio + image + text) as experimental — treat them as best-effort tools rather than rock-solid pipelines.
-–
My Take (from a “creationist-apologist / science-method mindset”)
I find it striking that even with state-of-the-art AI frameworks — designed and built by researchers using vast data and resources — we still see fragile behavior under load. This resonates with my perspective that any system of complexity built by intelligence (humans) requires careful architecture, distribution limits, validation, and design constraints. It suggests that complexity and function demand intentional organization — and when pushed beyond design limits (large datasets, many inputs), instability emerges.
It’s a caution against naïve belief that “given enough data and computing power, such systems will automatically work reliably.” They don’t — just like biological complexity: design without oversight or constraints leads to breakdown.