Vertex AI Batch Image Generation Failure (24h timeout / System error)

Hi,

I’m encountering a persistent issue with Vertex AI batch image generation.

The Problem

I am trying to generate images using a batch job, but:

  • The job runs for a long time and then fails with:

    Deadline exceeded due to job running for maximum allowed duration of 24 hours. Please retry the unprocessed rows or start over with a smaller batch size.

  • Even worse, the job processes fewer than 5~40 images out of 300 before failing

  • This has been happening repeatedly for about a week

I also tried reducing the batch size:

  • With 10 images, the job failed after about 4.5 hours with the following error:

    System error. Please try this operation again. If the issue persists please visit https://cloud.google.com/support-hub to view your support options.
    

Key Concern

The job is making almost no progress before failing, which suggests this is not simply a batch size limitation.

Environment

  • Model: gemini-3-pro-image-preview and gemini-3.1-flash-image-preview

  • Region: global

  • Execution method: Batch prediction

  • Authentication: Service account (API-based usage)

  • Input batch size:

    • 300 images → fails (only ~5 processed)

    • 10 images → fails early (~4.5h, system error)

What I’ve Tried

  • Reduced batch size (300 → 10)

  • Retried multiple times over several days

  • Verified API usage via service account

Questions

  1. Is this expected behavior for batch image generation, or could this indicate a backend/service issue?

  2. Are there any hidden limits or constraints (throughput, concurrency, etc.) for batch image generation?

  3. Could this be related to region or resource allocation?

  4. Is there any way to identify which rows are failing internally?

1 Like

1.No — this is NOT expected.

With Vertex AI Batch Prediction using image models like:

Gemini 3 Pro Image Preview, Gemini 3.1 Flash Image Preview

A batch of:

300 images should complete in minutes to ~1 hour (typical)

10 images should complete in a few minutes.

Jobs hanging for hours

Processing only 5–40 rows

24h timeout

This strongly suggests a backend execution issue OR request-level failures causing stalls, not a normal quota/batch limitation.

2.Yes — but none explain your behavior fully.

Known constraints:

(A) Concurrency / Throughput limits

Per-project quota (requests/sec, tokens/sec)

Internal model serving capacity

But normally:

Requests get throttled (429)

NOT silently stall for hours

Your case:

“Processes very few rows, then stalls for hours”

This usually means:

Certain rows are consistently failing

System keeps retrying them

Entire job gets blocked

3.Yes — this is VERY relevant.

You are using:region: global

For preview models like Gemini image:

“global” routes dynamically, but:

Capacity is not guaranteed

Preview models often have:

unstable routing

limited backend allocation

1 Like