The Temporal Precision / Target duration

Hello everyone,

I am heavily using the Google Cloud Text-to-Speech (TTS) API, specifically the 2.5 Pro model, for use cases that require very strict timing precision (e.g., generated audio must last 20, 25, or 30 seconds.

The main finding is this: the TTS API currently offers no way to define a specific target audio duration (Target Duration). The final duration is entirely determined by the input text, the selected voice, and the speaking rate.

Why Workarounds Fail

We rely on LLMs (via Chat Completion, for instance) to provide the source text. However, even with the strictest prompts:

  1. The LLM never respects the exact character length required for the target duration. It is impossible to guarantee a text that will produce the correct audio duration on the first attempt.

  2. Iterative speakingRate adjustment via the API is not only costly (double calls) but yields inconsistent results from one generation to the next.

  3. The SSML duration attribute is not supported to define the overall duration.

Consequently, it is impossible to reliably guarantee the duration of the generated TTS audio, which blocks the automation of our production workflows.

:bullseye: Value Proposition: Integration and Automation

The addition of a Target Duration feature is a critical necessity for modern integrations:

  • Solves the LLM Problem: TTS would handle the temporal constraint, freeing the LLM to focus on text quality and content, which is its strength.

  • Enables Automation: This is the missing link for automating complex workflows that couple Generative AI (text) and speech synthesis (audio) with rigid time constraints (dynamic audio advertising, synchronized narratives, etc.).

  • API Minimalism: It would only require adding a simple parameter to the audioConfig object:

JSON

"audioConfig": {
  "targetDurationSeconds": 25.0, // The desired target duration
  "voice": { ... },
  // ...
}

We request that the TTS engine intelligently adjusts the pace (similar to a dynamic speakingRate) and internal pauses to synthesize an audio file that closely matches this target duration, all while maintaining voice quality.


Call to the Community

I am certain that other developers working on applications with strict time constraints share this need.

Are you interested in this feature? Are you facing the same difficulties in guaranteeing precise TTS timing? Please share your use cases to strengthen this request!

Thank you for your consideration.

You’re right that the current Text-to-Speech API doesn’t provide a targetDurationSeconds or equivalent parameter. The synthesis duration is governed entirely by the phonetic content, prosody model, and the speakingRate factor defined in audioConfig. As confirmed in the Text-to-Speech documentation, the API only exposes rate, pitch, and volume controls; the SSML duration attribute isn’t implemented for global timing control. Because the model’s prosody generation is non-deterministic across calls, iterative rate tuning produces small timing variation even with identical inputs.

The only reliable workaround today is to measure the output duration after synthesis and apply a resampling step externally with an audio processing library to match your required time slot. This keeps latency predictable and avoids repeated TTS calls. If you need native support for a target-duration synthesis mode, submitting a feature request through your Cloud Console under Text-to-Speech > Feedback ensures it’s logged for product consideration. For production workflows with strict ad‑length constraints, that’s currently the most stable implementation path until the API adds duration control natively.

—Taz

Hi Taz

Thanks for your feedback.

Regarding external processing, yes, I coded a stretch API to transform the audio to the desired duration, with a tolerance threshold to avoid stretching the voice too much. The problem is that it sounds vocoded, and from one generation to the next, the audio output struggles to fall within the tolerance zone for processing. The only solution, in my opinion, is “target duration”…
I’ve already contacted Google; they’re open to creating this feature but are waiting for more demand.

PS: murf.ai has target duration, but the voices are terrible…

Nab