The Temporal Precision Gap

Hello everyone,

I am heavily using the Google Cloud Text-to-Speech (TTS) API, specifically the 2.5 Pro model, for use cases that require very strict timing precision (e.g., generated audio must last 20, 25, or 30 seconds, with a maximum tolerance of about $\pm 0.5s$).

The main finding is this: the TTS API currently offers no way to define a specific target audio duration (Target Duration). The final duration is entirely determined by the input text, the selected voice, and the speaking rate.

Why Workarounds Fail

We rely on LLMs (via Chat Completion, for instance) to provide the source text. However, even with the strictest prompts:

  1. The LLM never respects the exact character length required for the target duration. It is impossible to guarantee a text that will produce the correct audio duration on the first attempt.

  2. Iterative speakingRate adjustment via the API is not only costly (double calls) but yields inconsistent results from one generation to the next.

  3. The SSML duration attribute is not supported to define the overall duration.

Consequently, it is impossible to reliably guarantee the duration of the generated TTS audio, which blocks the automation of our production workflows.

:bullseye: Value Proposition: Integration and Automation

The addition of a Target Duration feature is a critical necessity for modern integrations:

  • Solves the LLM Problem: TTS would handle the temporal constraint, freeing the LLM to focus on text quality and content, which is its strength.

  • Enables Automation: This is the missing link for automating complex workflows that couple Generative AI (text) and speech synthesis (audio) with rigid time constraints (dynamic audio advertising, synchronized narratives, etc.).

  • API Minimalism: It would only require adding a simple parameter to the audioConfig object:

JSON

"audioConfig": {
  "targetDurationSeconds": 25.0, // The desired target duration
  "voice": { ... },
  // ...
}

We request that the TTS engine intelligently adjusts the pace (similar to a dynamic speakingRate) and internal pauses to synthesize an audio file that closely matches this target duration, all while maintaining voice quality.


Call to the Community

I am certain that other developers working on applications with strict time constraints share this need.

Are you interested in this feature? Are you facing the same difficulties in guaranteeing precise TTS timing? Please share your use cases to strengthen this request!

Thank you for your consideration.

The Text-to-Speech API doesn’t currently support a target duration parameter, and that’s by design—the synthesis engine optimizes for natural prosody and phoneme pacing rather than fixed-length output. As noted in the Text-to-Speech documentation, the only controllable temporal variable is speakingRate, which linearly scales timing but doesn’t guarantee reproducible length due to model-level variation between sessions and voices. This means exact second-level precision can’t be enforced through SSML or API settings.

For now, the most reliable approach is to post-process audio duration externally. You can measure the output length and resample or time-stretch the waveform using standard audio libraries before downstream use. If your application requires sub-second synchronization inside automated pipelines, a custom wrapper that iteratively adjusts speakingRate based on measured duration is the only viable workaround. Since this is a feature request rather than a configuration issue, you should submit it through the Google Cloud Issue Tracker under the Text-to-Speech component so product engineering can review the proposal for a targetDurationSeconds parameter in future models.

—Taz