Hello everyone,
I am heavily using the Google Cloud Text-to-Speech (TTS) API, specifically the 2.5 Pro model, for use cases that require very strict timing precision (e.g., generated audio must last 20, 25, or 30 seconds.
The main finding is this: the TTS API currently offers no way to define a specific target audio duration (Target Duration). The final duration is entirely determined by the input text, the selected voice, and the speaking rate.
Why Workarounds Fail
We rely on LLMs (via Chat Completion, for instance) to provide the source text. However, even with the strictest prompts:
-
The LLM never respects the exact character length required for the target duration. It is impossible to guarantee a text that will produce the correct audio duration on the first attempt.
-
Iterative
speakingRateadjustment via the API is not only costly (double calls) but yields inconsistent results from one generation to the next. -
The SSML
durationattribute is not supported to define the overall duration.
Consequently, it is impossible to reliably guarantee the duration of the generated TTS audio, which blocks the automation of our production workflows.
Value Proposition: Integration and Automation
The addition of a Target Duration feature is a critical necessity for modern integrations:
-
Solves the LLM Problem: TTS would handle the temporal constraint, freeing the LLM to focus on text quality and content, which is its strength.
-
Enables Automation: This is the missing link for automating complex workflows that couple Generative AI (text) and speech synthesis (audio) with rigid time constraints (dynamic audio advertising, synchronized narratives, etc.).
-
API Minimalism: It would only require adding a simple parameter to the
audioConfigobject:
JSON
"audioConfig": {
"targetDurationSeconds": 25.0, // The desired target duration
"voice": { ... },
// ...
}
We request that the TTS engine intelligently adjusts the pace (similar to a dynamic speakingRate) and internal pauses to synthesize an audio file that closely matches this target duration, all while maintaining voice quality.
Call to the Community
I am certain that other developers working on applications with strict time constraints share this need.
Are you interested in this feature? Are you facing the same difficulties in guaranteeing precise TTS timing? Please share your use cases to strengthen this request!
Thank you for your consideration.