I have a requirement in Android App to extract audio from video, upload to google cloud storage to send to Speech to Text V2 API. I was able to do it by converting to mp3 with ffmpeg library, but due to Google’s 16 kb page size requirement from Nov 2025, I can’t use it, so I tried converting using Android MediaTransformer API which supports M4A, the coverted file is playing fine, but when I send it to Speech to Text API it is giving error
“Audio data does not appear to be in a supported encoding. If you believe this to be incorrect, try explicitly specifying the decoding parameters.”
I have also tried explicitDecodingConfig & have sent like .
{
"encoding": "M4A_AAC",
"audioChannelCount": 2,
"sampleRateHertz": 22050
}
still I get “Failed to transcode audio. Please ensure the audio file is valid and has the correct encoding”
Hi locatoca,
The error message “Audio data does not appear to be in a supported encoding” is literal. The encoding value “M4A_AAC” you provided is not valid for the API. Google Speech-to-Text does not natively support AAC, the audio codec typically found in M4A files, as a direct input encoding. If you refer to the official documentation, you will see that formats like AAC, M4A, or “M4A_AAC” are not listed among the supported audio encodings. This is the primary cause of the error you’re encountering.
Regarding your “16KB page size” concern, this page details the actual limits for the Speech-to-Text API, which are typically based on audio duration (e.g., 60 seconds per synchronous request, or total audio processed per month for asynchronous) rather than a small file size. There is no mention of a 16KB file size limit for audio uploads.
Here are some suggestions that might help resolve the issue: you may need to convert your audio into a format explicitly supported and recommended by Google Speech-to-Text, such as FLAC (lossless and highly recommended) or LINEAR16 (raw PCM, common in WAV files). Your existing FFmpeg setup is ideal for this. Use FFmpeg to transcode the audio from your video into one of these formats (e.g., -c:a flac for FLAC or -c:a pcm_s16le for LINEAR16), ensuring that you specify suitable sample rates (such as 16000 Hz) and channel counts (mono is often best for speech). Finally, update your Google Speech-to-Text API request to accurately reflect the new encoding (e.g., “FLAC”) and the corresponding audio parameters (sampleRateHertz, audioChannelCount).
You may also want to check the Best Practices for Audio Input.