Hello team,
I am testing the Gemma-3n-e4b-it model for automatic speech recognition (ASR) tasks, where I provide an audio file as input and expect the spoken text transcription as output.
While the model performs well in general, I am consistently observing an unexpected language-mixing behavior for some languages, Like Punjabi.
Issue details:
-
I provide audio samples that contain only Punjabi speech.
-
My prompt explicitly specifies Punjabi as the target transcription language, for example:
"Transcribe this audio to Punjabi. Output only the transcription." "Transcribe ONLY the spoken Punjabi words exactly as heard. Stop immediately when the audio ends." -
However, for some segments, the model outputs text partially or entirely in Hindi (Devanagari script) instead of Punjabi (Gurmukhi script).
Examples:
๐ง File: Punjabi_audio_chunks/chunk_0005.wav
๐ฌ Output: เจเฉเฉฑเจเฉ เจฌเจเจผเฉ เจจเจพ เจซเฉเจเจเฉ เจฌเจนเฉเจค เจธเจญ เจคเฉเจฐเฉ เจชเฉฐเจเฉ เจชเจพเจ เจนเฉเจ เจเฅค โ
(Correct, Punjabi - Gurmukhi)
๐ง File: Punjabi_audio_chunks/chunk_0007.wav
๐ฌ Output: เค
เคเฅเคเคพ เคฆเฅ เคเฅเคเคฟ เคฒเคพ เค โ (Incorrect, Hindi - Devanagari)
Is there a way to force or lock the output language/script in Gemma-3n-e4b-it (for example, through language tokens or prompt parameters)? Please review this issue and help me to resolve it.