I am writing to inquire about the stability of the Google Cloud Speech-to-Text (STT) Chirp 3 model, specifically for the Cantonese language code (yue-Hant-HK).
We have noticed that the model occasionally produces “hallucinations” and other varying degrees of transcription errors. For example, when the input audio says “汉堡” (hamburger), the transcription output sometimes incorrectly becomes “港式奶茶包” (Hong Kong-style milk tea bun).
Currently, yue-Hant-HK is in the Preview phase. For comparison, we have also integrated the Mandarin and English models, which are in General Availability (GA). Those models perform exceptionally well, do not exhibit these hallucination issues, and are on a completely different performance level compared to the current Cantonese model.
Therefore, I would like to ask:
-
Are these hallucination and stability issues expected because the Cantonese model is still in Preview? Can we expect the frequency of these issues to significantly decrease once the model reaches General Availability (GA)?
-
We have also observed that the model’s support for colloquial spoken Cantonese is currently quite unstable. Is this instability also related to its Preview status?
Looking forward to your insights. Thank you!