Speech-to-text in web application giving unexpected results

I am trying to implement Google Cloud’s Speech-to-Text API in a web application, so users can speak into a microphone and see what they say in real time. I am using React.js on the frontend and Express.js in the backend. I am using the microphone-stream npm package to capture and stream user audio and the websocket-stream npm package to stream the audio through a web socket. Here is my source code on the frontend:

import MicrophoneStream from "microphone-stream";
import webSocketStream from "@httptoolkit/websocket-stream";

function Test() => {
  const mediaStream = useRef(null);
  const micStream = useRef(null);
  const webSocket = useRef(null);

  const listen = async () => {
    if (isRecording) {
      micStream.current.stop();
      setIsRecording(false);
      return;
    }

    const sampleRate = 16000;

    // get media stream
    mediaStream.current = await navigator.mediaDevices.getUserMedia({
      audio: {
        deviceId: "default",
        sampleRate: sampleRate,
        sampleSize: 16,
        channelCount: 1,
      },
      video: false,
    });
    setIsRecording(true);
    micStream.current = new MicrophoneStream();
    micStream.current.setStream(mediaStream.current);
    micStream.current.on("data", (chunk) => {
      console.log("data received from mic stream");
    });
    micStream.current.on("error", (error) => {
      console.error(error);
    });
    micStream.current.on("close", () => {
      console.log("mic stream closed");
      mediaStream.current.getAudioTracks()[0].stop();
      setIsRecording(false);
    });

    webSocket.current = webSocketStream("ws://localhost:8000/ws/stt", {
      perMessageDeflate: false,
    });
    webSocket.current.on("data", (data) => {
      console.log("Data received:", data);
    });
    webSocket.current.on("error", (error) => {
      console.log(error);
    });
    webSocket.current.on("close", (error) => {
      console.log("web socket stream closed");
    });

    micStream.current.pipe(webSocket.current);

    setTimeout(() => {
      // micStream.current.unpipe(webSocket.current);
      micStream.current.stop();
    }, 3000);
}

And here is how I handle it on the backend:

import express from "express";
import { SpeechClient } from "@google-cloud/speech";

import websocketStream from "@httptoolkit/websocket-stream";

const router = express.Router();

const sttClient = new SpeechClient();

router.ws("/ws/stt", (ws, req) => {
  console.log("Client connected");

  const recognizeStream = sttClient
    .streamingRecognize({
      config: {
        encoding: "LINEAR16",
        sampleRateHertz: 16000,
        languageCode: "en-GB",
        enableAutomaticPunctuation: true,
      },
      interimResults: true,
    })
    .on("error", (error) => {
      console.log("Error:", error);
    })
    .on("data", (data) => {
      console.log("Received data:", data);
      console.log("transcript:", data.results[0].alternatives[0].transcript);
      ws.send(data.results[0].alternatives[0].transcript);
    });

  const wss = websocketStream(ws, { perMessageDeflate: false });
  wss.pipe(recognizeStream);

  ws.on("close", () => {
    console.log("Client disconnected");
    wss.end();
  });

  ws.on("message", async (message) => {
    console.log("Received message:", message);
  });
});

The data is sent through correctly, but I am getting very unexpected results. I keep getting a transcript that reads “play” or “play radio”, even if I say nothing. This is an example response:

Received data: {
[0] results: [
[0] {
[0] alternatives: [Array],
[0] isFinal: true,
[0] stability: 0,
[0] resultEndTime: [Object],
[0] channelTag: 0,
[0] languageCode: 'en-gb'
[0] }
[0] ],
[0] error: null,
[0] speechEventType: 'SPEECH_EVENT_UNSPECIFIED',
[0] totalBilledTime: { seconds: '18', nanos: 0 },
[0] speechAdaptationInfo: null,
[0] requestId: '4181042299479530578'
[0] }
[0] transcript: Play radio.

Am I approaching this correctly? Any help or advice would be greatly appreciated.

It looks right to my eyes after a few minutes examination. If I were in your shoes, I’d do some debugging by breaking apart the pieces. In your back-end, you are receiving an audio stream and piping that directly to speech to text. What about writing it to a file and, when your transmission is over, download the file and validate that it contains what you expect. Next, write a test that reads the input data from the file and tries to convert it. Now you have separated the two parts of your story … maybe something interesting will show up. If nothing else, you have halved the problem. Assuming you end up with a good audio file, you can now package up the puzzle (samples audio + code) so that others can try and recreate. Sadly, these forums aren’t great for “Here is my code … what is wrong” … if we can narrow it down to as simple a failing thing … i.e. “I am using XXX Google Cloud API and expected AAA but got BBB, what might be wrong?”.

That said … one of the best written reports I have seen in a long time. Thank you for that!!!

1 Like