With the boom in artificial intelligence (AI), businesses are looking for a quick interface to validate their machine learning models, API, or data science workflow. Chatbots are a popular application of Large Language Models (LLMs) similar to Google’s Bard. LLMs are gaining traction and require quick chatbots to make conversational interfaces.
To make the conversation even more vernacular, businesses are now beginning to use voice-based chatbots or voice bots. Voice bots have been on the rise for a couple of years because of the convenience they bring. It’s much easier to speak rather than type. A voice-activated chatbot brings frictionless experiences for the end customer.
In this article, we’ll learn how to quickly launch a bot application that not only understands the keyboard inputs, but also the voice-based messages.
Considerations
- The bot’s interface is built using Gradio framework.
- Automatic speech recognition (ASR), the conversation of spoken speech to text, is handled by Google’s Speech-To-Text.
- For this article, the bot is built to converse in US-English language. However, the language code can be changed as per your locale.
- The steps presented in this article are for a Linux platform.
Prerequisites
Before you can send a request to the Speech-to-Text API, you must have completed the following actions:
1. Enable Speech-to-Text on a Google Cloud project.
- Make sure billing is enabled for Speech-to-Text.
- Create and/or assign one or more service accounts to Speech-to-Text.
- Download a service account credential key.
2. Set your authentication environment variable.
Note: You can skip creating a service account if you plan to use default login/application credentials.
Install libraries
sudo apt-get install python3-pip python-dev
sudo apt-get install ffmpeg
pip install gradio==3.38.0 --use-deprecated=legacy-resolver
pip install --upgrade google-cloud-speech==2.21.0
pip install torch
#[If you encounter space issue on your VM, create a temp folder(/home/user/tmp) and install torch inside it as shown below]
pip install --cache-dir=/home/user/tmp torch
Code sample
config.py to store static values:
bot = {
"banner": """<h1 align="left" style="min-width:200px; margin-top:0;"> Chat with Expert Advisor </h1>""",
"title": "Expert Advisor",
"initial_message": "Hi, I'm your expert advisor. How may I help you today?",
"temp_response": "Apologies, I'm not ready yet :(",
"text_placeholder": "Enter Text"
}
main.py the entry point for application:
import time
import gradio as gr
import config as cfg
from google.cloud import speech
def transcribe_file(speech_file: str) -> speech.RecognizeResponse:
"""Transcribe the audio file."""
text = ""
client = speech.SpeechClient()
with open(speech_file, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
language_code="en-US"
)
response = client.recognize(config=config, audio=audio)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for result in response.results:
# The first alternative is the most likely one for this portion.
text = result.alternatives[0].transcript
print(f"Transcript: {text}")
return text
def add_user_input(history, text):
"""Add user input to chat hostory."""
history = history + [(text, None)]
return history, gr.update(value="", interactive=False)
def bot_response(history):
"""Returns updated chat history with the Bot response."""
# Intergate with ML models to load response.
response = cfg.bot["temp_response"]
history[-1][1] = response
time.sleep(2)
return history
with gr.Blocks() as bot_interface:
with gr.Row():
gr.HTML(cfg.bot["banner"])
with gr.Row(scale=1):
chatbot=gr.Chatbot([(cfg.bot["initial_message"], None)], elem_id="chatbot").style(height=750)
with gr.Row(scale=1):
with gr.Column(scale=12):
user_input = gr.Textbox(
show_label=False, placeholder=cfg.bot["text_placeholder"],
).style(container=False)
with gr.Column(min_width=70, scale=1):
submitBtn = gr.Button("Send")
with gr.Row(scale=1):
audio_input=gr.Audio(source="microphone", type="filepath")
input_msg = user_input.submit(add_user_input, [chatbot, user_input], [chatbot, user_input], queue=False).then(bot_response, chatbot, chatbot)
submitBtn.click(add_user_input, [chatbot, user_input], [chatbot, user_input], queue=False).then(bot_response, chatbot, chatbot)
input_msg.then(lambda: gr.update(interactive=True), None, [user_input], queue=False)
inputs_event = audio_input.stop_recording(transcribe_file, audio_input, user_input).then(add_user_input, [chatbot, user_input], [chatbot, user_input], queue=False).then(bot_response, chatbot, chatbot)
inputs_event.then(lambda: gr.update(interactive=True), None, [user_input], queue=False)
bot_interface.title = cfg.bot["title"]
bot_interface.launch(share=True)
If you want to expose the bot externally, change the launch method as below:
bot_interface.launch(server_name="0.0.0.0", share=False,ssl_certfile="localhost.crt", ssl_keyfile="localhost.key", ssl_verify=False)
Launch
python3 main.py
or
gradio main.py
Below is how the bot should appear in your browser. You can start chatting using the keyboard or microphone. Feel free to integrate the bot with your machine learning models to get desired answers from the bot.
Congratulations! You have successfully launched a voice-based chatbot.
- Want to add streaming audio to this bot? Refer to Transcribe streaming audio
- Want to convert speech as you speak? Refer to Real Time speech recognition