Welcome to the third and last installment of this series on innovative AI-driven chatbots. In our journey from basic text-based interactions to text-2-voice-enabled capabilities, we now introduce a voice-2-voice-enabled PDF chatbot. This advanced system allows users to interact verbally with their PDF documents, significantly enhancing accessibility and usability. Let’s explore how this chatbot works and its implications for users.
Evolution of Chatbots: From Text to Voice
In our initial blog post, we introduced a text-based chatbot capable of processing queries from uploaded documents. This laid the groundwork for seamless interaction with textual information.
http://techiemate.blogspot.com/2024/06/revolutionizing-document-interaction-ai.html
Building on this, our second post showcased text to voice recognition integration, an interactive voice assistant capable of understanding and reading out the answer from the content in your PDFs. This enhancement marked a significant leap towards intuitive user engagement, catering to diverse user preferences and accessibility needs.
http://techiemate.blogspot.com/2024/06/revolutionizing-document-interaction-ai_13.html
Introducing Voice-to-Voice Interaction with PDFs
Today, we introduce our latest innovation: a voice-enabled PDF chatbot capable of both transcribing spoken queries and delivering spoken responses directly from PDF documents. This breakthrough technology bridges traditional document interaction with modern voice-driven interfaces, offering a transformative user experience.
The Technical Backbone: Exploring the Codebase
Let’s delve into the technical components that power our voice-enabled PDF chatbot:
Setting Up Dependencies and Environment
import os
import faiss
import streamlit as st
from dotenv import load_dotenv
from langchain_core.messages import AIMessage, HumanMessage
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.vector_stores.faiss import FaissVectorStore
import azure.cognitiveservices.speech as speechsdk
import speech_recognition as sr
# Initialize Faiss index for vectorization
d = 1536
faiss_index = faiss.IndexFlatL2(d)
PERSIST_DIR = "./storage"
# Load environment variables
load_dotenv()
The code snippet above sets up necessary dependencies:
- Faiss: Utilized for efficient document vectorization, enabling similarity search based on content.
- Streamlit: Facilitates the user interface for seamless interaction with the chatbot and document upload functionality.
- LangChain: Powers the message handling and communication within the chatbot interface.
- LlamaIndex: Manages the storage and retrieval of vectorized document data, optimizing query performance.
- Azure Cognitive Services (Speech SDK): Provides capabilities for speech recognition and synthesis, enabling the chatbot to transcribe and respond to spoken queries.
- Google Audio : Provides speech into text using Google’s speech recognition API.
Document Handling and Vectorization
def saveUploadedFiles(pdf_docs):
UPLOAD_DIR = 'uploaded_files'
try:
for pdf in pdf_docs:
file_path = os.path.join(UPLOAD_DIR, pdf.name)
with open(file_path, "wb") as f:
f.write(pdf.getbuffer())
return "Done"
except:
return "Error"
def doVectorization():
try:
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
documents = SimpleDirectoryReader("./uploaded_files").load_data()
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
index.storage_context.persist()
return "Done"
except:
return "Error"
The saveUploadedFiles
function saves PDF documents uploaded by users to a designated directory (uploaded_files
). The doVectorization
function utilizes Faiss to vectorize these documents, making them searchable based on content similarities.
Speech Recognition and Transcription
def transcribe_audio():
recognizer = sr.Recognizer()
microphone = sr.Microphone()
with microphone as source:
recognizer.adjust_for_ambient_noise(source)
audio = recognizer.listen(source, timeout=20)
st.write("π Transcribing...")
try:
text = recognizer.recognize_google(audio)
return text
except sr.RequestError:
return "API unavailable or unresponsive"
except sr.UnknownValueError:
return "Unable to recognize speech"
The transcribe_audio
function uses the speech_recognition
library to capture spoken queries from users via their microphone. It
adjusts for ambient noise and listens for up to 20 seconds before
transcribing the speech into text using Google's speech recognition API.
Querying and Fetching Data
def fetchData(user_question):
try:
vector_store = FaissVectorStore.from_persist_dir("./storage")
storage_context = StorageContext.from_defaults(
vector_store=vector_store, persist_dir=PERSIST_DIR
)
index = load_index_from_storage(storage_context=storage_context)
query_engine = index.as_query_engine()
response = query_engine.query(user_question)
return str(response)
except:
return "Error"
The fetchData
function retrieves relevant information from vectorized documents based
on user queries. It loads the persisted Faiss index from storage and
queries it to find and return the most relevant information matching the
user's question.
Defining the Welcome Message
The WelcomeMessage
variable contains a multi-line string that introduces users to the
voice-enabled PDF chatbot. It encourages them to upload PDF documents
and start asking questions:
WelcomeMessage = """
Hello, I am your PDF voice chatbot. Please upload your PDF documents and start asking questions to me.
I would try my best to answer your questions from the documents.
"""
This message serves as the initial greeting when users interact with the chatbot, providing clear instructions on how to proceed.
Initializing Chat History and Azure Speech SDK Configuration
The code block initializes the chat history and sets up configurations for Azure Speech SDK:
if "chat_history" not in st.session_state:
st.session_state.chat_history = [
AIMessage(content=WelcomeMessage)
]
AZURE_SPEECH_KEY = os.getenv("AZURE_SPEECH_KEY")
AZURE_SPEECH_REGION = os.getenv("AZURE_SPEECH_REGION")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_KEY, region=AZURE_SPEECH_REGION)
speech_config.speech_synthesis_voice_name = "en-US-AriaNeural"
speech_config.speech_synthesis_language = "en-US"
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
- Chat History Initialization: Checks if
chat_history
exists in the Streamlit session state. If not, it initializes it with theWelcomeMessage
wrapped in anAIMessage
object. This ensures that the chat starts with the welcome message displayed to the user. - Azure Speech SDK Configuration: Retrieves Azure Speech API key and region from environment variables (
AZURE_SPEECH_KEY
andAZURE_SPEECH_REGION
). It sets up aSpeechConfig
object for speech synthesis (speech_synthesizer
). The voice name and language are configured to use the "en-US-AriaNeural" voice for English (US).
Streamlit Integration: User Interface Design
def main():
load_dotenv()
st.set_page_config(
page_title="Chat with multiple PDFs",
page_icon=":sparkles:"
)
st.header("Chat with single or multiple PDFs :sparkles:")
for message in st.session_state.chat_history:
if isinstance(message, AIMessage):
with st.chat_message("AI"):
st.markdown(message.content)
elif isinstance(message, HumanMessage):
with st.chat_message("Human"):
st.markdown(message.content)
with st.sidebar:
st.subheader("Your documents")
pdf_docs = st.file_uploader(
"Upload your PDFs here and click on 'Process'",
accept_multiple_files=True
)
if st.button("Process"):
with st.spinner("Processing"):
IsFilesSaved = saveUploadedFiles(pdf_docs)
if IsFilesSaved == "Done":
IsVectorized = doVectorization()
if IsVectorized == "Done":
st.session_state.isPdfProcessed = "done"
st.success("Done!")
else:
st.error("Error! in vectorization")
else:
st.error("Error! in saving the files")
if st.button("Start Asking Question"):
st.write("π€ Recording started...Ask your question")
transcription = transcribe_audio()
st.write("✅ Recording ended")
st.session_state.chat_history.append(HumanMessage(content=transcription))
with st.chat_message("Human"):
st.markdown(transcription)
with st.chat_message("AI"):
with st.spinner("Fetching data ..."):
response = fetchData(transcription)
st.markdown(response)
result = speech_synthesizer.speak_text_async(response).get()
st.session_state.chat_history.append(AIMessage(content=response))
if "WelcomeMessage" not in st.session_state:
st.session_state.WelcomeMessage = WelcomeMessage
result = speech_synthesizer.speak_text_async(WelcomeMessage).get()
#============================================================================================================
if __name__ == '__main__':
main()
User Experience: Seamless Interaction
Imagine uploading a collection of PDF documents — research papers, technical manuals, or reports — and simply speaking your questions aloud. The chatbot not only transcribes your speech but also responds audibly, providing immediate access to relevant information. This seamless interaction is particularly beneficial for users with visual impairments or those multitasking who prefer auditory information consumption.
Demo π¬π€π€
Enhancing Accessibility and Efficiency
Our voice-enabled PDF chatbot represents a significant advancement in accessibility technology. By integrating speech recognition, document vectorization, and AI-driven query processing, we empower users to effortlessly interact with complex information. This technology not only enhances accessibility but also boosts efficiency by streamlining the process of retrieving information from documents.
Conclusion: Paving the Way Forward
As we continue to explore the capabilities of AI in enhancing user experiences, the voice-enabled PDF chatbot stands as a testament to innovation in accessibility and usability. Whether you’re a researcher seeking insights from academic papers or a professional referencing technical documents, this technology promises to revolutionize how we interact with information.
Stay
tuned as we push the boundaries further, exploring new applications and
advancements in AI-driven technology. Stay connected and yes Happy
Coding ! π
No comments:
Post a Comment