Welcome to the third and last installment of this series on innovative AI-driven chatbots. In our journey from basic text-based interactions to text-2-voice-enabled capabilities, we now introduce a voice-2-voice-enabled PDF chatbot. This advanced system allows users to interact verbally with their PDF documents, significantly enhancing accessibility and usability. Let’s explore how this chatbot works and its implications for users.

Evolution of Chatbots: From Text to Voice

In our initial blog post, we introduced a text-based chatbot capable of processing queries from uploaded documents. This laid the groundwork for seamless interaction with textual information.

http://techiemate.blogspot.com/2024/06/revolutionizing-document-interaction-ai.html

Building on this, our second post showcased text to voice recognition integration, an interactive voice assistant capable of understanding and reading out the answer from the content in your PDFs. This enhancement marked a significant leap towards intuitive user engagement, catering to diverse user preferences and accessibility needs.

http://techiemate.blogspot.com/2024/06/revolutionizing-document-interaction-ai_13.html

Introducing Voice-to-Voice Interaction with PDFs

Today, we introduce our latest innovation: a voice-enabled PDF chatbot capable of both transcribing spoken queries and delivering spoken responses directly from PDF documents. This breakthrough technology bridges traditional document interaction with modern voice-driven interfaces, offering a transformative user experience.

The Technical Backbone: Exploring the Codebase

Let’s delve into the technical components that power our voice-enabled PDF chatbot:

Setting Up Dependencies and Environment

import os
import faiss
import streamlit as st
from dotenv import load_dotenv
from langchain_core.messages import AIMessage, HumanMessage
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.vector_stores.faiss import FaissVectorStore
import azure.cognitiveservices.speech as speechsdk
import speech_recognition as sr

# Initialize Faiss index for vectorization
d = 1536
faiss_index = faiss.IndexFlatL2(d)
PERSIST_DIR = "./storage"

# Load environment variables
load_dotenv()

The code snippet above sets up necessary dependencies:

Faiss: Utilized for efficient document vectorization, enabling similarity search based on content.
Streamlit: Facilitates the user interface for seamless interaction with the chatbot and document upload functionality.
LangChain: Powers the message handling and communication within the chatbot interface.
LlamaIndex: Manages the storage and retrieval of vectorized document data, optimizing query performance.
Azure Cognitive Services (Speech SDK): Provides capabilities for speech recognition and synthesis, enabling the chatbot to transcribe and respond to spoken queries.
Google Audio : Provides speech into text using Google’s speech recognition API.

Document Handling and Vectorization

def saveUploadedFiles(pdf_docs):
    UPLOAD_DIR = 'uploaded_files'
    try:
        for pdf in pdf_docs:
            file_path = os.path.join(UPLOAD_DIR, pdf.name)
            with open(file_path, "wb") as f:
                f.write(pdf.getbuffer())
        return "Done"
    except:
        return "Error"

def doVectorization():
    try:
        vector_store = FaissVectorStore(faiss_index=faiss_index)
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        documents = SimpleDirectoryReader("./uploaded_files").load_data()
        index = VectorStoreIndex.from_documents(
            documents,
            storage_context=storage_context
        )
        index.storage_context.persist()
        return "Done"
    except:
        return "Error"

The saveUploadedFiles function saves PDF documents uploaded by users to a designated directory (uploaded_files). The doVectorization function utilizes Faiss to vectorize these documents, making them searchable based on content similarities.

Speech Recognition and Transcription

def transcribe_audio():
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()

    with microphone as source:
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source, timeout=20)
    
    st.write("🔄 Transcribing...")
    
    try:
        text = recognizer.recognize_google(audio)
        return text
    except sr.RequestError:
        return "API unavailable or unresponsive"
    except sr.UnknownValueError:
        return "Unable to recognize speech"

The transcribe_audio function uses the speech_recognition library to capture spoken queries from users via their microphone. It adjusts for ambient noise and listens for up to 20 seconds before transcribing the speech into text using Google's speech recognition API.

Querying and Fetching Data

def fetchData(user_question):
    try:
        vector_store = FaissVectorStore.from_persist_dir("./storage")
        storage_context = StorageContext.from_defaults(
            vector_store=vector_store, persist_dir=PERSIST_DIR
        )
        index = load_index_from_storage(storage_context=storage_context)
        query_engine = index.as_query_engine()
        response = query_engine.query(user_question)
        return str(response)
    except:
        return "Error"

The fetchData function retrieves relevant information from vectorized documents based on user queries. It loads the persisted Faiss index from storage and queries it to find and return the most relevant information matching the user's question.

Defining the Welcome Message

The WelcomeMessage variable contains a multi-line string that introduces users to the voice-enabled PDF chatbot. It encourages them to upload PDF documents and start asking questions:

WelcomeMessage = """
Hello, I am your PDF voice chatbot. Please upload your PDF documents and start asking questions to me. 
I would try my best to answer your questions from the documents.
"""

This message serves as the initial greeting when users interact with the chatbot, providing clear instructions on how to proceed.

Initializing Chat History and Azure Speech SDK Configuration

The code block initializes the chat history and sets up configurations for Azure Speech SDK:

if "chat_history" not in st.session_state:
    st.session_state.chat_history = [
        AIMessage(content=WelcomeMessage)
    ]

AZURE_SPEECH_KEY = os.getenv("AZURE_SPEECH_KEY")
AZURE_SPEECH_REGION = os.getenv("AZURE_SPEECH_REGION")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_KEY, region=AZURE_SPEECH_REGION)
speech_config.speech_synthesis_voice_name = "en-US-AriaNeural"
speech_config.speech_synthesis_language = "en-US"

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

Chat History Initialization: Checks if chat_history exists in the Streamlit session state. If not, it initializes it with the WelcomeMessage wrapped in an AIMessage object. This ensures that the chat starts with the welcome message displayed to the user.
Azure Speech SDK Configuration: Retrieves Azure Speech API key and region from environment variables (AZURE_SPEECH_KEY and AZURE_SPEECH_REGION). It sets up a SpeechConfig object for speech synthesis (speech_synthesizer). The voice name and language are configured to use the "en-US-AriaNeural" voice for English (US).

Streamlit Integration: User Interface Design

def main():
    load_dotenv()  

    st.set_page_config(
        page_title="Chat with multiple PDFs",
        page_icon=":sparkles:"
    )

    st.header("Chat with single or multiple PDFs :sparkles:")   

    for message in st.session_state.chat_history:
        if isinstance(message, AIMessage):
            with st.chat_message("AI"):
                st.markdown(message.content)
        elif isinstance(message, HumanMessage):
            with st.chat_message("Human"):
                st.markdown(message.content)     

    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs = st.file_uploader(
            "Upload your PDFs here and click on 'Process'", 
            accept_multiple_files=True
        )

        if st.button("Process"):
            with st.spinner("Processing"):
                IsFilesSaved = saveUploadedFiles(pdf_docs)
                if IsFilesSaved == "Done":
                    IsVectorized = doVectorization()
                    if IsVectorized == "Done":
                        st.session_state.isPdfProcessed = "done"
                        st.success("Done!")
                    else:
                        st.error("Error! in vectorization")
                else:
                    st.error("Error! in saving the files")

    if st.button("Start Asking Question"):
        st.write("🎤 Recording started...Ask your  question")
        transcription = transcribe_audio()
        st.write("✅ Recording ended")

        st.session_state.chat_history.append(HumanMessage(content=transcription))

        with st.chat_message("Human"):
            st.markdown(transcription)
        
        with st.chat_message("AI"):
            with st.spinner("Fetching data ..."):
                response = fetchData(transcription)
                st.markdown(response)    
                
        result = speech_synthesizer.speak_text_async(response).get()
        st.session_state.chat_history.append(AIMessage(content=response))
        
    if "WelcomeMessage" not in st.session_state:
        st.session_state.WelcomeMessage = WelcomeMessage
        result = speech_synthesizer.speak_text_async(WelcomeMessage).get()

#============================================================================================================
if __name__ == '__main__':
    main()

User Experience: Seamless Interaction

Imagine uploading a collection of PDF documents — research papers, technical manuals, or reports — and simply speaking your questions aloud. The chatbot not only transcribes your speech but also responds audibly, providing immediate access to relevant information. This seamless interaction is particularly beneficial for users with visual impairments or those multitasking who prefer auditory information consumption.

Demo 💬🎤🤖

Enhancing Accessibility and Efficiency

Our voice-enabled PDF chatbot represents a significant advancement in accessibility technology. By integrating speech recognition, document vectorization, and AI-driven query processing, we empower users to effortlessly interact with complex information. This technology not only enhances accessibility but also boosts efficiency by streamlining the process of retrieving information from documents.

Conclusion: Paving the Way Forward

As we continue to explore the capabilities of AI in enhancing user experiences, the voice-enabled PDF chatbot stands as a testament to innovation in accessibility and usability. Whether you’re a researcher seeking insights from academic papers or a professional referencing technical documents, this technology promises to revolutionize how we interact with information.

Stay tuned as we push the boundaries further, exploring new applications and advancements in AI-driven technology. Stay connected and yes Happy Coding ! 😊

18 June, 2024

🚀 Revolutionizing Document Interaction: An AI-Powered PDF Voice-2-Voice Chatbot Using LlamaIndex 🐑, Langchain 🔗 Azure AI Speech 🎤and Google Audio 🔊