The purpose of this blog is to demonstrate RAG implementation using LlamaIndex framework to construct a simple chatbot to answer a series of questions.
Overview:
In this blog, we will explore how RAG works and demonstrate its effectiveness through a practical example using GPT-3.5 Turbo to respond to a health care manual as an additional corpus.
By leveraging the advanced capabilities of language models and integrating information retrieval systems, businesses can provide their users with timely, accurate, and personalized information.
Use case:
Imagine you are tasked with developing a chatbot that can respond to queries about a particular product. This product has its own unique user manual, specific to the enterprise’s offerings. Traditional language models, like GPT-3, are typically trained on general data and may not have knowledge of this specific product. Fine-tuning the model with the new corpus might seem like a solution, but it comes with considerable costs and resource requirements.
Problem with LLM:
Large Language Models (LLMs) like GPT-3 have transformed the field of AI with their ability to understand and generate human-like text. However, they do face certain challenges:
- Data Biases: LLMs can inherit biases present in their training data, leading to biased outputs.
- Inaccurate Content: They may generate content that is not accurate or appropriate, especially when dealing with topics outside their training data.
- Resource Intensity: Training LLMs requires significant computational resources and time, making it expensive and energy-intensive.
- Static Knowledge: LLMs’ knowledge is static and cut-off at the point of their last training, meaning they can’t incorporate new information post-training.
Retrieval-Augmented Generation (RAG) addresses some of these issues by combining the generative power of LLMs with the ability to pull in information from external databases or knowledge bases in real-time. Here’s why RAG is important:
- Up-to-Date Information: RAG allows models to access the most current data, ensuring responses are relevant and timely.
- Reduced Errors: It helps reduce the chances of “hallucination,” where models make incorrect guesses or assumptions.
- Cost-Effective: It’s a more cost-effective way to improve LLM outputs without the need for retraining the model with new data.
- Customization: RAG enables customization to specific domains or organizational knowledge without altering the underlying model.
In essence, RAG enhances LLMs by providing them with a mechanism to reference current, authoritative information, improving the quality and reliability of their outputs. This is particularly useful for applications that require up-to-date knowledge or domain-specific accuracy.
Introduction of RAG
To solve this problem, researchers at Meta published a paper about a technique called Retrieval Augmented Generation (RAG), which adds an information retrieval component to the text generation model that LLMs are already good at. This allows for fine-tuning and adjustments to the LLM’s internal knowledge, making it more accurate and up-to-date.
Retrieval Augmented Generation (RAG) offers a more efficient and effective way to address the issue of generating contextually appropriate responses in specialized domains.
Instead of fine-tuning the entire language model with the new corpus, RAG leverages the power of retrieval to access relevant information on demand.
By combining retrieval mechanisms with language models, RAG enhances the responses by incorporating external context. This external context can be provided as a vector embedding.
How RAG works?
Here’s how RAG works on a high level:
- Prompt Entry: You type in a question or statement that you want the AI to respond to.
- Query Generation: The AI takes your prompt and formulates a specific question or set of questions that will help it find the best answer.
- Information Retrieval: The AI searches through a large database of information to find relevant facts or data related to the query.
- Context Creation: Using the information it has found, the AI builds a detailed context to understand the prompt better and provide a more accurate response.
- Response Generation: Finally, the AI uses the context it has created to compose a clear and informative answer to your original prompt, which is then sent back to you.
Retrieval Augmented Generation (RAG) – Amazon SageMaker
Prerequisites
PDF/Manual:
We will be using the HEALTHCARE PERSONNEL SAFETY COMPONENT PROTOCOL, Healthcare Personnel Exposure Module
You can download the use manual from here
Open API:
Register and create API Key.
Development IDE:
VSC
Demo(Summary):
Setup virtual environment
Create an OpenAI key
Install Dependencies
Code: Create Vector Embeddings from the User Manual PDF and store it in ChromaDB
Code: Use the ConversationalRetrievalChain API in LangChain to initiate a chat history component.
Code: pass the questions to the LLM to get a response and print it.
Demo(step-by-step):
Let’s start the step-by-step process.
Setup virtual environment
pip install virtualenv
python3 -m venv ./venv
source venv/bin/activate
Create an OpenAI key
We will need an OpenAI key, to access the GPT. Let’s create an OpenAI Key. You can create the OpenAIKey for free, by registering to OpenAI at https://platform.openai.com/apps
Once you register, log in, and select the API, Option, as shown in the screenshot.
setx OPENAI_API_KEY “your-api-key-here”
Install Dependencies
We need to install the various dependencies. Lets install all below libraries.
- Lanchain: A Framework to develop LLM applications
- ChromaDB: This is the VectorDB, to persist vector embeddings
- unstructured: Used for preprocessing Word/pdf documents
- tiktoken: Tokenizer framework
- pypdf: Framework to read and process PDF documents
- openai: Framework to access OpenAI
Summary of these commands:
pip install langchain
pip install unstructured
pip install pypdf
pip install tiktoken
pip install chromadb
pip install openai
Lets explore and install these components:
pip install langchain
LangChain: Building Applications with Large Language Models (LLMs)
LangChain is a powerful library that enables developers to create applications using large language models (LLMs) by combining them with other sources of computation or knowledge. Whether you’re building chatbots, question-answering systems, or more complex agents, LangChain provides a flexible framework for integrating LLMs into your projects.
Here’s how you can get started with LangChain:
- Installation:
- To install LangChain, use either pip or conda:
- Using pip:
- pip install langchain
- Using conda:
- conda install langchain -c conda-forge
- To install LangChain, use either pip or conda:
This will install the core LangChain package with the minimum requirements. Note that additional dependencies for specific integrations (such as model providers and datastores) are not installed by default. You’ll need to install those separately.
- Key Areas of LangChain:
- Models and Prompts: Manage prompts, optimize them, and work with various LLMs. LangChain provides a generic interface for LLMs and chat models.
- Chains: Go beyond single LLM calls and create sequences of calls. LangChain offers a standard interface for chains and integrations with other tools.
- Retrieval Augmented Generation: Fetch data from external sources for use in LLM-based generation (e.g., summarization, question-answering).
- Agents: Build decision-making agents that interact with LLMs, take actions, and observe results.
- Evaluation (BETA): Evaluate generative models using language models themselves.
- Documentation:
- For detailed information, explore the official documentation. It covers installation, examples, and API reference.
LangChain empowers you to harness the full potential of LLMs by integrating them seamlessly into your applications.
LlamaIndex and LangChain are related but distinct entities. LlamaIndex is a component that can be used within the LangChain framework. LangChain is a broader framework that facilitates the creation of applications using large language models, and it can integrate various components, including LlamaIndex, which specializes in data retrieval tasks.
Think of LangChain as a toolbox that contains a variety of tools (like LlamaIndex) designed for specific functions. LlamaIndex would be one of those specialized tools in the toolbox, particularly useful when you need to perform efficient data retrieval operations within your application.
pip install unstructured
Unstructured: Pre-Processing Tools for Unstructured Data
The Unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and more. Whether you’re working with raw data for downstream machine learning tasks or need to streamline your data processing workflow, Unstructured has got you covered.
Here’s how you can get started:
- Installation:
- To install Unstructured, use pip:
pip install unstructured
pip install pypdf
PyPDF: A Python Library for PDF Manipulation
PyPDF is a versatile Python library for working with PDF files. Whether you need to extract text, merge or split PDFs, or add custom data, PyPDF has got you covered. Here’s how you can get started:
- Installation:
- To install PyPDF using pip, simply run:
- pip install pypdf
- If you’re not a super-user (system administrator/root), you can install PyPDF for your current user:
pip install –user pypdf
pip install tiktoken
pip install chromadb
ChromaDB: An Open-Source Embedding Database
ChromaDB is an open-source embedding database designed for building AI applications with embeddings. It provides a fast and efficient way to work with Python or JavaScript LLM (Language Model) apps using memory. Here’s how you can get started with ChromaDB:
- Installation:
- To install the ChromaDB Python package, run:
pip install chromadb
Note: Install the latest updates: Make sure you have the latest updates for Visual Studio and the C++ toolset.
pip install openai
Installing the OpenAI Python SDK
To install the OpenAI Python SDK, you can use the following command:
pip install –upgrade openai
Once you’ve installed it, you’ll be ready to interact with OpenAI’s powerful language models and create intelligent applications.
Code: Create Vector Embeddings from the User Manual PDF and store it in ChromaDB
import os
import openai
import tiktoken
import chromadb
from langchain_community.document_loaders import OnlinePDFLoader, UnstructuredPDFLoader, PyPDFLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.memory import ConversationBufferMemory
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain_community.output_parsers.rail_parser import GuardrailsOutputParser
loader = PyPDFLoader("TheNationalHealthcareSafetyNetwork-Manual.pdf")
pdfData = loader.load()
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)
splitData = text_splitter.split_documents(pdfData)
collection_name = "NHSN_collection"
local_directory = "NHSN_vect_embedding"
persist_directory = os.path.join(os.getcwd(), local_directory)
openai_key=os.environ.get('OPENAI_API_KEY')
embeddings = OpenAIEmbeddings(openai_api_key=openai_key)
vectDB = Chroma.from_documents(splitData,
embeddings,
collection_name=collection_name,
persist_directory=persist_directory
)
vectDB.persist()Run the Program:
After you execute this code, you should see a folder created, that stores the vector embeddings
Now we have the vector embeddings stored in the ChromaDB. Let’s now use the ConversationalRetrievalChain API in LangChain to initiate a chat history component.
We will be passing the OpenAI object, initiated with GPT 3.5 turbo and the vectorDB we created. We will be passing ConversationBufferMemory that stores the messages.
Code: Use the ConversationalRetrievalChain API in LangChain to initiate a chat history component.
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
chatQA = ConversationalRetrievalChain.from_llm(
OpenAI(openai_api_key=openai_key,
temperature=0,
model_name="gpt-3.5-turbo"), # Removed api_endpoint from model_kwargs
vectDB.as_retriever(),
memory=memory)
Code: pass the questions to the LLM to get a response and print it.
chat_history = []
qry = ""
while qry != 'done':
qry = input('Question: ')
if qry != exit:
response = chatQA({"question": qry, "chat_history": chat_history})
print(response["answer"])
Complete Code:
Validate:
Run the program.
Ask the question related to this pdf/manual and you will view the output printed as below.
Below is the screenshot of the output.
You can ask various questions and the solution will provide you the answer.
Code Behind:
Below are the high-level activity for the code we wrote.
- Reads from the PDF (user manual PDF) and tokenize with a chunk_size of 1000 tokens
- Create a vector embedding of these tokens. It uses OpenAIEmbeddings library to create the vector embeddings.
- Stores the vector embeddings locally. We will be using simple ChromaDB as our VectorDB. We could be using Pinecone or any other such more highly available, production-grade VectorDBs instead.
- The user issues a prompt with the query/question.
- This issues a search and retrieval from the vectorDB to get more contextual data from the VectorDB.
- This contextual data is now will be used along with the prompt.
- The prompt is augmented by the context. This is typically referred to as context enrichment.
- The prompt along with the query/question and this enhanced context is now passed to the LLM
- LLM now responds back, based on this context.
Open API rate Limits:
You may receive error:
You need to register for Payment & upgrade the OpenAPI.
Conclusion:
Retrieval Augmented Generation (RAG) is an innovative approach that merges the capabilities of advanced language models such as GPT-3 with robust information retrieval systems.
This method enhances the input with pertinent, context-specific data, allowing language models to produce responses that are not only precise but also highly relevant to the given context. For businesses where bespoke fine-tuning may be impractical, RAG presents a viable and economical alternative for delivering customized and knowledgeable user engagements.
Stay tuned for more blogs in this field as I delve deeper into this exciting topic.
Happy learning!
📚💡
References:
RAG: https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-rag.html
https://www.promptingguide.ai/techniques/rag
Prompt Engineering: https://www.promptengineering.org/master-prompt-engineering-llm-embedding-and-fine-tuning/
CDC Document: https://www.cdc.gov/nhsn/pdfs/hps-manual/hps_manual-exp-plus-flu-portfolio.pdf

