In this blog post, we will demonstrate how to build a serverless AI chat experience with retrieval-augmented generation using LangChain.js and Azure. The application will be hosted on Azure Static Web Apps and Azure Functions, with Azure AI Search as the vector database.
Overview:
Building a serverless chatbot using generative AI is an exciting and powerful way to enhance user engagement and improve customer service.
By using LangChain.js for local development and testing, Ollama for cloud deployment, and state-of-the-art OpenAI models such as Mistral 7B, you can build a chatbot that can answer questions using a RAG pipeline, providing engaging and personalized responses.
By leveraging these tools together, you can experiment and iterate quickly, validate your ideas with a prototype, and scale up to production if your chatbot proves successful.
Building AI Applications:
Building AI applications can be complex and time-consuming, but using LangChain.js and Azure serverless technologies allows to greatly simplify the process. This application is a chatbot that uses a set of enterprise documents to generate responses to user queries.
In my earlier post, we explored various options to create the Chatbot.
Please refer to the following post if you are new to these concepts and have missed my previous posts.
Feel free to like and comment if you wish to explore this topic further.
Create your own Copilot that uses your own data with an Azure OpenAI Service Model
Build and deploy a Q&A Copilot with Prompt Flow
Getting started with Azure Open AI Services
Build Your Personal RAG Chatbot on a PDF document: Langchain, ChromaDB on GPT 3.5
In this post, we will take a closer look at how to build the chatbot using LangChain.js and Azure serverless technologies.
Architecture:
When deployed, the project architecture will look like this:
We’ll use Azure Functions to host the backend API, and Azure Static Web Apps to host the frontend. The Blob Storage will be used to store a copy of the original PDF documents, and Azure Cosmos DB for MongoDB vCore will be used as the vector database.
This application is made from multiple components:
- A web app made with a single chat web component built with Lit and hosted on Azure Static Web Apps. The code is located in the
packages/webappfolder. - A serverless API built with Azure Functions and using LangChain.js to ingest the documents and generate responses to the user chat queries. The code is located in the
packages/apifolder. - A database to store the text extracted from the documents and the vectors generated by LangChain.js, using Azure AI Search.
- A file storage to store the source documents, using Azure Blob Storage.
We use the HTTP protocol for AI chat apps to communicate between the web app and the API.
Understanding the RAG pipeline
The chatbot we built uses a RAG (Retrieval-Augmented Generation) pipeline to answer questions. But what is RAG?
RAG is a method used in artificial intelligence, particularly in natural language processing, to generate text responses that are both contextually relevant and rich in content using AI models.
At its core, RAG involves two main components:
- Retriever: Think “like a search engine“, finding relevant information from a knowledgebase, usually a vector database. In this sample, we’re using Azure CosmosDB for MongoDB vCore as our vector database.
- Generator: Acts like a writer, taking the prompt and information retrieved to create a response. We’re using here a Large Language Model (LLM) for this task.
To know more about how RAG works, refer this link: Build Your Personal RAG Chatbot on a PDF document: Langchain, ChromaDB on GPT 3.5
When you ask a question to the chatbot, the RAG pipeline works like this:
- The question is sent to the retriever, which will find the most relevant documents in the knowledgebase.
- We used the top *N* retrieved documents along with the question to generate a prompt for the generator.
The generator (here the AI model) will then use the prompt to generate a response, which is sent back to the user.
Demo Summary:
Below is the summary of steps.
1. Set up code
2. Install Ollama: Install required tools and set up the project: Use Ollama to experiment with the Mistral 7B model on your local machine
3. Run the project locally to test the chatbot
4. Deploy the chatbot to Azure Functions, using Azure Cosmos DB for MongoDB as a vector database (optional)
Prerequisites:
- Git
- A GitHub account
- An Azure account to create the resources and deploy the app.
- Azure OpenAI services
- A working Node.js v20+ environment
- A machine with a GPU supported by Ollama
Demo(step-by-step):
1. Set up code
The first step is to download the repo.
Let’s fork and clone the project repository on your machine.
- Open the following link, then select Create fork button: Fork on GitHub
- On your forked repository, select the Code button, then the Local tab, and copy the URL of your forked repository.
- Open a terminal and run this command to clone the repo:
git clone <your-repo-url>In a terminal, navigate to the project folder and install the dependencies with:
npm installIt’s almost ready! Before running the app, we need first have to setup Ollama to have a local AI playground.
2. Install Ollama
Ollama is a CLI tool that allows you to experiment with AI models and embeddings locally. It’s a great tool to test and validate your ideas before deploying them to the cloud.
Go to Ollama’s website and download the latest version for your platform. Once installed, you can use the ollama command in your terminal.
We’ll start by downloading the models we need for this project. Run the following commands in your terminal:
ollama pull mistral
ollama pull all-minilm:l6-v2This will pull the Mistral 7B model, a powerful language model that we’ll use for the chatbot, and the All-MiniLM model, a small embedding model that we’ll use to generate the vectors from the text.
Note: The mistral model with download a few gigabytes of data, so it can take some time depending on your internet connection.
Once the models are downloaded, you can test that the Ollama server is working correctly by running:
ollama run mistralYou should get an invite in your terminal, where you can chat with the AI model directly, like a minimal chatbot:
Try asking a few questions to the model, and see how it responds. This will give you a good idea of the capabilities of the model and how you can interact with it.
Once you’re done, you can stop the Ollama server by pressing Ctrl+D in your terminal.
3. Running the project locally
Now that we have the models ready, we can run the project to test the chatbot. Open a terminal in the project root and run:
npm startThis will start a dev server for the frontend, and an Azure Functions runtime emulator for the backend. You can access and play with the chatbot at http://localhost:8000.
Note: The first time you run the project, it can take a few minutes to start the backend API as it will download the runtime dependencies.
You should end up with a familiar chat interface, where you can ask questions and get answers from the chatbot:
For this demo, we use a fictitious company called Contoso Real Estate, and the chatbot can answer support questions about the usage of its products. The sample data includes a set of documents that describes its terms of service, privacy policy and a support guide.
Here’s a preview of the final project:
You can try using the suggested questions, or ask your own questions to see how the chatbot responds. Try also asking something completely out of context, like “Who won the latest football world cup?” to see how the chatbot handles it.
4. Deploying the chatbot to Azure
Now that we have a working chatbot locally, we can deploy it to Azure to make it accessible to the world. You’ve noticed how we handle the changes needed in the two endpoints to make the code work both locally and in the cloud.
We’ll use Azure Functions to host the backend API, and Azure Static Web Apps to host the frontend. The Blob Storage will be used to store a copy of the original PDF documents, and Azure Cosmos DB for MongoDB vCore will be used as the vector database.
Once you have everything ready, you can deploy the project with the following steps:
- Open a terminal at the root of the project.
- Authenticate with Azure by running azd auth login.
Note: in case Azure Developer Cli is not installed in your VSC, download and install it.
- Run azd up to deploy the application to Azure. This will provision Azure resources, deploy this sample, and build the search index based on the files found in the ./data folder.
- You will be prompted to select a base location for the resources. If you don’t know which one to choose, you can select eastus2.
The deployment process will take a few minutes. Once it’s done, you’ll see the URL of the web app in the terminal.
You can now open the resulting link for the web app in your browser and start chatting with the bot.
Following resources created:
This code uses Infrastructure as Code to set up the resources, you can look into the infra folder to see the templates used to deploy the resources.
Cleaning up the resources
To clean up all the Azure resources created and stop incurring costs, you can run the following command:
- Run azd down –purge
- When asked if you are sure you want to continue, enter y
The resource group and all the resources will be deleted.
Code behind: Exploring the LangChain.js building blocks
Lets see how to use the LangChain.js library to interact with large language models
(LLMs) for building AI-powered applications. It covers the following key components:
Large Language Models (LLM): The article explains that LLMs are capable of generating text based on input data and learned patterns. LangChain.js offers a simplified interface to work with different LLMs.
Vector Embeddings: It describes vector embeddings as numerical representations of objects, such as text, which are essential because machines understand numbers better than text.
LangChain Modules: The section delves into various LangChain modules, including document loaders and transformers, which help in loading and transforming data for AI applications.
Generating Vector Embeddings: It discusses the process of generating vector embeddings for documents, which is crucial for the functioning of the chatbot.
Summary of code:
The code “Exploring the LangChain.js building blocks” implements two main API endpoints using LangChain.js for a serverless AI chat system:
/documents Endpoint:
File Upload: Parses form data to retrieve the uploaded PDF file.
Text Extraction: Utilizes PDFLoader to extract text from the PDF without splitting pages.
Text Splitting: Employs RecursiveCharacterTextSplitter to divide text into smaller chunks for better retrieval performance and to manage AI model prompt size limitations.
Embeddings Generation: Depending on the environment, uses either AzureOpenAIEmbeddings or OllamaEmbeddings to generate embeddings for each text chunk.
Database Storage: Stores text and embeddings in the database, creating an index for the cloud or saving to a local folder.
File Storage: Optionally uploads the original PDF to Azure Blob Storage if running in the cloud.
/chat Endpoint:
Model and Database Initialization: Sets up embeddings, chat model, and vector store, with easy switching between cloud and local development.
Document Combination: Creates a chain with createStuffDocumentsChain to combine user queries with document content using templating features.
Document Retrieval: Implements createRetrievalChain to retrieve relevant documents from the database using vector search.
Response Generation: Streams the AI-generated response back to the client using the last user message as input, following the AI Chat Protocol with NDJSON format.
Overall, the code demonstrates how LangChain.js simplifies the creation of a RAG pipeline by abstracting complex processes and providing a unified interface for different components. The focus is on the seamless integration of AI models, document handling, and response generation for an AI-powered chat system.
Lets deep dive all the code:
/documents Endpoint:
Here we use LangChain.js to extract the text from the PDF file, split it into smaller chunks, and generate vectors for each chunk. We store the text and the vectors in the database for later use in our RAG pipeline.
1. postDocuments() function : packages/api/src/functions/documents-post.ts
Open the file packages/api/src/functions/documents-post.ts in your code editor.
Let’s analyse the main parts of postDocuments() function:
We start by parsing the form data to get the uploaded PDF file name and content. We use the standard MIME type multipart/form-data to handle file uploads here.
Next we use LangChain.js components to perform the text extraction and splitting. We use the PDFLoader to extract the text from the PDF file, and the RecursiveCharacterTextSplitter to split the text into smaller chunks.
2. Generate embeddings and save in database
Splitting the text into smaller chunks is important to improve the retrieval performance, as it allows the retriever to find more relevant information in the documents. It also helps with the AI models’ limitations on the prompt input size, and ultimately reduces the usage cost when using hosted cloud models.
// Generate embeddings and save in database
if (azureOpenAiEndpoint) {
const credentials = getCredentials();
const embeddings = new AzureOpenAIEmbeddings({ credentials });
await AzureAISearchVectorStore.fromDocuments(documents, embeddings, { credentials });
} else {
// If no environment variables are set, it means we are running locally
context.log('No Azure OpenAI endpoint set, using Ollama models and local DB');
const embeddings = new OllamaEmbeddings({ model: ollamaEmbeddingsModel });
const store = await FaissStore.fromDocuments(documents, embeddings, {});
await store.save(faissStoreFolder);
}Now is the most important part: we generate the embeddings for each chunk of text and store them in the database. LangChain.js abstracts a lot of the complexity here, allowing us to switch between different embeddings models easily.
3. use the Azure OpenAI embeddings for the cloud deployment, and the Ollama embeddings for the local development
Here we use the Azure OpenAI embeddings for the cloud deployment, and the Ollama embeddings for the local development. You can see that it’s easy to switch between the two as LangChain.js provides a common interface for both.
if (connectionString && containerName) {
// Upload the PDF file to Azure Blob Storage
context.log(`Uploading file to blob storage: "${containerName}/${filename}"`);
const blobServiceClient = BlobServiceClient.fromConnectionString(connectionString);
const containerClient = blobServiceClient.getContainerClient(containerName);
const blockBlobClient = containerClient.getBlockBlobClient(filename);
const buffer = await file.arrayBuffer();
await blockBlobClient.upload(buffer, file.size, {
blobHTTPHeaders: { blobContentType: 'application/pdf' },
});
} else {
context.log('No Azure Blob Storage connection string set, skipping upload.');
}4. Upload the original PDF file to Azure Blob Storage.
Finally, we upload the original PDF file to Azure Blob Storage. This is useful to keep a copy of the documents, and also to provide a way to download them later if needed. But it’s not mandatory for the chatbot to work, so we can skip it entirely if we’re running locally.
The `/chat` endpoint
In this endpoint we use LangChain.js components to connect to the database, load the documents and perform a vector search after vectorizing the user query. After that, the most relevant documents are injected into the prompt, and we generate the response. While this process seems complex, LangChain.js does all the heavy lifting for us.
1. use LangChain.js components to connect to the database, load the documents and perform a vector search (packages/api/src/functions/chat-post.ts )
Open the file packages/api/src/functions/chat-post.ts in your code editor.
Let’s skip to the most interesting part of the postChat() function:
let embeddings: Embeddings;
let model: BaseChatModel;
let store: VectorStore;
if (azureOpenAiEndpoint) {
// Initialize models and vector database
embeddings = new AzureOpenAIEmbeddings();
model = new AzureChatOpenAI();
store = new AzureCosmosDBVectorStore(embeddings, {});
} else {
// If no environment variables are set, it means we are running locally
context.log('No Azure OpenAI endpoint set, using Ollama models and local DB');
embeddings = new OllamaEmbeddings({ model: ollamaEmbeddingsModel });
model = new ChatOllama({ model: ollamaChatModel });
store = await FaissStore.load(faissStoreFolder, embeddings);
}2. Create the chain that combines the prompt with the documents
We start by initializing the AI models and the database. As you can see switching between the cloud and local models is straightforward, and the good news is that it’s the only change needed!
// Create the chain that combines the prompt with the documents
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: ChatPromptTemplate.fromMessages([
['system', systemPrompt],
['human', '{input}'],
]),
documentPrompt: PromptTemplate.fromTemplate('{filename}: {page_content}\n'),
});3. Create the chain to retrieve the documents from the database (createStuffDocumentsChain() function).
The first part of our RAG chain is using the createStuffDocumentsChain() function to combine the prompt with the documents. We use the templating features of LangChain.js to create the differents parts of the prompt:
- The system prompt, which is a fixed message that we inject at the beginning of the prompt. You can have a look at it to understand the different parts of what it does and what it contains. The most important part is the `{context}` placeholder at the end, which will be replaced by the retrieved documents.
- The human prompt, which is the user query. We use the `{input}` placeholder to inject the user query into the prompt.
- The document prompt, which is used to format how we inject the retrieved documents into the system prompt. Our format is simple here, we just prepend the document filename to the content of the page.
• // Create the chain to retrieve the documents from the database
• const chain = await createRetrievalChain({
• retriever: store.asRetriever(),
• combineDocsChain,
• });
4. retrieval of the documents from the database (createRetrievalChain):
The next part of our chain is the retrieval of the documents from the database. The createRetrievalChain() abstracts the process entirely: we just need to provide the retriever and the combineDocsChain, and it will take care of the rest. It will do a few things behind the scenes:
- Convert the user query into a vector using the embeddings model.
- Performs a vector search in the database to find the most relevant documents.
- Injects the most relevant documents into the context of the chain.
- Passes the context to our previous combineDocsChain to generate the prompt.
5. // Generate the response
6. const lastUserMessage = messages.at(-1)!.content;
7. const responseStream = await chain.stream({
8. input: lastUserMessage,
9. });
10.
11. return data(createStream(responseStream), {
12. 'Content-Type': 'application/x-ndjson',
13. 'Transfer-Encoding': 'chunked',
});5. generate the response using chain.stream()
Finally, the last part of the chain is to generate the response using chain.stream() and passing the last user message, containing the question, as input. The response is then streamed back to the client.
We use a stream of newline-delimited JSON (NDJSON) for the response, following the AI Chat Protocol as our API contract between the frontend and the backend.
Complete code ref: Build a serverless AI Chat with RAG using LangChain.js (microsoft.com)
Key takeaways
- LangChain.js provides abstraction over AI models and APIs, allowing you to switch between them easily. Built-in support for advanced chains components makes complex AI workflows like RAG easy to build.
- Ollama is a powerful tool to experiment with AI models and embeddings locally.
- Azure Cosmos DB for MongoDB vCore can be used as a vector database for AI workloads, in addition to your regular NoSQL storage.
Conclusion:
In this post we discussed the creation of a serverless AI chat application using Retrieval-Augmented Generation (RAG) with LangChain.js and Azure technologies.
The application is designed to provide a chat experience using enterprise documents to answer user queries.
It utilizes Azure Static Web Apps, Azure Functions, and Azure AI Search as the vector database. The chatbot, exemplified by a fictitious company called Contoso Real Estate, allows customers to inquire about product usage.
The setup includes a web app, a serverless API, a database for storing text and vectors, and file storage for source documents.
Key Points:
- Simplification: LangChain.js and Azure serverless technologies simplify AI application development.
- Components: The application consists of a web app, serverless API, database, and file storage.
- Prerequisites: Node.js LTS, Azure Developer CLI, Git, and an Azure account with specific permissions are required.
- Deployment: The project can be deployed on Azure and tested locally or via GitHub Codespaces.
In Conclusion, The post provides a comprehensive walkthrough for building a serverless AI chatbot, highlighting the ease of use and integration of various Azure services and LangChain.js. It serves as a starting point for developers to create more complex AI applications and experiment with AI models and workflows.
Want to explore more posts on AI, refer below and feel free to like and comment.
Create your own Copilot that uses your own data with an Azure OpenAI Service Model
Build and deploy a Q&A Copilot with Prompt Flow
Getting started with Azure Open AI Services
Build Your Personal RAG Chatbot on a PDF document: Langchain, ChromaDB on GPT 3.5

