This blog is the continuation of Enterprise Chat AI series and the final post.
In this first part, we delved into the architecture of Enterprise Chat AI and showcased a demo application to explore its functionality. Refer this link for all the details: Part 1: Enterprise Chat AI: Architecture and Demo
In this second part of the blog, we explored the process of setting up DevOps pipelines, productionizing our Chat AI solution, reviewing Scaling, Cost and Security measures. Refer this link for all the details:Part 2: Enterprise Chat AI: Guidelines for productionizing (DevOps setup, Scaling, Cost optimization, Security)
In the final part of our blog series, we will now delve into the realm of GenAI Ops, Customizing data/UI(Indexing process, Chunking, Integrated Vectorization), focusing on Load Testing, Evaluation, Monitoring and managing Enterprise Chat AI applications.
Customizing the Data
The Chat App is designed to work with any PDF documents. The sample data is provided to help you get started quickly, but you can easily replace it with your own data. You’ll want to first remove all the existing data, then add your own.
We will be using the prepdocs script to index documents for the Chat App.
In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
Overview of the manual indexing process
The prepdocs.py script is responsible for both uploading and indexing documents. The typical usage is to call it using scripts/prepdocs.sh (Mac/Linux) or scripts/prepdocs.ps1 (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current azd environment. Whenever azd up or azd provision is run, the script is called automatically.
The script uses the following steps to index documents:
- If it doesn’t yet exist, create a new index in Azure AI Search.
- Upload the PDFs to Azure Blob Storage.
- Split the PDFs into chunks of text.
- Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
Refer to this post for more details on Chunking, indexing additional documents, Removing documents.
Getting Started with Integrated Vectorization
Overview of Integrated Vectorization
Azure AI search recently introduced an integrated vectorization feature in preview mode. This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
See this notebook to understand the process of setting up integrated vectorization. The Solution is integrated with this code into prepdocs script.
This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index. In the newly created index schema, a new field ‘parent_id’ is added. This is used internally by the indexer to manage life cycle of chunks.
Lets understand the major steps needed for setting up Integrated Vectorization:
The azure-search-integrated-vectorization-sample.ipynb notebook in the Azure/azure-search-vector-samples GitHub repository provides a guide for setting up integrated vectorization for Chat AI using Azure AI Search. Here’s a summary of the major steps:
Prerequisites:
- Have an Azure subscription with access to Azure OpenAI.
- Ensure Azure AI Search service is available with enough capacity for the workload.
- Set up Azure Storage with a blob container containing documents.
- Deploy the text-embedding-ada-002 model in Azure OpenAI service
Environment Setup:
- Use Python 3.11.x or higher.
- Install Visual Studio Code with Python and Jupyter extensions.
- Create a .env file using .env-sample as a template, filling in values for Azure AI Search and Azure OpenAI.
Running the Notebook:
- Open the azure-search-integrated-vectorization-sample.ipynb file in Visual Studio Code.
- Optionally, create a virtual environment within the workspace.
- Execute the notebook cells sequentially to apply data chunking and vectorization in an indexer pipeline.
Troubleshooting:
- If you encounter error 429 from Azure OpenAI (overcapacity), check the Activity Log and Tokens Per Minute (TPM) on the deployed model.
- Adjust TPM or try a model with more capacity if errors persist.
Refer to this post to know more about this: Getting Started with Integrated Vectorization
Customizing the UI
Once you successfully deploy the app, you can start customizing it for your needs: changing the text, tweaking the prompts, and replacing the data. Consult the app customization guide as well as the data ingestion guide for more details.
Customizing the UI
The frontend is built using React and Fluent UI components. The frontend components are stored in the app/frontend/src folder. The typical components you’ll want to customize are:
app/frontend/index.html: To change the page titleapp/frontend/src/pages/layout/Layout.tsx: To change the header text and logoapp/frontend/src/pages/chat/Chat.tsx: To change the large headingapp/frontend/src/components/Example/ExampleList.tsx: To change the example questions
Customizing the backend
The backend is built using Quart, a Python framework for asynchronous web applications. The backend code is stored in the app/backend folder. The frontend and backend communicate using the AI Chat App HTTP Protocol.
Improving answer quality
Once you are running the chat app on your own data and with your own tailored system prompt, the next step is to test the app with questions and note the quality of the answers. If you notice any answers that aren’t as good as you’d like, here’s a process for improving them.
Identify the problem point
The first step is to identify where the problem is occurring. For example, if using the Chat tab, the problem could be:
- OpenAI ChatCompletion API is not generating a good search query based on the user question
- Azure AI Search is not returning good search results for the query
- OpenAI ChatCompletion API is not generating a good answer based on the search results and user question
You can look at the “Thought process” tab in the chat app to see each of those steps, and determine which one is the problem.
Improving OpenAI ChatCompletion results
If the problem is with the ChatCompletion API calls (steps 1 or 3 above), you can try changing the relevant prompt.
Improving Azure AI Search results
If the problem is with Azure AI Search (step 2 above), the first step is to check what search parameters you’re using. Generally, the best results are found with hybrid search (text + vectors) plus the additional semantic re-ranking step, and that’s what we’ve enabled by default. There may be some domains where that combination isn’t optimal, however.
Configuring parameters in the app
You can change many of the search parameters in the “Developer settings” in the frontend and see if results improve for your queries. The most relevant options:
Evaluating answer quality
Once you’ve made changes to the prompts or settings, you’ll want to rigorously evaluate the results to see if they’ve improved. You can use tools in the AI RAG Chat evaluator repository to run evaluations, review results, and compare answers across runs.
Evaluate your Chat App:
Let’s see how to evaluate a chat app’s answers against a set of correct or ideal answers (known as ground truth). Whenever you change your chat application in a way which affects the answers, run an evaluation to compare the changes. This demo application offers tools you can use today to make it easier to run evaluations.
By following the below instructions, you will:
- Use provided sample prompts tailored to the subject domain. These are already in the repository.
- Generate sample user questions and ground truth answers from your own documents.
- Run evaluations using a sample prompt with the generated user questions.
- Review analysis of answers.
Architectural overview
Key components of the architecture include:
- Azure-hosted chat app: The chat app runs in Azure App Service. The chat app conforms to the chat protocol, which allows the evaluations app to run against any chat app that conforms to the protocol.
- Azure AI Search: The chat app uses Azure AI Search to store the data from your own documents.
- Sample questions generator: Can generate a number of questions for each document along with the ground truth answer. The more questions, the longer the evaluation.
- Evaluator runs sample questions and prompts against the chat app and returns the results.
- Review tool allows you to review the results of the evaluations.
- Diff tool allows you to compare the answers between evaluations.
Prerequisites
You’ll need the following Azure resource information from that deployment, which is referred to as the chat app in this article:
- Web API URI: The URI of the deployed chat app API.
- Azure AI Search. The following values are required:
- Resource name: The name of the Azure AI Search resource name.Index name: The name of the Azure AI Search index where your documents are stored.
- Query key: The key to query your Search index.
- If you experimented with the chat app authentication, you need to disable user authentication so the evaluation app can access the chat app.
Prepare environment values and configuration information
azd env get-values > .env
AZURE_SEARCH_SERVICE="<service-name>"
AZURE_SEARCH_INDEX="<index-name>"
AZURE_SEARCH_KEY="<query-key>"Generate sample data
python3 -m scripts generate --output=my_input/qa.jsonl --numquestions=14 --persource=2Run first evaluation with a refined prompt
Run second evaluation with a weak prompt
Run third evaluation with a specific temperature
Review the evaluation results
You have performed three evaluations based on different prompts and app settings. The results are stored in the my_results folder. Review how the results differ based on the settings.
- Use the review tool to see the results of the evaluations:
python3 -m review_tools summary my_results| Groundedness | This refers to how well the model’s responses are based on factual, verifiable information. A response is considered grounded if it’s factually accurate and reflects reality. |
| Relevance | This measures how closely the model’s responses align with the context or the prompt. A relevant response directly addresses the user’s query or statement. |
| Coherence | This refers to how logically consistent the model’s responses are. A coherent response maintains a logical flow and doesn’t contradict itself. |
| Citation | This indicates if the answer was returned in the format requested in the prompt. |
| Length | This measures the length of the response. |
Compare the answers
Compare the returned answers from the evaluations.
- Select two of the evaluations to compare, then use the same review tool to compare the answers:
Suggestions for further evaluations
- Edit the prompts in
my_inputto tailor the answers such as subject domain, length, and other factors. - Edit the
my_config.jsonfile to change the parameters such astemperature, andsemantic_rankerand rerun experiments. - Compare different answers to understand how the prompt and question impact the answer quality.
- Generate a separate set of questions and ground truth answers for each document in the Azure AI Search index. Then rerun the evaluations to see how the answers differ.
- Alter the prompts to indicate shorter or longer answers by adding the requirement to the end of the prompt. For example,
Please answer in about 3 sentences.
Load testing Python chat app using RAG with Locust
The primary objective of load testing is to ensure that the expected load on your chat application does not exceed the current Azure OpenAI Transactions Per Minute (TPM) quota.
By simulating user behavior under heavy load, you can identify potential bottlenecks and scalability issues in your application. This process is crucial for ensuring that your chat application remains responsive and reliable, even when faced with a high volume of user requests.
Load testing
We will perform load testing on a Python chat application using the RAG pattern with Locust, a popular open-source load testing tool.
It is recommended to run a loadtest for your expected number of users.
You can use the locust tool with the locustfile.py in this sample or set up a loadtest with Azure Load Testing.
What is Locust?
Locust is an open-source performance/load testing tool for HTTP and other protocols. Its developer-friendly approach lets you define your tests in regular Python code.
Locust tests can be run from command line or using its web-based UI. Throughput, response times and errors can be viewed in real time and/or exported for later analysis.
You can import regular Python libraries into your tests, and with Locust’s pluggable architecture, it is infinitely expandable. Unlike when using most other tools, your test design will never be limited by a GUI or domain-specific language.
To use locust, first install the dev requirements that includes locust:
python -m pip install -r requirements-dev.txtOr manually install locust:
python -m pip install locust
Then run the locust command, specifying the name of the User class to use from locustfile.py.
We’ve provided a ChatUser class that simulates a user asking questions and receiving answers, as well as a ChatVisionUser to simulate a user asking questions with the GPT-4 vision mode enabled.
locust ChatUserOpen the locust UI at http://localhost:8089/, the URI displayed in the terminal.
Start a new test with the URI of your website, e.g. https://my-chat-app.azurewebsites.net. Do not end the URI with a slash.
You can start by pointing at your localhost if you’re concerned more about load on OpenAI/AI Search than the host platform.
For the number of users and spawn rate, we recommend starting with 20 users and 1 users/second. From there, you can keep increasing the number of users to simulate your expected load.
Here’s an example loadtest for 50 users and a spawn rate of 1 per second:
Evaluation
Before you make your chat app available to users, you’ll want to rigorously evaluate the answer quality. You can use tools in the AI RAG Chat evaluator repository to run evaluations, review results, and compare answers across runs.
The GitHub repository for the Azure-Samples ai-rag-chat-evaluator provides tools for evaluating chat applications that use the Retrieve and Generate (RAG) architecture. Here’s a summary of the major steps involved in the evaluation process:
Setting Up the Project:
- Install Python 3.10 or higher.
- Create a Python virtual environment and install the required packages with
python -m pip install -r requirements.txt.
Deploying a GPT-4 Model:
- Use a GPT-4 model for evaluation, even if your chat app uses a different model.
- Deploy a new Azure OpenAI instance using the Azure Developer CLI (
azd), or use an existing Azure OpenAI or openai.com instance.
Generating Ground Truth Data:
- Generate data that represents the expected outcomes for the chat app’s responses.
Running an Evaluation:
- Execute the evaluation scripts to assess the chat app’s performance against the ground truth data.
Viewing the Results:
- Use the provided tools to view and analyze the results of the evaluation.
Measuring the App’s Ability to Say “I Don’t Know”:
- Evaluate how well the chat app can handle questions it should not answer by generating ground truth data for answer-less questions and running evaluations for them.
These steps are designed to help you measure and improve the quality of responses generated by a RAG chat application.
The repository also includes examples and scripts to facilitate the evaluation process. Remember to check the repository for detailed instructions and additional context.
Using the summary tool
To view a summary across all the runs, use the summary command with the path to the results folder:
python -m review_tools summary example_resultsThis will display an interactive table with the results for each run, like this:
To see the parameters used for a particular run, select the folder name. A modal will appear with the parameters, including any prompt override.
Using the compare tool
To compare the answers generated for each question across 2 runs, use the compare command with 2 paths:
python -m review_tools diff example_results/baseline_1 example_results/baseline_2This will display each question, one at a time, with the two generated answers in scrollable panes, and the GPT metrics below each answer.
Monitoring with Application Insights
Monitoring a Chat AI application is crucial for understanding its performance and user interactions. The GitHub repository for the Azure-Samples azure-search-openai-demo outlines the steps for setting up monitoring with Application Insights. Here’s a summary of the major steps:
Integration of Application Insights:
- Ensure that Application Insights is integrated into your Chat AI application. This will enable the collection of telemetry data.
Instrumentation Key Setup:
- Obtain the Instrumentation Key from your Application Insights resource in Azure. This key is essential for connecting your application to Application Insights.
Adding Telemetry to the Application:
- Modify your application code to send telemetry data to Application Insights. This typically involves adding calls to track events, exceptions, dependencies, and other metrics.
Configuring Telemetry Modules:
- Set up various telemetry modules to collect different types of data, such as requests, traces, and performance counters.
Setting Up Dashboards and Alerts:
- Create dashboards in the Azure portal to visualize the telemetry data.
- Configure alerts to notify you of any unusual activity or performance issues.
Analyzing Telemetry Data:
- Regularly review the collected telemetry data to gain insights into the application’s performance and usage patterns.
Performance Tracing:
- Utilize performance tracing features to diagnose and troubleshoot any potential bottlenecks or issues within the application.
By following these steps, you can effectively monitor your Chat AI application and ensure it is performing optimally. Remember to refer to the repository for detailed instructions and additional context on setting up and using Application Insights with your Chat AI.
Viewing the data:
By default, deployed apps use Application Insights for the tracing of each request, along with the logging of errors.
To see the performance data, go to the Application Insights resource in your resource group, click on the “Investigate -> Performance” blade and navigate to any HTTP request to see the timing data. To inspect the performance of chat requests, use the “Drill into Samples” button to see end-to-end traces of all the API calls made for any chat request:
To see any exceptions and server errors, navigate to the “Investigate -> Failures” blade and use the filtering tools to locate a specific exception. You can see Python stack traces on the right-hand side.
You can also see chart summaries on a dashboard by running the following command:
azd monitorConclusion
In this blog post series, we have explored the various aspects of developing an Enterprise Chat AI solution. In the first part, we delved into the architecture of this AI technology and showcased a demo application to demonstrate its functionality.
In the second part, we focused on the process of setting up DevOps pipelines and productionizing our Chat AI solution. We reviewed various measures such as scaling, cost, and security to ensure that it is robust and scalable.
In the final part, we delved into the realm of GenAI Ops, focusing on load testing, evaluation, monitoring, and managing Enterprise Chat AI applications. By following this blog series, you will gain a comprehensive understanding of Enterprise Chat AI, its architecture, development process, monitoring process, and best practices.
Stay tuned for more blogs in this field as I delve deeper into this exciting topic.

