Site icon Rajeev Singh | Coder, Blogger, YouTuber

Part3: Enterprise Chat AI: Customize data, UI, Document Security, Evaluate, Scale, Load testing, Monitoring.

This blog is the continuation of Enterprise Chat AI series and the final post.

In this first part, we delved into the architecture of Enterprise Chat AI and showcased a demo application to explore its functionality. Refer this link for all the details: Part 1: Enterprise Chat AI: Architecture and Demo

In this second part of the blog, we explored the process of setting up DevOps pipelines, productionizing our Chat AI solution, reviewing Scaling, Cost and Security measures. Refer this link for all the details:Part 2: Enterprise Chat AI: Guidelines for productionizing (DevOps setup, Scaling, Cost optimization, Security)

In the final part of our blog series, we will now delve into the realm of GenAI Ops, Customizing data/UI(Indexing process, Chunking, Integrated Vectorization), focusing on Load Testing, Evaluation, Monitoring and managing Enterprise Chat AI applications.

Customizing the Data

The Chat App is designed to work with any PDF documents. The sample data is provided to help you get started quickly, but you can easily replace it with your own data. You’ll want to first remove all the existing data, then add your own.

We will be using the prepdocs script to index documents for the Chat App.

In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.

Overview of the manual indexing process

The prepdocs.py script is responsible for both uploading and indexing documents. The typical usage is to call it using scripts/prepdocs.sh (Mac/Linux) or scripts/prepdocs.ps1 (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current azd environment. Whenever azd up or azd provision is run, the script is called automatically.

The script uses the following steps to index documents:

  1. If it doesn’t yet exist, create a new index in Azure AI Search.
  2. Upload the PDFs to Azure Blob Storage.
  3. Split the PDFs into chunks of text.
  4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.

Refer to this post for more details on Chunking, indexing additional documents, Removing documents.

Getting Started with Integrated Vectorization

Overview of Integrated Vectorization

Azure AI search recently introduced an integrated vectorization feature in preview mode. This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.

See this notebook to understand the process of setting up integrated vectorization. The Solution is integrated with this code into prepdocs script.

This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index. In the newly created index schema, a new field ‘parent_id’ is added. This is used internally by the indexer to manage life cycle of chunks.

Lets understand the major steps needed for setting up Integrated Vectorization:

The azure-search-integrated-vectorization-sample.ipynb notebook in the Azure/azure-search-vector-samples GitHub repository provides a guide for setting up integrated vectorization for Chat AI using Azure AI Search. Here’s a summary of the major steps:

Prerequisites:

    Environment Setup:

      Running the Notebook:

        Troubleshooting:

          Refer to this post to know more about this: Getting Started with Integrated Vectorization

          Customizing the UI

          Once you successfully deploy the app, you can start customizing it for your needs: changing the text, tweaking the prompts, and replacing the data. Consult the app customization guide as well as the data ingestion guide for more details.

          Customizing the UI

          The frontend is built using React and Fluent UI components. The frontend components are stored in the app/frontend/src folder. The typical components you’ll want to customize are:

          Customizing the backend

          The backend is built using Quart, a Python framework for asynchronous web applications. The backend code is stored in the app/backend folder. The frontend and backend communicate using the AI Chat App HTTP Protocol.

          Improving answer quality

          Once you are running the chat app on your own data and with your own tailored system prompt, the next step is to test the app with questions and note the quality of the answers. If you notice any answers that aren’t as good as you’d like, here’s a process for improving them.

          Identify the problem point

          The first step is to identify where the problem is occurring. For example, if using the Chat tab, the problem could be:

          1. OpenAI ChatCompletion API is not generating a good search query based on the user question
          2. Azure AI Search is not returning good search results for the query
          3. OpenAI ChatCompletion API is not generating a good answer based on the search results and user question

          You can look at the “Thought process” tab in the chat app to see each of those steps, and determine which one is the problem.

          Improving OpenAI ChatCompletion results

          If the problem is with the ChatCompletion API calls (steps 1 or 3 above), you can try changing the relevant prompt.

          Improving Azure AI Search results

          If the problem is with Azure AI Search (step 2 above), the first step is to check what search parameters you’re using. Generally, the best results are found with hybrid search (text + vectors) plus the additional semantic re-ranking step, and that’s what we’ve enabled by default. There may be some domains where that combination isn’t optimal, however.

          Configuring parameters in the app

          You can change many of the search parameters in the “Developer settings” in the frontend and see if results improve for your queries. The most relevant options:

          Evaluating answer quality

          Once you’ve made changes to the prompts or settings, you’ll want to rigorously evaluate the results to see if they’ve improved. You can use tools in the AI RAG Chat evaluator repository to run evaluations, review results, and compare answers across runs.

          Evaluate your Chat App:

          Let’s see how to evaluate a chat app’s answers against a set of correct or ideal answers (known as ground truth). Whenever you change your chat application in a way which affects the answers, run an evaluation to compare the changes. This demo application offers tools you can use today to make it easier to run evaluations.

          By following the below instructions, you will:

          Architectural overview

          Key components of the architecture include:

          Prerequisites

          You’ll need the following Azure resource information from that deployment, which is referred to as the chat app in this article:

          Prepare environment values and configuration information

          azd env get-values > .env
          
          AZURE_SEARCH_SERVICE="<service-name>"
          
          AZURE_SEARCH_INDEX="<index-name>"
          
          AZURE_SEARCH_KEY="<query-key>"

          Generate sample data

          python3 -m scripts generate --output=my_input/qa.jsonl --numquestions=14 --persource=2

          Run first evaluation with a refined prompt

          Run second evaluation with a weak prompt

          Run third evaluation with a specific temperature

          Review the evaluation results

          You have performed three evaluations based on different prompts and app settings. The results are stored in the my_results folder. Review how the results differ based on the settings.

          1. Use the review tool to see the results of the evaluations:
          python3 -m review_tools summary my_results
          GroundednessThis refers to how well the model’s responses are based on factual, verifiable information. A response is considered grounded if it’s factually accurate and reflects reality.
          RelevanceThis measures how closely the model’s responses align with the context or the prompt. A relevant response directly addresses the user’s query or statement.
          CoherenceThis refers to how logically consistent the model’s responses are. A coherent response maintains a logical flow and doesn’t contradict itself.
          CitationThis indicates if the answer was returned in the format requested in the prompt.
          LengthThis measures the length of the response.

          Compare the answers

          Compare the returned answers from the evaluations.

          1. Select two of the evaluations to compare, then use the same review tool to compare the answers:

          Suggestions for further evaluations

          Load testing Python chat app using RAG with Locust

          The primary objective of load testing is to ensure that the expected load on your chat application does not exceed the current Azure OpenAI Transactions Per Minute (TPM) quota.

          By simulating user behavior under heavy load, you can identify potential bottlenecks and scalability issues in your application. This process is crucial for ensuring that your chat application remains responsive and reliable, even when faced with a high volume of user requests.

          Load testing

          We will perform load testing on a Python chat application using the RAG pattern with Locust, a popular open-source load testing tool.

          It is recommended to run a loadtest for your expected number of users.

          You can use the locust tool with the locustfile.py in this sample or set up a loadtest with Azure Load Testing.

          What is Locust?

          Locust is an open-source performance/load testing tool for HTTP and other protocols. Its developer-friendly approach lets you define your tests in regular Python code.

          Locust tests can be run from command line or using its web-based UI. Throughput, response times and errors can be viewed in real time and/or exported for later analysis.

          You can import regular Python libraries into your tests, and with Locust’s pluggable architecture, it is infinitely expandable. Unlike when using most other tools, your test design will never be limited by a GUI or domain-specific language.

          To use locust, first install the dev requirements that includes locust:

          python -m pip install -r requirements-dev.txt
           
          Or manually install locust:
          python -m pip install locust
           
          Then run the locust command, specifying the name of the User class to use from locustfile.py.

          We’ve provided a ChatUser class that simulates a user asking questions and receiving answers, as well as a ChatVisionUser to simulate a user asking questions with the GPT-4 vision mode enabled.

          locust ChatUser
           
          Open the locust UI at http://localhost:8089/, the URI displayed in the terminal.

          Start a new test with the URI of your website, e.g. https://my-chat-app.azurewebsites.net. Do not end the URI with a slash.

          You can start by pointing at your localhost if you’re concerned more about load on OpenAI/AI Search than the host platform.

          For the number of users and spawn rate, we recommend starting with 20 users and 1 users/second. From there, you can keep increasing the number of users to simulate your expected load.

          Here’s an example loadtest for 50 users and a spawn rate of 1 per second:

          Evaluation

          Repo: Azure-Samples/ai-rag-chat-evaluator: Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI (github.com)

          Before you make your chat app available to users, you’ll want to rigorously evaluate the answer quality. You can use tools in the AI RAG Chat evaluator repository to run evaluations, review results, and compare answers across runs.

          The GitHub repository for the Azure-Samples ai-rag-chat-evaluator provides tools for evaluating chat applications that use the Retrieve and Generate (RAG) architecture. Here’s a summary of the major steps involved in the evaluation process:

            Setting Up the Project:

            Deploying a GPT-4 Model:

            Generating Ground Truth Data:

            Running an Evaluation:

            Viewing the Results:

            Measuring the App’s Ability to Say “I Don’t Know”:

            These steps are designed to help you measure and improve the quality of responses generated by a RAG chat application.

            The repository also includes examples and scripts to facilitate the evaluation process. Remember to check the repository for detailed instructions and additional context.

            Using the summary tool

            To view a summary across all the runs, use the summary command with the path to the results folder:

            python -m review_tools summary example_results

            This will display an interactive table with the results for each run, like this:

            To see the parameters used for a particular run, select the folder name. A modal will appear with the parameters, including any prompt override.

            Using the compare tool

            To compare the answers generated for each question across 2 runs, use the compare command with 2 paths:

            python -m review_tools diff example_results/baseline_1 example_results/baseline_2

            This will display each question, one at a time, with the two generated answers in scrollable panes, and the GPT metrics below each answer.

            Monitoring with Application Insights

            Monitoring a Chat AI application is crucial for understanding its performance and user interactions. The GitHub repository for the Azure-Samples azure-search-openai-demo outlines the steps for setting up monitoring with Application Insights. Here’s a summary of the major steps:

            Integration of Application Insights:

            Instrumentation Key Setup:

            Adding Telemetry to the Application:

            Configuring Telemetry Modules:

            Setting Up Dashboards and Alerts:

            Analyzing Telemetry Data:

            Performance Tracing:

            By following these steps, you can effectively monitor your Chat AI application and ensure it is performing optimally. Remember to refer to the repository for detailed instructions and additional context on setting up and using Application Insights with your Chat AI.

            Viewing the data:

            By default, deployed apps use Application Insights for the tracing of each request, along with the logging of errors.

            To see the performance data, go to the Application Insights resource in your resource group, click on the “Investigate -> Performance” blade and navigate to any HTTP request to see the timing data. To inspect the performance of chat requests, use the “Drill into Samples” button to see end-to-end traces of all the API calls made for any chat request:

            To see any exceptions and server errors, navigate to the “Investigate -> Failures” blade and use the filtering tools to locate a specific exception. You can see Python stack traces on the right-hand side.

            You can also see chart summaries on a dashboard by running the following command:

            azd monitor

            Conclusion

            In this blog post series, we have explored the various aspects of developing an Enterprise Chat AI solution. In the first part, we delved into the architecture of this AI technology and showcased a demo application to demonstrate its functionality.

            In the second part, we focused on the process of setting up DevOps pipelines and productionizing our Chat AI solution. We reviewed various measures such as scaling, cost, and security to ensure that it is robust and scalable.

            In the final part, we delved into the realm of GenAI Ops, focusing on load testing, evaluation, monitoring, and managing Enterprise Chat AI applications. By following this blog series, you will gain a comprehensive understanding of Enterprise Chat AI, its architecture, development process, monitoring process, and best practices.

            Stay tuned for more blogs in this field as I delve deeper into this exciting topic.

            References:

            An Introduction to LLMOps: Operationalizing and Managing Large Language Models using Azure ML (microsoft.com)

            azure-search-vector-samples/demo-python/code/integrated-vectorization/azure-search-integrated-vectorization-sample.ipynb at main · Azure/azure-search-vector-samples (github.com)

            Azure-Samples/ai-rag-chat-evaluator: Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI (github.com)

            Azure-Samples/azure-search-openai-demo: A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences. (github.com)

            Exit mobile version