In this second part of the blog, we will explore the process of setting up DevOps pipelines, productionizing our Chat AI solution, reviewing Scaling, Cost and Security measures.

Overview:

In the second part of the blog, the focus will be on the following key areas:

  1. Setting up DevOps Pipelines: This involves creating automated pipelines that facilitate continuous integration and continuous delivery (CI/CD) for the Chat AI solution.
  2. Productionizing the Chat AI Solution: This phase involves preparing the Chat AI solution for deployment in a production environment.
  3. Reviewing Scaling Measures: This involves assessing the ability of the Chat AI solution to handle increased loads. It includes strategies for scaling up (adding more resources) or scaling out (adding more instances).
  4. Cost Review: This involves analyzing the financial implications of deploying and maintaining the Chat AI solution. It includes considerations such as infrastructure costs, licensing fees, and operational expenses.
  5. Security Measures Review: This involves evaluating the security measures in place to protect the Chat AI solution. It includes aspects such as data security, user authentication, and system integrity.

We will now cover these topics in detail.

DevOps pipeline setup:

Lets explore how to use a GitHub action workflow with the Azure Developer CLI to deploy an AI RAG Chat App to production.

GitHub Repo for this solution is here: GitHub Repo for DevOps pipeline

Below are the high level steps for DevOps pipeline.

  1. Install Tools: The GitHub action installs the necessary tools for provisioning and deployment, including Node.js for building the front end.
  2. Login with Federated Credentials: The action logs into Azure using Federated credentials tied to a service principle, which is linked to specific branches and actions in the repository.
  3. Setup Pipeline Config: The pipeline configuration is set up using addd pipeline config.
  4. Provision Infrastructure: The action provisions the infrastructure by comparing the bicep files in the INF folder with what’s already provisioned. If there are no new changes in the bicep file, this step is skipped.
  5. Build Front End Code: The front end code is built based on the current JavaScript, converting all components into JavaScript files.
  6. Package and Upload: The Python and JavaScript files are packaged together and uploaded as a zip to the app service.
  7. Deployment: The application is deployed, and the URL is provided. The deployment process takes about 5 minutes.
  8. Verification: The deployed application is verified by visiting the provided URL and checking the functionality of the custom chat app.

Local development of Chat App

azure-search-openai-demo/docs/localdev.md at main · Azure-Samples/azure-search-openai-demo (github.com)

You can only run locally after having successfully run the azd up command.

  1. Run azd auth login
  2. Change dir to app
  3. Run ./start.ps1 or ./start.sh or run the “VS Code Task: Start App” to start the project locally.

When you run ./start.ps1 or ./start.sh, the backend files will be watched and reloaded automatically.

  • Hot reloading frontend and backend files

To enable hot reloading of frontend files, open a new terminal and navigate to the frontend directory:

cd app/frontend

npm run dev

You should see:

> frontend@0.0.0 dev
> vite
   VITE v4.5.1  ready in 957 ms
   ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose
  ➜  press h to show help

Deploying Chat AI

If you’ve only changed the backend/frontend code in the app folder, then you don’t need to re-provision the Azure resources.

You can just run:

azd deploy

If you’ve changed the infrastructure files (infra folder or azure.yaml), then you’ll need to re-provision the Azure resources. You can do that by running:

azd up

Productionizing

This Solution is designed to be a starting point for your own production application, but you should do a thorough review of the security and performance before deploying to production.

In this section, we will analyze the points below.

  • Scaling: Scaling Azure Resources, OpenAI. Scale Azure OpenAI with Azure Container Apps and APIM
  • Cost: Cost considerations for these Services
  • Security: Security measures Authentication, Networking, Enable login for document upload, document security etc.

Let’s explore these areas in detail.

Scaling Resources

OpenAI Capacity

The default TPM (tokens per minute) is set to 30K. That is equivalent to approximately 30 conversations per minute (assuming 1K per user message/response).

You can increase the capacity by changing the chatGptDeploymentCapacity and embeddingDeploymentCapacity parameters in infra/main.bicep to your account’s maximum capacity. 

Guidelines:

  • Options in case maximum TPM isn’t enough for your load
    • Use a backoff mechanism to retry the request. This is helpful if you’re running into a short-term quota due to bursts of activity but aren’t over long-term quota. The tenacity library is a good option for this, and this pull request shows how to apply it to this app.
    • If you are consistently going over the TPM, then consider implementing a load balancer between OpenAI instances.

Implement Azure API Management or container-based load balancers. A native Python approach that integrates with the OpenAI Python API Library is also possible.

Azure Storage

The default storage account uses the Standard_LRS SKU.

To improve your resiliency, it is recommend using Standard_ZRS for production deployments, which you can specify using the sku property under the storage module in infra/main.bicep

The default search service uses the Standard SKU with the free semantic search option, which gives you 1000 free queries a month.

Assuming your app will experience more than 1000 questions, you should either change semanticSearch to “standard” or disable semantic search entirely in the /app/backend/approaches files.

If you see errors about search service capacity being exceeded, you may find it helpful to increase the number of replicas by changing replicaCount in infra/core/search/search-services.bicep or manually scaling it from the Azure Portal.

Azure App Service

The default app service plan uses the Basic SKU with 1 CPU core and 1.75 GB RAM. We recommend using a Premium level SKU, starting with 1 CPU core. You can use auto-scaling rules or scheduled scaling rules, and scale up the maximum/minimum based on load.

Let’s summarize the scaling of each component:

ComponentCurrent Tier/settingsRecommendations
OpenAITPM 30KView the Quotas tab in Azure OpenAI Studio Or Implement a Load Balancing OpenAI instances
Azure StorageStandard_ZRSStandard_ZRS
Azure AI SearchStandard SKUStandard SKU with semanticSearch as Standard
Azure App ServiceBasic SKUPremium SKU

Scale Azure OpenAI with Azure Container Apps

add load balancing to your application to extend the chat app beyond the Azure OpenAI token and model quota limits. This approach uses Azure Container Apps to create three Azure OpenAI endpoints, as well as a primary container to direct incoming traffic to one of the three endpoints.

This option requires you to deploy 2 separate samples:

  • Chat app
    • If you haven’t deployed the chat app yet, wait until after the load balancer sample is deployed.
    • If you have already deployed the chat app once, you’ll change the environment variable to support a custom endpoint for the load balancer and redeploy it again.
  • Load balancer app

Architecture for load balancing Azure OpenAI with Azure Container Apps

Because the Azure OpenAI resource has specific token and model quota limits, a chat app using a single Azure OpenAI resource is prone to have conversation failures due to those limits.

To use the chat app without hitting those limits, use a load-balanced solution with Azure Container Apps. This solution seamlessly exposes a single endpoint from Azure Container Apps to your chat app server.

The Azure Container app sits in front of a set of Azure OpenAI resources. The Container app solves two scenarios: normal and throttled. During a normal scenario where token and model quota is available, the Azure OpenAI resource returns a 200 back through the Container App and App Server.

When a resource is in a throttled scenario such as due to quota limits, the Azure Container app can retry a different Azure OpenAI resource immediately to fulfill the original chat app request.

Open Container apps local balancer sample app

Deploy Azure Container Apps load balancer

Get the deployment endpoint

Redeploy Chat app with load balancer endpoint

Stream logs to see the load balancer results

Configure the tokens per minute quota (TPM)

Scale Azure OpenAI for Python with Azure API Management

how to add enterprise-grade load balancing to your application to extend the chat app beyond the Azure OpenAI token and model quota limits. This approach uses Azure API Management to intelligently direct traffic between three Azure OpenAI resources.

This article requires you to deploy 2 separate samples:

  • Chat app
    • If you haven’t deployed the chat app yet, wait until after the load balancer sample is deployed.
    • If you have already deployed the chat app once, you’ll change the environment variable to support a custom endpoint for the load balancer and redeploy it again.
  • Load balancer with Azure API Management

Architecture for load balancing Azure OpenAI with Azure API Management

Because the Azure OpenAI resource has specific token and model quota limits, a chat app using a single Azure OpenAI resource is prone to have conversation failures due to those limits.

To use the chat app without hitting those limits, use a load balanced solution with Azure API Management. This solution seamlessly exposes a single endpoint from Azure API Management to your chat app server.

The Azure API Management resource, as an API layer, sits in front of a set of Azure OpenAI resources. The API layer applies to two scenarios: normal and throttled. During a normal scenario where token and model quota is available, the Azure OpenAI resource returns a 200 back through the API layer and backend app server.

When a resource is throttled due to quota limits, the API layer can retry a different Azure OpenAI resource immediately to fulfill the original chat app request.

Open Azure API Management local balancer sample app

Deploy Azure API Management load balancer

Get load balancer endpoint

Redeploy Chat app with load balancer endpoint

Configure the tokens per minute quota (TPM)

Cost

Most resources in this architecture use a basic or consumption pricing tier.

Consumption pricing is based on usage, which means you only pay for what you use. To complete this article, there will be a charge, but it will be minimal. When you’re done with the article, you can delete the resources to stop incurring charges.

Pricing varies per region and usage, so it isn’t possible to predict exact costs for your usage. However, you can try the Azure pricing calculator for the resources below.

  • Azure App Service: Basic Tier with 1 CPU core, 1.75 GB RAM. Pricing per hour. Pricing
  • Azure OpenAI: Standard tier, GPT and Ada models. Pricing per 1K tokens used, and at least 1K tokens are used per question. Pricing
  • Azure AI Document Intelligence: SO (Standard) tier using pre-built layout. Pricing per document page, sample documents have 261 pages total. Pricing
  • Azure AI Search: Standard tier, 1 replica, free level of semantic search. Pricing per hour. Pricing
  • Azure Blob Storage: Standard tier with ZRS (Zone-redundant storage). Pricing per storage and read operations. Pricing
  • Azure Monitor: Pay-as-you-go tier. Costs based on data ingested. Pricing

Security measures

Authentication:

By default, the deployed app is publicly accessible.

It’s recommended to use Authentication. You can enable authentication for your web app running on Azure App Service and limit access to users in your organization.

Refer to this post, where we saw how to enable Authentication for your Chat AI that is deployed to Azure App Service. Create your own Copilot that uses your own data with an Azure OpenAI Service Model

Refer section -> “Configure web app authentication

Limit access to a specific set of users or Groups:

To then limit access to a specific set of users or groups, you can follow the steps from Restrict your Microsoft Entra app to a set of users by changing “Assignment Required?” option under the Enterprise Application, and then assigning users/groups access.

Networking:

This solution is deployed as Public end point. You can use Private Endpoints or other solutions based on your requirements.

  1. Private Endpoints

With Azure Private Link, Azure customers can render and consume services privately on Azure Platform. Services can be Azure PaaS services such as Storage, SQL and so on, Marketplace Service (Service Provider rendering his service on Azure Platform) or Customer’s own service.

  1. Private DNS Zones

If the app is only for internal enterprise use, use a private DNS zone.

Azure DNS allows you to host your DNS domain in Azure, so you can manage your DNS records using the same credentials, billing, and support contract as your other Azure services. Zones can be either public or private, where Private DNS Zones are only visible to VMs that are in your virtual network. Our global network of name servers uses Anycast routing to provide outstanding performance and availability.

  1. Azure API Management (APIM)

You can consider using Azure API Management (APIM) for firewalls and other forms of protection. Azure Landing Zones provide a solid foundation for your cloud environment. When deploying complex AI services such as Azure OpenAI, using a Landing Zone approach helps you manage your resources in a structured, consistent manner, ensuring governance, compliance, and security are properly maintained.

We will cover Azure Landing Zone for deploying Azure OpenAI in another post.

Enabling login and document level access control

By default, the deployed Azure web app allows users to chat with all your indexed data. You can enable an optional login system using Azure Active Directory to restrict access to indexed data based on the logged in user. Enable the optional login and document level access control system by following this guide.

Enabling user document upload

You can enable an optional user document upload system to allow users to upload their own documents and chat with them. This feature requires you to first enable login and document level access control. Then you can enable the optional user document upload system by setting an azd environment variable:

azd env set USE_USER_UPLOAD true

Document Security:

When you build a chat application using the RAG pattern with your own data, make sure that each user receives an answer based on their permissions. Follow the process in this article to add document access control to your chat app.

An authorized user should have access to answers contained within the documents of the chat app.

Architectural overview (document security)

Without document security feature, the enterprise chat app has a simple architecture using Azure AI Search and Azure OpenAI. An answer is determined from queries to Azure AI Search where the documents are stored, in combination with a response from an Azure OpenAI GPT model.

No user authentication is used in this simple flow.

To add security for the documents, you need to update the enterprise chat app:

  • Add client authentication to the chat app with Microsoft Entra.
  • Add server-side logic to populate a search index which corresponds to the authenticated user’s identity that should have access to each document.

Azure AI Search doesn’t provide native document-level permissions and can’t vary search results from within an index by user permissions. Instead, your application can use search filters to ensure a document is accessible to a specific user or by a specific group. Within your search index, each document should have a filterable field that stores user or group identity information.

Because the authorization isn’t natively contained in Azure AI Search, you need to add a field to hold user or group information, then trim any documents which don’t match the user. To implement this technique, you need to:

  • Create a document access control field in your index dedicated to storing the details of users or groups with document access.
  • Populate the document’s access control field with the relevant user or group details.
  • Update this access control field whenever there are changes in user or group access permissions.
  • If your index updates are scheduled with an indexer, changes are picked up on the next indexer run. If you don’t use an indexer, you need to manually reindex.

In this article, the process of securing documents in Azure AI Search, is made possible with example scripts which you as the search administrator would run. The scripts associate a single document with a single user identity. You can take these scripts and apply your own security and productionizing requirements to scale to your needs.

Get started with chat document security trimming – Python on Azure | Microsoft Learn

Enabling CORS for an alternate frontend

By default, the deployed Azure web app will only allow requests from the same origin. To enable CORS for a frontend hosted on a different origin, run:

Run azd env set ALLOWED_ORIGIN https://<your-domain.com>

Run azd up

For the frontend code, change BACKEND_URI in api.ts to point at the deployed backend URL, so that all fetch requests will be sent to the deployed backend.

Conclusion

In this post, we explored the process of setting up DevOps pipelines, productionizing our Chat AI solution, reviewing Scaling, Cost and Security measures.

Each of these areas represents a critical phase in the lifecycle of the Chat AI solution, ensuring its readiness for deployment and its ability to deliver value in a real-world setting.

In the final part of our blog series, we will delve into the realm of GenAI Ops, focusing on Load Testing, Evaluation, Monitoring, and managing Enterprise Chat AI applications.

References:

Get started with the Python enterprise chat sample using RAG – Python on Azure | Microsoft Learn

azure-search-openai-demo/docs/productionizing.md at main · Azure-Samples/azure-search-openai-demo (github.com)

Azure OpenAI Landing Zone reference architecture (microsoft.com)

openai-chat-app-quickstart/README.md at main · Azure-Samples/openai-chat-app-quickstart (github.com)

Access Control in Generative AI applications with Azure AI Search – Microsoft Community Hub

Build an enterprise-ready Azure OpenAI solution with Azure API Management – Microsoft Community Hub

2 thoughts on “Part 2: Enterprise Chat AI: Guidelines for productionizing (DevOps setup, Scaling, Cost optimization, Security)”
  1. […] In this second part of the blog, we explored the process of setting up DevOps pipelines, productionizing our Chat AI solution, reviewing Scaling, Cost and Security measures. Refer this link for all the details:Part 2: Enterprise Chat AI: Guidelines for productionizing (DevOps setup, Scaling, Cost optimization… […]

Leave a Reply

Discover more from Rajeev Singh | Coder, Blogger, YouTuber

Subscribe now to keep reading and get access to the full archive.

Continue reading