Site icon Rajeev Singh | Coder, Blogger, YouTuber

Getting Started with Integrated Vectorization

The integrated vectorization feature in Azure AI Search marks a significant advancement in the field of search and retrieval.

By automating the chunking and vectorization processes, it not only simplifies the development of RAG applications but also enhances the overall efficiency and effectiveness of search functionalities within Azure’s ecosystem.

This innovation is poised to be a game-changer for developers and organizations looking to leverage the power of vector search in their applications.

Overview:

Let’s understand steps involve building a Copilot:

Building a Copilot involves several steps:

  1. Data Collection: Gather the necessary data and make sure it is clean and well-structured.
  2. Feature Engineering: Prepare the data by extracting relevant features and transforming them into a suitable format.
  3. Model Training: Train a machine learning model on the prepared data, using appropriate algorithms and hyperparameters.
  4. Model Evaluation: Evaluate the model’s performance using metrics such as accuracy, precision, recall, and F1-score.
  5. Deployment and Integration: Deploy the trained model in a production environment and integrate it with other relevant systems or applications.

Lets consider a scenario where you implement a chat app using Python, Azure OpenAI Service, and Retrieval Augmented Generation (RAG) in Azure AI Search to get answers about employee benefits at a fictitious company. The app is seeded with PDF files including the employee handbook, a benefits document and a list of company roles and expectations.

Architectural overview

A simple architecture of the chat app is shown in the following diagram:

Key components of the architecture include:

Customize Chat App Settings (Retrieval Mode)

The Chat App is designed to work with any PDF documents.  You can use chat app settings to change behavior of responses.

The intelligence of the chat is determined by the OpenAI model and the settings that are used to interact with the model. Refer below.

SettingDescription
Override prompt templateThis is the prompt that is used to generate the answer.
TemperatureThe temperature used for the final Chat Completion API call, a number between 0 and 1 that controls the “creativity” of the model.
Minimum search scoreThe minimum score of the search results that are used to generate the answer. Range depends on search mode used.
Minimum reranker scoreThe minimum score from the semantic ranker of the search results that are used to generate the answer. Ranges from 0-4.
Retrieve this many search resultsThis is the number of search results that are used to generate the answer. You can see these sources returned in the Thought process and Supporting content tabs of the citation.
Exclude categoryThis is the category of documents that are excluded from the search results.
Use semantic ranker for retrievalThis is a feature of Azure AI Search that uses machine learning to improve the relevance of search results.
Use query-contextual summaries instead of whole documentsWhen both Use semantic ranker and Use query-contextual summaries are checked, the LLM uses captions extracted from key passages, instead of all the passages, in the highest ranked documents.
Suggest follow-up questionsHave the chat app suggest follow-up questions based on the answer.
Retrieval modeVectors + Text means that the search results are based on the text of the documents and the embeddings of the documents. Vectors means that the search results are based on the embeddings of the documents. Text means that the search results are based on the text of the documents.
Stream chat completion responsesStream response instead of waiting until the complete answer is available for a response.

As you see, there are options to get your search results,

These settings refer to Indexing documents for the Chat App or can say its part of Model Training, Fine Tuning, and Knowledge retrieval (Indexing documents, Vector Embedding).

Let’s see the Development stage of a Chat Solution and where Vector Embedding is used.

StageLLM Based SolutionRAG Based SolutionVector Embedding Used
Data CollectionCollecting large datasets from diverse sourcesCollecting large datasets from diverse sourcesNo
Data CleaningPreprocessing with tools like NLTK, regexPreprocessing with tools like NLTK, regexNo
Model TrainingUtilizing deep learning frameworks (TensorFlow, PyTorch)Utilizing deep learning frameworks (TensorFlow, PyTorch)Yes
Model Fine-TuningApplying transfer learning techniquesApplying transfer learning techniquesYes
Knowledge RetrievalNot applicableUsing database management systems (SQL, NoSQL)Yes
Response GenerationImplementing natural language generation algorithmsCombining natural language generation algorithms with retrieval mechanismsNo
EvaluationEmploying automated metrics (BLEU, ROUGE)Employing automated metrics (BLEU, ROUGE) and retrieval accuracyNo

What is Vector embedding?

Vector embedding is a critical component used during the model training and fine-tuning stages for both LLM and RAG solutions. Additionally, in RAG solutions, vector embedding plays a vital role in the knowledge retrieval stage to effectively match query embeddings with document embeddings. The tools and technologies mentioned are commonly used in these stages to develop a Copilot.

Machines don’t understand human language & that is where we need embeddings.

LLMs store the meaning and context of the data fed in a specialized format known as embeddings. Imagine capturing the essence of a word, image or video in a single mathematical equation. That’s the power of vector embeddings — one of the most fascinating and influential concepts in machine learning today.

For example, the images of animals like cat and dog are unstructured data and cannot be directly stored in a database. Hence, they will be converted into machine readable format, that’s what we call embeddings and then stored in a vector database.

By translating unstructured and high-dimensional data into a lower-dimensional space, embeddings make it possible to perform complex computations more efficiently.

Types of Embedding:
While most of us have commonly used text embedding, Embeddings can also be utilised for various types of data, such as images, graphs, and more.

⮕ Word Embeddings: Embedding of Individual words. Models: Word2Vec, GloVe, and FastText.

⮕ Sentence Embeddings Embedding of entire sentences as vectors that capture the overall meaning and context of the sentences. Models: Universal Sentence Encoder (USE) and SkipThought.

⮕ Document Embeddings Embedding of entire sentences capturing the semantic information and context of the entire document. Models: Doc2Vec and Paragraph Vectors.

⮕ Image Embeddings — captures different visual features. Models: CNNs, ResNet and VGG.

⮕ User/Product Embeddings represent users/products in a system as vectors. Capture user/products preferences, behaviors, attributes and characteristics. These are primarily used in recommendation systems.

Below are some common embedding models we can use.
⮕ Cohere’s Embedding: Powerful for processing short texts with under 512 tokens.

⮕ Mistral Embedding: Strong embedding for AI/ML modeling like text classification, sentiment analysis etc.

⮕ Open AI Embeddings: Open AI is currently one of the market leaders for Embedding Algorithms. Of the all, OpenAI second-gen text-embedding model, ada-002, has proven to give top-notch results across various use cases.

Understanding Manual Indexing vs Integrated Vectorization.

In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.

FormatManual indexingIntegrated Vectorization
PDFYes (DI or local with PyPDF)Yes
HTMLYes (DI or local with BeautifulSoup)Yes
DOCX, PPTX, XLSXYes (DI)Yes
Images (JPG, PNG, BPM, TIFF, HEIFF)Yes (DI)Yes
TXTYes (Local)Yes
JSONYes (Local)Yes
CSVYes (Local)Yes

Overview of the manual indexing process

The prepdocs.py script is responsible for both uploading and indexing documents.

The typical usage is to call it using scripts/prepdocs.sh (Mac/Linux) or scripts/prepdocs.ps1 (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current azd environment.

Whenever azd up or azd provision is run, the script is called automatically.

The script uses the following steps to index documents:

  1. If it doesn’t yet exist, create a new index in Azure AI Search.
  2. Upload the PDFs to Azure Blob Storage.
  3. Split the PDFs into chunks of text.
  4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.

Chunking

We’re often asked why we need to break up the PDFs into chunks when Azure AI Search supports searching large documents.

Chunking allows us to limit the amount of information we send to OpenAI due to token limits. By breaking up the content, it allows us to easily find potential chunks of text that we can inject into OpenAI. The method of chunking we use leverages a sliding window of text such that sentences that end one chunk will start the next. This allows us to reduce the chance of losing the context of the text.

If needed, you can modify the chunking algorithm in scripts/prepdocslib/textsplitter.py.

Indexing additional documents

To upload more PDFs, put them in the data/ folder and run ./scripts/prepdocs.sh or ./scripts/prepdocs.ps1.

recent change added checks to see what’s been uploaded before. The prepdocs script now writes an .md5 file with an MD5 hash of each file that gets uploaded. Whenever the prepdocs script is re-run, that hash is checked against the current hash and the file is skipped if it hasn’t changed.

Removing documents

You may want to remove documents from the index. For example, if you’re using the sample data, you may want to remove the documents that are already in the index before adding your own.

To remove all documents, use the --removeall flag. Open either scripts/prepdocs.sh or scripts/prepdocs.ps1 and add --removeall to the command at the bottom of the file. Then run the script as usual.

You can also remove individual documents by using the --remove flag. Open either scripts/prepdocs.sh or scripts/prepdocs.ps1, add --remove to the command at the bottom of the file, and replace /data/* with /data/YOUR-DOCUMENT-FILENAME-GOES-HERE.pdf. Then run the script as usual.

Overview of Integrated Vectorization

Azure AI search recently introduced an integrated vectorization feature in preview mode. This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.

NOTE: This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index. In the newly created index schema, a new field ‘parent_id’ is added. This is used internally by the indexer to manage life cycle of chunks.

Now, we have set up the background, lets understand What is Integrated Vectorization?

What is Integrated Vectorization?

Integrated vectorization is a new feature of Azure AI Search that allows chunking and vectorization of data during ingestion through built-in pull-indexers, and vectorization of text queries through vectorizers. With a deployed Azure OpenAI Service embedding model or a custom embedding model, integrated vectorization facilitates automatic chunking and vectorization during data ingestion from various Azure sources such as Blob StorageSQLCosmos DBData Lake Gen2, and more. Furthermore, Azure AI Search now incorporates vectorizers referencing your own embedding models that automatically vectorize text queries, effectively eliminating the need for client application coding logic.

Figure 1 – Integrated vectorization diagram

 Key Concepts in Integrated Vectorization

Vector search: In Azure AI Search, this is a capability for indexing, storing, and retrieving vector embeddings from a search index. By representing text as vectors, vector search can identify the most similar documents based on their proximity in a vector space. In vector search, vectorization refers to the conversion of text data into vector embeddings.

Chunking: Process of dividing data into smaller manageable parts (chunks) that can be processed independently.  Chunking is required if source documents are too large for the maximum input size of embedding and/or large language models.

Retrieval Augmented Generation (RAG): Architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system (i.e., Azure AI Search) that provides the data.

Choosing Between Integrated Vectorization and Other Options in Azure

In the rapidly evolving field of RAG applications, there are various systems offering chunking and vectorization capabilities. Consequently, you might find yourself pondering the best choice for different scenarios.

For instance, Microsoft’s Azure platform now provides a solution that facilitates the creation of end-to-end applications using the RAG pattern across multiple data sources, including Azure AI Search, all from the convenience of Azure AI Studio. In such a case, it’s advisable to utilize the feature for chunking and vectorization, as this functionality is optimized to work seamlessly with your chosen options.

However, if your aim is to construct your RAG application or a traditional application using vector search through alternative methods, or if the built-in functionality of does not align with your specific – business requirements, integrated vectorization is an excellent alternative.

 Benefits of Integrated Vectorization

Here are some of the key benefits of integrated vectorization:

Streamlined Maintenance: Users no longer need to maintain a separate data chunking and vectorization pipeline, reducing overhead and simplifying data maintenance.

Up-to-date Results: Use Azure AI Search to pull indexers for incremental tasks, to pull recent results from your search. The feature works seamlessly with Azure AI Search pull indexers to handle incremental tasks. This allows your search service to deliver recent results.

Reduced Complexity: Automatically vectorize your data, to reduce complexity and increase accuracy.

Increased Relevancy: Generate index projections that map one source document to all corresponding vectorized chunks, to enhance the relevance of results. Moreover, this will simplify application development workflows for those developing Retrieval-Augmented Generation (RAG) applications where data chunking is required for retrieval.

DEMO: Getting Started with Integrated Vectorization

Getting started with integrated vectorization is easy. With just a few clicks in the Azure portal, you can automatically parse your data, divide it into chunks, vectorize them, and project all these vectorized chunks into a and start taking advantage of the many benefits of Azure AI Search.

Announcing the Public Preview of Integrated Vectorization in Azure AI Search – Microsoft Community Hub

Data Ingestion Chunking and Vectorization

Follow these quick steps to import, chunk, vectorize and index your data:

1. In the Azure portal, navigate to the Overview section in your Azure AI search service and choose Import your data with vectors from the menu.

Figure 2- Import and vectorize data wizard

 2. Select and configure your data source.

Figure 3 – Connect to your data

3. Go to the Vectorize and Enrich data section, add your Azure OpenAI Service enrichment model, and choose a schedule.

Figure 4 – Vectorize and enrich data

4. Add a prefix for the child objects that will be created as part of the deployment.

5. Review and create.

Figure 5 – Review and create child resources

The Import and vectorize data wizard creates the following AI Search child resources:

In case you need more customization for chunking or vectorizing operations, you can take advantage of the 2023-10-01 Preview REST API, Azure AI Search latest preview SDKs (.NET,Python,Java and JavaScript) and modify any of the child resources listed above for integrated vectorization.

 

Vectorization at query time

When using the Import and vectorize data wizard, the selected embedding model is automatically included as a vectorizer in the index vector field. The vectorizer assigned to the index and linked to a vector field will automatically vectorize any text query submitted. 
 
Utilize the Search explorer within the Azure portal to execute text queries against vector fields with vectorizers. The system will automatically vectorize the query. There are multiple ways to access the Search Explorer within the portal: 

1. Once the Import and vectorize data wizard has finished and the indexing operation is complete, wait a few minutes and then click on Start Searching

Figure 6 – Start Searching

2. Alternatively, in the Azure portal, under your search service Overview tab, select Search explorer.

Figure 7 – Overview – Search Explorer

3. You can also use the embedded Search Explorer within an index by selecting the Indexes tab and clicking on the index name.

Figure 8 – Index hyperlink

Upon accessing the Search explorer, you can simply enter a text query. If you have a vector field with an associated vectorizer, the query will be automatically vectorized when you click on Search, and the matching results will be displayed.

Figure 9 – Text query that will be automatically vectorized

To hide the vector fields and make it easier to view the matching results, click on Query options. Then, turn on the toggle button for Hide vector values in search resultsClose the Query options and resubmit the search query.

Figure 10 – Hide vector values in search results

Figure 11 – Results with the vector fields hidden

This is a code sample of how a vectorizer looks like in the index JSON definition.

{ 
    "name": "vectorized-index, 
    "vectorSearch": { 
        "algorithms": [ 
            { 
                "name": "myalgorithm", 
                "kind": "hnsw" 
            } 
        ], 
        "vectorizers": [ 
            { 
                "name": "openai", 
                "kind": "azureOpenAI", 
                "azureOpenAIParameters": 
                { 
                    "resourceUri": "<AzureOpenAIURI>”, 
                    "apiKey": "<AzureOpenAIKey>", 
                    "deploymentId": "<modelDeploymentID>" 
                } 
            } 
        ], 
        "profiles": [ 
            { 
                "name": "myprofile", 
                "algorithm": "myalgorythm", 
                "vectorizer":"openai" 
            } 
        ] 
    }, 
    "fields": [ 
        { 
            "name": "chunkKey", 
            "type": "Edm.String", 
            "key": true, 
            "analyzer": "keyword" 
        }, 
        { 
            "name": "parentKey", 
            "type": "Edm.String" 
        }, 
        { 
            "name": "page", 
            "type": "Edm.String" 
        }, 
        { 
            "name": "vector", 
            "type": "Collection(Edm.Single)", 
            "dimensions": 1536, 
            "vectorSearchProfile": "myprofile", 
            "searchable": true, 
            "retrievable": true, 
            "filterable": false, 
            "sortable": false, 
            "facetable": false 
        } 
    ] 
} 
 

And here is a code sample of how a text query call submitted against that vector field looks like.

POST <AISearchEndpoint>/indexes/<indexName>/docs/search?api-version=2023-10-01-Preview

{ 
    "vectorQueries": [ 
        { 
            "kind": "text", 
            "text":"<add your query>", 
            "fields": "vector" 
        } 
    ], 
    "select": "chunkKey, parentKey, page" 
} 

Skills that Make Integrated Vectorization Possible During Data Ingestion

Integrated vectorization during data ingestion is made possible by a combination of skills and configurations in Azure AI Search pull indexers. An Azure AI Search skill refers to a singular operation that modifies content in a particular manner. This often involves operations such as text recognition or extraction. However, it can also include a utility skill that refines or alters previously established enrichments. The output of this operation is typically text-based, thereby facilitating its use in comprehensive text queries. Skill operations are orchestrated in skillsets.

skillset in Azure AI Search is a reusable resource linked to an indexer. It includes one or more skills which invoke either built-in AI functions or external custom processing on documents fetched from an outside data source.

The key skills and configuration that make Integrated vectorization possible include:

  1. Text split cognitive skill: This skill is designed to break down your data into smaller, manageable chunks. This step is essential for adhering to the input constraints of the embedding model and ensuring the data can be efficiently processed. The smaller data pieces not only facilitate the vectorization process but also enhance the query process, especially for RAG applications. Additionally, the skill enables overlapping of data, which is instrumental in preserving the semantic meaning in multiple scenarios, thereby enhancing the accuracy and quality of search results.
  2. Azure OpenAI Embedding skill: This skill offers a reliable method to call upon your Azure OpenAI Service embedding model. It empowers generation of highly accurate and precise vectors, thereby improving the overall effectiveness for semantic queries.
  3. Custom Web API skill: This skill allows you to use your own custom embedding model for vectorization, giving you even more control over the vectorization process.
  4. Index Projections: This configuration allows you to map one source document associated with multiple chunks. This functionality allows you to have either:

Figure 12 – Diagrammatic View of Data Ingestion Skills in Azure AI Search’s Integrated Vectorization Process

In conclusion, Integrated Vectorization allows you to improve the accuracy and precision of your searches while reducing complexity and overhead. With just a few clicks, you can import your data into Azure AI Search index and start taking advantage of the many benefits Azure AI Search, including easy integration as retriever in RAG applications.

Recap:

Conclusion:

In conclusion, the integrated vectorization feature in Azure AI Search marks a significant advancement in the field of search and retrieval. By automating the chunking and vectorization processes, it simplifies the development and deployment of RAG applications while enhancing the overall performance and efficiency of search functionalities within Azure’s ecosystem. This innovation is poised to be a game-changer for developers and organizations looking to leverage the power of vector search in their applications.

References:

Announcing the Public Preview of Integrated Vectorization in Azure AI Search – Microsoft Community Hub

azure-search-openai-demo/docs/customization.md at main · Azure-Samples/azure-search-openai-demo (github.com)

azure-search-openai-demo/docs/data_ingestion.md at main · Azure-Samples/azure-search-openai-demo (github.com)

azure-search-openai-demo/docs/data_ingestion.md at main · Azure-Samples/azure-search-openai-demo (github.com)

Exit mobile version