Document Processing & Embeddings
In this module, we explore how Semantic Kernel leverages document processing and embeddings to enable powerful semantic search, summarization, classification, and more.
Document Processing in Semantic Kernel
Semantic Kernel enables you to:
- Ingest documents (PDFs, markdown, web pages, etc.)
- Split them into chunks using planners or chunking strategies
- Generate embeddings for each chunk
- Store these in a vector store (like Azure Cognitive Search, Qdrant, or Pinecone)
Once embedded and stored, you can:
- Perform semantic search: find relevant content based on meaning, not keywords
- Use context-aware prompting: fetch relevant chunks for better LLM responses
- Build retrieval-augmented generation (RAG) workflows, chat with docs, summarize content, etc.
Labs
This hands-on workshop gradually builds understanding and experience with document processing and embeddings using Semantic Kernel.
Each lab builds on the previous one, creating a progressive learning path.
Lab 1: Chunking Documents
Goal: Learn how to process documents and prepare them for embedding.
Document chunking is the process of breaking down large documents into smaller, meaningful segments that can be effectively processed by LLMs and embedding models. This is a crucial step in building RAG (Retrieval-Augmented Generation) applications because:
- Context Window Limitations: LLMs have limited context windows, so we need to break documents into manageable chunks
- Semantic Coherence: Chunks should maintain semantic meaning and context
- Retrieval Efficiency: Smaller, focused chunks improve search relevance
Exercise: Document Chunking
In this lab, we’ll build up our understanding of document chunking by implementing different strategies using Semantic Kernel. We’ll start with basic chunking and progress to more sophisticated approaches.
Make sure to copy the starter files for this lab from code/module-02. You can also continue working on the files from the previous module.
-
Download the sample content for the RAG system. We’ll use the raw content of the book “Building Effective LLM-based Applications with Semantic Kernel”. Download the book’s raw markdown files as dataset. Place these files in the
apps/indexer/InfoSupport.AgentWorkshop.Indexer/Content
directory. -
Search for content files in the
Content
directory. Modifyapps/indexer/InfoSupport.AgentWorkshop.Indexer/Services/ContentIndexerService.cs
file and add the following code to theProcessContentFilesAsync
method.var files = Directory.GetFiles(contentDirectory, "*.md", SearchOption.AllDirectories); -
Loop through all files and read them into memory. Add the following code to the
ProcessContentFilesAsync
method.foreach (var file in files){var text = await File.ReadAllTextAsync(file, cancellationToken);await ChunkFileContentAsync(file, text, cancellationToken);} -
Modify the
ChunkFileContentAsync
method to split the text into chunks.var lines = fileContent.Split(["\r\n", "\n"], StringSplitOptions.None);var chunks = TextChunker.SplitMarkdownParagraphs(lines, ChunkSize, OverlapSize);foreach (var chunk in chunks){await UpsertChunkAsync(fileName, chunk, cancellationToken);}
You now have all the logic required to chunk your documents into content that we can use for semantic search.
Lab 2: Generating Embeddings
Goal: Convert document chunks into embeddings.
In this lab you will learn how to prepare your chunks for embedding, and how to generate embeddings using Semantic Kernel.
What Are Embeddings?
Embeddings are numerical representations of text (words, sentences, or documents) that capture semantic meaning. They’re the output of AI models trained to understand language.
📊 For example, the sentence “The quick brown fox” could become a vector like:
[0.12, -0.98, 0.45, ..., 0.33]
Key property:
Texts with similar meanings get similar embeddings, even if they use different words.
Embeddings vs. Vectors
It’s common to hear the terms used interchangeably, but here’s the difference:
Term | Meaning |
---|---|
Vector | A list of numbers. Purely mathematical. |
Embedding | A vector with meaning, produced by a model to represent some input (e.g. text, image). |
So:
- ✅ All embeddings are vectors
- ❌ Not all vectors are embeddings
Think of a vector as a format, and an embedding as meaningful content in that format.
Exercise: Configure the vector store
To work with a vector store in Semantic Kernel, you need to configure it in the application. Since we’re using Aspire to orchestrate the various moving parts in the application we can configure the vector store with just a few lines of code.
-
Add the Qdrant connector to the
apps/indexer/InfoSupport.AgentWorkshop.Indexer
project.Terminal window dotnet add package Microsoft.SemanticKernel.Connectors.Qdrant --prereleasedotnet add package Aspire.Qdrant.Client -
Add the following statement to
apps/indexer/InfoSupport.AgentWorkshop.Indexer/Program.cs
after the line where the AzureOpenAI client is added.builder.AddQdrantClient("qdrant"); -
Configure the vector store in the
Program.cs
file.builder.Services.AddKernel().AddAzureOpenAITextEmbeddingGeneration(builder.Configuration["LanguageModel:TextEmbeddingDeploymentName"]!);builder.Services.AddQdrantVectorStore();
With the vector store configured, we can start storing data in it.
Exercise: Generating and storing embeddings
There are many vector databases you can use with Semantic Kernel. They all follow a similar pattern: You can usually store a record identified by a key. The record stores an embedding vector and some additional metadata that must be serializable to JSON.
In C#, you must create a specific class to represent the data in a vector store. Semantic Kernel uses the term vector store to describe a database that can store vector data. This can be a pure vector database or a relational database with support for storing vector data. If you’re planning on using a regular database to store vector data you need to be aware that you can’t combine the data structures offered by Semantic Kernel with other relational data processing such as Entity Framework Core although the database may support it.
-
Add a new file
TextUnit.cs
to the directoryapps/indexer/InfoSupport.AgentWorkshop.Indexer/Models
and add a new classTextUnit
in it.using Microsoft.Extensions.VectorData;namespace InfoSupport.AgentWorkshop.Indexer.Models;public class TextUnit{} -
Add the filename and content to the class.
To preserve information about the original file, we can add a property for the filename and the content. Semantic Kernel uses the term
VectorStoreRecordData
to describe the data that is stored in the vector store.[VectorStoreRecordData]public string OriginalFileName { get; set; } = default!;[VectorStoreRecordData(IsFullTextIndexed = true)]public string Content { get; set; } = default!; -
Add the identifier and vector data to the class.
A vector store record in Semantic Kernel requires a unique key and a vector data field. The identifier for the
TextUnit
is a unique number marked with the[VectorStoreRecordKey]
attribute. The vector data field has to be of typeReadOnlyMemory<float>
and is marked with the[VectorStoreRecordVector]
attribute. Depending on the embedding model you will use to generate embeddings, you need to specify a different value for the embedding size.[VectorStoreRecordKey]public ulong Id { get; set; }[VectorStoreRecordVector(1536)] // Amount of dimensions for the embeddingpublic ReadOnlyMemory<float> Embedding { get; set; }The embedding size is usually found in the manual of the LLM provider that offers the embedding model you’re using. Although it’s wise to use an embedding model from the LLM provider that you’re using for the LLM, it’s not required. Using an embedding model from another provider or an open-source embedding model requires extra maintenance and may not add additional value in terms of higher-quality search results.
-
Map the chunks to the data model. Modify the method
UpsertChunkAsync
in theContentIndexerService
class. Add the following code to it:var embeddingVector = await embeddingGenerationService.GenerateEmbeddingAsync(chunk, cancellationToken: cancellationToken);var textUnit = new TextUnit{Content = chunk,Embedding = embeddingVector,Id = _currentIdentifier++,OriginalFileName = fileName,}; -
Ensure that the vector store collection exists. Add the following code to the end of the
UpsertChunkAsync
method:var collection = vectorStore.GetCollection<ulong, TextUnit>("content");await collection.CreateCollectionIfNotExistsAsync(cancellationToken); -
Save the text unit to the vector store. Add the following code to the end of the method
UpsertChunkAsync
.await collection.UpsertAsync(textUnit, cancellationToken);
Summary and next steps
In this module, we’ve explored the fundamentals of document processing and embeddings in Semantic Kernel. We’ve learned how to process documents and split them into chunks, and how to embed these chunks into a vector space. We’ve also learned how to store these embeddings in a vector store and perform semantic search on them.
In the next module, we’ll extend the basic agent with the capacity to use the vector store to answer questions about the documents.