Document Processing & Embeddings

In this module, we explore how Semantic Kernel leverages document processing and embeddings to enable powerful semantic search, summarization, classification, and more.

Document Processing in Semantic Kernel

Semantic Kernel enables you to:

Ingest documents (PDFs, markdown, web pages, etc.)
Split them into chunks using planners or chunking strategies
Generate embeddings for each chunk
Store these in a vector store (like Azure Cognitive Search, Qdrant, or Pinecone)

Once embedded and stored, you can:

Perform semantic search: find relevant content based on meaning, not keywords
Use context-aware prompting: fetch relevant chunks for better LLM responses
Build retrieval-augmented generation (RAG) workflows, chat with docs, summarize content, etc.

Labs

This hands-on workshop gradually builds understanding and experience with document processing and embeddings using Semantic Kernel.

Each lab builds on the previous one, creating a progressive learning path.

Lab 1: Chunking Documents

Goal: Learn how to process documents and prepare them for embedding.

Document chunking is the process of breaking down large documents into smaller, meaningful segments that can be effectively processed by LLMs and embedding models. This is a crucial step in building RAG (Retrieval-Augmented Generation) applications because:

Context Window Limitations: LLMs have limited context windows, so we need to break documents into manageable chunks
Semantic Coherence: Chunks should maintain semantic meaning and context
Retrieval Efficiency: Smaller, focused chunks improve search relevance

Exercise: Document Chunking

In this lab, we’ll build up our understanding of document chunking by implementing different strategies using Semantic Kernel. We’ll start with basic chunking and progress to more sophisticated approaches.

Make sure to copy the starter files for this lab from code/module-02. You can also continue working on the files from the previous module.

Download the sample content for the RAG system. We’ll use the raw content of the book “Building Effective LLM-based Applications with Semantic Kernel”. Download the book’s raw markdown files as dataset. Place these files in the apps/indexer/InfoSupport.AgentWorkshop.Indexer/Content directory.
Search for content files in the Content directory. Modify apps/indexer/InfoSupport.AgentWorkshop.Indexer/Services/ContentIndexerService.cs file and add the following code to the ProcessContentFilesAsync method.
```
var files = Directory.GetFiles(contentDirectory, "*.md", SearchOption.AllDirectories);
```

Loop through all files and read them into memory. Add the following code to the ProcessContentFilesAsync method.

foreach (var file in files)
{
    var text = await File.ReadAllTextAsync(file, cancellationToken);
    await ChunkFileContentAsync(file, text, cancellationToken);
}

Modify the ChunkFileContentAsync method to split the text into chunks.

var lines = fileContent.Split(["\r\n", "\n"], StringSplitOptions.None);
var chunks = TextChunker.SplitMarkdownParagraphs(lines, ChunkSize, OverlapSize);

foreach (var chunk in chunks)
{
    await UpsertChunkAsync(fileName, chunk, cancellationToken);
}

You now have all the logic required to chunk your documents into content that we can use for semantic search.

Lab 2: Generating Embeddings

Goal: Convert document chunks into embeddings.

In this lab you will learn how to prepare your chunks for embedding, and how to generate embeddings using Semantic Kernel.

What Are Embeddings?

Embeddings are numerical representations of text (words, sentences, or documents) that capture semantic meaning. They’re the output of AI models trained to understand language.

📊 For example, the sentence “The quick brown fox” could become a vector like:
[0.12, -0.98, 0.45, ..., 0.33]

Key property:

Texts with similar meanings get similar embeddings, even if they use different words.

Embeddings vs. Vectors

It’s common to hear the terms used interchangeably, but here’s the difference:

Term	Meaning
Vector	A list of numbers. Purely mathematical.
Embedding	A vector with meaning, produced by a model to represent some input (e.g. text, image).

So:

✅ All embeddings are vectors
❌ Not all vectors are embeddings

Think of a vector as a format, and an embedding as meaningful content in that format.

Exercise: Configure the vector store

To work with a vector store in Semantic Kernel, you need to configure it in the application. Since we’re using Aspire to orchestrate the various moving parts in the application we can configure the vector store with just a few lines of code.

Add the Qdrant connector to the apps/indexer/InfoSupport.AgentWorkshop.Indexer project.

dotnet add package Microsoft.SemanticKernel.Connectors.Qdrant --prerelease
dotnet add package Aspire.Qdrant.Client

Add the following statement to apps/indexer/InfoSupport.AgentWorkshop.Indexer/Program.cs after the line where the AzureOpenAI client is added.
```
builder.AddQdrantClient("qdrant");
```

Configure the vector store in the Program.cs file.

builder.Services.AddKernel()
    .AddAzureOpenAITextEmbeddingGeneration(
        builder.Configuration["LanguageModel:TextEmbeddingDeploymentName"]!);

builder.Services.AddQdrantVectorStore();

With the vector store configured, we can start storing data in it.

Exercise: Generating and storing embeddings

There are many vector databases you can use with Semantic Kernel. They all follow a similar pattern: You can usually store a record identified by a key. The record stores an embedding vector and some additional metadata that must be serializable to JSON.

In C#, you must create a specific class to represent the data in a vector store. Semantic Kernel uses the term vector store to describe a database that can store vector data. This can be a pure vector database or a relational database with support for storing vector data. If you’re planning on using a regular database to store vector data you need to be aware that you can’t combine the data structures offered by Semantic Kernel with other relational data processing such as Entity Framework Core although the database may support it.

Add a new file TextUnit.cs to the directory apps/indexer/InfoSupport.AgentWorkshop.Indexer/Models and add a new class TextUnit in it.
```
using Microsoft.Extensions.VectorData;

namespace InfoSupport.AgentWorkshop.Indexer.Models;

public class TextUnit
{
}
```
Add the filename and content to the class.

To preserve information about the original file, we can add a property for the filename and the content. Semantic Kernel uses the term VectorStoreRecordData to describe the data that is stored in the vector store.
```
[VectorStoreRecordData]
public string OriginalFileName { get; set; } = default!;

[VectorStoreRecordData(IsFullTextIndexed = true)]
public string Content { get; set; } = default!;
```
Add the identifier and vector data to the class.

A vector store record in Semantic Kernel requires a unique key and a vector data field. The identifier for the TextUnit is a unique number marked with the [VectorStoreRecordKey] attribute. The vector data field has to be of type ReadOnlyMemory<float> and is marked with the [VectorStoreRecordVector] attribute. Depending on the embedding model you will use to generate embeddings, you need to specify a different value for the embedding size.
```
[VectorStoreRecordKey]
public ulong Id { get; set; }

[VectorStoreRecordVector(1536)] // Amount of dimensions for the embedding
public ReadOnlyMemory<float> Embedding { get; set; }
```
The embedding size is usually found in the manual of the LLM provider that offers the embedding model you’re using. Although it’s wise to use an embedding model from the LLM provider that you’re using for the LLM, it’s not required. Using an embedding model from another provider or an open-source embedding model requires extra maintenance and may not add additional value in terms of higher-quality search results.

Map the chunks to the data model. Modify the method UpsertChunkAsync in the ContentIndexerService class. Add the following code to it:

var embeddingVector = await embeddingGenerationService.GenerateEmbeddingAsync(
        chunk, cancellationToken: cancellationToken);

var textUnit = new TextUnit
{
    Content = chunk,
    Embedding = embeddingVector,
    Id = _currentIdentifier++,
    OriginalFileName = fileName,
};

Ensure that the vector store collection exists. Add the following code to the end of the UpsertChunkAsync method:

var collection = vectorStore.GetCollection<ulong, TextUnit>("content");
await collection.CreateCollectionIfNotExistsAsync(cancellationToken);

Save the text unit to the vector store. Add the following code to the end of the method UpsertChunkAsync.
```
await collection.UpsertAsync(textUnit, cancellationToken);
```

Summary and next steps

In this module, we’ve explored the fundamentals of document processing and embeddings in Semantic Kernel. We’ve learned how to process documents and split them into chunks, and how to embed these chunks into a vector space. We’ve also learned how to store these embeddings in a vector store and perform semantic search on them.

In the next module, we’ll extend the basic agent with the capacity to use the vector store to answer questions about the documents.