Chunking Strategies for RAG - Breaking Down Documents for Better Retrieval

A comprehensive guide to chunking strategies for Retrieval-Augmented Generation, from basic splitting to advanced semantic and agentic approaches.

Posted May 31, 2025 Updated Jun 1, 2025

Chunking strategies for RAG systems

By Sifal Klioui

21 min read

Chunking Strategies for RAG - Breaking Down Documents for Better Retrieval

This is the first post in a comprehensive series on Retrieval-Augmented Generation (RAG). While RAG involves multiple complex components, from embeddings to retrieval to generation, we’ll start with one of the most critical aspects: chunking.

Why might we RAG?

Training recent State-of-the-Art Large Language Models demands substantial computational resources. This typically requires large clusters of high-performance GPUs (hundreds to thousands, like NVIDIA A100 or H100) working in parallel to manage the massive number of parameters and datasets involved. Consequently, this process is extremely time-consuming. For instance, the Llama 3 model family, released on April 18, 2024, required significant compute for training, as illustrated below:

	Time (GPU hours)	Power Consumption (W)	Carbon Emitted(tCO2eq)
Llama 3 8B	1.3M	700	390
Llama 3 70B	6.4M	700	1900
Total	7.7M		2290

An important limitation of these models, stemming from their training data, is the concept of knowledge cutoffs. This means they lack inherent awareness of events or information that emerged after their training data was finalized. For example, Llama 3 70B has a knowledge cutoff around December 2023, as indicated in its model card¹:

Context Window	Trained with Long Context	Knowledge Cutoff
8k	Yes	December, 2023

One way to provide LLMs with more up-to-date or specific information is by directly including it within the context window of the prompt. By feeding relevant documents or pieces of information along with the user’s query, the LLM can leverage this provided context to generate more informed responses. However, this approach has limitations, particularly with very long documents or when needing to access a vast amount of information that exceeds the context window size.

To overcome these limitations and the issue of knowledge cutoffs, Retrieval-Augmented Generation (RAG) has emerged. RAG enhances LLMs by first retrieving relevant information from an external knowledge source. It then incorporates this retrieved context into the prompt before generation. This allows the LLM to access a broader and more current range of knowledge beyond its training data and the constraints of a single context window, leading to more accurate and grounded responses.

The Central Role of Chunking in RAG

Context windows of LLMs have been increasing quite rapidly.

Context Length Increase Over Time by Model

Despite these larger context windows, their full usability still presents challenges. For example, it’s a well-studied phenomenon that LLMs can suffer from issues like positional bias, where the placement of information within the context affects its consideration by the model. Another challenge is the “needle in the haystack” problem, where an important piece of information can be overlooked if buried within a large volume of text. For these reasons, among others, it’s often preferable not to feed entire documents directly into the LLM.

If we’re not feeding the entire document, then we’re naturally relying on chunking. Chunking is the process of splitting a document (or a long string of text) into smaller, separate blocks called chunks. These chunks are usually significantly smaller than the maximum supported context window of an LLM, due to the reasons stated above.²³

The quality of your chunking strategy directly impacts the effectiveness of your entire RAG system. Poor chunking can lead to fragmented context, missed connections between related information, and ultimately, subpar generation quality. This makes understanding different chunking approaches essential for building robust RAG applications.

Basic Chunking Strategies

Fixed-Size Chunking: The Fast and Easy Way

There are many ways to create chunks from a document. The most straightforward method is to split the text based on predefined markers, such as periods (.), semicolons (;), or newline characters (\n). Let’s illustrate this with an example:

  
doc = """To overcome these limitations and the issue of knowledge cutoffs,
 **Retrieval-Augmented Generation (RAG)** has emerged"""

chunks = doc.split(",")

print(chunks)

Output:

["To overcome these limitations and the issue of knowledge cutoffs",
 "**Retrieval-Augmented Generation (RAG)** has emerged"]

Now, suppose we’re asking the LLM: Why has RAG emerged? Even if we retrieve the two chunks, we know RAG emerged, we also know that there is something about overcoming the limitations, but we do not have the connecting information. This is because the separation has broken the semantic link needed to answer the question comprehensively.

To mitigate this, it has become standard practice to create some overlap between chunks. By including a small portion of the preceding or succeeding text in each chunk, we can maintain contextual continuity. With overlap, our chunks might look like this:

An example of how overlapping chunks

One issue with simple splitting approaches is that we could end up with chunks that are either too small to convey meaningful information or too large, causing the core meaning to become diluted. A common strategy is to set minimum and maximum chunk sizes (measured by characters, tokens, words, etc.). If a chunk exceeds the maximum size, it can be further split (e.g., into equally sized sub-chunks). If it’s below the minimum, it might be merged with a neighboring chunk.

These chunks will eventually be encoded as numerical vectors (embeddings) of a limited dimension. The longer the textual chunk, the more “diluted” or averaged its vector representation might become, potentially obscuring fine-grained details.

Recursive Character Text Splitting

A more sophisticated approach to fixed-size chunking is recursive character text splitting. This method attempts to split text at natural boundaries while respecting size constraints. It works by trying to split on a hierarchy of separators, first attempting to split on paragraphs (\n\n), then sentences (.), then words (_), and finally characters if necessary.

  
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_text(your_document)

This approach maintains more semantic coherence than naive character-based splitting while still providing predictable chunk sizes.

Token-Aware Chunking

Since LLMs work with tokens rather than characters, token-aware chunking ensures that chunks respect the actual token boundaries that the model will process. This is particularly important when working with models that have strict token limits.

  
import tiktoken

def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def token_aware_split(text, max_tokens=500, overlap_tokens=50):
    # Implementation that splits based on actual token count
    # rather than character count
    pass

Token-aware chunking prevents situations where a chunk appears to fit within limits but actually exceeds the model’s token capacity.

Document Structure-Based Chunking

Markdown and HTML Structure Chunking

For structured documents like Markdown or HTML, we can leverage the inherent document structure to create more meaningful chunks. Headers, sections, and other structural elements provide natural boundaries that preserve semantic relationships.

  
from langchain.text_splitter import MarkdownTextSplitter

markdown_splitter = MarkdownTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Respects markdown headers and structure
chunks = markdown_splitter.split_text(markdown_document)

Sentence-Based Chunking

Rather than splitting arbitrarily, sentence-based chunking uses natural language processing to identify sentence boundaries, ensuring that chunks contain complete thoughts.

  
from langchain.text_splitter import NLTKTextSplitter
import nltk

nltk.download('punkt')

sentence_splitter = NLTKTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

chunks = sentence_splitter.split_text(document)

This approach is particularly effective for maintaining grammatical coherence and avoiding mid-sentence breaks that can confuse retrieval systems.

Note that these implementations are often quite straightforward, so I’d strongly encourage you to check their source code. Sometimes they rely on third-party libraries; for example, the NLTKTextSplitter is mainly a wrapper around nltk.tokenize.sent_tokenize.

Advanced Semantic Chunking

Embedding-Based Semantic Chunking

The methods presented so far follow relatively simple heuristics. Depending on the use case and resource constraints, we might consider embedding-based semantic chunking. The high-level idea is that instead of just splitting by a fixed number of tokens or characters, this approach uses the semantic similarity between consecutive segments of the text to decide where to split. The key is to identify points in the text where the meaning or topic shifts significantly. This is often achieved by embedding sentences or small groups of sentences and then calculating the similarity (e.g., cosine similarity) between adjacent embeddings. A split point is introduced where the similarity drops below a certain threshold, indicating a change in topic.

An example of how semantic chunking can be implemented

Here’s a simplified implementation concept:

  
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.cosine import cosine_similarity

def semantic_chunking(text, similarity_threshold=0.7):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    sentences = nltk.sent_tokenize(text)
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(
            [embeddings[i-1]], 
            [embeddings[i]]
        )[0][0]
        
        if similarity > similarity_threshold:
            current_chunk.append(sentences[i])
            embeddings[i] = np.mean([embeddings[i-1],embeddings[i]])
        else:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i]]
    
    chunks.append(' '.join(current_chunk))
    return chunks

There’s a big computationnal benenfit from averaging the embeddings, but this comes at the cost of less detailed representations, a “simple” huristic would be to limit the number of chunks to merge.

Multi-Level Hierarchical Chunking

Hierarchical chunking creates nested structures of chunks at different granularities. This approach maintains both fine-grained and broad contextual information, allowing the retrieval system to access information at multiple levels of detail.

  
def hierarchical_chunking(document):
    # Level 1: Document sections (largest chunks)
    sections = split_by_headers(document)
    
    # Level 2: Paragraphs within sections
    paragraphs = []
    for section in sections:
        paragraphs.extend(split_by_paragraphs(section))
    
    # Level 3: Sentences within paragraphs (finest chunks)
    sentences = []
    for paragraph in paragraphs:
        sentences.extend(split_by_sentences(paragraph))
    
    return {
        'sections': sections,
        'paragraphs': paragraphs, 
        'sentences': sentences
    }

This hierarchical approach enables more sophisticated retrieval strategies where different levels can be queried based on the type of information needed.

Cutting-Edge Chunking Approaches

Agentic Chunking

Agentic chunking represents one of the most advanced approaches, leveraging LLMs themselves to determine optimal chunk boundaries. Instead of relying on fixed rules or simple similarity measures, this method uses the reasoning capabilities of language models to identify semantically meaningful breakpoints.⁴

  
def agentic_chunking(document, llm_client):
    prompt = f"""
    Analyze the following document and identify natural breakpoints 
    where the topic or focus shifts significantly. Return the 
    positions where the document should be split to create 
    semantically coherent chunks.
    
    Document: {document}
    
    Return split positions as a list of character indices.
    """
    
    response = llm_client.generate(prompt)
    split_positions = parse_split_positions(response)
    
    chunks = create_chunks_from_positions(document, split_positions)
    return chunks

Agentic chunking excels at understanding context, recognizing topic transitions, and creating chunks that align with human intuition about document structure.

Late Chunking

Late chunking is an innovative approach that reverses the order of traditional document processing. Instead of chunking the document first and then generating embeddings for each chunk, late chunking first generates token-level embeddings for the entire document and subsequently derives chunk-level representations. This derivation often employs pooling or attention mechanisms over the token embeddings within identified chunk boundaries.

  
def late_chunking(document, embedding_model):
    # Step 1: Generate token-level embeddings for entire document
    token_embeddings = embedding_model.encode_tokens(document)
    
    # Step 2: Identify chunk boundaries using various strategies
    chunk_boundaries = identify_boundaries(document)
    
    # Step 3: Create chunk embeddings by pooling token embeddings
    chunk_embeddings = []
    for start, end in chunk_boundaries:
        chunk_embedding = pool_embeddings(
            token_embeddings[start:end], 
            method='attention'
        )
        chunk_embeddings.append(chunk_embedding)
    
    return chunk_embeddings, chunk_boundaries

This method can capture more nuanced relationships across the document and often results in higher-quality chunk representations because each token’s embedding is informed by the context of the entire (or a large portion of the) document.

The authors of this approach⁵ advocate for the use of encoder-only models (like BERT). With such models, each token’s embedding benefits from the bidirectional context of the text. The process typically involves tokenizing (and potentially pre-chunking to fit the model’s context window, e.g., ~8k tokens), feeding these tokens into the encoder to obtain contextually rich token embeddings, and then splitting the document into chunks based on defined boundaries.

An overview of late chunking

Comparing this to early chunking, the resulting representation of a sentence in late chunking will contain information influenced by the surrounding text, unlike early chunking where the sentence’s embedding is primarily based on the tokens within that sentence alone.

An illustration of the contextual representation of embeddings in late vs early chunking

Sliding Window with Context Enrichment

Context-enriched sliding window chunking extends the traditional sliding window approach by not just including overlapping text, but by summarizing or extracting key information from adjacent chunks to provide richer context.⁶

  
def context_enriched_chunking(document, window_size=500, stride=250):
    basic_chunks = sliding_window_split(document, window_size, stride)
    enriched_chunks = []
    
    for i, chunk in enumerate(basic_chunks):
        enriched_chunk = ""
        
        # Add summary of previous chunk
        if i > 0:
            prev_summary = summarize_chunk(basic_chunks[i-1])
            enriched_chunk += f"Previous context: {prev_summary}\n"
        
        enriched_chunk += chunk

        # Add summary of next chunk
        if i < len(basic_chunks) - 1:
            next_summary = summarize_chunk(basic_chunks[i+1])
            enriched_chunk += f"Following context: {next_summary}\n"
        
        enriched_chunks.append(enriched_chunk)
    
    return enriched_chunks

Context enriched chunks, by summarizing neighboring segments

Hybrid Multi-Strategy Chunking

In practice, the most effective systems often combine multiple chunking strategies based on document characteristics, content type, and specific use case requirements.

  
class HybridChunker:
    def __init__(self):
        self.strategies = {
            'code': self.code_aware_chunking,
            'structured': self.structure_based_chunking,
            'narrative': self.semantic_chunking,
            'technical': self.hierarchical_chunking
        }
    
    def chunk_document(self, document, content_type='auto'):
        if content_type == 'auto':
            content_type = self.detect_content_type(document)
        
        primary_strategy = self.strategies[content_type]
        chunks = primary_strategy(document)
        
        # Apply post-processing based on chunk quality metrics
        optimized_chunks = self.optimize_chunks(chunks)
        
        return optimized_chunks

Evaluating Chunk Quality: Ensuring Your Chunks Hit the Mark

We’ve explored various strategies for breaking down documents, from simple splits to sophisticated agentic and late chunking methods. But how do we know if our chosen strategy is effective? How can we measure the “goodness” of our chunks? Evaluating chunk quality is a critical step, as it directly influences the relevance and accuracy of the information retrieved and, consequently, the final output of your RAG system. Poor chunks lead to poor retrieval, which inevitably results in subpar generated responses.

What Makes a “Good” Chunk? Key Criteria

Before diving into evaluation methods, let’s define what we’re looking for in a high-quality chunk:

Relevance: This is paramount. A chunk is high quality if it contains information that is semantically relevant to potential user queries it’s meant to address. It should directly contribute to answering a question or providing necessary context.
Conciseness: Chunks should be “just right” in size. They need to be substantial enough to carry meaning but not so verbose that they introduce noise or bury the key information. Remember the “diluted embeddings” issue with overly long chunks.
Coherence: The content within a single chunk should be thematically unified and logically connected. A chunk that randomly jumps between disparate topics will confuse the retrieval system and the LLM.
Contextual Completeness (to a degree): While chunks are pieces of a larger whole, an ideal chunk should be as self-contained as possible to be understandable. It shouldn’t excessively rely on other, unretrieved chunks for its core meaning to be grasped. This is where techniques like overlap or context enrichment (discussed earlier) play a role.
Accuracy & Factuality: At the risk of stating the obvious, the information within the chunk must be accurate. Retrieving factual errors will lead the LLM to generate incorrect or misleading information.

Methods for Evaluating Chunk Quality

Evaluating something as nuanced as “chunk quality” often requires a multi-faceted approach, blending automated techniques with the irreplaceable judgment of human evaluators.

The Gold Standard : Humans

Despite being time-consuming and potentially costly, human evaluation is often the most reliable way to assess chunk quality, especially for nuances like coherence and true relevance.

Direct Rating: Ask human evaluators to rate chunks based on the criteria above (e.g., on a scale of 1-5 for relevance, coherence, etc.) against a specific query or information need.
Comparative Evaluation: Present evaluators with chunks generated by different strategies for the same document segment and ask them to choose the best one or rank them.
Annotation Tasks: Evaluators can highlight relevant sentences within a chunk, identify irrelevant parts, or suggest better break points.
Impact on Final Output: Ultimately, chunk quality affects the RAG system’s final answer. Human evaluators can assess the quality, relevance, and faithfulness of the LLM’s response, which indirectly reflects the quality of the retrieved chunks.

Automated & Proxy Metrics

While direct automated measurement of all quality aspects is hard, several metrics can serve as proxies or evaluate specific parts of the chunking and retrieval process:

Chunk Length Distribution: Analyzing the distribution of chunk sizes (in tokens or characters) can ensure they fall within your desired min_chunk_size and max_chunk_size parameters. Drastic variations might indicate issues with the chunking logic.
Overlap Analysis: If using overlap, verify that it’s being implemented correctly and assess if the overlap percentage is optimal (too little might break context, too much increases redundancy).
Retrieval-Focused Metrics (Indirect evaluation of chunks):
- Embedding Similarity: For a given query, after you retrieve chunks, you can measure the semantic similarity (e.g., using cosine similarity between query embedding and chunk embeddings) as a proxy for relevance. This is often part of the retrieval process itself but can be analyzed.
- “Needle in a Haystack” Evaluation for Chunks: Adapt the “needle in a haystack” test. Intentionally insert a specific piece of information (the “needle”) into a document, chunk it, and then try to retrieve the chunk containing that needle using a relevant query. Success here indicates the chunk retained the key information and the chunking strategy didn’t obscure it.
Content-Based Heuristics:
- Topic Coherence Scores: If you want to assess how thematically focused chunks are, topic modeling methods like LDA can be used to compute coherence scores within each chunk. A coherent chunk will have a high average topic coherence, meaning the terms in the chunk consistently relate to a central theme. Implementation involves building topic models on the corpus and calculating metrics like UMass or NPMI within each chunk.
- Sentence Boundary Adherence: If your strategy is supposed to respect sentence boundaries (like NLTKTextSplitter), you can programmatically check how often chunks end mid-sentence.

Topic coherence measures how consistently words in a chunk relate to the same topic. Higher coherence usually means better chunk quality for thematic tasks like QA or summarization.

  
# Example: Basic check for chunk sizes
from langchain.text_splitter import RecursiveCharacterTextSplitter

def evaluate_chunk_sizes(chunks, min_chars=50, max_chars=1000):
    violations = {"too_small": 0, "too_large": 0, "good": 0}
    for chunk_text in chunks:
        if len(chunk_text) < min_chars:
            violations["too_small"] += 1
        elif len(chunk_text) > max_chars:
            violations["too_large"] += 1
        else:
            violations["good"] +=1
    return violations

# Dummy document and splitter for illustration
doc_content = "This is a sample document that is long enough to be split into several pieces. We want to ensure our pieces are reasonably sized." * 10
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(doc_content)

chunk_size_evaluation = evaluate_chunk_sizes(chunks, min_chars=50, max_chars=150)
print(f"Chunk size evaluation: {chunk_size_evaluation}")

While the code above checks character length, remember that token count is often the more critical factor for LLMs, as discussed in “Token-Aware Chunking.” Your evaluation should ideally align with how your LLM processes text.

3. Embedding-Based Evaluation with IOU and Recall

Another approach is to compare the chunks produced by your strategy with an ideal or reference segmentation using semantic similarity. Metrics like Intersection-over-Union (IoU) and Recall can be used to quantify overlap between your chunks and ground truth labels.⁷

  
from chunking_evaluation import BaseChunker, GeneralEvaluation
from chromadb.utils import embedding_functions

# Define a custom chunking class
class CustomChunker(BaseChunker):
    def split_text(self, text):
        # Custom chunking logic
        return [text[i:i+1200] for i in range(0, len(text), 1200)]

# Instantiate the custom chunker and evaluation
chunker = CustomChunker()
evaluation = GeneralEvaluation()

# Choose embedding function
default_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="OPENAI_API_KEY",
    model_name="text-embedding-3-large"
)

# Evaluate the chunker
results = evaluation.run(chunker, default_ef)

print(results)
# Example output:
# {'iou_mean': 0.177, 'iou_std': 0.106, 'recall_mean': 0.809, 'recall_std': 0.379}

These metrics provide an automatic, quantitative view of how well your chunks match an expected segmentation. IoU measures the overlap between retrieved and reference segments, while recall indicates how much of the relevant information is captured.

Iterative Improvement

Evaluating chunk quality isn’t a one-time task. It’s an iterative process:

Establish Baselines: Start with a basic chunking strategy and evaluate its performance.
Experiment: Try different strategies, chunk sizes, and overlaps.
Measure: Use a combination of human and automated evaluation to assess the impact of your changes.
Analyze Failures: When the RAG system provides a poor answer, trace back to the retrieved chunks. Were they irrelevant, incoherent, or incomplete?
Refine: Adjust your chunking strategy based on these findings.

By systematically evaluating and refining your chunking approach, you lay a strong foundation for a high-performing Retrieval-Augmented Generation system that can effectively leverage external knowledge.

Common Pitfalls and Best Practices

Pitfalls to Avoid

Ignoring document structure: Treating all documents the same regardless of format
Fixed-size obsession: Prioritizing uniform chunk sizes over semantic coherence
Insufficient overlap: Creating disconnected chunks that lose context
Over-chunking: Creating chunks so small they lack meaningful information
No evaluation: Implementing chunking without measuring its impact on retrieval quality

Best Practices

Start simple, iterate: Begin with basic methods and add complexity based on observed needs
Content-aware strategies: Adapt your approach based on document type and domain
Measure impact: Always evaluate chunking quality through downstream task performance
Consider preprocessing: Clean and normalize text before chunking
Document your choices: Keep track of parameters and rationale for future optimization

Looking Ahead: The RAG Series

These chunking strategies form the foundation of effective RAG systems. In the upcoming posts in this series, we’ll explore how these chunks are transformed into embeddings, stored in vector databases, retrieved based on user queries, and finally used to generate contextually informed responses.

Understanding chunking deeply will make the subsequent components of the RAG pipeline much clearer, as the quality of your chunks directly impacts every downstream process. The investment in getting chunking right pays dividends throughout your entire RAG system.

Most importantly, there’s no one-size-fits-all solution. One must carefully consider the nature of the documents, the context length of the LLM, the specific use case, and conduct thorough tests to find the optimal approach. The best chunking strategy is one that balances your specific requirements for accuracy, speed, and resource usage while maintaining semantic integrity and supporting effective retrieval.

References

Machine Learning, Deep-learning

This post is licensed under CC BY 4.0 by the author.