AI & ML |

RAG Chunking Strategies Compared: Fixed-Size vs Semantic vs Agentic in 2026

Compare fixed-size, semantic, and agentic chunking for RAG pipelines. Benchmarks, trade-offs, and when each strategy actually improves retrieval quality.

By SouvenirList

You built a RAG pipeline — documents go in, embeddings come out, the LLM answers questions. It works on your test set. Then you throw real documents at it and retrieval quality falls off a cliff. The LLM hallucinates answers that are almost right, pulling fragments from the wrong section of the right document. The problem isn’t your embedding model or your vector database — it’s how you’re splitting documents into chunks. Chunking is the most under-discussed and highest-leverage decision in RAG pipeline design, and in 2026 the options have expanded from “split every 500 tokens” to sophisticated semantic and agentic approaches that understand document structure.


TL;DR

  • Fixed-size chunking (split every N tokens with overlap) is fast, simple, and surprisingly effective for homogeneous documents. Start here.
  • Semantic chunking (split at topic boundaries using embeddings) improves retrieval for documents with mixed topics — technical docs, long-form articles, legal contracts. 10–25% retrieval quality improvement in benchmarks, but 5–10x slower ingestion.
  • Agentic chunking (LLM decides chunk boundaries and metadata) produces the highest-quality chunks but costs 50–100x more to ingest. Worth it only for high-value, low-volume corpora.
  • The real answer: use fixed-size for prototyping and most production workloads, add semantic chunking when retrieval quality on multi-topic documents drops below acceptable thresholds, and reserve agentic chunking for specialized use cases where per-query accuracy justifies the cost.

Why Chunking Matters More Than Your Embedding Model

A RAG pipeline has three critical stages: chunking, embedding, and retrieval. Most teams spend weeks evaluating embedding models (OpenAI vs Cohere vs open-source) and vector databases (pgvector vs Pinecone) — then chunk their documents with text.split() every 512 tokens and call it done.

This is backwards. The embedding model can only embed what it receives. If a chunk contains half of Topic A and half of Topic B, the embedding will be a muddled average of both — and it won’t be a strong match for queries about either topic. Garbage chunks in, garbage retrieval out, regardless of how good your embeddings are.

The goal of chunking is to produce text segments where each chunk contains exactly one coherent idea or piece of information that can be meaningfully matched to a user query. How you achieve this depends on your documents, your budget, and your accuracy requirements.


Strategy 1: Fixed-Size Chunking

How It Works

Split the document into segments of N tokens (typically 256–1024) with an overlap of M tokens (typically 50–200) between consecutive chunks. Every chunk is the same size regardless of content.

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 128) -> list[str]:
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk_tokens))
    return chunks

When It Works Well

  • Homogeneous documents: Technical documentation where each section covers one topic. API references, man pages, FAQ lists.
  • Short documents: Under 2,000 tokens, chunking strategy barely matters — the whole document fits in 2–4 chunks.
  • High-volume ingestion: Processing millions of documents where ingestion speed matters more than per-chunk quality.

When It Breaks Down

  • Long documents with topic transitions: A 50-page report where page 12 switches from financial analysis to legal disclaimers. A fixed-size chunk spanning that transition will embed poorly for both topics.
  • Documents with hierarchical structure: Code files, legal contracts, academic papers with sections/subsections. Fixed-size chunking ignores structure entirely.
Document TypeChunk SizeOverlapRationale
Short-form (blog posts, emails)256 tokens64Small chunks = precise retrieval
Technical docs512 tokens128Balance between context and precision
Long-form (reports, books)1024 tokens256Preserve paragraph-level context
Code files512 tokens128With line-boundary alignment

The overlap is non-negotiable. Without it, information that spans a chunk boundary is split across two chunks and neither chunk contains the full context. 20–25% overlap is the sweet spot — less loses boundary information, more wastes tokens and storage.


Strategy 2: Semantic Chunking

How It Works

Instead of splitting at fixed intervals, semantic chunking identifies topic boundaries within the document and splits there. The most common approach uses embedding similarity:

  1. Split the document into sentences (or small fixed segments).
  2. Compute embeddings for each sentence.
  3. Calculate cosine similarity between consecutive sentence embeddings.
  4. Where similarity drops below a threshold, insert a chunk boundary.
  5. Group sentences between boundaries into chunks.
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, threshold: float = 0.5) -> list[str]:
    sentences = split_into_sentences(text)
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(sentences)

    chunks, current_chunk = [], [sentences[0]]
    for i in range(1, len(sentences)):
        similarity = np.dot(embeddings[i], embeddings[i-1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
        )
        if similarity < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    chunks.append(" ".join(current_chunk))
    return chunks

The Sliding Window Variant

A more robust approach uses a sliding window of embeddings rather than comparing adjacent sentences. Compute the average embedding of the last 3–5 sentences and compare it to the next sentence’s embedding. This smooths out noise from individual sentence variations and produces more stable boundaries.

When It Works Well

  • Multi-topic documents: Annual reports, research papers, long articles that cover several distinct subjects.
  • Conversational transcripts: Meeting notes, support tickets, interview transcripts where topics shift organically.
  • Documents where section headers are unreliable or absent.

When It’s Overkill

  • Already well-structured documents: If your documents have clear headers (Markdown H2s, HTML sections), structure-aware splitting (split at headers) is simpler and equally effective.
  • High-throughput pipelines: Semantic chunking requires embedding every sentence at ingestion time, adding 5–10x latency compared to fixed-size.

Performance Impact

In benchmarks on multi-topic document retrieval (using MTEB-style evaluation):

Chunking StrategyRecall@5Precision@5Ingestion Speed
Fixed-size (512 tokens)0.720.681.0x (baseline)
Semantic (embedding similarity)0.830.790.12x (8x slower)
Semantic (sliding window)0.860.810.10x (10x slower)

The retrieval improvement is real — 15–20% better recall — but the ingestion cost is significant. For a corpus of 100,000 documents, fixed-size chunking takes minutes; semantic chunking takes hours.


Strategy 3: Agentic Chunking

How It Works

An LLM reads the document and decides:

  1. Where to place chunk boundaries
  2. What metadata to attach to each chunk (topic labels, key entities, summary)
  3. Whether a section should be kept as one chunk or split further
def agentic_chunk(text: str) -> list[dict]:
    response = llm.create(
        model="claude-sonnet-4-6",
        messages=[{
            "role": "user",
            "content": f"""Split this document into semantic chunks.
For each chunk, provide:
- The chunk text
- A 1-sentence summary
- Key entities mentioned
- Topic category

Document:
{text}"""
        }]
    )
    return parse_chunks(response)

What Makes It Different

Agentic chunking understands meaning, not just similarity. It can:

  • Recognize that a table and its preceding paragraph are one logical unit (even though their embeddings differ)
  • Keep a code block with its explanatory text
  • Identify cross-references (“as discussed in Section 3”) and include relevant context
  • Generate metadata that improves filtering before vector search

When It’s Worth the Cost

  • Legal and compliance documents: Contracts, regulatory filings where every clause matters and misattribution has consequences.
  • Medical/scientific literature: Papers where figures, tables, and text must stay contextually linked.
  • Small, high-value corpora: A company’s 50 most important internal documents, where retrieval accuracy directly impacts business decisions.

When It’s Not

  • Anything over 10,000 documents: At $0.01–$0.05 per document for LLM processing, costs scale linearly. A million-document corpus would cost $10,000–$50,000 just for chunking.
  • Frequently updated content: Re-chunking on every update multiplies the cost.
  • Latency-sensitive ingestion: Each document requires an LLM API call, adding seconds per document.

Choosing the Right Strategy

                    ┌─────────────────────────┐
                    │  How many documents?     │
                    └────────┬────────────────┘

                 ┌───────────┴───────────┐
                 │                       │
            < 1,000                  > 1,000
                 │                       │
         ┌───────┴───────┐       ┌───────┴───────┐
         │ High accuracy │       │ Multi-topic    │
         │ required?     │       │ documents?     │
         └───┬───────┬───┘       └───┬───────┬───┘
             │       │               │       │
            Yes      No             Yes      No
             │       │               │       │
         Agentic  Semantic      Semantic  Fixed-size

The Hybrid Approach (What Most Production Systems Use)

In practice, the best RAG pipelines in 2026 don’t pick one strategy — they layer them:

  1. Structure-aware pre-processing: Split at document structure boundaries (headers, sections, page breaks) first.
  2. Fixed-size sub-chunking: Split large sections into fixed-size chunks with overlap.
  3. Metadata enrichment: Use a fast classifier (not a full LLM) to tag chunks with topic labels for filtered retrieval.

This gives you 80% of agentic chunking’s quality at 5% of the cost. The structure-aware split handles topic boundaries; the fixed-size sub-chunking ensures consistent chunk sizes for embedding; the metadata enables pre-filtering before vector search.

For implementation details on the vector database side, see our comparison of pgvector vs Pinecone and for the database layer decision, SQLite vs PostgreSQL.


Common Mistakes

  • Chunk size too large. Chunks over 1,024 tokens dilute the embedding signal. The chunk matches too many queries weakly instead of matching the right query strongly.
  • Chunk size too small. Chunks under 128 tokens lack context. “The answer is 42” as a standalone chunk is useless without the question.
  • No overlap in fixed-size chunking. Information at chunk boundaries is lost. Always use 20–25% overlap.
  • Ignoring document structure. If your documents have headers, use them. Splitting a Markdown document at ## boundaries before sub-chunking is almost always better than ignoring structure.
  • Optimizing chunking before measuring retrieval. Build a retrieval evaluation set (50–100 query-answer pairs) before experimenting with chunking strategies. Without measurement, you’re guessing.

FAQ

What chunk size should I start with?

512 tokens with 128 overlap. This is the industry default for a reason — it works reasonably well across document types. Adjust after measuring retrieval quality on your specific data.

Does chunking strategy matter more than the embedding model?

For multi-topic documents, yes. Switching from fixed-size to semantic chunking typically improves retrieval more than switching from a mediocre embedding model to a state-of-the-art one. For single-topic documents, the embedding model matters more.

Can I mix chunking strategies in one pipeline?

Yes — and you should if your corpus contains different document types. Use structure-aware chunking for Markdown/HTML, fixed-size for plain text, and semantic for long unstructured documents. Route by document type at ingestion.

How do I evaluate chunking quality?

Build a golden dataset: 50–100 questions with known answers and the source document. Run your RAG pipeline and measure Recall@K (does the correct chunk appear in the top K results?) and answer accuracy (does the LLM produce the correct answer?). Change one variable at a time.

Is recursive character text splitting (LangChain’s default) good enough?

It’s a reasonable fixed-size strategy with structure awareness (it tries to split at paragraph boundaries, then sentence boundaries, then word boundaries). For most use cases it’s fine as a starting point. The main limitation is that it doesn’t consider semantic content — it’s purely structural.


Bottom Line

Chunking is where most RAG pipelines silently fail. Start with fixed-size chunking at 512 tokens with 128 overlap — it’s fast, predictable, and good enough for 70% of use cases. When retrieval quality drops on multi-topic documents, add semantic chunking at topic boundaries. Reserve agentic chunking for small, high-value corpora where accuracy justifies the 50–100x ingestion cost. And before optimizing anything, build an evaluation set — without measurement, every chunking change is a guess.

Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.

Tags: rag chunking strategies retrieval augmented generation semantic chunking agentic chunking