Skip to main content

9. Advanced Text Splitting

Recursive splitting is the default, but real documents have structure (headings, code, tables) that generic splitters ignore. Format-aware splitting respects it.

How recursive splitting actually works

It tries a priority list of separators, only moving to a finer one when a chunk is still over the size limit:

try "\n\n" (paragraphs) ─ still too big? ─▶ "\n" (lines)
─ still too big? ─▶ ". " (sentences) ─ still too big? ─▶ " " (words)

This keeps paragraphs and sentences intact whenever it can.

Format-aware splitters

from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
)

# Markdown: split on headings so each chunk keeps its section context
md = MarkdownHeaderTextSplitter(headers_to_split_on=[
("#", "h1"), ("##", "h2"), ("###", "h3"),
])
sections = md.split_text(markdown_doc) # carries heading metadata

# Code: keep functions/classes intact
code_splitter = RecursiveCharacterTextSplitter.from_language(
language="python", chunk_size=512, chunk_overlap=64,
)

Token-based sizing

LLM limits are counted in tokens, not characters. Sizing chunks by tokens makes your budget predictable:

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=512, # tokens, not chars
chunk_overlap=64,
)

Practical notes

  • Headers as metadata — keeping the section title with each chunk improves both retrieval relevance and citation quality.
  • Don't split code mid-function — language-aware splitting avoids breaking syntax across chunks.

Next: Semantic Chunking →