Skip to main content

3. Data Ingestion Pipeline

Ingestion is the offline "indexing" half of RAG: turning raw documents into a searchable vector index. Get this pipeline solid and querying becomes easy.

The stages

load ─▶ clean ─▶ chunk ─▶ embed ─▶ store
PDF/ strip split vectors vector
HTML/ noise, into per index
MD/... normalize pieces chunk
  1. Load — read source files (PDF, HTML, Markdown, DB rows).
  2. Clean — strip boilerplate, fix encoding, normalize whitespace.
  3. Chunk — split into retrievable pieces, attach metadata (parts 8–11).
  4. Embed — vector per chunk with your embedding model.
  5. Store — write chunks + vectors to the index.

Code — a complete ingestion pass

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

# 1–2. Load (loaders handle basic extraction)
docs = PyPDFLoader("handbook.pdf").load()

# 3. Chunk with metadata
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
for i, c in enumerate(chunks):
c.metadata["chunk_id"] = i

# 4. Embed
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([c.page_content for c in chunks],
normalize_embeddings=True)

# 5. Store (shape your index records however your store needs)
index = [
{"id": c.metadata["chunk_id"], "text": c.page_content,
"meta": c.metadata, "vector": v.tolist()}
for c, v in zip(chunks, vectors)
]

Practical notes

  • Idempotency — re-running ingestion should update, not duplicate, chunks (key by a stable id like source + chunk index).
  • Metadata is gold — store source, section, and date; you'll filter and cite with it.
  • Batch the embedding call — encoding all chunks at once is far faster than one-by-one.

Next: Document Retrieval →