3. Data Ingestion Pipeline

Ingestion is the offline "indexing" half of RAG: turning raw documents into a searchable vector index. Get this pipeline solid and querying becomes easy.

The stages

 load ─▶ clean ─▶ chunk ─▶ embed ─▶ store
  PDF/      strip     split    vectors   vector
  HTML/     noise,    into     per       index
  MD/...    normalize pieces   chunk

Load — read source files (PDF, HTML, Markdown, DB rows).
Clean — strip boilerplate, fix encoding, normalize whitespace.
Chunk — split into retrievable pieces, attach metadata (parts 8–11).
Embed — vector per chunk with your embedding model.
Store — write chunks + vectors to the index.

Code — a complete ingestion pass

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

# 1–2. Load (loaders handle basic extraction)
docs = PyPDFLoader("handbook.pdf").load()

# 3. Chunk with metadata
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
for i, c in enumerate(chunks):
    c.metadata["chunk_id"] = i

# 4. Embed
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([c.page_content for c in chunks],
                       normalize_embeddings=True)

# 5. Store (shape your index records however your store needs)
index = [
    {"id": c.metadata["chunk_id"], "text": c.page_content,
     "meta": c.metadata, "vector": v.tolist()}
    for c, v in zip(chunks, vectors)
]

Practical notes

Idempotency — re-running ingestion should update, not duplicate, chunks (key by a stable id like source + chunk index).
Metadata is gold — store source, section, and date; you'll filter and cite with it.
Batch the embedding call — encoding all chunks at once is far faster than one-by-one.

Next: Document Retrieval →

The stages​

Code — a complete ingestion pass​

Practical notes​

The stages

Code — a complete ingestion pass

Practical notes