3. Data Ingestion Pipeline
Ingestion is the offline "indexing" half of RAG: turning raw documents into a searchable vector index. Get this pipeline solid and querying becomes easy.
The stages
load ─▶ clean ─▶ chunk ─▶ embed ─▶ store
PDF/ strip split vectors vector
HTML/ noise, into per index
MD/... normalize pieces chunk
- Load — read source files (PDF, HTML, Markdown, DB rows).
- Clean — strip boilerplate, fix encoding, normalize whitespace.
- Chunk — split into retrievable pieces, attach metadata (parts 8–11).
- Embed — vector per chunk with your embedding model.
- Store — write chunks + vectors to the index.
Code — a complete ingestion pass
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
# 1–2. Load (loaders handle basic extraction)
docs = PyPDFLoader("handbook.pdf").load()
# 3. Chunk with metadata
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
for i, c in enumerate(chunks):
c.metadata["chunk_id"] = i
# 4. Embed
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([c.page_content for c in chunks],
normalize_embeddings=True)
# 5. Store (shape your index records however your store needs)
index = [
{"id": c.metadata["chunk_id"], "text": c.page_content,
"meta": c.metadata, "vector": v.tolist()}
for c, v in zip(chunks, vectors)
]
Practical notes
- Idempotency — re-running ingestion should update, not duplicate, chunks (key by a stable id like source + chunk index).
- Metadata is gold — store source, section, and date; you'll filter and cite with it.
- Batch the embedding call — encoding all chunks at once is far faster than one-by-one.
Next: Document Retrieval →