4. Document Retrieval
Retrieval is the query-time step that picks which chunks the LLM gets to see. It's where most RAG quality is won or lost.
The flow
question ─▶ embed (same model) ─▶ compare to every chunk vector
│ similarity score
▼
sort, take top-k ─▶ context
Code — a simple retriever
import numpy as np
def retrieve(question, index, model, k=4):
q = model.encode(question, normalize_embeddings=True)
# vectors are normalized → dot product == cosine similarity
scored = [(float(np.dot(q, r["vector"])), r) for r in index]
scored.sort(key=lambda x: x[0], reverse=True)
return [r for _, r in scored[:k]]
With LangChain
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
emb = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
store = FAISS.from_documents(chunks, emb)
retriever = store.as_retriever(search_kwargs={"k": 4})
results = retriever.invoke("How do I reset my password?")
The knobs that matter
- k (how many chunks) — too few misses context; too many adds noise and cost. Start at 4–5.
- Relevance threshold — if the best score is too low, return nothing so the app can say "I don't know" instead of forcing a weak answer.
- Metadata filters — restrict to a source, date range, or section before scoring.
When an answer is wrong, print the retrieved chunks first. If a human couldn't answer from them, the model can't either — fix retrieval before touching the model.
Next: Cosine Similarity →