Skip to main content

Author Introduction
1 · LLM Basics
- What is an LLM? (how to think about it)soon
- Tokens & Tokenizationsoon
- Next-Token Prediction & Samplingsoon
- Temperature, Top-p & Decoding Controlssoon
- Context Windows & Long-Contextsoon
- What are Reasoning Models?soon
- 2026 Model Landscape & Comparing Modelssoon
2 · Calling Models
3 · Prompting
4 · Retrieval (RAG)
- 1. What RAG Is
- 2. Embeddings & RAG Architecture
- 3. Data Ingestion Pipeline
- 4. Document Retrieval
- 5. Cosine Similarity
- 6. Your First RAG App
- 7. Conversational RAG
- 8. Chunking Strategies
- 9. Advanced Text Splitting
- 10. Semantic Chunking
- 11. Agentic Chunking
- 12. Multi-Modal RAG
- 13. Advanced Retrieval Techniques
- 14. Multi-Query RAG
- 15. Reciprocal Rank Fusion
- 16. Hybrid Search
- 17. Reranking & Next Steps
- More RAG (soon)
- Vector Databases (Pinecone, Qdrant, pgvector…)soon
- Vector Indexes — HNSW vs IVFsoon
- Query Rewriting & HyDEsoon
- Metadata Filtering & Multi-Tenant RAGsoon
- Grounding & Citationssoon
- Refusal & Unknown Handlingsoon
- RAG Failure Modes & Debuggingsoon
- Agentic RAG & Iterative Retrievalsoon
- RAG at Scale & Cache Invalidationsoon
5 · Agents
6 · Orchestration
7 · Evaluation
8 · Tuning Decisions
9 · Production & Ops

4. Document Retrieval

Retrieval is the query-time step that picks which chunks the LLM gets to see. It's where most RAG quality is won or lost.

The flow

 question ─▶ embed (same model) ─▶ compare to every chunk vector
                                        │ similarity score
                                        ▼
                                  sort, take top-k ─▶ context

Code — a simple retriever

import numpy as np

def retrieve(question, index, model, k=4):
    q = model.encode(question, normalize_embeddings=True)
    # vectors are normalized → dot product == cosine similarity
    scored = [(float(np.dot(q, r["vector"])), r) for r in index]
    scored.sort(key=lambda x: x[0], reverse=True)
    return [r for _, r in scored[:k]]

With LangChain

from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

emb = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
store = FAISS.from_documents(chunks, emb)
retriever = store.as_retriever(search_kwargs={"k": 4})

results = retriever.invoke("How do I reset my password?")

The knobs that matter

k (how many chunks) — too few misses context; too many adds noise and cost. Start at 4–5.
Relevance threshold — if the best score is too low, return nothing so the app can say "I don't know" instead of forcing a weak answer.
Metadata filters — restrict to a source, date range, or section before scoring.

When an answer is wrong, print the retrieved chunks first. If a human couldn't answer from them, the model can't either — fix retrieval before touching the model.

Next: Cosine Similarity →

The flow
Code — a simple retriever
With LangChain
The knobs that matter