Skip to main content

Author Introduction
1 · LLM Basics
- What is an LLM? (how to think about it)soon
- Tokens & Tokenizationsoon
- Next-Token Prediction & Samplingsoon
- Temperature, Top-p & Decoding Controlssoon
- Context Windows & Long-Contextsoon
- What are Reasoning Models?soon
- 2026 Model Landscape & Comparing Modelssoon
2 · Calling Models
3 · Prompting
4 · Retrieval (RAG)
- 1. What RAG Is
- 2. Embeddings & RAG Architecture
- 3. Data Ingestion Pipeline
- 4. Document Retrieval
- 5. Cosine Similarity
- 6. Your First RAG App
- 7. Conversational RAG
- 8. Chunking Strategies
- 9. Advanced Text Splitting
- 10. Semantic Chunking
- 11. Agentic Chunking
- 12. Multi-Modal RAG
- 13. Advanced Retrieval Techniques
- 14. Multi-Query RAG
- 15. Reciprocal Rank Fusion
- 16. Hybrid Search
- 17. Reranking & Next Steps
- More RAG (soon)
- Vector Databases (Pinecone, Qdrant, pgvector…)soon
- Vector Indexes — HNSW vs IVFsoon
- Query Rewriting & HyDEsoon
- Metadata Filtering & Multi-Tenant RAGsoon
- Grounding & Citationssoon
- Refusal & Unknown Handlingsoon
- RAG Failure Modes & Debuggingsoon
- Agentic RAG & Iterative Retrievalsoon
- RAG at Scale & Cache Invalidationsoon
5 · Agents
6 · Orchestration
7 · Evaluation
8 · Tuning Decisions
9 · Production & Ops

7. Conversational RAG

Single-shot RAG answers one isolated question. In a chat, questions depend on what came before — and that breaks naive retrieval.

The follow-up problem

 User: "What does the refund policy say?"
 Bot:  "...30 days..."
 User: "And for digital items?"      ← embed THIS alone → retrieves nothing useful

"And for digital items?" has no standalone meaning. Embedding it directly retrieves garbage because the subject ("refund policy") lives in the previous turn.

The fix: history-aware query rewriting

Before retrieving, use the LLM to rewrite the follow-up into a standalone question using the chat history, then retrieve with that.

 history + follow-up ─▶ LLM rewrites ─▶ "What is the refund policy for digital items?"
                                              │
                                         retrieve ─▶ answer

Code

def condense(history, follow_up):
    prompt = (
        "Given the conversation, rewrite the follow-up as a standalone question.\n\n"
        f"Conversation:\n{history}\n\nFollow-up: {follow_up}\n\nStandalone question:"
    )
    return llm(prompt).strip()

def chat_answer(history, follow_up):
    standalone = condense(history, follow_up)     # history-aware
    top, _ = retrieve(standalone)                 # retrieve with the rewritten query
    context = "\n\n".join(c["text"] for c in top)
    return llm(f"Context:\n{context}\n\nQuestion: {standalone}")

Practical notes

Only condense when needed — for a self-contained first question, skip the rewrite to save a call.
Cap history length — pass the last few turns, not the whole transcript, to stay inside the context window.

Next: Chunking Strategies →

The follow-up problem
The fix: history-aware query rewriting
Code
Practical notes