1. What RAG Is

Retrieval-Augmented Generation (RAG) gives a language model access to knowledge it wasn't trained on by retrieving relevant text at question time and placing it into the prompt. The model then answers using that supplied context instead of relying only on its parameters.

Why it exists

A base LLM has three hard limits RAG addresses:

Stale knowledge — it only knows its training data; it can't see your latest docs.
No private data — it has never seen your company wiki, your PDFs, your tickets.
Hallucination — asked something it doesn't know, it confidently makes things up.

RAG fixes all three by grounding the answer in real, retrieved source text — and lets you cite where the answer came from.

The two phases

 INDEXING (offline, once)
   documents ─▶ chunk ─▶ embed ─▶ store vectors in an index

 QUERYING (per question)
   question ─▶ embed ─▶ retrieve top chunks ─▶ prompt = question + chunks ─▶ LLM ─▶ answer

Everything in this course is one of these two phases: indexing (parts 3, 8–11) or querying (parts 4–7, 13–17).

Minimal mental model in code

# Indexing (done once)
chunks = chunk(documents)
index = [(c, embed(c)) for c in chunks]

# Querying (per question)
def answer(question):
    q = embed(question)
    top = most_similar(q, index, k=4)          # retrieval
    prompt = f"Context:\n{top}\n\nQuestion: {question}"
    return llm(prompt)                          # generation

When to use RAG

Use RAG when the model lacks knowledge (facts, private/recent data). If instead it lacks a behaviour or style, that's fine-tuning; if the task needs multi-step tool use, that's an agent.

Next: Embeddings & RAG Architecture →

Why it exists​

The two phases​

Minimal mental model in code​

When to use RAG​

Why it exists

The two phases

Minimal mental model in code

When to use RAG