6. Your First RAG App
Now we connect indexing, retrieval, and generation into one runnable app.
End-to-end shape
docs ─▶ [index once] ─▶ vector store
question ─▶ retrieve top-k ─▶ build prompt ─▶ LLM ─▶ answer (+ citations)
Code — minimal but complete
from sentence_transformers import SentenceTransformer
import numpy as np
emb = SentenceTransformer("all-MiniLM-L6-v2")
# --- index once ---
chunks = chunk(load_documents()) # parts 3 & 8
vectors = emb.encode([c["text"] for c in chunks], normalize_embeddings=True)
for c, v in zip(chunks, vectors):
c["vector"] = v
def retrieve(question, k=4, min_score=0.25):
q = emb.encode(question, normalize_embeddings=True)
scored = sorted(((float(q @ c["vector"]), c) for c in chunks),
key=lambda x: x[0], reverse=True)
top = [c for s, c in scored[:k] if s >= min_score]
return top, (scored[0][0] if scored else 0)
def answer(question):
top, best = retrieve(question)
if not top: # grounding guardrail
return "I don't have information on that in my sources."
context = "\n\n".join(c["text"] for c in top)
prompt = (
"Answer using ONLY the context. If it's not there, say you don't know.\n\n"
f"Context:\n{context}\n\nQuestion: {question}"
)
return llm(prompt) # your LLM call
What makes it a product, not a demo
- Grounding guardrail — the
min_scorecheck + "say you don't know" prompt stops confident hallucinations when retrieval finds nothing. - Citations — keep each chunk's source metadata and show it with the answer.
- Evaluation — keep a small set of real questions with known-good answers and re-run it after every change (see Reranking & Next Steps).
Next: Conversational RAG →