2. Embeddings & RAG Architecture
What an embedding is
An embedding is a fixed-length vector of numbers that represents the meaning of a piece of text. A model is trained so that texts with similar meaning land close together in this vector space, and unrelated texts land far apart. "How do I reset my password?" and "I forgot my login" produce nearby vectors even though they share almost no words.
The number of dimensions is fixed per model — e.g. all-MiniLM-L6-v2 outputs
384-dim vectors, many OpenAI models output 1536.
Why vectors instead of keywords
Keyword search matches characters; embeddings match concepts. That's what lets RAG retrieve the right passage even when the user's wording is completely different from the document's wording.
The full architecture
┌──────────────── INDEXING ────────────────┐
docs ─▶ chunker ─▶ embedding model ─▶ vector store (index)
│
┌──────────────── QUERYING ─────┼───────────┐
question ─▶ embedding model ─▶ retriever ──┘─▶ top-k chunks
│
prompt (question + chunks)
│
▼
LLM ─▶ grounded answer
The components:
- Chunker — splits docs into retrievable pieces (parts 8–11).
- Embedding model — same model used for both docs and questions (must match!).
- Vector store / index — holds chunk vectors; supports fast similarity search.
- Retriever — embeds the query and returns the closest chunks (parts 4–5).
- LLM — writes the final answer from the retrieved context.
Code — generating embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim
vec = model.encode("How do I reset my password?", normalize_embeddings=True)
print(vec.shape) # (384,)
Use the same embedding model for your documents and your queries — vectors from different models aren't comparable.