Skip to main content

2. Embeddings & RAG Architecture

What an embedding is

An embedding is a fixed-length vector of numbers that represents the meaning of a piece of text. A model is trained so that texts with similar meaning land close together in this vector space, and unrelated texts land far apart. "How do I reset my password?" and "I forgot my login" produce nearby vectors even though they share almost no words.

The number of dimensions is fixed per model — e.g. all-MiniLM-L6-v2 outputs 384-dim vectors, many OpenAI models output 1536.

Why vectors instead of keywords

Keyword search matches characters; embeddings match concepts. That's what lets RAG retrieve the right passage even when the user's wording is completely different from the document's wording.

The full architecture

┌──────────────── INDEXING ────────────────┐
docs ─▶ chunker ─▶ embedding model ─▶ vector store (index)

┌──────────────── QUERYING ─────┼───────────┐
question ─▶ embedding model ─▶ retriever ──┘─▶ top-k chunks

prompt (question + chunks)


LLM ─▶ grounded answer

The components:

  • Chunker — splits docs into retrievable pieces (parts 8–11).
  • Embedding model — same model used for both docs and questions (must match!).
  • Vector store / index — holds chunk vectors; supports fast similarity search.
  • Retriever — embeds the query and returns the closest chunks (parts 4–5).
  • LLM — writes the final answer from the retrieved context.

Code — generating embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim
vec = model.encode("How do I reset my password?", normalize_embeddings=True)
print(vec.shape) # (384,)

Use the same embedding model for your documents and your queries — vectors from different models aren't comparable.

Next: Data Ingestion Pipeline →