2. Embeddings & RAG Architecture

What an embedding is

An embedding is a fixed-length vector of numbers that represents the meaning of a piece of text. A model is trained so that texts with similar meaning land close together in this vector space, and unrelated texts land far apart. "How do I reset my password?" and "I forgot my login" produce nearby vectors even though they share almost no words.

The number of dimensions is fixed per model — e.g. all-MiniLM-L6-v2 outputs 384-dim vectors, many OpenAI models output 1536.

Why vectors instead of keywords

Keyword search matches characters; embeddings match concepts. That's what lets RAG retrieve the right passage even when the user's wording is completely different from the document's wording.

The full architecture

            ┌──────────────── INDEXING ────────────────┐
 docs ─▶ chunker ─▶ embedding model ─▶ vector store (index)
                                            │
            ┌──────────────── QUERYING ─────┼───────────┐
 question ─▶ embedding model ─▶ retriever ──┘─▶ top-k chunks
                                                  │
                                       prompt (question + chunks)
                                                  │
                                                  ▼
                                                 LLM ─▶ grounded answer

The components:

Chunker — splits docs into retrievable pieces (parts 8–11).
Embedding model — same model used for both docs and questions (must match!).
Vector store / index — holds chunk vectors; supports fast similarity search.
Retriever — embeds the query and returns the closest chunks (parts 4–5).
LLM — writes the final answer from the retrieved context.

Code — generating embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")   # 384-dim
vec = model.encode("How do I reset my password?", normalize_embeddings=True)
print(vec.shape)   # (384,)

Use the same embedding model for your documents and your queries — vectors from different models aren't comparable.

Next: Data Ingestion Pipeline →

What an embedding is​

Why vectors instead of keywords​

The full architecture​

Code — generating embeddings​

What an embedding is

Why vectors instead of keywords

The full architecture

Code — generating embeddings