Skip to main content

Author Introduction
1 · LLM Basics
- What is an LLM? (how to think about it)soon
- Tokens & Tokenizationsoon
- Next-Token Prediction & Samplingsoon
- Temperature, Top-p & Decoding Controlssoon
- Context Windows & Long-Contextsoon
- What are Reasoning Models?soon
- 2026 Model Landscape & Comparing Modelssoon
2 · Calling Models
3 · Prompting
4 · Retrieval (RAG)
- 1. What RAG Is
- 2. Embeddings & RAG Architecture
- 3. Data Ingestion Pipeline
- 4. Document Retrieval
- 5. Cosine Similarity
- 6. Your First RAG App
- 7. Conversational RAG
- 8. Chunking Strategies
- 9. Advanced Text Splitting
- 10. Semantic Chunking
- 11. Agentic Chunking
- 12. Multi-Modal RAG
- 13. Advanced Retrieval Techniques
- 14. Multi-Query RAG
- 15. Reciprocal Rank Fusion
- 16. Hybrid Search
- 17. Reranking & Next Steps
- More RAG (soon)
- Vector Databases (Pinecone, Qdrant, pgvector…)soon
- Vector Indexes — HNSW vs IVFsoon
- Query Rewriting & HyDEsoon
- Metadata Filtering & Multi-Tenant RAGsoon
- Grounding & Citationssoon
- Refusal & Unknown Handlingsoon
- RAG Failure Modes & Debuggingsoon
- Agentic RAG & Iterative Retrievalsoon
- RAG at Scale & Cache Invalidationsoon
5 · Agents
6 · Orchestration
7 · Evaluation
8 · Tuning Decisions
9 · Production & Ops

12. Multi-Modal RAG

Multi-modal RAG retrieves over more than text — images, charts, diagrams, scanned PDFs — so questions about visual content can be answered too.

Two architectures

A. Caption-then-embed (unified text space)

Turn every non-text item into a text description, then embed everything as text. Simple, works with any text vector store.

 image ─▶ vision model ─▶ caption ─┐
 text  ───────────────────────────┼─▶ text embeddings ─▶ one index
                                   ┘

Use a model that embeds images and text into the same vector space (e.g. a CLIP-style model), so a text query can directly retrieve images.

 image ─▶┐
          ├─▶ multimodal embedder ─▶ shared vector space ─▶ index
 text  ─▶┘                              ▲
 query (text) ─────────────────────────┘ retrieves text OR images

Code — caption-then-embed

def index_image(path, vision_model, text_embedder):
    caption = vision_model.describe(path)          # "bar chart of Q3 revenue..."
    vec = text_embedder.encode(caption, normalize_embeddings=True)
    return {"type": "image", "path": path, "text": caption, "vector": vec}

Choosing

Caption-then-embed — easiest to add to an existing text RAG; quality depends on caption richness. Good default.
Multi-modal embeddings — better for fine visual detail and image-to-image search; needs a multi-modal model and store.

For answers, pass the retrieved captions (and, with a vision-capable LLM, the images themselves) into the prompt.

Next: Advanced Retrieval Techniques →

Two architectures
- A. Caption-then-embed (unified text space)
- B. Multi-modal embeddings (shared space)
Code — caption-then-embed
Choosing