Skip to main content

12. Multi-Modal RAG

Multi-modal RAG retrieves over more than text — images, charts, diagrams, scanned PDFs — so questions about visual content can be answered too.

Two architectures

A. Caption-then-embed (unified text space)

Turn every non-text item into a text description, then embed everything as text. Simple, works with any text vector store.

image ─▶ vision model ─▶ caption ─┐
text ───────────────────────────┼─▶ text embeddings ─▶ one index

B. Multi-modal embeddings (shared space)

Use a model that embeds images and text into the same vector space (e.g. a CLIP-style model), so a text query can directly retrieve images.

image ─▶┐
├─▶ multimodal embedder ─▶ shared vector space ─▶ index
text ─▶┘ ▲
query (text) ─────────────────────────┘ retrieves text OR images

Code — caption-then-embed

def index_image(path, vision_model, text_embedder):
caption = vision_model.describe(path) # "bar chart of Q3 revenue..."
vec = text_embedder.encode(caption, normalize_embeddings=True)
return {"type": "image", "path": path, "text": caption, "vector": vec}

Choosing

  • Caption-then-embed — easiest to add to an existing text RAG; quality depends on caption richness. Good default.
  • Multi-modal embeddings — better for fine visual detail and image-to-image search; needs a multi-modal model and store.

For answers, pass the retrieved captions (and, with a vision-capable LLM, the images themselves) into the prompt.

Next: Advanced Retrieval Techniques →