12. Multi-Modal RAG
Multi-modal RAG retrieves over more than text — images, charts, diagrams, scanned PDFs — so questions about visual content can be answered too.
Two architectures
A. Caption-then-embed (unified text space)
Turn every non-text item into a text description, then embed everything as text. Simple, works with any text vector store.
image ─▶ vision model ─▶ caption ─┐
text ───────────────────────────┼─▶ text embeddings ─▶ one index
┘
B. Multi-modal embeddings (shared space)
Use a model that embeds images and text into the same vector space (e.g. a CLIP-style model), so a text query can directly retrieve images.
image ─▶┐
├─▶ multimodal embedder ─▶ shared vector space ─▶ index
text ─▶┘ ▲
query (text) ─────────────────────────┘ retrieves text OR images
Code — caption-then-embed
def index_image(path, vision_model, text_embedder):
caption = vision_model.describe(path) # "bar chart of Q3 revenue..."
vec = text_embedder.encode(caption, normalize_embeddings=True)
return {"type": "image", "path": path, "text": caption, "vector": vec}
Choosing
- Caption-then-embed — easiest to add to an existing text RAG; quality depends on caption richness. Good default.
- Multi-modal embeddings — better for fine visual detail and image-to-image search; needs a multi-modal model and store.
For answers, pass the retrieved captions (and, with a vision-capable LLM, the images themselves) into the prompt.