Image-Based RAG for Medical Imaging: MetaCLIP, FAISS, and the Retrieval-Synthesis Pattern
Most RAG content is about text. For medical imaging, the same pattern applies to images — retrieve similar studies by visual similarity, surface their radiologist-written reports, then synthesize across them. Here's the architecture, the embedding-model choice, the FAISS index decisions, and the failure modes nobody covers.
Most retrieval-augmented generation content is about text — chunking documents, embedding sentences, top-K retrieval over a vector index. For medical imaging, the same pattern applies to images: embed the query image, retrieve similar past studies, pull their radiologist-written reports, and let an LLM synthesize across them. This is the writeup of how we built one of these systems over ~1.6M medical images, why MetaCLIP was the right embedding model, and what the FAISS index choice actually does at this scale.
The standard retrieval-augmented generation pattern is well-trodden: a user submits a query, the system embeds the query, looks up the most similar chunks in a vector index, and feeds those chunks to a language model as context. For text-over-text RAG, the embeddings come from sentence-transformer models and the chunks are paragraphs from a corpus.
For medical imaging, the same architecture works with two substitutions: the query is an image instead of text, and the indexed corpus is a set of past images with their associated metadata and radiologist-written reports. The retrieval step finds visually similar past cases. The synthesis step has an LLM read the retrieved reports and produce a structured summary of "what we've seen in similar studies."
This pattern is genuinely useful — for decision support (showing the radiologist similar cases with their resolutions), for unsupervised pathology exploration (discovering structure in unannotated cohorts), and for dataset curation (finding outliers or near-duplicates at scale). It's also one of the few places where the marketing pitch for "foundation models in medical imaging" turns into a system that actually ships.
This post is the writeup of one such system covering ~1.6M chest X-rays and knee X-rays. The technical choices that matter most are the embedding model, the FAISS index configuration, and the failure modes nobody covers in tutorials.
The use case
For a new query chest X-ray, find the top-N most visually similar studies from a reference set of past cases. Each retrieved case has a radiologist-written report attached. Aggregate the findings across the top-N reports to surface the most likely pathology combinations and frequencies — output a summary that says "of the 100 most similar cases, 73 mentioned pleural effusion, 41 mentioned cardiomegaly, 12 mentioned both."
This isn't a classifier. It doesn't say "this image has pneumothorax." It says "this image looks like these other images, and here's what was reported in those." The interpretation is left to a clinician; the system's job is to surface relevant prior context fast.
Two properties of this approach matter:
- It scales across pathologies without per-pathology training. A new pathology that wasn't in your training set still surfaces in retrieval if it appears in the reference reports. Adding a new classifier head requires labeled data; adding new conditions to a retrieval system requires nothing.
- The output is grounded in real cases. Every flag in the output summary traces to specific reports written by specific radiologists. There's no hallucination beyond what the underlying reports contain.
The cost is that you need a large reference set with reports attached. For modalities where you have that — chest X-ray, knee X-ray, mammography — this is a useful tool. For modalities where you don't, you can't bootstrap this approach.
Why MetaCLIP for the embeddings
The whole system hinges on the embedding model. Get this wrong and every downstream choice — index structure, retrieval top-K, synthesis prompt — is irrelevant because the retrieved neighbors aren't actually similar in any clinically meaningful way.
The candidate pool, mid-2025, looked roughly like this:
- OpenAI CLIP (ViT-L/14) — the classic. Trained on 400M image-text pairs scraped from the web. Reasonable but not specialized for medical imaging.
- OpenCLIP variants — community reimplementations of CLIP with more diverse training distributions. LAION-2B and LAION-5B variants are stronger than OpenAI's.
- MetaCLIP — Meta's reimplementation focusing on data curation methodology. The PE-Core series is the strongest; the bigG-14-448 variant is trained on ~5.4B image-text pairs at 448×448 resolution.
- MedCLIP / BiomedCLIP — domain-specific medical fine-tunes of CLIP. Smaller training sets but presumably better domain alignment.
We picked MetaCLIP PE-Core-bigG-14-448 for three reasons:
1. Cross-domain generalization quality. Chest X-rays are not natural images, but they're also not pathology slides or microscopy. The visual statistics are different from web photos but not radically so. A model with strong general visual representation tends to transfer better than a smaller domain-specific model that hasn't seen comparable scale.
2. Embedding dimensionality at 1024 is a useful tradeoff. Smaller embeddings (256-d, 512-d) lose discriminative power on subtle visual differences. Larger embeddings (2048-d+) hurt index storage and query latency without commensurate retrieval improvements. At 1024-d, the storage math works (~6.5 GB for 1.6M vectors) and the retrieval quality is competitive.
3. Input resolution at 448 is closer to what medical images need than 224. Subtle pathologies — hairline fractures, small effusions, micro-calcifications — get lost at 224×224 input resolution because most of the signal is sub-pixel after downsampling. 448 preserves more of it.
The domain-specific medical CLIPs were tempting on paper but consistently underperformed in our retrieval-quality evaluation. The hypothesis: the medical CLIPs are smaller models trained on less data; what they gain in domain match they lose in capacity. For a generic visual-similarity retrieval task across multiple modalities (CXR, KXR), the more general model wins.
The pipeline
Five stages, three of them one-time (indexing) and two of them per-query.
Indexing (one-time per dataset version):
- For each reference image, run MetaCLIP's preprocessing (resize, normalize) and forward pass through the encoder. Output: a 1024-dimensional embedding vector per image.
- L2-normalize each vector. This converts cosine similarity (the meaningful metric for CLIP embeddings) into a Euclidean distance computation, which is what FAISS optimizes.
- Build a FAISS index from the L2-normalized vectors. Persist to disk.
Query (per inference):
- Run the query image through the same MetaCLIP preprocessing and encoder. Output: a 1024-dim embedding.
- L2-normalize the query embedding. Use FAISS to find the top-K (we used K=100) nearest neighbors in the index. Each neighbor has an associated
study_id, age, sex, and report text from the reference dataset.
Synthesis (per inference, optional):
- Pass the top-K retrieved reports to a language model (we used Gemini 2.5 Flash) with a structured prompt that extracts pathology mentions, medical devices, and anatomical references from each report, then aggregates frequencies across the K.
The total system is roughly 300 lines of Python plus the FAISS index file and a CSV with study_id, image_path, age, sex, observation, conclusion for the reference set. Most of the code is the embedding extraction loop; the FAISS calls themselves are five lines.
FAISS index choice at 1.6M scale
This is the technical decision that most affects production characteristics. Three reasonable options:
A few things worth saying that don't fit on the comparison card:
Storage math. 1.6M vectors × 1024 dimensions × 4 bytes per float = 6.55 GB for the raw embeddings. Index overhead varies by structure. IndexFlatL2 is essentially just the raw vectors; IndexIVFFlat adds the cluster index (a few MB); IndexHNSWFlat adds the graph structure (significant — often 50% overhead). For a 6.5 GB base, that means Flat at ~6.5 GB, IVF at ~6.5-7 GB, HNSW at ~10 GB.
Latency context. For our use case — retrieval that runs alongside a clinical reading workflow, not in the hot path — a 200-400 ms FAISS query is comfortably acceptable. The MetaCLIP embedding extraction takes longer than the FAISS lookup. We landed on IndexFlatL2 for simplicity. If your latency budget is sub-100ms, IVF is the right starting point. If sub-30ms, HNSW.
Recall-vs-speed tradeoff. Both IVF and HNSW are approximate-nearest-neighbor algorithms. They trade some recall (rarely the absolute best match, sometimes the 2nd or 3rd best) for dramatically faster query times. For our application, the "best" match isn't necessarily clinically more useful than the 5th best — a top-100 query with 95% recall is functionally indistinguishable from a top-100 query with 100% recall. Approximate is fine.
One thing about index updates worth knowing. None of these FAISS index structures support efficient incremental updates well. Adding new vectors to a Flat index is cheap but requires the index to grow; IVF requires re-clustering occasionally for quality; HNSW supports inserts but they're expensive. For a system that adds ~10K new images per month, a periodic rebuild (weekly or monthly) is simpler operationally than maintaining incremental updates. We rebuild on a schedule.
Retrieval quality — the unsolved problem
How do you know if the retrieval is "good"? There's no ground truth for "the 5th most similar chest X-ray to this query."
Three proxies that are useful, in increasing rigor:
Pathology agreement. For a held-out set of query images with known pathology labels, check whether the retrieved neighbors' reports mention the same pathologies. If a query has known cardiomegaly, do more than 50% of the top-20 retrieved cases mention cardiomegaly? This is a noisy signal because reports vary in completeness and terminology — but it's automatic and scales to thousands of queries.
Manual radiologist review. A radiologist looks at a query and its retrieved top-K, ranks the retrieval quality on a 1-5 scale. Expensive but gives the most reliable signal. For us, a couple of dozen queries reviewed per evaluation cycle was enough to catch obvious failure modes.
Round-trip retrieval. For each image in the reference set, run a retrieval query using that image. If the image's own report doesn't appear in the top-N, something's broken. This catches large failures (wrong embeddings, broken normalization) but doesn't measure subtle quality differences.
We use a combination: pathology-agreement metrics for continuous evaluation, manual review for weekly sanity checks, round-trip retrieval as an integration test.
Failure modes nobody talks about
Five things that bit us. In rough order of how much they cost:
1. The retrieved reports contain things you don't want to feed to the LLM directly. Radiology reports include patient identifiers, study IDs, scan timestamps, comparison references to prior studies the LLM doesn't have access to. Feeding raw reports as RAG context risks prompt injection (rare but real), terminology mismatches, and irrelevant context bloating the prompt. Strip aggressively before retrieval; only the findings and impression sections should reach the LLM.
2. Subtle pathologies are systematically under-represented in nearest-neighbor results. A query with a subtle finding (small effusion, early consolidation) tends to retrieve neighbors that visually look similar but where the report explicitly says "no acute finding." The MetaCLIP embeddings care about visual similarity, not pathology similarity. The downstream aggregation then under-counts the rare condition. Mitigation: include a per-report confidence weighting that gives more weight to reports mentioning the query's likely findings (chicken-and-egg, but partial fixes help).
3. Imaging quality variation drowns out clinical signal. A query from a portable bedside CXR retrieves a set of other portable bedside CXRs whose visual statistics are dominated by exposure, positioning, and artifact characteristics — not by pathology. The retrieval is "technically correct, clinically useless." Mitigation: pre-filter the retrieved set by quality flags before LLM synthesis.
4. Cross-modality contamination at the embedding level. Even with separate CXR and KXR cohorts in the reference set, MetaCLIP embeddings for a CXR can occasionally return KXR neighbors that look similar (similar contrast pattern, similar field-of-view). The fix is partitioning the index by modality so cross-modality matches are impossible by construction.
5. Storage and embedding compute is non-trivial at scale. 1.6M images × ~2 seconds per MetaCLIP forward pass = roughly 900 GPU-hours for the initial index build. That's not cheap. Plan it as a one-time-per-year activity rather than something you re-run casually.
When this approach makes sense and when it doesn't
Image-based RAG with retrieval-synthesis pattern is a good fit for:
- Decision-support workflows where surfacing similar prior cases adds value beyond classification
- Unsupervised exploration of large image cohorts without labeled training data
- Dataset curation — finding near-duplicates, outliers, or systematically misrepresented subgroups
- Educational use cases — showing radiology trainees similar cases with confirmed diagnoses
- Modalities where you have reference data with attached reports — CXR, KXR, mammography, abdominal radiographs
It's a poor fit for:
- Real-time triage — the latency profile (embedding + retrieval + LLM synthesis) is multi-second
- Fine-grained classification — a trained classifier on labeled data outperforms retrieval-based aggregation for tasks with clear class boundaries
- Novel pathologies — if the condition isn't well represented in the reference set, retrieval surfaces nothing useful
- High-stakes diagnostic decisions — the system surfaces context but doesn't make decisions; treating it as a classifier is a misuse
The deepest lesson from building this: the embedding model and the FAISS index are the easy parts. The hard parts are figuring out what the reference set should contain, ensuring the retrieved reports are usable downstream, and validating that the retrieval matches clinical intuition rather than just visual statistics.
Closing
Image-based RAG isn't a new pattern, but it's underused in medical imaging relative to its potential. Most published medical-imaging AI work focuses on classifiers and detectors. Retrieval systems sit in a different lane — they don't outperform specialized classifiers on benchmark accuracy, but they do something the classifiers can't: they ground every output in real cases with real reports, providing the kind of explanation that builds clinician trust faster than any saliency map.
For systems where decision support matters as much as classification, building retrieval-first and adding classifiers on top tends to land in a better place than the reverse. The patterns are stable, the libraries are mature, and the data requirements (reference images with reports) are typically already in place for any imaging-AI team that's been operating for more than a year.
Part of an ongoing series on production medical imaging. The companion year-one reflection covers the broader CXR-AI engineering context; the Gemini-vs-CNN clinical-QC note covers the LLM side of the synthesis step; the SQLite queue post is the same operational shape applied to inference. If you're building image-retrieval over a clinical corpus, reach out.