Evaluating Google's CXR Foundation Model: Why 135 Seconds per Image Killed Our Production Plans
Hands-on evaluation of Google's CXR Foundation vision transformer for chest X-ray embeddings — what worked, what didn't, and why the latency made it unusable for our production inference pipeline.
Lab note. Setup, numbers, verdict. Short on prose, long on data. A longer narrative writeup of foundation-model comparisons across MedSigLIP, CheXFound, RAD-DINO, and others will follow.
TL;DR
- We evaluated Google's CXR Foundation model (
google/cxr-foundationon Hugging Face) — a pre-trained vision transformer for chest X-ray representations. - Single-image embedding generation: 135 seconds on a Colab A100. Batch-of-10: 980 seconds total (~98 s/image, only a 27% per-image win from batching).
- Peak GPU memory: 32 GB. Sustained throughput: 0.85 images per minute (~51/hour).
- The embeddings themselves are strong. The latency is two orders of magnitude away from production-acceptable for our use case.
- Verdict: research-grade tool, not a production component. Lives in our experimentation pipeline; doesn't go anywhere near the inference path.
Setup
| Component | Detail |
|---|---|
| Model | google/cxr-foundation — pre-trained ViT |
| Dataset | CheXpert (publicly available) |
| Environment | Google Colab with A100 GPU |
| Reference notebooks | Google Health's quick-start with Hugging Face + train-data-efficient-classifier |
| Goal | Measure single-image and batch latency for embedding extraction |
The motivation was simple: Google's foundation model claims strong transfer-learning performance across chest X-ray tasks. If the embeddings were both high-quality and extractable at reasonable latency, we could replace several specialized classifiers in our pipeline with one embedding extractor + a few lightweight downstream heads. That's the value proposition of any foundation model — generality at acceptable cost.
We ran the standard Hugging Face inference path. No custom optimizations, no kernel surgery, no quantization. Stock setup, exactly as the documentation suggests.
Method
The embedding-generation code is straight off the Hugging Face quick-start, with timing instrumentation added:
from transformers import AutoImageProcessor, AutoModel
import torch
import time
processor = AutoImageProcessor.from_pretrained("google/cxr-foundation")
model = AutoModel.from_pretrained("google/cxr-foundation").cuda()
model.eval()
def generate_embeddings_with_timing(images):
start = time.time()
inputs = processor(images=images, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
elapsed = time.time() - start
print(f"Processed {len(images)} images in {elapsed:.2f}s "
f"({elapsed / len(images):.2f}s per image)")
return embeddings
We measured single-image runs and batch-of-10 runs across several iterations, recording wall-clock time, peak GPU memory, and per-image throughput. Numbers below are averages across the runs; variance was low.
Results
| Metric | Value |
|---|---|
| Single-image latency | 135 s |
| Batch-of-10 total time | 980 s |
| Batch-of-10 per-image | ~98 s |
| Peak GPU memory | 32 GB |
| Sustained throughput | 0.85 images/min (~51/hour) |
| Per-image batching gain | ~27% (135 s → 98 s) |
For context, a production CNN classifier on the same chest X-ray task (DenseNet-121, ResNet50, EfficientNet-B3) processes one image in 5–10 seconds on 4-vCPU CPU-only deployments — and under one second on equivalent GPU hardware. The Google CXR Foundation model is 15–30× slower on like-for-like comparisons, and that's before any of the standard production engineering work (TensorRT export, ONNX runtime, kernel fusion) the smaller models also haven't received.
The throughput gap is the actual story. 51 images/hour vs our production-target 1,000+/hour means this model is roughly 20× too slow. Even an A100-GPU-optimized CNN sits at ~3,600/hour — 70× faster than the foundation model on the same hardware class. The compute pattern of a large ViT applied to single radiographs is the bottleneck, not the hardware.
What worked
- The embeddings are good. A linear classifier trained on these features handles standard chest X-ray classification tasks at competitive accuracy. The transfer-learning story Google tells about this model is real — if you can afford the inference cost, the downstream quality is there.
- Hugging Face integration is clean. Standard
AutoModel.from_pretrained()works. No custom dependencies, no opaque setup. The quick-start notebook runs end-to-end without modification. That's a meaningfully better experience than a lot of medical-imaging foundation models, which often require their own forks of Transformers or specific image-preprocessing libraries. - Multimodal capability is present. The model can ingest text + image queries for visual question answering. We didn't use this for our primary embedding task, but it's a real capability that some downstream applications would benefit from.
What didn't work
- Single-image inference is unshippable at this latency. 135 seconds per image means that for a clinical workflow expecting sub-second response on each study, this model is roughly 100× off. No amount of UX work hides a two-orders-of-magnitude latency gap.
- Batching helps less than you'd hope. Going from 1 to 10 images per call dropped per-image time from 135 s to 98 s — a 27% improvement, not the near-linear amortization you'd see with a well-behaved model. The forward pass isn't dominated by batch-overhead; it's dominated by the model's compute footprint per image. Larger batches will not save you.
- The memory ceiling is real. 32 GB peak for a batch of 10 puts this firmly in A100/H100 territory. Most production GPU options (T4, V100, A10, L4) have 16–24 GB. Deployment off A100 would require quantization or model surgery — research-level work, not a configuration change.
Why this matters for production
Concrete arithmetic at our deployment scale:
- Our production workload is roughly 1,000 inference requests per hour.
- At 0.85 images/minute, that's ~20 hours of compute to clear one hour of demand. Impossible.
- Hypothetically running 24 parallel A100 instances would close the gap — at roughly $3–4/hour per A100, that's $72–96 per hour of throughput. A trained CNN classifier on a single CPU host does the same work for cents.
- Optimization paths (ONNX + TensorRT, INT8 quantization, kernel fusion) might realistically bring per-image time down to 12–15 seconds. Still 5–10× too slow.
The cost-per-prediction math is what kills it for us. The Foundation model is delivering generality at a 100× inference-cost multiplier relative to specialized CNNs that already work. Generality is valuable; not at that multiplier, not for this workload.
Verdict
| Use case | Recommendation |
|---|---|
| Real-time clinical inference | No — latency too high |
| High-throughput batch processing | No — throughput too low |
| Cost-sensitive production | No — compute multiplier too steep |
| Research feature extraction | Yes — embeddings are solid |
| Pre-computing embeddings for retrieval index | Maybe — one-time cost, evaluate against retrieval-task latency requirements |
| Academic / offline analysis | Yes — strong features, integration is clean |
The model sits in our research toolkit and stays there. We're not putting it in the inference path.
Next steps
- Try ONNX export + TensorRT optimization. If the latency drops by 10× this becomes a different conversation. We haven't measured this yet; the priority has been other model evaluations.
- Benchmark MedSigLIP and CheXFound on the same setup. Both target similar use cases with different architectural choices. The interesting question is whether any current foundation model hits production latency, or whether the entire category is in the same boat.
- Evaluate offline embedding precomputation for our retrieval index. One-time computation of embeddings for 1.6M+ images at this latency takes ~3 years of A100 time on one machine. Probably not viable even with parallelism; but worth a quick feasibility pass.
- Compare embedding quality against in-house ResNet50 features on the same downstream classification tasks. If the foundation-model embeddings don't materially outperform features from a model we already run at 1 s/image, the latency case for using them at all weakens further.
The general lesson, which a longer follow-up post will draw out across several foundation-model evaluations: publication-grade benchmarks rarely report inference latency at the granularity production engineers need. Most "state-of-the-art on chest X-rays" claims would benefit from a "per-image latency" column.
Part of an ongoing series on production medical imaging. The companion year-one reflection is here; the Gemini-vs-CNN clinical-QC lab note is here. If something here connects to work you're shipping, reach out.