Evaluating Google's CXR Foundation Model: Why 135 Seconds per Image Killed Our Production Plans

Hands-on evaluation of Google's CXR Foundation vision transformer for chest X-ray embeddings — what worked, what didn't, and why the latency made it unusable for our production inference pipeline.

June 10, 2026Saianiruth M

Lab note. Setup, numbers, verdict. Short on prose, long on data. A longer narrative writeup of foundation-model comparisons across MedSigLIP, CheXFound, RAD-DINO, and others will follow.

TL;DR

We evaluated Google's CXR Foundation model (google/cxr-foundation on Hugging Face) — a pre-trained vision transformer for chest X-ray representations.
Single-image embedding generation: 135 seconds on a Colab A100. Batch-of-10: 980 seconds total (~98 s/image, only a 27% per-image win from batching).
Peak GPU memory: 32 GB. Sustained throughput: 0.85 images per minute (~51/hour).
The embeddings themselves are strong. The latency is two orders of magnitude away from production-acceptable for our use case.
Verdict: research-grade tool, not a production component. Lives in our experimentation pipeline; doesn't go anywhere near the inference path.

Three headline numbers from the evaluation: 135 seconds per single-image embedding, 0.85 images per minute sustained throughput, and 32 GB peak GPU memory for batch-of-10 inference. — The three numbers that decided this. Measured on Google Colab with an A100 GPU using CheXpert images.

Setup

Component	Detail
Model	`google/cxr-foundation` — pre-trained ViT
Dataset	CheXpert (publicly available)
Environment	Google Colab with A100 GPU
Reference notebooks	Google Health's quick-start with Hugging Face + train-data-efficient-classifier
Goal	Measure single-image and batch latency for embedding extraction

The motivation was simple: Google's foundation model claims strong transfer-learning performance across chest X-ray tasks. If the embeddings were both high-quality and extractable at reasonable latency, we could replace several specialized classifiers in our pipeline with one embedding extractor + a few lightweight downstream heads. That's the value proposition of any foundation model — generality at acceptable cost.

We ran the standard Hugging Face inference path. No custom optimizations, no kernel surgery, no quantization. Stock setup, exactly as the documentation suggests.

Method

The embedding-generation code is straight off the Hugging Face quick-start, with timing instrumentation added:

from transformers import AutoImageProcessor, AutoModel
import torch
import time

processor = AutoImageProcessor.from_pretrained("google/cxr-foundation")
model = AutoModel.from_pretrained("google/cxr-foundation").cuda()
model.eval()

def generate_embeddings_with_timing(images):
    start = time.time()
    inputs = processor(images=images, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
    elapsed = time.time() - start
    print(f"Processed {len(images)} images in {elapsed:.2f}s "
          f"({elapsed / len(images):.2f}s per image)")
    return embeddings

We measured single-image runs and batch-of-10 runs across several iterations, recording wall-clock time, peak GPU memory, and per-image throughput. Numbers below are averages across the runs; variance was low.

Results

Metric	Value
Single-image latency	135 s
Batch-of-10 total time	980 s
Batch-of-10 per-image	~98 s
Peak GPU memory	32 GB
Sustained throughput	0.85 images/min (~51/hour)
Per-image batching gain	~27% (135 s → 98 s)

For context, a production CNN classifier on the same chest X-ray task (DenseNet-121, ResNet50, EfficientNet-B3) processes one image in 5–10 seconds on 4-vCPU CPU-only deployments — and under one second on equivalent GPU hardware. The Google CXR Foundation model is 15–30× slower on like-for-like comparisons, and that's before any of the standard production engineering work (TensorRT export, ONNX runtime, kernel fusion) the smaller models also haven't received.

Horizontal bar chart comparing images-per-hour throughput: Google CXR Foundation on A100 at 51/hour, typical CXR CNN on 4-vCPU CPU at ~600/hour, production target at 1,000+/hour, typical CXR CNN on A100 at ~3,600/hour. — Throughput at our deployment scale. The Foundation model is roughly 20× off our production target and 70× slower than a typical CXR CNN on the same A100 class hardware.

The throughput gap is the actual story. 51 images/hour vs our production-target 1,000+/hour means this model is roughly 20× too slow. Even an A100-GPU-optimized CNN sits at ~3,600/hour — 70× faster than the foundation model on the same hardware class. The compute pattern of a large ViT applied to single radiographs is the bottleneck, not the hardware.

What worked

The embeddings are good. A linear classifier trained on these features handles standard chest X-ray classification tasks at competitive accuracy. The transfer-learning story Google tells about this model is real — if you can afford the inference cost, the downstream quality is there.
Hugging Face integration is clean. Standard AutoModel.from_pretrained() works. No custom dependencies, no opaque setup. The quick-start notebook runs end-to-end without modification. That's a meaningfully better experience than a lot of medical-imaging foundation models, which often require their own forks of Transformers or specific image-preprocessing libraries.
Multimodal capability is present. The model can ingest text + image queries for visual question answering. We didn't use this for our primary embedding task, but it's a real capability that some downstream applications would benefit from.

What didn't work

Single-image inference is unshippable at this latency. 135 seconds per image means that for a clinical workflow expecting sub-second response on each study, this model is roughly 100× off. No amount of UX work hides a two-orders-of-magnitude latency gap.
Batching helps less than you'd hope. Going from 1 to 10 images per call dropped per-image time from 135 s to 98 s — a 27% improvement, not the near-linear amortization you'd see with a well-behaved model. The forward pass isn't dominated by batch-overhead; it's dominated by the model's compute footprint per image. Larger batches will not save you.
The memory ceiling is real. 32 GB peak for a batch of 10 puts this firmly in A100/H100 territory. Most production GPU options (T4, V100, A10, L4) have 16–24 GB. Deployment off A100 would require quantization or model surgery — research-level work, not a configuration change.

Why this matters for production

Concrete arithmetic at our deployment scale:

Our production workload is roughly 1,000 inference requests per hour.
At 0.85 images/minute, that's ~20 hours of compute to clear one hour of demand. Impossible.
Hypothetically running 24 parallel A100 instances would close the gap — at roughly $3–4/hour per A100, that's $72–96 per hour of throughput. A trained CNN classifier on a single CPU host does the same work for cents.
Optimization paths (ONNX + TensorRT, INT8 quantization, kernel fusion) might realistically bring per-image time down to 12–15 seconds. Still 5–10× too slow.

The cost-per-prediction math is what kills it for us. The Foundation model is delivering generality at a 100× inference-cost multiplier relative to specialized CNNs that already work. Generality is valuable; not at that multiplier, not for this workload.

Verdict

Use case	Recommendation
Real-time clinical inference	No — latency too high
High-throughput batch processing	No — throughput too low
Cost-sensitive production	No — compute multiplier too steep
Research feature extraction	Yes — embeddings are solid
Pre-computing embeddings for retrieval index	Maybe — one-time cost, evaluate against retrieval-task latency requirements
Academic / offline analysis	Yes — strong features, integration is clean

The model sits in our research toolkit and stays there. We're not putting it in the inference path.

Next steps

Try ONNX export + TensorRT optimization. If the latency drops by 10× this becomes a different conversation. We haven't measured this yet; the priority has been other model evaluations.
Benchmark MedSigLIP and CheXFound on the same setup. Both target similar use cases with different architectural choices. The interesting question is whether any current foundation model hits production latency, or whether the entire category is in the same boat.
Evaluate offline embedding precomputation for our retrieval index. One-time computation of embeddings for 1.6M+ images at this latency takes ~3 years of A100 time on one machine. Probably not viable even with parallelism; but worth a quick feasibility pass.
Compare embedding quality against in-house ResNet50 features on the same downstream classification tasks. If the foundation-model embeddings don't materially outperform features from a model we already run at 1 s/image, the latency case for using them at all weakens further.

The general lesson, which a longer follow-up post will draw out across several foundation-model evaluations: publication-grade benchmarks rarely report inference latency at the granularity production engineers need. Most "state-of-the-art on chest X-rays" claims would benefit from a "per-image latency" column.

Part of an ongoing series on production medical imaging. The companion year-one reflection is here; the Gemini-vs-CNN clinical-QC lab note is here. If something here connects to work you're shipping, reach out.