CheXagent-8b Zero-Shot: 70-75% F1 on Pleural Effusion Detection
Stanford AIMI's 8B chest X-ray vision-language model, evaluated zero-shot on pleural effusion detection. Strong prompt-only baseline, insufficient for clinical deployment without fine-tuning — and a useful data point on what 'foundation model for medical imaging' actually means right now.
Lab note. Setup, numbers, verdict. Third entry in the foundation-model evaluation series, after the Google CXR Foundation latency post and the Gemini-vs-CNN clinical-QC note.
TL;DR
- CheXagent-8b is Stanford AIMI's 8-billion-parameter vision-language foundation model, pre-trained on large-scale chest X-ray datasets with multimodal objectives.
- Zero-shot binary classification on pleural effusion detection (CheXpert): precision 70-75%, recall 70-75%, F1 ≈ 70-75%.
- The model handles natural-language disease queries cleanly — no task-specific fine-tuning required — and works across multiple pathologies the user can ask about by name.
- 70-75% is a strong zero-shot floor and a weak clinical-deployment ceiling. Useful as a rapid-prototyping baseline; insufficient as the primary signal for high-stakes triage.
- The interesting question isn't whether CheXagent is "good" — it's whether the gap from 70-75% to deployable (say 90%+ on both axes) closes with fine-tuning, prompt engineering, or ensembling. Our follow-up work suggests fine-tuning is the cheapest route.
Setup
| Component | Detail |
|---|---|
| Model | StanfordAIMI/CheXagent-8b |
| Architecture | Multimodal transformer (8B params), vision-language foundation model |
| Pre-training | Large-scale chest X-ray datasets with multimodal objectives |
| Reference paper | arXiv 2401.12208 |
| Evaluation dataset | CheXpert (publicly available) |
| Task | Binary classification — pleural effusion present / absent |
| Inference mode | Zero-shot via natural-language query |
| Hardware | NVIDIA GPU with 16-32 GB memory, float16 precision |
The motivation was practical: foundation-model VLMs for medical imaging have grown quickly in the last two years (Google CXR Foundation, MedSigLIP, CheXFound, PaliGemma, CheXagent, RAD-DINO, MedGemma). The marketing on each one promises "zero-shot clinical performance." We wanted to know what that actually means on a specific binary classification task we cared about, before committing to fine-tuning any of them.
CheXagent's pitch is particularly clean: a chest-X-ray-specialized VLM that you can prompt with natural language ("is there evidence of pneumothorax?") and get back a structured response. The Stanford AIMI paper reports strong performance across 8 chest X-ray interpretation tasks. Whether that performance survives on production-distribution data — and at what threshold — is the question this evaluation answered.
Method
The model loads via standard Hugging Face APIs:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
device = "cuda"
dtype = torch.float16
processor = AutoProcessor.from_pretrained(
"StanfordAIMI/CheXagent-8b", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"StanfordAIMI/CheXagent-8b",
torch_dtype=dtype,
trust_remote_code=True,
).to(device)
generation_config = GenerationConfig.from_pretrained(
"StanfordAIMI/CheXagent-8b"
)
Binary classification is then a method on the model object:
# Single-image binary disease classification
result = chexagent.binary_disease_classification(
[image_path],
"Pleural Effusion"
)
The inference path internally constructs the right prompt template, passes the image through the vision encoder, runs the text decoder against the prompt, and parses out a structured yes/no with confidence. The user never touches the prompt directly; that's intentional and useful — it removes prompt-engineering as a confounding variable in the evaluation.
We ran this against the CheXpert validation set's labeled pleural-effusion subset, computing precision, recall, and F1 against the ground-truth labels.
Results
| Metric | Value |
|---|---|
| Precision | 70-75% |
| Recall | 70-75% |
| F1 | ≈ 70-75% |
| Inference time per image | A few seconds on A100 (acceptable for batch eval; not benchmarked for real-time) |
| Pathology coverage tested | Pleural effusion (binary). Pneumothorax was also evaluated qualitatively. |
The 70-75% range is consistent across both precision and recall — the model isn't trading one off against the other. It's making both kinds of errors at roughly similar rates.
For context, the Stanford AIMI paper reports CheXagent improving 97.5% over general-domain foundation models on visual tasks, and 55.7% over medical-domain foundation models. Both are large multiplicative improvements over weaker baselines. Whether those improvements translate to absolute accuracy at the level real clinical workflows need is a separate question — the kind this evaluation tries to answer.
For a binary screening task where the cost asymmetry of missed positives vs false alarms is real (missing a pneumothorax is much worse than over-calling one), 70-75% recall is the harder number. Roughly one in four positive cases is missed. That's enough to disqualify the model as the sole signal for triage; it's not enough to disqualify it as one signal among several in a more complex pipeline.
What worked
- Zero-shot capability is real. No task-specific fine-tuning, no labeled training data for the target task, and the model returns coherent yes/no answers with reasonable accuracy. That's a meaningful capability for rapid prototyping — you can evaluate whether a problem is roughly tractable before investing in dataset curation and training.
- Multimodal query interface. Natural-language prompts let you swap target pathologies without code changes ("Pleural Effusion" → "Pneumothorax" → "Cardiomegaly"). For exploratory analysis or rapid sweeps, this is operationally clean compared to maintaining a fleet of specialized classifiers.
- Hugging Face integration is straightforward. Standard
AutoModelForCausalLMandAutoProcessorcalls,trust_remote_code=True, that's it. No custom dependencies, no model-specific build steps. Sets up in 15 minutes. - Pathology coverage is broad. The same model handles questions across many chest X-ray pathologies. Compared to maintaining specialized binary classifiers per pathology, that breadth is the practical foundation-model promise actually delivered.
What didn't
- Performance ceiling at the zero-shot threshold. 70-75% F1 is the floor of what a serious clinical screening system can ship. The gap from "interesting baseline" to "production-deployable" is real, and zero-shot inference doesn't close it.
- Dataset-dependent variation. CheXpert is one specific dataset with specific demographics, hardware, and acquisition protocols. We expect performance to vary on other distributions — different scanner vendors, different patient populations, different institutional QC standards. Generalization claims for foundation models often quietly assume distribution match.
- Binary-classification framing limits the signal. A yes/no output, even with confidence, gives less information than a per-pathology probability distribution with localized findings. For pipelines that want to combine multiple signals downstream, the binary output is a lossy interface.
- Resource cost is real. 8 billion parameters means 16 GB GPU minimum, 32 GB recommended for batch inference. That's deployment-class hardware. Compared to specialized CNNs that fit comfortably on a T4 or even a CPU, foundation models carry an infrastructure-cost premium that needs to be earned by accuracy or capability gains. At 70-75% F1, this evaluation didn't fully earn it.
Verdict
| Use case | Recommendation |
|---|---|
| High-stakes clinical screening (sole signal) | No — 70-75% recall is too low |
| One signal in a multi-model pipeline | Maybe — combine with specialized CNNs |
| Rapid prototyping on new pathologies | Yes — zero-shot lets you test feasibility fast |
| Exploratory data analysis | Yes — natural-language interface scales well across pathologies |
| Fine-tuning starting point | Yes — strong pre-training prior for a downstream task-specific head |
| Real-time inference | Probably no — at 8B params, latency is meaningful |
CheXagent's value is highest in the exploration and fine-tuning-starting-point roles, lowest in direct clinical deployment. The model is good at what foundation models are supposed to be good at — broad capability without task-specific training — and weak at what specialized models are good at — peak performance on a narrow task with curated data.
Next steps
- Try fine-tuning CheXagent's vision tower with a small labeled set for the target pathology. Hypothesis: the zero-shot 70-75% becomes 85%+ with a few thousand task-specific examples. This is the cheapest path to closing the gap.
- Prompt-engineering pass. The internal prompts CheXagent uses are reasonable defaults, but task-specific phrasing might extract another few F1 points. The trade-off: prompts add a tuning surface that benchmarks against fine-tuning poorly.
- Ensemble CheXagent with specialized CNN classifiers. The two error patterns may be uncorrelated enough that a confidence-weighted ensemble improves on either alone. We didn't measure this rigorously yet; on the to-do list.
- Evaluate on out-of-distribution data. CheXpert is a clean public dataset. Performance on real production distributions (different scanner vendors, different patient populations, different institutional artifacts) is the actual question for deployment. The next foundation-model evaluation note will pull more datasets into the comparison.
A longer narrative writeup synthesizing CheXagent, Google CXR Foundation, MedSigLIP, and the rest of the foundation-model evaluation set is on the writing list — once enough individual lab notes are in place to support a useful cross-comparison.
Part of an ongoing series on production medical imaging. The companion year-one reflection covers the broader CXR-AI engineering context; the Google CXR Foundation latency lab note and the Gemini-vs-CNN clinical-QC note are the other two entries in this evaluation series. If you're evaluating CheXagent or a sibling foundation model on your own data, reach out.