CheXagent-8b Zero-Shot: 70-75% F1 on Pleural Effusion Detection

Stanford AIMI's 8B chest X-ray vision-language model, evaluated zero-shot on pleural effusion detection. Strong prompt-only baseline, insufficient for clinical deployment without fine-tuning — and a useful data point on what 'foundation model for medical imaging' actually means right now.

June 24, 2026Saianiruth M

Lab note. Setup, numbers, verdict. Third entry in the foundation-model evaluation series, after the Google CXR Foundation latency post and the Gemini-vs-CNN clinical-QC note.

TL;DR

CheXagent-8b is Stanford AIMI's 8-billion-parameter vision-language foundation model, pre-trained on large-scale chest X-ray datasets with multimodal objectives.
Zero-shot binary classification on pleural effusion detection (CheXpert): precision 70-75%, recall 70-75%, F1 ≈ 70-75%.
The model handles natural-language disease queries cleanly — no task-specific fine-tuning required — and works across multiple pathologies the user can ask about by name.
70-75% is a strong zero-shot floor and a weak clinical-deployment ceiling. Useful as a rapid-prototyping baseline; insufficient as the primary signal for high-stakes triage.
The interesting question isn't whether CheXagent is "good" — it's whether the gap from 70-75% to deployable (say 90%+ on both axes) closes with fine-tuning, prompt engineering, or ensembling. Our follow-up work suggests fine-tuning is the cheapest route.

Three headline metric tiles from the evaluation: 70-75% precision, 70-75% recall, and F1 approximately 70-75 percent, all measured zero-shot on the CheXpert pleural effusion validation subset. — The headline numbers. Both precision and recall sit in the same 70-75% band — the model isn't trading one off against the other, just making both kinds of errors at similar rates.

Setup

Component	Detail
Model	`StanfordAIMI/CheXagent-8b`
Architecture	Multimodal transformer (8B params), vision-language foundation model
Pre-training	Large-scale chest X-ray datasets with multimodal objectives
Reference paper	arXiv 2401.12208
Evaluation dataset	CheXpert (publicly available)
Task	Binary classification — pleural effusion present / absent
Inference mode	Zero-shot via natural-language query
Hardware	NVIDIA GPU with 16-32 GB memory, `float16` precision

The motivation was practical: foundation-model VLMs for medical imaging have grown quickly in the last two years (Google CXR Foundation, MedSigLIP, CheXFound, PaliGemma, CheXagent, RAD-DINO, MedGemma). The marketing on each one promises "zero-shot clinical performance." We wanted to know what that actually means on a specific binary classification task we cared about, before committing to fine-tuning any of them.

CheXagent's pitch is particularly clean: a chest-X-ray-specialized VLM that you can prompt with natural language ("is there evidence of pneumothorax?") and get back a structured response. The Stanford AIMI paper reports strong performance across 8 chest X-ray interpretation tasks. Whether that performance survives on production-distribution data — and at what threshold — is the question this evaluation answered.

Method

The model loads via standard Hugging Face APIs:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

device = "cuda"
dtype = torch.float16

processor = AutoProcessor.from_pretrained(
    "StanfordAIMI/CheXagent-8b", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "StanfordAIMI/CheXagent-8b",
    torch_dtype=dtype,
    trust_remote_code=True,
).to(device)
generation_config = GenerationConfig.from_pretrained(
    "StanfordAIMI/CheXagent-8b"
)

Binary classification is then a method on the model object:

# Single-image binary disease classification
result = chexagent.binary_disease_classification(
    [image_path],
    "Pleural Effusion"
)

The inference path internally constructs the right prompt template, passes the image through the vision encoder, runs the text decoder against the prompt, and parses out a structured yes/no with confidence. The user never touches the prompt directly; that's intentional and useful — it removes prompt-engineering as a confounding variable in the evaluation.

We ran this against the CheXpert validation set's labeled pleural-effusion subset, computing precision, recall, and F1 against the ground-truth labels.

Results

Metric	Value
Precision	70-75%
Recall	70-75%
F1	≈ 70-75%
Inference time per image	A few seconds on A100 (acceptable for batch eval; not benchmarked for real-time)
Pathology coverage tested	Pleural effusion (binary). Pneumothorax was also evaluated qualitatively.

The 70-75% range is consistent across both precision and recall — the model isn't trading one off against the other. It's making both kinds of errors at roughly similar rates.

For context, the Stanford AIMI paper reports CheXagent improving 97.5% over general-domain foundation models on visual tasks, and 55.7% over medical-domain foundation models. Both are large multiplicative improvements over weaker baselines. Whether those improvements translate to absolute accuracy at the level real clinical workflows need is a separate question — the kind this evaluation tries to answer.

For a binary screening task where the cost asymmetry of missed positives vs false alarms is real (missing a pneumothorax is much worse than over-calling one), 70-75% recall is the harder number. Roughly one in four positive cases is missed. That's enough to disqualify the model as the sole signal for triage; it's not enough to disqualify it as one signal among several in a more complex pipeline.

What worked

Zero-shot capability is real. No task-specific fine-tuning, no labeled training data for the target task, and the model returns coherent yes/no answers with reasonable accuracy. That's a meaningful capability for rapid prototyping — you can evaluate whether a problem is roughly tractable before investing in dataset curation and training.
Multimodal query interface. Natural-language prompts let you swap target pathologies without code changes ("Pleural Effusion" → "Pneumothorax" → "Cardiomegaly"). For exploratory analysis or rapid sweeps, this is operationally clean compared to maintaining a fleet of specialized classifiers.
Hugging Face integration is straightforward. Standard AutoModelForCausalLM and AutoProcessor calls, trust_remote_code=True, that's it. No custom dependencies, no model-specific build steps. Sets up in 15 minutes.
Pathology coverage is broad. The same model handles questions across many chest X-ray pathologies. Compared to maintaining specialized binary classifiers per pathology, that breadth is the practical foundation-model promise actually delivered.

What didn't

Performance ceiling at the zero-shot threshold. 70-75% F1 is the floor of what a serious clinical screening system can ship. The gap from "interesting baseline" to "production-deployable" is real, and zero-shot inference doesn't close it.
Dataset-dependent variation. CheXpert is one specific dataset with specific demographics, hardware, and acquisition protocols. We expect performance to vary on other distributions — different scanner vendors, different patient populations, different institutional QC standards. Generalization claims for foundation models often quietly assume distribution match.
Binary-classification framing limits the signal. A yes/no output, even with confidence, gives less information than a per-pathology probability distribution with localized findings. For pipelines that want to combine multiple signals downstream, the binary output is a lossy interface.
Resource cost is real. 8 billion parameters means 16 GB GPU minimum, 32 GB recommended for batch inference. That's deployment-class hardware. Compared to specialized CNNs that fit comfortably on a T4 or even a CPU, foundation models carry an infrastructure-cost premium that needs to be earned by accuracy or capability gains. At 70-75% F1, this evaluation didn't fully earn it.

Verdict

Use case	Recommendation
High-stakes clinical screening (sole signal)	No — 70-75% recall is too low
One signal in a multi-model pipeline	Maybe — combine with specialized CNNs
Rapid prototyping on new pathologies	Yes — zero-shot lets you test feasibility fast
Exploratory data analysis	Yes — natural-language interface scales well across pathologies
Fine-tuning starting point	Yes — strong pre-training prior for a downstream task-specific head
Real-time inference	Probably no — at 8B params, latency is meaningful

CheXagent's value is highest in the exploration and fine-tuning-starting-point roles, lowest in direct clinical deployment. The model is good at what foundation models are supposed to be good at — broad capability without task-specific training — and weak at what specialized models are good at — peak performance on a narrow task with curated data.

Next steps

Try fine-tuning CheXagent's vision tower with a small labeled set for the target pathology. Hypothesis: the zero-shot 70-75% becomes 85%+ with a few thousand task-specific examples. This is the cheapest path to closing the gap.
Prompt-engineering pass. The internal prompts CheXagent uses are reasonable defaults, but task-specific phrasing might extract another few F1 points. The trade-off: prompts add a tuning surface that benchmarks against fine-tuning poorly.
Ensemble CheXagent with specialized CNN classifiers. The two error patterns may be uncorrelated enough that a confidence-weighted ensemble improves on either alone. We didn't measure this rigorously yet; on the to-do list.
Evaluate on out-of-distribution data. CheXpert is a clean public dataset. Performance on real production distributions (different scanner vendors, different patient populations, different institutional artifacts) is the actual question for deployment. The next foundation-model evaluation note will pull more datasets into the comparison.

A longer narrative writeup synthesizing CheXagent, Google CXR Foundation, MedSigLIP, and the rest of the foundation-model evaluation set is on the writing list — once enough individual lab notes are in place to support a useful cross-comparison.

Part of an ongoing series on production medical imaging. The companion year-one reflection covers the broader CXR-AI engineering context; the Google CXR Foundation latency lab note and the Gemini-vs-CNN clinical-QC note are the other two entries in this evaluation series. If you're evaluating CheXagent or a sibling foundation model on your own data, reach out.