Gemini vs CNN for Clinical Quality Control: Where Each Wins on 3,566 Chest X-Rays

A head-to-head benchmark of Gemini 2.5 Flash, Gemini 3 Flash Preview, and 14 CNN classifiers on the same chest X-ray QC task. Per-task numbers, cost, and latency.

June 2, 2026Saianiruth M

Lab note. Setup, numbers, verdict — short on prose, long on data. A longer narrative post drawing on this evaluation will follow.

TL;DR

On 3,566 chest X-rays evaluated against the same QC task:

Gemini wins at study-level reasoning — view classification, metadata extraction. F1 ≥ 0.94 for AP/PA view, 100% accuracy on extracted DICOM fields (modality, body part, side, view, ID, age, sex).
CNNs win at pixel-level perception — exposure flags, rotation, artifacts. Often by 5–25× in F1.
OR-aggregated quality flag — best CNN F1 = 0.911, Gemini 2.5 Flash F1 = 0.473.
Gemini 3 Flash Preview is better on some tasks (rotation, OR-aggregated) but its 5–8 minute TAT made it unshippable.

The production system uses both: CNNs for pixel-level quality flags, Gemini for structured reasoning and metadata. Picking either alone would be the wrong move.

The task is chest X-ray quality control — flagging studies that are over/under-exposed, rotated, cropped, flipped, contain artifacts, or were captured via mobile phone (a real failure mode in field deployments). Plus structured reasoning: view classification (AP / PA / LAT), modality verification, age group classification, body-part validation.

Test set: 3,566 production-distribution chest X-rays. Held out from training. Stratified to capture realistic prevalence — most studies are clean, the quality-flag classes are heavily imbalanced.

Models under test:

LLM side: Gemini 2.5 Flash (full run, n=3,566), Gemini 3 Flash Preview (partial run, n=243 before we dropped it for TAT). Same structured-output prompt for both — a dual-verification scheme that cross-checks DICOM metadata against pixel-level visual evidence and returns a strict JSON schema.
CNN side: 14 single-binary classifiers (yolov8m/x-cls, ResNet50/101, EfficientNet-B0/B3, EfficientNetV2-S/M/L, DenseNet121, MobileNetV3-L, ConvNeXt-Base, ViT-Base, Swin-Base, RadDINO-MLP) and 7 multiclass-multilabel classifiers covering 7 QC categories simultaneously.

Both sides see the same images. Both sides are evaluated on the same labels. Best-F1 thresholds are picked per CNN on a validation split; Gemini is evaluated at the threshold its structured output already commits to (no post-hoc tuning).

Results

Per-task F1 (best of each side)

Task	Gemini 2.5 Flash	Best CNN	Gemini 3 FP*
view_AP	0.943	(not run)	0.787
view_PA	0.955	(not run)	0.870
view_LAT	0.605	(not run)	0.583
exposure_OVEREXPOSED	0.033	0.349 (effnetv2_l)	0.168
exposure_UNDEREXPOSURE	0.049	0.405 (multilabel best)	0.154
artifact_present	0.393	0.800 (yolov8x)	0.376
coverage_wrongly_cropped	0.039	0.675 (multilabel best)	0.071
rotation_rotated	0.255	0.884 (yolov8m, internal)	0.714
flip_suspected	0.033	(anatomy-rule based)	0.150
mobile_capture	0.756	0.853 (multilabel best)	0.792
general_quality_low	0.079	0.619 (multilabel best)	0.159
OR-aggregated quality	0.473	0.911 (yolov8m_cls_multilabel_v2)	0.767

* Gemini 3 FP run is partial (n=243). Numbers are directional, not definitive.

Metadata extraction (Gemini-only — CNNs don't do this)

Field	Gemini 2.5 Flash	Gemini 3 FP
modality, body_part, side, view, id, age, sex	100%	100%
age_group (classification)	37% R / 48% P	12% R / 33% P

Cost and latency

Model	Cost per case	INR (@₹83)	TAT
Gemini 2.5 Flash	$0.01127	₹0.94	18 s
Gemini 3 Flash Preview	$0.03874	₹3.22	300–480 s
Gemini 2.5 Flash-Lite	$0.00160	₹0.13	not benchmarked
Gemini 3.1 Flash-Lite	$0.00425	₹0.35	not benchmarked

CNN latency on 4-vCPU GCP (no GPU): yolov8m multilabel ~6–8 s, ResNet50 ~7–10 s, DenseNet121 ~18–23 s, EfficientNetV2-L ~30–40 s. Roughly comparable to Gemini 2.5 Flash at the fast end, faster than Gemini 3 FP by 30–60×.

What worked

Gemini on structured reasoning. View classification at 0.943 / 0.955 F1 is excellent. Metadata extraction at 100% is a real result, not a typo — the model is reading the DICOM header values out as JSON, and once the prompt is right, it does this perfectly. This is the half of the QC task we'd previously been trying to handle with brittle string-parsing logic; Gemini replaced it cleanly.

CNN multilabel architectures on pixel-level tasks. A single yolov8m_cls_multilabel_v2 model handles 7 QC categories simultaneously and achieves OR-aggregated F1 of 0.911. Compared to running 7 separate binary classifiers (best single-binary F1 = 0.920 for yolov8x-cls), the multilabel approach loses about 1 F1 point but runs in roughly 1/7th the wall-clock time. Easy tradeoff.

Custom dataset normalization. All CNN results used custom (mean, std) computed on our training distribution, not ImageNet defaults. The delta on heavily-imbalanced tasks like exposure was meaningful (1–3 F1 points on best-threshold metrics). Almost no paper documents this; almost every paper would have benefited from it.

What didn't

Gemini on exposure detection. F1 of 0.033 on over-exposure (923 false positives on 3,566 cases) and 0.049 on under-exposure. The model is confidently calling exposure problems on clean studies and missing real ones. The root issue: judging exposure requires reasoning about a pixel-luminance distribution, which is exactly the kind of perceptual judgment Gemini doesn't do well — even with explicit prompt instructions to look at the histogram. A small dedicated CNN with the right preprocessing crushes it on the same task.

Gemini on fine artifacts and cropping. F1 of 0.039 on cropping and 0.393 on artifacts. Same pattern: the model knows what the concepts mean but can't localize them visually with the consistency a trained CNN can.

Gemini 3 Flash Preview latency. The newer Gemini was better on rotation (0.714 vs 0.255), OR-aggregated quality (0.767 vs 0.473), and overall handling of edge cases. But TAT of 5–8 minutes per case (vs 18 seconds for 2.5 Flash) made it impossible to use in production. The 4,000-token thinking budget cap we set wasn't honored by the API in our runs; average thinking tokens were ~10,800 per case. We dropped 3 FP at n=243 and continued with 2.5 Flash for the full 3,566.

The 184-case error tail. Gemini 2.5 Flash errored on 184 out of 3,566 cases (5.2%) after 5 retries each. Causes split between content-policy false positives (clinical chest content occasionally triggering safety filters), DICOM headers with unusual fields the prompt didn't anticipate, and transient API issues. Production-deployable, but the long tail of failures is real and needs a fallback path.

Verdict

Pick the right tool for the right axis:

Axis of judgment	Use
Metadata extraction (header → JSON)	Gemini 2.5 Flash
View classification (AP / PA / LAT)	Gemini 2.5 Flash
Pixel-level quality flags (exposure, rotation, artifact, crop, flip)	CNN multilabel
Age-group classification	Trained CNN (Gemini is unreliable)
Out-of-distribution detection	CNN with calibrated thresholds

The production stack now uses both. The integration cost is small (a single orchestrator routes the JSON outputs from each into the unified QC label); the win on each axis is large.

A general claim worth making: foundation-model VLMs are at their best when the work is "extract structured information from clearly visible text and metadata" and at their worst when the work is "make a perceptual judgment from low-level pixel statistics." Picking between them based on the task — not based on hype — is the part most public discussions of "LLMs in medical imaging" skip.

Next steps

Re-evaluate Gemini 3 Flash Preview when the thinking-budget cap is honored end-to-end. If TAT drops below 60 s/case, the better accuracy might be worth it.
Try Gemini 2.5 Flash-Lite as a cost-reduction option for the metadata-only path. ₹0.13/case vs ₹0.94 would be material at deployment scale.
Add a dedicated under-exposure model — the current multilabel best (0.405 F1) is the weakest pixel-level number in the table.
Document the 184-error tail with a failure-class taxonomy. Knowing which 5% fails matters more than the headline 95% pass rate.

A longer narrative writeup of this work will follow as a /blog/ post, with reproducibility notes, the full prompt design, and the per-class confusion matrices.

Part of an ongoing series on production medical imaging. The companion year-one reflection post is here; the Windows-installer engineering deep-dive is here.