Three Detectors, One Ensemble: Inside Our Shoulder-Fracture Detection System
An engineer-readable walkthrough of our shoulder-fracture detection ensemble — Faster R-CNN, EfficientDet, and RF-DETR — and what ensembles actually buy you when individual models are already strong.
A walkthrough of the shoulder-fracture detection ensemble we recently put on arXiv — what each of the three detection architectures actually contributes, how we fused them, and what the numbers honestly say about when ensembles help.
Shoulder fractures get missed. Clinical literature puts the miss rate at 1–10% on initial radiologist review, and the failure modes are predictable: emergency-shift fatigue, subtle cortical disruptions, oblique views, post-surgical anatomy. AI-assisted detection isn't a replacement for the radiologist; it's a second pair of eyes that doesn't get tired at 3am.
Earlier this year, my team and I published on arXiv the shoulder-fracture detection system we built to address this. It's an ensemble of three detection architectures — Faster R-CNN with FPN, EfficientDet with EfficientNet-B7 + BiFPN, and RF-DETR — trained on 10,000 annotated shoulder radiographs. This post is the engineer-readable companion to the paper: what we built, why we picked these three, and what we learned about ensembles in the process.
A note on credit: this work has many co-authors. The lead is Hemanth Kumar M; the radiologist guidance came from Dr. Vasanthakumar Venugopal; the broader team contributed across architecture, annotation, training, and validation. The paper has the full list. I'll use "we" throughout — that's accurate. Anywhere I use "I," that's a personal observation, not authorship credit.
The setup
Shoulder X-rays are a hard target for general-purpose detection models. Most public medical-imaging research focuses on chest X-rays (MIMIC, CheXpert, NIH ChestX-ray14) or limb fractures on more standardized datasets like MURA. Shoulder-specific systems with benchmarked performance are rare in the literature.
We collected 10,000 anonymized shoulder radiographs from production clinical settings — anteroposterior (AP) and lateral projections, mixed digital radiography hardware, diverse patient demographics. Annotation was COCO-style with bounding boxes drawn by expert radiologists. Every label went through dual review with consensus resolution. Studies with severe quality degradation (motion blur, incomplete anatomy) were excluded.
The task is binary: fracture / non-fracture, with bounding-box localization for positive cases. We deliberately kept it binary because the deployment target is rapid screening and triage — a clinician confirming an AI flag and then doing detailed subtype classification themselves. Fine-grained classification (avulsion, comminuted, pathological) was out of scope for this system.
Three architectures, three failure modes
The thesis behind ensembling for detection is straightforward: different architectures fail at different things. If you pick architectures whose failures are uncorrelated, the combined system covers more of the failure space than any single model.
We picked three architectures with deliberately different inductive biases:
Faster R-CNN with FPN
A two-stage detector. Stage 1 generates region proposals (RPN); stage 2 classifies and refines them. The Feature Pyramid Network gives it multi-scale features — high-resolution spatial detail at the bottom layers, semantically rich coarse features at the top.
What it's good at: precise localization on objects that are clearly visible. The two-stage nature gives it room to be careful — RPN narrows the search, the detection head refines. It tends to score high on recall.
What it's not good at: subtle features that need global context. The per-region classification doesn't see the whole image when deciding.
We trained variants with ResNet-50 FPN, ResNeXt-101 FPN, and DenseNet-121 backbones. ResNet-50 FPN ended up in the final ensemble — the heavier backbones didn't justify their cost on this task.
EfficientDet with EfficientNet-B7
A one-stage detector. Class prediction and bounding-box regression happen in a single forward pass. The EfficientNet-B7 backbone uses compound scaling (depth × width × resolution) to wring more accuracy out of fewer FLOPs.
What we care about here is the BiFPN — Bi-directional Feature Pyramid Network. It's a refinement over standard FPN that fuses features in both directions (top-down and bottom-up) with learnable per-layer weights. The result is better representation of small and medium objects, which matters a lot when fractures are hairline cortical disruptions a few pixels wide.
What it's good at: balanced precision/recall, fast inference (one-stage), strong on small lesions.
What it's not good at: cluttered scenes with many candidate regions. The single-shot prediction can saturate.
RF-DETR (Reformulated Detection Transformer)
A transformer-based detector. No anchor boxes, no RPN — just a CNN encoder feeding into a transformer decoder that emits a fixed-size set of predictions. Bipartite matching aligns predictions to ground truth.
What it's good at: global reasoning. The attention mechanism lets the decoder see the whole image when deciding on a single box. For subtle pathologies where the surrounding anatomy matters (greater-tuberosity fractures, oblique views), this is a real advantage.
What it's not good at: precise box geometry on clear-contrast objects. The set-based output doesn't always nail the tight bounding box.
The three are intentionally complementary: Faster R-CNN (two-stage CNN baseline) → EfficientDet (scaled CNN with cross-scale fusion) → RF-DETR (transformer with global context). When one misses, the others have a reasonable chance of catching.
Preprocessing and training
A few decisions that did most of the work:
- Resize to 1024×1024. Larger than typical natural-image detection inputs (640 or 800). Medical pathologies have signal at small spatial scales.
- CLAHE preprocessing. Contrast Limited Adaptive Histogram Equalization, applied per-tile. Normalizes exposure variance across the dataset — clinical radiographs come from mixed hardware with mixed acquisition parameters.
- Augmentation: horizontal flip, random crop, brightness scaling, small-angle rotation. Restrained on the rotation side because shoulder anatomy has handedness; large rotations introduce non-anatomical orientations.
- Stratified 80:20 split. Fracture / non-fracture prevalence preserved across train and validation.
- Adam optimizer, learning-rate warm restarts, early stopping on validation loss, max 100 epochs. Standard. The headline isn't the optimizer; it's that all three models trained with the same data and split, so the comparison is fair.
The refined test set used for the paper's reported numbers is 207 radiographs (117 fracture, 90 non-fracture) — a held-out clinical evaluation set, separate from the train/validation split.
Four fusion strategies, one winner
With three trained models, the question is how to fuse their predictions. We tried four box-level strategies:
NMS (Non-Maximum Suppression). Standard approach: for each set of overlapping boxes, keep the one with the highest confidence, discard the rest. Loses information — the discarded boxes might have been right, the kept box might have wrong geometry.
Soft-NMS. Instead of discarding overlapping boxes outright, decay their confidence scores by their overlap with the highest-confidence box. Less destructive than NMS, but still based on suppression rather than fusion.
WBF (Weighted Box Fusion). Combine overlapping boxes by computing a weighted average of their coordinates, weighted by confidence. This actually fuses the geometry from multiple models rather than picking a winner.
NMW (Non-Maximum Weighted). Similar to WBF but adds voting heuristics that penalize low-confidence overlaps. The fusion considers not just box coordinates but also which models contributed and how confidently.
We also tested classification-level fusion (affirmative / unanimous / consensus voting) for the binary fracture/non-fracture decision. Consensus voting — majority rule — handled the cases where individual models disagreed.
Results, honestly
Here are the actual numbers from the refined test set (207 images, 117 fracture, 90 non-fracture). I've put the individual-model and ensemble results side-by-side so you can see the comparison directly:
Individual models
| Model | Accuracy | Precision | Recall | F1 | AP@0.5 |
|---|---|---|---|---|---|
| Faster R-CNN (ResNet50) | 95.4% | 0.9532 | 0.9617 | 0.9572 | 0.9674 |
| EfficientDet-B7 | 96.86% | 0.9684 | 0.9647 | 0.9692 | 0.9626 |
| RF-DETR | 95.37% | 0.9574 | 0.9676 | 0.9631 | 0.9585 |
Ensemble strategies
| Method | Accuracy | Precision | Recall | F1 | AP@0.5 |
|---|---|---|---|---|---|
| NMS | 89.37% | 0.9195 | 0.8632 | 0.8903 | 0.8891 |
| Soft-NMS | 90.33% | 0.9271 | 0.8786 | 0.9022 | 0.9104 |
| WBF | 89.37% | 0.9068 | 0.8717 | 0.8889 | 0.8992 |
| NMW | 95.50% | 0.9589 | 0.9576 | 0.9610 | 0.9553 |
Two things jump out.
First, NMW wins decisively among the fusion strategies. NMS, Soft-NMS, and WBF all underperform — by 5–6 points of F1, 6–7 points of accuracy. If you're ensembling detection models, NMW (or something similar that includes confidence-weighted voting on top of geometric fusion) is clearly the right starting choice. Plain NMS as your ensemble is leaving real performance on the table.
Second — and this is the more honest read of the table — the NMW ensemble doesn't strictly outperform the best individual model on every metric. EfficientDet-B7 alone has slightly higher F1 (0.9692 vs 0.9610) and accuracy (96.86% vs 95.50%). Faster R-CNN has the highest AP@0.5 (0.9674 vs 0.9553). The ensemble is competitive with the best individual model, not dominant over it.
So what does ensembling actually buy you here?
What ensembling actually buys you
The metric-by-metric view above understates the real value. Three reasons.
Reason 1: Robustness across the long tail. Headline metrics on a 207-image refined test set average over many cases. The interesting cases — subtle hairline fractures, oblique projections, post-surgical hardware confounding the image — are a small fraction. On those cases, the ensemble's coverage of three different inductive biases matters far more than the average F1 difference. The metric you'd want is "how often does at least one model catch the hard case," and ensembling improves that even when the average doesn't move much.
Reason 2: Reducing per-model failure correlations. EfficientDet's failure modes (e.g., on cluttered scenes) are different from RF-DETR's (on tight box geometry) which are different from Faster R-CNN's (on subtle features). Production inference doesn't pick the best model per image; it picks one and runs it. If you pick EfficientDet because of its best-on-average F1, you're vulnerable to its specific failure modes on the cases that matter. The ensemble averages the failure modes the same way it averages the metrics — and that average failure mode is less correlated with any single bad case.
Reason 3: Confidence calibration. A single high-confidence prediction from one model can be wrong with high confidence. Three models agreeing at moderate confidence is a stronger signal than one model alone at high confidence. The fusion confidence aggregates this — useful downstream when a radiologist is deciding whether to verify or skip an AI flag.
The honest framing: the ensemble isn't strictly better than the best individual model on the curated test set. But it's the model you ship to production, because in production you don't get to pick which test case you're handling. You handle all of them. Ensembling is variance reduction more than bias reduction — it makes the worst case better, even when the average case is similar.
If we'd had to ship one of the three models alone, EfficientDet-B7 would have been the right pick on these numbers. We shipped the ensemble.
What this isn't (yet)
A few honest limitations worth surfacing:
Binary detection only. No subtype classification (avulsion, comminuted, pathological, etc.). The system is a triage assistant, not an orthopedic decision-support tool. A radiologist still confirms and subtypes.
Single-view input. The current system processes each X-ray independently. Multi-view fusion (combining AP and lateral projections of the same shoulder) is a clear next step — clinical radiologists routinely use both views together.
Adult anatomy. Pediatric shoulders have growth plates that look like cortical disruptions to an inexperienced model. We didn't validate on pediatric cases. A separate model or fine-tune would be needed.
No prospective trial yet. All results are on a retrospective held-out test set. Real prospective evaluation in a live clinical environment is in scope for the next phase of work but hasn't happened yet.
What's next
The team is working on:
- Multi-view fusion. Combining AP and lateral predictions for the same shoulder into a single, more confident decision.
- Subtype classification. Adding a second-stage classifier on positive boxes to suggest fracture subtype, while keeping the binary triage at the front for speed.
- Broader anatomy coverage. Extending the same ensemble approach to other underrepresented anatomies — clavicle, scapula, sternum.
A few follow-up posts I'm planning that drill into specific pieces of this work:
- A practitioner's comparison of object detectors for medical imaging. Faster R-CNN vs EfficientDet vs RF-DETR, in more depth than the architectural overview above. What I'd pick when.
- IoU-clustered, confidence-weighted ensemble fusion explained. A standalone walkthrough of NMW and why it beats NMS / Soft-NMS / WBF on this task, with code-level detail.
- When oriented bounding boxes beat axis-aligned ones. A different but related project, on rotated-fracture detection with YOLOv11-OBB.
Closing
Three architectures, one ensemble, ten thousand radiographs of source data, a refined test set that's small enough to honestly report on, and a result that's strong but doesn't need to be oversold to be useful. The paper is here if you want the full methodology, the ablations, and the related work.
If you're working on something similar — fracture detection, medical-imaging detection generally, or ensemble fusion in any domain — I'd genuinely like to compare notes. Reach out via the contact form.
Part of an ongoing series on production medical imaging. The companion year-one reflection post is here; the Gemini-vs-CNN clinical-QC research note is here; the Windows-installer engineering deep-dive is here.