IoU-Clustered, Confidence-Weighted Ensemble Fusion in Plain English
How NMW fuses object-detection boxes from multiple models — a walkthrough with a synthetic example, pseudocode, and an honest comparison to NMS, Soft-NMS, and WBF.
A walkthrough of how Non-Maximum Weighted (NMW) ensemble fusion works, with a synthetic example, Python pseudocode, and the honest answer to "but isn't WBF supposed to be better?"
In the shoulder-fracture ensemble post, I mentioned that we used Non-Maximum Weighted (NMW) fusion to combine predictions from three detectors. This post is the standalone explanation of what NMW actually does.
The short version: you have N object-detection models, each producing its own set of boxes for the same image. You need to collapse them into one set of final predictions. NMS picks winners. WBF averages clusters. NMW averages around the most confident box. Each has tradeoffs, and the right pick depends on your data more than the literature suggests.
The setup
Say you've run three detectors on a chest X-ray and each one produces a slightly different bounding box around what it thinks is a fracture. Plus maybe one of them produces a low-confidence false positive elsewhere on the image.
A concrete (synthetic) cluster on the actual fracture:
| Source | Box (x1, y1, x2, y2) | Confidence |
|---|---|---|
| Faster R-CNN | (295, 398, 348, 458) | 0.85 |
| EfficientDet | (302, 405, 355, 465) | 0.92 |
| RF-DETR | (290, 395, 360, 470) | 0.78 |
Three boxes, overlapping but not identical, varying confidences. Ground truth fracture (which the algorithm doesn't see) is roughly at (300, 400, 350, 460).
Plus a spurious detection from Faster R-CNN elsewhere on the image:
| Source | Box | Confidence |
|---|---|---|
| Faster R-CNN | (200, 500, 220, 520) | 0.55 |
The question: what does the final ensemble output?
What each method actually does
NMS (Non-Maximum Suppression)
The crudest approach. Sort all boxes by confidence. Take the highest-confidence box. Discard every other box that overlaps with it above some IoU threshold (say 0.5). Repeat with the remaining boxes.
For our cluster:
- EfficientDet's box (0.92) wins.
- Faster R-CNN's box (0.85) overlaps → discarded.
- RF-DETR's box (0.78) overlaps → discarded.
- The spurious detection (0.55) is in a different region, kept (or discarded based on threshold).
NMS output: (302, 405, 355, 465) at confidence 0.92.
The problem: NMS throws away two-thirds of the information you went to the trouble of generating. The discarded boxes might have had more accurate corners than the surviving one.
Soft-NMS
A gentler variant. Instead of discarding overlapping boxes, decay their confidence by a function of their overlap with the winning box. Then re-sort and continue.
For our cluster, Faster R-CNN's box gets its confidence decayed (say from 0.85 to 0.45). RF-DETR's confidence drops similarly. They're still in the candidate pool but at lower scores. The final output for this cluster is still the EfficientDet box, just with lower-confidence siblings hanging around.
Soft-NMS output: Still effectively (302, 405, 355, 465). Doesn't change the answer for this cluster, just retains more siblings for downstream filtering.
The improvement over NMS is mostly relevant in dense-object scenes (crowds, traffic) where you want to keep multiple nearby detections. For a single-object fracture on a clean X-ray, Soft-NMS and NMS are basically equivalent.
WBF (Weighted Box Fusion)
A different philosophy. Instead of picking a box, fuse the cluster into one new box by taking a weighted average of the coordinates, weighted by confidence. Crucially, WBF also scales the final confidence by how many models contributed — three models agreeing at moderate confidence is more trustworthy than one model at high confidence.
For our cluster, weighted coordinate averages (rough math):
fused_x1 = (295 × 0.85 + 302 × 0.92 + 290 × 0.78) / (0.85 + 0.92 + 0.78) ≈ 296
fused_y1 = (398 × 0.85 + 405 × 0.92 + 395 × 0.78) / 2.55 ≈ 400
fused_x2 = (348 × 0.85 + 355 × 0.92 + 360 × 0.78) / 2.55 ≈ 354
fused_y2 = (458 × 0.85 + 465 × 0.92 + 470 × 0.78) / 2.55 ≈ 464
WBF output: (296, 400, 354, 464) at scaled confidence (factoring in the 3-model agreement).
Notice the fused box is closer to ground truth (300, 400, 350, 460) than any individual model's box. That's WBF's main pitch: when individual models all miss slightly in different directions, the average is more accurate than any one.
NMW (Non-Maximum Weighted)
A close cousin of WBF with one key difference: NMW uses the highest-confidence box as the anchor for the cluster, and weights neighbors by their IoU with that anchor (not their raw confidence). The final box is an IoU-weighted average around the anchor. The final confidence stays as the anchor's confidence (NMW doesn't scale by cluster size).
For our cluster, EfficientDet's box (0.92) is the anchor. Weights are IoU between each box and the anchor:
IoU(anchor, FasterRCNN) = 0.78 (high overlap)
IoU(anchor, RF-DETR) = 0.85 (high overlap)
IoU(anchor, anchor) = 1.00
NMW averages the coordinates weighted by these IoUs:
fused_x1 = (295 × 0.78 + 302 × 1.00 + 290 × 0.85) / (0.78 + 1.00 + 0.85) ≈ 296
... (similar for other coordinates)
NMW output: (~298, 401, 354, 463) at confidence 0.92 (anchor's confidence, unchanged).
The fused box is similar to WBF's but anchored more tightly to the highest-confidence prediction. The confidence reflects the anchor model's certainty, not a cluster-size adjustment.
The pseudocode
NMW in clean Python:
def nmw_fuse(predictions, iou_threshold=0.5):
"""
Non-Maximum Weighted ensemble fusion.
Args:
predictions: list of Prediction objects with .box and .confidence
iou_threshold: minimum IoU to group boxes into the same cluster
Returns:
list of fused Prediction objects
"""
# 1. Sort by confidence descending
predictions = sorted(predictions, key=lambda p: p.confidence, reverse=True)
fused = []
used = [False] * len(predictions)
for i, anchor in enumerate(predictions):
if used[i]:
continue
# 2. The anchor is the highest-confidence unprocessed prediction
cluster = [anchor]
used[i] = True
# 3. Find all subsequent boxes overlapping with the anchor
for j in range(i + 1, len(predictions)):
if used[j]:
continue
if iou(anchor.box, predictions[j].box) >= iou_threshold:
cluster.append(predictions[j])
used[j] = True
# 4. Weighted average using IoU-with-anchor as weights
weights = [iou(anchor.box, p.box) for p in cluster]
total = sum(weights)
fused_box = tuple(
sum(p.box[c] * w for p, w in zip(cluster, weights)) / total
for c in range(4)
)
# 5. NMW keeps the anchor's confidence (does NOT scale by cluster size)
fused.append(Prediction(box=fused_box, confidence=anchor.confidence))
return fused
Five steps, each one independently easy. The whole method is a couple dozen lines.
Compare to WBF, which is structurally similar but with two differences:
# In WBF, the cluster's "reference box" updates as new boxes are added:
cluster_box = weighted_average(cluster_so_far)
# (instead of staying as the original anchor)
# And the final confidence scales by cluster size:
fused_confidence = sum(p.confidence for p in cluster) * min(cluster_size, n_models) / n_models
# (instead of taking the anchor's confidence)
Those two changes are the entire difference between NMW and WBF, mechanically.
So which one wins?
This is where it gets interesting.
The WBF paper (Solovyev et al., 2019) explicitly argues WBF generally beats NMW because (a) WBF's evolving cluster center captures geometric drift better, and (b) WBF's cluster-size confidence scaling rewards agreement among many models. The paper has experiments backing this on COCO and Open Images.
Our shoulder-fracture paper showed the opposite. On the refined test set, NMW reached F1=0.9610 while WBF only reached 0.8889. That's a 7-point gap in NMW's favor. The same library, the same models, the same data — different fusion method, very different result.
A few hypotheses for why NMW won on this task:
- High agreement among the three detectors. When all three models confidently agree on the fracture (which is the common case for non-subtle fractures), NMW's anchor-based averaging is near-optimal. WBF's evolving-center approach mostly adds variance for no benefit.
- Confidence calibration differs across models. WBF's confidence scaling assumes each model's confidence is comparably calibrated. Faster R-CNN, EfficientDet, and RF-DETR have different calibration characteristics, so combining their confidences via WBF's scheme might introduce noise that NMW's "trust the anchor" approach sidesteps.
- Implementation / parameter sensitivity. Both methods have IoU thresholds and per-model weights that need tuning. It's possible our WBF run wasn't fully parameter-tuned for this dataset and a different config would close the gap. We didn't do an exhaustive sweep.
The honest takeaway isn't "NMW is better than WBF." It's: the fusion method that wins depends on your data, your models' agreement structure, and your confidence calibration. The literature defaults assume general-domain detection (COCO-style). Medical-imaging detection — with fewer classes, fewer objects per image, and tighter cross-model agreement — can have different optimal choices.
When to pick what
A rough decision guide based on this work plus the literature:
| Situation | Best starting choice |
|---|---|
| Single model, dense scenes (crowds, traffic) | NMS or Soft-NMS |
| Single model, sparse scenes | NMS |
| Ensemble, models disagree often, general domain | WBF |
| Ensemble, high cross-model agreement (specialized domain) | NMW |
| Ensemble, calibration differs across models | NMW (avoids confidence-scaling artifacts) |
| You don't know which | Try all four with ensemble-boxes, pick on validation F1 |
In all cases: don't reimplement these from scratch. The ensemble-boxes Python package (pip install ensemble-boxes) has solid implementations of all four methods. The implementation pseudocode above is for understanding what they do, not for production use.
Practical notes from shipping this
A few things that aren't in any paper:
- Per-model weights matter more than you'd think. Both NMW and WBF support per-model weights. Setting weights proportional to each model's validation F1 (e.g., EfficientDet=1.1, Faster R-CNN=1.0, RF-DETR=1.0) gives a few F1 points over uniform weighting.
- IoU threshold needs tuning per dataset. The default 0.5 IoU threshold for clustering is reasonable but rarely optimal. For shoulder fractures (where boxes are small and one missed bone fragment is a clinical miss), we used a slightly lower threshold (0.45) so near-overlaps still clustered.
- You still need a final confidence threshold. Fusion gives you ranked predictions; you still pick a confidence cutoff at deployment. That cutoff is set by clinical sensitivity/specificity tradeoffs, not by the fusion method.
- Fusion isn't free. Running three models and fusing is roughly 3× the inference cost of the cheapest one (or 3× the slowest one if you run sequentially; closer to slowest-only if parallel). Worth it for high-stakes tasks. Probably overkill for triage-only screening.
Closing
NMW and WBF are five-step algorithms with one tricky design choice each (anchor vs evolving center; raw confidence vs scaled confidence). They're more sensitive to your data than the papers suggest. The right move is rarely "default to whatever the recent paper says wins" — it's "try all four on your validation set with the ensemble-boxes library and pick the one that actually works."
If you're building an object-detection ensemble in any domain, that's a couple of evening's worth of experimentation that pays for itself many times over.
Part of an ongoing series on production medical imaging. The companion shoulder-fracture ensemble post is here; the year-one reflection is here.