When Oriented Bounding Boxes Beat Axis-Aligned: Lessons From Rotated Fractures

Oriented bounding boxes (OBB) aren't just for aerial imagery and scene text. Rotated medical pathologies benefit too — here's what YOLOv11-OBB taught us on a shoulder-fracture dataset, and when OBB isn't worth the cost.

June 15, 2026Saianiruth M

Most object-detection content treats axis-aligned bounding boxes as the default. For natural images and most public benchmarks they are the default — for good reason. But for rotated pathologies in medical radiographs, axis-aligned boxes waste a startling fraction of their pixels on background. This is a writeup of what we measured when we switched to OBB on a shoulder-fracture detection task, and what the numbers do — and don't — say.

A bounding box has one job: enclose the object you want the model to find. Axis-aligned boxes do this well when the object is axis-aligned. When the object is rotated, an axis-aligned box has to expand in both dimensions to enclose the rotated extents. The bigger the rotation, the more background pixels you're forced to include.

For most natural-image detection tasks, this is fine. Cars on roads are roughly horizontal. People in photos are roughly vertical. The wasted background is small.

For medical imaging, it isn't fine. Fractures in shoulder radiographs follow the orientation of the bone they're in, which can be at any angle to the image axes depending on how the patient was positioned. Oblique pathologies, rotated anatomy, and angled fracture lines are common, not exceptions. The wasted-background ratio for axis-aligned boxes on these targets is consistently bad.

This post walks through what oriented bounding boxes (OBB) actually buy you in this setting — using results from our YOLOv8/v11/v12 experiments on a shoulder-fracture dataset — and where the tradeoffs flip the other way.

A two-minute OBB primer

An axis-aligned bounding box is parameterized by four numbers: (x_min, y_min, x_max, y_max). The box edges are parallel to the image axes by construction.

An oriented bounding box is parameterized by five numbers: (x_center, y_center, width, height, angle). The box can rotate around its center. When the object happens to be axis-aligned, OBB and axis-aligned boxes coincide. When it isn't, OBB fits the object more tightly.

The cost is small but real:

One more parameter to predict (the angle). For a transformer-based detector this is a minor extension; for a YOLO-style anchor-based detector it's a meaningful change to the detection head.
Angle ambiguity. A rectangle rotated by 0° and the same rectangle rotated by 180° are identical — the symmetry means the angle loss has to handle the wraparound. Common implementations parameterize angle as (cos(2θ), sin(2θ)) to avoid the discontinuity at 0/180°.
Annotation cost. Annotators draw four corner points instead of two opposite corners. Slightly more click work. Roughly 1.5× the labeling time per box in our experience.

Visually, the difference is what makes the case for OBB on rotated targets:

$Side-by-side comparison of axis-aligned vs oriented bounding boxes on a rotated fracture target. The axis-aligned box has to expand in both dimensions to enclose the 35-degree rotated pathology, capturing roughly 80% background pixels; the OBB hugs the pathology with about 17% background.$

For a 35° rotated pathology, the axis-aligned box wastes ~80% of its pixels on background. The OBB wastes ~17%. For downstream tasks that use the box content directly (cropping for refinement, attention masking, IoU calculation against ground truth), this matters.

Why this matters for medical imaging specifically

Three patterns in radiographic data make OBB worth considering:

Rotated anatomy. Shoulder X-rays in particular have wide patient-positioning variance. The humerus shaft is rarely parallel to the image y-axis. Fractures that follow the cortical bone orientation are therefore rarely parallel to either image axis. Axis-aligned boxes have to grow in both directions to enclose them.

Oblique pathologies. Linear pathologies — fracture lines, periosteal reactions, dissection planes — have a clear long axis. The aspect ratio of a tight OBB around a fracture line is often 6:1 or 8:1. The aspect ratio of the axis-aligned box around the same fracture is closer to 2:1 or 3:1, because both dimensions had to expand.

Tight packing. When multiple pathologies are close together (multiple fragments of a comminuted fracture, for instance), axis-aligned boxes overlap more than the underlying objects actually do. NMS and ensemble fusion both work on IoU, which means box-shape inaccuracy degrades the downstream pipeline even when the model's localization is correct.

None of these patterns appear in COCO or the natural-image benchmarks most detection research is tuned for. They're medical-imaging-specific reasons to revisit a default that's usually fine in the wider literature.

The experiment

We trained YOLOv8, YOLOv11, and YOLOv12 in both detection (DET, axis-aligned) and oriented-bounding-box (OBB) configurations on the same shoulder-fracture dataset used in our arXiv ensemble paper repackage — though that paper itself reported ensemble results on Faster R-CNN, EfficientDet, and RF-DETR. The YOLO-OBB work is a separate experimental line.

Setup:

Parameter	Value
Image size	640×640 (resized from 1024×1024)
Batch size	32
Epochs	100 (with early stopping)
Optimizer	AdamW / SGD (Ultralytics defaults)
Confidence threshold	0.25 at evaluation
NMS IoU threshold	0.45
Loss	Box loss + classification loss + DFL (Distribution Focal Loss for OBB regression)
Dataset sizes	2,000 images (initial) and 10,000 images (full)

The 2,000-image dataset gives a clean head-to-head between DET and OBB across all three YOLO versions, since both configurations were trained with identical settings. The 10,000-image dataset gives best-achievable numbers but only OBB was run at this scale.

Results

Bar chart comparing F1 scores across YOLOv8/v11/v12 in DET and OBB configurations on the 2,000-image dataset, and YOLOv8/v11-OBB on the 10,000-image dataset. YOLOv11-OBB on 10K data leads at F1 0.8335. — F1 across the YOLO variants. YOLOv11-OBB on the full 10K dataset leads at F1 0.8335; on the 2K head-to-head, OBB beats DET on both YOLOv8 and YOLOv11.

The actual numbers, broken out:

On the 2,000-image dataset (the clean head-to-head)

Variant	Accuracy	Precision	Recall	F1
YOLOv8-DET	76.97%	0.7731	0.7945	0.7836
YOLOv8-OBB	76.04%	0.7513	0.8300	0.7887
YOLOv11-DET	77.08%	0.7789	0.7866	0.7827
YOLOv11-OBB	78.30%	0.7929	0.7945	0.7937
YOLOv12-DET	77.96%	0.8162	0.7372	0.7747

OBB beats DET on F1 in both YOLOv8 (+0.005) and YOLOv11 (+0.011) on the 2,000-image dataset. The margin is small but consistent. YOLOv12 wasn't run with OBB here (the OBB head wasn't supported in our YOLOv12 setup at the time of these experiments).

A few things worth noticing in this table:

OBB pushes recall up more than precision. YOLOv8-OBB hits recall 0.83 vs DET's 0.79 — a meaningful jump for a screening application where missed fractures are the costlier error. Precision is slightly lower (0.75 vs 0.77), but for a triage tool that's the right side of the tradeoff.
YOLOv11-OBB is the best of the v8/v11 head-to-head pairs. Both DET and OBB heads benefit from v11's improved backbone, but OBB benefits more in absolute terms.
YOLOv12-DET has the highest precision (0.816) but the lowest recall (0.737). The newer architecture is more conservative. For screening this is the wrong direction; for a precision-critical confirmatory tool it might be the right choice. We didn't pursue it further.

On the 10,000-image dataset (full data, OBB-only)

Variant	Accuracy	Precision	Recall	F1
YOLOv8-OBB	79.98%	0.7621	0.8992	0.8250
YOLOv11-OBB	82.26%	0.8215	0.8458	0.8335

YOLOv11-OBB on the full dataset is the best overall result we measured — F1 = 0.833, accuracy 82.3%. The OBB advantage that was a +0.011 F1 win on 2K data appears to compound with more training data, though we don't have the matched DET-on-10K runs to confirm by direct comparison. (A clean DET-on-10K experiment is on the list.)

The recall is the standout: 0.8992 for YOLOv8-OBB best, 0.8458 for YOLOv11-OBB best. For a screening application, 90% recall on subtle fractures is the kind of number that translates into real clinical value when paired with radiologist confirmation. The precision tradeoff (0.76 vs the conservative YOLOv12's 0.82) is acceptable because false positives are easier to manage than false negatives.

Training gotchas

Things that bit us during the OBB training runs, in case they save you the same time:

Angle parameterization. Ultralytics' YOLO-OBB uses an angle representation that handles the 180° symmetry internally — but only if your annotations are consistent. If half your training labels parameterize the long axis as horizontal-first and half as vertical-first, the model learns nothing about angle, because the loss is averaging across two valid solutions. Normalize annotation conventions across the entire dataset before training.

DFL (Distribution Focal Loss) is the right default for OBB regression. OBB predicts continuous coordinates plus a continuous angle. DFL converts these into a soft classification over discrete bins, which trains more stably than direct regression. The Ultralytics implementation already uses DFL for OBB; don't override it unless you have a specific reason.

Augmentation policy matters more for OBB. Random rotation augmentation has to apply the same rotation to both the image and the labeled boxes' angles. Ultralytics handles this correctly for OBB by default — but if you're hand-rolling augmentation or using a library that wasn't OBB-aware, you can end up with images that have been rotated by 45° while the OBB angles haven't been updated. The result is a training set that teaches the model the rotation isn't real. Sanity-check augmented batches visually before kicking off a long training run.

The OBB head needs a slightly slower learning rate. In early experiments the OBB head was unstable for the first few epochs — angle predictions oscillated wildly while box coordinates converged. Dropping the learning rate by 30-50% relative to the DET baseline fixed it. Ultralytics' default schedule mostly handles this with warmup, but worth knowing if you customize.

When OBB isn't worth it

The OBB advantage isn't universal. Cases where we'd default back to axis-aligned:

Targets that are inherently axis-aligned. Chest radiograph QC tasks (detecting whether the image is rotated, whether artifacts are present, whether the patient was correctly positioned) involve targets defined relative to the image axes by their nature. OBB adds parameters without adding signal. Use DET.
Very small targets (under 20 pixels on the long axis). The angle parameterization adds noise that overwhelms the geometric benefit. The IoU difference between a 5° rotation and a 10° rotation on a 15-pixel-long object is below the model's discrimination threshold anyway.
Tight runtime constraints. OBB inference is slightly slower than DET (an additional rotational matrix per prediction, plus rotated-NMS rather than standard NMS). On a workload measured in milliseconds of headroom, this matters. On a workload measured in seconds, it doesn't.
Annotators who can't reliably mark angle. OBB labeling quality is more variable across annotators than axis-aligned labeling. If you're getting your annotations from a service that doesn't have explicit OBB expertise, you may be paying for inputs the model can't actually learn from. Spot-check OBB annotations on a 100-sample subset before committing to the labeling cost.

Practical recommendations

If you're working on medical-imaging detection and your pathologies are at all rotated or oblique relative to the image axes:

Train both DET and OBB on the same data with the same model, measure the gap. If OBB beats DET by at least 1 F1 point on your task (the gap we saw on shoulder fractures), it's probably worth the ~1.5× annotation cost going forward.
Use YOLOv11-OBB as the default starting point. The v11 backbone improvements compound with the OBB head better than v8's do. v12-OBB wasn't viable when we ran these experiments; that may have changed.
Bias toward recall by tuning the confidence threshold during evaluation. OBB pushes recall harder than precision — leaning into that for screening applications is the natural design.
Sanity-check annotations and augmentation pipelines on a small subset before committing to a 100-epoch training run. The most common failure mode for OBB isn't the model; it's an annotation or augmentation pipeline that's silently inconsistent about angle.

The shorter summary: OBB is a small but real win on rotated medical pathologies. It's not a free lunch — annotation cost, training stability, and runtime are all slightly worse — but for the specific use case it was designed for, the win shows up consistently in the numbers.

Part of an ongoing series on production medical imaging. The companion shoulder-fracture ensemble paper repackage is here; the year-one reflection is here; the NMW ensemble fusion deep-dive is here. If you're weighing OBB on your own detection problem, reach out.