Feature-Level vs Prediction-Level Ensembles: When to Combine What
Most posts treat 'ensemble' as one technique. In practice there are three distinct patterns — feature-level fusion, prediction-level fusion, and rules-based orchestration — with very different cost profiles, training requirements, and failure modes. Here's how to pick.
Three places to combine models in a pipeline, when each one wins, and why most "ensemble" content conflates the first two with the third.
When someone says "we ensembled the models," they usually mean one of three structurally different things — and the choice between them is more consequential than the choice of which models to ensemble. The conflation in most ML writing is real. The cost profiles, training requirements, and failure modes are all different.
This post separates them out, with a practical bias toward "when would I actually pick each one." The companion posts on detection ensembles (the shoulder-fracture system and NMW box fusion) covered the prediction-level case in depth. This post zooms out to the broader taxonomy.
The three patterns
A. Feature-level fusion. Multiple encoders run on the input; their intermediate representations are combined (concatenated, attention-fused, or otherwise merged); a single classifier head consumes the fused representation. The system is trained end-to-end with one loss.
B. Prediction-level fusion. Each model runs independently end-to-end, producing its own predictions. Their predictions are combined post-hoc — voting for classification, averaging for regression, NMW/WBF/Soft-NMS for detection, stacking with a meta-learner for more complex cases. Each base model is trained on its own, possibly at different times.
C. Rules-based orchestration. Multiple specialized models, each handling a sub-task, with their outputs feeding into an explicit decision DAG that combines them via domain-specific rules. Less statistically optimal than (A) or (B) on benchmark metrics; far more useful when audit trails matter.
The distinction sounds academic. It isn't. Each pattern has constraints that disqualify it from certain use cases entirely.
A. Feature-level fusion
The intuition: if model A and model B both produce useful features for the same task, the combined feature space is richer than either alone. Concatenate the features (or use cross-attention to fuse them), train one classifier on top of the combined representation, get a single end-to-end optimized system.
Variants you'll see:
- Late fusion (after encoders, before classifier head). Most common. Concat features from each encoder's final layer, project into a shared dimension, single classifier head. Cheap to implement, no architectural surgery on the encoders.
- Mid fusion (cross-attention between intermediate layers). More expressive — features at intermediate depths can talk to each other before the final layer. Requires architectural design and more memory at training time.
- Early fusion (at or near the input). Mostly for multi-modal cases — combining an image with tabular metadata, or two image modalities (CT + PET, e.g.). Less common for "two CNNs on the same image."
Where it wins:
End-to-end training means the encoders learn features that are specifically useful in combination, not in isolation. Compared to running two independently-trained models and averaging their outputs, feature-level fusion can give you better F1 at the same parameter count, because the model is optimizing for what's actually used downstream.
Where it loses:
You give up independent auditability of the base models. After end-to-end training, encoder A's features are no longer "what encoder A would have learned on its own" — they're "what encoder A learned to produce given that encoder B is sitting next to it." Some interpretations of regulatory and clinical-trust requirements treat this as a single composite model, with all the validation requirements that implies. Some don't. Worth knowing what your regulatory framework says before committing.
You also give up modularity. Adding a third encoder requires retraining the fusion classifier (and probably the encoders too, depending on architecture). Swapping out encoder A for a newer version means retraining the whole stack.
One example we ran combined a DenseNet and an EfficientNet backbone for pediatric chest X-ray classification, hitting around 98% accuracy on a held-out test set during a final-semester industry project. The fusion was straightforward — softmax outputs averaged — placing this on the prediction-level side of the taxonomy, but it illustrates the same point about combining differently-strong architectures.
B. Prediction-level fusion
The intuition: each model is a black box that produces a prediction; combine the predictions with a separately-defined rule.
Variants you'll see by task type:
- Classification: hard voting (majority), soft voting (average probability vectors), stacking (meta-learner trained on base-model predictions).
- Detection: NMS, Soft-NMS, WBF, NMW — combining boxes with overlap-aware rules. The deep-dive on these is in the NMW post.
- Regression: simple or weighted average, or stacking with a regression meta-learner.
- Segmentation: per-pixel voting, or averaging probability maps then re-thresholding.
Where it wins:
Each base model stays auditable. You can ship encoder A, swap it for a newer one, add a third encoder, drop one — all without retraining anything else. The fusion logic is a separately-versioned component that operates on outputs.
Models can be trained on different schedules, by different people, on different teams. Many production ensembles have base models trained months apart, by different authors, for different original purposes — combined post-hoc into a single pipeline. This works only because prediction-level fusion doesn't require joint training.
You get parallel inference. The base models can run on separate GPUs (or separate machines) simultaneously, with the fusion step being a tiny CPU computation at the end. For latency-sensitive workloads where total wall-clock matters more than total compute, this is the architectural win.
Where it loses:
You pay N× the inference cost for N models (modulo parallelism). Compared to feature-level fusion's single forward pass, prediction-level fusion is more expensive at serve time. For a desktop deployment with limited compute, this can be the deciding factor against ensembling at all.
You lose some statistical efficiency. End-to-end training would have shaped each encoder's features for the joint task; independent training doesn't. On benchmark F1 numbers, feature-level fusion often beats prediction-level fusion by a point or two when both are well-tuned — though the gap narrows or reverses on noisy or domain-shifted data, where the modularity benefit shows up.
A real example: our shoulder-fracture ensemble (Faster R-CNN, EfficientDet, RF-DETR fused with NMW) is prediction-level all the way through. We picked this over feature-level fusion because the three detectors were already trained and validated independently, each one's failure modes were understood, and we wanted the ability to swap any one of them out without invalidating the others. The 1–2 F1 points we might have gained from joint training weren't worth the architectural lock-in.
C. Rules-based orchestration
The intuition: instead of statistically combining model outputs, logically combine them with domain-specific rules. Each model has a defined role; the decision engine has explicit DAG nodes that determine how their outputs combine.
This is the pattern behind clinical decision-support systems that pass regulatory review. A radiology AI that outputs "fracture detected" while another outputs "post-surgical hardware present" should not be combined by averaging probability vectors — the correct combined output depends on a domain rule (post-surgical hardware doesn't preclude fracture; the AI should report both, and a radiologist sees both flags independently).
What this looks like in practice:
- Decision DAG with named nodes. Each node is a sub-model output (fracture classifier, hardware classifier, view classifier, exposure quality classifier). Edges encode the rules for how downstream decisions depend on upstream outputs.
- Per-classifier threshold tuning. Each sub-model has its own confidence threshold tuned to its operating point. The orchestrator combines outputs that are each already calibrated.
- Explicit conflict resolution. When two classifiers disagree in domain-meaningful ways, the rules specify which signal wins, or whether the case escalates to human review.
- Full audit trail. Every output is traceable to which sub-model fired, at what confidence, against what threshold, with which downstream rule chosen.
Where it wins:
When the regulator, the clinician, or the end-user needs to know why a system produced a specific output, statistical ensembling can't provide that explanation. Orchestration can — each component's contribution to the final output is visible by construction.
When the right combination of outputs depends on domain knowledge that isn't in any single model's training data ("hardware shouldn't suppress fracture detection," "low exposure quality should downgrade confidence in any other finding"), encoding that domain knowledge as explicit rules is more maintainable than trying to bake it into a meta-learner.
When you want to add or modify a sub-model without retraining anything else — the same modularity benefit as prediction-level fusion, but with logic-aware rather than statistics-aware combination.
Where it loses:
You hit a complexity ceiling. With 5–10 sub-models, the DAG is manageable. With 30+, the rule space becomes its own engineering problem. We've found ~12–20 sub-models to be a practical upper bound before the orchestration logic itself starts dominating maintenance effort.
You can underperform statistically. A well-trained joint model often beats a heuristic combination of specialized models on benchmark metrics. Orchestration trades a few benchmark points for auditability — for many use cases this is the right trade; for some it isn't.
We've shipped a logic-based diagnosis orchestration engine that sits on top of 12+ specialized classifiers covering different body regions and screening tasks. The choice of orchestration over a single monolithic multi-output model was driven by clinician-trust requirements — radiologists wanted to see which sub-model fired what flag, at what confidence, against what threshold, with the option to override any single flag without invalidating the others.
How to choose
The decision typically reduces to three questions, in this order:
1. Do you need per-model auditability? If yes (clinical, regulated, high-stakes), you're choosing between prediction-level fusion (when models share the same output space) and orchestration (when they don't). Feature-level fusion is off the table because the joint training erases the boundary between models.
2. Are the models trained jointly? If yes, you're doing feature-level fusion — there's no separate "ensemble" step, the combination is part of the model. If no, you're at prediction-level or orchestration.
3. Do all models share the same output space? If yes (all detectors producing boxes, all classifiers producing the same set of class probabilities), prediction-level fusion with task-appropriate combination (NMW for detection, voting for classification, etc.). If no — different sub-models doing fundamentally different tasks whose outputs need to be combined with domain logic — orchestration.
Cost comparison
Rough relative costs for a 3-model ensemble:
| Dimension | Feature-level | Prediction-level | Orchestration |
|---|---|---|---|
| Training cost | 1× (joint) | 3× (independent) | 3× (independent) |
| Inference cost | 1× (single pass) | 3× (parallelizable) | 3× (parallelizable) |
| Memory at inference | low-medium | high (3× weights) | high (3× weights) |
| Time to add a model | high (retrain) | low (retrain meta) | low (add DAG node) |
| Auditability | low | medium | high |
| Maintenance | low | medium | high (DAG complexity) |
| Best benchmark F1 | often highest | close 2nd | often 3rd |
No universal winner. Different rows matter in different contexts. The right ensemble pattern is the one whose strengths align with your actual constraints, not the one that wins the most benchmarks.
The meta-lesson
When you read "we ensembled the models and saw a 2 F1-point lift," ask which pattern they used. The answer changes:
- whether the result generalizes to your setup,
- whether you'd need to retrain anything to adopt their approach,
- whether their system is auditable to the standard your application needs,
- whether their inference cost is acceptable on your hardware.
Most published ensemble results are feature-level (academic benchmarks reward joint optimization) or simple prediction-level (averaging is easy to implement and report). Production ensembles in regulated domains are disproportionately orchestration-flavored, because the requirements that motivate them aren't in the benchmarks. Knowing which pattern you're reading about — and which pattern your own problem actually wants — is more important than the F1 difference between any two specific architectures.
Part of an ongoing series on production medical imaging. The shoulder-fracture ensemble system is here; the NMW box-fusion deep-dive is here; the year-one reflection is here. If you're weighing one of these patterns for a real system, reach out.