Year One in Production Medical AI: The Honest Version
My first year as an AI Scientist — 24+ models, one arXiv paper, and a few lessons I wish I'd had on day one.
My first year as an AI Scientist — 24+ models, one arXiv paper, and a few lessons I wish I'd had on day one.
Most of what I expected to matter in this job didn't. Most of what actually mattered was invisible from the outside.
I joined 5C Network full-time as a Data Scientist on June 1, 2025, after a five-month internship there. Today, May 22, 2026, is my one-year mark, give or take a week. With the team, I've worked on 24+ production medical-imaging models, co-authored one arXiv paper, and helped build retrieval infrastructure over 1.6M+ medical images. I made the mistakes I would have predicted, plus a different and larger set I wouldn't have.
A note on pronouns: almost none of the work below was solo. I'll use "I" for personal mistakes and reflections, "we" for the work itself.
If you'd asked me on day one what would dominate my year, I would have said: better architectures, smarter training tricks, harder-to-find datasets. Those things mattered, but each by maybe 5-15% in the right direction. What dominated my year, in roughly decreasing order:
- Engineering — encryption, queues, streaming, installers, hot-reloadable config, monitoring. The half of the job nobody teaches.
- Preprocessing — CLAHE, LANCZOS, custom dataset statistics, sanitized DICOM handling. Two percentage points of accuracy nobody talks about.
- Ensembles and orchestration — combining specialized models with explicit decision logic, instead of trying to make one big model do everything.
- Honest evaluation — what your model actually does on the long tail of real cases, not on a clean held-out split.
- Knowing when to use which tool — when a CNN, when a transformer, when a VLM, when a structured-output LLM, and when none of those and you just need a rule.
- The actual model architecture choice came in around #6.
Lesson 1: Ensembles compound; single models plateau — and always check specificity
The first significant thing we shipped was a shoulder-fracture detection ensemble — three independently-trained detectors (Faster R-CNN, EfficientDet-D3, and RF-DETR) fused via IoU-clustered, confidence-weighted output combination. It became the subject of the arXiv paper I co-authored.
The first version of the ensemble looked great on the headline numbers: 70% precision, 98.6% recall. Then radiologists started flagging that we were over-calling. We were all confused — precision was decent, recall was excellent, what was going wrong?
That's when I learned to look at specificity — recall's quieter, less-celebrated cousin. Our ensemble's specificity was 10%. Out of every 10 normal shoulders, the model was over-calling 9. The 70% precision had been masking the false-positive problem because the prevalence in our test set was high. The model was confidently shouting "fracture" at almost every normal anatomy it saw, and getting "rewarded" by metrics that didn't penalize it enough.
We retuned the ensemble: 81% precision, 95.76% recall, 60% specificity. On paper, recall dropped about 3 points. (Whether that 3 points actually mattered is its own lesson — see Lesson 6.)
The architecture-level finding still holds. Single detectors plateau in different failure modes: Faster R-CNN gives clean localization but misses subtle cortical disruptions; EfficientDet is fast but middling on small lesions; RF-DETR's global attention catches hairlines but its boxes drift on clear-contrast fractures. The ensemble doesn't just average their accuracy — it covers their failure modes. But the lesson I'd put first now is: every classification metric you ship needs all four corners of the confusion matrix, not just two. If your test set has imbalanced prevalence, precision and recall can both look respectable while specificity is quietly in the basement.
The catch with ensembles: 3x training time, 2-3x inference latency, significantly more engineering to run reliably. Worth it for anything safety-critical. Probably not worth it for triage or screening with a human in the loop.
Lesson 2: Zero-shot medical VLMs aren't ready yet — and the data is mostly hidden
About three months in, the team spent serious time evaluating publicly-available medical vision-language models — Google's CXR Foundation, Stanford AIMI's CheXagent-8b, XRAYGPT, several sizes of PaliGemma, MedCLIP. We ran them zero-shot on standard CXR tasks (pneumothorax, pleural effusion, view classification) using public datasets, and compared them honestly to small task-specific CNNs.
The pattern was consistent: VLMs handle high-level reasoning well — view classification, modality detection, side identification, extraction of structured fields from DICOM headers — but they fail on the pixel-level perception that actually matters in clinical QC. Over-exposed image detection by Gemini 2.5 Flash: F1 of 0.033. By a basic yolov8m classifier: F1 of 0.327 at default threshold, 0.353 with tuned threshold. CheXagent zero-shot on pleural effusion: 70-75% precision and recall, which sounds okay until you remember that means 25-30% of your decisions are wrong in either direction.
The other thing I learned: most published medical-VLM evaluations skip the cases that actually matter. On clean balanced datasets the headline metrics look strong. Add real-world distribution shift — mobile-phone monitor captures, oblique anatomy, vendor-specific exposure scales — and the numbers fall apart.
One caveat: this is the zero-shot picture, and it's moving fast. Fine-tuned medical VLMs and grounded bounding-box architectures — PaliGemma with QLoRA, RAD-DINO, the newer MedSigLIP work — are closing the gap quickly for specific tasks, and some are already production-viable. The criticism above is aimed at the dropped-in, vendor-pitched foundation model, not at domain-fine-tuned ones. Those are a different conversation, and the post I'm writing six months from now will probably be more optimistic than this one.
If you're building a medical-imaging system today and a vendor pitches you a "foundation model that just works," ask them for zero-shot results on a held-out set from your own data distribution. If they can't show you those numbers, the model isn't ready.
Lesson 3: Preprocessing matters more than architecture choice
This one took me a humbling four months to internalize.
On the pediatric chest X-ray classifier — my final-semester industry project, completed during my internship here — I spent weeks comparing DenseNet-169 against EfficientNet-B2/B3 against fused variants. Eventually I landed on a 2432-d fused-feature representation that got 98.14% accuracy with 99% precision.
What I learned later: most of my accuracy gain across that whole study didn't come from the architecture. It came from CLAHE (Contrast Limited Adaptive Histogram Equalization) applied during preprocessing, LANCZOS resampling instead of bilinear, and computing the mean/std of my own dataset on GPU rather than using ImageNet defaults. Each of those three choices contributed somewhere between 0.5 and 2 percentage points. Stacked, they outweighed the difference between DenseNet and EfficientNet entirely.
This pattern repeated across nearly every project. CLAHE didn't always help — on high-contrast adult CXRs it sometimes amplified noise — but the act of checking whether it helped, with a real ablation, was always worth more than another epoch of hyperparameter search. Same for resize interpolation. Same for dataset-specific normalization. Same for handling rotation augmentation: on one rotation-detection task this year, just turning rotation augmentation on lifted our external F1 from 0.585 to 0.98. Same dataset, same model, same hardware. One config flag.
The frustrating part: almost no papers document their preprocessing decisions in enough detail to reproduce. They publish the architecture and skip the part that mattered most.
Lesson 4: Engineering is the other half of the job
A model in a notebook is not a product. Production engineering for a model a clinician actually uses is a different stack from what gets taught in ML courses: AES-256 encryption-at-rest for model weights, persistent SQLite inference queues with stuck-process detection (300s warn, 600s restart), SSE streaming for incremental results, hot-reloadable thresholds via a DB watcher (no service restart to retune), cross-platform installers, license validation over HTTPS, modular pipelines spanning 40+ pathologies and 16+ supportive devices, all serving ~1000+ predictions per hour at sub-second latency.
None of these are in any ML curriculum I know of. All of them mattered more than model accuracy for whether the system actually got used. A model that hits 95% F1 in your notebook and 60% in deployment isn't a 60% model — it's a 0% model. Trust depends on the full stack.
One specific incident from this year: we refactored the inference pipeline from YAML-based config loading to DB-driven. Cleaner, hot-reloadable, no more service restarts to retune a threshold. Or so the design said.
The pipeline had two modules that loaded different sets of models. One I owned; the other was an older component built for a different workflow and reused. My module read config through the main entry-point's loader, and after the refactor it correctly read from the DB. The other module had its own internal settings file that hardcoded a YAML path. The main entry-point had a config line that looked like it controlled both modules — same import, same call — but the older module was overriding it internally and reading the hardcoded YAML. That line in main was, for the older module, dead code.
Model-loading behavior started doing things that didn't match what the DB said. It took longer than I'd like to admit to find the problem, because everywhere I looked the "right" code was there. What finally caught it: bumping the log level from INFO to DEBUG and re-running. The debug output spelled out exactly which path was being loaded for each model, and the mismatch jumped out within a single inference cycle.
Lesson: when you refactor a config or loading layer, audit every consumer of the old layer — and when nothing obvious is wrong, turn the logging up. INFO is what you ship; DEBUG is what you debug with. Most subtle config bugs hide in the gap between them.
Lesson 5: Use LLMs for what they're actually good at
In the second half of the year we built a clinical-QC pipeline that combines multiclass quality classifiers with a Gemini-based structured QC layer. This was the project where the LLMs-vs-CNNs question became concrete instead of philosophical.
The headline finding, after about a week of production data:
- For study-level reasoning — view classification, modality detection, side identification, extracting structured metadata from DICOM headers — Gemini 2.5 Flash was excellent. F1 around 0.94-0.96 for view classification. 100% accuracy on metadata extraction.
- For pixel-level perception — over/under-exposure detection, fine artifact detection, subtle cropping issues — Gemini failed sharply. F1 in the 0.03-0.08 range on exposure. Trained CNNs on the same task: 0.6-0.9 F1 at best-F1 threshold.
We ended up using both: CNNs OR-aggregated for the pixel-level quality flags, Gemini for the structured field extraction and the natural-language reasoning. The pipeline that wins almost always uses each tool for what it's actually good at, not what it's marketed for.
A related cost lesson: Gemini 2.5 Flash on this task was about ₹0.94 (~$0.011) per case at 18-second TAT. Gemini 3 Flash Preview was better on rotation and OR-aggregation but ran at 300-480 seconds per case — 5 to 8 minutes. Unshippable. The latest model is not always the right model.
Lesson 6: The radiologist is the user, not the benchmark
This is the lesson I had to actively unlearn the research-paper habit for. Coursework rewards you for held-out F1. Production rewards you for whether a tired radiologist at 11pm trusts your output enough to use it.
Remember the ~3-point recall drop from Lesson 1 — the one I said didn't matter clinically? Here's why.
When we sat down with a radiologist to walk through the cases the new ensemble was missing, the pattern jumped out immediately: almost all of them were old healed fractures. Calcified, remodeled, decades-old findings that show up on the radiograph but aren't actionable — generic findings to mention in the report, not anything that changes treatment. The previous version of the ensemble (the high-recall, low-specificity one) caught these and got credit for high recall, but those catches weren't clinically useful. The new ensemble missed them but caught the active fractures — the ones that matter for patient outcomes.
When you weight by clinical significance, effective recall on the new model was closer to 99%. The benchmark recall said 95.76%. The benchmark wasn't lying exactly — it was just answering a different question than the one we should have been training to.
I would never have known this from the data alone. We knew it because we asked.
A 95% F1 model that gives no localization, no calibrated confidence, no per-pathology rationale is less useful than an 85% F1 model that overlays a bounding box on the suspicious region, surfaces a confidence with meaning, and says "here's the rule that flagged this." Trust is built from those — not from the headline metric.
This is also why we ended up building logic-based diagnosis engines on top of ML classifiers — orchestrating 12+ specialized classifiers with explicit decision rules, rather than training one big multi-output model. The big model performs better on paper. The orchestrated pipeline performs better in the room, because every output is auditable. A radiologist can trace exactly which sub-model flagged what, and override individual flags without invalidating the rest.
What I got wrong
A reflection post without admissions is just a brag with extra paragraphs. So:
The most specific mistake: the first time I trained VLMs on an A100, my teammates were being conservative and keeping batch sizes under 35GB. I pushed mine to 39GB to save time and kicked off an overnight run. Sometime around midnight the GPU OOM'd. The training process halted. The Slack alert I'd set up inside the training process never fired — because the thing meant to send the alert was the thing that died. The GPU sat idle for the rest of the night. I burnt cloud credits for nothing.
The obvious takeaway was "use a smaller batch size." The real lesson was deeper: in-process alerts can't tell you about their own death. I now run a separate watcher process — separate PID, separate failure surface — whose only job is to monitor the training process and ping me on Slack if it stops. The watcher also auto-shuts down the VM past midnight if the training process is dead, so an unnoticed crash doesn't burn another six hours of credits. From my Mac terminal I can check the Slack channel at any time and see whether any of my processes are still alive, without SSHing into the VM.
A few less concrete ones:
I over-optimized on architecture choice for the first six months. I should have spent that time on preprocessing and evaluation infrastructure instead.
I trusted published medical-VLM numbers too uncritically at the start. I assumed headline F1 was indicative of out-of-distribution performance. It wasn't. I should have run our own zero-shot evaluations earlier.
I deferred too much to "what the paper said about hyperparameters" instead of doing ablations on our data. Medical imaging has so many distribution-specific quirks that paper defaults are almost never the right starting point.
What I'm working on next
The next twelve months: the questions I want to keep answering. When do ensembles stop adding value? When does a foundation model earn its license cost? When does an LLM actually beat a small CNN? Where does engineering matter more than model selection? I'll publish what I learn, as I learn it.
I'm also writing — properly this time. The post you're reading is the first of about 25 planned for the next year. Half will be production deep-dives on patterns I learned the hard way; half will be writeups of new work as we do it.
If you're working on something similar — production medical AI, medical imaging AI, ML system engineering — I'd genuinely like to compare notes. My contact is on the homepage. I'll respond to anyone who sends a thoughtful question.
Closing
If you're a year behind me on this path, the most useful thing I can tell you is: stop reading papers for a week and read your own production logs instead. They contain the lesson the next paper is going to confirm in twelve months.
To my team and everyone at 5C Network who taught me this year — the mentors, the radiologists who corrected our models, the engineers who reviewed the code — thank you. On to year two.
Find me on LinkedIn, GitHub, or Medium. The arXiv paper referenced above is here.