Writing·Blog

Server-Sent Events for Progressive ML Inference: A Pattern Worth Adopting

When ML inference takes 15-30 seconds and is composed of independent sub-models, you can deliver value at second 2 instead of second 18. The pattern is progressive results over SSE. Here's the architecture, the FastAPI implementation, and the messy parts nobody warns you about.

Saianiruth M

Long-running orchestrated ML inference looks bad on a UI by default. A 30-second spinner doesn't tell the user whether the system is working, stuck, or about to fail. Progressive results — delivered as each sub-model completes — fix this without changing the underlying compute. The pattern is straightforward; the implementation has a few sharp edges. This is the writeup.


In the year-one reflection I mentioned that we serve a fleet of medical imaging models with sub-second to multi-second per-model latency, orchestrated as a pipeline. A single user-facing prediction is the aggregate of many sub-model decisions: view classification, exposure quality, anatomical region detection, primary pathology, supportive findings, contraindications. End-to-end this takes 15-30 seconds on the worst inputs.

The naive way to serve this — POST /predict, wait, return the full result — gives the user a spinner for 30 seconds, no signal during the wait, and no way to start consuming partial results that are already available. The better pattern is server-sent events streaming the partial results back as each sub-model finishes.

This post is the writeup of why that pattern matters, how to implement it on top of an existing job queue (the SQLite queue from B16), and the surprisingly long list of operational gotchas that nobody warns you about.


The pattern, abstractly

Long-running compute split into independent sub-tasks can be delivered to a client incrementally. The user sees the first useful information as soon as any sub-task completes, rather than waiting for the slowest one. The total compute is unchanged; the perceived latency drops sharply.

For orchestrated ML inference this is a near-perfect fit because most multi-model pipelines have this shape:

  • The sub-models are independent (or partially independent) — they don't all need to wait on each other.
  • Each sub-model produces a self-contained partial result that's useful on its own (view classification, fracture detection, exposure quality).
  • The user has work they can start doing with the first partial result (begin reviewing the AP/PA classification while the fracture detector is still running).

The pattern: hold a long-lived connection from the client to the server. As each sub-model completes, push the partial result over the connection. Close the connection when the full inference is done.

Two timelines stacked vertically showing the same 18 seconds of compute. The traditional REST timeline shows a single blank bar from t=0 to t=18 with the result returned at the end. The SSE timeline shows the same compute split into per-sub-model bars (view classification at t=2, exposure at t=5, fracture at t=12, supportive findings at t=18), with each partial result streamed back to the client as soon as it completes.
Traditional REST (top) vs SSE streaming (bottom) for the same 18 seconds of compute. Total wall-clock is unchanged. But the user has actionable information at t=2s with SSE versus t=18s without — a ~9× perceived-latency win, purely from changing how results are delivered.

In the example above, total compute is 18 seconds either way. But the user has actionable information at t = 2s with SSE versus t = 18s with traditional REST. A roughly 9× perceived-latency win, just from changing how results are delivered.


Why progressive matters for ML specifically

Three reasons this pattern is worth more for ML inference than for typical web work:

1. Variable per-model latency. Some sub-models in our pipeline run in 200ms; others take 4-5 seconds. Returning a uniform "we'll get back to you in 18 seconds" hides the fact that 60% of the answer was available at second 3. The user's mental model is "slow system." Progressive results change it to "system that's working and showing me what it has."

2. Degraded-mode usefulness. If a single sub-model errors or times out, the remaining sub-models' outputs are still useful. With batch responses, one failed sub-model often means returning either a partial response (awkward) or an error (worse). With progressive results, the failing sub-model just doesn't emit an event — the rest of the system continues, the user gets what they got, and the missing pieces are explicit rather than invisible.

3. UI responsiveness in clinical workflows. Radiologists open a study, scan the AI flags, and either confirm or override each one. The faster the first flag appears, the sooner the radiologist starts working with it. The full 18 seconds of compute might be needed for completeness, but the human's reading of the result starts in parallel as soon as the first flag is visible.

These are stronger for ML inference than for "show me a list of tweets." There, full-response latency is also the user's first-information latency. For orchestrated ML inference, the two are very different — and the difference is the whole point of the pattern.


Why SSE rather than WebSockets or polling

The choice space for "long-lived client connection that receives server pushes" is three options:

  • Polling. Client makes repeated GET /jobs/:id requests. Simple. Wastes bandwidth on no-change responses. Per-request authentication overhead. Polling intervals add latency to first event. Workable but not great.
  • WebSockets. Full-duplex, bidirectional. Overkill for this use case — the client doesn't need to send anything mid-stream. WebSockets also have more operational complexity around reverse-proxy support, upgrade handshakes, and library inconsistencies.
  • Server-Sent Events. Server-to-client only. Plain HTTP. Native browser API (EventSource). Works through most reverse proxies with minimal configuration. Built-in reconnection with last-event-id resumability. Exactly what we need.

For this specific pattern — server pushes events to a passive client over a single long-lived HTTP connection — SSE is the right answer. WebSockets are the right answer for chat, multiplayer games, collaborative editing. Polling is the right answer for nothing.

There are real cases where WebSockets win (bidirectional control, sub-second latency budgets, binary frames). None of them apply here.


Architecture on top of the queue

The pattern composes cleanly with the SQLite-backed job queue from B16. The queue is the source of truth for job state; SSE is just a transport that reads from it.

Architecture diagram with three lanes. Top lane: client posts to POST /predict, receives a job_id, then opens a long-lived SSE connection to GET /jobs/:id/stream. Middle lane: API server polls the SQLite jobs table every 250ms and emits events on version change. Bottom lane: workers pick up jobs from the queue, process each sub-model in sequence, and write partial_results plus an incrementing version column back to the same row.
Architecture. The SQLite jobs table is the only shared state. Workers write partial results and bump a version column; the API server's SSE handler polls the row and emits an event whenever the version changes. Decoupled, simple, and the polling cost is below the noise floor.

The flow:

  1. Client posts an inference job (POST /predict returns a job_id immediately).
  2. Client opens a long-lived SSE connection: GET /jobs/:job_id/stream.
  3. Workers process the job in stages. Each completed sub-model writes its result to the job's partial_results column (JSON, accumulating).
  4. The API server's SSE handler polls the job row every 250ms. On any change in partial_results or status, it emits an event.
  5. When status flips to completed or failed, the handler emits a final event and closes the stream.

Two things make this composition work:

  • The queue is the only shared state. Workers don't need to know about SSE connections. The API server doesn't need to know about workers. Each component does its job; the queue mediates.
  • SQLite polling is cheap. Reading one row at 250ms intervals is below the noise floor on even modest hardware. We measured tens of concurrent SSE clients with no impact on inference throughput. (A change-notification system like PostgreSQL LISTEN/NOTIFY would be more elegant; for a single-host SQLite deployment, polling a memory-mapped table is fine.)

The implementation

FastAPI + sse-starlette is a clean way to write this in Python:

from fastapi import FastAPI, Request, HTTPException
from sse_starlette.sse import EventSourceResponse
import asyncio
import json
import time

app = FastAPI()

@app.get("/jobs/{job_id}/stream")
async def stream_results(job_id: int, request: Request):
    async def event_generator():
        last_seen_version = -1
        heartbeat_at = time.time()

        while True:
            # Bail if the client disconnected
            if await request.is_disconnected():
                return

            # Read current state from the queue
            row = db.execute(
                """SELECT status, partial_results, version
                   FROM jobs WHERE id = ?""",
                (job_id,)
            ).fetchone()

            if row is None:
                yield {"event": "error",
                       "data": json.dumps({"reason": "job_not_found"})}
                return

            # New state — push an event
            if row["version"] != last_seen_version:
                yield {
                    "event": "update",
                    "id": str(row["version"]),
                    "data": json.dumps({
                        "status": row["status"],
                        "partial_results": json.loads(row["partial_results"] or "{}"),
                        "ts": int(time.time()),
                    })
                }
                last_seen_version = row["version"]
                heartbeat_at = time.time()

            # Terminal state — send done, close
            if row["status"] in ("completed", "failed"):
                yield {"event": "done",
                       "data": json.dumps({"final_status": row["status"]})}
                return

            # Heartbeat every 15s to keep proxies from closing the connection
            if time.time() - heartbeat_at > 15:
                yield {"event": "heartbeat", "data": "ok"}
                heartbeat_at = time.time()

            await asyncio.sleep(0.25)

    return EventSourceResponse(event_generator())

The client side (browser) is even simpler:

const source = new EventSource(`/jobs/${jobId}/stream`);

source.addEventListener("update", (event) => {
    const payload = JSON.parse(event.data);
    renderPartial(payload.partial_results);
});

source.addEventListener("done", (event) => {
    const payload = JSON.parse(event.data);
    finalizeUI(payload.final_status);
    source.close();
});

source.addEventListener("error", () => {
    // EventSource auto-reconnects with Last-Event-ID
    showReconnectingNotice();
});

Worth noting from this code:

  • Each event has an id (the version column from the job row). On reconnection, the browser sends Last-Event-ID back; the server can use it to resume from that version rather than replaying from the start.
  • The version column is just an integer that increments on every update to the job row. A simple UPDATE ... SET version = version + 1 in the worker's progress-write keeps it monotonic.
  • asyncio.sleep(0.25) is the polling interval. Tunable per workload. For tasks that produce updates every few seconds, 250ms is more than enough. For sub-second update cadences, drop it; for slower workloads, raise it to reduce SQLite read traffic.

The messy parts nobody warns you about

The pattern is simple. The deployment isn't. Five things that bit us, in case they save you the rediscovery:

1. Reverse-proxy buffering will silently break SSE. Nginx by default buffers responses, which means events sit in the proxy's buffer instead of streaming to the client. Set X-Accel-Buffering: no in the response headers; if you don't control the proxy, configure proxy_buffering off in the nginx site config. AWS ALB, Cloudflare, and most other proxies have their own variants of this setting. Test in production-like setup, not just locally.

2. The 6-connection-per-origin browser limit. Most browsers limit concurrent connections to the same origin to 6. If your user opens 7 SSE streams (one per inference job), the 7th hangs until one of the others closes. The workaround is using HTTP/2 multiplexing (no per-origin limit on streams within a single connection) or distributing endpoints across subdomains. We hit this exactly once, in a multi-study viewer that opened streams for each visible study.

3. Heartbeats prevent idle-timeout disconnects. Load balancers, reverse proxies, and corporate firewalls often close idle TCP connections after 30-60 seconds of no traffic. An SSE stream with sparse events looks idle to them. Send a heartbeat event every 15 seconds (visible in the code above) so the connection stays warm. The client can ignore heartbeats; their only job is keeping the wire alive.

4. Reconnection logic is the client's job and the spec gets it almost-right. EventSource auto-reconnects with Last-Event-ID, but only on network errors — not on application-level errors (e.g., a 500 from the server). For "real" production resilience, wrap EventSource in a thin custom reconnect layer that catches both network and application errors, applies exponential backoff, and gives up after some bound.

5. Authentication has to live in the URL or in a cookie. EventSource doesn't support custom headers. So you can't put your Authorization: Bearer ... token in a request header — you have to use cookie-based auth, or accept the token as a query parameter (which is generally a bad idea because URLs land in logs and browser histories). The clean fix is short-lived signed URLs: server issues a stream-specific token in the POST /predict response, client uses it as a query parameter in the SSE URL, token expires when the job does.


When SSE isn't the right answer

SSE wins for server-to-client push of independent events over a single long-lived connection. Cases where you should reach for something else:

  • Bidirectional flow. If the client needs to send messages back mid-stream (mid-inference cancellation requests, parameter updates, etc.), WebSockets. SSE is one-way only.
  • Many concurrent consumers per event. If you're broadcasting the same event to thousands of clients, dedicated pub-sub (Redis pub-sub, Kafka, NATS) is more efficient than SSE-from-the-DB-per-client.
  • Sub-second latency requirements. SSE has slight overhead from HTTP framing. For sub-100ms event latency, WebSockets are tighter.
  • Polling-friendly workloads. For events that fire less than once per minute, plain polling is probably simpler and cheaper than a long-lived connection. SSE shines for "events every few seconds for the lifetime of a multi-second operation."

For orchestrated ML inference of the kind described here, all four exceptions are no — SSE is the right answer.


Practical recommendations

If you're building this for the first time:

  1. Use sse-starlette if you're on FastAPI. Don't hand-roll the SSE framing; subtle bugs in the wire format will haunt you.
  2. Make the version column explicit. Don't depend on updated_at timestamps for ordering — they have collision and resolution issues. A monotonic integer per job row is two extra lines of code and saves a debugging session later.
  3. Test with reverse-proxy buffering on, not off. A perfectly-working local-dev SSE stream that fails in staging because of nginx buffering is a deeply confusing failure to diagnose. Add buffering to your local environment so you catch it before it ships.
  4. Send a heartbeat event every 10-15 seconds. Cheaper than diagnosing why some users' streams die after 30 seconds and others don't.
  5. Plan for reconnection. Even on the best networks, expect at least 1% of connections to drop mid-stream. The reconnection path needs to work without losing data, which means the server needs to persist enough state (the queue does this naturally) and the client needs to send Last-Event-ID on reconnect.

The pattern itself is barely a hundred lines of code. The value is what it changes about the user's perception of your system: "is this thing stuck?" becomes "I can see exactly where it is in the pipeline." For long-running ML inference, that's the difference between an app users tolerate and an app users trust.


Part of an ongoing series on production medical imaging engineering. The companion SQLite queue post is here; the Windows installer deep-dive is here; the year-one reflection is here. If you're streaming long-running inference and hit one of these gotchas, reach out.