How I Moved Real-Time Out of Django (and Made Everything Simpler)

Your users want live updates: an order flips to paid, a notification pops, a dashboard number ticks up without a refresh. The obvious answer in Django is Channels. I reached for it first too. But as our scale grew, the costs of keeping connections in-process pushed me to rethink the architecture. I let Django stop handling WebSockets entirely, and almost everything got simpler. Here's the pattern that replaced it.

The Problem

Django is a request/response framework. It processes a request, returns a response, and moves on. Real-time notifications — a dashboard ticking up, an order status flipping live — need a server that holds a connection open for every connected client. That is a fundamentally different job.

The standard answer in the Django ecosystem is Channels. It moves you to ASGI, gives you consumers, and works well, especially when your real-time needs are deep and bidirectional.

Ours weren't. We needed server→client push: notify a user when an order completed, or when a payment landed. The connections were mostly idle. Channels gave us that, but it came with costs that started hurting at scale.

The concrete problem was this: every open WebSocket connection consumed memory and a small slice of CPU time on our app servers. A thousand connected dashboards was fine. Ten thousand, across dozens of tenants, each holding a persistent socket. Suddenly, our workers were doing two very different jobs at once. Serving fast API responses and babysitting long-lived, mostly-quiet connections. Both suffered.

What made it brittle was the reconnect storm. Deploy a new build, restart a server, or hit a brief network partition, and every connected client disconnects and immediately reconnects. All at once. That thundering herd of TLS handshakes, authentication, and subscription requests would spike CPU to 100% on every app server as it tried to spin up thousands of ASGI consumer instances simultaneously, occasionally knocking them over entirely before the retry backoff kicked in.

There's also the smaller but telling friction of auth. Authenticating the WebSocket handshake means reaching your existing auth (DRF/JWT) from inside an ASGI scope. I ended up writing middleware that reconstructed a DRF Request out of the raw scope just to reuse the auth I already had. It worked, but it was a sign: I was bending the framework to do something it wasn't built for.

The turning point was asking: why is my web framework in the business of holding connections at all?

For a fire-and-forget server→client push, it shouldn't be. There are servers built specifically to hold millions of open connections and do nothing else. The only two things your backend needs to do are say who's allowed to listen, and publish events.

The Solution

Put a dedicated pub/sub gateway in front of your app. I used Centrifugo, a standalone real-time server. Clients connect to it, not to Django. Django goes back to being a plain, boring WSGI/sync app and plays two small roles:

Issue a short-lived connection token that embeds which channels this user may subscribe to.
Publish events to the gateway over its HTTP API.

The gateway holds the connections. Django holds nothing.

Step 1: Issue a token that carries its own permissions

When a client wants to connect, it asks your backend (over a normal authenticated request) for a connection token. The token is a JWT whose claims list the exact channels the user may subscribe to. The gateway verifies the signature and authorizes subscriptions without a single database lookup:

# realtime/auth.py
import time
import jwt
from django.conf import settings


class RealtimeTokenService:
    """Issues short-lived JWTs for connecting to the realtime gateway."""

    def generate_connection_token(self, user) -> str:
        tenant_id = user.tenant_id

        # Channel grants are baked into the token, so the gateway authorizes
        # subscriptions with zero database lookups on connect.
        channels = [
            f"user_{user.id}",
            f"critical:user_{user.id}",
            f"tenant_{tenant_id}",
            f"critical:tenant_{tenant_id}",
        ]

        claims = {
            "sub": str(user.id),
            "exp": int(time.time()) + settings.REALTIME_TOKEN_TTL_SECONDS,
            "info": {"name": user.get_full_name(), "email": user.email},
            "channels": channels,
        }
        return jwt.encode(claims, settings.REALTIME_HMAC_SECRET, algorithm="HS256")

This is the key idea: token-as-permission. The channel names encode your authorization model. A user gets their own user_{id} channel and their tenant's tenant_{id} channel, and nothing else. Multi-tenant isolation falls out of the channel naming: a user simply has no grant for another tenant's channel, so the gateway will refuse the subscription.

Step 2: Hand the token out behind your normal auth

The endpoint that mints the token is an ordinary authenticated view. Your existing auth stack protects it; no ASGI gymnastics:

# realtime/views.py
class RealtimeTokenView(APIView):
    permission_classes = [IsAuthenticated]

    def post(self, request):
        token = RealtimeTokenService().generate_connection_token(request.user)
        return Response(
            {"token": token, "socket_url": settings.REALTIME_SOCKET_URL}
        )

The client takes {token, socket_url}, opens a WebSocket to the gateway, and subscribes to its channels. Django is now out of the connection entirely.

Step 3: Publish events over HTTP

To push an update, your backend POSTs to the gateway's publish API. No connection state, no sockets: just an HTTP call:

# realtime/publisher.py
import requests
from django.conf import settings


def push_event(channel, event_type, payload, *, recoverable=False, timeout=10.0):
    # The "critical:" namespace is configured on the gateway with message
    # history + recovery, so clients can catch up after a reconnect.
    final_channel = f"critical:{channel}" if recoverable else channel

    resp = requests.post(
        f"{settings.REALTIME_API_URL}/publish",
        json={
            "channel": final_channel,
            "data": {"type": event_type, "payload": payload},
        },
        headers={"X-API-Key": settings.REALTIME_API_KEY},
        timeout=timeout,
    )
    resp.raise_for_status()

Notice the critical: channel prefix. On the gateway, that namespace is configured to keep a short history so a client that briefly drops can recover missed messages on reconnect:

{
  "namespaces": [
    { "name": "critical", "history_size": 50, "history_ttl": "120s", "force_recovery": true }
  ]
}

Ephemeral events (a "user is typing" blip) publish to the bare channel; events you can't afford to lose (a completed order) publish to critical: and survive a reconnect.

Step 4: Make publishing reliable, and don't block the request

Publishing is a network call to a separate service. You do not want it on the request's critical path, and you want it to survive a transient blip. Wrap it in a Celery task with retries and backoff:

# realtime/tasks.py
import logging
from requests.exceptions import RequestException, HTTPError
from celery import shared_task
from .publisher import push_event

logger = logging.getLogger(__name__)


@shared_task(bind=True, max_retries=3, default_retry_delay=5, retry_backoff=True)
def publish_realtime_event(self, channel, event_type, payload, recoverable=False):
    try:
        push_event(channel, event_type, payload, recoverable=recoverable)
    except HTTPError as exc:
        # A 4xx error (bad API key, malformed payload) is a deterministic
        # failure that retries can't fix. Log it and drop it.
        # A 5xx error might be transient, so retry it.
        if 400 <= exc.response.status_code < 500:
            logger.error("Deterministic 4xx error on %s, dropping.", channel)
            return
        raise self.retry(exc=exc)
    except RequestException as exc:
        # Pure network timeouts or connection errors — retry with backoff.
        raise self.retry(exc=exc)
    except Exception:
        # A serialization bug won't fix itself on a retry. Log and drop it,
        # rather than burning three attempts on a guaranteed failure.
        # By not re-raising, Celery marks the task as successful and removes
        # it from the queue — no poison pill looping forever.
        logger.exception("Dropping unrecoverable realtime event on %s", channel)

That except split matters: retry the transient, drop the deterministic. A flaky network deserves another try; a payload that can't be JSON-encoded will fail identically three times in a row, so retrying it is just noise.

Firing an event from anywhere in your app is now a one-liner:

publish_realtime_event.delay(
    channel=f"tenant_{order.tenant_id}",
    event_type="order.completed",
    payload={"order_id": order.id, "total": str(order.total)},
    recoverable=True,
)

Why This Works

Your web tier goes back to being stateless. Django stays on WSGI. No ASGI migration, no async server, no sockets held in your workers. You can scale, restart, and deploy the web tier like any other stateless service. The open connections live on the gateway and aren't disturbed.

Connections scale independently. Holding a lot of idle connections is a specialized job. A purpose-built gateway does it on its own box, tuned for that, while your app servers stay sized for request throughput.

Auth has zero connect-time cost. Because the token carries its channel grants, the gateway never calls back into your database to authorize a subscription. Ten thousand clients reconnecting after a blip is ten thousand signature checks, not ten thousand DB queries.

Multi-tenancy is just naming. Tenant isolation isn't a separate access-control layer. It's the channel names in the token. There's no code path by which a user can subscribe to a channel they weren't granted.

Delivery is fire-and-forget but durable. Publishing happens off the request path via Celery, retries on transient failure, and the critical: namespace lets clients recover what they missed across a reconnect.

Design Decision: The Token Is a Cache, So It Can Go Stale

Baking permissions into the token is what buys you DB-free connects. But a token is a snapshot. If a user's access changes (you remove them from a tenant, say) and their token still has 24 hours to live, the gateway will keep honoring the old grants until it expires. You've traded revocation freshness for connect-time performance.

That's usually the right trade for a notification stream, but be deliberate about it:

Keep the TTL short and have the client transparently re-fetch a token. Shorter tokens = smaller stale window.
For true revocation, the gateway can be told to force-disconnect a user or refuse channels server-side. Reach for that only when a stream is sensitive enough to need it.

One more operational gotcha: a single gateway is tempting to share across environments, but tenant_42 in staging and tenant_42 in production are the same channel name: cross-talk waiting to happen. If you must share one broker, prefix channels with the environment (prod_tenant_42, staging_tenant_42) so the namespaces can't collide.

And to be honest about the path here: I shipped the Channels version first. It worked. But the friction of running an async server and bridging my existing auth into the ASGI scope is exactly what made the gateway approach worth it. If your real-time needs are deep and bidirectional (collaborative editing, presence), in-process Channels may genuinely fit better. For "push updates to subscribed clients," letting something else hold the sockets is the simpler system.

The Result

Adding a live update anywhere in the app is now one .delay() call. The web tier never learned what a WebSocket is. Connections, fan-out, history, and recovery are the gateway's problem; authorization is a signed list of channel names; delivery is a retrying background task. Django went back to doing the thing it's good at. And the real-time problem moved to a server built for it.

How are you doing real-time in your Django apps — in-process with Channels, or have you pushed connections out to a gateway? I'd love to hear what's worked for you in the comments.

¡Hasta luego!

How I Moved Real-Time Out of Django (and Made Everything Simpler)

The Problem

The Solution

Step 1: Issue a token that carries its own permissions

Step 2: Hand the token out behind your normal auth

Step 3: Publish events over HTTP

Step 4: Make publishing reliable, and don't block the request

Why This Works

Design Decision: The Token Is a Cache, So It Can Go Stale

The Result

Comments

More from this blog

It Worked on Every Environment Except Prod: Getting Daily Reports Right Across Timezones

How to Prefetch Across GenericForeignKeys When You Can't Change the Schema

How I Built a "Set It and Forget It" Sync System with Django Signals

MLSA Bootcamp 2024 Postmortem: Insights and Reflections

Command Palette

The Problem

The Solution

Step 1: Issue a token that carries its own permissions

Step 2: Hand the token out behind your normal auth

Step 3: Publish events over HTTP

Step 4: Make publishing reliable, and don't block the request

Why This Works

Design Decision: The Token Is a Cache, So It Can Go Stale

The Result

Comments

More from this blog