§1. What Real-Time Messaging IS
Real-time messaging is a class of infrastructure that maintains a persistent, low-latency, server-initiated channel from a backend to a connected client (browser, mobile app, IoT sensor, game client, trading terminal) so the server can push data the instant it changes. It is the in-application live data plane. When a Slack channel lights up the moment a teammate types, when a Robinhood ticker repaints sub-second on a price tick, when Figma cursors glide across a collaborator's canvas, when a Pokemon GO raid lobby fills with player positions, when a Twitch chat overlay scrolls past a streamer's screen at 30 messages per second — all of those experiences live or die on real-time messaging infrastructure.
The defining property is server-to-client push without the client asking first. The client opens one connection (typically a WebSocket, sometimes Server-Sent Events (SSE), sometimes MQTT (Message Queuing Telemetry Transport), occasionally HTTP long-poll for legacy paths), holds it open for minutes or hours, and the server pumps events down that connection as they happen.
It is essential to distinguish real-time messaging from four adjacent categories that often get confused with it:
Pull-based polling. A client asking "any new data?" every two seconds is fundamentally a different beast. Polling is request-driven, stateless, simple to operate, and catastrophic at scale: each poll is a fresh HTTP request with all its overhead (DNS, TCP handshake or HTTP/2 stream setup, TLS resume, request/response headers ≈ 1 KB minimum), it imposes a worst-case latency floor equal to the poll interval, and the cost scales linearly with the number of clients regardless of whether data changed. Real-time messaging exists to make the push path cheap and the idle path nearly free.
Mobile push notifications (Apple Push Notification service / Firebase Cloud Messaging). Documents in this series cover this elsewhere (see 12_notification.md). APNs (Apple Push Notification service) and FCM (Firebase Cloud Messaging) are out-of-band delivery: messages routed through Apple's or Google's gateway to a device that may or may not have the app running. They wake apps from background, render notification banners, deliver to suspended apps. They are decidedly NOT in-app live data — payload size is constrained (4 KB on APNs, 4 KB on FCM data messages), end-to-end latency is hundreds of milliseconds to tens of seconds, delivery is best-effort with carrier-dependent reliability, and the server never has a direct connection to the device. Real-time messaging is what you use when the app is in the foreground, has a chat window open, and needs to render a new message in 50 ms.
Message queue / event streaming (Kafka, RabbitMQ, Pulsar — see 02_message_queue.md). A message queue is broker-to-consumer. The consumers are application servers, batch workers, stream processors — not end-user devices. Queues are designed for durable, replayable, in-data-center transport with millisecond-to-second latency tolerances and consumer groups that pull at their own pace. Real-time messaging often uses a queue or pub-sub on the backend (for cross-gateway routing — see §13), but the customer-facing leg from gateway-to-device is a different protocol with different guarantees.
Web Real-Time Communication (WebRTC). WebRTC carries peer-to-peer media — video calls, voice, screen share — directly between user agents over UDP-based transports (Stream Control Transmission Protocol over Datagram Transport Layer Security, abbreviated SCTP-over-DTLS). It is built for tens-of-millisecond audio/video latency, traverses NAT (Network Address Translation) via ICE/STUN/TURN, and uses jitter buffers and codec-aware loss handling. It is not server-pushed event data; the server's role is signaling-only (and signaling itself often runs over WebSocket — see §15). When this doc talks about "real-time messaging," it means structured event delivery, not media planes.
What real-time messaging is NOT good for: durable history (no replay guarantees once delivered), exactly-once semantics (network is too lossy for that — at best, at-least-once with client dedup), strong ordering across topics or rooms (only within a single subscription, often only within a single connection), or as a system of record (a database or queue should hold the source of truth — the WebSocket gateway is a transport, not storage). It is also a poor fit for very large payloads — a 50 MB asset belongs in object storage with a URL pushed over the channel, not the asset itself.
§2. Inherent Guarantees
Real-time messaging gives you, by design:
Sub-second delivery to connected clients. The end-to-end push budget for a well-built gateway fleet is on the order of 10–80 ms within a region, dominated by network round trip and pub-sub propagation. Once a connection is established and a message is published, hitting all subscribed clients in under 100 ms is a reasonable expectation.
Broadcast to N subscribers. The protocol-level mechanic (write the frame N times to N sockets) is trivial; the architectural mechanic (route through a pub-sub fabric so all gateway nodes know who subscribes to which room) is the work, and is what differentiates a toy WebSocket server from a real platform.
Presence. "Who is currently online" is a first-class concept. A gateway knows which clients are connected because it holds their TCP sockets. Aggregating that view across a fleet (so that a UI can render "47 people in this channel") is presence-as-a-service, and most managed real-time platforms ship it.
Cheap idle. Once the connection is open and quiet, the steady-state cost per client is tiny — kernel-level TCP keepalive plus an occasional application-level heartbeat. A single Linux box with appropriate tuning routinely holds 1M+ concurrent idle WebSocket connections; the WhatsApp 2 million connections per FreeBSD server stat from 2014 is the canonical reference.
What real-time messaging does NOT give you, but is often naively assumed to:
Durable storage. If a gateway crashes and a client was disconnected at the moment a message published, the message is gone unless something else (a queue, a database, a replay buffer) wrote it down. Real-time messaging is a transport, not a journal. If your product needs "show me everything I missed while offline," you must layer a durable history on top — often a Kafka topic for fanout, a relational or document store for the user's personal history, and a sequence-number-based replay protocol so the client can ask "I last saw event 12,847, give me 12,848 onwards."
Exactly-once delivery. A TCP socket reset mid-frame, a reconnect after a wifi flip, a server failover — any of these can result in a duplicate frame, especially when combined with at-least-once retry semantics in the pub-sub backbone. Client-side dedup keyed on a server-issued message ID is mandatory.
Strong ordering across rooms or topics. Within a single subscription on a single connection, you typically get FIFO. Across two different rooms? No guarantee — they're going through different pub-sub channels that may have independent latency. Across two reconnects? Definitely no guarantee.
Authoritative state. The gateway doesn't know what the truth is. It is a delivery pipe. The application server publishes "user X said this" to the pub-sub; the gateway forwards it. If the application server's database write fails after the publish, the connected clients saw a phantom message. Production systems write to the database first, then publish; the database is the truth, the WebSocket is the notification.
Cross-region ordering. A user in Frankfurt and a user in São Paulo connected to different regional gateways will see globally-fanned events with regional latency skew. Geo-replication and conflict resolution (Conflict-free Replicated Data Types — CRDTs, or Operational Transformation — OT) are application-layer concerns, not gateway concerns.
The mental model: real-time messaging is a TCP-on-steroids primitive for hierarchical multicast to user devices. It is fast, it is broadcast-shaped, it is connection-stateful, and it requires everything else (durability, ordering, exactly-once) to be layered on by you.
§3. The Design Space
The variants split along three axes: protocol (the wire format), deployment model (self-hosted vs managed), and abstraction level (raw socket vs framework).
Protocols
| Protocol | Direction | Transport | Reconnect | Strengths | Weaknesses |
|---|---|---|---|---|---|
| WebSocket (RFC 6455) | Full-duplex | TCP, single connection after HTTP Upgrade | Manual (app/library) | Bidirectional, binary or text, low overhead per frame (2-14 bytes header), browser-native | Stateful proxies / corporate firewalls sometimes block, no built-in reconnect, requires sticky load balancing for affinity |
| Server-Sent Events (SSE) | Server → client only | HTTP/1.1 or HTTP/2 streaming, text/event-stream |
Automatic with Last-Event-ID |
Plays nice with HTTP infra (caches, proxies, CDNs), simple to implement, native browser EventSource API |
Text only, no client→server channel (must use separate POST), single connection per origin pre-HTTP/2 |
| HTTP long-poll | Pseudo-push (client polls, server holds) | HTTP/1.1 | Per-request | Works through every firewall and proxy ever built | High overhead per message (full HTTP cycle), high latency under load (request gap), expensive at scale |
| MQTT (Message Queuing Telemetry Transport) | Pub-sub | TCP (typical), or MQTT-over-WebSocket | Library-managed | Tiny header (~2 bytes), QoS (Quality of Service) levels 0/1/2, designed for low-power devices, retained messages, last-will-and-testament | Not browser-native, requires a broker (Mosquitto, EMQX, HiveMQ), pub-sub-shaped (not point-to-point) |
| gRPC streaming | Bidirectional or server-stream | HTTP/2 | Library-managed | Strong typing (Protobuf), multiplexed on HTTP/2 | Limited browser support (gRPC-Web is a degraded variant), not optimized for huge fanout |
WebSocket is the dominant choice in browser-driven chat and collaboration. SSE is the dominant choice for "live feed" use cases where the client doesn't need to push much — stock tickers, log streaming, AI chatbot token streaming, dashboards. Long-poll is legacy, retained mostly as a fallback for users behind hostile corporate proxies. MQTT is the IoT default — it ships on every Tesla, every commercial HVAC sensor network, AWS IoT Core, and Azure IoT Hub.
Deployment models
| Class | Examples | Operational profile | Cost shape |
|---|---|---|---|
| Self-hosted gateway, open source | Centrifugo (Go), socket.io (Node.js), Phoenix Channels (Elixir/Erlang), nchan (nginx module) | You run the fleet, the pub-sub broker, the LB. Maximum control, requires SRE depth. | Cap-ex on VMs/containers; cost per connection is small |
| Self-hosted, batteries-included framework | SignalR (.NET), Action Cable (Rails), Django Channels (Python) | Tied to a web framework. Productive for app teams, ceiling lower than dedicated gateways. | Same as above, plus framework coupling |
| Managed real-time platform | Ably, Pusher (now Beams/Channels), PubNub | They hold connections, fan out, persist history. You pay per connection-hour and per message. | Per-message and per-connection pricing; can balloon at scale |
| Cloud-provider native | AWS API Gateway WebSockets, Cloudflare Durable Objects with WebSocket, Azure Web PubSub | Tightly integrated with cloud functions (Lambda/Workers). Pay-per-event. | Pay per connection-minute and per message; cold-start latency on lambdas |
| Embedded in framework | SignalR for .NET, Action Cable for Rails, Phoenix Channels for Elixir | Ergonomic for the framework's developers; usually built on top of a transport (WebSocket fallback to long-poll) | Operational cost ≈ web fleet cost |
Comparison table — popular options
| System | Language | Protocol(s) | Pub-sub backbone | Sweet spot | Notable users / scale |
|---|---|---|---|---|---|
| socket.io | Node.js library | WebSocket + long-poll fallback | Redis adapter (or in-memory single node) | Small-to-mid apps that want auto-reconnect, rooms, fallback | Trello, early Microsoft Office for the web |
| Centrifugo | Go (single binary) | WebSocket, SSE, HTTP-stream, GRPC | Redis, Nats, Tarantool | Self-hosted, scriptable, free, ~1M conns per node | Mail.ru, many gaming and SaaS deployments |
| Phoenix Channels | Elixir/Erlang (BEAM VM) | WebSocket, long-poll | Phoenix.PubSub (in-VM clustering or Redis) | When you want supervisor trees and per-channel processes | Discord (until they replaced parts), bleacher report (2M concurrent in 2015) |
| Pusher Channels | Managed | WebSocket | Their internal fabric | App teams who want SDKs and to stop running infra | GitHub Issues live updates, many SaaS apps |
| Ably | Managed | WebSocket, SSE, MQTT, raw HTTP | Their global fabric | Multi-region, regulated industries needing guaranteed delivery | Live sports, fintech, big-event broadcast |
| PubNub | Managed | WebSocket, HTTP-stream | Global PoP fabric | Apps wanting in-region edge plus chat features | Curbside, telehealth, gaming |
| AWS API Gateway WebSockets | Managed | WebSocket | Triggers Lambda per event | Serverless WebSocket; modest scale; pay per event | Chat-as-feature in many enterprise apps |
| Cloudflare Durable Objects | Managed (Workers) | WebSocket | Object-as-actor (each room = 1 DO) | Per-room consistency, edge-near-user | Cloudflare's own dashboard, multi-player games like figma-style edge collab demos |
| MQTT broker (Mosquitto, EMQX, HiveMQ) | C/Erlang/Java | MQTT over TCP/TLS | Broker-internal | IoT, telemetry, sensor fanout | Tesla fleet, smart-home, industrial IoT |
Defended picks for common starting points:
- Browser chat or dashboard, you control the stack, modest scale (under a million concurrent): Centrifugo plus Redis Streams. Single Go binary, websockets and SSE both supported, easy ops, well documented, free.
- You want to ship features not run infra, sub-100K concurrent: Ably or Pusher. SDKs across every client platform, presence and history out of the box.
- Cloud-native, you live on AWS: API Gateway WebSockets + DynamoDB connection table + Lambda handlers. Cost ramps fast above ~100K concurrent but the operational story is zero.
- You're already on Cloudflare Workers: Durable Objects. One DO per room scales horizontally by room.
- IoT or sensor-heavy: MQTT with EMQX. Anything else is a wrong shape.
§4. Byte-Level Mechanics
A real-time messaging deep dive is incomplete without the wire format. Knowing what a WebSocket frame actually looks like, what text/event-stream does on the wire, and how MQTT spends two bytes saying "publish" separates someone who has shipped this from someone who has read about it.
WebSocket protocol — the handshake
A WebSocket connection begins as a plain HTTP request that is upgraded mid-stream:
GET /chat HTTP/1.1
Host: chat.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Origin: https://example.com
The client sends a random 16-byte nonce, base64-encoded, in Sec-WebSocket-Key. The server proves it speaks the protocol by computing:
accept = base64(SHA1(Sec-WebSocket-Key + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11"))
That magic GUID is hardcoded in RFC 6455 specifically so that a server that doesn't understand WebSockets cannot accidentally complete a handshake. Response:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
After the 101, the TCP connection is no longer HTTP. It is a stream of WebSocket frames in both directions.
WebSocket protocol — the frame format
The frame layout, byte by byte:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len | Extended payload length |
|I|S|S|S| (4) |A| (7) | (16/64) |
|N|V|V|V| |S| | (if payload len==126/127) |
| |1|2|3| |K| | |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
| Extended payload length continued, if payload len == 127 |
+ - - - - - - - - - - - - - - - +-------------------------------+
| |Masking-key, if MASK set to 1 |
+-------------------------------+-------------------------------+
| Masking-key (continued) | Payload Data |
+-------------------------------- - - - - - - - - - - - - - - - +
: Payload Data continued ... :
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| Payload Data continued ... |
+---------------------------------------------------------------+
Field walk-through:
- FIN (1 bit):
1if this frame is the final fragment of a logical message. Long messages can be split across multiple frames withFIN=0until the last. - RSV1/2/3 (1 bit each): Reserved. Used by extensions (e.g.,
permessage-deflatesets RSV1 to mark a compressed frame). - Opcode (4 bits):
0x0= continuation,0x1= text (UTF-8),0x2= binary,0x8= close,0x9= ping,0xA= pong. - MASK (1 bit): Must be 1 on client-to-server frames, must be 0 on server-to-client frames. This is an anti-cache-poisoning measure baked into the spec.
- Payload length (7 bits, possibly +16 or +64): Three-tier encoding.
≤125means the value is the length.126means the next 16 bits are an unsigned big-endian length (up to 64 KB).127means the next 64 bits are an unsigned big-endian length (theoretically 2^63 bytes; in practice servers cap this). - Masking key (32 bits, only if MASK=1): Random per-frame. The client XORs payload bytes against this rotating key before sending.
- Payload: The actual data. If MASKed, recipients XOR back against the masking key.
A minimal "hello" text frame from client to server: 0x81 0x85 m1 m2 m3 m4 h⊕m1 e⊕m2 l⊕m3 l⊕m4 o⊕m1. That's 11 bytes on the wire for a 5-character message. The same message in HTTP polling would be ~600 bytes including headers. That ratio — roughly 50–100× less overhead per event in steady state — is the entire economic argument for WebSocket.
Control frames (close, ping, pong) carry no payload or a tiny one and must have FIN=1. Ping/pong is the protocol-level heartbeat; many implementations use it to detect dead connections within 30–60 seconds.
Server-Sent Events — on the wire
SSE is dramatically simpler. The client opens a long-lived HTTP GET with:
GET /events HTTP/1.1
Accept: text/event-stream
Cache-Control: no-cache
The server responds with 200 OK, Content-Type: text/event-stream, and a body that is never closed. The body format:
id: 12847
event: message
data: {"user":"alice","text":"hi"}
id: 12848
event: presence
data: {"user":"bob","status":"away"}
retry: 5000
:heartbeat
Each event is one or more key: value lines terminated by a blank line. id becomes Last-Event-ID on the next connect attempt — the browser sends it automatically as a header, and the server uses it to replay missed events. event is the named event type (default message). data is the payload (and is the only field the JavaScript receives via e.data; everything else is metadata). Lines starting with : are comments, often used as a heartbeat (e.g., :keepalive\n\n every 15 s to prevent intermediate proxies from killing an idle stream).
Automatic reconnect is built into the browser's EventSource API. If the connection drops, the client waits the retry value (defaulting to ~3 s) and reconnects with Last-Event-ID. This is why SSE is so often the right call for "feed" UX cases — the gnarly part of WebSocket apps (reconnect logic with backoff and replay) is in the platform, not the app.
MQTT — fixed header + variable header + payload
MQTT's frame is even tighter than WebSocket's:
Fixed header:
7 6 5 4 3 2 1 0
+---+---+---+---+
| Type | Flags | ← 1 byte
+---+---+---+---+
| Remaining Len | ← 1-4 bytes, variable-length integer
+---+---+---+---+
The Type (4 bits) names the packet — 1=CONNECT, 2=CONNACK, 3=PUBLISH, 4=PUBACK, 8=SUBSCRIBE, 13=PINGREQ, etc. Flags (4 bits) encode per-type modifiers (for PUBLISH: DUP=duplicate-of-prior-publish, QoS, RETAIN).
After the fixed header comes a remaining-length field encoded as a variable-byte integer (7 bits of payload per byte, top bit as continuation). A single byte handles up to 127, four bytes handle up to 256 MB.
A QoS 0 PUBLISH of {"temp":21.3} on topic sensor/42/temp is:
30 — type=PUBLISH (3<<4), flags=0
13 — remaining length = 19
00 0D — topic name length = 13
73 65 6E 73 6F 72 2F 34 32 2F 74 65 6D 70 — "sensor/42/temp"
{ " t e m p " : 2 1 . 3 } — payload (12 bytes)
About 21 bytes for the whole exchange. A web-style POST of the same data with JSON body, headers, and TLS overhead is comfortably 1 KB. For battery-powered sensors waking up to send a reading once a minute, that ratio determines whether the device lasts a year or a month.
MQTT QoS (Quality of Service) levels:
- QoS 0 (at-most-once): Fire and forget. No acknowledgment. Fast, but messages can be silently dropped. Used for telemetry where loss of a single sample is acceptable (CPU temperature every second).
- QoS 1 (at-least-once): PUBLISH followed by PUBACK. Sender retransmits until acked. Receiver may see duplicates. Used for events that must arrive but tolerate dupes (sensor exceeding threshold).
- QoS 2 (exactly-once): Four-step handshake: PUBLISH → PUBREC → PUBREL → PUBCOMP. Highest cost, used rarely (financial transactions over MQTT). The performance hit is meaningful — you've doubled the round trips.
HTTP long-polling — request held open
The client issues a regular HTTP GET with a since=<cursor> parameter. The server doesn't reply immediately; it parks the request — typically holding the response open for 25–60 seconds — and returns the moment new data arrives, or returns an empty 204 when the parking timeout fires. The client immediately reconnects. From the application's perspective it's push; from the protocol's, it's a sequence of polls where each request happens to be long. Overhead per message is high (1 full HTTP cycle per event); the only reason it survives is that it works through every firewall and proxy on Earth — even ones that strip the WebSocket Upgrade header and break SSE.
End-to-end walkthrough — one chat message
Concretely tracing a single chat send from user Alice to a room of 5,000 subscribers, on a fleet using WebSocket + Redis Pub/Sub:
- Alice's browser holds an open WebSocket to gateway node
gw-7. She types "hello" and the app callsws.send('{"room":"r-99","text":"hello"}'). The browser builds a text frame: opcode0x1, MASK bit set, masking key fromcrypto.getRandomValues(), masked payload. About 35 bytes go on the wire. gw-7reads the frame off the kernel socket (epoll wakeup), unmasks, parses JSON. Validates Alice's session (cached from the auth handshake at connect time — see §7). Calls into the application layer:chatService.publish(room='r-99', from='alice', text='hello').- Application layer writes the message to the persistent store (Cassandra or DynamoDB) with a monotonically-increasing sequence number for that room — say
seq=12,847. This is the durable write that makes the chat replayable. - Application layer publishes to Redis:
PUBLISH room:r-99 '{"seq":12847,"from":"alice","text":"hello","at":169...}'. The Redis primary fans this out over its pub-sub fabric to every Redis subscriber. - All gateway nodes (
gw-0throughgw-31) are subscribed toroom:r-99. Each receives the published payload via their long-livedPSUBSCRIBEorSUBSCRIBEconnection. - Each gateway looks up its in-memory map of
room → set<connection>. It builds a WebSocket text frame (server-to-client, no mask):0x81 + length + payload. It writes this frame to every relevant socket. For 5,000 subscribers spread across 32 gateways, each gateway pushes ~156 frames. The kernel handles this withsendmsg()calls and TCP windowing. - Each subscriber's browser receives the frame, the runtime delivers
onmessagewith the parsed payload, the chat UI renders.
End-to-end budget on a tight system: ~5 ms for steps 1-3, ~5 ms for the Redis publish + fanout, ~5–20 ms for each gateway to push to its sockets, ~10–30 ms network to the client. Total: 25–80 ms median for everyone in the room. The architecture is the deep dive — each component is responsible for a slice of that budget, and any one of them being slow shows up as users grumbling that "Slack feels laggy today."
§5. Capacity Envelope
The capacity story of real-time messaging is dominated by concurrent connections more than by message rate. The unit economics are: how many TCP sockets can one box hold? How many messages per second can the fleet move? How much memory and how many file descriptors per socket?
Small scale: one Node.js + socket.io box, ~1,000 concurrent. A side-project chat app, a niche dashboard, a hobby multiplayer game. The default Node.js + socket.io configuration with no tuning will happily hold 1–5K connections on a small VM. Memory per connection is ~5–10 KB at the application layer, ~10 KB more in the kernel. Messages-per-second is in the low thousands. No load balancer, single box, redeploy = everyone reconnects, life is fine. Latency is sub-50 ms because everything is one box.
Mid scale: a chat product with 100K concurrent. Typical SaaS messaging tier — a customer support chat embedded in a B2B product, a real-time dashboard for a fleet management company. Two to ten gateway nodes (Centrifugo, socket.io with Redis adapter, Phoenix Channels), a Redis cluster for pub-sub, a load balancer with WebSocket support (HAProxy, NGINX, Envoy, or AWS ALB), and an application fleet that publishes events. Per node: 20-50K connections, depending on language runtime (Erlang/Elixir hit 100K easily, Node.js needs more boxes). Messages per second: tens of thousands. Latency budget: 20-100 ms p99.
Large scale: Slack-tier, hundreds of millions of WebSocket connections per day, tens of millions concurrent. Slack famously runs a custom WebSocket gateway tier (formerly an EVS / Edge Vendor Server architecture, with later modifications). Each gateway is a stateful node holding hundreds of thousands of sockets, fronted by ELBs. The pub-sub backbone routes channel-bound messages across the gateway fleet. They publish channel events to a fleet of gateways, gateways subscribe to channels their clients are in (using lazy subscription — see §13), and presence updates are aggregated. Slack has publicly discussed throughput on the order of billions of messages per day and ~10M concurrent at peak. Discord runs Elixir/Phoenix, has reported holding millions of WebSocket connections concurrently and 5M+ concurrent voice users at peak, with a custom Erlang-based "guild" actor model where each Discord server is a process. WhatsApp ran on FreeBSD with custom Erlang, famously holding 2M connections per server on commodity hardware back in 2014; today's WhatsApp Web pushes similar numbers per box.
Giant scale: ten million concurrent on a single broadcast. Twitch real-time chat has handled streams with 1M+ concurrent chatters fanning out to overlay viewers. The 2019 Fortnite "The End" event saw 12M+ concurrent players in a coordinated event; Pokemon GO during a major raid event saw 50M+ concurrent active players, each with bidirectional channels for game state. At this scale, the bottleneck is not connection count — it's fanout amplification: 1 user types in a global chat with 1M subscribers, and the system must write 1M frames in milliseconds. Hierarchical broker tiers, regional clustering, and per-room sharding become structural. Twitter Spaces and major event broadcasts coordinate fanout via multi-tier fanout fabrics where one "publish" expands across regional clusters before hitting individual gateway nodes.
IoT scale: millions of MQTT sensors. A Tesla fleet, a smart-meter rollout in a national power grid, a connected-vehicle telematics platform. These systems hold hundreds of thousands of MQTT connections per broker node; brokers like EMQX have published benchmarks of 10M+ connections per node with appropriate tuning. AWS IoT Core reportedly handles hundreds of millions of devices for customers like John Deere and various utility cooperatives.
Bottleneck progression by scale tier:
- 1K connections: nothing breaks, you're fine.
- 10K connections: file descriptor limits, default 1024 per process — bump to 1M with
ulimit. - 100K connections per node: TCP buffer memory dominates — tune
tcp_rmem/tcp_wmemlower. - 1M connections per node: ephemeral port exhaustion if your gateway dials outbound (each in-bound socket needs a kernel struct but no port). Source ports for outbound pub-sub connections become a limit.
- 10M concurrent fleet-wide: cross-region routing, pub-sub fabric throughput, presence aggregation. Architecture changes shape.
§6. Architecture in Context
The canonical pattern, regardless of vendor:
Clients (browser, mobile, IoT)
│
│ TLS-terminated WebSocket / SSE / MQTT
▼
┌───────────────────────┐
│ L7 Load Balancer │ ← supports WebSocket upgrade,
│ (ALB, Envoy, HAProxy)│ passes Upgrade/Connection headers,
└───────────────────────┘ sticky by hash(client_ip) or cookie
│
┌─────────┬─┴─┬─────────┐
▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│gw-node │ │gw-node │ │gw-node │ ← stateful: each connection lives on one
│ 0 │ │ 1 │ │ N │ node. In-memory map: conn_id → socket
└────────┘ └────────┘ └────────┘ and room_id → set<conn_id>
│ │ │
│ SUBSCRIBE / PUBLISH on rooms / users / topics
▼ ▼ ▼
┌───────────────────────────┐
│ Pub-sub backbone │ ← Redis Pub/Sub, Redis Streams,
│ (cross-gateway routing) │ Kafka, NATS JetStream
└───────────────────────────┘
▲
│ PUBLISH events
│
┌───────────┴───────────┐
│ Application servers │ ← actual product logic; writes to DB,
│ (REST/gRPC handlers) │ then publishes the change as an event
└───────────────────────┘
│
▼
Durable stores
(Cassandra, DynamoDB,
Postgres, Kafka journal)
The structural facts:
- Each WebSocket lives on exactly one gateway node. That node holds the TCP socket, the per-connection state (auth, subscriptions, send buffer). If it dies, the socket dies, and the client must reconnect.
- The load balancer needs sticky affinity only for reconnect. A WebSocket connection is sticky-by-nature (the TCP connection is to one IP); the LB chooses the node at connect time. Sticky cookies matter for HTTP long-poll fallback (so successive polls hit the same node).
- The pub-sub layer is what makes scaling horizontal. A new gateway node can join the fleet, subscribe to the pub-sub fabric, and start accepting connections. It doesn't need to know about other gateway nodes — only the shared pub-sub does.
- The application servers don't talk to gateways directly. They publish events to the pub-sub layer with topic/channel/room as the key. The gateways subscribe to channels that their connected clients have joined. This decouples the application's write path from the connection topology.
For Cloudflare Durable Objects, the picture is different — the "gateway" is the Durable Object, one per room or per user. The pub-sub is implicit (clients all WebSocket to the same DO, which holds them in a single actor). Scaling shifts from "more gateway nodes" to "more rooms ⇒ more DOs spread across the edge." Each DO is single-threaded and consistent for its scope — a natural fit for collaborative editing of a single document where you want strong serialization.
For AWS API Gateway WebSockets, the gateway is stateless from your perspective — AWS holds the sockets, and each frame triggers a Lambda invocation. State (which connection IDs are in which room) lives in a DynamoDB table you maintain. Fanout to a room becomes "look up connections in DDB, then call PostToConnection for each."
§7. Hard Problems Inherent to This Technology
7.1 Connection state is stateful — each WebSocket is pinned to one node
The naïve fix: "just put a load balancer in front."
How it breaks: the load balancer chooses node gw-3 for Alice's initial connection. Now the application server publishes "new message in room r-99." Which node should receive it? Not just gw-3. Every node with a subscriber in r-99. The LB can't help with this; it only knows ingress.
The real fix: a pub-sub fabric across nodes. Each gateway, when a client joins a room, subscribes to room:r-99 on Redis or a similar channel-based broker. The application publishes once; Redis fans out to every gateway holding a subscriber. This decouples ingress from delivery.
The downside: every gateway has to maintain potentially thousands of channel subscriptions on Redis. Naïve implementations subscribe-per-room and run into Redis client subscription limits. The fix is "subscribe lazily" (gateway subscribes when first client joins a room, unsubscribes when last leaves) and "subscribe-by-shard" (subscribe to one of 256 hash-partitioned channels and filter in-process).
7.2 Connection storm on deploy
Naïve case: deploy a new version of your gateway. Each replaced node terminates its connections. 200K clients reconnect within 30 seconds.
How it breaks: a thundering-herd reconnect drives load balancer CPU through the roof, and the new gateway nodes get hit by a TLS handshake storm — 200K handshakes in 30 seconds is 6.7K TLS handshakes per second sustained, which is enough to topple a CPU-bound TLS terminator. Worse, all those clients re-fetch initial state (history, presence, room membership) simultaneously, hitting the database with a synchronized read storm.
The real fix: jittered reconnect. The client library, when it detects disconnect, waits a random delay drawn from an exponential distribution (e.g., min(retry_count^2 * 1s, 30s) * Math.random()). Browsers implementing this spread the reconnect storm across 30-60 seconds. The server controls this via a hint — Centrifugo sends a disconnect_advise with a recommended backoff; SSE uses the retry: field. Combined with a rolling deploy (1-5% of nodes per cycle, not all at once), the storm becomes a manageable trickle.
A real example: Slack's downtime stories mention reconnect storms after WebSocket fleet restarts. They mitigate with deliberate connection draining, advance signaling, and per-region rolling rollouts.
7.3 Presence at scale
Naïve: every client sends an app-level heartbeat every 5 seconds; gateway joins this across the fleet for the "who's online" view.
Breaks: 100M users × heartbeat/5s = 20M heartbeats/sec. Even at 50 bytes each, that's 1 GB/s of pure heartbeat, and cross-fleet aggregation dwarfs the actual chat traffic.
Fix: don't compute global presence — compute scoped presence. A user only needs presence of people relevant to them — direct contacts, current channel, current document. Gateway publishes "user X is on me" once at connect, refreshes via TTL (Time To Live) in Redis (key presence:user:42 with EX 60), any other gateway can GET presence:user:42. The heartbeat is in-band via WebSocket ping/pong — no application-level heartbeats. For mass-presence broadcasts like "X is typing" to a 1M-user room, throttle (one event per 5s per user per channel) and drop fanout entirely above ~1K subscribers — just show "lots of typing".
7.4 Backpressure — slow clients fill server memory
Naïve: gateway writes to socket, kernel buffers, life moves on.
Breaks: a client on a 56K connection or stalled iOS background socket reads slowly. Kernel send buffer fills, gateway userspace queue fills. Fire-and-forget write() accumulates GBs of buffered data per slow client. With 200K connections, even 10 KB per buffer = 2 GB of resident memory for the slow ones alone.
Fix: bounded per-connection queues with explicit overflow policy. Per-socket write queue; when it exceeds threshold (e.g., 1 MB or 100 messages), gateway either (a) drops oldest, (b) drops the connection, or (c) downgrades the client to summary subscription. Centrifugo, socket.io, and Phoenix all expose this as configurable. Discord publicly described a backpressure incident where slow consumers caused fanout-side memory pressure; their fix was strict per-connection caps with disconnect on overrun.
7.5 Reconnect with replay — what did I miss while offline?
Naïve: when client reconnects, send current state.
Breaks: "current state" alone misses in-between events. If Alice's wifi was dead for 30 seconds while 50 messages arrived, "current state" might be just the latest — she has a gap. For collaborative editing this is worse: missing one OT (Operational Transform) op means client state diverges.
Fix: server-assigned monotonic sequence numbers per room/topic + bounded replay buffer. Each event has seq=N; client tracks highest seq seen. On reconnect, client sends last_seq=12847; server replays 12848 through current. If buffer doesn't go back that far, server tells client to re-fetch full state from the database. Centrifugo calls this "history with positioning"; Ably calls it "channel rewind"; Phoenix calls it "presence diff plus event history" — same pattern, different names. Redis Streams (XADD + XREVRANGE + XREAD BLOCK) is the canonical primitive — both subscribe and replay on a bounded log.
7.6 Authentication of WebSocket connections
Naïve: send a JWT (JSON Web Token) in the first WebSocket frame after connecting.
Breaks: the connection is established before any frame arrives. Unauthenticated clients have consumed a slot, gotten through your firewall, completed TLS. Browsers also don't let you set arbitrary headers on a WebSocket — only Sec-WebSocket-Protocol is app-controllable.
Three established patterns: (1) Cookie-based auth on the HTTP Upgrade request — same-origin cookies ride along on the Upgrade as any HTTP request; Slack, Discord, embedded chats do this; fragile cross-origin. (2) Query parameter token — wss://chat.example.com/ws?token=<JWT>; easy but shows up in access logs (privacy); use one-time, short-lived tokens. (3) Auth-after-connect with grace period — open unauthenticated, expect an AUTH frame in 5 seconds, close otherwise; most managed platforms (Ably, Pusher) use this combined with per-IP connect quotas.
Whatever the pattern, identity is verified once at connect and cached for the connection's lifetime. Mid-connection revocation requires periodic re-check or short-enough lifetimes that token expiry forces reconnect.
7.7 Multi-region routing
Naïve: one region; users connect to the nearest gateway.
Breaks: Alice in Frankfurt is on eu-west-1. Bob in São Paulo is on sa-east-1. Same chat room. Bob publishes to São Paulo's Redis; Frankfurt gateways aren't subscribed there. Alice never sees the message.
Fix: cross-region replication of the pub-sub layer. Options: (1) Active-active Kafka with MirrorMaker — canonical heavyweight; each region has a Kafka cluster, MirrorMaker replicates topics. (2) NATS supercluster — built-in cross-cluster routing, milliseconds to propagate. (3) Vendor magic — Ably, PubNub, Cloudflare DOs all do this under the hood. Latency hit is physics: Frankfurt-to-São Paulo is ~110 ms RTT one-way. Fatal for real-time games (hence regional sharding), tolerable for chat.
7.8 Load balancer affinity vs node death
Naïve: sticky sessions via cookie — reconnects always go to the same node.
Breaks: gw-3 is dead. Cookie says "gw-3." Either LB respects it (client gets 502s in a loop) or ignores it (not actually sticky). Even on a healthy node, the new node lacks session state and must rebuild from auth + rooms + presence.
Fix: sticky-but-failover. LB hashes the client preferentially but falls back if preferred is gone. New gateway queries Redis on connect: "what state did this user have?" — auth token validated, last-known rooms listed. Gateway rejoins rooms on pub-sub, client continues. This is "session rehydration" — Phoenix Channels does it with Presence and Channel.join callbacks.
§8. Failure Mode Walkthrough
Gateway node crash. A gw-3 segfaults or the host kernel panics. All TCP sockets it held are dead. From the client's perspective, the WebSocket emits a close event (or a heartbeat timeout if the host vanished without RST). The client library backs off and reconnects to a different gateway via the LB. The new gateway authenticates, restores room subscriptions from session state (cookie + cached presence in Redis), subscribes on pub-sub. Lost data window: anything published between the old gateway's death and the new gateway's resubscribe. The replay protocol (§7.5) covers this if implemented.
Pub-sub broker outage. Redis cluster has a primary failover; for 5-30 seconds, publishes fail. Application servers see errors. Gateways see no events. From the client's perspective, the WebSocket is up but the room is silent. Local-only messages (gateway to a client connected to the same gateway, if the gateway short-circuits same-node publishes) still work. After Redis recovers, the gap is filled via the replay protocol (clients send last_seq, gateway pulls from a durable backing — Kafka, Postgres — replays). The durability point is the application's write to the database: the database has the truth even when Redis is degraded.
Slow client behind buggy NAT or carrier. Connection is technically alive but data isn't flowing. WebSocket pings time out. Gateway closes the socket. Client's library reconnects on the next heartbeat. This is the dominant failure mode in real-world deployments — far more common than crashes. Buggy NATs at hotels or budget carriers idle-timeout TCP connections at ~30 seconds; the fix is application-level keepalive every 25 seconds.
iOS / Android background. App goes background. iOS suspends the process within 30 seconds (sometimes faster). The WebSocket is killed by the OS. The server eventually notices via failed heartbeat. To wake the app for inbound messages, the system uses a push notification delivered via APNs/FCM — the out-of-band channel. On app foreground, the WebSocket reconnects and the replay protocol fills the gap. This is the canonical "real-time messaging plus push notifications" hybrid; neither one alone covers backgrounded mobile.
WiFi to cellular transition. Phone walks out of a coffee shop. WiFi network leaves; cellular takes over. TCP connection on old WiFi NIC is dead (RST when the carrier-grade NAT loses state). Client detects via heartbeat timeout or online/offline browser event, reconnects on the new network. Reconnect path uses replay (§7.5).
Network partition between gateway and pub-sub. Gateway can't reach Redis; clients connect but can't subscribe. Modern gateways treat this as a fatal local error and refuse new connections (drain mode) or shed existing connections to make clients reconnect to healthy nodes. Recovery: the operator fixes the network, then the gateway exits drain.
Permanent loss of a region. EU region goes dark for hours. Users connected there get retried by the client library; eventually their library tries a backup hostname or a regional DNS rotation routes them elsewhere. Cross-region pub-sub means messages in flight at the moment of partition are lost; the durability story is "the database has the truth and the user catches up on reconnect."
Durability point summary: the database write before publish, plus the replay buffer (Redis Streams, Kafka log) for in-flight events, plus the client-side last_seq cursor that lets clients ask "give me what I missed." A real-time message bus cannot be the truth — its job is to be a fast, lossy mirror of the durable journal underneath. Every failure-mode recovery routes through that fact.
§9. Why Not Just Polling?
The math is the argument. Consider a chat app with 1M concurrent users, average 1 message per user per minute incoming.
Polling at 5-second interval: - Each poll: full HTTP request (TCP+TLS handshake amortized via HTTP keep-alive, but headers + path + cookies still ≈ 800 bytes both ways). - Polls per user per minute: 12. - Polls per minute: 12M. - Bytes per minute: 12M × 1.6 KB = 19.2 GB/min = 320 MB/sec. - Server CPU: 12M requests/min ≈ 200K req/sec sustained — needs a fleet of HTTP servers just for polling. - Worst-case latency: 5 seconds (mid-poll). p50: 2.5 seconds. Catastrophic for chat UX.
WebSocket equivalent: - 1M sockets open, idle. - Per-message overhead: ~30 bytes (frame header + small payload). - Messages per minute: 1M (one per user) × ~30 bytes = 30 MB/min = 500 KB/sec. - Server CPU: handling 17K msg/sec across a fleet — modest. - Latency: 20-80 ms p99.
That's a ~600x reduction in bytes and ~10-100x in CPU, and the latency improves by two orders of magnitude. Polling at low scale is fine — under 10K users, it works. Past that, you're paying for HTTP overhead per idle user, and that gets ruinous fast.
There's a second mode of failure too: polling implicitly assumes the server can serve every poll. A burst of concurrent polling clients from the same network (an office, an apartment block) creates synchronized request spikes — every 5 seconds, a thundering herd of polls hits the LB at once. WebSocket completely sidesteps this: the LB is touched once, at connect time.
The exception that proves the rule: long-poll fallback. Even WebSocket-first stacks ship long-poll as a fallback for hostile network paths (corporate proxies, ancient firewalls, IE11). They tolerate the cost because it's a small minority of users. Polling-as-primary is what you choose only when WebSocket and SSE both fail or the use case is genuinely poll-shaped (a GET /jobs/status for a long-running job).
§10. Scaling Axes
Type 1 — uniform growth (more concurrent connections, same rate per connection). A chat app that grows from 100K to 10M users, each averaging the same volume per user. The fix is horizontal partitioning of gateways. Add nodes; the load balancer spreads ingress; the pub-sub fabric handles cross-node routing; per-user load stays constant.
Inflection points:
- Up to ~50K connections per node: anything works (Node.js, Python, Go).
- 50K-500K per node: language matters. Go, Rust, Java/Netty, Elixir/BEAM handle this without breaking a sweat. Node.js works but you'll have more nodes.
- 500K-2M per node: kernel tuning territory. Bump fs.file-max, lower net.ipv4.tcp_rmem/wmem, tune somaxconn, disable Nagle on small writes. Reference: WhatsApp's 2M-per-box configuration.
- Above ~2M per node: physics. PCIe bus contention, NIC queue depth, kernel scheduler overhead. Solutions: kernel-bypass (DPDK, user-space TCP stacks like F-Stack) — only Cloudflare, FB, and a few others have done this for WebSocket.
Pub-sub backbone scaling: - Single Redis primary: ~10K pub-sub fanouts/sec, ~100K subscribers per channel. - Redis cluster (hash-slotted): scales by sharding channels; ~100K fanouts/sec per slot. - Kafka: orders of magnitude more durable throughput, higher latency (~5-20 ms vs <1 ms for Redis). - Nats with JetStream: in the same band as Kafka, lighter ops.
Type 2 — hotspot intensification (one room has 1M subscribers — fanout amplification). A live event in a chat room, a celebrity's stream, a live trading dashboard during a market crash. Same number of users, but they're all watching one thing.
The fix: fanout-aware topology. Approaches:
- Hierarchical fanout. Instead of one publish-to-N-subscribers, publish to a tree: 1 → 32 → 1024 → final clients. Each layer fans out to ~32. Latency increases logarithmically, throughput per publisher stays constant.
- Regional sharding. A global room with 1M subscribers is split into 10 regional rooms of 100K each, with a regional "echo" service that copies events between regional buckets. Each gateway only fans out to its local subscribers.
- Sampling and rate-limiting at scale. For a stream chat with 1M viewers and 1K msg/sec, every viewer can't receive every message — even with cheap fanout, the network bandwidth is hostile. Twitch dynamically downsamples chat at high viewer counts; older messages are dropped on the floor, only "popular" ones (high engagement) are propagated to all.
- Server-broadcast over IP multicast (private networks). For datacenter-internal fanout, IP multicast lets one packet hit thousands of subscribers. Used in financial market data dissemination — not in public Internet WebSocket fanout, because the public Internet doesn't support multicast.
Inflection points for hot rooms: - 1K subscribers: trivial. - 100K subscribers: needs cross-gateway fanout via pub-sub. - 1M subscribers in one room: needs hierarchical fanout or hub-and-spoke regional structure. - 10M+ subscribers in one event: needs CDN-style edge fanout (Cloudflare for video chat overlays, Twitch's own edge tier).
§11. Decision Matrix
| Scenario | WebSocket | SSE | Long-poll | Push notification | Polling |
|---|---|---|---|---|---|
| Browser chat (in-app) | Best | OK if no client→server | Fallback only | No (out-of-app) | No |
| Stock ticker / price feed | OK | Best (server→client only, HTTP-friendly) | No | No | No |
| Collaborative editing | Best (needs bidirectional, low latency) | No (no upstream channel) | No | No | No |
| Multiplayer game state | Best (low-latency bidirectional) | No | No | No | No |
| Backgrounded mobile app | No (connection dies) | No | No | Best | No |
| IoT sensor telemetry | OK | OK | No | No | OK if cheap |
| MQTT-enabled embedded device | OK (over WS) | No | No | No | No (use MQTT direct) |
| Job status (one fetch when done) | Overkill | OK | OK | OK | Best (simple) |
| AI chatbot token streaming | OK | Best (server→client, simple HTTP) | No | No | No |
| Live sports score | OK | Best | OK | No | No |
| Trading order entry + market data | Best (needs client→server orders + server→client ticks) | No | No | No | No |
A useful heuristic: if the channel is genuinely bidirectional (chat where the user types AND receives) and you control the client, use WebSocket. If it's server-to-client only (a feed of updates, a notification stream, an AI response token-by-token), use SSE — you get reconnect, replay, and HTTP infrastructure for free. If you're targeting an app in the background or a phone that may not have the app running, use APNs/FCM via the platform-native push pipeline. Polling is a last resort or a degenerate-case fallback.
§12. Protocol Deep Dive — WebSocket vs SSE
The WebSocket-vs-SSE choice is the single most consequential protocol decision in a real-time messaging architecture. People default to WebSocket because it's familiar; many of those projects should have used SSE.
WebSocket is full-duplex. Same TCP connection carries client→server and server→client frames. Required when both sides need to push — chat, collaborative editing, gaming.
SSE is server-only. Client→server happens over separate HTTP POSTs. This sounds limiting and is, but it has structural advantages:
-
It's HTTP all the way down. Caches don't break it. Proxies don't break it. CDNs can handle it (some, like Fastly and Cloudflare, support SSE natively). Corporate firewalls that strip
Upgrade: websocketheaders don't stripContent-Type: text/event-stream. -
Reconnect is in the browser.
EventSourcereconnects automatically on disconnect, sendingLast-Event-ID. The server replays events from that point. The amount of application code needed to ship a reliable SSE client is approximately zero; the same code in WebSocket land is hundreds of lines. -
HTTP/2 multiplexing. With HTTP/2 or HTTP/3, multiple SSE streams share a single TCP connection. WebSocket cannot — each WebSocket is its own TCP connection. For browsers that have many tabs to many origins, SSE wins on connection count.
-
It plays nice with existing HTTP load balancers. ALB, NGINX, Envoy — all handle SSE correctly out of the box. WebSocket requires explicit support (the
Upgradeheader negotiation), which not every LB ships well.
The "we don't actually need bidirectional, just use SSE" simplification has been adopted by:
- OpenAI's streaming chat API. ChatGPT's token-by-token streaming is SSE, not WebSocket. The client sends one POST with the prompt; the server streams response tokens as SSE events. Reconnect-and-replay isn't really meaningful here (you wouldn't replay a partial response), but the simplicity wins.
- Cloudflare's dashboard live updates. They use SSE for streaming logs and live event panels.
- GitHub's live issue comment updates. SSE-based.
- Many AI-product builders post-2023. The streaming-response-as-tokens pattern is a near-universal SSE deployment.
When WebSocket wins: chat, collaborative editing, multiplayer games, signaling for WebRTC, anything where the client needs to push more than an occasional event.
When SSE wins: feed updates, ticker data, AI streaming, log tailing, dashboards, notifications inbox.
§13. Pub-Sub for Cross-Node Routing
The pub-sub backbone is the heart of every multi-node real-time messaging system. The choice of backbone determines durability semantics, latency, throughput, and operational complexity.
Redis Pub/Sub. The classical choice. PUBLISH ch payload to a channel; every subscriber receives it. Subscriptions are connection-bound (a Redis client with SUBSCRIBE becomes pub-sub-only until unsubscribed). Latency: sub-millisecond within a datacenter. Throughput per Redis primary: ~100K-500K publishes/sec. Durability: zero. A Redis primary that restarts loses every pub-sub message in flight; a subscriber that disconnects mid-publish loses any frames between disconnect and reconnect. Acceptable for chat (clients reconnect, replay from a separate journal), unacceptable for "missed events must not be lost" semantics.
Redis Streams. The modern Redis answer: a durable, ordered log per stream key. XADD stream * field val appends; XREAD BLOCK 0 STREAMS stream $ blocks for new entries. Streams have bounded length (MAXLEN ~ 10000 trims to the most recent 10K entries). This is the foundation of the replay protocol — gateway holds a cursor into the stream, on reconnect the client tells the gateway its last cursor, gateway reads forward. Latency: low single-digit ms. Throughput: similar to pub-sub; trimming and AOF (Append-Only File) writes add some cost.
Kafka. The heavyweight, durable-replayable choice. Topics, partitions, consumer groups. Latency: 5-20 ms typical, can be tuned down with linger.ms=0 and batch.size=1. Throughput: millions of messages per second per broker. Durability: full — events persist to disk with configurable replication. Used for the cross-region replication leg of large-scale real-time messaging (Slack, Discord pre-Elixir, LinkedIn's internal "courier" system for in-app messaging). Operational cost is real — Kafka clusters need careful tuning.
NATS (with JetStream). A middle ground: fast like Redis, durable like Kafka, lighter ops than either. Used by Centrifugo as an alternative backbone; emerging as a popular choice. JetStream is the durability layer; raw NATS Core is pub-sub without persistence.
Gateway-to-pub-sub mechanics — the "subscription per channel" problem. Naïve gateway: when client joins room r-99, gateway calls SUBSCRIBE room:r-99 on Redis. When client leaves, UNSUBSCRIBE. Result: gateway has thousands of subscriptions, which is fine on Redis. But: 200K gateways × 10K rooms each = 2M subscriptions; Redis can handle it, but the bookkeeping is finicky and disconnect-handling is error-prone.
The robust pattern: partitioned subscriptions. Hash room IDs into N buckets (e.g., 256 channels named bucket:0-bucket:255). Each gateway subscribes to all 256 buckets at startup and never unsubscribes. When a publish happens, the publisher computes bucket = hash(room_id) % 256 and publishes to bucket:<n> with the room ID in the payload. Each gateway receives every message in its 256 buckets, filters in-memory for rooms its connected clients are in, and forwards to relevant sockets.
This trades a small amount of wasted bandwidth (gateways receive messages for rooms they don't have subscribers for) for radically simpler subscription management. It's how Centrifugo and most mature gateways work internally.
§14. Presence System
Presence is "who is currently online" or "who is currently in this room." Conceptually simple, terrifying at scale.
Per-client mechanic. Client opens WebSocket. The WebSocket protocol's ping/pong frames carry the heartbeat — every 25-30 seconds, the gateway sends a ping, the client responds with a pong. If three pongs fail, gateway closes the connection. No application-level heartbeat needed.
Per-user presence record. When a client connects, gateway writes presence:user:42 = { gw: 'gw-7', session: 's-abc', joined: 169... } to Redis with a 60-second TTL. Gateway refreshes the TTL every 30 seconds while the connection lives. When the connection closes, gateway deletes the key.
Per-room presence. When a client joins room r-99, gateway adds 42 to the set presence:room:r-99. When they leave or disconnect, gateway removes. A query "who's in r-99" is SMEMBERS presence:room:r-99.
Cross-fleet aggregation. Each gateway publishes presence-change events to pub-sub: gw-7 announces user 42 joined r-99. Other gateways with subscribers interested in r-99 receive the event and update their local presence cache. This is fanout-amplified again — a 1M-subscriber room with churn (people joining/leaving every second) generates a flood of presence updates.
Throttling. No room with more than ~1000 subscribers gets fine-grained presence. Instead, presence is summarized: "1,247 people online" updated every 5 seconds; individual user-online events are not propagated. This is what Slack does for large channels — you don't see individual people leaving, just a count.
Mass-presence broadcasts — "X is typing". Naïve: every keystroke sends a TYPING event to all subscribers. Reality: throttle to "first keystroke of typing-session, then mute for 5 seconds, then re-fire if still typing." For huge rooms, drop entirely. Discord and Slack both do this.
TTL-based eventual cleanup. The TTL on presence:user:42 is the safety net. If a gateway dies without cleanly removing presence (server crash, network partition), the key expires after 60 seconds and presence is correct again. The exact value (60 seconds is common) trades stale-presence latency against heartbeat frequency.
§15. Use Case Gallery
Chat applications. Slack, Discord, WhatsApp Web, Telegram, Teams. Tens of millions concurrent globally, per-channel fanout (hundreds of msg/sec in busy channels), presence (typing, online), reconnect-with-replay for flaky mobile. Pattern: WebSocket + Redis or Kafka pub-sub + durable journal (Cassandra/ScyllaDB at Discord, custom log at Slack) + APNs/FCM for backgrounded mobile.
Live collaborative editing. Google Docs, Figma, Notion, Linear. Many clients simultaneously editing shared state. Transport: WebSocket. Application: CRDT (Conflict-free Replicated Data Type) or OT (Operational Transform) for convergence without locks. Figma uses Cloudflare Durable Objects (one DO per file) — single-threaded actor model serializes edits naturally. Google Docs uses OT with centralized server; Notion uses CRDT.
Live trading dashboards. Robinhood, Coinbase, Bloomberg. Sub-second price ticks and order-book deltas. Mostly server-to-client; SSE is increasingly popular for the HTTP-infra and auto-reconnect wins. Throughput-bound: fast symbols hit hundreds of book updates/sec; common pattern is "full updates for top symbols, batched 100ms summaries for the rest."
Multiplayer games. Fortnite, Roblox, MMOs. Server-authoritative state, tick-rate updates. WebSocket in browsers; raw UDP in native clients for lower latency tolerating packet loss. Spatial fanout (nearby players see you, not the world). Photon and Mirror dominate; commodity real-time platforms too generic.
Live sports / event broadcasts. ESPN, Twitter Live. Score updates fanned to millions. SSE or WebSocket with hierarchical fanout via managed platforms — Ably built much of their reputation here.
Social presence. LinkedIn "Active now", Facebook green dot, Slack presence dots. Low-volume per user, high read fan-in. Solution: only fetch presence for users currently in UI; cache "active in last 5 min" aggressively rather than real-time.
IoT device telemetry. Tesla fleet, smart home (Nest, Ring), connected cars, industrial sensors. MQTT dominates — per-message overhead matters when battery-powered devices wake briefly to send a reading. Brokers (EMQX, HiveMQ, AWS IoT Core) hold millions of devices. Pattern: sensor → MQTT broker → Kafka/Kinesis → time-series store (TimescaleDB, InfluxDB) → dashboards via SSE.
In-app notification inbox. Bell icon with unread count, list of recent activity. Distinct from APNs/FCM out-of-band push. WebSocket or SSE pushes new notifications live. LinkedIn's bell, Twitter notifications, GitHub notifications all run on this.
WebRTC signaling. Video calls need a signaling channel for SDP (Session Description Protocol) offer/answer and ICE candidate exchange. Always WebSocket — bidirectional, low latency, low volume. Twilio Video, Daily.co, LiveKit all run their signaling brokers on WebSocket.
§16. Real-World Implementations
Slack. Custom WebSocket gateway tier; each client holds one persistent WebSocket to a regional gateway. Stateful gateways (millions of connections per fleet, tens of thousands per node), channels subscribed via Kafka-like pub-sub backbone. Billions of messages/day, ~10M+ concurrent WebSockets globally. Edge Gateway architecture (multi-layer with regional caches) covered in a publicly-discussed SRECon talk.
Discord. Elixir + Phoenix Channels — BEAM VM gives them millions of concurrent processes naturally. Each guild is a process, each user is a process, channel subscriptions are message-passing. Tens of millions of WebSockets concurrently, 5M+ concurrent voice users on spikes. Cassandra/ScyllaDB for message persistence, Rust for hot paths.
WhatsApp. FreeBSD + custom Erlang; the 2014 acquisition revealed 2M connections per box. WhatsApp Web connects to the user's phone (relayed through WhatsApp's servers) for end-to-end encrypted delivery — truth on the phone, web client is a relay.
Pusher Channels. Managed WebSocket fleet for thousands of customer apps. Billions of messages per day. Used by GitHub Issues for live updates and many SaaS apps for "this changed" events.
Ably. Managed real-time platform with strong consistency claims. Hundreds of billions of messages per month. Used by BBC Sports, regulated industries, event broadcasters. Differentiation: multi-region durability, guaranteed delivery with replay buffers.
PubNub. Shape similar to Pusher/Ably with heavier edge-presence emphasis. Used by Curbside, telehealth, gaming.
Figma. Multiplayer cursors + document collab using Cloudflare Durable Objects + WebSocket. One DO per file, on edge near where editing happens. Cursor positions stream at ~30 Hz, model deltas as they happen — actor-per-document, not connection-per-node.
Phoenix Channels at scale. Bleacher Report ran 2M concurrent connections on a small cluster in 2015 — the scale-per-node story that put Elixir on the map for real-time. Phoenix's Presence module uses a CRDT (variant of OR-Sets — Observed-Remove Sets) to gossip presence across nodes without central authority.
Centrifugo. Standalone Go binary; the closest thing to "Redis for WebSockets." Used by Mail.ru, various crypto exchanges, gaming platforms. Single node holds ~1M concurrent; multi-node scales linearly via Redis or Nats. One binary, one config — operationally light.
AWS IoT Core. Massive MQTT broker fleet. John Deere (tractors), Goodyear (smart tires), utility cooperatives (smart meters) — hundreds of millions of devices. Bills per device-connection-minute and per message.
Cloudflare Workers + Durable Objects. Each DO holds WebSockets for one room or actor. Hosted on the edge near first connection. Cloudflare's own dashboard, plus customers building chat/gaming/collab. Scales by room count not user count.
Twitch Chat. Historically a custom IRC-based protocol (Internet Relay Chat), with WebSocket gateways translating IRC for browsers. 1M+ concurrent chatters per stream during major events, tens of thousands of messages/sec for popular streams.
§17. Summary
Real-time messaging is a transport class: it solves "deliver this event to all connected clients in under a second," with a connection-stateful, fan-out-shaped architecture pinned by a pub-sub fabric and supported by a durable journal underneath. Pick WebSocket for bidirectional channels, SSE for server-to-client streams, MQTT for IoT, and APNs/FCM for backgrounded mobile. The hard parts are not the protocol but the operational shape: connection storms on deploy, fanout amplification on a million-subscriber room, presence at planetary scale, and the reconnect-with-replay protocol that makes the lossy transport behave like a reliable one.