A reference about the technology class — durable, asynchronous message transport between producers and consumers — exemplified by Apache Kafka, Apache Pulsar, RabbitMQ, AWS SQS (Simple Queue Service), AWS Kinesis, Google Pub/Sub, NATS JetStream, and the newer S3-native variants (WarpStream, AutoMQ). The reader should finish understanding the design space deeply enough to apply it to any system — payments, click-stream analytics, IoT (Internet of Things) telemetry, CDC (Change Data Capture) pipelines, microservices fan-out — and defend a pick against the obvious alternatives.
§1. What message queues and distributed logs ARE
A message queue or distributed log is a durable, asynchronous transport between producers (who publish records) and consumers (who read records), inserted between two components that would otherwise talk synchronously over RPC (Remote Procedure Call). It buys what synchronous calls cannot: temporal decoupling (consumer doesn't need to be up when producer publishes), fan-out (one publish, many consumers), and back-pressure absorption (publish spikes buffered, not crashing the consumer).
The category splits into two structurally different primitives that docs and interviews often blur. Get the split right or every later choice is wrong.
The log model (Kafka, Pulsar, Kinesis). Broker maintains an append-only, retained, replayable log per partition. Producers append; broker never deletes individual records, only whole segments past retention. Consumers track their own read position (the "offset"). Broker doesn't know whether any consumer has read any record — it just holds the tape. Mental model: a replicated tail -f of an event tape. Fan-out is free (N consumer groups each track their own offset). Replay is built-in.
The queue model (RabbitMQ, SQS, ActiveMQ). Broker maintains a per-message state machine — each record is unacked, in-flight, acked, redelivered, or dead-lettered. On ack, broker deletes the record. Broker tracks delivery; consumer does not track position. Mental model: a shared work-stealing queue with redelivery and per-message routing. Fan-out is expensive (broker replicates into N per-subscriber queues). Replay is generally impossible.
Both sit between producer services (order-service of a payments platform, activity tracker of a social feed, IMU publisher of a vehicle fleet) and consumer services (billing pipeline, feed indexer, time-series store). Infrastructure, not application logic — a self-contained broker cluster with a consensus layer for metadata.
What they are NOT good for: synchronous request/reply with tight latency budgets (a queue adds a hop + tens of ms — for "must respond < 50 ms" use gRPC); strict global total ordering at high throughput (per-partition is achievable, cross-partition at millions/sec is not); a primary system of record needing point queries (a log is sequential — use a database, let the log be the change stream); tiny scale with rich per-message semantics (50 msg/sec with priority routing + reply-to: Postgres SKIP LOCKED or a single RabbitMQ exchange is simpler than a Kafka cluster).
§2. Inherent guarantees — what's provided vs what must be layered on
Provided by design: (1) Durability for acknowledged writes — all variants ack only after replication (or fsync, in BookKeeper-backed Pulsar) to enough nodes to survive single-node loss. (2) Per-partition/per-queue total order — within one partition or queue, records appear in the order the broker accepted them; cross-partition order is not guaranteed. (3) At-least-once delivery as the default — producer retries until ack; consumer commits only after processing; duplicates tolerated, loss is not. (4) Retention as a first-class concept (logs only) — records live for N days independent of consumption, enabling replay.
NOT provided (Staff-level traps): (1) Exactly-once without explicit opt-in — Kafka has a transactional protocol, Pulsar has dedup + cursors, SQS FIFO has a 5-minute dedup window. All require producer + consumer + sink to participate; none give end-to-end exactly-once across an external API call (Stripe, Postgres) without application-level idempotency keys. (2) Global ordering — per-partition only; order requirements must be encoded into the partition key (e.g., account_id for ledger events). (3) Atomicity with external systems — consumer-read + DB-write is not atomic; outbox or 2PC (Two-Phase Commit) must be layered on. (4) Schema evolution — broker doesn't know the payload; a Schema Registry + backward-compatible Avro/Protobuf is required. (5) Quality of service on shared clusters — a noisy-neighbor topic saturates broker IO; quotas or per-tenant clusters layered on.
What the system designer must add: ack/commit strategy, partition-key strategy, retention policy, schema-evolution policy, dead-letter strategy, replay procedure. A queue or log is not a "magic durable pipe" — it is a substrate; the application architecture is everything else.
§3. The design space
Five axes differentiate the named variants. The rest of the space is variation along them.
Axis 1: Log vs queue (storage model). Covered in §1. Log: retained segments. Queue: per-message state machine. Fan-out cost differs by orders of magnitude.
Axis 2: Broker-managed vs peer-to-peer. Broker-managed (Kafka, Pulsar, RabbitMQ, SQS) — dedicated broker processes own data. Peer-to-peer (NATS Core, gossip-based) — every node is a peer; lower latency, weaker durability. Mostly out of scope here — peer-to-peer is for ultra-low-latency ephemeral telemetry.
Axis 3: Sync vs async replication. Sync (Kafka acks=all, Pulsar Qa=Qw) — producer waits for replicas; ~5-15 ms, no loss on single-node failure. Async (Kafka acks=1, RabbitMQ default) — leader acks, replicas catch up later; ~1-3 ms, data-loss window on leader crash. None (acks=0) — fire-and-forget for non-critical telemetry. Pick is workload-driven: payments outbox always sync; click-stream analytics often async; Tesla fleet CAN-bus (Controller Area Network) telemetry sync for fault codes, async-sampled for routine sensor reads.
Axis 4: Local-disk vs object-storage durability. Local NVMe + replication (classic Kafka, BookKeeper bookies) — ~1 GB/sec/broker, p99 ~5-15 ms, ~$0.10/GB-month NVMe (Non-Volatile Memory express). Object storage primary (WarpStream, AutoMQ since ~2023) — brokers stateless, durability is S3's 11-nines, p99 ~200 ms (S3 multipart upload), ~$0.023/GB-month, 5-10x cheaper at high retention. This axis emerged after 2022 and is the active frontier of the field.
Axis 5: Single-cluster vs geo-replicated. Single cluster across 3 AZs (Availability Zones) is the default. Cross-region active-passive (MirrorMaker 2, Pulsar geo-rep) gives RPO (Recovery Point Objective) on the order of seconds. Cross-region active-active makes conflict resolution a system requirement; Pulsar supports it natively, Kafka requires careful idempotency design.
Comparison table
| Dimension | Kafka | Pulsar | RabbitMQ | SQS | Kinesis | WarpStream / AutoMQ |
|---|---|---|---|---|---|---|
| Primitive | Log | Log (segregated storage) | Queue | Queue | Log | Log |
| Storage | Local disk + replication | BookKeeper (fsync/entry) | Memory + optional disk | Managed | Managed | S3 |
| Throughput / node | ~1 GB/sec | ~1 GB/sec | ~50K msg/sec/queue | 3K msg/sec/queue | 1 MB/sec/shard | ~600 MB/sec, latency-bound |
| p99 produce latency | 5-15 ms | 5-15 ms | 1-10 ms | 20-100 ms | 20-50 ms | 200-500 ms |
| Replay | Yes (in retention) | Yes (tiered storage) | No | No | Yes (24h-365d) | Yes (S3-native) |
| Ordering | Per-partition | Per-partition | Per-queue, single consumer | None (FIFO variant exists) | Per-shard | Per-partition |
| Exactly-once | Idempotent + transactional | Dedup + cursors | At-least-once only | FIFO 5-min dedup | At-least-once | Yes (Kafka API) |
| Geo-replication | MirrorMaker 2 | Native multi-cluster | Federation (limited) | Cross-region (limited) | Cross-region streams | S3 cross-region |
| Best fit | High-throughput event sourcing, replay-heavy | Multi-tenant, geo-active-active | Complex routing < 50K/sec | AWS-native work queues | AWS streaming < 1 MB/sec/shard | High-retention, cost-sensitive |
The instinct is not "Kafka always wins." Match the workload's throughput, latency, retention, and fan-out shape to a point on this matrix: a payments outbox at 5K events/sec wants RabbitMQ + DLQ simplicity; a click-stream pipeline at 5M events/sec wants Kafka; a multi-tenant SaaS event bus with per-tenant isolation across regions wants Pulsar; a 30-day-retention telemetry archive on a tight budget wants WarpStream.
§4. Byte-level mechanics — the log on disk
This is the section where the depth lives. The Kafka log, since 2010, is the canonical realization of the distributed log primitive; every later log (Pulsar segments, Kinesis shards, S3-native variants) borrows from it. We walk Kafka because it's the most documented.
4a. Segmented append-only log + sparse offset index
A single partition's on-disk layout:
/kafka/data/clickstream-page-views-23/ (one directory per partition)
├── 00000000000000000000.log closed segment, 1 GB, offsets 0..2_048_000
├── 00000000000000000000.index sparse offset → byte position
├── 00000000000000000000.timeindex sparse timestamp → offset
├── 00000000000002048000.log closed segment, 1 GB
├── 00000000000002048000.index
├── 00000000000002048000.timeindex
├── 00000000000004096000.log ACTIVE segment, currently appending
├── 00000000000004096000.index
├── 00000000000004096000.timeindex
├── leader-epoch-checkpoint which epoch took over at which offset
└── partition.metadata
Segment rotation rule: when the active segment hits segment.bytes (default 1 GB) or segment.ms (default 7 days), it is closed and a new active segment opens. Closed segments are immutable. That immutability is the entire foundation — it enables zero-copy reads, lock-free deletion, lock-free compaction, and tiered-storage offload.
File naming: filename is the zero-padded base offset. To locate offset 3,500,000 (e.g., a Tesla fleet telemetry event from 3 weeks ago an analyst wants to replay): (1) binary-search segment filenames → largest base ≤ 3,500,000 is 00000000000002048000.log; (2) open sparse .index (one entry per index.interval.bytes = 4 KB, so ~256K entries for 1 GB segment, ~2 MB of index, kept in memory); (3) binary-search index → byte position within ~4 KB of target; (4) scan forward through .log parsing record-batch headers until exact offset.
The index is sparse on purpose — dense index would cost 64 bytes × billions of records, doesn't fit. Sparse + scan-forward gives O(log n) locate at 0.001x the memory cost of dense indexing.
4b. Record batch layout
The producer never sends one record at a time on the wire. It accumulates records into batches (default up to 16 KB or linger.ms worth), compresses the batch end-to-end with lz4 / zstd / snappy / gzip, and sends the compressed batch as one unit.
RecordBatch (one unit on disk):
┌───────────────────────────────────────────────────────────────────────────┐
│ baseOffset (8B) batchLength (4B) partitionLeaderEpoch (4B) │
│ magic (1B) crc (4B) attributes (2B) │
│ - bits: compression (none|gzip|snappy|lz4|zstd), timestamp type, │
│ transactional flag, control batch flag │
│ lastOffsetDelta (4B) firstTimestamp (8B) maxTimestamp (8B) │
│ producerId (8B) producerEpoch (2B) baseSequence (4B) │
│ recordsCount (4B) │
│ ┌─ compressed body ──────────────────────────────────────────────────┐ │
│ │ record 0: length, attributes, timestampDelta, offsetDelta, │ │
│ │ keyLength, key, valueLength, value, headers │ │
│ │ record 1: ... │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
Things that matter: (1) Broker stores compressed batch verbatim — no decompression on hot path; consumer decompresses client-side. This is why one broker sustains ~1 GB/sec — CPU never touches payload. (2) producerId + baseSequence — foundation of idempotent producers. Broker tracks last sequence per (producerId, partition), rejects duplicates. Retries reuse the sequence; broker silently dedups. (3) CRC (Cyclic Redundancy Check) per batch — verified on every fetch and replication; detects bit rot, partial writes, network corruption.
4c. Why this structure suits the workload — sequential writes, page cache, sendfile
Three properties make this design hit hardware limits rather than software limits.
Sequential writes. Every append is to the tail of the active segment. NVMe sequential write hits 3-7 GB/sec — raw device throughput; random writes hit 100-200 MB/sec, 20-50x slower. Even on spinning rust (which LinkedIn ran for years), sequential gives ~200 MB/sec vs ~5 MB/sec random — a 40x gap. Append-only is the entire reason Kafka can use commodity disk at 1 GB/sec.
OS page cache as the read cache. Broker write() puts bytes in kernel page cache and returns immediately. Kernel writeback flushes dirty pages asynchronously (every ~30s via dirty_writeback_centisecs). Pages stay in page cache long after they're durable. Consumers reading recent offsets — the dominant pattern, since most consumers are within seconds of the log head — hit page cache, no disk IO. 64-256 GB RAM/broker effectively serves a multi-TB hot working set from RAM with no application cache.
sendfile() syscall for zero-copy reads. Broker doesn't read the file into user-space and write to socket. It calls sendfile(socket_fd, log_fd, &start, length); kernel copies from page cache directly to socket buffer. Modern NICs DMA (Direct Memory Access) page-cache pages straight to the wire.
Without sendfile (4 copies): disk → page cache → JVM heap → kernel socket → NIC
With sendfile (2 copies): disk → page cache → NIC
2-4x throughput multiplier at essentially zero broker CPU. Also why Kafka's broker can be JVM (Java Virtual Machine) without GC (Garbage Collection) pressure crushing the hot path — payload bytes never enter the heap. RabbitMQ cannot do this because it inspects every message to apply routing.
4d. Durability: fsync vs replication
Three durability points: (1) Page cache — write() returns; bytes in kernel memory, not disk. Kernel panic or power loss (no battery-backed cache) loses these. (2) Replication ack — followers' Fetch pulls bytes, write to their own page caches; once min.insync.replicas (default 2) ack, leader advances HW (High Watermark) and acks producer. (3) Disk fsync — kernel writeback flushes dirty pages, typically every ~30s.
Kafka's durability invariant is replication-based, not fsync-based. Default log.flush.interval.messages = Long.MAX_VALUE — Kafka does NOT fsync per write. Argument: correlated loss of leader + ≥1 follower's page cache (3+ machines losing power simultaneously) is rarer than per-broker fsync failure. With 3x replication across 3 AZs and independent power, this is sound.
Pulsar takes the opposite position. BookKeeper bookies fsync every entry before acking. Stronger per-node durability, so Pulsar can ack with fewer replicas (Qw=2, Qa=2). Trade-off: ~0.5 ms NVMe fsync, ~5 ms HDD per write.
Interview trap: candidates routinely say "Kafka fsyncs every write." It doesn't. For fsync-per-write semantics (financial regulation), set log.flush.interval.messages=1 and pay ~10x throughput. Or run Pulsar.
4e. Produce + consume walkthrough at byte level
Concrete example: payments service writes order_created with key=order_42; fraud-detection consumer (different team) reads it.
Produce path:
- T0: producer's send() buffers the record into a 16 KB batch destined for partition 7 (hash(order_42) % 256 = 7); lz4-compressed end-to-end.
- T1: after linger.ms or batch-full, producer sends Produce RPC to broker B-7 (leader of partition 7).
- T2: broker B-7 verifies CRC, appends to active segment via write() at current EOF. Bytes land in page cache; file position advances 16 KB. Adds sparse index entry if cumulative bytes since last index ≥ 4 KB. LEO advances 4,500,000 → 4,500,201 (201 records).
- T3: broker does NOT yet ack — waits for ISR followers.
- T4: followers B-31 and B-88 are long-polling Fetch from B-7. Next Fetch returns the batch; each follower write()s to its own active segment and sends a follow-up Fetch with fetchOffset=4,500,201, implicitly acking.
- T5: B-7 sees both caught up; advances HW (High Watermark) 4,500,000 → 4,500,201. HW is the offset up to which records are "committed" — replicated to all ISR.
- T6: B-7 acks producer; ~5 ms elapsed total.
- T7 (async, ~30s later): kernel writeback daemon fsyncs dirty pages from page cache to NVMe. Disk durability achieved here, not at T2.
What survives a crash at each step: - At T2 (mid-write): producer didn't get ack, retries. On restart, broker scans active segment, finds partial batch at tail (CRC fails), truncates back to last fully-written record boundary. Idempotent producer dedups the retry. - At T5 (leader before HW advances): producer didn't get ack. KRaft controller elects new leader from ISR (say B-31, which got the batch at T4). Producer retries against new leader; idempotent producer dedups. - At T6 (after ack, before T7 fsync): data in page cache on all three brokers. Simultaneous power loss on all three (regional disaster) would lose it. With 3 AZ independent power, this is the "regional event" failure mode — accepted.
Consume path:
- T0: consumer sends Fetch RPC: partition=7, fetchOffset=4,500,000, maxBytes=1 MB, maxWait=500ms, minBytes=1 KB.
- T1: broker binary-searches segment filenames: 4,500,000 ≥ base offset 4,096,000 → segment 00000000000004096000.log.
- T2: broker reads sparse .index: largest indexed offset ≤ 4,500,000 is 4,499,840 at byte position 67,108,000.
- T3: broker computes byte range: start=67,108,000 (aligned to batch boundary), length=min(1 MB, HW-fetchOffset).
- T4: broker calls sendfile(socket_fd, log_fd, &start, length). Kernel reads pages from page cache → DMA to NIC → wire. Bytes never enter JVM heap. ~0.3 ms broker CPU.
- T5: consumer receives ~1 MB of compressed batches; for each batch: verify CRC, lz4 decompress, iterate records from fetchOffset, hand to application.
- T6: consumer processes records.
- T7: consumer commits new offset to __consumer_offsets topic via OffsetCommit RPC to group coordinator (hash group.id to broker). Coordinator writes (group, topic, partition) → offset into __consumer_offsets.
- T8 (crash between T6 and T7): on restart, consumer re-fetches from last committed offset (4,500,000) and re-processes records. At-least-once. Idempotent processor must dedupe.
CPU cost on the broker for serving a 1 MB consume is ~0.3 ms (binary search + sendfile); disk + network dominate. A single broker on a 25 Gbps NIC can serve ~3 GB/sec of consumer fan-out before NIC saturation. This is why fan-out is "free" — adding a 10th consumer group costs zero IO unless they read the same hot offsets simultaneously.
4f. Log compaction — the LSM-tree-ish primitive
For topics with cleanup.policy=compact (state snapshots — CDC of current row state, __consumer_offsets itself, feature-store online caches), a background log cleaner scans closed segments and produces a new segment keeping only the latest value per key. Tombstones (null-value records) signal deletion. Cleaner builds an in-memory hash map of key → latest_offset, rewrites segments, atomically swaps old segments out. Throttled by log.cleaner.io.max.bytes.per.second. This is the LSM (Log-Structured Merge) tree-like primitive baked into Kafka, but only for compacted topics. Most production topics use cleanup.policy=delete (delete whole segments past retention).
4g. Tiered storage — offloading cold segments to S3
Closed segments older than local.retention.ms (e.g., 24h) are uploaded to S3 / HDFS / Azure Blob. Broker keeps only the local .index + metadata pointer; the .log lives in object storage. Consumer requesting an old offset: broker checks local first (~5 ms sendfile path); if not, ranged GetObject from S3 proxied to consumer (~100-200 ms, storage cost drops 10x). This is the bridge to S3-native designs (WarpStream, AutoMQ) that push the broker stateless and S3 to be the durable layer end-to-end. Trade-off: ~200 ms p99 produce latency vs ~5-15 ms on local disk.
§5. Producer-side mechanics — the half of the protocol clients implement
The broker is one half of the system; the producer client is the other half, and most of the "exactly-once," "lossless," and "high-throughput" guarantees the broker advertises depend on the producer being configured correctly. A naively configured KafkaProducer is fire-and-forget; a properly configured one is a small distributed protocol with its own state machine. This section covers the four knobs that must be known cold: idempotent producer, transactional producer, batching, compression, and acks.
5a. Idempotent producer — broker-side dedup via PID + sequence number
Default Kafka producers, on a network blip, retry the same record. Without protection, broker writes the duplicate; payments customer charged twice (§18.1). The idempotent producer (added in Kafka 0.11, 2017) closes this hole at the broker side.
Protocol. On startup, the producer client sends InitProducerId to a broker (transaction coordinator). The broker assigns a unique PID (Producer ID, 64-bit) and producer epoch (16-bit, zeroed for non-transactional). The producer then maintains a monotonically increasing sequence number per (PID, partition) and stamps every record-batch header with producerId, producerEpoch, baseSequence (§4b shows the field layout).
Broker-side state: for each (PID, partition), it tracks the last 5 sequence numbers it accepted (configurable via max.in.flight.requests.per.connection, default 5). On every incoming Produce request:
- If incoming.sequence == lastSequence + 1 → accept and append.
- If incoming.sequence <= lastSequence (and within the 5-slot window) → duplicate, return success but do NOT append again. Producer's view: the retry "succeeded."
- If incoming.sequence > lastSequence + 1 → out-of-order, return OutOfOrderSequenceException. Producer must reset and re-establish state.
Producer state machine (idempotent):
┌─────────────────┐
│ Init │── InitProducerId ──> Broker assigns PID=12345, epoch=0
│ PID=12345 │
│ epoch=0 │
│ seq[part-7]=0 │
└────────┬────────┘
│ send(record_A) → batch{PID:12345, epoch:0, baseSeq:0, ...}
▼
┌─────────────────┐
│ In-flight │ network blip; no ack received
│ seq=0,1,2,3,4 │
└────────┬────────┘
│ retry batch{baseSeq:0} → broker sees seq 0 already, returns success silently
▼
┌─────────────────┐
│ Acked │ producer thinks "first attempt succeeded"
│ seq[part-7]=5 │
└─────────────────┘
Cost. Practically free — one ID assignment per producer-process lifetime, ~24 bytes of metadata per record batch (PID + epoch + baseSequence). Broker-side state is bounded: 5 sequence numbers × ~1M PIDs × ~1K partitions = few MB. Why is it not the default? Historically it cost throughput because it forced max.in.flight.requests.per.connection=1; KIP-360 (2.5+) allows up to 5 in-flight with sequence-number-based reordering. Modern guidance: enable.idempotence=true is the production default. Confluent has made it the implicit default since 3.0.
What it does NOT give you. Idempotent producer dedups within a single producer's lifetime, per partition. It does not span producer restarts (new PID), does not span partitions (writes spanning P-7 and P-42 can each succeed independently), and does not give exactly-once into external sinks. For those, you need transactions.
5b. Transactional producer — atomic writes across partitions and topics
The use case: in §18.1's payments flow, the consumer reads payment.request, writes payment.charged to one topic, writes ledger.entry to another, then commits the source offset. If any of these three writes happens without the others, books are inconsistent. Transactional producers (Kafka 0.11+) wrap multi-partition, multi-topic writes plus consumer-offset commits in one atomic unit.
Protocol (two-phase commit with the transaction coordinator broker as TC):
Phase 0: Producer registers transactional.id="payments-tx-7"
Producer ──InitProducerId(transactional.id)──> TC
TC: looks up persistent (txid → PID, epoch) mapping in __transaction_state topic.
If exists: increment epoch (fencing — older zombie producer with prior epoch is dead).
Returns (PID=12345, epoch=2) to producer.
Phase 1: Producer.beginTransaction()
Producer ──AddPartitionsToTxn([orders-7, ledger-13])──> TC
TC: writes "ongoing transaction includes orders-7, ledger-13" to __transaction_state.
Phase 2: Producer sends data
Producer ──Produce(orders-7, batch{PID:12345, epoch:2, isTxn:true})──> broker B-7
Producer ──Produce(ledger-13, batch{PID:12345, epoch:2, isTxn:true})──> broker B-23
Producer ──AddOffsetsToTxn(consumer-group="payments-cg")──> TC
Producer ──SendOffsetsToTxn(payments-cg, payment-requests-3 → offset 5000)──> TC
Phase 3: Producer.commitTransaction()
Producer ──EndTxn(commit=true)──> TC
TC writes "PREPARE_COMMIT" to __transaction_state (durable decision point).
TC writes COMMIT_MARKER (control record, isTxn=true, commit=true)
to orders-7 and ledger-13 and __consumer_offsets.
TC writes "COMPLETE_COMMIT" to __transaction_state.
Phase 4: Cleanup
Consumer with isolation.level=read_committed will only see records
up to the latest COMMIT_MARKER for each partition.
Records past an ABORT_MARKER are skipped client-side (LSO < HW).
LSO (Last Stable Offset) is the new high watermark for transactional consumers: the highest offset at which no in-progress transaction exists below. Consumers with isolation.level=read_committed only read up to LSO; read_uncommitted reads up to HW (pre-existing behavior). Records inside aborted transactions are streamed to the consumer but filtered client-side using the broker-side aborted-transaction index — no broker-side rewriting of the log.
Producer fencing via epoch. The "zombie producer" problem: producer A starts a transaction, hangs (long GC pause), Kubernetes restarts the pod as producer A'. If both keep writing, you have two parallel transactions claiming the same transactional.id. Solution: TC bumps the epoch on every InitProducerId(txid). Producer A' gets epoch=3; producer A still has epoch=2. When A's pause ends and it sends batches with epoch=2, broker rejects with InvalidProducerEpochException. A's transaction is forcibly aborted.
Cost. ~50-100 ms added latency per transaction (two extra round-trips to TC, plus marker writes), ~3-5x throughput reduction vs non-transactional for small transactions. Amortize by batching many records into one transaction — Kafka Streams's default is one transaction per 100 ms commit interval. Don't open one transaction per record.
What it does NOT give you. Transactional producers give exactly-once into Kafka topics and offsets. They do NOT extend to external sinks (Postgres, Stripe API). For end-to-end exactly-once to those, you need either (a) idempotency keys at the external boundary (Stripe accepts Idempotency-Key header) or (b) the consumer must implement its own dedup table.
5c. Batching — linger.ms and batch.size
A producer that sends one record per send() call hits the network 100K times/sec at 100K msg/sec. Each network round trip is ~0.5-2 ms of RPC overhead, ~50-100 bytes of TCP/Kafka header per record-batch envelope. Batching is the entire reason Kafka producers scale.
Two interacting knobs:
- batch.size (default 16 KB) — maximum bytes per batch destined to one partition. When a batch fills, it ships immediately, regardless of time.
- linger.ms (default 0) — how long to wait for more records to fill the batch before sending. linger.ms=0 ships the first record immediately (no batching benefit unless multiple sends happen within one TCP RTT). linger.ms=10 waits up to 10 ms for more records.
Latency vs throughput tradeoff:
linger.ms=0, batch.size=16KB linger.ms=20, batch.size=64KB
───────────────────────────── ─────────────────────────────
p50 produce: 2 ms p50 produce: 22 ms
p99 produce: 8 ms p99 produce: 28 ms
throughput: 50 MB/sec/producer throughput: 800 MB/sec/producer
network RPCs: hundreds/sec network RPCs: tens/sec
compression: poor (small batches) compression: strong (large batches)
Rule of thumb by domain. Payments outbox (small volume, latency-sensitive): linger.ms=0, batch.size=16KB. Click-stream analytics (firehose, latency-tolerant): linger.ms=20-50, batch.size=128KB-1MB. CDC pipelines (sustained mid-throughput, ordered): linger.ms=5-10, batch.size=64KB. Tesla telemetry (high-volume burst): linger.ms=10, batch.size=256KB, with batch-per-key when one car is hot.
Internally, the producer maintains a RecordAccumulator — one buffer per (topic, partition) up to buffer.memory (default 32 MB) total. The Sender thread polls accumulators, picks batches that are full OR have aged past linger.ms, and ships them to brokers grouped by destination broker (one ProduceRequest per broker, containing batches for all partitions led by that broker). This is why partition count multiplies producer memory needs: 1000 partitions = up to 1000 batches = potentially 16 GB if all fill. buffer.memory is a hard cap; once full, send() blocks for max.block.ms.
5d. Compression — gzip vs lz4 vs zstd vs snappy
Batches are compressed end-to-end by the producer (broker stores compressed, consumer decompresses — §4b). Choice trades CPU for ratio for latency.
| Codec | Ratio (typical JSON) | Compress CPU | Decompress CPU | Notes |
|---|---|---|---|---|
| none | 1.0x | 0 | 0 | Only for tiny / pre-compressed payloads (Protobuf, encrypted) |
| snappy | ~2-3x | very fast | very fast | Google's default; lowest CPU overhead; smallest ratio |
| lz4 | ~2-3x | fastest | fastest | The Kafka default since 2.1; ~10% better ratio than snappy at similar CPU |
| gzip | ~5-7x | slow (~10x lz4) | medium | Best ratio, worst CPU; rare in modern Kafka |
| zstd | ~4-6x | medium | fast | Sweet spot since 2.1: 80% of gzip's ratio at 3-5x its CPU; the default recommendation |
At scale, compression dominates broker cost. Cloudflare's edge log shipping (mentioned §24) was reportedly 10-15% of edge CPU before they moved from gzip to zstd; post-migration, it dropped to ~3-4%. Same wire bytes, much less producer CPU. And because the broker stores compressed batches verbatim, choosing zstd over lz4 makes the broker disk footprint smaller, which compounds — storage cost is dominated by retention × ingest rate × replication factor (§24).
Caveat: if payload is already compressed (gzipped HTTP body, JPEG, Protobuf with packed fields), additional compression buys nothing and burns CPU. Default to none in those cases.
5e. acks — the durability dial
The single most consequential producer config. Three settings, wildly different semantics:
acks=0(fire-and-forget). Producer doesn't wait for broker ack. ~0.1-0.5 ms send latency. Loss on broker crash mid-write is silent. Use only for telemetry that's tolerant to small loss windows (metrics scraping, tracing samples, video playback heartbeats).acks=1(leader-only). Producer waits until leader has appended to its local log (page cache). ~1-3 ms latency. Data-loss window: if leader crashes after acking but before followers replicate, batch is lost. Per partition, this hits ~0.001% of writes under steady-state failures, possibly more during rolling restarts. Use for analytics workloads (click-stream §23) where 0.001% loss is acceptable.acks=all(full ISR). Producer waits untilmin.insync.replicasfollowers have caught up. ~5-15 ms latency. No loss tolerable under any single-node failure, and withmin.insync.replicas=2+ RF=3, can tolerate one broker loss and still serve writes. Default for payments, CDC, ledgers, audit logs.
min.insync.replicas interaction. A common bug: set acks=all but leave min.insync.replicas=1. Now acks=all means "ack when the leader's lone ISR has replicated" — but the leader IS the ISR if all followers fell behind. Effectively degenerates to acks=1 with no warning. LinkedIn standard: acks=all, min.insync.replicas=2, RF=3. Tolerates one broker loss without write outage; refuses writes if two brokers down (rather than silently lose durability).
acks=all failure mode: write unavailability. If 2 of 3 ISR brokers are down and min.insync.replicas=2, the leader rejects writes with NotEnoughReplicasException. Producer-side: retry or fail. This is by design — better to fail loud than silently accept writes the system can't durably store. The "unclean leader election" knob (§11c) is the escape hatch when uptime trumps durability.
§6. Consumer rebalancing — eager vs cooperative incremental, static membership, offset commit strategies
The consumer group protocol is where most production Kafka outages originate. The protocol is conceptually simple — N consumers, M partitions, assign partitions to consumers, when membership changes redistribute — but the realization is a state machine with surprising latency cliffs.
6a. Eager rebalancing (the legacy default, pre-2.4)
Before KIP-429 (Kafka 2.4, 2020), the only rebalance protocol was eager rebalancing, also called "stop-the-world." The protocol:
Eager rebalance, 64 consumers, 256 partitions:
T0: C5 leaves the group (deploy, crash, GC pause beyond session.timeout.ms).
T1: Group coordinator broadcasts "rebalance starting" to ALL 64 consumers.
T2: EVERY consumer (not just C5) revokes its current 4 partitions, commits offsets, returns.
T3: Coordinator collects "joins" from all 64; chooses one as leader.
T4: Group leader runs PartitionAssignor; produces new (consumer → partitions) map.
T5: Coordinator broadcasts new assignments to all 64.
T6: Each consumer resumes consumption on its new assignments.
Total stop-the-world: T2 → T6 ~ 3-10 seconds, ALL consumers idle.
At 1M msg/sec, that's 3-10M messages of lag building up.
The pathology compounds when consumers join one-at-a-time during a rolling deploy: 64 separate rebalances of 5 seconds each = 320 seconds (5+ minutes) of intermittent stalls. This was the dominant operational pain at LinkedIn through ~2018.
6b. Cooperative incremental rebalancing (CooperativeStickyAssignor, 2.4+)
KIP-429 introduced cooperative incremental rebalancing — only consumers losing partitions pause, others continue.
Cooperative rebalance, 64 consumers, 256 partitions, C5 leaves:
T0: C5 leaves.
T1: Coordinator notifies all 64 of rebalance.
T2: Each consumer reports its CURRENT assignments back.
T3: Group leader computes new assignment based on prior assignments,
aiming to MINIMIZE movement (sticky principle).
T4: Each consumer receives instruction:
"C18 and C42, revoke partitions P-200 and P-201 (formerly C5's)."
The other 62 consumers receive: "no change, continue."
T5: C18 and C42 revoke; THE OTHER 62 NEVER PAUSED. Coordinator reassigns.
T6: C18 and C42 receive new partitions; consume resumes.
Stop-the-world for 62/64 consumers: 0 ms.
Stop-the-world for the 2 reassigned consumers: ~100-300 ms.
Configuration: set partition.assignment.strategy = org.apache.kafka.clients.consumer.CooperativeStickyAssignor (also available: RangeAssignor, RoundRobinAssignor, StickyAssignor — the latter is sticky but still eager). Migration to cooperative requires a rolling restart through a hybrid period; Kafka 3.0+ defaults to including the cooperative protocol but not preferring it. Stateful consumers (Kafka Streams) benefit most — local state for old partitions doesn't have to be rebuilt.
6c. Static group membership — surviving rolling restarts without rebalance
The default protocol treats every consumer (re)connection as a new member with a fresh member.id. A rolling restart of 64 consumers triggers 64 rebalances (one per restart). KIP-345 (Kafka 2.3+) introduced static membership via group.instance.id:
Without static membership:
C5 restarts ─> coordinator sees member 'consumer-3a7f' leave (new memberId on rejoin)
─> rebalance (3-10 seconds for eager, 100ms for cooperative).
With group.instance.id="payments-cg-worker-5":
C5 restarts ─> coordinator sees instance "worker-5" reconnect within session.timeout.ms
─> matches by group.instance.id, restores PRIOR assignments
─> NO REBALANCE TRIGGERED.
Caveat: session.timeout.ms (default 45s) is the budget for restart. If the consumer takes longer (slow JVM warmup, k8s pull from cold registry), session expires and rebalance triggers normally. Production guidance: bump session.timeout.ms to ~5 minutes for stateful consumers, ensuring restarts complete within the window.
LinkedIn standard for stream processors: static membership + session.timeout.ms=300000 (5 min) + cooperative incremental. Combined, this eliminates ~95% of rebalance pauses observed pre-2020.
6d. Offset commit strategies — auto vs sync vs async, at-least-once vs at-most-once
How the consumer tells the broker "I've processed up to offset X" determines the semantic guarantees of the whole pipeline. Three strategies, four semantic regimes.
enable.auto.commit=true (default). Background thread commits the current consumer position every auto.commit.interval.ms (default 5 seconds). Hidden trap: the commit happens whether or not the records were successfully processed. If a consumer crashes 4 seconds in (between commits), the offsets are committed for records that were poll()d but not processed — silent at-most-once data loss. Worse, the application logic doesn't see this as an error.
Sync commit (consumer.commitSync()). Application explicitly calls commitSync() after processing each batch. Blocks until broker acks; throws on failure. Strong at-least-once: a crash post-commit replays nothing; a crash pre-commit replays the last batch. Latency cost: ~5-10 ms per commit. At 1M msg/sec processing through a 100-consumer fleet, that's manageable if commits happen per-batch (not per-record).
Async commit (consumer.commitAsync()). Application calls commitAsync() after processing; commit happens in background with callback. Lowest commit latency, but the callback must handle failures — if a commit fails and is never retried, the next commit will overwrite, losing the in-between processing window. Standard pattern: commitAsync() after each batch, commitSync() only at shutdown / partition revocation (so the final state is durable).
Manual offset storage (consumer-side). For exactly-once into a non-Kafka sink (Elasticsearch, MySQL), the consumer writes the offset alongside the data in the SAME transaction in the external store. On restart, consumer reads the offset from the external store, calls consumer.seek(offset). Now offsets and data are atomic with the external store; Kafka's __consumer_offsets is bypassed.
Semantic table:
| Commit pattern | After ack? | Crash here ⇒ | Semantic |
|---|---|---|---|
commitSync() BEFORE processing |
n/a | Reprocess never, lose the batch | At-most-once |
commitSync() AFTER processing |
Yes | Reprocess once on restart | At-least-once (default) |
commitSync() AFTER processing + idempotent consumer |
Yes | Reprocess but idempotent dedup | Effectively exactly-once |
| External store atomic write + offset | Yes | External store source of truth | Exactly-once across sink |
The rule: commits committed before processing = at-most-once (data loss). Commits after processing = at-least-once (duplicates). Exactly-once requires either transactional producer (§5b) OR external-store atomic commit. There is no "just enable exactly-once" knob.
§7. Schema Registry and data evolution
A Kafka broker has no idea what's in the byte arrays it routes. The payload is opaque bytes — a deliberate design choice (zero-copy reads depend on the broker never inspecting bytes). This is liberating until the day a producer team deploys a new schema and breaks every consumer. The Schema Registry exists to make that impossible.
7a. The problem — the "deploy producer, break consumers" disaster
Day 1: producer publishes {"user_id": 7, "event_type": "click", "url": "/foo"} as JSON to topic clicks. All five consumer services parse with their hand-coded JSON deserializer.
Day 30: producer team adds referrer field, accidentally removes url (renamed to path). The producer deploys at 2am. By 2:05am, every consumer is throwing NullPointerException on event.url. The consumer teams have no idea what changed; the producer team thought it was a minor rename. 5 services down for 4 hours while engineering hunts for the root cause.
A schema registry would have rejected the producer's schema registration at deploy time because removing a field violates backward compatibility. The bug is caught at the build step, not in production.
7b. Confluent Schema Registry — architecture
The Confluent Schema Registry (open-source, also Apicurio, Pulsar's built-in schema service, AWS Glue Schema Registry) is a small REST service backed by a Kafka topic (_schemas) for durability. It stores versioned schemas for subjects (typically <topic-name>-value and <topic-name>-key).
Producer write path with Schema Registry:
1. Producer has Avro/Protobuf/JSON-Schema definition for record class.
2. On first send, producer calls SR: registerSchema(subject="clicks-value", schema=...).
SR validates against compatibility mode; if OK, persists, returns schema ID (32-bit int).
Schema ID is cached in producer JVM forever.
3. Producer serializes record:
┌─────────────────────────────────────────────────────────┐
│ magic_byte (1B = 0x00) │ schema_id (4B) │ Avro payload │
└─────────────────────────────────────────────────────────┘
4. Producer ships to broker. Broker stores opaque bytes — has no idea Avro is in there.
Consumer read path:
1. Consumer reads byte array from Kafka.
2. Parse first 5 bytes: magic byte = 0x00, schema_id = 12345.
3. Look up schema 12345 in SR (cached). Returns the Avro schema definition.
4. Avro deserializer uses schema to decode binary payload into a record object.
The "magic byte + schema ID + payload" format is the wire format convention, not a Kafka protocol feature — schema IDs are inserted into the user's bytes. Any client supporting the convention can interoperate (most languages have Confluent serializer libraries).
7c. Compatibility modes — the rules that prevent disasters
Schema Registry enforces a compatibility mode per subject. When a new schema is registered, SR checks it against the existing latest (or all previous) schemas and rejects incompatible changes at the API call. The producer can't even publish until the schema is accepted.
| Mode | What's allowed (in new schema) | Use when |
|---|---|---|
| BACKWARD | Delete fields. Add fields with defaults. | Default — consumers upgrade first, then producers. The most common Kafka pattern. |
| FORWARD | Add fields. Delete fields with defaults. | Producers upgrade first, then consumers. Rare. |
| FULL | Both BACKWARD and FORWARD constraints. | Mixed upgrade order; safest, most restrictive. |
| NONE | Anything. | Never use in prod. |
| BACKWARD_TRANSITIVE / FORWARD_TRANSITIVE / FULL_TRANSITIVE | Compare against ALL prior schemas, not just latest. | When old consumers can't be guaranteed to upgrade. |
7d. Walk-through: backward-compatible schema evolution
Starting schema (Avro, v1):
record ClickEvent {
long user_id;
string event_type;
string url;
}
Bad change (rejected): rename url to path. This is delete url + add required path. Old consumers reading new data crash on missing url. Compatibility violation; SR rejects at registration.
Good change (accepted): add an optional referrer field with default.
record ClickEvent {
long user_id;
string event_type;
string url;
union {null, string} referrer = null; // new field with default
}
This is BACKWARD-compatible:
- Old consumer (using v1) reading new producer's (v2) data: Avro deserializer with v1 schema skips the referrer field (not in v1, but Avro can read v2-written-with-v1-schema because the v1 reader uses its own schema for projection). Old consumer continues to work, ignoring referrer.
- New consumer (v2) reading old producer's (v1) data: referrer field absent in payload, default value null used. New consumer works on old data.
Migration order: upgrade consumers to v2 first (they can read both v1 and v2 messages). Then upgrade producers to v2 (now publishing referrer). No outage window.
The bottom line. Schema Registry is not optional at scale. It is the contract layer that lets producer and consumer teams deploy independently without coordination meetings. LinkedIn has run a fork (Liquid) since ~2015; Confluent ships it; AWS embeds it in Glue. The cost is ~$0.50/sec per registry instance (3-5 instances for HA) and a few minutes of integration work. The benefit is zero "schema mismatch" outages.
§8. Kafka Connect and the connector ecosystem
The first time you stand up a Kafka cluster, you immediately ask: "how do I get the data INTO it from MySQL?" or "how do I write it OUT to S3?" The Kafka community's answer since 2015 is Kafka Connect — a separate cluster of workers running pre-built source and sink connectors. It exists because, by their count, ~80% of "consumers" people would otherwise write are stateless connectors moving bytes from system A to system B, and writing each from scratch is wasteful and error-prone.
8a. Architecture — Connect workers, tasks, configs
A Connect cluster runs alongside a Kafka cluster. Workers are JVM processes; each worker hosts one or more tasks. A task is a unit of parallel work (e.g., one MySQL table = one task; eight S3 prefixes = eight tasks). Configuration is via REST API; configs are stored in compacted Kafka topics (connect-configs, connect-offsets, connect-status) — Connect is self-bootstrapping.
┌────────────────────────────────────────────────────────────────┐
│ Kafka Connect Cluster (3-10 workers) │
│ │
│ Worker 1: ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Debezium-T1│ │ JDBC-Src-T2│ │ S3-Sink-T1 │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ Worker 2: ┌────────────┐ ┌────────────┐ │
│ │ Debezium-T2│ │ S3-Sink-T2 │ │
│ └────────────┘ └────────────┘ │
│ │
│ Stored in Kafka: connect-configs, connect-offsets, │
│ connect-status (compacted topics) │
└────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌────────────┐ ┌────────────┐
│ MySQL │ ──Debezium tails binlog──▶ │ Kafka │
└────────────┘ │ topic │
│ cdc.users │
└─────┬──────┘
│
▼
┌────────────┐
│ S3-Sink │
│ writes to │
│ s3://... │
└────────────┘
8b. Source connectors — Debezium for CDC, JDBC source
Debezium (RedHat, OSS) is the de-facto CDC source connector. It tails the binary log of MySQL, PostgreSQL WAL via logical replication, MongoDB oplog, SQL Server transaction log, Oracle LogMiner. Output: one Kafka topic per source table (<server>.<schema>.<table>); each message is {before, after, source, op} (op = create / update / delete). Partition key is the row's primary key, preserving per-row order (the CDC pipeline pattern from §18.5).
Why Debezium beats writing a custom consumer: - Handles snapshot + binlog cutover (initial snapshot of full table, then transition to streaming binlog without gaps or duplicates). - Tracks binlog position in Connect's offsets topic — survives restart, resumes from last position. - Handles MySQL master failover via GTID (Global Transaction Identifier) tracking. - Schema discovery: emits schema-change events to a separate topic, integrates with Schema Registry. - Tens of thousands of production deployments — battle-tested.
JDBC source connector is the more naive alternative: polls a table with SELECT * WHERE updated_at > <last_seen> ORDER BY updated_at every N seconds. Pros: works on any JDBC database. Cons: misses deletes (no row visible), can miss updates if updated_at isn't updated, polling load on source DB. Use only when binlog access is unavailable.
8c. Sink connectors — S3, Elasticsearch, Snowflake, JDBC
Sink connectors consume from one or more Kafka topics and write to an external system. The standard ones:
- S3 sink: buffers records by partition, flushes on time (
rotate.schedule.interval.ms) or size (flush.size) into Parquet/JSON/Avro files ats3://bucket/<topic>/year=2026/month=05/day=22/hour=14/<file>. Used to build data lakes. - Elasticsearch sink: maps Kafka records to ES documents; supports upserts via document ID = Kafka key. Bulk-indexed in batches.
- Snowflake sink (Snowpipe Streaming or batched): loads records directly into Snowflake tables; the Snowflake-Kafka connector handles auth, partition file naming, retries.
- JDBC sink: writes to any JDBC database (Postgres, MySQL, ClickHouse). Standard for materializing a Kafka stream into a row-store.
Exactly-once delivery to sinks is a per-connector concern. S3 sink achieves it via "write-then-rename" + offset tracking in S3 itself. Elasticsearch sink relies on the document ID = Kafka key idempotency. JDBC sink usually relies on PK-based upsert. Not all sinks are exactly-once — read the connector docs.
8d. Single Message Transforms (SMTs)
A connector pipeline often needs light transformations: drop a PII (Personally Identifiable Information) field, route by header value, flatten a nested record, change topic name. Without SMTs, you'd need a Kafka Streams app in between. SMTs are inline, stateless, per-record transforms applied in the worker.
Examples:
- ExtractField$Value — pull a nested field up.
- MaskField$Value — replace email with "***@***" for PII compliance.
- RegexRouter — rewrite topic name (cdc.prod_db.users → users).
- TimestampConverter — coerce millis/Date/Iso8601 strings.
Don't write a custom consumer for "Kafka → S3 with field rename and PII mask." Use S3 sink + 2 SMTs. ~10 minutes of YAML vs days of consumer code.
8e. Why Connect exists vs writing custom consumers
The honest answer: writing one custom consumer is fine. Writing fifty is unsustainable. Kafka Connect amortizes the costs:
- Offset management — the SAME offset-commit logic for every connector, not reimplemented per consumer.
- Restart and recovery — every connector resumes from the last committed offset; you don't write that.
- Distributed task rebalancing — when one worker dies, tasks move to other workers automatically.
- Schema integration — connectors integrate with Schema Registry out of the box.
- Single Message Transforms — composable transform pipeline.
- REST API for operations — start/stop/restart connectors via HTTP, no JVM redeploys.
LinkedIn's Brooklin is conceptually Kafka Connect for CDC across heterogeneous sources (Espresso, MySQL, Oracle, Couchbase) — the same insight: most "consumers" are connectors, so build the connector framework once and reuse.
§9. Cross-region replication
A single Kafka cluster is bounded to a region — cross-AZ replication is fine (5-15 ms ISR latency), but cross-region replication via ISR is not (50-200 ms RTT collapses throughput). For multi-region setups, cross-region replication is a separate pipeline running between two single-region clusters.
9a. The shape — replicator topology, offset translation
The basic shape (active-passive):
REGION A (us-east-1) REGION B (us-west-2)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Kafka cluster A │ │ Kafka cluster B │
│ │ │ │
│ orders.created ──┐ │ │ ┌──> orders.created │
│ payments.charged │ │ │ │ payments.charged │
│ │ │ │ │ │
└─────────────────────┼───┘ └───┼─────────────────────┘
│ │
▼ │
┌─────────────────────────────┐
│ Cross-Region Replicator │
│ (MirrorMaker 2 / Brooklin │
│ / Confluent Replicator / │
│ Cluster Linking) │
│ │
│ Reads from A, writes to B │
│ ~30-200 ms replication lag│
└────────────────────────────┘
9b. Tools — MirrorMaker 2, Brooklin, Confluent Replicator, Cluster Linking
MirrorMaker 2 (MM2) is Kafka's built-in, runs as Kafka Connect connectors. Topic-renaming convention: source topic orders in cluster A becomes A.orders in cluster B (prefix with source cluster alias). Offsets are NOT preserved (different cluster = different offsets). MM2 maintains a __consumer_offsets-translation topic in B mapping (source_offset → target_offset) for migration scenarios.
Brooklin (LinkedIn, open-sourced 2019) replaced LinkedIn's first-gen replicator. Designed for heterogeneous sources (not just Kafka — also Espresso, Oracle, MySQL). Decoupled controller for orchestration, datastream-per-pipeline abstraction, runs at LinkedIn-scale across hundreds of source/destination pairs.
Confluent Replicator is the proprietary equivalent of MM2 with offset translation as a service and integrated with Schema Registry. Used widely in Confluent Cloud.
Cluster Linking (Confluent Platform 6.0+, 2020) is the newest model: instead of running connectors, brokers in cluster B directly mirror partitions from cluster A via the same ISR-fetch protocol used internally. Offsets are preserved 1:1, eliminating offset translation entirely. Lower operational complexity; tighter coupling. This is where the industry is moving.
9c. Active-passive vs active-active
Active-passive (one cluster serves writes, the other is read-only): - Producer writes go to A. - B is a read-only mirror; reads can happen there for geo-locality. - DR (Disaster Recovery): on A's failure, redirect producers to B; consumers continue from B (using offset translation if running MM2). - RPO (Recovery Point Objective) = replication lag, typically 30-200 ms. - RTO (Recovery Time Objective) = depends on producer reconnect logic, typically 30s-5min.
Active-active (both clusters accept writes; replicate to each other):
- More complex: must avoid replication loops (B writes a copy of A's data, which then mirrors back).
- Solution: source-cluster prefix or a header replicated.from=A that the replicator filters on.
- Conflict resolution: typically last-write-wins by timestamp, but if two regions both write to the same key, order is impossible to reconcile globally. Either accept the conflict (eventual consistency) or route producers by key (e.g., users in EU write to EU, users in US write to US, no cross-region writes to the same key).
- Pulsar's native geo-replication is the cleanest active-active story; Kafka requires careful application-level design.
9d. Offset translation and the topic-renaming pattern
The painful detail: offset 5,000,000 in topic orders on cluster A does NOT correspond to offset 5,000,000 in topic A.orders on cluster B. They might differ wildly because B was created later, retains less data, or had different batching.
MM2's translation: every replicated batch produces a (source_topic, source_partition, source_offset) → (target_offset) mapping written to <cluster>.checkpoints.internal. A consumer migrating from A to B reads the translation topic to compute "I was at offset 5M on A; on B that's offset 4,827,331." Then consumer.seek(target_offset) and resume.
The topic-renaming pattern (A.orders in B) is the industry standard for unambiguous identification. Don't replicate orders from A onto orders in B without a prefix — if B later runs its own producer onto orders, the merge is irreversible.
RPO/RTO budgets for cross-region: - High-availability service (consumer app): RPO ~seconds, RTO ~minute. MM2 + automated failover. - Compliance-grade backup: RPO ~minutes, RTO ~hours. Daily snapshot to S3 + MM2 streaming. - True active-active (e.g., global multi-tenant SaaS): RPO ~0 (both regions take writes), but expect ~0.01% conflict rate to resolve.
§10. Security — SASL, mTLS, ACLs, encryption
A Kafka cluster carrying payments events or PII traffic without authentication is a CVE waiting to happen. Production security has four orthogonal axes: authentication (who you are), authorization (what you can do), encryption in transit (wire), encryption at rest (disk). Mix and match.
10a. Authentication — SASL variants and mTLS
Kafka brokers accept authenticated connections via the SASL (Simple Authentication and Security Layer) framework, layered atop SSL/TLS. Variants:
- SASL/PLAIN. Username + password in cleartext over the TLS channel. Easiest to set up; passwords live in
jaas.confon brokers and clients. Acceptable behind a private network, audited credential storage. The "PLAIN" refers to the wire-format, not the security level — credentials are encrypted by the underlying TLS. - SASL/SCRAM (Salted Challenge Response Authentication Mechanism, RFC 5802). Challenge-response so the password is never sent on the wire. SCRAM-SHA-256, SCRAM-SHA-512. Credentials managed via
kafka-configs.shand stored in ZooKeeper/KRaft. Better than PLAIN because brokers never see the plaintext password. - SASL/OAUTHBEARER. OAuth2 / OIDC (OpenID Connect) tokens. Client obtains a JWT (JSON Web Token) from an identity provider (Okta, Auth0, internal SSO), passes it to broker; broker validates signature and claims. Standard in cloud-native deployments where service identities are managed centrally.
- SASL/GSSAPI (Kerberos). Enterprise-grade, integrates with Active Directory / corporate Kerberos. Heaviest setup; common in large traditional enterprises.
- mTLS (mutual TLS). Client certificate-based authentication. Each producer/consumer/broker has a certificate signed by a corporate CA (Certificate Authority); broker validates client cert on TLS handshake. No password traffic at all. Common in cloud-native (mTLS at the service-mesh layer extended to Kafka).
Brokers can offer multiple mechanisms simultaneously via separate listeners — e.g., internal traffic over mTLS, external traffic over SASL/SCRAM. Broker-to-broker traffic typically uses mTLS or SASL/SCRAM for the simplest mutual authentication.
10b. Authorization — ACLs per topic, per principal
After authentication establishes identity, ACLs (Access Control Lists) determine what that identity can do. Kafka's ACL model:
ACL = (principal, resource, operation, host, permission)
principal: User:CN=payments-service.prod (from cert) or User:payments-svc (from SASL)
resource: Topic:orders.created
operation: Read | Write | Create | Delete | Alter | Describe | ClusterAction | ...
host: wildcard or specific IP
permission: Allow | Deny
Example ACL setup:
- Producer payments-service.prod can Write to Topic:orders.* (prefix match).
- Consumer fraud-detector.prod can Read from Topic:orders.* AND Read/Describe on Group:fraud-detector.
- Admin kafka-ops.prod has * operations on Cluster:kafka-cluster-prod.
- Default allow.everyone.if.no.acl.found=false → unknown principals are denied.
Stored in ZooKeeper (legacy) or KRaft metadata. Managed via kafka-acls.sh CLI or programmatic admin API. Per-topic, per-principal isolation is the standard for multi-tenant clusters.
10c. Encryption — TLS in transit, at-rest options
In transit: TLS for client-broker AND broker-broker. Inter-broker traffic crosses AZs and replicates data — must be encrypted at minimum at corporate-network boundary. Choice: TLS 1.2 or 1.3, with corporate CA-signed certs. Performance cost: ~5-10% throughput hit pre-AES-NI (Advanced Encryption Standard New Instructions) hardware acceleration, ~1-3% modern. Not a real cost.
At rest: Kafka itself does not encrypt log segments on disk. Three approaches: - Disk-level encryption (LUKS on Linux, BitLocker on Windows, AWS EBS encryption). Transparent to Kafka; encrypts everything on disk including pages, indexes, KRaft logs. Most common production setup. Cost: ~1% performance with AES-NI. - Filesystem encryption (eCryptfs, fscrypt). Per-directory; finer-grained but less common. - Application-level encryption. Producer encrypts payload before publishing; consumer decrypts. Brokers see only ciphertext. Strongest model; keys managed by KMS (Key Management Service — AWS KMS, HashiCorp Vault, GCP KMS). Compatible with multi-tenant brokers where ops shouldn't see payload. Cost: producer/consumer CPU; complicates Kafka Streams (can't filter on encrypted fields without per-message decryption).
10d. PII redaction strategies
Even with encryption, PII (Personally Identifiable Information) regulations (GDPR, CCPA, HIPAA) demand additional handling:
- Field-level masking via SMT or producer logic. Strip or hash
email,ssn,credit_cardbefore publish. Keeps PII out of Kafka entirely. - Tokenization. Replace PII with stable tokens (
user_email → user_token_a3f9); tokens map back to real values via a separate vault service. Kafka holds only tokens. - TTL (Time-To-Live)-bounded retention. Set
retention.msto GDPR deletion budget (e.g., 30 days) so personal data ages out. Avoid compacted topics for PII — compacted topics retain forever. - Right-to-be-forgotten (RTBF) handling. Hard problem for immutable logs — you can't surgically delete one record. The standard pattern: encrypt PII per-user with user-specific keys, then destroy the user's key on RTBF request, making historical encrypted records permanently unreadable ("crypto-shredding"). LinkedIn, Twilio, and others use this pattern.
§11. Monitoring, operations, and the unclean leader election dilemma
The infrastructure team's day-to-day. Knowing which metrics matter and what they imply is table stakes for any data-platform conversation.
11a. Broker JMX metrics — the canonical signals
Kafka brokers expose hundreds of metrics via JMX (Java Management Extensions, the JVM's standard monitoring interface). The ones operations teams alert on:
BytesInPerSec/BytesOutPerSec(per broker, per topic). Wire throughput. Trends drive capacity planning.RequestQueueSize/NetworkProcessorAvgIdlePercent. Inbound request queue depth and network thread idle. High queue / low idle = broker bottlenecked on CPU.UnderReplicatedPartitions. Count of partitions where one or more replicas have fallen out of ISR. Should be 0 in steady state. Non-zero indicates replication lag, broker disk/network saturation, or broker failure.UnderMinIsrPartitions. Partitions where ISR size has dropped belowmin.insync.replicas, meaningacks=allwrites are being rejected. Critical alert.IsrShrinksPerSec/IsrExpandsPerSec. Rate at which ISRs change. A high shrink rate without matching expand rate is bad; high steady rate of both indicates marginally unhealthy followers.LeaderElectionRateAndTimeMs. Frequency of leader elections. Spikes during broker restarts, network issues, controller failover.ProduceRequestRateAndTimeMsp99. End-to-end produce latency for clients. The user-visible SLA metric.FetchRequestRateAndTimeMsp99. Consumer fetch latency.LogFlushRateAndTimeMs. How often and how long page cache flushes take. Spikes correlate with disk pressure.- JVM heap usage and GC pause time. GC pause > 1s is a partition leadership stability risk.
11b. Consumer lag monitoring — Burrow, Cruise Control
The single most important metric for downstream teams: consumer lag = log_end_offset - committed_offset per (consumer group, partition). Tells you "is the consumer keeping up with the producer?"
Naive monitoring: poll __consumer_offsets, compare to broker offset. Problems: doesn't scale to 1000s of groups; stale committed offsets when consumer pauses but isn't broken (e.g., between batches); doesn't differentiate "consumer fell behind 10K messages and is catching up" from "consumer has been at 10K lag for 30 minutes."
Burrow (LinkedIn, OSS) addresses this with statefulness over the window: - Tracks the trend of committed offsets per (group, partition). - Distinguishes "lag is high but decreasing" (consumer catching up) from "lag is high and stable" (consumer stuck). - Emits status: OK / WARNING / ERR / STALL / STOP. - Used to set up alerts that don't fire on benign lag spikes.
Cruise Control (also LinkedIn, OSS) is broader: it's the auto-balancer for the cluster. Monitors broker resource utilization, runs partition reassignments to rebalance disk, network, CPU when brokers are skewed. Without Cruise Control, balancing is manual via kafka-reassign-partitions.sh — painful and error-prone at LinkedIn scale.
11c. The unclean leader election dilemma
The big operational decision: when all in-sync replicas fail, do you sacrifice durability for availability?
Scenario: Topic 'payments' partition 7. ISR was {B-7, B-31, B-88}.
T0: B-7 (leader) has offset 5,000,000. B-31 and B-88 caught up.
T1: B-88 falls behind. ISR shrinks to {B-7, B-31}.
T2: B-31 falls behind. ISR shrinks to {B-7}.
T3: B-7 hard-crashes (disk failure, kernel panic).
T4: Controller looks at ISR = {} (empty — leader was alone).
OPTION A: "Clean" — wait for B-7 to come back.
- Partition is UNAVAILABLE until B-7 (or another ISR-tracked replica) returns.
- If B-7's disk is dead, partition might be down for hours.
- Guarantees: no data loss; offsets remain consistent.
OPTION B: "Unclean leader election" — elect from non-ISR replicas (B-31, B-88).
- Partition becomes AVAILABLE quickly.
- Cost: B-31 was behind by some N messages; those messages are LOST.
- Producers and consumers see offset rollback: HW was 5,000,000, now 4,990,000.
The config: unclean.leader.election.enable per topic. Default since Kafka 0.11: false (prefer correctness over availability). Pre-0.11 default: true (prefer availability over correctness).
The bottom line: payments / ledgers / CDC = false. Ad-impressions / heartbeats / non-critical analytics = true. The decision is per-topic, per-data-criticality. There is no single right answer; it's a business-driven durability vs availability tradeoff.
11d. Operational rhythms
- Rolling broker restarts — patch broker, restart, wait for ISR to catch up, repeat. With 4000 brokers at LinkedIn, this takes weeks; automation is required.
- Partition reassignment — move partitions between brokers (capacity rebalancing, decommissioning). Cruise Control automates; manual
kafka-reassign-partitions.shis dangerous because throttling is per-broker not per-partition, easy to saturate network. - Topic creation — must respect cluster-wide partition count budgets. KRaft scales past 1M, but rebalance and metadata propagation still cost.
- Capacity planning — projection:
daily_msg_rate × avg_size × RF × retention_days. Add 30% headroom for hot-key skew. Add 50% for replica catch-up burst capacity.
§12. Cost economics — storage, network, S3-native
Cost analysis is half the conversation in real designs and rarely on the slide deck of younger engineers. Kafka's two dominant costs at scale: storage and cross-AZ bandwidth.
12a. Storage — retention dominates, replication multiplies
The arithmetic: storage cost = daily_ingest × retention_days × replication_factor / compression_ratio.
LinkedIn-scale example: 7T msg/day × 1 KB avg × 7-day retention × RF=3 / 3x compression = ~49 PB of raw on-disk data. At ~$0.10/GB-month on local NVMe, that's ~$5M/month just for the disks holding the message log. (LinkedIn's real cost is lower because they own hardware; cloud equivalents are 2-3x higher.)
Knobs to attack storage cost: - Shorter retention. Default 7 days; can it be 3? 1? Tradeoff: shorter replay window. Many production workloads only need 2-3 days for operational replay; longer retention is "just in case" cruft. - Better compression. lz4 → zstd typically gives 30-40% additional reduction. Free win. - Tiered storage (§4g). Local NVMe for recent (24h) data, S3 for the rest. S3 ~$0.023/GB-month vs $0.10 NVMe → 4x cheaper on the offloaded data. At 7-day retention with 24h local, ~85% of data moves to S3 — ~70% cost reduction overall. - Drop RF from 3 to 2 for non-critical topics. Saves 33% but gives up single-broker-loss tolerance. Rarely worth it.
12b. Cross-AZ traffic — the hidden expense
Cloud providers charge ~$0.01/GB for cross-AZ traffic. Every ISR replication crosses AZs (replicas spread across 3 AZs intentionally). Every consumer fetching from a leader in another AZ does too.
Worked example: 500 MB/sec sustained write to a topic, RF=3 across 3 AZs. - Leader is in AZ-1. Followers in AZ-2 and AZ-3. - Replication traffic: 500 MB/sec × 2 followers = 1 GB/sec cross-AZ on the replication path. - 1 GB/sec × 86400 sec × 30 days = ~2.6 PB/month of cross-AZ traffic. - At $0.01/GB = ~$26K/month just for replication egress.
Then consumers: if a consumer in AZ-2 fetches from a leader in AZ-1, every fetch crosses AZ ($0.01/GB). Same scale of cost.
Mitigations:
- Rack awareness. Configure brokers with broker.rack=us-east-1a etc. Kafka's RackAwareReplicaSelector (consumer-side) prefers reading from a replica in the SAME rack/AZ. Cuts consumer cross-AZ traffic in half or better.
- AWS announced free cross-AZ for MSK in 2024. Changes the calculus dramatically for AWS users. Not universal.
- Compress harder. Same BytesIn reduction reduces both storage AND cross-AZ proportionally.
12c. S3-native economics — WarpStream / AutoMQ
The S3-native variants (WarpStream, AutoMQ, RedPanda's BYOC) restructure the cost equation entirely:
Classic Kafka cost split (LinkedIn-style):
45% NVMe storage (replicated 3x)
30% Cross-AZ networking (replication + consumer fetch)
20% Compute (brokers)
5% Operations + control plane
S3-native Kafka cost split (WarpStream-style):
5% Local cache (small, ephemeral)
55% S3 storage (single-copy durability, 11-nines)
10% S3 PUT/GET costs
25% Compute (stateless brokers, smaller and elastic)
5% Operations + control plane
Net: 30-50% cost reduction at high retention; 200ms p99 produce latency (vs 5-15ms).
The trade is latency for cost: S3 multipart upload is ~150-300ms p99 vs ~5-15ms for local NVMe ISR replication. If your workload tolerates 200-500ms produce p99 (most analytics, log shipping, telemetry archive), S3-native is dramatically cheaper. If you need <50ms (payments, real-time bidding), classic NVMe-replicated Kafka is still the answer.
The bottom line: the S3-native frontier is real and rapidly evolving (WarpStream acquired by Confluent in 2024). Expect "classic" Kafka and S3-native to coexist on different workloads through the late 2020s; new high-retention deployments default to S3-native.
§13. KRaft mode — life after ZooKeeper
The single biggest architectural change in Kafka since the original 2011 release. Worth understanding deeply because the migration is ongoing across most production clusters as of 2026.
13a. The problem ZooKeeper created
Pre-KRaft Kafka used ZooKeeper for cluster metadata: broker membership, topic configs, partition leadership, ACLs, controller election. ZooKeeper was an external dependency — separate cluster, separate ops team, separate failure modes.
ZooKeeper's pain points at scale: - ZNode count limit ~200K. LinkedIn approached the limit at ~7M partitions across clusters; each partition needed multiple ZNodes (leader, ISR, config). - Watcher fan-out scaled poorly. Every broker watched dozens of ZNodes; 4000 brokers watching the same parent ZNode caused thundering-herd issues. - Controller failover took minutes. On controller broker death, new controller reloaded ALL partition state from ZooKeeper — at 7M partitions, this was ~5-15 minutes of metadata-change unavailability. - Operational complexity. Separate ZK cluster to provision, monitor, upgrade, secure. Common source of incidents.
13b. KRaft architecture — controller quorum, metadata log
KRaft (Kafka Raft, KIP-500) replaces ZooKeeper with a Raft-based controller quorum embedded in Kafka itself. Architecture:
┌──────────────────────────────────────────────────────────────────┐
│ KRaft-based Kafka Cluster │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Controller Quorum (3 or 5 nodes, Raft-replicated) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Ctlr-1 │ │ Ctlr-2 │ │ Ctlr-3 │ │ │
│ │ │ (leader) │ │ (follower│ │ (follower│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ Metadata stored in: __cluster_metadata topic │ │
│ │ (the controller quorum IS the Raft log holding metadata)│ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ MetadataFetch RPC │
│ ▼ (brokers tail the metadata log) │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ B-1 │ │ B-2 │ │ B-3 │ │ B-4 │ ... brokers │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
└──────────────────────────────────────────────────────────────────┘
Key shifts:
- The metadata log is a regular Kafka topic (__cluster_metadata). Every metadata change is an append; controllers are the producers; brokers are consumers tailing this topic. Eats own dogfood.
- Controller failover via Raft. The 3 or 5 controllers hold a Raft quorum; if the leader dies, a follower wins the next election in ~1-3 seconds.
- Brokers no longer connect to ZooKeeper. They tail the metadata topic from the controllers via the standard fetch protocol.
- Modes: controllers can be combined with brokers in small clusters (process.roles=broker,controller) or split off in large clusters (process.roles=controller for dedicated controller nodes).
13c. Why KRaft matters — controller failover, partition scaling
Controller failover from minutes to seconds. Old: controller broker dies → new controller broker reads ALL metadata from ZooKeeper → takes 5-15 minutes at 7M partitions. New: new controller leader has already replicated metadata via Raft → instantly takes over.
Scales past 1M partitions. ZooKeeper's ZNode limit was the hard wall; KRaft's controller quorum scales with regular Kafka topic semantics (compacted, segmented). Confluent reports KRaft handling 2M+ partitions per cluster in benchmarks.
Operational simplification. No more ZooKeeper cluster to operate. Faster cluster spinup (no ZK cluster to provision first). Tighter security model (one auth/ACL system, not two). Smaller surface area for incidents.
13d. Migration from ZooKeeper to KRaft
Migration is not a flag flip — it's a multi-step process:
- Upgrade brokers to 3.4+ which supports both ZK and KRaft (dual-write mode).
- Provision KRaft controller quorum (3 or 5 nodes) alongside the existing ZooKeeper cluster.
- Migrate metadata. Controllers read existing metadata from ZooKeeper, write it to
__cluster_metadata. Both ZK and KRaft hold the same metadata for the migration window. - Rolling restart brokers in
migrationmode — each broker switches from ZK-based metadata to KRaft-based. - Cut over. Disable ZooKeeper dependency in broker configs; ZK cluster is now decommissioned.
Kafka 4.0 (released 2024) removed ZooKeeper support entirely; KRaft is the only mode. The migration path is well-documented but operationally significant — most large production clusters were mid-migration through 2024-2025.
§14. Poison pill messages and Dead Letter Queues
§18.2 covers the basic idea of a DLQ. This section drills into the patterns, the failure modes of naive DLQ implementations, and the alternatives.
14a. The poison pill — what makes a message unprocessable
A "poison pill" is a record that the consumer cannot successfully process, no matter how many times it retries. Sources:
- Malformed payload. Schema-mismatch (e.g., producer wrote v2 schema before consumer upgraded), corrupted bytes, encoding bugs.
- Logical violation. Foreign key reference to a deleted entity, invalid state transition, business rule violation that wasn't caught upstream.
- External system failure. Stripe API returning 400 Bad Request because of a stale ID; Postgres returning constraint violations.
- Resource exhaustion. Record too large to process, requires N hops of enrichment that have permanently failed.
The pathology: consumer poll() returns the poison pill; processor throws; consumer retries; same exception; partition lag grows indefinitely while every consumer in the group waits on this one bad record. Backpressure to producers, eventually broker-side disk fills as retention extends.
14b. DLQ basics — bounded retry + redirect
The standard pattern:
for record in consumer.poll():
try:
process(record)
commit_offset(record.offset)
except RetriableError as e:
if retries(record) < MAX_RETRIES:
sleep(backoff(retries(record)))
retry
else:
dlq_producer.send("orders.dlq", record, headers={"error": str(e)})
commit_offset(record.offset) # source partition moves on
except PermanentError as e:
dlq_producer.send("orders.dlq", record, headers={"error": str(e)})
commit_offset(record.offset)
Operators monitor orders.dlq; on triage they can fix the data (correct the row, fix Stripe ID), re-publish to orders.created, and processing resumes. DLQ is an essential ops surface.
14c. Why DLQs are sometimes wrong — order loss and the "parking lot" pattern
The downside the basic DLQ pattern hides: DLQs lose per-key ordering forever.
Example: CDC pipeline (per-key ordered). Record 1: update user_42 set email=.... Record 2: same user, update user_42 set name=.... Record 1 fails (transient Postgres outage), goes to DLQ. Record 2 succeeds normally. Operator later replays Record 1 from DLQ. User-42's row now reflects record 1 applied AFTER record 2 — wrong order, wrong final state.
For order-sensitive workloads (CDC, event sourcing), DLQ-and-skip is incorrect. Alternatives:
Parking-lot pattern. Instead of routing to DLQ-and-continue, the consumer pauses the partition (consumer.pause(partition)) and parks the bad record in a side store (Redis, a status DB, or a manual review queue). The partition holds at that offset. Order is preserved; downstream processing of other partitions continues. Ops fixes the bad record; consumer resumes (consumer.resume(partition)) and re-tries.
Partition order-events-7 stalls at offset 5_000_000 (bad record):
Pre-bad-record: 5_000_000 (committed) ──> continuous flow
Bad record: 5_000_001 (parked in Redis, partition paused)
Post-bad-record: 5_000_002 ─ 5_000_500 (queued in broker, not consumed)
Other partitions (0, 1, 2, ..., 6, 8, ..., N) continue processing normally.
Operator review of parked record → fix → resume partition → records 5_000_001 onward.
Cost. Lag on the parked partition grows; SLA on partition-7 records is broken; partition-7's consumer is functionally idle. Worth it for ordered workloads; wrong choice for unordered (DLQ-and-skip is fine for telemetry, click-stream, etc.).
Retry topics with cascading TTL. Spring's pattern: orders.retry.5min, orders.retry.30min, orders.retry.2hr, orders.dlq. Failed records cascade through retry topics with increasing delays. Useful when failures are mostly transient (external service flapping) and order doesn't matter — gets to "wait then retry without blocking the source partition" semantics. Loses order, like DLQ; differs only in the retry shape.
The bottom line. "DLQ" is shorthand; the right design depends on ordering requirements. Ordered workloads use parking-lot. Unordered workloads use DLQ. Mixing them without thought leads to silent data corruption — the worst kind.
§15. The transactional outbox pattern — solving the dual-write problem
Mentioned in §20 (DB-as-queue) and §23 use-cases. This section covers the canonical version because the pattern is a recurring building block.
15a. The dual-write problem
The naive approach when a service needs to update its database AND publish to Kafka:
def create_order(order):
db.insert("orders", order) # write to DB
kafka.send("orders.created", order) # publish event
The failure modes: - DB write succeeds; Kafka send fails (broker down, network blip). DB has the row, but no event. Downstream consumers (billing, fraud, search) NEVER see the order. - Kafka send succeeds; DB write rolls back (constraint violation, transaction failure). Downstream consumers process a "created" event for an order that doesn't exist in the DB. They'll fail trying to look it up later. - Both succeed; service crashes between them; some duplicate write logic now has to reconcile.
There is no way to make two writes to two independent systems atomic without 2PC (Two-Phase Commit), and 2PC across Kafka and Postgres is not supported and would be unreasonably slow if it were. The right answer is to make it ONE write, not two.
15b. The outbox pattern
The atomic write is to the database only. The outbox table sits in the same database, written in the same transaction as the business data:
BEGIN;
INSERT INTO orders (id, customer_id, total, ...) VALUES (42, ..., ...);
INSERT INTO outbox (event_id, aggregate_type, aggregate_id, event_type, payload, created_at)
VALUES (uuid_v4(), 'order', 42, 'order.created',
'{"order_id":42, "customer_id":..., "total":...}',
NOW());
COMMIT;
Both rows commit atomically because they're in the same Postgres transaction. The outbox row is now a durable record that an event must be published. It cannot be lost without losing the order itself.
Then a separate process — typically Debezium (§8b) — tails the binary log of the database and publishes outbox rows to Kafka:
┌──────────┐ ┌──────────┐ ┌────────────────┐
│ App │──BEGIN──│ Postgres │ │ Kafka topic │
│ │ INSERT │ │ │ orders.created│
│ │ orders │ ┌──────┐ │ │ │
│ │ INSERT │ │ WAL │ │ └────────▲───────┘
│ │ outbox │ │ │ │ │
│ │ COMMIT │ └───┬──┘ │ │
└──────────┘ └─────┼────┘ │
│ │
│ tails WAL │
▼ │
┌────────────┐ │
│ Debezium │──publishes────┘
│ CDC │
└────────────┘
After publish, Debezium commits its WAL position to its own Kafka offsets topic. On restart, it resumes from the last published WAL position — at-least-once delivery (idempotent producer + Kafka transactions get to exactly-once-into-Kafka).
15c. Why outbox beats "write to DB then write to Kafka"
The dual-write problem disappears because there is no dual write. Concretely: - If the order INSERT fails, the outbox INSERT is in the same transaction and rolls back. No phantom events. - If the order INSERT succeeds but the service crashes before "writing to Kafka," it doesn't matter — the outbox row is durable, Debezium will pick it up on its own schedule. - If Kafka is down for an hour, the outbox grows but nothing is lost. When Kafka recovers, Debezium catches up.
Latency cost: the event reaches Kafka ~50-500ms after the business write (Debezium's polling lag + WAL append delay). For event-driven architectures where consumers process asynchronously, this is invisible.
Operational cost: Debezium is a piece of infrastructure to operate (Connect cluster, plugin, schema management). Worth it because the alternative is permanent data inconsistency.
Storage cost: the outbox table grows. Standard practice: a separate process (or Debezium's built-in outbox.event.deletion.strategy=delete) deletes outbox rows after publish-acknowledgment, keeping the table bounded.
15d. Variations and alternatives
- Outbox with built-in Debezium support — Debezium has an
outboxSMT that extractsaggregate_idas Kafka key, routes to the right topic, drops the outbox-table envelope. Saves boilerplate. - Polling-based outbox — without Debezium, a worker polls
SELECT * FROM outbox WHERE published=false ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED, publishes, setspublished=true. Simpler ops; misses the binlog-tailing efficiency; can produce duplicates on poll-then-crash without idempotent producer. - Listen/Notify channels (Postgres LISTEN/NOTIFY) — DB pushes notifications to a listener that publishes to Kafka. Lower latency than polling, but no durability guarantee on notify (if listener is down, notify is lost). Combine with outbox table as the source of truth.
- Saga pattern alternative — instead of outbox + events, the saga executes a workflow: each step writes a "pending" record, dispatches the next step. Cleaner for long-running multi-service workflows; outbox is cleaner for "atomic write + event publish" semantics.
Used by: Stripe (mentioned §24), Shopify, Airbnb, LinkedIn (Brooklin is essentially a managed outbox infrastructure across all services). The bottom line: the outbox pattern is the canonical, no-debate solution to "write to DB + publish event" atomicity. Any production system that does both must have an outbox or equivalent; the alternative is silent data drift.
§16. Capacity envelope across scales
Throughput, latency, and storage envelope varies by 6+ orders of magnitude. The bottleneck shifts at each tier.
Small team — 10K msg/sec (3-broker cluster). Startup logging activity events for search + billing. 3 brokers across 3 AZs, RF=3. 500 B avg → 5 MB/sec write, 15 MB/sec with replication. 7-day retention = ~9 TB. p99 produce 5-10 ms. Bottleneck: nothing technical. Operating Kafka at 10K msg/sec costs the same as 1M; below ~100K msg/sec the right answer is usually "use managed (MSK, Confluent Cloud)."
Mid-scale event bus — 1M msg/sec (15-30 brokers). E-commerce microservices on Black Friday, 12 consumer groups. 500 MB/sec write, 1.5 GB/sec with replication, ~3.5 GB/sec broker traffic at 5x fan-out. 30-day retention via tiered storage. p99 produce 5-15 ms. Bottleneck: partition count discipline + schema registry.
LinkedIn scale — 80M msg/sec (4,000 brokers, ~100 clusters). Public numbers (2019-2024): ~7T msg/day, ~18 GB/sec write, ~70 GB/sec read, ~100K topics, ~7M partitions, p99 ~10 ms. Hot volume ~32 PB. Bottleneck shifts to storage and partition metadata. Pre-KRaft, ZooKeeper hit a ceiling at ~200K partitions/cluster; KRaft + tiered storage pushed it.
Hyperscale — Netflix Keystone, 10 PB/day. ~700B events/day, ~8M events/sec, multiple regional clusters with cross-region replication, strict Protobuf via Schema Registry. Bottleneck: stream-processing layer (Flink state, sink backpressure), not Kafka itself.
S3-native frontier — WarpStream / AutoMQ (2023-2024). ~1 GB/sec sustained on small clusters (10-30 nodes), p99 ~200-500 ms, 30-50% cost reduction at high retention. Bottleneck: S3 prefix throughput (~5,500 GET / 3,500 PUT per prefix per second) drives aggressive 1-2 second batching to amortize S3 PUT cost.
The bottleneck moves: NIC at small scale, partition metadata at mid-scale, storage volume at LinkedIn-scale, S3 PUT rate at S3-native scale. At every tier the limit is different.
§17. Architecture in context
The canonical integration pattern, independent of product. Every variant fits this shape with names swapped.
┌──────────────────────────────────────────────────────────┐
│ PRODUCERS │
│ Services publishing events (payments-service, tracker, │
│ vehicle-uplink). Batch by partition key, compress, retry│
└────────────────────────┬─────────────────────────────────┘
│ binary protocol (Kafka wire, AMQP,
│ Pulsar binary, HTTP for SQS)
▼
┌──────────────────────────────────────────────────────────┐
│ BROKER CLUSTER │
│ │
│ Topic: orders.created │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Partition 0 (leader B-7, replicas {B-31,B-88}) │ │
│ │ Partition 1 (leader B-42, replicas {B-58,B-91}) │ │
│ │ ... Partition N (leader B-119, replicas {...}) │ │
│ └───────────────────────────────────────────────────┘ │
│ Leaders serve writes; followers replicate via Fetch. │
│ │
│ ┌──────────────────┐ ┌─────────────────────────────┐ │
│ │ Metadata Layer │ │ Offset / Cursor Storage │ │
│ │ (KRaft, ZooKeeper│ │ (__consumer_offsets, Pulsar │ │
│ │ Pulsar mDB, │ │ cursors, RabbitMQ acks) │ │
│ │ RabbitMQ Mnesia)│ │ │ │
│ └──────────────────┘ └─────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ Tiered Storage (S3 / HDFS / Azure Blob) ││
│ │ Closed segments offloaded; broker keeps metadata ││
│ └─────────────────────────────────────────────────────┘│
└────────────────────────┬─────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ CONSUMER GROUPS │
│ │
│ Group "billing" Group "fraud" Group "search" │
│ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │
│ │C1│ │C2│ │C3│ │C1│ │C2│ │C3│ │C1│ │C2│ │
│ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ │
│ │
│ Each group reads the SAME log independently. │
│ Within a group, each partition is assigned to one │
│ consumer — that's how parallelism scales. │
└──────────────────────────────────────────────────────────┘
Key annotations: Partition key on every write (hash(key) % num_partitions) — for an order event key is order_id, for vehicle telemetry vehicle_id, for click-stream user_id or session_id; the key choice determines what is ordered and where hot keys land. Replication factor 3 is standard — survives any single-node loss; single-AZ loss with replicas spread across AZs. Consumer groups are independent — new analytics team subscribes as its own group, reads from offset 0, doesn't disturb anyone. The metadata layer is small but critical — owns broker membership, partition leadership, topic config; if it goes down, steady-state IO continues but cluster changes pause.
§18. Hard problems inherent to logs and queues
Six problems any non-trivial deployment hits, illustrated with different domains.
18.1 Exactly-once semantics — payments outbox
In a payments platform, every payment.charged event must trigger exactly one customer charge and exactly one ledger update.
Naive fix: producer retries until ack, consumer commits offset after processing.
Why it breaks: at T0 Stripe service publishes payment.charged{order=42}; broker writes, replicates, acks. At T1 producer's TCP connection drops before receiving ack; producer retries. Without idempotent producer, broker writes the duplicate — offsets 100 and 101 now hold the same event; customer charged twice. Separately, consumer reads offset 100, charges, crashes before committing offset. On restart, last committed offset is 99; consumer re-reads, charges again. Customer charged three times.
Real fix: three pieces required: (1) Idempotent producer (enable.idempotence=true) — broker tracks (producerId, partition) → highest sequence, rejects duplicates. (2) Transactional producer (transactional.id) — atomically writes to N partitions and commits consumer offsets in one transaction. (3) Consumer isolation.level=read_committed — skip aborted or in-progress transactions.
Caveat: this is exactly-once within Kafka. End-to-end to Stripe's external API requires Stripe to accept an idempotency key. The bottom line: "Kafka exactly-once gives exactly-once into a Kafka sink; external boundaries still need application-level idempotency."
18.2 Head-of-line blocking — click-stream analytics
Click-stream ingests web events. One event at offset 5,000,000 has a malformed payload; consumer throws on every retry; entire partition stalls.
Naive fix: exponential backoff retry.
Why it breaks: retries forever on a poison message. All records after offset 5,000,000 on that partition are starved. Lag grows linearly.
Real fix: Dead Letter Queue (DLQ) + bounded retry. After N retries, publish failed record to clickstream.dlq with error in headers, commit source offset, move on. Operator-driven flow triages the DLQ.
Kafka has no DLQ primitive — application logic (or Kafka Connect's built-in DLQ). RabbitMQ has DLX (Dead Letter eXchange) first-class. SQS has per-queue DLQ. Forgetting DLQ is one of the most common Kafka production outages.
18.3 Hot partition — IoT telemetry
A Tesla-style fleet telemetry topic has 256 partitions keyed by vehicle_id. One car has a bug emitting 1000x normal rate. One partition does 50K msg/sec; the other 255 do hundreds.
Naive fix: add more partitions.
Why it breaks: the hot key still hashes to one partition. Repartitioning a live keyed topic is painful — existing data stays on old partitions, consumers must handle both.
Real fix — three options: (1) Salt the key — vehicle_id || ":" || (event_seq % 8) spreads one vehicle across 8 partitions. Lose strict per-vehicle order; downstream timestamp-sorts. (2) Two-tier topic — raw partitioned coarsely (geo), stream processor re-keys with dynamic shard suffixes for hot keys. (3) Carve out hot keys to dedicated pipelines — power-law distributions are common; top 100 of 1M vehicles often emit 50% of events.
The instinct: "what's the actual order requirement?" Usually "order per vehicle" is over-specified; "eventually consistent per vehicle with timestamps for reconciliation" works.
18.4 Consumer rebalance pauses — microservices fan-out
An Airbnb-style booking-events consumer group has 64 consumers reading 256 partitions. One dies. Group coordinator triggers rebalance: all 64 stop, return assignments, get reassigned, resume. Old-style "eager" rebalance is stop-the-world for 5-10 seconds. At 1M msg/sec, that's 5-10M messages of lag.
Naive fix: make consumers more reliable. Doesn't help — scaling events (deploy, autoscaling) cause rebalances regardless.
Real fix: cooperative incremental rebalance (KIP-429) — only partitions that need to move actually move; other consumers don't pause. Pause time per consumer drops from seconds to ~100 ms.
Deeper fix: static group membership (group.instance.id). Stable identity; restart within session.timeout.ms (default 45s) gets the same partitions back. Critical for stateful consumers (stream processors with local state).
18.5 Ordering vs scaling tension — CDC pipelines
A CDC pipeline (Debezium → Kafka → warehouse) needs row changes applied in order so the warehouse row matches the source DB row.
Naive fix: "one partition, one consumer."
Why it breaks: single partition ~100 MB/sec write, single consumer ~50-200K msg/sec. A DB doing 1M row-changes/sec is 5-10x too fast.
Real fix: partition by primary key — all changes for one row land on one partition, preserving per-row order. Cross-row order isn't needed; the warehouse reconstructs rows independently. The reality: "what is the actual ordering scope?" — almost always per-key, not global.
LinkedIn's Brooklin partitions by source-table PK so per-row order is preserved across hundreds of partitions.
18.6 Producer back-pressure — log shipping
Cloudflare-style edge log shipping: 200 POPs producing 500K msg/sec each into central Kafka. The cluster has a brownout (controller GC pause).
Naive fix: "producer send is async, it'll buffer locally."
Why it breaks: producer's buffer.memory (default 32 MB) fills. Then send() blocks or throws. OOM (Out-Of-Memory) is minutes away.
Real fix — four layers: (1) Producer: cap buffer.memory, set max.block.ms to ~100 ms, back-pressure upstream caller on block. (2) Broker quotas (quota.producer.default) throttle exceeding producers via response delays. (3) Upstream circuit breaker — if p99 produce > 100 ms for 30s, skip Kafka (fail open/closed by data class). (4) Outbox pattern as durable fallback — write events to DB outbox in same transaction as business write; CDC tails binlog to Kafka. If Kafka is down, outbox accumulates; nothing is lost.
§19. Failure mode walkthroughs
Broker crash mid-write (mid-batch). Leader B-7 has written half of a 16 KB batch; followers haven't pulled. Controller detects B-7 down (~10s timeout), elects new leader from ISR (say B-31, which hadn't pulled the partial). Producer's in-flight to B-7 times out, retries against B-31; idempotent producer dedups. B-7 restarts, scans active segment, finds partial record at tail (CRC fails), truncates to last fully-written record boundary, becomes follower, catches up. Durability point: last fully-written batch before the partial.
Crash between produce ack and replication. With acks=all this shouldn't happen — leader doesn't ack until ISR replicates. With acks=1 it's the classic data-loss window: leader writes, acks, crashes before replicating; new leader from a follower that didn't see the batch; batch lost. Defense: never use acks=1 for important data; acks=all + min.insync.replicas=2 is the LinkedIn default.
Metadata leader death (KRaft / ZooKeeper / Pulsar metadata). 5-node KRaft quorum, controller leader dies. Remaining 4 controllers hold a Raft election; new leader emerges in ~5-10s, replays metadata log, broadcasts to brokers. Durability point: Raft log on controller quorum — Raft majority guarantees committed entries (≥3 of 5 ack'd) survive any minority failure. During the gap, brokers continue serving in-progress produces and consumes — controller is only needed for metadata changes.
Network partition (split-brain potential). AZ-1 has leader B-7 + ISR {B-7, B-31} + controller majority (3 of 5). AZ-2 has B-88 + 2 controllers. Network partitions. AZ-1 continues serving writes; ISR shrinks to {B-7, B-31}. AZ-2's B-88 can't fetch from B-7; after 30s controller removes it from ISR. AZ-2's 2 controllers are minority — cannot elect new leader. No split-brain possible. Durability point: Raft majority on controller quorum — why 5-node quorums across 3+ AZs is standard.
Permanent participant loss (broker disk dies). B-7 dies, disk unrecoverable. New broker B-7' provisioned with same ID. Controller sees B-7 absent; leadership shifts to other ISR members. B-7' joins with no data. For each partition where B-7 was follower, leader sends Fetch from offset 0 — B-7' replays entire log. Multi-TB partitions take hours of network IO. Tiered storage helps — only local-tier data copies; old data lives in S3 and is accessible to all brokers (Confluent reports 5-10x faster broker replacement with tiered storage).
Consumer group rebalance during failure. Consumer C5 hangs. Group coordinator initiates rebalance. Cooperative incremental rebalance: only C5's partitions reassign; others continue. Subtle issue: if C5 was processing a record but hadn't committed offset, the new consumer re-reads from C5's last committed offset → duplicate processing. Idempotent consumer is mandatory; you cannot architect this away.
§20. Why not just use a database table as a queue?
The most common L4/L5 design suggestion; the failure modes show exactly what message queues exist to prevent.
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY, type TEXT, payload JSONB, created_at TIMESTAMPTZ,
consumed_by_billing BOOLEAN DEFAULT FALSE,
consumed_by_fraud BOOLEAN DEFAULT FALSE
);
-- Consumer polls:
SELECT * FROM events WHERE consumed_by_billing = FALSE
ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED;
-- ... process ... then:
UPDATE events SET consumed_by_billing = TRUE WHERE id IN (...);
Failure 1: Write throughput cap. At 1M msg/sec × 5 consumer groups = 5M state updates/sec. Postgres tops out ~50K writes/sec on a single primary — 20-100x too slow. Sharded at 100 shards still 5x too slow.
Failure 2: Index bloat from in-place updates. Postgres MVCC creates a new tuple version per update; autovacuum can't keep up at high rates. Table becomes mostly dead tuples; queries scan 100x more pages; performance collapses within hours. Kafka avoids this entirely — broker doesn't track per-message consumption. Consumer offset is one integer per (group, topic, partition), updated once per second. State is O(groups × partitions), not O(messages).
Failure 3: No replay. Delete rows after consumption → new consumer can't read old data; bug in consumer logic can't be fixed by replay. Keep rows forever → table grows unboundedly, index doesn't fit in RAM. Kafka stores 1000x more data with no degradation because reads are sequential and old data lives on cheap storage.
Failure 4: No fan-out. Five consumers = five consumed_by_x columns. New consumer = schema migration. In Kafka, new consumer group is one subscribe() call.
Failure 5: Locking under competing consumers. FOR UPDATE SKIP LOCKED (added to Postgres precisely because people kept trying this) gives mutual exclusion but row-level locks at 100K/sec cause contention. Kafka's "one partition per consumer in a group" eliminates locking.
When DB-as-queue actually works: modest scale (< 1K msg/sec), single consumer, no replay, no fan-out. Small SaaS outbound webhook queue with FOR UPDATE SKIP LOCKED + simple worker is fine. This is exactly why Kafka exists — past ~10K msg/sec or multi-consumer, the mismatch becomes fatal.
The related pattern — outbox: events written to a DB outbox table in the same transaction as the business write, then CDC tails the binlog to Kafka — is good and standard. Outbox is transactional staging, not the queue itself. Distinguish in interviews.
§21. Scaling axes
Two structurally different growth modes with different fixes; conflating them is a common mistake.
Type 1 (uniform growth): more topics, more groups, more partitions. New product feature → new event type → new topic. Progression: 10K msg/sec on 3 brokers → 100K msg/sec same cluster with more partitions → 1M msg/sec dedicated cluster of 15-30 brokers + schema registry → 10M msg/sec multiple clusters split by domain + KRaft + tiered storage → 80M msg/sec (LinkedIn) dozens of clusters with fleet-wide control plane, self-service onboarding, quota enforcement.
Inflection points: ~1K partitions → ZooKeeper struggles, move to KRaft. ~50 brokers/cluster → rebalance takes hours, split by domain. 10x storage growth → tiered storage essential. Cross-region → MirrorMaker 2 or Pulsar geo-rep; conflict resolution becomes a system requirement.
Type 2 (hotspot intensification): one topic ramps 10-100x. A viral feature, a runaway bug, Black Friday. Baseline: 10K msg/sec on 32 partitions = ~300 msg/sec/partition, fine. 10x uniform = ~3K/partition, fine. 10x with skew (top 1% of keys = 50%): one partition does 50K msg/sec, others idle. Adding partitions does NOT help — hot key still hashes to one. Fix: re-key with salt, sacrifice strict per-key order. 100x uniform approaches per-partition ceiling; fix is more partitions, but repartitioning a live keyed topic requires creating a new topic + dual-write + cutover — painful. Inflection point: over-provision partitions at topic creation 4-10x what you think you need; treat partition count as a one-shot decision.
The asymmetry: Type 1 is solved by adding capacity. Type 2 is solved by re-engineering the partition key.
§22. Decision matrix vs adjacent technology categories
| Technology | When to pick | When NOT to pick |
|---|---|---|
| Distributed log (Kafka/Pulsar/Kinesis) | Fan-out > 1 consumer group. Replay needed (audit, backfill). Throughput > 50K msg/sec. Per-partition order is sufficient. | Sub-ms latency. < 1K msg/sec with no replay (overkill). Strict global order at high throughput (impossible). |
| Queue (RabbitMQ/SQS) | < 50K msg/sec. Rich per-message semantics (priority, TTL, headers). RPC-style request/reply. Acked = gone is acceptable. | High throughput. Fan-out > 1. Need replay. |
DB outbox (SKIP LOCKED) |
< 1K msg/sec. Single consumer. Already have the DB. No fan-out, no replay. | Multi-consumer. > 10K msg/sec. Need replay. |
| Synchronous RPC (gRPC, HTTP) | < 50 ms latency required. Caller needs response value. Few hops. | Async fire-and-forget. Fan-out. Back-pressure absorption. Publish even if consumer is down. |
| Workflow engine (Temporal, Cadence) | Multi-step processes with retries, timers, human-in-the-loop, compensations. Days-to-weeks workflows. | Simple fan-out (over-engineered). Sub-second latency between steps. |
| Stream processor (Flink, Kafka Streams) | Stateful computation across events (windowing, joins, aggregation). | Pure transport (log alone). Stateless transforms (consumer is enough). |
Thresholds: Log vs queue — fan-out ≥ 2 or replay needed or > 50K msg/sec → log; otherwise queue is simpler. Log vs DB outbox — multi-consumer or > 10K msg/sec → log; otherwise outbox is cheaper. Queue vs RPC — caller doesn't need response or must succeed when consumer is down → queue; otherwise RPC.
§23. Use case gallery — six domains, different variants
Same primitive, wildly different demands.
Event sourcing in payments (Stripe outbox). Payments service writes payments row inside a transaction + an outbox row in the same transaction. Debezium tails the binlog and publishes payment.charged to Kafka. Downstream: ledger, notifications, fraud, reporting. Demands: strict atomicity (no lost events), replay for ledger reconciliation. Fit: Kafka acks=all, transactional consumers, read_committed. Not a queue because ledger + reporting need replay.
Real-time analytics (Uber click-stream). Mobile clients emit clicks tagged with session ID. Kafka partitioned by session_id. Flink computes session metrics; sinks to Druid (dashboards) + S3 (long-term). Demands: 1-10M events/sec, sub-second freshness, tolerates 0.1% loss, per-session order. Fit: Kafka acks=1, lz4 compression — async replication accepted because analytics is loss-tolerant.
CDC pipelines (LinkedIn Brooklin → Iceberg). Brooklin tails MySQL/Espresso binlogs, writes change events to Kafka, consumed by Iceberg lake tables, search index updates, Espresso secondary indices. Demands: lossless row-change transport, per-row order, multi-sink fan-out. Fit: Kafka log + per-PK (Primary Key) partitioning + acks=all + compacted topics for snapshot state.
Microservices async (Airbnb booking events). booking.created triggers payment, calendar, notifications, search update, host notification. Demands: ~50K msg/sec, each consumer its own bounded context. Fit: Kafka or Pulsar. RabbitMQ viable but adding a 6th consumer requires a new queue; with a log it's a subscribe() call.
IoT telemetry (Tesla fleet uplink). Millions of vehicles emit CAN-bus events, GPS, sensors, fault codes. Uplinked to regional brokers, forwarded to central Kafka for analytics, predictive maintenance, OTA (Over-The-Air) tracking. Demands: burst-tolerant, loss tolerable for routine sensor reads but NOT for fault codes, geo-distributed. Fit: Kafka central + regional clusters with cross-region rep, or MQTT (Message Queuing Telemetry Transport) at edge bridging to Kafka. Safety-critical uses acks=all, bulk uses acks=1.
Log shipping (Cloudflare edge logs). 200+ POPs (Points of Presence) emit request logs, security events, edge worker logs to central Kafka. Downstream: SIEM (Security Information and Event Management), warehouses, customer log delivery. Demands: 1T+ msg/day, 30+ day retention, cost-sensitive, ~100 ms latency tolerated. Fit: Kafka + tiered storage to S3/R2, aggressive zstd; evaluating WarpStream-style S3-native.
Same primitive — durable, partitioned, replayable log — serves all six. The knobs (replication mode, compression, partition key, retention, tiered storage) move; the structure is the same.
§24. Real-world implementations with numbers
- LinkedIn (Kafka origin, 7T msg/day). Kafka was invented at LinkedIn in 2010 to replace ActiveMQ. ~7T msg/day, peak ~80M msg/sec, ~4,000 brokers across ~100 clusters, ~100K topics, ~7M partitions, p99 ~10 ms. Brooklin (CDC) feeds Iceberg, Espresso secondary indices, derived stores.
- Netflix Keystone (10 PB/day). Unified telemetry. ~700B events/day, ~8M events/sec sustained, multiple regional clusters, Flink stream processing. Feeds S3, Druid, Elasticsearch, ML feature pipelines.
- Cloudflare (1T+ msg/day). Edge log shipping (2023 blog). Aggressive zstd compression. Driver behind their evaluation of S3-native log designs.
- Uber (4T msg/day Cherami → Kafka). Initially built Cherami (custom queuing); migrated to Kafka for replay + throughput. uReplicator (open-sourced) for cross-region. uForwarder + Flink.
- Stripe (~5M events/sec). Outbox pattern: business write paired with outbox row; CDC publishes to Kafka. Multi-region with strict exactly-once into ledger via transactional producers +
read_committed. - Confluent Cloud benchmarks. Single c5n.4xlarge broker sustains ~605 MB/sec write at
acks=all, RF=3, lz4. End-to-end p99 ~5 ms at moderate load. - WarpStream (acquired Confluent 2024) / AutoMQ. S3-native Kafka-compatible. ~1 GB/sec on 10-30 node clusters, p99 ~200-500 ms, 30-50% cost reduction at high retention.
- Pulsar at Tencent. 10M msg/sec across a single cluster claimed. Multi-tenant isolation as the differentiator. BookKeeper fsync-per-entry, fewer replicas than Kafka.
Steady-state production across wildly different access patterns — activity events (LinkedIn), telemetry firehose (Netflix), edge logs (Cloudflare), task graphs (Uber), transactional outbox (Stripe), high-retention cost-sensitive (WarpStream/AutoMQ), multi-tenant SaaS (Pulsar). Same primitive, tuned differently.
§25. Summary
The distributed log is the substrate primitive that converts "send this message" into "append this byte range to a replicated, segmented file" — and the rest of the design space (queue vs log, sync vs async replication, local-disk vs S3-native, broker-managed vs peer-to-peer) is variation on how aggressively you optimize that one operation. Pick log when fan-out and replay matter at scale; pick queue when per-message semantics matter at modest scale; pick S3-native when retention dominates cost; pick DB outbox when you're below the scale where introducing a broker pays for itself.
Appendix A: Acronyms expanded
ACL (Access Control List), AES-NI (Advanced Encryption Standard New Instructions), AMQP (Advanced Message Queuing Protocol), AZ (Availability Zone), BookKeeper (Pulsar's distributed durable storage layer), BYOC (Bring Your Own Cloud), CA (Certificate Authority), CAN-bus (Controller Area Network bus), CDC (Change Data Capture), CRC (Cyclic Redundancy Check), DLQ/DLX (Dead Letter Queue / Dead Letter eXchange), DMA (Direct Memory Access), DR (Disaster Recovery), FIFO (First-In-First-Out), GC (Garbage Collection), GTID (Global Transaction Identifier), HW (High Watermark — highest offset replicated to all ISR members), IoT (Internet of Things), ISR (In-Sync Replica), JMX (Java Management Extensions), JWT (JSON Web Token), JVM (Java Virtual Machine), KIP (Kafka Improvement Proposal), KMS (Key Management Service), KRaft (Kafka Raft, post-ZooKeeper metadata layer), LEO (Log End Offset), LSM (Log-Structured Merge tree), LSO (Last Stable Offset — transactional consumer's high-watermark equivalent), LUKS (Linux Unified Key Setup, disk encryption), MM2 (MirrorMaker 2), MQTT (Message Queuing Telemetry Transport), MSK (AWS Managed Streaming for Kafka), mTLS (mutual TLS), MVCC (Multi-Version Concurrency Control), NIC (Network Interface Card), NVMe (Non-Volatile Memory express), OIDC (OpenID Connect), OOM (Out-Of-Memory), OTA (Over-The-Air), PID (Producer ID — idempotent producer), PII (Personally Identifiable Information), PK (Primary Key), POP (Point of Presence), RPC (Remote Procedure Call), RPO (Recovery Point Objective), RTBF (Right To Be Forgotten), RTO (Recovery Time Objective), RTT (Round Trip Time), SASL (Simple Authentication and Security Layer), SCRAM (Salted Challenge Response Authentication Mechanism), SIEM (Security Information and Event Management), SMT (Single Message Transform — Kafka Connect), SQS (AWS Simple Queue Service), SR (Schema Registry), SSO (Single Sign-On), TC (Transaction Coordinator), TTL (Time-To-Live), WAL (Write-Ahead Log), 2PC (Two-Phase Commit).
Appendix B: Companion patterns worth a name-check
- Transactional outbox: events to a DB outbox table in the same transaction as the business write; CDC reads the binlog and publishes to Kafka. Eliminates dual-write atomicity. Used by Stripe, Shopify.
- Idempotency key at the API boundary: client-facing API accepts idempotency-key headers (Stripe-style). Defends against retries above the queue layer.
- Compacted state topics:
cleanup.policy=compact+ key per state object = materialized current-state snapshot. Used for__consumer_offsets, CDC of current row values, feature-store online caches, stream-processor state backing. - MirrorMaker 2 / uReplicator: cross-cluster replication for DR or geo. Offset translation is the tricky bit — offsets don't transfer 1:1 across clusters.
- Stream processing on top: Kafka is the substrate; Flink / Kafka Streams / Spark Structured Streaming are the compute layer. Exactly-once typically lives here, not in raw consumers.
- Schema Registry (Confluent, Apicurio, Pulsar built-in): every topic has a registered schema (Avro / Protobuf / JSON Schema). Backward-compatibility enforcement prevents producer changes from breaking consumers.
End of doc.