Notification systems take a domain event ("Alice replied to your comment", "your payment of $43.21 succeeded", "PagerDuty incident #4421 escalated", "your Stripe payout posted") and turn it into one or more asynchronous, channel-routed, user-targeted nudges. They are a class of infrastructure that sits between event producers (any service in the business) and external delivery endpoints (APNs (Apple Push Notification service), FCM (Firebase Cloud Messaging), SES (Simple Email Service), SendGrid, Twilio, web push, in-app inbox).
This is a reference about that class — what it is, what it inherently guarantees, the design space of fanout patterns, the byte-level wire mechanics of the channels, the hard problems, the failure modes, and how the same technology serves transactional payments, social feed, marketing campaigns, alerting, 2FA (Two-Factor Authentication) SMS, and in-game pushes. Multiple domains use it differently; we cover all of them.
§1. What notification systems ARE (and what they are NOT)
A notification system is an asynchronous fan-out delivery platform that accepts one logical event from an upstream service and produces N targeted messages routed across heterogeneous channels (push, email, SMS, in-app inbox, web push, voice). It owns four problems no individual channel API solves:
- Targeting: mapping an event to a recipient set (one user, a fanout list, a segment).
- Routing: picking channel(s) per recipient based on preferences, device availability, urgency, cost.
- Throttling / dedup / scheduling: per-user budgets, duplicate suppression, quiet hours.
- External-channel adaptation: HTTP/2 to APNs, SMTP/HTTPS to email, REST to SMS providers — managing connection pools, retries, backpressure.
It sits between event producers (feed, payments, identity, monitoring, marketing automation) and delivery endpoints (APNs, FCM, SES, Twilio) plus an internal inbox store.
Distinguish from adjacent technology categories
- vs. distributed message queue (Kafka, RabbitMQ): a queue moves bytes with no notion of "user". A notification system uses a queue inside, but its abstraction is "user × channel × content". You build notification ON a queue, not as one.
- vs. in-app realtime (WebSocket, SSE, XMPP): realtime channels deliver to currently-connected clients only. A complete platform uses BOTH — WebSocket when client is connected, APNs/FCM when offline. WebSocket is one channel among several.
- vs. transactional email service (SES, SendGrid alone): SES knows how to send one email. The notification platform orchestrates SES + APNs + Twilio + inbox + preferences + dedup. SES is a primitive; the platform composes.
- vs. marketing automation (Mailchimp, Iterable, Braze): marketing platforms layer campaign segmentation and A/B tests on top of the same fanout-route-throttle substrate.
- vs. alerting (PagerDuty, Opsgenie): alerting is a notification platform specialized for high-urgency, on-call rotation, escalation policies. Same primitives.
What notification systems are NOT good for
- Synchronous in-band confirmation. Channel ACKs mean "we accepted it", not "user saw it".
- Strict exactly-once across all channels. Push is best-effort; email goes to spam; SMS providers retry on their own. The platform delivers at-least-once; device or inbox does dedup.
- Cross-channel ordering. Push may arrive before email even when dispatched in order — no global clock across APNs and SES.
- Long-form rich content. APNs caps payloads at 4KB; SMS at 160 chars per segment. Send the link, host the content elsewhere.
- Two-way conversations. Notifications are one-way. Use a messaging system for replies.
§2. Inherent guarantees (and what must be layered on)
By design, the platform provides:
- At-least-once delivery per channel, where "delivery" means "accepted by the channel's API" — not "rendered on screen". An APNs 200 means APNs took it; the device may still be offline.
- Durable inbox as source of truth. The in-app inbox row, once written, survives crashes. Push and email are hints.
- Idempotency via event/dispatch keys. Same
event_id+recipient_idresolves to one inbox row even on retry. - Channel-specific reliability:
- Push (APNs/FCM): best-effort. Apple retains undelivered pushes ~30d on iOS, hours on Android, then drops.
apns-collapse-idensures one banner even if ten pushes sent. - Email (SES/SendGrid): ~99% deliverable for clean senders; latency seconds to minutes; content can land in spam.
- SMS (Twilio): ~95-99% delivery to MNOs (mobile network operators); operator's queue is opaque.
- In-app inbox: 100% (we wrote it durably).
- Back-pressure: the queue absorbs producer bursts; dispatcher rate decoupled.
By design, the platform does NOT provide:
- Exactly-once across channels. Network flakes between dispatcher and APNs cause duplicate pushes; device-side
collapse_idcollapses them. Layered on, not inherent. - Strict cross-channel ordering. "Push arrives before email" is convention. The dispatcher fans out in parallel; APNs ~200ms, email ~5s. Only the inbox (ordered by event timestamp) is the ordered source.
- Real-time delivery confirmation to producer. Producer publishes and is done. Whether user saw it = async callback (Twilio webhook, SES SNS bounce) hours later.
- Spam compliance. CAN-SPAM, GDPR, CCPA require unsubscribe handling, consent tracking. The platform exposes mechanics; the business owns the policy.
Must be layered on by the system designer:
- Producer-side idempotency keys — if payment service retries "payment succeeded", we'd see two events. Producer-generated
event_iddeduped in Redis is mandatory. - Cross-channel coordination — "push first, email only if no ack in 5 min" — built via delay queue.
- Compliance auditing — "did the GDPR data-export notification go on 2025-08-13?" must be queryable for 90+ days.
§3. The design space: fanout patterns
The dominant axis of variation in notification systems is how the recipient set is computed and materialized. Three patterns, with a sharp tradeoff.
Fanout-on-write (push model)
When the producer emits an event, the system immediately writes a row to every recipient's inbox. Read is trivial — the user opens the app, fetches SELECT * FROM inbox WHERE user_id=X LIMIT 50 from their own partition.
- Write cost: O(F) where F = recipient count. For a 1:1 transactional event (payment receipt), F=1 → cheap. For a 1:N social event (your friend posted), F = follower count. For a celebrity, F can be 100M.
- Read cost: O(1) page-load — one partition lookup.
- Storage: O(events × avg_followers) — duplicated per recipient.
- Where it shines: transactional 1:1 (payment receipts, password resets, alerts), small-fanout social.
- Where it breaks: high-fanout social (celebrity posts), viral content.
Fanout-on-read (pull model)
The producer writes only to the producer's own timeline / event log. At read time, the recipient's client/server queries all relevant producers' timelines and merges.
- Write cost: O(1) per event.
- Read cost: O(F') where F' = number of producers the recipient subscribes to. For a normal user with 300 followees, every inbox open triggers 300 partition reads.
- Storage: O(events) — single copy.
- Where it shines: very high-fanout sources (celebrity timelines, RSS-style feeds), write-heavy workloads.
- Where it breaks: read-heavy workloads, recipients who follow many sources, latency-sensitive UIs.
Hybrid (the standard pick for non-trivial systems)
Pick per-source. Below a celebrity threshold (~10k–100k followers), use fanout-on-write. Above the threshold, use fanout-on-read — don't pre-populate inboxes; at read time, fetch from celebrity's recent timeline and merge.
if poster.follower_count < THRESHOLD:
fanout_on_write(poster, followers)
else:
fanout_on_read_only(poster)
# Reader path:
inbox = pre-pushed inbox (top N)
celebs = user.celebrity_followees # bounded, ~10-30
fresh = [read recent_timeline(c, since=24h) for c in celebs]
return merge(inbox, *fresh).sorted_by_ts().take(50)
Twitter moved to this 2010-2012 after celebrity write-storms caused cascading failures. Facebook, Instagram, LinkedIn use variations.
Comparison
| Dimension | Fanout-on-write | Fanout-on-read | Hybrid |
|---|---|---|---|
| Write cost | O(F) — explodes for celebrities | O(1) — flat | O(F) for non-celebs, O(1) for celebs |
| Read cost | O(1) | O(F') — sometimes hundreds of reads | O(1) + O(small celeb-set) |
| Storage | F× duplication | 1× | Mixed; mostly 1× for celebs, F× for others |
| Read latency p99 | ~10ms (one partition) | ~100ms+ (multi-partition merge) | ~20ms (one + small merge) |
| Write latency p99 | Fine for normal posts; minutes for celebs | Fast always | Fast always |
| Code complexity | Simple | Simple | Two paths + merge logic |
| When to pick | Median fanout < 10k; transactional 1:1 | Write-heavy with very high fanout; few recipients per reader | Real-world social with skewed follower distribution |
For transactional notifications (payments, security, alerts) the question doesn't arise — fanout is 1, so fanout-on-write is trivially correct. The hybrid question only matters for social fanout.
§4. Byte-level mechanics
This is where shallow docs stop and the real depth begins. The notification platform has four critical storage/transport structures: the inbox store (LSM-tree in Cassandra), the device-token store (B+ tree in MySQL), the dedup/rate-limit store (Redis hash table), and the APNs/FCM wire protocol (HTTP/2 with HPACK framing). I cover each.
4.1. Inbox store: Cassandra LSM tree, partitioned by user_id
Access pattern:
- Write-heavy: every notification = ≥1 inbox write. At 5B/day = ~58k writes/sec sustained, ~1.75M/sec viral burst.
- Read pattern: "last 50 notifications for user X" — a range scan within one partition.
- Cross-user queries: ~never. 99.9% of inbox queries scope to one user_id.
LSM (Log-Structured Merge) tree is the textbook choice. A B+ tree (MySQL/InnoDB) would hit page-split storms at 58k writes/sec across 1B users; the random write pattern would thrash the buffer pool. LSM accepts ~10x write amplification (compaction overhead) in exchange for purely sequential writes that hit ~100k writes/sec/SSD per node.
Schema
CREATE TABLE notifications_inbox (
user_id bigint,
created_at_ms bigint,
notification_id uuid,
type text,
payload blob, -- protobuf-encoded
read_at_ms bigint,
PRIMARY KEY ((user_id), created_at_ms, notification_id)
) WITH CLUSTERING ORDER BY (created_at_ms DESC, notification_id ASC)
AND default_time_to_live = 2592000; -- 30 days
(user_id)is the partition key → all of one user's notifications co-locate on the same RF=3 (replication factor) replicas.created_at_ms DESCclustering → newest first; "last 50" reads the first 50 cells of the partition.default_time_to_live = 30d→ Cassandra writes tombstones at expiry; old data is reclaimed by compaction.
LSM layout on disk
Each Cassandra node holds data as immutable SSTables (Sorted String Tables) on disk plus a mutable in-memory memtable.
WRITE PATH
─────────
client write
│
▼
commit log (append-only, fsync periodic 10ms or per-batch) ── DURABILITY POINT
│
▼
memtable (skip-list or trie-map in heap, sorted by clustering key)
│
│ when memtable hits flush threshold (~256 MB or 15 min)
▼
flush to disk as a new SSTable (immutable, sorted)
│
▼
background compaction merges SSTables (size-tiered or leveled)
READ PATH
─────────
read for (user_id=12345, last 50)
│
▼
1. memtable lookup
2. bloom filter check on each SSTable for partition (user_id=12345)
3. for SSTables that pass: read partition index → seek to offset
4. merge results in clustering-key order, take top 50
5. apply tombstones (expired TTLs)
SSTable file layout
┌─────────────────────────────────────────────────────────────────┐
│ Data.db (rows, sorted by partition_key then clustering) │
│ Index.db (partition-key → byte offset into Data.db) │
│ Filter.db (bloom filter for partition keys, ~1% FP) │
│ Summary.db (sampled Index.db entries, kept in heap) │
│ Statistics.db (min/max clustering keys, tombstone counts) │
│ CompressionInfo.db │
└─────────────────────────────────────────────────────────────────┘
For SELECT * FROM notifications_inbox WHERE user_id=X LIMIT 50:
1. Hash user_id → identify home replicas via consistent-hashing token ring.
2. On a replica, bloom filter on Filter.db says "yes, this SSTable might have user_id=X" — false positive ~1%.
3. Index.db (or Summary.db, in-heap, sampled) gives byte offset in Data.db.
4. Seek into Data.db, scan forward in clustering order, take 50 cells.
Two seeks worst case per SSTable. If user X's data spans 4 SSTables (some rows recently flushed, others from older compactions), we touch 4 SSTables and merge in memory. Leveled compaction keeps the count low (~5 SSTables/level for any one row); size-tiered compaction can be higher and is the Cassandra default unless tuned otherwise.
Tradeoff vs B+ tree
| Dimension | LSM (Cassandra) | B+ tree (MySQL) |
|---|---|---|
| Write throughput | ~100k/sec/node (sequential) | ~20k/sec/node (random splits) |
| Read latency p99 | ~5-10ms (multi-SSTable merge) | ~1-2ms (single tree traversal) |
| Compaction overhead | ~10x write amplification | ~2-3x via splits + buffer pool |
| Range scan within partition | Fast | Fast (leaf-chain) |
| Cross-partition query | Terrible (no global index) | OK via secondary indexes |
| TTL expiry | Native | Manual cleanup job |
For 1B-user inbox with high write rate and partition-local reads, LSM wins. Cross-user analytics ("notifications about post X across users") = add Pinot/Druid as secondary store.
Durability and recovery
Commit log is the durability point. Cassandra appends to commit log BEFORE updating memtable. On crash: restart → replay commit log → serve reads. Fsync is periodic (10ms default) for throughput, or batch (per write) for stricter durability — tier transactional notifications to a separate keyspace with batch.
One operation, end-to-end
T+0: Dispatcher constructs CQL INSERT (user_id=12345, TTL=30d).
T+1: Coordinator receives at CQL layer.
T+2: Coordinator forwards to all 3 replicas; awaits QUORUM (2 of 3).
T+3: Each replica: append to commit log (sequential, ~1µs to page cache;
fsync at next 10ms tick), update memtable skip-list at
(user_id=12345) sorted by created_at_ms DESC, ACK coordinator.
T+5: Coordinator ACKs dispatcher. p50 ~5ms, p99 ~20ms.
──── [crash on one replica here] ────
Recovery: replay commit log to last fsync. If a write ACKed but lost in
10ms window, other 2 replicas have it → read-repair on next read, or
anti-entropy (Merkle-tree, nodetool repair) later.
DURABILITY POINT: commit log entries across the QUORUM.
T+15min: Memtable hits 256 MB → flush to SSTable (sequential, ~0.4s).
Once SSTable durable, commit log segment released.
Later: Background compaction (leveled: L0→L1 in 160 MB chunks).
Tombstones (expired TTLs) dropped during compaction.
4.2. Device-token store: MySQL B+ tree, sharded by user_id
CREATE TABLE device_tokens (
user_id BIGINT NOT NULL,
device_id VARCHAR(64) NOT NULL,
platform ENUM('IOS','ANDROID','WEB') NOT NULL,
token VARCHAR(256) NOT NULL,
app_version VARCHAR(32),
locale VARCHAR(16),
last_seen_ms BIGINT,
active BOOLEAN NOT NULL DEFAULT TRUE,
PRIMARY KEY (user_id, device_id),
INDEX idx_token (token(64))
) ENGINE=InnoDB;
Why B+ tree:
- Point or narrow-range lookups (≤10 rows per user). B+ tree branching factor ~100 → O(log_100 N) lookups. For 1B rows, ~5 levels deep; upper 3 levels in InnoDB buffer pool → 1 disk seek p99.
- Predictable mutations: token rotation = UPSERT on (user_id, device_id); no churn explosion.
- Sharded by user_id across 32 MySQL clusters → ~40 GB per shard. Each shard = 1 master + 2 replicas. Trivial ops.
Token-rotation feedback loop (the 410/404 path described in §8) writes back to this table to deactivate dead tokens.
4.3. Redis: dedup, rate-limit, preference cache
Dedup: SETNX dedup:{event_id}:{user_id} 1 EX 86400. SETNX returns 0 → already dispatched → drop. At 58k events/sec × ~50 deliveries/event ≈ 2.9M ops/sec → ~45k/sec/node across a 64-node Redis cluster (well within ~100k ops/sec/core ceiling). ~250M live keys × 50 bytes ≈ 12 GB. Fits.
Rate limit (sliding token bucket): ratelimit:push:{user_id} → { tokens, last_refill_ms }. Lua script atomically refills by elapsed time, decrements 1, returns allowed. Per-channel buckets independent (push 5/min, email 3/hr, SMS 1/hr).
Preference cache: prefs:{user_id} → { push, email, sms, quiet_hours, tz, muted_types }. Read-through, TTL=60s. Unsubscribe API writes MySQL AND DEL prefs:{user_id} → next read is fresh.
4.4. APNs/FCM wire protocol: HTTP/2 with HPACK and persistent connections
The depth signal: you don't just "POST to APNs" — you maintain persistent HTTP/2 connections, multiplex thousands of concurrent pushes through each connection, manage HPACK header compression, and handle GOAWAY for graceful connection lifecycle. This is where APNs and FCM amortize TLS setup and dramatically increase throughput.
APNs over HTTP/2
TCP connection [persistent, keep-alive]
└─ TLS 1.3 handshake (or session resumption) [client cert or JWT auth]
└─ HTTP/2 connection
├─ Stream 1: POST /3/device/{token1} payload1
├─ Stream 3: POST /3/device/{token2} payload2
├─ Stream 5: POST /3/device/{token3} payload3
├─ ... (up to ~1000 concurrent streams per connection)
└─ Stream 2001: POST /3/device/{tokenN} payloadN
Responses arrive interleaved on the same connection:
Stream 1: 200 OK
Stream 3: 410 Unregistered ← device gone, mark token dead
Stream 5: 429 Too Many Requests ← back off
Stream 2001: 200 OK
Frame-level mechanics:
- HEADERS frame: HPACK-compressed. After the first request on a connection, repeated headers like
apns-topic: com.foo.app,apns-push-type: alert,authorization: bearer <JWT>cost ~5 bytes each (an HPACK dynamic table index, not the full string). At scale this matters: 10k pushes × 200 bytes/header saved = 2 MB/connection/sec saved. - DATA frame: the JSON payload (≤4KB for normal pushes, 5KB for voip-pushtype).
- WINDOW_UPDATE frame: HTTP/2 flow control. APNs advertises a window (e.g., 64KB initial); if your in-flight data exceeds it, you wait for
WINDOW_UPDATE. Backpressure-aware clients respect this; aggressive ones cause TCP-level stalls. - GOAWAY frame: APNs sends this when it wants to gracefully close — typically during Apple's edge deploys, or when the connection has been alive "too long". Your client must drain in-flight streams and reconnect on a fresh connection. Failing to handle GOAWAY is a common bug — sends silently fail or hang.
- PING frame: HTTP/2 liveness check. Clients send periodically; missing PONG → connection dead → reconnect.
Throughput math: why persistent connections matter
Cold connection (per-push reconnect):
TLS handshake = ~50-100ms
One push per handshake → ~10 pushes/sec/conn
Persistent connection with multiplexing:
~1000 concurrent streams per HTTP/2 connection
Each stream completes in ~80ms (Apple round-trip)
Throughput per connection ≈ 1000 streams / 0.08s = ~12.5k pushes/sec
Sustained throughput limited by APNs server-side ≈ 1000-2000 pushes/sec/conn
At 1.75M push/sec viral burst:
1.75M/sec ÷ 1000/sec/conn = 1,750 persistent APNs connections
Spread across ~50 dispatcher hosts, each holding 30-40 connections. Apple permits up to ~100 connections per provider certificate; you negotiate higher for large customers (Twitter, LinkedIn, Meta have higher caps).
FCM specifics
FCM (Firebase Cloud Messaging) similarly speaks HTTP/2 but offers explicit batch endpoints:
POST https://fcm.googleapis.com/v1/projects/{pid}/messages:send (single)
POST https://fcm.googleapis.com/batch (batch, up to 500)
Batch endpoint accepts a multipart body with up to 500 distinct messages, returns a single HTTPS round-trip with an array of responses. APNs has no equivalent batch endpoint — you rely on HTTP/2 stream multiplexing.
Application-level batching tip: even on APNs, group pushes into ~500 streams per ~10ms flush window. This amortizes encryption and improves HPACK reuse.
Concrete walkthrough: one event → device, end-to-end
T+0ms: Producer (feed-service) publishes NotificationEvent to events.raw
partition (hashed by event_id).
T+1ms: Kafka durable: leader fsyncs, replicates to 2 ISR followers.
T+5ms: Fanout-service consumes event, queries follower-graph store.
T+8ms: Fanout-service writes per-user deliveries to deliveries.queued
(partitioned by recipient user_id).
T+10ms: Dispatcher worker (handling partition for user 42) consumes.
T+11ms: Redis SETNX dedup:event-abc:42 → 1 (first time).
T+12ms: Redis GET prefs:42 → push=true, not in quiet hours.
T+13ms: Redis Lua rate-limit check → allowed; consume 1 token.
T+14ms: Cassandra INSERT into notifications_inbox; ACKs at QUORUM (~5ms).
T+19ms: MySQL SELECT device_tokens WHERE user_id=42 AND active=TRUE
→ 3 tokens (iOS, Android, web).
T+21ms: Dispatcher enqueues 3 channel-specific sends.
T+22ms: APNs router picks an idle HTTP/2 stream on its persistent
connection to api.push.apple.com.
T+23ms: HEADERS + DATA frame (TCP_NODELAY, ~1KB payload).
T+50ms: Apple's edge ACKs at TCP level, begins processing.
T+80ms: APNs returns 200 OK on the stream.
T+81ms: Dispatcher updates delivery_status → DELIVERED (channel=APNS).
T+200ms: Apple forwards to user's phone over persistent push channel.
T+300ms: Phone displays banner.
p50 event→banner: ~300ms.
p99: ~3-5s (Apple's queue introduces delay; FCM similar).
Internal infra latency (T+0 to T+80): ~80ms. Apple-side: ~220ms. We have no control over Apple — internal speed-ups bottleneck at the wire.
§5. Capacity envelope: small to giant
The same architecture spans 6+ orders of magnitude depending on scale. The substrate (queue → fanout → dispatcher → channel routers) is the same; the sizing isn't.
| Scale | Throughput | Footprint | Real example |
|---|---|---|---|
| Tiny | ~5 notif/sec | Single SES account + AWS SNS. No queue. Sync send. Few hundred users. | Indie SaaS, side-project. |
| Small | ~500 notif/sec | One Postgres (preferences + inbox), one Redis (dedup), SES + Twilio. ~10k DAU. | Early startup, B2B SaaS notifications. |
| Mid | ~10k notif/sec | Kafka cluster (5 brokers), Cassandra inbox (10 nodes), Redis cluster (8 nodes), 100s of persistent APNs/FCM connections. Slack-shape. ~10M users. | Slack notifications, Discord (early-mid), Notion. |
| Large | ~500k notif/sec sustained, ~5M peak | Kafka 50+ brokers, Cassandra 100+ nodes, multi-region. Hybrid fanout (celebrities). Twitter 300k followers/celebrity, Facebook 10B/day, LinkedIn 1B users. | Twitter, LinkedIn, Facebook, Instagram, Snap. |
| Giant | 1M+ msgs/sec sustained, ~10M+ peak | Per-region Cassandra, dedicated DC fleet, custom hardware for persistent connections, Erlang VMs holding millions of long-lived sockets. | WhatsApp (100B+ msgs/day), iMessage, WeChat. |
At each tier the architecture is the same shape but the next bottleneck moves:
- Tiny/Small → bottleneck is producer rate.
- Mid → bottleneck is single MySQL inbox; move to Cassandra.
- Large → bottleneck is celebrity write storms; move to hybrid fanout-on-write/read.
- Giant → bottleneck is persistent connection management (millions of sockets); custom socket-tier in Erlang/Go.
§6. Architecture in context (canonical pattern)
EVENT PRODUCERS (payments, feed, identity, chat, monitoring, marketing, ...)
│ publishes NotificationEvent
▼
┌──────────────────────────┐
│ Kafka: events.raw │ partitioned by event_id, retention 7d, repl=3
└────────────┬─────────────┘
▼
┌───────────────────────────────────┐
│ FANOUT SERVICE (stateless) │
│ - lookup recipient set │
│ - apply visibility / mute rules │
│ - branch hot path vs cold path │
└──┬──────────────────────────┬─────┘
│ normal: fanout-on-write │ celebrity: fanout-on-read
▼ ▼
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ Kafka: deliveries.queued │ │ No fanout write │
│ partitioned by user_id │ │ Recipients query celebrity │
│ │ │ timeline at read-time │
└────────────┬────────────────┘ └─────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────┐
│ DISPATCHER WORKER POOL (stateless, autoscaling) │
│ consume by user_id; per-user serial processing │
│ for each delivery: │
│ 1. dedup (Redis SETNX, TTL 24h) │
│ 2. preference (Redis cache, MySQL fallback) │
│ 3. quiet-hours (user's IANA TZ) │
│ 4. rate-limit (Redis token bucket, multi-layer) │
│ 5. template render (locale + personalization) │
│ 6. write inbox (Cassandra UPSERT by user_id) │
│ 7. enqueue channel sends │
└──┬──────────┬──────────┬──────────┬────────────────────────────┘
│ HTTP/2 │ HTTPS │ HTTPS │ HTTPS
▼ ▼ ▼ ▼
APNs FCM SES/ Twilio/
(Apple) (Google) SendGrid MessageBird
SUPPORTING STORES:
Cassandra: notifications_inbox (PK=user_id, CK=created_at DESC)
Cassandra: delivery_status (PK=event_id, retention=90d)
MySQL: user_preferences, device_tokens (sharded by user_id)
Redis: dedup, rate-limit, preference cache
S3: long-term audit (compacted from delivery_status)
FEEDBACK LOOP:
APNs/FCM 410/404 → token dead
SES SNS bounce/complaint webhooks
Twilio delivery callbacks
↓ Kafka: feedback.events
Feedback processor → device_tokens (deactivate)
→ user_preferences (mark email undeliverable)
Partition keys, repeated explicitly:
- events.raw: by event_id → uniform producer load.
- deliveries.queued: by user_id → dispatcher worker can serialize per-user, batch Redis hits.
- notifications_inbox: user_id partition, created_at_ms clustering DESC.
- delivery_status: by event_id → "show me everything that happened for event X".
- device_tokens: by user_id.
This is the canonical pattern, not one product's full system. Slack, LinkedIn, Twitter, Discord, Stripe all use variations of this same shape.
§7. Hard problems inherent to this technology
Problem 1: Celebrity fanout (one source, 100M recipients)
Appears in social (celebrity post), marketing (broadcast to all users), alerting (incident paging every on-call), transactional (regulatory notice to every user in a region).
Naïve fix: fanout-on-write to all 100M. Synchronous = post hangs for hours. Async via Kafka = queue backlogs, normal users wait.
Why it breaks: source publishes at T=0; fanout enqueues 100M deliveries. At 1M deliveries/sec system-wide = 100s of pure source load. Dispatcher pool sized for 200k/sec = 500s (~8 min). Normal users' events stuck behind in the same Kafka partition (if partitioned naively) or starved of dispatcher capacity.
Real fix: hybrid fanout-on-write + fanout-on-read with a source threshold (see §3). When source.recipient_count > 10k, skip inbox pre-population; recipients query source's recent timeline at read time and merge.
For push notifications, sample — engagement scoring drops to recipients who actually engage with this source (1-10% of total). Per-user notification budgets (≤5 push/day) further trim. Net: a celebrity post pushes to ~1M, not 100M.
For marketing broadcast, shard over time — drip 100M emails over 30 minutes via a delay queue. Protects SES/SendGrid quotas; avoids overwhelming downstream IMAP servers.
Problem 2: Exactly-once vs at-least-once
Appears in payments (don't double-send receipts), 2FA (don't send two codes), alerting (don't page on-call twice), social (don't double-notify "Alice replied").
Naïve fix: distributed locking or 2PC (Two-Phase Commit) between dispatcher and APNs.
Why it breaks: APNs is external; cannot participate in 2PC. APNs returns 200 → maybe delivered, maybe APNs crashed. Even if APNs ACKs at T=80ms, dispatcher could crash at T=81 before recording DELIVERED — on restart, reprocess from Kafka offset, dispatch again, two pushes.
Real fix: at-least-once delivery, idempotency at the device level.
- Device-side dedup via apns-collapse-id / FCM collapse_key: pushes with same collapse ID supersede; user sees one banner.
- Inbox idempotency by event_id (Cassandra UPSERT): row written once regardless of retries.
- For transactional events (password reset, payment receipts), producer attaches idempotency_key; same key within 24h returns the same delivery. Stripe's receipt pipeline does this.
For most non-transactional notifications, at-least-once + device-side dedup is correct; exactly-once is the wrong goal.
Problem 3: Cross-channel ordering
Appears in chat (1→3→7 unread; user must see "7", not "1"), alerting (alert open → resolved; user must not see "resolved" first), e-commerce (placed → shipped → delivered).
Naïve fix: rely on Kafka partition ordering by user_id.
Why it breaks: T=10 "1 new message" and T=12 "3 new messages" both in same partition. Fanout reads in order; dispatcher reads in order. But dispatcher splits each into push/email/in-app. Push #1 queued behind 50 others on connection A; push #2 lands on idle connection B. Phone receives "3" at T=15, "1" at T=17.
Real fix:
1. Deterministic per-(user_id, channel) shard inside the dispatcher so all pushes for user 42 go through the same APNs router — FIFO per user.
2. apns-collapse-id / collapse_key: newer state supersedes older on the device. Wire-order wrong → device still shows latest.
3. Inbox Cassandra clustering by created_at_ms guarantees chronological order regardless of dispatch parallelism, provided each row carries the event's timestamp, not dispatcher's write time.
App-layer per-user serialization + device collapse_id + storage event-time ordering = effective in-channel ordering.
Problem 4: Multi-layer rate limiting and spam prevention
Appears in marketing (campaign quota), transactional (don't blast 50 receipts for 50 retries), alerting (alert flapping → suppression), social (mass-tag spam).
Naïve fix: trust producers to rate-limit themselves.
Why it breaks: producer bug retries in a loop, sends 10k events for one user. Each → ~5 notifications. User spammed, uninstalls.
Real fix: multi-layer rate limit at the dispatcher.
Layer 1: per-user, per-channel: 5 push/min, 10 push/hr
Layer 2: per-user, per-type: 3 NEW_FOLLOWER/day
Layer 3: per-user global cap: 20 notif/day across channels
Layer 4: per-campaign kill switch (operator-controlled)
Layer 5: provider circuit breaker (APNs/FCM err > 5% → open)
Each = Redis token bucket; Lua script atomically checks-and-decrements. Rejection records RATE_LIMITED_LAYER_2 in delivery_status.
Deeper anti-spam: rolling notification quality score — high open-rate types count cheaper; low-engagement throttled harder. Feedback from feedback.events → preference scoring.
Problem 5: Time-zone-aware quiet hours
Appears in marketing (don't email at 3am local), social ("liked your post" deferred while chat always allowed), alerting (low-severity deferred to on-call's quiet hours).
Naïve fix: store quiet hours as UTC offsets; check now_utc.
Why it breaks: user in Tokyo (UTC+9), quiet hours "22:00-08:00 local". Storing as UTC = "13:00-23:00". User flies to NY (UTC-5); same local hours, but UTC now "03:00-13:00". UTC representation requires updating on travel.
Real fix:
- Store TZ as IANA name (Asia/Tokyo). App SDK refreshes on device-tz change.
- Store quiet hours as local time ({start_local: "22:00", end_local: "08:00"}).
- Dispatcher computes now_in_user_tz. If in window: schedule for end via delay queue (Redis ZSET, scheduler polls due items, re-enqueues).
DST (Daylight Saving Time) handled by the IANA TZ database. Edge case: traveler crossing zones may briefly experience inconsistencies; accept this.
Problem 6: Unsubscribe propagation latency
Appears in marketing (GDPR/CAN-SPAM legal violation), transactional (user opts out of receipts), social (user mutes thread).
Naïve fix: write to MySQL, let cache TTL (60s) catch up.
Why it breaks: user unsubscribes at T=0. Notification fires at T=10; dispatcher reads stale cache; sends. For regulated channels, legal violation.
Real fix:
1. Unsubscribe API writes MySQL AND DEL prefs:{user_id} in Redis. Next read fresh.
2. Publish unsubscribe.events to Kafka. Fanout-side consumer updates an in-memory bloom filter; system stops creating deliveries for that user.
3. For in-flight in deliveries.queued: dispatcher re-checks preferences at dispatch time after the Redis DEL.
Combined: median propagation <500ms; worst case <5s.
§8. Failure mode walkthrough
8.1. Dispatcher crashes mid-processing
Dispatcher consumed event from Kafka, wrote inbox to Cassandra, ENQUEUED push to APNs router, then crashed before committing Kafka offset.
Recovery:
- Kafka consumer restarts from last committed offset.
- Reprocesses event.
- Inbox INSERT: idempotent on (user_id, event_id) → no double row.
- Push to APNs: redispatched; APNs receives duplicate; apns-collapse-id suppresses second banner on device.
- delivery_status UPSERT; end state DELIVERED.
Durability point: Kafka offset. Commit only after Cassandra write AND APNs send return success. Commit too early = lose work; too late = duplicate work (at-least-once contract handles this).
8.2. APNs throttling / unavailable
APNs returns 429 / 503. Circuit breaker on APNs router opens if 429/503 rate > 5% over 30s. Open: stop sending; HTTP/2 connections idle but open. Half-open after 30s: 10 test pushes; if 9/10 succeed → close.
While open: pushes queue in deliveries.apns.retry. After recovery, drain at 20% capacity, ramp up. Inbox already written; user sees on next app open. If push was the only channel and freshness window exceeded (e.g., chat >1h), escalate to email/SMS if opted in.
8.3. Device token rotation
User reinstalls app → old token dead. APNs returns 410 Gone; FCM returns 404 or 200 with NOT_REGISTERED.
Dispatcher publishes feedback.events: { TOKEN_DEAD, ... }. Feedback processor: UPDATE device_tokens SET active=FALSE WHERE token=?. Next push for that user: active=TRUE filter excludes it. Weekly cleanup hard-deletes tokens inactive >30d. Reinstalled app registers new token via POST /devices on first launch.
8.4. Bad email but good push
SES bounce → SNS webhook → feedback.events: { EMAIL_BOUNCE, hard }. UPDATE user_emails SET bounced=TRUE; preference cache evicted; future notifications skip email; push and in-app continue. In-app banner asks user to update.
Soft bounces (mailbox full): retry with backoff (1h, 4h, 24h); 3 soft → treat as hard.
8.5. Cassandra node death
Cluster continues at effective RF=2 for partitions on that node. Operator runs nodetool removenode, adds fresh node. New node bootstraps: gossips, gets token range, streams from surviving 2 replicas (~hours for 1TB). During bootstrap, writes go to new node AND old replicas; reads use old replicas until bootstrap completes.
Durability point: RF=3 with QUORUM. Losing 1 of 3 ≠ losing data.
8.6. Kafka broker death / leadership change
Standard Kafka recovery: ISR (in-sync replica) promoted; producers re-discover leader; consumers continue from last committed offset. Producers idempotent via event_id; a retry after leader change does not double-create (dedup layer catches it).
8.7. Network partition (split-brain)
Cassandra is AP. 3-replica partition splits 2+1; majority side serves QUORUM reads/writes; minority refuses writes. On heal: hinted handoff + read-repair.
Kafka uses controller quorum (KRaft/ZooKeeper). Minority brokers lose leadership; producers redirect to majority. No split-brain.
Dispatcher pods on the minority side stall as local Kafka loses leadership. No double-delivery thanks to dedup.
§9. Why not sync in-handler fanout?
The naïve alternative: when the producer service handles its request (Alice posts), the same handler synchronously calls APNs/SES/Twilio for every recipient before returning.
- T=0: Alice POSTs a status update.
- T=1ms: feed-service inserts post. Resolves 5,000 followers, 3 devices each.
- T=2ms: Handler issues 15,000 HTTP/2 streams to APNs. Each ~100ms p99.
- With 1000 concurrent → 15 batches × 100ms = 1500ms synchronous wait. Alice sees a 1.5s spinner for what should be a 50ms post. If APNs hiccups, her post times out at 5s.
Celebrity case: @cristiano with 100M followers → 100,000 batches × 100ms = 10,000s = 2.8 hours. Tweet "didn't post" (timeout).
The naïve version conflates the write path with the fanout path: - Write path: synchronous, ms-budget, only writes Alice's post + emits an event to Kafka. - Fanout path: asynchronous, seconds-to-minutes budget, handles N-recipient expansion offline.
It also has no rate limiting, dedup, retry, quiet hours, preference check, inbox persistence. Every one of these is forced into the architecture once fanout is off the critical path. Same logic for transactional 1:1: even a payment receipt should not put SES latency on the payment API. Decouple producer from delivery via a queue, always.
§10. Scaling axes
Type 1: uniform expansion (more users, same rate per user)
| Scale | Topology |
|---|---|
| 1M users (~5k/sec) | Single MySQL + read replicas; 3-broker Kafka; single Redis. |
| 10M (~50k/sec) | Shard MySQL (4); Kafka 20 brokers; Redis cluster (8); introduce Cassandra inbox. |
| 100M (~500k/sec) | Cassandra 64 nodes; Kafka 100 brokers; Redis 32; ~500 APNs/FCM conns; multi-region. |
| 1B (~5M/sec) | Per-region Cassandra with async cross-region replication; Kafka MirrorMaker; per-region dispatcher pool. |
Inflection points: - ~10M: shard MySQL preference table. - ~50M: move inbox off MySQL onto Cassandra/LSM. - ~200M: per-region deployment (APAC↔US latency too high for single-region fanout). - ~500M: introduce hybrid fanout because follower distribution has heavy-tail outliers.
Type 2: hotspot intensification (same entities, more rate per entity)
The harder axis — celebrity going viral, campaign blasting all users, alert storm.
| Scale | Topology change |
|---|---|
| Max fanout 10k | Fanout-on-write universal. |
| 100k | Add per-partition load shedding so high-fanout event doesn't starve same Kafka partition. |
| 1M | Celebrity threshold + hybrid fanout. Above-threshold = fanout-on-read for inbox; push limited to engaged recipients. |
| 10M | Engagement sampling for push (top-K engaged based on past interaction). |
| 100M (Cristiano scale) | Engagement sampling + per-user budget enforcement. Even if 10M "want" the push, per-user-per-day cap ≤5 means most never receive this specific event. |
| 1B (hypothetical) | Shard source's own timeline by (source_id, hour_bucket) so reading last 24h doesn't hit one Cassandra partition. |
Inflection points: 10k = celebrity threshold (the Twitter moment, ~2010-2012); 1M = engagement sampling; 100M = source-timeline sharding.
Type 2 is harder because it doesn't show in averages. p50 fine; only the celebrity's post or their followers' inbox loads show pain. Need percentile-by-source-size monitoring.
§11. Decision matrix
Fanout pattern
| Approach | Pros | Cons | Pick when |
|---|---|---|---|
| Fanout-on-write | Read O(1). Predictable inbox latency. Scales horizontally with replica count. | Write O(F). Celebrity write storms. Storage = F × content. | Median fanout < 10k; transactional 1:1; read latency tight. |
| Fanout-on-read | Write O(1). Celebrity-friendly. Storage 1×. | Read O(F'). Cache misses cascade. Inbox latency variable. | Write-heavy; long-tail fanout; tolerable read latency. |
| Hybrid | Best of both. | Two code paths + merge logic; threshold tuning. | Any non-trivial social/multi-source system. Default pick. |
| Polling-only | Trivial backend. No APNs/FCM dependency. | Bad UX. Server load ∝ active user count. | Internal admin tools only. |
Channel choice (per notification type)
| Channel | Latency | Cost / send | Reliability | Reach | Pick for |
|---|---|---|---|---|---|
| Push (APNs/FCM) | ~300ms | ~free | ~95% (device may be off) | App-installed users | Real-time engagement, chat, alerts. |
| In-app inbox | App-open | ~free | 100% | App-installed users | Source of truth; user can review later. |
| Email (SES/SendGrid) | seconds-minutes | ~$0.0001 | ~99% (clean senders) | Universal | Long-form; transactional receipts; marketing. |
| SMS (Twilio) | seconds | ~$0.01 | ~95-99% | Universal | 2FA codes, high-urgency, OTP. |
| Voice call (Twilio) | seconds | ~$0.02 | Variable | Universal | Critical alerts (PagerDuty escalation). |
| Web push | seconds | ~free | ~80% (browser caveats) | Web users | Browser engagement when app isn't installed. |
| Chat-app delivery (WhatsApp Business, iMessage Business) | seconds | ~$0.01-0.05 | High | Region-dependent | Where SMS is expensive/blocked. |
Build vs buy
| Stage | Pick | Reason |
|---|---|---|
| < 10 notif/sec | Buy (OneSignal, Iterable, Knock, Courier) | Fanout problem doesn't exist yet. |
| 10–1k notif/sec | Mostly buy, build glue | Managed platform handles channels; you write preference + dedup. |
| 1k–100k notif/sec | Build platform, rent channels | Platform owns fanout, dedup, scheduling. APNs mandatory; SES/Twilio replaceable. |
| 100k+ notif/sec | Build platform, rent channels with failover | Add SES + SendGrid redundancy; Twilio + MessageBird for SMS. |
| 1M+ notif/sec | Build everything custom; persistent-connection tier in Erlang/Go | WhatsApp scale; vendor pricing dominates. |
Defensible answer: at most companies, build the platform, rent the channels. APNs is irreducible (Apple controls the device); SES/Twilio are replaceable adapters.
§12. Use case gallery (different domains, same tech)
Social feed/notifications (Twitter, Facebook, LinkedIn, Instagram, Snap): high-fanout, skewed follower distribution. Push-heavy with in-app inbox as source of truth. Engagement filtering critical (~1% of candidate notifications actually delivered). Hybrid fanout-on-write/read indispensable above ~10k followers. Push collapse_id used for "you have N new likes" coalescing.
Chat (WhatsApp, Slack, Discord, iMessage): persistent connection for in-app; APNs/FCM for offline. Per-message push for 1:1; per-thread for groups. Coalescing critical ("12 new messages in 'team-android'", not 12 banners). WhatsApp runs Erlang VMs holding millions of persistent sockets; the notification system is per-recipient routing on top. Slack/Discord support per-channel mute, do-not-disturb, keyword highlights — enforced at the dispatcher.
Transactional emails (Stripe receipts, GitHub PR notifications, AWS billing): 1:1 fanout. Idempotency mandatory (don't double-send a $100 receipt). Email-heavy. Webhook callbacks (Stripe Connect) hang off the same pipeline. SES + SendGrid as redundant providers. Audit retention: 7 years for financial.
Marketing campaigns (Mailchimp, Iterable, Braze, Klaviyo): multi-million-recipient broadcast. Sharded over time (drip over 30 min, not instant blast). Per-recipient personalization. CAN-SPAM/GDPR unsubscribe propagation. A/B testing on top. Multi-channel orchestration (email → SMS reminder if no open in 24h).
Alert delivery (PagerDuty, Opsgenie, VictorOps, Datadog): low-volume, high-urgency, on-call routing. Escalation policies (SMS → call → manager after 5 min). Multi-channel mandatory (don't rely on push alone for "production down at 3am"). Acknowledgment loop ("ACK" via SMS reply cancels the page). Quiet hours bypassed for sev-1.
2FA SMS (Twilio Verify, Authy, Okta): 1:1 code delivery. Sub-second latency required (users abandon if >30s). Idempotency at the request level (don't send 3 codes if user clicks "send code" 3 times — rate-limit and return cached code). Per-region carrier-quality routing.
In-game notifications (Riot, Epic, Blizzard): mix of in-app (player got an item) and push (your tournament starts in 5 min). Real-time persistent connection for in-game events; APNs/FCM for offline. Per-tournament fanout. Heavy use of collapse_id for "5 new gifts" rollup.
E-commerce order updates (Amazon, Shopify merchants, DoorDash): multi-stage transactional (placed → confirmed → shipped → out-for-delivery → delivered). Cross-channel orchestration (email receipt, push "package nearby", SMS "delivery attempt failed"). Idempotency on order_id.
§13. Real-world implementations with numbers
- Twitter Timeline fanout: flipped from fanout-on-write to hybrid around 2010-2012 after celebrity accounts (Lady Gaga, Justin Bieber crossing 10M+) caused write storms. Documented in Twitter eng blog and "Designing Data-Intensive Applications" Ch. 1. Current threshold reportedly ~1M followers above which fanout-on-read for inbox.
- Facebook notifications: ~10B+ notifications/day across push, email, in-app — ~115k notif/sec sustained, peaks 5-10x. Internal service "Notify". Multi-tier fanout with engagement filtering — only ~1-5% of candidate notifications actually delivered. Cassandra-based inbox.
- LinkedIn ATC (Air Traffic Controller): 1B+ users; 5-50B notification deliveries/day. Kafka → dispatcher (preference + rate-limit) → APNs/FCM/SES. Inbox stored in Espresso (LinkedIn's sharded distributed DB on MySQL). Documented in LinkedIn engineering blog "Air Traffic Controller for notifications".
- WhatsApp: 100B+ messages/day = ~1.16M msgs/sec sustained, multi-million/sec peak. Each potentially a push. Custom Erlang VMs handle millions of persistent connections; APNs/FCM for offline.
- Snap: ~10B snaps/day at peak. Pipeline on AWS (DynamoDB + SQS + SNS), custom dedup and engagement filtering.
- Discord: ~4B+ events/day. One of the largest Cassandra deployments — "How Discord Stores Trillions of Messages". Per-user partition with rolling time-bucket clustering.
- Slack: Kafka + custom dispatcher + SES/APNs/FCM. ~25M DAU. Custom mute/DND/keyword logic at the dispatcher.
- PagerDuty: hundreds of millions of alerts/year. SMS + voice + push + email + escalation. <10s p99 alert-to-page. Twilio for SMS/voice; APNs/FCM for push.
- Mailchimp: ~1.5B marketing emails/day. Custom SMTP fleet (SES insufficient at this scale) plus reputation-management infrastructure. Multi-stage sending (warm-up, ramp).
- Stripe: ~10M+ transactional emails/day. SES + SendGrid redundancy. Idempotency mandatory on every receipt.
§14. Email deliverability deep dive
Sending an email through SES (Simple Email Service) or SendGrid is the easy part. Getting that email to land in the recipient's inbox instead of the spam folder — at scale, sustainably, across Gmail / Outlook / Yahoo Mail / corporate mail servers — is the hard part. This is the deliverability discipline, and it's invisible to most engineers until 90% of a campaign lands in spam.
The substrate rests on three authentication standards, layered: SPF, DKIM, DMARC. Then on top of that: sender reputation (per-IP, per-domain), feedback loops, and warm-up regimens. Without all of these, a high-volume sender is treated like a spammer by default.
14.1. SPF — Sender Policy Framework
SPF (Sender Policy Framework, expanded) is a DNS TXT record on your sending domain that lists which IPs are allowed to send mail purporting to be from that domain. A receiving MTA (Mail Transfer Agent) checks: the connecting SMTP server's IP is 1.2.3.4; the MAIL FROM envelope is bounce@example.com; the receiver looks up example.com TXT and sees:
example.com. IN TXT "v=spf1 ip4:54.240.0.0/18 include:_spf.mailprovider.com -all"
-all (hardfail) means "if the connecting IP isn't in this list, reject." ~all (softfail) means "accept but mark suspicious." Most large senders use -all because softfail is treated as spam-leaning by Gmail anyway.
The footgun: SPF only validates the envelope MAIL FROM (the bounce address), not the header From: the user sees. A spammer can pass SPF for bounce.spammer.com while showing From: ceo@yourbank.com in the visible header — SPF doesn't protect against that. This is why SPF alone isn't enough; you need DKIM and DMARC layered on top.
14.2. DKIM — DomainKeys Identified Mail
DKIM (DomainKeys Identified Mail, expanded) signs the email body + selected headers with a private key. The recipient fetches the public key from DNS at selector._domainkey.example.com and verifies the signature. If the body or signed headers were modified in transit, the signature fails.
Wire mechanics: the outgoing MTA inserts a DKIM-Signature: header:
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=example.com; s=s2025;
h=from:to:subject:date:message-id;
bh=MTIzNDU2Nzg5MGFiY2RlZg==;
b=Dw1z8e7K6L3M2N1O5P4Q9R8S7T6U5V4W3X2Y1Z0A9B8C7D6E5F4G3H2I1J0...
d=signing domains=selector — lets you rotate keys (s2025this year,s2026next year) without breaking older messagesh=headers coveredbh=body hashb=signature
Verifier looks up s2025._domainkey.example.com TXT → public RSA key (or Ed25519 for newer setups). RFC 8463 allows Ed25519-SHA256.
Footgun: relaxed canonicalization (c=relaxed/relaxed) is the default because mail servers fiddle with whitespace; simple/simple breaks under most relays. Body hashes computed over Content-Transfer-Encoding-decoded body, with line-ending normalization; getting this wrong = "I see no problem, but Gmail rejects every message."
Key rotation: every 6-12 months. New keypair, new selector, publish public key, switch signing, retire old selector after 1 month of no signed messages. Some compliance regimes require quarterly rotation.
14.3. DMARC — Domain-based Message Authentication, Reporting, and Conformance
DMARC (Domain-based Message Authentication, Reporting, and Conformance, expanded) ties SPF + DKIM together and adds a policy + reporting layer. The receiver checks: did SPF pass AND was the SPF domain "aligned" with the visible From: domain? Did DKIM pass with a signing d= aligned with From:? If either authenticated-and-aligned check passes, DMARC passes. If both fail, apply policy.
DNS record:
_dmarc.example.com. IN TXT "v=DMARC1; p=reject; rua=mailto:dmarc-agg@example.com;
ruf=mailto:dmarc-forensic@example.com; pct=100; aspf=s; adkim=s"
p=none(monitor only — start here)p=quarantine(spam folder)p=reject(drop entirely)rua=aggregate report (daily XML summary of who tried to send as you)ruf=forensic per-message reports (rare, privacy-sensitive)aspf=s, adkim=sstrict alignment —subdomain.example.comdoesn't satisfy alignment forexample.com. Most senders relax this withr.
Operationalization sequence: start p=none for 30-60 days; analyze rua reports for forgotten legitimate senders (marketing tool, HR vendor, finance ERP); add them to SPF / DKIM; tighten to p=quarantine; finally p=reject. Some Fortune-500s take 18 months to reach p=reject because of the "shadow IT" that emails as their domain.
The disaster scenario: a financial firm at p=none discovers, via DMARC reports, that 40k phishing emails/day are being sent claiming to be from them. They tighten to p=reject overnight — and discover their HR vendor sending password-reset emails was never SPF-aligned. Password resets bounce. The fix: p=reject requires complete inventory of legitimate senders first.
14.4. Sender reputation — per-IP and per-domain
Gmail, Outlook, Yahoo each maintain internal reputation scores for every sending IP and every sending domain. Inputs: - bounce rate (>5% over 24h = bad) - spam complaint rate (>0.1% = bad; this is the "this is spam" button) - spamtrap hits (mail to addresses that aren't real users — used by anti-spam vendors to catch list buyers) - engagement signals (opens, replies, "not spam" clicks) - volume consistency (sudden 10x spike from a new IP = suspicious)
Gmail's Postmaster Tools exposes per-domain reputation as High / Medium / Low / Bad. Microsoft's SNDS (Smart Network Data Services) gives per-IP reputation. Yahoo has limited but functional feedback loops.
Bad reputation = silent throttling. You send 1M emails, MTAs return 250 OK on every one, but only 200k actually reach the inbox. The other 800k go to spam, or "deliverability throttling" where Gmail accepts at 10% rate. No error — just slow delivery and bad open rates.
14.5. Warm-up of new sending IPs
A brand-new IP with zero history sending 1M emails on day 1 → treated as a spam attack by every receiver. The standard warm-up regimen:
Day 1: 100 emails
Day 2: 200
Day 3: 500
Day 4: 1,000
Day 5: 2,000
Week 2: 10,000-20,000/day
Week 3: 50,000-100,000/day
Week 4: 200,000-500,000/day
Week 5+: ramp to target volume; multiple IPs in pool
During warm-up, send only to highly engaged users (users who recently opened mail) so engagement signals look good. Mix transactional (high engagement) with marketing. Monitor Postmaster Tools daily. Pause if reputation slips.
The "we sent 1M emails from a cold IP and 90% went to spam" disaster: a startup launches its marketing campaign on a fresh Mailgun IP, blasts 1M users they have addresses for (including dormant addresses from years ago). Result: 8% bounce rate, 2% complaint rate, IP blacklisted at Spamhaus, all future mail blocked. Recovery: weeks of remediation, often impossible — you abandon the IP and start over.
Established sending platforms (SES, SendGrid, Mailgun) maintain warmed-up IP pools and round-robin new tenants onto them, but if you reserve dedicated IPs, you own the warm-up.
14.6. Feedback loops and bounce processing
- Hard bounce: permanent failure (
550 5.1.1 User unknown). Suppress immediately. Re-sending hard bounces = spam signal. - Soft bounce: temporary (
450 4.2.1 Mailbox full). Retry with backoff (1h, 4h, 24h, 72h). 3-5 consecutive soft bounces = treat as hard. - Complaint (FBL — feedback loop): receiver tells you "user marked this as spam." AOL, Yahoo, Microsoft offer FBL programs. Gmail doesn't expose per-message, only aggregate reputation. Process: receiver POSTs an ARF (Abuse Reporting Format) message to your registered endpoint. Suppress that recipient immediately.
- Unsubscribe: see §17.
Bounce processing must update the user_emails.bounced=TRUE flag and propagate to caches within seconds. A common bug: bounces processed by a nightly batch job → 24-hour window where bouncing addresses keep getting retried → reputation tanks.
§15. SMS economics and carrier filtering
SMS looks like the same primitive as push, but the economics, regulation, and delivery mechanics are fundamentally different. Push is essentially free (you pay for infrastructure, not per-message); SMS costs money per message and is regulated as a common-carrier service.
15.1. Per-message cost
| Provider | US (rough) | Mexico | UK | Brazil | India |
|---|---|---|---|---|---|
| Twilio | $0.0075-$0.0085 | $0.03-$0.05 | $0.04 | $0.05-$0.08 | $0.02-$0.04 |
| AWS SNS | $0.00645 | $0.05+ | $0.0468 | $0.06+ | $0.04+ |
| MessageBird | $0.007 | $0.04 | $0.04 | $0.07 | $0.03 |
At LinkedIn / Twitter scale (~10M SMS/day for 2FA + alerts), SMS cost = $75k+/day = $27M+/year. Optimization battle: - Route via cheapest provider per country (Twilio US, MessageBird EU, regional providers in Latin America). - For 2FA, prefer authenticator app or email when possible — push for app adoption. - Negotiate volume tiers (Twilio gives ~30-40% discount at 100M+/month).
International ranges from $0.02 (most of EU) to $0.20+ (premium destinations like Cuba, North Korea, satellite). Some countries (China, India) have aggressive operator filtering that loses 20%+ of messages with no refund.
15.2. Short codes vs long codes vs toll-free vs alphanumeric
Long codes (regular 10-digit phone numbers, e.g., +1-415-555-0100): cheap to provision, but throughput-limited (~1 message/sec/number in US under "A2P 10DLC", below). Used for transactional 1:1, replies, customer support.
Short codes (5-6 digits, e.g., 32665 = "FBOOK"): $500-$1000/month per code, throughput up to 100+ messages/sec, pre-approved by carriers for high-volume traffic. Mandatory for marketing in US. Approval takes 4-8 weeks (carriers vet the use case). Highly trusted; messages rarely filtered.
Toll-free SMS (e.g., +1-800-...): middle ground. Approved by Toll-Free Messaging Verification (TFMV) process. Up to 10-25 msg/sec. Used for transactional + light marketing.
Alphanumeric sender ID (e.g., "Uber", "Stripe" as sender name, not a number): supported in most of EU, India, Brazil. Not supported in US, Canada (US replies-by-text break — there's no number to reply to). Pre-registration required per-country; alphanumeric IDs are blocked in countries where unregistered.
15.3. Carrier filtering and A2P 10DLC
US carriers (AT&T, Verizon, T-Mobile) aggressively filter messages they think look like spam, phishing, prohibited content (gambling, cannabis, predatory loans). Filtering is opaque — you get delivered=true from Twilio but the message never lands.
In 2020-2021 US carriers rolled out A2P 10DLC (Application-to-Person 10-Digit Long Code) — every business sending on long codes must register their brand and use case with The Campaign Registry (TCR). Without registration, long-code messages are heavily throttled or blocked. With registration, you get tiered throughput (1 msg/sec → 75 msg/sec depending on trust score).
Registration components: - Brand: company identity, EIN/DUNS, website. Vet level (Standard/Sole Proprietor/Charity/Government) determines throughput. - Campaign: per use-case approval (2FA, marketing, account notification, customer service). Cannot use a campaign for unrelated content. - Per-message tagging: each SMS includes the campaign ID; carriers verify alignment.
Carrier filtering rejects:
- URL shorteners (bit.ly, tinyurl) — appear in phishing; use your own branded shortener.
- "STOP/HELP" not honored — instant block.
- Predatory keywords (heavy use of $$, "winner", "click here").
- Content not matching campaign category.
15.4. TCPA — US compliance
TCPA (Telephone Consumer Protection Act, 1991) governs unsolicited communications. For marketing SMS, you need prior express written consent — the user must have actively opted in to receive marketing texts from your specific brand. Inferred consent from "they gave us their number to ship a package" is NOT consent for marketing.
Penalties: $500-$1,500 per violation. Class actions ruinous: a 100k-recipient TCPA violation = $50M-$150M.
Required elements in marketing SMS: - Brand identification ("Acme:") - "Reply STOP to opt out" in the first message - "Msg & data rates may apply" disclosure - Opt-out must be honored within 24h (industry norm faster — instant)
For transactional SMS (2FA codes, shipping updates, fraud alerts) the TCPA bar is lower — implied consent from the transaction relationship is generally sufficient — but rules vary by state.
15.5. "We got 1M SMS sent but only 80% delivered" reconciliation
Twilio's "delivered" status callback fires when the carrier accepts the message. But the user-perceived delivery rate is lower because:
- Carrier silently drops (
delivered=truebut never lands) — 5-15% for unregistered 10DLC, near 0 for short codes. - User has phone off / out of range for >72h — carrier expires the message. Twilio reports
undelivered. - Invalid number —
failed=true, refunded. - Region-blocked — some destinations (Cuba, parts of Africa) are filtered by US carriers.
Reconciliation: pull Twilio delivery reports, join with internal click/open data (link in SMS → URL with tracking token). True "user saw it" rate often 60-80% of "sent" — significantly worse than email or push.
§16. Push token lifecycle
The device token (APNs token, FCM registration token) is the addressing primitive. It is not a permanent identifier; it rotates more than developers expect, and stale tokens accumulate fast.
16.1. Token issuance
1. App launches. Calls platform SDK.
iOS: UIApplication.shared.registerForRemoteNotifications()
Android: FirebaseMessaging.getInstance().getToken()
2. SDK negotiates with APNs / FCM (TLS connection to push platform).
3. Platform returns a token string (~64-char hex for APNs, ~150+ chars for FCM).
4. App POSTs token to your backend with user_id + device_id + app_version + platform.
5. Backend UPSERTs into device_tokens table (user_id, device_id) primary key.
Token issuance happens on every cold launch (cached locally; re-confirmed with platform). If the cached token differs from what platform returns, register the new one with backend.
16.2. Token rotation triggers
Tokens rotate for many reasons. The app SDK detects the rotation event and the developer must re-register:
- App reinstall — old token immediately dead.
- iOS update (major version bumps often rotate).
- APNs/FCM-side rotation — platform may rotate tokens every 6-12 months even without user action.
- Backup restore — restoring to a new device or after OS reset rotates tokens.
- User signs out and back in (some apps tie token to user identity).
- App data clear (Android — settings → clear app data).
- FCM specifically rotates more aggressively than APNs.
Rotation contract: the app's didReceiveRegistrationToken (Android) / application:didRegisterForRemoteNotificationsWithDeviceToken: (iOS) callback fires with the new token. Every cold launch should re-POST the token to backend (idempotent UPSERT) — relying on "only register if changed" misses cases where the local cache diverges from platform reality.
16.3. Dead token detection
APNs returns:
- 410 Gone + body {"reason":"Unregistered"} → token permanently dead. Mark inactive.
- 400 + {"reason":"BadDeviceToken"} → token was never valid. Mark inactive.
- 413 payload too large → not a token issue.
FCM returns:
- 404 NOT_FOUND or 400 with error.details[].errorCode = UNREGISTERED → token dead.
- 400 INVALID_ARGUMENT errorCode = INVALID_REGISTRATION → token never valid.
These responses fire the feedback loop documented in §8.3: feedback.events Kafka message → device_tokens.active=FALSE → next push skips this token.
16.4. The "50% dead tokens" problem
A common production state: an app has 100M registered tokens in MySQL but only 50M users are active. The other 50M tokens are dead (uninstalled, reinstalled, OS-reset). Without diligent cleanup, every push fanout sends 100M API calls and pays the cost (APNs is free, but FCM HTTP/2 connection load is real; SES costs money per email send).
Sources of accumulation: - Backend never receives dead-token feedback (silent suppression in some FCM paths — only "delivered" callbacks come back, not "this token was dead"). - Backend receives feedback but doesn't process it (feedback loop bug). - App fails to deregister token on logout (user logs out, token still associated with their account, but the app is now showing a different user). - Multi-account apps don't bind token → account properly.
Mitigations:
- Idle cleanup: tokens with no successful push in 60 days → mark inactive. If user re-launches the app, fresh registration reactivates.
- Active probe: send a silent push (content-available: 1) once a week; if it fails (410), mark dead.
- Per-platform metrics: track dead_token_rate per platform per app version. Spike = SDK bug / migration issue.
- App-side teardown: on logout, app calls DELETE /devices/{device_id} on backend.
Acceptable steady state: <5% dead tokens. >20% = serious accumulation. >50% = years of neglect.
§17. Unsubscribe and preference center
Unsubscribe handling is where notification systems collide with compliance law. Doing it wrong = regulatory fines, brand damage, and angry users marking as spam (which tanks deliverability).
17.1. One-click unsubscribe
RFC 8058 and the List-Unsubscribe + List-Unsubscribe-Post headers define one-click unsubscribe. Required by Gmail (since Feb 2024) and Yahoo for all senders sending >5k/day, and by FTC under CAN-SPAM.
Email headers:
List-Unsubscribe: <https://example.com/unsub?token=abc123>, <mailto:unsub+abc123@example.com>
List-Unsubscribe-Post: List-Unsubscribe=One-Click
The mail client sees both headers and renders a one-click "Unsubscribe" button. On click, the mail client issues a POST to the HTTPS URL (not a GET — POST avoids prefetchers / link scanners unsubscribing the user). The POST body includes the List-Unsubscribe=One-Click form parameter.
Backend handler must: 1. Validate the token (HMAC signature, includes user_id + list_id + timestamp). 2. Update preferences synchronously — user MUST be unsubscribed by the time the POST returns. 3. Confirm with HTTP 200. 4. Future emails to this address suppressed within minutes.
Common bug: backend asynchronously enqueues the unsubscribe → returns 200 → 5 minutes later marketing batch sends another email to this user. RFC 8058 implies, and Gmail enforces, that unsubscribe must take effect "promptly" (interpreted as <10 business days under CAN-SPAM; in practice, instant).
17.2. The CAN-SPAM "10 business days" violation
CAN-SPAM Act (US) requires honoring opt-out within 10 business days. Many companies built their unsubscribe pipeline at T+0 click → T+nightly batch updates DB → T+next-day cache invalidates → mail service sees fresh state at T+1 to T+2.
The violation pattern: bug in the batch job, or batch fails for a week, → unsubscribed users continue receiving mail past the 10-day mark. FTC has settled cases for $1M+ per incident. Class actions for high-volume marketers can reach $10M-$50M.
The fix is straightforward: write-through cache invalidation. Click → DB write + Redis DEL → future reads see fresh state immediately.
17.3. Granular preference center
A "one giant unsubscribe button" is hostile to deliverability. Users who unsubscribe entirely → no engagement → eventually mark future legitimate mail as spam → reputation drops.
Granular preferences let users dial down without leaving:
Per-channel: [push] [email] [SMS] [in-app]
Per-category:
- Transactional [always on, cannot opt out]
- Account & security [always on for some, opt-in for others]
- Product updates [user controlled]
- Marketing [user controlled]
- Recommendations [user controlled]
- Social [user controlled, granular by type]
- Digest [daily / weekly / off]
Per-frequency: [real-time] [daily summary] [weekly digest]
Per-time: [9am-5pm] [respects quiet hours]
Per-source: [mute user X] [mute thread Y]
The schema:
CREATE TABLE notification_preferences (
user_id BIGINT NOT NULL,
channel VARCHAR(16) NOT NULL, -- push/email/sms/inapp
category VARCHAR(32) NOT NULL, -- marketing/security/social/...
enabled BOOLEAN NOT NULL,
frequency ENUM('realtime','daily','weekly','off') DEFAULT 'realtime',
updated_at BIGINT NOT NULL,
PRIMARY KEY (user_id, channel, category)
);
Each notification has a (channel, category) tag at creation. Dispatcher does a single lookup per (user_id, channel, category) → enabled/disabled. Marketing teams cannot send "marketing" to a user who muted that category, regardless of organizational pressure.
17.4. Unsubscribe propagation: eventual vs strong consistency
Marketing infrastructure runs across many services: A/B test platform, journey orchestrator, email service, SMS service, in-app inbox. An unsubscribe must propagate to all of them.
Eventual consistency (typical): unsubscribe event → Kafka → each downstream service consumes → updates its local cache. Median propagation 1-5s, p99 30s.
Strong consistency (required for legal): preference center is the source of truth, and every send-time decision must read it (Redis cache with write-through invalidation, not async replication). Each delivery handler does:
1. Read preferences from Redis (cache).
2. If cache miss → read from MySQL → repopulate cache.
3. If preferences indicate suppression → drop, log delivery_status=SUPPRESSED_PREFERENCE.
For high-compliance channels (marketing email, marketing SMS), eventual consistency is a legal risk. The fix: write-through invalidation makes the cache strong-read. Some companies go further and do a synchronous MySQL read in the marketing path (sacrifice 1-2ms latency for compliance certainty).
§18. Compliance: GDPR / CAN-SPAM / TCPA / CASL
Notification systems sit on a regulatory landmine. Different jurisdictions impose different rules; the platform must enforce them at send time.
18.1. GDPR (EU)
GDPR (General Data Protection Regulation, EU, 2018) governs personal data of EU residents. Notification implications:
- Lawful basis required for processing email/phone (Article 6). For marketing notifications, the basis is almost always consent — explicit, granular, freely given, withdrawable.
- Consent must be specific — "I agree to marketing" is too broad. "I agree to receive monthly product updates by email" is granular.
- Right to erasure (Article 17) — user requests deletion → all PII must be removed including from notification logs. Audit trail must show compliance.
- Audit trail of consent — when did the user opt in? IP? Form version? Required to defend compliance. Store in append-only log with 7+ year retention.
- Cross-border transfer (Chapter V) — if your notification platform is US-based and sends to EU users, you need a transfer mechanism (Standard Contractual Clauses, adequacy decision, or Data Privacy Framework). Sending an email from US infrastructure to EU residents is a transfer.
Penalties: up to 4% of global annual revenue or €20M, whichever higher. Real cases: Meta €1.2B (2023), Amazon €746M (2021).
18.2. CAN-SPAM (US)
CAN-SPAM (Controlling the Assault of Non-Solicited Pornography And Marketing Act, 2003) governs commercial email. Requirements:
- No false/deceptive From, To, Reply-To, or Subject — the brand must be identifiable.
- Identify message as ad — for marketing, must be clear it's an advertisement.
- Physical postal address required — every commercial email must include a valid postal address in the footer. Many companies miss this on transactional emails that include "while you're here, check out..." cross-sell — once you add marketing, you must add the address.
- Opt-out mechanism — clear, conspicuous, easy. Honor within 10 business days. Cannot charge for opt-out. Cannot require login.
- Monitor third parties acting on your behalf — if an agency sends spam claiming to be you, you're liable.
Penalties: $51,744 per violation (2024). 1M-recipient spam = theoretically $51B; practical settlements much lower.
18.3. TCPA (US — SMS and voice)
TCPA (Telephone Consumer Protection Act, 1991) governs phone-based communications. SMS implications:
- Prior express written consent for marketing SMS. The bar is high: signed agreement (digital sig OK) acknowledging the user will receive marketing SMS at a specific number.
- Implied consent sufficient for transactional (you placed an order → we text shipping updates).
- Opt-out — STOP, UNSUBSCRIBE, CANCEL, END, QUIT, OPTOUT must all be honored.
- Caller-ID requirements — no spoofing.
Penalties: $500 per call/text, $1,500 per willful violation. Class actions ruinous because of low per-violation bar — accidentally texting 100k unconsenting users = $50M-$150M.
18.4. CASL (Canada)
CASL (Canadian Anti-Spam Legislation, 2014) is strictly more rigorous than CAN-SPAM:
- Express opt-in mandatory — implied consent narrowly allowed for existing business relationships, expires after 2 years inactivity.
- Identification — must clearly identify sender, include contact info.
- Unsubscribe — must function for 60 days after sending; honored within 10 business days.
Penalties: up to CAD $10M per violation for businesses; private right of action (individuals can sue).
18.5. The "marketing wants to send, legal won't let them" reality
This is the perennial tension. Marketing org wants to send the campaign to a 50M-recipient list. Legal looks at the list: - 8M users opted into "product updates" not "marketing." - 5M users are EU residents — was their consent properly recorded under GDPR? Is the marketing wording consistent with what they consented to? - 12M users haven't engaged in 18+ months — under CASL their implied consent has lapsed. - 1M users have a phone number on the list that was given for 2FA only — TCPA blocks marketing use.
Result: legal-approved list = 24M, not 50M. Marketing pushes back. Engineering builds the platform such that the preference center is the source of truth and the suppression list is enforced at send time — marketing cannot route around it without changing the platform.
The opposite failure: legal isn't involved, marketing sends to everyone, an EU regulator audits, GDPR fines follow. The platform must enforce policy, not trust intent.
§19. A/B testing notification content
Notifications are some of the most measurable things in any product — every send has clean attribution (sent/delivered/opened/clicked). This is fertile ground for experimentation, and most companies underuse it.
19.1. What to test
- Subject lines (email): "Your weekly summary" vs "Yifan, your week in review" vs "5 highlights from last week" → 10-30% open rate difference.
- Preheader (the first ~100 chars below subject) — major impact on open rate.
- Send time — "9am local Tuesday" vs "11am local Tuesday" vs "model-predicted optimal time per user."
- From name — "Acme" vs "Yifan from Acme" vs "Acme Notifications."
- Push title and body — "New like" vs "Alice liked your post" vs "Alice loved your idea" — 3-10% open rate difference.
- Channel mix — push only vs push + email after 1h if not opened.
- Frequency caps — 5/day vs 10/day → measure long-term retention.
19.2. Send-time optimization
Per-user STO (Send-Time Optimization): predict when a specific user is most likely to open. Model trained on per-user engagement timestamps over weeks. Output: "User 12345 opens email best at 7:42am Tue-Thu."
Implementation: nightly batch updates user_preferences.optimal_send_window per user-channel. At dispatch, if the campaign supports STO, schedule for the user's window using a delay queue.
The trap: STO works great for marketing where 1-hour latency is acceptable. Don't apply STO to transactional or alerting.
19.3. The experimentation pipeline
1. Define experiment:
experiment_id = "weekly_digest_subject_v3"
variants = ["control: Your weekly summary",
"A: Yifan, your week in review",
"B: 5 highlights from last week"]
traffic_allocation = [0.5, 0.25, 0.25]
metric = open_rate (primary), click_rate (secondary), unsubscribe_rate (guardrail)
duration = 14 days
2. At send time:
variant = hash(user_id || experiment_id) % 100
if variant < 50: control
elif variant < 75: A
else: B
3. Log to experiment log:
{ experiment_id, user_id, variant, sent_ts }
4. Outcome tracking:
{ experiment_id, user_id, event_type, event_ts }
events: SENT, DELIVERED, OPENED, CLICKED, UNSUBSCRIBED
5. Analysis:
per-variant rates with confidence intervals
chi-square or proportion-z test
guardrail check: did variant tank unsubscribe rate?
Bucketing by hash(user_id || experiment_id): same user always gets the same variant for this experiment, even across sessions. This is deterministic bucketing — critical for clean attribution.
19.4. Open rate vs click rate as success metrics
- Open rate (email): user opened the email (tracking pixel loaded). Inflated by mail client preview, Apple Mail Privacy Protection (MPP) which pre-fetches all images → opens reported even if user never looked. Since Sept 2021, open rate on iOS is essentially noise.
- Click rate: user clicked a link in the email. Stronger signal, less inflated.
- Conversion rate: user did the action you wanted (purchase, return, etc.). Strongest signal, lowest volume — needs larger sample sizes.
- Long-term engagement / retention: did this campaign increase 30-day-retained users? The strongest signal, requires careful experiment design (don't compare opens; compare cohorts).
For push: open rate = user opened the app from the push notification. Direct attribution via launch URL.
For SMS: open rate ≈ delivery rate (almost everyone reads SMS). Click rate via tracking URL.
19.5. Guardrail metrics
Optimizing open rate without guardrails leads to disasters: "URGENT: open immediately" subject lines, manipulative urgency, deceptive previews — all boost opens, all destroy long-term trust. Guardrail metrics block:
- Unsubscribe rate
- Spam complaint rate
- Long-term retention
- NPS or sentiment polls
- Cross-product engagement (did this notification cannibalize others?)
If a variant wins on open rate but increases unsubscribe rate by 0.5% absolute, abandon it.
§20. Notification templating
Sending 10M notifications a day in 30 languages, with personalization, A/B variants, and brand-consistent design requires a templating system. Hand-crafting HTML strings doesn't scale.
20.1. Template engines
- Liquid (Shopify, GitHub) — open syntax with filters:
{{ user.first_name | default: "there" }}. Designed for end-user authoring; sandbox-safe. - Handlebars (Mailchimp, common in JavaScript ecosystems) — logic-less templates with helpers.
{{#if user.is_premium}}...{{/if}}. Familiar to web developers. - Jinja2 / Twig — full programming language in templates; powerful but easy to introduce bugs and security holes if you let untrusted authors write templates.
- MJML (Mailjet Markup Language) — declarative responsive email markup.
<mj-section><mj-column><mj-text>Hi</mj-text></mj-column></mj-section>. Compiles to bulletproof HTML that renders in Outlook 2007, Apple Mail, Gmail, mobile clients. Removes the email-HTML nightmare. - MJML + Liquid combined — MJML for layout, Liquid for personalization. Common stack.
For push and SMS, templates are simpler — usually just a few {{ variable }} substitutions in plain text.
20.2. Template versioning and rollback
Templates change frequently (marketing tweaks, A/B variants, product updates). Versioning is essential:
template:weekly_digest
├── v1 (2025-01-15) [active until 2025-04-22]
├── v2 (2025-04-22) [active until 2025-09-10]
├── v3 (2025-09-10) [active]
└── v4 (2026-01-08) [staging — A/B at 5% traffic]
Each version is immutable once published. Production traffic routes to a specific version (typically "latest stable" + A/B traffic to "candidate"). Rollback = redirect traffic back to vN-1.
20.3. The "we shipped a template change that broke 100k emails" rollback
Disaster pattern: someone edits the template directly in production (no version control), pushes "Save," and the template now has a malformed {{ user.first_name } (missing brace). Next 100k sends render with the literal string {{ user.first_name } in the body. Users complain. Brand embarrassment.
Worse pattern: the template references a field that doesn't exist on some users (user.subscription.tier when free users have no subscription). Half of recipients get an email saying "Welcome, premium member null!"
Defenses:
1. Template editing is via PR, not in-place. Reviewed, tested, merged.
2. Lint all templates for common errors before deploy (missing braces, unknown variables, malformed HTML).
3. Render against test fixtures — every template has 5-10 example users (free, premium, suspended, etc.); CI renders against all, checks for missing fields and broken HTML.
4. Email test — every template change auto-sends to a template-qa@company.com mailbox; team reviews before promoting.
5. Canary rollout — new template at 1% traffic for 1 hour, monitor render error rate, then 100%.
6. One-click rollback — UI button reverts to previous version instantly. Time-to-recover < 1 minute.
20.4. Template testing infrastructure
- Render preview: in dev tool, pick a template + a sample user → render HTML. Email clients render preview in their UIs (Litmus, Email on Acid offer cross-client preview at $200-500/month).
- Litmus testing: renders the email in 90+ email clients (Outlook 2007 through Gmail iOS) and screenshots. Catches Outlook table-layout breaks.
- Spam scoring: send the email through SpamAssassin / MailTester before publish; address score <5.
- Accessibility check: alt text on images, sufficient color contrast, plaintext alternative.
§21. Multi-language and localization
For a global product, every notification must be localized. Sending English to 100M users where 60M don't read English = low engagement + complaints + cultural insensitivity.
21.1. Language detection
Inputs (in priority order, typically):
- User-set preference — explicit, highest priority. "User chose Spanish in settings."
- App locale — what the user's device is set to.
- Browser locale (
Accept-Languageheader) — for web push, web email rendering. - IP geolocation — fallback for users with no preference; less accurate than language settings.
- Default — English (US), typically.
Stored per-user in MySQL users.locale = "es-MX" (BCP 47 language tag). Refreshed on every login.
21.2. Translation pipeline
1. Engineer adds English string to source.
message_key = "notif.new_follower.body"
en: "{name} started following you"
2. Build extracts strings → POEditor / Lokalise / Crowdin / internal TMS.
3. Translators (human, sometimes machine-assisted) localize:
es: "{name} empezó a seguirte"
ja: "{name}さんがあなたをフォローし始めました"
ar: "بدأ {name} في متابعتك"
4. Translations checked back into repo.
5. Build packages translations into delivery service.
6. At dispatch time, lookup string by (key, locale).
For high-volume content (chat messages, user-generated), human translation isn't feasible; machine translation (Google Translate API, DeepL, internal models) used with quality warnings.
21.3. Variable substitution with grammar
This is where most localization breaks. English: "1 message" / "2 messages" — singular vs plural, two forms.
Slavic languages (Polish, Russian) have 3-4 plural forms: - 1 message - 2-4 messages (paucal — special form) - 5+ messages (genitive plural) - 0 messages (sometimes special)
Arabic has 6 plural forms (zero, one, two, few, many, other).
Use ICU MessageFormat or Fluent (Mozilla) for plural-aware templates:
{count, plural,
=0 {No new messages}
one {1 new message}
few {# new messages}
many {# new messages}
other {# new messages}
}
Localizer fills in per-language plural rules. At render time, library picks the right branch.
Gender agreement (Romance languages, Slavic, Hebrew, Arabic) requires similar branching:
{user_gender, select,
female {Es ist deine Freundin}
male {Es ist dein Freund}
other {Es ist deine Person}
}
21.4. RTL (right-to-left)
Arabic, Hebrew, Farsi, Urdu render right-to-left. Email HTML and push UI both need RTL handling:
- HTML:
<html dir="rtl" lang="ar">andtext-align: righton body. - Push: most platforms render based on system locale, but punctuation and embedded English/numbers need bidirectional handling (Unicode Bidi algorithm — RLM, LRM marks for clarity).
- Image direction: arrows and icons may need to flip for RTL contexts (a "forward" arrow points right in LTR, left in RTL).
- Tables and layouts mirror — first column becomes the rightmost.
Test RTL with a native RTL speaker on real devices. Pseudo-locale (xx-RTL) for engineering QA.
21.5. Localization-related notification disasters
- Currency formatting: "$1,000.00" (US) vs "1.000,00 €" (Germany). Wrong format = confusion or worse, wrong amount.
- Date formatting: "11/04/2025" — US reads November 4, EU reads April 11. Use ISO 8601 (
2025-04-11) or localized format library. - Time formatting: 14:30 vs 2:30 PM. 12h/24h preferences vary.
- Address format: country/state/zip order varies. Don't hardcode US layouts.
- Cultural taboos: certain colors, numbers (4 in Chinese contexts, 13 in Western contexts), images can be offensive in some cultures.
§22. Time-zone-aware quiet hours in depth
§7 problem 5 covered the basics. The deeper concerns:
22.1. IANA TZ database
Not "UTC-8". The IANA (Internet Assigned Numbers Authority) TZ database (also called the Olson database, tzdata) is the canonical mapping of named zones to offsets across time, including DST transitions, historical changes, and political adjustments.
Example: America/Los_Angeles is currently UTC-8 in winter and UTC-7 in summer. In 1942-1945, it observed "War Time" with different offsets. In 2022, the US Senate passed (then stalled) a bill to make DST permanent — if enacted, tzdata would record the change.
Store user's TZ as the IANA name. At dispatch, compute now_in_user_tz using a library (Java ZoneId, Python zoneinfo, JS Intl.DateTimeFormat). Never store a numeric offset that ignores DST and political shifts.
22.2. Per-user TZ storage
Storage:
ALTER TABLE users ADD COLUMN tz VARCHAR(64) DEFAULT 'UTC';
-- e.g. 'Asia/Tokyo', 'Europe/Berlin', 'America/Los_Angeles'
Sources: - User-explicit setting (account preferences). - App SDK reports device TZ on every backend session (cache for 24h). - IP geolocation as fallback for unregistered users.
Refresh on every login. Travelers can override manually.
22.3. The "user travels" decision
User normally in America/Los_Angeles. Travels to Tokyo. Device TZ now Asia/Tokyo. App reports this on next session.
Do we use the new TZ for quiet hours?
- For most notifications: yes. Quiet hours follow the device — user doesn't want a 2am Tokyo wakeup.
- For scheduled notifications: question. "Remind me at 9am tomorrow" — 9am Pacific (original setup) or 9am Tokyo (current)? Usually: explicit (tz_at_time_of_scheduling, local_time) pair. Display "9am Pacific (1am Tokyo)" so user can re-schedule.
- For 2FA: ignore quiet hours. Security beats sleep.
- For alerting (on-call): explicit on-call rotation handles, not user TZ.
22.4. Batched send during user's morning at scale
Marketing wants to send the morning digest at 9am local time to 100M users. Naive: cron job at 9am UTC sends 100M emails — but that's not "user morning" for most.
Better: per-user schedule. Every minute, scheduler picks up the 100M / (24*60) = ~70k users whose 9am_local == now_utc and dispatches.
Trap: time zones cluster around populous areas. 9am Pacific = 5pm UTC. ~100M Pacific users hit the system in one minute = global thundering herd.
Mitigations: - Drip within window: "send between 9-10am local" → distribute over 60 minutes; ~1.7M users/min sustained. - Capacity reservation: spin up dispatcher fleet ahead of US-Eastern 9am, scale down after EU-Western 9am. - Prioritization: transactional dispatchers separate from marketing dispatchers — marketing scheduler queues into a dedicated pool that won't crowd out 2FA codes. - Pre-warm cache: at T-15min, pre-load user preferences for the upcoming window into Redis.
Real example: Mailchimp's "Send at the right time" feature predicts per-user-optimal time within a daily window; it deliberately staggers sends to avoid simultaneous burst.
§23. Multi-device sync
Modern users have multiple devices — iPhone, iPad, Mac, Apple Watch, web browsers across multiple machines. A single notification logically targets the user but physically must be delivered to all their devices, and read state must sync.
23.1. The "all devices" delivery
When a notification fires for user 42:
1. Look up active devices for user 42:
{ iPhone (token_A), iPad (token_B), Mac (token_C), Apple Watch (token_D) }
2. Send to APNs with each token.
3. Inbox row written once (per user, not per device).
But wait — APNs has device-grouping APIs:
- apns-push-type: alert with no special grouping → all devices buzz.
- apns-thread-id → groups related notifications on the device.
- Apple's "iCloud user-id" binding → APNs can deliver to all devices of one iCloud user without you needing to enumerate tokens. This is rare in practice; most apps still manage per-device tokens.
Android FCM is less mature here. No native "all devices of this user" routing — you manage tokens per device, send N times.
23.2. Read state propagation
User reads on iPhone. Mac, iPad, Watch should all clear the badge.
1. iPhone: user taps notification or opens app inbox.
2. App sends POST /notifications/{id}/read to backend.
3. Backend: UPDATE notifications_inbox SET read_at_ms = NOW() WHERE user_id=42 AND notification_id=X.
4. Backend publishes 'read.events' to Kafka: { user_id, notification_id }.
5. Per-device WebSocket workers consume; push state-change to connected clients.
6. iPad / Mac (if app open with WebSocket) update UI immediately.
7. Watch — if APNs supports state-change push (it does for newer iOS), send silent push:
{ aps: { content-available: 1 }, notification_id, action: "mark_read" }
Watch processes silently, badge decrements.
The hard part: the watch isn't always connected. iPhone is offline (battery dead). Mac is asleep. State-change pushes queue at APNs/FCM with TTL ~4h; if the device comes online within TTL, state syncs. After TTL, the only sync is "next app open" → fetch unread count from backend.
23.3. The "dismissed on phone, still showing on watch" UX failure
Common failure: user dismisses on phone, watch still shows the banner. Causes:
- Watch wasn't reachable via APNs at the time of dismiss (Bluetooth disconnected from phone, no Wi-Fi).
- Notification was delivered to phone+watch independently; dismissal on phone didn't propagate.
- App on watch doesn't have background app refresh enabled to process the silent push.
Fix path: in-app inbox as the universal read-state source of truth. When the watch app opens later, it fetches inbox → sees notification X is read → clears.
23.4. The in-app inbox as the universal read-state truth
Push notifications are ephemeral hints. The durable, canonical "what's unread" is the inbox row's read_at_ms field. Every device, on every app open, syncs against the inbox. Even if pushes are lost, missed, or out of order, the inbox is authoritative.
This is why the inbox-write must happen BEFORE the push send. If push goes out and inbox write fails, the user sees a banner that they cannot find later. If inbox write succeeds and push fails, the user sees the notification on next app open — degraded UX, but no information loss.
§24. Rich push notifications
Push has evolved well beyond "title + body". Rich features differentiate engagement.
24.1. Image attachments
APNs: mutable-content: 1 + Notification Service Extension. The extension fetches an image URL and attaches it. Image is downloaded on the device.
{
"aps": {
"alert": { "title": "New photo from Alice", "body": "Check it out!" },
"mutable-content": 1
},
"image_url": "https://cdn.example.com/photos/abc123.jpg"
}
FCM: image field in notification payload natively supported.
{
"message": {
"token": "...",
"notification": {
"title": "New photo",
"body": "From Alice",
"image": "https://cdn.example.com/photos/abc123.jpg"
}
}
}
Constraints: image ≤10MB for APNs, ≤1MB for FCM. CDN-hosted (cached, fast). JPEG/PNG/GIF.
24.2. Action buttons
In-notification action buttons let the user reply, mark read, delete, like, etc. without opening the app.
APNs: define a category in the app:
let replyAction = UNTextInputNotificationAction(
identifier: "REPLY",
title: "Reply",
options: [],
textInputButtonTitle: "Send",
textInputPlaceholder: "Type a message...")
let markReadAction = UNNotificationAction(
identifier: "MARK_READ",
title: "Mark as Read",
options: [])
let category = UNNotificationCategory(
identifier: "MESSAGE",
actions: [replyAction, markReadAction],
intentIdentifiers: [],
options: [])
Push payload includes "category": "MESSAGE"; OS renders the buttons. On user tap, the app's userNotificationCenter(_:didReceive:) fires (or background-only handler for non-launching actions).
FCM: actions defined per-notification in the payload. Similar mechanics.
24.3. Critical alerts
iOS 12+: Critical alerts override Do Not Disturb and silent mode. The system plays sound regardless of user settings. Requires:
- Special entitlement from Apple (request via developer portal; granted for legitimate use cases — health, safety, emergency).
- User opt-in dialog separate from regular push permission.
- Volume controllable in payload (critical-alert-volume).
Use cases: hospital pagers, smoke alarms, security alerts. Misuse = Apple revokes the entitlement.
24.4. Provisional notifications (iOS)
iOS 12+ also: provisional authorization. Apps can show notifications without first asking permission — but they appear quietly in Notification Center, no banner / sound. User sees them and can promote to full notifications, or disable.
Use case: AI-based decisions about visibility. Send 100 candidate notifications quietly; system learns which the user interacts with; promote those styles to full alerts.
Payload:
{
"aps": {
"alert": "...",
"interruption-level": "passive"
}
}
Levels: passive, active (default), time-sensitive (bypasses focus modes), critical (above).
24.5. Web push and VAPID
Web push is the same APNs/FCM substrate exposed to browsers (Chrome, Firefox, Safari since iOS 16.4). User grants permission → browser registers with the push service → app stores the subscription endpoint.
VAPID (Voluntary Application Server Identification, RFC 8292) is the auth: server signs requests to the push service with a private key; service verifies with the registered public key. No per-user token in your auth — just per-server identity.
Subscription:
{
"endpoint": "https://fcm.googleapis.com/fcm/send/eXaMpLeToKeN",
"expirationTime": null,
"keys": {
"p256dh": "BNcRdreALRFXTkOOUHK1EtK2wtaz5Ry4YfYCA_0QTpQtUbVlUls0VJXg7A8u-Ts1XbjhazAkj7I99e8QcYP7DkM=",
"auth": "tBHItJI5svbpez7KI4CCXg=="
}
}
Server posts payload + VAPID headers to endpoint. Push service forwards to user's browser. See §27 for full web push detail.
§25. Anti-abuse and spam prevention
Notification fatigue is a real ecosystem-level problem. Users uninstall, mute, or train themselves to ignore notifications when oversaturated.
25.1. Per-user rate limits
Per-channel (push, email, SMS independently):
- 5 push / hour
- 30 push / day
- 10 email / day
- 3 SMS / day
Per-category:
- Marketing: 1/day max, 5/week max
- Social: 10/day max
- Transactional: no cap
- Security: no cap (but throttle bots)
Enforced via Redis token bucket (§4.3). Each delivery attempt checks; if limit reached, the notification: - For transactional: never rate-limited (defeats purpose). - For social: dropped silently or rolled into a digest. - For marketing: dropped silently; counted against engagement metric.
25.2. Per-sender reputation
Internal: each notification type / campaign has a quality score:
quality_score = w1 * open_rate + w2 * click_rate - w3 * unsubscribe_rate - w4 * complaint_rate
Low-quality campaigns throttled harder. High-quality (high engagement) campaigns get priority.
Marketing team's instinct: send to everyone, daily. Engineering's enforcement: quality score below threshold → reduce reach for next campaign. Pushes the team to send better content.
25.3. The "marketing wants to blast everyone daily" vs "users uninstall" tension
This tension has been studied extensively. Industry norms:
- Push frequency: above ~5/day for non-essential content, uninstall rate climbs sharply. Power users tolerate 10-20/day if all are high-quality (chat apps); marketing-heavy apps see uninstalls at 2-3/day.
- Email frequency: 1-2/week is the "set it and forget it" cadence. 5+/week → unsubscribe rate climbs.
- SMS frequency: 1/week max for marketing. 2+/week triggers complaints.
Counterintuitive: reducing frequency often increases total engagement. Less notification fatigue → users more likely to open the ones you do send.
25.4. Notification fatigue research
Studies (Google research, Facebook internal, university HCI groups) consistently show:
- Users have a small "attention budget" for notifications — once exceeded, all further notifications get dismissed without inspection.
- The first notification of the day has highest engagement; the 10th has near-zero.
- Notifications received during "downtime" (evening, weekend) have lower engagement than work-hours.
- "Important looking" notifications (red badges, exclamation marks) used promiscuously train users to ignore them.
Design implications: ration carefully, prioritize ruthlessly, batch low-value updates, optimize for quality over quantity.
25.5. Quiet periods
Beyond per-user quiet hours: system-wide quiet periods for non-critical content. Examples:
- No marketing pushes between 9pm-9am in any TZ.
- No marketing emails on weekends.
- "New year detox" — reduce marketing volume in early January when inbox fatigue is high.
- During major news events (election day, disaster) — pause marketing to avoid appearing tone-deaf.
These are policy decisions, not technical limits, but the platform exposes the controls.
§26. In-app inbox patterns
The in-app inbox (the bell icon, the notifications screen) is the durable source of truth. There are two architectural patterns.
26.1. Fanout-on-write to per-user inbox
Covered extensively in §3 and §4.1. Inbox rows are pre-written; user opens inbox; fast partition scan; rows displayed.
Pros: O(1) read, fast UI. Cons: O(F) write cost; storage scales with users × events × duplication.
26.2. Lazy hydration on inbox open
Alternative: don't pre-write inbox rows. Maintain only the event log (which user did what to which post). At inbox open, query "all events relevant to user X in last 30 days" and synthesize the inbox.
1. Event log: per-event row keyed by (source_id, event_id, ts).
2. Subscription log: which user follows which sources, with start_ts.
3. Inbox open:
- Get user's subscriptions.
- For each subscription, scan events since (user's last_seen OR 30d ago).
- Merge, sort, take top 50.
4. Cache result for ~5min per user.
Pros: O(1) write (no per-user duplication). Cheap storage.
Cons: O(F') read per inbox open. Cache misses are expensive. Real-time updates (new notification while user is on the inbox screen) require WebSocket or polling.
26.3. Hybrid: pre-write for hot, hydrate for cold
Most systems use hybrid: - Pre-write for active users (logged in within 30 days). - Lazy-hydrate for dormant users (logged in >30 days ago).
When dormant user returns, hydrate inbox in background while showing a loading state, then transition.
Storage savings: ~70% of users in a typical product are dormant; pre-writing for them is mostly wasted storage.
26.4. Read-state propagation
Already discussed in §23. Inbox row's read_at_ms field is canonical. Updated when:
- User opens inbox (mark all visible as "seen", not necessarily "read").
- User taps a specific notification (mark that one "read").
- User triggers "Mark all as read."
Propagated to other devices via WebSocket + silent push.
26.5. Retention
Inbox can't grow unbounded. Industry norms:
- Cap at 30 days (Twitter, Facebook, Instagram).
- Cap at 100-500 notifications per user (Slack, Discord).
- Cap at N per category (e.g., 50 social, 100 transactional, unlimited critical).
Implementation:
- Cassandra TTL (default_time_to_live) — automatic expiry at write.
- Periodic cleanup job — DELETE WHERE user_id=X AND row_count > 100 (more complex, runs nightly).
- Per-category retention — different TTLs per category field; needs partition partitioned by category or post-filter.
Trade-off: longer retention = more storage, slower inbox queries (more rows to scan), more compaction overhead. Shorter retention = bad UX when users want to find old notifications.
§27. Web push and progressive web apps
Web push extends notifications to browsers, including for sites with no installed app.
27.1. Service Worker subscription
A web push capability requires:
- Service Worker registered for the origin — a JS background script the browser keeps running even when no tab is open.
- Push permission — user opts in via
Notification.requestPermission(). UI prompt; ~5-20% acceptance rate typically. - Subscription —
serviceWorkerRegistration.pushManager.subscribe({ userVisibleOnly: true, applicationServerKey: VAPID_PUBLIC_KEY }). Returns the subscription endpoint + keys.
App POSTs subscription to backend. Backend stores in device_tokens table (platform = WEB).
27.2. VAPID keys
VAPID (Voluntary Application Server Identification, RFC 8292) — the auth model for web push. No platform-issued provider certificate (unlike APNs). Instead, your server generates an ECDSA P-256 keypair:
private key (kept on server)
public key (registered with subscription; embedded in subscribe call)
Each push request includes a JWT signed with the private key. Push service verifies with the public key on file.
JWT claims:
{
"aud": "https://fcm.googleapis.com",
"exp": 1716393600,
"sub": "mailto:admin@example.com"
}
sub is your contact (browser vendors use this to reach you if your server misbehaves).
27.3. Payload encryption
RFC 8291: web push payloads are end-to-end encrypted between server and browser. The push service (Mozilla Push, FCM, etc.) cannot read content — it just forwards encrypted bytes.
Encryption: - Server generates an ECDH ephemeral keypair per push. - Derives a shared secret with the browser's subscription public key (P-256 ECDH). - HKDF derives an encryption key. - Encrypts the payload with AES-128-GCM. - Sends ciphertext + ephemeral public key in HTTP headers.
Browser:
- Decrypts using its private key + server's ephemeral public key.
- Fires the Service Worker's push event handler.
- Handler calls self.registration.showNotification(title, options).
This is one of the rare end-to-end encrypted notification channels — APNs/FCM operators can read your payload; web push operators cannot.
27.4. The "browser push acceptance rate is low" reality
Web push permission prompts are highly disruptive. Chrome and Firefox have made permission UIs increasingly user-protective:
- Chrome 80+: quieter UI for repeat-asks. The default permission rate has been reported at ~10-20% in industry analyses.
- Safari (iOS 16.4+): requires the site to be a PWA installed to home screen.
- Firefox: aggressive prompt-frequency limits.
Best practices: - Don't ask on first page load — ask after user demonstrates interest. - Frame the value: "Get notified when your friend replies" not "We want to send you notifications." - Don't re-prompt after rejection — counter-productive.
Acceptance rate of 5% on a 1M-MAU site = 50k web push subscribers. Better than nothing, but small fraction of audience.
27.5. Browser quirks
- Chrome / Edge: subscriptions managed by FCM under the hood; reliable.
- Firefox: own push service (Mozilla); generally reliable.
- Safari: APNs under the hood since iOS 16.4 / macOS 13; same auth as native iOS (VAPID or APNs certs).
- Mobile browsers: Chrome Android supports web push; Safari iOS only for PWAs.
§28. Notification observability
Without metrics, the notification platform is opaque. Observability is essential.
28.1. Per-channel delivery rates
| Channel | Delivery rate target | Definition |
|---|---|---|
| Push (APNs/FCM) | ~90% | API returns 2xx |
| Email (SES/SendGrid) | ~95% | No hard bounce within 24h |
| SMS (Twilio) | ~95-99% | Carrier-delivered status |
| In-app inbox | ~100% | Row written to Cassandra |
Anything below target investigated. Common causes of drops:
- Push: token rotation, dead tokens accumulating.
- Email: deliverability issues (DKIM rotation incomplete, IP reputation slipping).
- SMS: carrier filtering, registration lapsed.
- In-app: rare — Cassandra outage.
28.2. Open rate / read rate
The "user actually saw it" funnel:
sent → delivered → opened → clicked → converted
100% 95% 30% 3% 0.5%
Per-channel and per-campaign breakdowns. Compare to baselines; alert on regression.
Tracking mechanics:
- Email open: tracking pixel (1x1 image hosted on your CDN with ?token=msg_id). Loaded = open. Apple Mail Privacy Protection (MPP) pre-fetches, inflating opens since 2021.
- Email click: rewrite all links through a tracking redirect (example.com/track?url=...&msg=...). Click = redirect logged.
- Push open: when user taps the push, app records launch source.
- SMS click: tracking URL in the message.
- In-app: read state from inbox row.
28.3. The funnel from sent to action
Most product analytics platforms (Mixpanel, Amplitude) integrate with the notification platform. Each sent, delivered, opened, clicked, unsubscribed event flows through to the analytics pipeline tagged with user_id, campaign_id, variant.
KPIs by stakeholder: - Eng: delivery rate, latency, error rates, infra cost. - Product: open rate, click rate, retention impact. - Marketing: conversion, attributed revenue. - Legal/compliance: unsubscribe latency, bounce processing, complaint rate.
Alerts: - Delivery rate drops >2% absolute → page on-call. - Spam complaint rate >0.1% (Gmail threshold) → block campaign, page marketing ops. - IP reputation drops in Postmaster Tools → page deliverability team.
28.4. Per-recipient delivery audit
For regulated content (HIPAA-adjacent, financial, GDPR data subject requests), you need to prove "we sent notification X to user Y at time Z, via channel C." This requires:
delivery_statustable with per-(event, recipient, channel) row, retained 7+ years for finance, 1+ year for general.- Append-only audit log (S3 with compliance lock or equivalent).
- Searchable by user_id (for GDPR access requests).
The audit log is separate from the operational stores (which have shorter retention) — S3 / object storage is cheap.
§29. Failure modes not covered earlier
§8 covered the common ones. Some more:
29.1. APNs throttling — IP quarantine
If you exceed APNs' throughput limits or send to many dead tokens, Apple throttles your IP. Throttle persists for hours to days. Recovery: exponential backoff, drain queue slowly, monitor success rate.
Worse: repeated abuse → APNs can quarantine the IP for weeks. Then you have to rotate to a new outbound IP and restart connections. Mitigations: - Strict 410-handling (immediately remove dead tokens). - Backoff on 429. - Stay well under per-IP throughput limits (~10k req/sec rough). - Spread across multiple outbound IPs.
29.2. FCM upstream changes
FCM has had multiple breaking changes:
- 2014: GCM (Google Cloud Messaging) replaced by FCM (Firebase Cloud Messaging). Token format changed.
- 2018: XMPP upstream messaging deprecated; HTTP/2 required.
- 2023-2024: Legacy HTTP API deprecated; FCM HTTP v1 (the newer API with Google service account auth) required.
Each migration broke apps that hadn't updated their server code. Operational learning: monitor Google announcements; subscribe to firebase-talk; budget yearly migration time.
29.3. Email bounce types
- Hard bounce: permanent (
550 5.1.1 User unknown,550 5.7.1 Blocked). Suppress immediately. Re-trying = spam signal that drops your reputation. - Soft bounce: temporary (
450 4.2.1 Mailbox full,421 4.7.0 Temporary system problem). Retry with backoff: 1h, 4h, 24h, 72h. If still bouncing after 5 attempts over 3 days → treat as hard. - Spam bounce: bounced because target server thinks you're spam (
550 5.7.1 Message rejected as spam). Categorize separately; investigation needed (not just suppress). - Block bounce: target server has explicitly blocked your IP. Hard to recover; rotate IP after warming new one.
Bounce processing must be reliable — a backed-up bounce queue means you keep sending to known-bad addresses, which destroys your sender reputation.
29.4. SES / SendGrid outages
SES has had partial outages (specific regions, specific recipient domains). Mitigation:
- Multi-provider redundancy: SES primary, SendGrid backup. Switch via config flag.
- Per-recipient-domain failover: if SES → @gmail.com is failing but SendGrid works, route Gmail addresses to SendGrid.
- Queue + retry: outage = queue accumulates → drain when recovered. Track queue depth.
29.5. Provider API rate limits
Every external API has rate limits:
- APNs: ~10k req/sec/IP/cert.
- FCM: 600k QPS per project (high; rarely hit).
- SES: per-account TPS (transactions per second), upgradeable from 1 to 10k+.
- Twilio: 1 msg/sec/long-code (US 10DLC) up to 75 msg/sec for high-trust tiers. Short codes 100+/sec.
Hit the limit = 429 / 4xx. Backoff. Stay below 80% of the limit in steady state. Spread across multiple sender accounts if needed.
29.6. DNS failures
Provider hostnames (api.push.apple.com, fcm.googleapis.com, email-smtp.us-east-1.amazonaws.com) resolved via DNS. DNS outages cause cascading failures. Mitigations:
- Long-lived DNS cache locally (resolver running on dispatcher hosts).
- Multiple resolvers configured (your VPC's DNS + Google 8.8.8.8 fallback).
- Pin IPs as fallback (with caution — providers do rotate IPs).
- Service mesh DNS (consul, internal DNS) with health checks.
§30. Cost economics
The bottom line. Costs by channel at scale:
30.1. Push
- APNs: free. You pay for infrastructure (dispatcher hosts, persistent connection capacity).
- FCM: free. Same.
- Self-hosted infra cost: ~$0.05-0.10 per 1M push at scale (compute + network).
- SaaS push providers (OneSignal, Airship, Iterable, Braze): $1-10 per 1M push depending on volume tier and features (often bundled with other channels and analytics).
For a 1B-pushes/month product, self-hosting saves $1k-$10k/month vs SaaS — but you need 1-3 engineers maintaining it. Break-even around 100M-1B push/month.
30.2. Email
| Provider | $/1k emails (typical) | Best for |
|---|---|---|
| AWS SES | $0.10 | High volume, low cost; you handle deliverability |
| SendGrid | $0.30-1.00 | Mid volume; managed deliverability; templates |
| Mailgun | $0.30-0.80 | Developer-friendly |
| Postmark | $1.25 | Transactional only; obsessive deliverability |
| SparkPost | $0.20-0.80 | High volume |
| Internal SMTP fleet | $0.01-0.05 + infra | >1B/month; Mailchimp-scale |
For 1B emails/month: - SES: ~$100k/month - SendGrid: ~$300k/month - Internal SMTP: ~$10-50k/month + 5-10 engineers managing deliverability
Inflection: ~100M emails/month is where you start considering internal SMTP. At 1B+/month most large senders build their own.
30.3. SMS
| Provider | $/1k SMS (US) | International | Notes |
|---|---|---|---|
| Twilio | $7.50-8.50 | 30-200+ | Most common; broad coverage |
| AWS SNS | $6.45 | varies | Cheaper US; less polished |
| MessageBird | $7-9 | varies | EU strong |
| Bandwidth | $4-7 | US-only | Direct carrier relationships |
| Sinch | $6-8 | varies | Volume-focused |
For 10M SMS/month US (10M × 2FA codes): - Twilio: ~$75k/month - Bandwidth: ~$40-70k/month (volume tier negotiated) - AWS SNS: ~$65k/month
International doubles or triples cost. Premium destinations (Russia, Cuba, parts of Africa) can be 10x US.
30.4. Volume tier negotiations
All major providers offer volume tiers, but the published rates are list. Real enterprise rates after negotiation:
- SES: list $0.10/1k → enterprise as low as $0.04-0.06/1k at billions/month.
- SendGrid: list $1/1k at small scale → ~$0.15/1k at billions/month.
- Twilio: list $0.0075/SMS → ~$0.005-0.006 at hundreds of millions/month.
Negotiation levers: - Multi-year commitment (1-3 years). - Multi-channel bundle (email + SMS + voice). - Public case study / co-marketing. - Migration from competitor (provider will discount to win the deal).
30.5. Hidden costs
- Dedicated IPs (for sender reputation control): $50-500/month per IP, plus warm-up time.
- Compliance tooling: DMARC monitoring (Valimail, EasyDMARC) $500-5000/month.
- Deliverability consulting: $5-50k/month at large scale.
- Analytics & A/B platform: Iterable / Braze / Klaviyo add $5-50k/month bundled.
- Engineering time: dedicated notification platform team = $1-5M/year fully-loaded.
For a Series-B company, "build vs buy" is rarely about per-message cost — it's about engineering opportunity cost. For a public company at $100M+ ARR with notifications central to product, build wins on cost and control.
§31. Summary
Notification delivery is a fanout amplifier with a back-pressure-aware bus: events on a durable queue, a fanout service explodes them into per-user deliveries (sharded by recipient_id), dispatcher workers apply dedup + preference + rate-limit + quiet-hours in a tight Redis loop, persist the inbox to an LSM-tree store (Cassandra, partitioned by user_id) as the source of truth, then push to channel routers that maintain persistent HTTP/2 connections to APNs/FCM with collapse_id for device-side dedup. The central design pivot is the recipient-set distribution: fanout-on-write for normal sources, fanout-on-read for celebrity-shaped sources, engagement-based push sampling on top — turning a 100M-recipient post into ~1M actual pushes. Delivery is at-least-once with idempotency at the inbox row and the device; exactly-once is the wrong contract. The inbox is the source of truth, the push is a hint, and the producer must never see the channels — they emit events to a queue and the platform owns everything that happens after.