Distributed Cache

A technology reference on distributed caches — what they are, how they work at the byte level, where the design space splits, when to pick which variant. Use cases (session stores, feed timelines, rate limiters, leaderboards, ML feature serving, social graphs) appear throughout as illustrations of the same class of technology bent to fit different workloads.

§1. What Distributed Cache Is

A distributed cache is an in-memory key-value store that sits between application servers and slower backing stores, returning sub-millisecond responses to lookups whose result is derivable from somewhere else. The defining property is that no piece of data the cache holds is only there — every entry can be reconstructed from a database, an object store, a service call, or a computation. The cache exists to make access fast, not to remember things.

Distributed caches sit at one corner of a multi-way design space:

Network-attached in-memory key-value store (the canonical distributed cache) — Redis, Memcached, Aerospike, Hazelcast, KeyDB, DragonflyDB. RAM-resident, accessed over TCP, shared across many application instances. The category this doc focuses on.
In-process / embedded cache — Caffeine in Java, lru_cache in Python, Ristretto in Go, Guava Cache (legacy). Lives inside the application process. Nanosecond access, but capacity is per-process and the cache is not shared.
CDN / edge cache — CloudFront, Akamai, Cloudflare, Fastly. Geographically distributed, terminates at the network edge, serves HTTP responses to external clients. Same algorithmic family (LRU, TTL — Time-To-Live, key→value) but optimized for HTTP semantics and edge POPs (Points of Presence), not for backend services calling each other.
Materialized view / derived store — pre-computed query results in PostgreSQL materialized views, Elasticsearch indexes, Pinot tables. Looks cache-like (derived from a primary) but is queried as a system of its own and updated on a different cadence.
Database buffer pool — InnoDB buffer pool, PostgreSQL shared_buffers. Caches pages in the database process itself. Specialized; never shared across instances; not the subject here.

This doc focuses on the network-attached distributed cache because that is where the byte-level mechanics and cross-instance patterns live. Embedded caches and CDNs borrow much of the machinery (LRU, TTL, hash table) but pivot the integration story; we'll contrast them in §3 and §11.

What distributed cache is NOT good for:

Source of truth for anything you cannot regenerate. A Redis instance crashes; its memory is gone; the default fsync=everysec means up to one second of writes might never have hit disk. If "we lost the last second" is a bug, you've put authoritative state in the wrong layer.
Cross-key atomic transactions at scale. A single-shard MULTI/EXEC in Redis is atomic; once you shard into Redis Cluster, a MULTI spanning two slots will not run. Compensation logic ends up in the application.
Storing huge values. Redis hot-path single-threaded loop processes one command at a time; a 100 MB value takes ~10ms to ship over a gigabit link and blocks every other request behind it. Keep value sizes under ~64 KB; ideally under 1 KB.
Workloads where freshness is the contract. A cache that says "I'll give you something recent" is fine; a cache that pretends to be a database with TTL is a foot-gun. If your invariant is "the response always reflects the latest write," the cache is the wrong primitive.
Long-term storage. RAM at AWS list price is ~$3.50/GB-month; SSD is ~$0.08; S3 is ~$0.023. Putting cold data in Redis is paying 150x for memory you're not using.

Mental model: a distributed cache is for the derived, hot, regenerable layer that lets the system of truth run at 1k QPS while the user-facing path runs at 1M QPS. Get the cache wrong and you over-spend on memory or under-cache; get the invalidation wrong and you serve stale state. The cache is the cheapest performance technology in the stack — and the easiest to misuse.

§2. Inherent Guarantees

What the technology provides by design, and what must still be layered above it.

Provided by design

Sub-millisecond reads on hot keys. A GET key on a hot Redis instance with a local connection returns in ~50–200µs end-to-end including TCP round-trip. Memcached at a similar number. The whole reason this technology exists.
Shared visibility across application instances. Any application server connected to the same cache cluster sees the same value (within seconds, modulo replication). A SET from app server A is visible to app server B's GET ~50µs later. In-process caches give you neither.
TTL-based expiry. Set a value to expire in 60 seconds; the cache returns nothing after the deadline. Critical primitive — most stampede problems flow from how this is implemented.
Eviction under memory pressure. When memory fills, the cache evicts according to a configured policy (LRU, LFU, TTL-first, random, no-eviction-rejection). The application doesn't crash; old data drops.
Atomic operations on a single key. INCR, DECR, HSET, LPUSH, ZADD, SETNX, Lua scripts — each runs to completion against the key without interleaving. This is the load-bearing primitive behind rate limiters, distributed locks, and counters.
Pipelining. Multiple commands sent in one TCP write, multiple responses returned in one read. Amortizes RTT (Round Trip Time) over hundreds of operations — moves throughput from 100k ops/sec to 1M ops/sec on a single connection.

Must be layered above

Durability across restart. Default Redis with appendfsync everysec loses up to one second of writes on crash. Memcached has no persistence at all — restart is a clean slate. If a write must survive, write it to a database and either let the cache lag or use write-through with synchronous DB write.
Strong consistency on multi-key operations after sharding. Redis Cluster splits keys into 16384 slots; keys in different slots cannot be combined in a MULTI/EXEC. Hash tags ({user:42}:profile, {user:42}:cart) co-locate keys but cap parallelism. Cross-slot transactions belong in a database.
Cache-database consistency. The cache and the database are two stores updated by application code; the system designer owns the protocol — cache-aside, write-through, write-behind, event-driven invalidation — and the failure modes of each (§7, §12).
Replication and failover. Redis Sentinel monitors primaries and promotes replicas; Redis Cluster has built-in primary-replica failover; Memcached has no built-in replication. The failure model is the designer's job — replication topology, quorum for failover, whether writes during failover are accepted or rejected.
Multi-region coherence. A Redis cluster in us-east-1 and another in us-west-2 do not coordinate. The application or a layered tool (Redis Enterprise Active-Active CRDTs, MySQL-CDC fanout) handles cross-region staleness.
Authentication and tenancy. Vanilla Redis up to v5 had only requirepass. ACLs since v6 give per-user permission grants — but multi-tenant isolation is still weak; one tenant's bug DoSing the cache hurts everyone.
Capacity planning per workload. The cache does not auto-grow. Hot key, hot shard, eviction cascades are operator problems.

Synthesis: the cache guarantees fast, atomic-per-key, shared, RAM-bounded access. Everything around it — durability, multi-key transactions, invalidation, consistency, multi-region — is the system designer's job.

§3. The Design Space

Variants of distributed cache differ along several orthogonal axes. They interact; you don't pick freely.

Data model: simple vs rich

Memcached is the minimalist. Keys map to opaque byte values (up to 1 MB) with TTL and CAS (Check-And-Set) tokens. No data structures; if you want a list, serialize it client-side and replace the whole thing on each update. Wire protocol is text-based and trivial.

Redis is the maximalist. Keys map to one of a dozen typed values: strings, hashes, lists, sets, sorted sets (ZSETs), streams, geospatial indexes, bitmaps, HyperLogLog (HLL — probabilistic cardinality), bit-counting, JSON (via RedisJSON module), time series, vector indexes (via RediSearch). INCR counter, LPUSH queue value, ZADD leaderboard 5000 player_42, XADD events * key val — each is one atomic command. The reason Redis ate Memcached's lunch in greenfield projects post-2012 is the richness of these primitives.

Aerospike is the hybrid. Key-value with secondary indexes, queries, user-defined functions (UDFs in Lua), and crucially persistent indexes on flash (NVMe) — terabyte-class working sets without paying RAM prices. Distributed by design with strong consistency optionally available. Used by AdTech for billion-row user profiles served at sub-millisecond.

Hazelcast is the JVM-native distributed grid. In-memory data grid with distributed maps, queues, topics, locks, and CP subsystem (Raft for strong consistency). Embedded into the application as a JVM library OR run as a separate cluster. Strong fit when the application is Java-heavy and wants near-cache semantics.

Concurrency model: single-threaded vs multi-threaded core

Single-threaded core (classic Redis 6.x and earlier on the command loop). One CPU thread accepts a command, executes it, writes the response, moves to the next. Beautiful for atomicity reasoning (every command is atomic; no internal locks needed) but caps single-node throughput at whatever one core can do — ~100k commands/sec on a modern x86 server. Redis 6+ added multi-threaded I/O (network read/write parsed in parallel) but command execution remains single-threaded.

Multi-threaded core (KeyDB, DragonflyDB, Memcached). Multiple threads execute commands concurrently. KeyDB forks Redis with command-level parallelism via locks. DragonflyDB is the modern rewrite — wire-compatible with Redis but built with shared-nothing per-thread shards and io_uring. Benchmarks show ~3M ops/sec on a single 64-core machine vs ~200k for vanilla Redis. The price is cluster mode and module ecosystem maturity.

Topology: standalone vs Sentinel vs Cluster

Standalone Redis. One primary, optional replicas. Application connects directly. Failover is operator-driven or scripted. Fine to ~100k ops/sec.
Redis Sentinel. Sentinel processes (typically 3 or 5) monitor the primary and replicas; on primary loss, quorum vote promotes a replica. Application connects via Sentinel to discover the current primary. Capacity ceiling is still one primary.
Redis Cluster. 16384 hash slots distributed across primaries; each primary has 0+ replicas; client library understands slot→node mapping and routes commands directly. Adds nodes by reslotting; failover is built-in. Caps multi-key operations to keys in the same slot (use hash tags).
Twemproxy / Envoy / mcrouter. Memcached and pre-Cluster Redis fronted by a proxy that handles routing and connection pooling. Twitter's Twemproxy and Facebook's mcrouter are the canonical examples.

Persistence: ephemeral vs durable

Pure ephemeral (Memcached). No disk. Restart = empty cache. Cold-start storm to the database is the operator's problem.
RDB (Redis Database) snapshots. Periodic point-in-time dump via fork() + copy-on-write. Compact, fast to load, but minutes of writes can be lost. Default fallback.
AOF (Append-Only File). Every write command logged to disk. fsync policies: always (every command, kills throughput), everysec (default — up to 1s of writes lost), no (OS flushes when it feels like it). RDB and AOF can coexist; on restart AOF wins.
Disk-backed values (Aerospike). Indexes in RAM; values on NVMe. Working set lives on flash; capacity per node grows from ~100 GB RAM to ~10 TB SSD with single-digit-microsecond read penalty.

Comparison table

Dimension	Redis (classic)	Memcached	Aerospike	Hazelcast	DragonflyDB	Caffeine (in-process)	TAO (Facebook)
Data model	Rich (12+ types)	Strings only	KV + secondary indexes	Distributed Java maps	Redis-compatible	Java objects	Social-graph objects
Storage	RAM (+ RDB/AOF disk)	RAM only	RAM index + SSD values	RAM (+ persistence)	RAM (io_uring fast)	JVM heap	RAM (over MySQL)
Core threading	Single-threaded exec	Multi-threaded	Multi-threaded	JVM multi-threaded	Multi-threaded shared-nothing	JVM threads	Multi-threaded
Sharding	Cluster (16384 slots)	Client-side hashing	Native distributed	Native distributed	Native distributed	N/A (per-process)	Sharded across regions
Replication	Async, semi-sync	None native	Built-in cross-DC	Built-in cross-DC	Built-in	N/A	Leader/follower regions
Persistence	RDB + AOF	None	SSD durable	Configurable	RDB+AOF compatible	N/A	MySQL durable
Throughput/node	100–200k ops/sec	500k+ ops/sec	1M+ reads/sec	100–500k ops/sec	1–3M ops/sec	100M+ ops/sec/JVM	Internal
Latency p99	<1ms	<1ms	<1ms	<1ms	<1ms	<1µs	<1ms
Atomic primitives	INCR, ZADD, Lua, etc.	INCR, CAS	UDF, CAS	Distributed locks, CP	Redis-compatible	Reference-based	Custom
Best fit	Rich data structures	Pure KV simplicity	Big cache on flash	JVM grid + CP	Modern multi-core	Process-local hot path	Social graph

The synthesis: each row of "best fit" is a consequence of the column entries. Memcached ends up at "pure KV simplicity" because that's all it offers. Redis ends up at "rich data structures" because the typed primitives turn many feature implementations into one-liners. DragonflyDB ends up at "modern multi-core" because the entire architecture is rebuilt around the shared-nothing-per-thread model that classic Redis was not.

A defended pick

If the workload is greenfield, mixed (rate limit + leaderboard + session + ad-hoc lookup), and per-node throughput < ~150k ops/sec, pick Redis. Industry mind-share, library quality, primitives, ops experience — the dominant choice for a reason.

If the workload is pure key-value, you control the client serialization, and you need maximum throughput from minimum memory overhead, pick Memcached. The multi-threaded core and slab allocator are tighter than Redis's general-purpose footprint. Facebook serves the largest cache deployment in the world on Memcached; that's not nostalgia.

If the workload is billion-row user-profile-class with values that don't fit in pure RAM at sane cost, pick Aerospike. AdTech bid-time lookups live here.

If the workload is a single JVM-based service that wants a distributed Java map with near-cache and CP options, pick Hazelcast.

If the workload is bottlenecked by single-thread Redis and you want a drop-in upgrade, pick DragonflyDB. The wire compatibility is the whole pitch.

§4. Underlying Data Structures: Byte-Level Mechanics

The depth section. We zoom into the structures inside a distributed cache — the dictionary, the skip list, the memory allocator, the eviction policy, the persistence pipeline — and walk a single SET + GET byte by byte over the wire.

4.1 The dictionary: incremental rehashing in Redis

Every Redis key lookup goes through dict.c, the hash table implementation. Each database holds a dict mapping sds (Simple Dynamic String) keys to robj values. The structure:

struct dict {
    dictht ht[2];      // two hash tables
    long rehashidx;    // -1 if not rehashing, else current bucket
    ...
};

struct dictht {
    dictEntry **table;  // array of bucket pointers
    unsigned long size; // bucket count (power of 2)
    unsigned long sizemask; // size - 1, for hash & mask
    unsigned long used; // total entries
};

struct dictEntry {
    void *key;
    union { void *val; uint64_t u64; int64_t s64; double d; } v;
    dictEntry *next;   // collision chain
};

The trick is the two hash tables, ht[0] and ht[1]. Normally only ht[0] is live. When used/size > 1.0 (load factor exceeds 1), Redis allocates ht[1] at double the size and enters rehashing mode. From that moment, every command that touches the dict moves a handful of buckets from ht[0] to ht[1] (incremental — typically 1 bucket per operation, with a watchdog limiting elapsed work to 1ms). Lookups during rehashing check both tables. When ht[0] is empty, rehashing completes, ht[0] is freed, and ht[1] becomes the new ht[0].

Why incremental? A naive rehash of a 100M-entry dict would walk 100M buckets and allocate 200M new slots in one operation — hundreds of milliseconds blocking the single-threaded command loop. Incremental rehashing turns that O(n) work into O(1) per command amortized.

Collision resolution is chaining — each bucket holds a linked list of entries that hash to the same slot. Load factor 1.0 with uniform hashing gives an average chain length of ~1; p99 chain length stays under 5. Hash function is SipHash-2-4 since Redis 4.0 (replacing MurmurHash to defend against algorithmic-complexity attacks where an adversary crafts colliding keys).

ASCII view of a rehashing dict mid-flight:

ht[0] (old, size=4)              ht[1] (new, size=8)
┌─────┐                          ┌─────┐
│  0  │ → emptied                │  0  │ → entry_A
├─────┤                          ├─────┤
│  1  │ → emptied                │  1  │ → null
├─────┤                          ├─────┤
│  2  │ → entry_X → entry_Y      │  2  │ → null
├─────┤  (still here)            ├─────┤
│  3  │ → entry_Z                │  3  │ → null
└─────┘                          ├─────┤
   ↑                             │  4  │ → entry_B → entry_C
   rehashidx = 2                 ├─────┤
   (next bucket to move)         │  5  │ → null
                                 ├─────┤
                                 │  6  │ → null
                                 ├─────┤
                                 │  7  │ → null
                                 └─────┘

A GET entry_X traverses ht[0][2] first, finds entry_X, returns. A GET entry_A misses ht[0][?] and finds it in ht[1][0]. After the lookup, Redis moves bucket 2 from ht[0] to ht[1], advances rehashidx to 3, then ultimately to "no rehashing" once ht[0] is empty.

4.2 Skip list for sorted sets

A Redis sorted set (ZSET) is a (member, score) pair where members are sorted by score with ties broken by lexicographic member order. The data structure underneath is a skip list paired with a hash table — the hash table indexes by member for O(1) lookup, the skip list is for ordered traversal and range queries (ZRANGE, ZRANGEBYSCORE, ZRANK).

A skip list is a probabilistic balanced search structure. Each node has a height drawn from a geometric distribution: level 1 with probability 1, level 2 with probability 0.25, level 3 with probability 0.0625, and so on (Redis uses p=0.25 and max level 32). Higher levels are sparser "express lanes" through the list.

Level 4:  HEAD ───────────────────────────────► TAIL
Level 3:  HEAD ────────────► [50] ────────────► TAIL
Level 2:  HEAD ───► [20] ──► [50] ─► [80] ───► TAIL
Level 1:  HEAD ─► [10] [20] [35] [50] [65] [80] [95] ► TAIL

ZRANGEBYSCORE leaderboard 50 80 starts at HEAD at the top level, drops down until the next pointer would overshoot 50, descends to the next level, walks rightward, and ultimately lands at score 50 in O(log n) expected time. The range scan from 50 to 80 then walks level 1's linked list — sequential, cache-friendly.

Why skip list over a balanced tree (red-black, AVL, B+ tree)?

Simpler code. No rebalancing rotations. Insert: roll a random level, fix up forward pointers. Delete: fix up forward pointers. ~150 LOC vs ~600 LOC for a red-black tree. Bugs are cheaper.
Same asymptotic complexity (O(log n) for search/insert/delete) with comparable constant factor.
Range scans are linear — once you find the start, you walk a doubly linked list.
Concurrent variants (not relevant in single-threaded Redis but used in JVM concurrent skip-list map) are easier than concurrent trees.

The price is slightly worse worst-case (skip list is probabilistic; with very bad luck, the structure can degrade) and slightly more memory (each level adds a forward pointer per node). Redis's leaderboard primitive — ZADD leaderboard 5000 player_42, ZRANGE leaderboard 0 9 WITHSCORES, ZRANK leaderboard player_42 — runs on this structure. A 10M-entry leaderboard delivers a top-10 query in ~20µs.

4.3 Memory allocators: jemalloc vs slab

A cache lives or dies by its memory allocator. Two dominant patterns.

jemalloc (Redis default). General-purpose thread-cached allocator. Maintains size classes from 8 bytes to multi-MB; each class has thread-local free lists. Allocations round up to the nearest class. A SET key value allocating 73 bytes lands in the 80-byte class — 7 bytes of internal fragmentation. Aggregated across millions of objects, fragmentation is ~5–15%. Redis exposes this via INFO memory: used_memory_rss / used_memory = mem_fragmentation_ratio. Targets: ~1.05–1.5; > 1.5 means too much fragmentation; < 1.0 means data has been swapped out (configuration emergency).

Slab allocator (Memcached's choice). Pre-allocate large slabs of memory, each carved into fixed-size chunks of a specific size class. A chunk is the unit of value storage. When a slab class is full, allocate another slab to it.

Slab class 1: chunks of 96 bytes  → slab 1 of 1MB = 10922 chunks
Slab class 2: chunks of 120 bytes → slab 2 of 1MB = 8738 chunks
Slab class 3: chunks of 152 bytes → ...
...
Slab class 42: chunks of 1 MB    → one chunk per slab

Storing a 100-byte value picks the 120-byte class — 20 bytes of internal fragmentation. Worse than jemalloc's typical fragmentation but with two compensating wins: (a) deallocation is O(1) — return the chunk to the slab's free list; no coalescing, and (b) write-amplification is zero — values don't migrate.

The slab allocator's failure mode is slab calcification. Imagine you start by writing 1 KB session blobs. Slab class for 1 KB fills with millions of slabs. Then your workload shifts to writing 256 KB images. Slab class for 256 KB is empty. To allocate a 256 KB chunk, Memcached must allocate a fresh slab — but the system has no free pages, all memory is owned by the 1 KB class. Pre-1.4.11 Memcached, you were stuck — even if 1 KB items had expired, their slabs stayed assigned. Modern Memcached has slab rebalancing to migrate empty slabs between classes, but the operator must enable and tune it.

Trade-off summary: jemalloc absorbs variable workloads gracefully but has higher internal fragmentation and longer-tail latencies under memory pressure. Slab is optimal for fixed-size objects (Memcached is excellent for "many small items of similar size" like session blobs and rendered HTML fragments) and degrades under polymorphic workloads.

4.4 Eviction policies: where the cache earns its keep

When memory hits maxmemory, the cache must drop something. The policy choice can change hit rate by 20%+. The four families.

LRU (Least Recently Used). Track access order; evict the entry not used for the longest time. Standard implementation: doubly linked list with HEAD (most recent) and TAIL (oldest). Every access moves the entry to HEAD; eviction takes from TAIL.

HEAD ─► [key_A] ─► [key_B] ─► [key_C] ─► ... ─► [key_Y] ─► [key_Z] ─► TAIL
                                                              ↑
                                                  next victim if eviction triggers

After GET key_C:

HEAD ─► [key_C] ─► [key_A] ─► [key_B] ─► ... ─► [key_Y] ─► [key_Z] ─► TAIL

Real Redis does approximate LRU — sampling 5 random keys (configurable maxmemory-samples) and evicting the least-recently-accessed of the sample. Sampling avoids maintaining a perfect doubly linked list of millions of entries (which would double the memory footprint per entry). With 10 samples, approximation accuracy is ~99% of perfect LRU.

LRU's weakness: scan resistance is zero. A long scan touching every key once shifts the entire access frontier; hot keys get evicted. The streaming SELECT * FROM events problem reappears in cache form.

LFU (Least Frequently Used). Track per-key access count; evict the least-frequently-used. Naive implementation explodes memory (a counter per key). Redis uses a probabilistic counter (LFU since v4.0) — a logarithmic counter Math.log(N+1) stored in 8 bits per key, with decay so old activity fades.

Decay logic: every minute, halve the counter of every visited key
             (probabilistically — sample, decay, sleep).

Increment logic: P(increment) = 1 / (factor * counter + 1)
                 — so counter grows logarithmically with hits.
                 — 8-bit counter reaches max ~10M hits.

LFU is scan-resistant (a one-touch scan doesn't bump counters past hot keys) and adapts well to skewed traffic — but slow to adapt to shifting workloads (yesterday's hot key takes hours to age out).

W-TinyLFU (Window TinyLFU) — the modern choice. Combines LRU's recency-awareness with LFU's frequency-awareness via an admission filter based on the count-min sketch — a probabilistic frequency estimator. Pioneered in academic literature, popularized by Caffeine (Java in-process cache, used by Netflix EVCache hot tier) and EVCache (Memcached + Caffeine layered).

The architecture:

                  ┌──────────────────────────────────────┐
   New key ──────►│  Count-Min Sketch                    │
                  │  (frequency estimator, 4 bits/entry, │
                  │   ~1% of cache memory)               │
                  └─────────────┬────────────────────────┘
                                │ estimated_freq(new) > estimated_freq(victim)?
                                │
        ┌───────────────────────┴───────────────────────┐
        │ Admission filter:                             │
        │ if new key's freq > LRU-victim's freq, admit. │
        │ else reject (keep the existing entry).        │
        └─────────────────────┬─────────────────────────┘
                              │ admit
                              ▼
                  ┌──────────────────────────────────────┐
                  │  Main cache (LRU window + Segmented  │
                  │   LRU with probation and protected   │
                  │   regions)                           │
                  └──────────────────────────────────────┘

The count-min sketch is a 2D array (depth × width) of counters with multiple hash functions; increments are added to one cell per row; estimated frequency is min(counter[row_i][hash_i(key)]). The min part bounds overestimation; the sketch never underestimates. A 4-bit counter saturates at 15 — combined with periodic halving (aging) this tracks "recently frequent." Memory cost: ~4 bits/key.

W-TinyLFU's hit-rate advantage over LRU is 5–20% on production traces (Caffeine benchmarks across YCSB, search-engine, web traces). For a cache at 1M QPS, a 5% hit rate improvement is 50k fewer database reads per second. That's why Caffeine became the default JVM in-process cache, why Aerospike, EVCache, ScyllaDB, Hazelcast moved toward TinyLFU variants.

FIFO with TTL filter (Memcached classic). Evict whatever is oldest in the slab class regardless of access. Cheap and simple. Worst hit rate but works for fixed-size append-only patterns.

Eviction matrix (defended picks):

Workload	Pick	Why
Power-law popular keys, stable	LFU	Frequency dominates
Recent activity, sliding hot set	LRU	Recency dominates
Mixed; scan resistance matters	W-TinyLFU	Best hit rate across benchmarks
Fixed-size session blobs, ample memory	FIFO + TTL	Simplicity; capacity isn't the binding constraint
Append-only events	No eviction + TTL	If eviction is hit, system is misconfigured

4.5 Redis persistence: RDB and AOF mechanics

The cache is in RAM, but Redis persists for restart recovery. Two mechanisms.

RDB (Redis Database) — point-in-time snapshot. Triggered by SAVE (foreground, blocks), BGSAVE (background fork), or schedule (save 60 10000 = snapshot every 60s if ≥10000 keys changed). The mechanism:

Redis calls fork(). The OS creates a child process with the parent's memory map as copy-on-write — both processes see the same physical pages until one writes.
The child walks the in-memory dictionary, dumps every key+value to a temp file in RDB format (binary, length-prefixed).
The parent continues serving commands. Writes touch pages → kernel copies the page to a new physical frame for the parent → child still sees the original.
The child renames the temp file to dump.rdb and exits.

The memory overhead is the copy-on-write fault rate × page size during the snapshot. On a 32 GB Redis with 100k writes/sec, ~10–20% of pages get touched during a 60-second snapshot — peak RAM usage spikes from 32 GB to ~36 GB. Capacity planning must headroom for this; otherwise the OOM-killer takes Redis.

RDB strengths: compact (binary), fast to load (just read sequentially and rebuild dicts — ~100 MB/sec). Weaknesses: minutes of writes lost if the box dies between snapshots.

AOF (Append-Only File) — write log. Every write command appended to appendonly.aof as a RESP-encoded line. fsync policies:

appendfsync always: fsync on every command. Bulletproof durability. ~1ms per command — kills throughput.
appendfsync everysec (default): background thread fsyncs once per second. Up to 1s of writes lost on crash. The common choice.
appendfsync no: never explicitly fsync; let the OS flush when it wants. ~30s of writes at risk.

The AOF grows unboundedly — every SET key val is 50+ bytes — so Redis periodically rewrites it. Rewrite forks a child that walks the live dictionary and writes a fresh AOF containing one logical command per key (SET key current_value instead of the history of 100 SETs that arrived at this value). New writes during rewrite go to a buffer that's appended after the child finishes. Same copy-on-write fork mechanic as RDB.

Combined RDB+AOF. Best practice. RDB for fast cold-start; AOF for the recent writes since the last snapshot. On restart, Redis prefers AOF.

Memcached has no persistence. Restart = empty cache. The justification: caches should be regenerable. The reality: cold caches DoS the database. Production Memcached deployments at Facebook scale solved this with custom warming pipelines.

4.6 One concrete operation: SET + GET byte by byte

Walk a single SET user:42:profile {...json...} EX 300 followed by a GET user:42:profile end to end.

Wire protocol (RESP — REdis Serialization Protocol).

The client serializes SET user:42:profile <value> EX 300 as:

*5\r\n
$3\r\nSET\r\n
$15\r\nuser:42:profile\r\n
$N\r\n<value bytes>\r\n
$2\r\nEX\r\n
$3\r\n300\r\n

Five-element array, each element prefixed with its length. Plain text framing — readable in tcpdump, trivial to implement in any language. The cost is ~25–40 bytes of framing per command; pipelining amortizes this over many commands.

1. TCP socket arrival. The TCP socket layer assembles the bytes and signals readability to Redis's epoll loop. Redis 6+ uses a small pool of I/O threads to parse RESP — multi-threaded I/O — while the command execution itself stays single-threaded.

2. RESP parse. The parser walks the buffer, validates lengths, builds a client->argv array of [SET, user:42:profile, <value>, EX, 300]. ~1µs of CPU.

3. Command dispatch. Redis looks up SET in the command table — itself a hash table keyed on command name. Finds the function pointer setCommand.

4. Key hashing. hash(user:42:profile) via SipHash → bucket index in db->dict.

5. Dict insert (incremental rehashing). If currently rehashing, move 1 bucket from ht[0] to ht[1]. Then insert. If the key existed, replace; if not, allocate a new dictEntry, set key (an sds of the key string) and value (an robj pointing to the value sds).

6. TTL bookkeeping. Insert/update an entry in the db->expires dict — a parallel hash table keyed on the same key, holding the expiration timestamp in milliseconds. Expiration checking is a hybrid: every command checks the expires dict on access (passive); a background cron samples random keys and lazy-evicts expired ones (active). Together, ~1% of expired keys live past their deadline in steady state.

7. AOF append. If AOF is on, serialize the command back into RESP and append to the in-memory AOF buffer. Will be fsync'd by the background thread.

8. RDB no-op. RDB doesn't react to individual writes.

9. Replication. If a replica is connected, queue the command for the per-replica replication output buffer. The replica's slave_io_thread (analogous to MySQL's IO thread) pulls bytes over TCP, replica's replication thread executes the command locally. Async by default.

10. Response. Redis writes +OK\r\n (3 bytes) to the client's output buffer, scheduled for sending by the I/O thread.

Total wire-to-ack time: typically 80–150µs. Of that, ~70% is network — kernel TCP stack, NIC roundtrip — and ~30% is in-Redis CPU.

Now the GET.

Client sends:

*2\r\n
$3\r\nGET\r\n
$15\r\nuser:42:profile\r\n

Parse same way. ~600ns.
Dispatch to getCommand.
Lookup in db->expires: is this key past its deadline? If yes, delete it now (lazy expiration), respond with $-1\r\n (nil).
Lookup in db->dict: walk the chain at the hashed bucket. Compare keys (memcmp on the sds bytes). Find it.
Reply with ${len}\r\n<value>\r\n. Schedule send.

Total: 40–100µs for a hot key.

Crash scenarios for SET:

Crash between step 5 (dict updated) and step 7 (AOF appended): the value is in RAM only. RAM is gone with the crash. Restart loads from AOF — which lacks the command. Result: the write didn't survive. For caches, this is acceptable; for source-of-truth, it would be a bug.
Crash between step 7 (AOF in memory) and the background fsync (default 1s later): AOF buffer is lost. Up to 1s of writes vanish. The "everysec" durability guarantee.
Crash after fsync but before the replica got the command: the primary recovers from AOF, replica is behind; promotion of replica during the gap would silently lose this write. The trade-off of async replication.

These are the boundaries of "cache durability." Plan around them.

§5. Capacity Envelope

What this technology can do, illustrated across very different scales.

Single-node Redis — startup scale. A vanilla Redis on AWS r6i.2xlarge (8 vCPU, 64 GB RAM) sustains ~100k ops/sec at p99 <1ms for mixed reads/writes with values <1 KB. Plenty for most production services through ~100M monthly active users. The single-threaded command loop is the bottleneck; one core pegs at 100% before any other limit. Hacker News, Basecamp-style sites, and most early-stage SaaS run their entire cache on one Redis primary + hot standby.

Sharded Redis — mid scale. Twitter's home timeline cache historically ran on Redis Cluster with hundreds of nodes; ~3M reads/sec aggregate; per-user timeline as a Redis list. GitHub runs Redis at the heart of Rails sessions and ActionCable — ~1M ops/sec across multiple clusters. Snapchat runs messaging on Redis Cluster, low-millisecond delivery worldwide.

Mid-large Memcached — Pinterest. Billions of pin reads per day from a Memcached fleet across thousands of machines. Hit rate >99%; MySQL serves <1% of read traffic. Cache is the primary read path; database is the consistency anchor only.

Giant — Facebook. Largest cache in the world. Public papers (NSDI 2013 "Scaling Memcache at Facebook"; 2020 "CacheLib") describe hundreds of millions of requests/sec across thousands of nodes globally — mcrouter, lease-based gets, regional clusters. Database reads are anomalies, not the norm.

Twitter Twemcache. Memcached fork with B-tree storage and segmented LRU per slab class. Trillions of requests/day across thousands of nodes. Twemproxy is one of the most influential pieces of cache routing software in industry.

Netflix EVCache. Memcached fork with cross-AZ replication, async mirroring to a secondary cluster, Caffeine L1 in front. ~30M+ ops/sec global aggregate; sub-millisecond from any region; backs movie metadata, A/B tests, recommendation candidates.

Facebook TAO. Write-through cache between web tier and sharded MySQL. Billions of reads/sec; tens of millions of writes/sec. Models the social graph. Region-leader/region-follower architecture; writes go to leader, propagate via MySQL replication.

LinkedIn caching. Couchbase and Memcached for profile data, sessions, feed materialization. The profile cache fronts Espresso at ~10M+ QPS aggregate. Sub-millisecond profile lookups load-bear every page render.

The range — startup 100k ops/sec, Twitter 3M, Pinterest billions/day, Facebook hundreds of millions/sec — is four orders of magnitude wide for the same technology class. What scales is the partitioning and multi-tier topology around the cache nodes; the per-node store (hash table + skip list + slab/jemalloc) is the same on a single Redis as on Facebook Memcached.

§6. Architecture in Context

The canonical pattern for a distributed cache in a production system. The shape recurs across session stores, feed timelines, leaderboards, ML feature serving, rate limiters.

                              ┌──────────────────────────────────┐
                              │   Client (browser, mobile)       │
                              └──────────────┬───────────────────┘
                                             │
                                             ▼
                              ┌──────────────────────────────────┐
                              │   CDN / Edge cache               │
                              │   (CloudFront, Akamai, Fastly)   │
                              │   HTTP-level, geographically      │
                              │   distributed                    │
                              └──────────────┬───────────────────┘
                                             │ (cache miss)
                                             ▼
                              ┌──────────────────────────────────┐
                              │   Application servers            │
                              │   (stateless, horizontal)        │
                              │                                  │
                              │   L1: in-process cache           │
                              │       (Caffeine, ~10ms TTL,      │
                              │       per-instance, ~100MB)      │
                              └──────────────┬───────────────────┘
                                             │ (L1 miss)
                                             ▼
                              ┌──────────────────────────────────┐
                              │   L2: distributed cache cluster  │
                              │   (Redis Cluster, Memcached,     │
                              │    Aerospike)                    │
                              │                                  │
                              │   Sharded by hash(key)           │
                              │   Per-shard primary + replicas   │
                              │   ~minutes-hours TTL             │
                              └──────────────┬───────────────────┘
                                             │ (L2 miss)
                                             ▼
                              ┌──────────────────────────────────┐
                              │   Source of truth (database)     │
                              │   (MySQL, PostgreSQL, Cassandra) │
                              │   Sharded, replicated, durable   │
                              └──────────────────────────────────┘

   Invalidation paths:
   ──────────────────
   Write → DB → CDC (Debezium / Brooklin) → Kafka → cache invalidator
                                                       │
                                                       └─► DEL L2:key
                                                       └─► broadcast invalidate L1

Three levels of cache; one level of source-of-truth. Each layer has:

L0 (CDN / Edge). Geographically distributed; serves HTTP responses to external clients. Very long TTLs (hours to days) for public content. Bypasses the application entirely.
L1 (in-process). Caffeine, Guava, Ristretto. Nanosecond access. Per-instance, capped at a few hundred MB. Short TTL (10s of ms to a few seconds) to bound staleness — and because cluster invalidation is harder for L1 than L2 (would need to broadcast invalidation to every instance).
L2 (network distributed cache). Redis or Memcached cluster. Sub-millisecond. Shared across all instances; minutes-to-hours TTLs.
L3 (database). Source of truth. p99 ~5–50ms. The cold path.

The annotations:

Sharding at the cache router by hashing the cache key. Consistent hash (Memcached client libraries, mcrouter), 16384 slot ring (Redis Cluster), or proxy-mediated (Twemproxy, Envoy).
Replication within a shard group. Redis primary fans out to ≥1 replica; promotion on primary loss.
Invalidation at the CDC (Change Data Capture) layer. A DB write fans out to Kafka; a cache-invalidator service consumes and issues DEL key to L2; optionally broadcasts a per-key invalidate to L1 via pub/sub.

This shape is the same whether the application is a social feed, a payment dashboard, a leaderboard, an ML feature serving layer, or a rate limiter. The variant (Redis vs Memcached, Cluster vs Sentinel, L1 in front or not) changes by workload; the layout doesn't.

Note that CDN and the cache here are different beasts: CDN is HTTP-aware and lives at the edge for external client traffic; the distributed cache is RESP/binary-aware and lives at the data center for backend-internal traffic. They cooperate but solve different problems.

§7. Hard Problems Inherent to Distributed Cache

Six fundamental challenges. Each shows up regardless of use case; illustrating examples come from across domains.

7.1 Cache stampede (thundering herd)

One line. A popular cached entry expires; many concurrent requests miss simultaneously; all hit the database in parallel; database melts.

Where it shows up. A trending news article's cached HTML expires; 50k QPS of concurrent requests miss; the DB query that takes 200ms gets called 10,000 times in parallel. A Twitter feed for a celebrity expires; 10k follower-fetch operations hit the graph database simultaneously. A featured product page's price-and-inventory cache expires during a flash sale.

Naive fix: longer TTL. But longer TTL = staler data, and the stampede just happens less often, equally violently.

Why the naive fix breaks. With TTL=300s and 50k QPS, at every 300-second boundary you get a burst of 50k concurrent DB calls. The database p99 inflates from 5ms to 5000ms during the burst; user requests time out; the cache stays empty because the slow DB calls haven't returned to populate it; more requests pile up missing; cascading failure.

Real fixes (layered, choose the right one):

Single-flight / request coalescing. Only one request per key fetches from the DB at a time; concurrent requesters for the same key wait on the same future. In-process with sync.Once (Go) or Guava Loading Cache. Across instances, use a distributed lock — SETNX cache:lock:key 1 EX 30 — only the lock holder calls the DB; others retry-with-backoff or read stale.
Probabilistic early expiration. Each requester computes a probability of recomputing slightly before TTL: if random() < e^(-(remaining_ttl - some_constant)/beta) then recompute. Spreads the refresh across a window before the hard expiration, smoothing the herd. Implemented in many cache libraries; called XFetch in the literature.
Stale-while-revalidate. Serve the stale value beyond TTL while asynchronously fetching the fresh one. Two TTLs — fresh_ttl and stale_ttl > fresh_ttl. Between them, the entry is served stale but the cache triggers a background refresh. Latency for the user stays low; database load stays bounded.
Hard cap on concurrent DB calls per key. A semaphore — at most N concurrent requesters fetch from the DB per key — the rest wait or read stale.

The Facebook TAO paper formalized leases for this: on a cache miss, the cache issues a lease token to one requester ("you fetch and SET"); concurrent missers receive a "wait — someone is fetching" signal. Pinterest's Memcached layer implements a similar protocol.

7.2 Hot key

One line. One specific key receives so much traffic that the single node holding it saturates while the rest of the cluster idles.

Where it shows up. Elon Musk's profile cache key receives 50k QPS at peak; that one Redis shard has its CPU pegged at 100%; the other 99 shards are at 10%. A trending ZSET leaderboard gets 30k ZADDs/sec; one shard saturates. A "current global counter" key (active users count, total likes) gets every page's GET.

Naive fix: add more cache nodes. Doesn't help — the key is still on one node.

Why the naive fix breaks. Consistent hashing puts a key on exactly one node. Adding cluster capacity doesn't move that key; it just spreads the cold keys further. The hot key's owner stays saturated. p99 latency on the hot key inflates to 50–200ms because of queueing on the single node's command loop.

Real fixes:

Local in-process L1. Each application instance caches the hot key locally with a short TTL (1s, 5s). With 100 instances and 50k QPS aggregate, each instance serves 500 requests/sec from its own L1; L2 sees only 100 RPS (one refresh per instance per second). Cost: ~1s staleness, which is usually fine.
Replicate the hot key to N shadow keys. hot_key becomes hot_key:0, hot_key:1, ... hot_key:9. Each client picks a shadow at random for reads. Writes fan-out to all shadows. Spreads load 10x. Requires application awareness — the cache doesn't know which keys are hot.
Read from replicas. If the shard has replicas, route GETs across primary + replicas. With 1 primary + 3 replicas, that's 4x read capacity for the hot key — at the cost of replication lag visibility (~1ms typical, but writes look stale on replicas briefly).
Promote the key to a CDN. If the value is HTTP-shaped and public, push it to the edge. Now millions of clients hit the CDN, not the cache.

Twitter publicly described shadow key replication as the core of their celebrity-feed handling. Pinterest combines L1 + sharding + CDN for pin metadata. There's no single fix; the right answer is layered.

7.3 Invalidation correctness

One line. "There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton. The cache says X; the database says Y; one of them is lying.

Where it shows up. A user updates their profile bio; the DB has the new bio, the cache has the old; subsequent reads serve the stale bio. A product's price changes from $50 to $40; the cache still says $50; customers checkout at $50. A user is banned; the auth cache still says "active"; the banned user keeps browsing.

Naive fix: "I'll just DELETE the cache key on every write." Welcome to the race condition.

Why the naive fix breaks. Concrete race condition on cache-aside with DELETE:

t=0  Client A: SET cache[K] = old_value (after a previous read).
t=1  Client B: UPDATE DB[K] = new_value
t=2  Client B: DELETE cache[K]   (cache is now empty)
t=3  Client C: GET cache[K] → miss
t=4  Client C: SELECT DB[K] → reads in transaction snapshot before t=1
                (or via a replica that hasn't caught up)
                returns old_value
t=5  Client C: SET cache[K] = old_value   ← cache now permanently holds old_value

The cache is poisoned with stale data, and there's no further invalidation event coming. Could stay wrong until TTL.

Real fixes (a layered story):

TTL as the safety net. Always set a TTL; never trust your invalidation pipeline to be perfectly delivered. A 5-minute TTL bounds the maximum staleness even if a DELETE event is lost.
Write-through. Updates write through the cache to the DB synchronously; cache and DB are updated in the same code path; no race. Cost: cache must understand DB write semantics; double write latency.
Write-around with versioned keys. Cache key is user:42:profile:v17; an update writes a new row + increments the version; subsequent reads ask for user:42:profile:v18. The old key just ages out via TTL; never invalidated. Used by ad-bidding systems where freshness > write coordination.
CDC-driven invalidation. Debezium reads MySQL binlog or PostgreSQL WAL; emits events to Kafka; cache-invalidator service consumes and issues DELs. The CDC stream is strictly ordered per row, so the t=4 scenario above is avoided: the DELETE arrives after the DB write that triggered it, so a re-read post-DELETE sees the new value. LinkedIn (Brooklin → Kafka → cache invalidators) and Pinterest both use this pattern at scale.
Negative caching. Cache "doesn't exist" for keys that miss the DB. Prevents stampede on adversarial enumeration ("does account_12345 exist? no? does account_12346 exist?"). Set short TTL because new accounts get created.

The honest truth: perfect invalidation is impossible without a transaction across cache and DB. Settle for "bounded staleness" via TTL, and use CDC to drive most invalidations correctly.

7.4 Consistent hashing under churn

One line. When a cache node is added or removed, only 1/N of keys should remap; with naive hash(key) % N, all keys remap.

Where it shows up. Memcached cluster grows 100 → 101 nodes; modulo changes for ~99% of keys; cache cold-starts. A node fails (100 → 99). Doubling the cluster (10 → 20). All invalidate everything.

Why the naive fix breaks. A 100-node cluster at 10M QPS with 99% hit rate: a remap drops hit rate to ~1%; database load spikes from 100k QPS to 9.9M QPS; cascading outage.

Real fixes:

Ring hashing (Karger 1997, original consistent hash). Hash each node to multiple points on a 0..2^32 ring; hash key to a point; key goes to next node clockwise. Adding a node steals a slice from each neighbor; removing distributes to next-clockwise. Only 1/N keys move. Virtual nodes (each physical node hashed to ~100–500 ring points) smooth distribution.

   Ring (0..2^32, simplified):
   0  10  20    35  50    65   80   95   100
   ●───●───●────●───●─────●────●────●────●
       ↑ key_X hashes here, goes clockwise
   Adding node at position 22: keys between 20 and 22 (previously → 35) now go to new node.

Rendezvous hashing (HRW — Highest Random Weight). For each key, compute hash(key, node_id) for every node; key goes to node with highest hash. Only keys that scored highest on the removed node reassign. Simpler than ring; better distribution. Used by mcrouter and many internal load balancers.
Jump consistent hash (Lamping & Veach 2014). O(log N), no per-node state, perfect uniformity. Returns bucket 0..N-1; increasing N moves only the correct fraction. Limitation: only supports adding/removing at the end. Cassandra uses jump hash.
Slot-based (Redis Cluster, Cassandra v3+). Pre-allocate fixed slot count (16384 in Redis Cluster). Assign slots → nodes. Adding a node moves slot ownership; slot count stays the same. Slot map is small (16384 entries) and gossiped. Operational simplicity > theoretical minimum.

The era of "modulo by N" is over.

7.5 Memory pressure and eviction policy choice

One line. Memory fills; the cache evicts; the wrong policy evicts the wrong things; hit rate collapses.

Where it shows up. Cache filled with low-value scan-once data from a batch job; the hot working set gets evicted; user-facing reads start missing; database melts. Cache used for both hot-and-tiny (session blobs) and cold-and-huge (rendered HTML); evicting one starves the other.

Why the naive fix breaks. Default LRU is the canonical bad case under scan workloads. Default LFU is the canonical bad case under shifting popularity. Default no-eviction (maxmemory-policy noeviction) just refuses new writes once full — application errors out.

Real fix:

Workload-aware policy choice. Per §4.4: LFU for stable power-law popularity (most Redis production workloads); LRU for sliding-window patterns (chat history); W-TinyLFU for mixed (Caffeine, EVCache).
Separate caches for distinct workloads. Don't put session blobs and analytics-precompute results in the same Redis. They have different TTLs, different value sizes, different access patterns. Each cache gets its own policy and size.
Monitor eviction rate. Healthy steady state: eviction rate ≈ insert rate of cold data. A sudden eviction-rate spike signals workload shift or memory leak. Page the on-call.

7.6 Multi-region cache coherence

One line. A write in region A lands in region A's cache; region B's cache still serves the old value; users in region B see stale data after a global change.

Where it shows up. A user updates their LinkedIn headline; the change writes to the US-East primary. The US-East cache invalidates correctly. The US-West cache, fronting the read-replica DB in US-West, still has the old headline. The user (or their viewers) hit US-West and see stale data for the 30 seconds of replication + invalidation lag.

Naive fix: "Just propagate every cache invalidate to every region."

Why the naive fix breaks. Cross-region latency is 50–200ms. If every cache write triggers cross-region invalidation, write latency balloons. And during a network partition between regions, you either block writes (CP) or accept divergence (AP).

Real fix:

Accept staleness, bound it with TTL. Cross-region cache is allowed to lag by ~5–30s; TTL caps the maximum staleness window.
Eventually-consistent invalidation via Kafka. Same CDC pipeline as §7.3, but the Kafka topic mirrors cross-region (MirrorMaker, Brooklin). Cache-invalidator in region B consumes the event and invalidates locally. Lag is the Kafka mirror lag (~hundreds of ms) plus invalidator processing.
Facebook TAO's read-after-write pattern. Writes go to the leader region (one designated per shard). Other regions' caches are followers; they pick up the change via MySQL replication. A user immediately reading their own write is routed back to the leader region briefly (session-stickiness or token-based read-from-primary).
Active-active CRDT (Conflict-free Replicated Data Types). Redis Enterprise's "Active-Active" feature uses CRDTs (counters, sets, last-writer-wins registers) for keys that need to merge concurrent writes across regions. Useful for counters, leaderboards, status flags; not for general key-value where "merge" has no semantic meaning.

The honest reality: multi-region caching is multi-region replication wearing a different hat. Same CAP tradeoffs apply.

§8. Failure Modes

Systematically cover what breaks and how to recover.

8.1 Primary node crash mid-operation

A Redis primary at 50k ops/sec crashes. OS process gone; RAM wiped; on-disk RDB from 60s ago; AOF from up to 1s ago.

Standalone: restart loads from AOF (preferred) or RDB. Up to 1s of writes lost (default appendfsync everysec). Brief window of cache misses spikes DB load.

Sentinel: Sentinel quorum (3 or 5) detects primary unreachable, declares it down; replica is promoted; clients redirected. Promoted replica's state is ~10–500ms behind crashed primary (async replication); writes in that gap are lost (RPO > 0). Old primary returning becomes a replica; post-divergence writes discarded.

Redis Cluster: gossip detects FAIL state; replica in same shard auto-promoted; failover typically <30s. If no replica up, slot range stays read-only until operator intervenes.

The durability point. AOF fsync. appendfsync always if you can't afford 1s loss — ~10x throughput hit. Most cache deployments accept the 1s loss because data is regenerable from DB.

8.2 Cluster split (network partition)

Two halves of a Redis Cluster lose visibility. Each shard has a primary in one partition; the other cannot reach it. Replicas in the majority partition initiate failover and promote their own — meanwhile minority's primary continues accepting writes (default Redis Cluster is AP). Divergent writes on both sides; when partition heals, older replica is demoted and its divergent writes lost.

Mitigations. cluster-require-full-coverage yes (default) — if any slot has no available primary, the entire cluster goes read-only; safer but blocks writes during partition. min-replicas-to-write N — primary accepts writes only if ≥N replicas reachable; quorum-like protection against split-brain at the cost of write availability during replica loss.

8.3 Memory pressure → eviction storm

A burst of writes pushes past maxmemory. Eviction kicks in at hundreds of thousands of keys/sec. Eviction itself burns CPU (~30% of single-threaded command loop at 100k evictions/sec); p99 inflates; hot working set may get partially evicted if LRU + scan-like burst; subsequent reads miss; DB spikes.

Recovery. Throttle the offending write source — the cache can't shed load itself. maxmemory-policy noeviction stops accepting writes — app errors, but existing hot data preserved. Often the safest production choice.

8.4 Expired-but-not-yet-evicted serving stale

Redis lazy expiration means an entry past TTL might still be in the dict until next access (passive) or active eviction (~25 keys sampled 10x/sec). A key with TTL=300s might live at 305s if untouched — but the next GET checks expiry first and returns nil. Stale-serving risk is only via KEYS */SCAN (which enumerate without expiry check) or buggy clients. Academic concern for properly-written cache clients.

8.5 Unbounded keys without TTL

A bug ships: code SETs without TTL. Over weeks, billions of stale entries accumulate. Eviction (allkeys-lru) starts evicting hot keys because cold-but-recently-touched dominate. Slow disaster. Symptom: hit rate degrades over weeks. Mitigation: always set TTL on cache writes unless the key is explicitly permanent.

8.6 Replication broken silently

A replica loses connection to primary; still serves reads from stale state; app doesn't know. Monitor master_link_status (Redis), lag in bytes between primary and replica's last received offset. Alert on master_link_status != up for >30s. Recovery: drop the replica from the read pool; let it resync. Resync of a multi-TB cache can take hours; read capacity degraded meanwhile.

§9. Why Not Just Use the Database

A natural question: why not point the application directly at the database and let the database's own buffer pool handle hot data?

Concrete scenario. A profile-rendering service serves 1M reads/sec at peak. The DB primary is a single MySQL r6i.8xlarge (32 vCPU). The hot working set is 50 GB; comfortably fits in the InnoDB buffer pool.

Without cache: The DB takes the full 1M QPS. Even with everything in the buffer pool, each read is a B+ tree traversal, a row lock acquisition (shared), MVCC visibility check, undo-log walk for snapshot-old reads, network framing, etc. Per-read CPU cost ~10–50µs of database CPU. At 1M QPS, the DB needs ~10–50 CPU-cores of compute just for cache-like work. p99 latency from a hot DB is ~1–5ms (network + B+ tree + InnoDB locking). The DB also runs the write workload (10k commits/sec), which contends for CPU and locks.

With cache (cache-aside): The cache takes 990k ops/sec (99% hit rate). Each Redis GET is ~80µs total wall-clock; ~10µs of CPU. The DB takes ~10k QPS of cold reads plus its 10k commits/sec. p99 read for hot keys drops from 5ms to 0.5ms. DB CPU drops by an order of magnitude.

The arithmetic is overwhelming once you cross ~50k QPS on any single shard's hot working set. The cache exists because:

In-memory KV with hash-table lookup is structurally faster than a B+ tree traversal. ~10x at the byte level.
The cache's job is just "key → value" — no transactions, no MVCC, no joins. The minimal work per request is smaller.
Adding cache nodes is independent of database capacity. You scale read throughput in front of a fixed database size.
The cache absorbs the traffic shape the database is bad at. Spiky, repetitive reads of the same small set of hot rows are exactly the workload a B+ tree handles worst from a CPU-per-request standpoint.

The cache is the cheapest performance technology in the stack. Adding a Redis cluster to absorb 99% of reads is a ~5x cost reduction for a service over scaling the database to absorb the same read volume.

§10. Scaling Axes

How does a distributed cache scale? Two distinct axes; each needs different fixes.

Type 1: Uniform growth (more keys, more apps)

The total working set grows from 100 GB to 1 TB. Total QPS grows from 100k to 1M. Each key still gets roughly the same per-key traffic.

The fix: add cache nodes; reshard. Consistent hashing means adding nodes moves only 1/N of keys; hit rate dips briefly then recovers as the new nodes warm.

Inflection points: - ~32 GB working set → one Redis node can hold it; standalone or single-shard cluster is fine. - ~256 GB working set → multi-shard cluster (Redis Cluster, mcrouter-routed Memcached). One shard per ~32–64 GB. - ~10 TB working set → consider Aerospike (RAM index + SSD values) or multi-region multi-cluster. - ~1 PB working set → custom-built tooling territory (Facebook TAO, EVCache with regional replicas).

Type 2: Hotspot intensification (one key very hot)

One specific key receives 50k QPS while the rest of the cluster averages 100 QPS/key. Adding nodes doesn't help; the hot key stays on one node.

The fix: see §7.2. Layer in L1 in-process caching, replicate the key to multiple cluster slots, route reads across replicas, push to CDN if HTTP-shaped.

Inflection points: - ~5k QPS per key → consider L1 in-process cache (sub-second TTL, regenerate from L2). - ~20k QPS per key → mandatory L1 + read-from-replica. - ~100k QPS per key → key-replication (N shadow keys) and CDN if applicable.

The two axes interact. Pinterest's pin-render path uses both — uniform shard scaling (Type 1) for the long tail of pins, plus L1 + popular-pin replication for the celebrity pins (Type 2).

§11. Decision Matrix vs Adjacent Categories

Side-by-side along named dimensions.

Dimension	Distributed cache (Redis/Memcached)	In-process cache (Caffeine)	CDN / Edge cache	Database + buffer pool
Latency p99	<1ms (network)	<1µs (in-memory)	~10–50ms (geographic, edge)	~5–50ms (B+ tree + locks)
Capacity per node	~64 GB – 1 TB	~hundreds of MB (heap)	~petabytes globally	~hundreds of GB (RAM-bound)
Shared across instances	Yes	No	Yes (HTTP clients)	Yes
Durability	Optional (RDB/AOF)	None	None (rebuild from origin)	Yes (WAL)
Invalidation	Manual / TTL / CDC	TTL or process-local broadcast	TTL / surrogate-key purge	Native (DB is source of truth)
Data structures	Rich (Redis) or KV (Memcached)	Whatever the language supports	HTTP responses (opaque)	Rich (relational)
Best for	Backend hot path	Sub-microsecond critical reads	External client traffic	Transactional state
Cost / QPS	Lowest of these	Lowest (no network)	Low for repeat-read traffic	Highest

Thresholds for picking which:

< 100k QPS aggregate read traffic with hot working set fitting in a single DB's buffer pool. Skip the distributed cache. Just use the database. Don't over-engineer.
100k – 10M QPS, internal backend traffic, hot keys identifiable. Distributed cache (Redis or Memcached). Cache-aside or write-through depending on consistency need.
> 1M QPS, latency-critical (microsecond-scale), per-instance-deduplicated traffic. Add L1 in-process (Caffeine) on top of L2 distributed cache.
External client traffic, public content, HTTP-shaped. CDN. The L0 in front of everything else.
Multi-region read with bounded global staleness. Cache cluster per region + Kafka-mediated invalidation.

Don't pick one and call it done; production caching is always multi-tier.

§12. Cache Patterns Deep Dive

The cache is a primitive. Which pattern you use to integrate it with the source of truth determines correctness and latency.

12.1 Cache-aside (look-aside)

Most common. App reads cache; on miss, reads DB and SETs cache; on write, writes DB and DELETEs the key.

Read:                              Write:
  v = cache.get(K)                   db.write(K, new_value)
  if v is None:                      cache.delete(K)
      v = db.read(K)
      cache.set(K, v, ttl=300)
  return v

Strengths: simple; cache is independent of database; cache failures don't take down writes. Weaknesses: race condition between concurrent reader and writer (§7.3); stale-cache poisoning is real; TTL is the safety net. Default for most general-purpose caching — LinkedIn, GitHub, Reddit, most Rails-era apps.

12.2 Write-through

Writes go through cache, which writes to DB synchronously before acking.

Write:                                  Read:
  cache.set(K, new_value)                 v = cache.get(K) || db.read(K)
  db.write(K, new_value)                  if from DB: cache.set(K, v)
  return ack                              return v

Strengths: cache and DB always consistent (if cache.set + db.write are atomic together); no staleness window. Weaknesses: cache becomes a write hot path; cache failure means writes fail; latency = cache write + DB write; failure modes (cache OK, DB fail; DB OK, cache fail) require compensation logic. Used by Facebook TAO — writes go to cache leader region first then MySQL, strict ordering keeps them in sync.

12.3 Write-behind (write-back)

Writes go to cache; cache asynchronously batches writes to DB.

Write: cache.set(K, new_value); enqueue_to_writeback(K, new_value); return ack
(Background: every 100ms, flush a batch of pending writes to DB.)

Fastest write latency; batch writes reduce DB load. Durability risk — a cache crash before flush loses writes. Specialized cases — write-coalescing for high-frequency counters (page views, ad impressions) where ~seconds of loss is tolerable.

12.4 Read-through

Cache library transparently fetches from DB on miss; from the app's perspective the cache is the data source.

def get(K):
    if K in store: return store[K]
    v = db_loader_fn(K); store[K] = v; return v

Cleaner code than cache-aside; same race-condition characteristics. Hibernate L2 cache, MyBatis cache, EhCache, Guava LoadingCache, Caffeine all expose read-through APIs.

12.5 Refresh-ahead

Identify hot keys; refresh proactively before TTL expiry. Each access checks "is entry less than X% of TTL remaining?" — if yes, asynchronously trigger refresh (single-flight to coalesce). Current request still gets cached value with no extra latency. Useful for predictable hot keys (top news article, home page config); wasted effort on cold keys.

12.6 Negative caching

Cache "key doesn't exist" results to prevent stampede on missing keys. Imagine enumeration attack — GET /api/user/12345, 12346, ... — for non-existent users. Without negative caching, every request misses cache and hits DB. With short-TTL negative entries, load is bounded. Used by every public API serving lookups (LinkedIn member ID lookup, Pinterest pin lookup, GitHub repo lookup).

§13. Advanced Patterns

13.1 Multi-tier cache

L1 (in-process) + L2 (distributed) + L3 (database). The canonical production architecture.

                  ┌─────────────────────────────────────┐
   GET key ───────►   L1: Caffeine in-process           │
                  │   TTL = 1s, capacity = 1 GB         │
                  │   ~100ns access                     │
                  └────────────┬────────────────────────┘
                               │ miss
                               ▼
                  ┌─────────────────────────────────────┐
                  │   L2: Redis / Memcached cluster     │
                  │   TTL = 300s, capacity = TBs        │
                  │   ~100µs access                     │
                  └────────────┬────────────────────────┘
                               │ miss
                               ▼
                  ┌─────────────────────────────────────┐
                  │   L3: Database (MySQL/PG/Cass)      │
                  │   Source of truth                   │
                  │   ~5ms access                       │
                  └─────────────────────────────────────┘

L1 absorbs the hottest fraction of keys; L2 absorbs the long tail; L3 sees only the cold path.

Numbers from a typical multi-tier deployment: - L1 hit rate: 60% (limited by capacity) - L2 hit rate: 99% of L1 misses = ~39.6% of total - L3 receives: 0.4% of original traffic

For a service at 10M reads/sec, L3 sees 40k QPS — easily handled by a sharded database. Without the cache layers, L3 would face 10M QPS and need 250x the database capacity.

L1 invalidation is the hard part. Options: - Short TTL (1s, accept staleness). - L2-driven broadcast: when the cache-invalidator deletes from L2, it also publishes to a Redis pub/sub channel; each app instance subscribes and invalidates its L1. - Per-instance polling of a version number stored in L2.

TAO is a write-through, leader-follower cache for the social graph at Facebook. Data model: objects (users, posts, photos, comments) and associations (likes, friendships, follows, comments-on).

   Web tier ──► TAO follower (region B) ──┐
                TAO follower (region C) ──┼──► TAO leader (region A) ──► sharded MySQL
                TAO follower (region D) ──┘

Reads in any region: hit local TAO. Writes: forwarded to TAO leader for that shard, which writes to MySQL and propagates to followers via MySQL replication. Read-after-write is routed back to leader briefly.

Cache layer (~hundreds of TB of RAM across thousands of nodes) absorbs billions of reads/sec; MySQL does ~tens of millions of writes/sec; hit rate >99%.

Why custom? Generic Memcached doesn't know about associations — "give me latest 10 likes for this post" against Memcached requires expensive client coordination. TAO understands the graph schema, fetches association lists efficiently, has built-in lease mechanics for stampede control, and tracks invalidation for parent objects whose associations changed. The textbook example of "build a specialized cache when the generic data model doesn't fit."

13.3 Discovery, sharding, cluster modes

DNS-based discovery. DNS record → list of cache hosts; client-side consistent hashing. Memcached classic. Topology changes wait for DNS TTL.
Cluster mode (Redis Cluster). Each node holds a topology map; clients fetch and refresh. Routing in client library.
Sentinel mode (Redis Sentinel). 3–5 Sentinel processes monitor primaries. Clients ask "who is current primary for shard X?" Sentinel handles failover.
Proxy mode (Twemproxy, mcrouter, Envoy). A proxy tier between app and cache. App connects to proxy as if it were a single cache; proxy handles routing, replication, failure. +200µs network hop; dramatically simpler clients. Facebook (mcrouter), Twitter (Twemproxy).

13.4 Sharding strategies

Consistent hash on key. Default. Adding/removing nodes moves 1/N of keys.
hash(key) % N. Don't.
Slot-based. Pre-allocated slots → nodes. Redis Cluster's 16384 slots. Operational simplicity.
Geo-partitioning. Regions own keyspaces. Compliance (EU data in EU caches) and latency (user's profile in their home region).

13.5 Cache warming

Cold cache is a database stampede waiting to happen. Mitigations: bulk pre-populate (replay top-N popular keys from a warm peer before a new node joins); gradual ramp (10% → 25% → 50% → 100% traffic, letting the cache fill organically); replication-based warming (new node joins as replica, syncs, then promoted). Twitter and Facebook publicly describe warming pipelines as load-bearing for any cluster expansion.

§14. Use Case Gallery

Five different domains; same technology, different bend.

14.1 Web session store

User logs in; server creates a session record, generates a session ID, sets a cookie. The cookie's session ID is the cache key; the cache value is the serialized session (user_id, permissions, last_activity).

Variant: Redis (HSET for sparse session fields) or Memcached (opaque blob).
TTL = session duration (~30 min; sliding renewal on activity).
Failure mode: cache loss = user logged out. Accept (small UX hit) or persist sessions to DB with cache in front.
GitHub serves Rails session reads from Redis at hundreds of thousands of QPS.

14.2 Rate limiting

Per-user token bucket: each user gets N tokens per minute; a request consumes one; if 0 tokens, reject. Implementation in Redis:

-- Atomic Lua: refill bucket based on time elapsed, take one token if available
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens/sec
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
    redis.call('HMSET', key, 'tokens', tokens - 1, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- allowed
else
    return 0  -- denied
end

Atomic because Lua scripts run as a unit on the single-threaded command loop. No interleaving, no race condition.

Used by GitHub, Stripe, every public API with quota enforcement.

14.3 Feed timeline (Twitter, Instagram)

Each user has a per-user feed list — the IDs of the most recent N posts from people they follow:

LPUSH timeline:user_42 post_99001 post_99000 post_98999  -- on fanout
LTRIM timeline:user_42 0 999  -- keep most recent 1000
LRANGE timeline:user_42 0 19  -- read top 20 for first page

On a new post by user X, the fanout service pushes the post ID into the timelines of X's followers (normal users) or pulls at read time (celebrities with millions of followers — Twitter's "celebrity fan-out" hybrid). Twitter publicly described millions of LPUSH/sec at peak; tens of millions of LRANGE/sec from the home timeline cache.

14.4 Leaderboard

Real-time ranking with O(log n) insert and O(log n + k) range. Redis sorted set is purpose-built:

ZADD leaderboard 5000 player_42       -- 50µs
ZREVRANGE leaderboard 0 9 WITHSCORES  -- top 10
ZRANK leaderboard player_42            -- player's current rank

A 10M-player leaderboard delivers all of the above in <100µs. Every gaming service with ranking uses this — mobile games, Discord activity scores, World of Warcraft ladders.

14.5 Pub/sub for real-time

Redis Pub/Sub: channel-based fanout. Publishers PUBLISH channel msg; subscribers SUBSCRIBE channel receive any messages published. No durability, no replay — fire and forget.

Redis Streams (since v5): durable log-shaped pub/sub with replay. Consumer groups handle partitioning. Lighter-weight Kafka for in-cluster use.

Used by Discord (real-time chat), Slack (presence updates), Trello (collaborative document updates).

14.6 Application configuration / feature flags

Cached config-DB rows. Application servers fetch their config (rate limits, feature flags, kill switches) from Redis on startup and on every change-event from a config service.

A feature flag system might store ~1k flags × ~100 application services = 100k cache entries; reads at 10M QPS aggregate (every request might check flags); writes at <1 QPS. Read-heavy by 10^7. L1 + L2 with TTL=10s.

14.7 Anti-abuse / fraud

Sliding-window counters and recent-events lookup for IP-based abuse detection — Redis sorted sets with timestamps as scores:

ZADD recent_events:ip_1.2.3.4 1715000000 event_id_1
ZREMRANGEBYSCORE recent_events:ip_1.2.3.4 -inf 1714996400  -- evict > 1h old
ZCARD recent_events:ip_1.2.3.4  -- count remaining

ZADD (push), ZREMRANGEBYSCORE (sliding window cleanup), ZCARD (count) handle per-IP rate-limit-style anti-abuse efficiently.

14.8 ML feature serving

Low-latency feature lookup for online model inference. A recommender needs 200 features × 100 candidate items = 20,000 feature lookups per request in <50ms. Multi-GET pipelined into Redis hits hundreds of thousands of lookups/sec per host. Tecton, Feast, and LinkedIn's online feature store use Redis (or Aerospike at large scale) as the online tier; offline tier is a parquet warehouse with point-in-time correctness.

§15. Real-World Implementations With Numbers

Named systems shipping at scale across varied use cases.

Facebook Memcached + CacheLib. Public NSDI papers (2013 "Scaling Memcache at Facebook"; 2020 "The CacheLib Caching Engine"). Hundreds of millions of requests/sec across tens of thousands of nodes globally. Innovations: mcrouter (protocol router with consistent hashing, replication, prefix routing), leases for stampede control, regional clusters, CacheLib as the unified library.

Twitter Twemcache and Twemproxy. Custom Memcached fork with B-tree storage per slab class and segmented LRU. Twemproxy as the routing tier. Trillions of cache ops/day across thousands of nodes; backs home timeline, profiles, ads serving.

Pinterest. Memcached fleet serving billions of pin reads/day. Hit rate >99%. Multi-tier: per-instance L1 + cluster L2 + MySQL L3.

Netflix EVCache. Memcached fork with cross-region async replication, Caffeine L1 with W-TinyLFU. Backs movie metadata, A/B tests, recommendation candidates. ~30M+ ops/sec aggregate; sub-millisecond worldwide.

Snapchat. Redis at scale for messaging delivery; ephemeral by design fits the cache's natural model. Multi-region Redis Clusters.

GitHub. Redis for Rails sessions, ActionCable real-time delivery, Sidekiq job queues. ~1M ops/sec aggregate; the backbone of every page render.

Discord. Cassandra for messages (LSM source-of-truth at petabyte scale), Redis for presence, ScyllaDB for hot real-time data. Trillions of messages; billions of presence updates/day.

Stripe. Redis for idempotency-key dedup, rate limiting, request coalescing. The hot path of payment processing relies heavily on Redis primitives.

LinkedIn caching. Couchbase and Memcached fronting Espresso (NoSQL source-of-truth) and Pinot (analytics store). Profile cache, feed cache, search-result cache. ~10M+ QPS aggregate; sub-millisecond profile lookups load-bear every page render. CDC fan-out via Brooklin → Kafka invalidates the cache.

Facebook TAO. Specialized social-graph cache, write-through over sharded MySQL, leader-follower regions. Billions of reads/sec; tens of millions of writes/sec; hit rate >99%.

Twitter (X) Cache. Rebuilt multiple times — Memcached → Twemcache → Redis-fork "Nighthawk" (PMEM-backed Redis-like semantics). Powers timeline materialization, ads, search ranking.

Aerospike at AdTech. Real-time bidding (The Trade Desk, AppNexus, etc.) uses Aerospike for sub-millisecond user-profile lookups. Tens of millions of profiles, hundreds of features each, ~5ms total bid budget.

The range — single-node Redis at a startup (100k QPS) to Facebook CacheLib (hundreds of millions of ops/sec) — is five orders of magnitude wide for the same technology class. The data structures (hash table + skip list + slab allocator + LRU/LFU eviction) are the same. What scales is the topology around the cache: shard count, multi-tier layering, multi-region replication, custom routing.

§16. Summary

A distributed cache is the cheapest performance technology in the stack — a network-attached in-memory key-value store with sub-millisecond access, atomic per-key operations, and TTL-bounded staleness — designed to let derivable hot data be served at orders of magnitude lower cost than the source of truth. What scales is not the cache itself but the topology around it: multi-tier (L1 in-process + L2 distributed + L3 database), multi-region with eventual coherence via CDC, hot-key replication via sharding or shadow keys, stampede control via single-flight or leases. Pick Redis for rich primitives, Memcached for raw throughput on opaque blobs, Aerospike for SSD-backed working sets at terabyte scale, DragonflyDB for modern multi-core throughput, and a custom cache (TAO-style) only when the generic primitives don't fit your data model. Cache invalidation is the system designer's problem, not the cache's — TTL is the safety net, CDC is the correctness lever, and write-through is the consistency hammer. The cache promises fast, atomic, shared, RAM-bounded access; everything else — durability, multi-key transactions, cross-region consistency, hot-key handling, invalidation correctness — must be layered above by the system that uses it.