Load Balancer — Yifan Li

Load balancing is a class of infrastructure that distributes incoming work across a pool of backends with health awareness, traffic policy, and failure isolation. It is the layer that turns "a fleet of servers" into "a service." Its variants are radically different — a Layer 4 (L4) packet forwarder doing 10 million packets per second (Mpps) per box in the kernel looks almost nothing like a Layer 7 (L7) HTTP proxy doing 100 thousand requests per second (RPS) of TLS-terminated routing in user space. They share a name and a purpose; they share little code. This doc treats load balancing as a technology class, walks the design space, and uses examples from public web, internal microservices, database pools, video streaming, real-time gaming, and read-replica routing.

§1. What load balancers ARE — and are NOT

A load balancer (LB) is a traffic distributor: given an incoming unit of work (TCP packet, HTTP request, UDP datagram, DB query), it picks a backend from a pool and forwards. It keeps the pool clean — backends that fail health checks are evicted, new ones added smoothly, traffic reshuffled on membership change without resetting every live connection.

Two big axes split the design space.

L4 vs L7. L4 LBs operate on the transport layer — they see source/destination IP, source/destination port, and protocol (the 5-tuple). They do not parse HTTP, do not terminate TLS, do not understand routes. They are fast (sub-microsecond per packet on modern XDP — eXpress Data Path — paths), nearly stateless (a connection table mapping 5-tuples to backends), and protocol-agnostic (TCP, UDP, QUIC). Google Maglev, Facebook Katran, Cloudflare Unimog, AWS NLB (Network Load Balancer), and Linux IPVS are L4. L7 LBs terminate the transport (typically TCP+TLS) and operate on the application protocol — HTTP, gRPC, Kafka wire protocol, Redis protocol. They route by path, header, body, JWT (JSON Web Token) claim; rewrite headers; retry; do canary; run a WAF (Web Application Firewall). NGINX, HAProxy, Envoy, Traefik, AWS ALB (Application Load Balancer), GCP HTTP(S) LB are L7. Slower per request (milliseconds, not microseconds), enormously more capable.

Hardware vs software. The legacy lineage is hardware appliances — F5 Big-IP, Citrix NetScaler, A10 — boxes with ASIC packet engines and proprietary OSes. The modern lineage is software on commodity x86/ARM — NGINX, HAProxy, Envoy in user space; IPVS, Katran, Unimog in the kernel via XDP/eBPF (extended Berkeley Packet Filter). Hardware wins on watts-per-bit; software wins on iteration speed and cost. By 2018 the industry's verdict was clear: hyperscalers all built software L4 LBs in eBPF or DPDK (Data Plane Development Kit). Hardware appliances still ship to enterprises that want one vendor's throat to choke.

What load balancers are NOT:

Not a database. No persistent state. Lose the LB and connections reset, but no business data is lost.
Not a service mesh. A mesh is an architecture (sidecar proxies + control plane + mTLS — mutual Transport Layer Security — everywhere); it uses an LB (typically Envoy) as a building block.
Not a CDN (Content Delivery Network). A CDN caches; an LB routes. A CDN has an LB inside (Cloudflare runs Unimog), but the CDN's value is the cache hierarchy.
Not DNS. DNS-based load balancing (multiple A records) is a poor substitute — see §9.
Not a retry framework. End-to-end retry semantics (idempotency keys, saga compensation, exponential backoff) belong in the client and the application.

The category matters because engineers reach for LBs to solve problems LBs don't solve (durability, exactly-once delivery, cross-region consistency) and ignore them for problems they do solve cleanly (slow-start protection, hot-route shedding, blue/green deploy gating).

§2. Inherent guarantees

Load balancers provide by design:

Traffic distribution across the pool, approximately even by some weighting policy (round-robin, least-connections, P2C — Power of Two Choices — random, consistent hash).
Health awareness within the check window (2–10s active, sub-second passive).
Failover within the pool — backends that die stop receiving new traffic; some idempotent requests can be retried.
Flow stickiness for live connections — once a TCP flow is pinned to backend B, it stays on B even as the pool churns.
Single coherent VIP (Virtual IP) identity to clients — membership churn is invisible.

What they do NOT provide:

Durability. In-flight requests on a crashed backend are lost. The LB does not buffer or persist.
Exactly-once. Retries to a new backend may double-execute; idempotency is the application's job.
End-to-end correctness. LB sees "backend returned 200"; it does not know the DB write committed.
Ordering across flows. Two requests from the same client may land on different backends in any order.
Strongly-consistent routing. Two LB boxes may briefly disagree on which backend is healthy. Routing tolerates this — either by being deterministic from a shared table (Maglev) or by the application surviving either choice.
Transactional drain. Draining is best-effort; no two-phase commit on connection migration.

Same pattern as TCP: the layer below delivers bytes, the application above builds the correctness story. Anyone treating an LB as a durability layer will be surprised.

§3. Design space

Variant	Layer	Throughput per box	Latency added	Strengths	Examples
Hardware appliance	L4+L7	10–40 Gbps line rate	µs	Predictable, vendor-supported	F5 Big-IP, Citrix NetScaler
Kernel L4 (IPVS, LVS)	L4	1–2 Mpps	5–20 µs	Native to Linux, simple	IPVS, LVS-DR (Direct Routing)
Kernel-bypass L4 (DPDK)	L4	10–14 Mpps line rate	~100 ns	Max performance	Google Maglev (DPDK)
XDP/eBPF L4	L4	5–10 Mpps per core	~50–200 ns	Kernel-resident, programmable	Katran, Unimog, Cilium
User-space L7	L7	50–200k RPS HTTPS	0.5–2 ms	Battle-tested, rich config	NGINX, HAProxy
Modern L7 sidecar	L7	50–150k RPS HTTPS	1–3 ms	Dynamic xDS, observability	Envoy at Lyft, Stripe
Cloud managed	L7 (mostly)	Cloud-scaled	1–10 ms	Zero ops	AWS ALB/NLB, GCP HTTP LB
API gateway	L7+	1–50k RPS	5–50 ms	Auth, transformation built-in	Kong, AWS API Gateway
DB proxy	L7 protocol-specific	50k+ QPS	0.5–2 ms	Connection pooling, query routing	PgBouncer, ProxySQL, vtgate

The crucial split is L4 (fast, dumb) vs L7 (slow, smart). Most real stacks chain them: L4 absorbs packet-per-second budget, then forwards into L7 for TLS termination, path routing, policy. Cloudflare: Unimog → NGINX-variant. Google: Maglev → GFE (Google Front-End). AWS: NLB + ALB chained.

The other crucial dimension is kernel-bypass vs kernel-resident. DPDK gives the floor on latency and the ceiling on PPS (packets per second) at the cost of losing every kernel networking facility (iptables, tcpdump, eBPF tracing). XDP/eBPF is the modern compromise: stay in the kernel, run a verified bytecode program ahead of the rest of the stack, get ~10x the speed of user space, keep all the kernel tooling. By 2026, new L4 LB projects almost universally pick XDP/eBPF over DPDK unless they need absolute line-rate determinism.

§4. Byte-level mechanics

This is where the technology becomes concrete: load-distribution algorithms, data structures, kernel-bypass paths, and one packet end-to-end.

4a. Consistent hashing variants

The naive backend_id = hash(5-tuple) mod N works when N is fixed and fails catastrophically when N changes: adding one backend re-maps (N-1)/N of all flows, so almost every live TCP connection RESETs. Consistent hashing solves this. Three variants ship in production.

Ring-based (DynamoDB-style, Cassandra-style). Each backend hashes to K positions on a 2^32 ring (virtual nodes, K=100–500 to flatten variance). For key X: hash to H(X), walk clockwise to the first vnode, return that backend. Lookup is O(log(K·N)) via binary search on a sorted vnode array. K=200, N=1000 = 200k entries, fits in L2 cache, ~100 ns lookups. Adding a backend places K new vnodes; ~1/N of the key space moves. Variance K^(-1/2): K=100 with N=100 gives ~10% load standard deviation. Used by databases (Cassandra, DynamoDB, Riak) where the LB layer is co-designed with storage placement.

Rendezvous (HRW — Highest Random Weight). For key X and N backends: compute H(X, B_i) for every backend, pick the highest. O(N) per lookup — too slow for data plane at large N. But exactly 1/N of keys move on add, no vnode tuning, naturally tight distribution. Right pick when N is small (<1000). Used by some CDNs and HDFS for chunk placement.

Maglev (Google's lookup-table variant). Used by Maglev, Katran, Unimog, and most modern XDP LBs. Precompute a fixed-size lookup table M of size T (a prime, typically 65537), where M[i] = backend assigned to slot i. To pick: slot = hash mod T, return M[slot]. A single array indexing — ~10 ns, no branches, fits in L1 cache.

The slot-assignment algorithm (NSDI 2016 §3.4):

For each backend B_i, precompute permutation P_i of [0, T) using two hashes: - offset = H1(B_i) mod T - skip = H2(B_i) mod (T-1) + 1 (T prime → gcd(skip, T) = 1 → permutation covers [0, T)) - P_i[j] = (offset + j * skip) mod T
Fill M in rounds. In round j each backend tries to claim slot P_i[j]. If taken, advance its cursor and try the next slot in its preference order on the next round.

Worked example, T=7, backends {B1, B2, B3}:

P_1 = [3, 0, 4, 1, 5, 2, 6]    // B1's preferred slot order
P_2 = [0, 2, 4, 6, 1, 3, 5]
P_3 = [3, 4, 5, 6, 0, 1, 2]

Round 0: B1 claims 3 → M[3]=B1; B2 claims 0 → M[0]=B2;
         B3 tries 3, taken, cursor advances
Round 1: B1 tries 0, taken; B2 claims 2 → M[2]=B2;
         B3 claims 4 → M[4]=B3
Round 2: B1 tries 4, taken; B2 tries 4, taken;
         B3 claims 5 → M[5]=B3
Round 3: B1 claims 1 → M[1]=B1; B2 claims 6 → M[6]=B2;
         B3 done (has T/N ≈ 2 slots)

Final: M = [B2, B1, B2, B1, B3, B3, B2]

Lookup for a 5-tuple hashing to 1234567:
  slot = 1234567 mod 7 = 4 → backend = M[4] = B3

When B2 dies and we rebuild with {B1, B3}, B2's slots (0, 2, 6) get reassigned per the remaining preference orders, while slots that didn't contain B2 (1, 3, 4, 5) very likely stay put. With T=65537 and a real backend count, ~T/N slots change per backend change — exactly the flows that had to move (the ones on the dying backend), with only a tiny fraction of surviving flows perturbed.

Why Maglev wins: O(1) lookup vs O(log N), provably even distribution by construction, single-array data structure. Rebuild is O(T) per backend change, done once on the control plane, pushed to every data-plane node as a ~64KB blob.

4b. Weighted round-robin internals

When backends have different capacities, the LB uses a weight scheduler. Canonical NGINX/IPVS/HAProxy implementation:

On each request:
  for each backend i:
    b[i].current_weight += b[i].effective_weight
  pick k = argmax(b[i].current_weight)
  b[k].current_weight -= sum(b[i].effective_weight)
  return k

Weights 5:1:1 produce the interleaved sequence A A B A A C A A A B A A C ... rather than A A A A A B C A A A A A B C ... — better latency distribution. O(N) per request, fine for N < 10,000.

4c. Least-connections (min-heap)

Naive O(N) scan is too slow at 5M decisions/sec. Real implementation: min-heap keyed by active_connections + inverted index backend_id → heap_index. On new flow: peek top, increment, sift O(log N). On flow end: decrement, sift O(log N). For N=1000, log N = 10 ops per decision.

Production refinement: P2C (Power of Two Choices). Pick two random backends, send to the less-loaded. O(1) per request, variance of max load drops from O(log N / log log N) for pure random to O(log log N / log d), and avoids the "thundering herd" pathology where every LB box independently picks the same "least-loaded" backend at the same instant. Envoy's default for LEAST_REQUEST.

4d. Connection tracking table (5-tuple → backend)

The L4 LB does NOT rely on the consistent hash alone. It also keeps a flow table that maps live 5-tuples to backends, so mid-stream packets always hit the same backend even if the consistent-hash table just changed:

struct flow_entry {
    uint128_t five_tuple_hash;
    uint32_t  backend_id;
    uint64_t  last_seen_ns;
    uint32_t  packets, bytes;
};
flow_entry table[16M];  // open-addressed, 2x expected concurrent flow count

On packet arrival: compute 5-tuple hash, probe at hash mod table_size, if found+fresh use cached backend, otherwise fall through to Maglev. Stale entries (idle >60s) reclaimed.

The flow table is the stickiness layer. The Maglev table is the correctness layer (new flows + flows after churn). Together: stable mid-flow stickiness + bounded reshuffle on churn + O(1) lookup in the hot path.

4e. Kernel-bypass paths

Three options, increasing in performance and decreasing in operational simplicity:

Option 1: kernel networking stack. NIC → kernel ring buffer → softirq → IP layer → netfilter → user-space socket → LB → socket → IP → NIC. ~10 µs per packet, ~500k pps per core, ~6 syscall/context-switch boundaries. Stock NGINX, HAProxy, IPVS path. Fine for L7 where ms of parsing dominate µs of socket I/O. Inadequate for L4 at scale.

Option 2: DPDK kernel-bypass. DPDK takes the NIC away from the kernel entirely. A user-space process polls the NIC ring buffer directly via PMD (Poll Mode Driver). Packets never touch the kernel. Hugepages (2 MB or 1 GB) avoid TLB (Translation Lookaside Buffer) misses. Each LB thread pinned to a CPU core; each NIC queue mapped to one core (RSS — Receive Side Scaling — distributes packets by 5-tuple hash on the NIC ASIC). ~100 ns per packet, 10 Mpps per core, line rate at 100 Gbps. Operational cost: no tcpdump, no iptables; the LB must reimplement ARP (Address Resolution Protocol), ICMP, fragmentation.

Option 3: XDP/eBPF. A kernel hook before the rest of the network stack. The eBPF program inspects the packet and returns XDP_PASS (continue up the stack), XDP_DROP (discard — DDoS shedding), XDP_TX (re-send out the same NIC after rewrite), or XDP_REDIRECT (different NIC). ~50–200 ns per packet, 5–10 Mpps per core. eBPF is verified-safe bytecode (no unbounded loops, kernel verifier rejects anything it can't prove terminates within ~1M instructions). The Maglev table lives in a BPF map (kernel-resident hash/array map); user space updates via syscall, the eBPF program reads it during packet processing. Operational win: still the kernel — tcpdump, perf, ftrace work on traffic the XDP program PASSes. The 2026 default for new L4 LB projects.

4f. One packet end-to-end

A TCP SYN arrives at an XDP-based L4 LB:

T=0    ns   Packet hits NIC. Ethernet header parsed by NIC ASIC.
            RSS: NIC hashes the 5-tuple via Toeplitz hash, picks
            NIC queue 7 of 16. Queue 7 IRQ-bound to CPU core 7.

T=50   ns   Core 7 NAPI poll reads packet descriptor from queue 7's
            RX ring buffer.

T=80   ns   XDP eBPF program invoked. Walks headers:
            - bytes [0..14): Ethernet. Read ethertype at offset 12;
              0x0800 (IPv4) or 0x86DD (IPv6), else XDP_PASS.
            - bytes [14..34): IPv4. Read proto at +23 (TCP=6),
              dst_ip at +30. If dst_ip != VIP, XDP_PASS.
            - bytes [34..54): TCP. Read src_port, dst_port, flags.

T=90   ns   Compute 5-tuple hash: CRC32C with SSE4.2 instruction
            over (src_ip, src_port, dst_ip, dst_port, proto). ~5 ns.

T=100  ns   Flow table lookup (BPF map). Probe at hash mod tbl_size.
            Compare full 5-tuple. Hit: use cached backend_id, jump
            to step 6. Miss (new SYN): continue.

T=130  ns   Maglev lookup:
              slot = hash mod 65537
              backend_id = maglev_tbl[slot]  // single mem access
            Write flow table entry (5-tuple_hash, backend_id, now()).

T=150  ns   Backend info lookup (BPF map): dest_mac, dest_ip.

T=170  ns   Rewrite packet:
            - DSR (Direct Server Return): rewrite dest_mac to
              backend_mac, leave IP unchanged. Backend has VIP bound
              on loopback, accepts the packet, sends response
              directly to client. Huge return-bandwidth savings.
            - IPIP / GUE (Generic UDP Encapsulation) / GENEVE tunnel:
              prepend outer IP with dest = backend_ip. More flexible
              (crosses L3 boundaries), more overhead.

T=180  ns   Return XDP_TX. NIC transmits out the same port.

T=200  ns   Packet on wire to backend. Total LB hop ~200 ns, no
            user-space copy, no syscall, no context switch.

At this rate, a 16-core box with 16 NIC queues saturates the 100 Gbps NIC long before saturating CPU. The L4 LB tier in modern hyperscale systems is bandwidth-bound, not CPU-bound.

4g. Trade-offs vs structures we did not pick

Maglev vs ring-based: Maglev wins on lookup speed (O(1) vs O(log N)) and provably-even distribution. Ring wins when LB and storage are co-designed (Cassandra). Pick Maglev for the data plane, ring for storage placement.
Maglev vs Rendezvous: Maglev's O(1) is 1000x faster than Rendezvous's O(N) at N=1000. Rendezvous is fine for control-plane decisions made once per minute; Maglev is required for data-plane decisions made once per microsecond.
Maglev vs hash-mod-N: hash-mod-N is O(1) and simple but reshuffles (N-1)/N of all flows on backend change. Never use for stateful LB.
L4 vs L7 forwarding: L4 is faster (1 memory access vs HTTP parse) and protocol-agnostic. L7 can route on path/header. We chain them.

4h. Soft-state durability

The L4 LB has no persistent state in the request path. The Maglev table is reconstructed deterministically from the backend list (which lives in a strongly-consistent control plane — etcd, Consul, Zookeeper). The flow table is soft state: if an LB box dies, surviving LBs (which get those flows via ECMP — Equal-Cost Multi-Path — rehash) compute the same Maglev table and route to the same backend. Flows keep working with a handful of packets dropped during the ~100 ms ECMP rehash window. This is the architectural trick that makes the LB tier stateless from a durability perspective: no replication, no failover dance.

L7 LB is approximately stateless too — TLS session state, HTTP/2 connection state in process memory. If an L7 box dies, in-flight requests RESET; clients retry; traffic shifts via ECMP from below. Worst case: 5–10 seconds of client-visible degradation for the in-flight requests on the dead box.

§5. Capacity envelope

The numbers span six orders of magnitude. Real deployments anchored to the envelope:

Tier	Example	Reported scale
Hobby / small shop	Single NGINX	~10k RPS
Mid-size SaaS	HAProxy at Stack Overflow (2016)	~30k RPS sustained, 70k peak
GitHub	HAProxy + GLB (GitHub LB)	~1M RPS peak
Stripe / Lyft	Envoy fleet (mesh + edge)	~250k RPS per Envoy, fleet-wide millions
Netflix Open Connect	NGINX CDN appliances	100+ Gbps per appliance, multi-Tbps fleet
Facebook Katran	XDP/eBPF L4 LB	~5 Mpps per server, tens of Mpps per cluster
Google Maglev	DPDK L4 LB (2016)	~1 Mpps per box in DPDK, >1 Tbps per cluster
Cloudflare edge	Unimog (XDP) + modified NGINX	>1B req/day, 60M RPS peak; mitigates >100 Tbps DDoS
AWS Hyperplane (NLB)	Internal AWS L4 fleet	Millions of RPS per NLB, 20M+ concurrent conns per AZ, <100µs added

The envelope spans 10k RPS (single nginx) to 60M RPS (Cloudflare edge during DDoS), packet rates from ~50k pps to 100+ Mpps per box. Architecture changes at inflection points: single nginx works until ~100k RPS, then ECMP'd cluster; ~1M RPS needs L4+L7 split with kernel-bypass at L4; ~10M RPS needs multi-region anycast; ~100M RPS needs partitioned Maglev tables across regions. Per-layer latency: L4 adds ~0.1 ms median, ~1 ms p99; L7 with TLS adds ~1 ms median, ~10 ms p99 (TLS handshake dominates); geographic anycast adds ~5–30 ms RTT depending on PoP density.

§6. Architecture in context

The canonical multi-tier topology, generalized across hyperscalers:

              Client (browser, mobile, IoT, gRPC client)
                              │
                              │ DNS query
                              ▼
                   ┌──────────────────────┐
                   │  GeoDNS / Anycast DNS │  returns anycast VIP
                   └──────────┬───────────┘
                              │ TCP SYN / UDP / QUIC to anycast IP
                              ▼
              BGP advertises same VIP from N PoPs
                  Internet routes to NEAREST PoP
                              ▼
        ┌────────────────────────────────────────┐
        │   PoP Edge Router (ECMP, 5-tuple hash) │
        └─────┬─────────────┬─────────────┬──────┘
              ▼             ▼             ▼
        ┌─────────┐   ┌─────────┐   ┌─────────┐
        │ L4 LB#1 │   │ L4 LB#2 │   │ L4 LB#M │
        │ XDP/eBPF│   │ XDP/eBPF│   │ XDP/eBPF│
        │Maglev tbl│  │Maglev tbl│  │Maglev tbl│
        │flow tbl │   │flow tbl │   │flow tbl │
        └────┬────┘   └────┬────┘   └────┬────┘
             │ DSR or IPIP/GUE tunnel    │
             └────────┬──────────────────┘
                      ▼
        ┌──────────────────────────────────┐
        │ L7 LB tier (Envoy / NGINX)       │
        │ - TLS terminate (TLS 1.3, HTTP/3)│
        │ - Path/header routing            │
        │ - WAF, rate limit, retry, circ b │
        └────────────┬─────────────────────┘
                     ▼ HTTP/2 + mTLS
        ┌──────────────────────────────────┐
        │ Backend pool (1000s of pods)     │
        │ /healthz, /ready                 │
        └──────────────────────────────────┘

Control plane (out of band):
  ┌───────────────────────────────────────────────┐
  │ Health checker + config (etcd / xDS)          │
  │ - polls /healthz every 2–5s                   │
  │ - rebuilds Maglev table on backend churn      │
  │ - pushes table to all L4 LBs (deterministic   │
  │   from backend list, ~64KB blob)              │
  │ - pushes route configs to all L7 (xDS)        │
  └───────────────────────────────────────────────┘

This pattern repeats whether the workload is public web (Cloudflare/Google), internal microservices (Envoy mesh at Lyft), DB connection routing (PgBouncer + ProxySQL in front of MySQL), or real-time video (Netflix Open Connect). At the small end it's just NGINX; at the large end it's anycast + L4 + L7 + mesh. Shape is identical.

Two architectural levers:

Anycast makes a single VIP globally addressable. BGP advertises the same IP from every PoP; routing delivers each client to the topologically nearest. Failover is automatic on BGP withdrawal (~10–60s with tuning).
ECMP makes a single anycast PoP capacity-scaled. N L4 LB boxes behind the edge router; ECMP spreads packets by 5-tuple hash. When an LB dies, ECMP rehashes; surviving LBs run the same Maglev table so flows route to the same backend without state migration. This is what makes the LB tier durably stateless.

§7. Hard problems

Problem 1: Stable hashing under churn

Naive: backend_id = hash(5-tuple) mod N. Fails: N changes 100→101 on deploy, 99% of flows get a different backend_id, all RESET — site appears to go down on every deploy. Fix: Maglev lookup table (§4) — only ~1/N of flows perturb on N→N+1. Plus the flow tracker that pins live 5-tuples to their original backend even when the table changes.

State walkthrough: 1M live TCP connections across 100 backends, T=65537, ~655 slots per backend. At T=10s, B73 gracefully removed; table rebuilt with N=99. The ~655 slots that pointed to B73 are spread across the other 99 backends. Existing flows in the flow table → original backend until they drain. New flows → new table. Total perturbed: ~10k flows that were on B73, plus near-zero on other backends. ~1% perturbation on a 1% pool change.

Problem 2: Slow-start protection — new backend gets crushed

Naive: New backend B100 joins; table rebuilt; B100 immediately gets its 655 slots and 1% of all traffic. Fails: B100 has cold caches, cold JITs (Just-In-Time compilers), empty connection pools. p99 spikes to 5 seconds; B100 trips its own circuit breakers; health check fails; B100 ejected. Pool reverts to 99 backends sized assuming B100 was up. Cascading failure. Fix: weighted ramp-up. Table built with B100 at weight 0.01 → 0.1 → 0.5 → 1.0 over 60s. Envoy's slow_start_config declaratively; for Maglev, B100 claims a fractional slot count proportional to weight. Common in payments (a new gateway warming its connection to a slow upstream like the card network) and feed-ranking (an in-memory ML model taking 30s to JIT-compile hot paths).

Problem 3: Hot backend — one box overloaded despite balanced LB

Naive: LB does even Maglev hashing or round-robin. Fails: One backend hosts a celebrity user getting 5M req/sec, or has a degrading disk, or is on a noisy hypervisor. LB has no idea; keeps sending its share. Fix: P2C + outlier detection. P2C: pick two random backends per request, send to the less-loaded; max-load variance drops dramatically. Outlier detection (Envoy outlier_detection): eject backends whose error rate or latency is >3σ above cluster mean for >30s, probe them back after a recovery window. Search workloads use this heavily — one slow shard blows up the p99 of a fan-out query. ML inference services do the same for GPU boxes hitting thermal throttling.

Problem 4: Health-check accuracy — false positive ejects everyone

Naive: Every backend's /healthz checks "can I reach the DB?" The DB has a 2-min outage. Fails: All backends simultaneously unhealthy. LB ejects all. Site dark even after DB recovers. Fix: layered health + panic mode. /healthz = process alive (TCP accept works). /ready = ready to serve. LB uses /healthz; orchestrator (Kubernetes) uses both. Panic mode (Envoy healthy_panic_threshold, default 50%): if more than half are unhealthy, route to ALL of them including the "unhealthy" — better to send traffic to backends that might serve it than to a guaranteed-empty pool. State walkthrough on an observability backend fronting a Prometheus ingestion fleet: 100 ingesters, all healthy; downstream storage hiccups; by T=30s all 100 report unhealthy. Without panic mode: 503 on every scrape, alerts fire on the wrong thing. With panic mode: traffic still flows, error rate caps at maybe 30% instead of 100%.

Problem 5: Sticky session vs failover

Naive: User session in backend memory; LB uses cookie-stickiness to pin user to B7. Fails: B7 dies, user loses cart. Fix: move session state to Redis/DynamoDB; backends become stateless. Sticky sessions become a performance optimization (warm L1 cache), not a correctness requirement. For workloads that genuinely need sticky — WebSocket (long-lived TCP that can't be reopened transparently), real-time gaming (each player's UDP flow must reach the same game-state authority for the match duration), interactive video conferencing (same SFU — Selective Forwarding Unit — mixes audio for a room) — failover semantics must be explicit: LB sends RST, client reconnects, new backend handles the reconnect (potentially with state-recovery from a session store). The application protocol must tolerate this.

Problem 6: TLS termination cost

Naive: Every request does a fresh TLS 1.2 RSA-2048 handshake. Fails: At 1M handshakes/sec, ~500 CPU cores of crypto — LB becomes CPU-bound on TLS, not routing. Fix: stack of optimizations. (1) TLS 1.3 with X25519 ECDHE: 1-RTT handshake, ECDSA P-256 ~0.1 ms vs RSA ~1 ms — ~10x. (2) Session resumption via PSK (Pre-Shared Key) tickets: returning clients skip the handshake; ~95% become 0-RTT or 1-RTT abbreviated — ~20x. (3) Hardware acceleration: AES-NI, QAT (Intel QuickAssist), HSM (Hardware Security Module) — 5x. (4) Connection multiplexing: HTTP/2 reuses one TLS connection for many requests — handshake amortized over 100s of requests. Combined: 1M handshakes/sec → ~5k handshakes/sec, easily handled. Public web relies on this heavily; mesh mTLS would melt CPUs without HTTP/2 multiplexing and certificate caching.

Problem 7: SYN floods and L3/L4 DDoS

Naive: Attacker sends 1M SYN/sec with spoofed source IPs; LB allocates a flow entry per SYN. Fails: Flow table fills, legit flows displaced, site dies. Fix: SYN cookies at L4 — encode connection state in the SYN-ACK's initial sequence number; LB stores zero state for half-open connections. Verify the cookie on ACK arrival before allocating. Linux kernel does this natively; XDP-based LBs implement it in eBPF. For volumetric DDoS (100 Gbps of garbage), the LB triggers BGP blackholing of the attacker's prefix upstream; Cloudflare does this autonomously through its global DDoS detector.

§8. Failure modes

Mode A: Single L4 LB box crashes

10 L4 LB boxes, ECMP across them, each holding ~10% of flows. LB-7 fails. Edge router's ECMP detects it dead (BFD — Bidirectional Forwarding Detection — link probe at 100 ms intervals or BGP session drop). ECMP rehashes across the 9. Flows that hit LB-7 now hit LB-3; LB-3 has no flow table entry, falls through to Maglev — same Maglev table on every LB → same backend. Flow continues. ~5–10 packets lost during transition. Durability point: the Maglev table itself, replicated by control plane to every LB; no per-flow state needs cross-replication.

Mode B: Entire PoP goes dark

PoP fiber cut. BGP detects no peer (or external monitor withdraws the route). Within 10–60s (BGP convergence), traffic to the anycast VIP shifts to the next-nearest PoP. All in-flight TCP at the dead PoP RESETs; clients retry, retry lands at new PoP; new connections work immediately. Durability point: BGP routes + the backend pool reachable from every PoP.

Mode C: Control-plane death

etcd/Consul cluster crashes. Data plane keeps running with the last known Maglev table — data plane never reads from control plane in the hot path; the table is pre-pushed. Backend additions/removals stall; backends that die look up to LB → LB still forwards → backend doesn't respond → L7 LB notices via passive health check (timeouts) and ejects. Durability point: the data plane's local cached table; control plane is not in the request path. This is a load-bearing decision — control planes routinely have higher unavailability than data planes, and the data plane must ride out 30-minute control-plane outages.

Mode D: Partition between L4 and L7

L4 LB can reach 30% of L7 LBs (rest are isolated). L4's Maglev table still lists all L7s as alive. 70% of flows time out. Fix: L4 has its own active health check on the L7 tier (separate from L7's check on backends). Within ~5s the L4 notices the dead L7s, rebuilds its table, continues on the remaining 30%. Durability point: each L4 makes its own eviction decision based on local probes — no cross-LB coordination.

Mode E: Permanent loss + simultaneous deploy

100 backends; 30 being rolled (10 at a time); mid-deploy a hardware failure takes out 15 more. Effective pool 100 → 55. Each backend now sees ~80% more traffic. If sized at 60% baseline, this pushes them to 108% → spiral. Fix: capacity buffer N+2 (one for failure, one for deploy) plus deploy gates that pause if observed error rate spikes. Combine with slow-start (Problem 2) so warming boxes don't get crushed during recovery.

§9. Why not round-robin DNS

The classic Junior answer: "Just give DNS multiple A records." It's almost right and totally wrong.

dig api.example.com → [198.51.100.1, ..., .10]. Client picks one (typically first), opens TCP.

Failure 1 — no health awareness. Backend .5 dies. DNS still returns it. Until DNS is updated AND TTLs expire on every resolver in the world, ~10% of traffic hits a dead backend. TTL of 60s is the floor; ISPs and mobile carriers cache well beyond TTL.

Failure 2 — no stickiness. Client may pick .3 first request and .7 second (resolver behavior is implementation-defined). In-memory session, sticky upload chunk, WebSocket — all break.

Failure 3 — uneven distribution. Round-robin DNS distributes across resolvers, not clients. If 80% of users are behind one big ISP resolver that cached .1, you get 80% on .1.

Failure 4 — no L7 features. No path routing, no TLS termination, no WAF, no rate limit, no canary.

Failure 5 — minutes-to-hours failover. DNS-driven failover is bounded by the worst-cached TTL on the planet. Anycast + LB failover is bounded by BGP convergence — 10–60s, sometimes sub-10s with reflectors. For 99.99% SLA, DNS-only is unviable.

DNS load balancing is fine as a layer above the LB tier — picking which region's anycast VIP to send the client to. Bad substitute for the LB tier.

§10. Scaling axes

Type 1: uniform growth (more backends, more traffic)

Scale	Topology
1k RPS	Single NGINX, no L4 layer
100k RPS	NGINX cluster (3 boxes), ECMP'd, active-active
1M RPS	L4 + L7 split; L4 with kernel bypass underneath L7
10M RPS	Multi-region anycast, 3 regions, BGP-driven failover
100M RPS	Anycast across 50+ PoPs; Maglev table sizing matters — 65537 slots, 1k backends, 65 slots each

Inflection points: ~100k RPS (outgrow single NGINX → cluster + ECMP); ~1M RPS (outgrow user-space-only → L4 with kernel bypass); ~10M RPS (outgrow single region → multi-region); ~100M RPS (outgrow single Maglev table → partitioned tables per region or larger T).

Type 2: hot route concentrated

The fleet might be sized for 10M RPS uniformly, but one route — /api/feed for a feed-heavy product, /api/checkout for a flash sale, /v1/inference/recommend for a hot new model — takes 80% of traffic. Adding boxes doesn't help; the hot route concentrates on a fixed backend pool.

Fixes: - Per-route connection pools sized by traffic share, not equally - Per-route circuit breakers — when feed saturates, shed feed with 503s, protect checkout (Envoy cluster.max_pending_requests) - Per-route priority — checkout requests get queueing priority over feed - Request coalescing — when a single user dominates (a celebrity's profile gets 5M req/sec), recognize identical concurrent requests in a small window and fan one backend response back to many clients (Cloudflare, Varnish)

The deepest hot-route scenarios (celebrity hotspot, sale-day overload) are architectural problems above the LB, but the LB enforces shedding and prioritization.

§11. Decision matrix vs adjacent categories

Need	Right tool	Why not the alternative
Distribute TCP/UDP across N backends with stickiness, sub-ms hop, packet-level perf	L4 LB (Maglev / Katran / Unimog / NLB)	L7 is 100x slower per packet
HTTP path routing, TLS terminate, per-route rate limit, retries	L7 LB (Envoy / NGINX / HAProxy / ALB)	L4 can't see paths; API gateway adds 5–50ms not justified hot-path
Inter-service mTLS mesh with observability, retries, canary	Service mesh (Envoy sidecars + Istio/Linkerd)	Standalone L7 works but mesh embeds policy at every pod
Cache content at the edge	CDN (Cloudflare, Akamai, Fastly, Open Connect)	LB doesn't cache; CDN has its own LB inside
Auth, schema validation, throttling, transformation at API edge	API gateway (Kong, AWS API Gateway, Apigee)	L7 LB does some, but gateways add features at higher latency
Pool DB connections, route reads/writes	DB proxy (PgBouncer, ProxySQL, vtgate)	Generic L7 doesn't speak Postgres wire protocol
First-hop coarse geographic distribution	GeoDNS	DNS-as-LB broken at scale (§9); DNS-on-top fine

Thresholds: single backend hits ~30k RPS → add L4 LB. HTTPS terminate burns >30% backend CPU → move to L7 LB. Global clients → anycast + multi-PoP L4. Microservice topology >20 services with mTLS → service mesh. Public clients globally → CDN. API consumers need auth/throttling → API gateway in front of L7.

§12. Use case gallery

Public web traffic (Cloudflare, Akamai, Fastly)

The textbook stack. Anycast VIPs from 300+ PoPs. L4 (Unimog at Cloudflare, custom forwarders at Akamai) does packet-level forwarding with Maglev hashing in XDP/eBPF at ~5 Mpps per box. L7 (NGINX-variant at Cloudflare, Apache Traffic Server at Akamai) does TLS 1.3 terminate, HTTP/3 (QUIC), WAF, rate limit, CDN cache lookup. The hard problem is DDoS — Cloudflare mitigates multi-Tbps attacks by dropping packets in XDP before they reach L7.

Internal microservices (Envoy in a service mesh)

At Lyft, Stripe, Airbnb, every pod has an Envoy sidecar that handles outgoing requests with mTLS, retries, circuit breaking, observability (per-call tracing + metrics), connection pooling to N upstream pods. Cluster membership pushed via xDS from a control plane (Istio Pilot, custom). The hard problem is fan-out — one user request touches 50 services, so a 1% per-hop failure rate produces a 40% end-to-end failure rate; retries and hedging are essential.

Database connection pooling (PgBouncer, ProxySQL, vtgate)

Postgres connections are expensive (fork-per-connection, ~10 MB RSS each, ~ms to establish). A direct topology with 10k pods × 10 conns/pod = 100k connections explodes the DB. PgBouncer pools and shares ~100 backend connections among 100k client connections via transaction-level multiplexing. ProxySQL/vtgate go further — parse SQL, route reads to replicas, writes to primary, support read-after-write via session pinning, shard queries to the right keyspace.

API gateway routing (Kong, AWS API Gateway, Apigee)

External developers hit one endpoint; gateway authenticates (OAuth2, API key, JWT), enforces per-customer quotas, applies request transformation (REST→gRPC, version mapping), routes to the right microservice. Higher latency budget (5–50 ms) than a pure LB because of policy layers. L7 LB with extensive plugin architecture optimized for policy enforcement.

Video streaming routing (Netflix Open Connect, YouTube CDN)

Open Connect appliances (OCAs) run NGINX-based servers inside ISP networks. Client requests for a video segment go through Netflix's control plane to pick the best OCA — same-ISP if available, nearest otherwise, with capacity checks. The LB is split: a global service decides which OCA owns the user's session, then DNS or HTTP redirect sends the client there. OCAs handle 100k+ concurrent HTTPS, multi-Gbps each.

Real-time gaming (low-latency UDP LB)

A multiplayer match has 64 players each sending UDP every 20–50 ms to the game-state authority. The LB must route all 64 players' UDP flows to the same backend (the authoritative game server) for the match's duration. UDP has no connection; "stickiness" is (src_ip, src_port) → game_id mapping that lasts the match. Hard problems: sub-50 ms p99 LB hop latency, sticky failover (if the server dies, a new server must inherit game state — usually outside LB scope). L4 UDP LB with a custom session table keyed on game_id rather than 5-tuple.

Database read-replica routing

A primary + 10 read replicas. The LB (often inside the DB driver, or ProxySQL) routes reads to a least-loaded healthy replica and writes to the primary. Replica lag is the constant problem — a read right after a write may hit a replica that hasn't caught up. Solutions: session-pin reads-after-writes to the primary for N seconds, or LSN (Log Sequence Number) tracking where the client requests a minimum LSN and the LB picks a replica caught up. Protocol-aware L7 LB with replica-lag awareness.

§13. Real-world implementations with numbers

Google Maglev (NSDI 2016). Distributed software L4 LB. ~1 Mpps per box in DPDK; cluster >1 Tbps fronting Search, YouTube, GCP. Introduced the lookup-table consistent-hash algorithm every modern L4 copies.
Facebook Katran (open-sourced 2018). XDP/eBPF L4 LB. ~5 Mpps per server. Fronts every Meta property. Maglev-style consistent hashing in eBPF maps.
Cloudflare Unimog (~2020). XDP-based L4 LB. >1B req/day, ~60M RPS peak, mitigates >100 Tbps DDoS.
AWS Hyperplane (NLB data plane). Fleet of compute nodes with shared-memory flow tables. Millions of RPS per NLB, 20M+ concurrent connections per AZ, sub-100µs added latency.
AWS ALB. Managed L7. Scales by LCUs (Load Balancer Capacity Units). Sustained ~25–100k RPS per ALB.
NGINX at Netflix Open Connect. Every OCA appliance runs NGINX as HTTPS terminator + HTTP range-byte responder. Hundreds of thousands of concurrent connections per appliance, multi-Gbps.
HAProxy at GitHub. Fronts github.com, ~1M+ RPS, Lua-based routing for Git over HTTPS, Git over SSH proxy, API, web UI.
HAProxy at Stack Overflow (2016). ~30k RPS sustained, 70k peak, ~6 HAProxy boxes fronting the entire site.
Envoy at Lyft. Designed and open-sourced 2016, both edge and sidecar. ~250k RPS per proxy at peak. xDS dynamic config protocol originated here.
Envoy at Stripe. mTLS everywhere, sophisticated retry + idempotency-key handling. Millions of RPS fleet-wide.
GLB (GitHub Load Balancer) (open-sourced 2018). IPVS-based L4 with a "director" tier that consistently hashes flows to LB boxes, handling backend failures without ECMP rehash issues.
PgBouncer at Heroku, Instagram (early). Multiplexes thousands of client connections onto tens of backend connections — 100x cut.
Vitess vtgate at YouTube, Slack. MySQL L7 LB that shards queries, hides shard topology. Millions of QPS at YouTube.
Linkerd at Microsoft, Buoyant customers. Rust-based sidecar proxy as a service-mesh alternative to Envoy.

The spread is ~10k RPS to ~60M RPS with 5–6 orders of magnitude of capacity envelope and very different architectural choices at each tier.

§14. L7 routing rules in depth

L4 routes by 5-tuple; L7 routes by everything in the request. The L7 routing language is what separates a packet forwarder from a programmable application gateway. Envoy's HTTP route configuration (RouteConfiguration → VirtualHost → Route) is the canonical model that most of the industry now mirrors (Istio VirtualService, Gloo, AWS App Mesh, Contour HTTPProxy, Linkerd ServiceProfile). Walking the dimensions:

Host-based routing. The first cut. Host: api.example.com lands on one VirtualHost (VH), Host: admin.example.com on another. Wildcards: *.example.com matches subdomains; longest-match wins (api.foo.example.com beats *.example.com). At Cloudflare or any multi-tenant edge, host-based routing scales to millions of VirtualHosts; the routing data structure is a radix trie on the reversed FQDN (Fully Qualified Domain Name) so longest-suffix-match is O(label count). One IP + one cert (via SNI — Server Name Indication, see §16) + one Envoy fleet serves the entire tenant base.

Path-based routing. Within a VirtualHost, match by URL (Uniform Resource Locator) path. Three modes: exact (/health), prefix (/api/v1/), regex (^/users/(\d+)$). Performance ordering: exact (hash lookup, ~10 ns) > prefix (trie walk, ~100 ns) > regex (RE2 — Google's safe regex engine — ~µs). Order matters in Envoy — routes are evaluated top-to-bottom on the first match; misplacing a catch-all / ahead of /api/v1/users will black-hole the specific route. Common pattern: top-level catch-all routes per service prefix → service-internal routes refine further.

Header-based routing (the canary primitive). Match on arbitrary request headers. The canonical canary case: x-canary: true routes to the new version's cluster, everything else to stable. Combined with header injection at the entry point (mobile app sends x-canary: true for internal employees by user-ID lookup), you get employee-only canary before opening to the public. The same primitive does dark launches (read-only canary that doesn't affect user state), A/B tests (header set by an experiment-assignment service upstream), and traffic replay (header added by a replay tool that mirrors prod traffic to staging).

Weight-based routing (gradual rollout). Within one route match, split traffic by weighted clusters: 99% to v1, 1% to v2 → 95/5 → 50/50 → 1/99 → 0/100, over hours or days. The weights are a control-plane knob (xDS push or Istio VirtualService update); Envoy converges within milliseconds. The probabilistic split is per-request and stateless; if a user must consistently see one variant, combine with a hash policy (e.g., hash on cookie userId, modulo 100, compare to weight) to make the choice sticky. This is the bedrock of progressive delivery — kubectl apply on a Flagger or Argo Rollouts CRD (Custom Resource Definition) drives the weights automatically based on observed error rate and latency from a metrics provider (Prometheus / Datadog).

Query-parameter-based routing. Match on ?version=v2 or ?region=eu. Less common than header-based at the edge (URL parameters are user-visible and cacheable, which can be wrong for canary), but heavily used by internal API gateways where the query string is part of the contract.

Mirroring / shadowing (the test-in-prod primitive). Send N% of real traffic to a secondary cluster, ignore the response. The client sees only the primary's response, so user-visible behavior is unchanged. The shadow cluster (typically a new version) sees real production load and emits its own logs/metrics. Differential analysis compares primary-vs-shadow latency, error rate, response bodies (modulo non-determinism). Envoy request_mirror_policies does this natively — fire-and-forget, response discarded. Caveats: side effects double-execute (do not mirror writes against the real DB; the shadow needs a sandboxed datastore or read-only mode), and the shadow's load doubles the upstream pressure on dependencies (cache, search) unless they're sandboxed too.

Composability. All these dimensions stack. A real Envoy route might say "if Host is api.example.com AND path matches /v2/feed AND header x-experiment-bucket is in {7..9}, then 95% to stable, 5% to canary, mirroring 10% of traffic to integration-test cluster." Envoy compiles the whole tree into a route-matching DAG (Directed Acyclic Graph) and evaluates per request at ~100k RPS per core. Service mesh control planes (Istio Pilot, Linkerd's destination controller) own the high-level intent (canary at 5%) and translate to xDS RouteConfiguration blobs pushed to data planes.

§15. Health check strategies

The LB's view of backend health drives every routing decision; health-check accuracy is therefore load-bearing for site availability. Three families, plus several wire-protocol choices.

Active health checks. LB polls each backend on a schedule (/healthz every 2–5 seconds is typical). Pass → mark healthy and routable; fail → eject after a threshold (3 consecutive failures is the Envoy default for unhealthy_threshold); pass after eviction → re-add after a recovery threshold. Pros: fast detection of fully dead boxes (the box that stopped responding to TCP), independent of real traffic, evenly probes the fleet. Cons: at fleet scale this becomes a health-check storm — 1000 backends × every LB checks every 10 seconds = 100 probes/sec per backend, and on a 1000-LB / 1000-backend mesh it's 100k probes/sec just for liveness. Mitigation: sample (each LB checks a random subset of backends), share results across LBs via the control plane, or use lazy probing where backends are only probed if they haven't received real traffic in N seconds.

Passive health checks (outlier detection). LB watches real traffic and ejects backends whose error rate or latency exceeds a threshold — Envoy's outlier_detection ejects a backend that returns 5xx for consecutive_5xx requests in a row, with optional success-rate-based ejection (eject backends >3σ below cluster mean). Pros: no extra probing traffic, observes real workload mix, detects "degrading" backends (still responding to /healthz but slow on real requests). Cons: requires actual traffic to detect failures (a backend with zero traffic looks healthy forever), and can be fooled by transient downstream failures (DB blip → all backends look bad → ejection storm → site dies).

Hybrid (active + passive). The production default. Active probing finds completely broken backends (TCP RST, /healthz returns 500, process gone). Passive ejection finds newly broken backends in the few-second window between brokenness onset and the next active probe — and finds subtly broken backends that pass /healthz but fail real requests. Envoy and modern L7 LBs run both simultaneously: active determines the eligible pool; passive prunes from that pool based on observed behavior.

HTTP health checks. The richest variant. LB sends a request (typically GET /healthz), expects a 2xx or 3xx status; can require specific headers (e.g., x-app-version: v3.2.1 to detect mis-deployed boxes), can match a substring in the response body. Cost: an HTTP request per backend per interval. Best for L7 LB checking L7 backends.

TCP health checks. Just open a TCP connection, send optional bytes, expect optional response, close. Cheaper than HTTP, but coarser — verifies the process accepts TCP, says nothing about whether the app behind it is healthy. Right for L4 LB, or as a sanity check ahead of HTTP.

gRPC health checks. Standardized gRPC service (grpc.health.v1.Health/Check) that returns SERVING / NOT_SERVING / SERVICE_UNKNOWN. Streaming variant (Watch) avoids the per-probe handshake by maintaining a long-lived stream. Native for gRPC services; Envoy grpc_health_check config understands it directly.

The "/healthz returns 200 but service is broken" disaster. Classic outage pattern. /healthz returns 200 if the process is alive — it does not check the DB pool, doesn't run a query, doesn't verify the auth token cache is populated. Process is alive, real requests fail with 500. Mitigations:

Deep health checks that touch downstream dependencies. /health/deep runs a 1-row SELECT against the primary DB, pings Redis, checks the auth cert is unexpired. Run sparingly (every 30 s, not every 2 s) to avoid amplifying load on dependencies.
Synthetic checks run a real user-facing transaction (login + read a feed + log out) and require it succeed end-to-end. Run from a separate prober (Pingdom, Datadog Synthetics, in-house) outside the LB, results feed into deploy gates and alerting, not directly into LB routing.
Liveness vs readiness separation. Liveness (= "is the process alive?") drives restart decisions (Kubernetes kubelet restarts the pod if liveness fails). Readiness (= "is the process ready to serve?") drives routing decisions (LB sends traffic only to ready pods). Conflating them causes a degraded backend to be restarted in a loop instead of just being de-routed temporarily.
Panic mode (Problem 4 in §7). When >50% of the pool is unhealthy, route to everyone anyway — a downstream-shared failure looks like a fleet-wide outage to /healthz, and you'd rather try than guarantee a black hole.

Health-check accuracy is fundamentally hard because "is this backend healthy?" is application-defined and the LB is generic. The honest answer in production is "no single check is sufficient; layer them."

§16. TLS termination deep dive

TLS termination is where cryptography meets traffic engineering. The choices are far from neutral.

Where to terminate. Three options.

Edge LB termination. The outermost LB (L7 at the PoP) terminates TLS, decrypts, routes plaintext (or re-encrypts with internal certs) to backends. Pros: TLS work centralized on a few boxes with hardware acceleration (AES-NI, QAT — Intel QuickAssist Technology, dedicated cards); backends speak plain HTTP; observability/WAF/inspection all work on plaintext. Cons: plaintext in the data center is a compliance issue for regulated workloads; an internal compromise reads bare traffic.
L7 LB termination + re-encryption. Edge L4 forwards to L7 LB which terminates TLS, runs L7 policies, then re-encrypts with internal mTLS to the backend. Pros: WAF/canary/etc. still work; backends still authenticated via mTLS. Cons: double the TLS work — once at edge, once at L7-to-backend. Usually fine because internal certs use ECDSA and session resumption is near-100%.
Pass-through (TLS bridging or TLS straight-through to backend). L4 LB forwards the encrypted bytes; only the backend has the cert. Pros: end-to-end encryption, simplest compliance story. Cons: L7 LB can't inspect or route by path/header (the bytes are encrypted); WAF impossible at the LB; SNI is the only signal available, so routing decisions degrade to one route per cert. Used for special cases: WebSocket-heavy traffic where L7 features aren't needed, or strict regulatory requirements.

The mainstream pattern is edge termination at the LB closest to the public, mTLS re-encryption from there inward.

SNI (Server Name Indication). Without SNI, one IP can host one cert (the cert is sent during the TLS handshake before the request, so the server doesn't know which domain the client wants). SNI is a TLS extension where the client sends the target hostname in the ClientHello, in plaintext. The server picks the matching cert from a SNI map. This is what makes anycast multi-tenant possible — one VIP at Cloudflare's edge serves millions of tenants' certs via SNI lookup. The SNI map at scale is a hash table keyed on hostname; cert objects (private key, cert chain, OCSP staple) are cached in memory and rotated independently. The plaintext SNI is also a leak — middleboxes (ISPs, governments) can see which domain you're hitting even over HTTPS. Encrypted ClientHello (ECH) is the post-2024 fix: the ClientHello is encrypted to a public key published in DNS, so the SNI is no longer visible. Cloudflare and Google enable ECH on supporting clients; rollout has been gradual.

ALPN (Application-Layer Protocol Negotiation). Another TLS extension. Client advertises supported protocols (h2, http/1.1, h3) in the ClientHello; server picks one and includes the choice in the ServerHello. This is how a single TLS endpoint serves HTTP/1.1, HTTP/2, and HTTP/3 (with separate transport — QUIC for h3) on the same port (443). The LB negotiates ALPN per connection; if the client supports h2, the LB upgrades, multiplexing many requests over one TCP connection, slashing handshake amortization. ALPN also negotiates non-HTTP protocols at scale — gRPC (which rides on HTTP/2 with the grpc-exp ALPN identifier in some stacks), WebSocket (negotiated post-handshake via HTTP Upgrade rather than ALPN, but ALPN-aware).

0-RTT in TLS 1.3. TLS 1.3 lets returning clients send application data in the very first round trip, before the full handshake completes, encrypted with a PSK (Pre-Shared Key) derived from a previous session. The latency win is dramatic: a mobile client on a 100ms-RTT cellular link saves 100–200 ms on every request after the first (no handshake round trip). The cost: 0-RTT data is replayable — an attacker who captured a 0-RTT packet can replay it later and the server has no way to tell. Mitigation: only allow 0-RTT for idempotent requests (GETs, OPTIONS), reject 0-RTT bytes for POSTs at the LB (Envoy's early_data_policy: HTTP1_SAFE_CRLF or equivalent) and force a full handshake.

Cert rotation under load. Certs expire (Let's Encrypt: 90 days; commercial certs: 1 year), and the rotation must not drop traffic. Process: the new cert is provisioned in parallel (ACME — Automated Certificate Management Environment — protocol talks to a CA like Let's Encrypt or an internal CA, completes the challenge, issues the cert). The LB loads the new cert into the SNI map alongside the old. A graceful flip swaps the default for new handshakes to the new cert. Old TLS sessions resuming via PSK keep working until their tickets expire. After a buffer period (24–48 hours) the old cert is removed. Throughout this, no connection is dropped. Envoy reloads cert configs from an SDS (Secret Discovery Service) endpoint on a control-plane push, atomically swapping per-listener cert references. Production rotation cadence is daily-to-weekly at Cloudflare scale (millions of customer certs); the LB tier has zero downtime.

HSM vs KMS-managed key story. TLS requires a private key. Where it lives matters.

In-process file. Easy and fast — the private key is in the LB's memory; signing operations happen in user space. Risk: a memory disclosure (Heartbleed, a process dump) leaks the key, an attacker can impersonate the site indefinitely until rotation.
HSM (Hardware Security Module). A physical or cloud-virtual device holds the private key inside tamper-resistant silicon. The LB sends a "sign this hash" request to the HSM; the HSM signs and returns the signature; the LB never touches the key. Pros: a complete memory disclosure leaks no key. Cons: signing latency goes from microseconds to milliseconds (HSM round trip), throughput capped by HSM signing rate (a few thousand signs/sec for a single HSM). Used by certificate authorities, financial institutions, regulated workloads. Common deployment: HSM in a separate VLAN, accessed via PKCS#11.
KMS (Key Management Service). The cloud equivalent — AWS KMS, GCP Cloud KMS, Azure Key Vault. The key never leaves the service; sign operations are HTTP API calls. Latency similar to HSM (ms per sign), but the LB also benefits from KMS-driven rotation (the KMS rotates underlying keys; clients see only key IDs). Used by AWS ALB internally — ALB private keys live in AWS KMS.
Keyless TLS (Cloudflare's pattern). Cloudflare terminates TLS at its edge for customers who refuse to share private keys (banks, governments). Cloudflare's edge runs the TLS handshake but forwards every signing operation to the customer's "key server" over a custom protocol; the customer's HSM signs; Cloudflare's edge completes the handshake. Adds one customer-side RTT to every handshake but Cloudflare never holds the key. Active in production for high-regulation tenants.

For 99% of services, the right answer is private key in memory on the LB box, file-system permissions to lock it down, KMS for the encryption-at-rest key that decrypts the cert file at boot. HSM only when compliance or threat model demands it.

§17. Service mesh integration

A service mesh is the realization "if Envoy is the LB primitive for everything, put one next to every pod." That makes the LB ubiquitous and the control plane the most important component.

Envoy as the data plane. Every pod gets an Envoy sidecar — a second container in the same Pod network namespace, intercepting all inbound and outbound traffic via iptables or eBPF (Cilium's approach). Outbound: the application calls http://feed-service; iptables redirects to the local Envoy on 127.0.0.1:15001; Envoy looks up feed-service in its cluster table (populated by xDS), picks a healthy endpoint via Maglev or P2C, opens an mTLS connection, sends. Inbound: traffic to the Pod IP is intercepted by Envoy on 15006, which terminates mTLS, applies inbound policy (rate limit, ACL — Access Control List), forwards to the application on its loopback. Each Envoy is the LB for that pod's interactions; the mesh is the union of all the Envoys' decisions.

Istio / Linkerd as the control plane. The Envoys are deaf without configuration. The control plane is what turns intent ("route 5% to v2 of feed-service") into the cluster/route configs every Envoy needs. Components: a discovery service (watches Kubernetes for service/endpoint changes), a config translation layer (CRDs → xDS resources), a cert authority (issues short-lived mTLS certs to every workload; rotation cadence typically 24 hours). Linkerd's control plane is Rust-based and minimalist; Istio's is bigger (Pilot, Citadel, Galley, etc.) but feature-richer.

xDS — eXtensible Discovery Service. The wire protocol between control plane and Envoy. A gRPC bidirectional stream where the control plane pushes resource updates; Envoy ACKs/NACKs. The xDS namespace decomposes:

LDS (Listener Discovery Service). What ports/addresses Envoy listens on, what filter chain processes each.
RDS (Route Discovery Service). For each HTTP listener, what RouteConfiguration (host/path/header rules — see §14).
CDS (Cluster Discovery Service). What logical upstream clusters exist (e.g., "feed-service-v2") and their load-balancing config (P2C, Maglev, weights).
EDS (Endpoint Discovery Service). Per cluster, what concrete endpoints (pod IPs and ports) are members. EDS updates are the most frequent — every pod birth/death emits an EDS push.
SDS (Secret Discovery Service). Per workload, what mTLS cert/key/CA bundle. SDS updates drive cert rotation.
ADS (Aggregated Discovery Service). Single stream that multiplexes all the above — preferred because it preserves causal ordering (cluster appears before its endpoints).

At scale, the xDS push rate dominates the control plane's cost. A 5000-pod mesh with 100 services emits dozens of EDS pushes per minute under steady churn, multiplied by the number of Envoys subscribing. Optimizations: incremental xDS (push only the delta, not the full snapshot), scoped xDS (each Envoy subscribes only to clusters it cares about), federated control planes (one per region, each handling its local fleet).

The "L7 proxy attached to every pod" picture. This is conceptually load balancing per-call rather than per-edge — every call between two services hops through both ends' Envoys, gets mTLS, retries, tracing, rate limit, circuit breaking, all without the application doing any of it. The cost is two Envoy hops per call (each adds ~1 ms p50, ~3–5 ms p99). For a 50-service fan-out request, that's 100 hops; the total mesh tax is real (often 10–30% of end-to-end latency). The benefit is total uniformity: every call obeys the same rules, observability is end-to-end, mTLS is the floor.

Ambient mesh (post-sidecar evolution). Istio Ambient (2022+) replaces the per-pod sidecar with two shared components:

A node-level ztunnel (zero-trust tunnel) — one per node, handles all the mTLS/identity work for every pod on the node. Implemented in Rust with eBPF redirect from Pods to ztunnel. No HTTP parsing here — it's an L4 tunnel.
A waypoint proxy — a per-namespace or per-identity Envoy that handles L7 policy (the routing rules from §14). Requests that need L7 features traverse a waypoint; pure L4 mTLS traffic skips it.

Wins: no Envoy in every pod (saves ~50–100 MB RSS per pod across a fleet of 10000 = significant), faster Pod startup (no sidecar init), simpler upgrades (upgrade the waypoint once, not 10000 sidecars). Trade-offs: less per-call isolation (a node-level ztunnel handling all Pods on the node is a shared failure domain), more network hops for L7 features (Pod → ztunnel → waypoint → ztunnel → Pod). Linkerd has a similar evolution; Cilium Service Mesh skips sidecars entirely with eBPF in the kernel (see §25).

The arc of mesh design has been "the LB is everywhere → the LB is too expensive everywhere → keep the LB only where it adds value." Ambient and eBPF meshes are still load-balancing technology, but reshape where the proxy sits.

§18. Geo-routing and anycast depth

The LB tier's first job at planet scale is "get this client to a nearby PoP." Three mechanisms ship in production; large CDNs use all three.

GeoDNS with ECS (EDNS Client Subnet). DNS resolvers traditionally hide the client IP from the authoritative DNS server — the auth sees only the resolver's IP, which may be in a different country (Google Public DNS 8.8.8.8 serves users globally). GeoDNS routes by IP; without the client's IP, it routes by the resolver's location, often badly. ECS (EDNS Client Subnet, RFC 7871) lets the resolver pass a truncated form of the client's IP (e.g., the /24 prefix) inside the DNS query as an EDNS option. The auth uses that prefix to look up geographic location in a database (MaxMind GeoLite, in-house IP-to-region maps from RUM — see below), and returns the IP of the nearest PoP. ECS is opt-in per resolver and per zone (privacy concern: it leaks more about the client than the resolver alone). Major DNS providers (Google, Cloudflare 1.1.1.1 disables ECS by default in fact, OpenDNS) have varying ECS support. The accuracy ceiling of GeoDNS is the IP-to-region database's accuracy; mobile carriers' IP blocks often misgeolocate (a Verizon block in Texas serves users in California, etc.).

BGP anycast. Skip DNS as the routing layer; let the network itself route. One IP (e.g., 1.1.1.1) is announced via BGP from many PoPs simultaneously. The internet's BGP machinery picks one PoP per source AS (Autonomous System) based on AS path length, local preference, and other BGP attributes — typically the topologically nearest, which strongly correlates with geographically nearest. Pros: no DNS to mess with; failover is automatic (PoP withdraws the BGP announcement, traffic shifts in seconds to minutes); attack surface for DDoS is diluted across all PoPs (10 Gbps attack split across 50 PoPs is 200 Mbps each, manageable). Cons: BGP picks per-route, not per-flow, and BGP routes can change mid-flow (route update from upstream peer). Anycast for UDP/QUIC is trivial; anycast for stateful TCP is hard.

RUM-based routing. Real User Monitoring data — actual latency measurements from real users, collected via JavaScript beacons or app SDKs — fed back into the DNS or BGP layer. Cloudflare and Akamai use RUM to learn "users in carrier X consistently get better latency from PoP B than PoP A even though PoP A looks closer." The decision becomes "for queries from this carrier's IP space, return PoP B." More accurate than IP-to-region database alone, especially for mobile and emerging markets. Feedback loop: RUM data updates every few hours; sustained changes (new submarine cable, peering shift) update the maps; transient anomalies are smoothed.

The "anycast island" pathology. A client's BGP route to anycast IP changes mid-flow (e.g., a peering link goes down, upstream router re-runs best-path selection, picks a different next-hop, traffic now lands at a different anycast PoP). The new PoP has no connection state for that flow — its kernel sends a TCP RST. The connection breaks visibly. With long-lived connections (HTTP/2, gRPC, WebSocket, large downloads), this is intolerable. Mitigations:

Flow-aware anycast steering. When a route changes, the new PoP recognizes "I don't have state for this flow," forwards the packet to the old PoP via an internal tunnel (which still has the state), rather than RST'ing. The old PoP serves the flow until it ends; new flows from the same client land at the new PoP. The LinkedIn/Cloudflare Unimog architecture builds this in: a "flow steering" component on every anycast machine keeps a flow-to-machine table, and machines forward packets among themselves to preserve state under route churn.
Connection ID-based routing (QUIC). QUIC's connection ID is independent of the 5-tuple. The LB at the new PoP can read the connection ID, look it up in a fleet-wide table, forward to whichever machine actually owns the connection. This is QUIC's killer feature for anycast (NAT — Network Address Translation — rebinding also benefits; mobile IP changes don't break QUIC flows).
Migrate proactively before the route changes. Hard in practice — the LB doesn't get advance notice of upstream BGP changes.

Cloudflare's Unimog publication describes the flow-steering approach in detail. The "any" of anycast becomes "the right one" by introducing intra-fleet flow forwarding.

§19. Connection draining

Deploys are the most frequent failure mode the LB has to mask. Connection draining is the protocol.

The graceful-shutdown dance. When a backend (or LB itself) is going down for a deploy:

Drain signal. Orchestrator sends a "preStop" hook or the equivalent (Kubernetes' terminationGracePeriodSeconds window starts; Envoy receives /healthcheck/fail admin endpoint; HAProxy receives SIGUSR1).
Stop accepting new traffic. The backend marks itself unhealthy (its /healthz starts returning 503), and/or its registration in the LB pool is removed. The LB notices on the next active probe (or immediately if the LB has a control-plane push channel) and stops sending new connections.
Existing connections finish. In-flight requests complete naturally — the response is sent, the HTTP/1.1 keep-alive idle timeout fires, the HTTP/2 stream closes, the gRPC RPC completes.
Forceful close after drain timeout. If anything is still hanging on after, say, 30 seconds, the backend sends a GOAWAY on HTTP/2 streams (asking clients to migrate gracefully) and then SIGTERMs. After a final grace window (Kubernetes default 30s), SIGKILL.

Done right, a deploy is invisible — no 5xx blip, no latency spike. Done wrong, every deploy is a small outage and the team learns to deploy at low-traffic hours rather than fix it.

The "30s isn't enough" problem. Long-poll connections, server-sent events (SSE), WebSocket, gRPC streaming RPCs, file uploads, and slow backend dependencies can all keep a connection alive far longer than 30 seconds. A WebSocket pinned to chat-service-pod-7 might live for hours. Forcing the close after 30 seconds means dropping every active chat session every deploy. Options:

Longer drain timeout. terminationGracePeriodSeconds: 600 (10 minutes). Costs slower deploys (waiting on long-lived connections to finish), and might be unbounded for some traffic patterns.
Drain signal to clients. Send WebSocket close frame or HTTP/2 GOAWAY with last_stream_id set to a value below current — tells client to migrate gracefully to a new server. Clients reconnect to a different backend; the old one drains faster. Slack and Discord do this on chat servers.
Connection rebalancing. Periodically (independent of deploys), the LB or backend signals clients to reconnect, naturally migrating traffic and avoiding any one backend amassing too many long-lived connections. Avoids the deploy-time pileup.
Sticky-session-aware draining. If sessions are user-pinned and stored in state, drain by session count, not connection age — let the user finish their current task, but redirect on the next reconnect.

The other failure mode: a backend SIGTERM'd without a drain period (the "panic kill") returns ECONNREFUSED to the LB mid-request. The LB sees 5xx, marks the backend unhealthy, retries on another backend — most cases recover transparently. But the customer-visible cost during a global deploy is real: rolling 1000 pods with no drain costs hundreds of 5xx per minute. Drain is the difference between zero-impact deploys and noticeable ones.

§20. Sticky sessions

Stickiness pins a logical session (user, customer, game match) to a specific backend across requests. The technique is older than load balancing as we know it.

Cookie-based stickiness. The LB sets a cookie on the first request — Set-Cookie: lb-backend=B7; Max-Age=3600 — and on subsequent requests reads the cookie and forwards to B7 directly, bypassing the load-balancing decision. Envoy and HAProxy do this with cookie hash policies. The cookie can be:

Opaque (LB internal: a hash mapping to backend ID; clients don't see meaningful content).
Signed (HMAC over backend ID + expiry, prevents tampering).
Encrypted (full encryption with LB-only key, prevents inference).

IP-based stickiness. The LB hashes the client IP and routes consistently to a backend. No cookies, works for non-HTTP, but breaks behind NAT (many clients share one IP and get pinned to one backend; one client roaming between Wi-Fi and cellular gets re-pinned). IP stickiness with a hash policy is L4-LB-compatible; cookie stickiness requires L7.

Why most modern designs avoid stickiness. Stickiness is state leakage from the app to the LB. As soon as the LB knows "this user belongs on B7," the app has lost the freedom to:

Restart B7 — the user's traffic is pinned there and won't easily move.
Scale the pool — adding B11 has no effect on the user's experience because they're stuck on B7.
Failover B7 — when B7 dies, the user's session must explicitly migrate; the LB can do this but the application must support session re-establishment.

The modern pattern is the opposite: stateless backends, state in a shared store. Sessions, carts, user preferences, in-flight game state — all live in Redis, DynamoDB, or a session service. Any backend can serve any request. The LB does pure round-robin or P2C. Scaling, restart, deploy, failover all become trivial.

When sticky is right. A short list of valid cases:

WebSocket apps. The TCP connection is the session; the LB has no choice but to pin it for the connection's life. (Note: this is stickiness to a connection, not stickiness to a user — different problem; reconnects can land elsewhere.)
Server-sent events (SSE), long-poll, streaming HTTP. Same as WebSocket — long-lived TCP that can't be transparently moved.
Very latency-sensitive caches at the LB layer. The LB can be smarter if it routes a user consistently to the backend that already has their L1 cache warm. The cost of state in the LB is small if it's purely a performance hint (the system stays correct if the hint is wrong).
Game matches with strict liveness. A 64-player UDP match must route every player to the authoritative game server for the match duration. The LB maintains a (game_id → backend) table; players' UDP flows all hash to the same game_id.
Streaming media with large in-memory buffers. A video session that loaded the first 10 GB of a movie's segments in memory benefits from sticky; reload-elsewhere costs a 10-GB re-fetch.

When stickiness is justified, design failover behavior explicitly: when B7 dies, what happens to its sticky users? The application must tolerate "your session ended, reconnect to a new backend, recover state from Redis" as a normal event.

§21. Connection pooling at the LB

At hyperscale, "one TCP per request" is wasteful even for the LB-to-backend leg. The math: 100k RPS, 1ms per request, requires 100 concurrent connections — manageable. 1M RPS at 10ms requires 10k concurrent connections; at 100M RPS that's 1M concurrent connections, blowing past the 65k port limit per source IP (TCP source ports are 16-bit), and crushing the connection-establishment cost (TLS handshakes, mTLS cert checks, socket-table churn).

Backend connection pools. The LB maintains a pool of pre-opened TCP connections to each backend. A new request grabs an idle connection from the pool (an O(1) operation), sends, parks the connection back. Pool size sized to peak concurrent requests per backend. NGINX's upstream { keepalive 100; } configures 100 idle connections retained per upstream; Envoy's http_protocol_options { max_connections: } similar. The pool drains slowly (connections idle for >60s closed by Envoy or by the backend's keep-alive timeout), refilling on demand.

HTTP/2 multiplexing to backends. HTTP/1.1 can't pipeline reliably (head-of-line blocking); each request needs its own connection or its own slot in a serial keep-alive chain. HTTP/2 multiplexes many streams over one TCP connection. The LB opens, say, 2 HTTP/2 connections to backend B7 and pipes 1000 concurrent streams down them. Sockets: 2 instead of 1000. Connection setup cost: amortized over thousands of requests. This is the single biggest scalability win for LB-to-backend; almost every modern proxy uses HTTP/2 backend connections.

HTTP/3 (QUIC) backend connections. Even better for some patterns. QUIC over UDP avoids TCP head-of-line blocking between streams (a lost packet for stream 7 doesn't block stream 8), useful when the LB-to-backend RTT is non-trivial (cross-region pools) or when packet loss is non-zero. Adoption lags HTTP/2 — backends must support h3, observability tools must understand UDP flows.

The "1M sockets from LB to backend pods" exhaustion. At very high scale, even pooled connections exhaust resources. Per-source-IP TCP source port limit is 65k. If the LB has 1 source IP and 1000 backends, max source ports per backend is ~65 (with reservation overhead). Solutions:

More source IPs on the LB. Assign 10 IPs, get 650 ports per backend. NICs and kernels handle this.
HTTP/2 multiplexing, as above — collapses 1000 streams into 1 socket.
SO_REUSEPORT for outgoing connections. In some kernels, multiple sockets can share a source IP/port for outbound; less common than for listening sockets.
Connection-pool sharing across backend replicas. If the LB serves N backends and each only sees ~10 RPS, opening 1 connection per backend is enough. Sharing is the default.

A second related exhaustion: conntrack table on stateful firewalls. Linux's conntrack (NF_CONNTRACK) keeps per-flow state for connection tracking and NAT. Default table size is ~256k; at high LB scale, conntrack fills, new connections drop. Mitigations: disable conntrack on LB outbound (set notrack rules), bypass with XDP, or raise the table size. Cloudflare and Facebook have documented running their LBs with conntrack disabled entirely on the data plane path.

§22. Autoscaling interaction

The LB's metrics — request rate, latency, error rate, connection count — are the input signal for autoscaling. Done right, this closes the loop and makes capacity self-managing.

LB metrics feeding HPA (Horizontal Pod Autoscaler). The Kubernetes HPA reads metrics (default: CPU, memory; with custom metrics: anything) and scales pod counts up or down. Sourcing from the LB:

Requests per second per pod. If request_rate > 200 RPS/pod, scale up. The LB exposes per-cluster RPS in Prometheus; an adapter (Prometheus Adapter, KEDA — Kubernetes Event-Driven Autoscaling) feeds it to the HPA.
p95 latency. If p95 > 100 ms, scale up — the pool is saturating.
Active connections per pod. Bypasses RPS variability; "how busy is the average pod?" is a steadier signal.
Error rate. Scale up if errors are correlated with saturation, not with bugs.

Composite scaling: HPA combines multiple metrics, uses the highest-recommended desired-replicas. Avoids oscillation around a single noisy metric.

The "thundering herd on scale-up" problem. Scale-up adds new pods. The LB sees them, treats them as full-capacity members of the pool, sends them their share of traffic immediately. The new pods are cold: JIT-compiled code paths aren't hot, connection pools to dependencies are empty, OS page caches are unwarmed, ML model weights aren't paged in. Cold p99 might be 5–10x the warm p99. Traffic ramp at full throttle pushes the cold pods into trip-circuit-breaker territory, the HPA sees saturation again, scales up further, and the cascade unfolds.

Warmup periods. Three layers fix this.

Slow start at the LB. Envoy's slow_start_config ramps the cluster weight of a newly-added endpoint from 0 to 100% over a configurable window (default 60s, can extend to 5+ minutes). The pod gets a fraction of normal traffic in the first minute; by minute 5 it's at full weight.
Readiness probe with progressive criteria. Kubernetes' readiness probe returns "ready" only after the pod has finished its warmup (model loaded, caches primed, dependencies pinged). The LB doesn't see the pod until it's actually ready.
Pre-warming hooks. The deploy pipeline replays a synthetic traffic profile against new pods (recorded production requests, 1% playback rate) before promoting them to live traffic. By the time the pod is in the LB's pool, its JITs and caches are warm.

The combination eliminates the cold-start cliff. Without it, autoscaling becomes anti-scaling — adding pods makes things worse before they get better.

Predictive scaling. Reactive scaling (wait for saturation, scale up) loses by the warmup window. Predictive scaling (anticipate the upcoming load and pre-warm) wins. Used heavily for Black Friday, sports event surges, scheduled deployments. Inputs: time-of-day patterns, marketing-campaign schedules, traffic seasonality. The LB doesn't drive this directly, but its metrics feed the predictor.

§23. DDoS protection layers

The LB tier is the front line of denial-of-service defense. Attacks split into two regimes by layer.

Layer 3 / 4 attacks (volumetric). The attacker generates raw packet volume — SYN floods, UDP amplification (small DNS query, large reflected response), TCP reflection, ICMP floods. Defended at the edge before the LB processes anything.

Anycast dilution. As mentioned in §18, a multi-PoP anycast network splits an attack across N PoPs. A 1 Tbps attack against Cloudflare's IP shows up as ~5 Gbps per PoP at 200 PoPs — handleable. The bigger the network, the more dilution.
ACLs (Access Control Lists) at the network layer. Router ACLs drop traffic matching known-bad patterns (spoofed source IPs from bogon ranges, IPv4 fragments, malformed packets).
Rate limiting at L4. Per source IP, per source AS, per ASN region. XDP/eBPF programs maintain a sliding-window counter per source IP in a BPF map, drop packets above threshold. The decision is sub-microsecond.
SYN cookies. See §7 Problem 7 — handle SYN floods without flow-table memory.
Upstream BGP coordination. On a multi-Tbps attack, the LB operator asks its upstream transit providers to drop the attacker's prefix via BGP communities (RTBH — Remote Triggered Black Hole) or BGP FlowSpec. Cloudflare's Magic Transit pushes this into the customer's autonomous transit decisions automatically.

Layer 7 attacks (application abuse). Smaller traffic volume, but designed to exhaust application resources. Slowloris (open many TCP connections, send headers one byte at a time, hold sockets open), HTTP slow POST (similar with body), HEAD floods, expensive query patterns (?sort=relevance&fuzzy=true over a search endpoint that doesn't cache), credential stuffing (login attempts with breached credentials). Defended at the L7 LB or upstream WAF.

WAF (Web Application Firewall). Rule-based and ML-based filtering. Known patterns (SQL injection, XSS, path traversal) blocked outright. Anomalous request patterns (large numbers of failed logins from one IP, unusual user-agent strings, requests with bot signatures) blocked or challenged. ModSecurity, Cloudflare WAF, AWS WAF.
Rate limiting at L7. Per user, per API key, per endpoint, per geo. Token-bucket or sliding-window counters per key. Envoy's local_rate_limit filter for per-instance limits, rate_limit filter calling out to an external rate-limit service (RLS — Rate Limit Service) for global limits.
Bot management. JavaScript challenges (force the client to execute code, weed out simple scrapers), CAPTCHAs (force a human-in-the-loop, useful but high-friction), TLS fingerprinting (JA3/JA4 hashes identify clients by their TLS handshake; bots have different fingerprints than browsers), behavioral analysis (mouse movements, request timing patterns).
Connection limits per source. Cap the number of concurrent connections from one IP to mitigate slowloris — 100 conns/IP is a reasonable default for human traffic.

The "Cloudflare absorbs Tbps" math. Cloudflare's published incident reports show absorbing >100 Tbps attacks. The math: ~300 PoPs × ~400 Gbps capacity each = ~120 Tbps total. Anycast spreads the attack evenly (give or take), each PoP sees ~400 Gbps, handles it. The attacker can't concentrate the attack on one PoP because the routing decision is the internet's, not theirs. Their published Q3 2024 numbers showed multiple sustained >5 Tbps attacks per quarter, including a 5.6 Tbps UDP flood handled with zero customer impact.

The DDoS defense stack is a layered onion: edge anycast → L4 ACL → L4 rate limit → L7 rate limit → WAF → application. The closer to the edge, the cheaper to drop; the deeper, the more semantic context for the decision. Modern defense is automated — the network detects an attack pattern within seconds (typically via traffic anomaly detection on netflow data) and rolls out countermeasures across the fleet via BGP, control-plane updates, and WAF rule pushes.

§24. Failure modes not covered earlier

The §8 failure modes covered LB box death, PoP failure, control-plane death, and partition. Three more deserve coverage.

Mode F: The LB itself crashes (need redundant LBs via ECMP). A single L4 or L7 LB box is not a deployment unit. The deployment unit is at least two LBs behind ECMP. The edge router announces the VIP via ECMP across N LB boxes; flow hashes spread traffic. When one dies (process crash, kernel panic, hardware fault), ECMP detects via BFD (Bidirectional Forwarding Detection — sub-second link probe) or BGP session drop, removes the dead next-hop, rehashes onto N-1. Flows that hit the dead box: those still in the dead box's flow table are gone (a few packets dropped); upon rehash, the survivor LB picks up via Maglev → same backend → connection continues. Without redundancy, the LB is a SPOF (Single Point Of Failure); with ECMP redundancy, LB failure is transparent within ~100 ms.

Mode G: LB to backend partition (LB thinks backends healthy, backends can't reach DB). A network partition isolates the backends from a critical downstream (DB, cache, message broker) while keeping the LB↔backend path healthy. The LB's /healthz probe to the backend succeeds (the backend's HTTP server is up). Real requests fail because the backend can't reach the DB. The LB has no signal — its probes pass, its passive ejection might catch some 5xx but slowly. Mitigations:

Deep health checks. /healthz includes a real DB ping (and Redis, and other criticals). Costs more probe traffic; pays off in detection accuracy. Make the deep checks rate-limited so the LB's probes don't take down the DB on a partial outage.
Synthetic transactions. A scheduled prober (separate from the LB) runs end-to-end user transactions every few seconds; results feed into LB pool exclusion. Detects DB problems even if /healthz is shallow.
Backend self-reporting unhealthy. When the backend detects its DB connections are timing out, it returns 503 on /healthz proactively. The LB ejects on the next probe. Slow (one probe interval), but no additional probe traffic.
Adaptive ejection. The LB tracks a sliding-window error rate per backend; if requests to B7 fail at 50% over 30 seconds, eject regardless of /healthz. Envoy's outlier_detection with consecutive_5xx and success_rate_minimum_hosts does this.

This is the most subtle LB failure mode because "healthy from the LB's perspective" is divorced from "healthy from the user's perspective."

Mode H: Health-check storm (1000 backends × every LB checks every 10s = 100/sec per backend). At fleet scale, naive active health checking generates substantial load on the backends. 1000 backends, 200 LBs (mesh sidecars), each checking each backend every 10 seconds = 20000 checks/sec across the fleet, 20 checks/sec per backend. If each check is HTTP and costs ~1 ms, that's ~20 ms/sec of backend CPU just on health checks. Doubles or triples at mesh scale where every Envoy probes every cluster member it might route to.

Mitigations:

Sampling. Each LB checks a random subset of backends; combine signals via the control plane. Coverage per backend is the same, total checks halved or quartered.
Hierarchical health. A small "health aggregator" service checks every backend, exposes aggregated health to all LBs via xDS push. LBs trust the aggregator, never probe directly. The aggregator is a SPOF, so deploy it replicated.
Passive-only at the high end. Skip active checks entirely; rely on real traffic + outlier detection. Works when traffic is high enough that any broken backend reveals itself within seconds.
Long intervals plus event-driven. Active checks every 30 seconds (not 2 seconds) for coarse liveness; event-driven exclusion (on observed errors) for fast detection.

The lesson: every probe is a request. At fleet scale, the LB's monitoring traffic is itself a workload to manage.

§25. Observability of the LB

The LB sees every request — that makes it both the most important observability vantage point and the noisiest one. Production observability has three pillars and one investigation pattern.

Request logs (sampled, structured). Every request through the LB could log a row (timestamp, source IP, method, path, status, latency, upstream backend, bytes in/out, TLS cipher, response time, retry count). At 100k RPS, full logging is 100k rows/sec — petabytes per day. Production approach: structured + sampled.

Structured — JSON or protobuf rows shipped to a log pipeline (Kafka → S3 → Athena/Trino), with consistent field names. Searchable; not free-text.
Sampled — log 1% uniformly + 100% of errors + 100% of slow requests (>p99). Captures the tail without paying for the median.
Trace-linked — every log row has a trace ID (e.g., W3C traceparent), so you can join LB logs to backend traces, DB traces, and downstream.

The four golden signals. SRE canon — every LB exposes these:

Rate — requests per second. The pulse. Sudden drop = upstream broken or client gone; sudden spike = surge or attack.
Errors — 5xx rate, 4xx rate. Distinguish: 5xx = our problem, 4xx = client problem (mostly). Watch the trend; absolute count varies by traffic.
Latency — distribution (p50, p95, p99, p99.9). Median is the experience; p99 is the worst experience that's still common.
Saturation — connection count, queue depth, CPU usage, memory usage. The "how close are we to failing under sustained load" signal. If saturation is creeping up, scale before latency or errors crack.

The signals are reported per cluster (e.g., the feed cluster), per route, per upstream backend (so you can spot the one slow box), and overall.

Per-backend metrics. A backend's view from the LB might disagree with the backend's own metrics. The LB measures wall-clock time as seen at its socket; the backend measures application time. Network latency, queueing at the backend's accept queue, and TLS overhead all show up at the LB but not in the backend's logs. Per-backend LB metrics:

Active connections — how many open right now.
RPS to this backend — per-second.
p50/p95/p99 latency to this backend.
Error rate to this backend.
Pending requests — sometimes the LB has a queue of requests waiting for a free connection in the pool; this depth is a saturation indicator.

The "backend p99 is fine but LB p99 is bad" investigation. Classic. Backend reports p99 of 50 ms. LB reports p99 of 500 ms. Where did the extra 450 ms come from?

Queue buildup at the LB. Requests are arriving faster than the LB can dispatch (CPU saturation on the LB process). Each request waits in the LB's accept queue; queue time is invisible to the backend. Check: LB CPU > 80%, accept queue depth > 0, request rate increasing.
Connection-pool exhaustion. The LB has 100 connections to the backend; 101 simultaneous requests arrive; the 101st waits. The wait time shows in LB latency but not backend latency. Fix: bigger pool, HTTP/2 multiplexing.
Retries. The LB retried the request after a backend failure. The backend sees only the second (successful) attempt; the LB sees the total time. Filter logs by retry-count > 0.
TLS handshake overhead. New connections cost 1-2 RTTs of TLS handshake. The LB measures from client's first byte; the backend measures only after handshake. Look at session resumption rates.
Slow-start ramp. A new backend is at 1% weight; rate-of-traffic feels normal but a small spike pushes it over its slowed capacity, queueing builds, latency spikes. Look at slow-start state.

The recipe is "ask the LB and the backend for separate latency distributions, find the gap, and trace which queue or pool is the culprit." Sometimes the gap reveals a missing layer of metrics — adding pool-wait-time and accept-queue-depth metrics often reveals the answer.

Distributed tracing. OpenTelemetry / Zipkin / Jaeger. Every request gets a trace ID; the LB emits a span ("LB hop"), the backend emits its span ("backend processing"), downstream calls emit their spans. Visualized as a flame graph, you see exactly which hop took which time. This is the gold standard for cross-system latency investigations and the LB is the place to inject the trace ID if the client didn't.

§26. Kernel-level and eBPF

The state of the art in L4 LB is migrating from user-space DPDK to in-kernel eBPF. The shift is a few years old and accelerating.

XDP (eXpress Data Path). A Linux kernel hook at the NIC driver level — the earliest possible point. An XDP program is eBPF bytecode loaded into the kernel; it inspects each incoming packet and returns one of:

XDP_PASS — continue up the normal kernel stack (eventually delivered to a socket).
XDP_DROP — discard the packet immediately. The fastest possible drop — used for DDoS mitigation.
XDP_TX — re-transmit out the same NIC after rewriting headers. Used for L4 LB return paths.
XDP_REDIRECT — send to a different NIC or to a different CPU. Used for multi-NIC LBs.

XDP runs before kernel networking (no allocation of sk_buff, no routing decision, no netfilter), so packet processing is dramatically faster than user-space (1-2 orders of magnitude). Maglev table lookup, flow table lookup, header rewrite, retransmit — all happen in the XDP program, 50-200 ns per packet, 5-10 Mpps per core.

The kernel ecosystem stays alive: tcpdump works on packets the XDP program PASSes, ftrace can attach to XDP entry/exit, perf events are emitted, bpftool introspects loaded programs.

eBPF for L7 routing without sidecars (Cilium). Cilium is an L7 service mesh implemented in eBPF. Instead of an Envoy sidecar per pod, Cilium loads eBPF programs into the kernel's socket layer that do:

L4 load balancing — replaces kube-proxy's iptables/IPVS with eBPF maglev.
Network policy — drops disallowed packets in the kernel based on identity (not IP).
L7 routing — a kernel-resident proxy (or a per-node Envoy, depending on Cilium version) handles HTTP routing for L7-required traffic.
Observability — eBPF programs emit metrics and trace events without a sidecar.

The win: no sidecar means no per-pod Envoy, no resource overhead per pod, no startup latency, no upgrade dance. The constraint: eBPF can't do everything Envoy can (extensions, custom Lua filters, etc.) — Cilium handles the common cases in the kernel and falls back to Envoy for the rest.

The modern Linux load-balancing stack. A 2026-era L4 LB stack looks like:

NIC offload — RSS hashes 5-tuples and dispatches packets to N RX queues; each queue's interrupts are pinned to one CPU core. Packet rate scales with cores.
XDP eBPF program on each NIC RX queue. Runs Maglev lookup, flow tracking, header rewrite, drops bad traffic.
BPF maps (kernel hash tables) hold the Maglev table, flow table, backend mappings. Updated by user-space control plane via syscall.
Optional sk_buff path for traffic XDP can't decide (returned via XDP_PASS); standard kernel networking handles it.

User-space DPDK code is moving to museum status for new projects. The maintenance cost of staying in DPDK (no kernel tooling, custom NIC driver, manual ARP/ICMP) is too high when XDP delivers ~80% of the perf with 100% of the kernel ecosystem.

§27. Proxy-less load balancing

The mesh pattern (Envoy at every pod, all calls through the proxy) has a tax. Proxy-less LB asks: what if the client picks the backend directly, no LB hop in between?

gRPC's client-side load balancing. gRPC has built-in support for client-side LB. The client is told (via a name resolver, e.g., DNS, xDS, or Consul) the set of backend endpoints and a load-balancing policy (round-robin, P2C, custom). On each call, the client picks an endpoint and opens (or reuses) a connection. The server-side LB is bypassed for internal calls.

xDS-discovered backends. The client subscribes to xDS — same protocol as Envoy uses. The control plane pushes endpoint updates; the client's resolver updates. The same xDS push that updates Envoy's CDS/EDS now updates every gRPC client. Consistent picture, no extra infrastructure.
Picker policy. Round-robin is default; custom pickers (P2C, locality-aware, latency-based) are pluggable.

Why this avoids the LB-hop tax for internal traffic. A typical mesh call costs:

Client → sidecar Envoy. ~0.5 ms hop. (Same machine, but still a context switch.)
Sidecar → remote backend. Network RTT (~0.5 ms data center).
Remote backend → its sidecar. Same-machine hop.
Sidecar → application. Same-machine hop.

Total: ~2 ms of overhead per call before the application even sees the request. Multiplied by 50 services in a fan-out, that's 100 ms of overhead.

Proxy-less gRPC:

Client (application) → remote backend (application). Network RTT only. Maybe ~0.5 ms.

The mesh tax is gone for internal calls. Production data from Google (which uses proxy-less gRPC heavily internally) shows latency reductions of 10-30% for high-fan-out workloads.

The trade-offs.

Clients become smarter. Every client (every language: Go, Java, Python, C++, etc.) needs the LB library, the xDS resolver, the connection-pool management. The "all logic in Envoy" benefit is gone — you reimplement it in every language.
Policy distribution harder to centralize. With a sidecar, "everyone retries 3x with exponential backoff" is a single Envoy config. With proxy-less, each client library's behavior must be consistent. xDS helps (the control plane pushes the same config), but library-version skew is real ("our Go client is on gRPC 1.50, Java client is on 1.55; their retry semantics differ slightly").
Operational complexity. A misconfigured client is harder to fix than a misconfigured sidecar (you can update the sidecar without touching the application; you can't update the client library without an application redeploy in most cases).
Observability harder. Without a sidecar, the trace/metric instrumentation must live in the client. gRPC has good built-in support for this, but it's another thing to keep consistent.

The 2026 picture: proxy-less is winning for gRPC-heavy internal architectures (Google, ByteDance, large financial firms), and sidecars remain dominant for polyglot environments with mixed protocols (HTTP/1.1, gRPC, WebSocket, custom). Ambient mesh (Istio) and Cilium's eBPF mesh blur the line further — no sidecar, but a node-level proxy still gets the hop tax in less-acute form.

§28. Cost and economics

The LB's cost shows up everywhere — instance hours, bandwidth, license fees, engineer time. Worth understanding before adopting a managed service or building in-house.

Managed LB pricing. AWS ALB pricing (current as of 2026):

Fixed hourly: $0.0225/hour per ALB (~$16.40/month standing). 50 ALBs = $820/month base.
Per LCU (Load Balancer Capacity Unit): $0.008/LCU-hour. An LCU is "the max of new connections, active connections, processed bytes, rule evaluations." A medium-traffic ALB consumes 10-50 LCUs continuously, so ~$2-11/month per ALB at low load. At peaks (Black Friday on a retail ALB), an LCU spike to 1000+ for an hour = $8 for one hour.
Free data in, $0.01-0.09/GB out (varies by region).

For a small SaaS with one ALB doing 1k RPS sustained: ~$25-50/month. Affordable; just turn it on.

For a hyperscaler doing 1M RPS on the same managed LB: tens of thousands per month per LB; multiply by hundreds of LBs and the bill is millions per year. The math starts to make in-house Envoy attractive.

NLB pricing is similar shape but with NLCU (Network Load Balancer Capacity Unit) and different per-unit costs. GCP and Azure broadly comparable.

Self-managed Envoy/HAProxy TCO (Total Cost of Ownership). What does running your own LB actually cost?

Hardware/cloud instances. An Envoy fleet doing 250k RPS per instance needs N instances + headroom. Let's say a fleet of 50 c5.4xlarge instances ($0.68/hour each, $24,500/month all up).
Bandwidth. Same as managed — the data is leaving the data center either way.
Engineering. Building, operating, on-call. A team of 3-5 engineers ($1-2M/year fully loaded) is the realistic minimum to operate a custom LB tier at scale. Includes incident response, capacity planning, upgrade testing.
Observability infrastructure. Metrics ingest, log storage, tracing systems. Maybe $50-200k/year for a medium fleet.
Software/maintenance. Envoy is open source, but security patches, version upgrades, integration testing all cost time. HAProxy Enterprise is paid ($5-20k/year per pair of instances).

Total: ~$2-3M/year for a self-managed fleet that replaces ~$5M/year of managed LB at scale. The "tipping point" depends on traffic volume.

The "we save $X by running our own LB" tipping point. Rough rule of thumb:

Under $500k/year on managed LBs: keep using managed. The engineering cost of in-house is higher than the LB savings. Spend the engineer-cycles on the product instead.
$500k to $3M/year: it depends. Specialized requirements (custom WAF rules, internal-only protocols, regulatory constraints) might force in-house; otherwise managed is still cheaper.
Over $3M/year: in-house pays off. The hyperscalers all run their own LBs (Cloudflare Unimog, AWS Hyperplane, Google Maglev, Facebook Katran) for exactly this reason — and because at their scale, they need features no managed LB provides.

There's a non-monetary tipping point too: feature constraint. If a managed LB doesn't support a feature you need (HTTP/3 backend connections at a certain time, a specific TLS extension, a custom protocol), you build your own regardless of cost.

The "managed LBs hide the cost" pattern. A common surprise: the managed LB looks cheap until traffic grows or feature creep adds LCUs. A team starts at $50/month per ALB; six months later they have 200 ALBs (one per service per environment) and $200k/year in LB bills, and nobody planned for it. Practice: budget for managed LB cost growth at the same rate as traffic growth. Consider one ALB serving many backends (path-based routing) rather than one ALB per backend.

The big architectural truth: load balancing has become so commoditized that the wrong question is "Envoy vs HAProxy vs Nginx." The right question is "managed or in-house," and the answer flips at predictable scale milestones.

§29. Summary

Load balancing is a class of traffic-distribution infrastructure spanning kernel-bypass packet forwarders at the line-rate end (Maglev, Katran, Unimog — XDP/eBPF, Maglev-hash lookup tables, ~200 ns per packet) and user-space application-aware proxies at the policy-rich end (Envoy, NGINX, HAProxy — TLS terminate, path routing, retries, mesh policy); both layers chain in the canonical anycast + ECMP + L4 + L7 + backend topology, the Maglev table is the only replicated state and is deterministic from the backend list so the data plane is durably stateless, and the design's hardest problems — stable hashing under churn, slow-start, hot keys, layered health, sticky failover, TLS cost, SYN floods — all trace back to the same root: an LB is a routing decision repeated billions of times per day with no second chances.