RPC

A technology reference on RPC (Remote Procedure Call) and the broader class of inter-service communication substrates: gRPC, Apache Thrift, REST (Representational State Transfer), GraphQL, Connect/Twirp, and the service-mesh primitives (Envoy sidecars, mTLS, xDS configuration streaming) that make any of them survive at fleet scale. This is not a doc about one product's mesh; it is a doc about the underlying technology — what it is, what guarantees it gives you by design, where its hard problems live, and what bytes are actually on the wire. Use cases from payments, search, feed, ML inference, mobile, and public APIs appear throughout as illustrations.

The depth comes from: the Protobuf wire format at the byte level, HTTP/2 framing and stream multiplexing, and the cascading-failure mechanics (retry storms, head-of-line blocking, connection-pool exhaustion) that show up in any RPC fabric regardless of which framework you picked.

§1. What RPC Is, and What It Is Not

RPC is a class of inter-service communication where a process invokes a procedure on another process with synchronous (or streaming) request-response semantics, over a typed schema, with explicit error codes, deadlines, and (usually) authentication built into the wire protocol. That last clause is what separates RPC from "just send HTTP and parse the body." RPC frameworks bake the operational contract into the protocol.

Three properties make something an RPC framework rather than a transport:

Schema is the contract. A .proto (Protobuf), .thrift, or OpenAPI document defines the operations, the message types, and the error space. Both client and server compile code from the same schema, and the wire bytes are interpreted relative to that schema. This is fundamentally different from raw HTTP, where the body is opaque bytes and the contract is "whatever the docs say."
Error codes are typed. A gRPC call returns OK, DEADLINE_EXCEEDED, UNAVAILABLE, PERMISSION_DENIED, etc. — 17 well-known codes. Each tells the client a different recovery action. Compare to raw HTTP, where you get one of ~50 status codes that mostly compress to "200 / 4xx / 5xx" and the actual meaning is hidden in a JSON error body whose shape varies per service.
Deadlines, not timeouts. A deadline is an absolute timestamp ("must complete by 14:00:00.500 UTC") that propagates across hops. Every hop subtracts the time it took and forwards a smaller deadline downstream. Timeouts (per-hop) do not propagate and are the root of multi-hour incidents where leaf services are still running after the user request was abandoned at the edge.

Where RPC sits in the stack. RPC is the east-west fabric of a microservices architecture. A request enters at the edge (a load balancer, API gateway, or CDN), the edge runs some logic and then calls 5–50 internal services, each of which calls 5–50 more services, each of which talks to databases, caches, search indices, and ML feature stores. The RPC layer is the connective tissue of that tree. It is not the database, it is not the queue, it is the means by which one stateless service asks another stateless service for an answer.

Distinguish from adjacent technology categories:

vs. Message queues (Kafka, RabbitMQ, Pulsar). Queues are asynchronous and durable. You produce a message; some consumer eventually drains it. Failure semantics are "at-least-once delivery with replay." RPC is synchronous and non-durable. If the server crashes mid-call, the message is gone; the client has to retry against a different instance. Use queues for fanout work that can tolerate seconds-to-minutes of latency and must survive crashes. Use RPC for "I need the answer now, in this user request."
vs. Raw HTTP. Raw HTTP gives you a transport with status codes and headers. RPC gives you a transport plus a schema plus an error taxonomy plus deadlines plus codegen plus streaming. You can build "an RPC framework" on top of raw HTTP — that's what REST is, sort of — but you'll be reinventing each of those primitives in user code, and ten teams will reinvent each differently.
vs. REST. REST is resource-oriented (GET /users/42, POST /payments) and HTTP-native. It's a style of using HTTP, not really a separate protocol. REST is the right answer for public APIs (browsers, third parties, CLI debuggability with curl). It is the wrong answer for internal hot paths at fleet scale, for reasons we'll dissect in §23.
vs. GraphQL. GraphQL is a query language over a typed schema, optimized for client-driven response shape (a mobile app asks for exactly the fields it wants). It is best used at the API gateway / BFF (Backend For Frontend) layer, not as an internal east-west substrate. Internally GraphQL becomes a thick translation layer with no real benefits — internal callers know exactly what fields they want.

What RPC is NOT good for:

Long-running operations (workflows). If your "RPC" runs for 10 minutes, you've built a workflow engine on top of RPC and you'll fight reconnect, idempotency, and progress-tracking forever. Use a workflow primitive (Temporal, Cadence, Step Functions, durable saga) instead.
Durable event distribution. RPC is a hand-off, not a record. Use a log (Kafka).
Bulk data movement. Streaming gigabytes through gRPC works but is wasteful; use direct object storage or a data pipeline.
Anywhere the consumer count is unknown or changes constantly. RPC is point-to-point; pub/sub is many-to-many.

§2. Inherent Guarantees: What RPC Provides by Design

A clear-eyed list of what the technology gives you for free and what it does NOT give you (so you stop expecting it).

Provided by design:

Guarantee	Mechanism
Typed messages	Schema in `.proto` / `.thrift`; codegen produces typed structs in every supported language. Wire bytes are decoded relative to the schema.
Versioned error codes	gRPC defines 17 named status codes (`OK`, `CANCELLED`, `UNKNOWN`, `INVALID_ARGUMENT`, `DEADLINE_EXCEEDED`, `NOT_FOUND`, `ALREADY_EXISTS`, `PERMISSION_DENIED`, `UNAUTHENTICATED`, `RESOURCE_EXHAUSTED`, `FAILED_PRECONDITION`, `ABORTED`, `OUT_OF_RANGE`, `UNIMPLEMENTED`, `INTERNAL`, `UNAVAILABLE`, `DATA_LOSS`). Each prescribes a different client behavior.
Deadline propagation	`grpc-timeout` metadata header is a relative duration; each hop subtracts elapsed time and forwards a smaller value. Calls that have already exceeded the deadline fail immediately without dispatch.
Streaming	gRPC supports unary, server-streaming, client-streaming, and bidi-streaming modes. The wire framing (HTTP/2 DATA frames + length prefix per message + TRAILERS) makes this first-class rather than bolted-on.
Multiplexing	HTTP/2 lets one TCP connection carry hundreds-to-thousands of in-flight RPCs interleaved at the frame level. No connection-per-call.
Per-stream flow control	HTTP/2 `WINDOW_UPDATE` frames mean a slow consumer naturally throttles its sender without starving sibling streams on the same connection.
Schema evolution rules	Protobuf's field-number contract: add fields safely, reserve removed numbers, never reuse a number. Old binaries skip unknown tags. Forward and backward compatibility by construction if you follow the rules.
Compact wire format	Protobuf and Thrift TCompact are typically 30–35% the size of equivalent JSON, with ~10x faster encode/decode.

NOT provided — must be layered on top:

Not provided	Why naive expectation breaks
Durability	RPC is a hand-off. If the server crashes after acknowledging but before persisting, the action is lost. Want durability? Persist to a database in the handler before returning, or use a queue.
Transactional semantics across services	A successful RPC to service A does not "commit" anything cross-service. Two-phase commit, sagas, and outbox patterns must be layered on top. RPC is one of the primitives those patterns are built from, not a substitute.
Automatic retry safety (idempotency)	The framework will happily retry a `POST /payment` and charge twice. Idempotency is the application's responsibility (idempotency keys, content-addressed identifiers). RPC retries are safe only on idempotent calls.
Service discovery	gRPC does not tell you where service B lives. You need a discovery layer (DNS, Consul, etcd, Eureka, Kubernetes endpoints) underneath.
Authentication / authorization	gRPC carries credentials in headers (`authorization`, mTLS client cert) but does not enforce authn/authz. That's policy code in the server or in the sidecar.
Cross-region routing or fallback	RPC does what you tell it. Multi-region failover, locality-aware routing, and "if region A is down send to B" are all higher-layer concerns.
Backpressure across the application	HTTP/2 flow control covers one hop. Backpressure across a 5-deep fan-out is your design problem (token budgets, admission control).
Exactly-once delivery	Like all RPC, gRPC is best-effort with retries. Exactly-once is application-layer (dedup + idempotency keys + sequencing).

The summary: the RPC framework gives you the conduit, the schema, the error taxonomy, the deadlines, and the streaming. Everything else — durability, transactions, idempotency, fanout, fallback — is your job to layer on top.

§3. The Design Space: Variants of the Technology

The RPC class has several major variants. They differ along wire format, transport, schema strictness, browser support, and ecosystem.

Dimension	gRPC	Apache Thrift	REST + JSON	GraphQL	Connect / Twirp
Wire format	Protobuf (binary)	TBinary or TCompact	JSON (text)	JSON over HTTP	Protobuf or JSON
Transport	HTTP/2 only	TBinary over TCP, framed; THttp variants	HTTP/1.1 or HTTP/2	HTTP/1.1 or HTTP/2	HTTP/1.1 or HTTP/2
Schema	`.proto`, mandatory	`.thrift`, mandatory	OpenAPI optional	SDL (Schema Definition Language) mandatory	`.proto`, mandatory
Streaming	unary + 3 stream modes	unary + streams (modern)	SSE / chunked / WebSocket (ad hoc)	subscriptions (ad hoc)	server-streaming over HTTP/2
Browser native	no (gRPC-Web proxy)	no	yes	yes	yes (no proxy needed)
Trailers required	yes (status in TRAILERS)	no	no	no	no (status in HEADERS body)
Deadline in protocol	yes (`grpc-timeout`)	yes (some transports)	no	no	yes
Error codes	17 typed codes	typed exceptions	HTTP status + body convention	error array in body	typed codes (mirrors gRPC)
Wire size vs JSON	~30%	~35%	100%	100%	~30% (Protobuf mode)
Encode CPU vs JSON	~10x faster	~8x faster	baseline	baseline	~10x faster
Codegen	yes (15+ langs)	yes (20+ langs)	optional (OpenAPI Generator)	yes (Apollo, urql, etc.)	yes (Buf)
Debuggability	`grpcurl` + reflection	`thrift CLI`	`curl`	playground UI	`curl` (when JSON mode)
Typical use	east-west microservices	east-west (Meta, Uber legacy)	north-south, public APIs	mobile/web BFF	mobile + east-west
Origin	Google (2015, public)	Facebook (2007, public)	conventional HTTP usage	Facebook (2015, public)	Buf (2022)

Defended picks:

gRPC + Protobuf is the default for east-west internal traffic at any scale where wire and CPU efficiency matter. The combination of compact format, mandatory schema, codegen, deadline propagation, and streaming is unmatched. Google, Square, Lyft, Cloudflare, Dropbox, and increasingly LinkedIn use it for hot internal paths.
REST + JSON is the right answer for north-south traffic (public APIs, browser-callable endpoints, partner integrations). Stripe's public API, GitHub's REST API, Twilio's API — all REST + JSON because the caller is some random startup with curl and Python.
GraphQL fits the BFF (Backend For Frontend) layer where one client (a mobile app) needs exactly the fields it wants and no more, and where different clients want different shapes. Shopify's storefront API, GitHub's GraphQL API, Facebook's mobile data layer. Bad fit internally where callers know exactly what they need and over-fetching isn't a thing.
Thrift is the legacy answer at Facebook, Twitter, and Uber. Functionally equivalent to gRPC for most workloads. If your company is on Thrift, stay on Thrift; the migration is rarely worth the cost.
Connect / Twirp are the answer when you want Protobuf-quality schemas but need to call from a browser without a proxy. They drop the trailers requirement (gRPC's biggest browser problem) and put status codes in the response headers/body.

§4. Underlying Data Structure — The Wire Formats

The depth here is what bytes are actually on the wire, top down: Protobuf message → gRPC framing → HTTP/2 frames → TCP. Several different domain examples are used here (a payment authorization, a user profile lookup, a feed item) to keep the abstraction honest.

4a. Protobuf wire format: Tag-Length-Value with varint encoding

A Protobuf message is a flat sequence of (field_tag, value) pairs. There is no top-level length, no field names, no schema embedded — only numbered fields decoded against a schema both ends already have.

Field tag = (field_number << 3) | wire_type, encoded as a varint. Wire types:

Wire type	Code	Used for
VARINT	0	int32, int64, uint32, uint64, sint32, sint64, bool, enum
FIXED64	1	fixed64, sfixed64, double
LEN	2	string, bytes, embedded messages, packed repeated
FIXED32	5	fixed32, sfixed32, float

(SGROUP=3 and EGROUP=4 are deprecated proto2 group markers.)

Varint encoding is a variable-length integer. Each byte uses its high bit as a continuation flag and the low 7 bits as data, little-endian:

1 → 0x01 (1 byte)
150 → 0x96 0x01 (binary: 10010110 00000001 → strip MSB, swap groups: 0000001 0010110 = 150)
300 → 0xAC 0x02
2^32 - 1 → 5 bytes
Negative integers in int32 / int64 are sign-extended to 64-bit unsigned and so -1 takes 10 bytes. This is the classic foot-gun. Use sint32 / sint64 with zigzag encoding ((n << 1) ^ (n >> 31)) when you have negatives, so -1 becomes 1 byte.

Concrete walkthrough 1 — a user profile lookup (search-style read):

message GetUserProfileRequest {
  string user_id = 1;
}

Encoded GetUserProfileRequest { user_id = "42" }:

0x0A 0x02 0x34 0x32

Breakdown: - 0x0A = tag for field 1, wire type 2 (LEN). (1 << 3) | 2 = 10 = 0x0A. - 0x02 = length of the string in bytes (varint). - 0x34 0x32 = "42" in ASCII.

Total: 4 bytes. Equivalent JSON {"user_id":"42"} is 16 bytes.

The response UserProfile { user_id = "42", display_name = "Yifan", created_at_ms = 1714400000000, roles = ["admin", "staff"] }:

field 1 (user_id, LEN):
  0x0A 0x02 0x34 0x32                                    -> "42"

field 2 (display_name, LEN):
  0x12 0x05 0x59 0x69 0x66 0x61 0x6E                     -> "Yifan"

field 4 (created_at_ms, VARINT int64):
  0x20 0x80 0xB6 0xD0 0xC5 0xC2 0x31                     -> 1714400000000

field 5 (roles, LEN) - repeated string, two entries:
  0x2A 0x05 0x61 0x64 0x6D 0x69 0x6E                     -> "admin"
  0x2A 0x05 0x73 0x74 0x61 0x66 0x66                     -> "staff"

Total: 4 + 7 + 7 + 7 + 7 = 32 bytes on the wire. The equivalent JSON is ~110 bytes.

Concrete walkthrough 2 — a payment authorization (different domain, same encoding rules):

message AuthorizeRequest {
  string idempotency_key = 1;
  int64 amount_minor_units = 2;
  string currency = 3;
}

Encoded AuthorizeRequest { idempotency_key = "abc", amount_minor_units = 4999, currency = "USD" }:

0x0A 0x03 0x61 0x62 0x63          -> field 1, "abc"
0x10 0x8F 0x27                    -> field 2 (VARINT), 4999 = 0x1387 -> varint 0x8F 0x27
0x1A 0x03 0x55 0x53 0x44          -> field 3, "USD"

13 bytes. The point: Protobuf is denser because (a) field names are integers, not strings; (b) integers use varint and small integers take one byte; (c) no delimiters, whitespace, or quotes; (d) the schema (not the wire) carries the field semantics.

Schema evolution rules — the field-number contract

The wire format is identified by integer field number, not name:

Adding fields: pick an unused number; old readers ignore unknown tags (proto3 default retains them in an unknown-fields set so they round-trip). Backward + forward compatible. Always optional.
Removing fields: stop writing them and reserve the number: message User { reserved 4; reserved "old_field_name"; } Without reserved, a future developer reuses field 4 with a different type, and old data on disk now decodes catastrophically wrong.
Renaming fields: free; the name is a codegen artifact, not on the wire.
Changing types: only between wire-compatible types — int32 ↔ int64 ↔ uint32 ↔ uint64 ↔ bool (all VARINT); string ↔ bytes (both LEN). NOT string → int32.
repeated ↔ scalar: proto3 repeated of scalars uses packed encoding by default (one LEN tag, all values concatenated). Switching between repeated and single scalar is mostly safe with caveats.
required does not exist in proto3 because once marked required, you can never remove the field without breaking every reader. proto2 had it; proto3 dropped it.
oneof: only one field in the set can be set (a tagged union). Adding to a oneof is safe; moving an existing field into a oneof is breaking.

Enforcement: buf breaking and protolock run on every PR that touches .proto. At fleet scale (hundreds of services, thousands of engineers), schema review is a Staff-engineer-level responsibility.

4b. Thrift wire formats

Thrift offers two binary formats:

TBinaryProtocol: fixed-width and simple. Each field has [type_byte (1)] [field_id (2)] [value]. An int32 with value 1: 0x08 0x00 0x01 0x00 0x00 0x00 0x01 — 7 bytes for what Protobuf does in 2.
TCompactProtocol: varints plus delta-encoded field IDs (each field stores delta from previous). An int32 value 1 with delta-1 field id: 0x15 0x02 — 2 bytes (close to Protobuf). Most production Thrift uses TCompact.

Trade-off vs Protobuf: Thrift IDL distinguishes set, map, list natively and has built-in service definitions and typed exceptions. Protobuf treats collections as repeated. The practical choice today is mostly historical (Facebook chose Thrift; Google chose Protobuf; companies adopt whichever ecosystem they're in).

4c. JSON

JSON is text. A field "user_id": "42" is 14 bytes. Parsing requires a tokenizer at ~30–80 ns/byte (Jackson, in Java) versus Protobuf's generated parser at ~2–5 ns/byte. JSON's only real advantages are human-readability, schema-optional flexibility, browser-native parsing, and infinite extensibility — all of which are features at the public edge and liabilities internally.

JSON Schema can enforce types post-hoc but is rarely enforced at runtime in service-to-service paths.

4d. HTTP/2 framing

A single TCP connection multiplexes many concurrent HTTP/2 streams. Each stream is a sequence of frames, where each frame has a 9-byte header:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                 Length (24)                   |
+---------------+---------------+---------------+
|   Type (8)    |   Flags (8)   |
+-+-+-----------+---------------+-------------------------------+
|R|                 Stream Identifier (31)                      |
+=+=============================================================+
|                   Frame Payload (0...)                      ...
+---------------------------------------------------------------+

Frame types relevant to RPC:

HEADERS (0x1): contains HPACK-compressed pseudo-headers (:method, :path, :scheme, :authority, :status) and regular headers (content-type, grpc-timeout, grpc-encoding, authorization, traceparent).
DATA (0x0): the request/response body. For gRPC, body is a sequence of length-prefixed Protobuf messages — 1 byte (compression flag) + 4 bytes (message length, big-endian uint32) + N bytes (Protobuf payload).
TRAILERS (HEADERS frame with END_STREAM set, sent after DATA): contains grpc-status, grpc-message. This is why gRPC needs HTTP/2 — HTTP/1.1 trailers are nearly unusable (browsers don't surface them; intermediaries strip them). gRPC puts call-result status in trailers because a streaming RPC may emit many DATA frames before discovering an error.
SETTINGS (0x4): connection setup — MAX_CONCURRENT_STREAMS (default 100, gRPC servers usually negotiate 1000+), INITIAL_WINDOW_SIZE (per-stream flow-control credits, default 65535 bytes), MAX_FRAME_SIZE (default 16 KB).
WINDOW_UPDATE (0x8): per-stream and per-connection flow-control. The receiver tells the sender "I have N more bytes of buffer; you can send N more."
RST_STREAM (0x3): cancel one stream without tearing down the connection. gRPC uses this when a client deadline expires or the client cancels.
GOAWAY (0x7): "I will close this connection after stream N; do not start new streams." Used for graceful shutdown.
PING (0x6): keepalive and RTT measurement.

HPACK header compression: a dynamic table of recently-seen header name/value pairs on both ends. Once authorization: Bearer <very long JWT> has been seen, subsequent references use a 1-byte index. The static table covers common headers like :method GET for free. On a connection that has served 1000 requests with the same auth token, headers cost ~5 bytes per request after the first, instead of ~800 in ASCII.

Flow control: every stream starts with a 65 KB receive window. If the receiver doesn't drain, the sender stops sending DATA frames for that stream. Per-stream — a slow stream doesn't block sibling streams on the same connection (modulo TCP-layer head-of-line; see §13 problem 5).

4e. gRPC over HTTP/2: end-to-end byte walkthrough of one unary call

Walk through stub.GetUserProfile(GetUserProfileRequest { user_id: "42" }) with a 500 ms deadline.

Step 1 — Client stub serialization. Generated code calls request.toByteArray(). Bytes: 0x0A 0x02 0x34 0x32. CPU cost: ~50 ns.

Step 2 — gRPC framing. The 4-byte Protobuf message is wrapped in a 5-byte gRPC frame:

0x00                       <- compression flag (0 = uncompressed)
0x00 0x00 0x00 0x04        <- message length (big-endian uint32 = 4)
0x0A 0x02 0x34 0x32        <- Protobuf payload

Total: 9 bytes.

Step 3 — HTTP/2 HEADERS frame. The gRPC client opens stream 5 (odd-numbered, client-initiated) on the existing HTTP/2 connection:

HEADERS frame, stream 5, END_HEADERS=1, END_STREAM=0
  :method = POST
  :scheme = https
  :path = /user.v1.UserService/GetUserProfile
  :authority = user-service:9090
  content-type = application/grpc+proto
  grpc-encoding = identity
  grpc-accept-encoding = gzip
  grpc-timeout = 500m                  <- deadline as relative time
  te = trailers                         <- MUST be present for gRPC
  user-agent = grpc-java/1.62.0
  authorization = Bearer eyJhbGciOi...
  traceparent = 00-abc...-def...-01     <- W3C trace context

After HPACK compression, ~50–100 bytes on the wire (less with prior history).

Step 4 — HTTP/2 DATA frame.

DATA frame, stream 5, END_STREAM=1
  payload: 0x00 0x00 0x00 0x00 0x04 0x0A 0x02 0x34 0x32  (9 bytes)

HTTP/2 frame header: 9 bytes. Total DATA frame on the wire: 18 bytes.

Step 5 — Kernel and network. HTTP/2 codec writes ~70–120 bytes total. TLS wraps in TLS records (~30 bytes record overhead; TLS 1.3 handshake is one round trip on first connection, 0-RTT after). TCP adds a 20-byte header per segment. NIC, switch, NIC, kernel DMA into server ring buffers.

Step 6 — Server-side decode. Server kernel delivers bytes via read(). TLS decrypts. HTTP/2 codec parses frames, demultiplexes by stream ID. HEADERS into request metadata, DATA into the body. gRPC framing reads [0x00][0x00 0x00 0x00 0x04] — 4-byte Protobuf message follows. Hands 0x0A 0x02 0x34 0x32 to the generated parser.

Step 7 — Protobuf decode. Generated GetUserProfileRequest.parseFrom(bytes): - Read tag byte 0x0A → field 1, wire type LEN. - Read length varint 0x02. - Read 2 bytes 0x34 0x32 → "42". - End of buffer. Returns { user_id = "42" }. CPU cost: ~30 ns.

Step 8 — Deadline enforcement. gRPC server registers a deadline-expired callback. From grpc-timeout: 500m, server knows it must respond within 500 ms minus elapsed. If the handler hasn't called onCompleted() by then, server sends RST_STREAM and closes with grpc-status: DEADLINE_EXCEEDED (4).

Step 9 — Handler. Application code runs. Hits a database, builds UserProfile.

Step 10 — Response path. Mirror in reverse. Server emits HEADERS (:status 200, content-type), DATA (length-prefixed Protobuf), and finally TRAILERS with grpc-status: 0 and END_STREAM=1. Trailers (not initial HEADERS) carry status because streaming RPCs may emit many DATA frames before knowing the final outcome.

Step 11 — Client receives. HTTP/2 codec routes frames to stream 5. DATA goes to gRPC framing; framing parses length-prefix, hands bytes to UserProfile.parseFrom. Trailers arrive: grpc-status: 0. Stub returns the message.

End-to-end: ~1.3 ms p50 if everything is hot. Most of that is the database lookup inside the handler — the framework adds ~250 µs at p99 for a small message.

The same call over REST + JSON would: serialize JSON at ~10x the CPU, send HTTP/1.1 request-line + headers in ASCII (~250 bytes versus ~50 compressed), occupy one TCP connection for the duration (no multiplexing), require you to bake your own deadline header convention (no grpc-timeout), and lose the trailer-status pattern so a mid-stream error has no clean way to be reported.

§5. Capacity Envelope Across Scales

RPC fabrics span six orders of magnitude. Different scales hit different bottlenecks; the same technology serves all of them but with different topologies.

Scale tier	Typical deployment	RPS / fan-out	Bottleneck	Topology
Small (startup)	~10 services, ~1k QPS total	10 QPS per service, fan-out depth 2	nothing real	direct HTTP/1.1 + JSON, load balancer in front, no mesh
Mid (Stripe-like)	~100 services, ~50k QPS internal	1k–10k QPS per hot service, fan-out depth 3–5	retry storms, connection pool exhaustion	gRPC adopted for hot paths; client-side LB; per-team retry policies (drifting); idempotency keys on the public API
Large (Netflix-like)	~1000 services, ~5M QPS	500k QPS per hot service, fan-out depth 5–8	cascading failure; sidecar overhead becomes real	service mesh (Envoy + control plane), circuit breakers (Hystrix → resilience4j), schema registry, breaking-change CI
Giant (Google internal)	~10000+ services, ~10B RPC/sec	1M+ QPS per hot service, fan-out depth 10+ (one search query touches ~1000 backends)	sidecar CPU rivals app CPU; control-plane scalability	ambient mesh / proxyless gRPC, hierarchical service catalog, region-aware routing, hedged requests, dedicated schema-review org

Specific deployments anchoring the range:

Small / mid: Stripe's public API runs REST + JSON over standard HTTP/2; their internal services are partly Ruby on Rails, partly Go services talking over gRPC and Twirp. Tens of thousands of internal RPCs per second per service. Stripe famously made Idempotency-Key first-class in their public API — the textbook example of "retries safe on writes."
Mid / large: Netflix's home-page render makes 100+ internal RPC calls per device session. At 250M members and ~1B daily streaming starts, that's several million RPCs/sec across the mesh. Hystrix (now resilience4j) was born here; the original retry-storm post-mortems are the canonical reference for circuit breakers.
Large: LinkedIn's Rest.li framework serves ~10M QPS across ~100k service instances. The Espresso storage layer alone is reached at ~3M QPS over Rest.li. D2 (Dynamic Discovery) handles client-side LB. Rest.li started as REST-flavored over HTTP/1.1 and is moving to gRPC for internal hot paths. Mid-tier services like the feed mid-tier and profile mid-tier individually sustain 500k+ QPS.
Large / giant: Uber's internal mesh runs Apache Thrift over TChannel transport (and YARPC, their RPC framework on top). ~10M QPS internal. A single trip-request fans out to ~30 sync services plus ~40 async, p99 budget 300 ms.
Giant: Google's internal RPC volume is ~10 billion RPCs/sec over Stubby (gRPC's internal ancestor). A single search query fan-outs to ~1000 backends, each call deadline-bounded at 10–50 ms. They run a proxyless mesh — gRPC libraries consume xDS directly and there is no sidecar — because sidecar CPU at that scale would be a separate cost line in the billions.
Giant (different shape): Meta runs Thrift end-to-end. Volume not publicly disclosed but assumed similar order to Google. Thrift's typed-exception model is preferred there over gRPC's flat status codes.

Per-connection capacity (the numbers a Staff engineer should know):

HTTP/1.1: 1 in-flight request per connection (pipelining is dead in practice). For 10k QPS at 50 ms latency, Little's Law gives 500 concurrent in-flight = 500 sockets per client. 1000 clients × 500 = 500k sockets on the server — exhausts ephemeral port ranges and file descriptors fast.
HTTP/2: default SETTINGS_MAX_CONCURRENT_STREAMS = 100 (servers often negotiate 1000+). One TCP connection carries 100–1000 concurrent RPCs interleaved. Same 10k QPS / 50 ms: 500 in-flight / 100 streams per conn = 5 connections per client. 1000 clients × 5 = 5000 sockets — two orders of magnitude better.
HTTP/3 (QUIC): same multiplexing math as HTTP/2 but stream multiplexing at the UDP layer, so one lost packet only stalls one stream. Tail-latency advantage in lossy networks (mobile, cross-region).

Latency budget (one intra-DC unary call, small message):

Stage	p50 (µs)	p99 (µs)
Client Protobuf encode (1 KB)	20	100
Client kernel write	5	30
Network in-DC (NIC → switch → NIC)	100	300
Server kernel read	5	30
HTTP/2 frame decode	5	20
Server Protobuf decode	30	150
Handler logic (DB lookup etc.)	1000	3000
Response path mirror	165	630
Total	~1.3 ms	~4.3 ms

The handler dominates. RPC framework overhead is ~250 µs at p99 for a small message. Doubling framework efficiency saves ~100 µs of a 4 ms budget — relevant at fleet scale (CPU and bandwidth) but not in any individual call's latency.

Serialization CPU at fleet scale:

Protobuf encode: ~2–5 ns/byte (generated code, no reflection). 1 KB ≈ 2–5 µs.
JSON (Jackson): ~30–80 ns/byte. 1 KB ≈ 30–80 µs.
At 1M RPCs/sec with 1 KB messages: JSON costs ~50–80 cores just for parsing; Protobuf costs ~5 cores. At the LinkedIn / Netflix scale, that's the difference between a $1M/yr and $10M/yr serialization bill.

Connection-pool exhaustion math (the hidden killer that recurs in every fan-out):

Service B normally responds in 10 ms. Service A has a 200-connection pool, serves 5,000 QPS to B. Little's Law: in-flight = QPS × latency = 5000 × 0.010 = 50 in-flight, well under the pool limit. B degrades to 1,000 ms (not down — just slow). In-flight needed = 5,000 × 1.0 = 5,000. Pool is 200. A starts blocking on pool acquisition; within ~40 ms, every A thread is blocked; A is not responding to its callers; A is now degraded. The wave propagates outward. We come back to this in §13.

§6. Architecture in Context

The canonical RPC fabric in a modern microservices architecture looks like this — regardless of whether the apps are gRPC, Thrift, or REST:

                           ┌─────────────────────────────────────────────────┐
                           │  CONTROL PLANE                                   │
                           │  ┌──────────────┐   ┌────────────────────────┐  │
                           │  │ Service      │   │ xDS / config server    │  │
                           │  │ Registry     │──▶│ (Istio Pilot,          │  │
                           │  │ (Consul/     │   │  Envoy Gateway, etc.)  │  │
                           │  │  etcd/k8s)   │   │  pushes routes, mTLS   │  │
                           │  └──────────────┘   │  certs, retry/CB cfg   │  │
                           │                     └────────┬───────────────┘  │
                           └──────────────────────────────┼──────────────────┘
                                                          │ xDS
                                                          │ (ADS streams)
                                                          ▼
   ┌──────────────────────────────────┐         ┌──────────────────────────────────┐
   │  CLIENT POD (Service A)          │         │  SERVER POD (Service B)          │
   │  ┌────────────────────────────┐  │         │  ┌────────────────────────────┐  │
   │  │  Application                │  │         │  │  Application                │  │
   │  │  (Java, Go, ...)            │  │         │  │  (gRPC / Thrift handler)    │  │
   │  │                             │  │         │  │                             │  │
   │  │  stub.Method(req,           │  │         │  │  onRequest(...)             │  │
   │  │      deadline=500ms)        │  │         │  │                             │  │
   │  └──────────┬─────────────────┘  │         │  └────────────────────────────┘  │
   │             │ localhost:9090      │         │             ▲                    │
   │             ▼                     │         │             │ localhost:9090     │
   │  ┌────────────────────────────┐  │         │  ┌──────────┴─────────────────┐  │
   │  │ Sidecar proxy (Envoy)      │  │         │  │ Sidecar proxy (Envoy)      │  │
   │  │  • outbound TLS terminate  │  │         │  │  • inbound TLS terminate   │  │
   │  │  • client-side LB          │  │         │  │  • authZ check             │  │
   │  │  • retry policy + budget   │  │         │  │  • rate limit              │  │
   │  │  • circuit breaker         │  │         │  │  • trace + metric emit     │  │
   │  │  • outlier detection       │  │         │  │  • access log              │  │
   │  │  • header injection (trace)│  │         │  │                             │  │
   │  └──────────┬─────────────────┘  │         │  └────────────────────────────┘  │
   └─────────────┼────────────────────┘         └──────────────────────────────────┘
                 │                                              ▲
                 │   mTLS over HTTP/2 (TCP:9090)                │
                 │   ┌───────────────────────────────────────┐  │
                 │   │ HEADERS  (stream 5, deadline=500ms)   │  │
                 │   │   ──────────────────────────────────▶ │  │
                 │   │ DATA     (Protobuf-encoded request)   │  │
                 │   │   ──────────────────────────────────▶ │  │
                 │   │ HEADERS  (:status=200)                │  │
                 │   │   ◀────────────────────────────────── │  │
                 │   │ DATA     (Protobuf-encoded response)  │  │
                 │   │   ◀────────────────────────────────── │  │
                 │   │ TRAILERS (grpc-status=0)              │  │
                 │   │   ◀────────────────────────────────── │  │
                 │   └───────────────────────────────────────┘  │
                 │                                              │
                 └──────────────────────────────────────────────┘

Key abstractions, regardless of product:

Sidecar pattern. Each pod runs a proxy (Envoy is the dominant implementation; Linkerd is the lighter-weight alternative) that handles cross-cutting concerns: mTLS, retries, circuit breakers, traffic shaping, observability. Application code is plain gRPC. The sidecar lives on localhost so the network hop is loopback (~5 µs).
Control plane / data plane split. The control plane (Istio Pilot, Consul Connect, Linkerd's controller, AWS App Mesh controllers) pushes configuration — routes, cluster definitions, TLS certs, retry policies, CB thresholds — via xDS, a family of gRPC streaming APIs (LDS=Listeners, RDS=Routes, CDS=Clusters, EDS=Endpoints, SDS=Secrets, ADS=Aggregated). The data plane never asks "where is service B?" — it already knows from its config.
Connection per upstream. A's sidecar maintains 2–4 HTTP/2 connections per backend (typically one per CPU thread), each carrying ~100–1000 multiplexed streams.
Retry boundary. Retries happen at the sidecar, not in app code, so retry policy is uniform fleet-wide and can be globally rate-limited via retry budgets.
Proxyless variant. At Google scale, the sidecar CPU becomes a separate budget line, so gRPC libraries themselves consume xDS directly — no sidecar process — at the cost of mesh features the library doesn't implement. Increasing open-source adoption via gRPC's xDS-enabled clients.
Ambient mesh variant. Instead of a per-pod sidecar, eBPF programs in the kernel handle the data plane (Cilium, Istio Ambient). Lower per-pod overhead. Less mature but rapidly improving.

The naïve REST-over-HTTP/1.1 topology — app → load balancer → app, with retries/circuit breakers/mTLS baked into each app's libraries — is what every company starts with and what nobody can sustain past ~50 services. Different teams pick different retry libraries, different timeouts, different auth schemes, and the mesh emerges as a forcing function for uniformity.

§7. Authentication and Service Identity

The first cross-cutting concern an RPC fabric must solve, before it can solve anything else, is "who is calling whom?" At 10 services, every team puts a static API key in a config file and calls it a day. At 10,000 services with quarterly key rotation and SOC 2 audit requirements, that approach fails catastrophically. The modern answer is mTLS (mutual TLS) backed by SPIFFE (Secure Production Identity Framework For Everyone) workload identities, and the depth comes from walking through why JWT-as-service-auth is worse than mTLS at scale.

What mTLS actually is

Standard TLS authenticates the server to the client — the client checks that the server's certificate is signed by a trusted CA (Certificate Authority) and that the certificate's Subject Alternative Name (SAN) matches the hostname being connected to. Mutual TLS adds the reverse: the client also presents a certificate during the TLS handshake, and the server validates it against its own trust root. After the handshake completes, both sides have cryptographically verified each other's identity, and the rest of the connection (HTTP/2 frames, gRPC payloads) flows over the same encrypted channel.

The handshake costs are real (a full TLS 1.3 handshake is one round trip; TLS 1.2 is two), but the cost is amortized over a long-lived HTTP/2 connection that may carry millions of RPCs before being torn down. Per-call overhead from mTLS at steady state is ~0 — the TLS record layer adds ~30 bytes of MAC/padding per record, and AES-NI hardware acceleration makes encryption itself essentially free on modern CPUs.

The hard part isn't the protocol — it's identity provisioning and rotation. Every workload needs a certificate, certificates expire (best practice: 1–24 hours for short-lived; the shorter, the smaller the blast radius of compromise), and revocation must propagate quickly.

SPIFFE and SPIRE: workload identity as infrastructure

SPIFFE (Secure Production Identity Framework For Everyone) is an open standard for workload identity. Its concrete artifacts:

SPIFFE ID: a URI of the form spiffe://trust-domain/path, e.g., spiffe://prod.example.com/ns/payments/sa/balance-service. The trust-domain is your organization's identity boundary; the path encodes whatever you want (namespace, service account, environment).
SVID (SPIFFE Verifiable Identity Document): the cryptographic proof of a SPIFFE ID. Comes in two flavors — X.509-SVID (a standard X.509 certificate with the SPIFFE ID encoded in the SAN) and JWT-SVID (a JWT whose sub claim is the SPIFFE ID, used where TLS termination is impossible).
Trust bundle: the set of CA certs you use to verify SVIDs from a given trust domain. Distributed via the Workload API.

SPIRE is the canonical SPIFFE implementation. SPIRE has two components:

SPIRE Server: the CA. Issues certificates to workloads after attestation.
SPIRE Agent: runs on every node. Workloads ask the agent (via a Unix-domain socket) for their identity; the agent attests them (using Kubernetes Service Account, AWS Instance Metadata, Docker labels, etc.) and returns an SVID.

The attestation step is the magic: SPIRE doesn't trust "the workload says it's the balance service." It verifies, via the orchestrator (Kubernetes), that the calling Unix process is in fact a pod with the balance-service service account, in the payments namespace, on a node that itself passed node attestation. The identity is grounded in the runtime substrate, not in a config file the workload could lie about.

Rotation: SVIDs are typically issued with a 1-hour TTL (Time To Live). The agent renews them in the background well before expiry. Workloads receive a stream of certificates via the Workload API and rotate seamlessly. A compromised SVID is useful for less than an hour.

Istio's automatic mTLS

Istio (the dominant service mesh) implements mTLS transparently between sidecars. The application code uses plain HTTP or gRPC; the sidecar proxies (Envoy) terminate plaintext from the app on localhost and re-encrypt with mTLS over the network. Both sidecars present certificates issued by Istio's CA (Citadel, now part of Istiod). The application never sees the certificate; the application never has to know mTLS exists.

The data plane: - App → Envoy: localhost:9090 plaintext. - Envoy → Envoy: mTLS over the network. - Envoy → App: localhost:9090 plaintext on the receiving side.

The control plane: - Istiod is the CA, signs SVIDs (Istio uses SPIFFE-compatible IDs of the form spiffe://cluster.local/ns/payments/sa/balance-service). - Certificates pushed via SDS (Secret Discovery Service), one of the xDS variants. Envoy receives new certs via gRPC stream, reloads in-memory without process restart. - TTL: 24 hours by default; rotation kicks in at ~50% of TTL.

The end-state: every pod-to-pod TCP connection is mTLS-encrypted and mutually-authenticated, and no application engineer wrote a single line of TLS code.

Why JWT-as-service-auth is worse than mTLS at scale

Many companies have a legacy pattern where service A calls service B with Authorization: Bearer <JWT> in the HTTP header, signed by a shared secret or a central auth service. This is worse than mTLS at scale, and you should be able to enumerate why:

Bearer tokens are stealable. A JWT in a header is the proverbial cookie — anyone with the bytes can replay it. mTLS requires the private key, which never leaves the pod. A compromised log line containing a JWT compromises the caller; a compromised log line with TLS record bytes is useless without the session keys.
Per-call validation cost. Validating a JWT means parsing the JSON, verifying the signature (RSA or ECDSA, ~50–200 µs per call), checking the expiry, possibly hitting a key-rotation endpoint. mTLS does this once per connection, then nothing per call. At fleet scale (millions of RPCs per second per service), JWT validation can consume 5–10% of CPU; mTLS at steady state is ~0%.
Token TTLs are stuck between bad and worse. Short TTL (5 minutes) means callers constantly refresh from the auth service, which is now a critical-path dependency for every RPC. Long TTL (1 day) means a stolen token is usable all day. mTLS sidesteps this by binding identity to a key pair on the host.
No mutual authentication by construction. The server validates the JWT; the client validates the server's TLS cert. The protocols are asymmetric. mTLS is symmetric — both sides cryptographically proven.
No transport-layer encryption guarantee. A JWT in a header over plain HTTP is exposed. mTLS encrypts the whole channel.
Token propagation across hops is error-prone. If service A→B→C, does B forward A's JWT, or use its own? If B forwards A's, then a compromised B can replay it. If B uses its own, you've lost the original caller's identity. With mTLS, identity is per-hop by construction, and the original-caller identity propagates in a separate, signed, narrow-scoped header (see §11 on header propagation).

The reasonable place for JWT in this fabric is the user identity that propagates through the call tree (the human's session), not the service-to-service identity. mTLS authenticates the workload; JWT carries who the workload is acting on behalf of.

§8. Authorization Policy

Authentication answers "who are you?" Authorization answers "what are you allowed to do?" Even when you've solved identity with mTLS + SPIFFE, you still face the question: service A is calling BalanceService.Debit(account=X, amount=$50) — is service A allowed to call that method on that account? The answer is a policy decision, and the modern pattern is to externalize policy from application code.

The "I have a JWT, what can I do?" question

The naïve pattern: every service implements its own authorization checks in its own handlers. if (caller_role == "admin") allow(); scattered across 800 services in 8 languages. Six months later, nobody knows what the global authorization rules are, three services have subtly different definitions of "admin," and a compliance audit demands a single source of truth. This is the authorization sprawl anti-pattern.

The fix: a policy decision point (PDP) evaluates rules; policy enforcement points (PEPs) in services (or sidecars) consult the PDP. The policy is declarative, versioned, and reviewable.

OPA (Open Policy Agent)

OPA (Open Policy Agent) is the de-facto open standard for general-purpose policy. Its design:

Rego: a Datalog-flavored declarative policy language. Rules describe what is allowed (or what is denied) based on inputs.
Sidecar pattern: OPA runs as a sidecar (or local process) alongside each service. The service makes an authorization decision by calling OPA on localhost with a JSON document describing the request. OPA returns allow: true/false plus optional context.
Bundle distribution: policies (Rego files plus data) are packaged into bundles and distributed from a central server. OPA polls or streams updates. Policy is versioned, signed, and rolled out like a deployment.
Decision logging: OPA logs every decision (input + result) to a sink. Auditors get a complete trail.

Sketch of a Rego rule:

package payments.authz

default allow = false

allow {
    input.method == "BalanceService.Debit"
    input.caller.spiffe_id == "spiffe://prod.example.com/ns/payments/sa/checkout"
    input.request.amount <= 10000
    input.account.owner == input.user.id
}

The service (or its sidecar) constructs the input document from request metadata — caller's SPIFFE ID, the RPC method, the request fields, the resolved user — and asks OPA. OPA evaluates Rego against the input and returns the decision in microseconds (Rego compiles to an efficient VM).

The win: policy is one artifact, in one language, reviewed by security engineers, deployed independently of the services it governs. A new rule (e.g., "no service may debit more than $10K without an explicit override") is one PR to the policy repo, not 800 PRs to 800 services.

Cedar (AWS): a lighter alternative

Cedar is AWS's newer policy language, used in Amazon Verified Permissions and Cognito. Design choices that differ from Rego:

Statically analyzable: Cedar policies can be checked for "is this policy ever satisfiable?" and "is policy A strictly more permissive than policy B?" Rego is Turing-complete and harder to reason about formally.
Schema-aware: Cedar policies reference a schema of entities (User, Account, Resource) and actions (Read, Debit, Transfer). The policy language enforces that you only reference existing types.
Performance: Cedar is designed for sub-millisecond evaluation in the AWS authorization plane.

Cedar is lighter and more constrained than Rego — good for "who can do what to which resource" authorization, less good for arbitrary policy beyond authorization. For a typical RPC fabric, either works.

Centralized vs distributed authorization

Two deployment models:

Mode	Architecture	Trade-off
Centralized PDP	One (HA) authz service. Every authz check is a network call to it.	Single source of truth for policy state. Coupled latency (every call adds ~1 ms). PDP becomes a critical-path dependency.
Distributed PDP (sidecar)	OPA runs alongside each service. Policy bundle pushed to every instance. Decisions are local.	Sub-millisecond decisions. Policy updates have propagation lag (seconds). Each PDP has the full policy set.

Production at scale almost always picks distributed — local OPA sidecars, central bundle server. Policy propagation lag (seconds) is acceptable; per-call latency is not.

Where the PEP lives

Three plausible locations for the policy enforcement point:

In the application handler. The application explicitly calls opa.check(request) before doing work. Pro: fine-grained, business-context-aware. Con: every developer has to remember.
In the sidecar. Envoy's ext_authz filter calls OPA per-request before forwarding to the app. Pro: enforced uniformly without app cooperation. Con: only sees what's in the request metadata; can't enforce business rules that depend on database state.
Both, layered. Sidecar enforces coarse-grained (this service can call this method); application enforces fine-grained (this user can read this row).

The Staff-level answer is "layered" — coarse-grained authorization at the sidecar protects against a compromised service calling methods it shouldn't; fine-grained authorization in the handler protects against well-formed-but-unauthorized business operations.

§9. Observability and Distributed Tracing

When a user request fans out across 50 services and the p99 latency spikes, "which hop is slow?" is the question that defines incident response. The answer comes from distributed tracing, and the modern standard is W3C Trace Context propagated via OpenTelemetry instrumentation, with sampling decisions taken at multiple layers.

What a trace is

A trace is a tree of spans. A span represents one unit of work (one RPC call, one database query, one expensive computation). Each span has:

trace_id (16 bytes, hex-encoded as 32 chars): identifies the whole trace tree.
span_id (8 bytes, hex-encoded as 16 chars): identifies this specific span.
parent_span_id: links to the calling span.
start_time, end_time, duration.
attributes: key-value tags (http.method, db.statement, grpc.code, user.id).
events: timestamped log lines within the span.
status: OK / ERROR / UNSET.

When a request enters the edge, it gets a fresh trace_id. Every downstream RPC creates a new span with a new span_id but inherits the trace_id and points to its parent. The result is a tree; rendered as a flame graph, it shows exactly which leaves consumed the budget.

W3C Trace Context: the `traceparent` header

The W3C Trace Context specification standardizes how trace IDs propagate over HTTP. The relevant header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │  │                                │                │
             │  │                                │                └── trace-flags (sampled=1)
             │  │                                └── parent-span-id
             │  └── trace-id (16 bytes)
             └── version

A second header, tracestate, carries vendor-specific data (tracestate: dd=s:1;p:xyz,opentelemetry=...). Multiple tracing systems coexist on the same request by writing into their own slot of tracestate.

B3 propagation is the legacy format from Zipkin. Header names: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, X-B3-Sampled. Older systems (Brave, Finagle, Linkerd-1) still emit B3. Modern systems emit W3C. Bridges exist; the Staff-level reality is "your mesh emits both for a transition window."

OpenTelemetry: the unified instrumentation API

OpenTelemetry (OTel) is the umbrella project unifying tracing, metrics, and logs under a single API and SDK. It replaced the earlier rivals OpenTracing and OpenCensus (which merged into OTel in 2019). The architecture:

API: language-agnostic surface that application code calls (tracer.startSpan(...)). Same shape in every language.
SDK: implements the API. Configures samplers, processors, exporters.
Exporters: serialize spans/metrics to a backend (Jaeger, Zipkin, Tempo, Datadog, Honeycomb, X-Ray, Cloud Trace).
OTLP (OpenTelemetry Protocol): the wire format. gRPC + Protobuf. Backends accept OTLP natively.
Collector: a sidecar or daemon that receives OTLP from applications, applies processing (sampling, redaction, batching), and forwards to backends.

The key idea: instrument once against the OTel API, swap backends without touching application code. Before OTel, every backend (Jaeger, Zipkin, Datadog, ...) had its own client library; switching backends meant rewriting instrumentation across the fleet.

Span propagation across async boundaries

The hard case: service A produces a Kafka message; service B consumes it 10 seconds later. The user's request to A has completed; the trace has "ended" in A's view. But B's processing of that message is logically part of the same workflow. How do you stitch them together?

Kafka headers carry trace context. When A produces, it injects traceparent (and tracestate) into the Kafka message headers. When B consumes, it extracts them and creates a new span as a child of the original — even though hours have passed and the original trace was "closed." OpenTelemetry has specific instrumentation for Kafka, RabbitMQ, SQS, etc. that does this automatically.

Within a service, async propagation across thread boundaries is solved by language-specific mechanisms: - Go: context.Context carries the active span; every async call must thread the context through. The classic "we lost the span at hop 4" bug is forgetting to pass context. - Java: ThreadLocal carries the span by default; switching to a different thread (CompletableFuture, reactor) requires explicit propagation. Frameworks like Reactor have a Hooks.enableAutomaticContextPropagation() to do this. - Node.js: AsyncLocalStorage carries it. Promises and async/await are well-instrumented; raw callbacks are not.

Sampling: where to drop spans

Tracing every request at scale is infeasible. A 10M-QPS service emitting a span per RPC produces ~1 TB/sec of trace data — bigger than most company-wide telemetry budgets combined. Therefore: sample.

Strategy	Where it runs	Trade-off
Head-based sampling (SDK level)	The first service to see the request flips a coin (e.g., 1% sampled). Decision in `traceflags` bit. Every downstream service respects it.	Cheap. But you've decided to drop before knowing whether the request will be interesting (slow, error). Most slow requests are dropped.
Head-based sampling (collector level)	Collector receives every span, then drops at the agent before forwarding.	Slightly more flexibility (per-service rates). Still drops "before knowing it was interesting."
Tail-based sampling (backend level)	Backend buffers all spans for a few seconds. After the trace is complete, applies rules: "keep all error traces, keep the slowest 1%, sample 0.1% of the rest."	Catches the interesting traces. Requires keeping every span in memory for the buffering window — expensive. Common in commercial backends (Honeycomb, Datadog, Lightstep).

Production at scale typically uses head-based with high probability on errors ("if I see a 5xx, set traceflags=01 so downstream all sample") plus tail-based for slow-trace capture in the backend.

The synthesis: "head-based throws away the slow requests; tail-based catches them at the cost of buffering memory; pick based on whether your goal is volume reduction or signal preservation."

Metrics and logs alongside traces

OpenTelemetry also unifies metrics and logs. The pattern:

Metrics: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors). Per-RPC-method counter and histogram. Exported as OTLP or Prometheus.
Logs: structured log lines auto-correlated to the active span (the SDK injects trace_id and span_id into log records).

The three signals (metrics, traces, logs) are now correlated by trace_id. Incident response: "p99 latency on BalanceService.Debit spiked → find the slow trace → see which span is slow → see the logs from that span." This is the post-Google SRE Book observability model.

§10. Rate Limiting at the RPC Layer

Authentication says "who you are." Authorization says "what you can do." Rate limiting says "how often you can do it." It is the fabric's defense against abusive callers, runaway loops, and noisy-neighbor tenants.

Local vs distributed rate limits

Two flavors:

Local (per-instance) rate limit. Each server instance tracks a token bucket per caller. If a server has 100 replicas, the global rate limit is 100x whatever each instance allows. Simple, but the "global rate" is hard to reason about and drifts as replicas scale.
Distributed (global) rate limit. All replicas share a counter, typically backed by Redis or a dedicated service. Every call increments a counter; if over the limit, reject. Single source of truth, but adds latency and a new dependency.

Token bucket: the canonical algorithm

The token bucket: a bucket holds up to B tokens. Tokens refill at rate R per second. Every call consumes one token. If the bucket is empty, the call is rejected (or queued).

Burst capacity: B controls how big a burst you tolerate. R = 100/sec, B = 200 means sustained 100/sec but a sudden burst of 200 is allowed.

Redis implementation: a single Redis key per (caller, route). A Lua script atomically: (a) computes how many tokens have refilled since the last call (based on stored timestamp), (b) tops up the bucket to at most B, (c) checks if there's at least one token, (d) decrements and returns success, or rejects. Atomic via Lua + Redis single-threaded execution.

Latency cost: ~0.5–1 ms per Redis round trip. At 1M QPS, the Redis cluster needs careful sharding (shard by caller key) and the network has to be reliable. Most fabrics deploy a Redis cluster per region to keep the round trip in-region.

Envoy's ratelimit service: the sidecar enforces global rate

Envoy doesn't implement distributed rate limiting itself; it integrates with an external service via the envoy.filters.http.ratelimit filter. The pattern:

Request arrives at server's sidecar (Envoy).
Envoy makes a gRPC call to the rate-limit service with a descriptor — e.g., [("caller_spiffe_id", "spiffe://prod/ns/payments/sa/checkout"), ("route", "BalanceService.Debit")].
Rate-limit service consults its config (per-descriptor limits and intervals) and Redis for current counts.
Returns OK or OVER_LIMIT.
Envoy forwards the request or rejects with 429 Too Many Requests (or RESOURCE_EXHAUSTED for gRPC).

The reference implementation is envoyproxy/ratelimit (Lyft's open-source service). Lyft runs it at hundreds of thousands of QPS in production.

The win: the application never implements rate limiting. It's a config in the mesh:

rate_limits:
  - descriptor_key: caller_spiffe_id
    descriptor_value: spiffe://prod/ns/marketing/sa/email-blast
    rate_limit:
      unit: SECOND
      requests_per_unit: 100

A new caller? Update the config. A burst? The config changes are versioned and reviewed.

Per-method, per-priority, per-tenant limits

At scale, the dimensions multiply:

Per-method: BalanceService.Debit is more expensive than BalanceService.GetBalance; limit them differently.
Per-priority: critical-path callers (the checkout flow) get higher quotas than batch jobs (the nightly reconciliation).
Per-tenant: in a multi-tenant SaaS, customer A's noisy retry loop must not starve customer B. Tenant ID becomes part of the rate-limit descriptor.

Stripe famously rate-limits per merchant account at the public API edge — one merchant's misbehaving integration cannot degrade another merchant's traffic. The rate-limit descriptor is (merchant_id, endpoint), the limit is per merchant. Limits visible to the merchant via X-RateLimit-Limit / X-RateLimit-Remaining / X-RateLimit-Reset response headers (see §18).

Rate limit vs admission control vs load shedding

Easy to conflate; distinct primitives:

Rate limit: based on a caller's quota. Reject if caller has exceeded their share.
Admission control: based on the server's instantaneous load. Reject if server is at, say, 90% CPU or queue depth N. Sees no notion of "fair share."
Load shedding: drop the lowest-priority work when overloaded. Requires priority tagging on requests.

Production deploys all three layered. Rate limit catches abusive callers before they reach the server. Admission control catches load the rate limit didn't predict (a thundering herd of nominally-compliant callers). Load shedding ensures the most important work survives.

§11. Header Propagation

A user request entering the edge will fan out across 5–10 hops before it returns an answer. Along the way, the downstream services need to know things about the original context: who the user is, which tenant they belong to, which trace they're part of, what their session ID is. Header propagation is the discipline of keeping that context alive across hops, and the canonical failure mode is "we lost the user context at hop 4."

What gets propagated

Common headers on every internal RPC:

Header	Purpose	Set by	Read by
`traceparent`, `tracestate`	W3C trace context	Tracing SDK / sidecar	Tracing SDK / collector
`x-request-id`	Idempotent request identifier (one per user request)	Edge ingress	Every service, for log correlation
`x-user-id`	The authenticated user (the human, not the service)	Edge auth layer	Every service, for authorization checks
`x-tenant-id`	The customer/organization context	Edge	Tenant-scoped services (DB sharding, billing)
`x-session-id`	The user's session	Edge	Services that gate by session validity
`x-locale`, `x-timezone`	Localization context	Edge (from client)	Services that render user-visible content
`x-experiment-bucket`	A/B test assignments	Edge experiment evaluator	Services that vary behavior by experiment
`grpc-timeout`	The remaining deadline	Each hop, recomputed	gRPC framework
`authorization` (mTLS or token)	Caller identity (workload + sometimes user)	mTLS / auth layer	Service handlers, OPA

Propagation rule: every service must read these headers from inbound requests and re-attach them to outbound requests it makes. Forget one and the downstream services have no idea who the user is.

The "we lost the user context at hop 4" failure mode

You ship a new mid-tier service in Go. Six months later, an auditor asks: "show me every request your billing service received in February that was attributed to user X." You query the logs. Half the requests have a user_id field; half don't. Investigation: the new mid-tier doesn't propagate x-user-id because the developer used a raw http.Client instead of the standard internal client, and forgot to copy headers from the inbound context to the outbound request.

This is the classic propagation bug. It is invisible to functional testing (the request works; the downstream service just doesn't know who the user is) and only shows up when someone tries to audit or attribute.

The standard fixes by language

Go: context.Context pattern (ContextProp). The idiomatic Go pattern is to thread context.Context through every function call. Headers are stored as context values; HTTP/gRPC client middleware automatically extracts them from the context and adds them to outbound headers. The discipline:

func (h *Handler) Process(ctx context.Context, req *Request) (*Response, error) {
    // ctx has x-request-id, x-user-id, etc. bound to it by middleware.
    // Pass ctx through to every call:
    result, err := h.balanceClient.GetBalance(ctx, &balanceReq)
    // Client middleware reads ctx, propagates headers automatically.
}

The rule: never use context.Background() in a handler. That breaks the chain. Linters can catch this; many teams enforce it.

Java: MDC (Mapped Diagnostic Context). The SLF4J/Logback MDC is a ThreadLocal map of correlation IDs. Server-side filters extract headers from inbound requests and call MDC.put("user.id", userId). Loggers include MDC keys in every log line. Outbound HTTP/gRPC client interceptors read MDC and re-attach as headers. Modern frameworks (Spring Cloud Sleuth, OpenTelemetry Java agent) automate this. Async boundaries (CompletableFuture, @Async) require explicit MDC propagation (the framework usually provides a wrapper).

Other languages: - Python: contextvars (PEP 567) is the modern equivalent. ThreadLocal-aware and asyncio-aware. - Node.js: AsyncLocalStorage (Node 13+) for async-aware context. - Rust: tokio-tracing provides Span context with attribute propagation.

Sidecar-assisted propagation

A well-configured mesh can reduce the application's burden. Envoy's tracing filter and header-forwarding filter can be configured to copy a whitelist of headers from inbound to outbound automatically. The application code can omit explicit propagation for headers in that whitelist (the sidecar handles them on the way out).

Limits: the sidecar copies exactly the inbound header. It cannot transform (e.g., decrypt a user token, extract user_id, re-encode for downstream). Anything that requires application logic still has to be done in the application.

Anti-patterns to call out

Re-deriving user identity at every hop: don't re-authenticate the user 6 times in 6 services. The edge authenticates once and signs an internal token that downstream services trust. The Staff-level pattern is "edge-signed identity headers" — the edge stamps x-user-id and a short-lived HMAC over it; downstream services verify the HMAC cheaply (~10 µs vs ~1 ms for a JWT). This is what GitHub, Shopify, and Slack-like systems do internally.
Putting too much in headers: every header costs HPACK table space and bandwidth. Don't ship the whole user object — ship the ID and let services fetch full state if needed.
Forgetting to scrub on egress: internal headers (x-internal-debug, x-canary-shard) must be stripped at the edge before responses go to clients. Otherwise you leak internal architecture to the public.

§12. Connection Management

The "physical" layer that the RPC fabric rides on is TCP connections. How those connections are pooled, multiplexed, kept alive, and reaped is invisible most of the time and the cause of the worst outages the rest of the time.

Client-side connection pooling

A naïve client opens a TCP connection per RPC call. At 5,000 QPS, that's 5,000 TCP handshakes per second per server, each costing one round trip (~500 µs in-DC) plus a TLS handshake (one more round trip for TLS 1.3, two for TLS 1.2). The per-call latency penalty is 1–2 ms before any application work happens, and the kernel buckles under ephemeral port pressure (a host has ~28,000 ephemeral ports by default).

Connection pooling is the fix: maintain a pool of long-lived connections per upstream, reuse for many calls. Pool sizing rule of thumb:

For HTTP/1.1: pool_size >= concurrent_in_flight = QPS × latency. At 5k QPS and 10 ms latency, pool ~= 50.
For HTTP/2: one connection handles many concurrent streams. Pool sizing reduces to "enough connections to use available CPUs" — typically 2–4 connections per upstream per pod (one per CPU thread).

The connection pool itself is a resource that can be exhausted (see §13 problem 6). Bounded queue depth on pool acquisition is mandatory.

HTTP/2 multiplexing

The major advantage of HTTP/2: one TCP connection multiplexes 100–1000 concurrent streams. Stream 7 (in flight for 3 seconds) does not block stream 8 (starting now). Frames from different streams interleave on the wire.

The implications for pool sizing: - One HTTP/2 connection per upstream is often enough at low/mid scale. - Multiple connections become useful for scalability of stream-creation (HTTP/2 SETTINGS_MAX_CONCURRENT_STREAMS caps per connection; opening a second connection doubles the cap) and CPU parallelism (the HTTP/2 codec can be a per-connection bottleneck; spreading across connections distributes CPU). - gRPC's official guidance: 1 connection per upstream is fine until you saturate MAX_CONCURRENT_STREAMS or hit codec CPU limits, then add more.

Connection pinning hazards

Connection pinning: a long-lived stream (especially a server-streaming RPC like "stream all log lines forever") sticks to one connection. If that connection's underlying TCP stalls, all streams sharing it stall (the TCP-layer head-of-line issue from §13 problem 5).

The hazard: a single slow-but-not-broken stream — say, a streaming RPC where the client is reading slowly — applies HTTP/2 flow-control backpressure that pauses the connection's send buffer, briefly stalling sibling streams. Symptoms: random p99 latency spikes correlated with which connection the call landed on.

Mitigations: - Aggressive connection rotation (close and reopen every N minutes or M requests). - Separate connections for "control plane" (small RPCs) and "data plane" (large streaming RPCs). - Per-stream flow-control tuning (smaller windows force the slow consumer to throttle quickly without bloating the connection buffer).

Keep-alive tuning

Three layers of keep-alive:

TCP keep-alive: kernel sends a probe every N seconds (default Linux: 7200 s = 2 hours) to detect dead peers. Too slow for production; tune to ~30–60 s.
HTTP/2 PING frames: gRPC sends a PING every ~10–60 s. If no PONG within timeout, connection considered dead, closed, replaced. Tunable via keepalive_time and keepalive_timeout.
Application-layer health checks: out-of-band /healthz polls. Independent of any in-flight traffic.

Misconfigured keep-alive is a frequent cause of mysterious connection failures. Too aggressive: PING storms across thousands of connections cost CPU. Too lenient: dead connections sit in the pool for hours, every call to them fails until the kernel notices.

Idle connection reaping

A connection with no traffic for N minutes should be closed to free resources. Server-side reaping (SETTINGS_MAX_IDLE_CONNECTIONS or similar) is good hygiene. Client-side: the pool reaps idle connections.

The hazard: a client that should call B once per hour finds the previous connection reaped. Next call pays the handshake cost. For ultra-low-frequency callers, consider keeping the connection warm with periodic PING.

Connection draining on shutdown

When a server pod is being terminated (rolling deploy, scale-down, node drain):

Server sends GOAWAY with last-stream-id = N.
Existing streams (≤ N) complete normally.
Clients see GOAWAY, stop opening new streams on this connection, open new connections to other instances.
Server waits for grace period (typically 30–60 s) for streams to drain.
Server closes.

Without GOAWAY, server termination kills in-flight requests; clients see UNAVAILABLE and retry. With GOAWAY, the shutdown is graceful and clients seamlessly fail over.

The same pattern applies to Envoy / sidecars: on shutdown, send GOAWAY, drain, then exit. Kubernetes preStop hooks and termination grace periods enable this.

§13. Hard Problems Inherent to RPC

These problems are inherent to the technology. They show up regardless of which framework you picked, in payments, search, feed, and ML inference alike.

Problem 1: Retry storms (retry amplification)

Naïve solution. "Retries are cheap. Three retries on every client with exponential backoff. We get more reliability for free."

How it breaks (payments fan-out illustration). Payment authorization at Stripe-like scale: a request fans out edge → risk service → balance service → ledger. Each hop has 3 retries. Balance service GC pauses for 2 seconds; latency spikes to 800 ms; calls upstream begin to time out at 500 ms. Risk service retries: 3x. Edge retries the risk service: 3x. 3^4 = 81x amplification. Balance service is now being hammered at 81x normal load and never recovers until traffic is shed.

Trace at t=0 in a different domain (search fan-out): Search root → 100 leaf shards. One shard GC'd. p99 latency spikes. Root retries (3x). Multiplied across 100 shards retried in parallel, the leaf cluster sees 300x amplification on a fraction of its capacity → cluster-wide brownout.

Actual fix: retry budgets. Envoy and modern gRPC client libs enforce a budget per upstream: max 10% of successful traffic can be retries (configurable). If B is serving 10k QPS successfully, the retry budget allows 1k retries/sec; everything beyond that is rejected at the client side. Critically per-upstream and not per-call, otherwise concurrent failing calls each get their own 10% and you're back to amplification.

Layered defenses around retry: - Exponential backoff with jitter. backoff = min(base * 2^attempt, cap) + uniform(0, jitter). Jitter is non-negotiable; without it, all retries land in a thundering herd at the same millisecond. - Only retry idempotent calls. Reads (getUserProfile) yes; writes (POST /payments) no unless the call carries an idempotency key. - Respect deadline. Retrying past the parent's deadline is pointless. - Status-code-aware retry. NOT_FOUND, INVALID_ARGUMENT, PERMISSION_DENIED deterministic; never retry. UNAVAILABLE, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED are candidates.

Problem 2: Cascading failure / circuit breaking

Naïve solution. "If B is slow, A will wait. Eventually B recovers and so does A."

How it breaks (ML inference fanout illustration). A recommendation service calls 50 ML inference endpoints in parallel, p99 budget 100 ms. One inference endpoint starts taking 5 seconds (GPU thermal throttle, or a bad model deploy). The 49 healthy endpoints respond in 80 ms. The recommendation service waits on the 50th. Its connection pool to that endpoint fills. Its thread pool fills (each thread blocked on the slow endpoint). The recommendation service is now degraded for every user, not just users whose request happened to land on the slow inference endpoint. "Slow" is worse than "down".

The math (Little's Law, again): pool 200, normal in-flight 50. Latency jumps 50x → in-flight needed jumps to 2500. Pool is 200. Every additional caller blocks on pool acquisition. Application threads pile up on the semaphore. App stops accepting new requests.

Actual fix: circuit breakers. Track the upstream's error rate or latency over a rolling window. When threshold exceeded (e.g., 50% of last 100 requests failing, or p99 > 5x normal), open the circuit:

States:    [CLOSED] --errors above threshold--> [OPEN] --after timeout--> [HALF_OPEN]
              ^                                                                |
              |                                                                |
              +------------success on probe call-------------------------------+

CLOSED: normal operation, track errors/latency.
OPEN: fast-fail all calls with UNAVAILABLE for 30s. Don't even open a connection. Errors are returned in microseconds instead of seconds. This frees the pool immediately.
HALF_OPEN: after timeout, allow a few probe calls. If they succeed, close; if they fail, back to OPEN.

Hystrix (Netflix, 2012) popularized this; resilience4j is the modern Java successor; Envoy implements it natively. The deeper insight: the circuit breaker protects the caller, not the callee. A failed-fast caller can still serve its own callers (returning a default, a cached response, or a graceful error). A blocked-on-slow-callee caller has nothing left to give.

Problem 3: Deadline (not timeout) propagation

Naïve solution. "1-second timeout on every RPC call. If anything doesn't return in 1 s, fail."

How it breaks (feed render illustration). User opens the LinkedIn feed; edge service has a 500 ms budget. Feed mid-tier (called from edge) has its own 1 s timeout to the ranking service. Ranking has 1 s to the candidate-generation service. Candidate-gen has 1 s to the feature store. Feature store has 1 s to several upstream databases. Each hop says "I'm fine, I returned in under 1 s," but the cumulative depth blows the 500 ms user budget by 4–5 seconds, and the user gives up before any of this finishes. Worse: while the user-facing edge already killed the parent request, the entire fan-out subtree is still running, wasting CPU computing results nobody will read.

Actual fix: deadline (not timeout) propagation. A deadline is an absolute timestamp ("must complete by 14:00:00.500 UTC"). When A calls B, A propagates deadline = now + (deadline_received_from_parent - now) - safety_margin. gRPC's grpc-timeout is a relative value recomputed at each hop:

Edge:     receives request at t=0, total budget 500 ms.
Edge → A: grpc-timeout = 450m (after subtracting internal time + safety)
A → B:    A spent 50ms internally, propagates grpc-timeout = 400m
B → C:    B spent 30ms, propagates grpc-timeout = 370m
...
At any hop, if (deadline - now) <= 0, the call fails immediately
with DEADLINE_EXCEEDED without being dispatched.

The deadline is part of the call. Every hop subtracts elapsed time. The fan-out tree self-truncates when the deadline runs out; downstream calls don't waste CPU computing results that will be discarded. Same primitive solves two problems: long-running calls don't blow user-facing budgets, and zombie calls don't waste resources after their parent gave up.

REST + JSON does not give you this; you must bake it into every service yourself, and ten teams will bake it ten different ways.

Problem 4: Schema evolution at fleet scale

Naïve solution. "We coordinate proto changes manually."

How it breaks (multi-tenant illustration). 800 services. A developer renames user_id → member_id in a shared .proto because "names are codegen artifacts." They also reuse field number 1 for a new is_premium bool field. Old senders write a string into field 1; new receivers try to decode a bool. Decoder either crashes (5xx) or silently reads garbage.

Replay: the issue ships at 02:00 UTC. Old caller binaries in region A still have queued requests targeting field-1-as-string. New server binaries that just deployed expect field-1-as-bool. New server decodes field 1, gets nonsense. State corruption. Rollback the .proto, wait for binaries to redeploy across the fleet, post-mortem.

This is not hypothetical. Google has publicly post-mortemed at least one fleet-wide 4-hour partial outage caused by a single proto field-number reuse.

Actual fix: schema is the contract. Enforce in CI:

buf breaking or protolock compares the proto in the current PR to the committed-in-main version. Changes to existing field numbers, field types, or removals without reserved are CI-blocking errors.
Field numbers are immutable once assigned and shipped. Treat them like primary keys.
Reserved field numbers and names whenever you remove a field.
Schema registry: every binary on every host pulls the same proto definitions, versioned, from a central registry. Deploys validated against the registry.
Compatibility tests in CI: spin up a real client built against version N, hit a real server built against version N+1, run integration tests. And vice versa.

At fleet scale (hundreds of services, thousands of engineers), "I own the schema-review process for shared services" is a credible Staff-level accomplishment.

Problem 5: Head-of-line (HoL) blocking

Naïve solution. "HTTP/1.1 with keep-alive is fine."

How it breaks. HTTP/1.1 pipelining is broken in practice; proxies and servers serialize requests on a connection in order. If request 1 takes 3 seconds, requests 2, 3, 4 on the same connection block. Application-layer HoL. Workaround: open many connections (50+ per backend). Socket overhead explodes; the kernel groans under ephemeral-port pressure; ALB / load-balancer connection limits start to bite.

HTTP/2 fix. Streams are multiplexed at the frame layer. Stream 7 can stall without blocking stream 8, 9, 10 on the same connection. 100–1000 streams per connection.

But HTTP/2 still has HoL blocking at the TCP layer. TCP guarantees in-order byte delivery. If a single TCP packet carrying a fragment of stream 7 is lost, retransmission stalls all subsequent bytes, including bytes belonging to streams 8, 9, 10 already arrived in the wrong order. A packet loss on one stream silently delays unrelated streams sharing the connection.

HTTP/3 (QUIC) fix. QUIC runs over UDP, manages stream ordering at the application layer. Each QUIC stream has its own ordered delivery. Packet loss on stream 7 only stalls stream 7. Streams 8, 9, 10 deliver immediately. Tail-latency benefits substantial in lossy environments (mobile networks, cross-region links). Cloudflare and Google use HTTP/3 at the edge for exactly this reason. Internal east-west is still mostly HTTP/2 because intra-DC links are clean and HTTP/2 multiplexing is enough.

The point: "TCP-layer head-of-line blocking" is specifically the reason HTTP/3 over QUIC matters for fan-out latency.

Problem 6: Server slow but not down (connection-pool exhaustion)

We touched this in problem 2; here is the explicit mechanism that recurs everywhere.

Naïve solution. "Pool sized for peak. We're fine."

How it breaks (Little's Law, again). in_flight = QPS × latency. Pool 200, normal 4k QPS at 10 ms = 40 in-flight, 5x headroom. Latency degrades 50x to 500 ms (not "down" — just unhealthy). In-flight needed = 200k. Pool size 200. Every additional acquire blocks indefinitely. App threads pile up on the semaphore. App stops accepting new requests. Caller of the app sees it as "down."

Actual fix layered:

Bound queue depth at the pool acquire. If more than N threads are waiting on the pool, fail new acquisitions immediately with RESOURCE_EXHAUSTED. Fail fast; let circuit breaker open.
Circuit breaker on the slow upstream — opens, fast-fails further calls, drains the pool.
Per-upstream bulkheads. Separate thread pools / connection pools per upstream. If B is sick, A's calls to C are unaffected. Hystrix called this "bulkheading" — separate watertight compartments so one flooded compartment doesn't sink the ship.
Server-side load shedding. Server sets SETTINGS_MAX_CONCURRENT_STREAMS low enough that overloaded servers stop accepting new streams instead of taking them and being slow.
Deadline enforcement at every layer. Calls that have already exceeded their parent's deadline are killed locally before they consume more resources.

The order matters: deadline + retry budget + circuit breaker + bulkhead + load shedding are layered defenses. Each catches what the previous misses.

Problem 7: Server-side timeout shorter than client deadline

Naïve solution. "I'll just set timeouts everywhere; if anything is slow, it'll fail fast."

How it breaks. Client A calls server B with a 1-second deadline. B has its own internal timeout of 500 ms on the database call it makes. The database is slow (800 ms). At t=500 ms, B's database timeout fires, B returns DEADLINE_EXCEEDED to A. A still has 500 ms of budget left, so A retries the call. B starts processing the retry — except the original work didn't actually stop because B's database call was already in flight, and the database is still slowly returning at t=800 ms (now for the original, abandoned call) and again at t=1300 ms for the retry. Double the work, double the database load, and A might retry a third time.

The pathology compounds: a slow downstream causes upstream retries that amplify load on the slow downstream, prolonging the slowness. This is the same shape as the retry storm, but caused by internal timeout misconfiguration rather than naive client retries.

Actual fix: Server-side timeouts should be longer than the client's deadline, or — better — derived from grpc-timeout directly. The pattern: B reads its grpc-timeout from the incoming request and uses that as the budget for all of its internal operations. No independent internal timeout that's shorter than the client's deadline.

Stripe's reliability engineering wrote about this exact pattern: their internal timeouts are always ≥ caller's deadline + safety margin, so if the caller is willing to wait, the server doesn't preemptively kill work.

Problem 8: Deadline propagation forgotten

Naïve solution. "Each service has its own timeout config; that's fine."

How it breaks. Service A is called with grpc-timeout: 500m. A makes 5 downstream calls. Each downstream is configured with its own static timeout of "1 s" because that was the team default at the time. Even though A's caller will give up at 500 ms, the downstream services happily spend up to 1 second computing answers nobody will read. The fan-out subtree continues consuming CPU after the parent request has been abandoned.

This is endemic in REST-over-HTTP/1.1 fabrics where deadline propagation isn't built into the protocol. People reinvent it with X-Deadline headers, half implement it, and the half that doesn't keeps the zombie work alive.

Actual fix: Make deadline propagation a fabric-level invariant. Every gRPC server's framework automatically reads grpc-timeout from inbound, computes now + remaining as the deadline, and (if it makes downstream calls) propagates (deadline - now - safety_margin) to each downstream. Most gRPC stub libraries do this by default if you pass the request context through. The discipline is "never start a new deadline tree mid-fabric."

For REST, the convention is X-Request-Deadline: <unix-millis> or X-Request-Timeout: <ms>. Both work; uniform enforcement is what matters.

Problem 9: Circuit breaker oscillation

Naïve solution. "The circuit breaker handles failure. Set 50% error threshold, 30 s open, done."

How it breaks. Upstream B is degraded but not down — 70% of calls succeed, 30% fail with DEADLINE_EXCEEDED. Circuit breaker opens after the error threshold is breached. After 30 s, it transitions to half-open and sends one probe call. The probe succeeds (one of the 70%). Circuit breaker closes. Within seconds, the failure rate again exceeds 50%, and the circuit opens again. The breaker oscillates between OPEN and CLOSED every ~30 s. Every closed period, callers slam B with full traffic and get hammered by failures; every open period, callers fast-fail and don't put load on B (which would help B recover) but also can't drain queued work.

The breaker is neither protecting B (B sees bursts of full traffic) nor serving callers well (callers oscillate between "everything fails fast" and "everything works partially").

Actual fixes:

Half-open requires multiple consecutive successes before fully closing. Envoy's outlier detection has success_rate_minimum_hosts and consecutive_5xx; the half-open state requires N successes (typically 5) before re-closing.
Hysteresis: thresholds for opening and closing differ. Open when error rate > 50%; close only when error rate < 10%. Without the gap, the breaker flaps on any rate near the threshold.
Gradual re-admission: instead of "open → closed = full traffic," ramp traffic up. 1% probe, then 10%, then 50%, then 100%. Resilience4j calls this "permitted calls in half-open state."
Per-instance, not per-service. A circuit breaker per upstream instance is more useful than per-service. One bad pod doesn't open the circuit to all replicas; outlier detection ejects just that pod.

The Staff-level insight: a circuit breaker that flaps is worse than no circuit breaker, because callers can't reason about the behavior. Stability under partial failure is the actual goal, not just opening/closing fast.

§14. Failure Mode Walkthrough

Server crashes mid-request (between DATA and TRAILERS)

Client sent HEADERS + DATA. Server began processing, started writing response DATA, then process died (OOM, segfault, kill -9).

TCP RST or FIN reaches the client. HTTP/2 codec sees connection break. All in-flight streams on that connection notified.
gRPC stub sees UNAVAILABLE with reason "connection closed without trailers."
Recovery: client retries (if idempotent, within retry budget) against another instance. Mesh's outlier detection ejects the dead host within ~10s on consecutive failures.
Durability point: nothing in the RPC layer. The response was never committed anywhere. Client is on the hook. For non-idempotent operations, the client must have written an idempotency key so the retry doesn't double-execute the side effect.

Server crashes between requests (idle)

Same TCP RST. Pool sees the connection dead, removes it. Opens a new connection to a healthy instance.
Outlier detection ejects the host. After probation, the mesh may re-add it.

Sidecar (Envoy) crashes

Application's gRPC stub points at localhost:9090. Localhost connection refused.
App should fail fast with UNAVAILABLE. Kubernetes restarts the sidecar within ~5s.
Defense: Envoy is statically linked, supervisor-restarted, has its own watchdog. The more common failure is misconfigured xDS push causing Envoy to reject config and serve stale.
Durability point: the pod is the unit. If the sidecar is unrecoverable, kill the pod, scheduler brings up a new one.

Control plane (xDS) outage

Sidecars cache last-known-good config. Existing traffic keeps routing.
What breaks: new services can't register, certificates can't rotate, new routes can't propagate.
Recovery: control plane is HA-replicated (etcd / Raft underneath). Failover within seconds.
Critical principle: the data plane MUST continue to work without the control plane. This is the whole point of the control-plane / data-plane split. A 30-minute control-plane outage taking down all RPC traffic is a design failure.

Network partition between client and server racks

mTLS connections in the partition fail. Outlier detection ejects servers on the unreachable side.
Service discovery sees the partition via health checks and stops routing across it.
Calls fail fast with UNAVAILABLE. Circuit breakers open. Retries fail-fast. Callers degrade gracefully.
Split-brain consideration: relevant only if the called service is stateful. Stateless RPC has no split-brain — both sides can independently serve whoever can reach them.
Durability point: any persisted state lives in the database, not the RPC layer. RPC is conduit; data lives elsewhere.

Permanent loss of a server instance

Mesh detects via health checks (passive: outlier ejection from consecutive 5xx; active: periodic L7 health check on /healthz or gRPC reflection).
Service registry removes the instance. Other instances absorb the load.
Auto-scaler may bring up a replacement.

Slow but not down (the worst case)

Recovery via circuit breakers + outlier detection + load shedding. The mesh must distinguish "slow" from "down" and apply different remedies: dead instance ejected; slow instance shielded by circuit breaker. This is the case the framework gives you primitives for but does not solve automatically — see §13 problem 2.

§15. Streaming Patterns

gRPC supports four call modes, and three of them are streaming. Knowing which mode fits which problem — and when to reach for something other than gRPC streaming — separates "I understand the semantics" from "I just know the protocol."

The four gRPC call modes

Defined in .proto:

service ExampleService {
  rpc UnaryCall(Request) returns (Response);
  rpc ServerStreaming(Request) returns (stream Response);
  rpc ClientStreaming(stream Request) returns (Response);
  rpc BidiStreaming(stream Request) returns (stream Response);
}

The wire mechanics: in all four modes, the HTTP/2 stream is opened once, and each Request or Response is one length-prefixed Protobuf message in DATA frames. The framework signals end-of-stream by setting END_STREAM on the final frame and emitting TRAILERS with grpc-status.

Server-streaming: one request, many responses

Shape: client sends one Request, server sends a sequence of Response messages, then TRAILERS.

Use cases: - Push notifications: client opens a long-lived stream; server pushes notifications as they arrive. Slack, Discord, and competitive trading desks use this pattern. - Log/event streaming: tail logs from a server — client says Tail(service="api", since=10min), server emits log lines as a stream. - Database change streams: replication clients subscribe to a stream of WAL (Write-Ahead Log) events from the leader. - Search-style "all results": a query that produces thousands of matches. Server streams as it finds them; client renders progressively.

Why streaming beats polling for these: - Push semantics: no "tail latency" of polling interval. - HTTP/2 flow control naturally throttles the producer if the consumer is slow (per-stream WINDOW_UPDATE). - One TCP connection, one HTTP/2 stream, codec efficiency. - The cancellation path is clean: client closes the stream, server gets CANCELLED, releases resources immediately.

Client-streaming: many requests, one response

Shape: client sends a sequence of Request messages, server processes them all, returns one Response.

Use cases: - Bulk upload: streaming a large file as a sequence of chunks. The server can begin processing earlier chunks before later ones arrive (lower wall time). - Sensor data aggregation: thousands of IoT devices report metrics; an aggregator endpoint receives them as a stream, periodically responds with "ack received up to sequence N." - Batch transcoding: client streams audio frames, server returns one final transcription result.

Pitfall: client-streaming is most useful when the response depends on all inputs. If each input could have its own response, that's bidi-streaming or unary-loop, not client-streaming.

Bidirectional streaming: many requests, many responses

Shape: both sides send messages independently, in interleaved order. Either side can end its half of the stream.

Use cases: - Chat: messages flow in both directions. - Real-time games: client sends inputs; server sends world state updates. - Voice/video signaling: WebRTC negotiation control channels. - Live coding / collaboration: edits stream both directions. - Interactive ML inference: client sends queries, server streams partial completions (LLM streaming responses).

Wire semantics: the order of messages within each direction is preserved (HTTP/2 streams are ordered). The relative order between client→server and server→client messages is not synchronized — both sides can speak at any time.

Critical correctness consideration: bidi-streaming makes it tempting to encode complex state machines into the stream. Be careful — a long-lived bidi stream is stateful at the connection. If the connection breaks (server crash, network partition, sidecar restart), you have to rebuild that state. Many bidi-streaming designs assume liveness and break under partial failure. Prefer to checkpoint state to durable storage; treat the stream as transport, not as the system of record.

When streaming is right vs when polling / SSE / WebSocket

Pattern	Use when	Why
gRPC server-streaming	Internal service-to-service push, < 100k concurrent streams per service, both ends speak gRPC	Compact, flow-controlled, cancellable, multiplexed over HTTP/2
gRPC bidi-streaming	Internal bidirectional flow, low-frequency control plane	All of the above, both directions
Polling	Updates arrive infrequently (minutes apart), simple stateless servers, lossy networks	No stream state to lose; each poll is a fresh attempt; trivially scalable
SSE (Server-Sent Events)	Browser needs server push, one-way only, HTTP/1.1 OK	Browser-native; works through any HTTP/1.1 proxy; auto-reconnect built into the spec
WebSocket	Browser needs bidirectional, very frequent messages	Browser-native bidi; protocol upgrade from HTTP/1.1; ubiquitous proxy support
gRPC-Web (streaming)	Browser ↔ gRPC backend, server push needed	Requires a proxy (Envoy with gRPC-Web filter); only server-streaming is well-supported, not bidi
Webhook	"Tell me when X happens" across the internet (different security domains, low frequency)	Decoupled, no long-lived connections, retry-friendly
Long-polling	Legacy / fallback for environments with no streaming support	Behaves like polling on the wire but feels like push to the app

Anti-patterns:

Streaming for fan-out request-response: if A wants to ask B "give me users 1, 2, 3" and get three responses, this is not server-streaming — this is three unary calls. Streaming is for unbounded or open-ended sequences. Don't shoehorn batch into streams.
Long-lived bidi as the connection-state: see above. State belongs in storage.
Streaming for low-frequency push from browser: SSE or WebSocket. gRPC-Web bidi is half-baked.

§16. API Versioning Strategies

Once an API has external callers, you can never break it. Even internally, with 800 services owned by 200 teams, you can't atomically deploy a breaking change. Versioning strategies are the disciplines for shipping new API behavior without breaking old callers.

URL versioning: `/v1/users`, `/v2/users`

The classic REST approach. Each major version is a separate URL prefix. Clients pin to a specific version explicitly.

Pros: - Simple, obvious, debuggable via curl. - Versions can run side-by-side in different services if needed. - Cache layers see different URLs, so caching just works.

Cons: - Path explosion: 5 versions × 50 endpoints = 250 active routes. - Hard to evolve a single endpoint independently — bumping the version typically means bumping the whole API.

Used by: Stripe (mostly date-based, e.g., 2024-12-15 via header — see below), Twilio (URL-versioned), most public REST APIs.

Header versioning: `Accept` content-type

The IETF-blessed approach: Accept: application/vnd.example+json;version=2. The URL stays stable; the version is in the negotiated content type. Stripe famously versions by date in a header: Stripe-Version: 2024-12-15.

Pros: - Cleaner URLs (one resource, many representations). - The server can choose what version to apply per-request, even mixing versions. - Date-based versioning maps to "compatibility snapshot" semantics elegantly.

Cons: - Less obvious to debug — you have to know to set a header to see the right version. - Cache layers may need to vary on the header (Vary: Accept or Vary: Stripe-Version). - Tooling support is uneven (CDNs, gateways).

gRPC: field number stability + new RPC methods

gRPC fundamentally doesn't have "API versions" in the URL sense. The version discipline is:

Field numbers in a message are immutable. Once shipped, never reuse, never repurpose.
Adding fields is always safe (old clients ignore unknown tags).
Removing a field: reserved it, never reuse the number.
Changing a method's semantics: don't. Add a new method (GetUserV2) and deprecate the old one.

The Protobuf field-number contract (§4a) is the entire versioning discipline. There's no version in the URL. The implicit version is "whatever this binary's schema says."

For breaking schema changes at the service level, the pattern is: - Add a new service definition with a new package name (user.v2.UserService). - Old callers continue calling user.v1.UserService. Server implements both. - After all callers have migrated, remove v1.

Deprecation timelines

The "we run v1, v2, v3 simultaneously for 18 months" reality. Real deprecation timelines at large companies:

Phase	Timeline	Actions
Launch v2	t=0	Announce. Migration guide. Both versions live. Logs of v1 callers visible.
Soft deprecation	t+3 months	Docs mark v1 as deprecated. New customers default to v2. v1 still fully supported.
Warnings	t+6 months	v1 responses include `Deprecation` HTTP header. Outreach to known v1 callers.
Hard deprecation	t+12 months	v1 responses begin returning errors for new accounts. Existing callers grandfathered.
Sunset	t+18 months	v1 returns 410 Gone. Migration complete or customer is broken.

Stripe's "API versions" are essentially infinite: every customer is pinned to the version they signed up with, and Stripe maintains compatibility for years. Internal services bump them lazily. AWS APIs are similar — you can still call s3:PutObject against versions from 2006.

For internal east-west traffic, the timeline is months, not years. But the same discipline applies: never break a contract without warning; provide a migration path; keep the old version alive long enough that every caller can migrate without an emergency.

Versioning pitfalls

Coupled releases: "we'll deprecate v1 when we deploy v2 next sprint" is fantasy. You will run both for many months.
Implicit version drift: code that branches on if (version == "v1") {} else {} accumulates spaghetti. Prefer a clean v1 handler and a clean v2 handler in separate files; one routes to the other.
Forgetting to log version: when an incident hits, knowing which API version produced the bad response is critical. Always emit the resolved version in metrics and logs.
"v2 is just v1 with a few fixes": don't ship a "v2" for cosmetic changes. Adding fields to v1 is non-breaking. Only ship a new major version when you have a truly breaking change.

§17. GraphQL Depth

GraphQL is mentioned in §3 and §25 as a fit at the BFF (Backend For Frontend) layer. The deeper question is why — and the answer requires understanding GraphQL's specific machinery: the N+1 query problem, DataLoader batching, persistent queries, and Apollo Federation.

What GraphQL is, briefly

A query language over a typed schema (SDL — Schema Definition Language):

type User {
  id: ID!
  displayName: String!
  posts(limit: Int = 10): [Post!]!
}

type Post {
  id: ID!
  title: String!
  author: User!
  comments: [Comment!]!
}

type Query {
  user(id: ID!): User
}

Client sends a query specifying exactly the fields it wants:

query {
  user(id: "42") {
    displayName
    posts(limit: 3) {
      title
      comments {
        text
      }
    }
  }
}

Server resolves the query by calling resolvers — one function per field. The result tree matches the query tree exactly.

The N+1 problem

The canonical performance pathology. Given the query above, naïve resolver execution:

user(id: "42") resolver runs → 1 database call: SELECT * FROM users WHERE id = '42'.
user.posts resolver runs → 1 database call: SELECT * FROM posts WHERE author_id = '42' LIMIT 3. Returns 3 posts.
For each of 3 posts, post.comments resolver runs → 3 database calls: SELECT * FROM comments WHERE post_id = '...'. Returns N comments total.

Total: 1 + 1 + 3 = 5 database calls for a single user. Scale to 50 posts each with comments → 1 + 1 + 50 = 52 calls. N+1: one "parent" query produces N "child" queries that should have been batched.

The root cause: GraphQL resolvers are written per field, with no visibility into siblings. The post.comments resolver runs once per post, with no idea that 49 other posts are about to do the same thing.

DataLoader: batching and per-request caching

DataLoader is the canonical fix, originally from Facebook (Lee Byron). It's a per-request batching primitive:

Within a single request, when 50 resolvers each call commentsLoader.load(postId), DataLoader buffers the IDs.
At the end of the current tick (event loop), DataLoader calls a single user-defined batch function: batchCommentsByPostId([p1, p2, ..., p50]).
The batch function makes one database call: SELECT * FROM comments WHERE post_id IN (...).
DataLoader returns results to each waiting resolver.

The N+1 collapses to N+1 batched calls (one per "level" of the query tree), no matter how many siblings exist.

DataLoader also caches per-request: if userLoader.load(42) is called from two places in the same request, it's deduplicated. Across requests, no caching — each request gets a fresh DataLoader to avoid stale cache hazards.

The Staff-level discipline: every database/RPC call in a GraphQL resolver must go through DataLoader. If a developer adds a raw query, the N+1 returns.

Persistent queries (Automatic Persisted Queries — APQ)

GraphQL queries can be large. A single mobile screen's query might be 5–10 KB of GraphQL text per request. At scale, that's bandwidth + parse cost on every call.

Persistent queries: the client sends a hash of the query instead of the query text. The server has the query pre-registered (by build tooling, or via a "first request seeds the cache" protocol). The server looks up the hash, executes the registered query.

Apollo's APQ flow: 1. Client computes SHA256(query). 2. First request: sends only the hash. Server returns PersistedQueryNotFound. 3. Client retries with both hash and query. Server stores the mapping. 4. Subsequent requests: hash only. Server executes the stored query.

Bandwidth saving: 5 KB query → 64-byte hash. At 1M QPS, 5 GB/s → 64 MB/s. Huge win for mobile.

Persistent queries also act as a security control — if the server only accepts pre-registered queries, an attacker can't probe the schema with novel queries. This is the "operations safelist" pattern used by Shopify, GitHub, and Netflix.

Apollo Federation: distributed GraphQL

Naïve GraphQL has one server implementing all resolvers. At scale, that becomes a monolith — every team contributes code to one process. Apollo Federation is the answer for "GraphQL across many services":

Each service is a subgraph owning some types. Service users owns User; service posts owns Post.
A gateway (Apollo Router or older Apollo Gateway) composes the subgraphs into one supergraph schema.
Client sends one GraphQL query to the gateway. Gateway plans which subgraphs to call for each piece of the query, calls them (in parallel where possible), and composes the result.
Types can be extended across subgraphs: Post (owned by posts) can have an author: User field whose User.displayName is resolved by the users subgraph. The gateway handles cross-subgraph stitching via @key directives.

Federation lets each team own its own subgraph and ship independently. The downside: the gateway is a new piece of critical infrastructure with non-trivial query-planning cost.

When GraphQL is right vs wrong

Right: - Mobile apps: cellular bandwidth is precious; over-fetching costs real money. The screen specifies exactly what it needs. - BFFs (Backend For Frontend): one BFF per client (web BFF, iOS BFF, Android BFF) is a viable architecture; GraphQL fits the "compose many backends into a client-shaped response" job. - External APIs with diverse clients: GitHub's GraphQL API was added because the REST API was over-fetching for tools like dashboards. GraphQL clients ask for what they want. - Federated organizations: many teams, one cohesive external API. Federation lets each team own a piece.

Wrong: - Internal east-west high-throughput: GraphQL adds parsing, query planning, resolver dispatch — overhead that doesn't matter at low scale and is unacceptable at high scale. Internal callers already know what fields they need; the query language gives them nothing. - CRUD-heavy applications: writes in GraphQL (mutations) are awkward and don't compose. REST + JSON is fine. - Caching-heavy applications: HTTP caching works on URLs; GraphQL queries are POST bodies. There are workarounds (persistent queries become GETs), but the impedance mismatch is real. - When the schema isn't actually unified: federation only helps if the types are shared. If services own disjoint data, you don't need federation — just a thin BFF that calls each independently.

§18. REST Design at Scale

REST is the right answer for north-south traffic, but "REST" covers a wide range of designs from "pretty good" to "actively painful." A Staff-level REST API has discipline around resource modeling, idempotency, pagination, and response headers.

Resource modeling: nouns, not verbs

The core REST discipline: URLs name resources; HTTP methods name operations.

Good: - GET /users/42 — fetch the user. - POST /users — create a user. - PUT /users/42 — replace the user. - PATCH /users/42 — partial update. - DELETE /users/42 — delete. - GET /users/42/posts — collection of posts owned by user 42. - POST /users/42/posts — create a post for user 42.

Bad (RPC-shaped over REST): - POST /createUser, POST /getUser, POST /updateUser.

The win of resource modeling: HTTP caching works (GET on a stable URL is cacheable), idempotency is well-defined (PUT/DELETE are idempotent by spec), composition is natural (collections nest cleanly).

Idempotency-key header: Stripe's pattern

Problem: POST is not idempotent. A client retrying a POST /payments after a network blip may charge the customer twice.

Solution: an Idempotency-Key header. The client generates a UUID; the server stores (idempotency_key) → (response). On retry, the server detects the key was seen, returns the same response without re-executing.

The wire pattern:

POST /v1/charges
Idempotency-Key: 8e7c2a5d-3f4b-4abc-9d12-c1e5b6f8a907
Content-Type: application/json

{
  "amount": 4999,
  "currency": "usd",
  "source": "tok_visa"
}

Server-side: 1. Look up the idempotency key. If found, return stored response. Done. 2. If not found, acquire a lock on the key, execute the operation. 3. Store (key) → (request_hash, response). 4. Return the response. 5. Future calls with the same key see the stored response.

Stripe's specific discipline: idempotency keys are valid for 24 hours; if the request body differs from the original, return an error (so a client can't accidentally reuse a key for a different operation); keys are scoped to API endpoint.

This pattern makes POST safely retryable, which is the foundation of resilience on any write-heavy API. It is the canonical real-world example of "retry safety on writes."

Pagination: offset vs cursor

Offset pagination: GET /posts?offset=1000&limit=20. Trivial to implement; familiar to every developer. Breaks catastrophically at scale:

offset=1000000 requires the database to scan 1M rows and discard them. O(offset+limit) per page.
If new rows are inserted between page 1 and page 2, the offset shifts and the user sees duplicates or misses items.
Cache-unfriendly: every page is unique.

Cursor pagination: GET /posts?cursor=opaque_token&limit=20. The server returns rows after the cursor; the response includes a next_cursor. The cursor encodes the last-row's sort key (e.g., created_at + id).

Wire:

GET /v1/posts?limit=20
→ { "data": [...], "next_cursor": "eyJ0IjoxNzE0..." }

GET /v1/posts?cursor=eyJ0IjoxNzE0...&limit=20
→ { "data": [...], "next_cursor": "eyJ0IjoxNzE1..." }

Pros: - O(limit) per page regardless of how deep you've paged. - Insert-safe: new rows don't shift the cursor. - Cacheable: each cursor uniquely identifies a slice.

Cons: - No "jump to page 47" — strictly sequential. - Cursor opacity is critical (clients shouldn't parse cursors; the server can change cursor format).

Production at scale uses cursor pagination always. Twitter's API, Stripe's API, GitHub's API — all cursor-based. The "offset breaks at scale" failure mode shows up when a large customer paginates deep into a list and the database starts taking 10 seconds per page.

Rate-limit response headers (X-RateLimit-*)

Returning rate-limit metadata in response headers lets clients self-regulate without trial-and-error:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1716163200

GitHub's convention. Variations: - X-RateLimit-Limit: the cap. - X-RateLimit-Remaining: how many you have left in the current window. - X-RateLimit-Reset: Unix timestamp when the window resets. - Retry-After: 60 on a 429 response: how long to wait before retrying.

Modern IETF draft headers: RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset (without the X- prefix). Some APIs serve both for transition.

HATEOAS in theory vs practice

HATEOAS (Hypermedia As The Engine Of Application State) is the "true REST" doctrine: responses include links to related resources, and clients navigate the API by following links rather than constructing URLs.

Theoretical pattern:

{
  "id": 42,
  "name": "Yifan",
  "_links": {
    "self": "/users/42",
    "posts": "/users/42/posts",
    "delete": "/users/42"
  }
}

In practice, almost nobody does this seriously. The reasons: - Clients are written against the docs, not by exploring the API at runtime. The link is dead weight. - Tooling (SDKs, codegen) doesn't benefit from HATEOAS. - Pure REST adherents are a minority; "HTTP API with JSON" is what people mean by REST in 2026.

The pragmatic exception: pagination cursors ("next_cursor") are a HATEOAS-flavored design that does survive at scale, because the cursor is opaque and the client really doesn't construct it. Other than that, HATEOAS remains aspirational.

§19. Connect RPC and Twirp

Between gRPC's full feature set and REST's lowest-common-denominator simplicity sits a class of frameworks that get most of gRPC's benefits — Protobuf schemas, typed errors, codegen — without the operational headaches of gRPC's HTTP/2-only, trailers-required wire format. The two notable members are Connect RPC (Buf, 2022) and Twirp (Twitch, 2018).

Twirp: Twitch's simpler protocol

Twirp was Twitch's response to gRPC for internal services. The design choices:

Protocol: HTTP/1.1 OR HTTP/2.
Wire format: Protobuf (binary) OR JSON, chosen via Content-Type header.
No trailers: status codes are returned via HTTP status code + a structured JSON error body.
No streaming: unary RPC only.
No deadlines in the wire protocol: relies on HTTP-layer timeouts.

The URL convention: POST /twirp/<package>.<Service>/<Method>. Both server and client are codegen from .proto.

What you give up vs gRPC: - Streaming. - HTTP/2 multiplexing (when using HTTP/1.1). - grpc-timeout propagation.

What you gain: - Works over HTTP/1.1 — every proxy, every CDN, every browser supports it. - Debuggable with curl when in JSON mode. - No HTTP/2 negotiation headaches.

Connect RPC: gRPC-compatible without trailers

Connect (from Buf, 2022) is a newer entrant that aimed to fix gRPC's browser problem while keeping gRPC interop. Design choices:

Protocol: HTTP/1.1 or HTTP/2.
Wire format: Protobuf OR JSON.
gRPC wire-protocol compatible: a Connect server can speak vanilla gRPC for callers that want it.
Connect-specific protocol for non-gRPC callers: status codes mapped to HTTP status + headers, no trailers required.
Streaming: server-streaming is supported (over HTTP/2 with the same length-prefix framing as gRPC); bidi requires HTTP/2.
Deadline propagation: works similarly to gRPC.

Connect's killer feature: a Connect-codegen client running in a browser can call a Connect server without a separate gRPC-Web proxy. The protocol is browser-fetch-compatible.

When these matter

Use Connect or Twirp when:

Browsers without gRPC-Web: native gRPC requires HTTP/2 trailers, which browser fetch doesn't expose. gRPC-Web works but requires a separate proxy (Envoy or Buf Proxy). Connect skips the proxy.
Simpler debugging: JSON mode lets you curl requests during development. Switch to Protobuf in production for efficiency.
You're not using streaming: if your services are 95% unary calls, the protocol simplification is worth more than the streaming features you don't use.
Heterogeneous proxy chains: corporate networks where intermediate proxies do weird things to HTTP/2 may favor HTTP/1.1 + simpler semantics.

Use plain gRPC when:

Streaming is heavy: bidi voice/chat, log-streaming, etc.
You're already on Istio/Envoy: the trailers issue is solved by the sidecar; no extra benefit from Connect.
Highest-throughput east-west: gRPC's HTTP/2 multiplexing is mature and squeezed for performance.

Twirp is in maintenance mode in 2026; Connect is the more active project. Both are valid choices; Connect is the more conservative bet for new code.

§20. HTTP/3 and QUIC

HTTP/3 — HTTP over QUIC — is the next-generation transport. The Staff-level depth signal here is being able to name TCP-layer head-of-line blocking as the specific motivation, and connection migration as the killer feature for mobile.

QUIC: UDP-based transport

QUIC (Quick UDP Internet Connections) is a transport-layer protocol developed at Google (2012, IETF-standardized 2021). The architecture flip: instead of TCP + TLS + HTTP/2 in three layers, QUIC combines transport, encryption, and multiplexing into one protocol over UDP.

Properties: - UDP-based: QUIC is application-layer over UDP. No kernel TCP stack involved. - Built-in encryption: TLS 1.3 is mandatory and baked in. Not a layer; integral. - Stream multiplexing at the transport layer: each QUIC stream has its own ordering and flow control. Independent of every other stream. - 0-RTT handshake: on reconnection, the client can send application data with the very first packet (using a previous session's keys). 1-RTT is also faster than TLS-over-TCP because the encryption and transport handshakes merge. - Connection migration: the connection is identified by a 64-bit Connection ID, not by the (src_ip, src_port, dst_ip, dst_port) 4-tuple. If the client's IP changes (mobile switching from Wi-Fi to LTE), the connection survives.

No head-of-line blocking at the transport layer

The killer architectural feature, repeated from §13 problem 5:

HTTP/2 over TCP: TCP guarantees in-order byte delivery. If one packet is lost, the kernel withholds all subsequent bytes until the lost packet is retransmitted. Even if those subsequent bytes belong to a different HTTP/2 stream, they sit in the kernel buffer, blocked. HTTP/2 multiplexing is undone by TCP-layer head-of-line blocking under packet loss.
HTTP/3 over QUIC: each QUIC stream is independently ordered. Packet loss on stream 7 only blocks stream 7. Streams 8, 9, 10 deliver immediately. The multiplexing gain HTTP/2 promised is actually realized end-to-end.

In a clean intra-datacenter network (loss ~0.001%), this barely matters. In a lossy mobile network (loss 1–5% during cell handoff), the p99 latency difference can be substantial.

Connection migration: the mobile killer

A mobile user is on the subway. Their phone disconnects from one cell tower, reconnects to the next, and their IP address changes. With TCP, the connection (identified by 4-tuple) dies; every in-flight request fails; the client has to reconnect and retry. The user sees a brief stall.

With QUIC, the Connection ID stays the same. The client's new IP shows up in the same Connection ID. The server recognizes it as the existing connection and continues. Streams in flight survive the IP change, no application-visible failure.

Cloudflare published numbers showing 0-RTT QUIC's effect on mobile page-load time: median ~0% change, p99 ~30% better. The tail is where the win is.

Production adoption status

In 2026:

Cloudflare: HTTP/3 is the default on their edge network. ~50% of HTTP/3-capable browsers connecting to Cloudflare-hosted sites speak HTTP/3.
Google: serves search, YouTube, and most of their properties over HTTP/3 by default.
Meta (Facebook, Instagram): HTTP/3 for the mobile apps; web is mixed.
Apple: iOS uses HTTP/3 where the server supports it.
AWS, Azure, GCP: HTTP/3 termination at their CDN/edge products. Internal services largely still on HTTP/2.

For internal east-west traffic, HTTP/2 is still the dominant choice. Reasons: - Intra-DC networks are clean (no packet loss); QUIC's tail-latency advantage doesn't manifest. - UDP-based transport interacts awkwardly with some load balancers and firewalls. - QUIC implementations are more CPU-hungry than mature TCP (less hardware offload). - HTTP/2's mature tooling and observability haven't been matched yet.

For edge / public-facing / mobile-heavy traffic, HTTP/3 is the default in 2026.

The Staff-level rule: HTTP/3 for the last mile to clients; HTTP/2 (or even HTTP/1.1 internally with mesh-managed connection pools) for east-west.

§21. Ambient Mesh and Cilium

The sidecar pattern (§6) has been the dominant service-mesh data-plane architecture since Istio (2018) and Linkerd-2 (2019). It is also the pattern's primary cost: every pod runs a copy of Envoy, consuming 100–300 MB of RAM and 5–15% of CPU. At 10,000 pods, that's 1–3 TB of RAM and hundreds of CPUs just for sidecars. The next-generation answer is ambient mesh (Istio) and eBPF-based meshes (Cilium).

Istio Ambient Mode: ztunnel + waypoint

Istio's ambient mode (introduced 2022, GA in 2024) splits the data plane in two:

ztunnel: a per-node (not per-pod) component running on every Kubernetes node. Handles the L4 layer: mTLS termination, L4 authorization, transport-level telemetry. One process per node serves all pods on that node.
waypoint proxy: an optional per-namespace (or per-service-account) component that handles L7 features when needed: HTTP routing, retries, request-level authorization. Deployed only where L7 features are configured; many namespaces never need a waypoint.

The architectural saving: - 10,000 pods × Envoy sidecar = 10,000 Envoy instances. - 500 nodes × ztunnel + 50 namespaces × waypoint = 550 instances. - Roughly 20x fewer proxy instances.

Trade-off: ztunnel is a smaller, more limited proxy (it's L4 + mTLS only). L7 features pay an extra hop through the waypoint proxy. For pure L4 + mTLS, ambient is strictly cheaper. For complex L7 routing, the latency is slightly higher than per-pod Envoy (~100 µs).

The migration story: Istio supports both modes simultaneously. A namespace can be "ambient" while another is "sidecar." Move gradually.

Cilium Service Mesh: eBPF-based, no sidecar at all

Cilium is a CNI (Container Network Interface) plugin that uses eBPF (extended Berkeley Packet Filter) — programs running in the Linux kernel — to implement networking, security, and observability. Cilium Service Mesh extends this with mesh features.

The architecture:

Kernel-level data plane: eBPF programs attached to the kernel's network stack handle packet processing, L4 LB, mTLS (via the kernel's built-in kTLS), traffic policy, observability.
No sidecar process: there is no Envoy. The application talks directly to the network; eBPF intercepts and applies policy.
L7 features via embedded Envoy when needed: for HTTP routing or complex L7 policy, Cilium runs Envoy as a per-node process (similar to Istio's ztunnel + waypoint split).

The savings are even larger than ambient Istio: - No per-pod proxy: zero per-pod proxy CPU/RAM overhead. - In-kernel forwarding: avoids the userspace ↔ kernel context switches that sidecars require, lowering latency.

Trade-offs: - Kernel version requirements: eBPF features evolve with kernel versions. Cilium pushes the boundary; older kernels can't run all features. - Less mature: fewer organizations have deployed Cilium Service Mesh at extreme scale compared to Istio. - Different mental model: kernel-resident networking is harder to debug for developers who think in terms of processes.

When the post-sidecar shift makes sense

The Staff-level position in 2026:

Small/mid scale (< 100 services): sidecars are fine. The overhead is real but the operational simplicity wins.
Large scale (100–10000 services): ambient mesh is the strong default. The sidecar CPU/RAM tax is a budgetary line item; ambient delivers the same features at ~20% the resource cost.
Hyperscale (Google, Meta, Cloudflare): proxyless or eBPF. Sidecar resource cost at this scale is in the millions of dollars annually; engineering investment in proxyless gRPC libraries or eBPF data planes pays back.

The historical arc: from "one library per language" (early microservices) → "sidecar proxy" (Istio era) → "per-node proxy + optional L7" (ambient) → "kernel-resident eBPF" (Cilium). Each step trades implementation complexity for resource efficiency.

§22. Cost Economics

A Staff engineer should be able to discuss the RPC fabric in dollars, not just in milliseconds. The major cost lines:

Serialization CPU

Protobuf decode vs JSON parse: - Protobuf: ~2–5 ns/byte (generated code, no reflection). 1 KB ≈ 2–5 µs. - JSON (Jackson, Java): ~30–80 ns/byte. 1 KB ≈ 30–80 µs. - Protobuf is ~10x cheaper than JSON at decode.

At 1M RPCs/sec with 1 KB messages, the CPU difference is: - JSON: 30–80 µs × 1M = 30–80 seconds of CPU per second of wall time = 30–80 cores. - Protobuf: 2–5 µs × 1M = 2–5 cores.

At AWS rates (~$0.04 per core-hour for steady on-demand), the annual difference per million-QPS service: - 75 cores × 24 × 365 × $0.04 ≈ $26,000/year per million QPS for the JSON tax, just on parsing. - Across the fleet at LinkedIn or Netflix scale, this aggregates to millions per year.

Encoding is similar (Protobuf encode ~5–10 ns/byte; JSON serialize ~50 ns/byte). Total round-trip CPU savings: ~10x.

mTLS overhead

A single TLS 1.3 handshake costs ~500 µs of CPU + 1 round trip. Per-connection, not per-call.

For a long-lived HTTP/2 connection carrying 1M RPCs before being recycled: - Handshake amortized cost per RPC: ~0.5 ns. Negligible. - Per-RPC encryption (AES-NI hardware accelerated): ~1–2 ns per byte. 1 KB ≈ 1–2 µs. Smaller than the framework decode cost.

The cost matters only if connection turnover is high (short-lived clients, frequent reaping). At scale, keep connections warm; mTLS overhead is then noise.

Bandwidth and egress

Cross-AZ (within-region) data transfer on AWS: $0.01/GB. Cross-region data transfer: $0.02/GB. Egress to the internet: $0.05–$0.09/GB depending on region.

At 10M RPCs/sec with 2 KB JSON payloads: - 10M × 2 KB = 20 GB/s = 1.7 PB/day in each direction. - At cross-AZ ($0.01/GB), 20 GB/s × 86400 s × $0.01 = ~$17,000/day = $6M/year just for bandwidth.

Switch to Protobuf with ~30% the wire size: - 6 GB/s = 500 TB/day. - Cost: ~$5,000/day = $1.8M/year. - Annual bandwidth saving: ~$4M for that one service.

Cross-region adds a multiplier: a chat service that has cross-region replication is paying $0.02/GB; at the same volume, ~$12M/year on JSON vs ~$3.6M on Protobuf.

This is the "Google publishes Protobuf benchmarks at 10x" reality: those benchmarks aren't academic. The annual savings at scale fund entire engineering teams.

Sidecar overhead

Per-pod Envoy: ~100–300 MB RAM, ~5–15% CPU.

At 10,000 pods: - Cumulative RAM: 1–3 TB. - Cumulative CPU: 500–1500 cores.

At AWS EC2 m5.xlarge rates ($0.192/hour for 4 vCPU + 16 GB RAM): - 1000 cores × 24 × 365 × ($0.192/4) = $420,000/year just on sidecar CPU. - Plus the RAM. Plus the per-pod overhead of running an extra process (operational complexity).

This is why ambient mesh and eBPF-based meshes (§21) exist: cutting sidecar count by 20x at scale is a multi-million-dollar annual cost line.

Connection count

Each TCP connection holds kernel buffers (~64 KB) and a file descriptor. At 1M concurrent connections, that's ~64 GB of kernel memory and file-descriptor pressure.

HTTP/2 multiplexing (100–1000 streams per connection) reduces connection count by ~100x compared to HTTP/1.1. The savings are infrastructure-wide.

Egress to public clients

For public APIs, egress to the internet is the dominant cost line. JSON is unavoidable here (browsers, third parties). The mitigation is compression (gzip, Brotli) — JSON compresses to 20–30% of its raw size, putting it in the same ballpark as Protobuf's wire size.

A public REST API serving 100 GB/s of compressed JSON costs ~$30,000/day in egress at AWS rates. The framework choice (REST vs gRPC) is small compared to whether you compress.

The headline numbers

For a Staff engineer to commit to memory:

Cost line	Order of magnitude
Protobuf vs JSON CPU	~10x cheaper for Protobuf
Protobuf vs JSON bandwidth	~3x smaller for Protobuf
Cross-AZ bandwidth	$0.01/GB (~$10K/day per GB/s sustained)
Cross-region bandwidth	$0.02/GB
Sidecar overhead per pod	~10% CPU, ~200 MB RAM
TLS overhead per long-lived connection	negligible (~0%)
Ambient mesh vs sidecar savings	~20x fewer proxy instances

When discussing "why we picked gRPC east-west" at fleet scale, the answer is rarely "because it's fast" — it's "because of the millions of dollars in CPU, bandwidth, and infra we don't have to spend."

§23. Why Not Just REST + JSON for Everything

The naïve answer to "how do we do inter-service communication?" is "just use REST + JSON. Every language has an HTTP library. Every developer understands curl." Let's dissect why this breaks at scale.

1. Wire size. A typical internal RPC payload, fully verbose JSON, is 3–10x the size of the equivalent Protobuf. At 10M QPS × 2 KB JSON vs 600 bytes Protobuf, the bandwidth difference is ~14 GB/s vs ~6 GB/s per direction across the fleet. Cross-AZ bandwidth is $0.01/GB. That's a $4M/year cost line for just the JSON tax. At Google scale (10B RPCs/sec), the math is orders of magnitude worse.

2. Serialization CPU. JSON parsing in Java (Jackson) is ~30–80 ns/byte. Protobuf parsing is ~2–5 ns/byte. At fleet scale, that's the difference between 50 cores and 5 cores per million QPS just for parsing. At LinkedIn scale, hundreds of host-years per year of saved CPU. Real money.

3. No streaming. REST has chunked transfer encoding, SSE (Server-Sent Events), and WebSockets — none of them are a streaming RPC abstraction. You bake your own framing, your own backpressure, your own keep-alive, your own reconnect. With gRPC, server-streaming is returns (stream Post) in the .proto and one method on the stub. With REST, hundreds of lines of framing + cancellation logic per service.

4. No deadline propagation. HTTP has Connection: keep-alive and whatever timeout the load balancer enforces. You can put an X-Deadline header in your application code, but every service must agree on convention and propagate correctly. Half of them won't, and you discover it during the next incident. gRPC's grpc-timeout is built into the protocol and enforced by the library.

5. No multiplexing in HTTP/1.1. One in-flight request per connection. At high RPS, huge connection pools. HTTP/2 fixes this, but REST-over-HTTP/2 still has client libraries and intermediate proxies that downgrade or assume HTTP/1.1 semantics. gRPC enforces HTTP/2 end-to-end.

6. No built-in flow control per call. HTTP/2 provides per-stream flow control. With REST you usually get TCP-level flow control only; one slow consumer drains your network buffer; you don't know which client is to blame.

7. No schema evolution discipline. JSON is "anything goes" — clients send what they want, servers parse what they recognize. Six months later, 14 different shapes of the "same" payload across your fleet. Protobuf field-number rules + buf breaking force discipline.

8. Lossy status semantics. HTTP has 200 OK, 4xx, 5xx. gRPC has 17 named status codes, each prescribing a different client recovery action. With REST you convey this in error bodies and clients parse them — yet another inconsistent convention.

Counterpoint — when REST IS the right answer: - External APIs (public, partner integrations). Customer uses curl; reads JSON. - Web browsers (without a gRPC-Web proxy). Browsers can't generate HTTP/2 trailers. - Low-volume, debuggability-heavy interfaces (admin APIs). - Migrating off something worse (SOAP, RPC over JMS).

The right answer at scale is both: REST + JSON north-south, gRPC + Protobuf east-west. This is exactly what Google, LinkedIn, Netflix, Uber, and Stripe all do.

§24. Scaling Axes

Type 1: Uniform expansion (more services, more calls)

Topology evolution as services multiply:

Scale	Topology
1x (10 services, 10k QPS)	Direct service-to-service over HTTP/1.1 + JSON. LBs in front. No mesh.
10x (100 services, 100k QPS)	Adopt gRPC for hot internal paths. Service registry (Consul). Client-side LB in libraries. Per-team retry policies (already drifting).
100x (1000 services, 1M QPS)	Service mesh becomes mandatory. Sidecar proxies (Envoy) handle mTLS, retries, circuit breakers. Control plane pushes config via xDS. Schema registry. CI breaking-change checks.
1000x (10000 services, 100M QPS)	Mesh becomes a critical resource sink (sidecar CPU rivals app CPU). Ambient mesh (eBPF) or proxyless (xDS into a thick client library). Hierarchical service catalog. Region-aware routing. Hedged requests.

Inflection points: - ~50–100 services: "We can't reason about retry/timeout policies anymore." Adopt mesh. - ~1000 services: "Sidecar CPU is 30% of pod cost." Move to proxyless or eBPF. - ~10000 services: "Schema registry is critical infra. Schema review is a Staff-level role."

Type 2: One service becomes hot

Same N services, but service X grows from 1k QPS to 1M QPS.

Stage	Adaptation
1x	One instance, no problem.
10x	Scale horizontally. Round-robin LB. Stateless ideal.
100x	Connection pool exhaustion at clients (each maintaining N connections to X). Subsetting: each client connects to a random subset of 20 X instances rather than all of them. This is exactly the failure mode that LinkedIn's D2 (Dynamic Discovery) was built to solve — at 1000 instances on each side, naïve any-to-any LB is 1M sockets per host.
1000x	Caller-side hot-key problems: one user generates 30% of QPS to X. Caller-side rate limit + admission control at X.
10000x	X must shard internally. Mesh routes by request key (consistent hash). Hedged requests for the tail. Read-through cache in front for read-heavy workloads.

Subsetting math (the explicit case): N=1000 clients × M=1000 X instances. Naïve all-to-all: 1M client→server connection pairs. With subset size 20: 20k pairs. Two orders of magnitude saving in socket counts, file descriptors, and HTTP/2 connections.

Consistent-hash routing (the explicit case): if X is internally sharded, route by key (user_id → shard). Same user always hits the same shard, enabling local caches and reducing cross-shard joins. Used by feature stores, video metadata services (Vitess), and anywhere session affinity matters.

Hedged requests (the explicit case): send the same idempotent read to two replicas, return whichever responds first. Google's "Tail at Scale" paper shows median latency only slightly worse, p99 dramatically better. Cost: 2x QPS at the callee — enable only on read-heavy idempotent paths.

The trap at every scaling step: people keep adding capacity to the existing topology rather than restructuring. The 100x → 1000x step often requires a topology change (subsetting, consistent-hash routing, sharding the callee), not just more pods.

Type 2 mostly exposes the callee — hot service becomes the bottleneck. Type 1 exposes the fabric — mesh control plane, schema registry, observability.

§25. Decision Matrix

Scenario	Pick	Threshold / Why
Internal microservices, high RPS, multi-language	gRPC + Protobuf	Default east-west. Wire/CPU efficient; streaming; codegen; deadline propagation.
Public external API (third parties, partners)	REST + JSON	Customer uses `curl`; reads JSON. Cacheable.
Web ↔ backend (browser caller)	REST + JSON or Connect / GraphQL	Browser-native; gRPC-Web needs a proxy. Connect avoids trailers.
Mobile ↔ backend (diverse clients)	GraphQL or gRPC	GraphQL when different clients want different field shapes. gRPC for tight binary efficiency on cellular.
Inter-service streaming (logs, telemetry, live updates)	gRPC streaming	Built-in flow control; bidi mode.
Server push to browser	SSE or WebSockets	gRPC-Web doesn't do bidi well from browser side.
Existing Facebook/Apache ecosystem	Thrift	Inherited. Functionally similar to gRPC. Don't migrate without business reason.
Kafka producer/consumer payloads	Avro	Tighter integration with schema registry; better evolution rules for stored data.
Browser-callable Protobuf	Connect RPC (Buf)	gRPC-compatible without trailers; works over HTTP/1.1 from browsers without a proxy.
Long-running, durable, retryable workflow	NOT RPC — use a workflow engine	Temporal, Cadence, AWS Step Functions. RPC is synchronous; workflows are not.
Asynchronous fanout, durable	NOT RPC — use a queue	Kafka, RabbitMQ, Pulsar. Producer / consumer decoupled in time.

Decision thresholds: - >1M QPS internal: must adopt schema discipline (Protobuf + CI breaking-change checks). REST + JSON at this scale is a CPU and bandwidth bonfire. - >50 services: must adopt service mesh. Per-team retry/timeout policies are already drifting; uniformity must be enforced at infra layer. - >10 hop fan-out depth: must enforce deadline propagation. Per-hop timeouts will compound and blow user-facing budgets. - Browser caller: must NOT use vanilla gRPC. Either gRPC-Web (proxy), Connect (no proxy), GraphQL, or REST.

The defensible Staff-level rule: gRPC east-west; REST north-south; GraphQL at BFF when client shapes vary; Avro for stored event streams; never RPC for things that should be a queue or a workflow. Anything else needs explicit justification.

§26. Use Case Gallery

Same technology, different domains, different specialization points:

Microservices mesh — Netflix-style consumer-facing fanout. A device session's home-page render makes 100+ internal RPC calls. Hystrix (then resilience4j) circuit breakers, Eureka for service discovery, Ribbon for client-side LB. The canonical "every leaf failure must be tolerated" workload. Variant of the tech: synchronous gRPC fanout with aggressive circuit breaking, fallback responses, hedged requests, and pre-rendered cache fallbacks. The defining demand: 99.9% of user requests must succeed even if 5% of leaf services are sick at any moment.

Microservices mesh — Lyft-style operational backbone. Lyft built Envoy precisely because their existing mesh (Python services + various clients) couldn't enforce uniform retry/CB/observability behavior. The mesh becomes a substrate not just for RPC but for the entire operational story: traffic shifting, canary deploys, blue-green via header-based routing.

Microservices mesh — Stripe-style transactional fanout. Payment auth fans out to risk, balance, ledger, fraud, FX. Each call must succeed or be retried safely. Idempotency keys on every public POST. Internal services Twirp or gRPC. The defining demand: every retry must be safe; no double charges; deadline propagation prevents long tails from blowing user-facing budgets.

Mobile API — GraphQL BFF. Shopify's storefront API and Facebook's mobile data layer use GraphQL because mobile clients fetch wildly different field combinations and over-fetching costs cellular bytes. The mobile app sends one query specifying exactly the fields it wants; the BFF resolves it by calling 5–20 internal gRPC services and composing the response. Variant of the tech: GraphQL at the edge, gRPC internally.

Public API — Stripe / Twilio / GitHub REST. Stripe, Twilio, and GitHub's classic REST API are REST + JSON because the caller is some random startup with Python + requests or a Ruby app or a Slack bot. The defining demand: zero-tooling debuggability, cacheable GET responses, predictable pagination semantics, and a CLI (curl) story.

Public API — GitHub GraphQL. GitHub also exposes GraphQL because integrators (CI tools, code analyzers, dashboards) want exactly the fields they need and the REST API's response shapes were over-fetching. Same data, two surfaces.

Internal high-throughput — LinkedIn Rest.li. ~10M QPS across ~100k instances. Rest.li is REST-flavored over HTTP/1.1 historically — the schema is a .pdl file (similar to Protobuf in intent), and the framework codegens Java stubs. D2 handles client-side load balancing. Espresso (the wide-column store) is reached at ~3M QPS over Rest.li. Variant of the tech: REST shape with a Protobuf-equivalent schema discipline + a custom client-side discovery + LB layer. The defining demand: 10M QPS, 100k instances, and a Java ecosystem that pre-dates gRPC's maturity.

Internal high-throughput — Google Stubby / gRPC. ~10B RPCs/sec, ~10000+ services, a single search query fans out to ~1000 backends, each call deadline-bounded at 10–50 ms. Proxyless mesh (gRPC libraries consume xDS directly, no sidecar) because sidecar CPU at 10B RPCs/sec would be a separate billion-dollar cost line. The defining demand: minimum framework overhead per call; ruthless schema discipline; fan-out depth 10+ with deadline self-truncation.

ML inference fanout — Uber Michelangelo. A trip-pricing request invokes 5–10 ML models in parallel (demand prediction, fraud, ETA, surge). Each inference is a gRPC call to a Triton or TorchServe endpoint, p99 budget tens of milliseconds. The defining demand: highly parallel fanout with hedging on the slowest call; "slow" model instances ejected by circuit breaker so one bad GPU doesn't poison the prediction.

Database wire protocols — Postgres, MySQL. Not gRPC, but conceptually an RPC: the client sends a typed request (a SQL statement, a prepared statement, a bind), the server sends a typed response (rows, status). Postgres wire protocol has explicit message types (Q = simple query, P = parse, B = bind, E = execute, S = sync), length prefixes, and well-defined error semantics. The same primitive (typed request-response over TCP) shows up everywhere; gRPC is one polished implementation.

Streaming — gRPC bidi for chat / live updates. Bidirectional gRPC streams power: Slack's WebSocket-replacement initiatives (where they can avoid browsers), Discord's voice signaling, Zoom's call coordination, financial market-data feeds. The defining demand: long-lived streams (minutes to hours), in-order delivery within the stream, per-stream flow control via HTTP/2 WINDOW_UPDATE.

§27. Real-World Implementations with Numbers

Google internal: ~10 billion RPCs/sec on Stubby (now mostly gRPC). Borg + service discovery, deadline propagation built into the protocol. One search query fans out to ~1,000 backends, each call 10–50 ms deadline. Proxyless mesh — gRPC libraries consume xDS directly, no sidecar process.
LinkedIn Rest.li: ~10M QPS across ~100k service instances. Espresso (storage) reached at ~3M QPS over Rest.li. D2 (Dynamic Discovery) handles client-side LB. Rest.li is REST-flavored over HTTP/1.1 historically, moving to gRPC for internal hot paths. The mid-tier services (feed mid-tier, profile mid-tier) individually sustain 500k+ QPS.
Netflix: Hystrix and the original circuit-breaker pattern, born from a 2012 retry-storm incident that brought down a substantial chunk of the personalization tier when one backend GC'd. Eureka for service discovery. 100+ call fan-out per device session. Migrated to resilience4j (Hystrix is in maintenance) and gRPC for newer services.
Uber: Apache Thrift, TChannel transport, then YARPC (their RPC abstraction). ~10M QPS internal. One ride request: ~30 sync calls + ~40 async, p99 budget 300 ms.
Stripe: idempotency keys are first-class in their public API. The Idempotency-Key header is the canonical real-world example of "retries safe even on writes." Internal services are a mix of Ruby, Go, and increasingly gRPC + Twirp for hot paths.
Envoy / Istio: deployed at LinkedIn, Google, eBay, Salesforce, Yelp, and most of the Fortune 100. Envoy alone handles tens of millions of requests/sec in large production deployments. The dominant data-plane proxy.
Cloudflare: gRPC over HTTP/3 at the edge. They have publicly written about HTTP/3 head-of-line benefits at p99 in lossy mobile networks.
Meta (Facebook): Thrift everywhere. Internal RPC volume not publicly disclosed but estimated similar order to Google.
Buf (the company): maintains buf breaking and Connect RPC. The fact that breaking-change detection in CI is now table-stakes is largely because of their work normalizing it.
Discord: gRPC for inter-service communication, with bidi streams for voice coordination. WebSockets for the browser-facing chat protocol (gRPC-Web doesn't bidi cleanly from browsers).
Square / Block: pioneered Wire (a leaner gRPC-like for mobile) and contributed to Connect. Internal services mix Wire, gRPC, and standard Protobuf-over-HTTP.

War-story-grade numbers

One Google SRE talk: a single proto field-number reuse (someone overwrote field 5 with a different type) caused a fleet-wide 4-hour partial outage on a top service. Recovery: roll back the .proto, wait for binaries to redeploy. Schema review is not optional.
Netflix's original retry storm (circa 2012, pre-Hystrix) brought down a substantial chunk of their personalization tier when one backend GC'd. The post-mortem birthed Hystrix.
LinkedIn's adoption of D2 + Rest.li was partly motivated by the "1000 service instances each maintaining 1000 connections to every other service = 1M sockets per host" failure mode of naïve client-side load balancing — solved by subsetting.

Bonus depth beats (call these out if probed)

Hedged requests. Same request to two replicas, return whichever responds first. Median latency slightly worse, p99 dramatically better. Cost: 2x QPS — read-heavy idempotent paths only.
HTTP/2 flow control as backpressure. WINDOW_UPDATE frames per stream — slow consumer naturally throttles its sender without affecting siblings. With gRPC streaming, this is the actual backpressure mechanism — not "the client polls less often."
mTLS rotation via SDS. Secret Discovery Service pushes new certs to sidecars every few hours; application never sees the cert. SPIFFE identities (spiffe://example.com/sa/service-a) are the modern service-identity standard.
W3C trace context. traceparent (32-char trace id + 16-char span id) injected by client sidecar and forwarded by every hop. Sampling decisions (traceflags) must be propagated for the trace to be coherent.
oneof and Any in Protobuf. oneof is a tagged union; only one field in the set can be set. Any carries an arbitrary message by type URL + bytes — flexible but discards the schema benefit at that field.
Server-side reflection. grpc.reflection.v1alpha lets grpcurl discover service definitions at runtime. Useful in dev, dangerous in prod (leaks API surface). Disable past staging.
xDS variants: LDS (Listeners), RDS (Routes), CDS (Clusters), EDS (Endpoints), SDS (Secrets), ADS (Aggregated — one stream for all). Knowing the lingo signals "I've run a mesh."
Proxyless mesh. gRPC libraries consume xDS directly. Saves sidecar CPU. Costs you mesh features the library doesn't implement.
Ambient mesh / Cilium. Sidecar replaced by eBPF in the kernel. Lower per-pod overhead. Less mature; rapidly maturing.

§28. Summary

"RPC is the typed, schema-bound, deadline-propagating, error-coded inter-service communication substrate that turns a fan-out tree of services into something engineers can reason about — gRPC and Thrift and Connect over HTTP/2 with Protobuf payloads on the inside, REST + JSON at the public edge, GraphQL at the BFF — and the technology's real value isn't 'can two services talk' (anything can do that) but the contract it imposes: typed messages with field-number evolution rules, deadlines that subtract across hops, status codes that prescribe recovery actions, and HTTP/2 multiplexing that lets one connection carry thousands of in-flight calls — because the actual failure modes at fleet scale aren't 'service down' but 'service slow but not down,' and the layered defenses that stop the wave (deadline + retry budget + circuit breaker + bulkhead + load shedding) are exactly the primitives that a real RPC fabric standardizes so every team gets them for free instead of reinventing each one wrong."