← Back to Backend Fundamental Components

Auth & Identity

Contents

§1. What auth and identity IS

Authentication and identity is the class of technology that answers two distinct questions on every interaction:

  • Authentication (AuthN) — "who is this?" Prove identity via something known (password), held (TOTP — Time-based One-Time Password, hardware token), inherent (biometric), or by chained trust (a previously-issued token).
  • Authorization (AuthZ) — "is this identity allowed to do that?" Decide whether the action is permitted: RBAC (Role-Based), ABAC (Attribute-Based), or ReBAC (Relationship-Based Access Control).

The same token (e.g., a signed JWT — JSON Web Token) usually carries both: sub says who; scope/role/aud constrain what. AuthN's hard problems are credential storage, MFA (Multi-Factor Authentication), and revocation. AuthZ's hard problems are policy evaluation at scale and cross-tenant blast-radius isolation.

The technology sits between the public network and business logic — fronted by an API gateway, fanned out into every service via a token validator or sidecar. Two distinct sub-problems:

  • User auth — humans logging in via password + MFA, browser cookies, mobile redirects, social login. Occasional logins, ~10-15 min access tokens, refresh tokens for continuity.
  • Service auth — machines authenticating to other machines in a microservice mesh. Short-lived (~1h) auto-rotated credentials, cryptographic identity at the TCP layer (mTLS — mutual TLS, SPIFFE — Secure Production Identity Framework For Everyone). No human in the loop.

NOT in scope: end-to-end encryption (Signal protocol, X3DH, double ratchet — separate category, protecting message content from the platform); key management for data at rest (KMS for DEKs/KEKs — Data/Key Encryption Keys; auth uses KMS to custody signing keys, but data encryption is its own problem); fine-grained policy languages (OPA — Open Policy Agent, Rego — the AuthZ engine downstream of identity).

Where this tech is NOT good: high-frequency intra-app trust where every method call needs its own crypto check (auth gates the request boundary, not function calls); plaintext-content protection (JWT signing protects integrity, not confidentiality — claims are readable to anyone holding the token).

The defining property: auth events are rare; auth checks are constant. A user logs in maybe ten times a week, but their identity is verified on thousands of API calls per day. The technology must collapse the verification path to microseconds while keeping the issuance path correct.


§2. Inherent guarantees per scheme

Each scheme makes a different bargain.

JWT (signed bearer token). Provides stateless verification — any party with the IdP's (Identity Provider) public key can verify without contacting the issuer. Integrity, tamper detection, built-in expiration. Does NOT provide revocation — once issued, valid until exp no matter what. Does NOT provide confidentiality — claims are base64url-encoded, not encrypted. Must layer: short TTL + denylist for revocation; aud audience separation to prevent confused deputy; key rotation infrastructure.

Session cookie + server store. Provides instant revocation (delete the session row). Updatable claims mid-session. Easy audit. Does NOT provide cheap scale-out — every API call hits Redis. Does NOT provide cross-domain validity without explicit federation. Must layer: server-side store, session-fixation protection (rotate ID at login), CSRF (Cross-Site Request Forgery) tokens.

mTLS. Provides cryptographic identity at the TCP layer, established during the TLS handshake. The identity IS the connection — can't be stolen from a captured payload because TLS session keys aren't transferable. Does NOT provide human identity, semantic scopes, or fine-grained authZ. Must layer: PKI (Public Key Infrastructure) ops — short-lived certs (e.g., 1h SPIFFE SVIDs — SPIFFE Verifiable Identity Documents), automated rotation, CRL (Certificate Revocation List) or OCSP (Online Certificate Status Protocol).

OAuth 2.0 / OIDC (OpenID Connect). Provides delegated authorization (OAuth) and federated identity (OIDC) — let User log into App A using their account at IdP B without giving A the password. Standardised flows. Does NOT provide implementation-grade defaults (footguns: implicit flow leaks tokens, missing PKCE on public clients, mis-validated aud). OAuth is a flow framework, not a revocation system.

SAML (Security Assertion Markup Language). XML-based browser federation, mature in enterprise SSO (Single Sign-On). Not mobile- or service-friendly. Modern shops migrate toward OIDC for new builds; SAML stays for legacy enterprise B2B.

Passwords + KDF (Key Derivation Function). Argon2id/bcrypt/scrypt provide work-factor protection against offline cracking after a DB leak. Salt defeats rainbow tables. Does NOT protect against phishing, credential stuffing, password reuse — KDFs only matter once the leak has happened. Must layer: MFA, breached-password checks, rate limiting, ideally passkeys.

Production systems compose these — JWT for the hot path, session cookie for browser revocation, mTLS for service mesh, OAuth flow for cross-app login, Argon2id for the once-per-login password check.


§3. The design space

Stateful vs stateless vs hybrid

Pattern Token contents Per-request work Revocation Best for
Stateful (session cookie) Opaque ID (32 random bytes) Redis/DB lookup ~1ms Delete the row Browser apps on one domain (banking portal, GitHub web)
Stateless (signed JWT) Signed claims Local crypto verify ~80-150µs Hard — denylist + short TTL Microservice mesh, mobile apps
Hybrid (opaque + cache) Opaque ID, claims in cache Cache hit ~1ms Cache DEL Stripe API keys, PayPal, Google LOAS

The hybrid pattern is common where you want JWT-like portability and instant revocation: PayPal, Stripe, Google's internal LOAS (Low Overhead Authentication System) use short opaque tokens with claims in a fast distributed cache. Warm cache hit; rare miss falls back to introspection.

Signing algorithms

Algo Type Key size Sign Verify Sig size Use
HS256 (HMAC-SHA256) Symmetric 256-bit ~2µs ~2µs 32 B Single trust boundary
RS256 (RSA-SHA256) Asymmetric 2048-bit ~3ms ~80µs 256 B Federated — IdP signs, many verifiers
ES256 (ECDSA P-256) Asymmetric 256-bit ~150µs ~150µs ~70 B Mobile, IoT, wire-size-sensitive
EdDSA (Ed25519) Asymmetric 256-bit ~80µs ~150µs 64 B Modern default — deterministic, immune to nonce reuse

Default pick: RS256 or EdDSA. HS256 fails when more than one party needs to verify (sharing secrets is fatal). ECDSA's nonce-reuse footgun caused real disasters (Sony PS3, 2010, all signing keys leaked from one bad nonce); HSMs (Hardware Security Modules) avoid it, but EdDSA's determinism kills the whole bug class.

OAuth 2.0 flow selection

Flow Client type When
Authorization Code Confidential (server-side webapp with secret) Classic webapp login (Atlassian, Salesforce)
Authorization Code + PKCE Public (mobile, SPA) Mobile apps, SPAs (Apple Sign-In, Sign in with Google mobile)
Client Credentials Service account Backend-to-backend (Stripe webhook signing)
Device Code Smart TVs, CLI, IoT with no browser Apple TV Netflix, AWS CLI SSO, gh auth login
Refresh Token Any with stored refresh Continuous session for mobile apps, long-running CLIs
Implicit (deprecated) SPA historical DEAD — leaks token in URL fragment
ROPC (Resource Owner Password Credentials, deprecated) Legacy migration DEAD — defeats the entire point of OAuth

Recommended pick: Authorization Code + PKCE (Proof Key for Code Exchange) for all new flows. OAuth 2.1 makes PKCE mandatory for all clients.

Token format, storage, MFA

Token formats: JWT (self-contained), opaque (server lookup), PASETO (JWT competitor without algorithm-confusion footguns), Macaroons (delegated, attenuatable caveats — Google internal, Tarsnap). Session storage: in-memory single instance (~1µs), Redis/Memcached (~1ms, default), MySQL/Postgres (~5-50ms, audit-trail), cookie-only signed (0ms, stateless edge).

MFA factor ranking (weakest → strongest): SMS OTP (weak — SIM swapping), TOTP (RFC 6238 — Google Authenticator), push (Duo, Okta Verify), hardware tokens (YubiKey, Titan), WebAuthn/FIDO2/passkeys (phishing-resistant — credential scoped to origin). Modern default: WebAuthn/passkeys. Apple, Google, Microsoft all ship them natively.


§4. Byte-level mechanics

This is where the depth lives. Anyone can name "JWT" or "OAuth"; the real reference walks the bytes.

4a. JWT anatomy

Three base64url-encoded segments joined with dots: header.payload.signature.

Wire format (line-broken for readability):

eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6ImtleS0yMDI2LTA1In0
.
eyJpc3MiOiJodHRwczovL2lkcC5leGFtcGxlLmNvbSIsInN1YiI6InVzZXItNDIi
LCJhdWQiOiJhcGkuZXhhbXBsZS5jb20iLCJleHAiOjE3MTUwMDA5MDAsImlhdCI6
MTcxNTAwMDAwMCwianRpIjoiYjM2ZDQ4MSJ9
.
VsbGgZkYZGW5XYsTuYg5fkH3vY3p9_lKqGo7n8sZ9wW...

Decoded header (JSON): {"alg":"RS256","typ":"JWT","kid":"key-2026-05"}kid (key ID) tells the verifier which public key to use.

Decoded payload (claims, RFC 7519 §4.1):

{
  "iss": "https://idp.example.com",   // issuer — required, validated
  "sub": "user-42",                    // subject — user ID
  "aud": "api.example.com",            // audience — required, MUST be validated
  "exp": 1715000900,                   // expiration epoch
  "iat": 1715000000,                   // issued at — useful for per-user revocation
  "nbf": 1715000000,                   // not valid before
  "jti": "b36d481",                    // unique token ID — key for denylist
  "scope": "read:profile write:posts"
}

Signature: RS256_sign(SHA256(header_b64 + "." + payload_b64), private_key), then base64url-encoded. RS256 produces exactly 256 bytes pre-base64; ES256 ~70 bytes; HS256 32 bytes.

4b. The RS256 verify path, byte by byte

Receive: "eyJ...header.eyJ...payload.VsbGg...signature"

Step 1: split on "." → header_b64, payload_b64, signature_b64
Step 2: base64url-decode header → JSON {"alg":"RS256","kid":"key-2026-05"}
Step 3: look up public key by kid in JWKS cache (JSON Web Key Set,
        fetched from https://idp.example.com/.well-known/jwks.json,
        refreshed hourly) → RSA public key (n, e)
Step 4: compute SHA256(header_b64 + "." + payload_b64) → 32-byte digest
Step 5: base64url-decode signature_b64 → 256-byte signature bytes
Step 6: RSA verify: signature^e mod n == PKCS#1-v1.5-padded digest?
Step 7: base64url-decode payload → JSON claims
Step 8: validate claims:
   - iss == expected_issuer (constant-string compare)
   - aud includes this service
   - exp > now() - leeway (typical leeway: 60s)
   - nbf <= now() + leeway
   - jti not in revocation denylist (hashset lookup, ~50ns)
   - iat >= min_iat[sub] (per-user revocation cache lookup)
Step 9: return claims to application

Total CPU: ~80µs RS256 verify + ~5µs JSON parse + ~50ns hashset lookup ≈ ~100µs per validation.

4c. RSA math (one paragraph)

Pick two large primes p, q (1024 bits each for RSA-2048). Compute n = p · q, φ(n) = (p-1)(q-1), pick e (almost always 65537 — small, prime, sparse binary), compute d = e⁻¹ mod φ(n). Public key = (n, e); private = d. Sign: s = digest^d mod n. Verify: digest =? s^e mod n. Security rests on integer factorization being hard — best-known algorithm (NFS — Number Field Sieve) takes ~10¹¹ operations for 2048-bit n. HSMs store d in tamper-resistant silicon; signing happens inside the HSM, never exposing d. For RS256 specifically: SHA-256 to 32-byte digest, PKCS#1 v1.5 pad to 256 bytes, RSA-sign the padded digest.

4d. Fleet math

A system at 2M token validations/sec across the mesh:

  • RS256 verify: 2,000,000 × 80µs = 160 CPU-seconds per wall second = ~160 cores spread across the fleet. At 64-core hosts, ~2.5 hosts' worth of CPU consumed by JWT verification — totally tractable.
  • ES256 verify: 2M × 150µs = ~300 cores.
  • HS256 verify: 2M × 2µs = ~4 cores — 40× cheaper, but requires every verifier to hold the signing key. Poison pill for blast radius.

The fact that RS256 is 40× slower than HMAC but still only 160 cores at 2M QPS is why asymmetric crypto is the default for distributed validation: the CPU is a rounding error; the architectural simplification (only the IdP holds the private key) is the win.

4e. JWT revocation invariants

The fundamental tension: a JWT is verifiable without contacting the issuer. That's what makes it scalable. But once issued, it's valid until exp — the IdP cannot "take it back" without breaking the no-server-lookup model.

Production layered design:

  1. Short access TTL + refresh tokens. Access tokens get 5-15 min expiry. After expiry, the client uses a long-lived refresh token (stored server-side with state) to get a new access token. To revoke a user: invalidate their refresh token; the access token works for up to 15 min more, then locked out.

  2. jti denylist via Kafka. Maintain a Redis set of revoked JWT IDs. When immediate revocation is needed, publish jti to Kafka; every verifier subscribes and updates its local in-memory denylist within ~1s. Verifiers check the denylist on each request (~50ns hashset lookup). Bounded size: once past exp, entries fall off, so the set stays in the hundreds of entries even at 50k revocations/day.

  3. Per-user min_iat stamp. Redis: user:42 → min_iat: 1715000500. Any token with iat < min_iat is rejected. To revoke all of user 42's tokens (the "log me out everywhere" button), bump min_iat to now. One Redis lookup per request (cached at the verifier; LRU of ~1M entries ≈ 100MB).

  4. Key rotation (nuclear). Rotate the signing key. All tokens signed with the old key are invalidated. Reserved for catastrophic compromise.

The complete answer combines (1) + (2) + (3): short access TTL is the safety net; Kafka-fanned jti denylist handles per-token revocation; min_iat handles per-user "logout everywhere."

4f. Refresh token storage

Refresh tokens are long-lived bearer credentials. Treat as carefully as passwords.

CREATE TABLE refresh_tokens (
  token_id        CHAR(36)  PRIMARY KEY,    -- visible part
  user_id         BIGINT    NOT NULL,
  hashed_secret   CHAR(64)  NOT NULL,        -- HMAC-SHA256 of secret part
  family_id       CHAR(36)  NOT NULL,        -- rotation chain
  generation      INT       NOT NULL,        -- rotation counter
  expires_at      TIMESTAMP,
  revoked_at      TIMESTAMP NULL,
  INDEX (user_id, revoked_at), INDEX (family_id)
);

Token shown to client: token_id.secret. Server stores token_id indexed and HMAC(server_key, secret). On use: split on ., look up by token_id, constant-time-compare HMAC(server_key, secret) to hashed_secret (non-constant-time leaks timing), check revoked_at IS NULL and expires_at > now(), issue new access token AND rotate (new generation, same family_id, mark old revoked).

family_id + generation detects stolen refresh tokens. If attacker steals and uses one, server issues a new one and revokes the old. When the legitimate client tries the original (now-revoked) refresh token, server detects "family already advanced past this generation" and revokes the entire family. OAuth 2.1 §6.1 refresh token rotation.

4g. OAuth 2.0 authorization code + PKCE, byte by byte

1. User clicks "Login" in webapp/mobile app.

2. Client generates:
     code_verifier  = random 43-128 char string
                      e.g., "dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
     code_challenge = base64url(SHA256(code_verifier))
     state          = random nonce (anti-CSRF)

3. Client 302-redirects browser to IdP /authorize?
     response_type=code&client_id=webapp&
     redirect_uri=https://app.example.com/callback&
     code_challenge=<challenge>&code_challenge_method=S256&
     state=<nonce>&scope=openid profile email

4. IdP shows login page; user submits credentials + MFA.

5. IdP:
   - verifies Argon2id hash (~80ms); MFA challenge if step-up
   - generates auth_code (1 min, single-use)
   - stores: auth_code → {user_id, client_id, code_challenge, scope}
   - 302-redirects browser to:
       https://app.example.com/callback?code=<auth_code>&state=<nonce>

6. Client callback:
   - verifies state matches stored value (anti-CSRF)
   - server-side POSTs to IdP /token:
       grant_type=authorization_code, code=<auth_code>,
       code_verifier=<original verifier>, client_id=webapp,
       client_secret=<secret> (confidential clients only),
       redirect_uri=https://app.example.com/callback

7. IdP /token:
   - retrieves stored code_challenge for auth_code
   - computes base64url(SHA256(code_verifier))
   - MUST match stored code_challenge
   - deletes auth_code (one-time use)
   - mints:
       access_token  (JWT, 15 min, aud=api.example.com)
       id_token      (JWT, OIDC identity claims about user)
       refresh_token (opaque, 30 days, server-side row)
   - returns { access_token, id_token, refresh_token, expires_in: 900 }

8. Client creates session:abc123 → {user_id, tokens} in Redis.
   Set-Cookie: session=abc123; HttpOnly; Secure; SameSite=Lax

9. Subsequent API calls: browser sends cookie → gateway resolves
   session → forwards Authorization: Bearer <access_token> →
   API validates JWT locally (no IdP call).

Why every piece is non-optional:

  • PKCE: prevents code interception. If an attacker steals the auth code (browser history, malicious app on phone, shared computer), they can't redeem it without the original code_verifier. Replaces the older client_secret-only flow for public clients.
  • state parameter: anti-CSRF. Otherwise an attacker can log a victim into the attacker's account.
  • Auth code, not token-in-URL: code traverses the user's browser (untrusted, logs, Referer headers); the token only goes server-to-server.
  • One-time use, 1-min auth code TTL: small leak window; one use closes it.
  • id_token vs access_token (OIDC): id_token is for the client app to verify who logged in. access_token is for the API to verify what's allowed. Mixing them up is a common bug class.

4h. Password storage — why salt + slow KDF

Threat model: your user table eventually leaks (LinkedIn 2012, Adobe 2013, Yahoo 2013, RockYou 2009, Equifax 2017). Assume hashes are public. What must remain unrecoverable?

Bad: SHA-256(password). GPUs do ~10 billion SHA-256/sec. 8-char password = 95^8 ≈ 6.6 × 10^15 candidates → crackable in ~7 days on one rig. Rainbow tables make precomputed dictionaries trivial.

Better: SHA-256(salt + password). Salt defeats rainbow tables. But still 10 GH/s; dictionary passwords crack in seconds.

Good: bcrypt(password, cost=12). ~250ms/hash on 2010 hardware, ~50ms on 2026 hardware. GPU attacks at ~20 hashes/sec/core, so 95^8 takes ~3M years. Cost factor tunable up; re-hash on next login. Limitation: bcrypt is time-hard, not memory-hard — ASICs (Application-Specific Integrated Circuits) and FPGAs (Field-Programmable Gate Arrays) parallelize it cheaply.

Best: Argon2id with m=64MiB, t=3, p=1. Memory-hard — each hash needs 64MiB of RAM, so ASICs hit a memory bandwidth wall. Default winner of 2015 Password Hashing Competition. The id variant is hybrid: resistant to side-channel attacks (the i lineage) AND GPU parallel attacks (the d lineage).

Math: at 64MiB/hash, a 24GB GPU runs ~375 hashes in parallel. At 80ms each, ~4700 hashes/sec/GPU. To crack a 10-char password (~6 × 10^19) takes ~300M years on a 1000-GPU farm.

Salt is automatic in the encoded output: $argon2id$v=19$m=65536,t=3,p=1$<base64-salt>$<base64-hash>. Each user gets a unique salt.

Fleet cost: at 6,000 logins/sec peak, 6000 × 80ms = 480 CPU-seconds per wall second = 480 cores for password verification, plus 6000 × 64MiB = 384GiB of transient RAM. That's why login flows are aggressively rate-limited — an attacker probing usernames can OOM your auth tier.

When the interviewer asks "MD5 vs SHA-256 for passwords?" — neither. KDFs only. That's the entire answer.

4i. End-to-end login walkthrough

Composing §4b + §4g + §4h with timings: T=0 user submits credentials; T=4ms IdP reads user row (~2ms MySQL shard read); T=84ms Argon2id verify completes (~80ms compute); T=85-2086ms MFA challenge (~2s for the human to type the TOTP); T=2089ms IdP HSM-signs the JWT (~3ms RS256 sign); T=2092ms IdP inserts refresh_token row (~3ms); T=2096ms webapp creates Redis session; T=2098ms Set-Cookie returned. Then per-API-call: ~1ms session lookup at gateway; ~80µs RS256 verify at the backend; ~5µs claim validation; ~50ns denylist check. Per-request overhead after login: ~1.5ms total, ~100µs of crypto. Password verify is the only slow thing (80ms) and happens once per ~10,000 API calls. That asymmetry is what the technology buys.


§5. WebAuthn and passkeys in depth

Passwords are the worst credential humans have ever invented. They are phishable, reusable, leakable, forgettable, and stored in every breach corpus on the dark web. The 2024-2026 industry shift to passkeys is the first credible attempt at killing them at consumer scale. Understanding the actual protocol — not just the marketing — is required.

5.1 FIDO2 in two layers

FIDO2 (Fast IDentity Online 2) is two specs working together:

  • CTAP2 (Client to Authenticator Protocol v2) — talks between the client (browser, OS) and the authenticator (the device holding the private key: YubiKey over USB/NFC, Apple Secure Enclave, Windows Hello TPM). CTAP2 carries the challenge from the relying party to the device and the signed assertion back. Wire formats: CBOR (Concise Binary Object Representation) over USB-HID, NFC, or Bluetooth Low Energy.
  • WebAuthn (Web Authentication, W3C spec) — JavaScript API exposed by the browser to web pages. navigator.credentials.create() for registration, navigator.credentials.get() for login. WebAuthn talks to the OS/browser, which talks CTAP2 to the authenticator. Server side, the relying party receives a JSON blob with the signed assertion.

A web page never sees raw CTAP2; an authenticator never sees WebAuthn JSON. The browser is the bridge.

5.2 Public-key crypto on the device

Registration creates a fresh key pair on the authenticator:

1. RP (Relying Party) server sends: rp_id="example.com", user_id, challenge (random 32B).
2. Browser calls navigator.credentials.create({publicKey: {...}}).
3. Browser asks OS/authenticator to:
     - generate a fresh ECDSA P-256 (or Ed25519) key pair
     - bind the pair to (rp_id, user_id) inside the secure element
     - return: { credential_id, public_key, attestation_object,
                client_data_json (contains the challenge) }
4. Authenticator may also prompt for user verification: Face ID, Touch ID, PIN.
5. Server stores: (user_id, credential_id, public_key).
   Private key NEVER LEAVES the authenticator.

Login is challenge-response:

1. Server generates challenge (32 bytes), sends to browser.
2. navigator.credentials.get(); authenticator looks up credential_id
   bound to rp_id, prompts user verification, signs:
     signature = ECDSA_sign(SHA256(authenticator_data || client_data_hash))
3. Server verifies signature with stored public_key.
4. Done. No password ever transmitted, ever.

The "private key never leaves the device" property is enforced by the secure element. On Apple, it's the Secure Enclave (a separate ARM core with isolated memory and a hardware AES engine). On Android, it's StrongBox (a discrete chip on Pixel/Samsung high-end) or TEE (Trusted Execution Environment) as a fallback. On Windows, it's the TPM (Trusted Platform Module). On a YubiKey, it's the YubiKey's secure element silicon. The OS gets a signed assertion; the OS cannot read the key.

5.3 No shared secret, phish-resistant

The "phish-resistant" claim rests on the rp_id binding. When the authenticator signs, it signs over rp_id_hash — the SHA-256 of the origin the user is actually on, as reported by the browser. If a user visits https://exarnple.com (typo-squat, the m is rn), the browser passes rp_id=exarnple.com to the authenticator. The authenticator looks up credentials for exarnple.com — none exist. No credential, no signature, no login. The user literally cannot be phished into using their example.com credential on exarnple.com.

Compare to passwords: a phishing site shows a pixel-perfect login page; the user types their password; the attacker now has the password. The user is the weak link. Passkeys remove the user from the trust loop — the browser enforces origin binding cryptographically.

This is also why passkeys defeat credential stuffing. There's no shared secret to stuff with. Each (user, rp) pair has a unique key. A breach of one site's database leaks only public keys, which are useless to attackers.

5.4 Platform vs roaming authenticators

  • Platform authenticator — the authenticator is built into the device the user is on. Apple Keychain on iOS/macOS, Windows Hello on Windows 10+, Android Smart Lock. UX is seamless (Face ID prompt, done). Lives in the secure element. Cannot be used from a different device unless cloud-synced (see 5.5).
  • Roaming authenticator — separate hardware that the user carries. YubiKey 5 series, Google Titan key, Feitian. Connects via USB-A/C, NFC, or BLE. Same rp_id binding model, but the credential is portable: a single YubiKey can log into the user's account from any device with a USB port. Enterprise security teams prefer roaming authenticators because the credential is air-gapped from the OS — even total OS compromise doesn't leak it.

Mixed-mode is the production answer: platform for everyday convenience, roaming as the fallback / recovery / step-up factor.

5.5 Passkey sync across devices

The original FIDO2 model was non-syncable: lose the device, lose the key. That ergonomic was poison for consumer adoption — losing your iPhone meant losing your bank account. Apple, Google, and Microsoft all introduced synced passkeys in 2022-2023:

  • iCloud Keychain (Apple) — passkey private keys are sync-fabric-encrypted with a user-derived key derived from the iCloud password + device PINs. Apple cannot decrypt them. New device joining the iCloud account performs an attested key-exchange handshake with an existing device, retrieving the sync keys.
  • Google Password Manager — similar architecture, sync-fabric encryption with end-to-end encryption (E2EE) keys derived from the user's Google account secrets + device locks.
  • 1Password / Dashlane / Bitwarden — third-party password managers added passkey storage in 2023-2024, syncing via the same vault that holds passwords.

The trade-off: synced passkeys are no longer hardware-bound. The private key exists in multiple devices (and in the encrypted cloud blob). Stronger than passwords (still phish-resistant, still per-origin), but weaker than non-syncable hardware keys (now susceptible to cloud-account compromise). Enterprise compliance regimes (FedRAMP High, PCI Level 1) still require hardware-bound (non-syncable) keys for privileged access; consumer flows accept synced.

5.6 When to migrate from passwords (the 2024 shift)

The industry inflection point: in 2024-2025 Apple, Google, Microsoft, GitHub, Amazon, eBay, Best Buy, Adobe, and PayPal all enabled passkeys as a primary credential. By 2026, "create a passkey" is the default sign-up CTA at most major consumer sites. The migration plays out as:

  1. Phase 1 — additive: passkeys offered alongside passwords; users opt in.
  2. Phase 2 — default: new signups use passkeys; existing users prompted to add one.
  3. Phase 3 — passwordless: password becomes a recovery factor only; primary auth is passkey.
  4. Phase 4 — passwords removed: account can be created and used with no password ever set.

Design call: most new B2C systems should start at Phase 2 — passkeys default, passwords as the fallback. For B2B/enterprise, follow what the customer's IdP supports (Okta, Azure AD all support WebAuthn now). For very high-security flows (admin, financial), hardware-bound roaming authenticators are still the right answer.

The cost: WebAuthn is more code than <input type=password>, registration UX has device-discoverability quirks (the credential lives on the device you registered with, not your account, in the non-synced model — confusing). Recovery requires a fallback channel (email, second device, IT helpdesk). Adoption requires user education. But the security upside — eliminating phishing, credential stuffing, and password reuse in one stroke — has finally tilted the calculation.


§6. Passwordless beyond passkeys

Passkeys are the strongest passwordless option, but the broader passwordless landscape contains several weaker — and one acceptable — alternatives. The hierarchy of MFA strength runs roughly: passkey (FIDO2/WebAuthn) > hardware OTP token > push with number-matching > push without number-matching > authenticator app TOTP > magic link > SMS OTP > knowledge-based (security questions).

A magic link is a single-use URL emailed to the user's address: https://example.com/auth/magic?token=<random-64-char>. Click the link, the server validates the token (one-time, ~10 min TTL), and the user is logged in.

  • Security: depends entirely on the security of the user's email account. If email is itself behind a passkey/2FA, this is acceptable. If email is password-only, magic links inherit that weakness.
  • UX win: zero-friction onboarding. No password to remember, no app to install. Substack, Notion, Medium, Slack all use magic links for sign-in.
  • Failure modes: email delivery delays (SPF/DKIM misconfigured); link tokens leak via shared mailboxes, corporate email filters that pre-fetch URLs (Microsoft Defender expanding magic links and consuming the one-time use); user clicks link from a different device than they started on, leaving a half-authenticated session.

Caveat: magic links are sometimes positioned as "passwordless" but they're really "email-as-the-credential." If the threat model includes email account compromise, magic links don't help.

6.2 SMS OTP (One-Time Password)

A 6-digit code sent via SMS. User types the code into the login page. Server validates (one-time, ~5 min TTL).

  • Security: weak. SMS is not end-to-end encrypted; the carrier sees every message. Worse, SIM-swap attacks let an attacker bribe or social-engineer a carrier into transferring the victim's phone number to an attacker-controlled SIM. The 2019 Twitter CEO hack (Jack Dorsey) and the 2022 wave of crypto-exchange compromises (Coinbase, FTX users) were SIM swaps.
  • UX: universal — every phone receives SMS. No app install. Still the most widely-deployed MFA factor by volume.
  • Why it persists: even weak MFA is much better than no MFA against bulk credential stuffing. NIST (National Institute of Standards and Technology) SP 800-63B has discouraged SMS since 2017 but didn't ban it because alternatives weren't universally available.

Verdict: acceptable as a fallback factor, never as the primary factor for high-value accounts. Banks pretending SMS is "two-factor" are gambling on bulk attackers being lazier than the SIM-swap rings.

6.3 Authenticator app TOTP (RFC 6238)

Time-based One-Time Password. The server and the device share a secret (a base32-encoded ~20-byte key, provisioned via QR code at enrollment). Every 30 seconds, both sides compute:

TOTP = truncate(HMAC-SHA1(secret, floor(now / 30))) mod 10^6

The 6-digit code is displayed in Google Authenticator, Authy, 1Password, etc. User types it; server computes the same value and compares.

  • Security: stronger than SMS — no carrier in the loop, no SIM swap. Weaker than passkeys — phishable (attacker proxies the code in real time via a phishing page), and the shared secret can be exfiltrated from the device (rooted phone, malicious Authenticator app).
  • UX cost: requires the user to install an app and scan a QR code at enrollment. Switching phones requires re-enrollment or a sync (Authy and 1Password offer cloud sync; Google Authenticator added it in 2023).
  • Clock skew: TOTP requires the device and server clocks to be within ~30s. Devices with bad NTP fail silently. Server should accept the current window and one neighbor (±30s window) by default.

Position: TOTP is the sensible default MFA factor when passkeys aren't available — phishable but far better than SMS or no MFA.

6.4 Push notifications and number-matching

The IdP sends a push to a previously-enrolled mobile app (Okta Verify, Duo, Microsoft Authenticator). User taps "Approve" — done.

  • MFA fatigue: by 2021-2022 attackers learned to spam approval prompts in the middle of the night, hoping the user would tap "approve" to make it stop. Uber's 2022 breach was via MFA fatigue against a contractor.
  • Number-matching mitigation: the login page shows a 2-digit number; the push notification asks the user to type that number into the app, not just tap. Microsoft and Duo now require this by default. Eliminates blind-tap fatigue attacks.
  • Phishability: still phishable via real-time proxy (the attacker forwards the user's typed number to the legitimate login page).

Position: push with number-matching is acceptable; push without is a known vulnerability.

The 2025-2026 best practice is to combine: passkey as primary credential, magic-link to email or SMS to a paired device as recovery only, TOTP or hardware OTP for step-up to admin actions. This builds the MFA hierarchy into the flow, not the factor — the strong factor for sensitive operations, the weak factor only for the long-tail of recovery cases.


§7. OAuth 2.0 security best practices

OAuth 2.0 is a flow framework, not a turnkey security solution. It ships with footguns; the practitioner is expected to name them on sight.

7.1 PKCE (Proof Key for Code Exchange, RFC 7636) — mandatory

PKCE answers: how do we know the client that started the authorization flow is the same client redeeming the code?

The legacy flow used client_secret. That works for confidential clients (server-side webapps with a real secret backend). It fails for public clients (mobile apps, SPAs) because they can't keep a secret — any value baked into the binary can be extracted.

PKCE replaces the secret with a dynamic per-flow proof:

At /authorize:
  code_verifier  = random 43-128 char URL-safe string  (kept by client)
  code_challenge = base64url(SHA256(code_verifier))    (sent to IdP)
  → IdP stores code_challenge alongside the authorization code

At /token:
  client sends code_verifier
  IdP computes SHA256(code_verifier), compares to stored code_challenge
  → match: issue token. Mismatch: reject.

Why it works: an attacker can steal the authorization code (browser history, log files, malicious app intercepting the redirect URI on mobile), but without the original code_verifier (kept in client memory, never sent to the IdP at the first step), they can't redeem it. The verifier is bound to the flow.

OAuth 2.1 makes PKCE mandatory for all clients, public and confidential. The S256 challenge method (SHA256(verifier)) is required; the older plain method is deprecated.

7.2 state parameter — CSRF protection

The state parameter is a random nonce the client generates at the start of the OAuth flow and verifies at the callback. Without it, an attacker can initiate an OAuth flow themselves, capture the authorization code midway, and redirect a victim to the client's callback URL with the attacker's code. The victim's browser presents the code; the client redeems it and creates a session — but the session is for the attacker's identity. The victim is now logged into the attacker's account, and any data they enter goes into the attacker's account.

Fix: generate state (random, 32+ bytes) at the start of the flow; store it in the client's session storage; on callback, compare state from the URL against the stored value. Mismatch → abort.

7.3 nonce parameter — OIDC replay protection

For OpenID Connect, the client also passes nonce at /authorize. The IdP echoes it into the id_token claims. The client verifies id_token.nonce == stored_nonce. This binds the identity assertion to the specific login flow — preventing an attacker from replaying an old id_token against a fresh session.

7.4 Redirect URI validation — exact match

The IdP must validate that the redirect_uri in the request matches a pre-registered URI for the client. Common bugs:

  • Wildcards — IdP allows https://app.example.com/*. Attacker registers https://app.example.com/.well-known/oauth-callback?... (or finds an open redirect on the same origin) and redirects the authorization code to themselves.
  • Substring match — IdP allows any URI starting with https://app.example.com. Attacker uses https://app.example.com.evil.com.
  • Scheme mismatch — IdP allows app.example.com/callback (no scheme); attacker uses http://... and intercepts in transit.
  • Open redirector on the client — client has a /redirect?to=... endpoint; attacker chains the authorization flow through it to exfiltrate the code.

Fix: exact match on the full URI string, including scheme. No wildcards. No partial matches. Open redirects on the client domain are independently exploitable.

7.5 Refresh token rotation

Covered in §4f. Each use of a refresh token returns a new one and invalidates the old. Detect family reuse to catch theft. OAuth 2.1 §6.1.

7.6 OAuth 2.1 — the cleanup

OAuth 2.1 (draft RFC ongoing through 2024-2026) consolidates a decade of best practices into a single normative spec:

  • Implicit flow — removed. The implicit flow returned the access token directly in the URL fragment. Token in URL was always a footgun (browser history, Referer, server logs). PKCE-augmented authorization code is the only flow for public clients now.
  • ROPC (Resource Owner Password Credentials) — removed. The flow where the client collects the user's password directly and sends it to the IdP. Defeats the entire point of OAuth (the password should never touch the client). Removed.
  • PKCE — required for all clients.
  • Bearer tokens in URL query — forbidden. Tokens go in headers only.
  • Redirect URIs — exact match required. No more wildcards.
  • Refresh token rotation — required for public clients.

Recommendation: design new systems against OAuth 2.1 from day one, even if the spec hasn't been published as final. None of the deprecated flows are worth supporting in 2026.


§8. SAML vs OIDC tradeoffs

SAML (Security Assertion Markup Language) and OIDC (OpenID Connect) solve the same problem — federated identity, an IdP vouches for a user to a downstream Service Provider (SP). They differ in vintage, wire format, and ecosystem.

8.1 The two protocols at a glance

Dimension SAML 2.0 OIDC 1.0
Year 2005 2014
Wire format XML, signed with XML-DSIG JSON / JWT, signed with JWS
Transport Browser POST or redirect HTTPS, REST-ish
Carries XML <Assertion> with attributes id_token (JWT) + UserInfo endpoint
Mobile support Painful (XML in mobile webview) First-class (mobile SDKs)
API/service support Not designed for it First-class (access tokens for APIs)
Discovery Manual cert exchange or SAML metadata XML /.well-known/openid-configuration
Library maturity Java EE world; C# WIF; Python pysaml2 Universal — every language has it
Initiated by SP-initiated or IdP-initiated Almost always SP-initiated

8.2 When you must support SAML

If you sell B2B SaaS to mid-large enterprises, you must support SAML. Every Fortune 500 has an Active Directory or Okta with SAML federation. Procurement checklists say "must support SAML 2.0 SSO" — without it, you don't get past the security review. The enterprise IdP-of-record is usually Active Directory Federation Services (ADFS), Okta, Azure AD, Ping, or OneLogin, all SAML-native.

For B2C, mobile-first, or API-first products, you can skip SAML entirely and ship OIDC. The customer base never asks.

8.3 Common SAML pitfalls

SAML's threat model is the worst part. The protocol involves parsing untrusted XML, validating signatures over selected XML subtrees, and resolving attribute claims — and every step has had CVEs:

  • XML signature wrapping — the assertion is signed, but the signature covers only a subset of the XML tree. An attacker takes a legitimately signed assertion, wraps it in a new XML envelope where the signed subtree is referenced by ID but the visible assertion (the one the SP actually parses) is unsigned and attacker-controlled. The SP validates the signature (matches), reads the unsigned assertion (attacker's content), and accepts attacker identity. CVE-2011-1411 (SAML 2.0 multiple implementations), CVE-2018-1056, CVE-2022-22556. Microsoft Azure AD, Okta, and OneLogin have all shipped patches for variants of this bug.
  • Comment injection in NameID — the SAML NameID is a text element, but XML parsers strip comments. An attacker registers an account with email victim@example.com<!-- -->.evil.com; the IdP issues an assertion with that NameID; the SP's XML parser sees victim@example.com.evil.com or victim@example.com depending on the canonicalization order. Cross-account impersonation. Duo Security disclosed this in 2018.
  • XXE (XML External Entity) — older SAML parsers loaded external XML entities, enabling SSRF and file-read attacks from a malicious assertion.
  • Time-bound assertions — the SAML assertion has NotBefore and NotOnOrAfter, but old SP libraries didn't enforce them strictly, allowing replay of stale assertions.
  • AssertionConsumerService URL spoofing — analogous to OAuth's redirect_uri validation. SP must validate.

The defensive playbook: use a battle-hardened library (Shibboleth, OneLogin's php-saml, Spring SAML), keep it patched, prefer EncryptedAssertion (signed and encrypted with the SP's public key) over plain signed, validate the entire signed tree as the only source of identity claims, and never trust unsigned XML siblings.

8.4 The migration story

The 2020-2026 trend: enterprise IdPs (Okta, Azure AD) speak both SAML and OIDC. New integrations use OIDC. Legacy integrations use SAML. Customer-pressing-for-OIDC requests grow yearly. Most SaaS support both; some bridge libraries (e.g., Auth0) abstract both behind a single API. Advice: ship OIDC as the primary, SAML as the alternate, and don't write your own SAML parser — use a library.


§9. API key vs OAuth tradeoffs

API keys and OAuth scopes solve a related but distinct problem: how does a machine (or sometimes a human in a script) authenticate to your API?

9.1 API keys — simple wins, scaling losses

An API key is an opaque bearer token, typically generated at signup and pasted into config:

curl -H "Authorization: Bearer sk_live_AbCdEfGhIjKlMnOp..." \
     https://api.example.com/v1/charges
  • Pros: trivial to issue, trivial to use. Every dev tool supports them. Stripe, Twilio, SendGrid, Mailchimp all started with API keys. Time-to-first-API-call is one minute.
  • Cons at scale:
  • Hard to scope — typically each key has full account access. A leaked key from a build server can read all customer data.
  • Hard to rotate — keys often end up in config files, CI variables, customer-side third-party integrations. Rotating is a coordinated migration.
  • Hard to audit — without further infrastructure, server logs see "key_xxx" but not "which engineer issued this for which purpose."
  • The "100k key" trap — a successful product accumulates keys across years of integrations. A company we won't name discovered 800k unrevoked Stripe-style keys outstanding, with no idea which were active. Mass-rotation requires customer-side coordination, often impossible.

9.2 OAuth scopes — proper grant model

OAuth 2.0 with scopes provides:

  • Per-grant scoping — the OAuth client requests specific scopes (read:profile, write:posts); the user (or admin) grants them; the token is limited to those scopes. The API rejects out-of-scope calls.
  • Revocable per-grant — the user can revoke a specific integration without affecting others.
  • Audit trail — every access token is traceable back to a specific OAuth client (= integration partner) + specific user grant.
  • Refresh tokens — short-lived access tokens with explicit refresh rotation.

Cost: setup is more code. Each integration partner registers an OAuth client. The OAuth dance is ~5 endpoints. Customer-facing integrations need consent screens. Time-to-first-API-call is closer to a week.

9.3 GitHub fine-grained PATs — the modern hybrid

GitHub Personal Access Tokens (PATs) shipped in 2013 as classic API keys (account-scoped, no expiry, no resource limits). The 2022 "fine-grained PATs" redesign keeps the UX of an API key (you generate it in settings, paste it into config) but adds the guardrails of OAuth:

  • Required expiry — minimum 7 days, maximum 1 year. No "forever" tokens.
  • Resource scopes — per-repository (not just per-account) access. "This token can read repo X but nothing else."
  • Permission scopes — read/write granularity on specific resources (contents, issues, pull requests).
  • Org admin policy — orgs can require fine-grained PATs and forbid classic PATs.

This is the recommendation for "we shipped with API keys, now we want OAuth-grade safety without breaking the simple integration UX": short-lived, resource-scoped, expiry-required, audit-logged. Keep the paste-into-config ergonomic; remove the grenade-pin nature of forever-tokens.

9.4 The decision

Scenario Pick
Server-side automation, single org, devs trust each other API keys (with expiry + scopes)
Third-party integrations, customer-installed apps, marketplace OAuth 2.0 with PKCE
Internal CLI / CI scripts, want simplicity Fine-grained PAT pattern
Highly sensitive (payments, healthcare PHI), regulated OAuth + DPoP / mTLS-bound tokens
Mixed — public API + partner integrations Both: API keys for vanilla, OAuth for marketplace

§10. Capacity envelope

Real deployments at very different scales:

Small: single Keycloak instance, ~100k users. A SaaS startup runs Keycloak (open-source IdP) on a single 4-core VM with PostgreSQL. Workload: 100k users, 1k DAU, ~10k logins/day → ~0.1 logins/sec average. Validations: ~10k API calls/day × 1k DAU ≈ 120/sec. Single instance at ~10% CPU. Bottleneck appears around 100 logins/sec — Argon2id saturates cores. Cost: ~$50/month.

Mid: Auth0/Okta enterprise tier, ~5M users per tenant. Mid-market SaaS uses Auth0 (now part of Okta). 5M users, 500k DAU, ~250k logins/day → ~30/sec peak. ~5B API calls/year → ~150/sec average. Multi-tenant shared infrastructure; each tenant gets its own JWKS endpoint but shares the validation tier. Bottleneck: per-tenant rate limits (~1000 logins/min). Cost: $5k-50k/month depending on M2M (machine-to-machine) volume.

Large: Google Identity, ~3B users. ~150M logins/day → ~1700/sec average, ~6000/sec peak (3.5× diurnal). ~30B API calls/day across Google services → ~350k/sec average validation, ~2M/sec peak across the global fleet. Login-to-validation ratio: ~10,000:1. That ratio is the number that defines the technology choice.

To handle 2M validations/sec:

  • JWT validation distributed to every service via library + JWKS cache; no central validation tier. 160 cores spread across thousands of hosts.
  • Public JWKS endpoint cached at edge with hour-scale TTL.
  • Per-user revocation propagation via internal pub/sub, <1s fanout.
  • User store on Bigtable (sharded by user_id hash) for cold-path lookups.
  • Argon2id verification on a dedicated cluster with ~5000 cores reserved.
  • BeyondCorp zero-trust gates internal services — every request, every hop, verifies. No VPN, no implicit trust.

Other anchors: LinkedIn (1B+ members, hundreds of millions of session validations/sec, hybrid session-cookie+JWT). Apple Sign-In (hundreds of millions of devices, ES256). AWS STS (Security Token Service — probably the highest-QPS auth surface on Earth; every AWS API call validates a STS-issued credential via custom SigV4). GitHub OAuth Apps + Apps (100M+ users, billions of API calls/day).

The 10,000:1 ratio defines the technology: anything you put per-request against a central store doesn't survive scale.


§11. Architecture in context

Canonical pattern, not specific to one product:

                                  ┌─────────────────────┐
                                  │ Public JWKS endpoint│
                                  │ /.well-known/...    │
                                  │ (public keys, kid)  │
                                  └──────────▲──────────┘
                                             │ fetched hourly
   ┌─────────┐       ┌────────────┐   ┌──────┴──────┐       ┌─────────┐
   │ Browser │──────▶│ API Gateway│──▶│ IdP / Auth  │──HSM─▶│  KMS    │
   │ / Mobile│ HTTPS │ (Envoy/Kong│   │  Service    │ sign  │  / HSM  │
   └─────────┘       │  /custom)  │   │  (issuer)   │       │ (priv k)│
        │            └──────┬─────┘   └─────┬───────┘       └─────────┘
        │ Set-Cookie        │                │
        │ session_id        │                ▼
        │                   │         ┌──────────────┐
        │                   │         │ User store   │
        │                   │         │ (MySQL,      │
        │                   │         │  sharded by  │
        │                   │         │  user_id)    │
        │                   │         │  - argon2id  │
        │                   │         │  - mfa_secret│
        │                   │         └──────────────┘
        │                   │
        │                   │ JWT in Authorization header
        │                   ▼
        │            ┌──────────────────┐    miss     ┌─────────────┐
        │            │ Token Validator  │────────────▶│ Session     │
        │            │ (gateway lib,    │             │ Cache       │
        │            │  sidecar, or     │◀────────────│ (Redis,     │
        │            │  service lib)    │   hit       │  sharded by │
        │            │  - verify sig    │   <1ms      │  user_id)   │
        │            │  - check exp/aud │             └─────────────┘
        │            │  - check denylist│◀────┐
        │            └────────┬─────────┘     │ subscribes
        │                     │               │
        │                     ▼               │ ┌──────────────────┐
        │            ┌─────────────────┐      └─┤ Revocation Bus   │
        │            │ Business        │        │ (Kafka)          │
        │            │ services        │        │ - revoked-jti    │
        │            │ (re-validate,   │        │ - revoked-user   │
        │            │  defense-in-    │        │ - rotated-key    │
        │            │  depth)         │        └──────▲───────────┘
        │            └─────────────────┘               │
        │ logout / pw change / breach reset            │ produces
        └──────────────────────────────────────────────┘

Service-to-service (east-west):
   [Svc A]──mTLS handshake──▶[Svc B]
       │      service identity from SPIFFE ID in cert
       └─JWT in metadata─▶(user identity, signed by IdP)

   ┌──────────────────────────┐
   │ Workload identity (SPIRE,│  issues short-lived (1h) certs
   │  cert-manager, Istio CA) │  to each service automatically
   └──────────────────────────┘

Annotations:

  • Sharding: user store by hash(user_id). Session cache by user_id. Revocation Kafka topic partitioned by user_id. JWKS is a single global CDN-fronted endpoint (~10KB, trivially cacheable).
  • Crucial topology choice: JWT validation happens at every hop, not centrally. Validators are libraries embedded in the gateway, sidecars, and services. They share only JWKS public keys and the Kafka revocation feed. No service-to-IdP RPC on the hot path. That's how 2M QPS validation works without overwhelming the IdP.
  • The IdP is on the cold path: logins, refreshes, MFA. ~6k/sec at peak — well within a single regional cluster's capacity.
  • East-west service identity: mTLS with SPIFFE-issued certs gives every workload a cryptographic identity at the connection layer. User identity layered on top as a JWT in request metadata.

§12. Hard problems inherent to auth tech

7.1 JWT revocation (the central problem)

Naive: "JWTs are stateless. Just don't revoke. Users wait until exp."

Failure: User's laptop stolen 9:00am, active JWT valid till 9:15am. Thief has 15 min of full access. For a banking app: catastrophic. For corporate SSO: thief drains mailboxes. Regulators (HIPAA, PCI DSS, SOX, GDPR) require demonstrable immediate revocation.

Fix: layered (§4e) — short access TTL caps worst case, jti denylist on Kafka closes the per-token gap (<1s globally), per-user min_iat for "logout everywhere," refresh tokens server-side.

Multi-domain: healthcare (revoke fired clinician in seconds, HIPAA), consumer social ("log me out everywhere" from settings), B2B SaaS (deprovision departing employee). Same problem class, different urgency tier.

7.2 Session fixation

Naive: "Generate session ID on visit; keep the same one after login."

Failure: Attacker visits app.example.com, gets session_id=ABC. Phishes victim with app.example.com/?session=ABC. Victim logs in. Now session_id=ABC is authenticated. Attacker uses ABC. Attacker is the victim.

Fix: always rotate session ID at login. Old ID dies; new random ID issued (Set-Cookie). Never accept session_id from URL parameters — only from HTTP-only cookies set by the server.

Multi-domain: appears in older WordPress, some PHP apps, early Rails before reset_session became the default. One-liner in any modern framework, but it should be named on sight.

7.3 Replay / token theft

Naive: "Token is signed; just check the signature."

Failure: Attacker on coffee-shop wifi captures Authorization: Bearer <jwt> from victim's traffic (mis-configured TLS, MITM). Even hours later, JWT is within 15-min TTL. Attacker replays the same request from their own laptop. Signature valid, token valid — server allows it.

Fixes (defense in depth): TLS everywhere (no HTTP in transit); HTTP-only Secure SameSite=Lax cookies; DPoP (Demonstrated Proof of Possession, RFC 9449) — client generates a key pair; each request includes a DPoP header signing URL + method + timestamp + access token hash; server validates and rejects on timestamp skew; mTLS-bound tokens (RFC 8705) — token bound to client's TLS cert thumbprint, useless from a different TLS connection.

Multi-domain: payment platforms (Stripe, PayPal) use sender-constrained tokens for sensitive ops. Mobile banking pins certs and rotates aggressively. Consumer social accepts the weaker TLS + short-TTL story because UX cost outweighs the marginal threat.

7.4 OAuth confused deputy

Naive: "We issue a token to the user. They send it to whatever API. APIs validate the signature."

Failure: API A and API B both validate tokens from the same IdP. A token issued for A is sent to B; B trusts the signature and accepts. Now an app authorized only for "read calendar" (A) can read banking data (B). The deputy (B) is confused about who is authoritative.

Fix: aud claim mandatory and verified. Every token carries aud naming the API it's for. API B rejects any token whose aud doesn't include itself. IdP mints different tokens for different audiences. RFC 8707 standardizes the resource parameter for requesting audience-scoped tokens.

Multi-domain: matters most in enterprise SSO where one IdP fronts dozens of internal apps — an HR Self-Service token replayed against the payroll admin API. Real-world: multiple Salesforce CVEs, BetterHelp 2023.

7.5 Signing key compromise

Naive: "Key in HSM. Can't be stolen."

Failure: insider attack, HSM firmware bug, supply-chain compromise (real example: Infineon TPM ROCA bug, CVE-2017-15361, weakened millions of RSA keys). Or simpler: a poorly-written CI/CD pipeline briefly extracts the key. Once the private key is out, attacker mints arbitrary tokens for arbitrary users. Total compromise.

Fix: key rotation as routine, not emergency. JWKS advertises multiple keys (different kids); verifiers cache all of them. Sign with the new key; old key valid for the longest token TTL + buffer. Rotate quarterly under normal conditions. Have a "panic rotation" runbook — drop old key from JWKS within minutes; tokens signed with it stop verifying. Use different keys per audience so one compromise doesn't burn the whole fabric. HSM + audit logs on every signing; anomaly detection ("why did 10M tokens get minted in 30 seconds?").

Multi-domain: AWS, Microsoft, and every public CA rotate signing material on tight schedules per CA/Browser Forum baseline. Universal.

7.6 Distributed validation latency

Naive: "Every API call hits Redis to look up session + user DB. ~5ms."

Failure: At 2M QPS, even 1ms Redis latency is fine — but cross-region (US west user, EU service) eats 100ms RTT per request. Worse: Redis is a single point of failure for every request.

Fix: stateless validation at the request edge. JWT validated locally with cached JWKS keys (~100µs CPU). State lookups only when revocation evidence is needed (rare). Revocation denylist is eventually consistent via Kafka with ~1s convergence — for the 1s window after revocation, some verifiers might still accept the token. Explicit trade-off: 1s of stale validation in exchange for not putting Redis on every hot path.

For applications where 1s is too slow (banking, admin operations), wrap critical operations with a "fresh check" — Redis call to confirm session is still active. Explicit cost of high-security flows, accepted only on cold paths.

Multi-domain: feed API tolerates 1s staleness; wire-transfer confirmation does a fresh Redis check; microservice mesh runs entirely stateless; internal admin console double-checks.

7.7 Clock skew killing exp

Naive: "Check exp > now()."

Failure: server clocks drift. Verifier B is 90s ahead of the IdP. A freshly-minted token with exp = now() + 900 looks expired to B. Spurious rejection cascade on a fraction of the fleet.

Fix: clock skew tolerance + NTP discipline. JWT libraries accept a leeway parameter (typically 60-120s) that adds to exp and subtracts from nbf/iat. NTP (Network Time Protocol) on every host with <1s drift policy. Monitoring alerts on host drift >5s. Trap: leeway too high (say, 600s) defeats exp for short-lived tokens. 60-120s is the sweet spot.


§13. Authorization frameworks deep dive

AuthN proves who; AuthZ decides what. Three families dominate at scale, each with its own data model, evaluation cost, and operational characteristics.

13.1 RBAC (Role-Based Access Control)

The simplest model: users have roles; roles have permissions; permissions gate actions. NIST RBAC standard (RFC 7642), the bread and butter of enterprise applications since the 1990s.

user:alice → role:editor
role:editor → permission:document.write

The check is a join: "does Alice have any role that has permission document.write?" — usually a SQL query against three tables (users, user_roles, role_permissions) or a denormalized lookup in Redis.

  • Pros: easy to model, easy to explain to compliance auditors ("show me everyone with admin role"), supported by every IdP and access management platform out of the box. Hierarchical roles (manager ⊇ employee) compose nicely.
  • Cons:
  • Role explosion — for any reasonably complex business, "Editor" isn't enough; you end up with "Editor for Documents in Group A," "Editor in Region EU," "Editor for Customer X's tenant," etc. A real-world enterprise can accumulate 50,000+ roles.
  • No resource-level granularity — Alice is editor. On which document? RBAC alone can't answer; you bolt on object-level checks separately.
  • Inheritance gets weird — if manager inherits employee, what happens when employee has access to "all employees in my team" — does manager get a recursive expansion? RBAC isn't designed for it.

When to pick: simple internal tools, small set of roles (<20), no per-resource permissions. Google Workspace admin console, Atlassian admin tier.

13.2 ABAC (Attribute-Based Access Control)

Attributes of the subject, resource, action, and environment combine in a policy expression. Policies are code — Open Policy Agent (OPA) with Rego, AWS IAM policies in JSON, XACML in XML.

# OPA Rego example
package authz

allow {
  input.user.department == input.resource.department
  input.user.clearance >= input.resource.classification
  input.action == "read"
  time.now_ns() < time.parse_rfc3339_ns(input.resource.expiry)
}
  • Pros:
  • Resource-level, condition-aware — "Alice can read this doc if she's in the same department AND her clearance ≥ doc classification AND the current time is before doc expiry."
  • Externalized policy — security team writes Rego; engineering team writes code; policies can be reviewed, versioned, and deployed independently.
  • Real-world fits — government clearance hierarchies, healthcare PHI (Protected Health Information) access based on care relationship + role + time.
  • Cons:
  • Cost of evaluation — pulling subject/resource/environment attributes for every check requires fetches. Naive ABAC at 10k QPS chokes on the attribute lookup.
  • Hard to audit — "show me everyone who can read doc X" requires running the policy against every user. Not a SELECT statement.
  • Policy bugs — Rego is a real language; bugs can over-grant or under-grant silently. Testing matters.

When to pick: complex conditional policies (clearance, time-of-day, geofencing, regulatory), policy ownership by a security team, willingness to invest in OPA infrastructure.

13.3 ReBAC (Relationship-Based Access Control)

Permissions flow through relationships between users and resources. Google Zanzibar formalized this for Drive, Calendar, YouTube: a user has access to a resource if there exists a chain of relationships from user to resource that the policy graph approves.

alice → member of team:platform
team:platform → editor of folder:project-x
folder:project-x → parent of doc:design.pdf

Check: "can Alice edit doc:design.pdf?" Walk: doc inherits from parent folder; folder has team:platform as editor; team:platform has Alice as member. Answer: yes.

  • Pros:
  • Natural model for "shared documents" — Google Drive's "anyone with the link can comment, plus these specific people can edit, plus the folder owner inherits" is exactly ReBAC.
  • Recursive evaluation — a single check naturally traverses arbitrary depth of inheritance.
  • Resource-instance granularity — every object's permissions are computed from the same engine.
  • Cons:
  • Operational complexity — running Zanzibar/SpiceDB is a real infrastructure investment.
  • Latency budget — recursive walks can amplify. Zanzibar publishes 5-nines latency (~10ms p99 globally) but achieving that requires planet-scale infrastructure with consistent caching.
  • Audit — "show me everyone who can edit doc X" requires expanding the relationship graph from the doc backward to users (the "expand" API).

When to pick: any product where users share resources with other users (collaboration, social, B2B SaaS with team hierarchies), and the sharing graph is non-trivial. Drive, Slack, Notion, Figma, Asana — all reach for ReBAC.

13.4 The recursive evaluation example

The canonical illustration: "Alice is editor on doc X if Alice is in team Y and team Y has editor on folder containing doc X."

1. Check: can alice EDIT doc:X?
2. doc:X has parent folder:F. Recurse: does anyone with editor on folder:F
   give editor on doc:X? YES (folders propagate to children by default).
3. folder:F has explicit editor: team:Y. Does alice belong to team:Y?
4. team:Y has member: alice. YES.
5. Return: allow.

This 4-step recursion happens on every request. The naive implementation pulls 4 rows from the database — 4 round-trips, 5-20ms. The Zanzibar implementation walks the same path against a sharded in-memory graph cache, ~1ms total.

13.5 Picking the right model

Need Pick
<20 roles, no per-resource permissions RBAC
Conditional policies (time, location, clearance) ABAC
Users share resources with users (any collaboration product) ReBAC
Hybrid: roles for coarse access + resource-level for fine-grained RBAC + ReBAC (most real products)
Regulated, must be auditable as policy code ABAC with OPA

Many real systems combine: RBAC for the macro (admin / editor / viewer organization-wide), ReBAC for the micro (this specific doc is shared with this specific user), and ABAC layers for compliance conditions (no PHI access outside business hours from non-corp networks).


§14. Zanzibar / SpiceDB architecture

Google Zanzibar (2019 paper "Zanzibar: Google's Consistent, Global Authorization System") is the reference for planetary-scale ReBAC. Anyone working on collaboration/sharing products is expected to know it.

14.1 The data model

Zanzibar represents permissions as relation tuples:

<object>#<relation>@<user>

Examples:
doc:readme#owner@user:alice
doc:readme#viewer@group:engineering#member
folder:project-x#editor@team:platform
team:platform#member@user:bob

Each tuple says: has on . The @group:engineering#member form means "any user that has member relation on group:engineering" — userset rewrites give Zanzibar its compositional power.

The database is (namespace, object_id, relation, user) rows. At Google's scale: ~2 trillion tuples, replicated globally across Spanner.

14.2 Namespaces (object types)

A namespace defines an object type and its relations. Drive's doc namespace might define:

namespace doc {
  relation owner: user
  relation editor: user | doc#owner   // owners are also editors
  relation viewer: user | doc#editor | doc#parent->viewer
  // permission viewer is granted by direct grant,
  // OR by being editor, OR by being viewer on the parent (folder)
}

Three primitives:

  • Direct relations — explicit tuples in the DB.
  • Userset rewritesdoc#editor includes everyone who is doc#owner. Composition.
  • Computed userset via "tuple-to-userset" (TTU)doc#viewer includes everyone who is viewer on the doc's parent folder. Recursion across object types.

This 3-rule system is expressive enough to model Drive, Calendar, Cloud IAM, YouTube channel permissions, and Google Photos shared albums — all in the same engine.

14.3 The Check API

The hot path. Check(object, relation, user) returns true|false:

Check("doc:design.pdf", "edit", "user:alice")

Server walks the relation graph: direct grant? userset rewrite? TTU to parent? Answer is true if any path yields a match.

Latency budget: p99 < 10ms globally, even when traversal hits multiple data centers. Zanzibar published 5-nines on this — partly via aggressive caching (warm cache hit rate ~95%), partly via "zookies" (snapshot timestamps that ensure the check sees a causally-consistent view of the graph without globally synchronous reads).

14.4 The Expand API

For audit and admin UIs. Expand(object, relation) returns the full set of users with that relation:

Expand("doc:design.pdf", "edit") →
  { user:alice, user:bob, members-of(team:platform) }

The output is a tree of usersets, not a flat list — leaving subset materialization to the caller (the "list everyone" might be huge). The admin console then walks the tree.

Expand is the "audit" mode. Check is the "decide" mode.

14.5 The 5-nines latency budget

Zanzibar publishes p99 < 10ms globally. Achieving that at trillion-tuple scale required:

  • Spanner backend — global consistency, but read latency optimized via local replicas in every region.
  • Aggressive caching — Zanzibar's "Leopard index" precomputes flattened group memberships for common usersets. A check on a "engineering" group of 50k engineers doesn't walk the graph; it hits a precomputed bloom-filter-like index.
  • Zookies for snapshot reads — when checking permissions for a load page that has already snapshot-read the resource, the check uses the resource's snapshot timestamp. Avoids round trips to verify "is this answer still current?"
  • Replication and read locality — every check served by the closest region's replica.

14.6 Open-source clones

  • SpiceDB (AuthZed) — closest open-source Zanzibar clone. Same (namespace, relation, user) tuple model; same Check/Expand API; same userset rewrite + TTU primitives. PostgreSQL or CockroachDB backend. Used by Reddit, Turo, Adobe, Netflix.
  • OpenFGA (Auth0/Okta) — similar design, CNCF (Cloud Native Computing Foundation) sandbox project. JSON-based schema instead of Zanzibar's Protobuf.
  • Permify — newer entrant, supports OpenFGA's model + a query language for ad-hoc policy.
  • Warrant — commercial SaaS offering of a Zanzibar-like service.

Advice: if building a collaboration product greenfield in 2026, start with SpiceDB or OpenFGA. Don't roll your own — the recursion semantics are subtle and easy to get wrong (cycles, eventual consistency, snapshot reads).

14.7 Operational gotchas

  • Schema evolution — adding a new relation requires backfill of derived tuples. Plan migrations like database schema changes.
  • Caching staleness — when a tuple is written, dependent caches must invalidate. SpiceDB uses watch APIs; Zanzibar uses zookies.
  • Listing performance — "show me all docs Alice can see" is the inverse of Check. Some implementations support ListObjects, but at scale this is expensive — it's an expand-all operation. Use carefully or push into a separate read model.

§15. Session revocation at scale

Revocation is the recurring theme. §4e gave the conceptual primitives; this section is the implementation playbook for an organization actually running this at 100M+ users.

15.1 The three tools and how they compose

  1. Short-lived JWT + Kafka denylist — for per-token revocation (specific JWT exposed, leaked, etc.).
  2. Per-user min_iat cache — for "log me out everywhere" (password change, account compromise reset).
  3. Session-store fallback with cache — for high-revocation-frequency surfaces (admin tools, banking).

In production, all three are deployed simultaneously, layered by use case.

15.2 Implementation: Kafka-backed JTI denylist

The architecture (per data center):

[Revoke API] → write to MySQL (durable) → publish to Kafka (revoked-jti)
                                              │
                                              ▼
[Verifier pods] ─ subscribe ─ maintain in-memory hashset
        │
        └─ per request: hashset.contains(jti)?  ← ~50ns

Sizing: at 50k revocations/day with a 15min access token TTL, the denylist contains at most ~520 entries at any given time (50k/86400 × 900 ≈ 520). 520 × ~50 bytes ≈ 26 KB per pod. Once exp passes, the entry can be evicted. Falling-off cleanup runs every 60s.

Convergence: Kafka end-to-end latency p99 ~500ms within a data center, ~1s cross-region (with MirrorMaker 2). So "revoke" is visible everywhere within ~1s globally.

What if a verifier hasn't received the Kafka message yet? For the ~1s window, that verifier still accepts the JTI. The accepted blast radius: 1s of one access token's usage. For consumer-grade revocation (a user clicks "Log Out"), this is acceptable. For high-stakes (a fired employee tries to delete prod data), wrap critical endpoints with a fresh Redis check that bypasses the local cache.

15.3 Per-user min_iat for "logout everywhere"

Redis:  user:42:min_iat → 1715000500

Every JWT carries iat. Verifier reads min_iat for the token's sub; if iat < min_iat, reject. Effect: a single Redis write to user:42:min_iat = now() invalidates all of user 42's outstanding tokens in one shot.

Caching strategy: verifiers cache min_iat per user in an LRU. At 1M cached users × ~50 bytes = ~50MB per pod. TTL on cache entries: 60s, with invalidation on user-targeted events (subscribe to user-min-iat-updates Kafka topic). For users not in cache, fall through to Redis (~1ms).

When to bump min_iat: - Password change. - MFA reset / device removed. - Admin-forced logout. - Suspected compromise detected by anti-abuse. - User-initiated "log me out from all devices."

15.4 Session-store fallback with cache

For surfaces where 1s of denylist propagation is too slow, fall back to the classic session model: every request checks a server-side session row. Redis cache fronts MySQL. Cache hit ~1ms; cache miss ~5ms.

Used for: - Admin consoles (force fresh check; never trust JWT alone for org-wide settings changes). - Wire-transfer confirmation pages. - Anything where the next request after revocation must respect it.

Cost: 1ms added latency, an extra Redis cluster to run. Worth it only on critical paths.

15.5 The trade-offs at a glance

Tool Latency to take effect Per-request cost Capability
Short TTL alone Up to TTL (15 min) None (already verifying JWT) Cheapest, weakest
JTI denylist ~1s (Kafka fanout) ~50ns hashset Per-token revocation
min_iat ~100ms (Redis write + cache TTL) ~50ns cached / ~1ms miss User-level "logout everywhere"
Session-store check Immediate ~1ms Redis Surface-specific guarantee
Key rotation Immediate, but global blast radius None per-request Nuclear

Real production stacks at LinkedIn/Stripe/PayPal: all of the above, layered.


§16. Step-up auth and ACR

Sometimes a session is "authenticated enough" for routine browsing but not for sensitive operations. Step-up auth lets the server demand stronger authentication mid-session, scoped to specific actions.

16.1 The pattern

A user logged in 4 hours ago with password + TOTP. They now click "Initiate a $50,000 wire transfer." The server's check:

1. Is the session authenticated? Yes.
2. Does this endpoint require step-up? Yes — banking compliance: MFA within 15 min.
3. Was MFA performed within 15 min? No (it was 4 hours ago).
4. Return: 401 with WWW-Authenticate: step-up.
5. Client redirects user to MFA challenge.
6. After MFA, server issues a *new* token with elevated ACR claim.
7. Client retries the wire transfer with the new token.
8. Server accepts.

16.2 ACR — Authentication Context Class Reference

OIDC standardizes the acr claim (Authentication Context Class Reference) for exactly this. Values are URI strings that describe how the user authenticated:

{
  "sub": "user-42",
  "acr": "urn:mace:incommon:iap:silver",
  "amr": ["pwd", "totp"],
  "auth_time": 1715000000,
  "iat": 1715014400,
  "exp": 1715015300
}
  • acr — broad authentication context class. NIST AAL (Authenticator Assurance Level) 1, 2, 3 — or vendor-defined values.
  • amr (Authentication Methods References) — specific methods used: pwd, totp, hwk (hardware key), face, fpt (fingerprint), mfa.
  • auth_time — when the user last authenticated. Used to check freshness for step-up.

16.3 The wire-transfer endpoint in code

@require_acr(min_acr="aal3", max_age_seconds=900)
@require_amr_any_of(["hwk", "totp"])
def initiate_wire_transfer(request):
    ...

The decorator inspects the JWT's acr, amr, auth_time. If acr < aal3 OR now - auth_time > 900s OR amr lacks a strong factor, return 401 with a header indicating what step-up is needed.

16.4 Critical use cases

  • Financial operations — funds transfer, ACH initiation, account closure.
  • Admin actions — terminate employee, change org-wide settings, rotate signing keys.
  • PII export — GDPR data export, HIPAA medical records download.
  • Account recovery — adding a new device, changing recovery email.
  • High-trust API calls — third-party app requesting expanded scope.

16.5 The UX pattern

Step-up should feel targeted, not painful. Banking apps: tap "Send $5000" → fingerprint prompt → done. The token before the fingerprint had acr=aal2; the new one has acr=aal3. The user perceives one biometric prompt; the system has bumped the assurance level.

What you don't want: re-running the whole login flow. That trains users to type passwords on every action, leading to password-everywhere phishing risk.


§17. Delegation and impersonation

Real systems have flows where one identity acts on behalf of another. Two main classes:

  1. Admin impersonation — customer support agent impersonates a user to debug.
  2. Service-to-service on-behalf-of — Service A calls Service B with the user's identity preserved.

17.1 Admin impersonation with audit

A support engineer needs to "see what the customer sees." Designs:

  • Bad: support engineer has root credentials, logs in as the customer. No audit; can't tell which actions were customer vs support.
  • Better: support engineer requests impersonation token via an internal admin tool. Tool checks support's role, generates a JWT with both subject + actor claims:
{
  "sub": "user-42",          // the impersonated user
  "act": {
    "sub": "support:carol",  // the actual actor
    "role": "support-tier-2"
  },
  "scope": "impersonate:read", // restricted scope
  "exp": 1715015300,           // short expiry (15 min)
  "amr": ["pwd", "totp"]       // support's own MFA, not user's
}

The act claim (RFC 8693 §4.1) carries the actor identity. Downstream services see sub=user-42 (so the UI looks normal) but the audit log records act.sub=support:carol (so the action is traceable).

Defensive checks: - Restricted scopeimpersonate:read only allows read operations. Wire transfers, password changes, etc. require the original user's credentials. Hardcoded in critical endpoints. - Short TTL — 15 min max. - Audit log every request — separate logging stream for impersonation, retained 7 years. - User notification — user gets an email "support engineer Carol viewed your account at 14:35." - Reason required — support must enter a JIRA ticket or reason before the token is issued.

17.2 Token exchange (RFC 8693)

OAuth 2.0 Token Exchange standardizes the on-behalf-of flow. Service A holds a user token; Service A wants to call Service B in a way that B sees user identity, but the call is attributable to A.

POST /token HTTP/1.1
grant_type=urn:ietf:params:oauth:grant-type:token-exchange
subject_token=<user's access token>
subject_token_type=urn:ietf:params:oauth:token-type:access_token
audience=https://api.b.example.com
actor_token=<A's service token>            // optional, for on-behalf-of
actor_token_type=urn:ietf:params:oauth:token-type:access_token

IdP validates both tokens, mints a new token with:

{
  "sub": "user-42",
  "aud": "https://api.b.example.com",
  "act": {
    "sub": "service:a"
  },
  "scope": "...",
  "iat": ...,
  "exp": ...
}

Service B sees sub=user-42 (correct authZ) and act.sub=service:a (so it knows A is the conduit). If A is later compromised, B's audit log shows exactly which calls came through A on behalf of which users.

17.3 The "customer support acting as user X" distinction

The trap: downstream services must distinguish user X authentic from user X via support impersonation. Critical for:

  • Risk scoring — fraud models should not learn that "user X logs in from corp IP at 3am" is normal user behavior.
  • Notifications — don't email user "you logged in!" when it was support.
  • Privacy-restricted operations — some operations (export your personal data) should be user-only, not support-impersonated.

Implementation: every downstream service must read the act claim and gate features on act being absent (or matching specific allowed actor types).

17.4 SAML / OIDC delegation claims

  • OIDC — RFC 8693 act claim (above).
  • SAML<saml:SubjectConfirmation> with <saml:Subject> carrying delegate semantics, sometimes vendor-specific.
  • JWTact (actor) and may_act (the allowed delegate, used to restrict who can delegate).

Always carry the actor; never collapse to "just the user." It's the single most-violated authZ-audit principle in real systems.


§18. Multi-tenant identity

B2B SaaS lives or dies on multi-tenancy. Each customer (tenant) is its own walled garden of users, with its own IdP, its own admin policies, its own data. Auth in this world has to be carefully scoped or you get cross-tenant data leaks.

18.1 Per-tenant IdP configuration

Each tenant configures their own IdP-of-record. The SaaS supports federation per tenant:

  • Tenant Acme uses Okta SAML.
  • Tenant Beta uses Azure AD OIDC.
  • Tenant Gamma uses no SSO; users sign up with email/password (managed by the SaaS itself).
  • Tenant Delta uses a customer-private OpenID Connect endpoint.

When a user visits https://acme.example.com/login (tenant-subdomain), the SaaS reads the tenant config, redirects to the correct IdP, accepts the SAML assertion or OIDC token, and creates a local session.

Implementation: a tenant_idp_config table:

CREATE TABLE tenant_idp_config (
  tenant_id     BIGINT PRIMARY KEY,
  idp_type      ENUM('saml','oidc','none'),
  metadata_url  TEXT,
  cert_pem      TEXT,
  client_id     VARCHAR(255),
  client_secret_kms_arn  VARCHAR(255),
  attribute_map JSON
);

The attribute map is critical: how does the IdP's "email" claim map to the SaaS's email? Most IdPs send email, but some send mail or urn:oid:0.9.2342.19200300.100.1.3. Normalize.

18.2 Tenant-scoped tokens

Every token carries tenant_id. Every endpoint enforces it.

{
  "sub": "user-42",
  "tenant_id": "tenant-acme",
  "aud": "https://api.example.com",
  "scope": "read:projects"
}

When tenant-acme's user submits a token to https://api.example.com/projects/proj-123, the server must:

  1. Verify the token.
  2. Load proj-123's tenant_id.
  3. Reject if proj-123.tenant_id != token.tenant_id.

Step 3 is non-optional. Every multi-tenant SaaS that has shipped a cross-tenant data leak has shipped it via step 3 being missing on one endpoint.

18.3 The cross-tenant impersonation attack

The classic bug: an attacker is a legitimate user of tenant-acme. They notice that api.example.com/projects/proj-999 returns proj-999's data — and proj-999 belongs to tenant-beta. The server validated the token (signature good, not expired) but didn't check tenant ownership of the requested resource.

Variant: aud confusion. If tenant-acme's IdP issues tokens with aud=https://api.example.com (the shared API), an attacker can take their own legitimate token and send it to tenant-beta's admin API — same aud, no tenant constraint. Mitigation: scope aud per tenant: aud=https://api.example.com/t/tenant-acme, and reject mismatched tenants.

18.4 Tenant isolation patterns

  • Subdomain per tenantacme.app.example.com, beta.app.example.com. Tenant context inferred from hostname.
  • Path prefix per tenant/t/acme/..., /t/beta/.... Same domain, easier for cookies; harder to do per-tenant TLS pinning.
  • Header per tenantX-Tenant-Id: acme. Common in B2B APIs. The IdP issues a token; the API checks X-Tenant-Id matches token.tenant_id.

The cleanest: subdomain + tenant_id claim + endpoint-level enforcement. Belt and suspenders.

18.5 Custom SSO per tenant — the operational cost

Supporting SAML for every B2B customer means:

  • Per-tenant IdP metadata URLs that may break/rotate.
  • Per-tenant certs that expire on the customer's schedule (not yours). Expired SAML cert → tenant locked out → 2am support page.
  • Per-tenant attribute mapping bugs. Customer's IdP sends "department" but the SaaS expects "dept" — silent permission misassignment.

Tools to invest in: cert expiry monitoring (alert 30 days before expiry, automated email to tenant admin), attribute-map validator on configuration, "try-as-user" debug tool for tenant admins to test SSO without affecting prod.


§19. Anti-abuse and rate limiting on auth

The auth surface is a magnet for abuse: credential stuffing, brute force, account takeover, denial-of-service via lockout, bot signups, OAuth-grant spam. Every shipped auth product needs defenses; many candidates underestimate how much code this is.

19.1 Login rate limiting — three axes

  • Per-user — block more than N failed logins per user per minute. Threshold ~5 fails/min.
  • Per-IP — block more than N login attempts per IP per minute. Threshold ~20 fails/min. (Lower than per-user because attackers cycle usernames.)
  • Per-ASN (Autonomous System Number) — when one network (cloud provider, residential ISP) starts a burst, throttle the ASN entirely. Used during credential-stuffing campaigns originating from compromised botnets.

Implementation: sliding-window counters in Redis. INCR login_fail:user:42, INCR login_fail:ip:1.2.3.4, INCR login_fail:asn:12345, all with TTL = window size. On request, read all three; if any exceeds threshold, reject (with 429 Too Many Requests).

19.2 Account lockout — the DoS anti-pattern

The naive defense: after 5 failed logins, lock the account for 15 minutes. This seems reasonable.

The DoS abuse: an attacker submits 5 wrong passwords for every legitimate user's account. Every user is now locked out. The attacker has DoS'd your whole user base without ever cracking a password.

Better defenses:

  • Exponential backoff per IP, not lockout per user — after each failed attempt, the IP must wait 2× longer before its next attempt. Legitimate user from same IP slightly inconvenienced; attacker on one IP is dead-stopped.
  • CAPTCHA after N failures — show CAPTCHA after 3-5 fails. Bots can't pass; humans can. Doesn't lock the user.
  • Step-up MFA after N failures — if the user has MFA, demand MFA even though they typed the right password (suggesting possible credential leak).
  • Lockout only on success-after-many-fails — if the attacker eventually guesses right, then lock the account and notify the user. Don't punish the user for being a target.

The answer: never lock based purely on input. Always combine with IP/ASN signals. Account lockout was a 2005 idea; it's a denial-of-service vector in 2026.

19.3 Bot detection layers

  • CAPTCHA — reCAPTCHA v2 (checkbox), v3 (invisible scoring), hCaptcha, Cloudflare Turnstile. Annoying to users, but the strongest bot-stopper at the top of the funnel.
  • Behavioral biometrics — mouse-movement patterns, typing cadence, scroll behavior. Vendors: Akamai BMP, F5 Shape, BioCatch. Pricey, opaque to legitimate users, but stops sophisticated bots that pass CAPTCHA.
  • Device fingerprinting — User-Agent + screen size + timezone + installed fonts + WebGL hash + Canvas hash. Bot farms tend to have identical fingerprints; humans don't. FingerprintJS is the popular library.
  • TLS fingerprinting (JA3/JA4) — TLS ClientHello byte patterns reveal the underlying library (curl, Node.js, browser headless). Browser fingerprint, not user-agent string — much harder to spoof. Cloudflare and Akamai use this extensively.
  • Honeypot fields — hidden form fields named email that real browsers don't fill but naive bots do. Cheap and surprisingly effective.

19.4 Credential stuffing detection

Distinct from brute force: in credential stuffing, the attacker has a list of leaked (email, password) pairs from another site's breach. They try them against your site, hoping users reused passwords.

Detection signals:

  • High volume of (email, password) attempts that succeed — but each from a different new IP, often residential proxies. The "high success rate from new IPs" signal is the giveaway.
  • Failed-login-from-new-IP-with-known-password — the password is in a breach corpus (Have I Been Pwned, Specops, vendor feeds). When you see this, it's almost certainly credential testing; alert and step-up.
  • Login from new device + new IP + new geography — for established users, this combination triggers email "did you just log in from Russia?" and step-up MFA.

Layered responses:

  1. Known-breached password at login: force password change.
  2. New-device-new-geo successful login: send email; require MFA; optionally session-expire the new device until confirmed.
  3. Burst of successful logins across many accounts from one ASN: throttle the ASN; alert on-call security.

19.5 Sign-up abuse

Less obvious but equally important: bots create fake accounts for spam, scraping, fraud. Mitigations:

  • Email verification — require email click-through before account becomes usable. Cuts ~80% of throwaway abuse.
  • Phone verification — paid by the abuser per number; raises the cost. Used by Twitter, Discord, ChatGPT.
  • Payment verification — for high-value tenants, charge $1 (refundable). Bots don't have credit cards.
  • Hold period — new accounts have limited capabilities for 24-72 hours. Slows spam campaigns.

§20. Audit logging requirements

For any regulated industry — finance, healthcare, government, payments — auth events MUST be logged in detail, retained for years, and stored immutably. Auditors will ask, and "we didn't log that" is a finding.

20.1 What must be logged

At minimum:

  • Login success/failuser_id, timestamp, ip, user_agent, mfa_used, result.
  • MFA challengeuser_id, factor (totp/sms/push/webauthn), result, timestamp.
  • Password changeuser_id, timestamp, channel (self-service / admin-reset / forced).
  • Scope/permission grant — when the user grants an OAuth scope to a third-party app: user_id, client_id, scopes, timestamp.
  • Token issueduser_id, jti, aud, scopes, expiry, issuer_kid.
  • Token revokeduser_id, jti or family, reason, actor (user / admin / system).
  • Role changeuser_id, old_role, new_role, actor, reason, timestamp.
  • Impersonation start/endactor_user_id, impersonated_user_id, ticket, start, end.
  • Admin actions — every action under the admin role.
  • Account creation / deletionuser_id, timestamp, source (signup / import / scim).
  • MFA enrollment / removal — every factor change.
  • Session creation / destruction — at least the metadata.

20.2 Retention

  • Finance (SOX — Sarbanes-Oxley) — 7 years.
  • PCI-DSS (Payment Card Industry Data Security Standard) — minimum 1 year, 3 months immediately available.
  • HIPAA (Health Insurance Portability and Accountability Act) — 6 years.
  • GDPR (General Data Protection Regulation) — purpose-limited; auth logs typically 1-2 years.
  • SOC 2 (Service Organization Control 2) Type II — minimum 1 year, customer often demands 3.
  • FedRAMP (Federal Risk and Authorization Management Program) — 3 years on the system; can require longer for moderate/high.

Practical answer: budget for 7 years on cold storage. It's the longest common requirement and the marginal cost of cold storage is low.

20.3 Immutability — write-once storage

Audit logs must be tamper-evident. An attacker with admin privileges should not be able to retroactively edit the log to cover their tracks.

Patterns:

  • WORM (Write Once Read Many) storage — AWS S3 Object Lock in compliance mode, Azure Blob Immutable storage, on-prem WORM tape. Cannot delete or modify objects until the retention period expires.
  • Hash-chained logs — each log entry includes the hash of the previous entry. Any tampering breaks the chain. Hyperledger / blockchain-style audit logs.
  • Append-only Kafka — logs published to a Kafka topic with infinite retention, replicated across regions. Operators can't delete individual messages without losing all replicas.
  • External audit log service — Splunk Cloud, Datadog, AWS CloudTrail (for AWS API calls), Vanta — outside the application's blast radius.

Design pattern: dual-write the audit event to (a) Kafka for real-time consumption by security tools and (b) S3 with Object Lock for compliance retention.

20.4 What NOT to log

PII and credentials in the audit log are a liability:

  • Never log passwords, even on failure. "Password incorrect" suffices.
  • Never log full session IDs or full tokens — log a hash or the last 4 characters. Otherwise the audit log itself becomes a credential database.
  • Personal Identifiable Information beyond user_id — minimize. GDPR's data minimization principle.
  • MFA codes — never log the actual TOTP / SMS code (replay risk).
  • API keys / secrets — never. Logging vendor SDKs sometimes leak this; review.

20.5 Compliance frameworks that drive audit

  • SOX — finance, US public companies. Section 302 + 404 audit requirements.
  • PCI-DSS — card data; very specific log content requirements (PCI 10.x).
  • HIPAA — health; access logs for PHI must be reviewable on demand.
  • SOC 2 — auditor-driven; flexible but extensive.
  • ISO 27001 — global; control A.12.4 requires event logging.

The auditor's question every time: "show me who accessed user X's record between dates A and B." If you can't answer in 5 minutes, you fail.


§21. Zero-trust architecture

The 2010s assumed corporate networks were trusted (VPN inside = good; outside = bad). Zero-trust dropped that assumption: every request, regardless of network origin, is authenticated and authorized. Network-based trust is dead.

21.1 BeyondCorp — Google's reference architecture

Google published BeyondCorp in 2014 after the Aurora attacks (2009) showed that compromising one corporate laptop on the trusted network was a foothold for everything. BeyondCorp:

  • No VPN, no network perimeter. Engineers work directly from coffee shops, home, anywhere.
  • Every request authenticated. A Google employee opening an internal tool from a browser hits an Access Proxy that verifies (a) the device is corporate-issued and healthy, (b) the user is authenticated, (c) the user is authorized for that resource.
  • Device trust as a first-class signal. Devices have certificates issued at enrollment; certificate is on the device's secure element. Device health (patches, anti-malware) is checked continuously.
  • Continuous re-evaluation. Trust scores drop if anomalies show up; access can be revoked mid-session.

The architecture:

[Engineer's laptop] ─ HTTPS ─▶ [Access Proxy] ─ inspect ─▶ [Internal app]
        │                            │
   device cert                       ├─▶ check IdP (user identity)
   user creds                        ├─▶ check device inventory (corp-issued)
                                     ├─▶ check policy engine (allowed?)
                                     └─▶ continuous risk score

21.2 Commercial zero-trust gateways

Cloudflare Access, Cisco Duo, Tailscale, Zscaler — all implement BeyondCorp-style architecture as a service:

  • Cloudflare Access — TCP/HTTP reverse proxy; integrates with the customer's IdP (Okta, GSuite, AzureAD); per-app policies ("only engineers can access GitHub Enterprise from corp-issued laptops"). Subscription replaces traditional VPN.
  • Tailscale — VPN replacement built on WireGuard. Mesh networking. Identity-based access (Tailscale ACLs in HuJSON), no shared keys. Identity ties to Google/Microsoft/Okta IdP.
  • Cisco Duo — primarily MFA, but expanded into zero-trust gateway with device posture checks.
  • Zscaler Private Access (ZPA) — enterprise-targeted; competes with traditional VPN replacement.

21.3 The VPN replacement story

Traditional VPN: tunnel everything corp-to-corp; the tunnel grants implicit network access. Lateral movement after compromise is trivial. Patching the VPN appliance (Pulse Secure, Fortinet) is a security drama.

Zero-trust replaces the network with the identity. Connection is end-to-end TLS to the app, not site-to-site VPN. Each connection is authenticated; lateral movement requires fresh authN per app.

Advice: for any new infrastructure project in 2026, zero-trust gateway is the default; VPN is the legacy. Existing VPNs are increasingly liability (slow patch cycles, large attack surface) — most large organizations have plans to retire them by 2027-2028.

21.4 What this means for application authZ

Apps assume nothing about the network. Every request:

  1. Bear the user's identity (JWT in header or session cookie).
  2. Bear the device's identity (mTLS cert thumbprint, BeyondCorp header).
  3. Be evaluated against per-request policy ("can this user, on this device, do this action, on this resource, at this time?").

Applications don't have a "trusted internal" mode. There is no internal.


§22. Compliance frameworks affecting auth design

Compliance regimes are not optional in regulated industries. They directly constrain auth design choices.

22.1 PCI-DSS (Payment Card Industry Data Security Standard)

Applies to: any system that stores, processes, or transmits cardholder data.

Auth-relevant requirements:

  • Req 8.3 — MFA for all non-console administrative access AND all remote access to the Cardholder Data Environment. SMS OTP no longer accepted for new implementations (PCI-DSS 4.0, 2024).
  • Req 8.2 — password complexity, rotation; or use of hardware tokens.
  • Req 10 — comprehensive audit logging, including all authN events.
  • Req 8.7 — unique user IDs; shared accounts forbidden for the CDE.

Design impact: passwordless / passkey deployment for admin access to payment processing; per-account MFA enrollment; immutable audit log; never share service accounts among engineers.

22.2 HIPAA (US Healthcare)

Applies to: Covered Entities (hospitals, clinics, payers) and Business Associates (SaaS handling PHI).

Auth-relevant safeguards:

  • §164.312(a)(2)(i) — Unique user identification.
  • §164.312(a)(2)(iii) — Automatic logoff. Session timeout for unattended sessions.
  • §164.312(d) — Person or entity authentication. Reasonable verification before access.
  • §164.312(b) — Audit controls. Record activity in PHI-handling systems.

Design impact: short session timeouts (15-30 min); detailed audit logs of every PHI access; per-user accounts (no shared logins in clinical settings); break-glass procedures for emergency access.

22.3 GDPR (EU privacy)

Applies to: any system processing personal data of EU residents.

Auth-relevant articles:

  • Article 5 — Data minimization. Don't collect more identity data than needed.
  • Article 17 — Right to erasure ("right to be forgotten"). User can demand account deletion; you must actually delete it (with caveats for legal retention).
  • Article 32 — Security of processing. Encryption, MFA, access controls.
  • Article 33-34 — Breach notification. 72-hour reporting if breached.

Design impact: account deletion must actually delete (not just disable); audit logs are kept under "legitimate interest" with documented retention; MFA recommended; encryption of credentials at rest is mandatory.

The hardest GDPR-auth interaction: audit logs contain PII (user IDs, IPs, actions). When a user invokes right-to-erasure, do you delete their audit log? Generally, no — audit logs are kept under "legal obligation" or "legitimate interest," but the user's user_id must be pseudonymized or replaced with a hash so the log can't be tied back without separate consent.

22.4 CCPA (California Consumer Privacy Act)

GDPR-lite for California. Right to know, right to delete, right to opt out of sale of personal information. Auth design impact similar to GDPR but less prescriptive.

22.5 SOC 2 (Service Organization Control 2)

Customer-facing audit; mainly a B2B SaaS requirement. Trust Services Criteria: Security, Availability, Processing Integrity, Confidentiality, Privacy.

Auth-relevant controls:

  • CC6.1 — Logical access controls. MFA, role-based access, least privilege.
  • CC6.2-6.3 — Provisioning and deprovisioning users.
  • CC7.2 — Monitoring for security events.

Design impact: SCIM provisioning to enterprise tenants; quarterly access reviews; MFA on all admin and engineering access; detailed audit logs.

22.6 FedRAMP (US Government)

Applies to: SaaS sold to US federal agencies.

Levels: Low, Moderate, High. Auth requirements get progressively stricter:

  • Moderate — MFA required; ~325 controls.
  • High — Hardware-based MFA (PIV/CAC cards, FIDO2 hardware keys); ~421 controls.

Design impact: support hardware-bound credentials (PIV smart cards), no synced passkeys for FedRAMP High, all auth events to immutable government-approved storage.

22.7 The audit playbook

Every regulated SaaS keeps:

  • Auth event log with 7-year retention.
  • Quarterly access review with sign-off.
  • MFA enrollment report (% of users with MFA).
  • Annual penetration test of auth surfaces.
  • Documented incident response plan with auth-compromise scenarios.
  • Customer-facing compliance docs (SOC 2 report, HIPAA BAA, GDPR DPA).

Advice: design auth to be evidenceable. The audit isn't "we have MFA"; it's "show me 100 random user accounts and prove their MFA enrollment date."


§23. Cost of auth

Auth runs at high QPS. Cost matters, especially for KMS-backed signing and IdP federation.

23.1 KMS / HSM signing costs

If JWTs are signed using a managed KMS (AWS KMS, GCP Cloud KMS, Azure Key Vault HSM), every signing call costs money:

  • AWS KMS — $0.03 per 10,000 sign operations for asymmetric (RSA/ECC) keys. At 1M logins/sec × 86400s × 365 = 31.5 trillion signs/year ≈ $94M/year if every JWT were KMS-signed.

That's absurd. The fix: don't sign every JWT in KMS. Use KMS only to sign a key wrapping — the IdP holds a signing key in process memory; KMS-signs only on rotation (once per quarter, hundreds of dollars per year total). The runtime signs are with the in-memory key, free.

For ultra-high-security, the trade-off: signing in HSM (every sign in tamper-resistant silicon) costs ~$0.03/10k. At 100k/sec sustained signs, that's ~$258k/year. Worth it for the highest-stakes flows (e.g., root signing keys for a CA); not worth it for everyday access tokens.

23.2 IdP federation calls

Every login federated to a third-party IdP costs:

  • Okta — pricing scales with monthly active users; rough $2-4/user/year for B2C.
  • Auth0 — similar tier model; ~$0.023/login above free tier.
  • AWS Cognito — $0.0055/MAU (Monthly Active User) above free tier (50k MAU); cheaper but fewer enterprise features.

At 100M MAU: $200M+/year on Okta-grade IdP. The pure self-host (Keycloak on a small cluster) at the same scale is $50k/year in compute + ~5 SRE (Site Reliability Engineer) FTE. The buy-vs-build calculation flips around 10-30M MAU for B2C, sooner for B2B with enterprise IdP needs.

23.3 The cache hit rate target

For session-store-backed validation, hit rate must be >99%. Numbers at 1M QPS:

  • 100% cache hit: 1M Redis reads/sec, ~$1k/month for a Redis cluster (a single ElastiCache node).
  • 99% hit, 1% miss to MySQL: 10k MySQL reads/sec — borderline saturating a single MySQL primary. Shard or read-replica.
  • 90% hit, 10% miss: 100k MySQL reads/sec — needs 4-8 shards.

Engineering effort to push cache hit from 99.0% to 99.9% (eviction policy tuning, warmer queries, longer TTL): tractable. Effort to push from 90% to 99%: huge. The cliff is steep.

Design principle: assume revenue grows with usage; engineer cache hit rate for the 10× growth, not today's load.

23.4 The math at a glance

At 1M QPS sustained auth validation:

  • JWT local verify (RS256): ~80 cores, ~$300/month compute. Cheap.
  • Redis session cache: ~$2k/month for a cluster.
  • MySQL session-store with 99% cache hit: 10k/sec, ~$5k/month for an Aurora cluster.
  • KMS-signed token (every token): $94M/year. NEVER.
  • IdP federation per login: at 1M logins/day, $230k/year on Auth0. Compare to Keycloak self-host at $50k/year + 1 SRE.

The takeaway: know the unit economics. Many systems lose money on auth because they didn't realize KMS calls or per-MAU IdP costs scale linearly with growth.


§24. Failure mode walkthrough

Crash mid-operation (IdP crashes during login): Argon2id verify returned true but INSERT INTO refresh_tokens hadn't committed. Client retries; Argon2id is idempotent; transaction never committed, no orphan row. Durability point: only the DB commit makes a refresh token usable.

Crash between operations (key rotation interrupted): step 1 (new key published to JWKS) succeeded; step 2 (switch signer to new key) hadn't. On restart, IdP reads "current signing key" pointer; if not flipped, signing continues with the old key. New key is in JWKS but unused — verifiers accept both old and new during rotation. No client impact.

IdP unavailable (cluster partitioned, 5 min): new logins fail, refresh fails, but verification continues — JWKS is CDN-cached, verifiers don't need the IdP on the hot path. Recovery: multi-region IdP with anycast or GeoDNS failover. Durability point: JWKS on CDN with long TTL keeps verification alive through IdP outage.

Revocation list out of sync: Kafka consumer lag of 10 min on a subset of verifiers — revoked tokens work for those 10 min. Mitigation: monitor lag per partition, alert >5s. Fail-closed mode for high-security paths (if feed is stale beyond X seconds, switch to "consult Redis denylist directly"). Durability point: Kafka's replicated log is source of truth; local denylist is a materialized view, rebuildable by replay.

Signing key compromise (detected post-hoc): audit log shows the key was briefly accessible last week. Publish new kid to JWKS, switch signing to it, drop the old key after a 10-min grace to invalidate everything signed by it, force re-authentication of all users, audit-log all activity in the suspected window.

Network partition (auth service ↔ user store): auth reaches Redis but not MySQL. New logins fail (need to read user row); validation continues (uses JWKS only). Recovery: MySQL read replicas in the auth region; failover to read-only mode. Durability point: MySQL binlog is source of truth; replicas reapply on recovery.

Permanent loss of a user-store shard: one MySQL shard destroyed (hardware fire + replica unavailable). ~4M users cannot log in; existing tokens work until expiry; refresh fails when MySQL is consulted. Recovery: restore from last full backup + binlog replay to point-of-failure. Data loss window = backup interval (typically 15 min) minus replicated binlog reach. Durability point: daily full backups + continuous binlog shipping to a separate region; synchronous replication for the most critical events (password changes, MFA enrollments).

IdP downtime cascading to total outage: when the IdP goes down — and every service requires a fresh token validation that calls the IdP introspection endpoint — every service fails authentication, every dependent service downstream fails, the entire product is unreachable. The 2020 Cloudflare-Okta partial outage showed this pattern: a localized IdP issue cascaded into hours of customer-side login failures. Mitigation: cached public keys (JWKS at CDN with hours-long TTL); stateless JWT validation that works without the IdP for the validity window of issued tokens; aggressive in-memory caching of session lookups so the user-perceived blast radius is "no new logins for 5 minutes" instead of "everything is broken." Durability point: the IdP must be on the cold path. Anything that puts the IdP on the hot path of every API call will take the whole product down when the IdP hiccups.

Signing key rotation incident (old tokens stop validating mid-deploy): the rotation procedure rolled out the new kid to half the verifier fleet. Verifiers on the half that received the new JWKS reject tokens signed with the old key (they're still in JWKS but the in-memory cache for that pod was loaded before the new key was pushed, and somehow the old kid got dropped). Symptom: ~50% of API requests intermittently fail with invalid_token. Diagnosis: the verifier library's JWKS cache TTL was 6 hours; cache misses fetch fresh JWKS but cache hits use stale entries that no longer include both keys. Fix: on key rotation, force-bust the JWKS cache across all verifiers (rolling restart, or a pub/sub "JWKS updated" event); always have overlap — old key remains in JWKS for at least 2× the access token TTL after the new key is signed with. Lesson: key rotation is more error-prone than people think; treat as a coordinated deploy, not a single API call.

Clock skew across services causing spurious token rejection: a few servers' NTP daemon failed; their clocks drift 5 minutes ahead. A freshly-issued token with exp = now + 900 looks like exp = now - 4100 to the skewed server (because skewed-server's now is 5 min in the future). Mass rejection on a subset of the fleet. Customers report "I just logged in, why am I logged out?" Diagnosis: monitor host clock drift continuously; alert on >5s drift. Mitigation: JWT library leeway parameter (60-120s) absorbs minor skew; major skew (>120s) should crash the host or fail-closed. Lesson: NTP discipline isn't optional for distributed auth.

Cross-region replication lag triggers tenant lockout: tenant admin in EU region adds a new user. EU MySQL primary commits the row. Replication to US region lags by 30 seconds. The new user, attempting to log in via the US region (because their VPN exits there), can't be found. They appear to be a non-existent account. Symptom: "we created the user, but they say login fails." Mitigation: route writes and immediately-following reads to the same region; tag new-user events for cross-region propagation priority; if cross-region is essential, accept the latency and route logins to the user-home-region rather than the closest region.


§25. Why not the obvious simpler alternative

"Just store sessions in MySQL, look up by session ID on every request." Pre-2010 design. At 2M QPS even 256 shards = ~8k lookups/sec/shard, above MySQL's comfort zone (~5k/sec sustained random-access). Buffer pool churns. Cross-region adds 60ms RTT per request. Adding Redis in front gives you the "session cache" architecture — but with MySQL writes on every login. Right move: validate JWTs locally; consult Redis only when revocation evidence is needed.

"Use JWT and never revoke." Stolen laptop at 9:00am — thief has 15 min of access token, 30 days of refresh token. Banking, healthcare, admin tooling: catastrophic. Compliance (GDPR, HIPAA, SOX) all require demonstrable immediate revocation. Logout, password change, account closure, admin-forced re-auth are not optional.

"Use HS256 (symmetric) and share the secret with every service." N services, N copies of the signing secret — any leak forges tokens for the entire system. A junior engineer copies the secret to their laptop "to debug." Auditing is impossible. Rotation requires synchronized deploys. Right move: RS256 (or EdDSA), HSM-backed private key, public-key JWKS for verifiers. 40× verify slowdown is irrelevant at 2M QPS spread across thousands of cores.

"Verify tokens by calling the IdP introspection endpoint on every request." IdP becomes the bottleneck — handling 2M QPS introspection means a cluster the size of the API tier. Cross-region latency on every API call (50-100ms). Single point of failure. Right move: stateless JWT validation in each service; introspection only for opaque tokens where revocation tightness matters more than throughput.


§26. Scaling axes

Type 1: more users (uniform growth)

Scale What changes
1M Single auth service, single Postgres, Redis for sessions. Sessions fine; JWT optional.
10M Shard MySQL by user_id (8 shards). Redis cluster. RS256 JWT for multi-service validation.
100M 64 shards. Multi-region IdP. CDN JWKS. Kafka for revocation propagation. MFA as a service.
1B 256 shards. Per-region writes with async cross-region replication. Federated identity for B2B. SCIM provisioning. SPIFFE for service-to-service.

Inflection points: ~10M users — shard the user store. ~50 services — JWKS / asymmetric signing; HS256 stops being viable. ~5 regions — Kafka revocation propagation; polling the IdP doesn't scale. B2B tenancy — federated identity (SAML or OIDC per tenant).

Type 2: one celebrity / robot / service account (hotspot intensification)

The more interesting axis. One identity with 100× normal API rate — a BFF (Backend For Frontend), a busy enterprise API integration, an admin tool serving 10k support agents on a shared account:

  • Token validation still scales — each verifier independent; same kid verifies identically regardless of source.
  • Refresh tokens become a hot key. One user's refresh_token row gets read/written 10k/sec. The shard hosting that row is now the bottleneck.
  • Revocation cost grows. Revoking a hot account invalidates enormous in-flight calls; the 1s stale window has a much bigger blast radius.

Fixes: cache refresh token validity in Redis per token (~30 day TTL); reads stay in cache, only writes hit MySQL. Per-tenant rate limiting at the gateway — even legitimate hot users get throttled to a sane ceiling. For machine-to-machine traffic, prefer mTLS with short-lived (1h) certs; no refresh tokens. The cert IS the credential, with its own revocation model (CRL/OCSP).

Structural inflection: at high enough concentration, model these as "service accounts" with different lifecycle and rate-limit policies rather than as "users."


§27. Decision matrix

Dimension JWT (signed) Session cookie (server state) Opaque + cache
Per-request latency ~100µs local crypto ~1ms Redis ~1ms cache
Revocation latency T + 1s (denylist fanout) Immediate Immediate
Cross-service portability Trivial (just verify) Hard (sticky to domain) Cache replication needed
Wire size ~1KB 32 B 32 B
Update claims mid-session Hard (must refresh) Trivial Trivial
Compromise blast radius Large (valid till exp) Bounded Bounded
Best for Microservices, mobile, federated One-domain browser apps High-revocation-freq APIs

Thresholds: >20 services in a mesh — JWT. Single-domain browser app, no mobile — session cookie + server state. Tight revocation SLO (<1s) with cross-service portability — opaque + cache (Stripe/PayPal). Default for new builds — hybrid: session cookie for the browser, JWT for downstream services.

Service-to-service: mTLS vs SPIFFE vs bearer JWT

Dimension Plain mTLS (X.509) SPIFFE workload identity Bearer JWT
Identity layer TCP (TLS handshake) TCP + structured ID (URI form) Application (HTTP header)
Revocation CRL/OCSP — clunky SVID auto-expires hourly Denylist + short TTL
Automation Manual cert rotation pain Fully automated (SPIRE agent per node) IdP-issued
Cross-mesh Cert chain trust SPIFFE federation OAuth federation
Best for Legacy with manual PKI Modern microservice mesh When user identity must propagate

Thresholds: greenfield mesh on K8s — SPIFFE/SPIRE or Istio's built-in mTLS. Need service + user identity — mTLS for connection + JWT in metadata for user. Mature manual PKI shop — plain mTLS, fine if rotation is automated.


Consumer login (Google, Meta). OIDC over OAuth 2.0, RS256 JWTs, JWKS at /.well-known/openid-configuration. Internal: BeyondCorp zero-trust mesh. MFA via push or hardware Titan. Demand: billions of accounts, broad device coverage (web/mobile/TV/watch), tolerance for minutes-scale revocation lag. Variant fits: federated OIDC with hybrid session-cookie + JWT.

Enterprise SSO (Okta, Auth0). Multi-tenant IdP. Each tenant configures their IdP-of-record (Active Directory, Workspace, custom); Okta/Auth0 federates and issues tokens to downstream SaaS via SAML or OIDC. SCIM (System for Cross-domain Identity Management) for user lifecycle. Demand: per-tenant key isolation, fine-grained admin policies, compliance (SOC 2, ISO 27001, FedRAMP). Variant fits: per-tenant signing keys, opaque tokens for SaaS APIs.

Mobile auth (Apple Sign-In, OAuth + PKCE). Authorization code + PKCE — mobile app is a "public client," can't keep a secret, so PKCE binds the auth code to a verifier known only to the legit app. ES256 (smaller signatures, better for mobile). Per-app pseudonymous identifier (Apple's "Hide my email"). Demand: small wire footprint, fast crypto on mobile silicon, biometric unlock. Variant fits: ES256 + PKCE + refresh tokens in iOS Keychain / Android KeyStore.

API authentication (Stripe API keys, GitHub PATs, OAuth Apps). Opaque bearer tokens. Stripe keys (sk_live_...) validated against a fast distributed cache. GitHub Personal Access Tokens similar. GitHub Apps issue short-lived installation tokens (1h); the app itself authenticates with a JWT signed by the app's private key. Demand: instant revocation, fine-grained scopes, high-volume per-token. Variant fits: opaque tokens cached at the edge.

Service-to-service mesh (SPIFFE/SPIRE, mTLS). Every workload gets a cryptographic identity via SPIFFE (spiffe://prod.example.com/ns/checkout/sa/checkout-svc) and a short-lived SVID (1h X.509). SPIRE agent per node auto-rotates. Istio/Linkerd handles mTLS. User identity layered on top as a JWT in request metadata. Demand: zero-touch identity rotation, no shared secrets, cryptographic identity at the connection layer. Variant fits: SPIFFE + JWT.

Federated identity (SAML for enterprise). Enterprise IdP (ADFS, Okta, Azure AD) issues a signed SAML assertion to the SaaS on user login. SaaS validates the signature against the customer's pre-uploaded IdP cert, extracts user attributes (email, groups), creates a local session. Demand: legacy enterprise compatibility (every Fortune 500 has SAML), attribute mapping, JIT (Just-In-Time) provisioning. Variant fits: SAML 2.0 with signed XML and browser POST binding. New SaaS prefer OIDC but must offer SAML for B2B.

Passwordless (WebAuthn / passkeys). User registers a hardware-backed credential (Touch ID, Face ID, YubiKey, Windows Hello). Private key never leaves the device's secure enclave. Login: relying party sends a challenge; device signs; server verifies with the registered public key. Phishing-resistant — credential is scoped to the origin (https://example.com), can't be presented to https://exarnple.com. Demand: phishing resistance, no password reuse, no reset support burden. Variant fits: FIDO2/WebAuthn. Apple/Google passkeys (cloud-synced) for consumer; YubiKey/Titan for enterprise.


§29. Real-world implementations with numbers

System Pattern Scale
Google Identity OIDC (Google co-authored OAuth 2.0 + OIDC). RS256 JWTs. JWKS at googleapis.com/oauth2/v3/certs. BeyondCorp zero-trust internally. 3B+ accounts, ~150M logins/day, ~2M+ validations/sec across the global fleet (10,000:1 ratio).
Meta / Facebook Login OAuth 2.0; custom signed tokens. Internal: TAO graph for authZ relationships. 3B+ DAUs, billions of logins/day.
LinkedIn Auth Hybrid session cookie + JWT. RBAC + ReBAC for content access (Espresso permission tables, Pegasus policies). SAML for workforce SSO; OIDC for consumer. 1B+ members, hundreds of millions of session validations/sec across the mesh.
Okta SaaS IdP, multi-tenant per-org keys. SAML + OIDC + SCIM. Per-tenant rate limits. ~18k enterprise tenants, hundreds of millions of monthly active workforce identities, ~50B authN events/year (~1500/sec average).
Auth0 (now Okta) SaaS IdP. OIDC-first. Rules engine for policy. ~10k tenants pre-acquisition.
GitHub OAuth Apps + Apps PATs (opaque), OAuth Apps (user-scoped), GitHub Apps (installation-scoped, JWT-authenticated, 1h installation tokens). 100M+ users, billions of API calls/day.
Apple Sign-In OIDC over ES256. Per-app pseudonymous identifier (relay email). Built-in PKCE. Hundreds of millions of devices.
AWS IAM + STS Custom SigV4 (HMAC-based, not JWT). STS issues short-lived (15min-12h) credentials for assumed roles. IRSA for K8s workloads. Probably the highest-QPS auth surface on Earth — every AWS API call.
Google Zanzibar Reference for ReBAC at planetary scale. ~2 trillion ACL tuples. Powers Drive, YouTube, Calendar. Tens of millions of QPS, p95 <10ms globally. Inspired SpiceDB, OpenFGA.
Cloudflare Access Zero-trust gateway. JWT-bound to device certs + IdPs. SaaS app reverse-proxy with auth at the edge. Hundreds of thousands of customer tenants.
Stripe API auth Opaque API keys (server-side cache lookup); webhook signing with HMAC-SHA256 + timestamp; restricted keys for fine-grained scoping. Hundreds of millions of API calls/day.

The pattern that wins at scale: signed credential (JWT or equivalent) for the hot path; revocation by short TTL + denylist; identity federation for B2B; workload identity (SPIFFE/mTLS) for service-to-service; KDF-hashed passwords with passkey upgrade for end-user credentials.


§30. Summary

Auth and identity is the canonical "hot path / cold path" technology class: per-request validation must be stateless and cryptographic (signed JWT verified locally, ~100µs CPU per RS256 verify), while state — sessions, revocation lists, user records — lives on the cold path that revocation events fan out into via Kafka. The login-to-validation ratio of ~10,000:1 dictates the architecture: anything you put per-request against a central store doesn't survive scale. Authentication's hard problems are credential storage (Argon2id, memory-hard, never reversible) and revocation in a stateless world (short TTL + jti denylist + per-user min-iat). Authorization's hard problems are per-tenant blast-radius isolation and scaling relationship checks (Zanzibar-style ReBAC for resource sharing, RBAC for coarse roles). Defense-in-depth means every hop verifies, every token carries an audience, every signing key lives in an HSM, and every action is auditable — because the day your user table leaks, the only thing standing between the attacker and your users is the work factor of your KDF.