Auth & Identity

§1. What auth and identity IS

Authentication and identity is the class of technology that answers two distinct questions on every interaction:

Authentication (AuthN) — "who is this?" Prove identity via something known (password), held (TOTP — Time-based One-Time Password, hardware token), inherent (biometric), or by chained trust (a previously-issued token).
Authorization (AuthZ) — "is this identity allowed to do that?" Decide whether the action is permitted: RBAC (Role-Based), ABAC (Attribute-Based), or ReBAC (Relationship-Based Access Control).

The same token (e.g., a signed JWT — JSON Web Token) usually carries both: sub says who; scope/role/aud constrain what. AuthN's hard problems are credential storage, MFA (Multi-Factor Authentication), and revocation. AuthZ's hard problems are policy evaluation at scale and cross-tenant blast-radius isolation.

The technology sits between the public network and business logic — fronted by an API gateway, fanned out into every service via a token validator or sidecar. Two distinct sub-problems:

User auth — humans logging in via password + MFA, browser cookies, mobile redirects, social login. Occasional logins, ~10-15 min access tokens, refresh tokens for continuity.
Service auth — machines authenticating to other machines in a microservice mesh. Short-lived (~1h) auto-rotated credentials, cryptographic identity at the TCP layer (mTLS — mutual TLS, SPIFFE — Secure Production Identity Framework For Everyone). No human in the loop.

NOT in scope: end-to-end encryption (Signal protocol, X3DH, double ratchet — separate category, protecting message content from the platform); key management for data at rest (KMS for DEKs/KEKs — Data/Key Encryption Keys; auth uses KMS to custody signing keys, but data encryption is its own problem); fine-grained policy languages (OPA — Open Policy Agent, Rego — the AuthZ engine downstream of identity).

Where this tech is NOT good: high-frequency intra-app trust where every method call needs its own crypto check (auth gates the request boundary, not function calls); plaintext-content protection (JWT signing protects integrity, not confidentiality — claims are readable to anyone holding the token).

The defining property: auth events are rare; auth checks are constant. A user logs in maybe ten times a week, but their identity is verified on thousands of API calls per day. The technology must collapse the verification path to microseconds while keeping the issuance path correct.

§2. Inherent guarantees per scheme

Each scheme makes a different bargain.

JWT (signed bearer token). Provides stateless verification — any party with the IdP's (Identity Provider) public key can verify without contacting the issuer. Integrity, tamper detection, built-in expiration. Does NOT provide revocation — once issued, valid until exp no matter what. Does NOT provide confidentiality — claims are base64url-encoded, not encrypted. Must layer: short TTL + denylist for revocation; aud audience separation to prevent confused deputy; key rotation infrastructure.

Session cookie + server store. Provides instant revocation (delete the session row). Updatable claims mid-session. Easy audit. Does NOT provide cheap scale-out — every API call hits Redis. Does NOT provide cross-domain validity without explicit federation. Must layer: server-side store, session-fixation protection (rotate ID at login), CSRF (Cross-Site Request Forgery) tokens.

mTLS. Provides cryptographic identity at the TCP layer, established during the TLS handshake. The identity IS the connection — can't be stolen from a captured payload because TLS session keys aren't transferable. Does NOT provide human identity, semantic scopes, or fine-grained authZ. Must layer: PKI (Public Key Infrastructure) ops — short-lived certs (e.g., 1h SPIFFE SVIDs — SPIFFE Verifiable Identity Documents), automated rotation, CRL (Certificate Revocation List) or OCSP (Online Certificate Status Protocol).

OAuth 2.0 / OIDC (OpenID Connect). Provides delegated authorization (OAuth) and federated identity (OIDC) — let User log into App A using their account at IdP B without giving A the password. Standardised flows. Does NOT provide implementation-grade defaults (footguns: implicit flow leaks tokens, missing PKCE on public clients, mis-validated aud). OAuth is a flow framework, not a revocation system.

SAML (Security Assertion Markup Language). XML-based browser federation, mature in enterprise SSO (Single Sign-On). Not mobile- or service-friendly. Modern shops migrate toward OIDC for new builds; SAML stays for legacy enterprise B2B.

Passwords + KDF (Key Derivation Function). Argon2id/bcrypt/scrypt provide work-factor protection against offline cracking after a DB leak. Salt defeats rainbow tables. Does NOT protect against phishing, credential stuffing, password reuse — KDFs only matter once the leak has happened. Must layer: MFA, breached-password checks, rate limiting, ideally passkeys.

Production systems compose these — JWT for the hot path, session cookie for browser revocation, mTLS for service mesh, OAuth flow for cross-app login, Argon2id for the once-per-login password check.

§3. The design space

Stateful vs stateless vs hybrid

Pattern	Token contents	Per-request work	Revocation	Best for
Stateful (session cookie)	Opaque ID (32 random bytes)	Redis/DB lookup ~1ms	Delete the row	Browser apps on one domain (banking portal, GitHub web)
Stateless (signed JWT)	Signed claims	Local crypto verify ~80-150µs	Hard — denylist + short TTL	Microservice mesh, mobile apps
Hybrid (opaque + cache)	Opaque ID, claims in cache	Cache hit ~1ms	Cache `DEL`	Stripe API keys, PayPal, Google LOAS

The hybrid pattern is common where you want JWT-like portability and instant revocation: PayPal, Stripe, Google's internal LOAS (Low Overhead Authentication System) use short opaque tokens with claims in a fast distributed cache. Warm cache hit; rare miss falls back to introspection.

Signing algorithms

Algo	Type	Key size	Sign	Verify	Sig size	Use
HS256 (HMAC-SHA256)	Symmetric	256-bit	~2µs	~2µs	32 B	Single trust boundary
RS256 (RSA-SHA256)	Asymmetric	2048-bit	~3ms	~80µs	256 B	Federated — IdP signs, many verifiers
ES256 (ECDSA P-256)	Asymmetric	256-bit	~150µs	~150µs	~70 B	Mobile, IoT, wire-size-sensitive
EdDSA (Ed25519)	Asymmetric	256-bit	~80µs	~150µs	64 B	Modern default — deterministic, immune to nonce reuse

Default pick: RS256 or EdDSA. HS256 fails when more than one party needs to verify (sharing secrets is fatal). ECDSA's nonce-reuse footgun caused real disasters (Sony PS3, 2010, all signing keys leaked from one bad nonce); HSMs (Hardware Security Modules) avoid it, but EdDSA's determinism kills the whole bug class.

OAuth 2.0 flow selection

Flow	Client type	When
Authorization Code	Confidential (server-side webapp with secret)	Classic webapp login (Atlassian, Salesforce)
Authorization Code + PKCE	Public (mobile, SPA)	Mobile apps, SPAs (Apple Sign-In, Sign in with Google mobile)
Client Credentials	Service account	Backend-to-backend (Stripe webhook signing)
Device Code	Smart TVs, CLI, IoT with no browser	Apple TV Netflix, AWS CLI SSO, `gh auth login`
Refresh Token	Any with stored refresh	Continuous session for mobile apps, long-running CLIs
Implicit (deprecated)	SPA historical	DEAD — leaks token in URL fragment
ROPC (Resource Owner Password Credentials, deprecated)	Legacy migration	DEAD — defeats the entire point of OAuth

Recommended pick: Authorization Code + PKCE (Proof Key for Code Exchange) for all new flows. OAuth 2.1 makes PKCE mandatory for all clients.

Token format, storage, MFA

Token formats: JWT (self-contained), opaque (server lookup), PASETO (JWT competitor without algorithm-confusion footguns), Macaroons (delegated, attenuatable caveats — Google internal, Tarsnap). Session storage: in-memory single instance (~1µs), Redis/Memcached (~1ms, default), MySQL/Postgres (~5-50ms, audit-trail), cookie-only signed (0ms, stateless edge).

MFA factor ranking (weakest → strongest): SMS OTP (weak — SIM swapping), TOTP (RFC 6238 — Google Authenticator), push (Duo, Okta Verify), hardware tokens (YubiKey, Titan), WebAuthn/FIDO2/passkeys (phishing-resistant — credential scoped to origin). Modern default: WebAuthn/passkeys. Apple, Google, Microsoft all ship them natively.

§4. Byte-level mechanics

This is where the depth lives. Anyone can name "JWT" or "OAuth"; the real reference walks the bytes.

4a. JWT anatomy

Three base64url-encoded segments joined with dots: header.payload.signature.

Wire format (line-broken for readability):

eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6ImtleS0yMDI2LTA1In0
.
eyJpc3MiOiJodHRwczovL2lkcC5leGFtcGxlLmNvbSIsInN1YiI6InVzZXItNDIi
LCJhdWQiOiJhcGkuZXhhbXBsZS5jb20iLCJleHAiOjE3MTUwMDA5MDAsImlhdCI6
MTcxNTAwMDAwMCwianRpIjoiYjM2ZDQ4MSJ9
.
VsbGgZkYZGW5XYsTuYg5fkH3vY3p9_lKqGo7n8sZ9wW...

Decoded header (JSON): {"alg":"RS256","typ":"JWT","kid":"key-2026-05"} — kid (key ID) tells the verifier which public key to use.

Decoded payload (claims, RFC 7519 §4.1):

{
  "iss": "https://idp.example.com",   // issuer — required, validated
  "sub": "user-42",                    // subject — user ID
  "aud": "api.example.com",            // audience — required, MUST be validated
  "exp": 1715000900,                   // expiration epoch
  "iat": 1715000000,                   // issued at — useful for per-user revocation
  "nbf": 1715000000,                   // not valid before
  "jti": "b36d481",                    // unique token ID — key for denylist
  "scope": "read:profile write:posts"
}

Signature: RS256_sign(SHA256(header_b64 + "." + payload_b64), private_key), then base64url-encoded. RS256 produces exactly 256 bytes pre-base64; ES256 ~70 bytes; HS256 32 bytes.

4b. The RS256 verify path, byte by byte

Receive: "eyJ...header.eyJ...payload.VsbGg...signature"

Step 1: split on "." → header_b64, payload_b64, signature_b64
Step 2: base64url-decode header → JSON {"alg":"RS256","kid":"key-2026-05"}
Step 3: look up public key by kid in JWKS cache (JSON Web Key Set,
        fetched from https://idp.example.com/.well-known/jwks.json,
        refreshed hourly) → RSA public key (n, e)
Step 4: compute SHA256(header_b64 + "." + payload_b64) → 32-byte digest
Step 5: base64url-decode signature_b64 → 256-byte signature bytes
Step 6: RSA verify: signature^e mod n == PKCS#1-v1.5-padded digest?
Step 7: base64url-decode payload → JSON claims
Step 8: validate claims:
   - iss == expected_issuer (constant-string compare)
   - aud includes this service
   - exp > now() - leeway (typical leeway: 60s)
   - nbf <= now() + leeway
   - jti not in revocation denylist (hashset lookup, ~50ns)
   - iat >= min_iat[sub] (per-user revocation cache lookup)
Step 9: return claims to application

Total CPU: ~80µs RS256 verify + ~5µs JSON parse + ~50ns hashset lookup ≈ ~100µs per validation.

4c. RSA math (one paragraph)

Pick two large primes p, q (1024 bits each for RSA-2048). Compute n = p · q, φ(n) = (p-1)(q-1), pick e (almost always 65537 — small, prime, sparse binary), compute d = e⁻¹ mod φ(n). Public key = (n, e); private = d. Sign: s = digest^d mod n. Verify: digest =? s^e mod n. Security rests on integer factorization being hard — best-known algorithm (NFS — Number Field Sieve) takes ~10¹¹ operations for 2048-bit n. HSMs store d in tamper-resistant silicon; signing happens inside the HSM, never exposing d. For RS256 specifically: SHA-256 to 32-byte digest, PKCS#1 v1.5 pad to 256 bytes, RSA-sign the padded digest.

4d. Fleet math

A system at 2M token validations/sec across the mesh:

RS256 verify: 2,000,000 × 80µs = 160 CPU-seconds per wall second = ~160 cores spread across the fleet. At 64-core hosts, ~2.5 hosts' worth of CPU consumed by JWT verification — totally tractable.
ES256 verify: 2M × 150µs = ~300 cores.
HS256 verify: 2M × 2µs = ~4 cores — 40× cheaper, but requires every verifier to hold the signing key. Poison pill for blast radius.

The fact that RS256 is 40× slower than HMAC but still only 160 cores at 2M QPS is why asymmetric crypto is the default for distributed validation: the CPU is a rounding error; the architectural simplification (only the IdP holds the private key) is the win.

4e. JWT revocation invariants

The fundamental tension: a JWT is verifiable without contacting the issuer. That's what makes it scalable. But once issued, it's valid until exp — the IdP cannot "take it back" without breaking the no-server-lookup model.

Production layered design:

Short access TTL + refresh tokens. Access tokens get 5-15 min expiry. After expiry, the client uses a long-lived refresh token (stored server-side with state) to get a new access token. To revoke a user: invalidate their refresh token; the access token works for up to 15 min more, then locked out.
jti denylist via Kafka. Maintain a Redis set of revoked JWT IDs. When immediate revocation is needed, publish jti to Kafka; every verifier subscribes and updates its local in-memory denylist within ~1s. Verifiers check the denylist on each request (~50ns hashset lookup). Bounded size: once past exp, entries fall off, so the set stays in the hundreds of entries even at 50k revocations/day.
Per-user min_iat stamp. Redis: user:42 → min_iat: 1715000500. Any token with iat < min_iat is rejected. To revoke all of user 42's tokens (the "log me out everywhere" button), bump min_iat to now. One Redis lookup per request (cached at the verifier; LRU of ~1M entries ≈ 100MB).
Key rotation (nuclear). Rotate the signing key. All tokens signed with the old key are invalidated. Reserved for catastrophic compromise.

The complete answer combines (1) + (2) + (3): short access TTL is the safety net; Kafka-fanned jti denylist handles per-token revocation; min_iat handles per-user "logout everywhere."

4f. Refresh token storage

Refresh tokens are long-lived bearer credentials. Treat as carefully as passwords.

CREATE TABLE refresh_tokens (
  token_id        CHAR(36)  PRIMARY KEY,    -- visible part
  user_id         BIGINT    NOT NULL,
  hashed_secret   CHAR(64)  NOT NULL,        -- HMAC-SHA256 of secret part
  family_id       CHAR(36)  NOT NULL,        -- rotation chain
  generation      INT       NOT NULL,        -- rotation counter
  expires_at      TIMESTAMP,
  revoked_at      TIMESTAMP NULL,
  INDEX (user_id, revoked_at), INDEX (family_id)
);

Token shown to client: token_id.secret. Server stores token_id indexed and HMAC(server_key, secret). On use: split on ., look up by token_id, constant-time-compare HMAC(server_key, secret) to hashed_secret (non-constant-time leaks timing), check revoked_at IS NULL and expires_at > now(), issue new access token AND rotate (new generation, same family_id, mark old revoked).

family_id + generation detects stolen refresh tokens. If attacker steals and uses one, server issues a new one and revokes the old. When the legitimate client tries the original (now-revoked) refresh token, server detects "family already advanced past this generation" and revokes the entire family. OAuth 2.1 §6.1 refresh token rotation.

4g. OAuth 2.0 authorization code + PKCE, byte by byte

1. User clicks "Login" in webapp/mobile app.

2. Client generates:
     code_verifier  = random 43-128 char string
                      e.g., "dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
     code_challenge = base64url(SHA256(code_verifier))
     state          = random nonce (anti-CSRF)

3. Client 302-redirects browser to IdP /authorize?
     response_type=code&client_id=webapp&
     redirect_uri=https://app.example.com/callback&
     code_challenge=<challenge>&code_challenge_method=S256&
     state=<nonce>&scope=openid profile email

4. IdP shows login page; user submits credentials + MFA.

5. IdP:
   - verifies Argon2id hash (~80ms); MFA challenge if step-up
   - generates auth_code (1 min, single-use)
   - stores: auth_code → {user_id, client_id, code_challenge, scope}
   - 302-redirects browser to:
       https://app.example.com/callback?code=<auth_code>&state=<nonce>

6. Client callback:
   - verifies state matches stored value (anti-CSRF)
   - server-side POSTs to IdP /token:
       grant_type=authorization_code, code=<auth_code>,
       code_verifier=<original verifier>, client_id=webapp,
       client_secret=<secret> (confidential clients only),
       redirect_uri=https://app.example.com/callback

7. IdP /token:
   - retrieves stored code_challenge for auth_code
   - computes base64url(SHA256(code_verifier))
   - MUST match stored code_challenge
   - deletes auth_code (one-time use)
   - mints:
       access_token  (JWT, 15 min, aud=api.example.com)
       id_token      (JWT, OIDC identity claims about user)
       refresh_token (opaque, 30 days, server-side row)
   - returns { access_token, id_token, refresh_token, expires_in: 900 }

8. Client creates session:abc123 → {user_id, tokens} in Redis.
   Set-Cookie: session=abc123; HttpOnly; Secure; SameSite=Lax

9. Subsequent API calls: browser sends cookie → gateway resolves
   session → forwards Authorization: Bearer <access_token> →
   API validates JWT locally (no IdP call).

Why every piece is non-optional:

PKCE: prevents code interception. If an attacker steals the auth code (browser history, malicious app on phone, shared computer), they can't redeem it without the original code_verifier. Replaces the older client_secret-only flow for public clients.
state parameter: anti-CSRF. Otherwise an attacker can log a victim into the attacker's account.
Auth code, not token-in-URL: code traverses the user's browser (untrusted, logs, Referer headers); the token only goes server-to-server.
One-time use, 1-min auth code TTL: small leak window; one use closes it.
id_token vs access_token (OIDC): id_token is for the client app to verify who logged in. access_token is for the API to verify what's allowed. Mixing them up is a common bug class.

4h. Password storage — why salt + slow KDF

Threat model: your user table eventually leaks (LinkedIn 2012, Adobe 2013, Yahoo 2013, RockYou 2009, Equifax 2017). Assume hashes are public. What must remain unrecoverable?

Bad: SHA-256(password). GPUs do ~10 billion SHA-256/sec. 8-char password = 95^8 ≈ 6.6 × 10^15 candidates → crackable in ~7 days on one rig. Rainbow tables make precomputed dictionaries trivial.

Better: SHA-256(salt + password). Salt defeats rainbow tables. But still 10 GH/s; dictionary passwords crack in seconds.

Good: bcrypt(password, cost=12). ~250ms/hash on 2010 hardware, ~50ms on 2026 hardware. GPU attacks at ~20 hashes/sec/core, so 95^8 takes ~3M years. Cost factor tunable up; re-hash on next login. Limitation: bcrypt is time-hard, not memory-hard — ASICs (Application-Specific Integrated Circuits) and FPGAs (Field-Programmable Gate Arrays) parallelize it cheaply.

Best: Argon2id with m=64MiB, t=3, p=1. Memory-hard — each hash needs 64MiB of RAM, so ASICs hit a memory bandwidth wall. Default winner of 2015 Password Hashing Competition. The id variant is hybrid: resistant to side-channel attacks (the i lineage) AND GPU parallel attacks (the d lineage).

Math: at 64MiB/hash, a 24GB GPU runs ~375 hashes in parallel. At 80ms each, ~4700 hashes/sec/GPU. To crack a 10-char password (~6 × 10^19) takes ~300M years on a 1000-GPU farm.

Salt is automatic in the encoded output: $argon2id$v=19$m=65536,t=3,p=1$<base64-salt>$<base64-hash>. Each user gets a unique salt.

Fleet cost: at 6,000 logins/sec peak, 6000 × 80ms = 480 CPU-seconds per wall second = 480 cores for password verification, plus 6000 × 64MiB = 384GiB of transient RAM. That's why login flows are aggressively rate-limited — an attacker probing usernames can OOM your auth tier.

When the interviewer asks "MD5 vs SHA-256 for passwords?" — neither. KDFs only. That's the entire answer.

Composing §4b + §4g + §4h with timings: T=0 user submits credentials; T=4ms IdP reads user row (~2ms MySQL shard read); T=84ms Argon2id verify completes (~80ms compute); T=85-2086ms MFA challenge (~2s for the human to type the TOTP); T=2089ms IdP HSM-signs the JWT (~3ms RS256 sign); T=2092ms IdP inserts refresh_token row (~3ms); T=2096ms webapp creates Redis session; T=2098ms Set-Cookie returned. Then per-API-call: ~1ms session lookup at gateway; ~80µs RS256 verify at the backend; ~5µs claim validation; ~50ns denylist check. Per-request overhead after login: ~1.5ms total, ~100µs of crypto. Password verify is the only slow thing (80ms) and happens once per ~10,000 API calls. That asymmetry is what the technology buys.

§5. WebAuthn and passkeys in depth

Passwords are the worst credential humans have ever invented. They are phishable, reusable, leakable, forgettable, and stored in every breach corpus on the dark web. The 2024-2026 industry shift to passkeys is the first credible attempt at killing them at consumer scale. Understanding the actual protocol — not just the marketing — is required.

5.1 FIDO2 in two layers

FIDO2 (Fast IDentity Online 2) is two specs working together:

CTAP2 (Client to Authenticator Protocol v2) — talks between the client (browser, OS) and the authenticator (the device holding the private key: YubiKey over USB/NFC, Apple Secure Enclave, Windows Hello TPM). CTAP2 carries the challenge from the relying party to the device and the signed assertion back. Wire formats: CBOR (Concise Binary Object Representation) over USB-HID, NFC, or Bluetooth Low Energy.
WebAuthn (Web Authentication, W3C spec) — JavaScript API exposed by the browser to web pages. navigator.credentials.create() for registration, navigator.credentials.get() for login. WebAuthn talks to the OS/browser, which talks CTAP2 to the authenticator. Server side, the relying party receives a JSON blob with the signed assertion.

A web page never sees raw CTAP2; an authenticator never sees WebAuthn JSON. The browser is the bridge.

5.2 Public-key crypto on the device

Registration creates a fresh key pair on the authenticator:

1. RP (Relying Party) server sends: rp_id="example.com", user_id, challenge (random 32B).
2. Browser calls navigator.credentials.create({publicKey: {...}}).
3. Browser asks OS/authenticator to:
     - generate a fresh ECDSA P-256 (or Ed25519) key pair
     - bind the pair to (rp_id, user_id) inside the secure element
     - return: { credential_id, public_key, attestation_object,
                client_data_json (contains the challenge) }
4. Authenticator may also prompt for user verification: Face ID, Touch ID, PIN.
5. Server stores: (user_id, credential_id, public_key).
   Private key NEVER LEAVES the authenticator.

1. Server generates challenge (32 bytes), sends to browser.
2. navigator.credentials.get(); authenticator looks up credential_id
   bound to rp_id, prompts user verification, signs:
     signature = ECDSA_sign(SHA256(authenticator_data || client_data_hash))
3. Server verifies signature with stored public_key.
4. Done. No password ever transmitted, ever.

The "private key never leaves the device" property is enforced by the secure element. On Apple, it's the Secure Enclave (a separate ARM core with isolated memory and a hardware AES engine). On Android, it's StrongBox (a discrete chip on Pixel/Samsung high-end) or TEE (Trusted Execution Environment) as a fallback. On Windows, it's the TPM (Trusted Platform Module). On a YubiKey, it's the YubiKey's secure element silicon. The OS gets a signed assertion; the OS cannot read the key.

5.3 No shared secret, phish-resistant

The "phish-resistant" claim rests on the rp_id binding. When the authenticator signs, it signs over rp_id_hash — the SHA-256 of the origin the user is actually on, as reported by the browser. If a user visits https://exarnple.com (typo-squat, the m is rn), the browser passes rp_id=exarnple.com to the authenticator. The authenticator looks up credentials for exarnple.com — none exist. No credential, no signature, no login. The user literally cannot be phished into using their example.com credential on exarnple.com.

Compare to passwords: a phishing site shows a pixel-perfect login page; the user types their password; the attacker now has the password. The user is the weak link. Passkeys remove the user from the trust loop — the browser enforces origin binding cryptographically.

This is also why passkeys defeat credential stuffing. There's no shared secret to stuff with. Each (user, rp) pair has a unique key. A breach of one site's database leaks only public keys, which are useless to attackers.

5.4 Platform vs roaming authenticators

Platform authenticator — the authenticator is built into the device the user is on. Apple Keychain on iOS/macOS, Windows Hello on Windows 10+, Android Smart Lock. UX is seamless (Face ID prompt, done). Lives in the secure element. Cannot be used from a different device unless cloud-synced (see 5.5).
Roaming authenticator — separate hardware that the user carries. YubiKey 5 series, Google Titan key, Feitian. Connects via USB-A/C, NFC, or BLE. Same rp_id binding model, but the credential is portable: a single YubiKey can log into the user's account from any device with a USB port. Enterprise security teams prefer roaming authenticators because the credential is air-gapped from the OS — even total OS compromise doesn't leak it.

Mixed-mode is the production answer: platform for everyday convenience, roaming as the fallback / recovery / step-up factor.

5.5 Passkey sync across devices

The original FIDO2 model was non-syncable: lose the device, lose the key. That ergonomic was poison for consumer adoption — losing your iPhone meant losing your bank account. Apple, Google, and Microsoft all introduced synced passkeys in 2022-2023:

iCloud Keychain (Apple) — passkey private keys are sync-fabric-encrypted with a user-derived key derived from the iCloud password + device PINs. Apple cannot decrypt them. New device joining the iCloud account performs an attested key-exchange handshake with an existing device, retrieving the sync keys.
Google Password Manager — similar architecture, sync-fabric encryption with end-to-end encryption (E2EE) keys derived from the user's Google account secrets + device locks.
1Password / Dashlane / Bitwarden — third-party password managers added passkey storage in 2023-2024, syncing via the same vault that holds passwords.

The trade-off: synced passkeys are no longer hardware-bound. The private key exists in multiple devices (and in the encrypted cloud blob). Stronger than passwords (still phish-resistant, still per-origin), but weaker than non-syncable hardware keys (now susceptible to cloud-account compromise). Enterprise compliance regimes (FedRAMP High, PCI Level 1) still require hardware-bound (non-syncable) keys for privileged access; consumer flows accept synced.

5.6 When to migrate from passwords (the 2024 shift)

The industry inflection point: in 2024-2025 Apple, Google, Microsoft, GitHub, Amazon, eBay, Best Buy, Adobe, and PayPal all enabled passkeys as a primary credential. By 2026, "create a passkey" is the default sign-up CTA at most major consumer sites. The migration plays out as:

Phase 1 — additive: passkeys offered alongside passwords; users opt in.
Phase 2 — default: new signups use passkeys; existing users prompted to add one.
Phase 3 — passwordless: password becomes a recovery factor only; primary auth is passkey.
Phase 4 — passwords removed: account can be created and used with no password ever set.

Design call: most new B2C systems should start at Phase 2 — passkeys default, passwords as the fallback. For B2B/enterprise, follow what the customer's IdP supports (Okta, Azure AD all support WebAuthn now). For very high-security flows (admin, financial), hardware-bound roaming authenticators are still the right answer.

The cost: WebAuthn is more code than <input type=password>, registration UX has device-discoverability quirks (the credential lives on the device you registered with, not your account, in the non-synced model — confusing). Recovery requires a fallback channel (email, second device, IT helpdesk). Adoption requires user education. But the security upside — eliminating phishing, credential stuffing, and password reuse in one stroke — has finally tilted the calculation.

§6. Passwordless beyond passkeys

Passkeys are the strongest passwordless option, but the broader passwordless landscape contains several weaker — and one acceptable — alternatives. The hierarchy of MFA strength runs roughly: passkey (FIDO2/WebAuthn) > hardware OTP token > push with number-matching > push without number-matching > authenticator app TOTP > magic link > SMS OTP > knowledge-based (security questions).

6.1 Magic links

A magic link is a single-use URL emailed to the user's address: https://example.com/auth/magic?token=<random-64-char>. Click the link, the server validates the token (one-time, ~10 min TTL), and the user is logged in.

Security: depends entirely on the security of the user's email account. If email is itself behind a passkey/2FA, this is acceptable. If email is password-only, magic links inherit that weakness.
UX win: zero-friction onboarding. No password to remember, no app to install. Substack, Notion, Medium, Slack all use magic links for sign-in.
Failure modes: email delivery delays (SPF/DKIM misconfigured); link tokens leak via shared mailboxes, corporate email filters that pre-fetch URLs (Microsoft Defender expanding magic links and consuming the one-time use); user clicks link from a different device than they started on, leaving a half-authenticated session.

Caveat: magic links are sometimes positioned as "passwordless" but they're really "email-as-the-credential." If the threat model includes email account compromise, magic links don't help.

6.2 SMS OTP (One-Time Password)

A 6-digit code sent via SMS. User types the code into the login page. Server validates (one-time, ~5 min TTL).

Security: weak. SMS is not end-to-end encrypted; the carrier sees every message. Worse, SIM-swap attacks let an attacker bribe or social-engineer a carrier into transferring the victim's phone number to an attacker-controlled SIM. The 2019 Twitter CEO hack (Jack Dorsey) and the 2022 wave of crypto-exchange compromises (Coinbase, FTX users) were SIM swaps.
UX: universal — every phone receives SMS. No app install. Still the most widely-deployed MFA factor by volume.
Why it persists: even weak MFA is much better than no MFA against bulk credential stuffing. NIST (National Institute of Standards and Technology) SP 800-63B has discouraged SMS since 2017 but didn't ban it because alternatives weren't universally available.

Verdict: acceptable as a fallback factor, never as the primary factor for high-value accounts. Banks pretending SMS is "two-factor" are gambling on bulk attackers being lazier than the SIM-swap rings.

6.3 Authenticator app TOTP (RFC 6238)

Time-based One-Time Password. The server and the device share a secret (a base32-encoded ~20-byte key, provisioned via QR code at enrollment). Every 30 seconds, both sides compute:

TOTP = truncate(HMAC-SHA1(secret, floor(now / 30))) mod 10^6

The 6-digit code is displayed in Google Authenticator, Authy, 1Password, etc. User types it; server computes the same value and compares.

Security: stronger than SMS — no carrier in the loop, no SIM swap. Weaker than passkeys — phishable (attacker proxies the code in real time via a phishing page), and the shared secret can be exfiltrated from the device (rooted phone, malicious Authenticator app).
UX cost: requires the user to install an app and scan a QR code at enrollment. Switching phones requires re-enrollment or a sync (Authy and 1Password offer cloud sync; Google Authenticator added it in 2023).
Clock skew: TOTP requires the device and server clocks to be within ~30s. Devices with bad NTP fail silently. Server should accept the current window and one neighbor (±30s window) by default.

Position: TOTP is the sensible default MFA factor when passkeys aren't available — phishable but far better than SMS or no MFA.

6.4 Push notifications and number-matching

The IdP sends a push to a previously-enrolled mobile app (Okta Verify, Duo, Microsoft Authenticator). User taps "Approve" — done.

MFA fatigue: by 2021-2022 attackers learned to spam approval prompts in the middle of the night, hoping the user would tap "approve" to make it stop. Uber's 2022 breach was via MFA fatigue against a contractor.
Number-matching mitigation: the login page shows a 2-digit number; the push notification asks the user to type that number into the app, not just tap. Microsoft and Duo now require this by default. Eliminates blind-tap fatigue attacks.
Phishability: still phishable via real-time proxy (the attacker forwards the user's typed number to the legitimate login page).

Position: push with number-matching is acceptable; push without is a known vulnerability.

6.5 Magic links via passkeys — the synthesis

The 2025-2026 best practice is to combine: passkey as primary credential, magic-link to email or SMS to a paired device as recovery only, TOTP or hardware OTP for step-up to admin actions. This builds the MFA hierarchy into the flow, not the factor — the strong factor for sensitive operations, the weak factor only for the long-tail of recovery cases.

§7. OAuth 2.0 security best practices

OAuth 2.0 is a flow framework, not a turnkey security solution. It ships with footguns; the practitioner is expected to name them on sight.

7.1 PKCE (Proof Key for Code Exchange, RFC 7636) — mandatory

PKCE answers: how do we know the client that started the authorization flow is the same client redeeming the code?

The legacy flow used client_secret. That works for confidential clients (server-side webapps with a real secret backend). It fails for public clients (mobile apps, SPAs) because they can't keep a secret — any value baked into the binary can be extracted.

PKCE replaces the secret with a dynamic per-flow proof:

At /authorize:
  code_verifier  = random 43-128 char URL-safe string  (kept by client)
  code_challenge = base64url(SHA256(code_verifier))    (sent to IdP)
  → IdP stores code_challenge alongside the authorization code

At /token:
  client sends code_verifier
  IdP computes SHA256(code_verifier), compares to stored code_challenge
  → match: issue token. Mismatch: reject.

Why it works: an attacker can steal the authorization code (browser history, log files, malicious app intercepting the redirect URI on mobile), but without the original code_verifier (kept in client memory, never sent to the IdP at the first step), they can't redeem it. The verifier is bound to the flow.

OAuth 2.1 makes PKCE mandatory for all clients, public and confidential. The S256 challenge method (SHA256(verifier)) is required; the older plain method is deprecated.

7.2 `state` parameter — CSRF protection

The state parameter is a random nonce the client generates at the start of the OAuth flow and verifies at the callback. Without it, an attacker can initiate an OAuth flow themselves, capture the authorization code midway, and redirect a victim to the client's callback URL with the attacker's code. The victim's browser presents the code; the client redeems it and creates a session — but the session is for the attacker's identity. The victim is now logged into the attacker's account, and any data they enter goes into the attacker's account.

Fix: generate state (random, 32+ bytes) at the start of the flow; store it in the client's session storage; on callback, compare state from the URL against the stored value. Mismatch → abort.

7.3 `nonce` parameter — OIDC replay protection

For OpenID Connect, the client also passes nonce at /authorize. The IdP echoes it into the id_token claims. The client verifies id_token.nonce == stored_nonce. This binds the identity assertion to the specific login flow — preventing an attacker from replaying an old id_token against a fresh session.

7.4 Redirect URI validation — exact match

The IdP must validate that the redirect_uri in the request matches a pre-registered URI for the client. Common bugs:

Wildcards — IdP allows https://app.example.com/*. Attacker registers https://app.example.com/.well-known/oauth-callback?... (or finds an open redirect on the same origin) and redirects the authorization code to themselves.
Substring match — IdP allows any URI starting with https://app.example.com. Attacker uses https://app.example.com.evil.com.
Scheme mismatch — IdP allows app.example.com/callback (no scheme); attacker uses http://... and intercepts in transit.
Open redirector on the client — client has a /redirect?to=... endpoint; attacker chains the authorization flow through it to exfiltrate the code.

Fix: exact match on the full URI string, including scheme. No wildcards. No partial matches. Open redirects on the client domain are independently exploitable.

7.5 Refresh token rotation

Covered in §4f. Each use of a refresh token returns a new one and invalidates the old. Detect family reuse to catch theft. OAuth 2.1 §6.1.

7.6 OAuth 2.1 — the cleanup

OAuth 2.1 (draft RFC ongoing through 2024-2026) consolidates a decade of best practices into a single normative spec:

Implicit flow — removed. The implicit flow returned the access token directly in the URL fragment. Token in URL was always a footgun (browser history, Referer, server logs). PKCE-augmented authorization code is the only flow for public clients now.
ROPC (Resource Owner Password Credentials) — removed. The flow where the client collects the user's password directly and sends it to the IdP. Defeats the entire point of OAuth (the password should never touch the client). Removed.
PKCE — required for all clients.
Bearer tokens in URL query — forbidden. Tokens go in headers only.
Redirect URIs — exact match required. No more wildcards.
Refresh token rotation — required for public clients.

Recommendation: design new systems against OAuth 2.1 from day one, even if the spec hasn't been published as final. None of the deprecated flows are worth supporting in 2026.

§8. SAML vs OIDC tradeoffs

SAML (Security Assertion Markup Language) and OIDC (OpenID Connect) solve the same problem — federated identity, an IdP vouches for a user to a downstream Service Provider (SP). They differ in vintage, wire format, and ecosystem.

8.1 The two protocols at a glance

Dimension	SAML 2.0	OIDC 1.0
Year	2005	2014
Wire format	XML, signed with XML-DSIG	JSON / JWT, signed with JWS
Transport	Browser POST or redirect	HTTPS, REST-ish
Carries	XML `<Assertion>` with attributes	`id_token` (JWT) + UserInfo endpoint
Mobile support	Painful (XML in mobile webview)	First-class (mobile SDKs)
API/service support	Not designed for it	First-class (access tokens for APIs)
Discovery	Manual cert exchange or SAML metadata XML	`/.well-known/openid-configuration`
Library maturity	Java EE world; C# WIF; Python pysaml2	Universal — every language has it
Initiated by	SP-initiated or IdP-initiated	Almost always SP-initiated

8.2 When you must support SAML

If you sell B2B SaaS to mid-large enterprises, you must support SAML. Every Fortune 500 has an Active Directory or Okta with SAML federation. Procurement checklists say "must support SAML 2.0 SSO" — without it, you don't get past the security review. The enterprise IdP-of-record is usually Active Directory Federation Services (ADFS), Okta, Azure AD, Ping, or OneLogin, all SAML-native.

For B2C, mobile-first, or API-first products, you can skip SAML entirely and ship OIDC. The customer base never asks.

8.3 Common SAML pitfalls

SAML's threat model is the worst part. The protocol involves parsing untrusted XML, validating signatures over selected XML subtrees, and resolving attribute claims — and every step has had CVEs:

XML signature wrapping — the assertion is signed, but the signature covers only a subset of the XML tree. An attacker takes a legitimately signed assertion, wraps it in a new XML envelope where the signed subtree is referenced by ID but the visible assertion (the one the SP actually parses) is unsigned and attacker-controlled. The SP validates the signature (matches), reads the unsigned assertion (attacker's content), and accepts attacker identity. CVE-2011-1411 (SAML 2.0 multiple implementations), CVE-2018-1056, CVE-2022-22556. Microsoft Azure AD, Okta, and OneLogin have all shipped patches for variants of this bug.
Comment injection in NameID — the SAML NameID is a text element, but XML parsers strip comments. An attacker registers an account with email victim@example.com.evil.com; the IdP issues an assertion with that NameID; the SP's XML parser sees victim@example.com.evil.com or victim@example.com depending on the canonicalization order. Cross-account impersonation. Duo Security disclosed this in 2018.
XXE (XML External Entity) — older SAML parsers loaded external XML entities, enabling SSRF and file-read attacks from a malicious assertion.
Time-bound assertions — the SAML assertion has NotBefore and NotOnOrAfter, but old SP libraries didn't enforce them strictly, allowing replay of stale assertions.
AssertionConsumerService URL spoofing — analogous to OAuth's redirect_uri validation. SP must validate.

The defensive playbook: use a battle-hardened library (Shibboleth, OneLogin's php-saml, Spring SAML), keep it patched, prefer EncryptedAssertion (signed and encrypted with the SP's public key) over plain signed, validate the entire signed tree as the only source of identity claims, and never trust unsigned XML siblings.

8.4 The migration story

The 2020-2026 trend: enterprise IdPs (Okta, Azure AD) speak both SAML and OIDC. New integrations use OIDC. Legacy integrations use SAML. Customer-pressing-for-OIDC requests grow yearly. Most SaaS support both; some bridge libraries (e.g., Auth0) abstract both behind a single API. Advice: ship OIDC as the primary, SAML as the alternate, and don't write your own SAML parser — use a library.

§9. API key vs OAuth tradeoffs

API keys and OAuth scopes solve a related but distinct problem: how does a machine (or sometimes a human in a script) authenticate to your API?

9.1 API keys — simple wins, scaling losses

An API key is an opaque bearer token, typically generated at signup and pasted into config:

curl -H "Authorization: Bearer sk_live_AbCdEfGhIjKlMnOp..." \
     https://api.example.com/v1/charges

Pros: trivial to issue, trivial to use. Every dev tool supports them. Stripe, Twilio, SendGrid, Mailchimp all started with API keys. Time-to-first-API-call is one minute.
Cons at scale:
Hard to scope — typically each key has full account access. A leaked key from a build server can read all customer data.
Hard to rotate — keys often end up in config files, CI variables, customer-side third-party integrations. Rotating is a coordinated migration.
Hard to audit — without further infrastructure, server logs see "key_xxx" but not "which engineer issued this for which purpose."
The "100k key" trap — a successful product accumulates keys across years of integrations. A company we won't name discovered 800k unrevoked Stripe-style keys outstanding, with no idea which were active. Mass-rotation requires customer-side coordination, often impossible.

9.2 OAuth scopes — proper grant model

OAuth 2.0 with scopes provides:

Per-grant scoping — the OAuth client requests specific scopes (read:profile, write:posts); the user (or admin) grants them; the token is limited to those scopes. The API rejects out-of-scope calls.
Revocable per-grant — the user can revoke a specific integration without affecting others.
Audit trail — every access token is traceable back to a specific OAuth client (= integration partner) + specific user grant.
Refresh tokens — short-lived access tokens with explicit refresh rotation.

Cost: setup is more code. Each integration partner registers an OAuth client. The OAuth dance is ~5 endpoints. Customer-facing integrations need consent screens. Time-to-first-API-call is closer to a week.

9.3 GitHub fine-grained PATs — the modern hybrid

GitHub Personal Access Tokens (PATs) shipped in 2013 as classic API keys (account-scoped, no expiry, no resource limits). The 2022 "fine-grained PATs" redesign keeps the UX of an API key (you generate it in settings, paste it into config) but adds the guardrails of OAuth:

Required expiry — minimum 7 days, maximum 1 year. No "forever" tokens.
Resource scopes — per-repository (not just per-account) access. "This token can read repo X but nothing else."
Permission scopes — read/write granularity on specific resources (contents, issues, pull requests).
Org admin policy — orgs can require fine-grained PATs and forbid classic PATs.

This is the recommendation for "we shipped with API keys, now we want OAuth-grade safety without breaking the simple integration UX": short-lived, resource-scoped, expiry-required, audit-logged. Keep the paste-into-config ergonomic; remove the grenade-pin nature of forever-tokens.

9.4 The decision

Scenario	Pick
Server-side automation, single org, devs trust each other	API keys (with expiry + scopes)
Third-party integrations, customer-installed apps, marketplace	OAuth 2.0 with PKCE
Internal CLI / CI scripts, want simplicity	Fine-grained PAT pattern
Highly sensitive (payments, healthcare PHI), regulated	OAuth + DPoP / mTLS-bound tokens
Mixed — public API + partner integrations	Both: API keys for vanilla, OAuth for marketplace

§10. Capacity envelope

Real deployments at very different scales:

Small: single Keycloak instance, ~100k users. A SaaS startup runs Keycloak (open-source IdP) on a single 4-core VM with PostgreSQL. Workload: 100k users, 1k DAU, ~10k logins/day → ~0.1 logins/sec average. Validations: ~10k API calls/day × 1k DAU ≈ 120/sec. Single instance at ~10% CPU. Bottleneck appears around 100 logins/sec — Argon2id saturates cores. Cost: ~$50/month.

Mid: Auth0/Okta enterprise tier, ~5M users per tenant. Mid-market SaaS uses Auth0 (now part of Okta). 5M users, 500k DAU, ~250k logins/day → ~30/sec peak. ~5B API calls/year → ~150/sec average. Multi-tenant shared infrastructure; each tenant gets its own JWKS endpoint but shares the validation tier. Bottleneck: per-tenant rate limits (~1000 logins/min). Cost: $5k-50k/month depending on M2M (machine-to-machine) volume.

Large: Google Identity, ~3B users. ~150M logins/day → ~1700/sec average, ~6000/sec peak (3.5× diurnal). ~30B API calls/day across Google services → ~350k/sec average validation, ~2M/sec peak across the global fleet. Login-to-validation ratio: ~10,000:1. That ratio is the number that defines the technology choice.

To handle 2M validations/sec:

JWT validation distributed to every service via library + JWKS cache; no central validation tier. 160 cores spread across thousands of hosts.
Public JWKS endpoint cached at edge with hour-scale TTL.
Per-user revocation propagation via internal pub/sub, <1s fanout.
User store on Bigtable (sharded by user_id hash) for cold-path lookups.
Argon2id verification on a dedicated cluster with ~5000 cores reserved.
BeyondCorp zero-trust gates internal services — every request, every hop, verifies. No VPN, no implicit trust.

Other anchors: LinkedIn (1B+ members, hundreds of millions of session validations/sec, hybrid session-cookie+JWT). Apple Sign-In (hundreds of millions of devices, ES256). AWS STS (Security Token Service — probably the highest-QPS auth surface on Earth; every AWS API call validates a STS-issued credential via custom SigV4). GitHub OAuth Apps + Apps (100M+ users, billions of API calls/day).

The 10,000:1 ratio defines the technology: anything you put per-request against a central store doesn't survive scale.

§11. Architecture in context

Canonical pattern, not specific to one product:

                                  ┌─────────────────────┐
                                  │ Public JWKS endpoint│
                                  │ /.well-known/...    │
                                  │ (public keys, kid)  │
                                  └──────────▲──────────┘
                                             │ fetched hourly
   ┌─────────┐       ┌────────────┐   ┌──────┴──────┐       ┌─────────┐
   │ Browser │──────▶│ API Gateway│──▶│ IdP / Auth  │──HSM─▶│  KMS    │
   │ / Mobile│ HTTPS │ (Envoy/Kong│   │  Service    │ sign  │  / HSM  │
   └─────────┘       │  /custom)  │   │  (issuer)   │       │ (priv k)│
        │            └──────┬─────┘   └─────┬───────┘       └─────────┘
        │ Set-Cookie        │                │
        │ session_id        │                ▼
        │                   │         ┌──────────────┐
        │                   │         │ User store   │
        │                   │         │ (MySQL,      │
        │                   │         │  sharded by  │
        │                   │         │  user_id)    │
        │                   │         │  - argon2id  │
        │                   │         │  - mfa_secret│
        │                   │         └──────────────┘
        │                   │
        │                   │ JWT in Authorization header
        │                   ▼
        │            ┌──────────────────┐    miss     ┌─────────────┐
        │            │ Token Validator  │────────────▶│ Session     │
        │            │ (gateway lib,    │             │ Cache       │
        │            │  sidecar, or     │◀────────────│ (Redis,     │
        │            │  service lib)    │   hit       │  sharded by │
        │            │  - verify sig    │   <1ms      │  user_id)   │
        │            │  - check exp/aud │             └─────────────┘
        │            │  - check denylist│◀────┐
        │            └────────┬─────────┘     │ subscribes
        │                     │               │
        │                     ▼               │ ┌──────────────────┐
        │            ┌─────────────────┐      └─┤ Revocation Bus   │
        │            │ Business        │        │ (Kafka)          │
        │            │ services        │        │ - revoked-jti    │
        │            │ (re-validate,   │        │ - revoked-user   │
        │            │  defense-in-    │        │ - rotated-key    │
        │            │  depth)         │        └──────▲───────────┘
        │            └─────────────────┘               │
        │ logout / pw change / breach reset            │ produces
        └──────────────────────────────────────────────┘

Service-to-service (east-west):
   [Svc A]──mTLS handshake──▶[Svc B]
       │      service identity from SPIFFE ID in cert
       └─JWT in metadata─▶(user identity, signed by IdP)

   ┌──────────────────────────┐
   │ Workload identity (SPIRE,│  issues short-lived (1h) certs
   │  cert-manager, Istio CA) │  to each service automatically
   └──────────────────────────┘

Annotations:

Sharding: user store by hash(user_id). Session cache by user_id. Revocation Kafka topic partitioned by user_id. JWKS is a single global CDN-fronted endpoint (~10KB, trivially cacheable).
Crucial topology choice: JWT validation happens at every hop, not centrally. Validators are libraries embedded in the gateway, sidecars, and services. They share only JWKS public keys and the Kafka revocation feed. No service-to-IdP RPC on the hot path. That's how 2M QPS validation works without overwhelming the IdP.
The IdP is on the cold path: logins, refreshes, MFA. ~6k/sec at peak — well within a single regional cluster's capacity.
East-west service identity: mTLS with SPIFFE-issued certs gives every workload a cryptographic identity at the connection layer. User identity layered on top as a JWT in request metadata.

§12. Hard problems inherent to auth tech

7.1 JWT revocation (the central problem)

Naive: "JWTs are stateless. Just don't revoke. Users wait until exp."

Failure: User's laptop stolen 9:00am, active JWT valid till 9:15am. Thief has 15 min of full access. For a banking app: catastrophic. For corporate SSO: thief drains mailboxes. Regulators (HIPAA, PCI DSS, SOX, GDPR) require demonstrable immediate revocation.

Fix: layered (§4e) — short access TTL caps worst case, jti denylist on Kafka closes the per-token gap (<1s globally), per-user min_iat for "logout everywhere," refresh tokens server-side.

Multi-domain: healthcare (revoke fired clinician in seconds, HIPAA), consumer social ("log me out everywhere" from settings), B2B SaaS (deprovision departing employee). Same problem class, different urgency tier.

7.2 Session fixation

Naive: "Generate session ID on visit; keep the same one after login."

Failure: Attacker visits app.example.com, gets session_id=ABC. Phishes victim with app.example.com/?session=ABC. Victim logs in. Now session_id=ABC is authenticated. Attacker uses ABC. Attacker is the victim.

Fix: always rotate session ID at login. Old ID dies; new random ID issued (Set-Cookie). Never accept session_id from URL parameters — only from HTTP-only cookies set by the server.

Multi-domain: appears in older WordPress, some PHP apps, early Rails before reset_session became the default. One-liner in any modern framework, but it should be named on sight.

7.3 Replay / token theft

Naive: "Token is signed; just check the signature."

Failure: Attacker on coffee-shop wifi captures Authorization: Bearer <jwt> from victim's traffic (mis-configured TLS, MITM). Even hours later, JWT is within 15-min TTL. Attacker replays the same request from their own laptop. Signature valid, token valid — server allows it.

Fixes (defense in depth): TLS everywhere (no HTTP in transit); HTTP-only Secure SameSite=Lax cookies; DPoP (Demonstrated Proof of Possession, RFC 9449) — client generates a key pair; each request includes a DPoP header signing URL + method + timestamp + access token hash; server validates and rejects on timestamp skew; mTLS-bound tokens (RFC 8705) — token bound to client's TLS cert thumbprint, useless from a different TLS connection.

Multi-domain: payment platforms (Stripe, PayPal) use sender-constrained tokens for sensitive ops. Mobile banking pins certs and rotates aggressively. Consumer social accepts the weaker TLS + short-TTL story because UX cost outweighs the marginal threat.

7.4 OAuth confused deputy

Naive: "We issue a token to the user. They send it to whatever API. APIs validate the signature."

Failure: API A and API B both validate tokens from the same IdP. A token issued for A is sent to B; B trusts the signature and accepts. Now an app authorized only for "read calendar" (A) can read banking data (B). The deputy (B) is confused about who is authoritative.

Fix: aud claim mandatory and verified. Every token carries aud naming the API it's for. API B rejects any token whose aud doesn't include itself. IdP mints different tokens for different audiences. RFC 8707 standardizes the resource parameter for requesting audience-scoped tokens.

Multi-domain: matters most in enterprise SSO where one IdP fronts dozens of internal apps — an HR Self-Service token replayed against the payroll admin API. Real-world: multiple Salesforce CVEs, BetterHelp 2023.

7.5 Signing key compromise

Naive: "Key in HSM. Can't be stolen."

Failure: insider attack, HSM firmware bug, supply-chain compromise (real example: Infineon TPM ROCA bug, CVE-2017-15361, weakened millions of RSA keys). Or simpler: a poorly-written CI/CD pipeline briefly extracts the key. Once the private key is out, attacker mints arbitrary tokens for arbitrary users. Total compromise.

Fix: key rotation as routine, not emergency. JWKS advertises multiple keys (different kids); verifiers cache all of them. Sign with the new key; old key valid for the longest token TTL + buffer. Rotate quarterly under normal conditions. Have a "panic rotation" runbook — drop old key from JWKS within minutes; tokens signed with it stop verifying. Use different keys per audience so one compromise doesn't burn the whole fabric. HSM + audit logs on every signing; anomaly detection ("why did 10M tokens get minted in 30 seconds?").

Multi-domain: AWS, Microsoft, and every public CA rotate signing material on tight schedules per CA/Browser Forum baseline. Universal.

7.6 Distributed validation latency

Naive: "Every API call hits Redis to look up session + user DB. ~5ms."

Failure: At 2M QPS, even 1ms Redis latency is fine — but cross-region (US west user, EU service) eats 100ms RTT per request. Worse: Redis is a single point of failure for every request.

Fix: stateless validation at the request edge. JWT validated locally with cached JWKS keys (~100µs CPU). State lookups only when revocation evidence is needed (rare). Revocation denylist is eventually consistent via Kafka with ~1s convergence — for the 1s window after revocation, some verifiers might still accept the token. Explicit trade-off: 1s of stale validation in exchange for not putting Redis on every hot path.

For applications where 1s is too slow (banking, admin operations), wrap critical operations with a "fresh check" — Redis call to confirm session is still active. Explicit cost of high-security flows, accepted only on cold paths.

Multi-domain: feed API tolerates 1s staleness; wire-transfer confirmation does a fresh Redis check; microservice mesh runs entirely stateless; internal admin console double-checks.

7.7 Clock skew killing exp

Naive: "Check exp > now()."

Failure: server clocks drift. Verifier B is 90s ahead of the IdP. A freshly-minted token with exp = now() + 900 looks expired to B. Spurious rejection cascade on a fraction of the fleet.

Fix: clock skew tolerance + NTP discipline. JWT libraries accept a leeway parameter (typically 60-120s) that adds to exp and subtracts from nbf/iat. NTP (Network Time Protocol) on every host with <1s drift policy. Monitoring alerts on host drift >5s. Trap: leeway too high (say, 600s) defeats exp for short-lived tokens. 60-120s is the sweet spot.

§13. Authorization frameworks deep dive

AuthN proves who; AuthZ decides what. Three families dominate at scale, each with its own data model, evaluation cost, and operational characteristics.

13.1 RBAC (Role-Based Access Control)

The simplest model: users have roles; roles have permissions; permissions gate actions. NIST RBAC standard (RFC 7642), the bread and butter of enterprise applications since the 1990s.

user:alice → role:editor
role:editor → permission:document.write

The check is a join: "does Alice have any role that has permission document.write?" — usually a SQL query against three tables (users, user_roles, role_permissions) or a denormalized lookup in Redis.

Pros: easy to model, easy to explain to compliance auditors ("show me everyone with admin role"), supported by every IdP and access management platform out of the box. Hierarchical roles (manager ⊇ employee) compose nicely.
Cons:
Role explosion — for any reasonably complex business, "Editor" isn't enough; you end up with "Editor for Documents in Group A," "Editor in Region EU," "Editor for Customer X's tenant," etc. A real-world enterprise can accumulate 50,000+ roles.
No resource-level granularity — Alice is editor. On which document? RBAC alone can't answer; you bolt on object-level checks separately.
Inheritance gets weird — if manager inherits employee, what happens when employee has access to "all employees in my team" — does manager get a recursive expansion? RBAC isn't designed for it.

When to pick: simple internal tools, small set of roles (<20), no per-resource permissions. Google Workspace admin console, Atlassian admin tier.

13.2 ABAC (Attribute-Based Access Control)

Attributes of the subject, resource, action, and environment combine in a policy expression. Policies are code — Open Policy Agent (OPA) with Rego, AWS IAM policies in JSON, XACML in XML.

# OPA Rego example
package authz

allow {
  input.user.department == input.resource.department
  input.user.clearance >= input.resource.classification
  input.action == "read"
  time.now_ns() < time.parse_rfc3339_ns(input.resource.expiry)
}

Pros:
Resource-level, condition-aware — "Alice can read this doc if she's in the same department AND her clearance ≥ doc classification AND the current time is before doc expiry."
Externalized policy — security team writes Rego; engineering team writes code; policies can be reviewed, versioned, and deployed independently.
Real-world fits — government clearance hierarchies, healthcare PHI (Protected Health Information) access based on care relationship + role + time.
Cons:
Cost of evaluation — pulling subject/resource/environment attributes for every check requires fetches. Naive ABAC at 10k QPS chokes on the attribute lookup.
Hard to audit — "show me everyone who can read doc X" requires running the policy against every user. Not a SELECT statement.
Policy bugs — Rego is a real language; bugs can over-grant or under-grant silently. Testing matters.

When to pick: complex conditional policies (clearance, time-of-day, geofencing, regulatory), policy ownership by a security team, willingness to invest in OPA infrastructure.

13.3 ReBAC (Relationship-Based Access Control)

Permissions flow through relationships between users and resources. Google Zanzibar formalized this for Drive, Calendar, YouTube: a user has access to a resource if there exists a chain of relationships from user to resource that the policy graph approves.

alice → member of team:platform
team:platform → editor of folder:project-x
folder:project-x → parent of doc:design.pdf

Check: "can Alice edit doc:design.pdf?" Walk: doc inherits from parent folder; folder has team:platform as editor; team:platform has Alice as member. Answer: yes.

Pros:
Natural model for "shared documents" — Google Drive's "anyone with the link can comment, plus these specific people can edit, plus the folder owner inherits" is exactly ReBAC.
Recursive evaluation — a single check naturally traverses arbitrary depth of inheritance.
Resource-instance granularity — every object's permissions are computed from the same engine.
Cons:
Operational complexity — running Zanzibar/SpiceDB is a real infrastructure investment.
Latency budget — recursive walks can amplify. Zanzibar publishes 5-nines latency (~10ms p99 globally) but achieving that requires planet-scale infrastructure with consistent caching.
Audit — "show me everyone who can edit doc X" requires expanding the relationship graph from the doc backward to users (the "expand" API).

When to pick: any product where users share resources with other users (collaboration, social, B2B SaaS with team hierarchies), and the sharing graph is non-trivial. Drive, Slack, Notion, Figma, Asana — all reach for ReBAC.

13.4 The recursive evaluation example

The canonical illustration: "Alice is editor on doc X if Alice is in team Y and team Y has editor on folder containing doc X."

1. Check: can alice EDIT doc:X?
2. doc:X has parent folder:F. Recurse: does anyone with editor on folder:F
   give editor on doc:X? YES (folders propagate to children by default).
3. folder:F has explicit editor: team:Y. Does alice belong to team:Y?
4. team:Y has member: alice. YES.
5. Return: allow.

This 4-step recursion happens on every request. The naive implementation pulls 4 rows from the database — 4 round-trips, 5-20ms. The Zanzibar implementation walks the same path against a sharded in-memory graph cache, ~1ms total.

13.5 Picking the right model

Need	Pick
<20 roles, no per-resource permissions	RBAC
Conditional policies (time, location, clearance)	ABAC
Users share resources with users (any collaboration product)	ReBAC
Hybrid: roles for coarse access + resource-level for fine-grained	RBAC + ReBAC (most real products)
Regulated, must be auditable as policy code	ABAC with OPA

Many real systems combine: RBAC for the macro (admin / editor / viewer organization-wide), ReBAC for the micro (this specific doc is shared with this specific user), and ABAC layers for compliance conditions (no PHI access outside business hours from non-corp networks).

§14. Zanzibar / SpiceDB architecture

Google Zanzibar (2019 paper "Zanzibar: Google's Consistent, Global Authorization System") is the reference for planetary-scale ReBAC. Anyone working on collaboration/sharing products is expected to know it.

14.1 The data model

Zanzibar represents permissions as relation tuples:

<object>#<relation>@<user>

Examples:
doc:readme#owner@user:alice
doc:readme#viewer@group:engineering#member
folder:project-x#editor@team:platform
team:platform#member@user:bob

Each tuple says: has on

Tool	Latency to take effect	Per-request cost	Capability
Short TTL alone	Up to TTL (15 min)	None (already verifying JWT)	Cheapest, weakest
JTI denylist	~1s (Kafka fanout)	~50ns hashset	Per-token revocation
`min_iat`	~100ms (Redis write + cache TTL)	~50ns cached / ~1ms miss	User-level "logout everywhere"
Session-store check	Immediate	~1ms Redis	Surface-specific guarantee
Key rotation	Immediate, but global blast radius	None per-request	Nuclear

Scale	What changes
1M	Single auth service, single Postgres, Redis for sessions. Sessions fine; JWT optional.
10M	Shard MySQL by user_id (8 shards). Redis cluster. RS256 JWT for multi-service validation.
100M	64 shards. Multi-region IdP. CDN JWKS. Kafka for revocation propagation. MFA as a service.
1B	256 shards. Per-region writes with async cross-region replication. Federated identity for B2B. SCIM provisioning. SPIFFE for service-to-service.

Dimension	JWT (signed)	Session cookie (server state)	Opaque + cache
Per-request latency	~100µs local crypto	~1ms Redis	~1ms cache
Revocation latency	T + 1s (denylist fanout)	Immediate	Immediate
Cross-service portability	Trivial (just verify)	Hard (sticky to domain)	Cache replication needed
Wire size	~1KB	32 B	32 B
Update claims mid-session	Hard (must refresh)	Trivial	Trivial
Compromise blast radius	Large (valid till exp)	Bounded	Bounded
Best for	Microservices, mobile, federated	One-domain browser apps	High-revocation-freq APIs

Dimension	Plain mTLS (X.509)	SPIFFE workload identity	Bearer JWT
Identity layer	TCP (TLS handshake)	TCP + structured ID (URI form)	Application (HTTP header)
Revocation	CRL/OCSP — clunky	SVID auto-expires hourly	Denylist + short TTL
Automation	Manual cert rotation pain	Fully automated (SPIRE agent per node)	IdP-issued
Cross-mesh	Cert chain trust	SPIFFE federation	OAuth federation
Best for	Legacy with manual PKI	Modern microservice mesh	When user identity must propagate

System	Pattern	Scale
Google Identity	OIDC (Google co-authored OAuth 2.0 + OIDC). RS256 JWTs. JWKS at `googleapis.com/oauth2/v3/certs`. BeyondCorp zero-trust internally.	3B+ accounts, ~150M logins/day, ~2M+ validations/sec across the global fleet (10,000:1 ratio).
Meta / Facebook Login	OAuth 2.0; custom signed tokens. Internal: TAO graph for authZ relationships.	3B+ DAUs, billions of logins/day.
LinkedIn Auth	Hybrid session cookie + JWT. RBAC + ReBAC for content access (Espresso permission tables, Pegasus policies). SAML for workforce SSO; OIDC for consumer.	1B+ members, hundreds of millions of session validations/sec across the mesh.
Okta	SaaS IdP, multi-tenant per-org keys. SAML + OIDC + SCIM. Per-tenant rate limits.	~18k enterprise tenants, hundreds of millions of monthly active workforce identities, ~50B authN events/year (~1500/sec average).
Auth0 (now Okta)	SaaS IdP. OIDC-first. Rules engine for policy.	~10k tenants pre-acquisition.
GitHub OAuth Apps + Apps	PATs (opaque), OAuth Apps (user-scoped), GitHub Apps (installation-scoped, JWT-authenticated, 1h installation tokens).	100M+ users, billions of API calls/day.
Apple Sign-In	OIDC over ES256. Per-app pseudonymous identifier (relay email). Built-in PKCE.	Hundreds of millions of devices.
AWS IAM + STS	Custom SigV4 (HMAC-based, not JWT). STS issues short-lived (15min-12h) credentials for assumed roles. IRSA for K8s workloads.	Probably the highest-QPS auth surface on Earth — every AWS API call.
Google Zanzibar	Reference for ReBAC at planetary scale. ~2 trillion ACL tuples. Powers Drive, YouTube, Calendar.	Tens of millions of QPS, p95 <10ms globally. Inspired SpiceDB, OpenFGA.
Cloudflare Access	Zero-trust gateway. JWT-bound to device certs + IdPs. SaaS app reverse-proxy with auth at the edge.	Hundreds of thousands of customer tenants.
Stripe API auth	Opaque API keys (server-side cache lookup); webhook signing with HMAC-SHA256 + timestamp; restricted keys for fine-grained scoping.	Hundreds of millions of API calls/day.