← Back to Common Backend Systems

Container Orchestration & Deployment

Contents

A technology reference on container orchestrators and deployment systems — what they are, how Kubernetes works at the byte level, where the design space splits across Kubernetes, Nomad, ECS (Elastic Container Service), Docker Swarm and Mesos, and how deployment patterns (blue-green, canary, rolling, dark launch, shadow) layer on top. Use cases (a SaaS web app, a stateful Postgres cluster, ML training pipelines, batch ETL, edge IoT) appear throughout as illustrations of the same class of technology bent to fit different workloads.


§1. What Container Orchestration Is

A container orchestrator is a runtime that schedules containerized workloads onto a cluster of machines, manages their lifecycle, exposes them as services, and handles failures — all driven by a declarative description of desired state. You hand the system a manifest that says "I want three replicas of my web app, each with 500m CPU and 512Mi RAM, exposed on port 8080, on nodes labeled 'web-tier'," and the orchestrator's job is to make reality match that description and keep it matching as nodes die, traffic spikes, and new versions deploy.

The defining property is desired-state convergence. You don't issue imperative commands ("start container X on node Y"). You declare what should be true, and a control loop continuously observes the current state, compares it to the desired state, and takes actions to close the gap. This is what makes the system resilient: every minute, the same loop that originally placed your container is asking "are three replicas still running? are they on healthy nodes? do they still pass health checks?" — and if the answer is no, it does what it takes to make the answer yes.

Container orchestration sits at one corner of a design space:

  • Container orchestration (Kubernetes, Nomad, ECS, Docker Swarm) — schedules OCI (Open Container Initiative) containers onto a cluster, exposes them as services, manages lifecycle. Pods or tasks are ephemeral; the orchestrator handles replacement. Modern default for stateless and most stateful workloads.
  • VM (Virtual Machine) orchestration (VMware vSphere, OpenStack, AWS EC2 Auto Scaling Groups) — heavier units (VMs boot in 30–120 seconds vs containers in 1–5 seconds), stronger isolation (hypervisor), denser per-VM workloads (multiple processes). Still used for legacy apps, kernel-tied workloads, or strict isolation requirements.
  • Serverless platforms (AWS Lambda, Google Cloud Run, Azure Functions, Fargate) — abstract the cluster away entirely. You hand the platform a function or container, it runs it on demand, scales to zero, charges per invocation. No node management; constrained execution environment (15-minute timeout, ephemeral filesystem, limited memory).
  • Bare-metal provisioning (Foreman, MAAS, Tinkerbell) — installs OSes and configurations onto physical machines. The base layer beneath VMs and containers. Slow turnover (hours to days), no native scheduling.
  • Configuration management (Ansible, Puppet, Chef, SaltStack) — applies configuration to long-lived hosts via SSH or agents. Predates containers; orthogonal but often combined for the host layer beneath the container layer.

This doc focuses on container orchestration — primarily Kubernetes, since that is where 90%+ of new deployments happen and where the byte-level mechanics matter most — with the other variants compared along the way.

What container orchestration is NOT good for:

  • Persistent state guarantees. Pods are ephemeral. The orchestrator will reschedule your database pod to another node without warning if it deems necessary. Persistence is a separate concern handled by the storage layer (PVs — Persistent Volumes, PVCs — Persistent Volume Claims, CSI — Container Storage Interface drivers) but the orchestrator itself promises nothing about your data.
  • Cross-cluster atomicity. A Kubernetes deployment to cluster A and a deployment to cluster B are independent. There is no built-in distributed transaction across clusters. Multi-cluster systems must layer that on top.
  • Business logic correctness. The orchestrator runs your code; it does not validate that your code is right. A pod with a bug will be cheerfully kept alive forever as long as its health check returns 200 OK.
  • Tiny workloads. Running a 50-pod cluster requires ~3 control-plane nodes plus storage plus monitoring — overhead that might cost more than the workload itself. Below a certain threshold (~100 pods, ~5 services), a single VM with systemd is genuinely simpler.
  • Hard real-time. Pod scheduling, CNI (Container Network Interface) plumbing, and kubelet bookkeeping introduce variable startup latency. If you need a workload running within 50 ms of a trigger, use a pre-warmed serverless platform or pin a long-running pod.
  • GPU-dense legacy training that wants direct hardware control. Doable in Kubernetes (NVIDIA's device plugin) but legacy tools (Slurm) still dominate in scientific HPC (High Performance Computing) and parts of AI training.

Mental model: a container orchestrator is the scheduling and lifecycle layer between your application code and the physical infrastructure. Everything above it (load balancing rules, deployment strategies, secrets, configs) is expressed declaratively; everything below it (CPU, memory, disk, network) is abstracted into the resource model. The orchestrator's job is to reconcile the two.


§2. Inherent Guarantees

What container orchestration provides by design, and what must still be layered above it.

Provided by design

  • Desired-state convergence. Every controller in the system runs a reconcile loop that, given a declarative spec, takes actions to make actual state match. If you say "three replicas," the system will keep three replicas alive as long as the cluster has capacity.
  • Self-healing through pod replacement. A pod that crashes is restarted (default restartPolicy: Always for Deployments). A node that dies has its pods rescheduled to other nodes within ~5 minutes (the pod eviction timeout). A failing health check ejects the pod from service endpoints within seconds.
  • Rolling updates with bounded disruption. Updating an image rolls pods one (or N) at a time, respecting maxUnavailable and maxSurge. Traffic shifts away from terminating pods (via readiness probes + endpoints controller) before they are killed.
  • Service discovery. Every Service object gets a stable virtual IP (ClusterIP) and DNS name (my-service.my-ns.svc.cluster.local). Pods come and go; the Service abstracts that churn.
  • Secret and config injection. Secrets (base64-encoded, encrypted at rest in etcd if configured) and ConfigMaps mount into pods as files or environment variables. Rotation is supported (with caveats — see §7).
  • Resource isolation. Pods declare CPU and memory requests and limits. The scheduler honors requests for placement; the kernel (via cgroups) enforces limits at runtime.
  • Horizontal scaling on metrics. HPA (Horizontal Pod Autoscaler) and VPA (Vertical Pod Autoscaler) adjust replica count or per-pod resources based on metrics. KEDA (Kubernetes Event-Driven Autoscaling) extends this to non-metric triggers like queue depth.

Must be layered above

  • Persistent state. Pods are stateless to the orchestrator. Persistent storage is provided by PVs backed by EBS (Elastic Block Store), GCE (Google Compute Engine) Persistent Disks, Ceph, NFS, etc. The orchestrator decides where the pod runs; the storage layer decides whether the data follows. Operators (e.g., CloudNativePG, Strimzi, Vitess) wrap stateful workloads with custom reconciliation logic.
  • Cross-region / multi-cluster coordination. A single Kubernetes cluster is bounded to one (or a few coupled) failure domains. Active/active across regions requires a federation tool (KubeFed, Cluster API, Argo CD ApplicationSet, Karmada) plus an external service mesh or global load balancer.
  • Network policy enforcement. Default Kubernetes networking is flat — every pod can reach every other pod. NetworkPolicy objects are requests; the CNI plugin must enforce them (Calico, Cilium do; some don't). Zero-trust requires a service mesh layer (Istio, Linkerd) or a CNI that does L7 filtering (Cilium with eBPF).
  • Deployment strategies beyond rolling update. Native Deployment supports rolling update and recreate. Canary, blue-green, traffic-mirroring all require extra machinery — Argo Rollouts, Flagger, service mesh traffic splitting, or hand-rolled scripts.
  • GitOps / CI/CD. Kubernetes accepts manifests; it does not fetch them from git, build images, or sequence rollouts. Tools like ArgoCD, FluxCD, Spinnaker, and Tekton add the delivery pipeline on top.
  • Backup and disaster recovery. Velero, Stash, etc. handle etcd snapshots and PV backups. The orchestrator itself only knows current state; it does not maintain history.
  • Multi-tenancy isolation. Namespaces are soft boundaries — RBAC (Role-Based Access Control), ResourceQuotas, LimitRanges layer on top. Hard multi-tenancy (untrusted workloads) requires extra (gVisor, Kata Containers, vCluster, or separate clusters).
  • Cost allocation, chargeback, FinOps. The orchestrator does not natively attribute compute cost to teams. Tools like Kubecost, OpenCost calculate this from labels.

Synthesis: the orchestrator guarantees scheduling, lifecycle, and service abstraction for ephemeral workloads on a cluster of homogeneous-enough machines. Everything beyond — persistent state, cross-cluster, security policy enforcement, delivery pipeline, multi-tenancy — is the system designer's problem.


§3. The Design Space

The orchestrator design space is dominated by Kubernetes but has meaningful variants.

Axis 1: General-purpose orchestrator vs cloud-managed

  • General-purpose, self-hostable (Kubernetes, Nomad, Docker Swarm, Mesos): runs anywhere — on-prem bare metal, EC2 (Elastic Compute Cloud), bare-metal cloud, edge. You operate the control plane (or use a managed offering of the same software).
  • Cloud-proprietary (Amazon ECS, AWS Fargate, Azure Container Instances): tightly coupled to one cloud's services. Simpler to operate (no etcd, no control plane to babysit) but locked in.

Axis 2: Container-only vs multi-workload

  • Container-focused (Kubernetes, ECS, Docker Swarm): pods/tasks are the primary unit. Other workloads (raw binaries, VMs) are second-class or unsupported.
  • Multi-workload (Nomad): schedules containers, raw executables (the exec driver), QEMU VMs, Java JARs, Windows tasks. One scheduler, many runtimes. Useful when your fleet has legacy executables you can't easily containerize.

Axis 3: Batteries-included vs minimal

  • Batteries-included (Kubernetes): ships with a controller for almost everything — Deployment, StatefulSet, DaemonSet, Job, CronJob, HPA, NetworkPolicy, Ingress (via controllers), Service, PVC, etc. Steep learning curve; rich ecosystem.
  • Minimal core (Nomad): ~50 MB single binary. Scheduling, service discovery, simple stanza-based job definitions. Ecosystem is thinner; ops burden lower.

Axis 4: Deployment delivery model

Orthogonal to the orchestrator itself, but tightly coupled in practice:

  • Push-based (Spinnaker, traditional Jenkins-driven kubectl apply): the CI/CD system has cluster credentials, pushes manifests in. Familiar; secrets in CI; cluster has no idea where its config came from.
  • Pull-based GitOps (ArgoCD, FluxCD): an in-cluster controller watches a git repository, pulls manifests, applies them, reports drift. Cluster has no inbound credentials; the source of truth is git.
  • Progressive delivery (Argo Rollouts, Flagger): a deployment controller that handles canary and blue-green natively, gating each step on metrics from Prometheus / Datadog / etc.

Comparison table

Dimension Kubernetes Nomad ECS Docker Swarm Mesos Spinnaker (delivery) ArgoCD (delivery)
Scope Orchestrator Orchestrator Orchestrator (AWS-bound) Orchestrator Orchestrator (legacy) Delivery pipeline Delivery pipeline
Self-hostable? Yes Yes No (AWS only) Yes Yes Yes Yes (runs in cluster)
Single binary? No (5+ components) Yes (~50 MB) N/A (cloud service) Yes No No (microservices) No (controller + UI)
Workload types Containers, jobs Containers, raw, VMs, JAR Containers (Fargate or EC2) Containers Containers, frameworks Targets k8s, EC2, GCE Targets k8s
Service discovery Built-in (CoreDNS) Built-in (Consul integration) AWS Cloud Map Built-in None native N/A N/A
Storage abstraction PV/PVC/CSI Host volumes, CSI EBS, EFS, FSx via integrations Volumes Frameworks own this N/A N/A
Network policy Yes (CNI dependent) Limited Security groups Limited N/A N/A N/A
Cluster max nodes ~5,000 / cluster (5k pods/node, 150k pods total) ~10,000 5,000 (Fargate auto-scales) ~1,000 Tens of thousands (historic) N/A N/A
Operational complexity High Low Low (managed) Low (declining ecosystem) Very high Medium Medium
Deployment model Built-in rolling, ecosystem for canary Built-in canary, blue-green Built-in rolling, canary via CodeDeploy Rolling only Framework-dependent Multi-stage pipelines GitOps pull
Status (2026) Dominant Healthy niche Strong on AWS Effectively legacy Dead Mature; usage declining Dominant GitOps

The first three rows of "status" are the relevant ones. Kubernetes won the orchestrator war; Nomad survives at HashiCorp shops and at Cloudflare's scale; ECS survives because AWS keeps shipping it; Swarm and Mesos are dead-walking. For delivery, ArgoCD has become the de facto standard in new cloud-native shops; Spinnaker remains entrenched at Netflix and similar large estates that adopted it early.

The synthesis: each row's "typical use" is a consequence of the column entries. Kubernetes ends up at "everything" because batteries-included plus 5000-node scale meets most workloads. Nomad ends up at "Cloudflare's edge" because a 50 MB single binary across 200+ POPs (Points of Presence) is operable; Kubernetes there would not be. ECS ends up at "AWS-native shops" because the lock-in is the feature.


§4. Underlying Mechanics: The Kubernetes Control Plane at the Byte Level

This is the depth section. We use Kubernetes as the anchor since it dominates the design space, walk a single kubectl apply from API call to running container, and explain why each layer exists.

4.1 The control plane architecture

A Kubernetes cluster is built from a control plane (typically 3 or 5 nodes for HA — High Availability) and a set of worker nodes. The control plane comprises:

                   ┌────────────────────────────────────┐
                   │       Control Plane Node           │
                   │  ┌──────────────────────────────┐  │
                   │  │ kube-apiserver               │  │  ← REST + validation + admission
                   │  │   (stateless, horizontally    │  │
                   │  │    scalable)                  │  │
                   │  └────────────┬─────────────────┘  │
                   │               │                    │
                   │  ┌────────────▼─────────────────┐  │
                   │  │ etcd                         │  │  ← all cluster state lives here
                   │  │   (Raft-replicated KV)        │  │     (key-value store, MVCC)
                   │  └──────────────────────────────┘  │
                   │  ┌──────────────────────────────┐  │
                   │  │ kube-scheduler               │  │  ← decides which node a pod runs on
                   │  └──────────────────────────────┘  │
                   │  ┌──────────────────────────────┐  │
                   │  │ kube-controller-manager      │  │  ← runs Deployment/ReplicaSet/etc loops
                   │  └──────────────────────────────┘  │
                   │  ┌──────────────────────────────┐  │
                   │  │ cloud-controller-manager     │  │  ← interfaces with AWS/GCP/Azure
                   │  └──────────────────────────────┘  │
                   └────────────────────────────────────┘
                                  │
                                  │
   ┌──────────────────────────────┼──────────────────────────────────┐
   │                              │                                   │
   ▼                              ▼                                   ▼
┌──────────────┐           ┌──────────────┐                  ┌──────────────┐
│ Worker Node  │           │ Worker Node  │     ...          │ Worker Node  │
│ ┌──────────┐ │           │ ┌──────────┐ │                  │ ┌──────────┐ │
│ │ kubelet  │ │ ← agent   │ │ kubelet  │ │                  │ │ kubelet  │ │
│ │  + CRI   │ │   talks    │ │          │ │                  │ │          │ │
│ │  + CNI   │ │   to       │ │          │ │                  │ │          │ │
│ │  + CSI   │ │   apiserver│ │          │ │                  │ │          │ │
│ ├──────────┤ │            │ ├──────────┤ │                  │ ├──────────┤ │
│ │kube-proxy│ │ ← service  │ │kube-proxy│ │                  │ │kube-proxy│ │
│ │          │ │   routing  │ │          │ │                  │ │          │ │
│ ├──────────┤ │            │ ├──────────┤ │                  │ ├──────────┤ │
│ │containerd│ │ ← actually │ │containerd│ │                  │ │containerd│ │
│ │  / CRI-O │ │   runs     │ │          │ │                  │ │          │ │
│ │          │ │   containers│ │          │ │                  │ │          │ │
│ └──────────┘ │            │ └──────────┘ │                  │ └──────────┘ │
│              │            │              │                  │              │
│  ┌────┐┌────┐│            │  ┌────┐      │                  │  ┌────┐      │
│  │Pod ││Pod ││            │  │Pod │      │                  │  │Pod │      │
│  └────┘└────┘│            │  └────┘      │                  │  └────┘      │
└──────────────┘            └──────────────┘                  └──────────────┘

Each component has one job. They communicate only through the API server.

4.2 etcd — the desired-state store

etcd is the single source of truth for the cluster. Every other component is stateless or has reconstructible state; everything that must persist (object specs, current status, leader elections) lives in etcd. It is a Raft-replicated key-value store with:

  • MVCC (Multi-Version Concurrency Control): every write creates a new revision; old revisions are kept for --auto-compaction-retention (typically 1 hour) so that watchers can resume from any point in the recent past.
  • Watch streams: clients open long-lived gRPC streams subscribed to a key prefix. Any mutation under that prefix flows out as an event. This is the mechanism the API server uses to feed change events to every controller in the cluster.
  • Consensus: writes require a quorum of etcd members (3-of-5, 2-of-3). Reads default to linearizable (also quorum-bound) but can be serializable (read from local member, possibly stale, faster).

A typical key in etcd looks like:

/registry/pods/default/web-app-7d4f8c-x9k2p
  → value: protobuf-encoded PodSpec + PodStatus, ~5-50 KB

The API server reads and writes etcd; nothing else does. This isolation is intentional — it means etcd's permissions model can be brutal (TLS — Transport Layer Security — client certs only), and the cluster's complexity can grow without etcd having to learn about Pods, Services, or any Kubernetes concept. To etcd it is all opaque bytes.

Scale ceiling: etcd is generally happy up to about 8 GB of total data, ~150k objects, sustaining hundreds of writes per second. Past that, the Raft log grows large enough that follower replication starts lagging; cluster operators must tune --quota-backend-bytes, run regular etcdctl defrag, and limit the largest objects (the kube-apiserver --max-request-bytes defaults to 1.5 MB — a hint that very large objects are not the design point). Stripe has reported running etcd at the edge of this envelope; Google's k8s offering shards across multiple etcd clusters per logical cluster as a workaround.

4.3 kube-apiserver — the validation and admission layer

The API server is a stateless HTTPS server that exposes the entire Kubernetes API as REST. It has four responsibilities:

  1. Authentication — verifies the client (TLS cert, token, OIDC — OpenID Connect, webhook).
  2. Authorization — runs RBAC checks. "Can user U perform verb V on resource R in namespace N?"
  3. Admission control — runs mutating and validating webhooks. This is where defaulting, validation, and policy enforcement happen. Examples: - The built-in LimitRanger admission controller fills in default CPU/memory requests. - A PodSecurityPolicy (or its successor PodSecurity admission) rejects pods that ask for root. - An external mutating webhook (e.g., Istio's sidecar injector) adds an Envoy sidecar to every pod in a labeled namespace.
  4. Persistence to etcd — after admission, the object is written to etcd. The response to the client is returned only after etcd ack.

The API server is horizontally scalable — you can run 3, 5, 10 instances behind a load balancer, all reading/writing the same etcd. Watch fan-out is the main scaling constraint: each watcher is a long-lived stream the apiserver must feed. At large scale (5000-node clusters), the API server uses an internal watch cache that deduplicates etcd watches — one watcher per resource type to etcd, fanned out in-process to thousands of API clients.

4.4 The controller pattern — the heart of declarative

A controller is a process that:

  1. Watches one or more API resources via the API server's watch protocol.
  2. Maintains a local cache of the current state.
  3. Periodically (or on event) compares current state to desired state.
  4. Computes the actions needed to converge.
  5. Issues those actions through the API server.
  6. Repeats forever.

This is the reconcile loop:

                  ┌─────────────────┐
                  │  Watch events   │  ← from API server (apiserver feeds
                  │   queue up      │     watch events into client-go's
                  └────────┬────────┘    SharedInformer cache)
                           │
                           ▼
              ┌────────────────────────┐
              │   workqueue (rate-     │
              │   limited, dedup)      │
              └──────────┬─────────────┘
                         │ pop a key
                         ▼
              ┌────────────────────────┐
              │  Reconcile(key)         │
              │   1. fetch current state│
              │   2. fetch desired spec │
              │   3. compute delta      │
              │   4. take action via    │
              │      apiserver API call │
              │   5. update status      │
              └────────────────────────┘
                         │
                  ┌──────┴──────┐
                  │             │
              (success)     (error / requeue)

Every "primitive" in Kubernetes is a controller. The Deployment controller reconciles Deployments by creating ReplicaSets. The ReplicaSet controller reconciles ReplicaSets by creating Pods. The Endpoints controller reconciles Endpoints by listing Pods. Custom Resource Definitions (CRDs) extend this — write a CRD plus a custom controller and you've added a new primitive without modifying core Kubernetes. This is how operators (CloudNativePG for Postgres, Strimzi for Kafka, Prometheus Operator for monitoring) work.

The cardinal principle: controllers are level-triggered, not edge-triggered. A controller does not consume a single "create pod X" message and then forget; it observes "I want 3 pods named X-*" and continuously asserts that. Drop an event and the next periodic sync still corrects it. This is what makes the system robust to network blips, controller restarts, and etcd hiccups.

4.5 The scheduler — bin-packing with predicates and priorities

The default scheduler (kube-scheduler) is the controller that watches for Pods with spec.nodeName unset and assigns them to a node. The algorithm is filter, then score:

For each pending pod:
  Step 1: FILTER (a.k.a. predicates)
    For each node N in cluster:
      pass = true
      check NodeAffinity / NodeSelector match    ← skip if mismatch
      check Taints / Tolerations                 ← skip if untolerated
      check resource fits (CPU, memory, ephemeral storage)
      check pod fits port requirements
      check volume zone matches node zone
      check existing pod (anti-)affinity rules
      ...
      if any check fails, drop N from candidates

  Step 2: SCORE (a.k.a. priorities)
    For each candidate node N that passed filter:
      score = 0
      score += LeastRequested      (prefer emptier nodes — bin-pack tightness)
      score += BalancedAllocation  (prefer balance between CPU/memory)
      score += NodeAffinityWeight  (weighted by soft affinity rules)
      score += ImageLocality       (prefer nodes that already have the image)
      score += InterPodAffinity    (prefer near/far from labeled pods)
      ...

  Step 3: BIND
    Pick the highest-scoring node.
    Issue Binding API call → apiserver writes spec.nodeName = N to etcd.
    The pod is now "assigned." kubelet on node N will pick it up.

The scheduler is single-threaded by default for binding (one pod at a time) but the filter and score phases parallelize across goroutines. A scheduler can sustain ~100 pods/sec on a 5000-node cluster, which is the practical ceiling for fast scale-up.

There is no central database of "which pod is where" — that information is the union of spec.nodeName fields on Pod objects in etcd. The scheduler does not own placement decisions retroactively. Once a pod is bound, only an eviction or a deletion moves it. This is by design: Kubernetes does not reschedule live pods for bin-packing reasons unless explicitly told to (via descheduler or PodDisruptionBudget-aware eviction).

4.6 kubelet — the per-node agent

Every worker node runs a kubelet, the agent that turns Pod specs into running containers. The kubelet:

  1. Watches the API server for Pods with spec.nodeName == thisNode.
  2. Talks to the CRI (Container Runtime Interface) — a gRPC API implemented by containerd or CRI-O — to create the pod sandbox, pull images, and start each container.
  3. Talks to the CNI (Container Network Interface) plugin to attach network interfaces.
  4. Talks to the CSI (Container Storage Interface) plugin to mount persistent volumes.
  5. Periodically reports node and pod status back to the API server, which writes to etcd.
  6. Runs liveness and readiness probes; restarts containers that fail.
  7. Streams logs to disk; serves them to kubectl logs.

The kubelet itself does NOT run containers. Containers are run by containerd (the dominant low-level runtime since Docker was deprecated as the in-cluster runtime in Kubernetes 1.24, 2022). The kubelet → containerd boundary is the CRI:

kubelet                                    containerd
  │     CreateContainerRequest               │
  │ ───────────────────────────────────────► │   shim process forks runc
  │                                           │   runc sets up namespaces
  │     CreateContainerResponse(id=xyz)       │   (PID, NET, MNT, IPC, UTS, USER)
  │ ◄─────────────────────────────────────── │   cgroups limit resources
  │     StartContainerRequest                 │   container is now running
  │ ───────────────────────────────────────► │
  │     StartContainerResponse                │
  │ ◄─────────────────────────────────────── │

Below containerd is runc (or alternatives — gVisor for sandboxing, Kata Containers for VM-isolated containers, crun for a lighter C implementation). runc is what actually executes the clone() and unshare() syscalls to create Linux namespaces and the cgroups v2 writes to enforce CPU/memory limits.

The CRI separation is what allows the orchestrator to evolve independently of the runtime. Kubernetes spoke "Docker" originally; the deprecation of dockershim in 2022 was the long-deferred result of standardizing on CRI.

4.7 CNI (Container Network Interface) — how pod networking works

The CNI is a contract: "given a network namespace and a pod's identity, attach a network interface and configure routing." The implementation is the CNI plugin, and the choice of plugin is one of the most consequential per-cluster decisions.

The three dominant CNI plugins:

  • Cilium — eBPF-based. Programs the Linux kernel's networking subsystem directly via BPF (Berkeley Packet Filter) maps. No iptables, no kube-proxy needed. Supports L7 (application-layer) policy. Best performance at high pod counts. The de-facto choice for new large clusters in 2026.
  • Calico — supports BGP (Border Gateway Protocol) for routing pod IPs across nodes, or VXLAN/IPIP overlay for non-BGP environments. Long the production default; very flexible policy model.
  • Flannel — VXLAN overlay only. Simplest to operate; no policy enforcement. Common in dev clusters; rarely in production at scale.

How pod-to-pod networking actually works under Cilium (with bpf_lxc mode):

Pod A on Node 1 wants to send a packet to Pod B on Node 2.

1. Pod A's process calls sendmsg(). The socket is in Pod A's network namespace.
2. The kernel's network stack consults the routing table inside Pod A's netns.
   Default route → veth-A-host (the host-side end of a veth pair).
3. Packet arrives in the host's main network namespace via the veth.
4. Cilium's eBPF program attached to veth-A-host inspects the packet:
   - Checks NetworkPolicy: is Pod A allowed to talk to Pod B?
   - Looks up Pod B's identity in a BPF map (keyed by destination IP).
   - Determines the egress route: which physical interface, which encap.
5. If Node 2 is on the same L2 (Layer 2) segment, the packet is encapsulated
   in VXLAN (or routed natively if BGP is configured) and sent out eth0.
6. Node 2's eth0 receives. Cilium's ingress eBPF program decapsulates,
   checks NetworkPolicy again on the destination side, and delivers to
   veth-B-host.
7. Packet flows through veth-B-host → veth-B-pod → Pod B's netns → Pod B's
   socket.

The key point: CNI is at L3+ (Layer 3 and above). Pods get IP addresses; the rest is kernel routing plus optional encapsulation. The orchestrator does not invent a new network stack; it configures the kernel's existing one.

4.8 CSI (Container Storage Interface) — how PVCs map to disks

The CSI is the storage analogue of CNI. Each storage backend (EBS, GCE PD, Ceph, Portworx, OpenEBS, vSAN, NFS) ships a CSI driver that runs as two components:

  • A controller plugin (a Deployment) that handles cluster-level operations: create volume, delete volume, snapshot.
  • A node plugin (a DaemonSet) that handles per-node operations: attach volume to this node, mount inside pod.

The flow when a pod requests a PersistentVolumeClaim (PVC):

1. User creates a PVC:
     kind: PersistentVolumeClaim
     spec:
       resources: { requests: { storage: 100Gi } }
       storageClassName: ebs-gp3

2. PVC controller sees the PVC. Looks up the StorageClass to find the provisioner.
3. CSI controller plugin gets a CreateVolume call:
     csi-ebs creates a 100 GiB gp3 EBS volume via AWS API.
     Returns volumeHandle = "vol-0123abc".
4. A PersistentVolume (PV) object is created with that handle.
5. The PVC binds to the PV.
6. A pod referencing the PVC is scheduled to node N.
7. The kubelet on node N calls the CSI node plugin: NodePublishVolume.
8. CSI node plugin:
     a. Attaches vol-0123abc to node N via AWS API (~5-20 seconds).
     b. Detects the resulting /dev/nvme1n1.
     c. Mounts /dev/nvme1n1 onto /var/lib/kubelet/pods/<pod-uid>/volumes/...
9. kubelet bind-mounts that directory into the container's mount namespace
   at the path specified in volumeMounts.
10. Container starts. Its filesystem now includes the persistent volume.

When the pod is deleted or rescheduled, the reverse happens: unmount, detach, optionally delete the underlying volume (depending on PersistentVolumeReclaimPolicy).

The brittle parts: attach/detach is slow (5-30 seconds), zonal volumes can only attach to nodes in the same zone, and only certain backends (EBS gp3 in mt-attach mode, AWS EFS, Ceph RBD) support multi-attach. This is the source of the "my stateful pod is stuck terminating" pain — the CSI driver is waiting for AWS to detach a volume that has a stale lease.

4.9 Walking kubectl apply pod.yaml end-to-end

The depth walkthrough. A developer types kubectl apply -f pod.yaml. Here is everything that happens.

Pod manifest:
  apiVersion: v1
  kind: Pod
  metadata: { name: web-app, namespace: prod }
  spec:
    containers:
    - name: app
      image: registry.example.com/web-app:v1.4.2
      resources: { requests: { cpu: 500m, memory: 512Mi } }

Step 1: kubectl
  Reads ~/.kube/config to find the apiserver URL and credentials.
  Parses the YAML; converts to JSON.
  Computes a 3-way merge patch vs the last-applied-config annotation.
  Sends HTTPS PATCH https://apiserver:6443/api/v1/namespaces/prod/pods/web-app.

Step 2: kube-apiserver — request routing
  TLS handshake; client cert validated.
  Authn: extracts user identity from cert (Common Name + Organization).
  Authz: RBAC check — can user 'alice' verb 'patch' resource 'pods' in 'prod'?
  Routes to the v1.Pod handler.

Step 3: kube-apiserver — admission
  Mutating admission webhooks run in order:
    - LimitRanger fills in default memory limit (e.g., 1Gi).
    - NamespaceDefaulter sets serviceAccountName: default.
    - Istio sidecar injector adds an envoy sidecar container.
  Validating admission webhooks run:
    - PodSecurity admission checks privilege escalation.
    - ResourceQuota controller checks remaining quota in 'prod' namespace.

Step 4: kube-apiserver — persistence
  apiserver encodes the Pod as Protobuf (default storage format).
  Issues etcd transaction:
    if revision('pods/prod/web-app') == expected_revision:
      put('pods/prod/web-app', encoded_pod)
  etcd's Raft leader appends to WAL, replicates to followers, applies to
  state machine, returns success after quorum.
  apiserver returns 201/200 to kubectl. Pod is now persisted.

Step 5: Scheduler
  Scheduler's pod-informer cache receives a watch event: new Pod with
  spec.nodeName == "".
  Scheduler enqueues web-app.
  Filter phase: iterates over the 100 nodes in the cluster.
    Node 1: requests 500m CPU; has 200m free. FILTER OUT.
    Node 2: 800m free. PASS.
    ...
    47 nodes pass.
  Score phase: scores each. Node 17 wins (most free capacity + image already cached).
  Binding: scheduler issues a POST /pods/web-app/binding with target Node 17.
  apiserver writes spec.nodeName = "node17" to etcd.

Step 6: kubelet on node17
  kubelet's pod informer sees Pod with spec.nodeName == node17.
  Pulls Pod into its pod-manager.
  Calls CRI: RunPodSandbox — containerd creates the pod sandbox (a pause
    container that holds the network namespace).
  Calls CNI ADD: Cilium attaches a veth pair, allocates pod IP 10.244.17.42,
    programs eBPF maps with the pod's identity.
  Calls CRI: PullImage if not present. containerd pulls registry.example.com/
    web-app:v1.4.2 via the configured credentials (~5-50 seconds depending
    on image size and registry latency).
  Calls CRI: CreateContainer for the 'app' container — containerd creates
    the OCI bundle and runc config.
  Calls CRI: StartContainer — runc clone()s the process inside the namespaces;
    cgroups v2 enforce 500m CPU shares and 1Gi memory limit.
  Container is now running with PID 1 = the app process.

Step 7: Readiness probe
  kubelet runs the readiness probe (e.g., HTTP GET /healthz on port 8080).
  Once it returns 200, kubelet patches the Pod's status.conditions to set
  Ready: True.
  apiserver writes the status update to etcd.

Step 8: Endpoints controller
  The Endpoints controller watches Pods for the Service that selects this pod.
  Sees the pod become Ready.
  Adds 10.244.17.42:8080 to the Endpoints object for service 'web-app'.
  apiserver writes the Endpoints update to etcd.

Step 9: kube-proxy / Cilium dataplane
  kube-proxy on every node sees the Endpoints change.
  Updates its iptables/IPVS rules (or, with Cilium, eBPF maps) so that
  packets to the service ClusterIP load-balance to include 10.244.17.42:8080.

Step 10: Traffic flows
  Other pods in the cluster doing DNS for web-app.prod.svc.cluster.local
  resolve to the Service ClusterIP. Their packets are DNAT'd to 10.244.17.42:8080.

DURABILITY POINT:
  The cluster's authoritative state lives in etcd. If every kubelet, scheduler,
  and apiserver dies, the data in etcd is sufficient to reconstruct the cluster.
  If etcd loses quorum but a quorum's worth of disks survive, a restored member
  can rejoin and apply the WAL. If etcd is irrecoverably lost, recovery requires
  restoring from a snapshot (typically taken via `etcdctl snapshot save` on a
  cron).

Most engineers stop at "kubectl applies the pod, it runs." The depth is in the controller-watch-react chain that makes the system level-triggered and resilient: every step is the result of a controller observing state and acting, not a chain of imperative commands.


§5. Capacity Envelope

The throughput and scale range that container orchestration covers.

Small scale — startup with 3-node cluster, ~50 pods

A typical seed-stage SaaS. Three small EC2 instances or droplets, each running both control plane and workload (a single-cluster mode like k3s or a managed k8s like Linode Kubernetes Engine). 50 pods total across ~5 services: web app, API, worker, Postgres, Redis. Resource utilization per node ~30%. Cluster spec rarely changes. The orchestrator's overhead is significant proportionally — control plane CPU is on the order of 100m–500m, etcd takes ~200 MB RAM, kubelet+containerd another ~300 MB — but absolute numbers are tiny.

The next bottleneck at this scale is not the orchestrator; it's the application. The cluster could handle 10x more pods without breaking a sweat.

Mid scale — Spotify, Backstage, ~10k pods

Spotify's internal platform (Backstage was born here) runs thousands of microservices on Kubernetes. The fleet is on the order of 10k pods, 500–1000 nodes, multiple clusters per region for blast radius isolation. At this scale:

  • The control plane is dedicated (3 or 5 nodes, managed by GKE — Google Kubernetes Engine).
  • etcd holds ~3–8 GB of data. Compaction and defrag are operational concerns.
  • ~5–10 deployments per day per service, hundreds of services, thousands of deployments per day overall.
  • Each cluster is sized to ~150 nodes max; multi-cluster federation handles the rest.

The next bottleneck is operational: keeping the API server's request rate under control (large kubectl get -A queries can stall), keeping etcd compact, keeping CNI policies consistent.

Large scale — Stripe, ~100k pods

Stripe runs in the low hundreds of thousands of pods across many clusters, on AWS. The orchestrator there is largely Kubernetes but with significant in-house automation. The numbers:

  • Each cluster is capped around 1000–3000 nodes for safety.
  • etcd is run with --snapshot-count aggressive tuning to keep WALs short.
  • Pod startup latency p99 is in the 30–60 second range (image pull dominates for cold cache; ~5–10s for warm).
  • Deployment rate is thousands per day; rollouts are progressive and metric-gated.

The next bottleneck is cluster sprawl: managing 50+ clusters becomes its own engineering problem. Cluster API (the Kubernetes-managing-Kubernetes pattern) and per-cluster GitOps were developed largely to solve this.

Giant scale — Google Borg → Kubernetes, millions of cores

Google's Borg (Kubernetes' direct ancestor; the 2015 Borg paper is required reading) runs at the multi-million-core scale. A single Borg cell holds tens of thousands of machines. Kubernetes today is not run quite this large — Google internally still uses Borg for the largest workloads — but Kubernetes' design point is "scale tested to 5000 nodes per cluster, 150,000 pods total." For larger estates, the answer is multiple clusters, not larger clusters.

The next bottleneck is cross-cluster coordination: deploying a service to 50 clusters with consistent versioning, gating rollouts on global error rates, handling regional failovers. This is what Spinnaker was originally built for at Netflix.

Tier Pods Nodes etcd size Deploy rate Operator effort
Startup ~50 3 <100 MB 1-5/day <10% of one SRE
Mid (Spotify) ~10k 500-1000 3-8 GB hundreds/day dedicated platform team
Large (Stripe) ~100k 5000-10000 strict tuning thousands/day large platform team
Giant (Google) millions 100k+ many shards continuous hundreds of engineers

The scale curve has a knee around 150-300 nodes per cluster where the operational model has to change (dedicated control plane, dedicated monitoring, automated etcd backup). It has another knee around 3000-5000 nodes per cluster where Kubernetes itself starts to creak (apiserver memory, scheduler latency, etcd write throughput) and multi-cluster federation becomes mandatory.


§6. Architecture in Context

The canonical pattern. Source code becomes a running pod through this pipeline:

                    ┌─────────────────────┐
                    │ developer / git push│
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │  CI (Continuous     │
                    │  Integration):      │
                    │   build image       │
                    │   run unit tests    │
                    │   scan for CVEs     │
                    │   push to registry  │
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │ container image     │   ← OCI image layers,
                    │ registry            │     content-addressable
                    │ (ECR, GCR, Harbor,  │     by SHA256 digest
                    │ Docker Hub)         │
                    └──────────┬──────────┘
                               │
       ┌───────────────────────┼───────────────────────┐
       │                       │                       │
       ▼                       ▼                       ▼
  ┌─────────┐         ┌─────────────────┐    ┌──────────────────┐
  │ git ops │         │  CD pipeline    │    │ manual deploy    │
  │ repo of │         │  (Spinnaker,    │    │ (rare; emergency)│
  │ k8s     │         │   custom script,│    │                  │
  │ manifests│        │   Jenkins+kubectl)   │                  │
  └────┬────┘         └────────┬────────┘    └────────┬─────────┘
       │                       │                      │
       ▼                       ▼                      ▼
  ┌──────────────────────────────────────────────────────────┐
  │  ArgoCD / FluxCD     OR    direct kubectl apply           │
  │  (GitOps controller        (push to apiserver with        │
  │   in-cluster, pulls         credentials)                  │
  │   manifests from git)                                     │
  └──────────────────────────┬───────────────────────────────┘
                             │
                             ▼
            ┌──────────────────────────────────┐
            │   Kubernetes API server          │  ◄── validation, admission
            │   → etcd                          │
            │   → controllers                   │
            │   → scheduler                     │
            │   → kubelet → containerd → runc   │
            └──────────────┬───────────────────┘
                           │
                           ▼
            ┌──────────────────────────────────┐
            │  Pods running on nodes           │
            │   exposed via Service (ClusterIP)│
            │   plus Ingress (L7 ingress       │
            │   controller — Nginx, Traefik,   │
            │   Envoy, Istio gateway)          │
            └──────────────────────────────────┘
                           │
                           ▼
            ┌──────────────────────────────────┐
            │ Traffic from users / other        │
            │ services                          │
            └──────────────────────────────────┘

The canonical pieces:

  • Image registry is content-addressable. A pod refs an image by tag (web-app:v1.4.2) or, better, by digest (web-app@sha256:abc...) for immutability. Tags are mutable; digests are not. Production should pin by digest.
  • GitOps tool is the source-of-truth bridge. The cluster watches a git repo (per environment) for manifests and applies them. Changes flow through PRs; rollbacks are git revert.
  • The orchestrator does the heavy lifting from manifest to running container.
  • Service mesh / ingress handles ingress routing, TLS termination, mTLS (mutual TLS) between services, traffic splitting for canary, observability.

Variants:

  • Push model: CI/CD has cluster credentials; CI runs kubectl apply. Familiar but couples CI's identity to cluster modifications.
  • Pull model (GitOps): cluster watches git; CI just commits to git. Stronger security model, easier audit.
  • Service mesh optional: many clusters never deploy one. Mesh is most valuable for many-service estates with strict zero-trust requirements; for simple clusters, NetworkPolicy plus Ingress suffices.

§7. Hard Problems Inherent to This Technology

Five problems that anyone using container orchestration will hit. Each: the naive solution, how it fails, the actual fix.

7.1 Bin-packing under heterogeneous resource constraints

The problem in one line: the scheduler must place pods on nodes that fit their CPU, memory, and other resource requests, and bin-packing is NP-hard in general.

The concrete failure: a cluster with 100 nodes, each with 8 cores and 32 GiB RAM, hosts 1500 pods averaging ~500m CPU and ~1.5 GiB RAM each. Aggregate utilization is roughly 75% CPU and 70% RAM — comfortable on paper. A new pod arrives requesting 4 CPU and 16 GiB RAM. It does not fit on any node, because no node has 4 contiguous CPU cores free, even though aggregate capacity is plentiful. This is the "every node is half full but no node fits this pod" problem.

The naive fix: scale up by adding nodes. Wasteful — you have ~25% capacity unused already.

The actual fixes, in order of intervention:

  1. Right-size pod requests. Many pods request 1 CPU and use 100m. Use VPA (Vertical Pod Autoscaler) in recommendation mode to right-size requests across the fleet. Typical result is 30-50% capacity reclaimed.
  2. Topology-aware scheduling. For pods that need to be close (low-latency communication) or far apart (HA), use topologySpreadConstraints to influence placement.
  3. Run the descheduler. A separate controller that evicts pods that "should" be elsewhere — pods on overloaded nodes, pods that violate (now updated) affinity rules, low-utilization pods. The orchestrator does not rebalance on its own; the descheduler bridges this gap.
  4. Cluster autoscaler. Detects pending pods that can't fit any current node, and provisions new nodes from a node group whose instance type fits. Returns nodes when usage drops. The standard tool for elasticity.
  5. Karpenter (AWS). A next-generation autoscaler that picks instance types dynamically (rather than from preset node groups), often finding a better fit. Reports 30-60% cost savings vs traditional Cluster Autoscaler in published case studies.

Illustration from a different domain — ML training. A training job needs 8 GPUs (Graphics Processing Units), preferably on the same node for NVLink bandwidth. The scheduler must find a node with 8 free GPUs and respect any taints (e.g., GPU nodes tainted to keep general workloads off). Without GPU-aware scheduling (nvidia.com/gpu resource type, the NVIDIA device plugin, GPU-feature-discovery labels), the scheduler will refuse to place the pod even on nodes with GPUs because it doesn't know they exist. With it, the scheduler treats GPU as just another resource — but the constraint that all 8 must be on one node is a requiredDuringSchedulingIgnoredDuringExecution pod-affinity rule.

7.2 StatefulSet vs Deployment vs DaemonSet vs Job — pick the right primitive

The problem in one line: Kubernetes ships four (really five, with CronJob) primitives for managing pod lifecycles, and picking the wrong one creates correctness or operability problems.

The four:

  • Deployment — stateless replicas. Pods are interchangeable. Rolling updates replace random pods; identity is irrelevant.
  • StatefulSet — stateful replicas. Pods have stable identities (my-db-0, my-db-1, my-db-2), stable storage (each has its own PVC), and ordered startup/shutdown. The default for clustered databases, Kafka brokers, etc.
  • DaemonSet — one pod per node. Used for node-local agents: log shippers (Fluentd, Vector), metrics exporters (node-exporter), CNI plugins, CSI node plugins.
  • Job — runs to completion. One-shot tasks: database migrations, ML training runs, ETL.
  • CronJob — Job on a schedule.

The concrete failure: a team deploys a Postgres replica via Deployment with replicas: 3. Each pod has a PersistentVolumeClaim mounted at /var/lib/postgresql. The cluster scales down to one pod, then back up to three. The three new pods cannot tell which one was the primary. Their PVCs are not bound to specific replicas; the Deployment didn't promise stable identity. Data is inconsistent.

The naive fix: hand-engineer identity by setting metadata.name. Doesn't work — Deployment generates names with random suffixes.

The actual fix: use StatefulSet. A StatefulSet with serviceName: my-db and replicas: 3 creates pods named my-db-0, my-db-1, my-db-2. Each has its own PVC (my-data-my-db-0, etc.), stable through restarts. The headless Service my-db resolves to specific pod DNS names (my-db-0.my-db.ns.svc.cluster.local). The orchestrator now guarantees that "the pod with identity 0" always has the same data attached.

Then layer an operator — a custom controller — on top to handle the operational concerns: leader election, failover, backup, version upgrades. CloudNativePG, Strimzi (Kafka), Vitess (sharded MySQL), Crunchy Postgres Operator are widely used examples. The operator pattern is why Kubernetes finally became viable for stateful workloads in the 2020-2024 window.

Illustration in batch — an ML training pipeline. The training script is naturally a Job: run once, succeed or fail, no replication. Argo Workflows or Tekton (or Kubeflow Pipelines) wrap Jobs into multi-step DAGs (Directed Acyclic Graphs) with dependencies, retries, and artifact passing. The orchestrator runs the Jobs; the workflow tool sequences them.

7.3 Persistent storage — "my pod moved to a new node, where's my data?"

The problem in one line: pods are ephemeral; the orchestrator can move them; without explicit storage configuration, the data on the old node is lost.

The concrete failure: a pod writes a 5 GiB SQLite file to its container filesystem. The node hosting the pod is drained for kernel patching. The pod is rescheduled to a different node. The SQLite file is gone. The container filesystem is per-instance.

The naive fix: use emptyDir volume. Doesn't help — emptyDir is also tied to the pod's lifecycle; it dies with the pod.

The actual fix: use a PersistentVolumeClaim (PVC) backed by a network-attached PersistentVolume (PV) — EBS, GCE PD, Azure Disk, Ceph RBD, etc. The PV is independent of any node; it can attach to whichever node the pod runs on.

But this introduces new problems:

  • Zonal volumes. EBS volumes are zonal (us-east-1a or us-east-1b, not both). If the pod is scheduled to a node in us-east-1c, the volume cannot attach. The topology.kubernetes.io/zone label and WaitForFirstConsumer binding mode of the StorageClass jointly ensure the volume is provisioned in the zone where the pod runs.
  • Single-attach (RWO — ReadWriteOnce). Most block storage attaches to one node at a time. If the old node hangs without releasing the volume, the new node can't attach. Recovery requires manual intervention (force-detach) or, increasingly, a CSI driver that handles it (the AWS CSI driver does in newer versions).
  • Multi-attach (RWX — ReadWriteMany). Needed for shared storage (e.g., a stateful Jenkins). Requires a network filesystem (EFS, NFS, CephFS) — slower than block storage. Not supported by every CSI driver.

Illustration from a database use case: CloudNativePG deploys Postgres on Kubernetes. Each replica is a pod with its own PVC (an EBS gp3 volume, typically 50–500 GiB). When the primary fails, the operator promotes a replica by patching the leader-pod label; the replica's PVC continues to exist. The application reconnects via a Service that selects the new leader. Storage continuity is what makes this work; without it, every failover would be a full data refresh.

7.4 Network policies and zero-trust pod networking

The problem in one line: by default, every pod in a Kubernetes cluster can reach every other pod over the network, and that is a security disaster waiting to happen.

The concrete failure: a developer pushes a Docker image that contains a known CVE (Common Vulnerabilities and Exposure). The image runs as part of the marketing-service Deployment. An attacker exploits the CVE, gets a shell, and then curls the payments-service' internal endpoint. The payments service trusts in-cluster traffic; the attacker exfiltrates payment data.

The naive fix: rely on the cluster's perimeter. Doesn't help once anything inside the cluster is compromised.

The actual fix: NetworkPolicy. A NetworkPolicy object expresses "pods labeled X can be reached only from pods labeled Y on ports Z." A common default is "deny all" plus selective allow:

kind: NetworkPolicy
metadata: { name: default-deny, namespace: payments }
spec:
  podSelector: {}        # selects all pods in this namespace
  policyTypes: [Ingress, Egress]
  # no ingress/egress rules = deny all

Then explicit allows for the legitimate cross-service traffic. NetworkPolicy is enforced by the CNI plugin — Calico and Cilium do; Flannel does not. For L7 (application layer — HTTP methods, paths, headers) policy, the service mesh (Istio AuthorizationPolicy) or Cilium L7 policy is required.

Layer above that: mTLS (mutual TLS) between services, terminated by a service mesh sidecar. Each service has its own short-lived cert (SPIFFE — Secure Production Identity Framework for Everyone — identities are the modern pattern), and traffic is encrypted and authenticated at the connection level. Even if NetworkPolicy is misconfigured, the connection won't establish without valid certs.

Illustration in a feed serving stack: the recommendation service must call the user-profile service for hydration but should never reach the billing service. NetworkPolicy expresses this. Without it, a vulnerability in the recommendation service compromises billing. With it, lateral movement is blocked at the network layer.

7.5 Multi-tenancy — namespaces vs separate clusters

The problem in one line: namespaces are soft isolation; for hard multi-tenancy (untrusted workloads, regulatory boundaries, strict resource isolation), namespaces are insufficient.

The concrete failure: an engineering org runs 50 teams' workloads on one cluster, each in its own namespace. Team A deploys a buggy pod that allocates 100 GiB of swap-backed memory; the node OOM-killer (Out Of Memory) starts evicting pods, including team B's database. The kernel does not respect namespace boundaries for memory pressure.

The naive fix: tighter ResourceQuotas. Helps with quota accounting but doesn't prevent node-level resource contention from a single greedy pod when limits are unbounded or improperly set.

The actual fix is multi-layered:

  • ResourceQuota caps per-namespace requested resources. Prevents one team from consuming all of CPU/memory at admission time.
  • LimitRange sets defaults and maxima for individual pod requests/limits.
  • Pod Security Standards (Restricted/Baseline/Privileged) limit kernel capabilities.
  • NetworkPolicy isolates per-namespace traffic.
  • PriorityClass + PodDisruptionBudget prevent low-priority workloads from preempting high-priority ones.
  • For hard multi-tenancy:
  • Separate clusters per tenant. The cleanest answer.
  • vCluster — virtual Kubernetes clusters running inside a host cluster, each with its own apiserver and controllers.
  • gVisor / Kata Containers — runtime sandboxes that provide stronger kernel isolation per pod.

Illustration in SaaS: a B2B SaaS hosting customer code (e.g., a webhook platform like Pipedream or a notebook host like Hex) runs untrusted user code. Hard multi-tenancy is mandatory. Most such products use Firecracker microVMs (AWS Fargate's underlying tech) or gVisor for each tenant rather than relying on namespace isolation.

7.6 Cluster upgrades without downtime

The problem in one line: Kubernetes ships a new version every 4 months and supports each version for ~14 months — meaning upgrades are a constant operational concern, and a botched upgrade can take down the cluster.

The concrete failure: a team runs Kubernetes 1.24, the API server breaks kubectl apply for a CRD they depend on, and the upgrade to 1.28 deprecates the policy/v1beta1 PodDisruptionBudget API they have hardcoded in scripts. Half their Deployments fail to apply during the upgrade.

The naive fix: upgrade in place. Risky — control plane downtime, deprecated APIs surface immediately.

The actual fix:

  1. Read the release notes. Every Kubernetes minor release has a "removed APIs" section. Audit your manifests against it.
  2. Use kubectl convert (or tools like kubent) to find usage of deprecated APIs.
  3. Upgrade the control plane first, in place (managed Kubernetes handles this — GKE/EKS/AKS upgrade control planes with rolling control-plane node replacement). Workloads keep running.
  4. Then upgrade nodes, one at a time, by draining (which gracefully evicts pods, respecting PodDisruptionBudgets) and then replacing.
  5. For very risky upgrades, provision a new cluster, deploy workloads in parallel, shift traffic, decommission old. The "blue-green cluster" pattern. Expensive but bulletproof.

Illustration in a high-availability service: a payments cluster cannot afford even seconds of downtime. The operator runs the upgrade as a blue-green at the cluster level — new cluster ("green") provisioned on the new Kubernetes version, all workloads deployed via GitOps, traffic shifted at the load balancer, old cluster ("blue") decommissioned a week later. Total upgrade cost: 2x infra for a week. Total downtime: zero.


§8. Failure Mode Walkthrough

What happens when things break, and what survives.

8.1 Worker node loss

A node disappears (kernel panic, hardware failure, EC2 instance terminated, network partition).

  • Within ~5 minutes (default pod-eviction-timeout), the node-lifecycle controller marks the node NotReady and starts evicting its pods. Eviction is logical — the pods on the node are marked for deletion, but the node may still be running them; this is the source of "split-brain" risks for stateful workloads.
  • Stateless pods get rescheduled to other nodes by the Deployment/ReplicaSet controller. Total disruption: minutes, depending on pod startup time.
  • Stateful pods (StatefulSet) are rescheduled only after the operator confirms safety. The operator may pause; manual intervention may be needed if quorum is at risk.
  • What survives: etcd state is intact. The pods that were on the dead node are restarted elsewhere. Data in PVs reattaches to the new pod (after the volume detaches from the dead node, which can be slow if the cloud doesn't know the node is dead).

Recovery procedure: wait. The orchestrator is designed to handle this. If volumes stick, manually force-detach via the cloud console.

8.2 Control plane node loss

One of three (or five) control plane nodes dies.

  • etcd: loses one of three members. The remaining two form a quorum of 2-of-3 and continue serving. If a second member dies, write quorum is lost.
  • apiserver: the surviving instances continue serving. The load balancer routes around the dead instance.
  • Scheduler / controller-manager: these have leader-election semantics. The leader on the dead node loses its lease (~15 seconds), and another instance becomes leader.
  • Workloads: running pods continue to run. New pod placement, scaling, and rollouts pause briefly (~30 seconds) until a new scheduler leader is elected.

Recovery procedure: replace the failed node. Add a new etcd member. The new member catches up via Raft replication.

8.3 etcd loses quorum

Two of three (or three of five) etcd members die at once. This is the catastrophic case.

  • The apiserver becomes read-only (linearizable reads also block; only stale reads from local etcd may serve, depending on config).
  • No new pods can be scheduled. No new deployments can be applied.
  • Running pods continue to run. This is critical: the data plane survives even when the control plane is down. The orchestrator's level-triggered design pays off here.

Recovery procedure:

  1. Identify the still-alive member.
  2. Take a snapshot from its on-disk WAL (or use the most recent scheduled snapshot).
  3. Restore from snapshot by initializing a new etcd cluster with the snapshot and bringing it up.
  4. Reconfigure the apiserver to point at the new cluster.
  5. Verify state. Some recent changes (the last few minutes) may be lost depending on snapshot recency.

This is why etcd snapshots on a regular cadence (every 30 minutes is common) are non-optional. The DURABILITY POINT is the snapshot file.

8.4 CNI failure

The CNI plugin (or its underlying dataplane — VXLAN, BGP, eBPF programs) breaks across the cluster. This is rarer but very high-blast-radius.

  • Existing connections may keep working (kernel sockets are already established).
  • New connections fail. Pod-to-pod, pod-to-service, ingress-to-pod — all may fail.
  • DNS resolution fails because CoreDNS pods can't be reached.

Recovery: depends on the failure mode. If a Cilium daemon crashed cluster-wide due to a bad config push, revert the config. If the BGP session collapsed, restart the BGP daemon. If a kernel bug, reboot nodes one by one.

The lesson: CNI changes are second only to etcd changes in blast radius. Test in a smaller cluster first; roll out gradually.

8.5 Image pull failures

A pod is scheduled but ImagePullBackOff because the registry is unreachable or the image doesn't exist at the requested tag.

  • The pod stays in Pending → ImagePullBackOff indefinitely, retrying with exponential backoff (up to 5 minutes between attempts).
  • Other pods continue to run. This is a per-pod failure, not a cluster failure.
  • Recovery: fix the image reference, or fix the registry. Pin to digests to avoid "image moved" surprises.

8.6 CrashLoopBackOff

A container starts, crashes, gets restarted, crashes again. Kubernetes ramps up the backoff exponentially (10s, 20s, 40s, 80s, capped at 5 minutes).

  • The pod is technically running but in a useless state.
  • Readiness probes are key — if the readiness probe fails, the pod is removed from Service endpoints, so traffic doesn't hit a broken pod.
  • Recovery: look at the container logs (kubectl logs --previous), fix the bug, push a new image, redeploy.

8.7 PodDisruptionBudget blocking node drain

A node needs to be drained for maintenance. A PodDisruptionBudget (PDB) says "at least 2 of 3 replicas must always be available." Draining a node would violate the PDB if multiple replicas are on it.

  • The drain hangs. kubectl drain blocks until the PDB is satisfiable.
  • Recovery: spread the replicas first (topologySpreadConstraints makes this automatic), or use --ignore-daemonsets --delete-emptydir-data --force for emergency drains, accepting the brief unavailability.

§9. Why Not Just SSH to Servers and systemd?

Why does container orchestration exist? What's the naive alternative, and why does it break?

The naive replacement: a fleet of VMs, each running systemd, with services managed by hand or by a configuration manager (Ansible, Chef). Deployments are scp-and-restart. Service discovery is hardcoded IPs or a DNS round-robin. Scaling is "provision more VMs and tell the load balancer."

This works at small scale and gets painful in predictable steps:

  1. At ~20 services, manual scaling becomes a constant chore. Need 10% more capacity? SSH to 50 boxes, edit configs, restart. Drift creeps in — some boxes get the new config, some don't.

  2. No self-healing. A service crashes; systemd restarts it. systemd can't reschedule it to a different machine if the machine itself is broken. The fleet's owner has to notice and intervene. Engineering on-call burden grows linearly with fleet size.

  3. Rolling updates are bespoke. Every team writes their own "deploy with 20% at a time" logic. Bugs in that logic cause outages. Some teams skip it and deploy all-at-once; some teams over-engineer it.

  4. Service discovery degrades. Hardcoded IPs break when machines are replaced. DNS works but is slow to converge. A purpose-built service registry (Consul, etcd, ZooKeeper) helps but adds its own ops burden.

  5. Resource isolation is whatever the kernel gives you. Two services on the same VM compete for memory; OOM-killer takes the wrong one. You either over-provision (one service per VM, expensive) or accept noisy neighbors.

  6. The "1000 servers" tipping point. Somewhere around 100-500 servers, the cumulative ops burden of "SSH and systemd" exceeds the burden of running Kubernetes. The crossover depends on team size and skill, but the curve is real.

The walk-through: a team grows from 5 services on 10 VMs to 50 services on 200 VMs over 2 years. Initially they used Ansible playbooks and hand-rolled deployment scripts. They now spend half a sprint per quarter on deployment infrastructure: writing scripts to drain-and-replace, retrofitting service discovery, debugging "why is this service running an old version on box-37." Eventually they migrate to Kubernetes; the deployment infrastructure becomes a few hundred lines of YAML and a CI pipeline. The team's velocity on actual product work goes up.

The orchestrator does not eliminate complexity; it moves the complexity into shared, well-understood, declarative primitives. That trade is worth it once you have enough services that the per-service overhead of "build your own deploy script" exceeds the platform overhead of "learn Kubernetes."


§10. Scaling Axes

Orchestrators scale along two distinct axes; the failure modes are different.

Type 1 — uniform growth: more services, more nodes

Linear scaling, mostly. Add nodes to add capacity. Adapt to higher pod counts by tuning the control plane.

  • Up to ~150 nodes per cluster with default settings: no special tuning. Standard managed Kubernetes (GKE/EKS/AKS) handles this transparently.
  • 150-1000 nodes: dedicated control plane sized up (4-8 cores on apiserver, ~16 GiB RAM on etcd). Tune --max-pods-per-node (default 110; can go to 250 on big nodes). Tune --kube-api-qps and --kube-api-burst so that controllers don't get rate-limited.
  • 1000-5000 nodes: this is the soft ceiling for a single Kubernetes cluster. Beyond ~5000 nodes, scheduling latency, etcd write throughput, and apiserver memory all start hitting the wall. Spotify, Uber, Stripe reportedly cap at this size and use multiple clusters.
  • Beyond 5000 nodes: multi-cluster federation. Cluster API, Karmada, or per-environment / per-region clusters managed by GitOps ApplicationSet (in ArgoCD) or Flux's Kustomization tree. There is no single-cluster solution; you split.

Type 2 — hotspot intensification: same services, higher traffic

A single service needs to scale from 100 to 100,000 QPS. The mechanism is per-service:

  • HPA (Horizontal Pod Autoscaler) scales replica count based on CPU utilization, memory, or custom metrics. With metrics-server for CPU/memory and custom-metrics-adapter for Prometheus-derived metrics, an HPA can target "scale to keep CPU at 70%" or "scale to keep p95 latency at 200ms." Standard tool.
  • KEDA (Kubernetes Event-Driven Autoscaler) generalizes HPA to non-metric triggers — Kafka lag, SQS queue depth, Postgres query result, CloudWatch alarm. KEDA's design point is event-driven workloads: a service that processes messages from a queue should scale by queue depth, not CPU.
  • VPA (Vertical Pod Autoscaler) adjusts per-pod resource requests. Less useful for hot services (which want more replicas, not bigger pods) but excellent for right-sizing in the steady state.

The scaling curve:

  • Below 1k QPS / 50 replicas: HPA with default settings is enough. Scale-up takes ~1-3 minutes (pod schedule + image pull + warmup). Acceptable for most workloads.
  • 1k–10k QPS / 50-500 replicas: HPA is still fine but tune behavior.scaleUp.policies to scale more aggressively. Consider over-provisioning headroom (--min-replicas set to ~30% above current load) to absorb spikes before HPA reacts.
  • 10k–100k QPS: replica count matters less than node capacity. You may saturate a single Service object's iptables/IPVS routing rules; consider headless Service + client-side load balancing, or move to a service mesh.
  • 100k+ QPS: shard the service. Different shards in different namespaces / clusters; clients route by key. The orchestrator handles a single shard; sharding is above it.

Inflection points:

  • At ~150 nodes: dedicated platform team becomes necessary; the cluster ops burden exceeds "side task."
  • At ~5000 nodes: multi-cluster federation becomes mandatory.
  • At ~10k QPS per service: observability investment becomes mandatory; you can't troubleshoot at this scale by reading logs.

§11. Decision Matrix vs Adjacent Categories

When to pick Kubernetes vs Nomad vs ECS vs Serverless vs bare VMs.

Dimension Kubernetes Nomad ECS / Fargate AWS Lambda Bare VMs + systemd
Smallest viable scale ~3 nodes + ~50 pods ~1 node ~10 tasks ~1 function 1 VM
Largest cluster (single) 5,000 nodes 10,000 nodes thousands unlimited (no cluster) hundreds
Operational complexity High Low-medium Low (managed) Very low Medium (Ansible-driven)
Stateful support Yes (StatefulSet + operators) Limited Limited (EFS-backed) No Yes (full control)
Multi-cloud / on-prem Yes Yes No (AWS only) Vendor-specific Yes
Heterogeneous workloads (containers + VMs + binaries) Containers only Yes (multi-driver) Containers only Functions only Anything
Cost overhead ~3 control-plane nodes + 5-10% compute ~1-3 servers + 2-3% compute Per-task / per-vCPU-hour Per-invocation Per-VM
Cold start latency 5-60s (image pull dominant) 1-10s 30-60s (Fargate) 50-500ms N/A (always on)
Vendor lock-in Low (CNCF standard) Low (HashiCorp) High (AWS) Very high None
Community / ecosystem Massive Medium Medium Massive Massive
Best for Long-running services, microservices at scale Mixed workloads, simpler ops, edge AWS shops, especially ones already deep in AWS Spiky / event-driven workloads <50 servers, simple apps, full control needs
Skip if Tiny scale (<100 pods); kernel-tied legacy You want a single dominant ecosystem Not in AWS Workload >15min, needs persistent connections, large memory More than ~100 servers

Specific thresholds

  • Pick Kubernetes if: >100 pods, multiple environments, multi-team, multi-cloud or on-prem, mainstream stack (you can hire people who know it).
  • Pick Nomad if: HashiCorp shop, simpler ops desired, multi-workload (containers + raw binaries + VMs), edge/distributed (Cloudflare's edge runs Nomad across hundreds of POPs).
  • Pick ECS / Fargate if: AWS-only, want managed everything, willing to accept lock-in, simpler workload patterns.
  • Pick Lambda / Cloud Functions if: event-driven, short-running, spiky, no persistent state in the function. The workload that fits Lambda is narrow but when it fits, the operational cost is near-zero.
  • Pick bare VMs + systemd if: small scale, special hardware needs, regulatory or compliance pinning to specific OS images, full control over kernel. Don't pick this for >100 services.

§12. Deployment Patterns

This is where the technology meets the discipline. The same Kubernetes cluster can be used to deploy with completely different risk profiles depending on which pattern you adopt.

12.1 Rolling update — the default

Kubernetes Deployments default to RollingUpdate, parametrized by maxUnavailable and maxSurge. Both can be absolute numbers or percentages.

maxUnavailable: 25%   ← at most 25% of replicas may be unavailable at a time
maxSurge: 25%         ← at most 25% extra replicas may be created beyond
                        the desired count during the rollout

For a Deployment with replicas: 8:
  - At most 2 pods are unavailable at any time
  - At most 10 pods exist transiently (8 + 2 surge)

The rollout:

Time 0:  ┌─────────────────────────────────────┐
         │ v1: 8 replicas                       │  ← steady state
         └─────────────────────────────────────┘

Time 1:  ┌─────────────────────────────────────┐
         │ v1: 6 ready, 2 terminating         │
         │ v2: 2 ready                         │  ← 25% replaced
         └─────────────────────────────────────┘

Time 2:  ┌─────────────────────────────────────┐
         │ v1: 4 ready, 2 terminating          │
         │ v2: 4 ready                         │
         └─────────────────────────────────────┘

...

Time N:  ┌─────────────────────────────────────┐
         │ v2: 8 ready                          │  ← rolled out
         └─────────────────────────────────────┘

Pro: simple, built-in, no extra cost, zero downtime if traffic shifts gracefully. Con: v1 and v2 run concurrently. Any incompatibility (database schema, API contract) must be backward-compatible. Rollback is also a rolling update — slow. Best for: small-to-medium services with backward-compatible changes. Most workloads.

12.2 Blue-green deployment

Two complete production environments, "blue" (current) and "green" (new). Deploy v2 to green while blue serves traffic. Once green is verified, flip the load balancer.

Step 1:  ┌─────────────┐         ┌─────────────┐
         │ Blue (v1)   │ ← all   │ Green       │  (empty / being built)
         │ serving     │  traffic│             │
         └─────────────┘         └─────────────┘

Step 2:  ┌─────────────┐         ┌─────────────┐
         │ Blue (v1)   │ ← all   │ Green (v2)  │
         │ serving     │  traffic│ smoke tests │
         └─────────────┘         └─────────────┘

Step 3:  ┌─────────────┐         ┌─────────────┐
         │ Blue (v1)   │         │ Green (v2)  │ ← all traffic
         │ idle        │         │ serving     │
         └─────────────┘         └─────────────┘
         (kept for rollback)

Step 4:  ┌─────────────┐         ┌─────────────┐
         │ Blue        │         │ Green (v2)  │
         │ decommiss.  │         │ now "blue"  │
         └─────────────┘         └─────────────┘

Pro: atomic switch. Rollback is instant (flip the LB back). No version skew during deploy. Con: 2x infrastructure during the transition. Long-lived sessions / WebSockets / streaming must be drained or accepted to break. Best for: stateless services where rollback latency matters (e.g., a deploy at 3am that turns out to be broken — you want a 30-second rollback, not 30 minutes).

Implementation in Kubernetes: deploy two Deployments (web-app-blue with v1, web-app-green with v2) selected by separate labels. The Service selects whichever label points to the "live" Deployment. Flip the Service selector to switch.

12.3 Canary deployment

Gradually shift traffic from v1 to v2, monitoring metrics at each step. Roll forward if metrics are healthy; roll back if not.

Phase 1 (1%):    ┌────────┐  ┌──┐
                 │ v1 99% │  │v2│
                 │        │  │1%│
                 └────────┘  └──┘
                 Wait 15min. Check error rate, p95 latency, CPU.

Phase 2 (10%):   ┌─────────┐ ┌────┐
                 │ v1 90%  │ │ v2 │
                 │         │ │10% │
                 └─────────┘ └────┘

Phase 3 (50%):   ┌──────┐    ┌──────┐
                 │ v1   │    │ v2   │
                 │ 50%  │    │ 50%  │
                 └──────┘    └──────┘

Phase 4 (100%):              ┌────────┐
                              │ v2 100%│
                              └────────┘

Pro: blast radius capped at the canary percentage. Bugs are caught early with limited customer impact. Metrics-driven rollback is automatable. Con: the rollout is slow (hours to days for risk-averse deploys). Requires good metrics and traffic-splitting infrastructure. Best for: customer-facing services where a regression is expensive. Standard at Netflix, Lyft, Stripe.

Implementation:

  • Argo Rollouts (Kubernetes-native progressive delivery controller) handles canary natively. Define stages, weights, pauses, and metric checks declaratively.
  • Flagger (Flux-ecosystem analog) does the same.
  • Service mesh traffic splitting (Istio VirtualService, Linkerd) shifts traffic at the L7 layer.
  • Ingress controller weights for simpler cases (Nginx, Traefik).

The metric gates are the secret sauce. A canary rollout that doesn't query Prometheus / Datadog and abort on regression is just a slow rolling update.

12.4 Dark launch

Deploy the code, but gate behind a feature flag (see 13_feature_flag.md). Traffic still goes to v2 pods, but the new code path is dormant. Once verified safe, flip the flag to enable.

Step 1: Deploy v2 pods. Flag = OFF.
        Behavior: identical to v1 (new code path not entered).

Step 2: Enable flag for internal users only.
        Behavior: internal users see new code path; everyone else: old.

Step 3: Enable for 1% of external users.

Step 4: Enable for 100%.

Step 5: Remove the flag (and dead code) in a later release.

Pro: decouples deploy from release. Rollback is "flip the flag" — milliseconds. Con: the codebase carries feature-flag clutter. Discipline required to clean up old flags. Best for: any user-visible change with non-trivial blast radius. The backbone of "many deploys per day" cultures.

12.5 A/B deployment

Multiple versions running concurrently, with traffic split based on some attribute (user cohort, region, A/B test bucket). Not really about safe deployment — about controlled experimentation.

                    ┌────────────────┐
   User request ──► │ Router / mesh  │
                    └────────┬───────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ Variant A│  │ Variant B│  │ Variant C│
        │ (control)│  │ (test)   │  │ (test)   │
        │ 80%      │  │ 10%      │  │ 10%      │
        └──────────┘  └──────────┘  └──────────┘

The deployment system supports this by deploying multiple Deployments behind weighted routing rules. The experimentation system (see 13_feature_flag.md) is what decides which user goes where.

12.6 Shadow traffic / traffic mirroring

Send a copy of production traffic to v2; ignore the response; observe v2's behavior. Validate that v2 doesn't crash, doesn't have memory leaks, doesn't behave wildly different — without any user being affected.

                    ┌────────────────┐
   User request ──► │ Router / mesh  │
                    └────────┬───────┘
                             │
                ┌────────────┴────────────┐
                │                         │
                ▼                         ▼
          ┌──────────┐              ┌──────────┐
          │ v1       │              │ v2       │ ← receives mirrored copy
          │ (live)   │              │ (shadow) │   response is discarded
          └────┬─────┘              └──────────┘
               │
               ▼
          ┌──────────┐
          │ User sees│
          │ v1's     │
          │ response │
          └──────────┘

Pro: zero-risk validation against real traffic. Catches bugs that synthetic tests miss. Con: double-spends compute. Side effects (writes to a real database, sending emails) must be either disabled in v2 or made idempotent. Set up requires traffic-mirroring support in the mesh. Best for: large refactors where the externally-observable behavior must be identical. Used heavily for service migrations (Go rewrite of a Java service, etc.).

Istio supports this via VirtualService mirror field. Envoy proxies do too.

12.7 The "deploy on Friday" anti-pattern

A cultural concern, not a technical one. The thinking: "don't deploy on Friday because if it breaks, you ruin your weekend." Symptoms:

  • Deploys cluster into Monday-Thursday.
  • Long-lived feature branches accumulate.
  • Big-bang Monday deploys carry more risk than the small daily deploys they replaced.

The fix is not "deploy on Friday anyway." It's making every deploy low-risk:

  • Trunk-based development. Short-lived branches.
  • Small frequent deploys instead of large infrequent ones (Accelerate / DORA — DevOps Research and Assessment — research confirms deploy frequency negatively correlates with change failure rate).
  • Feature flags to decouple deploy from release.
  • Automated canary with metric gating, so a bad deploy auto-rolls-back without a human in the loop.
  • 24/7 on-call rotation that doesn't punish weekend incidents.

Done right, deploys become a non-event. Stripe, Etsy, and similar engineering cultures deploy hundreds of times per day across all hours, with confidence.


§13. GitOps and CI/CD

The delivery layer that sits between source code and the orchestrator.

13.1 Push-based deploy

The traditional model. CI builds the image, pushes to registry, then runs kubectl apply against the target cluster with stored credentials.

       ┌──────────┐       ┌──────────┐      ┌──────────┐
       │ git push │ ────► │ CI runner│ ──── │ image    │
       └──────────┘       │ builds + │      │ registry │
                          │ tests +  │      └──────────┘
                          │ pushes   │             │
                          └──────────┘             ▼
                                │           ┌──────────┐
                                └─────────► │ kubectl  │ ── credentials stored in CI
                                            │ apply    │
                                            └──────────┘
                                                  │
                                                  ▼
                                            ┌──────────┐
                                            │ Cluster  │
                                            └──────────┘

Pro: simple, low-tech, direct. Con: the CI system must have cluster credentials. If the CI system is compromised, the cluster is compromised. Audit trail is in CI logs, which may not be retained. Drift between desired state (in git) and actual state (in cluster) is hard to detect.

Most teams started here and graduated to GitOps.

13.2 Pull-based GitOps

ArgoCD or FluxCD runs inside the cluster and watches a git repository. When the repo changes, the controller pulls the new manifests and applies them. The cluster initiates outbound connections; no inbound credentials needed.

       ┌──────────┐       ┌──────────┐      ┌──────────┐
       │ git push │ ────► │ CI       │ ──── │ image    │
       └──────────┘       │ builds + │      │ registry │
                          │ tests +  │      └──────────┘
                          │ commits  │
                          │ updated  │
                          │ manifest │
                          │ to       │
                          │ "config" │
                          │ repo     │
                          └──────────┘
                                │
                                ▼
                          ┌──────────────────────┐
                          │ git config repo      │  ← source of truth
                          │ (Kubernetes manifests)│
                          └──────────────────────┘
                                ▲
                                │ git pull (every 3 minutes
                                │  or on webhook)
                                │
                          ┌──────────────────────┐
                          │ ArgoCD in cluster    │
                          │  - polls git         │
                          │  - applies manifests │
                          │  - reports drift     │
                          └──────────────────────┘
                                │
                                ▼
                          ┌──────────────────────┐
                          │  Cluster             │
                          └──────────────────────┘

Pro: - No inbound credentials to the cluster. CI only needs to commit to git. - Source of truth is git, with full audit trail. - Drift detection is automatic — ArgoCD reports any in-cluster resource that doesn't match git. - Multi-cluster scales naturally (ApplicationSet pattern: one git repo, many target clusters).

Con: - Higher initial setup cost. - Secrets handling requires extra tooling (Sealed Secrets, External Secrets Operator, SOPS) since you can't store plaintext secrets in git. - Rollback is git revert — slightly slower than a direct kubectl rollout undo.

GitOps has become the de facto standard for new cloud-native deployments. ArgoCD is the dominant choice; FluxCD is the close runner-up. The CNCF (Cloud Native Computing Foundation) graduated both.

13.3 Argo Rollouts and progressive delivery

Argo Rollouts is a Kubernetes-native progressive delivery controller. It extends the Deployment primitive with explicit canary, blue-green, and analysis steps. A typical config:

kind: Rollout
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: { duration: 10m }
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50
      - pause: { duration: 20m }
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 100

The analysis step queries Prometheus, Datadog, or another metric source. If the success rate falls below the threshold, the rollout pauses or aborts. This is the operationalization of "metric-gated canary" without writing custom scripts.

Flagger plays the same role in the Flux ecosystem.

13.4 Trunk-based development + feature flags

The modern alternative to long-lived release branches:

  • Trunk-based development. All developers commit to main (or master). Short-lived feature branches (<2 days) merged via PR.
  • Feature flags. New code is gated behind a flag. Unreleased features can be merged to main and deployed without being exposed to users.
  • Continuous deployment. Every merge to main triggers a build + deploy. Production is, on a multi-team trunk, getting updated dozens of times per day.

This pattern decouples three things that used to be coupled:

  1. Code merge (when the work lands in main).
  2. Deploy (when the binary reaches production).
  3. Release (when users see the new behavior).

Coupling them all together creates large-blast-radius releases. Separating them — code in main but flagged off, deploy when ready, flip the flag when ready — is what enables the "thousands of deploys per day" cultures.


Five domains where the same orchestrator technology serves dramatically different workloads.

14.1 Typical SaaS web app deployment

The canonical case. A SaaS company runs ~50 microservices: web frontend, API gateway, several backend services, async workers, caches, observability sidecars. Each service is a Deployment with HPA scaling on CPU. Total ~500 pods. One Kubernetes cluster per environment (dev, staging, prod). ArgoCD does GitOps. Argo Rollouts handles canary for prod. Standard Prometheus + Grafana for monitoring. Service mesh (Istio) for mTLS + tracing.

Demands on the tech: zero-downtime deploys, quick rollback, multi-team isolation via namespaces.

Variant of the tech: Kubernetes (managed — EKS or GKE), Deployment for stateless, StatefulSet for the one Postgres they self-host, plus a few CronJobs for nightly batches.

14.2 Stateful database deployment

A team needs to run Postgres on Kubernetes (rather than managed RDS — Relational Database Service) for cost or feature reasons. They use CloudNativePG, an operator that manages a Postgres cluster as a CRD.

kind: Cluster   # CloudNativePG CRD
spec:
  instances: 3
  postgresql: { parameters: { max_connections: "200" } }
  bootstrap:
    initdb:
      database: app
  storage:
    size: 100Gi
    storageClass: gp3

The operator's reconciliation loop:

  • Creates a StatefulSet with 3 pods (cluster-1, cluster-2, cluster-3).
  • Bootstraps cluster-1 as the primary.
  • Sets up streaming replication to cluster-2 and cluster-3.
  • Sets up a Service that selects the primary (cluster-rw for read/write) and a Service for replicas (cluster-ro for read-only).
  • Schedules backups via WAL archiving to S3.
  • On primary failure, promotes a replica, updates the Service selector, and re-syncs the old primary.

Demands on the tech: stable pod identities, persistent storage that follows the pod, low-latency failover, careful upgrade orchestration.

Variant: StatefulSet + CSI-backed PVCs + a custom operator. The orchestrator alone is insufficient; the operator is what makes the stateful workload manageable.

14.3 ML training jobs

Training a model is fundamentally different from serving one. Training jobs are:

  • Long-running (hours to weeks).
  • Resource-intensive (multiple GPUs, often distributed via Horovod, PyTorch DDP, or DeepSpeed).
  • Batch-shaped (run to completion, then exit).
  • Multi-step (preprocess → train → evaluate → register).

The Kubernetes-native ML stack:

  • Argo Workflows or Kubeflow Pipelines for the DAG. Each step is a container; the workflow tool sequences them.
  • Volcano scheduler (or Kubernetes' own gang-scheduling support) — co-schedules all pods of a distributed training job to start together. Without this, you can deadlock waiting for all replicas to be runnable.
  • NVIDIA GPU operator + nvidia.com/gpu resource type for GPU scheduling.
  • Cluster autoscaler scaled to GPU node groups. GPU nodes are expensive (~$3-30/hour); scale to zero when idle.

A typical training run: Argo Workflow with five sequential steps; the training step is a 4-GPU job that runs for 8 hours; the workflow handles retries, artifact passing (model checkpoints to S3), and cleanup.

Demands on the tech: GPU scheduling, gang scheduling, persistent intermediate artifacts, integration with experiment tracking (MLflow, Weights & Biases).

Variant: Job + Volcano scheduler + NVIDIA device plugin + workflow tool on top.

14.4 Batch processing

The classic ETL workload. A nightly job that reads from a warehouse, transforms, writes to another warehouse. Runs once per day, takes 2 hours.

Kubernetes-native batch:

  • CronJob for the schedule.
  • Job spawned by CronJob, with backoffLimit: 3, activeDeadlineSeconds: 7200.
  • Argo Workflows or Tekton if the batch is multi-step or has fan-out (e.g., a workflow that processes 10,000 files in parallel with 100 concurrency).

A typical pattern: a CronJob runs at 02:00 UTC daily. It creates a Job with parallelism: 50, completions: 10000. Kubernetes schedules 50 pods concurrently; as each completes, another starts, until all 10,000 are done.

Demands on the tech: scheduled execution, fan-out parallelism, failure handling with retries, resource fairness (don't starve interactive workloads).

Variant: CronJob + Job + a workflow tool for orchestration.

14.5 Edge deployment

Running compute close to users — in retail stores, factories, telcos, vehicles. Constraints:

  • Limited compute per site (often a single small server, a Raspberry Pi cluster, or a few nodes).
  • Intermittent network connectivity to the central cloud.
  • Operated by non-experts (store managers, technicians).
  • High site count (thousands to tens of thousands).

Solutions:

  • k3s — a 50 MB single-binary Kubernetes distribution by Rancher. Designed for edge. Replaces etcd with SQLite by default for single-node deployments. Runs on a Raspberry Pi.
  • KubeEdge — extends Kubernetes to manage nodes that may be offline. Pods sync to edge nodes when connectivity is available; the edge node continues running pods during disconnects.
  • Nomad — Cloudflare uses Nomad to manage 200+ POPs (Points of Presence) globally. Each POP runs a Nomad agent; a central Nomad cluster schedules across them.
  • Argo CD ApplicationSet — deploys the same manifest to N target clusters generated from a list (e.g., one per store).

Demands on the tech: tiny footprint, tolerance for disconnection, easy bulk management, simple per-site upgrade.

Variant: k3s + GitOps + carefully sized resource requests.


§15. Real-World Implementations with Numbers

Named systems shipping container orchestration at scale, across different use cases.

Google Borg → Kubernetes

The progenitor. Google's internal Borg system (paper: "Large-scale cluster management at Google with Borg," EuroSys 2015) runs essentially all of Google. A typical Borg cell is 10,000 machines; Google has hundreds of cells. Total scale is millions of machines, billions of containers.

Borg pioneered: declarative job specs, two-level scheduling (Borg → tasks → Linux processes), priority/preemption, cgroups-based isolation. Kubernetes (open-sourced 2014) is the externalized lessons learned from Borg.

Kubernetes itself targets 5,000 nodes per cluster; Google internally runs Kubernetes for some workloads but still relies on Borg for the largest.

Spotify

Spotify migrated from a homegrown system to Kubernetes around 2018-2020. The platform team open-sourced Backstage — a developer portal that ties Kubernetes deployments, observability, and ownership together. Public numbers: thousands of services, ~10k pods, hundreds of deploys per day. Multiple clusters per region, GitOps via Flux.

Airbnb

Airbnb runs ~100 production services on Kubernetes (mostly EKS) across multiple regions. Migrated from a homegrown deployment system. Public talks cite ~70% cost reduction in some workloads after migrating to Kubernetes with HPA + Cluster Autoscaler properly tuned, due to better bin-packing.

Lyft

Lyft is one of the largest publicly-discussed Envoy users (Envoy was created at Lyft). They run a service mesh on Kubernetes with thousands of services, billions of requests per day. The Lyft platform team contributed heavily to the cloud-native ecosystem.

Pinterest

Pinterest runs on Kubernetes (~10k pods) plus their own platform abstractions on top. They've published on multi-cluster ingress, GitOps at scale, and Argo Rollouts adoption.

Stripe

Stripe runs heavily on Kubernetes across many clusters. They've published on Argo CD at scale, on careful pod-startup latency tuning (image pull is the dominant component, mitigated by aggressive caching), and on the operational model for hundreds of thousands of pods.

Netflix

Netflix runs partly on Kubernetes (via their internal platform "Titus" for many years, more recently incorporating Kubernetes) and partly directly on EC2. Spinnaker was created at Netflix as the deployment orchestrator across both. Their use case is the canonical "thousands of services, hundreds of deploys per day, multi-region active-active" pattern. Open Connect (Netflix's CDN edge servers) runs Nomad.

Cloudflare

Cloudflare runs Nomad on 200+ POPs globally. They've publicly discussed why Nomad's smaller footprint (versus Kubernetes) fits the edge: each POP has a constrained number of nodes, the multi-workload model lets them run a mix of containers and raw binaries, and a single binary is easier to deploy across that many sites.

AWS — ECS and Fargate

Amazon ECS is used by a huge number of AWS customers — Slack, Lyft (in addition to Kubernetes), many SaaS startups. ECS is simpler than Kubernetes (no cluster to manage), tightly integrated with AWS (IAM — Identity and Access Management — roles map to tasks via task-role ARNs, VPC — Virtual Private Cloud — networking is native, ALB — Application Load Balancer — targets are first-class). Fargate underneath is AWS Lambda-style: AWS allocates micro-VMs (Firecracker), runs the container, charges per vCPU-hour. Used by ~100k+ customers.

Numbers summary

Org Orchestrator Approx pods Approx services Approx deploys/day
Google Borg (+ k8s) billions of containers unknown continuous
Spotify Kubernetes ~10k thousands hundreds
Stripe Kubernetes ~100k thousands thousands
Lyft Kubernetes tens of thousands thousands thousands
Pinterest Kubernetes ~10k+ hundreds hundreds
Cloudflare Nomad tens of thousands across POPs hundreds continuous
Netflix Spinnaker over EC2+k8s hundreds of thousands thousands thousands

The range is enormous. Different orgs at different scales arrive at different points in the design space, and yet the same fundamental abstractions (declarative spec, reconcile loop, ephemeral workloads, network-attached storage) underlie all of them.


§16. Summary

Container orchestration is the runtime layer that turns "I want N copies of this service" into reality: a control plane stores desired state in a Raft-replicated KV store, controllers run reconcile loops that translate intent into actions, a scheduler bin-packs pods onto nodes, and per-node agents drive a container runtime through standardized interfaces for compute (CRI), networking (CNI), and storage (CSI). On top of that runtime, deployment patterns — rolling, canary, blue-green, dark launch, shadow — and delivery systems — GitOps with Argo or Flux, progressive delivery with Argo Rollouts, push pipelines with Spinnaker — turn the act of shipping software from a heroic event into a continuous, declarative, level-triggered process. The technology's contract is narrower than it appears: it promises scheduling, lifecycle, and abstraction for ephemeral workloads on a cluster of machines, not persistent state, not cross-cluster atomicity, not business correctness — and the difference between teams that ship Kubernetes well and teams that don't is largely the difference between those who internalize that narrow contract and those who don't.