← Back to Backend Fundamental Components

Observability

Contents

A technology reference. Observability is the class of systems that collect, store, and query telemetry from production software so humans (and increasingly machines) can diagnose what is happening. It is not one product; it is a tripod of three pillars — metrics, logs, traces — each with a distinct storage engine forced by a distinct access pattern. This doc covers the class.


§1. What Observability Is

Observability is telemetry collection + storage + query for diagnosing production systems. The output of an observable system is the ability to answer questions you didn't know you were going to ask, in production, without redeploying — distinguishing it from classical "monitoring," which only answers questions you wired dashboards for ahead of time.

An observability stack does three things: instrumentation (SDK in-process — OpenTelemetry, Datadog agent, Prometheus client lib, Sentry SDK), transport and storage (telemetry shipped async into per-pillar storage tiers), and query/alerting (dashboards, ad-hoc investigations, threshold or error-budget alerts).

The three pillars

Not three views of the same data — three different data types with different volume, shape, and query patterns.

  • Metrics. Pre-aggregated numeric time-series. Low volume per data point (one float + one timestamp + a label set), high cardinality if labels run wild. Access pattern: range scan over a labeled series, often aggregated (sum, rate, histogram_quantile). Used for dashboards, SLO (Service Level Objective) tracking, most alerts. Example: http_requests_total{service="checkout",status="500"} sampled every 10 seconds.
  • Logs. Free-form or structured text events emitted by application code. High volume (the biggest chunk of an observability bill), variable content, queries mix grep-style substring search with structured filters. Used for incident forensics, audit trails, and increasingly product analytics. Example: a JSON line {"ts":"...","level":"ERROR","service":"payment","trace_id":"abc","msg":"stripe API returned 500"}.
  • Traces. Causal chains of work across services. A trace is a tree of spans, each span a (start_time, duration, service, operation, attributes, parent_span_id) tuple. Volume scales with requests × span_count_per_request, so sampling is mandatory at any non-trivial scale. Used for "where did this request spend its time." Example: one user click producing 20 spans across api-gateway → auth → checkout → payment → stripe.

Where it sits and what it isn't

Observability lives alongside the application, not inside the request path. Done right, it's invisible to user latency; done wrong, a synchronous log flush wedges the application. It is not a substitute for testing, not a transactional system (telemetry is best-effort — dropping 0.01% of metric samples on restart is fine), not a single-product question ("install Datadog" is sourcing, not design; the technology choices — pull vs push, full-text vs label index, head vs tail sampling — exist regardless of vendor), and not real-time × full-fidelity × unbounded retention × cheap. Every platform picks two of those four and trades the rest.


§2. Inherent Guarantees per Pillar

Each pillar has a contract — what it provides by design and what you cannot have. The trap is assuming guarantees the technology doesn't make.

Metrics. You get: cheap aggregation, predictable storage, fast range queries by label. rate(http_requests_total[5m]) by (service) over thousands of series returns in milliseconds because the TSDB (Time-Series Database) is structurally optimized for this shape. You don't get: per-event detail — once you've incremented a counter, you cannot ask "which specific user caused this?"; that information was discarded at aggregation. High cardinality is also not free; each unique label-value combination is its own series. Must be layered: SLO definitions, alerting rules, exemplars (a histogram-bucket pointer to a sample trace).

Logs. You get: the full event payload, durably stored, queryable by any indexed field. Logs are the "ground truth" of what each service saw and did. You don't get: cheap — log volume dwarfs metrics by 50-200x in bytes. You also don't automatically get correlation; a log line is just a string unless the application stamps trace_id, service, user_id. Must be layered: structured schema, trace context propagation, retention tiering, PII (Personally Identifiable Information) scrubbing, per-service rate limits.

Traces. You get: causal structure. Given a trace_id you can reconstruct the full tree of who called whom, with timing for each hop. The only pillar that natively expresses "the latency came from the third gRPC call to payment." You don't get: complete coverage. Tracing without sampling is uneconomical above ~10k requests/sec; whatever you didn't sample doesn't exist. Must be layered: sampling strategy (head/tail/hybrid), trace context propagation across HTTP, gRPC, Kafka, Espresso CDC (Change Data Capture), cron — every async boundary.

The bound you can't escape: real-time × full-fidelity × unbounded retention × cheap — pick two. Metrics give up detail (pre-aggregate). Logs give up retention (tier to cold). Traces give up coverage (sample). Cost determines how aggressively each pillar pays the toll.


§3. The Design Space

The class has axes. Pick a position on each.

Axis A: collection model — pull vs push. Pull (Prometheus) — the TSDB scrapes /metrics on a schedule. Strength: cardinality and rate bounded by scraper config. Weakness: must reach every target — breaks behind NAT (Network Address Translation), firewalls, or for short-lived batch jobs. Push (Datadog agent, StatsD, OTLP/HTTP) — the application sends telemetry to a collector. Strength: works for serverless, batch, cross-region. Weakness: a runaway client can DDoS the platform; mitigated by per-tenant rate limits. Most large platforms are hybrid — Prometheus pulls long-running services, OTLP receivers handle short-lived jobs.

Axis B: log indexing strategy. - Inverted index per word (ELK = Elasticsearch + Logstash + Kibana, OpenSearch, Splunk). Sub-second full-text grep. Storage 1.5-3x raw; operational cost high (Lucene shards, JVM tuning). - Label index only (Loki, Grafana Cloud Logs). Index a small set of labels (service, level, region); payload as compressed chunks. Cheap. Grep scans chunks at query time — fine for service=api last 1h, bad for global string search. - Columnar (ClickHouse, Apache Pinot for logs, Honeycomb's Retriever). Row per log line, column per JSON field. Fast SQL aggregations, cheap storage via columnar compression. Weak at substring grep on free text, strong at "how often does X happen by category." - Object-store + Parquet (CloudWatch Logs Insights, Athena, Snowflake on logs). Cheapest storage; query latency seconds-to-minutes. Compliance retention, not incident grep.

Axis C: trace sampling. Head — decide at trace start, propagate with request. Cheap, cannot preferentially keep errors. Tail — every span emitted, buffered by trace_id ~30s, decision keeps errors + slow + p% random. Expensive in memory; requires partitioning all spans of a trace to one collector. Hybrid — head at 10% then tail on the remainder. Most production platforms.

Axis D: storage tier model. Hot-only (Monarch holds metrics in RAM regionally; cheap to query, expensive to scale). Tiered (Prometheus + Thanos: 2h blocks NVMe, older to S3; Loki: recent chunks local SSD, older in object storage). Object-store native (Tempo: Parquet in S3, bloom-filter index; cheapest, highest cold-read latency).

Axis E: instrumentation framework. Vendor-specific SDKs (Datadog agent, New Relic) — rich auto-instrumentation, lock-in. OpenTelemetry (OTel) — vendor-neutral, CNCF (Cloud Native Computing Foundation) project; Google, Microsoft, AWS, Datadog, Splunk, Honeycomb, New Relic all back it. Default pick in 2026 if you don't want lock-in.

Comparison table

Variant Collection Index strategy Cost shape Best for
Prometheus + Mimir pull inverted label index, Gorilla chunks low storage, cardinality-bounded metrics for any long-running fleet
Datadog (SaaS) push agent proprietary columnar TSDB + log columnstore high $/GB, low ops shops that buy not build
ELK / OpenSearch push full-text Lucene inverted high $/GB, high ops strong full-text needs, smaller volume
Loki + Grafana push label-only + zstd chunks in S3 low $/GB, slower grep large fleets with grep-not-aggregate logs
ClickHouse / Pinot for logs push (Kafka) columnar, per-column indexes low $/GB, fast aggregations SQL log analytics, dashboards
Honeycomb push columnar event store medium $/GB, ad-hoc query strength high-cardinality wide-event analysis
Jaeger / Cassandra push, sampled trace_id partitioned LSM bounded by sample rate per-request causal investigation
Tempo + S3 push, sampled Parquet + bloom filter in object store very low $/GB cheap long-retention traces
Honeycomb Retriever push, often unsampled columnar wide events medium-high $/GB "ask weird questions" workflows

§4. Byte-Level Per Pillar

This is where most observability writeups hand-wave with "stores time-series data." Each pillar has a different engine, and the engine choice is forced by the access pattern.

4.1 Metrics — Prometheus TSDB with Gorilla compression

The Prometheus TSDB is a custom append-only time-series engine derived from Facebook's Beringei / Gorilla paper (VLDB 2015). It is the de facto reference design — M3DB, VictoriaMetrics, InfluxDB IOx, Mimir, Cortex, and Datadog's internal TSDB all use the same family of techniques.

Why a custom engine and not RocksDB or InnoDB? Four facts force the choice: (1) writes are append-only with monotonically increasing timestamps; (2) float values within one series are highly correlated; (3) reads are almost always range scans over a labeled series; (4) cardinality is the dominant cost driver, not throughput. An LSM (Log-Structured Merge) tree forces compaction overhead and duplicates the series identifier in every record. A B+ tree wastes space on internal nodes and is terrible at random writes. The Prometheus TSDB is best understood as a specialized append-only LSM with a custom value codec (Gorilla) and coarse block-level compaction (the 2h block merge).

On-disk block layout (one 2-hour chunk):

block/
  meta.json                         # block ID, time range, stats
  chunks/
    000001                          # raw chunks file (binary)
    000002
  index                             # inverted index of label values
  tombstones                        # deleted ranges (rare)

A chunk inside chunks/000001:
  +---------+-----------+---------+-----------+----------------+
  | header  | timestamp | first   | timestamp | ... compressed |
  | (series | header    | sample  | delta of  | delta-of-delta |
  |  ref)   | base      | (full)  | delta     | + XOR floats   |
  +---------+-----------+---------+-----------+----------------+
   ~10 B     ~10 B       ~16 B     ~1-4 b/sample  ~1-12 b/sample

The "1.37 bytes/sample" Gorilla figure comes from the steady state of this format averaged over many real Facebook samples. Compared to a naive 16 bytes/sample encoding (8B timestamp + 8B float), Gorilla is roughly 12x denser. At a 9M-samples/sec fleet, that's 12 MB/sec vs 144 MB/sec — the difference between a small TSDB cluster and a 100-node beast.

Gorilla — timestamp half (delta-of-delta). Scrapes happen on a fixed schedule. If t1=1000, t2=1010, t3=1020, deltas are 10, 10, 10. The delta-of-delta is 0, 0, which encodes in 1 bit per sample. A late scrape (10.001s) makes delta-of-delta a small int, encoded in 7-9 bits. Worst case ~32 bits.

Gorilla — float half (XOR). For values that change slowly (CPU at 47.3, then 47.5, then 47.6), XOR of consecutive IEEE 754 floats has many leading and trailing zeros. Gorilla encodes XOR results as (leading_zeros, meaningful_bits, value). Identical to previous → 1 bit. Small change → ~12-20 bits. Large change → up to 64 bits. Average ~7-11 bits.

Inverted label index. A query like sum(rate(http_requests_total{service="api",region="us-east-1"}[5m])) must find all series matching service="api" AND region="us-east-1". The TSDB maintains a posting-list index from (label_name, label_value) → sorted list of series_IDs. Label intersection is a posting-list AND — the same primitive Lucene uses for search. O(min(a, b)) merge scan, fast. This is also where cardinality explosions hurt — each new unique value adds a posting list entry.

Head block + WAL (Write-Ahead Log). In-memory state for the current ~2h window lives in a hash table of series_id → most_recent_chunk. New samples append to the open chunk. Every sample is also written to a WAL on disk before acknowledgment. fsync batches every ~1s (configurable) — pre-fsync, a crash loses up to 1s of samples (acceptable for monitoring). On crash recovery, WAL replay rebuilds the head block.

Walk through one scrape end-to-end.

1. App exposes /metrics:
   http_requests_total{service="api",status="200"} 4823 1716391200000

2. Scraper (HTTP GET every 10s): parse text → (metric + sorted label kv) hashes
   to series_id → append (timestamp, value) to open chunk in head block →
   append WAL record (series_id, timestamp, value).

3. WAL hits page cache; fsync batches every ~1s. Pre-fsync crash loses ≤1s.

4. Open chunk reaches 120 samples (~20 min) → close it, Gorilla-encode →
   new chunk opens. Closed chunks stay in head-block RAM.

5. Every 2h the head block is "cut" — closed chunks flush to a persistent
   block. Inverted label index rebuilt. WAL truncated.

6. After 24h, persistent blocks compact into larger blocks and ship to
   long-term storage (S3 via Thanos/Mimir).

Cardinality is the kill switch. Space scales with cardinality, not sample rate. Doubling scrape interval halves storage; doubling cardinality doubles storage AND inverted-index size AND head-block RAM AND query latency. user_id with 100M values → 100M series; at ~1 KB per open chunk that's 100 GB of RAM in the head block alone. Prometheus OOMs (out of memory) long before. Cardinality 1-10M per instance is fine; above 10M shard (Mimir) or drop the label.

4.2 Logs — three competing storage models

Model A: full inverted index per word (ELK / OpenSearch / Splunk). Every word tokenized and added to a Lucene inverted index. Storage ~1.5-3x raw. Query latency for level=ERROR AND message:"connection refused" is millisecond-class via posting-list intersection. Cost at 85 TB/day is brutal: 1000+ nodes, many TB of RAM for hot indexes. ELK shines below ~1 TB/day; above that the cost curve breaks.

Model B: index labels, scan the rest (Loki). Only service, host, level, region indexed; log content stored as zstd-compressed chunks addressed by label-set. Query {service="api"} |= "connection refused": (1) label index finds chunks for service="api" in the time range; (2) for each chunk, fetch from S3, decompress, grep. Works because most queries scope by service+time first, indexing per word is what's expensive (skipped here), and S3 parallel reads are easy. Loki's labels match Prometheus's — cross-pillar consistency for free.

Loki:
  /index/   # (label-set hash, time range) → chunk references
  /chunks/  # zstd-compressed payload, one chunk = one label-set + 1-2h
            # production: S3, GCS, ABS

Model C: columnar (ClickHouse, Pinot, Honeycomb's Retriever). Row per log line, column per JSON field. Per-column indexes make SELECT count() FROM logs WHERE service='api' AND level='ERROR' GROUP BY error_code return sub-second on TB/day clusters. LinkedIn's log analytics runs logs → Kafka → Pinot for exactly this — most "how often does X happen" queries are SQL aggregations, not grep.

Decision rule. Grep-heavy → Loki. SQL-aggregation-heavy → Pinot/ClickHouse. Ad-hoc full-text and budget isn't a constraint → ELK. Very-long-term compliance, infrequent queries → Parquet on object storage with Athena/Snowflake. At scale you run multiple; LinkedIn pipes one Kafka stream into both Loki-equivalent (grep) and Pinot (analytics).

4.3 Traces — partition by trace_id, sampled

Storage shape. Each span is a record:

(trace_id, span_id, parent_span_id, service, operation,
 start_ns, duration_ns, status, attributes:{k:v}, events:[...])

The dominant query is "give me all spans for trace_id=X." Less commonly: "find traces where service=payment had duration > 2s and status=ERROR in the last hour."

For trace_id lookup, a key-value store keyed by trace_id is perfect. Jaeger originally used Cassandra (LSM tree, partition key = trace_id). Modern Tempo writes Parquet files to S3 and uses a Bloom-filtered index. Cassandra and S3 Parquet are both well-suited to "lots of writes, occasional point lookup by key."

For the search query ("find traces where..."), you need a secondary index — typically a small Lucene/Elasticsearch index that holds just the searchable span attributes plus a pointer to the full span payload in cheap storage. Honeycomb takes a different route — its Retriever stores every span as a row in a columnar DB and indexes nothing specifically, relying on columnar scan speed to make "ad-hoc queries on traces" the primary workflow.

Partition by trace_id is non-negotiable for tail sampling. Why: tail sampling decisions are made on the full set of spans for a trace_id. If span 1 of trace abc lands on collector A and span 2 lands on collector B, neither collector can decide whether to keep the trace. Kafka topic partitioned by trace_id ensures all spans of a trace hash to the same partition and are read by the same consumer.

Head vs tail sampling. Head: flip a coin at trace start; sampled flag propagates via traceparent. Cheap, zero coordination, cannot preferentially keep errors (you don't know yet). Tail: every span emitted, collector buffers by trace_id ~30s, then decides — keep all if any span has status=ERROR or duration > 2× service p99, else random p. Catches the interesting traces; needs memory to buffer and trace_id partitioning so all spans of a trace land on one collector. Hybrid (most production): SDK head-samples at 10% (drops 90% before the wire), collector tail-samples on top, keep all errors/slow plus 1% random. Net ~1.5%, 100% of the interesting traces.

Walk through one trace assembly.

1. api-gateway receives HTTP, no traceparent. Generates trace_id=abc123,
   span_id=001, sampled=1. Starts span "POST /api/checkout".

2. api-gateway → auth gRPC, sends traceparent: 00-abc123-001-01.
   auth creates span_id=002, parent=001.

3. api-gateway → checkout (span 003) → payment (004) → stripe (005).
   Each span reported async to local OTel SDK buffer.

4. SDK batches (512 spans or 5s) → OTLP/gRPC → local OTel Collector →
   regional Collector → Kafka topic "traces" partitioned by trace_id.

5. Tail-sampling consumer for the partition reads spans, buffers per-trace_id.
   After 30s (or all spans seen), applies tail policy.

6. Kept traces:
   - Cassandra: row key=trace_id, column key=span_id, value=span proto
   - Tempo: append to a Parquet file in S3 + Bloom filter + trace_id→offset index.

7. Query "show me trace abc123" → point lookup → ~1000 spans →
   UI assembles tree via parent_span_id pointers.

§5. OpenTelemetry in Depth

OpenTelemetry (OTel) is the vendor-neutral instrumentation standard that has, over 2022-2026, replaced per-vendor SDKs as the default way applications emit telemetry. It is a CNCF (Cloud Native Computing Foundation) graduated project — second only to Kubernetes by contributor count. Understanding OTel architecture matters because every modern observability discussion assumes you've already picked it for instrumentation; the real choices are downstream of the SDK.

5.1 Why a standard exists

Pre-OTel, instrumentation was per-vendor. Datadog had a dd-trace SDK; New Relic had its agent; Honeycomb had beeline; Jaeger had its own client libraries; Prometheus had client_java, client_python. Each captured the same data — spans, metrics, logs — in incompatible formats. Switching backends meant re-instrumenting every service, line by line. Multi-vendor strategies (Datadog for traces, Prometheus for metrics, Splunk for logs) required three concurrent SDKs in every binary.

OTel exists to decouple instrumentation from backend. Instrument once with the OTel SDK; pick a backend (or several) via configuration on the Collector. Want to migrate from Datadog to Honeycomb? Change the exporter URL. The application code does not move.

This is the same architectural pattern as JDBC (Java Database Connectivity) for databases, ODBC (Open Database Connectivity) before that, and SLF4J (Simple Logging Facade for Java) for logging — a uniform API in the application, swappable provider underneath.

5.2 The three-layer SDK architecture

OTel splits instrumentation into three composable layers. This separation is the source of its power.

Layer 1: Instrumentation libraries (@opentelemetry-instrumentation-*). Per-framework packages that auto-wrap common libraries — HTTP servers (Express, Flask, Spring), HTTP clients (requests, OkHttp), database drivers (JDBC, psycopg2, MySQL connector), message brokers (Kafka, RabbitMQ), gRPC clients/servers, AWS SDK. Each wraps the library's entry points and emits spans/metrics with conventional attribute names (http.method, http.status_code, db.statement). Hundreds of these libraries exist, contributed by the community.

Layer 2: SDK (the actual span/metric/log processor). The runtime that receives signals from instrumentation libraries (and from manual instrumentation in app code), batches them, optionally samples them, applies resource attributes (service.name, host.id, k8s.pod.name), and hands them to exporters. The SDK is per-language — opentelemetry-sdk-python, opentelemetry-sdk-java, opentelemetry-sdk-go, etc. — but the configuration model is identical across languages by design.

Layer 3: Exporters (@opentelemetry-exporter-*). Pluggable senders. Built-in: OTLP/gRPC, OTLP/HTTP, Prometheus (pull endpoint), Jaeger (deprecated, use OTLP), Zipkin, console (for debugging). Vendor-specific: Datadog, Honeycomb, New Relic, Splunk. The exporter is the only piece you swap when changing backends.

Conceptually:

[Instrumentation library] → [SDK: sampler, processor, batcher] → [Exporter] → wire
       ↑                                  ↑
   auto-wraps                         resource attrs,
   HTTP/DB/MQ                         sampler, span limits

5.3 OTLP — the wire protocol

OTLP (OpenTelemetry Protocol) is the on-the-wire format OTel emits. It is protobuf-encoded telemetry over either gRPC (HTTP/2 streams) or HTTP/1.1 with protobuf body. Most production deployments use OTLP/gRPC for efficiency; OTLP/HTTP is the fallback when corporate proxies/firewalls misbehave with HTTP/2.

The protobuf schema is published as a stable v1: opentelemetry.proto.collector.trace.v1.TraceService, MetricService, LogsService. Each is a Export(ExportRequest) returns (ExportResponse) unary RPC. Streaming is intentionally not in the standard — batches are bounded, which lets backends absorb traffic without holding open per-client streams.

A typical OTLP span payload looks like (logical):

ExportTraceServiceRequest {
  resource_spans: [
    {
      resource: { attributes: {service.name="api", service.version="3.2.1", host.id="..."} }
      scope_spans: [
        { scope: {name="opentelemetry-instrumentation-flask", version="0.45.0"}
          spans: [
            { trace_id: bytes(16), span_id: bytes(8), parent_span_id: bytes(8),
              name: "POST /api/checkout",
              start_time_unix_nano: ..., end_time_unix_nano: ...,
              attributes: {http.method="POST", http.status_code=200, ...},
              events: [...], links: [...],
              status: { code: STATUS_CODE_OK } }
          ]
        }
      ]
    }
  ]
}

The fact that resource attributes (which describe the source — service name, version, host) are factored out of each span saves a lot of bytes when batching thousands of spans from the same process.

5.4 The Collector — the central piece

The OTel Collector is a stand-alone process that receives, processes, and exports telemetry. It is the most consequential operational component of an OTel deployment. Two canonical deployment patterns:

Agent pattern (sidecar / DaemonSet). One Collector instance per host (Kubernetes DaemonSet) or per pod (sidecar). Apps emit OTLP to localhost:4317. Collector handles batching, basic enrichment (k8s metadata, host attributes), and forwards to a regional gateway. Benefits: low-latency local handoff, app SDK can be lighter (less buffering required), enrichment uses local metadata. Drawback: many Collector instances, each consuming a small amount of CPU/RAM.

Gateway pattern (centralized). A pool of Collectors in a cluster receives OTLP from many apps (or from agent Collectors). Performs heavier processing: tail sampling (requires seeing whole traces, so must be at a centralized chokepoint partitioned by trace_id), redaction, fan-out to multiple backends. Benefits: economies of scale, central control plane for sampling policy. Drawback: must be made highly available, network path matters, capacity planning is real.

Most production deployments are agent + gateway: agent does local enrichment and forwards to gateway over OTLP/gRPC, gateway does tail sampling and exports to backends.

5.5 Collector pipeline: receivers → processors → exporters

Inside a Collector, telemetry flows through pipelines, one per signal type (traces, metrics, logs). Each pipeline is receivers → processors → exporters.

Receivers. OTLP/gRPC, OTLP/HTTP, Jaeger Thrift, Zipkin JSON, Prometheus scrape (the Collector can act as a Prometheus server), Fluent Forward (for log forwarders), syslog, statsd, kafka.

Processors (the interesting layer): - batch — buffer spans/metrics for some time window, send as a batch. The single most impactful processor for throughput. - memory_limiter — drop incoming data when Collector RAM exceeds a threshold. Without this, a backpressure event OOM-crashes the Collector. - attributes — add, remove, hash, or redact attribute values. The PII redaction layer. - resource — set/override resource attributes (e.g., inject deployment.environment=prod). - tail_sampling — buffer spans per trace_id, decide retention after the trace completes or a timeout. - probabilistic_sampler — head sampling. - filter — drop spans/metrics matching a predicate (e.g., health-check endpoints). - transform — OTTL (OpenTelemetry Transformation Language) — a small DSL for arbitrary span/metric/log mutation. Power-user knob. - routing — route different signals to different exporters by attribute (e.g., test traffic → dev backend, prod traffic → prod backend).

Exporters. OTLP/gRPC (most common, forwards to gateway or backend), Prometheus remote-write (for metrics into Prometheus-compatible TSDBs like Mimir, VictoriaMetrics, Cortex), Loki (for logs), Tempo/Jaeger (for traces), vendor-specific (Datadog, New Relic, Honeycomb), Kafka, file, debug.

Pipeline is configured as YAML — the Collector is data-driven, not code-driven. Reconfiguring sampling, redaction, or routing is a config push, not a code deploy.

5.6 Manual vs auto-instrumentation

Auto-instrumentation. Drop a -javaagent:opentelemetry-javaagent.jar flag into your JVM startup, or opentelemetry-instrument python app.py, or for Node.js node --require @opentelemetry/auto-instrumentations-node app.js. The OTel agent uses bytecode instrumentation (Java), import hooks (Python), or require-in-the-middle (Node) to wrap every library at load time. Zero code changes, hundreds of libraries instrumented. The pragmatic default for most services.

Manual instrumentation. Use the SDK directly in code: tracer.start_as_current_span("compute_recommendations") as span: .... Required for: business operations the SDK can't see ("checkout_validate" is not an HTTP call), custom attributes (order_id, merchant_tier), span events ("retry attempt 3"), span links (cross-trace causality).

Trade-offs. Auto-instrumentation: fast to deploy, broad coverage, occasional surprises (a library you didn't know was being wrapped — DNS lookups generating spans, for example), small overhead from bytecode rewriting at startup (Java agent adds ~3-5s to startup time), can produce noisy spans for trivial operations. Manual: targeted, low overhead, captures business semantics, but requires discipline and code review.

The production combination: auto-instrumentation for HTTP/RPC/DB/MQ boundaries (the boring boilerplate) + manual instrumentation for business operations + custom attributes. Audit periodically to drop spans no one queries.

5.7 Why OTel won and the remaining trade-offs

Why it won. Three facts: (1) every vendor backed it — Datadog, Splunk, New Relic, Honeycomb, Dynatrace, Grafana Labs all ship OTLP receivers and contribute to OTel. (2) The cost of maintaining a per-vendor SDK across N languages and M frameworks is overwhelming; offloading instrumentation maintenance to the OTel community is rational. (3) The CNCF blessing and Google/Microsoft/AWS contributors mean it is not a vendor's hostage.

Remaining trade-offs. OTel's coverage of logs is still maturing as of 2026 — the spec is stable, but auto-instrumentation for logs lags traces. Many shops still ship logs via Fluent Bit / Vector and use OTel only for traces+metrics. Performance overhead is non-zero — the OTel SDK adds 2-8% CPU in measured Java workloads (depends heavily on instrumentation density). Vendor-specific SDKs are often more optimized for their own backend. Sampling configuration sprawl — head sampling at SDK, tail sampling at Collector, retention in backend all interact; a misconfiguration in any layer breaks the chain.


§6. eBPF-Based Observability

eBPF (extended Berkeley Packet Filter) is a Linux kernel feature that lets sandboxed programs run in kernel space attached to events — syscalls, kprobes, uprobes, tracepoints, network sockets, perf events. The observability promise: see everything every process does without changing application code.

6.1 The core capability

A traditional SDK runs in-process — it sees only what the application explicitly calls. eBPF runs in the kernel — it sees every syscall, every network packet, every CPU sample. From an observability standpoint, eBPF replaces sidecars, agents, and SDK instrumentation with kernel-side hooks that nobody has to opt into.

Concretely, an eBPF program can: - Trace every TCP connect, accept, send, recv with timing — without network library instrumentation. - Sample CPU stacks every N ms on every process — without enabling profiling in the app. - Decode HTTP/gRPC/Postgres/MySQL/Redis protocols on the wire from packets — without language SDK. - Capture file syscalls, exec calls, signal sends — for security forensics. - Aggregate metrics in-kernel (eBPF maps) and expose them as Prometheus endpoints.

6.2 Major eBPF observability products

Cilium Hubble (network observability without sidecars). Hubble is part of Cilium, an eBPF-based CNI (Container Network Interface) for Kubernetes. Every packet between pods is observed in kernel space. Hubble produces L3-L7 (Layer 3 through Layer 7) flow logs — source pod, dest pod, protocol, HTTP path, gRPC method, response code, latency. Replaces service-mesh sidecars (Istio's Envoy) for network observability. Cost ~zero CPU per pod (the kernel does the work); zero per-pod sidecar memory footprint.

Pixie (auto-instrument every service without code changes). Acquired by New Relic. Deploys an eBPF agent as a Kubernetes DaemonSet. Decodes HTTP/2, gRPC, Kafka, MySQL, Postgres, Redis, DNS on the wire — including TLS-encrypted traffic by hooking the userland OpenSSL functions before encryption (uprobes on SSL_read/SSL_write). Result: traces and request/response payloads for every service, zero code changes. The promise is "kubectl apply, see everything in 30 seconds."

Parca / Pyroscope (continuous profiling via eBPF perf events). Sample CPU stack traces at 100 Hz across every process on every node. Stacks recorded as flamegraphs and stored as time-series. Find "this commit added 10ms to p99" by diffing flamegraphs before and after the deploy. eBPF makes this cheap (~1% CPU overhead).

Falco (security observability). CNCF runtime security project. eBPF program watches every syscall, applies rules ("alert if /etc/shadow is read by a non-root process"), forwards events. Foundational for runtime threat detection and compliance posture.

Tetragon (Cilium's security observability sibling). eBPF-based runtime enforcement and observability. Watches process execs, file accesses, network connections, with the option to block a syscall mid-kernel — observability with teeth.

6.3 The "no code changes" promise

The marketing claim is: instrument your entire cluster in minutes, no code changes, see every request. The reality:

What works without code changes. - L4 (Layer 4) network flows: every TCP/UDP connection, with bytes and timing. - L7 protocol decoding: HTTP/1, HTTP/2, gRPC, Postgres wire, Kafka wire, DNS, Redis, MySQL. - CPU profiling: stack traces at 100 Hz, mapped to symbol names via DWARF debug info. - Syscall tracing: every file open, every exec, every network connect. - Memory profiling (limited): allocation samples via uprobes on malloc/free.

What still needs application instrumentation. - Business operations the kernel can't see — there is no kernel hook for "the user clicked checkout." Span names like "compute_recommendations" require manual SDK calls. - Custom attributes — order_id, merchant_tier. These live in application memory, not on the wire. - Trace context propagation — although eBPF can read the traceparent header from HTTP requests, propagating it across async boundaries (Kafka, background jobs) still requires SDK awareness. - Business log lines — "fraud check rejected this transaction because rule_id=42" — emitted by the application, not visible to eBPF.

So eBPF is complementary to OTel, not a replacement. It dominates L4-L7 network observability and continuous profiling; OTel dominates business-semantic spans and metrics.

6.4 Kernel version constraints

eBPF capabilities depend on kernel version. The ecosystem has evolved fast:

  • Linux 4.9+ (2016): basic eBPF, kprobes, perf events. Floor for any eBPF observability.
  • Linux 4.18+ (2018): BTF (BPF Type Format), enabling portable eBPF programs (CO-RE — Compile Once Run Everywhere).
  • Linux 5.4+ (2019): bpf_skb_load_bytes_relative, more flexible socket programs.
  • Linux 5.8+ (2020): ring buffers, atomic ops, broader uprobe support.
  • Linux 5.10+ (2020, LTS): modern production floor. Cilium Hubble, Pixie, Parca all assume 5.10+.
  • Linux 5.15-6.x: newer hook types (e.g., bpf_loop, kfuncs), required for some Cilium Tetragon features.

Older kernels (RHEL 7 with 3.10 + backports, RHEL 8 with 4.18) sharply limit what's available. Some shops still run kernels too old for the modern eBPF observability stack — a structural constraint on adoption.

6.5 Failure modes specific to eBPF

Verifier rejects program. Every eBPF program is checked at load time by the kernel verifier — bounded loops, no unbounded memory access. Complex programs can be rejected; upgrading the verifier across kernel versions changes what is accepted. The "this used to work on 5.10, fails on 5.15" class of bugs.

Kernel panic from buggy program. Rare with the verifier, but historical bugs in JIT compilation and helper functions have caused panics. Production rollouts of new eBPF probes follow staged rollout discipline.

Performance overhead under load. A poorly written program (e.g., printing every packet) can saturate a CPU. Maps with millions of entries cause hashing pressure. Per-CPU maps and ring buffers are the standard high-throughput pattern.

Encrypted-traffic blind spot. TLS hides L7 payloads from packet-level eBPF. Pixie's workaround is uprobing OpenSSL before encryption; this works for dynamically-linked OpenSSL but breaks for statically-linked binaries, Go's crypto/tls, or BoringSSL variants. Coverage is uneven across runtimes.


§7. Continuous Profiling

Continuous profiling is the fourth pillar in many recent observability discussions. Where one-off profiling (run perf or pprof when an engineer suspects something) gives a single snapshot, continuous profiling samples profiles continuously across all processes and stores them as time-series. The output: at any past timestamp, you can ask "what was the CPU doing at 2:34 PM yesterday?"

7.1 The data model

A profile is a flamegraph — a tree of (stack_trace → cumulative_sample_count). Each sample is a stack — [func_a, func_b, func_c] — with a weight (1 sample, 10 ms, 64 KB of memory, whatever the profile measures). A flamegraph aggregates millions of samples into a tree showing where time/memory/allocations were spent.

Continuous profiling: take a flamegraph every N seconds (typically 10s) for every process and store as a time-series of flamegraphs, queryable by service, host, timerange. Storage shape similar to traces — heavy event-shaped data — but with tree-structured payload rather than span-tree.

7.2 Profile types

  • CPU profile. Stack traces sampled at fixed rate (10-100 Hz). Tells you where CPU cycles went.
  • Heap / memory profile. Stack traces at allocation sites, weighted by bytes allocated. Tells you where memory went.
  • Lock / contention profile. Stack traces at blocked-on-mutex sites, weighted by wait time. Tells you where threads queued.
  • I/O profile. Stack traces at syscall sites blocking on I/O. Tells you where blocking I/O happened.
  • Goroutine / thread profile. Snapshot of every live goroutine or thread with its current stack — for diagnosing deadlocks and goroutine leaks.

7.3 Major continuous profiling products

Pyroscope (Grafana Labs). Open source. Originally a stand-alone product, now merged into Grafana's stack as Grafana Pyroscope. Pull or push profile ingestion; storage on local SSDs + S3 for long retention. Flamegraph diff view — pick two timeranges, see what changed.

Parca (Polar Signals). Open source eBPF-based continuous profiler. eBPF program samples stacks across every process on every node at 100 Hz, sends to Parca server. Stack symbolization via DWARF debug info. The "auto-profile everything" angle.

Datadog Continuous Profiler. SaaS. JVM, Go, .NET, Python, Node agents. Differential views integrated with traces — for a slow trace, click "show profile during this span" to see the CPU flamegraph for just that interval.

Pyrra / Phlare. Other open-source contenders in the same shape.

7.4 Flamegraph diff — finding the regression

The killer use case: "this commit added 10ms to p99. What in the stack is responsible?"

Take a flamegraph for the hour before the deploy and the hour after. Diff them: for each (stack_trace), compute samples_after − samples_before. Stack frames that grew are colored hot, frames that shrank are cold. Read the diff — usually one frame jumps out: "yes, the new JSON serializer is taking 8ms more per call."

This collapses what used to be a multi-day investigation ("which of the 17 changes in this deploy caused the regression?") into a 10-minute review.

7.5 Differences from one-off profiling

Aspect One-off profile (perf, pprof) Continuous profile
When Reactively, after suspicion Always, prospectively
Coverage One process, one time window Every process, every interval
Storage A file on disk Time-series store, queryable by time/service
Workflow SSH, run profiler, copy file, analyze Open dashboard, pick timerange
Cost Engineer time per investigation ~1% CPU overhead always-on
Best for Deep dive in dev/staging Production regression hunting

The shift from one-off to continuous mirrors the shift from "tail the logs over SSH" to "central log aggregation" — the move from reactive to ambient observability.

7.6 Failure modes

Symbolization breaks on stripped binaries. Without DWARF debug info or build-id-based debuginfod, eBPF profiling shows hex addresses instead of function names. Operationally, you have to ship symbol files to the profiler or run debuginfod.

Sampling bias on green-threaded runtimes. CPU sampling sees one OS thread; with goroutines or async tasks, the same OS thread runs many logical contexts. Runtime cooperation (Go's pprof labels) helps.

Cost of profile storage. A naive implementation stores every flamegraph as a tree. Modern profilers store as columnar (Parca-style) with delta compression — still real money at fleet scale.


§8. Sampling Strategies in Depth

Sampling is what makes traces (and increasingly logs and profiles) affordable. The wrong sampling strategy makes the platform either too expensive or blind to the interesting events. This section unpacks the strategy space — §3 introduced the head/tail/hybrid axis; here is the operational depth.

8.1 Head-based sampling (decide at root span)

The trace SDK at the root service flips a weighted coin: random() < 0.01. If yes, set sampled=1 on the trace context; if no, set sampled=0. The sampled bit propagates with traceparent through every downstream service. Every downstream SDK honors it — either emit spans or skip.

Strengths. Cheap: 99% of traces never produce any span, anywhere. Zero coordination: SDKs decide independently. Latency-neutral: no buffering. The "set and forget" choice.

Weaknesses. Cannot preferentially keep interesting traces. At root span, you don't yet know whether the request will error, take 5 seconds, or hit the rare bug. By the time you find out, the decision is locked in. The "ML team can't debug the 0.1% bug" failure (§22.3) is exactly this.

Variants: - Probabilistic head sampling. Uniform p across all traces. - Per-route head sampling. /api/healthcheck at 0.001%, /api/checkout at 10%. Route-level policy. - Per-tenant head sampling. Free-tier customer at 1%, enterprise customer at 100%. SaaS tier-aware.

8.2 Tail-based sampling (decide at end of trace)

Every service emits every span unconditionally. Collector (or dedicated tail-sampling tier) buffers spans by trace_id for ~30s. When the trace is judged "complete" (last span seen + timeout, or root span end + timeout), apply a policy: keep all if any span has status=ERROR, or trace duration > 2s, or http.status_code >= 500, or a custom predicate; else keep p=0.01 random.

Strengths. Catches the interesting traces by definition. Errors and slow at 100% even if their root cause is 1% prevalence. Random baseline for sample-rate normalization.

Weaknesses. - Memory. Every span of every trace lives in collector RAM for ~30s. At 60M spans/sec ingest, 30s buffer ≈ 1.8B spans. At ~2 KB per span, ~3.6 TB RAM aggregated across the tail-sampling tier. - Partition all spans of a trace to one collector. Otherwise, span 1 lands on collector A, span 2 on collector B — neither sees the full trace, neither can decide. Kafka topic partitioned by trace_id is the standard solution. - Late-arriving spans. A long-running span (a 5-minute batch job) that's part of a trace whose root finished 30s ago will be evicted from the buffer and never associated with its trace. Trade-off between buffer window and tail-sampling correctness. - Sampler at a centralized chokepoint. The thing that has to be highly available. If the tail-sampler tier goes down, all traces (including errors) are at risk.

8.3 Hybrid (the production sweet spot)

Most production deployments combine head and tail:

SDK head-samples at 10%    → drops 90% before any wire bytes
   ↓
Collector tail-samples on remaining → keeps 100% of errors + slow,
                                        keeps 1% random of healthy
   ↓
Net retention: ~1.5% of all traces, 100% of interesting traces

The head stage saves Kafka bandwidth and tail-sampler RAM. The tail stage rescues the interesting traces from the head-sampled stream.

Sampling key matters. Head sampling must use a key uniformly distributed across traces — typically the trace_id itself, which W3C mandates to have 128 bits of randomness. A poorly seeded RNG (timestamp-derived, see §22.6 biased sampling) produces a non-uniform sample.

Consistent sampling. sampled = trace_id_hash mod 100 < 10 ensures all spans of a trace make the same head decision — required for trace completeness.

8.4 Adaptive sampling (rate adjusts to traffic)

Static sampling rates break under traffic spikes. At 1% sampling on a 10x traffic burst, span ingest also goes 10x; downstream storage can't keep up. Adaptive sampling adjusts the rate inversely to traffic:

target_spans_per_second = 100K
current_traffic = N requests/sec
new_sample_rate = target / N        (clamped to [0.001, 1.0])

Implementation lives in the SDK or the Collector. Datadog's APM and Honeycomb's auto-sample both implement variants.

Per-route adaptive sampling. Sample /api/checkout so its span rate stays at 10K/sec. Sample /api/healthcheck so its span rate stays at 100/sec. Different routes get different sample rates dynamically.

8.5 The "1% sample but 100% of errors" hybrid recipe

The most commonly cited production recipe:

  • Head sample at SDK: 10% (knocks out the bulk of healthy traffic).
  • Tail sample at Collector:
  • Keep 100% if status=ERROR or http.status_code >= 500.
  • Keep 100% if duration > p99_of_service.
  • Keep 100% if marked force-keep by application (debug flag, A/B test cohort).
  • Otherwise, keep with probability 1%.

Final retention: ~1.1% of traces. Of those, the "interesting" categories are over-represented relative to baseline; you can still compute true rates by tracking original sample rates per category.

8.6 Sampling for logs and metrics

Logs. Sampling logs is harder because each log line is often individually meaningful (an error log might be the only evidence of an event). Many shops keep 100% of ERROR/WARN logs and sample INFO/DEBUG to 10%. Some apply per-event-name sampling: keep 100% of payment_charged, 1% of user_pageview.

Metrics. Metrics are pre-aggregated, so "sampling" doesn't apply in the same way. The closest analog is cardinality control — drop high-cardinality labels, only export top-N. The economic problem is cardinality, not sampling.

Profiles. Continuous profiling samples at 100 Hz on every CPU; that's already heavily sampled. Further "trace-correlated profiling" (only profile during sampled traces) is an emerging technique.


§9. SLO / SLI / SLA Framework

Service-level reliability vocabulary is borrowed from Google's SRE book and now standard across the industry. Mixing up the three terms wastes meetings; the discipline behind them is what makes alerting and incident response coherent.

9.1 Definitions

SLI — Service Level Indicator. The measurement. A quantifiable signal about service health. Examples: - successful_requests / total_requests over 1m windows. - requests_below_500ms / total_requests. - availability = uptime / total_time. - freshness = max(now() − event_timestamp) for streaming pipelines.

The SLI is a metric — computed from the same observability primitives the platform already collects.

SLO — Service Level Objective. The target the team commits to internally. The threshold the SLI must exceed. - "99.9% of requests succeed over rolling 30 days." - "99% of requests complete in <500 ms over rolling 7 days." - "Streaming pipeline freshness <60 seconds for 99% of events."

The SLO is the internal aspirational target — what engineering says it will hit.

SLA — Service Level Agreement. The contractual commitment to the customer, usually with financial consequences. Examples: - AWS S3: "99.9% monthly uptime, or refunds proportional to downtime." - Stripe: "99.99% API uptime."

SLA is typically looser than SLO by design — if engineering targets 99.9% (SLO) but contracts on 99.5% (SLA), the gap (0.4%) is the safety margin between "we missed our target" and "we owe the customer money." The SLA is also typically over coarser time windows (monthly vs rolling).

Crucially: SLA is a business artifact, SLO is an engineering artifact, SLI is a measurement artifact. Treating them as interchangeable confuses operations.

9.2 Error budget — the engineering practice

If SLO = 99.9% over 30 days, then error budget = 1 − 0.999 = 0.1% = 43.2 minutes of allowable downtime (or failed requests) per 30 days. The team has 43.2 minutes per month to "spend" on incidents, bad deploys, and risky changes.

Why the error budget framing is powerful.

  • It makes risk explicit. "We have 38 minutes of error budget remaining and a major deploy planned" is a quantifiable risk.
  • It aligns engineering and product. A product team pushing for more features faster eats error budget; once exhausted, all attention shifts to reliability.
  • The freeze-deploys rule. If error budget is exhausted, deploys freeze — no new risky changes until next window. Some shops automate this (deploy pipeline checks SLO state).
  • It avoids the "all or nothing" trap. Without budget, every minute of downtime is a "failure"; with budget, the first 43 minutes/month are acceptable, fully consumed budget triggers escalation.

9.3 Multi-window multi-burn-rate alerts

This is the alerting innovation that pairs with SLO/error budget.

The naive approach: alert when SLI < SLO. Bad — by the time monthly SLI breaches monthly SLO, the month's budget is gone; you missed your window.

The right approach: alert on the burn rate of error budget.

burn_rate = current_error_rate / acceptable_error_rate

If SLO is 99.9% (error rate ≤ 0.1%), and current error rate is 1.0%, then burn_rate = 1.0 / 0.1 = 10x. The team is burning the budget 10 times faster than sustainable. At 10x, the month's budget is consumed in 3 days.

Multi-burn-rate, multi-window. Don't alert on just one window — combine short (sensitive, noisy) with long (insensitive, stable) windows for both fast and slow burns.

A canonical Google-SRE-book ruleset:

Severity Burn rate Short window Long window Time to budget exhaustion
Page (fast burn) 14.4x 5 min 1 hour ~2 days
Page (medium burn) 6x 30 min 6 hours ~5 days
Ticket (slow burn) 3x 2 hour 24 hours ~10 days
Ticket (slowest burn) 1x 12 hour 72 hours budget consumed at end of window

The compound condition (short_window_burn > X AND long_window_burn > X) prevents flapping on a transient 1-minute spike (short burns high, long stays low → no alert).

9.4 What "alert when users hurt" buys you

Compared to threshold-based alerting (CPU > 80%, queue_depth > 1000):

  • Far fewer alerts. ~5 alerts per service (covering page-fast, page-medium, ticket-slow burns × 1-2 SLIs) instead of 50-200.
  • Every alert represents real user-visible impact, not a symptom.
  • Engineers stop ignoring the pager.
  • Capacity planning, deploy gating, and oncall sizing all align on the same number — the error budget.

§10. Alerting Architecture

The alerting plane is the human-actuating side of observability. Architecture matters because alerts that fire reliably under load, route to the right human, and don't drown the team in noise are different from a if x > threshold: page loop.

10.1 Prometheus AlertManager — the canonical reference

In a Prometheus stack, AlertManager is the alert-routing daemon. The flow:

[Prometheus] evaluates [alert rules] (PromQL expressions)
    ↓ when rule fires
[AlertManager] receives alert (HTTP POST)
    ↓
[Grouping] — combine related alerts by labels (cluster, service, alertname)
    ↓
[Inhibition] — suppress alerts implied by other alerts
                  (cluster down → don't page on every pod down)
    ↓
[Silences] — mute alerts during planned maintenance
    ↓
[Routing tree] — match labels to receivers
                  team=payments → PagerDuty key X
                  team=infra    → PagerDuty key Y, also Slack #alerts-infra
    ↓
[Receivers]
    ├─ PagerDuty / Opsgenie (paging)
    ├─ Slack / MS Teams (chat)
    ├─ Email
    ├─ Webhook (incident.io, Rootly, custom)
    └─ SMS

Alert rules are written in PromQL (Prometheus Query Language):

groups:
- name: payments
  rules:
  - alert: PaymentErrorBudgetBurnFast
    expr: |
      (
        rate(payment_failures_total[5m]) / rate(payment_requests_total[5m]) > 14.4 * 0.001
        AND
        rate(payment_failures_total[1h]) / rate(payment_requests_total[1h]) > 14.4 * 0.001
      )
    for: 2m
    labels:
      severity: page
      team: payments
    annotations:
      summary: "Payment SLO budget burning at 14.4x rate"
      runbook: "https://wiki.example.com/runbooks/payment-burn"
      dashboard: "https://grafana.example.com/d/payments-slo"

for: 2m — alert only after condition holds 2 minutes (avoid flapping). annotations.runbook — link in the page to the runbook so oncall doesn't start from zero.

10.2 Grouping and routing — the noise control layer

Grouping. Without grouping, a cluster-wide outage that fires 800 pod-down alerts produces 800 pages. AlertManager's group_by: [cluster, service] collapses them into one page with the count: "service=api in cluster=use1 has 200 pods down." One page, one incident.

Routing tree. A tree of matchers. First match wins.

route:
  group_by: [alertname, cluster, service]
  receiver: default-slack
  routes:
  - match: { severity: page }
    receiver: pagerduty-primary
    continue: false
  - match: { team: payments }
    receiver: pagerduty-payments-team
  - match: { severity: ticket }
    receiver: jira-ticket-creator

continue: false is the default — first match terminates. continue: true allows multi-routing (also send to Slack while paging).

Inhibition. "If the parent is down, don't alert on the children." Express as source_match: severity=critical AND alertname=ClusterDown inhibits severity=warning AND alertname=PodDown for matching labels. Stops the cascade.

Silences. Time-bounded mutes, usually for planned maintenance or known incidents. UI lets oncall acknowledge and silence in one click.

10.3 Alerting fatigue and how to fight it

Pre-SLO shops typically have 100-500 alert rules per platform. Most fire transiently, self-resolve, and condition the oncall to ignore the pager. Engineers stop reading alert titles. The next real incident is missed.

Anti-patterns that cause fatigue. - Cause-based alerts. CPU > 80%, queue_depth > 1000, disk_usage > 75%. These alert on symptoms, not customer impact. CPU at 80% on a CPU-elastic system is fine. - Static thresholds across all services. What's "high error rate" for the payment service (must be perfect) is not the same as the search-suggestions service (degraded mode is fine). - Single-window alerts. rate(errors[5m]) > 0.01 will flap on every 5-minute traffic anomaly. - One alert per micro-condition. Separate alerts for each instance of a service mean ten instances of one alert produce ten pages.

Fighting fatigue. - SLO-burn-rate alerts. Sections 9 and 10.1. Replaces 100 cause-based alerts with ~5 effect-based alerts per service. - Consolidated dashboards. When the oncall is paged, they hit one "service-health" dashboard, not 14 different ones. The "Datadog Service Page" pattern: one URL per service, all signals. - Symptom and error budget remaining are first-class. Pages include "you have 23 minutes of error budget left this week" — frames urgency. - Auto-resolution. Alert fires, pages, oncall starts investigating; condition self-resolves at 4-minute mark. AlertManager fires the resolution notification — closes the page, drops the war room. - Postmortem cycle. Every alert that fires should periodically be reviewed. "Did this represent real user impact?" If no — delete the rule or tune the threshold. - Pager budget per team. "Team X is paged > 5 times per week → engineering manager owns reducing it." Treats noise as a managed defect.

10.4 On-call ergonomics

Pager rotation tooling. PagerDuty, Opsgenie, OnCall. Rotate by week / day, escalation chains, "follow-the-sun" multi-region rotations.

Runbook discipline. Every alert annotation links to a runbook. Runbooks are short — "verify in dashboard X, restart service Y if Z, escalate to team A otherwise." Updated after every incident.

Page acknowledgment. First responder acks → escalation cancels. Resolves → page closes. Time-to-ack metric tracked.

Postmortem feedback loop. Every page produces an incident record. Postmortems extract preventive engineering work into the backlog. The good shops have a "no alert without a runbook" rule.


§11. Log Structure and Parsing

The "logs are just text" mental model breaks at scale. Free-text logs without structured fields produce search latency that scales with data volume, and entire categories of debugging become impossible. Structured logging is the discipline that makes logs queryable.

11.1 Structured vs unstructured

Unstructured (the old way):

2026-05-22 14:32:11 INFO  Payment processed for user alice@example.com
   in 234ms total=49.99 currency=USD trace=abc123

Looks fine. To search "all payments by alice@example.com," grep. To compute "average payment by currency," parse with awk. To correlate with a trace, regex-extract trace=. Every query becomes a parsing exercise.

Structured (the modern way):

{
  "ts": "2026-05-22T14:32:11.234Z",
  "level": "INFO",
  "service": "payment",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "alice@example.com",
  "msg": "Payment processed",
  "duration_ms": 234,
  "amount": 49.99,
  "currency": "USD"
}

Each field is named, typed, and indexable. Queries become WHERE user_id='alice@example.com' or WHERE service='payment' AND currency='USD' — fast, no parsing.

11.2 Required fields (the schema)

A production-quality log schema includes at minimum:

  • ts (RFC 3339 timestamp, UTC, millisecond precision) — required for time-range queries.
  • level (DEBUG, INFO, WARN, ERROR, FATAL) — required for severity-based filtering.
  • service (service name) — required for routing and per-service queries.
  • trace_id (W3C trace ID hex) — required for trace-log correlation. Without this, you can't pivot from a slow trace to its logs.
  • span_id (current span) — required for span-level log correlation.
  • host / instance / pod — required for host-level debugging.
  • version (service version / git SHA) — required for "did this start with the new deploy?"
  • msg (human-readable message) — required because logs are still read by humans.
  • Domain-specific structured fields — user_id, order_id, merchant_tier, etc.

Many shops use OpenTelemetry semantic conventions for log attribute names. http.method instead of method. db.statement instead of query. Consistency across services is what makes cross-service queries possible.

11.3 The "can't search for user X across services" pain

Without structured logging, the question "show me everything related to user_id=42 in the past hour, across all services" requires:

  • Knowing every service's log format.
  • Parsing free-text logs in each format.
  • Joining results manually.

With structured logging:

SELECT * FROM logs
WHERE user_id = '42'
  AND ts >= now() - INTERVAL '1 hour'
ORDER BY ts ASC

One query, sub-second. The structured-vs-unstructured gap is the difference between "I can debug this in 5 minutes" and "I can't debug this."

11.4 Log-to-metric pipelines

Some logs encode metrics implicitly: Payment failed lines, Order completed lines, User churned lines. Counting them per minute gives metrics.

The pipeline: logs → stream processor (Flink, Kafka Streams, Materialize, Vector) → counters → push to Prometheus or a TSDB.

{"level":"ERROR","service":"payment","error_code":"insufficient_funds",...}
    ↓ Vector / Fluent Bit
    aggregator counts by (service, error_code) in 1-minute windows
    ↓
push metric: payment_errors_total{service="payment",error_code="insufficient_funds"}

Now you have a low-cardinality metric (drop user_id, keep error_code) that powers dashboards and alerts, while the underlying logs are still searchable for forensics. Two pillars cooperate: high-cardinality detail in logs, low-cardinality aggregate in metrics, both fed from the same source.

11.5 Parsing legacy unstructured logs

For services that can't be re-instrumented (vendor binaries, ancient code), parse at ingest. Grok patterns (regex-with-names): %{TIMESTAMP_ISO8601:ts} %{LOGLEVEL:level} %{GREEDYDATA:msg}. Fluent Bit / Logstash / Vector all support grok and parse-on-ingest. Output is structured even if the source isn't.

Trade-off: parsing is CPU-expensive and brittle (format drift breaks it silently). Always prefer fixing the source if possible.


§12. Metrics Cardinality Discipline

Cardinality is the single biggest cost lever in metrics — and the single biggest failure mode. The TSDB's storage, RAM, and query cost all scale with the number of distinct series, not the rate of samples. The discipline of controlling cardinality is what separates a healthy metrics platform from one OOM-crashing every Tuesday.

12.1 What cardinality is, concretely

A series is a unique combination of metric name + label values. Two examples emit two series:

http_requests_total{service="api", method="GET", status="200"}   ← series 1
http_requests_total{service="api", method="GET", status="500"}   ← series 2

Cardinality of a metric = product of cardinalities of its labels (in the worst case).

http_requests_total{service, method, status, endpoint, region}
                     50      5       8       100         5
cardinality = 50 × 5 × 8 × 100 × 5 = 1,000,000 series

Add one more label:

                     ... × user_id   (1M unique users)
cardinality = 1,000,000 × 1,000,000 = 1 trillion series

This is the cardinality explosion — one bad label takes the metric from "fine" to "infeasible."

12.2 Why each series costs

Per active series, the TSDB pays: - RAM in the head block. ~1-2 KB per open chunk (current 2h window). At 50M series, that's 50-100 GB RAM. - Inverted index size. Posting list entries for every (label_name, label_value) combination. Grows linearly with cardinality. - Query cost. A PromQL query sum by(service) (rate(metric[5m])) over a 50M-series metric scans 50M series, even if most are zero. - WAL bytes. Each scrape adds N samples to WAL where N = number of series scraped.

Cardinality is the cost axis for metrics. Sample rate matters far less.

12.3 The "1M users" disaster

The recipe for a self-inflicted outage:

  1. Engineer adds user_id label to http_requests_total to debug a per-user issue.
  2. Deploys. Each request now creates a new series for that user_id.
  3. Within 2 hours of traffic, the head block holds 1M new series.
  4. RAM grows by ~1-2 GB. Maybe fine on one beefy node.
  5. The metric is queried — sum by(status) (rate(http_requests_total[5m])) — and now scans 1M series instead of 1000.
  6. Query takes 30s instead of 30ms. Dashboards time out. Alertmanager misses evaluations.
  7. Or worse: the Prometheus head block exhausts RAM, OOM-killed.

The platform team gets paged. The original feature debugging doesn't happen because the platform is on fire.

12.4 Cardinality control techniques

Bound user-input dimensions. Never label by raw user_id, request_id, session_id, query string. These have unbounded cardinality.

Aggregate buckets. Replace user_id (1M values) with user_tier (5 values: anon, free, basic, pro, enterprise). Same insight (per-tier slicing) at 5 series instead of 1M.

Boolean projections. Replace customer_id with is_high_value_customer (0/1). Still actionable; cardinality 2 instead of 1M.

Per-service quotas at ingest. Mimir / Cortex / VictoriaMetrics support per_user_series_limit — refuse new series above N for a service. Alerts the owning team; protects the platform.

Cardinality CI checks. In CI, run the service in a synthetic load, count the cardinality of each metric. If a metric's cardinality grew more than 2x or crossed a threshold, fail the build. Catches cardinality regressions at commit time.

Cardinality dashboards. A Grafana page per tenant: "your top 20 metrics by cardinality, sorted by series count, with growth rate." Engineers can see their footprint and self-police.

Recording rules to pre-aggregate. If you must capture high cardinality at the source, write a recording rule that pre-aggregates: sum by(service, tier) (http_requests_total) → stored as a new derived metric with lower cardinality. Use the derived metric in dashboards; keep the raw metric only in trace-level / wide-event storage.

12.5 The "we logged credit_card_count" anecdote

A real (composite) story: an engineer added credit_card_count label to a metric to debug a "why so many cards on this account" question. credit_card_count had cardinality 1-30 (most accounts have <30 cards), looked safe. But the label was joined with account_id (a million accounts) on another metric — and Prometheus's query optimizer fanned out. RAM tripled in 8 hours. The post-incident learning: cardinality is multiplicative across labels — the "small" label can be the trigger.

Discipline. Label additions go through a review. "What's the cardinality? What's the cardinality multiplied across labels? What's the cardinality if traffic 10x's? What's the rollback plan?"


§13. Trace Context Propagation Across Async Boundaries

Traces work because every span carries (trace_id, parent_span_id). The trace is reconstructed by joining spans. The moment context propagation breaks, the trace breaks — and downstream spans become orphans with no parent.

13.1 The W3C standard

W3C Trace Context (https://www.w3.org/TR/trace-context/) defines two HTTP headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             │  │                                │                │
             │  │                                │                └─ flags (sampled=1)
             │  │                                └─ parent span_id (8 bytes hex)
             │  └─ trace_id (16 bytes hex)
             └─ version (00)

tracestate: vendor1=value1,vendor2=value2
            (vendor-specific key-value pairs, max 32 entries, max 512 bytes)

traceparent is the must-have. tracestate carries vendor-specific propagation (Datadog's dd-id, AWS X-Ray's IDs).

Every OTel SDK reads these on inbound HTTP/gRPC and writes them on outbound. Hundreds of HTTP libraries are auto-instrumented to do this.

13.2 The Kafka producer/consumer gotcha

Naive Kafka usage. Producer sends a message; consumer receives it. The consumer's span has no parent — the trace ends at "produce" and a new trace starts at "consume." Two disconnected trees, same logical request.

The OTel pattern for Kafka.

Producer side:

with tracer.start_as_current_span("produce_to_kafka") as span:
    carrier = {}
    propagator.inject(carrier)   # writes traceparent into carrier dict
    headers = [(k, v.encode()) for k, v in carrier.items()]
    producer.send(topic, value=payload, headers=headers)

Kafka messages have headers (added in Kafka 0.11). traceparent and tracestate are written as headers.

Consumer side:

for message in consumer:
    headers_dict = {k: v.decode() for k, v in message.headers}
    ctx = propagator.extract(headers_dict)
    with tracer.start_as_current_span("consume_from_kafka",
                                       context=ctx,
                                       kind=SpanKind.CONSUMER,
                                       links=[Link(...)]):
        process_message(message)

The consumer span: - Inherits trace_id from the producer's context. - Has the producer's span as a Link, not a parent (because the producer's span has already ended — CHILD_OF would imply the parent is still open). OTel calls this FOLLOWS_FROM semantics; the spec models it as Link. - Sets kind=CONSUMER so the span shows as a consumer in the trace UI.

Now: produce span and consume span are both in the same trace, parent-link is correct, the trace UI shows the queue hop as an edge with the queue wait time visible.

13.3 Other async boundaries

Espresso / Pinot CDC streams. The CDC envelope (the message format) must carry trace_id. LinkedIn's CDC framework adds a _trace_context field; consumers extract it like Kafka headers.

Background jobs / cron. A scheduled job has no inbound trace context — it starts a new trace. But the trace_id should be discoverable from the trigger metadata if you want to correlate downstream. Pattern: the scheduler logs (job_id, trace_id) when it triggers; the job creates a span with that trace_id.

Async tasks within a process (e.g., goroutines, thread pools). The OTel SDK uses thread-local / async-local state to track the current context. Submitting work to a thread pool must pass the context explicitly:

def submit_work(executor, fn, *args):
    ctx = context.get_current()
    executor.submit(context.attach, ctx)   # attach the parent context
    executor.submit(fn, *args)

Or use OTel's context.attach wrappers. Missing this is the most common reason "the span has no parent in async code."

WebSocket / Server-Sent Events. Long-lived connections. Each message is logically a separate request; the producer side wraps each message with a fresh traceparent in a custom protocol field. No standard yet.

HTTP/2 push, gRPC streaming. Each stream has one traceparent. Within the stream, each message is part of the same span unless the app creates child spans.

13.4 Detection and enforcement

Detect propagation gaps in traces. A trace with N services but only one span per service is suspicious — probably propagation is broken. Honeycomb and Datadog flag "broken trace" — root span exists with no child spans on the next service.

CI test for propagation. Spin up the service + a downstream mock, send a request with a known trace_id, assert the downstream got the same trace_id. Run on every PR. Catches "we forgot to wire propagation" regressions before prod.

OTel auto-instrumentation reduces the surface area. Most HTTP/gRPC libraries are wrapped by community instrumentation, so propagation is automatic. The remaining propagation work is at custom protocols, Kafka, and async-task boundaries.


§14. Exemplars — The Bridge Between Pillars

A metric data point answers "how many," not "which one." When p99 latency spikes at 2:00 PM, the metric tells you the spike happened — exemplars tell you which specific trace caused it.

14.1 The concept

An exemplar is a pointer attached to a metric data point that links to one example trace. Specifically, exemplars attach to histogram buckets: for the 2s-5s bucket of payment_duration_seconds, the histogram includes a sample trace_id where a request fell into that bucket.

payment_duration_seconds_bucket{le="5"} 12345 # 12345 requests
   # exemplar: trace_id=abc123 value=4.2 timestamp=1716391200

The metric is still cheap to store (one float per bucket per series); the exemplar adds a few bytes per data point (a trace_id and a value).

14.2 The cross-pillar workflow

The exemplar is what makes the unified observability UI work:

  1. Engineer opens a Grafana dashboard, sees p99 latency spike at 2:00 PM.
  2. Hovers the histogram heatmap; sees the spike concentrated in the 2s-5s bucket.
  3. Clicks an exemplar dot on the chart — links to a specific trace in Tempo / Jaeger.
  4. Opens the trace; identifies the slow span (payment → external-bank, 4.2s).
  5. Clicks "view logs" on that span — links to Loki / log search filtered by trace_id.
  6. Reads logs; sees "external-bank returned 503 after 4s retry."
  7. Root cause identified.

The chain metric → trace → log is the cross-pillar pivot. Exemplars make step 3 possible — without them, the engineer has to manually search "traces near 2:00 PM with duration > 2s," a needle-in-haystack query.

14.3 How Prometheus exemplars work

OpenMetrics format (the successor to the Prometheus text format) adds an exemplar syntax:

# TYPE payment_duration_seconds histogram
payment_duration_seconds_bucket{le="5"} 12345 # {trace_id="abc123"} 4.2 1716391200.000

Comment after # carries the exemplar.

Storage. Prometheus stores exemplars in a circular buffer per series — limited capacity (~150 KB per series by default), oldest exemplars evicted. Not all data points have exemplars stored; the buffer captures recent ones.

Query. Grafana's Explore view shows exemplars as dots overlaid on the chart. Clicking a dot triggers a trace-store query for trace_id=<exemplar>.

14.4 Other linking patterns

Logs ↔ Traces. Every log line carries trace_id. From a trace span, query logs filtered by trace_id. From a log line, click trace_id → land in the trace UI.

Logs ↔ Metrics. Log-to-metric pipelines (§11.4) make logs an upstream source for metrics; tracing the metric back to the log line is via the source label.

Traces ↔ Profiles. Datadog and Pyroscope are experimenting with "trace-correlated profiling" — for a specific slow trace, fetch the CPU profile sampled during that trace. Identifies which function was hot during that specific request.

The full graph: every pillar links to every other pillar via trace_id. The platform exists to let humans (and automation) walk that graph.


§15. PII Redaction in Logs and Traces

PII (Personally Identifiable Information) in telemetry is a compliance and ethics problem. Regulations (GDPR — General Data Protection Regulation, HIPAA — Health Insurance Portability and Accountability Act, PCI DSS — Payment Card Industry Data Security Standard, CCPA — California Consumer Privacy Act) impose hard limits on what can be stored and for how long. The discipline of PII redaction is a structural feature of any mature observability platform.

15.1 What counts as PII

Direct identifiers: full name, email, phone number, SSN (Social Security Number), credit card number, government ID, IP address, geolocation, device ID, biometric data.

Indirect identifiers: combinations that uniquely identify (zip code + DOB + gender; user_id stable across sessions; cookie ID).

Special categories under GDPR: health data, racial/ethnic origin, religious beliefs, sexual orientation, biometric.

Under HIPAA (US healthcare): 18 specific identifiers including medical record numbers, IP addresses, dates more granular than year.

Under PCI DSS: full credit card numbers (PAN), magnetic stripe data, PIN — never stored anywhere, including logs.

15.2 Where PII leaks in observability

  • Logs. The most common leak. A debug log f"received request with body: {request.body}" dumps everything — including emails, phone numbers, possibly credit cards.
  • Traces / span attributes. db.statement="SELECT * FROM users WHERE email='alice@example.com'" captures the email in the span. http.url="https://api/users?email=alice@example.com" captures it via URL.
  • Metrics. Less common (cardinality discipline usually rules out PII labels) but possible. user_email as a label is both a cardinality bomb and PII.
  • Error stack traces / exception messages. "TypeError: cannot serialize User(email='alice@example.com')" — the error message contains PII.

15.3 The "we logged credit card numbers" incident class

Real incidents in this class: - A debug log accidentally enabled in production prints request bodies, capturing full credit card numbers for hours before discovery. - A misconfigured Lambda's stderr stream logs API responses including SSNs. - An exception handler logs the whole input object, including unredacted PII.

Once PII is in the log store, it's in: every backup, every replica, every cold-storage Parquet file. Right-to-be-forgotten (GDPR Article 17) gives a user the right to demand deletion — which requires scanning and rewriting petabytes of logs to remove their records. Expensive and often technically infeasible at scale.

15.4 Redaction at the Collector layer

The canonical pattern: redact at the Collector before persistent storage, so PII never lands in logs/traces in the first place.

OTel Collector's attributes processor supports hash, delete, and update operations:

processors:
  attributes:
    actions:
    - key: user.email
      action: hash      # SHA256 the value; consistent for joins, no PII visible
    - key: user.ssn
      action: delete
    - key: credit_card.number
      action: delete
    - key: http.url
      action: update
      value: <redacted>
      pattern: "(\\?|&)email=[^&]+"

For free-text fields (log messages, exception strings), regex-based redactors run at the Collector. Detect patterns and replace:

  • Credit card: \b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b<REDACTED_CC>
  • US SSN: \b\d{3}-\d{2}-\d{4}\b<REDACTED_SSN>
  • Email: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b<REDACTED_EMAIL> (sometimes hash instead, to keep correlation by user).
  • IP addresses: redact to /24 subnet for analytics, full redact for PCI.

Tools: Vector, Fluent Bit, Cribl, custom Collector processors. Some shops use ML-based PII detection (Google DLP API, AWS Macie) for harder patterns.

15.5 Right-to-be-forgotten implications

GDPR Article 17 requires deletion of a user's data on request. For observability:

  • Logs. Identify all logs referencing user_id=X across all storage tiers, including S3 cold storage and Glacier. Rewrite to scrub. Expensive but tractable with structured logs.
  • Traces. Identify all spans where attributes contain user.id=X. Rewrite. Trace_id-keyed storage helps.
  • Metrics. Mostly safe if cardinality discipline kept user_id out of labels. If not, time-bound retention helps.

Architectural mitigation. Don't store PII in the first place — use hashed/pseudonymized user IDs in observability data, with the hash → real-ID mapping stored in a separate PII-graded system that supports proper deletion. The observability store deals only with hashes; right-to-be-forgotten happens at the hash-mapping layer.

15.6 The minimum-viable PII discipline

  • Default-deny: no debug logs in prod, redaction processor in every Collector pipeline.
  • Schema includes a PII tier: every log/span field tagged pii=none|low|high|forbidden. Forbidden tier rejected at Collector.
  • Automated scans of telemetry for PII patterns; alert on findings.
  • Retention tiers shorter for any logs containing personal data (30 days max for high-PII logs).
  • Audit log of who queried PII fields (the meta-meta-observability).

§16. Retention Tiers and Economics

Telemetry storage cost dominates observability cost. A naive "keep everything hot for 90 days" approach is a multi-million-dollar mistake at any scale. Retention tiers — where data lives, how fast it can be queried, what it costs — are the central economic design.

16.1 The tier model

A typical four-tier model:

Tier Medium Query latency Typical retention $/GB/month (2026, rough)
Hot SSD on Prometheus / Loki / ELK / Tempo < 1 second 7-30 days $0.20-0.50
Warm SSD/HDD on remote cluster, S3 cached 1-10 seconds 1-3 months $0.05-0.15
Cold S3 Standard / Parquet, lazy-loaded 10s-minutes 6-12 months $0.023
Archive S3 Glacier / Deep Archive, restore needed hours-days years (compliance) $0.001-0.004

Cost ratio: ~200x between hot and archive. A 7-year compliance retention requirement on a tiered system is two orders of magnitude cheaper than the same retention all-hot.

16.2 What lives where

  • Hot. Active operational data — last 7-30 days, used for incident response, dashboards, alerts. Query frequency: 100s-10000s of queries per day. Loki recent chunks on SSD, Prometheus head + recent blocks, ELK active indexes, Tempo recent days.
  • Warm. 1-3 months. Used for post-incident analysis, weekly reviews, capacity planning. Query frequency: 10s-100s per day. Compressed Parquet on cheaper SSD, or S3 with caching layer.
  • Cold. 6-12 months. Compliance, occasional forensics, historical trend analysis. Query frequency: 1-10 per day, latency-tolerant. S3 Standard with Athena / Trino / DuckDB on top.
  • Archive. Years. Pure compliance, regulatory retention. Query frequency: near zero, exceptions for legal discovery. S3 Glacier Deep Archive — restore takes hours.

16.3 The economic disaster of "all hot"

A scenario:

  • 10 PB/day log ingestion.
  • 30 days hot retention "because that's safe."
  • 300 PB in hot SSD storage.
  • At $0.20/GB/month, that's $60M/month.

Just for hot logs. Real industry numbers from companies that didn't tier early are in this range.

Compare to a tiered model:

  • 3 days hot (~30 PB hot, $6M/month).
  • 30 days warm (~300 PB warm at $0.05/GB, $15M/month).
  • 1 year cold (3.6 EB cold at $0.023/GB, $83M/month — but only because of the long tail; could halve by sampling or aggregating before cold).

Tier-by-tier, "all hot for 90 days" is roughly equivalent in cost to "all tiered for 1 year" — and the tiered approach gives 4x retention.

16.4 Query patterns by tier

Hot queries. Dashboards refresh every 10s, alert evaluations every 30s, incident response queries are 1-10 minutes wide on the last hour. All sub-second. Index-heavy storage formats.

Warm queries. Post-incident analysis, "what did the system look like last Tuesday?" Acceptable latency: 5-30 seconds. Parquet on S3 with bloom filters; queries scan more bytes but few queries per day means the bytes-scanned cost is bounded.

Cold queries. Compliance retrieval, "give me all of user X's actions in the past 6 months." Latency-tolerant; might take minutes. Athena/Trino over Parquet, scanning gigabytes.

Archive queries. Restore from Glacier (hours), then query. Pre-warmed for known legal-hold cases. Rare.

16.5 Tier transitions and downsampling

Data flows down tiers over time:

Hot (7 days) → Warm (3 months) → Cold (1 year) → Archive (7 years)

At each transition, transform the data to be cheaper in the next tier.

  • Hot → Warm. Aggregate raw samples to 1-minute resolution for metrics (downsampling). Compress log chunks more aggressively.
  • Warm → Cold. Aggregate to 1-hour resolution. Drop debug-level logs (already past their useful window). Convert from row to columnar Parquet.
  • Cold → Archive. Aggregate to 1-day. Move to Glacier Deep Archive.

Mimir's "long-term storage" uses a sequence of compaction blocks: 2h → 12h → 1d, with downsampling at the day boundary.

16.6 The "ghost in the warehouse" problem

A common mistake: ingest at full fidelity into hot, transition to warm/cold without downsampling. The data is still there but in a tier with high query cost (S3 scan over Parquet) and no one ever queries it because the use case for warm/cold is aggregate trends, not raw events. The result: petabytes of cold storage with zero queries — pure cost.

Discipline. For each tier, define the query use case. If no query use case exists, the data shouldn't be in that tier. Either delete or move to archive (compliance-only).


§17. Multi-Tenancy in Observability Platforms

A platform team owns the observability stack; every product team owns its own services and the telemetry they emit. Multi-tenancy is the architecture that lets this work — one team's bad deploy doesn't bring down another team's visibility.

17.1 The platform-team / product-team split

The contract:

  • Platform team owns: the observability stack (Prometheus/Mimir, Loki, Tempo, Grafana, Collectors, Kafka, AlertManager). Reliability, scaling, multi-tenant isolation, cost controls.
  • Product teams own: their service's instrumentation, dashboards, alerts, runbooks. They use the platform; they don't operate it.

Without explicit multi-tenancy, a noisy product team consumes shared resources — high cardinality, high log volume, runaway traces — and degrades every other team's experience. With multi-tenancy, the noisy team's impact is contained to their own tenant.

17.2 Tenant isolation primitives

Per-tenant ingest quotas. Mimir's per_user_series_limit, ingestion_rate_limit_mb, max_global_series_per_metric_name. Loki has equivalent per-tenant limits. The platform refuses metrics/logs from a tenant over budget — protects the platform, alerts the tenant.

Per-tenant storage quotas. Hot/warm/cold storage capped per tenant. Once full, tenant must delete (or pay more).

Per-tenant retention. Tenant A retains 90 days (default), tenant B retains 30 days (their choice / cost decision), tenant C retains 7 years (compliance — pays for it). Configured per tenant.

Per-tenant query isolation. A tenant's query workload doesn't crowd out another tenant. Per-tenant query concurrency limits, per-tenant query budget (CPU-seconds per minute). The "noisy neighbor in the query plane" problem.

Per-tenant authentication & authorization. Tenants only see their own data. Grafana's tenant-aware datasources, header-based tenant identification (X-Scope-OrgID in Mimir/Loki), RBAC for dashboards/alerts.

17.3 Tenant identification

How does the platform know which tenant a piece of telemetry belongs to?

  • By service identity. Service name → tenant lookup (from a service catalog). The OTel Collector adds tenant_id resource attribute based on service.
  • By Kubernetes namespace. One namespace = one team = one tenant.
  • By explicit header. App sends X-Tenant-ID: payments on OTLP requests; Collector routes by header.
  • By cluster. One Kubernetes cluster per tenant. Strong isolation but expensive at scale.

LinkedIn's approach: service catalog maps service → owning team → tenant. Platform components query catalog on receive.

17.4 The chargeback model

To create the right incentives, cost is allocated to tenants:

  • Compute the per-tenant share of platform cost (ingest bytes, storage bytes, query CPU).
  • Bill internally (chargeback) or visibly track (showback).
  • High-cost tenants either reduce volume or pay for the privilege.

Without chargeback, every tenant has incentive to over-instrument and over-retain — externalizing cost to the platform team's budget.

17.5 The hard parts

Hot tenants on shared storage. Even with quotas, query patterns can be uneven. A single tenant's expensive PromQL query can block other tenants behind it in the queue. Per-tenant queue isolation helps; full isolation only via dedicated infra per tenant (expensive).

Cross-tenant queries. Some legitimate use cases need cross-tenant access (security audit, platform team incident response). Special "super-tenant" roles, audited.

Onboarding new tenants. Self-service onboarding is hard. Most platforms have a manual approval flow with cost forecasting before granting a tenant a quota.

Quota tuning over time. A tenant outgrows its quota legitimately. Quota changes are change-management — must not surprise the platform-team capacity plan.


§18. Incident Management Integration

Observability is the substrate that incident management runs on. The integration between observability and incident-management tools determines whether incident response is a stitched-together manual process or a coordinated workflow.

18.1 The incident lifecycle and where observability hooks in

1. Detection      ← alert fires from observability
2. Triage         ← oncall investigates; uses observability dashboards
3. Notification   ← incident-mgmt tool notifies stakeholders, escalates
4. Mitigation     ← team finds root cause via observability, mitigates
5. Resolution     ← alert auto-resolves; observability confirms recovery
6. Post-mortem    ← assemble timeline from observability artifacts

Each step has a specific observability integration.

18.2 Tool ecosystem

Paging: PagerDuty, Opsgenie, VictorOps. Receives alerts from AlertManager (or equivalent), rotates oncall, escalates if unacknowledged.

Incident management: PagerDuty Incident Response, incident.io, Rootly, FireHydrant, Blameless. Tracks the incident as an entity — stakeholders, severity, timeline, mitigations, postmortem.

ChatOps: Slack, MS Teams. Hosts the war room channel created automatically for the incident.

Status pages: Statuspage, status.io, internally-hosted. Public-facing customer communication.

Post-mortem tools: Notion, Confluence, Blameless, FireHydrant.

18.3 Auto-create incidents from alerts

The most consequential integration. When a critical alert fires:

  1. AlertManager → webhook to incident-mgmt tool.
  2. Incident-mgmt creates an incident record with severity from the alert label.
  3. Creates a dedicated Slack channel (#incident-<id>-<short-summary>).
  4. Invites the oncall + auto-detected stakeholders (service owner, dependency owners).
  5. Posts the alert details + links to dashboards + links to relevant traces/logs.
  6. Optionally: auto-creates a JIRA ticket, status page draft.

Within 30 seconds of alert fire, the war room exists, people are paged, and links to the observability data are in chat.

18.4 Linking traces and logs to incidents

The incident record should include:

  • Links to dashboards showing the breaching SLI.
  • Links to specific traces that demonstrate the failure (via exemplars).
  • Saved log queries filtered to the incident time window and service.
  • Profile diffs if continuous profiling is in play and the regression is identifiable.

These are not screenshots — they are live links that update as the incident evolves. PagerDuty and incident.io both support attaching observability links to an incident.

18.5 Post-mortem timeline from observability artifacts

After resolution, a post-mortem is assembled. The timeline section traditionally was hand-curated from chat logs and memory. Modern integrations auto-assemble:

  • Alert fire/resolve events from AlertManager.
  • Deploy events from CI/CD (Spinnaker, GitHub Actions).
  • Chat messages from the war-room Slack channel.
  • Mitigation actions logged by responders ("rolled back commit abc").
  • Recovery signals from observability (SLI returns to within SLO).

Blameless, FireHydrant, and incident.io all auto-assemble timelines. The human responder annotates and the post-mortem is ready in minutes instead of days.

18.6 The "did this incident affect any customer SLA?" question

Once SLO/SLI/SLA tracking is in place, an incident's customer impact is computable directly from observability data:

  • "During the incident, what was the SLI?"
  • "Which customers' SLAs were breached?"
  • "What's the refund / credit owed?"

The financial follow-on of an incident is computed from the same telemetry that detected the incident. SLA management becomes data-driven instead of "we'll guess at goodwill credits."


§19. Mobile and Frontend Observability

Server-side observability is most of the literature, but the user-facing layer (browser, mobile app) is where the user actually experiences pain. The frontend observability stack is structurally different from the backend stack — different SDKs, different storage shapes, different failure modes.

19.1 What frontend observability captures

  • JavaScript errors / unhandled exceptions. Stack trace, browser, OS, URL, user actions before the error (breadcrumbs).
  • Mobile crashes. Native stack traces (often symbolicated server-side), device info, OS version, app version.
  • Frontend performance. Time to first byte, time to interactive, largest contentful paint, frames per second (FPS), JavaScript long tasks, layout shifts (CLS — Cumulative Layout Shift).
  • User actions. Clicks, route changes, form submissions, A/B test variant assignments.
  • Network requests from the frontend. Each XHR / fetch with timing, status, headers.
  • Backend correlation. Frontend HTTP requests carry a trace_id; backend continues the trace. Now "user said the page was slow" can be diagnosed end-to-end.

19.2 The major products

Sentry. Open source error tracking with managed offering. Captures JS / Python / Ruby / Java / mobile errors; symbolicates stacks; groups by signature; assigns to git-blame author. ~$26/month entry-level, scales by event volume. Storage on Snuba (ClickHouse-based wide-event store).

Crashlytics (Google / Firebase). Mobile crash analytics for iOS and Android. Free. Industry-standard for mobile crash reporting. Symbolication via uploaded dSYM (iOS) / ProGuard mapping files (Android).

Datadog RUM (Real User Monitoring). Frontend SDK that captures performance, errors, user sessions, distributed traces continuing into the backend. Same Datadog backend; cross-stack correlation in one UI.

New Relic Browser, Dynatrace Real User Monitoring, AppDynamics EUM. Vendor equivalents.

LogRocket, FullStory. Session replay — record and replay user sessions (clicks, navigation, even DOM snapshots). The "I want to see what the user saw" use case.

OpenTelemetry JS / Mobile SDKs. OTel has client-side SDKs for browsers (instrument fetch / XHR / DOM events) and mobile (Android / iOS). The vendor-neutral path; still maturing as of 2026 vs the established vendor SDKs.

19.3 Differences from backend observability

Sampling differs. Backends sample at 1% because traffic is high and uniform. Frontend events are bursty (a viral page → 100x traffic spike). Adaptive sampling that auto-adjusts is more important. Errors are usually kept at 100% (rare, high-value).

Symbolication. Mobile crash stack traces are dumps of memory addresses. Server-side symbol stores (dSYM for iOS, ProGuard mapping for Android) translate to function names. The "we shipped a release with broken symbols" failure mode breaks every crash report for a release.

Privacy. Session replay captures user actions — potentially form field values, PII. Frontend SDKs need configurable redaction (mark form fields as "do not record").

Network instability. Mobile SDKs buffer events when offline, retry on reconnect. The "the user crashed during a flight, the crash report shows up 8 hours later" pattern.

Versioning. Mobile apps have N concurrent versions in the wild. Backend has one. Crash reports must be partitioned by version, and the symbol store must keep symbols for every released version.

19.4 What good frontend observability tells the team

  • "Top 10 crashes by user impact in v3.2.1." Not just count — weighted by how many users affected. A single user with a flapping device gives 1000 crashes; one bug affecting 0.1% of users is more important.
  • "Time to interactive p75 by route." Per-route page-load performance; drives optimization priorities.
  • "5% of users who clicked Checkout saw a JS error in the past hour." Functional impact, not just exception count.
  • "This trace shows the user's request from frontend through 8 backend services." Cross-stack causal chain.

19.5 Failure modes specific to frontend

Symbol mismatch. Engineer uploads symbols for v3.2.0; production users are on v3.2.1. All v3.2.1 crashes are unsymbolicated — show as 0x7fff8c4d3a20 instead of MyApp.PaymentScreen.confirm(). Fix: CI/CD step uploads symbols for every release.

Ad blockers. Browser ad blockers block analytics endpoints, including some observability SDKs (Datadog RUM, Sentry hosted at obvious domains). Workarounds: proxy through your own domain; accept some sampling bias.

SDK overhead on mobile. Battery and bandwidth are scarce. SDKs must batch aggressively, defer to Wi-Fi, throttle.

False crash signals from third-party SDKs. A buggy ad SDK throws an unhandled exception; the host app reports a "crash" that's not really the app's fault. Stack-trace-aware filtering required.


§20. Capacity Envelope

The whole point of this section is that observability spans about six orders of magnitude in scale, and the engine choice depends on which tier you're in.

Small — startup, single product line. 10-50 application containers; ~30K active series, ~3K samples/sec; ~10 MB/sec, ~1 TB/day logs; ~1K spans/sec head-sampled. Single Prometheus, single ELK, Jaeger on Cassandra on a handful of m5.large boxes. Budget < $5K/month. One engineer part-time. Bottleneck: single ELK node hits I/O ceiling around 10 MB/sec sustained.

Mid — Grafana Cloud customer, mid-size SaaS. 1-5K containers; 3-15M active series, 300K-1.5M samples/sec; ~1 GB/sec, ~85 TB/day logs; 100-500K spans/sec post-sample. Mimir / Cortex for metrics with sharding, Loki for logs, Tempo for traces, object storage cold tier. Team of 2-4. Bottleneck: single Prometheus hits cardinality ceiling around 5-10M active series; shard via Mimir.

Large — LinkedIn, Datadog customer, mid-tier hyperscaler. 30-100K containers; 50-500M active series, 5-50M samples/sec; LinkedIn ~10 PB/day log ingestion across the fleet, ~115 GB/sec sustained; 1-10M spans/sec post-sample, billions/day. Sharded Mimir, multi-cluster Loki + Pinot for logs, multi-region Tempo with tail sampling. Dedicated org of 10-30. Bottleneck: Kafka if you didn't dedicate per-pillar clusters; logs at 6 GB/sec on one Kafka is fine, logs + traces on the same cluster asks for back-pressure cascades.

Giant — Google, Meta. - Google Monarch (VLDB 2020): in-memory globally-replicated TSDB, hierarchical regional zones, ~1 trillion samples/sec ingestion peak; replaces Borgmon. - Datadog: publicly reports trillions of metric points per day across its customer-aggregate platform — at 10 trillion/day that's ~115M points/sec. - Cloudflare logs at ~1 trillion log records/day, analytics on ClickHouse. - Meta's Scuba (VLDB 2013): in-memory wide-event store at petabyte scale, sub-second aggregations across billions of events.

At this tier the question stops being "can we store it" and becomes "can we afford to keep it." Retention costs dominate ingest costs; aggressive downsampling and tiered cold storage are structural, not optional.


§21. Architecture in Context

The canonical pattern. Not one product's full system — the shape every observability platform converges on.

   ┌────────────────┐                                          ┌─────────────────────┐
   │  Application   │                                          │  Metrics store      │
   │   (Java/Go/Py) │                                          │  Prometheus / Mimir │
   │ ┌────────────┐ │      pull /metrics                       │  shard by job+host  │
   │ │ OTel SDK   │─┼──────────────────────► [Scrapers] ──────►│  - WAL              │
   │ │ metrics    │ │      every 10s                           │  - Head block       │
   │ └────────────┘ │                                          │  - 2h blocks → S3   │
   │ ┌────────────┐ │                                          └─────────────────────┘
   │ │ OTel SDK   │ │      OTLP/gRPC                                ▲
   │ │ traces     │─┼─────────► [OTel Collector] ──────────────────┤ exemplars
   │ │ (sampled)  │ │      head-sample 1%      │                    │ metric ↔ trace
   │ └────────────┘ │                          │                    │
   │ ┌────────────┐ │                          ├──► [Kafka: traces] │
   │ │ Structured │ │      OTLP logs           │   partition by     ▼
   │ │ logger     │─┼─────────► [Fluent Bit] ──┤   trace_id    [Trace store]
   │ │ (JSON)     │ │                          │                Cassandra/Tempo/S3
   │ └────────────┘ │                          │                indexed by trace_id
   └────────────────┘                          │                sampled head + tail
                                               │
                                               ├──► [Kafka: logs]
                                               │    partition by   ┌─────────────────────┐
                                               │    service.name   │ Log store           │
                                               │                   │ Loki labels + S3    │
                                               ├───────────────────┤ chunks (hot)        │
                                               │                   │ OR Pinot/Clickhouse │
                                               │                   │ for log analytics   │
                                               ▼                   └─────────────────────┘
                                  [Tail-sampling decision engine]           │
                                  buffers spans 30s,                        │
                                  keeps errors + slow + 1% random      ┌────▼─────────┐
                                                                       │ Cold tier S3 │
                                                                       │ Parquet      │
                                                                       │ day+service  │
                                                                       └──────────────┘

         ┌──────────────────────────────────────────────────────────┐
         │ Query / alert plane (Grafana / Alertmanager / PagerDuty) │
         │   - dashboards join metrics/logs/traces                  │
         │   - exemplars: hop metric → trace → log                  │
         │   - SLO multi-window burn-rate alerts                    │
         └──────────────────────────────────────────────────────────┘

Staff-level callouts. Three pillars, three pipelines, three storage tiers, no shared fate — a backed-up log Kafka cluster cannot stall the metrics path; the metrics path stays alive when the log path melts. Sampling at multiple stages — SDK head sampling (1-10%), Collector tail sampling (keep errors + slow), storage retention tiering — each cuts another order of magnitude. Kafka in the middle for logs and traces, not metrics — metrics are pulled (scraper WAL is enough buffer); logs and traces are pushed and bursty, so Kafka absorbs variance. Partition keys named explicitly — metrics shard by job+host, logs partition by service.name, traces partition by trace_id (required for tail sampling). Exemplars are the bridge — a Prometheus exemplar attached to a histogram bucket carries a trace_id; click a slow p99 bucket → jump to one specific trace → jump to its logs. The cross-pillar workflow only works if trace_id is stamped on metrics, logs, and spans.


§22. Hard Problems Inherent to Observability

Six fundamental challenges that anyone deploying this technology will face, regardless of product choice. Each illustrated with a different domain.

22.1 Cardinality explosion (kills metrics)

Domain: payments incident debugging. The payments team wants "p99 latency per merchant." They add merchant_id as a label on payment_duration_seconds. 5M active merchants × status × method × region → cardinality goes from 50K to 50M overnight. Prometheus's head block hits 50 GB and OOMs. The platform pages, not the original incident.

Naïve fix. Add RAM. Buys days, not months — cardinality is unbounded.

Real fix. (1) Cardinality enforcement at ingestion — TSDB tracks active series per tenant; exceed budget (e.g., 100K series per service) → refuse new series, alert owner. Mimir has per_user_series_limit. (2) Aggregation rules — pre-aggregate high-cardinality dimensions in a stream processor (Flink, Materialize), store top-100 merchants as a metric, push the rest into a wide-event store. (3) Tracing is the answer for per-entity slices — "p99 for this merchant" is a trace/wide-event query, not a metric query. (4) Cardinality alerts — alert when series count grows >2x in 24h; catches bad deploys before OOM.

22.2 Cost of full log indexing

Domain: security and audit logging. The security team needs full-text grep across all auth events: "did this attacker IP show up anywhere in 90 days?" They push 100% into Elasticsearch with full-text indexing. 50 TB/day raw → 100 TB/day indexed. ES cluster grows to 1500 nodes. Cost overtakes the rest of the platform.

Naïve fix. Reduce retention. Helps but doesn't fix the structural problem — they're paying to index 99% of data that's never queried.

Real fix. Tier by query pattern. Recent 7 days in ELK for fast grep (~3 TB hot). 8-90 days in Loki + S3 chunks — grep is slower (10-30s) but acceptable for forensics. 90 days+ in Parquet on S3, queried via Athena. Most security workflows tolerate seconds-to-minutes grep on older data; the fast-index tier just needs to cover the active investigation window.

22.3 Head-based vs tail-based trace sampling — choosing wrongly

Domain: ML model serving. The ML team is debugging "why does the recommendation model occasionally return empty results for 0.1% of requests." 1% head sampling. Out of 10K affected requests, 100 sampled — none of the 100 hit the bug. Team is blind.

Naïve fix. Increase head sample to 50%. Volume goes up 50x; storage and Kafka cost balloon. Storing massive amounts of perfectly boring traces.

Real fix. Tail sampling. Every span to a collector. Buffer by trace_id for 30s. Keep: 100% errors, 100% slow, 100% where recommendation_count == 0, plus 1% random for baseline. Storage ~1.5x old head rate, coverage of "interesting" traces 100%. Cost: collector tier must partition by trace_id and needs RAM for the buffering window. At 60M spans/sec, 30s buffer ≈ 1.8B spans, ~2-4 TB RAM aggregated.

22.4 Trace context propagation across async boundaries

Domain: gaming server crash investigation. A live-ops engineer debugs "user's item purchase didn't credit." Flow: HTTP → Kafka → async worker → Espresso write → CDC stream → ledger reconciliation. Six async hops. Trace ends at "request enters Kafka"; every downstream span is an orphan.

Naïve fix. Include trace_id in the Kafka message body. Half-right — loses span hierarchy; SDK doesn't wire consumer spans as children.

Real fix. OTel-defined trace context propagation across every boundary. HTTP/gRPC: traceparent header (W3C standard). Kafka: traceparent and tracestate as Kafka message headers. Producer SDK attaches on send; consumer SDK reads them and creates a span with parent_span_id = producer's span_id, link type FOLLOWS_FROM (not CHILD_OF — parent already completed). Espresso/Pinot CDC: trace context in the CDC envelope. Scheduled jobs (cron, Flink, batch): scheduler creates a root span; job inherits trace_id from trigger metadata.

22.5 Alert fatigue

Domain: any SRE oncall rotation. ~200 active alert rules. Pager fires 40 times/week, most transient bumps that self-resolve in 90 seconds. Engineers stop reading alerts. The next real incident goes unnoticed for 20 minutes.

Naïve fix. Tune thresholds tighter. Tunes the noise but doesn't fix structural — alerting on causes (CPU > 80%, queue depth > 1000) means alerting on symptoms that may or may not affect users.

Real fix. SLO-driven multi-window burn-rate alerts. Define user pain: "99.9% of requests 2xx-3xx and <500ms over rolling 28 days." Compute error budget. Page at 14.4x burn rate over 1h (month's budget in a day), 6x over 6h (4 days), ticket at 3x over 24h. ~5 alerts per service instead of 200; each represents real user impact. Cause-based alerts move to ticket → dashboard. Alert when users are hurting, not when a metric is high — the SRE discipline applied to alerting.

22.6 Biased sampling

Domain: mobile app crash analytics. Crashlytics/Sentry samples at 10% with key device_id mod 10. Months later the team finds iOS device_id distribution is skewed (timestamp-derived seed correlates with launch time). The 10% sample is now skewed toward Android. Crash trends say "Android has 3x more crashes" when rates are equal — iOS samples are systematically missing.

Naïve fix. Stratified sampling per platform. Treats one symptom.

Real fix. Sampling keys must be uniformly distributed. W3C Trace Context recommends 128 bits of randomness on trace_id; use a cryptographic-quality RNG, not timestamp-derived. Monitor for bias — per-partition span rate, per-platform sample rate, per-region sample rate. Any category materially hotter than others means the hash or sampling key is broken. Observability of the observability platform.


§23. Failure Modes

The platform itself is a distributed system. It fails. Systematic walkthrough.

Collector backpressure

Scenario. OTel Collectors are CPU-pinned on tail sampling. Span ingestion lags. Kafka queues fill up.

Recovery.

  1. Circuit breaker on the Collector: when CPU > 80% for > 1 minute, switch to degraded mode — drop the random-sample portion but keep error/slow traces. Cuts ingest ~80% while preserving high-value traces.
  2. Kafka tiered storage — once disk fills, oldest segments offload to S3. Traces retrievable but with higher latency.
  3. Auto-scaling on the Collector tier triggered by Kafka consumer lag or CPU.

Durability point. Traces are still in Kafka. Worst case is delayed assembly, not loss.

Storage tier full

Scenario. Prometheus disk fills overnight because someone deployed a new label that exploded a single metric to 50M series.

Recovery.

  1. Storage hard limit — Prometheus stops accepting new samples (it does not crash). Old data still queryable.
  2. Cardinality monitoring catches the explosion within minutes; alert pages the owning service team.
  3. Service rollback or hot-fix the bad label.
  4. Once cardinality returns to normal, the next 2h block boundary clears the head block; disk pressure resolves.

Durability point. Persistent 2h blocks already on disk and replicated to S3 (Thanos/Mimir). The head block is rebuildable from the WAL.

Biased sampling silent dropout

Scenario. Trace_id distribution skewed → one Kafka partition gets 10x normal load → buffer overflows → that partition's traces preferentially drop. Specific region's traces systematically missing.

Recovery. No live recovery — past traces are lost. Forward fix: detect bias proactively via per-partition span-rate alerts and re-shard if the hash function is broken.

Durability point. None for dropped traces. Prevent via invariant 4 (sampling decisions auditable) and proactive bias detection.

Missing trace_id in logs

Scenario. A new service deployed without the OTel logger handler. Its logs don't carry trace_id. During an incident, you can't pivot from trace → logs for that service.

Recovery.

  1. Detect via a CI check: every service's structured-log format must contain a trace_id field. Reject deploys without it.
  2. For the in-progress incident, fall back to time-range + service-name filter. Higher noise but works.

Durability point. This is an SDLC (Software Development Life Cycle) invariant, not a runtime one. Fix is in the deploy pipeline.

Multi-region partition

Scenario. us-east-1 loses connectivity to the central observability platform in us-west-2 for 20 minutes.

Recovery.

  1. Per-region OTel Collectors have local Kafka buffers. They keep buffering during the partition.
  2. Per-region Prometheus instances keep scraping locally; regional Grafana queries keep working.
  3. When the link heals, Collectors flush the backlog. The metrics store backfills (Thanos sidecars accept out-of-order blocks).
  4. Alerts that fired during the partition were fired by the regional Alertmanager via a secondary alerting channel (PagerDuty fallback over a different network path).

Durability point. Per-region storage. Built to degrade per region, not require a globally consistent view.

Collector backpressure cascades into the application

Scenario. The OTel Collector exporter (sending spans to Tempo) blocks because Tempo is overloaded. The Collector's internal queue fills. With the default OTLP exporter blocking behavior, the SDK in the application also blocks waiting for the Collector to ack the OTLP batch. Application threads pile up on the OTel exporter call; request latency degrades; the observability stack has caused the application incident it exists to diagnose.

Recovery.

  1. Configure non-blocking SDK exporters. OTel SDK BatchSpanProcessor should drop on overflow, not block. maxQueueSize, scheduledDelayMillis, exporterTimeout tuned so that an overwhelmed Collector results in dropped spans, not blocked apps.
  2. Memory limiter processor in the Collector. When Collector RAM hits a threshold, drop incoming OTLP before queueing. Apps see fast errors instead of slow blocks.
  3. Async batched logging. Same pattern for logs — never block the application on a log write. log.warn calls should be queue-and-return.
  4. Production discipline. "Observability never blocks the app, ever." Audited periodically; failure of this property is treated as a P0 platform bug.

Durability point. Drop spans/logs in extremis. Telemetry is best-effort; application liveness is not.

Prometheus scrape overwhelms the target

Scenario. Prometheus is configured to scrape /metrics every 5s on a service with 50K metrics. Each scrape parses 50K lines, blocks one thread for ~200ms. Combined with 10 Prometheus replicas scraping for HA (High Availability), the service sees 20 scrapes/sec, each consuming a thread. The metrics endpoint becomes a hotspot — 1% of CPU spent on emitting metrics.

Recovery.

  1. Reduce scrape frequency. 30s instead of 5s for most metrics.
  2. Pre-compute / pre-aggregate. Per-process metrics aggregation rather than per-call exposition.
  3. Use a remote-write pattern. App pushes via OTLP to a Collector (which pushes to Prometheus via remote-write), eliminating the per-scrape parsing cost.
  4. Cardinality reduction. Often the 50K metrics are themselves a sign of an unaddressed cardinality issue.

Durability point. Push-based ingestion with batching scales better than pull-based scraping above ~10K series per instance.

One bug fills logs with stack traces

Scenario. A common-path exception starts firing at 1000 req/sec. Each fires a 30-line stack trace, ~3 KB log line. Net log volume: 3 MB/sec = 260 GB/day from one bug. ELK ingest overruns; warm tier fills; ingest backs up for all services. The platform takes a structural hit because of one service's bug.

Recovery.

  1. Per-service log rate limits. Each service has a budget (e.g., 100 MB/sec). Excess gets dropped at the local Fluent Bit / OTel Collector. The buggy service's logs partially drop; other services unaffected.
  2. Stack-trace deduplication at the Collector. If 1000 lines have the same stack signature in 1s, emit one with count=1000. Cuts volume by 1000x for repeating exceptions.
  3. Hot-fix the bug. Until then, the rate limit is the safety net.
  4. Alert on log-volume spikes. A service emitting 10x its baseline log rate triggers a page; faster than waiting for the storage tier to fill.

Durability point. Per-service rate limits are the primary defense. Without them, one service's bug can take out the platform.

Clock skew makes traces look wrong

Scenario. Service A's clock is 200ms ahead of service B's clock. A trace span from A→B starts at A's t=10.000, finishes at B's t=9.950 (B's clock). The trace UI shows B's span starting before A's call to B started — physically impossible. Engineers debugging the trace are confused; investigations stall.

Recovery.

  1. Strict NTP / PTP discipline. Every host runs Network Time Protocol or Precision Time Protocol; clock drift > 100ms is alarmed; clock drift > 1s is paged.
  2. Trace UI tolerance. Modern trace UIs (Tempo, Jaeger) compensate for small skew by aligning child span start to parent span start when skew is detected. Doesn't fix the data, but masks the visualization issue.
  3. Span timing reported as durations, not absolute times. OTel spans carry start_time and duration. Duration is unambiguous; absolute start time depends on host clock. UIs prefer duration for visualization within a trace.
  4. Detect skew via known-good timing. Cross-service request that should take <10ms (e.g., a health check ping) measured >100ms apparent — flag the host. Observability of clock skew is its own monitor.

Durability point. Clock skew is bounded by NTP discipline. The platform must defend against the long tail.


§24. Why Not the Obvious Simpler Alternative — "SSH and tail the logs"

This is the actual mental model many engineers grew up with. It worked when "the application" was one process on one host. It categorically breaks the moment you have more than one service or more than one host.

A customer reports "checkout failed at 2:34 PM Eastern." You ssh into api-gateway. You don't know which of 800 hosts served their request. You tail random hosts; nothing. After 20 minutes grep'ing 800 hosts you find a candidate on host 437. The log says "downstream timeout" — doesn't say which downstream. The flow involves auth, checkout, payment, fraud, inventory. You ssh each. checkout says "called payment, got 500." Payment is 200 hosts. You grep. Two hours in, you find: "stripe API returned 5xx after 30s." But without traces you can't reconstruct which request that was, what user, what params, what DB state. You give up. MTTR (Mean Time To Resolution) is essentially infinite.

Scale that to an incident affecting 10,000 customers, where you're computing impact radius. Or where the request crossed an async Kafka hop. Or where half the logs have rotated off the host because retention is 24h. This is the wall. The observability stack exists to replace SSH-and-grep with structured, correlated, queryable data. Every component — central log aggregation, trace_id propagation, metrics dashboards, SLO alerting — is a brick in that wall.


§25. Scaling Axes

Type 1: uniform growth (more services, more hosts)

Same per-host telemetry, more hosts. Mostly horizontal fixes. 1x (1K hosts): single Prometheus (~3M series), single ELK (~3 TB/day), single Jaeger Cassandra. One SRE. 10x (10K hosts): Prometheus at cardinality ceiling; shard via federation or move to Mimir/Cortex. Logs move off pure ELK or grow it to 200 nodes (painful). Traces still fit in Cassandra with tighter sampling. 100x (100K hosts): multi-region everything. Per-region Prometheus aggregated to global Mimir via remote-write. Logs through Kafka into multi-tier (Loki + Pinot). Traces at 0.1% head + tail. Team 5-10 engineers.

Inflection points where topology must structurally change: past 1M series per instance → shard TSDB; past 1 GB/sec log ingest → move off pure full-text; past 10M spans/sec post-sample → partition traces by trace_id, multi-shard tail sampling; past 3 regions → per-region storage with cross-region query federation.

Type 2: hotspot intensification (cardinality explosion from a bad label)

Same hosts, but one service emits 100x more series because a deploy added a bad label. Horizontal scaling doesn't help — the problem is concentrated in one tenant. Fix shape: per-service cardinality quotas enforced at ingest — "checkout" has a budget of 100K series; new deploy emits 5M → TSDB refuses new series and pages the owning team. Per-service rate limits on log and span ingest — runaway service can't drown the platform. Pre-prod cardinality CI checks — simulate metric output and count cardinality before merge; fail CI if a service doubles. Type-2 growth is harder than Type-1 because it's bursty, attributable to specific deploys, and can be 10-100x steady-state in seconds. The TSDB must be hostile to bad neighbors.


§26. Decision Matrix

Per pillar

Workload Best fit Why
Aggregate numeric time-series, low cardinality Prometheus / Mimir TSDB Gorilla compression, inverted label index, optimized for label slicing
Aggregate time-series, very high cardinality (per-user, per-merchant) Honeycomb wide events / Pinot Columnar scan beats label-index when cardinality > 10M
Log search, ad-hoc grep over labeled streams Loki Label-only index keeps cost manageable at TB/day
Log analytics (SQL aggregations, dashboards) Pinot / ClickHouse Columnar, streaming ingest from Kafka, sub-minute freshness
Powerful full-text log queries, smaller volume Elasticsearch / Splunk Lucene power, expensive but unmatched search UX
Per-request causal chains, named lookup by trace_id Jaeger / Tempo trace_id partitioning, sampling, span tree assembly
Ad-hoc multi-dimensional trace analysis Honeycomb Retriever Columnar event store, query weird questions
Continuous profiling Pyroscope / Parca Continuous low-overhead CPU/heap profiles
Black-box / synthetic Pingdom-style probes External reachability, not a substitute for internal pillars
Mobile/crash analytics Crashlytics / Sentry Built-in symbolication, native crash grouping, mobile SDK

Build vs buy

Buy Datadog / New Relic / Splunk if you're small-to-mid, value time-to-instrumented over $/GB, accept lock-in. Below ~1 PB/day total telemetry, SaaS is often cheaper than the headcount to operate the equivalent. Build on open source (OTel + Mimir + Loki + Tempo + Pinot) if you're large enough that vendor markup × volume > your engineering cost, you have regulatory/data-locality requirements vendor can't meet, or your access patterns don't fit vendor pricing (Honeycomb prices on events; 1T events/day pays them more than your AWS bill). Past ~10 PB/day, build math usually wins. Hybrid (most realistic): OTel SDK everywhere, open source for bulk storage (metrics + logs), SaaS for specialized analytics (Honeycomb for high-cardinality wide events, Sentry for mobile crashes).

Pick of picks for a green-field large-scale platform. OpenTelemetry SDK → OTel Collector → Kafka (per-pillar clusters) → (Mimir for metrics) + (Loki + Pinot for logs) + (Tempo with tail sampling for traces) → Grafana for unified query → Alertmanager for SLO-burn alerting. No vendor lock-in (OTel), independent pillar failure domains (separate Kafka), economical at >10 PB/day log scale (Loki + Pinot beats ELK), tail sampling for trace coverage (Tempo).


Five different products, five different ways the same three-pillar tech is applied. The technology is the subject; these are illustrations.

SRE incident response — payments platform. Stripe-class payment processor. SRE oncall paged via SLO burn-rate alert on payment_success_rate. Hop into a Grafana dashboard. Histogram of payment_duration_seconds shows exemplars on the p99 bucket — click one, land on a specific trace. Trace shows payment → fraud-check → external-bank; fraud-check is the slow span. Logs tagged with that trace_id show "cache miss → DB query took 4s." Root cause in 8 minutes; MTTD (Mean Time To Detect) under 2 min, MTTR (Mean Time To Resolution) under 15. Three pillars compress an hours-long debug into a ten-minute walk.

Business KPI monitoring — e-commerce dashboards. A Shopify-class merchant uses Datadog dashboards to track real-time GMV (Gross Merchandise Value), conversion rate, cart abandonment. App emits orders_completed_total, cart_abandoned_total; dashboards rate() over 1-minute windows. Dashboards owned by the business team, not engineering. Same TSDB that serves SRE alerts serves these dashboards — different labels and query windows. Same tech, different consumer.

Security audit logging — financial services compliance. A bank retains every auth event, transaction, and admin action for 7 years (regulatory). 5 TB/day. Hot tier (30 days in Elasticsearch) for active investigation. Warm tier (30d-1y in Loki + S3) for forensic grep. Cold tier (1-7y in Parquet on S3 + Glacier). Splunk is the buy-vs-build incumbent here — many banks pay $20-50M/year for compliance certifications and audit trail features.

ML model monitoring — recommendation drift. A recommendation model in production. Monitor: input feature distributions (drift), output distribution (degenerate?), per-request latency (slower as index grows?), business outcome (driving clicks?). Distributions are high-cardinality wide events — Honeycomb-shape store. Latency is metrics — Prometheus histogram. Outcome is a metric joined with a click log. Three pillars × ML. When the model goes degenerate at 3 AM, all three cooperate to identify feature pipeline vs model vs data quality.

Distributed tracing for slow-request debugging — Honeycomb's home turf. SaaS company sees p99 latency spike. Engineer opens Honeycomb. heatmap(duration) WHERE service=api — spike concentrated in a window. Add GROUP BY customer_id — one customer dominates. Add GROUP BY endpoint — one endpoint. Drill into a slow trace: api → search → vector-store → object-storage. Vector-store fetch is the slow hop. Opens vector-store dashboard, finds a deploy 5 minutes earlier. Rolls back. Total time: 4 minutes. High-cardinality wide-event analysis is the workflow Honeycomb's columnar Retriever optimizes for — and the workflow aggregated metrics dashboards cannot serve.

Mobile app crash analytics — Crashlytics / Sentry. Every crash generates a report with stack trace, device info, OS version, breadcrumbs. SDK uploads on next connectivity. Server-side, crashes are grouped by stack signature, weighted by user impact. Devs see "top 10 crashes by user impact in v3.2.1." Event-shaped telemetry (one crash = one structured record), not metric-shaped. Storage is a columnar event store — same family as Honeycomb's Retriever, tuned for the mobile crash workflow. 100% sampling for crashes (rare, high-value).

Product analytics on log data — LinkedIn's Pinot for logs. Growth team wants "users who completed onboarding yesterday, by country, by device." Historically a custom pipeline question. With Pinot for logs it becomes SQL on the log stream — SELECT count(DISTINCT user_id) FROM logs WHERE event='onboarding_complete' AND date='2026-05-21' GROUP BY country, device. Columnar storage and per-column indexes return sub-second on TB-day clusters. Logs become a product analytics substrate — a use case the original "logs are for incident debugging" model never anticipated.


§28. Real-World Implementations with Numbers

  • Google Monarch (VLDB 2020). Globally-distributed in-memory TSDB. Hierarchical regional zones, each holding its own data; aggregator queries fan out. ~1 trillion samples/sec ingestion peak. Replaces Borgmon. Lesson: in-memory beats disk for hot metrics; regional autonomy beats global consistency.
  • Facebook Beringei / Gorilla (VLDB 2015). Paper introducing delta-of-delta + XOR float compression. Drove Gorilla-shape chunks in Prometheus TSDB, M3DB, VictoriaMetrics, InfluxDB IOx. The 1.37 bytes/sample figure for steady-state metrics is from this paper.
  • Google Dapper (OSDI 2010). Original distributed tracing paper. Established head sampling at 0.01-0.1% as the default. Modern observability still leans on Dapper's model — trace_id, span hierarchy, context propagation.
  • LinkedIn's log pipeline. ~10 PB/day of log ingestion across the fleet — ~115 GB/sec sustained. Pipeline: app → Kafka → tiered storage. Pinot for log analytics delivers sub-second SQL on 10+ TB/day per cluster; replaces ELK for aggregation use cases.
  • Datadog. Publicly reports trillions of metric points per day, billions of traces/day, exabytes of logs over its history. Sharded TSDB on Cassandra-like LSM trees with custom Gorilla-like codecs. Tail sampling is a managed APM feature.
  • Honeycomb's columnar Retriever. Pioneered "events not traces" — every span is a row in a proprietary columnar store. Ad-hoc query language (heatmap(duration) WHERE service=api GROUP BY customer_id) supports very high cardinality natively. Better for "ask weird questions," worse for "show me one trace tree."
  • Cloudflare logs. ~1 trillion log records/day via ClickHouse-based analytics. Seconds-class aggregation at this volume — order of magnitude cheaper than Elasticsearch.
  • Uber Jaeger. Open-sourced from Uber. trace_id-partitioned Cassandra early on; now offers ClickHouse and Elasticsearch backends. Default head sampling 0.1% across the Uber fleet.
  • Lyft Envoy + observability. Envoy emits rich metrics, logs, tracing out of the box. The "service mesh has built-in observability" pattern (Envoy + Istio) means no app-side instrumentation for L7 (Layer 7) telemetry.
  • Sentry. Open source error tracking — full stack traces and breadcrumbs. ~100M+ events/day SaaS tier. Storage on Snuba (ClickHouse-based wide-event store).
  • OpenTelemetry. Vendor-neutral instrumentation. CNCF graduated project, backed by Google, Microsoft, AWS, Splunk, Datadog, New Relic, Honeycomb. The strategic bet: instrument with OTel, change backends as economics shift.

§29. Summary

"Observability is three pillars, three storage engines, three failure domains — never one product. Metrics live in a Gorilla-compressed TSDB sharded by series hash with hard cardinality quotas. Logs split across label-indexed chunk stores (Loki) for grep and columnar stores (Pinot, ClickHouse) for SQL aggregation — never a full-text index at petabyte scale. Traces ride OTLP through Kafka partitioned by trace_id into a tail-sampling tier that keeps 100% of errors and slow plus a small random sample. Correlation lives in the trace_id stamped on every metric exemplar, every log line, every span. Alerts are SLO-burn-driven so the pager fires when users are hurting, not when a metric is high. The pillars degrade independently; the platform survives the failures it exists to diagnose."