Time-Series Database

A technology reference on TSDBs (Time-Series Databases) — what they are, why a generic relational engine collapses on their workload, the byte-level mechanics (Gorilla XOR floats, delta-of-delta timestamps, 2h blocks, posting-list label indexes), the design space (Prometheus, VictoriaMetrics, InfluxDB, TimescaleDB, M3DB, Thanos/Mimir/Cortex, Pinot/Druid, QuestDB, ClickHouse), and the workloads they serve — monitoring metrics, IoT (Internet of Things) sensor telemetry, financial market data, application performance, business KPIs (Key Performance Indicators). Use cases appear as illustrations throughout — Kubernetes monitoring in §4, Tesla fleet telemetry in §5, trading systems in §6, smart-meter data in §15 — never as the framing device. The technology is the subject.

§1. What Time-Series Databases Are

A TSDB (Time-Series Database) is a specialized storage engine for append-only, timestamped, numeric data whose access patterns are dominated by (a) range scans over a time window, (b) aggregations across windows (sum, avg, percentile, rate), (c) downsampling from fine-grained to coarse-grained resolutions, and (d) automatic retention pruning of old data. The defining property is "ingest enormous volumes of timestamped points cheaply, query time windows fast, throw old data away on a schedule."

A time-series sample looks like (metric_name, label_set, timestamp, value) — e.g. http_requests_total{service="api", region="us-east-1", status="200"} 4823 1716391200000. The pair (metric_name, label_set) defines a time-series — a logical stream of (timestamp, value) points. The pair (timestamp, value) is one sample. A production cluster might have 10 million active series each receiving a sample every 10 seconds — 1 million samples per second steady state. That is the design point.

Where TSDBs sit and what they aren't

TSDBs are a corner of the storage landscape distinct from three adjacent categories newcomers conflate:

vs. OLTP (OnLine Transaction Processing) databases (PostgreSQL, MySQL). OLTP serves ACID (Atomicity, Consistency, Isolation, Durability) transactional mutations of structured data — bank balances, user profiles, inventory counts. Putting a metrics workload into PostgreSQL is wasteful: no transactions to commit (each sample is independent), no joins (a metric is a self-contained series), no point lookups except (series, time_range), and row-oriented storage when float values compress 10× better in a columnar codec. A 1M-samples/sec workload melts a PostgreSQL primary; the same workload on Prometheus runs on a 4-core VM. The mismatch is structural, not tuning.
vs. OLAP (OnLine Analytical Processing) warehouses (BigQuery, Snowflake, Redshift). OLAP serves general analytical queries — group-by, joins, window functions across petabytes. It can store time-series but pays a generality tax: no specialized timestamp codec, no inverted label index, no automatic retention, no 1.37-bytes-per-sample compression. A TSDB is time-specialized; OLAP is time-capable. Crossover around 1 billion samples per day — below that, ClickHouse or Pinot is fine; above, the specialized TSDB pulls ahead 5-20× on storage and query cost.
vs. event stores / event sourcing (EventStoreDB, Kafka as a log). Event stores record discrete domain events (OrderPlaced, PaymentCaptured) and support replaying the log to rebuild state. A TSDB records measurements (a CPU sample, a request count) where the goal is "value at time T" or "rate over the last hour," not "rebuild aggregate state at version N." Event sourcing is about causality and state reconstruction; TSDB is about measurement and aggregation.

What a TSDB is NOT good for: complex joins (no relational join operator; PromQL — Prometheus Query Language — supports label-matched joins but not arbitrary cross-metric joins), transactional updates to past samples (rewrites are hostile to the storage layer), low-latency point lookup by non-time keys (use Redis or a key-value store), unstructured text payloads (logs and traces belong in their own engines — see observability doc).

Mental model: a TSDB is the right answer for "millions of floats per second, indexed by labels, queried by time window, thrown away on a schedule." Anywhere that profile doesn't fit, the TSDB is the wrong tool.

§2. Inherent Guarantees

What a TSDB provides by design, and what the system designer still has to layer on top.

Provided by design

High write throughput. A single Prometheus TSDB node sustains 1 million samples per second on commodity hardware (16 vCPU, 64 GB RAM, NVMe SSD); VictoriaMetrics claims 10 million per node with the same hardware budget. Datadog's internal TSDB ingests trillions of samples per day across its fleet. Write throughput scales linearly with cores until the WAL fsync becomes the bottleneck.
Aggressive compression. Gorilla encoding (§4) hits 1.37 bytes per sample on real production data — timestamp + 64-bit float compressed into less than 11 bits of disk on average. Compared to a naive 16-byte row store, the storage efficiency gap is ~12×.
Fast time-range aggregations. A query like sum(rate(http_requests_total[5m])) by (service) over 1000 series returns in milliseconds because (a) samples are stored sequentially per series and (b) the query engine pushes the aggregation down to the chunk decoder.
Automatic retention. Drop blocks older than N days. No manual DELETE is ever necessary; the TSDB owns the lifecycle of old data. Prometheus's --storage.tsdb.retention.time=30d flag does the entire job.
Label-based query indexing. An inverted index from (label_name, label_value) → series_ids lets queries filter by arbitrary label combinations without scanning every series.

Must be layered above

Cross-series aggregation correctness. A counter that resets on process restart looks like a negative rate if naively differenced. The application must use a monotonic counter type AND the query language must understand counter resets (PromQL's rate() handles it; raw SQL on samples does not).
Cardinality budget enforcement. A high-cardinality label (user_id, request_id, trace_id) generates one series per unique value. 10 million series is the cliff edge for a single node. The SDK must drop or hash such labels before they hit the TSDB.
Long-term retention. Native Prometheus retention is days-to-weeks of full-resolution data. Year-long retention requires layering Thanos, Mimir, Cortex, or VictoriaMetrics's long-term tier — an object-storage-backed federation layer.
High availability. A single Prometheus instance is a single point of failure. HA (High Availability) requires running two Prometheus instances scraping the same targets with a deduplicating query layer above them.
Backfill. Loading historical data — say, CSVs of last year's stock prices — is not what stock Prometheus is for. TimescaleDB, InfluxDB, ClickHouse, and QuestDB all support backfill; Prometheus only recently added it.
Multi-tenancy. A bare TSDB serves one tenant. Cortex/Mimir add per-tenant ingester sharding, per-tenant query limits, and per-tenant retention. SaaS observability platforms layer billing on top.
Linearizability across replicas. TSDBs are typically eventually consistent — replicas may diverge briefly during scrape jitter. For monitoring this is fine (a 1-second sample drift is invisible); for financial settlement-of-record time-series it is not.

The contract: a TSDB stores timestamped floats, queries time ranges, drops old data. Everything else — multi-tenancy, high availability, long-term retention, cardinality control — is layered on by the operator.

§3. The Design Space

The TSDB category split into five sub-categories over the past decade. They differ along axes of collection model (pull vs push), query language (PromQL vs SQL vs Flux), retention design (short vs tiered), and access pattern (operator metrics vs user-facing analytics).

A. Monitoring TSDB — Prometheus, VictoriaMetrics

Pull-based. The TSDB scrapes HTTP /metrics endpoints from monitored targets on a schedule (default 15-30s). Service discovery (Kubernetes API, Consul, file-based) tells the scraper which targets to poll. Data model: float per (metric, label_set, timestamp); labels are key-value pairs that define unique series. Query language: PromQL — functional, label-aware, with rate(), histogram_quantile(), topk(), label-matched joins. Retention: short by default (15 days). Long-term needs a federation layer (Thanos/Mimir/Cortex/VictoriaMetrics long-term).

Strengths: dead simple to deploy, Kubernetes default, vast ecosystem (Grafana, AlertManager, exporters for every database/queue/proxy). Weaknesses: single-node by default, no built-in HA, cardinality cliff at ~10M active series.

VictoriaMetrics is the higher-performance Prometheus-compatible alternative — same data model, same PromQL, 5-10× better storage density and throughput. Drop-in replacement when a Prometheus node hits its ceiling.

B. General TSDB — InfluxDB, TimescaleDB

Push-based. Applications and agents (Telegraf, fluentd) write samples via HTTP POST or line protocol. InfluxDB has tags (indexed) and fields (not indexed). TimescaleDB uses PostgreSQL rows in a special hypertable. Query language: InfluxQL → Flux → SQL (Influx's checkered history — Flux was a functional language, deprecated in v3 in favor of SQL). TimescaleDB extends PostgreSQL SQL with time_bucket() and continuous aggregates.

Strengths: SQL access, joins with relational data (TimescaleDB), arbitrary backfill, richer-than-numeric data types. Weaknesses: lower ingest ceiling per node than Prometheus-style; SQL more verbose than PromQL for "rate per label" queries.

C. Real-time analytics — Apache Pinot, Apache Druid

Push via Kafka. Designed for user-facing sub-second analytical queries, not operator monitoring. Data model: rows in a columnar segment, time-partitioned. Query language: SQL. Retention: weeks-to-years with tiered hot/cold storage.

Strengths: sub-second queries over billions of events. LinkedIn's "Who Viewed Your Profile" runs on Pinot, p99 <100ms for queries over years of pageview events. Weaknesses: ops complexity (controller, broker, server, deep store); slower ingest than a pure TSDB; over-engineered for "what's my error rate" monitoring.

Pinot and Druid are sometimes called TSDBs and sometimes called OLAP. They're shaped like a TSDB (time-partitioned, append-only, columnar) but query user-facing dashboards, not machine alerts.

D. Long-term Prometheus storage — Thanos, Mimir, Cortex, VictoriaMetrics

Receive Prometheus's remote-write stream OR sidecar-tap into local 2h blocks. Storage: S3-class object storage (S3, GCS — Google Cloud Storage, ABS — Azure Blob Storage) for blocks older than ~6 hours. Architecture: ingester tier (in-memory recent samples) + store gateway (S3 query path) + query frontend (caching, splitting, parallelism).

Strengths: unlimited retention (S3 prices, $0.023 per GB-month), horizontal scalability via tenant sharding. Weaknesses: large ops surface area (5-7 microservices); cold queries from S3 are 10-100× slower than recent-data queries from ingesters.

Mimir is Grafana Labs's Cortex fork — actively maintained. Cortex is the original CNCF (Cloud Native Computing Foundation) project. Thanos is the sidecar-style alternative — slightly different architecture, similar outcome. VictoriaMetrics cluster does the same job with fewer components.

E. Specialized TSDBs — M3DB, OpenTSDB, QuestDB, ClickHouse-for-time-series

M3DB: Uber-built, optimized for high-cardinality (Prometheus's limit was a non-starter at Uber's scale). M3 = ingester + coordinator + query layer with M3DB as the underlying TSDB.
OpenTSDB: the original (2010) Hadoop/HBase-backed TSDB. Largely superseded but still runs at some shops.
QuestDB: SQL-native, single-binary, optimized for financial market data. Tens of millions of ticks per second on a single machine.
ClickHouse for time-series: general columnar OLAP database with MergeTree engine + time-ordered partitioning. Deployed as a TSDB at Uber, Cloudflare, eBay for SQL + aggregate analytics together.

Comparison table

TSDB	Collection	Query language	Compression	Per-node ingest ceiling	Retention model	Best for
Prometheus	pull	PromQL	Gorilla (~1.4 B/sample)	~1M samples/sec	local 2h blocks, days-weeks	Kubernetes monitoring, the default
VictoriaMetrics	pull or push	PromQL + MetricsQL	Gorilla + extra zstd	~10M samples/sec	local + S3 long-term	Prometheus-compatible at higher scale
InfluxDB (v3 IOx)	push	SQL	Parquet + zstd	~1-2M samples/sec	retention policies	SQL access, IoT, mixed workloads
TimescaleDB	push	PostgreSQL SQL	columnar compression	~500K-1M samples/sec	hypertable chunks	wants PostgreSQL + time-series in one DB
M3DB	push	M3QL / PromQL	Gorilla variant	~5M samples/sec	local + S3	huge cardinality (Uber-class)
Pinot	Kafka push	SQL	columnar + dictionary	~1M events/sec	tiered hot/cold	user-facing analytics dashboards
Druid	Kafka push	SQL	columnar	similar to Pinot	tiered hot/cold	similar to Pinot, slight ops differences
ClickHouse	push	SQL	columnar + LZ4/zstd	~10M+ rows/sec	TTL on tables	SQL analytics + time-series in one engine
QuestDB	push	SQL	columnar	tens of millions ticks/sec	retention policies	financial tick data, low-latency ingest
Thanos / Mimir	downstream of Prometheus	PromQL	inherited from Prom blocks	scales horizontally	S3 unbounded	long-term Prometheus retention

The picks split cleanly: Prometheus for monitoring, VictoriaMetrics if Prometheus is too slow, TimescaleDB if SQL and joins matter, InfluxDB for IoT push-mode, Pinot/Druid for user-facing analytics, ClickHouse for SQL analytics with time-series shape, M3DB for Uber-class cardinality, Thanos/Mimir for long-term Prometheus.

§4. Byte-Level Storage Internals

This is the section that separates "I deployed Prometheus" from "I know why Prometheus uses 1/12th the disk a naive design would." Three pieces matter: the data model, the block layout, and the two compression algorithms (Gorilla XOR for floats and delta-of-delta for timestamps).

4.1 The data model: series, labels, samples

Every sample lives in a logical time-series. A series is uniquely identified by:

metric_name + sorted(label_kv_pairs)
        ↓ hash
   series_id (typically uint64)

For example, the metric http_requests_total with labels {service="api", region="us-east-1", status="200", method="POST"} becomes one series with a stable series_id. A different status (status="500") is a different series with a different series_id. This is critical: cardinality is driven by unique label-value combinations, not by sample volume.

A sample is the (timestamp_ms, float64_value) pair appended to a series. Steady state: 100M samples/sec across 10M series = 10 samples/sec/series average = sample every 100ms.

4.2 Block-based storage (Prometheus TSDB)

The Prometheus on-disk format is the de facto reference design. M3DB, VictoriaMetrics, Cortex, and Mimir all use variants of the same shape.

The head block. The most recent ~2 hours of data lives in memory in the head block. New scrapes append to in-memory chunks per series. Every sample is also written to a WAL (Write-Ahead Log) on disk for crash recovery. Head block memory size is roughly:

head_RAM ≈ active_series × (open_chunk_bytes + index_overhead)
         ≈ 10M series × (1KB open chunk + 200 bytes index)
         ≈ ~12 GB RAM for a 10M-series instance

Persistent 2-hour blocks. Every 2 hours, the head block "cuts" — closed chunks flush to a persistent block on disk and the head block resets. Each block is a self-contained immutable directory holding all samples for its 2-hour window:

data/
  01HF3MRD8Y6J3QTBPGZS5RFK0E/      # ULID — universally unique time-sorted ID
    meta.json                       # min_time, max_time, stats, ulid
    chunks/
      000001                        # binary chunks (Gorilla-encoded)
      000002
    index                           # inverted label index + series table
    tombstones                      # rare — deleted ranges

Block immutability is the design's superpower: a block can be compressed once, indexed once, then never rewritten. Compaction merges multiple 2h blocks into larger blocks (8h → 24h → multi-day) without touching the original samples, only re-encoding into a denser layout.

WAL for crash safety. Before a sample lands in the head block's in-memory chunk, it is appended to a WAL segment on disk. The WAL is a sequential stream of records:

record_type | series_id | timestamp | value

WAL fsync is batched every ~100ms (configurable). On crash, the head block is rebuilt by replaying the WAL from the last checkpoint. Durability point: up to 100ms of pre-fsync samples can be lost. Acceptable for monitoring (one missing 10s sample is invisible); not acceptable for financial settlement (where TimescaleDB's synchronous_commit=on or QuestDB's sync mode buys per-commit durability at lower throughput).

Compaction. Every 2h block is compacted with neighbors at fixed intervals — typically 6h → 24h → multi-day blocks. Compaction: - Re-encodes chunks for better compression (more samples per chunk → better delta-of-delta and XOR locality). - Drops tombstoned ranges. - Rebuilds the inverted index. - Reduces the number of blocks the query engine must touch for a long-range query.

4.3 The inverted label index

A PromQL query sum(rate(http_requests_total{service="api", region="us-east-1"}[5m])) must first find all series matching service="api" AND region="us-east-1". This is exactly the problem Lucene solves for full-text search: posting-list intersection.

The TSDB maintains a posting-list index:

(label_name="service", label_value="api")        → [series_3, series_7, series_11, ...]
(label_name="region",  label_value="us-east-1")  → [series_3, series_11, series_19, ...]

AND merge: [series_3, series_11]  ← these series match both labels

Posting lists are stored sorted by series_id. Intersection is a merge scan — O(min(|A|, |B|)) time, fast even for large lists. The query engine then loads chunks for [series_3, series_11] over the requested time window and decodes.

Cardinality explosion is what kills this index. Each unique label_value adds a posting list entry. If user_id is a label with 10 million unique values, the index has 10 million posting lists for that label. Memory for the index alone exceeds 10 GB and queries that filter by user_id become O(N) instead of O(log N). Mitigation: drop high-cardinality labels at the SDK before the scrape (§7).

4.4 Compression: delta-of-delta on timestamps

Timestamps in a monitoring TSDB are almost always at regular intervals. Scrape every 10 seconds → timestamps are T, T+10s, T+20s, T+30s, .... Deltas between consecutive timestamps are constant (10s, 10s, 10s). The delta-of-delta is zero (10-10=0, 10-10=0, 10-10=0).

The Gorilla codec encodes the delta-of-delta:

For each sample after the first:
  delta_i = timestamp_i - timestamp_{i-1}
  delta_of_delta_i = delta_i - delta_{i-1}

If delta_of_delta == 0:          encode in 1 bit (the value '0')
If delta_of_delta in [-63, 64]:  encode in 7 bits + 2-bit tag (9 bits total)
If delta_of_delta in [-255, 256]: encode in 9 bits + 3-bit tag (12 bits total)
If delta_of_delta in [-2047, 2048]: encode in 12 bits + 4-bit tag (16 bits total)
Otherwise:                       fallback 32-bit delta + 4-bit tag (36 bits total)

Steady-state monitoring: 99% of samples have delta_of_delta = 0, so timestamps compress to ~1 bit per sample.

4.5 Compression: Gorilla XOR on floats

Values from the same metric tend to be highly correlated. CPU utilization sample-to-sample changes by tenths of a percent. A memory gauge moves smoothly. Even a request-rate counter, after rate(), varies slowly.

The Gorilla paper (Pelkonen et al., VLDB 2015 — Facebook) observed that XOR of two consecutive IEEE 754 doubles produces a small integer with many leading and trailing zeros:

double_1 = 47.3       → IEEE 754: 0x4047A66666666666
double_2 = 47.5       → IEEE 754: 0x4047C00000000000
xor      = 47.3 ⊕ 47.5 = 0x0000666666666666
                          ↑ many leading zeros
                                                 ↑ many trailing zeros
                              ↑─ meaningful bits ─↑

Gorilla encodes the XOR result as (leading_zero_count, meaningful_bit_count, meaningful_bits):

If XOR == 0 (identical to previous): encode in 1 bit (value '0')
If meaningful_bits fit in previous range: encode in (meaningful_bits) + 1 bit (skip header)
Otherwise:                          5 bits leading + 6 bits length + meaningful_bits + 2-bit tag

Real-world floats average ~7-11 bits per sample in production.

4.6 The 1.37-bytes-per-sample number

Putting it together for one sample: - Timestamp: ~1 bit (delta-of-delta=0 case) up to ~16 bits worst case → average ~1-4 bits. - Value: ~7-11 bits average via XOR. - Per-sample overhead (delimiters, padding): negligible amortized.

Total: ~11 bits per sample on average = 1.37 bytes per sample. Compared to a naive (uint64 timestamp, double value) = 16 bytes per sample, Gorilla is ~12× denser.

At a 1-million-samples-per-second workload, this is the difference between 1.4 MB/sec on disk vs 16 MB/sec on disk — i.e., a 4 TB disk holds ~33 days at Gorilla density vs ~3 days naive. Compounded by compaction-level recompression and you get the multi-month retention Prometheus advertises on commodity hardware.

4.7 One scrape end-to-end

Walk a single metric scrape, from HTTP request to a byte on disk:

1. App exposes /metrics in text exposition format:
   http_requests_total{service="api",status="200"} 4823 1716391200000

2. Prometheus scraper HTTP GETs /metrics every 10s.
   Parses each line: (metric_name, label_set, value, timestamp).

3. Hash (metric_name + sorted_labels) → series_id.
   If new series_id, allocate it in the in-memory series table and
   add posting-list entries to the inverted label index.

4. Append (timestamp, value) to the open chunk for series_id in the head
   block. Chunk grows by ~1.4 bytes due to Gorilla encoding.

5. Append WAL record (series_id, timestamp, value). WAL fsync batched
   every ~100ms. Pre-fsync crash loses ≤100ms of samples.

6. After ~120 samples (~20 min at 10s scrape), the open chunk closes,
   gets Gorilla-encoded fully; a new open chunk replaces it. Closed
   chunks stay in head-block RAM until block cut.

7. Every 2h, the head block "cuts": closed chunks flush to disk in the
   block's chunks/ directory; inverted label index is built; WAL is
   truncated; meta.json written last (atomic block visibility); head
   block resets.

8. After 24h, compaction merges 2h blocks into 24h blocks; multi-day
   blocks into week-long blocks.

9. After retention (e.g. 15d), oldest blocks are deleted by rmdir.
   Atomic, no rewrite.

10. (Optional) Thanos sidecar uploads blocks to S3 for long-term
    retention; local retention can shorten once blocks are in S3.

The whole pipeline is "append to memory, write to WAL, batch fsync, flush block, compact, expire" — no random page rewrites, no btree splits, no transactional rollbacks. The simplicity is what makes 1M samples/sec on a single node possible.

4.8 Why this beats a generic database 10-100×

A naive PostgreSQL table (metric_id, ts, value) with a (metric_id, ts) btree index:

16 bytes per row (key) + 16 bytes per value + 30-40 bytes of btree/MVCC overhead = ~60 bytes per sample.
Random page writes (btree page rewrites on each insert) → ~20-30k inserts/sec/SSD ceiling.
No timestamp-aware compression.

A specialized TSDB:

1.4 bytes per sample (Gorilla).
Sequential WAL writes + memtable + immutable blocks → 1M+ samples/sec.
30-50× less storage, 30-100× more throughput.

The factor of ~12× on storage and ~30× on throughput is structural — it comes from the access pattern (monotonic timestamps, repeated values, no updates), not from clever tuning. A generic database cannot close this gap by adding indexes; it would have to become a TSDB.

§5. Capacity Envelope

What this technology can do across very different scales, illustrated by real deployments.

Small — single Prometheus VM

Setup: Prometheus on a 4-core, 16 GB RAM VM with a 100 GB local SSD. Scraping 50 targets, each exposing ~100 metrics. ~10k samples/sec steady state. 7-day retention.

Storage: 10k samples/sec × 86400 s × 7 days × 1.4 B/sample ≈ 8 GB for the whole 7-day window.
CPU: scrape + Gorilla encode + WAL write uses <10% of 4 cores.
Query latency: common dashboards (CPU/memory by service over 1h) return in <100ms.

Plenty for a startup running 50 containers in Kubernetes. The bottleneck shows up around 10k targets / 1M series — RAM for the head block.

Mid — typical SaaS observability

Setup: 5 Prometheus instances scraping a 2000-node Kubernetes cluster + Thanos sidecars uploading to S3. ~1M samples/sec steady state. 30-day retention local + 1-year retention in S3.

Active series: ~5M across all instances.
Daily samples: 1M × 86400 = ~86 billion samples/day.
Daily storage: 86B × 1.4 B = ~120 GB/day local before compaction; ~50 GB/day in S3 after compaction-level recompression.
30-day local: ~3.5 TB after compaction.
1-year S3: ~18 TB at $0.023/GB-month = ~$415/month.

Most production observability platforms live here. Bottlenecks: cardinality from poorly bounded labels, query cost on long ranges (year-long p99 latency query touches 365 day-blocks).

Large — Datadog scale

Setup: Datadog's internal TSDB (proprietary, derived from same Gorilla family). Per public material: 15+ trillion samples ingested per day at peak.

15 trillion samples/day = 175 million samples/sec average; multi-hundred-million peaks.
Storage at 1.4 B/sample: ~21 TB/day after compression.
Cluster footprint: thousands of ingester nodes, S3 long-term storage in the multi-petabyte range.

Per-tenant sharding (multi-tenant SaaS) plus regional sharding (data residency) plus per-customer rate limits to keep one tenant from drowning others.

Giant — Google Monarch

Setup: Google's internal monitoring system. Per public material (USENIX 2020): ~1 trillion samples per second globally at the time of publication.

Per-second samples: ~1 trillion/sec → 10^12 ÷ 10^6 = 1M samples/sec/server, hosted by a fleet on the order of 1M ingest servers globally.
Architecture: in-memory regional zones, leaf nodes hold recent data, root nodes aggregate cross-region queries.
No SSD persistence in the hot tier — Monarch holds the recent window in RAM only, replicated across zones. Reliability via replication, not persistence. Long-term archived to colossal blob storage.

The fact that Monarch is RAM-only in the hot tier is the design's bold choice: at this scale, SSD writeback becomes the bottleneck and RAM + replication is the way out.

Domain-specific scales

IoT — Tesla fleet telemetry. ~5M vehicles × hundreds of signals at 1-100 Hz → multi-millions of samples/sec. InfluxDB and ClickHouse variants common in automotive.
Financial market data. Equities tick streams (NYSE, NASDAQ consolidated) hit tens of millions of ticks/sec at peak. QuestDB and kdb+ dominate — optimized for sub-ms query latency on a single host. ~50 GB/day per market at richer payload (bid/ask/volume).
Application performance — LinkedIn-class. ~5000 services × 100 metrics × 0.1 Hz = ~50M samples/sec aggregate. Backed by internal Prometheus-compatible TSDB with multi-region federation.

The range spans six orders of magnitude in samples/sec — from 10k on a 4-core VM to 1 trillion across Monarch's global fleet. The storage engine fundamentals (Gorilla, delta-of-delta, immutable blocks, inverted label index) are identical across the range. What scales is the topology above the engine — sharding, federation, replication, retention tiering.

§6. Architecture in Context

The canonical pattern for a TSDB in a production monitoring or analytics system.

                                ┌──────────────────────────────┐
                                │  Application + sidecars      │
                                │  /metrics endpoint exposed   │
                                │  via Prom client lib or      │
                                │  OTel SDK                    │
                                └──────────┬───────────────────┘
                                           │
                                           │ HTTP GET /metrics (pull)
                                           │  -or- OTLP push
                                           ▼
                                ┌──────────────────────────────┐
                                │   Scraper / ingester tier    │
                                │   - Prometheus (pull)        │
                                │   - OTel Collector (push)    │
                                │   - VictoriaMetrics agent    │
                                └──────────┬───────────────────┘
                                           │
                                           │ WAL + head block in RAM
                                           │ 2h blocks on local NVMe
                                           ▼
                ┌──────────────────────────────────────────────────────┐
                │              Hot TSDB tier (recent ~6h-30d)          │
                │  - Prometheus / VictoriaMetrics / Mimir ingesters    │
                │  - Sharded by tenant or by metric_name hash          │
                │  - Replicated for HA                                 │
                └──────────────────────────────────────────┬───────────┘
                                                           │
                                                           │ 2h block upload
                                                           ▼
                                          ┌──────────────────────────────┐
                                          │   Cold tier — S3 / GCS / ABS │
                                          │   Multi-month / multi-year   │
                                          │   blocks                     │
                                          └──────────┬───────────────────┘
                                                     │
              ┌──────────────────────────────────────┴──────────────┐
              │                                                     │
              ▼                                                     ▼
   ┌──────────────────────┐                            ┌──────────────────────┐
   │ Query frontend       │                            │ Store gateway        │
   │ - PromQL parser      │ ─────────────────────────► │ Reads blocks from S3 │
   │ - query splitting    │                            │ Bloom filter cache   │
   │ - result caching     │                            │ Decode chunks        │
   └──────────┬───────────┘                            └──────────────────────┘
              │
              ▼
   ┌──────────────────────────────────────────────────┐
   │  Consumers                                       │
   │  - Grafana (dashboards)                          │
   │  - AlertManager (rule evaluation → PagerDuty)    │
   │  - Recording rules (downsampled rollup outputs)  │
   │  - Internal analytics dashboards                 │
   └──────────────────────────────────────────────────┘

Annotations on the diagram:

Sharding happens in the ingester tier. Mimir/Cortex shard by (tenant_id, series_hash) so all samples of a given series land on the same ingester (for replay correctness). Prometheus single-instance has no sharding; HA pairs scrape redundantly and the query layer deduplicates.
Replication is typically 2-3× at the ingester tier. Each sample written to 2 or 3 ingesters; on query, results are deduplicated.
WAL fsync is the durability point — pre-fsync samples are lost on crash; post-fsync samples are recoverable from the WAL replay.
Block upload is the long-term durability point — once the block is in S3, the local copy can be safely deleted after retention expires.

This is the shape of every Prometheus-class observability platform from a startup running one VM to Datadog's multi-region SaaS. The boxes scale, multiply, and shard, but the data flow — scrape/push → WAL → head block → flush → block → S3 — does not change.

§7. Hard Problems Inherent to TSDBs

Six structural challenges every TSDB operator faces, drawn from the technology's design constraints — not from any particular vendor or workload.

7.1 Cardinality explosion

The problem. A label with high cardinality — user_id, request_id, session_id, trace_id, anything with millions of unique values — creates millions of unique series. Memory, index size, and query latency all scale linearly with cardinality.

The naive solution. "Just add the label, more dimensions are more visibility." Run with it.

How it breaks. A web service emits http_request_duration_seconds{user_id="..."} and ships to Prometheus. The fleet has 50 million users. The TSDB allocates 50M series, each consuming ~1 KB for the open chunk + index entries. Head block RAM blows past 50 GB. The TSDB OOMs (Out Of Memory). Restart it; it OOMs again on WAL replay. Production observability is now down because someone added one well-intentioned label.

The actual fix. - Drop high-cardinality labels at the SDK. The Prom client library Counter.labels(user_id=...) should be Counter.labels() instead — the per-user detail goes into traces (where each trace is one event, not one series), not metrics. - For "I need to see one user's metrics," use exemplars — a histogram-bucket pointer to a sample trace_id. Exemplars cost one trace pointer per histogram bucket, not one series per user. - For pre-emptive defense, configure metric_relabel_configs in Prometheus to drop or hash labels matching a pattern at scrape time. - Set query.max-series-fetched to fail expensive queries before they take down the TSDB.

(Treated in more depth in the observability doc, §4.1.)

7.2 Downsampling and retention tiering

The problem. Raw 10-second samples are great for the last hour. They are wasteful for last year — nobody looks at second-level granularity over a 365-day window. But "year-long" dashboards are common (capacity planning, business KPIs over time, compliance reporting).

The naive solution. Keep raw samples forever. Disk is cheap.

How it breaks. 1M samples/sec × 1 year × 1.4 B/sample = 44 TB/year of raw data. A "show me p99 latency for the last year" query reads 365 day-blocks × ~120 GB each = enormous I/O. Query takes minutes; dashboard becomes unusable; users complain; you start to think monitoring isn't worth it.

The actual fix. Tiered downsampling — keep multiple resolutions, each with its own retention:

Resolution	Retention	Storage cost
Raw (10s)	30 days	120 GB/day × 30 = 3.6 TB
1m aggregated	90 days	12 GB/day × 90 = 1.1 TB
5m aggregated	1 year	2.4 GB/day × 365 = 880 GB
1h aggregated	5 years	200 MB/day × 1825 = 365 GB

Total: ~6 TB for 5 years of multi-resolution data vs ~220 TB raw. 35× cost reduction.

Mechanisms: - Prometheus recording rules pre-compute aggregates and write them as new series. rate(http_requests_total[1m]) becomes a separate series http_request_rate_1m that the dashboard queries directly. - Thanos / Mimir downsampling is automatic — every 2h block is downsampled to 5m and 1h resolution. - TimescaleDB continuous aggregates are SQL-defined rollups maintained incrementally. - InfluxDB continuous queries (legacy) and tasks (v3) do the same.

Query layer must know about multiple resolutions and pick the right one based on the requested range: "give me 1 year p99" should hit the 1h-resolution series, not the raw.

7.3 High-write ingest

The problem. A single TSDB node is designed for ~1M samples/sec; beyond that, sharding is required. Cardinality explosion (7.1) can also drive ingest cost up sharply even at modest sample rates.

The naive solution. "Just buy a bigger machine." 32-core, 256 GB RAM, NVMe RAID.

How it breaks. Beyond ~3-5M samples/sec on a single node, WAL fsync becomes the bottleneck (each fsync is ~50-200µs, and a 100ms batch limits throughput). The head block index grows past available RAM. Compaction CPU starts to compete with ingest CPU. Scaling vertically hits a wall at ~5M samples/sec/node regardless of hardware.

The actual fix. Shard by series. Mimir/Cortex use consistent hashing on series_id to distribute samples across an ingester ring. Each ingester owns ~1/N of the series. Query layer fans out a PromQL query to all ingesters and merges results. M3DB does the same with its own ring topology. VictoriaMetrics cluster shards by metric name + label fingerprint.

The inflection: 1 instance ≈ 1M samples/sec; 10 instances ≈ 10M; 100 instances ≈ ~80M (sublinear due to query fanout overhead).

7.4 Long-range query cost

The problem. A query like histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) over the last 1 year must touch every block in the year — even with downsampling, that's hundreds of blocks. Each block must be opened, its index consulted, chunks decoded, percentile computed.

The naive solution. Just execute it. Modern hardware is fast.

How it breaks. A naive 1-year query on a Mimir tenant with 5M series scans ~365 day-blocks, decodes ~10 GB of data, performs the percentile computation. Latency: 30-300 seconds. One user clicking such a query can saturate the query path; concurrent users grind dashboards to a halt.

The actual fix. - Query splitting and parallelism. Mimir/Thanos query frontend splits a 1-year query into 12 monthly sub-queries that run in parallel, then merges results. - Result caching. PromQL queries on stable past windows cache forever (Mimir's results-cache). A repeated 1-year dashboard hits cache on second load. - Pre-aggregated recording rules. The percentile is computed at ingest time once per 5m window and written as a series. The 1-year dashboard queries the pre-computed series, not the raw histogram buckets — 100× cheaper. - Adaptive resolution. Query frontend rewrites the query to use coarser resolution (1h instead of 10s) when the range is long.

7.5 Multi-tenancy

The problem. One TSDB serving many teams. Team A's runaway cardinality and slow query should not break Team B's dashboards.

The naive solution. One TSDB per team. Operational nightmare; cost duplication; cross-team queries impossible.

How it breaks. One TSDB cluster shared by 50 teams. Team A's developer adds a high-cardinality label by mistake. Cardinality across the cluster explodes from 5M to 50M series in 30 minutes. Head block OOMs. All 50 teams lose monitoring at once. Postmortem: "noisy neighbor."

The actual fix. Per-tenant resource limits and sharded ingest: - Each request to the TSDB carries a tenant_id header (Mimir/Cortex pattern). - Ingesters route by (tenant_id, series_hash) so each tenant's series live on a tenant-specific ring slice. - Per-tenant limits: max active series, max samples/sec ingest, max query series fetched, max query duration, max query memory. Exceeding any limit returns 429 to that tenant only; other tenants unaffected. - Per-tenant retention. Free tier 14 days, paid tier 90 days, enterprise 1 year. Implemented as a TTL per tenant on block deletion. - Per-tenant query queues. Slow queries from tenant A don't block tenant B's queue.

This is the architecture of Grafana Cloud, Chronosphere, Datadog, New Relic — every SaaS observability platform. The same pattern applies internally at a large company where each engineering team is a "tenant."

7.6 Backfill

The problem. Loading historical data — last year's stock prices, archived telemetry from a vehicle that was offline, exported metrics from a deprecated TSDB being migrated.

The naive solution. Just write old timestamps to the TSDB.

How it breaks. Prometheus (until very recently) explicitly rejected samples with timestamps older than ~1h. The reason: the head block is in-memory and time-ordered; out-of-order samples would force chunk rewrites that break the immutability invariant. Backfill via push was simply not supported. Operators worked around it by stopping Prometheus and loading raw blocks via the promtool tsdb create-blocks-from rules workflow.

The actual fix. - TimescaleDB, InfluxDB, ClickHouse, QuestDB support out-of-order writes natively because they're not constrained by the 2h immutable block design. - Prometheus now supports out-of-order writes (since v2.39, 2022) with a configurable out-of-order window — but at a complexity and memory cost; not the default. - For Prometheus migrations, the recommended path is generate blocks offline with promtool and drop them into the data directory. Block compaction will merge them with existing data. - For long-term backfill, push directly to the long-term tier (Thanos's thanos receive, Mimir's distributor) which can accept out-of-order data because its blocks are not yet finalized.

7.7 Slow query isolation

The problem. One user's expensive query — topk(100, sum(rate(http_requests_total[1d])) by (user_id)) over a 1M-series, 100M-cardinality tenant — reads gigabytes of chunks into memory, runs for minutes, and can OOM the entire TSDB.

The naive solution. Trust users to write good queries.

How it breaks. A junior engineer is debugging at 3am, types the wrong PromQL, hits enter. The query reads 2 GB into the query engine's working memory. The TSDB OOMs. All dashboards and alerts go dark for the 60 seconds it takes to restart. The junior engineer is now responsible for a real incident.

The actual fix. - Query timeouts (default 2 minutes, configurable per tenant). - Per-query memory limits (query.max-samples, query.max-fetched-series, query.max-fetched-chunks). - Query queues to limit concurrency. - Query analyzer (Mimir's read-path analyzer, Grafana's "Query Inspector") to estimate cost before execution. - Resource quotas at the query frontend that reject queries projected to exceed thresholds.

§8. Failure Modes

Specific crash and degradation scenarios with recovery procedure and durability point.

Disk full

The TSDB instance fills its local disk. WAL writes fail. Head block can no longer flush. Prometheus stops scraping new samples (configurable: drop or stop). Already-acked samples in WAL are recoverable if WAL fsync completed; queries continue on existing data. Recovery: add disk, restart, WAL replays on startup. Prevention: alert on node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15, set retention aggressively low until upgrade.

WAL corruption

Power loss mid-WAL-write or filesystem corruption. On startup, Prometheus detects checksum mismatch on a WAL segment and skips it, continuing replay from the next valid record. Recovery: lose up to the corrupt segment's worth of samples (~30 minutes worst case); head block rebuilds from valid WAL records. Durability point: WAL fsync — samples before the last fsync are recoverable; samples within the ~100ms batching window are lost.

Cardinality explosion taking out the TSDB

A bad deploy adds a high-cardinality label. Series count goes from 5M to 50M in 5 minutes. Head block RAM grows past available memory; Linux OOM killer terminates the process; restart with WAL replay re-allocates the 50M series and OOMs again. Recovery (graceful degradation): (1) identify the offending metric via prometheus_tsdb_head_series and "Top metrics by series count" dashboards; (2) add a metric_relabel_configs rule to drop the label at scrape time; (3) restart with reduced chunk-write queue and aggressive --query.max-samples; (4) if WAL fatally bloated, delete old segments and accept sample loss; (5) wait for retention to expire bloated blocks. Prevention: cardinality alerts, per-job cardinality limits in scrape config, organizational label-naming discipline.

Query OOM

A user runs topk(1000, ...) with no time filter; query reads 50 GB into memory. The query engine hits its per-query limit and returns an error; if unlimited, OOMs the process and other queries fail until restart. Recovery: restart query engine (separate process in cluster mode; same process in single-node Prometheus). Queries are read-only — recovery is just process restart.

Scrape target timeout cascading

A scrape target's /metrics endpoint hangs for 30+ seconds. Scraper's worker pool depletes; other targets miss scrape windows. Each scrape has a scrape_timeout (default 10s); targets exceeding it are aborted and marked unhealthy via up{...} == 0. Prevention: scrape timeout < scrape interval (always); limit per-target metric count (some exporters expose tens of thousands of metrics); metric_relabel_configs to drop noisy metrics at parse time.

Ingester replica divergence

In a Mimir cluster, two ingester replicas for the same series have ~10s clock skew or one missed samples during a network blip. Query frontend merges results from all replicas and dedupes by (series_id, timestamp). Durability point: ingester replication factor — with RF=3, two ingesters can fail simultaneously without sample loss.

§9. Why Not a Generic Database

The seductive argument: "PostgreSQL is rock-solid, well-understood, runs my application — why not just put metrics in there too?"

Build a table:

CREATE TABLE metrics (
  metric_name TEXT NOT NULL,
  labels JSONB NOT NULL,
  ts TIMESTAMPTZ NOT NULL,
  value DOUBLE PRECISION NOT NULL
);
CREATE INDEX ON metrics (metric_name, ts);

Walk through ingest at 100K samples/sec:

Each INSERT is a btree page mutation. Random page write. NVMe SSD random write: ~10-30K IOPS sustainably. With group commit and prepared statements, maybe 50K rows/sec. Past 50K/sec, the WAL fsync saturates.
Storage per row. 8 bytes timestamp + 8 bytes float + ~50 bytes for metric_name (TEXT) + ~80 bytes for JSONB labels + ~30 bytes btree overhead = ~180 bytes per sample. 130× worse than Gorilla's 1.4 bytes.
Query SELECT avg(value) FROM metrics WHERE metric_name='cpu' AND ts > now() - '5m' GROUP BY (labels->>'service'). PostgreSQL scans the btree, fetches matching rows. Each row is in a different page (no co-location by series). 5 minutes at 10K samples/sec = 3M rows. Random page reads → multi-second latency.
MVCC bloat from the constant INSERT stream. Autovacuum cannot keep up; table bloats; query plans degrade further.
Retention. DELETE FROM metrics WHERE ts < now() - '30 days' is a delete on millions of rows. Locks pages, blocks vacuum, generates massive WAL. The whole cluster slows. In a TSDB, retention is rmdir block_directory — atomic, near-instant.

The breakage is structural:

Cost dimension	PostgreSQL	Prometheus	Factor
Storage per sample	~180 B	~1.4 B	130× worse
Sustained writes/sec/SSD	~50K	~1M	20× worse
5min range query	seconds	milliseconds	100× worse
Retention deletion	hours of locks	seconds (rmdir)	massive
Compression on floats	none / generic	Gorilla XOR	12×
Compression on timestamps	none	delta-of-delta	30×

The factor of 10-100× across every dimension is what justifies a specialized engine. PostgreSQL is the right tool when you need transactions, joins, and arbitrary structured queries on small-to-medium data. It is the wrong tool when you need millions of timestamped floats per second.

TimescaleDB closes some of this gap by adding a hypertable layer with chunk pruning and compression on top of PostgreSQL — but in doing so, it has effectively become a TSDB that happens to expose PostgreSQL SQL. Same engine internally; the abstraction at the SQL surface is the only thing that's "still PostgreSQL."

§10. Scaling Axes

Time-series workloads scale on two qualitatively different axes. They require different fixes.

Type 1: more series, more scrape targets — uniform growth

A fleet doubles from 5K to 10K Kubernetes pods. Each pod exposes ~100 metrics. Active series doubles from 5M to 10M. Sample rate doubles from 500K/sec to 1M/sec.

Fix: horizontal sharding. Mimir/Cortex consistent hashing across the ingester ring spreads new series across more ingester nodes. Each node still holds ~1/N of series; doubling targets means doubling N. Query frontend fan-out spreads query cost.

Inflection points: - 1 instance → HA pair (~1M series/instance): add a second Prometheus instance scraping the same targets, query layer deduplicates. - HA pair → sharded cluster (~10M series/instance): transition to Mimir or VictoriaMetrics cluster. Cardinality across the fleet outgrows one instance. - Sharded cluster → multi-region (~100M series cluster-wide): federation across regions. Each region has its own ingester ring; cross-region queries fan out to all regions and merge.

Type 2: same entities, more rate per entity — hotspot intensification

Same 5K pods, but a high-cardinality label appears. Per-pod series count goes from 100 to 50,000. Total series count goes from 500K to 250M. Sample rate stays the same.

This is the harder one. Sharding by series helps for ingest (each ingester takes 1/N of the new series), but it does not help the query path — a query like topk(10, sum(rate(...)) by (label)) must read across all 250M series to compute the topk.

Fixes: - Drop the high-cardinality label at SDK. First and best fix. If you don't need per-user metrics, don't emit per-user labels. - Pre-aggregated views. Define a recording rule that materializes the topk at ingest time as a low-cardinality derived series. Queries hit the derived series. - Tail-sampled traces instead. Per-request detail belongs in traces, where each request is one event, not one series. - Switch to a higher-cardinality TSDB. M3DB at Uber was built because Prometheus's cardinality ceiling was a non-starter for their use case. VictoriaMetrics and Mimir tolerate higher cardinality than vanilla Prometheus.

Inflection points: - 1M-10M series active: vanilla Prometheus on a big node. - 10M-100M series active: Mimir/Cortex or VictoriaMetrics cluster. - 100M+ series active: M3DB-class system, or aggressive cardinality control to bring it back down.

Read amplification

A separate scaling axis often missed: a single PromQL query can read many series. sum(rate(http_requests_total[5m])) over all services with all status codes touches every series matching the metric — could be 10K+ series for a moderately-labeled metric. Multiply by N concurrent queries (Grafana with 30 panels open, 100 users) and the query path is the bottleneck before the ingest path.

Fix: query splitting, result caching, recording rules — same techniques as §7.4.

§11. Decision Matrix

When to pick which TSDB.

You want	Pick	Why
Kubernetes monitoring, default reach	Prometheus	Ecosystem default; every exporter and dashboard assumes it; deploy in <1 hour
Prometheus-compatible, higher per-node throughput	VictoriaMetrics	Drop-in for PromQL/remote-write; 5-10× throughput; simpler ops than Mimir
Long-term retention behind Prometheus	Mimir or Thanos	Both add S3-backed unbounded retention; Mimir for big SaaS; Thanos is lighter
IoT push-mode, SQL access	InfluxDB	Push-native via line protocol; v3 has SQL; rich tags/fields
SQL + relational data + time-series in one place	TimescaleDB	PostgreSQL hypertable; joins with relational tables; familiar SQL; backfill native
Financial market data, microsecond ingest	QuestDB or kdb+	Optimized for tick data; SQL; out-of-order ingest; nanosecond timestamps
User-facing analytics dashboards (sub-second p99)	Apache Pinot	Indexed columnar; star-tree indexes; LinkedIn-class scale
Cross between time-series and general analytics	ClickHouse	SQL; great compression; flexible schema; ingest 10M+ rows/sec
Uber-class cardinality (10⁸+ series)	M3DB	Built for this; M3 coordinator + DB + query layer
Single trillion-samples/sec global scale	Build your own like Monarch	Beyond the public OSS ceiling

The boring default: Prometheus + Grafana + AlertManager for monitoring, plus VictoriaMetrics or Mimir when you outgrow it, plus Pinot or ClickHouse for user-facing analytics dashboards. Most organizations don't need anything more exotic until they're at 100M+ active series.

§12. PromQL and Query Languages

The query language is the second-half of a TSDB's value. A great storage engine with a bad query language wins fewer users than a mediocre engine with a great query language. PromQL is the dominant choice for monitoring; SQL is the dominant choice for analytics.

PromQL

Prometheus's functional query language. Tightly fitted to the time-series data model.

Instant vector vs range vector. http_requests_total returns one sample per series at a single point in time (instant). http_requests_total[5m] returns the time-window of samples per series over the last 5 minutes (range) — not directly displayable, must be reduced via a function.
Aggregations. sum, avg, min, max, count, quantile, topk, bottomk. Applied across series with by/without clauses: sum(rate(http_requests_total[5m])) by (service) computes the per-second rate over a 5m window, then sums by service.
rate() vs increase(). Single most common confusion for newcomers. rate(metric[5m]) = per-second rate over the window, accounting for counter resets. increase(metric[5m]) = total increase over the window = rate × window_seconds. Both handle counter resets correctly. rate() is correct for "requests per second"; increase() for "total requests in window."
histogram_quantile(). Computes a percentile from histogram buckets. The metric http_request_duration_seconds_bucket{le="..."} has one series per bucket; histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) aggregates buckets across instances and interpolates the 99th percentile.
Label-matched joins. metric_a * on(service) group_left() metric_b. Limited — not full SQL joins, but enough for "multiply error rate by traffic to get error volume."

InfluxQL → Flux → SQL

InfluxDB's query language evolved three times in a decade: - InfluxQL (original, SQL-like): "easy if you know SQL." Limited expressiveness. - Flux (v2): functional pipeline language with |> syntax. Powerful, but a new DSL nobody else knew. - SQL (v3 IOx): back to SQL on top of a columnar engine.

The lesson: query language design is hard, and users hate learning new DSLs when SQL is "good enough" for time-series queries.

SQL extensions (TimescaleDB, ClickHouse, QuestDB)

SQL with time-series functions: - time_bucket('1 minute', ts) — bucket samples into 1-minute windows. - first_value(), last_value() — sample at start/end of window. - time_bucket_gapfill() — fill missing windows with NULLs or interpolation. - LATERAL JOIN for windowed lookups.

TimescaleDB example:

SELECT time_bucket('5 minutes', ts) AS bucket,
       service,
       avg(value) AS p50,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM metrics
WHERE metric_name = 'http_request_duration_seconds'
  AND ts > now() - INTERVAL '1 hour'
GROUP BY bucket, service
ORDER BY bucket;

ClickHouse and QuestDB have similar dialects. The advantage of SQL is that everyone already knows it and analytical tools (Tableau, Looker, Superset) speak it natively. The disadvantage is that for the most common monitoring queries, PromQL is dramatically more concise.

The "metric query is hard" lesson

Most newcomers to TSDBs struggle with: - Counter reset handling. A counter resets to 0 when the process restarts. Naive value_now - value_then produces a negative number. PromQL's rate() handles this; SQL on raw samples does not unless you explicitly model it. - Sample alignment across series. Two series scraped at slightly different times have offset timestamps. Aligning them for a join requires interpolation or window-snapping. - Histogram aggregation. Quantiles do not aggregate linearly. avg(p99(serviceA), p99(serviceB)) is not the p99 of the combined population. You must aggregate the underlying histogram buckets first, then compute the quantile.

These traps catch experienced SQL practitioners as often as junior engineers. They're not bugs in PromQL — they're consequences of the time-series data model meeting a query language.

§13. Downsampling and Retention

The economic math of TSDB operation.

Pre-aggregation patterns. - Recording rules (Prometheus). PromQL expressions evaluated on a schedule and written back as new series. record: service:http_request_rate:5m / expr: sum(rate(http_requests_total[5m])) by (service) lets the dashboard query the pre-aggregated series — 1000× cheaper than computing live. - Continuous aggregates (TimescaleDB). SQL-defined rollups maintained incrementally — new hypertable chunks trigger the rollup to extend. - Continuous queries / tasks (InfluxDB). Declarative aggregations re-evaluated on a schedule. - Materialized views (ClickHouse). Views that physically store aggregated rows, updated on insert via row-level triggers. - Thanos / Mimir downsampling. Automatic — every 2h block downsampled to 5m; every 24h block downsampled to 1h. Three resolutions of the same data coexist on disk.

Retention policies. - Prometheus --storage.tsdb.retention.time=15d: blocks older than 15 days are deleted by rmdir — atomic and near-instant. - InfluxDB retention policies: per-bucket TTL applied to shards. - TimescaleDB SELECT drop_chunks('hypertable', INTERVAL '90 days') removes whole chunks at once. - ClickHouse TTL clauses per-table or per-column; data deleted in background merges.

The cost math. For a SaaS observability workload: - Raw 10s × 100 metrics × 1000 hosts × 86400 s/day = 86.4 billion samples/day. - At 1.4 B/sample (Gorilla): 120 GB/day. - 30-day raw retention: 3.6 TB local NVMe. - 1-year raw retention: 44 TB — uncomfortable for local SSD, fine for S3 at $1000/month. - 1-year with 5m downsampling: 88 GB for the 5m tier — trivial. - 5-year with 1h downsampling: ~90 GB total — trivial.

The pattern: keep raw for short windows where investigation needs precision; downsample aggressively beyond that. The choice of resolutions and retention boundaries is the single biggest cost lever for an observability platform.

§14. Specialized Time-Series Workloads

Different domains stress different parts of the TSDB design.

Application monitoring — RED and USE. RED (Rate, Errors, Duration) for request-driven services — three metrics per endpoint, the basis of most service-level alerts. USE (Utilization, Saturation, Errors) for resources — CPU, queue depth, error counts; the basis of most infrastructure alerts. A typical cluster: ~50-200 metrics per service × thousands of services × 10s scrape = ~5-50K samples/sec.
Infrastructure monitoring — node_exporter. Linux host metrics: CPU, memory, disk, network, /proc inspection. node_exporter exposes ~1000 metrics per host with well-bounded labels — 1000 hosts × 1000 metrics × 6 scrapes/min ≈ 100K samples/sec.
Business KPIs. Signups per minute, revenue per minute, active users, conversion rate. Emitted as counters/gauges from application code (signup_total, revenue_dollars_total), queried via PromQL or pushed to Pinot. Cardinality bounded by business categories (campaign, region, tier) — typically <1000 unique series.
IoT sensor telemetry. Millions of devices × ~10-100 sensors × ~1Hz = ~10M-1B samples/sec aggregate. High cardinality: every device is a series. A 5M-vehicle fleet × 50 sensors = 250M series — exactly the case Prometheus is not designed for. Backfill essential (devices reconnect after offline periods). Workload fit: InfluxDB, TimescaleDB, ClickHouse; M3DB if cardinality is extreme.
Financial market data. Tens of millions of ticks/sec at peak. Ingest-to-query in single-digit milliseconds; nanosecond timestamps; values are bid/ask/volume tuples, not just one float. Workload fit: QuestDB, kdb+, InfluxDB; ClickHouse for analytics on archived ticks.
APM (Application Performance Monitoring) trace summaries. Trace data lives in a trace store (Tempo, Jaeger); trace-derived metrics — per-service p99 latency, per-endpoint error rate — get ingested as time-series. Pattern: tail-sampling collector emits metrics-from-spans (RED metrics computed across sampled spans) as Prometheus exporters.
Container / pod metrics. Kubernetes-native time-series: kube-state-metrics (desired/actual pods, deployment health), cAdvisor/kubelet (per-container CPU/memory/network), metrics-server (HPA — Horizontal Pod Autoscaler input). Cardinality bounded by (namespace, pod, container) — typically 100s of K to a few M series in a large cluster.

§15. Use Case Gallery

Different products, different workloads, all on TSDBs.

Service monitoring — every Kubernetes-era backend. The dominant TSDB use case. Prometheus (or a Prometheus-compatible TSDB) scrapes /metrics endpoints of every service. Pattern: ~1000 services × ~100 metrics × 10s scrape ≈ 100K samples/sec/cluster. Storage 100K × 86400 × 1.4 B ≈ 12 GB/day. Trivial for a single Prometheus instance. The default of the cloud-native ecosystem.
IoT telemetry — Tesla, factory floors, smart homes. Tesla's fleet of 5M+ vehicles each reporting hundreds of telemetry channels. Industrial: factory PLCs (Programmable Logic Controllers) over OPC-UA. Smart home: Nest, Ring, Hue. Workload shape: high-cardinality per-device series, multi-million samples/sec aggregate, high backfill demand. Typical stack: device → MQTT broker → Kafka → TSDB (InfluxDB or ClickHouse) → Grafana.
Trading systems — real-time stock prices. NYSE consolidated tape ~10-50M trades/sec at peak. Trading algorithms must see the latest tick within microseconds. Typical stack: market feed (TCP/multicast) → ingest service → QuestDB or kdb+ on a beefy single machine. Query "last 1 minute of trades for AAPL bucketed by 100ms" returns in <1ms.
Cloud cost monitoring. Resource utilization × pricing = cost time-series. Cloud providers (AWS Cost Explorer, GCP Billing) emit data; third-party tools (Vantage, Kubecost) aggregate. Pattern: hourly fetch of cost-per-service breakdown stored as cloud_cost_dollars{service=..., region=...}. Low volume (<10K samples/sec), long retention (years).
Network monitoring — SNMP polling. SNMP (Simple Network Management Protocol) is the 30-year-old standard for polling routers, switches, firewalls. NOCs (Network Operations Centers) poll ~10000 devices every 30s with ~100 OIDs (Object Identifiers) per device = ~33K samples/sec. Stack: snmp_exporter / LibreNMS → Prometheus → Grafana. Older shops still run RRDtool (Round-Robin Database, 1999) → Cacti / MRTG (Multi Router Traffic Grapher).
DevOps process metrics — DORA. Deployment frequency, change failure rate, mean time to recovery. Emit from CI/CD as counters/gauges; chart "deploys per week per team" for leadership. GitHub Actions / Jenkins → Prometheus pushgateway → Grafana. Low volume, long retention.
User analytics — DAU and retention. Daily/weekly active users, cohort retention curves. Computed batch from event data, result is a small time-series per cohort. Events in Kafka → Spark/Flink → Pinot or Prometheus. Dashboard queries the pre-computed time-series.
Energy / grid monitoring — smart meters. ~150M smart electricity meters in the US, each reporting consumption every 15 minutes (or 1Hz for high-resolution meters). ~167K samples/sec for a national utility. Compliance retention: 7-year billing dispute window. Stack: meter → AMI (Advanced Metering Infrastructure) → InfluxDB or TimescaleDB → analytics + billing.

§16. Real-World Implementations

Named systems shipping TSDBs at scale.

Prometheus. Born at SoundCloud (2012), inspired by Google Borgmon. Kubernetes default since ~2017. Single-node, pull-based. Federation via Thanos/Mimir/Cortex for long-term.
Datadog. SaaS observability. Proprietary internal TSDB derived from Gorilla-family techniques. Trillions of samples per day ingested across customer base. Push-based agent + collectors per host.
Grafana Mimir / Cortex. Mimir is Grafana Labs's actively-maintained fork of Cortex (the original CNCF project for horizontally-scalable Prometheus). Open source. Powers Grafana Cloud Metrics. Reference deployments handle billions of active series.
Thanos. Sidecar-based long-term retention for Prometheus. Backs S3-class storage with a query layer for unified access. The "minimum upgrade from vanilla Prometheus."
VictoriaMetrics. Single-binary or cluster Prometheus-compatible TSDB with higher per-node throughput. Open source + enterprise tier. Drop-in replacement at higher scale.
InfluxDB. v1/v2 were the OG general-purpose TSDB. v3 IOx is a new engine on Parquet + DataFusion (Rust). InfluxDB Cloud is the managed offering; Aiven and others also offer managed InfluxDB.
TimescaleDB. PostgreSQL extension with hypertables. Used widely for time-series + relational workloads. Timescale Cloud is the managed offering.
M3DB / M3. Uber-built, Uber-open-sourced. Built for Uber's cardinality (~1 billion active series). Gorilla-derived chunk format with custom Bloom filter index. M3 platform = ingester + coordinator + query layer around M3DB.
Facebook Beringei. The original Gorilla paper's implementation. In-memory engine, durable via replication, not disk. The paper itself (Pelkonen et al., VLDB 2015) is the foundational reference everyone in the TSDB space has read.
Netflix Atlas. Netflix-internal TSDB optimized for in-memory operational dashboards. Open source. Multi-tier with in-memory hot layer and on-disk cold layer.
Google Monarch. Google's internal monitoring system. Per USENIX 2020 paper, ~1 trillion samples/sec globally. The largest TSDB deployment publicly described. Hot tier is RAM-only across zones; no SSD in the recent-data path. Federated query layer across regions.
Apache Pinot. LinkedIn-born, open source. Powers user-facing analytical queries over time-series-shaped data — "Who Viewed Your Profile," ad analytics. Sub-second p99 on billions of events.
Apache Druid. Similar shape to Pinot; born at Metamarkets/Imply. Heavy use in advertising tech.
ClickHouse. Yandex-born, now ClickHouse Inc. General columnar analytical database; widely deployed for time-series because of its compression and ingest speed. Cloudflare, Uber, eBay, GitLab all run large ClickHouse deployments.
QuestDB. SQL-native TSDB optimized for financial market data and IoT. Tens of millions of inserts/sec on a single machine.
kdb+ / KX. Proprietary, dominant in finance. Q programming language. Decades of incumbency at investment banks for tick data.
OpenTSDB. The 2010-era TSDB on HBase. Largely superseded; still running at older shops.
RRDtool. The 1999 original — Round-Robin Database. Built for network monitoring (MRTG → Cacti). Fixed-size circular file; automatic downsampling baked in. Still widely deployed in NOCs and embedded systems for its simplicity.

§17. Summary

A TSDB is a specialized engine for "append-only timestamped floats indexed by labels and queried by time range" — a workload where a generic relational database loses 10-100× to a specialized one. The wins come from a small set of structural choices: monotonic timestamps compress to delta-of-delta (~1 bit/sample), correlated floats compress to Gorilla XOR (~7-11 bits/sample), immutable 2-hour blocks let retention be rmdir instead of DELETE, and an inverted label index lets queries filter by arbitrary label combinations. The dominant pattern — Prometheus + Grafana + AlertManager for monitoring, with Mimir/Thanos/VictoriaMetrics layered on for long-term — is the cloud-native default; specialized variants (Pinot/Druid for user-facing analytics, InfluxDB/TimescaleDB for SQL + IoT, M3DB for trillion-series cardinality, QuestDB/kdb+ for financial ticks) exist where the default's design point breaks. The recurring failure mode is cardinality explosion — a high-cardinality label (user_id, request_id) generating millions of series and OOM'ing the head block. The recurring cost optimization is tiered downsampling — raw samples for days, downsampled aggregates for years. Pick a TSDB when the workload is millions of timestamped floats per second; pick anything else when it isn't.