A technology reference on OLAP (OnLine Analytical Processing) data stores — what they are, how their columnar engines work at the byte level, where the design space splits between cloud warehouses, real-time analytics engines, and lakehouse query layers, and when to pick which. Use cases (business intelligence dashboards, real-time product analytics, customer-facing analytics, ML feature pipelines, log analytics, marketing attribution) appear throughout as illustrations of the same class of technology bent to fit different workloads.
§1. What OLAP Storage Is
An OLAP store is a database whose physical layout, execution engine, and durability model are all optimized for one workload: scanning huge fact tables, filtering, grouping, and aggregating. Where an OLTP (OnLine Transaction Processing) database commits a small mutation in 1 ms and serves a point lookup in 5 ms, an OLAP store reads ten billion rows in two seconds and returns a SUM(revenue) GROUP BY country. The defining property is "answers analytical questions over the entire dataset in human-tolerable time, not in hours."
Three orthogonal shifts separate OLAP from OLTP:
- Columnar storage instead of row storage. A row store (B+ tree of 16 KB pages, each holding ~60 full rows) is built for "give me this one row." A column store keeps each column in its own contiguous file or page run, so a query touching only 3 of 200 columns reads only 1.5% of the table — a 60x I/O reduction before compression.
- Bulk writes instead of small transactions. OLAP ingests batches (a million rows from S3, an hour of streamed events, a daily ETL load). Per-row UPDATE with row-level locking is anti-pattern; the engine is built around large appends and full-segment rewrites.
- No multi-row ACID across writers, by design. OLAP gives per-batch or per-snapshot atomicity, not point-isolation between 10,000 concurrent debit transactions. The cost saved on locking and MVCC (Multi-Version Concurrency Control) garbage is spent on scan-and-aggregate throughput.
OLAP sits at one corner of a four-way analytical-storage design space:
- Columnar OLAP warehouses (the focus of this doc). Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse, StarRocks, Apache Doris, Pinot, Druid, Trino+Iceberg, Spark SQL+Delta, DuckDB. Column-oriented, compressed, vectorized execution, optimizer over a large statistical model of the data. Built for
SELECT col1, col2, SUM(col3) FROM huge_table WHERE col4 > X GROUP BY col5. - OLTP relational databases. MySQL InnoDB, PostgreSQL, Oracle, SQL Server. B+ tree, row-oriented, strong ACID. Excellent for the small mutable state that drives a product (account balance, user profile, cart). Awful for
SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '90 days'over 100 B rows. See01_databases.md §1. - HTAP (Hybrid Transactional/Analytical Processing) stores. TiDB, SingleStore, Unistore (Snowflake's experiment), Oracle HeatWave. Bolt a column store onto an OLTP engine via a delta-stage + column-store, or vice versa. Promise: one system for both. Reality: the column store is real-time-ish (seconds-stale), one workload usually starves the other, and the cost-per-row is the union of OLTP and OLAP costs. Used when "ETL freshness" is the bottleneck and the team explicitly wants to avoid a separate warehouse.
- Search engines. Elasticsearch, OpenSearch, Solr. Inverted indexes for text relevance + filter aggregations. Optimized for "find the 10 best matches for a free-text query." Can do facet aggregations but pays a 10–100x cost vs columnar OLAP at the same data size and is bad at multi-column
GROUP BY. See03_search.md. - Time-series databases. Prometheus, InfluxDB, TimescaleDB, M3DB, VictoriaMetrics, Honeycomb's columnar engine. Specialized for
(timestamp, metric, tags) → valuewith high-cardinality label indexes, downsampling, and TTL (Time-To-Live) eviction. OLAP can serve time-series, but TSDBs (Time-Series DBs) win on storage cost per data point and on rate-style queries (rate(http_requests_total[5m])). See10_observability.md §3.
This doc focuses on the columnar OLAP corner because that is where the byte-level mechanics matter most and where the modern data stack converges. We'll contrast lakehouse formats (Iceberg, Delta, Hudi) against integrated warehouses (Snowflake, BigQuery) and against single-purpose real-time OLAP engines (Pinot, Druid, ClickHouse).
What OLAP is NOT good for:
- Low-latency point lookups. Reading one row by primary key from a columnar store re-assembles N columns from N separate files. A ClickHouse point lookup at p99 is ~30–50 ms; a Postgres B+ tree lookup is <1 ms. If your workload is "render a profile page given a user ID," OLAP is the wrong tool.
- High-concurrency small writes. A row-by-row UPDATE in Snowflake or ClickHouse re-writes a column file, often with a copy-on-write semantics. Per-row OLTP throughput on a columnar engine is ~10–100 ops/sec/cluster — five orders of magnitude below OLTP. Bulk-load or stream into immutable segments and let compaction handle merges.
- ACID across multiple writers. Most OLAP gives per-batch atomic (a load either fully visible or fully invisible) and snapshot isolation per query. Two concurrent updates to overlapping data is a "last writer wins" or a conflict-and-abort situation, not the OLTP-style serialization of millions of debits and credits.
- Sub-millisecond freshness. Lakehouse pipelines (Kafka → Flink → Iceberg) deliver data at ~minute freshness. Streaming OLAP (Pinot, Druid) reaches sub-second. Nothing in OLAP territory delivers OLTP-style "the write is queryable in microseconds at single-row granularity."
- Storing 10 MB blobs in a column. A VARCHAR full of base64 PDFs destroys compression and ruins scan throughput. Use object storage for blobs; OLAP stores the path.
Mental model: an OLAP store is the secondary, derived, analytical view of the truth that lives in OLTP. The OLTP source emits a change stream (CDC — Change Data Capture); a pipeline reshapes it into wide denormalized fact tables in columnar format; the OLAP engine serves "report queries" — business intelligence dashboards, real-time product analytics, ML training feature joins, customer-facing analytics dashboards, ad-hoc data-science exploration. Get the OLTP source right; everything analytical can be rebuilt from its change log.
§2. Inherent Guarantees
What columnar OLAP provides by design, and what must still be layered above it.
Provided by design
- Fast aggregation over billions of rows. A modern engine scans 1–10 GB/sec/node and produces a
GROUP BYresult over 10 B rows in seconds. ClickHouse benchmarks at ~2 GB/sec/core on aggregation; BigQuery slot allocation routinely scans terabytes in tens of seconds. - High compression, 5–20x. Per-column compression (dictionary + run-length + delta + general-purpose Zstandard) over the columnar layout typically gives 5–10x compression for transactional fact tables and 20–50x for sparse logs or status columns. The math: column data is locally similar, so entropy coders work brilliantly.
- Predicate pushdown via zone maps / metadata. Each storage granule (a row-group, micro-partition, or segment) carries
min,max,null_count, and often histograms / bloom filters per column. A query withWHERE event_date = '2026-05-21'may skip 99.9% of the table without reading data. - Snapshot isolation per query. A query sees a frozen view of the table as of its start time. Concurrent inserts and table mutations are visible only after a snapshot commit; in-flight readers never see partial writes.
- Late-binding schema for raw data. Many OLAP engines and lakehouse formats (Iceberg, Parquet) let you store JSON / VARIANT / Map
columns and project keys at query time. Schema-on-read. Lets analysts add new fields without a schema migration. - Per-batch / per-snapshot atomicity. A load of 100 M rows is either visible or invisible; never half-visible. Iceberg commits a new snapshot atomically via metadata pointer swap; Snowflake micro-partition commits via metadata transaction; BigQuery streaming buffer commits at flush.
Must be layered above
- OLTP-style point-row consistency. If a single user's account balance must be exact and immediately readable everywhere, that lives in OLTP, not OLAP. The warehouse is a derived view, intrinsically a few seconds-to-minutes stale.
- End-to-end pipeline correctness. OLAP gives you snapshot isolation; it does not give you "every transaction from the OLTP system reached the warehouse exactly once." Exactly-once ingestion lives at the stream-processor / CDC layer, with idempotent keys and offset checkpointing — see
04_stream_processing.md §6. - Workload isolation between tenants. A heavyweight query from team A can saturate a shared cluster and degrade team B. Multi-tenancy needs warehouse-level "virtual warehouses" (Snowflake), per-team slot reservations (BigQuery), or per-cluster isolation. The engine doesn't auto-fairness across tenants.
- Data quality and schema evolution discipline. A wrong column type or a deleted upstream field silently corrupts downstream tables and dashboards. Schema-on-read does not save you from human error; data contracts at the ingestion edge do.
- Cost governance. Pay-per-query engines (BigQuery) bill on bytes scanned; idle-but-allocated clusters (Snowflake virtual warehouses, Redshift) bill per uptime. A runaway dashboard can scan a petabyte and bill $5,000 in an hour. Quotas, cost tagging, slot limits — layered above the engine.
- PII (Personally Identifiable Information) governance, row-level security, column masking. Columnar engines provide RBAC (Role-Based Access Control) and column masks as a feature, but enforcing GDPR (General Data Protection Regulation) deletion across snapshots, replicas, and downstream BI extracts is a data-governance problem above the engine.
- Backup / point-in-time recovery. Snowflake's Time Travel + Fail-safe is the strongest built-in (90 + 7 days). Iceberg gives you snapshot history. Most engines need an explicit retention policy and an external "is the daily backup actually restorable" drill.
Synthesis: the OLAP engine guarantees per-snapshot atomic, columnar-fast aggregation with predicate skipping on the data you give it. The fidelity of that data, the freshness of the pipeline, the isolation between tenants, the cost ceiling, and the regulatory governance are all the platform team's problem.
§3. The Design Space
Five distinct flavors of OLAP store; they diverge along compute/storage coupling, target query latency, ingestion model, and operational ownership.
Cloud data warehouses (Snowflake, BigQuery, Redshift, Databricks SQL)
Snowflake. Fully-managed warehouse-as-a-service. Decoupled architecture: data lives in S3 as immutable micro-partitions (~16 MB each, columnar, compressed); compute is "virtual warehouses" (clusters of EC2 nodes) that you size up/down/pause; metadata is a centralized Snowflake-operated FoundationDB store. Strong on time-travel, zero-copy cloning, cross-account data sharing. Read-mostly with strong support for MERGE/UPDATE. Pricing: credits per warehouse-second + storage.
BigQuery. Google's serverless warehouse. Built on Dremel (the paper that started the columnar-OLAP wave). No clusters; you submit a query and Google's scheduler reserves "slots" (units of CPU) from a shared pool, scans columnar storage (Capacitor format) on Colossus (Google's distributed filesystem), and returns results. Decoupled storage and compute, but the compute is invisible — there are no nodes you operate. Pricing: bytes scanned (on-demand) or reserved slot capacity.
Redshift. Amazon's warehouse, originally a fork of ParAccel. Two generations: classic (shared-storage cluster) and RA3 nodes that decouple compute from storage (data in Redshift Managed Storage on S3). Strong PostgreSQL-compatible SQL surface. Pricing: per node-hour + RMS storage.
Databricks SQL (and the wider Databricks platform). Lakehouse approach. Data lives in Delta Lake format on the customer's S3 / ADLS / GCS (Google Cloud Storage); compute is Photon (Databricks' vectorized C++ engine) running on Spark clusters. Same engine serves SQL workloads, notebooks, ML training. Strong story for "one platform for warehouse + lakehouse + ML."
These four converge on the same architecture: shared-nothing-compute over shared-storage. Storage is object storage (S3, GCS, ADLS) with columnar files (Parquet, ORC, or proprietary). Compute is a scale-out cluster that you size per workload. Metadata (table layout, snapshots, statistics) is in a centralized catalog.
Self-hosted OLAP engines (ClickHouse, StarRocks, Apache Doris)
ClickHouse. Yandex-origin (now ClickHouse Inc. / open source), C++, single-node-or-clustered. The performance king on raw CPU efficiency — vectorized execution with hand-written SIMD (Single Instruction Multiple Data) kernels, ~2 GB/sec/core scan rates. Strong story for log analytics, customer-facing analytics, time-series-ish workloads. Coupled compute and storage in the classic architecture (data on local NVMe per node); newer ClickHouse Cloud decouples to S3.
StarRocks. Originally a fork of Apache Doris by ex-Baidu engineers, now its own project. MySQL-protocol-compatible, MPP (Massively Parallel Processing) architecture, vectorized engine. Strong real-time materialized view support; data can be ingested from Kafka and made queryable in seconds.
Apache Doris. Same family as StarRocks (shared origin). MPP, MySQL-compatible, focus on real-time analytics for internet companies (Baidu, ByteDance internal use).
These three are the "you operate it" tier. Cheaper at scale than cloud warehouses (no per-credit billing) but you carry the ops burden — capacity planning, replica management, version upgrades, failure recovery.
Real-time analytics engines (Pinot, Druid)
Apache Pinot. Born at LinkedIn (2014, originally for "Who's Viewed Your Profile" and feed analytics), now Apache. Designed for sub-100 ms p99 aggregation queries on streaming-ingested data with hundreds of thousands of QPS (Queries Per Second). Architecture: immutable segments, columnar with inverted indexes per column, Helix-based coordinator. Strong story for customer-facing real-time dashboards (Uber's order map, LinkedIn's analytics tab).
Apache Druid. Born at Metamarkets (~2011), Apache since 2018. Same niche as Pinot — real-time OLAP with streaming ingest. Architecture: segment files (columnar with bitmap indexes), Historical / MiddleManager / Broker / Coordinator nodes. Used at Netflix (clickstream), Airbnb (search-funnel analytics), Lyft.
These two overlap heavily; both target sub-second aggregation on data that arrived seconds ago. Differences in operational story, indexing tradeoffs, and SQL maturity drive the choice.
Federated query engines (Trino, Presto, Spark SQL, DuckDB)
Trino (formerly PrestoSQL) and Presto. Originated at Facebook (2012) for ad-hoc analytics on Hive tables. Trino is the active community fork; PrestoDB is the Meta-led continuation. Architecture: separation of compute from any specific storage — Trino has connectors for Hive/Iceberg, PostgreSQL, MySQL, Kafka, Cassandra, MongoDB, S3 directly. You can JOIN a Postgres table against an Iceberg fact table against a Kafka topic in a single SQL statement.
Spark SQL. General-purpose distributed compute engine; SQL is one of its surfaces. More batch-friendly than Trino (longer-running jobs, fault-tolerant via lineage), used heavily in ETL (Extract-Transform-Load) and ML training pipelines.
DuckDB. Single-node, embedded ("SQLite for analytics"). Vectorized engine, reads Parquet / CSV / Arrow / Iceberg directly. Used as a data-science workhorse on laptops and as an embedded engine inside other tools (dbt, Mode, Mother Duck for cloud).
These engines do not own the storage. They are query layers over external storage (lakehouse tables, object storage, federated databases). Coupled with Iceberg or Delta they form the lakehouse pattern: warehouse-grade SQL on commodity object storage with open file formats.
Lakehouse table formats (Iceberg, Delta Lake, Hudi)
Not engines, but table formats — metadata layers that turn a pile of Parquet files in S3 into a real "table" with snapshot isolation, schema evolution, time travel, and ACID per commit.
Apache Iceberg. Netflix origin, vendor-neutral. Snapshot + manifest tree (see §4). Dominant choice in 2024–2026 for new lakehouses. Read/write support: Trino, Spark, Flink, Snowflake (now), BigQuery (now), Databricks (via interop), Athena, ClickHouse.
Delta Lake. Databricks origin, open-sourced. Snapshot + transaction log (_delta_log). Native to Databricks; broader engine support has grown (Trino, Spark, Snowflake via Iceberg interop, Flink).
Apache Hudi. Uber origin. Built for upsert-heavy workloads (rider-and-driver state updates flowing through ride-matching). Two table types: Copy-on-Write (rewrite affected files at write) and Merge-on-Read (write delta log, merge at read). Strong on streaming upserts; the others have caught up since.
Comparison table
| Dimension | Cloud warehouse (Snowflake) | BigQuery | Self-hosted (ClickHouse) | Real-time (Pinot/Druid) | Lakehouse (Trino+Iceberg) | Single-node (DuckDB) |
|---|---|---|---|---|---|---|
| Compute / storage coupling | Decoupled | Decoupled (serverless) | Coupled (classic) / decoupled | Coupled | Decoupled | Single node |
| Target query latency | Seconds-to-minutes | Seconds-to-minutes | Sub-second to seconds | <100 ms p99 | Seconds-to-minutes | Sub-second on small data |
| Ingest latency | Minutes (Snowpipe ~1m) | Seconds (streaming buf) | Seconds (Kafka engine) | Sub-second (real-time segment) | Minutes (commit cadence) | Manual |
| Concurrency model | Virtual warehouses | Slot pool | Per-cluster | Broker fanout | Coordinator + workers | Single user |
| Storage format | Proprietary micro-partitions | Capacitor (proprietary) | MergeTree (open) | Pinot/Druid segments (open) | Parquet + Iceberg/Delta | Parquet / native |
| Time travel | 90 days (Time Travel) | 7 days | Limited | Limited | Unlimited (Iceberg snapshots) | N/A |
| Updates / deletes | Native MERGE | Native MERGE | Mutations (rewrite parts) | Limited (UPSERT in Pinot) | MERGE via Iceberg/Delta | Native |
| Cost model | Credits per warehouse-second | Bytes scanned / slots | Per-node | Per-node | Per-node + S3 | Free (open source) |
| Ops burden | Low (managed) | None (serverless) | High | High | Medium (cluster + catalog) | Low (just a library) |
| Sweet spot | BI on TB-PB | BI on TB-PB | Customer-facing logs/events | <100ms real-time dashboards | Open warehouse, ML pipelines | Notebooks, data science |
Synthesis: every cell's value is driven by the engine's central design choice. Pinot picks coupled storage with bitmap indexes because <100 ms p99 means you can't pay for S3 round-trips. Snowflake picks decoupled storage because the workload is BI, and 200 ms more on cold-cache scans is acceptable in exchange for "you can resize compute in 5 seconds." Trino picks no storage at all because its job is to federate across whatever already exists.
§4. Byte-Level Mechanics
The depth section. We zoom into columnar storage layout, compression codecs, zone maps, vectorized execution, Parquet file format, and the Iceberg manifest tree. We walk one aggregation query end-to-end.
4.1 Columnar layout: row-major vs column-major
A row-oriented store (InnoDB, PostgreSQL) writes rows contiguously on a page. A 200-column table with rows of 256 B has each row's 200 column values adjacent in memory:
Row-major (one InnoDB 16 KB page):
┌───────────────────────────────────────────────────────────────┐
│ Row 1: col1 col2 col3 ... col200 Row 2: col1 col2 col3 ... │
│ │ │ │ │ │
│ adjacent in memory, one row = one contiguous chunk │
└───────────────────────────────────────────────────────────────┘
A column-oriented store stores each column in its own contiguous run. For the same 200-column table, you get 200 separate "column files" (or 200 contiguous column chunks within a row group):
Column-major (one Parquet row group, simplified):
┌─────────────────────────────────────────────────────────┐
│ col1: [val_1, val_2, val_3, ..., val_N] ← contiguous │
│ col2: [val_1, val_2, val_3, ..., val_N] ← contiguous │
│ col3: [val_1, val_2, val_3, ..., val_N] │
│ ... │
│ col200: [val_1, val_2, val_3, ..., val_N] │
└─────────────────────────────────────────────────────────┘
The key consequence: a query SELECT col5, col10, SUM(col17) FROM t WHERE col5 = 'US' reads only the byte ranges for col5, col10, col17. If each column is 1% of total table bytes, the query reads ~3% of the data — a 30x I/O reduction before compression even kicks in.
The next consequence: locality within a column is much higher than locality within a row. All country_code values are likely to be one of ~200 strings; all event_type values one of ~50; all timestamp values monotonically increasing. Per-column compression exploits this brutally.
4.2 Compression codec stack
A modern columnar engine doesn't use one compressor — it composes lightweight codecs tailored to the column's data shape, then optionally applies a general-purpose entropy coder on top.
Dictionary encoding. For low-cardinality columns (country_code, event_type, status), build a dictionary mapping each distinct value to a small integer ID. Store the dictionary once, and replace the column values with the IDs. A country_code column of 1 B values × 8 bytes = 8 GB becomes 1 B × 1 byte (for 256 distinct countries) + dictionary = ~1 GB. 8x reduction.
Run-length encoding (RLE). For columns where consecutive values often repeat (e.g., after sorting by country_code: 1 B "US", then 200 M "GB", then ...). Store (value, count) pairs. A column with 1000 distinct runs over 1 B rows compresses to ~16 KB.
Bit-packing. For dictionary-encoded columns where the dictionary fits in K bits (256 entries → 8 bits, 64 entries → 6 bits), pack the IDs at bit-level. Bit-packing on a 5-bit dictionary across 1 B rows saves 3 bits per row vs a full byte = ~375 MB saved per column.
Delta encoding. For sorted or monotonic numeric columns — most commonly timestamps. Store (start, delta_1, delta_2, ...). For an event-stream timestamp column at second resolution, deltas are small integers fitting in 4–8 bits each.
Delta-of-delta. Two-level delta. Used in Gorilla (Facebook's in-memory time-series engine, now generalized into Prometheus/M3DB/InfluxDB IOx). For a regular-cadence metric ("CPU usage sampled every 10s"), most timestamp deltas equal 10 seconds; delta-of-delta is 0. Compresses a 64-bit timestamp into ~1 bit on average for steady cadences.
Gorilla encoding (for floats). Facebook's Gorilla paper trick: XOR each float64 with the previous, then store the number of leading zero bytes + significant bytes. Stationary or slowly-changing metrics (memory usage, throughput) compress 10–20x because consecutive samples differ in only the last few bytes.
Zstandard / LZ4 / Snappy on top. After the structural codecs, run a general-purpose entropy coder over the result. Zstd at level 3 is the modern default — ~500 MB/sec compress, ~2 GB/sec decompress, 2–4x further reduction. Snappy is faster but compresses less; LZ4 is fastest. Parquet defaults vary by engine; ZSTD is the modern default in Iceberg/Delta.
The cumulative effect. A real production fact table (event logs, ~50 columns, ~256 B/row uncompressed): per-column codecs give ~5x; Zstd on top gives ~2x; total ~10x. A 10 TB raw fact table on S3 is ~1 TB in Parquet. Snowflake reports typical compression of 3-5x on production tables (their own micro-partition format with their codec stack).
Column "country_code" compression path (1 B rows):
Raw VARCHAR 8 GB 8-byte strings
Dictionary → 1 GB 8-bit IDs + small dictionary
Bit-pack → 125 MB bit-packed to actual bit-width
Zstd → ~15 MB entropy coding on the bit-packed bytes
───
~530x reduction on a perfectly-skewed column
4.3 Zone maps, MinMax indexes, and sparse primary keys
A columnar engine doesn't index every row. It indexes granules — fixed-size chunks of consecutive rows (~8192 rows in ClickHouse, ~1024 rows in Pinot, ~1 M rows per row group in Parquet).
Each granule carries per-column metadata:
- Min and max of the column within the granule. A
WHERE event_date BETWEEN '2026-05-01' AND '2026-05-07'query checks each granule'sminandmax; if the granule's range doesn't intersect, skip it entirely. - Null count.
- Distinct count (sometimes).
- Bloom filter (optional). For point-equality predicates (
WHERE user_id = 42), a bloom filter per granule rules out granules with no chance. - Histogram (optional). Snowflake's clustering keeps approximate distributions per micro-partition for the optimizer.
ClickHouse MergeTree, simplified granule layout:
┌────────────────────────────────────────────────────────────┐
│ Part (~10-100 GB, one logical chunk on disk) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Granule 0: │ │
│ │ col1.bin bytes [0..L1] col2.bin bytes [0..L2] │ │
│ │ ... │ │
│ │ Mark file: col1: min=10, max=99, NULL=0 │ │
│ │ col2: min='2026-05-01', max='2026-05-01' │ │
│ ├──────────────────────────────────────────────────────┤ │
│ │ Granule 1: │ │
│ │ col1.bin bytes [L1..L1+L1'] │ │
│ │ ... │ │
│ │ Mark file: col1: min=100, max=199 │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Primary key index (sparse): one entry per granule │
│ granule_0 → (col1=10, col2='2026-05-01') │
│ granule_1 → (col1=100, col2='2026-05-01') │
│ ... │
└────────────────────────────────────────────────────────────┘
The primary key index in a columnar OLAP engine is fundamentally different from an OLTP B+ tree. It is sparse — one entry per granule, not per row. For an 8192-row granule, the PK index is ~8000x smaller than an OLTP per-row index. The query plan uses the PK index to binary-search to the right granule, then scans within. Point lookups can't use this to find one row in <1 ms — they find one granule (~8000 rows) in <1 ms and then have to scan ~8000 rows.
This is why OLAP is anti-pattern for point lookups: the index granularity is wrong by design.
4.4 Vectorized execution and SIMD
The classic execution model — tuple-at-a-time — processes one row at a time through the operator tree (Volcano-style iterators). Each row pays the overhead of virtual function calls, branch mispredictions, and L1 cache misses on every operator boundary.
The modern model — vectorized execution — processes a batch (typically 1024 or 4096 values) through each operator at once. The operator's inner loop becomes a tight scalar or SIMD-vector loop over an array of column values.
Concrete: filtering WHERE price > 100.
// Tuple-at-a-time (slow)
while (next_row(row)) {
if (row->price > 100) emit(row);
}
// Each iteration: virtual call, branch, possible cache miss.
// Throughput: ~10–50 M rows/sec/core.
// Vectorized (fast)
for (int i = 0; i < 1024; i++) {
selection[i] = (price[i] > 100);
}
emit(selection);
// Inner loop: 1024-element-wide SIMD compare + bitmask in <100 nanoseconds on AVX-512.
// Throughput: ~1–5 GB/sec/core on integer columns.
SIMD instructions. Intel AVX-512 (Advanced Vector Extensions) processes 64 bytes at once — 16 int32 values, 8 int64 values, 64 byte values per instruction. ClickHouse, DuckDB, Databricks Photon, BigQuery's Capacitor engine all use AVX-512 (or AVX2 on older hardware) intensively. Aggregations (SUM, COUNT, MIN, MAX), filters (predicate evaluation), arithmetic (price * quantity), and hashing (group-by) all parallelize.
Vectorized execution + tight columnar layout + SIMD is why a modern engine scans 1–10 GB/sec/core. This is 10–100x faster than tuple-at-a-time on the same hardware. The whole "columnar OLAP is fast" story rests on this.
4.5 Parquet file structure
Parquet is the dominant on-disk columnar format. Layout (one file):
Parquet file (typical, 128–1024 MB on disk):
┌────────────────────────────────────────────────────────────┐
│ Magic number: "PAR1" (4 bytes) │
├────────────────────────────────────────────────────────────┤
│ Row Group 0 (~128 MB uncompressed, ~1 M rows) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Column chunk for col1 │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ Page 0: dictionary page │ │ │
│ │ │ Page 1: data page (encoded values) │ │ │
│ │ │ Page 2: data page │ │ │
│ │ │ ... │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ │ Column chunk for col2 → ... │ │
│ │ Column chunk for col3 → ... │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────┤
│ Row Group 1 │
│ Row Group 2 │
│ ... │
├────────────────────────────────────────────────────────────┤
│ Footer: │
│ File metadata: │
│ Schema definition │
│ Row groups[]: │
│ - byte_offset, byte_length │
│ - num_rows │
│ - column_metadata[]: │
│ - col_name, encoding, codec │
│ - statistics: min, max, null_count │
│ - byte_offset, compressed_size, uncompressed_size│
│ - bloom_filter offset (Parquet 1.12+) │
│ Footer length (4 bytes) │
│ Magic: "PAR1" (4 bytes) │
└────────────────────────────────────────────────────────────┘
Key properties:
- Self-describing. Each file is independently readable. The footer holds the full schema + statistics for every row group.
- Row groups are independent. A query can skip a row group entirely based on footer statistics, parallelize across row groups in different files, and read column chunks for only the projected columns.
- Page is the I/O unit. Pages are ~1 MB. A query that needs only the first 10 K rows of a 1 M-row row group reads only the first ~1 MB of each projected column.
- Footer at the end. Annoying for streaming writes (you can't write the footer until you've written all row groups), but the convention is to read the file by
seekto the last 8 bytes, read footer length and magic, thenseekback and read the footer, thenseekagain to the data ranges. 3 round trips on S3 before data fetches start — a performance gotcha for many-small-file workloads. - ORC (Optimized Row Columnar) is the alternative. Originated in Hive/Hortonworks. Same general structure (file with stripes ~ Parquet row groups, columns within stripes, footer at end). More aggressively optimized for Hive-era predicate pushdown. Modern lakehouses standardize on Parquet because of broader engine support; ORC is still common in older Hadoop estates.
4.6 Iceberg manifest tree — table semantics over Parquet
Parquet alone is "files in S3." To make a real table — with INSERT, UPDATE, DELETE, snapshot isolation, schema evolution, time travel — you need a metadata layer above the files. Iceberg's solution: a multi-level manifest tree, all in object storage, all updated by atomic pointer swaps.
Iceberg metadata hierarchy (in S3 alongside the data):
table_root/
├── metadata/
│ ├── v1.metadata.json ← initial metadata
│ ├── v2.metadata.json
│ ├── ...
│ └── version-hint.text ← points to latest version number
│
│ v42.metadata.json:
│ {
│ "table_uuid": "...",
│ "current-snapshot-id": 12345,
│ "snapshots": [
│ { "snapshot-id": 12345,
│ "timestamp-ms": ...,
│ "manifest-list": "snap-12345-mlist.avro",
│ "summary": { "added-files": 17, "deleted-files": 3 } },
│ { "snapshot-id": 12344, ... },
│ ...
│ ],
│ "schema": { ... },
│ "partition-spec": { ... },
│ "sort-order": { ... }
│ }
│
├── metadata/snap-12345-mlist.avro
│ manifest list (one entry per manifest file):
│ [
│ { "manifest_path": "manifest-001.avro",
│ "added_data_files_count": 5,
│ "partitions": [ { "lower_bound": "2026-05-01", "upper_bound": "2026-05-01" } ] },
│ { "manifest_path": "manifest-002.avro", ... },
│ ...
│ ]
│
├── metadata/manifest-001.avro
│ manifest file (one entry per data file):
│ [
│ { "data_file": "data/dt=2026-05-01/part-0001.parquet",
│ "file_size_bytes": 524288000,
│ "record_count": 1042000,
│ "partition": { "dt": "2026-05-01" },
│ "column_stats": {
│ "user_id": { "min": 1, "max": 99999, "null_count": 0 },
│ "country": { "min": "AD", "max": "ZW", "null_count": 12 }
│ } },
│ ...
│ ]
│
└── data/
├── dt=2026-05-01/part-0001.parquet
├── dt=2026-05-01/part-0002.parquet
├── ...
The chain: table metadata → snapshot → manifest list → manifest files → data files.
Why three levels. A single flat list of all data files in the table doesn't scale — a 1 PB table can have millions of Parquet files. A two-level tree (manifest list → manifest files) lets the engine:
- Read the snapshot's metadata.json (1 KB).
- Read the manifest list (1–10 MB).
- Skip manifests whose partition bounds don't intersect the query predicate (most are skipped).
- Read only the relevant manifest files (each ~1 MB, listing ~1000 data files with stats).
- Use per-file stats to skip data files.
- Read only the matching Parquet files, then use Parquet's row-group stats to skip row groups within them.
This is a four-level zone-map hierarchy: partition → file → row-group → page. Each level prunes ~10–100x. A 1 PB table with a tight predicate often scans <100 MB.
Snapshot isolation via pointer swap. A write transaction:
- Writes new Parquet files to
data/. - Writes a new manifest file pointing to them.
- Writes a new manifest-list file referencing the old manifests (unchanged ones) + the new manifest.
- Writes a new
v43.metadata.jsonreferencing the new manifest list. - Atomically updates
version-hint.textto "43" (or, in catalog-based deployments, atomically updates the catalog row for this table to point tov43.metadata.json).
Step 5 is the commit point. Until it happens, no reader sees the new data. After it happens, all new readers see the new snapshot atomically; in-flight readers continue on their original snapshot (they hold the old metadata.json reference and ignore the new manifest list). Snapshot isolation, without locks, on object storage.
Time travel. Old snapshot files are not deleted until a retention threshold passes. SELECT * FROM t FOR VERSION AS OF 12343 opens the table at snapshot 12343 (manifest list, manifests, and data files all immutable and still present in S3). Iceberg defaults to retaining snapshots for 5 days; configurable.
Schema evolution. Each snapshot carries a schema version; column adds, renames, drops are metadata-only operations that produce a new snapshot with a new schema. Old snapshots remain readable under their original schema.
Hidden partitioning. Iceberg stores the partition function (e.g., bucket(16, user_id), day(event_ts)) in the table metadata. Queries say WHERE event_ts > '2026-05-01' and Iceberg auto-applies the partition predicate. The user never says WHERE event_dt = '2026-05-01' (the partition column in Hive). This decouples partitioning from query syntax; partitioning evolution doesn't require rewriting queries.
Delta Lake uses a similar idea with a different metadata format — a _delta_log/ directory of JSON commit files and Parquet checkpoint files. Hudi uses a .hoodie/ directory with timeline files. All three solve the same problem (table semantics over files in object storage); Iceberg's manifest tree is the dominant choice in 2026 because of breadth of engine support and operational simplicity at PB+ scale.
4.7 One concrete query, walked end-to-end
A real query against an Iceberg+Trino lakehouse:
SELECT country, SUM(revenue) AS total_revenue
FROM clickstream_events
WHERE event_date BETWEEN '2026-05-01' AND '2026-05-07'
AND event_type = 'purchase'
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 20;
Table: partitioned by event_date, 10 B rows over a year, ~50 columns, ~5 TB Parquet.
- Parse, plan. Trino coordinator builds plan:
Scan → Filter → Aggregate → Sort → Limit. Filter pushed to scan. - Read table metadata (
v_latest.metadata.json, then manifest list ~5 MB). - Prune manifests by partition bounds. 7 days of 365 → ~99% skipped.
- Read 7 relevant manifests (~7 MB total) listing data files with per-file column stats.
- Prune data files by
event_typemin/max — ~30% drop. After: ~1000 files (~700 GB) out of ~50 K (~5 TB). ~85% of bytes skipped. - Schedule splits. Coordinator divides files among workers, one row-group per task.
- Per-worker read. Open Parquet file (HTTP GET footer first, then data ranges); skip row groups within file via footer stats (~30% further pruning); read only the 4 projected columns; decompress pages; decode into Arrow vectors.
- Vectorized filter. SIMD compare
event_type == 'purchase'across 1024-row batches (~100 ns per batch on AVX-512). - Partial aggregate. Hash by
country, accumulateSUM(revenue)in a per-worker hash table. - Shuffle partial results by
hash(country) % N. - Final aggregate. Sum partial sums per country (~250 country rows total).
- Sort, limit. Coordinator sorts and returns top 20.
Wall clock on a 50-worker Trino cluster: ~3–10 seconds for ~700 GB effective scan, dominated by S3 GET latency. Aggregation itself at ~5 GB/sec/node runs in ~60 ms wall time after parallelization.
Byte-level total. Out of 5 TB of table data, ~700 GB Parquet downloaded → ~150 GB after column projection → ~30 GB after row-group skipping. Final output: ~480 bytes (20 rows × 24 bytes). Without columnar layout, manifest pruning, and vectorized execution, the same query reads 5 TB and takes hours. With them, 30 GB and 5 seconds.
§5. Capacity Envelope
What columnar OLAP can do, illustrated across very different scales. The same family of technology covers six orders of magnitude.
DuckDB on a laptop — GB scale. A data scientist runs DuckDB in a Jupyter notebook on a MacBook Pro M3 (16 GB RAM, NVMe). 50 GB of Parquet from S3 cached locally; aggregations over ~500 M rows in 1–5 seconds. ~6 GB/sec scan rate on a single chip; 1 B-row TPC-H Q1 in ~2 seconds. Concurrent users: 1. Cost: zero (open-source).
ClickHouse self-hosted — TB scale. A SaaS analytics company runs a 16-node ClickHouse cluster on AWS i4i.4xlarge (16 vCPU, 128 GB RAM, 7.5 TB NVMe each). ~120 TB compressed (~5 PB uncompressed equivalent). ~1000 QPS, p99 ~500 ms. Ingest ~2 M events/sec via Kafka. Plausible Analytics publicly runs their entire product on this scale.
Snowflake at mid-large enterprise — TB-to-PB scale. A retailer with thousands of analytics users. Virtual warehouses: 4–8 size-large clusters (~8 nodes each), scaled up during BI peak hours, paused overnight. ~50–500 TB compressed; ~5 PB raw. Dashboard query: 2–10 seconds. ETL: 5–30 minutes. Concurrency: ~200–500 concurrent queries. Cost: $1 M–$10 M/year typical.
Netflix Iceberg + Trino — exabyte scale. ~hundreds of PB to low exabytes on S3 in Iceberg (~500 PB disclosed 2023). Trino interactive (~1000+ nodes), Spark batch (tens of thousands of jobs/day), Flink streaming. Queries from 1-second metadata lookups to multi-hour ML training. Iceberg was born at Netflix in 2017 to fix Hive's metadata scaling at this size.
Meta Presto, Google BigQuery — exabyte scale. Meta runs Presto on hundreds of thousands of CPUs over >300 PB; the dominant interactive engine inside Facebook. BigQuery is likely the largest known OLAP system — exabytes of customer data on Capacitor/Colossus, queries scanning multiple PB in seconds, millions of queries/day.
Apache Pinot at LinkedIn / Uber — real-time, multi-PB. LinkedIn Pinot for member-facing dashboards: ~200 K QPS, p99 <50 ms, tens of PB. Uber Pinot + Druid: ~600 K QPS at peak, sub-100 ms p99. Different OLAP shape than Snowflake — "tens-of-ms-fast over PB" via coupled storage and per-column bitmap indexes, not Snowflake's "seconds-fast over PB" via cheap S3.
Capacity envelope: GB on a laptop to exabytes at hyperscale — six orders of magnitude wide. Storage engine details (columnar, compressed, vectorized) are the same on a DuckDB process as on a BigQuery slot. What scales is the partitioning, manifest layering, and distributed coordinator wrapping the engine.
§6. Architecture in Context
The canonical modern OLAP architecture, the Lakehouse pattern. Not "the X system" — the shape that recurs across analytics platforms in 2024–2026.
┌────────────────────────────────────────┐
│ OLTP source systems │
│ (MySQL / PostgreSQL / DynamoDB / │
│ MongoDB / app-level events) │
└─────────────┬──────────────────────────┘
│
│ CDC (Change Data Capture)
│ Debezium / Brooklin / native
│
▼
┌────────────────────────────────────────┐
│ Kafka (or equivalent log) │
│ Per-table topics, ordered by │
│ primary key partition │
└─────────────┬──────────────────────────┘
│
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Stream │ │ Real-time│ │ Batch │
│ processor│ │ OLAP │ │ ETL │
│ (Flink) │ │ (Pinot, │ │ (Airflow │
│ │ │ Druid) │ │ + Spark)│
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ direct │
│ │ ingest │
▼ │ ▼
┌────────────────────┴──────────────────────┐
│ Lakehouse: Iceberg / Delta / Hudi │
│ on S3 / GCS / ADLS │
│ │
│ Bronze (raw) → Silver (cleaned) │
│ → Gold (aggregated / materialized) │
└─────────────────┬─────────────────────────┘
│
┌─────────────┼─────────────┬────────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────────┐ ┌───────────┐
│ Trino │ │ Spark │ │ Snowflake / │ │ DuckDB │
│ (inter- │ │ SQL │ │ BigQuery │ │ (laptops, │
│ active │ │ (batch │ │ (managed │ │ embedded)│
│ SQL) │ │ ETL/ML)│ │ warehouse) │ │ │
└────┬────┘ └────┬────┘ └─────┬───────┘ └─────┬─────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────────────────────────────────────────────┐
│ Consumers: BI tools (Tableau, Looker), data │
│ science notebooks, ML training, reverse ETL │
│ to operational systems, customer-facing analytics│
└────────────────────────────────────────────────────┘
The canonical layers: OLTP source is the truth (account balances, orders); CDC captures changes (Debezium, Brooklin, DynamoDB Streams); Kafka is the durable, replayable backbone; stream processor (Flink, Spark Streaming) materializes Kafka topics into Iceberg/Delta with exactly-once semantics, committing every 30s–5min; real-time OLAP (Pinot, Druid) ingests directly from Kafka for sub-second-freshness use cases; batch ETL (Airflow + Spark) runs longer transformations producing Bronze → Silver → Gold medallion-architecture tables; query engines (Trino, Spark, Snowflake, BigQuery, DuckDB) read the lakehouse tables — the table is the integration point. Consumers are BI dashboards, notebooks, ML pipelines, customer-facing analytics, and reverse-ETL.
Lambda vs Kappa variants. Lambda (Marz, 2014) runs batch + speed layers in parallel, queries merging both — two implementations, two drift opportunities. Kappa (Kreps, 2014) runs one streaming pipeline and replays from Kafka to backfill. The diagram above is Kappa; Flink + Iceberg + Trino is the modern default.
§7. Hard Problems Inherent to OLAP
Seven fundamental challenges. Each shows up regardless of which warehouse or engine you pick; illustrating examples come from across domains.
7.1 Joins at scale
One line. A JOIN between two large tables triggers a hash table or a shuffle that may not fit in memory or in network bandwidth.
Where it shows up. A marketing-attribution query joins clicks (100 B rows) against purchases (1 B rows) by user_id. A risk model joins transactions (10 B rows) against accounts (1 M rows) by account_id. An ML pipeline joins events (50 B rows) against feature_table (1 M user features) by user_id.
Naïve plan. Cross-product: for every clicks row, scan all purchases. 100 B × 1 B = 10²⁰ comparisons. Heat death of the universe before it returns.
Real fix. The optimizer picks one of three strategies based on table sizes and join keys.
- Broadcast (replicated) join. When one side is small (<1 GB after filters), the small side is replicated to every worker; each worker hash-joins its slice of the big side against the in-memory small side. Cost: O(big_side_scan + N_workers × small_side_size). Used for fact-against-small-dimension joins (e.g.,
events⨝products). - Shuffle join. Both sides are large. Hash both by the join key, redistribute rows to N partitions, hash-join within each partition. Cost: O(both_sides_scan + both_sides_network). Used when neither side is broadcastable.
- Bucketed join. If both tables are pre-bucketed (Iceberg's
bucket(N, user_id)partition spec) on the join key, no shuffle is needed — corresponding buckets join directly. Free, when applicable.
The "100 B × 1 M" cross-join disaster. A common pitfall: the optimizer mis-estimates the small side as "huge," picks a shuffle join, and the shuffle phase moves 100 B rows over the network. 10 TB of data shuffled at 10 GB/sec across the cluster = 1000 seconds of pure network time. The fix: explicit hint to the optimizer (/*+ BROADCAST(small_table) */) or pre-collect stats so the optimizer picks the right plan. Bad joins are the single biggest cause of "query that should take 10 seconds takes 4 hours."
7.2 Cardinality estimation for query planning
One line. The optimizer's chosen plan depends on its estimate of row counts at each operator; bad estimates cause bad plans.
Where it shows up. A query with WHERE event_date = '2026-05-20' AND user_country = 'JP' AND event_type = 'purchase' could match 100 rows or 100 M rows depending on the day, country, and type. The optimizer must guess to pick the right plan.
Naïve estimate. Assume column predicates are independent: selectivity(date) × selectivity(country) × selectivity(type). Often wrong by orders of magnitude — country='JP' and event_type='cart_view' are correlated (Japan has higher e-commerce engagement); the joint selectivity is not the product.
Why it breaks. The optimizer estimates a join's left side at 1000 rows, picks a broadcast join, allocates 100 MB for the hash table. Reality: 100 M rows. Hash table OOMs (out of memory); the worker crashes; the query restarts from scratch with a different plan; total wall time goes from a target 5 s to 30 minutes.
Real fix. Multi-pronged:
- Statistics collection. Engines maintain per-column histograms, NDV (Number of Distinct Values) estimates, and most-common-value frequencies. Snowflake auto-collects on micro-partition writes; BigQuery uses ZetaSQL's optimizer over column statistics in metadata; Spark requires
ANALYZE TABLE. - Multi-column statistics. Joint histograms (PostgreSQL's
CREATE STATISTICS) capture correlations between columns. Often expensive, used selectively. - Adaptive query execution (AQE). Spark 3+, Snowflake's adaptive planner, Databricks Photon's runtime adaptation. The engine starts a plan, observes actual row counts during the first stages, and re-plans downstream stages. Turns "wrong plan = disaster" into "wrong plan = slow start, then re-plan."
- Manual hints. Production-critical queries often carry hints to override the optimizer when stats are known to be misleading.
7.3 Data skew
One line. Even with hashed partitioning, a few keys account for most of the rows; one worker gets 90% of the work.
Where it shows up. A click-stream GROUP BY user_id finds one bot account with 1 B events out of 10 B total. A SaaS analytics GROUP BY tenant_id finds one mega-customer driving 30% of all rows. A multiplayer game GROUP BY game_session_id has one viral session with 100x median traffic.
Naïve fix. Hash partition by the GROUP BY key. Spreads keys evenly across workers — but the rows per key are unevenly spread, so a worker holding the heavy key holds most of the rows.
Why it breaks. Concrete: 100 workers, shuffle-aggregating 10 B rows. Theoretical balance: 100 M rows/worker. Reality (skewed key): one worker holds 1 B (the bot's events) + ~90 M (random other keys) ≈ 1.1 B rows. Wall time = (1.1 B rows) / (1 worker's throughput) = 10x the expected. The other 99 workers idle for 90% of the query.
Real fix. Salting the skewed key. For the known-hot key, append a random 0..N-1 suffix at write or at the partial aggregation step. The hot key becomes N "shards" — bot_user_id:0, bot_user_id:1, ..., bot_user_id:15 — each landing on a different worker. After local aggregation, a final reduce step de-salts and combines: 16 partial sums per shard → one total per logical key.
Without salting: With salting (N=16):
worker_0 [user_42:1B rows] ⨯ worker_0 [user_42:0, 62M rows] ✓
worker_1 [other 90M rows] ✓ worker_1 [user_42:1, 62M rows] ✓
worker_2 [other 90M rows] ✓ ...
... worker_15 [user_42:15, 62M rows] ✓
↓
Final reduce: sum across user_42:0..15
Trino and Spark have automatic skew detection (AQE, dynamic partition coalescing); manual salting is still common for known-pathological cases.
7.4 Data freshness vs cost
One line. Going from "hourly batch ETL" to "sub-second streaming ingest" can cost 10–100x more compute.
Where it shows up. A daily dashboard with 24-hour-stale data is fine for finance reporting; useless for fraud detection. Real-time fraud needs <1-second visibility into transactions. ML model training needs hourly fresh features; online inference needs sub-second.
Naïve fix. Stream everything. Run Flink jobs continuously; commit to Iceberg every 30 seconds; query the lakehouse for everything.
Why it breaks. Continuous streaming commits produce small files. A table with 30-second commits and 100 partitions produces 100 × (86400/30) = 288,000 files per day. After a year: 100 M small files. File listing alone (S3 LIST is paginated at 1000 keys per call) takes minutes; query planning becomes "list 100 M keys to find the right partition." The "small-file problem" is the lakehouse's #1 chronic disease.
Real fix. Tiered freshness:
- Hot tier (real-time): sub-second freshness via Pinot/Druid, ingested directly from Kafka. Limited retention (~7–30 days).
- Warm tier (minute freshness): Flink → Iceberg with 1–5 minute commits. Background compaction merges small files into larger ones nightly (10K files of 1 MB → 100 files of 100 MB).
- Cold tier (hourly/daily): batch ETL produces day-partitioned tables for the long history; ad-hoc analytics queries hit this.
- Federated query to unify: a dashboard query joins real-time tier (last hour) with warm tier (last day) with cold tier (older). Pinot can federate to Iceberg natively; Trino spans both as connectors.
The cost ratio: real-time tier is ~10–100x more expensive per GB than cold tier, because (a) you pay for always-on streaming compute, (b) you pay for the inverted indexes that enable <100ms queries. Use it only for the slice that needs that freshness.
7.5 Multi-tenancy and workload isolation
One line. A heavy query from team A can saturate a shared cluster and degrade team B's dashboards.
Where it shows up. A data scientist's exploratory SELECT * on a 100 TB table saturates Snowflake's shared warehouse, freezing every CFO dashboard. An ad-hoc Trino query joins three large tables badly and consumes 80% of cluster CPU for an hour.
Naïve fix. "Just be careful." Doesn't scale past 10 users; queries from misbehaving sources are common.
Real fix. Workload isolation primitives:
- Snowflake virtual warehouses. Each team or job class gets its own warehouse — a separate compute cluster sharing the storage layer. Sized independently; one warehouse's load can't touch another. Common pattern: BI warehouse (auto-scale, peak hours), ETL warehouse (large, night), data-science warehouse (small, on-demand).
- BigQuery reservations / slot pools. Reserved slot capacity per project or department. Idle slots flex into other projects (fair-share); committed slots are guaranteed.
- Trino resource groups. Queries from each group get a CPU/memory share. A misbehaving group can be capped at 30% of the cluster, preventing it from saturating.
- Per-query memory limits. Trino's
query.max-memory-per-node, Snowflake'sSTATEMENT_TIMEOUT_IN_SECONDS, BigQuery'smaximumBytesBilled. Force the bad query to fail rather than degrade the cluster.
A production OLAP platform team's #1 ongoing job is tuning these policies as workloads evolve.
7.6 Schema evolution
One line. Add, rename, or drop a column on a 1 PB table — without rewriting the data.
Where it shows up. Product adds a new event field; analytics needs to track it. A column type needs to change from INT to BIGINT (counter overflowed). A column rename for clarity. A column drop for GDPR compliance (PII_email must be permanently deleted).
Naïve fix. ALTER TABLE. In Hive (pre-Iceberg), ALTER TABLE was metadata-only but had positional column matching — adding a column in the middle re-numbered subsequent columns, causing readers to misinterpret existing Parquet files (Parquet stores by position, not name). Adding created_at between id and name made all existing rows appear shifted. Silent data corruption.
Real fix. Iceberg and Delta both use column IDs instead of column positions. Each column has a stable ID at table-creation; renames change the name → ID mapping in metadata; the underlying Parquet column chunk is identified by ID. Adding a column anywhere is metadata-only; existing files don't reference the new column ID, and reads default to NULL. Renaming a column is a metadata update of the name → ID mapping, no file rewrite.
For type changes (INT → BIGINT), Iceberg allows widening conversions (INT → BIGINT, FLOAT → DOUBLE) as metadata-only. Narrowing or incompatible changes require a column-add + backfill + column-drop dance.
For column drops with GDPR-level deletion (must be permanently irrecoverable, not just hidden), the rewrite cost is paid — but it's a one-time job (OPTIMIZE table ... REWRITE), not blocking.
7.7 Time travel and snapshot retention
One line. Querying "the table as it was yesterday" is useful for debugging, recovery, and ML reproducibility; but old snapshots cost storage and metadata complexity.
Where it shows up. A bug deletes 10 M rows from a production table at 2 AM. A data scientist needs to reproduce a model trained on yesterday's data. A compliance audit asks "what did this table look like on 2025-12-31?"
Naïve fix. Daily backups. Coarse-grained, slow to restore, often not actually tested.
Real fix. Per-snapshot time travel. Iceberg keeps every commit's metadata; querying FOR VERSION AS OF <id> or FOR TIMESTAMP AS OF <ts> opens the table at that snapshot's manifest list. Cost: old data files must be retained (no garbage collection) for the retention window. Snowflake's Time Travel: 90 days for Enterprise+. Snowflake Fail-safe (a separate 7-day disaster-recovery tier behind Time Travel): for catastrophic recovery, you call Snowflake support to restore.
Operational gotcha. Time travel storage cost can dominate the table's actual size. A 1 TB table with hourly updates of 1% of rows = 240 GB/day of new files; over 90 days = ~22 TB of retained-file storage for 1 TB of "current" data. Tuning snapshot retention is a real cost lever.
§8. Failure Modes
Five archetypal OLAP failure scenarios; recovery procedure for each.
8.1 Query out-of-memory (OOM)
Scenario. A GROUP BY query's hash table exceeds available worker memory. Common when (a) cardinality estimation is wrong, (b) the GROUP BY key has high cardinality (e.g., GROUP BY user_id on 1 B users), (c) joins explode in cardinality.
Failure mode. The worker crashes (OutOfMemoryError). The coordinator either fails the query or retries it. If many workers OOM, the cluster goes into a thrash spiral.
Recovery. Engines have several mechanisms:
- Spill to disk. Trino, Spark, ClickHouse, Snowflake — when the in-memory hash table exceeds a threshold, spill the partial state to local NVMe. Trino's "spill-to-disk" doubles query time for large GROUP BY but prevents OOM.
- Adaptive partitioning. Detect skewed hash buckets and re-partition them at runtime.
- Per-query memory limits. Fail loud, fast — let the user shrink the query.
Durability point. The query is idempotent (read-only); restart from scratch with adjusted parameters.
8.2 One bad query freezes the warehouse
Scenario. A cross-join with no predicate. A query joining three 100B-row tables. A buggy dbt model that scans entire history.
Failure mode. The bad query consumes 100% of cluster CPU / memory / network. Other queries either queue (BigQuery) or compete and slow proportionally (Snowflake without virtual-warehouse isolation; Trino without resource-group caps). Dashboards across the company go red.
Recovery. Two layers:
- Real-time kill. Operator finds the offending query (
SHOW QUERIES/ Snowflake's QUERY_HISTORY / BigQuery's INFORMATION_SCHEMA.JOBS) and aborts it. Manual; pager-duty-driven. - Permanent fix. Apply per-query limits (max bytes scanned, max execution time, max memory) and per-user/team resource caps to prevent recurrence.
Durability point. Read-only queries — abort costs nothing. Write queries (MERGE, COPY) have batch atomicity — abort rolls back the entire commit; partial data never visible.
8.3 CDC → lakehouse pipeline lag
Scenario. Debezium → Kafka → Flink → Iceberg pipeline. Flink is supposed to commit every 60 s. A backpressure surge (Iceberg compaction stall, S3 throttling, Flink GC pause) causes commits to fall behind. Lag grows from seconds to minutes to hours.
Failure mode. Queries against the lakehouse see stale data. A dashboard says "revenue today = $3 M" when the truth is "$5 M" because the last 2 hours of orders aren't materialized yet.
Recovery.
- Detect: monitor current_time - max(commit_timestamp) continuously; alert at >5 minutes.
- Investigate: is Flink GC-thrashing? Is Iceberg's snapshot commit blocking on metadata? Is S3 throttling writes (S3's prefix-level rate limits)?
- Apply backpressure relief: scale Flink TaskManagers; tune commit batch size; rate-limit upstream Kafka if necessary.
- Catch up: Flink replays from its committed Kafka offset; no data loss as long as Kafka retention exceeds the lag.
Durability point. Kafka is the durability point for CDC. As long as Flink's offset is durable (Flink's exactly-once checkpoints — see 04_stream_processing.md §6), the data is recoverable.
8.4 Schema drift breaks downstream tables
Scenario. An upstream service adds a new field to the purchase_events event. The new field is a non-nullable INT (a service-team mistake). The CDC pipeline propagates the new column to Bronze. The dbt model that builds Silver doesn't know about the new column, but a SELECT * downstream of it breaks.
Failure mode. Pipeline writes start failing with type errors. Dashboards depending on the Silver table go to "0 rows" until the schema mismatch is resolved.
Recovery.
- Schema contract enforcement at the producer: the service should not be able to add a non-nullable field without a data-platform review.
- Iceberg's schema evolution: since Iceberg uses column IDs, the new column is added as nullable, defaults to NULL for old rows. Downstream models that don't reference it are unaffected. Only models that explicitly use the new field need updating.
- Schema-on-read engines (DuckDB, Trino with JSON columns): can defer schema resolution to query time. Trade structure rigor for flexibility.
- Investigation pattern: check metadata_log_entries table in Iceberg for the offending commit; identify the upstream pipeline that introduced the breaking change.
Durability point. Iceberg snapshots are immutable; rolling back to the pre-drift snapshot via time travel restores the dashboard while the team fixes the upstream contract.
8.5 File-listing slowness on S3
Scenario. A table has 50 partitions × 100,000 commits × ~10 small files per commit = ~50 M Parquet files. Query planning needs to know which files match the query's predicate. The naïve approach: S3 LIST against the partition prefix.
Failure mode. S3's LIST is paginated at 1000 keys per call; listing a partition with 1 M files takes 1000 sequential LIST calls × ~200 ms = 200 s of planning time before the query even starts. Iceberg's manifest tree avoids this for Iceberg-managed tables, but direct queries over Hive-style partitioned Parquet (or Iceberg tables polluted by external writers bypassing the manifest) can hit this.
Recovery. - Use a real table format (Iceberg/Delta) with metadata manifests; don't rely on S3 LIST. This is the single biggest reason the lakehouse moved from Hive to Iceberg. - Compaction. Background jobs merge small files into larger ones, reducing total file count by 100x. - Partitioning discipline. Don't over-partition (one partition per day, not per minute).
Durability point. Files in S3 are immutable; compaction is a write+swap operation, atomically committed via Iceberg's manifest update. If compaction fails mid-way, the new files become orphans (cleaned by garbage collection); the old files remain referenced by the current snapshot.
§9. Why Not Just Run Analytics on the OLTP Database
The naïve alternative: "we already have PostgreSQL; just run analytics on the production replica." Five interlocking reasons it fails:
- Locks and replica lag. A 30-minute analytical scan holds an MVCC snapshot for 30 minutes; PostgreSQL's autovacuum cannot reclaim dead row versions; tables bloat, write throughput collapses (see
01_databases.md §7.5). Reading from a replica avoids the lock cost but adds replica lag. - IOPS exhaustion.
SELECT * FROM ordersover 100 B rows reads ~30 TB of B+ tree pages. At 100 K IOPS, that's 50 minutes of saturated I/O. Same scan on Snowflake reads ~3 TB of compressed Parquet, zone-map skips 90%, scans ~300 GB at 5 GB/sec/node × 8 nodes = ~8 seconds. 375x faster. - Row-major layout is wrong for scans. A 200-column row store reading 3 columns still reads all 200 columns per row — 89% of I/O wasted. Columnar reads only the 3.
- No vectorization. Postgres processes rows one at a time; ClickHouse processes 1024-row batches through SIMD kernels. 10–50x CPU efficiency gap even before I/O.
- Concurrency death spiral. Ten concurrent analytical queries on the OLTP replica serialize on buffer pool, WAL writer, I/O scheduler. Buffer pool is sized for the OLTP hot working set; analytical scans evict it; OLTP latency p99 jumps 100x.
Concrete walkthrough. SELECT date_trunc('day', created_at), COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '90 days' GROUP BY 1 on a 100B-row PostgreSQL: 1 hour of degraded production, replica 30 minutes behind, p99 800 ms (was 5 ms). Same query on Snowflake with the data in Parquet micro-partitions: 5 seconds, zero impact on the OLTP source. The "warehouse" metaphor is deliberate — it's where you take data to do work that would damage the OLTP store.
§10. Scaling Axes
How OLAP scales, on two distinct axes.
Type 1: more data, more queries (uniform growth)
Dataset grows from 1 TB to 100 TB; QPS from 10 to 1000. Same workload shape.
Modern fix: decouple compute from storage. Storage scales effectively without limit on S3/GCS/ADLS (~$23/TB/month S3 Standard, 10x cheaper in IA/Glacier; a 100 PB lake is ~$2 M/month). Compute scales by adding capacity — Snowflake resizes warehouse from L to XL to XXL in seconds (no data migration); BigQuery increases slot reservations; Trino adds workers.
Inflection points. ~1 TB: DuckDB or small Snowflake suffice. ~100 TB: multi-node cluster; partition pruning and compaction critical. ~10 PB: lakehouse architecture essential; Iceberg manifest tree replaces Hive file listing. ~exabytes (Meta, Google, Netflix): custom catalog scaling, tiered storage (S3 hot / Glacier cold), federated query across multiple catalogs.
Cost story: storage scales linearly and is cheap; compute scales with query load and is expensive. A 100 PB lake with low query rate costs less than a 10 TB lake with high query rate.
Type 2: one query needs more compute (hotspot)
A complex 5-way JOIN over hundreds of TB needs more compute than one warehouse delivers in reasonable time.
Fix: distributed query execution. Every modern OLAP engine is MPP (Massively Parallel Processing). Scan parallelism by row group or partition; shuffle parallelism via hash redistribution; aggregation parallelism via partial-then-final.
Inflection points. Adding workers gives diminishing returns past the shuffle width (a query with 100 shuffle partitions uses ~100 workers efficiently; more helps marginally). Network bandwidth ceilings shuffle-heavy queries (10 TB shuffle across 100 workers at 10 Gbps = 8 seconds of network alone). Fix: pre-bucketed tables eliminate shuffles, or change join order to reduce intermediate size. For massive queries: BigQuery dynamically allocates thousands of slots; Trino at Meta provisions hundreds of elastic worker nodes per query.
Type 1 + Type 2 interaction. If both axes grow, workload isolation (§7.5) becomes critical — each heavy query gets its own warehouse; lakehouse storage serves all warehouses transparently.
§11. Decision Matrix
When to pick which OLAP variant. Use cases come from across domains.
| If you need... | Pick | Why |
|---|---|---|
| Managed BI platform, hands-off ops, mid-large enterprise | Snowflake | Best-in-class managed experience, mature ecosystem, decoupled compute/storage |
| Serverless, no-ops, deep Google integration | BigQuery | True serverless; no clusters to manage; bytes-scanned pricing aligns with usage |
| Customer-facing real-time analytics, <100 ms p99 SLAs | Pinot | Inverted indexes + bitmap indexes; designed for the latency target |
| Customer-facing real-time, time-series-heavy | Druid | Similar to Pinot; mature ecosystem at Netflix, Lyft |
| Self-hosted high-performance OLAP, log/event analytics, cost-sensitive | ClickHouse | Fastest open-source engine on CPU efficiency; cheap at scale if you can operate it |
| Open lakehouse, mixed batch + streaming + ML, vendor-neutral | Trino + Iceberg | Open file format, multiple engines, federation, no vendor lock-in |
| Databricks ecosystem, Spark-native, ML-heavy | Databricks SQL | Photon + Delta integrate tightly; one platform for data + ML |
| Local analytics, notebooks, embedded | DuckDB | Single binary; reads Parquet/Iceberg from S3; fast on TB-scale single-node |
| Cheap, simple, batch ETL with Spark | Spark SQL | Batch-friendly fault tolerance; first-class lakehouse support |
| MySQL-compatible OLAP for ByteDance-style internet workloads | StarRocks / Doris | MySQL protocol; real-time materialized views |
Thresholds.
- <1 TB total data: DuckDB or a single ClickHouse node is enough. Don't pay Snowflake.
- 1–100 TB: managed warehouse (Snowflake, BigQuery, Databricks) is the default unless you specifically need real-time (<1s) or low ops cost.
- 100 TB–10 PB: lakehouse (Iceberg + Trino + Spark) becomes attractive for vendor flexibility; managed warehouses are still strong choices.
-
10+ PB: open lakehouse is the modern default at hyperscale (Netflix, Apple, LinkedIn).
-
Query latency SLA p99 <100 ms with high QPS: Pinot/Druid. Snowflake/BigQuery can't hit this.
- Query latency SLA p99 ~1 s with moderate QPS: ClickHouse for self-hosted; Snowflake/BigQuery for managed.
-
Query latency SLA seconds-to-minutes: anything works; pick on cost and ops.
-
Concurrency >1000 concurrent users: Snowflake virtual warehouses, BigQuery slots, or per-team isolated clusters. Don't share one warehouse.
-
Concurrency <100: any single cluster is fine.
-
Hands-on ops team: ClickHouse, StarRocks, self-hosted Trino are great.
- No ops team: Snowflake or BigQuery only.
§12. Ingestion Patterns
Five patterns for loading data into OLAP. Each is right for a specific freshness / volume / cost point.
Batch ETL (Airflow → Spark → warehouse)
The classic pattern, still dominant for "yesterday's data is fine." Airflow schedules a DAG (Directed Acyclic Graph) at daily/hourly cadence; each task is a Spark job reading source data, transforming it, and writing to the warehouse (Parquet to S3 / COPY INTO Snowflake / bq load). Freshness: hours-to-daily. Cost: cheap (~$0.50–$2/TB). Failure handling: idempotent reruns. Common at banks, retailers, traditional enterprises for daily/weekly BI.
CDC streaming (Debezium → Kafka → Flink → Iceberg)
The modern default for sub-minute freshness without paying for true real-time. Debezium (or LinkedIn's Brooklin, Maxwell, native CDC) reads the OLTP source's logical replication / binlog and emits per-row change events to Kafka. Flink consumes, transforms, and writes to Iceberg/Delta/Hudi at 30 s–5 min commit cadence. Freshness: sub-minute. Cost: 5–10x batch ETL. Exactly-once via Flink's two-phase commit + Iceberg atomic snapshot (see 04_stream_processing.md §6). Common at e-commerce (real-time inventory), SaaS (usage dashboards), fintech (intra-day reconciliation).
Real-time ingest (Kafka → Pinot/Druid directly)
For sub-second freshness on real-time-OLAP engines. Kafka producers publish events; Pinot/Druid ingest directly and events become queryable within seconds. Freshness: sub-second. Cost: highest (~10–50x batch) — engines maintain per-event memory + indexes. Storage tiered: real-time segments in memory + local disk; historical on S3. Common at customer-facing real-time dashboards (Uber, LinkedIn, Stripe).
Bulk COPY / LOAD from S3
For periodic large loads from partners, vendors, external pipelines. Data lands as Parquet/CSV in S3; a COPY INTO / bq load / INSERT INTO ... SELECT FROM staged imports. Freshness tied to upstream cadence. Cheap and idempotent. Common for data sharing between organizations and on-prem → cloud migrations.
Reverse ETL (warehouse → operational systems)
The opposite direction: take aggregated analytics back to operational systems. dbt materializes aggregated tables (customer_360, lead_score, churn_risk); Hightouch / Census sync them to Salesforce, Marketo, Zendesk, internal app databases. Analytics produces insights → reverse-ETL pushes them to product features (personalization, marketing). Freshness usually hourly. Common in "modern data stack" deployments running data-driven product loops.
§13. Lakehouse Table Formats Deep Dive
The three contenders for "table format on top of object storage" — Iceberg, Delta Lake, Hudi.
Apache Iceberg
Origin. Netflix, 2017. Apache top-level 2020. Built to fix Hive's atomic-rename and metadata-scaling problems at PB scale.
Architecture. Manifest tree (see §4.6): metadata.json → manifest list → manifest files → data files. Each commit is an atomic pointer swap on the latest metadata.json (or catalog row).
Strengths. Vendor-neutral (engine support across Trino, Spark, Flink, Snowflake, BigQuery, Athena, ClickHouse, DuckDB, Databricks); hidden partitioning (partition function in metadata; users don't write partition predicates explicitly); schema evolution by column ID (safe adds, renames, drops); snapshot isolation and time travel; branching and tagging (Iceberg 1.0+); manifest scales to exabytes (tested at Netflix and Apple).
Weaknesses. Catalog can have brief consistency issues during commits (Hive Metastore, AWS Glue); modern catalogs (Nessie, Iceberg REST, Unity) are stronger. Compaction is the user's responsibility — periodic OPTIMIZE jobs, no built-in always-on compactor.
Dominant in 2026. Used by Netflix, Apple, LinkedIn, Tencent, Adobe, AWS, Stripe.
Delta Lake
Origin. Databricks, 2019. Linux Foundation 2020.
Architecture. Per-table _delta_log/ directory containing ordered JSON files (one per commit) describing add/remove operations on data files, and periodic Parquet "checkpoint" files that snapshot the full table state.
Strengths. Tight Databricks integration (Photon, Auto Loader, Live Tables, Delta Sharing — first-class); Z-ordering (multi-dimensional clustering for better pruning on multi-column predicates than partition pruning alone); Change Data Feed (built-in CDC output via table_changes()); DELETE and MERGE with deletion vectors (bitmap-based soft deletes, Delta 3+).
Weaknesses. Databricks-centric ecosystem; portability less mature than Iceberg. _delta_log can grow large on heavily-updated tables; checkpoint frequency tuning matters.
Strong choice when you're already on Databricks, or when Delta-specific features are first-class.
Apache Hudi
Origin. Uber, 2016. Built for upsert-heavy workloads — ride state, driver position changing constantly.
Architecture. Per-table .hoodie/ directory with timeline files. Two table types: Copy-on-Write (CoW) rewrites affected Parquet files (read-optimized, write-expensive); Merge-on-Read (MoR) appends to a row-level delta log in Avro, reads merge with the base Parquet on the fly (write-cheap, read pays merge cost).
Strengths. Best-in-class upsert performance; incremental queries ("all rows changed since timestamp T") efficient; async compaction.
Weaknesses. Smaller ecosystem; complex operational model (two table types + compaction + cleaner + archival).
Used at: Uber (origin), TikTok, large CDC-driven lakehouses where upsert volume dominates.
The format wars (2024–2026 status)
- 2017–2020: Hive dominant; Iceberg/Delta/Hudi emerged in parallel.
- 2021–2023: Three-way race; each format added features (time travel, schema evolution, deletion vectors). Engine support fragmented.
- 2024–2026: Convergence. Snowflake added native Iceberg read/write (2023). Databricks added Iceberg interop via UniForm — Delta files with Iceberg-compatible metadata (2023). BigQuery added native Iceberg support (2024). Walls came down.
Strategic shift: the storage layer is becoming a commodity. Iceberg is the dominant interop format; vendors compete on compute engines and managed experiences, not lock-in via storage. A 2026 lakehouse commonly has one copy of Iceberg data on S3, read by Snowflake for BI, Databricks for ML, Trino for ad-hoc, Flink for streaming write, and DuckDB for laptop exploration.
§14. Query Engines Deep Dive
Trino / Presto
Origin. Facebook, 2012. Open-sourced 2013. Split into two forks in 2018: PrestoSQL (now Trino) led by the original founders + community; PrestoDB continues at Meta. Today Trino is the dominant community fork.
Architecture. Coordinator + workers. Queries decomposed into stages (scan, filter, partial-aggregate, shuffle, final-aggregate, sort). Workers stream tuples between stages via in-memory pipelines — fast, but no fault tolerance per query.
Strengths. Federation (~50+ connectors: query Postgres + Iceberg + Kafka + Cassandra in one SQL); fast on interactive queries (sub-second small, seconds large); battle-tested at Meta, Netflix, LinkedIn, Lyft, Uber, Pinterest, Airbnb.
Weaknesses. No mid-query fault tolerance by default (until recent fault-tolerant-execution mode); one worker dies during a long query → whole query restarts. All-in-memory shuffle by default; memory pressure is a tuning lever.
Capacity. Meta runs Presto at exabyte scale; LinkedIn runs Trino at PB scale on hundreds of nodes per cluster.
Spark SQL
Origin. Databricks (Berkeley AMPLab), 2014.
Architecture. RDD (Resilient Distributed Dataset) / DataFrame model on top of Spark's executor framework. Query plans become DAGs of stages; failures restart only the failed partition.
Strengths. Fault tolerance via RDD lineage (critical for long-running batch); general-purpose (SQL, DataFrames, ML, streaming in one engine); lakehouse-native (first-class Iceberg, Delta, Hudi); Adaptive Query Execution (runtime re-planning based on observed stage sizes).
Weaknesses. Higher latency than Trino on short interactive queries (JVM/task overhead); heavier ops footprint.
Capacity. Netflix runs Spark for ETL at exabyte scale; Databricks runs Photon (vectorized Spark SQL replacement) for SQL workloads as fast as Trino.
DuckDB
Origin. CWI Amsterdam, 2018. Same lab as MonetDB / Vectorwise.
Architecture. Single-process, embedded library. Vectorized engine inspired by MonetDB/X100. Reads Parquet, CSV, Arrow, JSON, Iceberg directly from S3.
Strengths. Zero setup (pip install duckdb); fast (TPC-H Q1 on 1 B rows in 2 seconds on a MacBook M3, ~6 GB/sec scan rate); no cluster (just a library); polyglot bindings.
Weaknesses. Single-node ceiling (~10 TB on big machines); single-user concurrency.
Growing role. Mother Duck (cloud DuckDB), serverless analytics, dbt local development. DuckDB is becoming the "first tier" of analytics — analysts pull a TB of Parquet to a laptop for exploration, then push to Snowflake/BigQuery for production.
Snowflake's engine
Architecture. Proprietary. Decoupled compute (virtual warehouses) from storage (S3 micro-partitions). Per-warehouse local SSD cache. Custom vectorized engine in C++. FoundationDB-backed metadata.
Strengths. Zero-copy clones (cloning a 100 TB database is a metadata pointer copy); Time Travel + Fail-safe (90 days of snapshot recovery built in); multi-cluster warehouses (one warehouse can have N clusters for concurrency); data sharing across accounts without copy; built-in MERGE/UPDATE/DELETE.
Weaknesses. Closed format historically (Iceberg interop now reduces this); per-credit pricing can be expensive at high concurrency.
BigQuery
Origin. Google, 2010 (public). Built on Dremel (paper 2010). Storage on Capacitor (Google's columnar format) on Colossus (Google's distributed file system).
Architecture. Serverless. Queries enter a global slot scheduler; slots (CPU units) are allocated from a shared pool or your reserved capacity. Storage and compute decoupled at the most extreme level — even within a query, slots and storage are completely independent.
Strengths. True serverless (no clusters, no nodes); massive scale per query (PB scanned in seconds); strong GCP integration; BigQuery ML lets you train models in SQL.
Weaknesses. Capacitor is Google-only (mitigated by recent Iceberg support); slot model opaque, cost prediction tricky; cold-start on slot allocation hurts sub-second targets.
Capacity. Likely the world's largest OLAP system. Queries scanning multiple PB; total data under management in the exabytes.
§15. Use Case Gallery
Seven different products / domains, same family of technology applied differently.
1. Business Intelligence dashboards (Tableau, Looker on Snowflake/BigQuery)
A retailer's CFO opens a Tableau dashboard showing weekly revenue by region. Tableau issues 5–10 SQL queries to Snowflake; each scans tens to hundreds of GB of pre-aggregated Gold tables. Latency: 2–10 seconds per query. Data flow: OLTP → CDC → Kafka → Flink → Iceberg Bronze → Spark → Silver → dbt → Gold; Snowflake reads Gold via Iceberg interop. Pattern: aggregated tables (Gold tier) served by managed warehouse to interactive dashboards.
2. Real-time analytics dashboards (Uber order map, LinkedIn feed analytics)
Uber's operations dashboard shows live ride supply/demand by city, updated every few seconds. Ride state changes flow through Kafka into Pinot, which serves aggregation queries with <100 ms latency. LinkedIn's "Who's Viewed Your Profile" tab is built on Pinot at similar scale — hundreds of thousands of QPS, p99 <50 ms. Pattern: streaming ingest into real-time-OLAP engine for sub-second freshness + sub-100ms query latency.
3. ML feature pipelines (warehouse → feature store → online serving)
A recommender system. Offline features (user_watch_time_30d, user_genre_affinity_vector) computed daily in Spark from the warehouse. Results land in a feature store: offline (Iceberg) for training-time joins, online (Redis or DynamoDB) for serving-time lookups. Pattern: warehouse as offline feature compute layer; specialized stores for online serving.
4. Customer-facing analytics (Stripe Sigma)
Stripe Sigma lets merchants run SQL on their own payment data. The engine (Trino-derivative at Stripe) enforces row-level security per merchant_id. Payments flow Stripe Postgres → CDC → warehouse → per-merchant projections. Latency: seconds-to-minutes per query. Pattern: multi-tenant OLAP with row-level filtering; merchants are the end users.
5. Cohort and funnel analysis (Amplitude, Mixpanel)
Product analytics tools let PMs build cohorts ("users who signed up in March AND activated AND purchased") and funnels. Behind both Amplitude and Mixpanel: ClickHouse clusters (publicly disclosed). Events ingested in real-time; cohort/funnel queries are SQL-with-time-window-semantics over event streams. Per-column compression on event_type/user_id gives 20–50x storage savings. Pattern: event-stream analytics on ClickHouse with custom DSL.
6. Log analytics (Honeycomb, Datadog, ClickHouse for logs)
Logs are the highest-volume analytical workload: tens of TB/day of structured events. Honeycomb built its own columnar engine in Go with sparse indexes for high-cardinality event attributes. Datadog moved from Elasticsearch to a custom columnar store. ClickHouse-for-logs at Cloudflare, Uber, ByteDance. Pattern: columnar storage with attribute-level indexes optimized for "find events matching predicate X over the last hour." See 10_observability.md §5.
7. Marketing attribution
A marketer wants the path of touchpoints across channels for today's conversions. Requires joining events (clicks, opens, impressions — billions of rows) against conversions (millions) over user_id and time, then applying time-decay/Markov-chain attribution math. Too heavy for real-time; freshness budget is "morning report on yesterday." Pattern: heavy batch ETL in Spark materializes a customer_journey table; BI dashboards query it on Snowflake/BigQuery.
§16. Real-World Implementations
Named systems shipping OLAP at scale, across diverse use cases.
Snowflake — the largest data warehouse business. ~10,000 enterprise customers as of public 2024 disclosures; ~$3 B+ annual revenue. Customers span finance (Capital One), healthcare (CVS), retail (Walmart). Single-customer deployments routinely reach hundreds of TB to several PB; the largest approach exabyte-scale. Snowflake's architecture (decoupled compute + S3-backed micro-partitions + FoundationDB metadata) defined the modern cloud warehouse playbook.
Google BigQuery — multi-exabyte serverless. Google's internal + customer data totals tens of exabytes (estimated). Single queries scan PB in seconds. Used by Spotify, Twitter (pre-X), Carrefour, and most major Google services. The Dremel paper (2010) is foundational to modern OLAP.
Netflix Iceberg + Trino — exabyte-scale lakehouse. ~hundreds of PB on S3 in Iceberg format (low exabytes as of 2024–2026). Trino is the interactive query layer (~thousands of nodes globally); Spark for batch ETL; Flink for streaming. Iceberg was created at Netflix in 2017 to fix Hive metastore scaling. ~500 PB disclosed in 2023.
Meta Presto — exabyte-scale warehouse. Meta runs Presto on hundreds of thousands of CPUs. Public disclosures: >300 PB warehouse, >300K queries/day; Presto is the dominant interactive query layer inside Facebook for analytics.
LinkedIn Pinot — real-time OLAP origin. Pinot was born at LinkedIn (2014) for "Who's Viewed Your Profile": hundreds of thousands of QPS, p99 <50 ms, on PB-class data. Today LinkedIn runs Pinot for member-facing analytics (profile views, post analytics, recruiter dashboards) — hundreds of Pinot clusters internally.
Uber Pinot + Druid. ~600 K QPS at peak for marketplace and rider/driver dashboards on Pinot; Druid for time-series (M3 metrics platform). Both serve customer-facing real-time analytics.
Pinterest Druid — ad analytics. Advertisers see real-time impressions, clicks, conversions on campaigns. Druid's bitmap-index-per-column fits ad-attribute filtering well; tens of TB of fresh data, sub-second queries.
ByteDance ClickHouse — internet-scale event analytics. Publicly disclosed: one of the world's largest ClickHouse deployments — tens of thousands of CPUs, hundreds of PB, internal analytics across TikTok and Douyin.
Cloudflare ClickHouse — log analytics. Hundreds of TB/day of HTTP request log ingest; trillions of historical requests for security/performance investigation. Also backs DNS Analytics and Workers Analytics Engine customer products.
Plausible Analytics ClickHouse. Full product (tens of thousands of customer dashboards, billions of page views) on a single ClickHouse cluster. Small team, big scale, low cost — ClickHouse's "operate well, save 10x vs Snowflake" thesis.
Databricks Delta + Photon. ~10,000+ enterprise customers. Single customers run multi-PB Delta lakehouses. Photon is Databricks' vectorized C++ engine, replacing JVM-based Spark SQL.
Stripe Sigma. Merchants run SQL on their own payment data. Backed by an internal Trino-derived engine over Parquet on S3, with row-level security per merchant.
The numbers — from a single ClickHouse node at Plausible to multi-exabyte BigQuery at Google — span more than six orders of magnitude on the same family of technology (columnar, compressed, vectorized, snapshot-atomic on bulk writes). What varies is the operational layer (managed vs self-hosted), the freshness budget (sub-second vs hourly), and the integration (federated vs warehouse-native). The byte-level mechanics are the same.
§17. Summary
Columnar OLAP is what you get when you keep relational SQL's interface but discard everything in row-store machinery that exists to support per-row transactions, then bend the storage engine around bulk scans: per-column files for I/O reduction, per-column codec stacks for 5–20x compression, zone maps for predicate-driven skipping, sparse granule indexes for scan-not-point, vectorized SIMD execution for cache- and pipeline-friendly aggregation, and snapshot-atomic-bulk-write semantics in place of OLTP's row-level ACID. The result is an engine that answers
SUM ... GROUP BYover a trillion rows in seconds — and is correspondingly terrible at returning one row by primary key in five milliseconds. Modern variants split across cloud warehouses (Snowflake, BigQuery — managed and decoupled), self-hosted engines (ClickHouse, StarRocks — operate-yourself and fast), real-time OLAP (Pinot, Druid — sub-100ms on streaming-ingested data), and the lakehouse pattern (Iceberg + Trino + Spark on object storage — open file format, multi-engine, vendor-neutral). They converge on the same byte-level story: columnar Parquet on S3, ZSTD-compressed, organized into manifest trees that prune queries at four levels (partition → file → row-group → page), executed by vectorized engines processing 1024 values at a time on SIMD lanes. Pick by the freshness SLA, the query latency budget, the ops envelope, and where on the cost-vs-control axis the team sits — and remember that OLAP is always a derived view of the truth that lives in OLTP; get the source right, and any analytical store can be rebuilt from its change log.