← Back to Backend Fundamental Components

Search

Contents

Component class: search engines built on an inverted index over text and structured fields. Implementations span Apache Lucene (the library) and its derivatives (Elasticsearch, OpenSearch, Solr), alternative engines (Vespa, Tantivy, Manticore), and purpose-built systems (Zoekt / Sourcegraph and GitHub Blackbird for code, Grafana Loki / Quickwit for logs, Twitter Earlybird for tweets, LinkedIn Galene for profiles). The byte-level mechanics in §4 apply to all Lucene-derived engines and broadly to the segment pattern Vespa and Tantivy follow.


§1. What a search system IS

A search system takes a corpus of documents and a free-form query and returns a ranked list of matching documents with low latency. The defining structure is the inverted index: a mapping from term → list of documents containing that term, augmented with per-doc, per-field metadata sufficient to compute a relevance score and to apply filters, facets, and sorting.

A search system is a derived store, never the source of truth. It sits downstream of a primary database via a change-capture or batch pipeline, and exists to answer queries the primary cannot answer efficiently — fuzzy text matching, ranked retrieval, faceting at scale, geo-distance, hybrid keyword + vector retrieval.

Distinguish from adjacent categories

  • SQL LIKE '%term%' does a sequential scan; a B+ tree can't use an index on a leading wildcard. Inviable past ~1M rows. Postgres tsvector + GIN (Generalized Inverted Index) is a real inverted index but tops out around ~10M docs without sharding, lacks BM25 (Best Match 25), no native faceting at search-engine latencies.
  • Document DB queries (MongoDB, Couchbase) excel at fetching by ID; bundled text indexes are simpler postings bolted onto the primary. Not optimized for scoring fanout or near-real-time refresh.
  • Vector search / ANN (Approximate Nearest Neighbor) engines (Pinecone, Milvus, pgvector) answer "documents semantically close to this embedding" via HNSW (Hierarchical Navigable Small World) or IVF (Inverted File) graphs. They do not answer "documents containing this term." Modern Lucene-derived engines and Vespa fuse both, but vector-only doesn't replace keyword.
  • OLAP (Online Analytical Processing) engines (Druid, Pinot, ClickHouse) overlap in faceting and aggregation but are tuned for analytical scans, not ranked retrieval over a sparse term space.

What a search system is NOT good for

  • Point lookups by primary key — 5–10× slower than a KV (Key-Value) store; a search query fans out to every shard.
  • Source of truth — segments corrupt; schemas evolve forward-only; cluster loss must be recoverable.
  • Strong consistency — refresh is near-real-time (1s default); read-your-writes is not free.
  • Transactional updates — no multi-document atomicity.
  • Sub-millisecond p99 — segment-search with multi-shard fanout can't hit <1ms.
  • Sustained >100k writes/sec/shard — merge cost dominates; specialized engines win (Pinot upsert, Prometheus).

§2. Inherent guarantees — the contract by design

Provided:

  • Term-based retrieval at scale: O(posting list length), not O(N).
  • Ranking: BM25 length-normalized TF (Term Frequency) scoring built in; pluggable via function score, script score, or learned-to-rank rerank.
  • Near-real-time visibility: doc indexed at t=0 is searchable at t=1s (configurable). Bounded, not real-time.
  • Faceting / aggregation on indexed fields via doc values (columnar per-field storage). Sub-100ms over millions of matches.
  • Horizontal scalability through sharding; read availability through replication.

Not provided — must be layered on:

  • Durability as a primary store. Translog (Write-Ahead Log) makes individual write recovery work, but the canonical record must live elsewhere. Pattern is always: primary → CDC (Change Data Capture) → search engine.
  • Strong consistency. Replicas may lag by tens of ms to seconds.
  • Schema flexibility in place. Mappings are forward-only. Changing an analyzer or field type means building a new index and swapping an alias.
  • Idempotent writes by default. Must encode via external_version (a monotonic source-of-truth sequence number) or accept that retries can clobber newer writes.
  • Multi-doc atomicity. No transactions.
  • Privacy / authorization filtering. Engine returns any matching doc; per-user authorization is the caller's problem.

Mental model: a search engine is a scalable, replicated, near-real-time, ranked-retrieval materialization layer. Everything outside that contract is your problem.


§3. The design space

Engine Storage core Primary use cases Native vectors Operational tradeoffs
Apache Lucene (library) Per-segment FST + PFOR-Delta postings + doc values + stored fields Embedded in JVM apps; foundation for ES/Solr k-NN via HNSW You write the distribution layer
Elasticsearch / OpenSearch Lucene per shard General-purpose: logs, products, profiles, observability HNSW built-in Largest ecosystem; ES is Elastic License, OpenSearch is Apache 2.0 fork
Apache Solr Lucene per core Enterprise / catalog / bibliographic search KNN (newer) ZooKeeper-coordinated; rock-solid in catalog tier
Vespa (Yahoo, standalone) Custom tensors + posting lists E-commerce ranking, ad serving, ML-heavy retrieval, RAG (Retrieval-Augmented Generation) First-class Single engine for retrieval + ranking + inference; smaller community
Tantivy (Rust) Lucene-inspired, Rust-native Embedded search in Rust; small/mid-scale Yes Library, not a distributed service; basis for Meilisearch, Quickwit
Quickwit (Rust, Tantivy core) Tantivy segments on object storage Log search on cheap S3 (Simple Storage Service); compute/storage separation Limited Built for high ingest + cheap retention
Zoekt (Google, Sourcegraph) Trigram inverted index Code search (substring on source) No Trigram tokenization, not word-level
GitHub Blackbird Custom n-gram inverted index GitHub global code search No Workload-specific; not open source
Grafana Loki Inverted index on labels only; logs as compressed chunks Cheap log aggregation No Brilliant if you only filter by label and grep
Algolia Custom in-memory engine Sub-100ms site search up to ~100M records Yes (newer) Hosted only; cost scales with RAM

How to read this: Lucene-derived engines (ES/Solr/OpenSearch) dominate the general-purpose tier — heterogeneous documents, varying queries, facets and ranking; reach for them first. Vespa earns its place when ranking is the product (e-commerce, ads, RAG retrieval-with-rerank). Specialized engines (Zoekt, Blackbird, Loki, Earlybird) exist where tokenization shape doesn't match Lucene's word-level index — code is not English, logs are not articles. Tantivy / Quickwit / Manticore are smaller-footprint alternatives. Default pick: OpenSearch (license clarity post-2021) or Elasticsearch; specialized only when the workload demands it.


§4. Underlying data structure — Lucene at the byte level

The Lucene segment format is the single most important thing to understand about modern search engines. Vespa and Tantivy diverge in specifics but use the same conceptual primitives.

4.1 A shard is one Lucene index

An ES/OpenSearch shard is literally a Lucene index on disk. The engine is a distributed wrapper around N Lucene indexes. A Lucene index is a directory of segment files, each segment an immutable mini-index produced by a single flush.

/data/nodes/0/indices/products_v3/0/index/
├── _0.cfs          ← segment 0, compound file (all sub-files packed)
├── _0.cfe          ← segment 0 entry table
├── _0.si           ← segment 0 segment info
├── _3.fdt          ← stored fields
├── _3.fdx          ← stored fields index
├── _3.tim          ← term dictionary (FST tail)
├── _3.tip          ← term index (FST root in memory)
├── _3.doc          ← posting list
├── _3.pos          ← positions
├── _3.dvd          ← doc values data
├── _3.dvm          ← doc values metadata
├── _3.nvd          ← norms data
├── _3.fnm          ← field info
├── segments_42     ← commit point: lists live segments
└── write.lock

The defining invariant: every segment is immutable once written. Updates and deletes are tombstones layered on top of segments; actual reclamation happens at merge time. The merge process is to a search engine what compaction is to an LSM (Log-Structured Merge) tree — and the analogy is exact, because Lucene's segment model is structurally an LSM tree of inverted indexes.

4.2 Inside one segment

(a) Term dictionary — FST (Finite State Transducer)

The term dictionary maps terms → posting list offsets. Lucene encodes it as an FST — a DAG (Directed Acyclic Graph) that compresses shared prefixes and suffixes simultaneously, conceptually a minimized trie.

terms: "engineer", "engineers", "engineering", "england", "english"

  start ─e─►─n─►─g─┬─i─►─n─►─e─►─e─►─r─►●(eng-pointer)─┐
                   │                    └─s─►●         │
                   └─l─►─a─►─n─►─d─►●               (suffix shared)
                                  └─i─►─s─►─h─►●

Why FST and not a plain B-tree:

  • The index part lives in memory (.tip). The tail (.tim) is on disk, accessed at most once per term lookup.
  • For a 1B-doc index with ~10M unique terms per shard, the FST in memory is ~100 MB, not 10 GB. 50–100× compression vs a flat sorted term list.
  • Lookup is O(term_length) byte comparisons; doesn't depend on the number of terms.
  • Prefix walks (engin*) are a single FST traversal that emits all matching terms in lex order — cheap autocomplete.

Trade-off: FST is append-only. Can't insert a term mid-FST. Fine because segments are immutable: adding a doc builds a new segment with a new FST.

(b) Posting list — frame-of-reference + PFOR compression

For each term, the posting list is the sequence of (doc_id, term_freq, positions...) for every doc containing that term. Lucene encodes it with stacked compressions:

  1. Delta encoding on doc_id: store [12, 47, 89, 142] as [12, 35, 42, 53]. Differences are smaller than absolutes.
  2. PFOR-Delta (Patched Frame of Reference) on deltas: pick a bit width that fits most deltas (e.g., 8 bits), bit-pack 128 deltas into a fixed-width block. Outliers ("patches") get a side-channel exception list. Result: ~3 bits/doc for typical lists vs 32 bits/doc uncompressed — 10× compression.
  3. Skip pointers interleaved every 128 docs: (doc_id, file_offset) that lets you jump ahead without decoding intervening blocks. Critical for conjunctions: q = "stainless AND bottle", stainless's next at doc 50000 → skip past blocks for 1000, 5000 without decoding. Intersection drops from O(N+M) toward O(min(N,M) × log(max)/skip_interval).
Posting list for term "engineer":
block 0 (docs 0–127):    [bit-packed 128 deltas, 8 bits each = 128 bytes]
                          skip → (doc_id=145, offset=byte_500)
block 1 (docs 128–~300): [bit-packed 128 deltas]
                          skip → (doc_id=380, offset=byte_900)
...
Iterator asks "advance past doc 50000" — skip-list jumps straight to
the right block without decoding 50 prior blocks.

The companion .pos file holds positions for phrase queries and proximity scoring; same delta + bit-pack treatment.

(c) Doc values — columnar per-field storage

The inverted index is term → docs. For sorting, faceting, aggregating, and scoring with field values, you also need doc → field value. That's doc_values: column-oriented arrays, one entry per doc per field.

docId:  0    1    2    3    4    5    ...
price: [42, 19, 99, 37, 19, 80, ...]   ← packed column
       (8 bits if max < 256, 16 otherwise — minimum bit width)

Stored in .dvd / .dvm, mmap'd. This is what makes aggregations work — without doc values, every facet count would have to re-tokenize from the inverted index. The same column-oriented layout is why Druid/Pinot do interactive OLAP.

Trade-off vs FieldData (the legacy in-heap version): doc values are on-disk and mmap'd, so they don't OOM (Out Of Memory) the JVM. First read is a disk access; the OS page cache fields subsequent reads.

(d) Stored fields — original document blob

_source (the original JSON) lives in .fdt / .fdx, LZ4-compressed in 16 KB blocks. Fetched only at phase 2 of a search (the final top-K). Disabling _source for non-display fields saves 30–50% of disk and speeds bulk reindex dramatically. Log search keeps _source (the doc is the point); product catalog may drop it and re-fetch from primary.

(e) Norms — per-field length normalization

.nvd stores one byte per doc per field encoding field length (lossily compressed). BM25 uses it for length normalization. Loaded in memory.

4.3 The write path — one doc end to end

Whether the doc is a Booking.com hotel, an Earlybird tweet, or a Galene profile, the mechanics are the same.

Step 1: Document arrives at the shard's primary.
        Doc = {id: 12345, title: "Stainless steel bottle", ...}

Step 2: Append to translog (per-shard WAL).
        ┌─────────────────────────────────────────────────────────┐
        │ translog-{generation}.tlog                              │
        │ [op=INDEX, seq_no=12834, doc_id=12345, payload=...]     │
        └─────────────────────────────────────────────────────────┘
        Write to OS buffer. If translog.durability=request (default),
        fsync at end of bulk request. ← DURABILITY POINT.

Step 3: Apply to in-memory indexing buffer.
        - Tokenize text fields ("stainless", "steel", "bottle").
        - Append to per-thread DocumentsWriter buffer.
        - RAM only. NOT searchable yet.

Step 4: Replicate to in-sync replicas.
        Primary forwards op to each replica's translog.
        Primary waits for replicas to ack their translog write.
        ← Replicas are durable for the SAME ack.

Step 5: Refresh (every refresh_interval, default 1s).
        - Flush in-memory buffer to a NEW segment file.
        - Open a new IndexReader that includes the new segment.
        - Doc is NOW searchable.
        - Segment is NOT yet fsync'd; lives in OS page cache.

Step 6: Flush (less frequent, ~5 min or 512 MB of translog).
        - fsync all segment files.
        - Write new segments_N commit point.
        - Roll over to a new translog generation.
        - Old translog reclaimable.
        ← FSYNC POINT for segments.

Step 7: Merge (background).
        - Tiered merge policy picks similar-sized segments.
        - Concatenates, dedups tombstones, writes one larger segment.
        - Old segments deleted after merge commits.

What survives a crash:

  • Between 1–2: doc lost; client gets no ack. Fine.
  • Between 2–4: primary has translog, replicas don't. Primary replays translog into the buffer; doc recovered. If primary dies, replica's shard recovery brings it back to step-2.
  • Between 4–5: doc in translog on primary + replicas. Translog replay rebuilds the buffer. Recovered without ever having been searchable — fine, since ack contract is "visible after refresh."
  • Between 5–6: segment in page cache but not on disk. On restart, segment gone, but translog replay re-creates an equivalent.
  • After 6: segment on disk; translog truncated to commit point. Engine opens at the commit point and continues.

This is why the translog is the durability primitive, not the segment file. Segments are immutable, fsync'd lazily, and reconstructible. Translog is the WAL and must be fsync'd per request.

4.4 Why 1-second refresh

Each refresh creates a new segment. With 5k writes/sec/shard, a 1s refresh produces a 5000-doc segment — a meaningful unit of work. Each segment carries fixed overhead (FST tip in memory, file handles, metadata). 1s refresh accumulates 3600 segments/hour, which the merger keeps in check; 100ms accumulates 36000/hour and the merger can't keep up — merge CPU crowds out query CPU. Each search visits every live segment; a 100-segment shard is ~30% slower than a 10-segment shard. 1 second is the empirical sweet spot.

Workload-specific tuning: bulk reindex / log ingestion sets refresh_interval: -1 during load, re-enables after — 50k → 200k+ docs/sec on the same hardware. E-commerce catalog: 5–30s is fine. Real-time tweet / news search: 200ms, accept merge cost (Earlybird tuned aggressively).

4.5 Segment merge — tiered

Tiered merge groups segments into size tiers, picks up to max_merge_at_once (default 10) similarly-sized segments and merges them into one larger segment in the next tier. Avoids quadratic merge work; logarithmic segment count growth. Competing alternative is size-tiered merge (Cassandra-style) — more aggressive but more write-amplifying. Lucene's tiered merge balances write amp (~2–3× over a doc's lifetime) against query-side segment count.

Force merge (_forcemerge?max_num_segments=1) collapses everything to one segment. Good for read-only archives (a closed monthly log index, an immutable historical catalog). Catastrophic on a live index — monopolizes I/O for minutes-to-hours.

4.6 Why an inverted index and not the alternatives

  • vs B+ tree: text LIKE '%shoes%' requires full scan; leading wildcard isn't selective.
  • vs LSM tree with secondary indexes: LSM gives fast point writes, but multi-term query intersection across secondary indexes is essentially what posting list intersection does — and Lucene does it on PFOR-bit-packed data with skip pointers. ~5–10× less I/O.
  • vs in-memory hash map: works for tiny indexes; blows the heap at billion-doc scale.
  • vs HNSW vector alone: pure ANN can't filter by exact term, doesn't facet cheaply, loses precision on exact identifier matches (SKUs, error codes, names). Modern architecture is hybrid: keyword postings for precise queries, HNSW for semantic similarity, RRF (Reciprocal Rank Fusion) or learned-to-rank combines them.

§5. Capacity envelope across scales and domains

Search engines span six orders of magnitude in doc count and four in query rate. The technology is the same; the topology is wildly different.

Tier Example Docs Index size QPS Topology
Small Internal wiki, single-team product catalog, hobby project 10k–1M <10 GB <100 Single node, single shard, no replicas; Tantivy or SQLite-FTS in-process
Mid E-commerce mid-market (1M products), Stack Overflow Q&A 10M–100M 50–500 GB 1k–10k 5–10 shards × 2 replicas, 3–5 nodes
Large LinkedIn Galene (1B profiles, 100k QPS), Booking.com hotels (~2M with hundreds of facets), Yelp businesses 100M–1B 1–50 TB 50k–500k 50–200 shards × 2–3 replicas, 20–50 nodes, hot/cold tiering, routing partitions
Giant GitHub Blackbird (200B docs, 640 TB), Datadog logs, Earlybird, large-vendor observability 10B–1T+ 100 TB – multi-PB 100k+ ingest, 1k–10k query Multi-cluster federation, per-region/per-time-bucket indices, compute/storage separation on object stores (Quickwit-style)

Where the next bottleneck appears

  • Small. Translog fsync latency. Solo devs set translog.durability=async to push past 10k writes/sec single SSD. No fanout overhead.
  • Mid. Per-shard query cost. A 50 GB shard with 10M docs spends most of its budget in the BM25 scorer.
  • Large. Fanout tail latency. Every query touches all shards; slowest dominates. With per-shard 1% chance of >50ms, fanout 100 gives 1 - 0.99^100 ≈ 63% chance of crossing 50ms — p99 collapses unless you route to limit fanout. Galene runs ~100 shards × 3 replicas, ~30 data nodes/region, ~100k QPS at p99 ~200ms.
  • Giant. Storage cost vs query rate. At Blackbird scale (200B docs, 640 TB), full hot NVMe is economically painful. Hot/cold split with searchable snapshots on S3; Quickwit's compute-on-S3-segments design exists for this tier.

Per-shard sweet spot: 20–50 GB. Below 10 GB, fanout overhead dominates; above 50 GB, query p99 spikes. Targets:

  • 100M × 2 KB/doc = 200 GB → 10 shards × 20 GB.
  • 1B × 3 KB/doc = 3 TB → 100 shards × 30 GB.
  • 200B × 3 KB/doc = 600 TB → tens of thousands of shards, federated.

Data node sizing: 32–64 vCPU, 128–256 GB RAM, 2–4 TB NVMe. Heap capped at 31 GB (compressed-oops ceiling). Remaining RAM is page cache — and query latency lives or dies in page cache. Size so the working set of posting lists fits.


§6. Architecture in context

The canonical pattern:

┌─────────────────────────────────────────────────────────────────────┐
│                       SOURCE OF TRUTH                                │
│  ┌─────────────────────────┐    ┌──────────────────────────┐        │
│  │  Primary DB             │    │  Write service           │        │
│  │  (Postgres, MySQL,      │◄───┤  (apps write here first) │        │
│  │   DynamoDB, Espresso,   │    └──────────────────────────┘        │
│  │   Cassandra...)         │                                         │
│  └────────────┬────────────┘                                         │
│               │  WAL / changelog                                     │
│               ▼                                                      │
│       ┌──────────────┐                                               │
│       │  CDC reader  │   (Debezium, Brooklin, DMS, DynamoDB          │
│       │              │    Streams, Git post-receive hooks for code)  │
│       └──────┬───────┘                                               │
└──────────────┼──────────────────────────────────────────────────────┘
               │
               ▼
   ┌───────────────────────────────────────────────────────────────┐
   │  Kafka topic (or Pulsar / Kinesis / Pub/Sub)                  │
   │  Partitioned by entity ID, retention 3–30 days                │
   │  ── replayable, immutable, deduplicated by (key, lsn)         │
   └────────────────────────┬──────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────────┐
              │  Indexer (Flink / Samza /   │
              │   Kafka Streams / bespoke   │
              │   consumer pool)            │
              │  - read partition           │
              │  - enrich / join            │
              │  - transform → bulk doc     │
              │  - bulk write w/ idempotent │
              │    external_version=lsn     │
              └──────────────┬──────────────┘
                             │ bulk index over HTTP
                             ▼
   ┌───────────────────────────────────────────────────────────────┐
   │                  SEARCH ENGINE CLUSTER                         │
   │                                                                │
   │   ┌──────────┐  ┌──────────┐  ┌──────────┐                    │
   │   │ Master 1 │  │ Master 2 │  │ Master 3 │   ← voting config  │
   │   └──────────┘  └──────────┘  └──────────┘     (Raft-like)    │
   │                                                                │
   │   ┌────────────┐  ┌────────────┐  ┌────────────┐              │
   │   │ Coord A    │  │ Coord B    │  │ Coord C    │ ← stateless  │
   │   └─────┬──────┘  └─────┬──────┘  └─────┬──────┘   fanout     │
   │         │               │               │                     │
   │         └───────────────┼───────────────┘                     │
   │                         ▼                                     │
   │   ┌──────────────────────────────────────────────────────┐   │
   │   │  Data nodes, each holding N primary + replica shards │   │
   │   │  Shard 0 P  Shard 0 R1  Shard 0 R2                   │   │
   │   │  Shard 1 R  Shard 1 P   Shard 1 R                    │   │
   │   │  ...                                                  │   │
   │   │  K primary shards × (1+R) copies                     │   │
   │   │  routed by hash(entity_id) mod K                     │   │
   │   └──────────────────────────────────────────────────────┘   │
   └───────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
                ┌─────────────────────┐         ┌──────────────────┐
                │  Search API tier    │◄────────│  User request    │
                │  (BM25 + rerank)    │         │                  │
                └─────────────────────┘         └──────────────────┘
                           │ (optional)
                           ▼
                ┌─────────────────────┐
                │  Online feature /   │
                │  ML model serving   │
                │  for rerank         │
                └─────────────────────┘

The labels change per domain: E-commerce search — primary = MySQL/DynamoDB catalog, CDC = Debezium, indexer enriches with pricing/inventory, engine = OpenSearch, rerank = personalization ML. Code search — primary = Git repos, CDC = post-receive hooks, indexer tokenizes source as trigrams/n-grams, engine = Zoekt / Blackbird, rerank = recency/popularity mix. Log search — primary = the emitting app, CDC = filebeat/Fluentd/Vector, Kafka or directly into Logstash, engine = ES / Loki / Quickwit, no rerank. Profile search — primary = Espresso, CDC = Brooklin, engine = Galene, rerank = LTR with online feature store.

Key annotations: Sharding key consistent top to bottom — primary sharded by entity_id, Kafka keyed by entity_id, engine routed by hash(entity_id) mod K. Idempotency — bulk writes use external_version = source LSN (Log Sequence Number); engine rejects out-of-order via external_gte. Multi-stage search — phase-1 BM25 retrieves top ~1000/shard; phase-2 fetches _source for top-K; optional rerank applies ML on merged top-200 or top-500. Master separation — master-eligible nodes isolated from data and coord roles; three masters = quorum minimum.


§7. Relevance tuning in depth — BM25, boosts, and the Explain API

Once retrieval works, the next job is making the top of the result list correct. Search engineers spend more wall-clock time on relevance debugging than on anything else, and the vocabulary is consistent across Lucene-derived engines: term frequency, inverse document frequency, length normalization, function scores, and the Explain API.

7.1 BM25 — the formula every search engineer needs in their head

BM25 (Best Match 25, "25" because it was the 25th iteration in the Okapi project at City University of London) is Lucene's default scoring function since version 6. Earlier Lucene used TF-IDF (Term Frequency–Inverse Document Frequency) with a square-root TF and a different length norm; BM25 fixes both of TF-IDF's pathologies.

For a single query term t against document D:

score(D, t) = IDF(t) × (f(t,D) × (k1 + 1)) / (f(t,D) + k1 × (1 - b + b × |D|/avgdl))

where:
  f(t,D)  = term frequency of t in D (raw count)
  |D|     = length of D in tokens (for the queried field)
  avgdl   = average document length across the index for that field
  k1      = TF saturation knob (default 1.2)
  b       = length-normalization knob (default 0.75)

IDF(t)    = ln(1 + (N - n(t) + 0.5) / (n(t) + 0.5))
  N       = total docs in the index (for that field)
  n(t)    = number of docs containing t

For a multi-term query, the score is the sum across terms (technically across query clauses, where a single clause may itself be a term or a phrase). This sum is what bool and dis_max queries combine.

Three pieces matter for intuition:

(a) Inverse document frequency — the rare-term bonus. A query for "Yifan stainless" gives almost all its score to "Yifan" because almost no docs contain it. "stainless" appears in tens of thousands, gets little weight. This is what makes "the" a no-op even when not stopword-filtered.

(b) Term frequency saturationf / (f + k1) is a hyperbolic curve that saturates around k1 + 1. A doc with 50 mentions of "bottle" is not 50× more relevant than one with 1; it's perhaps 2.5× more relevant. TF-IDF used sqrt(f), which saturated more slowly and let keyword-stuffing pages dominate. k1=1.2 is empirically tuned; raising to 2.0+ makes the curve flatter (more sensitive to TF), lowering to 0.5 makes it sharper (TF caps quickly).

(c) Length normalization(1 - b + b × |D|/avgdl) divides by document length relative to the average. A 50-token product title that says "bottle" scores higher than a 5000-word blog post that also says "bottle" once. b=0.75 is the empirical default. b=0 disables length normalization entirely (use for fields where length is irrelevant, like a SKU); b=1 is full normalization.

7.2 Tuning k1 and b per field

Different fields want different parameters:

Field Length char Recommended k1 Recommended b
Product title Short, dense, every word matters 1.2 (default) 0.5–0.75
Product description Long, repeats keywords for SEO 1.0 0.75
Reviews Variable, repetition is noise 0.5 0.75
Author / brand Single token usually 1.2 0 (no normalization)
Code identifier Short, often unique 0.5 0
News article body Long-form prose 1.5 0.75

Override per-field in the mapping:

{
  "settings": {
    "similarity": {
      "title_sim": { "type": "BM25", "k1": 1.0, "b": 0.5 },
      "body_sim":  { "type": "BM25", "k1": 1.5, "b": 0.85 }
    }
  },
  "mappings": {
    "properties": {
      "title": { "type": "text", "similarity": "title_sim" },
      "body":  { "type": "text", "similarity": "body_sim" }
    }
  }
}

A/B test on click logs; relevance @5 is the usual metric. Production teams iterate k1 in 0.3 steps and b in 0.1 steps; expect 1–5% click lift per tuning round before plateauing.

7.3 Query-time vs index-time boosts

Two ways to give a field extra weight:

Index-time boost — multiplied into the field's norm at indexing time. Avoid. Lucene removed support for index-time boost from individual documents in 6.x because (1) it bakes ranking decisions into immutable segments — change a boost, reindex everything; (2) it lossily quantizes into one byte; (3) it composes badly with BM25's own length norm.

Query-time boost — multiplier applied to a clause at search time. Easy to iterate.

{
  "query": {
    "multi_match": {
      "query": "stainless steel bottle",
      "fields": ["title^3", "brand^2", "description^1"]
    }
  }
}

The ^N syntax says "multiply this field's BM25 score by N before summing across fields." A match in the title is 3× a match in the description. Boosts are multiplicative on score, not on rank, so the effect depends on the underlying BM25 distribution. A boost of 3 on title is usually noticeable; 30 makes title hits dominate entirely; 1.5 is a nudge. Tune via A/B.

7.4 Function score and the custom ranking layer

BM25 alone says "how textually similar." Real ranking wants "and also recent, and popular, and not currently out of stock." function_score multiplies (or adds) custom factors on top of BM25:

{
  "query": {
    "function_score": {
      "query": { "match": { "title": "espresso machine" } },
      "functions": [
        { "field_value_factor": {
            "field": "rating", "modifier": "log1p", "factor": 0.5
        }},
        { "gauss": {
            "release_date": {
              "origin": "now", "scale": "365d", "decay": 0.5
            }
        }},
        { "filter": { "term": { "in_stock": true }}, "weight": 2.0 }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

score_mode combines the functions (multiply, sum, avg, max, min). boost_mode combines the function output with the base BM25. Multiplying is the common choice — keeps the relative ordering of strong text matches while penalizing stale or out-of-stock items.

The decay functions (gauss, linear, exp) are how "recency matters" gets expressed. A gauss with scale: 30d says "score halves every 30 days from origin." Newsrooms tune this aggressively; e-commerce uses a much longer decay for evergreen products.

script_score (one rung above) runs a Painless script per candidate. Powerful, expensive — used for "geo distance × rating × price-bucket" composite scores. Avoid in inner loops at scale (touches every candidate).

7.5 The "why does my CEO's profile rank low" investigation pattern

The classic relevance bug at LinkedIn-style scale: someone at the company searches for the CEO's name, and the result is somewhere below page 3. This recurs at every search shop. The investigation playbook:

Step 1: Reproduce the query exactly. Get the query string the user actually typed (autocomplete, typos, locale, filters). Don't trust the screenshot — re-run via curl with the same request body.

Step 2: Confirm the document exists in the index. GET /profiles/_doc/{id} — is it there? Is _source what you expect? If the doc is missing, the bug is in the indexing pipeline, not relevance.

Step 3: Confirm the document MATCHES. Run the same query with explain=true AND a hard filter on _id: <ceo_id>:

GET /profiles/_search?explain=true
{
  "query": {
    "bool": {
      "filter": [{ "term": { "_id": "ceo123" }}],
      "must":   [{ "match": { "full_name": "Tim Cook" }}]
    }
  }
}

If the must clause doesn't match (no hits returned), the analyzer is the bug. The query "Tim Cook" might tokenize to ["tim", "cook"]; the doc field full_name might be indexed as a keyword (whole-string only), so "Tim Cook" matches but "tim" or "cook" alone does not. Or the field is using the wrong analyzer (e.g., a Whitespace tokenizer when you wanted Standard with lowercase).

Step 4: Compare scores via the Explain API. Once it matches, run without the _id filter and use explain=true to see the score breakdown for both the top-ranked doc and the CEO:

{
  "explanation": {
    "value": 8.43,
    "description": "sum of:",
    "details": [
      { "value": 6.12, "description": "weight(full_name:tim in 123) [BM25]",
        "details": [
          { "value": 6.12, "description": "score(freq=1.0), computed as ...",
            "details": [
              { "value": 1.0, "description": "termFreq=1.0" },
              { "value": 9.8, "description": "idf, computed as ..." },
              { "value": 0.42, "description": "tfNorm, computed as 1/(1+k1*(1-b+b*dl/avgdl)) ..." }
            ]
          }
        ]
      },
      { "value": 2.31, "description": "weight(full_name:cook in 123) [BM25]", "details": [...] }
    ]
  }
}

Read top-down: total score, per-clause scores, per-term breakdown including TF, IDF, and the tfNorm component. The typical findings:

  • "Tim Cook" appears with a lower IDF than expected because the index contains 50k "Tim"s and 10k "Cook"s — both common.
  • The top-ranked doc has "Tim Cook" appearing 4× in the title field; CEO's profile only has it 1× in full_name. TF saturation should clip this, but if k1 is too high it doesn't.
  • The top-ranked doc is from a 30-word profile; CEO's is from a 500-word one. Length normalization punishes the longer doc.
  • Worst case: the CEO's name is in current_company.ceo_name, NOT in full_name. The query matches full_name only.

Step 5: Fix with a structural change, not a one-off boost. Bad fix: add a function_score that boosts the CEO's specific doc ID. Doesn't generalize, embarrasses the engineer. Good fix: add a weight boost on docs where is_executive: true, OR add a cross_fields multi_match so the query matches against full_name, headline, AND current_company.ceo_name simultaneously, OR add an LTR feature for "person matches a known prominent role in their company." All three are general; all three lift the CEO's profile because of what it is, not because of who it is.

The Explain API is a debug tool — never call it in production hot paths (it inflates response size 10–100×). Use it in a relevance-debugging notebook, dashboard, or sidecar service.

7.6 Per-query relevance instrumentation

Long-term, relevance is a measurement problem. Production search shops build:

  • NDCG (Normalized Discounted Cumulative Gain) dashboards computed from click logs. NDCG@10 is the common SLA — "for our top 100 head queries, what fraction of clicks land in the top 10?"
  • Side-by-side diff tool (e.g., Yelp's "Sherlock," LinkedIn's relevance lab) that runs the same query against two engine configurations and renders the diff visually. PR approvals against this.
  • Sampled query replay — 0.1% of production queries replayed nightly against a candidate config; aggregate score delta is the gating metric for releases.

Without these, ranking changes become vibes-based and regress silently.


§8. Learning-to-Rank (LTR) — the ML re-ranking layer

BM25 with hand-tuned function scores plateaus quickly. The standard upgrade is a two-stage architecture: BM25 as a fast candidate generator, then a learned model re-ranks the top N. Every serious consumer search system (LinkedIn, Yelp, Booking, Pinterest, Etsy, Indeed) operates this way.

8.1 Why two stages

Scoring every document with a heavy model is infeasible. A neural reranker can score maybe 200–1000 docs per query in the latency budget; BM25 can score millions. So:

  1. Stage 1 — candidate generation. BM25 (with filters and lightweight function scores) returns top 100–1000 documents per shard, merged to a global top 500–2000. Optimized for recall: we just need the truly-relevant docs to be SOMEWHERE in this set, ranking within the set is less important.
  2. Stage 2 — re-ranking. A learned model scores the merged candidate set. Optimized for precision at K: get the very best result to position 1, second-best to position 2, etc.

The key tradeoff: stage 1 must be both fast and high-recall. If the CEO's profile isn't in stage 1's top 1000, no reranker can save it.

Query → BM25 retrieval (per shard, top 1000 each)
      → coordinator merges → top 2000 globally
      → fetch ML features per (query, doc) pair
      → LTR model scores all 2000
      → return top 10 sorted by model score

8.2 Model choices

Gradient-boosted decision trees (XGBoost, LightGBM) dominate. Why: hundreds of features mix dense (numeric: BM25 score, log of click count, recency in days) and sparse (categorical: country, query intent class). GBDTs handle the mix natively, train in minutes on millions of impressions, infer in microseconds per (query, doc) pair, and are debuggable — predict_contrib says which features moved which row.

Neural rerankers (cross-encoders like BERT cross-encoder, ColBERT, MS MARCO–trained transformers) win on relevance at higher latency cost. A cross-encoder ingests [query, doc] jointly and outputs a relevance score — better at semantic alignment but 100× slower than a GBDT. Used for top-50 final reranking when the budget allows.

Hybrid stacks (LinkedIn, Pinterest): GBDT to score top 2000 → top 200, then a cross-encoder to score top 200 → final order. Each stage uses what it's good at within its latency cell.

8.3 Features

Three families:

Query-only: query length, language, presence of named entities, query intent classifier output ("navigational" vs "informational"), historical click-through rate for this query string.

Document-only: log of profile completeness, recency in days, popularity (views, stars, reviews), language, document type, structured priors (verified seller, in-stock, "responds in 1 hour" for hotels).

Query-document interaction (the high-signal family): BM25 score itself, BM25 across multiple fields (title, body, brand) as separate features, lexical overlap (count of query tokens appearing in doc), term proximity (minimum window containing all query terms), historical click-through for this exact (query, doc) pair, vector cosine if hybrid retrieval is in play.

Modern LTR models train on 500–2000 features. Feature engineering and pipeline reliability matter more than model architecture choice past a point.

8.4 Click-through data as training signal

The dominant training signal is implicit: click logs. For each search, log impressions (which docs were shown, in what positions) and clicks (which were clicked). Convert to pairwise preference data:

  • Doc A shown at position 3, clicked. Doc B shown at position 1, not clicked. Implied: A > B (despite worse position).
  • Doc A shown at position 5, clicked. Doc B shown at position 6, clicked. Both positive — weaker signal.

Use LambdaMART (the GBDT variant that optimizes a ranking metric directly) or pairwise/listwise loss in a neural model. Position bias is a major confound — users click position 1 simply because it's position 1 — corrected via inverse propensity scoring (downweight clicks at high positions proportional to their position-only click probability) or counterfactual evaluation (interleaving experiments).

Cold-start problem: a brand new product has no click history. Mitigation: lean on document-only features for new docs, and run a bandit that occasionally injects new items into the top results to gather data.

8.5 How LinkedIn and Yelp do it

LinkedIn People Search (Galene + LTR): stage 1 is BM25-on-Galene with network-cluster routing (fanout ~10 shards). Stage 1 returns ~2000 candidates. Stage 2 features come from an online feature store (Pinot, Venice) keyed by (searcher_id, candidate_id) — features include "connection degree," "shared employers," "skill overlap," "recent profile views by searcher's network." LTR model is a GBDT trained on click + connection-accept data, retrained weekly. Per-request latency: stage 1 ~30ms, feature fetch ~40ms, scoring ~10ms, total under 200ms p99.

Yelp Business Search: stage 1 is Lucene-on-Solr (Yelp uses both Lucene-derived and a custom layer historically). Filters by geo and category. Returns top 500 by BM25 + lightweight geo distance score. Stage 2 GBDT scores with features like business rating, review velocity, hours-open-now, photo count, historical click-through by users in this user's segment. Etsy's stack is structurally identical.

Booking.com: similar shape, with the added complexity that "best result" depends heavily on travel dates (a hotel that's full on the user's dates is irrelevant regardless of score). Date availability is a filter, not a feature; stage 1 applies it.

8.6 Deployment patterns — where the model runs

In-engine (Elasticsearch LTR plugin, OpenSearch Learning to Rank): model lives on the data nodes; reranking happens before shard results return. Lowest latency, tightest coupling. Painful to iterate on features (mapping changes) and to roll back. Good for stable production models.

Sidecar (between coord and client): coordinator returns top N raw docs to the search API tier; API calls a model-serving service. Easy to iterate (any model, any features, deploy independently); higher latency due to extra hop. Default pattern at most companies.

In-engine tensors (Vespa): Vespa runs the model inside the engine via its tensor framework. Compose retrieval and ranking in one query plan, no round trip. The reason Vespa is loved for ML-heavy ranking.

8.7 Online evaluation — never trust offline

Offline NDCG on held-out click logs is necessary but not sufficient. Real-world LTR rollouts always:

  1. Interleave results from production and candidate models for a small traffic fraction. Compare click rates.
  2. A/B test the candidate model on 1–5% of traffic. Measure not just NDCG but downstream business metrics (purchases, application starts, signups).
  3. Watch for fairness regressions: did the new model rank certain demographics or sellers systematically lower? Critical at LinkedIn (job seekers) and at Airbnb-style marketplaces.

A 2% NDCG offline lift that translates to 0% revenue lift is common. A 0.5% offline lift that translates to 3% revenue lift is also common. Online wins, always.


§9. Hybrid retrieval — BM25 + vector for the post-2023 era

The most consequential architectural shift in search in a decade is the merger of keyword and vector retrieval. ChatGPT made dense retrieval mainstream; production systems by 2024 routinely combine both. The pattern is now table-stakes for any new search system.

9.1 Why hybrid, not vector-only

Vector embeddings capture semantic meaning — "espresso machine" and "coffee maker" are close in embedding space even with no token overlap. But:

  • Vector loses on rare identifiers. Searching for SKU B07XYZ1234 or error code ERR-4242 or username liali. Embedding models normalize away exact spellings; BM25 nails them.
  • Vector loses on negation and operators. "iPhone NOT case" routes through query parsing; vector embeddings of the whole phrase don't subtract "case" cleanly.
  • Vector loses on filters at scale. "Espresso machines under $200 with 4+ star reviews" — vector indexes (HNSW) struggle to apply post-filters efficiently because the graph traversal doesn't know about the filter until candidates are reached.

BM25 loses on synonyms ("auto" vs "car"), on paraphrase ("how do I reset my password" vs "I forgot my password"), and on cross-language ("perro" vs "dog"). The complementary failure modes are exactly why hybrid wins.

9.2 Co-located vs separated vector storage

Two architectures:

(a) Co-located: vector field inside the same Lucene index. Elasticsearch and OpenSearch added dense_vector field type with HNSW (Hierarchical Navigable Small World) indexing. Mapping:

{
  "mappings": {
    "properties": {
      "title":          { "type": "text" },
      "description":    { "type": "text" },
      "title_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": { "type": "hnsw", "m": 16, "ef_construction": 100 }
      }
    }
  }
}

Each segment now contains both the inverted index AND an HNSW graph. Query can ask kNN + filters + text match in one request. The engine intersects results in the same query plan.

Pros: one system, one operational story, atomic updates (text and embedding change together at refresh boundaries). Cons: vector storage is heavy (768 floats × 4 bytes = 3 KB per doc, before HNSW graph overhead); HNSW graph is expensive to build during merge; vector indexes don't compress as well as text inverted indexes. A 100M doc index with 768-dim vectors needs 300–500 GB just for vectors.

(b) Separated: Pinecone / Weaviate / Milvus + ES. Vector DB stores (doc_id, embedding). Search engine stores the text index. Query path:

Query embedding → Pinecone returns top 1000 doc_ids by cosine
              ↘
                merge by doc_id
              ↗
Query text   → ES returns top 1000 doc_ids by BM25

Both are merged in the search API tier. Pros: each system optimized for its job, scale independently, can change vector models without reindexing text. Cons: two systems to operate, doc_id consistency across both is your problem, joining at query time adds latency.

Production teams in 2024–2026 mostly pick (a) for new builds when index size allows, (b) when vector volume dwarfs text or when teams already have a specialized vector DB. LinkedIn, Pinterest, Spotify all use a mix.

9.3 Reciprocal Rank Fusion (RRF) — the simple, robust score combiner

The hard problem in hybrid retrieval: BM25 returns absolute scores (e.g., 8.43); vector cosine returns similarity in [0, 1]. They're not comparable. Three approaches:

(a) Score normalization — divide each result list's scores by their max, then weighted-sum. Brittle: outliers warp normalization.

(b) Learned combination — train a small model with bm25_score, vector_score, and features as input, predict relevance. Powerful, more moving parts.

(c) Reciprocal Rank Fusion (RRF) — ignore scores entirely. Combine by rank:

RRF(doc) = sum over all retrieval methods i of:
             1 / (k + rank_i(doc))

  k is a smoothing constant, default 60.
  rank_i is the doc's rank in method i's result list (1 = top).
  If the doc isn't in method i's top list, its contribution is 0.

For a doc ranked 1st by BM25 and 3rd by vector: RRF = 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323. For a doc ranked 1st by BM25 only (not in vector top): RRF = 1/61 = 0.0164. The doc appearing in BOTH lists is preferred even at moderate ranks.

RRF works because it doesn't depend on score calibration; it only needs each retriever to produce a sensible ranking. It's robust to weird score distributions, handles new retrievers being added without retuning, and the result is interpretable (each retriever's contribution is visible). Elasticsearch added it as a first-class operator (rrf in _search) precisely because production teams kept reimplementing it.

The k parameter controls how much top ranks dominate. k=60 is the empirical default; tune via A/B. Lower k weights top positions more; higher k flattens differences.

9.4 The semantic search revolution post-2023

What changed: production-quality embedding models became cheap and small. OpenAI text-embedding-3-small is 1536-dim, 5 cents per 1M tokens. Cohere embed-english-v3 is 1024-dim. Open models (sentence-transformers, BGE, E5) run on a single CPU at 100–500 docs/sec. Embedding a 1B-doc catalog dropped from "infeasible" to "$5000 of API calls, a weekend on a GPU."

Companion shifts:

  • RAG (Retrieval-Augmented Generation) as a primary use case. Document chunks retrieved by hybrid search are fed as context to an LLM. Quality of the RAG output depends directly on retrieval quality — driving search investment from LLM teams.
  • Re-ranking with cross-encoders (e.g., Cohere rerank, BGE-reranker) as a stage between hybrid retrieval and final results — typically reorders top 100 to top 10.
  • Late interaction models (ColBERT) that produce per-token embeddings then sum-of-max-similarity at query time. More expressive than single-vector embeddings, more expensive to store and query. Used by some advanced production stacks (Vespa supports it natively).

The composite query in 2026 production: (BM25 on title^3 + brand^2 + description^1) RRF-fused with (vector knn over title_embedding, ef=100) + filters on category, price, in_stock → top 200 → cross-encoder rerank → top 10. Five years ago this would have been three separate research projects; today it's one query body in Elasticsearch 8.x.


Tokenization is the unglamorous foundation of search. The wrong analyzer makes everything else (BM25, LTR, hybrid) worse, and the bugs are subtle — queries return nothing or rank obviously-wrong docs, and the cause is buried in how text was sliced into terms at index time.

10.1 The analyzer pipeline — three stages

A Lucene analyzer is a pipeline applied to text before it's inverted:

raw text  →  CHARACTER FILTER  →  TOKENIZER  →  TOKEN FILTERS  →  terms
  • Character filter — operates on the raw character stream. Strips HTML tags, maps &amp; to &, normalizes Unicode (NFC), folds accents (ée).
  • Tokenizer — splits the character stream into tokens. Standard tokenizer splits on Unicode boundaries (whitespace, punctuation). Other tokenizers: whitespace-only, character-by-character, regex-based, n-gram.
  • Token filter — operates on the token stream. Lowercases, removes stopwords, stems (runningrun), expands synonyms (nycnyc + new york), folds diacritics.

The analyzer used at INDEX time and at QUERY time must produce compatible tokens. The most common bug: indexed text was lowercased, query was not (or vice versa) — no matches. Always set analyzer on a field and let it apply to both, OR explicitly set search_analyzer and verify.

10.2 Standard vs language-specific analyzers

Standard — Unicode word-boundary tokenizer + lowercase filter. Works for any Latin-script language as a baseline. Misses stemming, so "running" and "ran" don't match.

English — Standard tokenizer + lowercase + English stopwords + Porter stemmer (rule-based: running → run, running → run, connection → connect). Stems can be aggressive — university and universe both stem to univers; usually a worthwhile tradeoff but not always.

Snowball — A family of stemmers (Porter, KStem, Lovins) with subtly different aggressiveness. KStem is conservative ("only conflate forms that are surely the same word"); Porter is aggressive. KStem is the e-commerce default — fewer false matches on product names.

Language-specific — analyzers tuned per language: Spanish, French, German, Russian, Arabic, etc. Each carries language-specific stopword lists and stemmer rules. German is particularly important because compound words (Donaudampfschifffahrtsgesellschaftskapitän) need a decompounder to find "Schiff" or "Kapitän."

ICU (International Components for Unicode) — analyzer with deeper Unicode awareness. Handles Asian languages, normalization to NFKC, locale-specific casing. The starting point for non-Latin scripts.

10.3 Tokenizing non-Latin scripts

Lucene's word-boundary tokenizer fails on languages without spaces between words:

Chinese. No whitespace between words. Options:

  • Character-level (n-gram of length 1 or 2). Every character indexed as a token. Simple, recall-heavy, terrible precision — "中" appears in millions of docs.
  • Jieba / IK / SmartCN segmenter. Dictionary + statistical model splits character streams into words. Standard Chinese index uses IK (内核分词) or SmartCN. Required for usable Chinese search.
  • Bigram tokenizer ("CJKBigram"). Each pair of adjacent characters is a token. Decent precision/recall balance, no dictionary maintenance.

Japanese. Similar problem but worse — three scripts (kanji, hiragana, katakana) mixed inline. Kuromoji (built-in Elasticsearch plugin) is a morphological analyzer that recognizes kanji compounds and inflections. Without Kuromoji, "走った" (ran) and "走る" (run) are different tokens.

Korean. Nori tokenizer (built-in plugin since ES 6.4). Hangul agglutinates suffixes onto roots; Nori strips them.

Thai, Khmer, Lao. No whitespace, no Kuromoji-equivalent for all of them. ICU tokenizer is the fallback; performance varies.

Arabic and Hebrew. Right-to-left, with diacritics often stripped. ICU handles bidi correctly; an Arabic stemmer (light10) helps stem the rich morphology.

10.4 Synonyms

Synonym filters expand terms at INDEX time (nyc indexed as both nyc and new york) or QUERY time (nyc typed in a query expanded to also search new york). Both have tradeoffs:

Index-time synonyms: cheaper at query time (fewer terms to search), index size grows, changing the synonym list requires reindexing. Common pattern: city aliases, product number variants, common typos.

Query-time synonyms: index unchanged, easy to iterate, but query expansion can balloon (a query like "car" might expand to 8 terms — car|auto|automobile|vehicle|sedan|...). The synonym filter applies at query parse and feeds the resulting terms into BM25.

Multi-word synonyms ("new york" ↔ "ny" ↔ "nyc") need careful phrase handling. Elasticsearch's synonym_graph token filter handles multi-word equivalences correctly inside match_phrase queries; the older synonym filter has known gaps.

WordNet provides English synonym data; OpenThesaurus for German; custom dictionaries for product domains (tee/tshirt/t-shirt).

10.5 Edge n-grams and autocomplete

Autocomplete needs prefix matching: typing "iph" should match "iphone." Two patterns:

(a) Edge n-grams at index time. Tokenize each term into all prefixes: iphonei, ip, iph, ipho, iphon, iphone. At query time, use a standard analyzer (no edge n-gram) so the user's "iph" matches one of the indexed n-grams. Index grows ~5×, queries are blazing fast (single-term lookup).

(b) Completion suggester. Lucene's completion field type — a memory-resident FST keyed on prefix, returning the top-K suggestions weighted by an indexed score. Sub-millisecond. The right tool for "as-you-type" suggestions; doesn't help with anywhere-in-string matching (use match_phrase_prefix for that with rank trade-offs).

Common bug: indexing edge n-grams on BOTH index and search sides — then "iph" tokenizes to i, ip, iph, and the search expands to all three terms, matching everything starting with "i." Always set search_analyzer to a non-edge-n-gram analyzer for these fields.

10.6 The keyword vs text distinction

Two field types, both backed by Lucene, with totally different semantics:

text — analyzed, tokenized, inverted-indexed by terms. Used for full-text search. Cannot be sorted on or aggregated directly (uses fielddata, expensive). Score-able via BM25.

keyword — NOT analyzed; the entire string is a single token. Used for exact match, sort, aggregate, filter. Cannot be substring-searched. Stored in doc values automatically.

The convention is the multi-field mapping:

{
  "properties": {
    "city": {
      "type": "text",
      "fields": {
        "raw": { "type": "keyword" }
      }
    }
  }
}

Now city is searchable (text analyzer), city.raw is exact-matchable, sortable, and aggregatable. This pattern is in every production Elasticsearch index.

Beginners' bug: mapping a status field ("OPEN", "CLOSED") as text, then trying to filter or aggregate by it. The analyzer lowercases and tokenizes; "OPEN" becomes ["open"]; the aggregation works but the field type semantics are wrong (sort/aggregate uses fielddata, slow). Use keyword from the start.


§11. Cluster topology deep dive — node roles and sizing

A production Elasticsearch/OpenSearch cluster is a mix of node roles. Each role does a distinct job; mixing roles on the same VM works for small clusters but breaks at scale. The reason is failure isolation: a slow query on a data node shouldn't slow cluster-state operations on a master.

11.1 The four roles

(1) Master-eligible nodes. Manage the cluster state — which indices exist, what's their mapping, where each shard lives, which nodes are up. The cluster state is a single replicated document (effectively a tree) gossiped via a Raft-like protocol. Master nodes are the consensus group. They do NOT serve queries, hold shards, or process documents.

(2) Data nodes. Hold shards. Serve queries on local shards. Index documents. Run merges. CPU- and disk-intensive. Most cluster machines are data nodes.

(3) Coordinating-only nodes (also called "client nodes" historically). No shards, no master role. Accept HTTP requests, parse queries, fan out to relevant data nodes, gather and merge results, return to client. Pure routing/aggregation. Stateless, easy to scale horizontally.

(4) Ingest nodes. Run ingest pipelines — pre-processing applied to docs before they land on a data node. Pipelines parse JSON, enrich with GeoIP, lowercase fields, drop sensitive data. Optional; if a doc doesn't need preprocessing, skip ingest nodes entirely.

A small cluster (3–5 nodes total) runs every role on every node. A large cluster separates them:

3× master-only nodes        (4 vCPU, 8 GB RAM, small disk — cluster state is tiny)
3–10× coordinating nodes    (16 vCPU, 64 GB RAM, no disk — query fanout)
2× ingest nodes (optional)  (16 vCPU, 32 GB RAM — pipeline CPU)
20–200× data nodes          (32–64 vCPU, 128–256 GB RAM, 2–4 TB NVMe SSD)

11.2 Why dedicated masters past ~10 data nodes

The cluster state grows with the number of indices and shards. A cluster with 1000 indices × 10 shards × 3 copies has 30000 shard entries in the cluster state. Every node holds a copy and applies updates.

If a data node is also master, a long garbage-collection pause on the data role blocks cluster-state updates — shard rebalancing stalls, allocation decisions queue up. Worse, a data node with heavy query load can drop heartbeats to peers; the cluster thinks the master is down and elects a new one. Spurious master elections cascade: in-flight cluster-state updates abort, allocation churns, query latency spikes.

Rule of thumb: past ~10 data nodes, give masters dedicated VMs. They're tiny — 4 vCPU and 8 GB is plenty — but they're critical for stability. Always 3 of them, never 2 (split-brain), never 1 (single point of failure). 5 masters in geographically-spread deployments where you want one master to survive the loss of a region.

11.3 Why coordinating-only nodes

A query lands on some node. That node:

  1. Parses the query, resolves the index name to a list of shard IDs.
  2. Sends a request to one copy of each shard's data nodes.
  3. Receives per-shard top-K results.
  4. Merges into a global top-K.
  5. (Phase 2) Asks for _source from the data nodes holding the final top-K.
  6. Returns to client.

This work is CPU and network heavy. If the coord role lives on a data node, it competes with the local shard searches. Bursts of queries cause data nodes to spend cycles coordinating other people's queries instead of running their own.

Dedicated coord nodes keep the data nodes focused. A pool of 5–10 coord-only nodes can fan out to a 100-data-node cluster. The coord layer is stateless — kill one, requests retry; scale up/down with autoscaling groups.

For most clusters under ~20 data nodes, you don't need dedicated coords. The savings are real at large scale.

11.4 Ingest nodes — when and when not

Ingest pipelines transform documents before indexing. Use cases:

  • Parse logs (parse syslog timestamps, extract fields from a single string).
  • Enrich with GeoIP (turn IP into country/city/coords).
  • Drop or rename fields.
  • Encrypt PII before storing.

If documents arrive already-shaped, skip ingest entirely. If pipelines are heavy (heavy regex, ML inference at index time), dedicate ingest nodes so they don't compete with data nodes for CPU.

Alternative architecture for heavy pre-processing: do it in the upstream indexer (Logstash, Flink, custom consumer) before sending bulk to the cluster. Trades operational complexity for finer-grained control.

11.5 Heap sizing — 31 GB ceiling

The JVM heap on every ES/OS node is capped at 31 GB (slightly below the 32 GB compressed-oops threshold). Past that point, object pointers become full 64-bit instead of compressed 32-bit, RAM efficiency drops, garbage collection time grows. More heap is worse, not better.

A 256 GB RAM data node uses:

  • 31 GB heap.
  • ~225 GB for the OS page cache.

That OS page cache is what makes Lucene fast — segment files, doc values, FST tails are mmap'd, so the OS caches what's hot. Latency lives in page cache hits. Size the cluster so that the working set of postings + doc values fits in collective page cache across nodes.

11.6 Cross-AZ and rack-awareness

cluster.routing.allocation.awareness.attributes: zone tells the engine: never put primary + replica of the same shard in the same zone. Anti-affinity is enforced. Lose an AZ, lose at most one copy of any shard. Same applies to rack, host, or any failure domain you label.

Forced awareness goes further: awareness.force.zone.values: [a, b, c] means the cluster refuses to allocate replicas if it can't satisfy zone-spread. Safer in production.


§12. Index lifecycle management (ILM) — hot/warm/cold/frozen tiers

A logs index from a week ago is queried 100× less often than today's logs but costs the same on hot NVMe. Index Lifecycle Management (ILM in Elasticsearch, ISM in OpenSearch — Index State Management) automates the migration of indices through cost-tiered storage, cutting cost 10–100× with negligible operational overhead.

12.1 The four tiers

Hot — writable, queried frequently. Local NVMe SSD. High IOPS, high CPU. The "today's data" tier. Cost: $300–500/TB-month on cloud NVMe.

Warm — read-only, queried occasionally. Larger spinning disks or cheaper SSD. Same engine binary, no new operational story. Cost: $50–100/TB-month.

Cold — read-only, queried rarely. Searchable snapshots on object storage (S3, GCS, Azure Blob). The index files live in object storage; the data node holds only metadata + a lightweight cache. Queries against cold tier pull blocks from S3 on demand. First query is slow (multi-second), subsequent are cached. Cost: $20–25/TB-month on S3 (storage); compute is cheap because RAM/CPU footprint is small per index.

Frozen — like cold but even more aggressive. Partially mounted: nothing is cached locally beyond a small block cache. Every query is mostly an S3 fetch. Latency in single-digit seconds for first-touch. Used for compliance retention — "we must hold logs for 7 years but we'll touch them once a quarter for audits." Cost: $5–10/TB-month effective.

12.2 Rollover and time-based indices

The pattern that makes tiered storage work: instead of one giant logs index that lives forever, write to a rolling alias logs that points to the current write index (logs-2026.05.22-000001). Triggers:

  • Size: roll over when index reaches 50 GB (per shard).
  • Age: roll over when index is 1 day old.
  • Doc count: roll over at 100M docs.

After rollover, the previous index is closed to writes — perfect candidate for migration to warm tier. Reading still goes through the alias logs*, which the engine expands to all matching indices.

12.3 An ILM policy

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "1d" },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "forcemerge": { "max_num_segments": 1 },
          "shrink": { "number_of_shards": 1 },
          "allocate": { "include": { "data_tier": "data_warm" }},
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": { "snapshot_repository": "s3_repo" },
          "allocate": { "include": { "data_tier": "data_cold" }}
        }
      },
      "frozen": {
        "min_age": "180d",
        "actions": {
          "searchable_snapshot": { "snapshot_repository": "s3_repo" }
        }
      },
      "delete": {
        "min_age": "730d",
        "actions": { "delete": {} }
      }
    }
  }
}

What it does: write to hot for 7 days, force-merge and shrink at warm transition (one segment, one shard — small index, ideal for read-only), snapshot to S3 at cold, deeper snapshot at frozen, delete after 2 years. Operations are fully automatic.

The force-merge to one segment at warm transition is the critical optimization: read-only indices serve queries faster from one big segment than from 50 small ones. Don't force-merge on hot tier (live writes); always do it on rollover to warm.

The shrink to one shard consolidates a multi-shard index into one shard after rollover. Old logs have low query rate; one shard is cheaper to operate than 10. The shrink is in-place (uses hardlinks, no copy).

12.4 Real cost numbers

A petabyte-scale log cluster at a SaaS company, with 1 PB ingested per month and 90-day total retention:

  • Without ILM: 3 PB × $400/TB-month = $1.2M/month.
  • With ILM (7d hot, 30d warm, 90d cold): (0.23 PB × $400) + (0.77 PB × $80) + (2 PB × $25) = $92k + $62k + $50k = $204k/month.

~6× cost reduction, same query semantics, transparent to the application. This is the largest single optimization available for log-heavy ES clusters and the reason Quickwit/Loki exist (they take cold-tier-on-object-storage further as the default, not an optimization).


§13. Re-indexing strategies — the alias swap pattern in detail

Mappings are immutable in Lucene. You cannot change a field's analyzer, type, or doc_values setting on a live index. The way to evolve a schema in production is the alias swap pattern, which §20.6 (the "hard problems" section) introduces briefly. This section drills into the mechanics.

13.1 Why mappings are immutable

Segments encode field metadata at write time. The FST term dictionary, the analyzed terms, the doc values column types — all baked into segment files. Changing an analyzer would require re-tokenizing every document from _source. Changing a field from keyword to text would require regenerating the inverted index for that field. Changing a doc_values codec would mean rewriting the column.

Lucene doesn't support in-place rewrite. Instead: build a new index with the new mapping, copy data over, swap.

13.2 The five-step alias swap

Step 1: CREATE the new index with new mapping.
   PUT /products_v4
   { "mappings": { "properties": {
       "title": { "type": "text", "analyzer": "english_v2" },
       "title_embedding": { "type": "dense_vector", "dims": 768 }
   }}}
   Index is empty.

Step 2: START dual writes.
   Indexer publishes each doc to BOTH products_v3 AND products_v4.
   This catches all NEW edits from t=now forward.
   products_v4 is now catching up in real time.

Step 3: BACKFILL with _reindex (or external batch).
   POST /_reindex
   { "source": { "index": "products_v3" },
     "dest":   { "index": "products_v4" },
     "script": { "source": "..." }  // optional field transforms
   }
   Reindex API streams from old → new at controlled rate.
   Uses external_version so a slow backfill row cannot
   overwrite a fast live write of the same doc.

Step 4: VERIFY parity.
   Diff doc counts: GET /products_v3/_count vs /products_v4/_count
   Spot-check sampled docs by ID.
   Run query suite against both, compare top-K stability.
   This is the highest-risk gate; do not skip.

Step 5: SWAP the alias atomically.
   POST /_aliases
   { "actions": [
       { "remove": { "index": "products_v3", "alias": "products" }},
       { "add":    { "index": "products_v4", "alias": "products" }}
   ]}
   Both ops in one request = atomic. Zero downtime.

Step 6 (optional): SOAK then delete.
   Keep products_v3 read-able for 1–7 days in case rollback needed.
   Then DELETE /products_v3.

The key invariant: queries always go through the alias products, never the versioned name. Production code never knows whether it's hitting v3 or v4. Reverting is a single alias-swap call in the other direction — under 1 second to roll back.

13.3 Reindex API at scale

_reindex is throttle-aware. By default it streams documents using internal scroll. Throttle with requests_per_second:

POST /_reindex?slices=auto&wait_for_completion=false&requests_per_second=5000

slices=auto parallelizes across shards (one slice per source shard). For a 100-shard source, that's 100 parallel reindex tasks. wait_for_completion=false returns a task ID; track via GET /_tasks/{id}.

A 1 TB index typically reindexes in 4–12 hours at production throttle settings. Bottlenecks: source disk read throughput (use _source directly, not from doc values), destination merge cost (set refresh_interval: -1 and replicas: 0 on the destination during backfill; restore both after).

For mapping changes that require enrichment from another source (e.g., need to call an embedding service per doc), the reindex API has a script field that runs Painless on each doc, but invoking an external HTTP service from Painless is unsupported. Instead: run a custom indexer in Flink/Spark that reads from the primary, enriches, writes to the new index. The reindex API is for engine-internal copy.

13.4 Painless script bulk update for in-place modifications

Not every schema change needs a new index. If the change is adding a derived field, computed from existing fields, the _update_by_query API with a Painless script can rewrite the relevant docs in place:

POST /products/_update_by_query
{
  "query": { "term": { "category": "electronics" }},
  "script": {
    "source": "ctx._source.tax_class = ctx._source.price > 500 ? 'high' : 'low'"
  }
}

This still creates new segments under the hood (every "updated" doc is reindexed with a tombstone on the old one), so it's not magic — but it avoids the alias swap. Use for surgical changes; use alias swap for analyzer or type changes.

13.5 Versioning index names

The convention: products-v1, products-v2, ..., with the alias products always pointing to the current write index. Variations:

  • Date-stamped: products-2026.05.22 rolled daily for time-series.
  • Schema version + date: logs-v3-2026.05.22. Lets you do both — increment the schema (alias swap) and roll daily (ILM rollover).
  • Color-coded blue/green: products-blue, products-green. Alias swap is the cutover. Single rollback action.

The discipline is: production code reads/writes through aliases, never raw index names. The DB-name-in-the-app pattern is forbidden. Otherwise rollouts and rollbacks turn into multi-team coordinations.


§14. Cross-cluster search and replication

Past a single cluster's limits — cost, blast radius, regulatory locality — you split into multiple clusters. Search across them is solved by Cross-Cluster Search (CCS); replication by Cross-Cluster Replication (CCR) in Elasticsearch or by upstream-replay patterns in OpenSearch.

A single query, fan out across N clusters. Configuration:

{
  "persistent": {
    "cluster": {
      "remote": {
        "us-east": { "seeds": ["us-east-coord-1:9300"] },
        "eu-west": { "seeds": ["eu-west-coord-1:9300"] }
      }
    }
  }
}

Now a query like GET /us-east:logs-*,eu-west:logs-*/_search fans out to both clusters. Each remote cluster runs the query locally and returns its own top-K to the originating cluster, which merges into a global top-K.

Use cases:

  • Geo-distributed deployment: per-region indices (each region's data stays in-region for latency and regulatory reasons); global queries (rare) fan out cross-region.
  • Tenant isolation: per-tenant clusters with one global "search across all tenants" plane for admin/audit.
  • Hot/cold separation across clusters: hot cluster for last 30d, archive cluster for 30d–7yr. CCS over both for occasional historical queries.

The key cost is network: queries hit the remote cluster over inter-region links. Latency adds up. Bandwidth costs add up. CCS works well when the remote is rarely queried; less well when it's the primary path.

14.2 Cross-cluster replication

CCR replicates a leader index to one or more follower indices in remote clusters. The follower applies operations from the leader's translog; lag is bounded by network + processing.

Used for disaster recovery: primary cluster in us-east-1, follower cluster in us-west-2. If us-east-1 burns, promote us-west-2 followers to leaders, repoint clients. Time to recover: minutes.

Also used for read scaling across regions: keep the write path in one region, replicate to read-only followers in others. Users query their nearest follower. Followers can lag a few seconds; tolerable for most read use cases.

Caveats:

  • Mapping changes propagate from leader to follower automatically.
  • Hardware mismatches between clusters cause merge throughput differences; followers can fall behind on heavy write loads.
  • Conflict resolution is one-way: leader wins, follower has no votes. For active-active multi-region writes, CCR isn't the right tool; you'd need application-level conflict resolution (e.g., CRDT-style) and route writes by entity.

OpenSearch has its own cross-cluster replication implementation, also based on the operation log; semantics are nearly identical.

14.3 When to federate vs centralize

Driver Federate Centralize
Single-cluster size > 5 TB or 200 nodes Yes No, scale up first
Cross-region regulatory data locality Yes No
Independent tenant capacity needs Yes No
Vast majority of queries are local to a single tenant or region Yes No
Most queries are cross-tenant or global No Yes
Operational simplicity dominates No Yes

Federation costs are real: more clusters means more masters, more snapshot policies, more alert dashboards, more on-call complexity. Don't federate prematurely.


§15. Security — multi-tenancy, field-level, and the licensing story

A search engine in a SaaS or enterprise context must enforce who can see what. The engine answers "documents matching the query"; it's the security layer's job to ensure the user is allowed to see those documents.

15.1 The default — no security

Open-source Lucene and stock OpenSearch ship with no authentication. Vanilla Elasticsearch was the same pre-7.1; now ships with security on by default. Exposing an unsecured engine to the internet is the canonical mistake — many companies have leaked customer data through unsecured ES clusters.

Production must always have at minimum:

  • Authentication — basic auth, JWT, mutual TLS, SAML, OIDC. The cluster knows who the caller is.
  • Authorization — role-based access control. Roles map to cluster privileges (read this index, write that one).
  • TLS for all node-to-node and client-to-node traffic.

15.2 Field-level security

Different roles see different fields of the same document. Example: a users index has fields name, email, ssn, phone. Customer support sees name, email, phone; engineering sees name, email; only fraud-ops sees ssn.

Configured per role:

{
  "role": {
    "indices": [{
      "names": ["users"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["name", "email", "phone"],
        "except": ["ssn"]
      }
    }]
  }
}

The engine strips disallowed fields from _source before returning. Note: this can still leak via aggregations — if a user can aggregate on ssn, the buckets reveal SSN values. Field-level security must be combined with restricting which fields can be aggregated.

15.3 Document-level security

Different roles see different documents in the same index. Implemented as a query filter applied automatically by the engine. Example: a multi-tenant SaaS where every doc has a tenant_id field, and each user can only see docs in their own tenant:

{
  "role": {
    "indices": [{
      "names": ["data"],
      "privileges": ["read"],
      "query": "{\"term\":{\"tenant_id\":\"{{_user.metadata.tenant}}\"}}"
    }]
  }
}

Every query from a user in tenant acme_corp gets an implicit AND tenant_id = "acme_corp" filter. The user can never see other tenants' data, even with a maliciously-crafted query.

Runtime queries can be arbitrarily complex — checking allowlists, computing per-user permissions from a metadata field, joining with an external rules engine via runtime fields. Trade-off: complex doc-level filters slow every query for that user.

Multi-tenancy patterns:

  • Shared index, doc-level security: cheapest, simplest, biggest blast radius. One badly-tuned query slows everyone.
  • Per-tenant index: stronger isolation, but cluster state grows (one index per tenant = thousands of indices for big SaaS).
  • Per-tenant cluster: strongest isolation, biggest cost.

Most SaaS pick shared-with-doc-level until a tenant hits a scale where they justify their own index/cluster.

15.4 The open-source vs Elastic-Cloud vs OpenSearch fork story

In 2021, Elastic changed the Elasticsearch license from Apache 2.0 to a dual SSPL + Elastic License (non-OSI-approved). The trigger was AWS reselling Elasticsearch as a managed service without contributing back; Elastic wanted to preempt that pattern. AWS responded by forking the last Apache 2.0 release (7.10) into OpenSearch, with its own roadmap.

The current landscape (2026):

  • Elasticsearch — Elastic's product, dual-licensed (free Basic tier for most use; paid commercial for x-pack features like cross-cluster replication, machine-learning anomaly detection, advanced security). Available on Elastic Cloud (Elastic's hosted service).
  • OpenSearch — AWS's Apache 2.0 fork. Functionally similar to ES 7.10 at the start, has diverged with its own features (Anomaly Detection, Observability, OpenSearch Dashboards). Available on Amazon OpenSearch Service.
  • Apache Lucene — unchanged, Apache 2.0, the underlying library. Both ES and OpenSearch consume it.

Companies pick:

  • Greenfield, no Elastic relationship: OpenSearch by default. Apache 2.0 license clarity, fully OSS, cheaper at scale on AWS.
  • Existing Elastic customer: stay on ES, the upgrade path is smoother, x-pack features are valuable.
  • Air-gapped or self-hosted with self-sufficiency: either; OpenSearch is the Apache 2.0 safer bet.

The practical answer: "OpenSearch is the default open choice in 2026, ES is a strong choice if you're already paying Elastic." Know the license shift exists, but don't dwell on it.

In 2024 Elastic relicensed back to a tri-license including AGPL3 for some components — the situation is in flux. OpenSearch remains the safe-default fork.


§16. Operations — snapshots, monitoring, and the heap pressure playbook

A search cluster in production runs ~24/7 for years. Operations are the part nobody writes blog posts about and the part that distinguishes a working cluster from one that paged at 2 AM weekly.

16.1 Snapshots to S3

The snapshot/restore subsystem writes incremental snapshots of indices to an object store. The repository registration:

PUT /_snapshot/s3_repo
{
  "type": "s3",
  "settings": {
    "bucket": "company-es-snapshots",
    "region": "us-east-1",
    "compress": true,
    "base_path": "prod-cluster/"
  }
}

Snapshots are incremental at the segment level. First snapshot uploads all segment files; subsequent uploads only new segments since the last snapshot. Because segments are immutable (once written, never modified), incrementality is trivially correct — a segment is either already in S3 (skip) or new (upload).

A snapshot takes 1–10 minutes for a typical incremental on a multi-TB cluster, hours for the first full snapshot. Schedule via SLM (Snapshot Lifecycle Management) policies:

{
  "policy": {
    "name": "<daily-snap-{now/d}>",
    "schedule": "0 30 1 * * ?",
    "repository": "s3_repo",
    "config": {
      "indices": ["*"],
      "ignore_unavailable": true,
      "include_global_state": true
    },
    "retention": {
      "expire_after": "30d",
      "min_count": 5,
      "max_count": 50
    }
  }
}

Restore time depends on data volume. A 10 TB cluster restores from S3 in 4–8 hours at typical S3-to-NVMe bandwidth. Restoring a single index is faster — pull only that index's files.

Repository plugins: S3 (default for AWS), GCS for Google Cloud, Azure for Azure, FS for shared NFS, HDFS for on-prem Hadoop ecosystems. All implement the same incremental segment-level pattern.

16.2 Monitoring — the four golden signals

Every search cluster needs alerts on four families:

(1) Cluster health. _cluster/health returns status green (all primaries and replicas assigned), yellow (primaries assigned, some replicas missing), red (some primaries unassigned). Yellow is sustainable temporarily during recovery; red means user-visible data loss until shards are recovered.

(2) Shard allocation. Are shards balanced across nodes? Is there a node holding 2× the average shard count? Has allocation stalled due to disk watermarks (cluster.routing.allocation.disk.watermark.low/high/flood_stage)?

(3) JVM heap pressure. Per-node JVM heap usage. Target: 75% of max heap after old-gen GC. Above 85% sustained, GC pauses lengthen; queries time out. The data-nodes-hit-75%-heap remediation playbook:

  • Check fielddata cache: GET /_nodes/stats/indices/fielddata. If high, set indices.fielddata.cache.size to limit it.
  • Check segment count: aggressive refresh creates too many small segments, each carrying fixed heap overhead. Slow refresh interval.
  • Check query patterns: deep pagination (from: 10000) buffers all results in heap. Switch to search_after.
  • Check aggregations: large cardinality aggregations and terms with size: 100000 chew through heap. Set sensible size limits.
  • Last resort: add data nodes to spread load.

(4) GC pauses. A 10-second old-gen GC pause = a 10-second query stall. Monitor via GET /_nodes/stats/jvm. Pauses > 1s/min are warnings; pauses > 5s are pages. Modern G1GC (default since ES 7+) is mostly self-tuning; ZGC has been an option for low-pause workloads since ES 8 with sub-10ms pauses but higher CPU cost.

16.3 The "data nodes hit 75% heap" playbook

Pager goes off at 2 AM: jvm.mem.heap_used_percent > 85 on three data nodes. The investigation:

Step 1. Quick stabilize.
  - Throttle ingest if writes are spiking: pause one Kafka consumer group.
  - Disable expensive cron jobs (large _reindex, _update_by_query).
  - Lower refresh interval doesn't help in the moment — segments are
    already created.

Step 2. Diagnose. Hit /_nodes/stats?metric=jvm,indices.
  - field_data.memory_in_bytes? If high, someone aggregated on a `text`
    field (uses fielddata, evil). Fix: change to keyword, or block such
    aggregations with `indices.fielddata.cache.size: 10%`.
  - segments.count high per node? Too many small segments. Cause: too
    aggressive refresh, or merge stuck. Check merge stats; consider
    bumping merge scheduler threads.
  - circuit_breakers tripping? Engine is preventing OOM at the cost of
    failed requests.
  - Per-shard memory: which index is the hog? Often a single misbehaving
    index dominates.

Step 3. Mitigate.
  - Force-merge specific old indices (NOT live-writing ones) to reduce
    segment count.
  - Roll a node restart on the worst offender (full heap reclaim).
  - Move shards off the worst nodes via temporary allocation rules.

Step 4. Post-mortem and structural fix.
  - Did mapping explosion cause the heap pressure? (§18)
  - Was someone aggregating on text fields? Fix the application.
  - Was bulk index too aggressive? Cap bulk size in the indexer.
  - Are we sized correctly? If average heap is consistently >70%,
    add capacity.

The structural fixes matter more than the immediate. A cluster that hits 85% heap nightly will eventually have a bad night.

16.4 Shard allocation insight

GET /_cluster/allocation/explain is the canonical debugging tool when a shard refuses to allocate. It returns a structured reason: disk watermark exceeded, awareness violation, node failure history, deciders rejecting. The shape of typical answers:

  • "no_attempt: no allocation can be performed because of forced awareness constraints (zone)." Add a node in the missing zone.
  • "the shard cannot be allocated because of disk usage above the high watermark." Move shards or expand disk.
  • "the node is currently throttled with concurrent recoveries." Wait or raise cluster.routing.allocation.node_concurrent_recoveries.

A red cluster is almost always solved by reading allocation/explain for an unassigned shard.


§17. Cost optimization — Lucene's storage cost and how to fight it

Lucene's inverted index, doc values, stored fields, and FST collectively use ~3× the raw JSON size on disk. A 1 TB of JSON documents becomes a 3 TB Lucene index. The breakdown:

  • Stored fields (_source): original JSON, LZ4 compressed. Roughly 0.5–1× of raw JSON.
  • Inverted index (terms + postings + positions): roughly 0.5–1× of raw text.
  • Doc values (columnar per field): roughly 0.3–0.6× of raw structured data.
  • Norms, term vectors, segment metadata: small constant overhead.

The multiplier scales with how many fields are indexed, how many are stored, whether positions are kept. Aggressive optimization can cut to 1.5×; default settings land at 3×.

17.1 Source filtering

By default _source includes the entire original JSON. For a doc with 200 fields where only 10 are displayed, this is wasteful. Two ways to trim:

(a) source_includes at query time — return only listed fields:

{ "_source": { "includes": ["title", "price"] }, ... }

Saves bandwidth; doesn't save disk.

(b) _source field filtering at index time — exclude fields from being stored at all:

{ "_source": { "excludes": ["raw_html", "log_blob"] }, ... }

Saves disk. Trade-off: those fields can't be reindexed in place (no source to read from) — must re-fetch from primary. Acceptable for log fields that are pure noise.

17.2 Disable _source for log-only data

The most aggressive optimization: _source: { enabled: false }. Original JSON not stored at all. Inverted index and doc values remain — search and aggregate still work. Trade-offs:

  • Cannot retrieve the original doc. You see hits' IDs, scores, and any explicitly store: true fields, but not the doc body.
  • Cannot reindex from this index. Future schema changes require re-ingesting from the source.
  • Highlighting requires store: true on specific fields (snippet extraction needs the text).

For pure log search where the source is the application (which never queries back to ES for the source) and the doc body has no display path, disabling _source cuts ~30–50% of disk. Datadog and similar log-shape workloads use this.

17.3 Force-merge fewer segments for cold tier

A 50 GB index might have 20 segments at hot tier. Each segment has fixed overhead (FST tip in heap, file handles, metadata). At rest in warm/cold tier, you don't need fast writes — collapse to one segment:

POST /logs-2025.04.15/_forcemerge?max_num_segments=1

Trade-off: takes minutes to hours, monopolizes I/O. Run during off-peak; one shard at a time. Always done at ILM warm/cold transition; never on hot tier with live writes.

Effects:

  • Heap usage drops (single FST tip instead of 20).
  • Query latency drops (fewer segments to visit).
  • Disk usage drops 10–30% (better compression on larger segments; less tombstone overhead).
  • New writes blocked during the merge.

17.4 Codec choices

Lucene supports multiple codecs. The default LZ4 compression on stored fields is fast and decent. Switching to best_compression (DEFLATE):

{ "settings": { "index": { "codec": "best_compression" }}}

Cuts _source size 40–60%. Costs 10–30% query latency on retrieval (decompression). Worth it for log workloads; not for low-latency e-commerce.

Beyond default codecs, Lucene 9.10+ supports zstd (faster decompression than DEFLATE, similar ratio) and codec choices per field (doc_values_format). Specialization at the field level is the way teams squeeze the last 10–20%.

17.5 Cost in real numbers

A LinkedIn-style 1B profile index:

  • Raw JSON per profile: ~3 KB. Total: 3 TB raw.
  • Default ES: ~9 TB. At $300/TB-month for NVMe SSD, that's $2700/month/copy. With 3 replicas: $8100/month.
  • With _source excluding raw text fields (kept only displayable): drops to ~6 TB. $5400/month with 3 replicas.
  • With best_compression on _source: drops to ~5 TB. $4500/month.
  • With ILM moving 30-day-stale profiles to warm tier: hot 1 TB, warm 5 TB. (1 × $900 + 5 × $300) = $2400/month.
  • With ILM cold tier on S3 for inactive profiles: hot 0.5 TB, warm 1.5 TB, cold 4 TB. ($450 + $450 + $200) = $1100/month.

~8× total cost reduction without changing query semantics. Hot-tier-only is the lazy expensive option; aggressive ILM is where the savings live.


§18. The mapping explosion problem

This was sketched briefly in the original "hard problems" section; here's the full deep dive because it's the canonical reason production search clusters page their on-call.

18.1 What it is

Search engines materialize per-field metadata. For each unique field name, the cluster state holds: the field type, the analyzer, the doc values setting, whether it's indexed, stored, has term vectors. The cluster state is replicated to every node; every change is broadcast.

If your ingest pipeline allows arbitrary field names from untrusted input — user-supplied JSON, log lines with structured but uncontrolled keys, third-party APIs you don't control — you can blow this up.

18.2 A realistic failure timeline

Day 1.  A new microservice deploys. Its logs include fields like
        attr.user_id_3f7b29a1: "..."
        attr.session_abc: "..."
        ...one new field name per call.

Day 1 evening.  Cluster has 5000 unique fields in the logs index.
                Mapping is 200 KB. No alerts.

Day 3.  100k unique fields. Mapping is 5 MB. Cluster state propagation
        slows. Background "publish cluster state" tasks take seconds
        instead of milliseconds. Coord nodes start logging warnings.

Day 5.  300k fields. Cluster state is 20 MB. Every node holds a copy;
        every change rebroadcasts. Heap pressure rising on data nodes
        (each new mapping change allocates a new full cluster state on
        the JVM heap; old ones are GC'd; allocation pressure spikes).

Day 7.  500k fields. Master OOM during a routine snapshot operation
        that tries to serialize the full cluster state. Master dies,
        election fires, new master tries to do the same thing, also
        OOMs. Cluster red. All ingest stopped.

Recovery.  Manually clean up: delete the offending indices, force-merge
           remaining cluster state, restart cluster. Hours of downtime.
           Postmortem reveals the bad ingester.

This pattern is so common that every search-ops team has a story.

18.3 Mitigations

(1) dynamic: strict on indices accepting external data. Documents with unknown fields are rejected at ingest. Ingester sees the error, you investigate. No silent expansion.

{ "mappings": { "dynamic": "strict", "properties": { ... } }}

The downside: every new legitimate field requires a mapping update before docs flow. Workable if your schema is stable and changes are coordinated; painful if devs frequently add fields.

(2) dynamic: false. New fields are accepted into _source but NOT indexed. Search on them silently returns nothing; they're invisible. Use when you want to keep the raw data but never search it.

(3) Dynamic templates. Define rules for how new fields are auto-mapped:

{
  "mappings": {
    "dynamic_templates": [
      { "tags_as_keyword": {
          "match_mapping_type": "string",
          "match": "tag_*",
          "mapping": { "type": "keyword", "ignore_above": 256 }
      }},
      { "metric_as_float": {
          "match": "metric_*",
          "mapping": { "type": "float" }
      }}
    ]
  }
}

Permits new fields matching patterns; gives them consistent settings. Combine with total_fields.limit to cap blast radius.

(4) index.mapping.total_fields.limit (default 1000). Hard cap on field count. Once hit, new fields are rejected. Tune to your actual schema size with headroom (a 200-field schema might cap at 500). Set this on every index from day 1.

(5) Flattened fields. For genuinely free-form key-value data, use flattened:

{ "properties": { "attributes": { "type": "flattened" }}}

A document { "attributes": { "user_id": "abc", "session": "xyz" }} indexes both keys and values into a single internal field. Search via attributes.user_id: abc. Trade-off: less expressive scoring on sub-fields, no per-key tokenization. Right tool for variable user-supplied JSON.

(6) Two-index pattern for structured + free-form. Stable fields go in a normal index with strict mapping. Free-form data goes in a separate index with flattened or dynamic: false. Queries that need both join in the application layer.

The lesson is: never allow untrusted input to control field names. Always have one of the above guardrails. If you're operating a SaaS log search where customers can emit any field they want, you need to enforce per-customer field-count limits and reject when exceeded.


§19. Specialized retrieval engines beyond Lucene

The Lucene-derived family (ES, OpenSearch, Solr) covers ~80% of search workloads. The remaining 20% — extreme scale, exotic tokenization, or radical cost requirements — leaves Lucene behind. The current landscape of alternatives:

19.1 Vespa — ranking + retrieval in one engine

Vespa was Yahoo's internal search/ad-serving platform, open-sourced in 2017. Its key architectural differentiator: the ML model runs inside the engine. Vespa has a native tensor type with operations (matrix multiply, ReLU, attention) that the engine knows how to score. A query plan can include retrieval (find candidates) AND ranking (score them with a learned model) in one pass on a data node, with no external service call.

Query → match-phase (BM25 + filters, in-engine)
     → first-phase ranking (cheap function, top 2000)
     → second-phase ranking (heavy model evaluating tensors, top 10)
     → return

Every phase runs on the content node holding the docs. No round-trip to a model server. This is decisive for ML-heavy workloads with <100ms budgets — Yahoo Mail, Yahoo Shopping ad ranking, increasingly RAG retrieval-with-rerank at companies that need tighter latency than ES + sidecar reranker can deliver.

Trade-offs: smaller ecosystem, steeper learning curve, less hosted-managed-service availability (Vespa Cloud is the main hosted option). Operations require more Vespa-specific expertise than ES does.

When to leave Lucene for Vespa: ranking is the product, latency budget is sub-100ms, model is heavy (cross-encoder, neural reranker, or tensor-based scoring), and the team is willing to invest in Vespa expertise. Not a casual upgrade.

Quickwit builds on Tantivy (a Rust port of Lucene's index format) with two architectural twists:

  1. Compute and storage are separated. Indexes live on S3 (or any object store), not on local disk. Queries read from S3 directly via prefetch and caching. No "data node" with hot data; nodes are stateless.
  2. Built for high-ingest, time-series. Streaming ingestion is first-class. Indexes are partitioned by time bucket; old buckets are cheap S3 objects.

Result: log search at $20/TB-month storage (S3) instead of $300/TB-month (NVMe). For workloads where retention dominates cost — Binance, BlaBlaCar, large observability vendors — Quickwit is 5–10× cheaper than ES at equivalent retention. Query latency is higher (S3 fetch is hundreds of ms instead of microseconds page-cache), but for log search where the query rate is engineer-driven and not user-facing, the trade is fine.

When to leave Lucene for Quickwit: log search at multi-petabyte retention where cost dominates. Don't use for user-facing low-latency search (latency budget too tight for S3 fetches).

19.3 Tantivy — Rust Lucene equivalent

Tantivy is a from-scratch Rust implementation of Lucene's index format. Same FST term dictionary, same PFOR-Delta posting compression, same segment-merge lifecycle. The difference is the implementation language: no JVM, no garbage collection, lower memory overhead per index.

Tantivy is a library, not a distributed service. It powers:

  • Meilisearch — Tantivy-based site search engine, lighter operationally than ES, sub-50ms p99 at <100M doc scale.
  • Quickwit — distributed Tantivy on S3 (above).
  • Custom embedded search in Rust applications.

When to leave Lucene for Tantivy: embedded search in a Rust app (in-process, no JVM tax), or operating at smaller scales where ES/OS feels heavy. Not a drop-in for production distributed search at scale.

Zoekt was built at Google for internal code search; now powers Sourcegraph for self-hosted users. The differentiator: a trigram inverted index over source code, designed for substring search (function, getUserByEmail() and regex.

Lucene's word-level tokenization is wrong for code. getUserByEmail is one token to Lucene (or with a CamelCase tokenizer, split into get, User, By, Email — also wrong for substring search). Zoekt indexes every overlapping 3-character sequence: get, etU, tUs, Use, ... A regex query gets candidates via trigram intersection, then verifies the regex against matching files.

When to leave Lucene for Zoekt: code search where users want substring and regex matching. Not for English text where word-level tokenization is correct.

19.5 GitHub Blackbird

Blackbird is GitHub's from-scratch global code search engine, replacing the Lucene/Solr-based predecessor. 200B documents, 640 TB hot index, sub-2-second p99 for substring queries across all of public GitHub.

Architectural notes (from GitHub's engineering blog):

  • Custom n-gram inverted index (similar conceptually to Zoekt but at planet scale).
  • Sharded by repository hash.
  • Tightly coupled to Git's object format — indexes Git objects directly rather than extracted files.
  • Built in Rust for memory efficiency and tail-latency control.

Not open source. The lesson: when Lucene's word-level model is fundamentally wrong for your workload AND you're at hundreds-of-billions scale, building a from-scratch engine is justified. Most teams won't be at that scale.

19.6 When to leave Lucene — the decision

Default                           → ES / OpenSearch (Lucene)
Ranking is the product, low-latency ML rerank → Vespa
Log search at PB-retention, cost dominates    → Quickwit / Loki
Embedded search in Rust app                   → Tantivy / Meilisearch
Code search (substring, regex)                → Zoekt / Blackbird
Label-filtered logs, grep-friendly            → Loki
Hosted small-to-mid site search               → Algolia / Meilisearch

The vast majority of new search projects in 2026 stay on Lucene-derived engines. The specialized engines exist because someone hit a wall — usually around cost, tokenization, or ranking-in-engine — that the default couldn't solve at their scale.


§20. Hard problems inherent to this technology

20.1 Relevance vs freshness — when stale signals corrupt fresh retrieval

Problem. Multiple ranking signals refresh at different cadences: keyword (~1s), filter fields (same), vector embeddings (batch, hours), LTR features (online, seconds-to-minutes). Mixing them gives incoherent results.

Illustration — e-commerce. A merchant drops a water bottle's price from $35 to $19. The keyword index sees the change immediately. The vector embedding (computed offline on the description) hasn't changed, so the bottle ranks where it did pre-discount. Net: discount lands in the index but doesn't appear in the ranking until the next batch.

Naïve fix. Cut refresh_interval to 100ms and hope. Why it breaks: 100ms refresh blows up the merge process (§4.4) — 36000 segments/hour, merger can't keep up, query degrades — and the embedding is still stale.

Real fix. Decompose freshness per signal:

  • Keyword / filter freshness: tied to refresh_interval. 1–5 seconds.
  • Vector embedding refresh: route freshly-edited docs through a streaming embedding service that pushes new embeddings into the same index path. Fall back to batch when streaming is delayed.
  • LTR (Learning to Rank) features: separate online feature store keyed by entity ID. Reranker pulls fresh features at query time, NOT from the index. Decouples model freshness from index freshness.

20.2 Sharding for fanout cost — more shards is not always more performance

Problem. Growing index pushes per-shard size past 50 GB. Instinct is to bump shard count.

Illustration — observability log search. Daily log volume grows 50 TB → 500 TB. Per-day index goes from 10 shards × 5 TB to 100 shards × 500 GB. Query p99 doubles. Why: every search fans to 100 shards; tail-latency probability 1 - (1-p)^100; per-shard 1% chance of >50ms → 63% chance of crossing 50ms overall.

Real fix. Routing partitioner to limit fanout. Log search partitions by (tenant, hour) (fanout 100 → 5–10). Profile search routes by searcher's network cluster (a "search within network" query hits ~10 shards instead of all 100). E-commerce routes by region or category when present.

Alternative: two-tier (hot/cold) indexing. Recently-active entities on a small hot-tier cluster (5–10 shards). Rest on a larger cold-tier (100 shards). 90%+ of queries serve from hot. Galene and Earlybird pattern.

20.3 Hot keys and hot queries

Problem. A few terms appear in a huge fraction of docs (or queries). Scoring every match is impossibly expensive.

Illustration — code search. A user types function into Sourcegraph. The term appears in tens of millions of files. Naïve scoring touches them all.

Naïve fix. Cache top-20 results for "function" with a 60s TTL. Why it breaks: personalization defeats the cache (each user gets different results based on repos and permissions), and the cache lags on index rebuild.

Real fix. Don't score everything. Early termination on a sorted index: maintain postings sorted by a static prior (recency, repo popularity, stars). For high-recall queries, score only top M = 10000 docs per shard. Prove via WAND or Block-Max WAND that no doc beyond M can beat the current top-K. WAND uses upper bounds on per-term scores to skip docs whose maximum possible score is below the current threshold. For "function," might mean scoring 50k docs instead of 50M.

For non-personalized aggregations ("how many files contain function"), cache counts, not result lists.

20.4 Mapping explosion — a single bad ingester field takes the cluster down

Problem. Search engines materialize per-field metadata. Untrusted input that explodes field count blows the cluster state.

Illustration — observability log search. A new microservice emits logs with fields named attr.<random_uuid>: <value>. Each line introduces a new field name. After a week, the cluster has 500k unique fields. The engine tracks every field in segment metadata; the cluster state, gossiped between every master and data node, grows to hundreds of MB; master OOMs; cluster red.

Real fix. "dynamic": "strict" on indices accepting third-party data — reject unknown fields at ingest. For genuinely free-form key-value data, use a flattened field type: the whole sub-object is indexed as one field with keys preserved at query time. Set index.mapping.total_fields.limit (default 1000) and fail loudly when exceeded.

20.5 Eventual consistency vs the primary — silent divergence

Problem. Index drifts from the primary store and you don't notice until a user complaint.

Illustration — e-commerce inventory. A product goes out of stock. Catalog DB commits. CDC event into Kafka. Indexer pulls. Bulk write fails 429 (too many requests). Indexer retries; succeeds. During the failure window, the index served stale "in stock" results; users placed orders the inventory system then rejected.

Naïve fix. Retry until accepted; monitor indexer lag. Why it breaks: at-least-once + non-idempotent writes = duplicates or out-of-order (a retry might overwrite a newer edit), and silent divergence is undetectable from inside the engine.

Real fix. Idempotent bulk writes via external_version set to source LSN — engine rejects out-of-order. Periodic reconciliation: hourly job samples 0.1% of IDs, alerts on divergence > 0.01%. Treat the engine as a strict materialization: features requiring authoritative data read from the primary. Business-invariant features (inventory, privacy) enforce at the API layer with a live primary read, not on an indexed-field filter.

20.6 Bulk reindex without downtime — the alias swap pattern

Problem. Mappings are forward-only. Changing an analyzer (adding synonyms, switching language tokenizers) requires a rebuild.

Illustration — multi-lingual product search. A retailer adds Japanese to a previously English-only catalog. The Japanese tokenizer (Kuromoji) requires re-tokenizing all documents. "Stop writes" isn't an option in production.

Real fix. Two-phase rebuild with alias swap:

  1. Create products_v4 with new analyzer. Empty.
  2. Indexer dual-writes to products_v3 and products_v4 simultaneously.
  3. Backfill products_v4 from the primary with a batch indexer. Uses external_version so a stale backfill record can't clobber fresher live edits.
  4. Once backfill complete and lag is zero, swap the alias: productsproducts_v4. Atomic, no downtime.
  5. After soak (a few days), delete products_v3.

Applies identically to LinkedIn profile schema migrations, GitHub repo re-indexing, log roll-overs, Booking.com facet evolution.


§21. Failure mode walkthrough

21.1 Shard primary loss mid-write

Primary fsyncs translog op, dies before forwarding to replicas. Master detects via heartbeats, promotes an in-sync replica. Client either got an ack (primary fully forwarded before crash) or not (no ack, retries). With idempotent external_version, retry is safe. Durability point: translog fsync on each in-sync replica; primary doesn't ack until replicas fsync.

21.2 Shard primary loss between writes

Network blip drops primary heartbeat for 30s; master demotes, promotes replica. Demoted primary, on rejoin, queries new primary for current seq_no watermark. Replays local ops with seq_no > watermark; idempotent ones no-op. Resumes as replica. Durability point: seq_no per-shard monotonic, stored in translog.

21.3 Master death — split-brain in a 2-master cluster

Two master-eligible nodes, network partition: each elects itself; conflicting cluster states; shard assignments diverge. Real fix — quorum. Always run odd number ≥ 3. Voting config requires majority. 1-vs-2 partition leaves the minority unable to elect, read-only until heal. Modern ES Zen 2 is essentially Raft for cluster state; Solr uses ZooKeeper. Pre-ES-7, discovery.zen.minimum_master_nodes = (N/2) + 1 had to be set explicitly — the canonical split-brain bug.

21.4 Bulk index failure mid-batch

Bulk of 500 docs returns HTTP 200 with per-doc status: 487 succeeded, 13 returned 409 version conflict. Conflicts are expected — out-of-order CDC with external_version lower than current. Log, increment counter, proceed. NO retry.

If 13× 503: retry those specifically with exponential backoff. Idempotent via external_version.

If whole bulk times out: re-send entire batch; version check rejects already-applied docs.

Durability point. Kafka offset committed after successful bulk. Crash before commit = next consumer starts from prior offset and replays. Kafka offsets + engine version conflicts = exactly-once application.

21.5 Permanent shard loss

All three copies of shard 42 on a single failure domain (cabinet, AZ — Availability Zone) catch fire. Recovery: with snapshots, restore + Kafka replay from snapshot timestamp. Without, full reindex of shard 42's doc range from the primary — hash(entity_id) mod K tells you which IDs. Hours, but bounded. Recoverable because the index is derived. Prevention: anti-affinity rules; never colocate primary + replicas of the same shard on the same host, rack, or AZ.

21.6 Cascading merge storm

Force-merge accidentally on a 30-shard live index; each shard tries to merge to 1 segment; disk I/O saturates; p99 goes from 80ms to 8s. Cancel force-merge tasks (they don't abort cleanly mid-merge but new merge work stops); throttle indices.merge.scheduler.max_thread_count. Prevention: don't force-merge live indices. Only on roll-over indices (ILM-managed — Index Lifecycle Management) no longer written to.


§22. Why not just SQL LIKE or Postgres FTS

22.1 LIKE '%term%' — instant rejection

Leading-wildcard LIKE can't use a B-tree. Full scan: 100M rows × 2 KB = 200 GB at ~500 MB/sec = ~7 min/query. Unviable.

22.2 Postgres tsvector with GIN — more interesting

Postgres has a real inverted index via GIN.

CREATE INDEX idx ON products USING gin(to_tsvector('english', title || ' ' || description));
SELECT * FROM products
WHERE to_tsvector('english', title || ' ' || description) @@ to_tsquery('stainless & bottle')
ORDER BY ts_rank(...) DESC LIMIT 20;

Works up to ~5–10M docs, single Postgres node, simple queries.

Breaks at 100M+ docs: (1) no native horizontal sharding — Citus possible, cross-partition ts_rank is awkward. (2) No BM25 — ts_rank is simpler; length normalization and TF saturation matter. (3) Analyzer extensibility painful — custom analyzers, edge-n-grams, multilingual tokenization not first-class. (4) No faceting at query time — GROUP BY after 200M candidates is slow. (5) No hybrid vector + keyword — pgvector exists but blending BM25 + vector + LTR features is awkward. (6) GIN updates expensive — fastupdate stalls on heavy writes; autovacuum pressure and table bloat at 5–10k writes/sec. (7) No NRT (Near-Real-Time) refresh control. (8) No WAND — ranking top-1000 from a 200M-doc posting list requires reading all of them.

22.3 "Use a SQL DB plus a cache" — almost never

Cache works only if queries repeat exactly. With personalization, hit rate is near zero. A cache helps for autocomplete top-K, popular queries, faceted counts on home pages — not as the engine itself.

22.4 What you're buying with Lucene / Vespa / Tantivy

Pre-built FST term dictionaries and PFOR-compressed posting lists; merge / refresh strategies optimized over 20+ years for query efficiency; a query DSL (Domain-Specific Language) with proximity, phrase, BM25, function scores, vector, geo, percolator — composable; an ecosystem of analyzers, language plugins, k-NN, LTR. Building this on Postgres = reinventing a search engine poorly.


§23. Scaling axes

23.1 Type 1 — Uniform expansion

More docs, similar query rate per doc.

Scale Topology
(10M docs, 1k QPS) 5 shards × 2 replicas, 3 nodes
10× (100M, 10k) 10 shards × 2 replicas, 5–10 nodes
100× (1B, 100k) 100 shards × 3 replicas, 30 data nodes, 3 masters, hot/warm tiering
1000× (10B+, 1M) Multi-cluster federation; per-region indices; cross-cluster search; cold tier on S3 searchable snapshots

Inflection points: 5 TB per shard cluster (must split into multiple indices — per-region, per-time-bucket, per-tenant); 100k QPS total (hot/cold tiering mandatory); 1B docs in one logical index (routing partitioner mandatory); multi-region latency requirements (federate; per-region clusters + cross-cluster search for rare global queries).

23.2 Type 2 — Hotspot intensification

Same dataset, query rate spikes (viral event, breaking news, election day, marketing push).

Pressure Fix
10× QPS on same data More replicas. 3 → 6 = 2× read capacity per shard at constant write cost.
Hot query repeats ("trending") App cache (Redis) for top ~100 trending queries. TTL 30s.
Hot term ("election", "covid", "function") WAND / Block-Max WAND early termination; IndexSorting by static prior.
Hot writer (bad shard key) Re-shard; finer-grained routing partitioner.
Coordinator overload Add dedicated coord-only nodes. Stateless; cheap.

The two axes need different fixes. Type 1 = more shards, more storage. Type 2 = more replicas, more caching, smarter early-termination.


§24. Decision matrix vs adjacent categories

Dimension Search engine (ES/Solr/OpenSearch/Vespa/Tantivy) Postgres tsvector / SQL FTS Vector DB (Pinecone, Milvus, pgvector) Specialized (Zoekt, Loki, Quickwit)
Doc count ceiling 10B+ in one engine; trillions federated 10M practical; 100M with effort 100M–1B vectors 10B+ at their tuned tier
Query model Text + filter + facet + sort + hybrid vector Text + filter; weak ranking Vector similarity only Specialized: trigram / label-only
Latency p99 (200ms) Up to ~1B docs with care Misses past ~10M for complex queries Hits for vector-only; doesn't filter cleanly Hits at their target scale
Write rate sustained 50k–500k docs/sec/cluster 5k–20k tx/sec single node 1k–10k vectors/sec 1M+ for log-tuned engines
Native faceting First-class via doc values Slow GROUP BY None Limited
Native ranking BM25, custom, LTR plugin, in-engine tensors (Vespa) ts_rank (basic) Cosine / dot-product only Custom per use case
Hybrid keyword + vector Yes (ES 8.x, OpenSearch 2.x, Vespa native) pgvector bolt-on Vector-only Some via add-ons
Operational complexity Cluster ops, JVM tuning, mapping discipline "It's just Postgres" Managed often Varies; Loki is famously simple
Cost at 1B docs $$ $$$ (vertical only) $$$$ (vector RAM expensive) $ – $$$ per engine
When to pick Default for general text + facet + hybrid <10M docs, no facet/personalization Semantic similarity dominates Tokenization diverges from word-level English

Thresholds: <10M docs and simple queries → Postgres tsvector. 10M–1B docs with text + facet + filter → Lucene-derived. 1B+ docs with ranking as the product (e-commerce, ads, RAG) → consider Vespa. Pure semantic similarity → vector-only DB. Code or log search → specialized engine.


25.1 E-commerce product search (Yelp, Booking.com)

~2M – 200M products / businesses / hotels; faceted search (price, category, rating, geo-distance); personalized ranking; <200ms p99; updates every few seconds. Variant: Lucene-derived with heavy doc values for facet counts, geo-distance scoring, BM25 + function scores for "boost by rating." Indexer denormalizes joined pricing/inventory/reviews so the engine doesn't join at query time. Hot tier of trending products; faceted counts cached at home-page level.

25.2 Code search (GitHub Blackbird, Sourcegraph + Zoekt)

Source code, not English. Users want substring matches (func MyHelper(), regex, symbol search. Word-tokenized Lucene doesn't fit — getUserByEmailAddress needs substring and camel-case decomposition. Variant: custom n-gram inverted index (trigrams or longer). Each file is the set of all overlapping 3-grams; intersect to find candidates; verify with regex on matches. Blackbird at GitHub: 200B documents, 640 TB total, sub-2s p99 for global substring queries — a from-scratch engine because Lucene's word-level model doesn't compose with substring grep. Sourcegraph uses Zoekt (originally Google), similar architecture, smaller scale.

25.3 Log / observability search (Datadog, ELK, Loki, Quickwit)

Millions of log lines/sec ingest, lower query rate (engineers searching ad-hoc), short retention (~30 days hot), cost-sensitive. Three sub-shapes: ELK / OpenSearch logs (full inverted index on every field; most flexible, most expensive); Loki (index ONLY on labels, log body as compressed chunks, full-text is label-filter-then-grep — ~10× cheaper than ELK); Quickwit (Tantivy segments on S3 — decouples compute from storage, $/TB-month 10–20× cheaper than NVMe, trades a few hundred ms latency for order-of-magnitude cost reduction).

25.4 Profile and people search (LinkedIn Galene)

1B+ profiles; heavy per-searcher personalization; facet-heavy UI; 100k+ QPS; 200ms budget. Variant: Galene is a Lucene fork at LinkedIn. Per-searcher personalization in the hot path — every result list is reranked using features specific to the searcher. LTR is part of the search API tier, fed by an online feature store (Pinot or Venice) so model features refresh faster than the index. Routing by network cluster keeps fanout to ~10 shards instead of ~100 for most queries.

25.5 Real-time feed / tweet search (Twitter Earlybird)

Massive write rate (500M tweets/day historical), sub-second freshness, time-decaying relevance. Variant: Earlybird was a Lucene-derived in-memory engine with aggressive 100ms refresh. Hot tier holds last few days in RAM; cold tier on disk, rarely queried directly. Sharded by time bucket and user. Tuned for "typical query is for very recent tweets in a small slice of the timeline."

25.6 E-commerce ranking with ML in the hot path (Vespa at Yahoo Shopping)

Tens of millions of candidates per query, ML rerank scoring thousands with a gradient-boosted tree or neural model, ~100ms budget. Variant: Vespa runs the ML model inside the engine via tensors. Engine fetches the candidate, multiplies its embedding/feature vector against the model in-process, returns top-K — no round-trip to external service. This is why Vespa dominates ML-heavy ranking where round-trip latency would otherwise kill the budget.

25.7 Hosted small-to-mid site search (Algolia)

Up to ~100M records, sub-100ms p99 globally, low operational overhead. Variant: custom in-memory engine. Full index in RAM on every search node — the latency win. Cost scales with RAM; doesn't extend to billion-doc scale. For sub-100M site search, hard to beat.


§26. Real-world implementations with numbers

  • LinkedIn Galene — Lucene-derived custom engine. 1B+ profiles, ~100k+ QPS, p99 < 200ms. Forked Lucene with personalization-aware scoring; federation, network-cluster partitioning, LTR in the hot path.
  • GitHub Code Search (Blackbird) — Custom engine (not Lucene). 200B documents, 640 TB hot index, sub-2s p99 for global substring. Reinvented the tokenizer (n-grams) because Lucene's word-level model doesn't fit code search.
  • Sourcegraph + Zoekt — Open-source code search. Trigram-indexed. Tens of thousands of repos at the typical deployment; same shape as Blackbird, smaller scale.
  • Yelp — Custom Lucene for business search. ~10M businesses, hundreds of facets, geo + faceted scoring, pre-computed home-page facets, 2-tier hot/cold.
  • Booking.com — Lucene for hotel search. ~2M properties × hundreds of attributes, p99 < 200ms with 10+ active facets, hybrid keyword + ranking-signal.
  • Twitter / X Earlybird — Custom Lucene fork. 500M tweets/day historical write rate. Two-tier hot/cold; aggressive refresh.
  • Stack Overflow — Elasticsearch. ~100M Q&A documents. Canonical mid-large workload on a single ES cluster with thoughtful ops.
  • Algolia — Hosted SaaS, custom in-memory. Sub-100ms p99 globally for sites up to ~100M records.
  • Vespa at Yahoo Shopping / Vespa Cloud — In-engine ranking with tensors. Billions of docs with ML rerank in the hot path; Yahoo Mail/Shopping, ad ranking, increasingly RAG (Retrieval-Augmented Generation) backends.
  • Grafana Loki at large observability SaaS — label-only inverted index. ~10× cheaper than ES-on-logs at equivalent retention.
  • Datadog Logs — ES-backed. Millions of log lines/sec ingest at petabyte-scale daily volumes; two-tier hot/archive with on-demand rehydration.
  • Quickwit at Binance / BlaBlaCar — Tantivy on S3. 5–10× cost reduction vs ES at equivalent log retention.
  • Apache Solr in library catalogs / enterprise CMS — less hot in modern web, still default in many institutional installations.

The same technology category — inverted-index-based ranked retrieval — covers six orders of magnitude in doc count, half a dozen distinct workload shapes, and stretches from "hobby Tantivy project" to "GitHub's planet-scale code search on a custom engine." The vocabulary (terms, postings, doc values, segments, refresh, merge, fanout) is shared across all of them.


§27. Summary

"A search system is an inverted-index-based, ranked-retrieval materialization downstream of a primary store: terms in an FST term dictionary, posting lists PFOR-compressed with skip pointers, doc values for columnar facet/aggregation, immutable per-segment with a translog WAL for write durability and a 1-second refresh as the freshness/throughput knob; you make it fast by shrinking fanout with routing partitions and hot/cold tiers, early-terminating with WAND, and reranking with a model fed by an online feature store; you make it correct by treating it as derived (so cluster loss is recoverable from the source of truth via CDC + Kafka + idempotent external_version), by enforcing forward-only mapping evolution behind alias swaps, and by reconciling against the primary; you make it match the workload by picking from a design space that spans Lucene-derived (ES/Solr/OpenSearch) for general-purpose text + facet + hybrid retrieval, Vespa for ML-in-engine ranking, Tantivy/Quickwit for cost-sensitive log search, and purpose-built engines (Zoekt, Blackbird, Earlybird, Loki) when the tokenization shape — code, logs, tweets, label-only — diverges from word-level English."