← Back to Backend Fundamental Components

Vector Database

Contents

Component class: storage and search systems built around ANN (Approximate Nearest Neighbor) lookup over high-dimensional vector embeddings produced by ML (Machine Learning) models. Implementations span specialized managed services (Pinecone), open-source dedicated systems (Weaviate, Milvus, Qdrant), Postgres extensions (pgvector), embedded libraries (Chroma, FAISS, Annoy), and traditional search engines with vector capabilities bolted on (Elasticsearch dense_vector, OpenSearch k-NN, Vespa, Solr). The byte-level mechanics in §4 — HNSW (Hierarchical Navigable Small World) graphs, IVF (Inverted File) clustering, PQ (Product Quantization) compression — apply uniformly across all of them, because everyone is reusing the same handful of academic algorithms from the 2010s.


§1. What a vector database IS

A vector database stores fixed-dimensional floating-point vectors — typically 256 to 3072 dimensions per vector — and answers queries of the form "give me the K vectors in the corpus that are closest to this query vector, under distance metric D." The distance metric is almost always cosine similarity, dot product, or Euclidean (L2). The closeness is computed in a continuous, semantically meaningful space produced by an upstream embedding model: a sentence about "how to renew a passport" lands near a sentence about "passport renewal procedure," because the embedding model places them near each other in vector space.

The defining structure is an ANN (Approximate Nearest Neighbor) index, almost always one of: HNSW (Hierarchical Navigable Small World), IVF (Inverted File) with PQ (Product Quantization), or some industrial variant like ScaNN (Scalable Nearest Neighbors). The "approximate" is load-bearing. Exact nearest-neighbor search over a billion 1536-dimensional vectors is a linear scan of 6 TB — minutes per query. ANN burns 1-5% recall to drop that to milliseconds.

A vector database is in most architectures a derived store, never the source of truth. Embeddings are themselves derived from raw text or images held by a primary system; if the embedding model changes, the entire corpus must be re-embedded and re-indexed. The original document IDs and metadata live in a relational or document database; the vector DB holds an index keyed by those IDs plus the high-dimensional vector and a thin metadata sidecar for filtering.

Distinguish from adjacent categories

  • Traditional databases (relational, document, wide-column) answer "give me rows where column X equals value Y" — exact match on discrete predicates. They cannot say "give me rows whose meaning is similar to this query," because they have no notion of meaning, only of bytes. A B+ tree on a LIKE '%word%' predicate degrades to a scan, and even a trigram index only handles substring fuzziness, not semantic fuzziness.
  • Search engines (Elasticsearch, OpenSearch, Solr) with text inverted indexes answer "give me documents containing these terms ranked by BM25 (Best Match 25)." They match discrete tokens, not continuous semantics. They will not match "passport renewal" to a query of "how do I update my travel documents" unless an analyzer or synonym dictionary explicitly maps the tokens. Modern Lucene-based engines have added dense_vector fields and an HNSW index, which makes them hybrid — see §3.
  • Key-value stores (Redis, DynamoDB) answer "fetch the value for this exact key" — O(1), no scan, no scoring. They are useless for "find content similar to this" because the key carries no semantic relationship to other keys. Redis has added a RediSearch vector module, but the access pattern is still a separate index.
  • Graph databases (Neo4j, Neptune) model explicit edges between entities; vector DBs model implicit similarity as proximity in a continuous space. A graph DB stores "user U follows user V"; a vector DB stores "user U's embedded preference vector is 0.91 cosine-similar to user V's preference vector."
  • OLAP (Online Analytical Processing) engines (ClickHouse, Druid, Pinot) scan huge columnar tables for aggregates; they do not do nearest-neighbor lookup as a first-class operation.

What a vector database is NOT good for

  • Exact-match lookup by ID — slower than a KV store; you pay HNSW traversal cost when you just wanted a hash lookup. Always carry the original ID separately and fetch the raw doc from the primary.
  • Source of truth — embeddings are derived; vectors are an interpretive layer over text/images/code. Rebuild from the raw is always the recovery path.
  • High-throughput point updates — index builds are expensive; HNSW graphs degrade with churn; deletion is a tombstone, not an erasure.
  • Returning exact top-K — recall is typically 95-99%. If your product requires 100%, ANN is wrong; you want exact KNN at small scale or a structured query.
  • Transactional semantics across multiple vectors — no ACID. No multi-vector atomicity.
  • Small corpora where a linear scan in Python on a single GPU is faster than a network round-trip to a service — below ~10k vectors, sklearn or FAISS in-process beats any networked vector DB on latency.
  • Storage efficiency at the byte level — a single OpenAI text-embedding-ada-002 vector is 1536 floats × 4 bytes = 6,144 bytes per document, before index overhead. 100 million docs is 600 GB of raw vectors before HNSW edge lists.

§2. Inherent guarantees — the contract by design

Provided:

  • Sub-second ANN at scale. A well-tuned HNSW index over 100 million 768-dimensional vectors returns top-10 in 5-50 ms on a single node with vectors in RAM. Pinecone, Weaviate, Milvus all hit p99 latencies in this range.
  • Configurable recall/latency tradeoff. HNSW exposes ef_search (how wide the candidate set is during query); IVF exposes nprobe (how many clusters to scan). Larger = higher recall, slower. The tradeoff is a knob the operator picks per use case.
  • Metadata filtering. Almost every modern vector DB allows attaching a JSON blob or set of attributes to each vector and filtering by predicate (tenant_id = X AND category = "support") as part of the ANN query.
  • Horizontal scalability via sharding. Vectors are partitioned by hash of ID or by metadata; each shard runs its own ANN index; the coordinator merges top-K from each shard.
  • Persistent durability of the raw vectors. Most implementations write vectors to a WAL (Write-Ahead Log) or to a Parquet/RocksDB backing store so the in-memory index can be rebuilt after a crash.

Not provided — must be layered on:

  • Exact results. ANN trades 1-5% recall for orders of magnitude latency improvement. If your product needs 100% recall (e.g., legal discovery), you need either a brute-force scan or a structured retrieval layer on top.
  • High write throughput. HNSW construction is O(N log N) edge insertions; rebuilding a 100M-vector index from scratch on a single node takes hours. Steady-state inserts are tolerable (a few thousand per second per shard), but bulk loading is expensive.
  • Compact storage. Vectors are intrinsically large; PQ (Product Quantization) and other compression help but cost recall. There is no free lunch.
  • Idempotent writes. Inserting a vector twice creates two index entries unless the application enforces uniqueness via the ID layer.
  • Read-your-writes consistency. Most vector DBs have an async commit phase; a newly-inserted vector may not appear in queries for tens of milliseconds to seconds.
  • Embedding-model versioning. When the upstream model changes, all existing vectors are in the old space and not comparable to new query vectors. The vector DB knows nothing about this; the application must track model versions and re-embed.
  • Authorization filtering. The engine returns matching vectors; per-tenant or per-user authorization is the caller's problem. Metadata filtering is the lever, but it must be wired in.

Mental model: a vector database is a horizontally scalable, persistent, near-real-time ANN index with optional metadata filtering. Everything outside that contract — including the embedding model that produces the vectors — is your problem.


§3. The design space

Vector storage and search has fragmented across at least five architectural patterns since 2021, when OpenAI's GPT-3 (Generative Pretrained Transformer 3) wave drove demand for RAG (Retrieval Augmented Generation) and similarity search to commercial scale.

System Architecture Storage core License/Hosting Primary use case Operational tradeoffs
Pinecone Managed-only SaaS (Software as a Service) HNSW + proprietary indexes Closed source, hosted only RAG, recsys; "I want a vector DB endpoint" Vendor lock-in; opaque internals; pay-per-vector pricing
Weaviate Open-source + managed cloud HNSW; columnar metadata in LSM BSD 3-Clause (open), managed cloud option RAG with rich filtering; multi-tenancy GraphQL API; learning curve; recent maturity
Milvus Open-source distributed HNSW, IVF-PQ, DiskANN; pluggable Apache 2.0; Zilliz Cloud as managed offering Large-scale (billions); ML pipelines Heavy stack (etcd, Pulsar, MinIO components); ops burden
Qdrant Open-source, single-binary Rust HNSW; payload index for filters Apache 2.0; Qdrant Cloud as managed Mid-scale; clean API; filtering Lighter than Milvus; smaller ecosystem
pgvector Postgres extension HNSW or IVFFlat inside Postgres PostgreSQL License "I already have Postgres; add vectors" Single-node bottleneck; HNSW added in pgvector 0.5 (2023)
Chroma Embedded, in-process HNSW via hnswlib backend Apache 2.0 Prototypes, notebooks, small RAG demos Not for production at scale
FAISS (Meta) Library, not a service HNSW, IVF, PQ, IVFPQ, ScaNN-like MIT License Embedded research; the reference implementation No service layer; user wraps it
Annoy (Spotify) Library Random-projection forest Apache 2.0 In-process recsys; lightweight Older algorithm; Spotify still uses it in production
ScaNN (Google) Library Anisotropic vector quantization Apache 2.0 High-precision IVF variant; YouTube recsys C++; tight Google TF integration
Vespa (Yahoo) Distributed search + ranking + vector HNSW + tensor ranking Apache 2.0 First-class hybrid search with ranking Single engine for retrieval + ranking + ML inference; smaller community
Elasticsearch dense_vector Lucene segment with HNSW Lucene segment per shard Elastic License v2 Text search corpora that gain vectors Vector retrieval less optimized than dedicated; segment immutability friction
OpenSearch k-NN Lucene fork with k-NN plugin HNSW or IVF via NMSLIB / FAISS Apache 2.0 Hybrid search at scale Fork of ES post-2021 license change
Redis Stack (RediSearch + Vector) In-memory module HNSW, FLAT Source Available Sub-ms latency, smaller corpora Cost per GB; RAM-bound
MongoDB Atlas Vector Search Document DB extension HNSW via Lucene under hood SaaS within Atlas "I already have Mongo Atlas" Restricted to Atlas

How to read this: there are five real categories, and the pick is structural, not aesthetic.

  1. Specialized managed (Pinecone) — you want a vector index endpoint, you don't want to run anything, you'll pay for it. Operational simplicity. Vendor lock-in. The popular default for AI startups in 2023-2024.
  2. Open-source dedicated (Weaviate, Milvus, Qdrant) — you want a real vector DB, but on-prem or self-hosted-cloud. Different ergonomics: Weaviate is GraphQL-first and modular; Milvus is heavy-stack and built for billions; Qdrant is lean Rust with a clean REST API and is the rising mid-tier pick.
  3. Postgres extension (pgvector) — you already operate Postgres; you have < 50M vectors; you'd rather add an extension than add a system. pgvector is now the dominant on-ramp; it gained HNSW support in pgvector 0.5.0 (mid-2023) and is now plausible for mid-scale workloads.
  4. Embedded library (FAISS, Chroma, Annoy) — vector search is in-process. Used in research, prototypes, and inside other systems (Spotify's recsys uses Annoy in-process; many Pinecone-like engines wrap FAISS under the hood).
  5. Search engine + vectors (Elasticsearch, OpenSearch, Vespa, Solr) — you already have a Lucene-based search system; you add vector retrieval as another field type. The advantage is hybrid retrieval natively — keyword and vector in the same query. The cost is that the Lucene segment model is optimized for inverted indexes, not for dense vector search; HNSW segments are conceptually awkward inside Lucene's append-only segment world.

Default picks today (mid-2020s): - < 10M vectors, low-ops constraint: pgvector. Done. - Up to 100M vectors, want managed: Pinecone or Weaviate Cloud. - 100M-10B vectors, self-host, ops capacity: Milvus or Qdrant. - Already have ES/OpenSearch and want hybrid keyword+vector: Elasticsearch dense_vector or OpenSearch k-NN. - Need first-class ranking with ML model in the loop: Vespa.


§4. Underlying data structure — ANN algorithms at the byte level

This is the section shallow docs miss. There are three families of ANN algorithms that every production vector DB implements; almost every system is one or a combination of HNSW, IVF, and PQ. Understanding them at the byte level is the difference between "I know vector DBs exist" and "I know which knob to turn when latency spikes."

4.1 The naïve baseline — brute-force KNN (k-Nearest Neighbors)

The reference, for context: for every query vector q, compute the distance to every vector v in the corpus, then sort and return top-K. Cost is O(N·d) per query, where N is corpus size and d is dimensionality. For N=10⁹ and d=1536, that's 1.5 × 10¹² float operations per query. At 100 GFLOPS, that's 15 seconds. Unusable.

Even with SIMD (Single Instruction, Multiple Data) and GPUs, brute-force is fine up to N ~ 10⁶ (then 1-100 ms per query), painful at N=10⁸, and unworkable at N=10⁹+. Hence ANN.

4.2 HNSW — Hierarchical Navigable Small World

HNSW is the dominant ANN algorithm of the 2020s. It was published by Malkov & Yashunin in 2016 and is the index of choice in Pinecone, Weaviate, Qdrant, Milvus (alongside IVF-PQ), pgvector, FAISS, Elasticsearch dense_vector, OpenSearch k-NN, and Vespa.

Concept. HNSW builds a multi-layer graph where each vector is a node. The bottom layer (layer 0) contains every vector. Each higher layer contains a geometrically decreasing subset (typically each vector has probability 1/M of being promoted to the next layer). The top layer has only a handful of nodes connected by sparse long-range edges.

Search starts at the entry point in the top layer, greedily walks toward the query vector until no closer neighbor exists at that layer, then descends to the next layer at the current best node, and repeats. At the bottom layer, the search returns the top-K closest vectors.

The "small world" insight comes from Watts & Strogatz's 1998 paper on small-world graphs — a network where most edges are local but a few long-range edges drastically shorten paths between any two nodes. Add a few long edges at the top of a hierarchy and you can route from any starting point to any target in O(log N) hops.

Layer 2:     A ────────────── F
              \              /
Layer 1:      A ─ C ──── E ─ F ─ H
                 / \    / \   \
Layer 0: A B C D E F G H I J K L M N O P  ← all vectors live here
         (dense local neighborhood graph)

Byte-level layout (this is what's actually on disk and in RAM):

Per vector:
  vector_data:     d × float32           (e.g., 1536 × 4 = 6144 bytes for OpenAI ada-002)
  metadata:        variable (JSON / fixed schema, e.g., 50-500 bytes)

Per layer per vector:
  edge_list:       M × uint32 neighbor IDs
                   (M typically 16-64; pgvector default M=16)
                   = 64-256 bytes per layer

A 100 million-vector HNSW index at d=1536, M=16, with vectors and edges in RAM costs: - Vectors: 100M × 6144 B = 614 GB - Bottom layer edges: 100M × 16 × 4 B = 6.4 GB - Higher layer edges (geometrically smaller): ~6.4 GB total - Total: ~627 GB

This is why HNSW at billion-scale is a sharded, multi-node game, and why PQ compression (next sections) matters.

Construction parameters: - M — the number of edges per node per layer. Higher M = better recall, more memory, slower build. Typical: M=16 (lean), M=32 (balanced), M=64 (high recall, large indexes). - ef_construction — how aggressively the builder searches for good neighbors when inserting a new node. Higher = better graph quality, slower build. Typical 100-500. - ef_search — at query time, how wide is the candidate set during graph traversal. Higher = better recall, slower query. Typical 50-500. This is the runtime tuning knob.

Walk through one HNSW search end-to-end:

Setup: corpus of 10M vectors, d=768, M=32, ef_search=100. Query vector q.

  1. Entry: start at the global entry point, a single node in the top layer. Compute distance from q to the entry point. Suppose entry has 4 neighbors at layer 2; compute distance to each.
  2. Greedy step at top layer: move to the closest neighbor. Recompute distances to that node's neighbors. Repeat until no neighbor is closer to q than the current node.
  3. Descend to layer 1 at the current best node. Repeat the greedy walk at layer 1 — denser graph, more edges per node, finer-grained navigation.
  4. Descend to layer 0. This is the bottom layer with all 10M vectors and dense local connectivity.
  5. At layer 0, instead of strict greedy, the algorithm maintains a candidate set of size ef_search=100 — a min-heap of best-so-far nodes and a queue of nodes to visit. Each iteration: pop the closest unvisited node; explore its M=32 neighbors; insert each into the candidate set if it improves the heap. Continue until the queue is exhausted or all heap candidates are closer than the worst pending.
  6. Return the top-K (say K=10) from the candidate heap.

Total compute: each visited node costs M distance computations, each of which is d=768 float operations. At ef_search=100, the search visits maybe 1000-5000 nodes — call it 3000 × 32 × 768 = ~74 million float operations. On a modern CPU with AVX-512 (Advanced Vector Extensions 512), this is sub-millisecond.

The recall is typically 95-99% (i.e., 9.5-9.9 of the true top-10 are returned). At ef_search=500, recall climbs above 99%, latency rises ~4×. This is the knob.

Trade-offs vs the alternative: HNSW is fast at query time, but expensive to build (each insert does a full search to find neighbors) and bad at deletes (removing a node breaks edges; the graph degrades until rebuild). It also demands the entire graph in RAM — disk-backed HNSW exists (DiskANN) but is significantly slower.

4.3 IVF — Inverted File index

IVF predates HNSW by a decade and is still everywhere — it's the basis of FAISS's default index and is used heavily in YouTube's candidate generation via ScaNN.

Concept: cluster all vectors using k-means into N centroids (typically N = √(corpus size); for 100M vectors, N ≈ 10,000 clusters). Each vector is assigned to its nearest centroid. At query time, find the closest n_probe centroids to the query, then do a brute-force scan of the vectors in those clusters. n_probe is the runtime knob — higher = higher recall, slower.

                    ┌── centroid c1 ── [v1, v7, v23, v89, ...]
query vector  q ──> ├── centroid c2 ── [v4, v12, ...]         ← n_probe = 3
                    ├── centroid c3 ── [v5, v8, ...]            scan these
                    └── centroid c4 ── [v2, v15, ...]
                    ...
                    └── centroid c10000 ── [...]

Byte-level layout:

Centroids:        N_clusters × d × float32  (10000 × 1536 × 4 = 60 MB)
Cluster lists:    per cluster, a packed list of vector IDs and (optionally) the full vectors
Inverted file:    cluster_id → list of vector_ids

Walk through an IVF query: corpus of 100M vectors, N=10000 clusters, n_probe=16.

  1. Compute distance from q to all 10000 centroids: 10000 × 1536 = 15M float ops. ~150 µs on modern CPU.
  2. Pick the 16 closest centroids.
  3. For each of the 16 clusters, fetch the ~10000 vectors in that cluster (100M / 10000 = 10000 vectors/cluster on average) and compute distances: 16 × 10000 × 1536 = 246M float ops. ~2.5 ms.
  4. Maintain a min-heap of top-K across all clusters scanned. Return top-K.

Recall depends on n_probe. n_probe=1 might give 60-70% recall; n_probe=32 might give 95%+. Tuning n_probe is the IVF analog of HNSW's ef_search.

Trade-offs vs HNSW: IVF is faster to build (just k-means and assignment, vs HNSW's per-vector graph search), cheaper in memory because cluster lists can live on disk, and easier to update (just re-assign a vector to its new cluster). But query latency is generally higher and recall/latency Pareto is slightly worse than HNSW at the same corpus size, which is why HNSW has won as the default. IVF wins at extreme scale (10B+ vectors) where HNSW's memory footprint is prohibitive.

ScaNN (Scalable Nearest Neighbors, Google 2020) is an optimized IVF variant using anisotropic vector quantization — the quantizer is tuned to preserve the dot product magnitudes that matter most for the final ranking, not just nearest-neighbor distances. ScaNN backs YouTube's video recommendation candidate generation and is a few percent better than vanilla IVFPQ at billion-vector scale.

4.4 PQ — Product Quantization

Product Quantization solves the memory problem. A 1536-dim float32 vector is 6 KB; a billion such vectors is 6 TB. PQ compresses each vector to typically 8-32 bytes — an 8x to 100x reduction — at the cost of a few percent recall.

Concept: split each d-dimensional vector into M sub-vectors of length d/M. For each sub-vector position, train a codebook of K=256 sub-vector centroids via k-means on a training sample. Each sub-vector is then represented by the index of its closest centroid — 1 byte (256 = 2⁸).

Original vector v ∈ R^768:
  [v_1, v_2, ..., v_768]
                                ┌─ codebook 1 (256 centroids of d/M dims)
Split into M=12 chunks of 64:   ├─ codebook 2
  [v_1..v_64] [v_65..v_128] ... ├─ ...
   ↓             ↓               └─ codebook 12
   c1 (1 byte) c2 (1 byte) ...   

Compressed: 12 bytes per vector (was 768×4 = 3072 bytes)
            = 256× reduction

For 1536-dim vectors with M=48 chunks: compressed to 48 bytes per vector. A billion-vector index is 48 GB instead of 6 TB. Fits on one large box.

Distance computation under PQ uses precomputed lookup tables. At query time, the M sub-vectors of q are compared to each of the 256 centroids of each codebook, producing M tables of 256 partial distances. Then to compute the distance from q to any compressed vector v, you do M table lookups and sum — far cheaper than the original M×(d/M) float multiplications.

IVFPQ — the workhorse combination: IVF partitions vectors into clusters; PQ compresses each cluster's vectors. FAISS's default index for large-scale workloads is IVFPQ. Milvus exposes IVFPQ as a first-class index type. Most "billions-scale" deployments end up here because pure HNSW can't fit in memory.

Trade-off: PQ loses recall (typically 2-5% vs uncompressed). For recall-critical applications, you can use PQ as a coarse filter and re-rank the top candidates with the full uncompressed vectors (the "rerank" step in IVFPQ pipelines).

4.5 Distance metrics — the choice that depends on the embedding

  • Cosine similarity — angle between vectors, indifferent to magnitude. The default for text embeddings (OpenAI, Cohere, sentence-transformers) which are typically L2-normalized so dot-product = cosine.
  • Dot product — magnitude matters. Used for non-normalized embeddings and in cases like recommendation systems where the magnitude encodes confidence.
  • Euclidean (L2) — straight-line distance. Used for image embeddings (CLIP — Contrastive Language-Image Pretraining), some neural retrieval models.
  • Manhattan (L1) — sum of absolute differences. Niche, occasionally used in sparse models.

The choice is dictated by the embedding model's training objective. If you got the model wrong, you'll see catastrophic recall — match the metric the model was trained with. OpenAI text-embedding-3-* are normalized for cosine; CLIP image embeddings use cosine; many BERT-derived embeddings are dot-product native.

4.6 Filtering during ANN search — the hard interaction

Production queries usually carry a metadata filter: "find similar support tickets, but only for this tenant, only in English, only from the last 90 days." Three strategies, in increasing order of sophistication:

  1. Post-filtering (naive): run ANN to get top-K (say K=100), then filter. Problem: if the filter is selective (e.g., 1% of corpus passes), 99 of the 100 candidates are discarded and only 1 survives. Recall craters. You must over-fetch to a multiple of K, sometimes K × 1000, which destroys latency.
  2. Pre-filtering: compute the set of vectors matching the filter, then do brute-force ANN on that subset. Works if the filter is selective and the subset is small (≤ 10k). Falls over if the subset is large (millions) or if filter combinations explode.
  3. Filtered ANN (tightly-coupled): maintain auxiliary structures so the ANN traversal itself respects the filter. Qdrant, Weaviate, and Milvus implement variants of this — e.g., per-segment filter bitmaps consulted during HNSW graph traversal so only matching neighbors are explored. Recall stays high; latency stays bounded. This is the current state of the art and is what differentiates a real vector DB from "add a vector index to an existing store."

Elasticsearch's dense_vector with HNSW has historically struggled here because Lucene's segment model is immutable and per-segment filter bitmaps are awkward. OpenSearch and ES have shipped improvements (efficient filtering in 8.x), but specialized vector DBs still beat them on filtered ANN.

4.7 Durability and persistence

Most vector DBs persist three layers: the raw vectors (in RocksDB, Parquet, or flat files on local disk and/or object storage — the recovery source), the metadata sidecar (LSM or relational), and the index itself (HNSW edges, IVF cluster assignments) as periodic snapshots with a WAL for incremental updates.

Crash recovery: load latest snapshot, replay WAL, resume serving. If the snapshot is corrupt, the index is rebuilt from raw vectors — expensive (O(N log N) for HNSW, hours at 100M) but always possible. The raw vectors are the source of truth within the DB; the upstream text/image store is the deeper source.


§5. Capacity envelope

The capacity envelope of vector DBs spans about six orders of magnitude. Concrete numbers from production deployments:

  • Small (≤ 1M vectors) — pgvector on a single Postgres instance. A 1M-vector, 1536-dim, M=16 HNSW index fits in ~6 GB of RAM. Queries: 5-20 ms. Common for SaaS startup RAG over a customer's docs.
  • Mid (10M-100M vectors) — Pinecone, Weaviate, Qdrant, Elasticsearch dense_vector. Single node or small cluster. Vectors typically in RAM (sharded). Query p99: 10-50 ms. Used by mid-market RAG products, recommendation systems at smaller consumer apps.
  • Large (1B+ vectors) — Milvus, FAISS-based custom deployments, ScaNN. Distributed multi-node, often IVFPQ to fit in memory. Spotify's recommender (Annoy in-process), Pinterest visual search, Notion's RAG over enterprise workspaces. Query p99: 50-200 ms.
  • Giant (10B+ vectors) — bespoke systems at Google (YouTube candidate generation via ScaNN), Meta (Reels, Instagram recommendations via in-house systems built on FAISS lineage), Amazon (product recommendations). At this scale no off-the-shelf vector DB is used; the deployments are custom, often involving multi-tier indexes (a coarse IVFPQ for candidate generation followed by exact rerank).

A useful approximate scaling rule: a single modern node (32-core, 256-512 GB RAM, NVMe SSD) can host an HNSW index of about 50-200M vectors at 768 dimensions, depending on M and ef. Beyond that, you shard. Beyond about 5B vectors, you go custom.


§6. Architecture in context — the canonical RAG / similarity pipeline

The vector DB rarely stands alone. The canonical integration pattern in 2024-2025 looks like this:

        ┌───────────────────────────────────────────────────────────────┐
        │                       INGEST PATH                              │
        │                                                                │
        │  raw docs / items / images                                     │
        │     │                                                          │
        │     ▼                                                          │
        │  Chunker (split documents into 200-1000 token passages          │
        │           with 50-200 token overlap)                            │
        │     │                                                          │
        │     ▼                                                          │
        │  Embedding model (OpenAI ada-002, Cohere, BGE, E5,             │
        │                   CLIP for images)                              │
        │     │                                                          │
        │     │  vector + doc_id + metadata                              │
        │     ▼                                                          │
        │  Vector DB (HNSW or IVFPQ index, metadata sidecar)             │
        │     ▲                                                          │
        │     │  primary store (Postgres / S3 / Mongo)                   │
        │     │  is the source of truth for raw text and IDs              │
        └─────┼──────────────────────────────────────────────────────────┘
              │
              │
        ┌─────┼──────────────────────────────────────────────────────────┐
        │     │                  QUERY PATH                              │
        │     │                                                          │
        │  query string from user                                        │
        │     │                                                          │
        │     ▼                                                          │
        │  Embedding model (SAME model used at ingest)                   │
        │     │                                                          │
        │     │  query vector                                            │
        │     ▼                                                          │
        │  Vector DB ──► top-K candidates + scores + metadata            │
        │     │                                                          │
        │     ▼                                                          │
        │  Optional re-ranker (cross-encoder, e.g. Cohere rerank,        │
        │                       BGE-reranker)                             │
        │     │                                                          │
        │     ▼                                                          │
        │  Top-K re-ranked passages                                      │
        │     │                                                          │
        │     ▼                                                          │
        │  LLM (Large Language Model) with passages as context           │
        │     │                                                          │
        │     ▼                                                          │
        │  Answer with citations                                         │
        └─────────────────────────────────────────────────────────────────┘

Key invariants:

  • Same embedding model on ingest and query. If query uses V2 but corpus is V1, vectors are in incompatible spaces and recall collapses. Re-embed the corpus whenever the model changes.
  • Vector DB returns IDs; actual content is fetched from the primary. Typical: vector DB stores {id, vector, metadata: {doc_id, chunk_idx}}. The full text lives in the primary store and is fetched after retrieval.
  • Re-ranking is optional but standard. ANN gives a fast coarse top-100; cross-encoder reranks to top-10.

For recommender systems, replace "LLM" with "ranking model" and the path is identical.


§7. Hard problems inherent to vector DBs

Problem 7.1 — Recall vs latency tradeoff

Naive solution: pick HNSW defaults (ef_search=50, M=16) and ship.

How it breaks: under load, the recall is 90%. Users complain that obvious matches aren't surfacing. Engineers crank ef_search to 500 to fix recall; p99 latency jumps from 15 ms to 70 ms; the SLO (Service Level Objective) is blown.

Concrete state: a support-ticket retrieval system has 5M vectors at d=768, M=16, ef_search=100. Recall@10 = 94%. The bot is missing relevant past tickets ~1 time in 16. Bumping ef_search to 400 gets recall to 99.2% but doubles latency.

Actual fix: this is fundamentally a Pareto frontier; you don't escape it. Three real moves: 1. Increase M (rebuild the index with M=32 or M=64). Higher M shifts the Pareto curve, giving more recall at the same ef_search. 2. Add a cheap re-ranking step: retrieve top-100 with ef_search=100 (fast), then rerank to top-10 with a small cross-encoder. The cross-encoder catches misses the ANN missed, and you've kept latency in budget. 3. Hybrid retrieval: combine ANN with keyword search and use RRF (Reciprocal Rank Fusion). The keyword side catches exact-term matches that vectors are mediocre at; the vector side catches semantic matches that keyword misses. Combined recall is higher than either alone at the same latency.

Problem 7.2 — Updates and deletes

Naive solution: insert and delete vectors freely.

How it breaks: HNSW graphs degrade with deletions. A delete is a tombstone — the node is marked invalid but its edges remain. Over time, traversals waste time on dead nodes. After 30% of vectors are deleted, recall drops noticeably and latency rises 2-3×. Bulk re-indexing is required.

Concrete state: an e-commerce catalog has 50M product vectors with high turnover — 5% of products are added or removed daily. After three months, half the original vectors are tombstones; queries are 2× slower; recall has fallen 5%.

Actual fix: most vector DBs implement periodic merge/rebuild. Milvus does segment-level rebuilds. Weaviate and Qdrant rebuild HNSW shards in the background. Pinecone abstracts this away (it's their managed service problem). pgvector's HNSW is more naive — heavy deletes hurt; you may need to manually REINDEX. The lesson: any catalog with churn needs an index rebuild policy, not just a delete API.

Problem 7.3 — Hybrid retrieval (vector + keyword)

Naive solution: pure vector search. "Embeddings capture meaning; we don't need keyword."

How it breaks: vector search is mediocre at proper nouns, exact terms, recent jargon, and codes. A user searches "ELI5 transformer architecture" — embeddings might pull up generic ML content but miss the actual "ELI5 transformer" tutorial because "ELI5" is a low-frequency token the embedding compresses heavily. A search for product code "GTX-4090" loses against generic gaming GPU content.

Concrete state: a documentation search system uses pure vector ANN. Engineers report that searching for kubectl apply -f doesn't reliably surface the doc that literally contains that command, because the embedding model treats the snippet semantically rather than as a literal token.

Actual fix: hybrid retrieval. Run both BM25 (keyword) and vector ANN, then fuse with RRF (Reciprocal Rank Fusion): for each document d, score(d) = Σ over rankings i of 1/(k + rank_i(d)), where k is a smoothing constant (typically 60). RRF is robust (no weight tuning), gives improvements of 5-15% MRR (Mean Reciprocal Rank) over pure vector or pure keyword, and is the industry default in 2024-2025. Elastic, Weaviate, Vespa all have first-class RRF. Pinecone exposes a sparse-dense hybrid that's similar in spirit. See §9 of 03_search.md for the search-side view.

Problem 7.4 — Filter interaction with ANN

Naive solution: do ANN, then filter the results.

How it breaks: if the filter is selective (1% of corpus), you fetch top-100 and find only 1 surviving result, recall on the filtered subset is terrible. You compensate by fetching top-10000, but now you've blown the latency budget.

Concrete state: a multi-tenant SaaS RAG product has 10M vectors total, but each tenant only sees their own data — average 10,000 vectors per tenant. A post-filtered query fetching top-100 then filtering by tenant_id = X returns maybe 1-2 vectors for that tenant. Latency is 50 ms because of the over-fetch; quality is awful because most candidates are discarded.

Actual fix: filtered ANN with tightly-coupled metadata bitmaps. Qdrant pioneered the pattern: maintain per-payload-field bitmap indexes; during HNSW traversal, skip nodes that fail the filter, but explore their neighbors as candidates. Or alternatively, partition by tenant — assign each tenant its own collection or shard. With per-tenant collections of 10k vectors each, ANN is trivially fast and exact-KNN on a single shard is plausible.

Problem 7.5 — Embedding drift / model version skew

Naive solution: deploy a new embedding model; don't re-embed the corpus.

How it breaks: queries are embedded with model V2 (3072 dims) while the corpus is in V1 (1536 dims). Either the query is rejected because of dimension mismatch (best case), or the application code silently truncates/pads vectors and recall is catastrophic (worst case — silent data corruption).

Even with same-dimension models, the embedding spaces are not aligned. A V2 vector of "passport renewal" might be at angle 1.2 radians from a V1 vector of the same phrase, even though both encode the same semantics. Mixing them is meaningless math.

Concrete state: a company switches from OpenAI ada-002 to text-embedding-3-large (both 1536 dims) without re-embedding the 50M-document corpus. Search quality drops 30% overnight because queries and docs are in incompatible spaces. The drop is silent — no errors, just bad results.

Actual fix: 1. Maintain corpus version metadata. Tag each vector with embedding_model_version. Queries must use the same version. Cross-version queries are blocked. 2. Re-embed on model change. This is expensive: 50M docs × $0.0001/1k tokens × ~500 tokens/doc = $2,500 in API fees, plus several days of re-indexing time. Plan for it. 3. Shadow rollout. Index the corpus in both old and new versions; query both; compare quality; cut over when V2 wins. 4. Use sentence-transformers / open-source models you can pin. Mitigates the risk of a vendor model changing under you.

Problem 7.6 — High-dimensional curse

Naive solution: pick whatever embedding dimension the latest model uses.

How it breaks: at d=3072 (OpenAI text-embedding-3-large), each vector is 12 KB. A 100M-vector index is 1.2 TB of raw vectors before edges. RAM cost balloons; cache misses skyrocket; HNSW graph edges grow proportionally. Latency goes up linearly with d. ANN quality also subtly degrades at very high d because the "curse of dimensionality" pushes all distances toward similar values, making the metric less discriminating.

Concrete state: an enterprise switches from a 384-dim model to a 3072-dim model expecting better quality. Latency 6× worse, infra cost 8× higher, and quality improvement is only 2-3%.

Actual fix: 1. Use Matryoshka representation learning. Recent models (OpenAI text-embedding-3-, BGE variants) are trained so that truncating to a shorter prefix (say 256 or 512 dims) preserves most of the signal. You get 90% of the quality at 25% of the cost. 2. Apply PQ compression. Trades 2-5% recall for 8-100× memory reduction. Critical at 1B+ scale. 3. Choose dimensions to match the workload.* 384 dims is fine for most semantic search. 768 is a sweet spot. 1536+ only when the marginal quality matters more than the marginal cost.

Problem 7.7 — Real-time index updates

Naive solution: insert vectors as they arrive; the index updates immediately.

How it breaks: HNSW supports incremental adds, but each insert costs an ef_construction-quality search. At very high insert rates (10k/sec/shard), insert latency rises and query latency starts to suffer from lock contention or write amplification. Pure IVF requires a full re-cluster periodically; if you don't, new vectors fall into stale cluster assignments and recall drops.

Concrete state: a fraud-detection system embeds 10k events/second. The vector DB starts queueing inserts; the queue grows to 1M; query latency spikes because the index is constantly being modified.

Actual fix: 1. Batch inserts. Buffer for 1-5 seconds; insert in bulk; commit to the searchable index once per batch. 2. Two-tier index. Hot vectors (last hour) in a small frequently-rebuilt in-memory index; cold vectors in a large stable HNSW. Query both, merge results. This is what most fraud and real-time recsys deployments do. 3. Periodic full rebuilds. Schedule background rebuilds nightly or per-shard rolling. Most production deployments rebuild the entire HNSW once a week to drop tombstones and re-optimize edge selection.


§8. Failure mode walkthrough

8.1 Crash mid-insert

State: HNSW node partially added — vector data written, some edges set, reverse edges not yet committed. In-memory graph inconsistent.

Recovery: WAL contains the insert. On restart, reload snapshot, replay WAL — the partial insert is either re-applied atomically or rolled back.

Durability point: operation is durable once the WAL is fsynced; the in-memory graph mutation is best-effort and reconstructable from the WAL.

8.2 Crash between two operations

State: insert A committed; insert B not yet started. Clean crash between.

Recovery: reload snapshot, replay WAL — A is re-applied; B is unknown to the engine. If the upstream producer persisted B, it can retry.

8.3 Leader / coordinator death

State: the coordinator routing queries to shards dies.

Recovery (Milvus example): etcd elects a new coordinator. Data nodes can serve reads during the 5-30s failover window. Inserts queue.

Pinecone abstracts this; users see brief latency spikes during managed failover, no API downtime.

Durability point: shard metadata lives in the coordination service (etcd, ZooKeeper); data lives on shard nodes. As long as both survive independently, recovery is automatic.

8.4 Network partition / split-brain

State: cluster partitions; both halves believe they are authoritative.

Vector DBs are typically AP (Availability + Partition tolerance). Both partitions serve reads (one stale); writes are either allowed on both and reconciled later (a hazard) or restricted to majority side.

Milvus uses Pulsar+etcd for a single write path; minority writes are blocked. Pinecone hides this with cross-AZ replication and control-plane quorum.

Durability point: writes go through a single agreed-upon log; both partitions converge by replaying it post-heal.

8.5 Permanent participant loss

State: a shard's storage is permanently lost (disk failure, all replicas dead).

Recovery: re-embed all docs on the lost shard from the upstream source-of-truth store. This is why the primary text/image store must be the deep source of truth — embeddings are always regenerable. Time cost: hours to days at billion scale; data recoverable as long as upstream is intact.

The pathological case: the engine has vectors but the application has lost doc_id → raw_text mapping. Embeddings exist but the application can't show the user what was matched. Doc IDs and raw text must be persisted independently of the vector DB.

8.6 The silent quality drift

State: a developer swaps the embedding endpoint to a new model (or the vendor silently upgrades it). New inserts are V2; existing vectors are V1.

Symptom: gradual quality degradation. No errors. ANN still returns results. Users complain search "feels worse."

Recovery: tag every vector with model version; run a quality eval (MRR on a labeled query set) regularly; re-embed when versions diverge. This is a quality-monitoring problem, not a crash-recovery one — the vector DB is fine; the corpus understanding has drifted.


§9. Why not the obvious simpler alternative — exact KNN with linear scan

The naive replacement for an ANN index is brute-force KNN: store all vectors in a flat array; for each query, compute distance to every vector; return top-K. No index. No tuning. Always exact.

Why this breaks:

Consider a 1B-vector, d=1536 OpenAI corpus. Raw storage: 1B × 1536 × 4 B = 6 TB.

A query is a scan over all 6 TB. Even with all data in RAM and SIMD-accelerated dot products at 100 GFLOPS sustained per core, that's 1B × 1536 = 1.5 × 10¹² float operations per query → 15 seconds per query per core. Parallelize across 64 cores: 230 ms per query, assuming perfect linear scaling, which never happens because of memory bandwidth saturation.

On a GPU with 1 TFLOPS effective throughput: ~1.5 seconds per query. Still too slow for an interactive product.

ANN's HNSW visits maybe 3000-5000 vectors per query out of 1B. That's a 200,000× reduction in compute. Sub-millisecond per query, configurably traded against recall.

At what corpus size does ANN start to pay off? Roughly at N ≈ 10⁶ — below that, brute force in NumPy or PyTorch on a single CPU/GPU is fine. Above 10⁶, ANN is essential. Above 10⁸, exact KNN is unaffordable. Above 10⁹, even ANN requires significant engineering (sharding, IVFPQ, multi-tier).

The case for brute force still exists at very small scale (Chroma's default in-memory index is essentially this) and as the rerank step after ANN — re-rank top-100 ANN candidates against the query with exact full-precision distance to recover the last few points of recall.


§10. Scaling axes

Type 1 — uniform growth (more vectors, more queries)

Inflection 1: single-node memory. A 256 GB RAM node fits ~30-50M vectors at d=1536 in HNSW. Beyond that, shard. Strategies: hash sharding on vector_id (uniform load, queries fan out to every shard); metadata sharding by tenant or category (reduces fanout but creates hotspots on uneven tenants); hybrid — shard by tenant, replicate hot tenants.

Inflection 2: write throughput. A single HNSW index tops out at 1k-10k inserts/second/shard. Beyond that, batch inserts, two-tier hot+cold indexes, or move to IVF (lower per-insert cost).

Inflection 3: cross-shard ANN cost. With S shards, a top-K query fetches a multiple of K per shard (e.g., top-3K) to ensure global recall after merge. Latency is bounded by the slowest shard. Beyond ~50-100 shards, tail latency dominates and you need request hedging or replica selection.

Inflection 4: dimensionality. At d ≥ 3072, even HNSW becomes memory-bound; PQ or Matryoshka truncation becomes mandatory.

Inflection 5: billion-scale. Above 1B vectors, IVFPQ with disk-resident clusters is the standard architecture; pure HNSW is too memory-hungry. DiskANN is the alternative — HNSW-like graph on SSD with cache-aware layout.

Common in product search ("iPhone"), support ("password reset"), assistants ("what time is it"). Fixes:

  • Query-result caching. Cache (query_vector, top_K_results). Key by the rounded query vector — many users phrase the same question slightly differently and embeddings map to nearly identical vectors, so even coarse rounding gets cache hits.
  • Hot-shard replication. If one tenant or category dominates traffic, replicate that shard 3-5×.
  • Warm caches. Pre-compute top-K for known popular queries during off-peak.
  • Rerank caches. Cache (query, candidate IDs) → reranked order to skip the slow cross-encoder for repeated queries.

The hotspot fix lives mostly upstream of the vector DB itself — at the query-embedding and result-caching layer.


§11. Decision matrix vs adjacent technology categories

Dimension pgvector Pinecone Milvus Weaviate Qdrant ES dense_vector OpenSearch k-NN Chroma FAISS
Hosting Self-host Managed only Self/managed Self/managed Self/managed Self/managed Self/managed Embedded Library
License PostgreSQL Closed Apache 2.0 BSD-3 Apache 2.0 Elastic License Apache 2.0 Apache 2.0 MIT
Sweet spot < 50M 1M-1B 100M-10B 10M-500M 10M-200M already on ES already on OpenSearch < 1M (prototype) research, embed
HNSW Yes (v0.5+) Yes Yes Yes Yes Yes Yes Yes Yes
IVFPQ No Internal Yes No No No Via FAISS plugin No Yes
Filtered ANN Basic Yes Yes Yes (best-in-class) Yes (best-in-class) Improving Improving Basic Manual
Hybrid (vector+keyword) Tough Sparse-dense Limited Yes Yes First-class (BM25+HNSW) First-class No No
Distributed Limited Built-in Yes Yes Yes Yes (Lucene-style) Yes No No
Schema flex Native SQL Limited Schema-on-write Schema, modular Schema Mappings Mappings Loose None
When to pick Already have PG, ≤50M Want managed, AI-first 1B+ self-host RAG with rich filters Lean, modern, mid-scale Already on ES, want vector Same as ES, Apache license Notebooks, prototypes Custom platform

Adjacent categories for comparison:

Category Strengths Where it loses to vector DBs
Relational DB with B+ tree ACID, exact match, joins, SQL No semantic similarity; trigram only does substring
Search engine (ES/OpenSearch, Solr, Vespa) BM25, faceting, mature ops, hybrid possible Pure-vector workloads less optimized; Lucene segment model awkward for ANN
Key-value store Sub-ms lookups No similarity
Graph DB Explicit relationships, multi-hop No implicit similarity
Embedded ANN library (FAISS, Annoy) Max control, no network hop No service layer

Specific thresholds:

  • < 10M vectors, < 100 QPS, latency budget > 50 ms, ops aversion: pgvector. Just add the extension.
  • < 100M vectors, want managed, comfortable with vendor lock-in: Pinecone.
  • < 100M vectors, want open-source, want rich filtering: Qdrant or Weaviate.
  • 100M-10B vectors, self-host, ops capacity: Milvus.
  • You already have ES/OpenSearch, want hybrid keyword+vector: ES dense_vector or OpenSearch k-NN.
  • You're prototyping in a Jupyter notebook: Chroma. (Don't deploy it to production.)
  • You're building your own platform / custom retrieval system: FAISS as a library.

§12. Embedding models — the deep dive vector DBs depend on

Vector DBs are downstream of embedding models. The vector DB is dumb about meaning; all meaning is upstream.

Closed/commercial models

  • OpenAI text-embedding-ada-002 (2022) — 1536 dims, ~$0.0001/1k tokens. The 2023 de facto standard.
  • OpenAI text-embedding-3-small / -3-large (2024) — 1536 / 3072 dims; cheaper and better than ada-002; Matryoshka-truncatable.
  • Cohere embed-english-v3.0, embed-multilingual-v3.0 — 1024 dims; strong multilingual; well-priced.
  • Google text-embedding-004 — 768 dims, part of Gemini.

Open-source models

  • BGE family (BAAI)bge-large/base/small, 1024/768/384 dims. Top of MTEB (Massive Text Embedding Benchmark) leaderboard among open models. Fine-tune-friendly.
  • E5 family (Microsoft) — instruction-tuned embeddings; e5-mistral-7b-instruct is high quality.
  • sentence-transformersall-MiniLM-L6-v2 (384 dims, fast), all-mpnet-base-v2 (768 dims, balanced). Swiss Army knife.
  • Nomic nomic-embed-text-v1 — 768 dims, fully open weights.

Image embeddings

  • CLIP (OpenAI) — joint image+text space (512/768 dims). Basis for "search images by text" workflows.
  • DINOv2 (Meta) — self-supervised image embeddings, no text alignment. Pure visual similarity.
  • SigLIP (Google) — improved CLIP successor with sigmoid loss.

Fine-tune vs zero-shot

Zero-shot is fast to deploy but mediocre for domain-specific tasks — a medical RAG might get 70% recall@10 with generic ada-002 vs 85% fine-tuned on medical Q&A. For consumer-grade RAG, zero-shot is fine; for specialized domains (legal, medical, code), fine-tuning gains 5-20% on MRR or recall@K. Approach: contrastive loss with in-batch negatives over (query, relevant doc) pairs via sentence-transformers or FlagEmbedding.

Dimensionality choices

  • 128-384 dims: fast, low memory, lower quality. Large corpora where cost dominates, or feeding a downstream rerank.
  • 512-768 dims: workhorse range. sentence-transformers, BGE-base. Good quality-to-cost.
  • 1024-1536 dims: high quality. ada-002, Cohere v3, BGE-large. Premium RAG default.
  • 3072+ dims: top-end. text-embedding-3-large. Diminishing gains; significant cost.

Matryoshka representation learning (2022; deployed in OpenAI text-embedding-3-* and many 2024+ models) trains a single model where the first N dimensions are themselves a usable embedding. Store full vectors and query with truncated prefix, or store truncated and recover later. Increasingly the default for new models.


§13. RAG (Retrieval Augmented Generation) architecture

RAG drove the commercial wave of vector DBs after GPT-3.5 / GPT-4 made it obvious that LLMs needed a grounding mechanism. The structure is canonical:

INGEST
  Documents (PDFs, HTML, Markdown, transcripts...)
    │
    ▼
  Chunking (200-1000 tokens per chunk, 50-200 token overlap)
    │
    ▼
  Embedding model
    │
    ▼
  Vector DB (chunk_id, vector, metadata={doc_id, chunk_idx, source, timestamp})


QUERY
  User question
    │
    ▼
  Embedding model (SAME)
    │
    ▼
  Vector DB ANN ── top-K chunks (K = 20-100)
    │
    ▼
  Cross-encoder reranker (Cohere rerank, BGE-reranker)
    │
    ▼
  Top-K' chunks (K' = 3-10)
    │
    ▼
  Prompt: "Answer using only the following passages:
           [chunk 1]... [chunk 2]... QUESTION: {q}"
    │
    ▼
  LLM (GPT-4, Claude, Llama)
    │
    ▼
  Answer with inline citations

Chunking strategies

  • Fixed-size with overlap (default naive RAG): split every 500 tokens with 50-token overlap. Predictable; works for narrative text.
  • Semantic chunking: split on natural boundaries (section headers, topic shifts detected by an embedding model). Robust but harder.
  • Recursive chunking (LangChain default): try paragraph → sentence → character. Pragmatic fallback chain.
  • Document-structure-aware: respect headings, tables, code blocks — essential for technical docs and code.
  • Chunk size matters: too small (< 100 tokens) loses context; too large (> 1500) bloats the prompt. 300-800 is the sweet spot.

Re-ranking with cross-encoders

ANN gives a fast coarse top-K. Cross-encoders take (query, passage) as joint input and score relevance — slower (10-100ms/pair on CPU; 1-10ms on GPU) but more accurate. The standard pipeline: ANN top-100 → cross-encoder rerank → top-10 into the LLM prompt. Typical gain on nDCG@10 (normalized Discounted Cumulative Gain): 5-15%. Common rerankers: Cohere rerank, BGE-reranker-large, ms-marco-MiniLM.

The hallucination problem

When retrieved context does not answer the user's question, LLMs frequently hallucinate. Mitigations: filter low-confidence retrievals (skip the LLM entirely if top-K cosine is low); require citations and verify them post-hoc; multi-hop / agentic retrieval (refine the query and retry); evaluation harnesses (RAGAS, TruLens) measuring faithfulness and answer relevance.


§14. Hybrid search — vector + keyword

The 2023-2024 industry consensus: hybrid beats pure vector beats pure keyword for most retrieval tasks. The mechanism is straightforward — vectors and keywords complement each other's weaknesses.

Why hybrid wins

  • Keyword search excels at: proper nouns, product codes, exact phrases, technical jargon, code symbols, recent neologisms the embedding model has never seen.
  • Vector search excels at: paraphrases, semantic intent, cross-language similarity, conceptual matches without lexical overlap.

A search for "java vs scala" benefits from keyword (the model token "java" is rare and content-bearing); a search for "how do I make my service faster" benefits from vectors (no exact phrase will match all the docs about optimization, caching, profiling, etc.).

Reciprocal Rank Fusion (RRF)

The canonical fusion algorithm. For each document d, score it across all ranking sources i (e.g., BM25 ranking and vector ranking):

RRF_score(d) = Σ_i  1 / (k + rank_i(d))

Where k is a smoothing constant (usually 60, from the original Cormack et al. 2009 paper). Documents that appear high in any ranking get a high score; documents that appear in multiple rankings get amplified.

RRF is robust: it requires no weight tuning, no score normalization, no calibration. It treats each ranking as an ordinal vote. It is the default for hybrid in OpenSearch, Elasticsearch, Weaviate, Vespa, and most modern retrieval frameworks. Improvements over pure vector or pure keyword are typically 5-15% MRR.

Weighted combination

The alternative is score = α × BM25_score + (1-α) × cosine_score, with α tunable. This requires score normalization (BM25 and cosine are on different scales) and parameter tuning. It can beat RRF when carefully tuned on a labeled dataset but is fragile.

Sparse-dense hybrid (SPLADE, ColBERT)

A frontier line: learned sparse embeddings (SPLADE) and late-interaction models (ColBERT) bridge keyword and vector. SPLADE outputs a sparse high-dim vector that behaves like a learned BM25 index. ColBERT computes per-token embeddings and matches tokens individually. These are research- and production-grade in 2024-2025 but heavier to deploy than dense + BM25.


15.1 Semantic search over documents (RAG for support, knowledge bases)

The single most common use case in 2024-2025. A company indexes its internal Confluence, Notion, Google Docs, Slack, Zendesk tickets, etc. Users ask questions; a RAG pipeline retrieves relevant chunks and feeds them to an LLM. Tools like Glean, Notion AI, GitHub Copilot Chat, and countless internal "chat with our docs" deployments are this pattern.

Demands: high recall (don't miss the relevant doc), strong filtering (only see docs the user has permission for), freshness (a new doc must be searchable within minutes). pgvector, Pinecone, Weaviate, or Elasticsearch hybrid are all viable depending on existing infrastructure.

15.2 Recommendation systems

Spotify's Discover Weekly, Netflix's "more like this," YouTube's next-up feed, Amazon's "customers also bought" — all use vector similarity at the candidate-generation tier. A user's listening history → an embedding; nearest neighbors in track-embedding space become candidates; a heavier ranking model picks the final ordering.

Demands: massive scale (10M+ users, 100M+ items, but only ~10⁶-10⁷ active items in the candidate set); very low latency (sub-100ms p99 for the candidate generation step); approximate is fine because the rerank stage does the heavy lifting. ScaNN (YouTube), Annoy (Spotify), and custom FAISS-based systems dominate at this scale.

CLIP-style models embed images and text into the same space. "Find images of red sneakers" becomes "embed 'red sneakers' as a query vector and ANN over the image embedding corpus." Pinterest's visual search, Google Lens, Shopify's visual-similar-product feature all use variants of this.

Demands: storage is dominated by image vectors (often 512-dim from CLIP); ingest is expensive because every image needs the model run; the metric is usually cosine on L2-normalized embeddings. Milvus, Qdrant, and custom FAISS deployments are common.

15.4 Anomaly detection

Embed normal events; cluster them. New events are scored by distance from the nearest cluster centroid or by their LOF (Local Outlier Factor) within the embedded space. Used in fraud detection, network intrusion detection, and industrial sensor monitoring.

Demands: continuous ingest of new event embeddings; fast nearest-neighbor queries; the corpus grows continuously, so an IVF index with periodic re-clustering or an HNSW with incremental adds is required.

15.5 Deduplication

Find near-duplicate articles, products, or images. Embed everything; cluster by similarity threshold (e.g., cosine > 0.95); flag clusters for review or auto-merge. Used in news aggregators (don't show 50 versions of the same AP wire story), e-commerce (consolidate duplicate listings), and content moderation (find variants of known spam).

Demands: very high recall on the near-duplicate detection (a missed dup is a quality bug); willingness to use a strong embedding model; the dedupe pass is often a batch job, not realtime.

15.6 Customer support intent matching

A new support ticket arrives. Embed it. Find the K nearest past tickets. Use the resolution pattern of those tickets to route, suggest replies, or auto-respond. Zendesk Answer Bot, Intercom Resolution Bot, and similar systems are this pattern.

Demands: per-tenant isolation (each company's tickets are siloed); fast ingest (new tickets must be searchable in seconds); rich metadata filtering (status=open, priority=high, etc.). Multi-tenancy is the dominant operational concern.

Semantic code search — "find functions that parse JSON and handle errors." Embed every function with a code-aware model (CodeBERT, OpenAI code embeddings). Used in GitHub Copilot context-fetching, Sourcegraph semantic search, and internal dev tools. Often combined with AST-based structural search and exact-string search for completeness.

Text-to-image and image-to-text via CLIP-like models. Used in shopping (find similar products from a photo), stock-image libraries, and moderation (find policy-violating images by semantic concept).

15.9 Personalization

User embedding + item embedding → predict affinity by dot product. The user embedding is computed from interaction history; items are embedded by content; nearest neighbors become recommendations.

15.10 Plagiarism / paragraph similarity

Embed every paragraph; compare against a corpus of known sources via ANN. Used in academic integrity tools, AI-content detection, and legal e-discovery.


§16. Real-world implementations

Pinecone

The leading dedicated managed vector DB. Raised significant funding in 2023 ($100M Series B at a $750M valuation) on the RAG wave. Built on proprietary internals (originally HNSW-based; has evolved). Hundreds of thousands of customers across indie RAG prototypes to enterprise deployments. The "easy button" of vector search.

Weaviate

Open-source vector DB with a managed cloud option. GraphQL-first API. Modular embedding integrations (OpenAI, Cohere, Hugging Face). Heavily used in mid-market RAG products.

Milvus / Zilliz

Open-source distributed vector DB; donated to the Linux Foundation. Heavy stack (Pulsar, etcd, MinIO/S3). Built for billion-vector scale. Zilliz Cloud is the managed offering.

Qdrant

Rust-based, single-binary, lean and modern. Apache 2.0. Strong on filtered ANN. Rising share in 2023-2024 as the "real vector DB without the heavy stack" pick.

pgvector

Postgres extension. Initially IVFFlat-only; HNSW added in v0.5.0 (mid-2023). Now exposed turnkey by Supabase, Neon, AWS RDS. The pragmatic choice when Postgres is already in the stack.

Spotify Discover Weekly (Annoy)

Spotify open-sourced Annoy (Approximate Nearest Neighbors Oh Yeah) in 2015 — a random-projection forest, a different ANN algorithm from HNSW or IVF, building binary trees by random hyperplanes. Still in production for Spotify's tens of millions of tracks; Discover Weekly's candidate generation embeds users and tracks and queries nearest neighbors.

YouTube candidate generation (ScaNN)

One of the largest deployed ANN systems in the world. Candidate generation uses ScaNN (Scalable Nearest Neighbors), a Google IVFPQ variant with anisotropic vector quantization. Sharded across many machines over billions of videos; a heavy ranker orders the candidates.

Pinterest visual search, Notion AI, Perplexity, Copilot

Pinterest "Lens": CLIP-like image embeddings + ANN over billions of pins. Notion AI: per-workspace RAG with strict tenant isolation. Perplexity: RAG over the live web as a primary product surface. GitHub Copilot: semantic code search in the IDE plus context-fetching for chat. Different domains, identical core pattern.

Twitter/X "For You", LinkedIn surfaces

Two-tower neural networks embed users and items; ANN generates candidates; a ranking model orders. Twitter open-sourced parts of this stack in 2023; LinkedIn's search and "people you may know" use internal infrastructure combining LSH (Locality-Sensitive Hashing), HNSW, and learned-to-rank rerankers.


§17. Summary

A vector database is a horizontally scalable, persistent ANN index over high-dimensional embeddings — fast similarity search for content semantics, with metadata filtering, at the cost of exactness. HNSW dominates for in-memory mid-scale; IVFPQ dominates for billion-scale on disk; the embedding model upstream is where all the meaning actually lives, and the vector DB is the dumb-but-fast retrieval layer over it. Pick pgvector if you already have Postgres and ≤ 50M vectors; pick Pinecone or Weaviate if you want managed; pick Milvus or Qdrant if you self-host; pick ES/OpenSearch if you already run search; pick FAISS if you're building a custom platform. Always pair with a primary store as the source of truth, always tag vectors with their embedding model version, always plan for re-embedding when the model changes, and always benchmark hybrid retrieval (RRF over BM25 + vector) before settling on pure vector.