ARCHIVED from builddistributedsystem.com on 2026-04-28 — URL: https://builddistributedsystem.com/tracks/indexes/article
Indexes|13 min read
Guide

Database Indexes: Why the Data Structure Behind Your Query Matters

An index is how a database finds your data in milliseconds instead of scanning millions of rows. The choice between a B-tree and an LSM tree determines whether your workload performs or collapses.

Database Indexes: Why the Data Structure Behind Your Query Matters

An index is how a database finds your data in milliseconds instead of scanning millions of rows. The design of that index determines whether your workload performs or collapses under real conditions.

Most engineers think of indexes as something the database handles automatically. Add an index on the column you query, and queries get faster. This works until it does not: until a query that should be fast still does a full scan, until write throughput collapses under the weight of index maintenance, until your time-series database falls over because the B-tree that works fine for random-access workloads is catastrophically wrong for append-heavy sequential writes.

What an Index Does

Without an index, finding a row matching a predicate requires scanning every row in the table. For a table with ten million rows, a query for the user with email alice@example.com reads ten million rows to find one.

An index is a separate data structure that maps field values to row locations. With an index on the email column, the same query traverses the index (a tree of depth ~23 for ten million entries) to find the row location, then reads that one row. Twenty-three comparisons versus ten million reads.

The index has a cost: it consumes disk space and must be updated on every write. For read-heavy workloads, this tradeoff is overwhelmingly positive. For write-heavy workloads, the overhead of maintaining many indexes can be significant.

B-Trees: The Workhorse of Relational Databases

The B-tree is the dominant index structure in relational databases: PostgreSQL, MySQL InnoDB, SQL Server, Oracle all use B-trees for their primary and secondary indexes.

A B-tree is a balanced tree where each node contains multiple keys and pointers to children. The tree is kept balanced through splits and merges as data is inserted and deleted. The depth of the tree (and thus the number of disk reads required for a lookup) is O(log n) where n is the number of entries.

What makes B-trees good for databases is their I/O efficiency. Each tree node maps to one or a few disk pages. A single disk read fetches an entire node with many keys and pointers. This is more efficient than a binary tree, where each node contains only one key and the depth is much greater.

B-trees handle range scans well. Because keys are ordered, a query for "all users created between January and March" traverses the tree to the first matching key and then walks the leaf nodes sequentially. This is efficient for time-range queries, lexicographic prefix searches, and any query that benefits from ordered key traversal.

The weakness of B-trees is write amplification. Inserting or updating a row requires updating the index. If the row lands in a full leaf node, the node splits, which may cause its parent to split, propagating up the tree. On average, a single row insert touches multiple tree nodes. For a table with many indexes, this multiplication compounds.

LSM Trees: Optimized for Writes

The Log-Structured Merge tree (LSM tree) takes a fundamentally different approach. Instead of updating an index in-place, it buffers all writes in memory and periodically flushes them to disk as sorted, immutable files called SSTables (Sorted String Tables).

When you write to an LSM-based store, the write goes to an in-memory structure (a memtable, usually a balanced BST or skip list) and is appended to a write-ahead log. When the memtable reaches a size threshold, it is flushed to disk as an SSTable. Over time, you accumulate many SSTables.

A read query must check the memtable and all SSTables for the requested key. This is slower than a B-tree read, which has a single sorted structure. LSM stores use bloom filters (probabilistic data structures that quickly answer "this key is definitely not in this SSTable") to reduce unnecessary SSTable reads.

The background process that makes LSM trees work is compaction. Periodically, the system merges multiple SSTables into fewer, larger ones. This is where the "merge" in LSM comes from. Compaction removes deleted entries, deduplicates updated values, and reduces the number of SSTables a read query must check.

LSM trees have significantly lower write amplification than B-trees for random-insert workloads. The sequential writes to disk (appending to SSTables) are much faster than the random in-place writes of B-tree updates. LevelDB, RocksDB, Cassandra, HBase, and ScyllaDB all use LSM trees for this reason.

The cost is read amplification (more files to check per read) and the background compaction process that consumes I/O bandwidth. Tuning an LSM-based system involves understanding compaction strategies (level compaction vs size-tiered compaction) and their tradeoffs for read and write performance.

Bloom Filters

A bloom filter is a probabilistic data structure that answers membership queries with two possible answers: "definitely not in this set" or "probably in this set." It never produces false negatives (it will never say an element is absent when it is present) but can produce false positives (it may say an element is present when it is not).

In LSM stores, each SSTable has a bloom filter over its keys. When a read query looks for key K, it checks the bloom filters of all SSTables. Any SSTable whose bloom filter returns "definitely not" is skipped. Only SSTables whose bloom filter returns "probably yes" are read from disk. Since most SSTables do not contain any given key, this dramatically reduces disk reads.

Bloom filters appear in many other places: Google's BigTable uses them, CDNs use them to avoid caching one-hit-wonder URLs, Bitcoin's SPV nodes use them for lightweight transaction filtering. The false positive rate is configurable by changing the filter size and the number of hash functions. A larger filter means fewer false positives but more memory.

Inverted Indexes

A B-tree index lets you answer "find all rows where email = X." An inverted index lets you answer "find all rows where the text contains the word X."

An inverted index maps each term to the list of documents containing that term. For the sentence "the quick brown fox," the inverted index contains: {the: [doc1], quick: [doc1], brown: [doc1], fox: [doc1]}. For a corpus of millions of documents, the index maps each word to the (possibly millions of) document IDs containing it.

Elasticsearch, Apache Solr, and the full-text search in PostgreSQL are all built on inverted indexes. The term "inverted" comes from the direction of lookup: instead of document-to-words (the natural direction), it maps words-to-documents (the inverted direction).

The core operations are more complex than B-tree lookups. Boolean queries ("fox AND hound") require intersecting posting lists. Phrase queries ("quick brown fox") require checking that terms appear adjacent. Relevance ranking requires computing TF-IDF or BM25 scores. But the fundamental structure is the same: an index that maps terms to documents.

What This Track Builds

The Indexes track implements both a B-tree and an LSM-based index, and benchmarks them against different workload patterns. You will see firsthand why the write-heavy time-series workload that destroys B-tree performance runs efficiently on LSM, and why the random-lookup workload where LSM reads many files runs fast on B-trees.

This hands-on comparison is more valuable than any explanation. The performance profile of each data structure becomes intuitive once you have measured it yourself under controlled conditions.


Ready to build it? The Indexes track builds and benchmarks both B-tree and LSM-based indexes. You will run read-heavy and write-heavy workloads against each and measure the performance difference. The same data structures underpin PostgreSQL, MySQL, Cassandra, RocksDB, and Elasticsearch.

Build it yourself

Reading about distributed systems is useful. Building them is how you actually learn.

Start the Indexes track