Tasks
Implement Hash Index
Build a hash index that maps keys to data file offsets: 1. Hash each key to a bucket 2. Store key and file offset in the bucket 3. Lookup returns the...
Build B-Tree Index
Implement a B-Tree index supporting both point and range queries: 1. Internal nodes contain keys and child pointers 2. Leaf nodes contain keys and da...
Implement LSM Tree
Build a Log-Structured Merge Tree (LSM Tree): 1. Writes go to in-memory memtable (sorted structure) 2. When memtable is full, flush to SSTable on dis...
Add Secondary Indexes
Implement secondary indexes for non-key attributes: 1. Primary index: primary_key -> data 2. Secondary index: attribute_value -> list of primary_keys...
Distribute Index Across Nodes
Distribute your index across multiple nodes: 1. Partition index entries by key (hash or range) 2. Point lookups go to single partition 3. Range queri...
Interview Prep
Common interview questions for Database / Backend Engineer roles that map directly to what you build in this track. Click any question to reveal the model answer.
Model Answer
An index lookup for many rows requires: (1) reading the index B-tree to find matching entries, (2) random I/O to fetch each row by its heap/rowid location. Random I/O is much slower than sequential I/O. A full table scan reads data sequentially, which is cache-friendly and allows prefetching. For high-selectivity queries (returning few rows), the random I/O overhead is worth it. For low-selectivity queries, a sequential scan is faster. Database query planners estimate selectivity and choose accordingly. The crossover point is typically 5-15% of rows depending on hardware.
Model Answer
B-tree: in-place updates, O(log N) reads and writes, good for random reads and range scans, write amplification grows with index fragmentation. LSM tree: append-only writes to memtable, sequential flushes to SSTables, background compaction. Writes are fast (sequential I/O); reads require checking multiple SSTables (mitigated by bloom filters). For time-series: writes are sequential by time (append-heavy), LSM trees excel because each new data point appends to the current SSTable. B-trees would have high write amplification for non-sequential inserts. Cassandra, HBase, InfluxDB, and RocksDB all use LSM trees for this reason.
Model Answer
(1) EXPLAIN / EXPLAIN ANALYZE — is the query using the expected index? Does production have the same indexes as dev? (2) Check data volume — dev may have 10K rows, production 100M. Index selectivity changes with data size. (3) Check for parameter sniffing / plan caching — the production query plan may have been compiled for different parameter values. (4) Check table statistics — are auto-analyze statistics up to date? Stale statistics lead to bad plan choices. (5) Check for lock contention — is the query waiting on locks from concurrent writes? EXPLAIN only shows the execution plan, not wait time. Use pg_stat_activity or slow query logs to see wait events.
Model Answer
A covering index includes all columns needed to satisfy a query (WHERE, SELECT, ORDER BY) without accessing the main table. Example: SELECT email, created_at FROM users WHERE status = "active" ORDER BY created_at. A covering index on (status, created_at, email) includes all needed columns. The query engine reads only the index, not the heap. Benefit: avoids the random I/O of heap lookups. In PostgreSQL, this is "index-only scan." In InnoDB (MySQL), secondary indexes include the primary key, so a query that needs the PK only also benefits. Covering indexes trade more storage and write overhead for faster specific queries.
Model Answer
An LSM store accumulates multiple SSTables over time. To read a key, you must check the memtable and all SSTables in reverse chronological order. Each SSTable read is a disk I/O. A bloom filter per SSTable answers "is this key definitely NOT in this SSTable?" with zero false negatives. If the filter says the key is not present, you skip the SSTable entirely. With typical bloom filter false positive rates of 1%, you avoid ~99% of unnecessary SSTable reads. The filter itself is tiny (10 bits/key for 1% FPR) and kept in memory. This transforms O(number of SSTables) reads into roughly O(1) for most point queries.
Questions are representative of real interview patterns. Model answers are starting points — adapt them with your own experience and the specific context of the interview.
Common Mistakes
The top 5 mistakes builders make in this track — and exactly how to fix them. Click any mistake to see the root cause and the correct approach.
Why it happens
B-trees are balanced and support point lookups and range scans efficiently, but in-place updates during writes cause page splits and random I/O as the tree grows.
The fix
Use an LSM-tree (Log-Structured Merge-tree) for write-heavy workloads. Writes go to an in-memory buffer (memtable) and are flushed to immutable sorted files (SSTables) in the background.
Why it happens
Without an index on the filter column, the query engine must inspect every row.
The fix
Add a secondary index on frequently queried non-primary-key columns. In a custom store, maintain a secondary sorted data structure keyed by the indexed column.
Why it happens
A boolean column has only two values. A B-tree index on it splits the data into two huge buckets. Scanning half the table via the index is slower than a sequential table scan due to random I/O.
The fix
Do not index low-cardinality columns in isolation. Use a composite index (e.g., `status + created_at`) or a partial index that only indexes rows with the minority value.
Why it happens
A secondary index is a derived data structure. When the primary data changes, the index must also be updated — both the old entry deleted and the new one inserted.
The fix
Treat index updates as part of the same atomic write as the primary data change. In an LSM store, both the primary SSTable entry and the index entry go into the same memtable flush.
Why it happens
Range queries on an index have no implicit row limit. A query like "all events after epoch 0" may match every row.
The fix
Always apply a limit to range scans. Design APIs to require a maximum page size and use cursor-based pagination for large result sets.
Comparison Mode
Side-by-side comparisons of the approaches, algorithms, and trade-offs you encounter in this track. Expand any comparison to see a detailed breakdown.
| Dimension | B-Tree Index | Hash Index | Inverted Index |
|---|---|---|---|
| Best query type | Range queries, prefix scans | Exact-match lookups | Full-text search, multi-value attributes |
| Point lookup speed | O(log n) | O(1) average | O(1) for term lookup + postings merge |
| Range query support | Yes — in-order traversal | No — keys are hashed, order lost | Partial — sorted postings lists support some range ops |
| Write overhead | Medium (rebalancing, page splits) | Low (append to bucket) | High (tokenise, update many term lists) |
| Space usage | Medium | Low | High (term dictionary + postings) |
| Used in | PostgreSQL, MySQL, SQLite | InnoDB adaptive hash, memory tables | Elasticsearch, Lucene, Postgres GIN |
Concepts Covered
Prerequisites
It is recommended to complete the previous tracks before starting this one. Concepts build progressively throughout the curriculum.