Distributed Clocks: Why Timestamps Are Lies and What to Do About It
The clock on your server is wrong. Not just slightly wrong — wrong in ways that can corrupt your data if you rely on it for ordering events across machines. Understanding why requires understanding how time works at the hardware level.
Distributed Clocks: Why Timestamps Are Lies and What to Do About It
The clock on your server is wrong. Not just slightly wrong — wrong in ways that can corrupt your data if you rely on it for ordering events across machines.
This is not a defect in your hardware or your operating system. It is a fundamental consequence of how time works across a distributed system. Every distributed systems engineer eventually learns this lesson. Better to learn it before it causes a production incident.
How Physical Clocks Work
A server's clock is maintained by a quartz oscillator that counts cycles and converts them to seconds. The oscillator drifts — it runs slightly fast or slightly slow relative to actual time. On a typical server, this drift is a few parts per million, which translates to a few seconds per day.
To correct this drift, servers use NTP (Network Time Protocol) to synchronize against reference servers that track actual time via atomic clocks or GPS. NTP introduces its own errors: the time it takes to communicate with the NTP server means the correction is always slightly stale. On a local network with a good NTP setup, clock skew between two servers might be a few milliseconds. Across datacenters, it might be tens or hundreds of milliseconds.
A few milliseconds sounds negligible. It is not. In a distributed system where events happen in microseconds, a few milliseconds of clock skew means you cannot trust timestamps to determine which event happened first.
Why This Matters
Suppose you have two database replicas. User A updates their email address to alice@example.com on replica 1 at timestamp T. User B (or a buggy client retry) updates the same email address to alice2@example.com on replica 2 at timestamp T+2ms. Replica 2's clock is 5ms behind replica 1's clock.
When the replicas synchronize, which update wins? If you use last-writer-wins based on timestamp, replica 2's earlier-timestamped write (T+2ms as measured by its slow clock) should lose to replica 1's later-timestamped write (T as measured by its fast clock). But T+2ms is numerically larger than T, so the synchronization logic picks alice2@example.com as the winner — the wrong answer.
This is the classic last-writer-wins clock skew bug. It is not hypothetical. It has caused real data corruption in production systems.
Logical Clocks: Ordering Without Time
Leslie Lamport's solution, introduced in his 1978 paper, was to abandon physical time entirely for ordering purposes. Instead of timestamps, use a logical counter that captures causal relationships.
A Lamport clock is a counter. Each process maintains its own counter. When a process performs an event, it increments its counter. When a process sends a message, it includes its current counter value. When a process receives a message with counter value V, it sets its counter to max(local counter, V) + 1.
This ensures a causal property: if event A causally precedes event B (A caused B to happen), then A's Lamport timestamp is less than B's. You can use Lamport timestamps to order events that are causally related.
The limitation is that Lamport timestamps are a partial order. If two events are causally independent (neither caused the other), their Lamport timestamps might be in either order, or equal. You cannot determine which happened "first" in physical time — only which one could have influenced the other.
Vector Clocks: Tracking Causality Precisely
A vector clock extends the Lamport clock to track causal relationships across all processes, not just the current process.
Each process maintains a vector of counters, one per process in the system. When process A sends a message, it increments its own counter in its vector and sends the whole vector. When process B receives the message, it updates each element of its vector to the maximum of its own value and the received value, then increments its own element.
Two events are causally related if one vector is strictly less than the other (every component of A is less than or equal to the corresponding component of B, and at least one is strictly less). Two events are concurrent if neither vector is less than or equal to the other — they happened independently, with no causal relationship.
Amazon's Dynamo used vector clocks to track concurrent writes. When two replicas received conflicting updates that were causally concurrent, Dynamo returned both versions to the client and required the client to resolve the conflict. Shopping cart reconciliation was the motivating use case: if you added an item in one tab and removed a different item in another, the cart should contain both changes.
The challenge with vector clocks is size. A vector clock has one entry per process. In a system with hundreds of processes, vector clocks become large. DynamoDB moved away from vector clocks in its published design, using instead a simpler conflict resolution strategy combined with last-writer-wins for the common case.
Hybrid Logical Clocks
Hybrid Logical Clocks (HLC), proposed by Sandeep Kulkarni and others in 2014, combine the best of physical and logical clocks. The goal is a clock that stays close to physical time (so you can use it for human-meaningful timestamps) while providing the causal ordering guarantees of logical clocks.
An HLC has two components: a physical component (the maximum of the current NTP time and the highest physical timestamp seen in any received message) and a logical component (a counter that breaks ties when the physical components are equal).
The result: if physical clocks have bounded skew (say, within 500ms of each other), HLC timestamps are within a bounded distance of real time, and HLC timestamps preserve causality. If event A causally precedes event B, A's HLC timestamp is less than B's.
CockroachDB uses HLC as its primary clock. Every transaction gets an HLC timestamp. The committed timestamp of a transaction must be in the future (greater than the current HLC) to ensure that subsequent reads see the write. If a node's clock is behind (its HLC is in the past relative to a message it receives), it fast-forwards its clock.
TrueTime: When You Control the Hardware
Google Spanner took a different approach: instead of managing clock uncertainty in software, deploy hardware (GPS receivers and atomic clocks) in every datacenter to minimize it.
TrueTime is Google's API for getting the current time. Instead of returning a single timestamp, it returns an interval: the current time is guaranteed to be within [earliest, latest]. The uncertainty (latest - earliest) is typically a few milliseconds.
Spanner uses this uncertainty to implement external consistency: a transaction that commits at time T is guaranteed to be visible to any transaction that starts after T + epsilon (where epsilon is the maximum TrueTime uncertainty). To implement this, Spanner explicitly waits out the uncertainty interval before committing a transaction.
This means every write in Spanner incurs a delay of a few milliseconds — the "commit wait" — to account for TrueTime uncertainty. The result is that you can compare Spanner timestamps across nodes and globally order transactions. No other production system has this property.
What the Track Covers
The Advanced track deals with the distributed timing and ordering problems that appear once you have the basics working. You will implement a system that uses logical clocks to order events, handle concurrent updates with vector clocks, and see how HLC keeps distributed timestamps meaningful.
These are the topics that separate distributed systems engineers from application developers who use distributed systems. Every experienced distributed systems engineer has been burned by clock skew at least once. The engineers who built Spanner, CockroachDB, and Cassandra all had to think carefully about time. So will you.
Ready to build it? The Advanced track addresses distributed time: Lamport clocks, vector clocks, and hybrid logical clocks. You will implement causal ordering for events across nodes and see how real systems like CockroachDB and Spanner handle time differently. The same concepts appear in every distributed database and any system that needs to order events across machines.
Build it yourself
Reading about distributed systems is useful. Building them is how you actually learn.
Start the Advanced track