Logical Clocks
June 27th, 2025
Overview
Unlike the monotomic clock and the wall clock, logical clocks are typically fully independent of the device clock. They are defined exclusively in software, and two independent logical clocks cannot be compared in any meaningful way. Logical clocks are widely used in distributed systems to determine the relative ordering of events, and are often used as Version Vectors. I wrote a simple demo for these which you can find here.
Happens Before Relationship
Before we take a look at logical clocks, we should understand what it means if an event happens before another. If you were to ask a random person on the street what this means, you will likely get something like "The time that event A happened at is earlier than the time that event B happened". If you read my post on clock synchronization, you'd be familiar with the idea that the concept of time is not uniform across machines. Therefore, when discussing distributed systems, the happens before relationship works a little bit different.
When event A happens before event B (i.e ), one of the following is true:
- Events A and B both happen on node X, and A happens before B
- Event A causes event B to happen
- If event A happens before event B, event B before C, then
Lamport Clock
The Lamport clock is a logical clock that allows you to get a partial ordering of events in your system. This partial ordering tells us that particular events happened before others. A partial ordering does not guarantee that any two elements are comparible, only that if events are comparible, that there is an ordering between the two events.
The algorithm is roughly as follows:
- Any time an event is discovered (message received, state changed, etc), you increment your own logical clock.
- If you are sending knowledge of this event to another node, include the value of your own logical clock.
- Any time a message is received, you set your clock to the maximum of (your_time, message_time)
In rust, the algorithm looks something like this:
use std::cmp::max;
#[derive(Clone, Debug)]
struct LamportClock {
pub time: i64,
}
impl Clock for LamportClock {
/// Whenever an event occurs on this node (message received, state changed, etc), we
/// immediately increment our clock to show the passing of time
fn advance_clock(&mut self) -> i64 {
self.time += 1;
self.time
}
/// Whenever we receive a message from another node, we set our clock to the maximum of our own
/// clock and the message timestamp. This ensures time always moves forward, and is what allows
/// us to define partial ordering for particular events.
fn update_clock(&mut self, message_timestamp: i64) -> i64 {
self.time = max(self.time, message_timestamp);
self.advance_clock()
}
}
impl LamportClock {
pub fn new() -> Self {
Self { time: 0 }
}
}
After Receiving | |||
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 1 | 0 | 1 |
3 | 2 | 1 | 1 |
4 | 2 | 2 | 2 |
Lamport clocks have several interesting properties. Assume we have events A and B, with lamport timestamps and respectfully, we have the following:
- If , then
- If then either or
- Lamport clocks can provide a total ordering if there is a way to break ties
Let's explore these properties and try to understand their consequences. First, (1) states that if event happened before event , then the lamport timestamp for event with be less than that of . In figure 1, the sending of message on node A must have a lower timestamp than that of the receival of on node B because you cannot receive a message that was not sent.
(2) states that given two lamport timestamps and , if , then either A happened before B, or A and B are concurrent. Concurrent in this case means that we don't actually know which of the two events happened first and that the events are also independent. Events and in figure 2 are concurrent since there is no happens before relationship between them. Concurrent events are the reason why we have a partial ordering rather than a total ordering. Some events such as those mentioned above do not have an ordering between them according to lamport timetsamps, which brings us to (3).
(3) states that assuming we have a way to break ties between events with the same timestamp, we get a total ordering of the events in our system. A total ordering is an order of events in the system that all nodes agree on. A total ordering you are likely familiar with is the lexographical ordering of words composed of alphabetical letters commonly found in dictionary references. We know that the word bat should show up before cat, and cat before zebra. Indeed, with any two words in the English language, you can tell me which one shows up first in a dictionary. Lamport timestamps can provide us a total ordering assuming we have a way to break ties of events with the same timestamp. One common way to do this is to include a node identifier in the event, which would be compared lexographically if there was a tie. In figure 1, events and can be ordered if the node ID is attached to the event, in which case because A comes before C in the alphabet. If all of the nodes were threads on the same system, you could instead use the process ID (PID). If all events are generated in a single thread context on the same node, then no extra information is needed for a total ordering because you can never generate two events with the same timestamp, and all timestamps can be compared by default.
Vector Clocks
Another drawback of lamport clocks is that when you have two lamport timestamps , and , then we don't actually know whether A happened before B, or whether they are concurrent (we don't know which came first). If we want to be able to draw this conclusion, we need something different. Enter vector clocks, which solve this problem by having a vector of timestamps, , an entry for each node. The algorithm is roughly as follows:
- On initialization, set your vector clock to , where the length of the vector = the number of nodes.
- When node X generates a local event, it increments its own entry in its vector.
- When node X sends a message, it attaches its vector clock to the message.
- When node X receives a message with vector , it does the following:
a. Increments its own entry because it received the message
b. For each index we update our own vector to be the max of and
In Rust, this looks something like the following:
#[derive(Clone, Debug)]
struct VectorClock {
pub node_id: usize,
pub time_vector: Vec<i64>,
}
// NOTE: We clone alot here, which is fine for the example. If we are using shared memory, it would
// be better to wrap messages in an Arc to avoid the overhead of allocating a new vector each time.
impl Clock<Vec<i64>> for VectorClock {
// Whenever an event occurs on this node (message received, state changed, etc), we
// immediately increment our clock to show the passing of time
fn advance_clock(&mut self) -> Vec<i64> {
self.time_vector[self.node_id] += 1;
self.time_vector.clone()
}
// Whenever we receive a message from another node, we'll update our clock so that each entry
// of the vector is the maximum.
fn update_clock<'a, 'b>(&'a mut self, message_timestamp: &'b Vec<i64>) -> Vec<i64> {
self.time_vector = self
.time_vector
.iter()
.zip(message_timestamp.iter())
.map(|(a, b)| max(a, b).clone())
.collect();
self.advance_clock()
}
fn get_clock(&self) -> Vec<i64> {
self.time_vector.clone()
}
}
impl VectorClock {
pub fn new(node_id: usize, num_nodes: usize) -> Self {
Self {
node_id,
time_vector: (0..num_nodes).map(|_| 0).collect(),
}
}
}
After Receiving | |||
---|---|---|---|
1 | |||
2 | |||
3 | |||
4 | |||
5 | |||
6 | |||
7 |
Before we talk about comparing vector timestamps, it's noting that no two distinct events in the system will ever share the same vector timestamp. This means that if two events share the same vector timestamp, they must be the same event. However, as you'll see, this fact alone does not provide a total ordering of the events. In addition to thinking of a vector clock as a timestamp, you can also think of the vector timestamp as a set of events. Each event is the set that contains and all of its causal dependencies ( i.e any event that happened before ). Any event that can be reached moving backwards in time in a sequence diagram starting at some event forms its causal dependency set . Consider the receival of message 4, which has the vector timestamp of If we travel backwards, we'll see that we hit any of the first two events on each node.
Similarily to lamport timestamps, we can compare two vector timestamps to determine their relative ordering:
In English, (1) states that is less or equal to if all elements of are less than or equal to . Recall that atleast one element of these two timestamps must differ in order for them to come from distinct events. If we think about this in terms of a sequence digram, is reachable moving backwards in time starting at if and only if .
(2) covers the case where neither event is reachable by the other. In this case, they are said to be concurrent, meaning we don't actually know which event happened first. An example of two concurrent events from Figure 2 are events 1 and 2, with vector clocks and respecfully. We cannot determine which of these came first, but we also know that neither one had a causal relationship with the other.
Variable Number of Nodes
Vector clocks provide us a partial order that is consistent with casuality, but its correctness depends on the fact that we know the number of nodes participating in the exchange. If this isn't the case, we need a different algorithm. I won't cover any of them here, but some examples include Chain Clocks and Tree Interval Clocks.
Logical Timestamp Use Cases
Handling Concurrent Writes
One way logical timestamps are used in database systems is in handling concurrent writes to a replicated system. Consider figure 3, which shows two clients are trying to update the same key at roughly the same time but in different replicas. Consider the following ordering of messages:
- Both A and B receive M1 before receiving M2
- Both A and B receive M2 before receiving M1
- A receives M2 then M1, while B receives M1 then M2
- A receives M1 then M2, while B receives M2 then M1
If we disreguard the timestamps in the messages, and just applied the messages in the order we receive them, then cases (1) and (2) result in a consistent state, while cases (3) and (4) result in an inconsistent state. Since replication by definition requires the nodes to be identical, this isn't sufficient.
If the timestamps attached to M1 and M2 are lamport timestamps, then we can define a total ordering of the events in our system, and keep only the value for the key containing the greater timestamp. Remember however, that does not mean that event A happened before event B, as the events could be concurrent. This means that even if B happened after A according to an observer of the system, it may silently discard the value of event M2 because the lamport timestamp for M1 was greater.
If the data loss associated with lamport timestamps is unacceptable, we can instead use vector clocks alongside changing how we store the value in our database. Assuming that t0 for both messages M1 and M2 are vector timetamps, and . If either or then we simply take the greater timestamp and store it alongside our value. If , then we simply store both values in the database with their timestamps, and lets the application code deal with resolving the conflict (i.e let the user choose which to use). We can extend this principle of holding on to more than one value to get some interesting benefits. One such benefit is being able to do a large database read without needing to lock any rows. Because data is appeneded, you'll never update an older value and so if you have an existing value in the database, it is read only and can safely be read without locking. This is one of the strategies used by Spanner for maintaining consistent snapshots.
Sources
- Designing Data Intensive Applications by the goat Martin Kleppmann
- Patterns of Distributed Systems by Unmesh Joshi
- Wiki