This article provides a detailed overview of the various timestamps the Big Peer uses to enforce causal consistency and high availability.
A unique universally stable timestamp (UST) indicates its order in the sequence of events, causal relationship to other events, and conflict resolution mechanisms to handle concurrency conflicts and enable non-blocking reads.
The UST maintains its value and consistency over time so event ordering remains intact and reliable.
Regardless of the view of the UST, the Big Peer reflects a causally consistent version of the database that can be read by Small Peers.
This is due to each Small Peer committing their highest (maximum) transactions to local storage. Once committed, they gossip their committed maximums across peers. Each Small Peer computes the UST by determining the lowest transaction among the set — the lowest common denominator, if you will. which is then considered to be the upper limit and therefore represents the UST.
Of the various maximums that have been committed, the minimum refers to the lowest transaction among the set — the lowest common denominator, if you will.
To further explain, consider the following scenario of a three-node cluster:
- Server A commits Transaction 10
- Server B commits Transaction 5
- Server C commits Transaction 7
Given this scenario, the UST is 5.
Each server may have a different transactional view of the UST, depending on the duration of time required for them to finish gossiping their transactions with each other.
A transactional view is the snapshot of data that a transaction observes during execution.
In a concurrent, multi-user environment, conflicting transactions can run simultaneously at any point. As a result, each transaction receives its individual data view to guarantee consistency and isolation.
For example, consider the following complex scenario within a three-node cluster, as indicated by servers A, B, and C in the graphic.
Server A computes the UST as 4
Server B computes the UST as 5
Server C computes the UST as 3
Given this scenario, all three servers retain a causally consistent version of the database that remains readable. This remains true even though each server maintains a distinct view of the UST.
The UST changes on every storage node in the cluster depending on the state of the Big Peer:
- When actively processing transactions, the UST increments.
- When quiescent and inactive, the UST reflects the last transaction produced by the transaction log and does not increment.
The Big Peer uses the GC timestamp to determine which data are candidates for purging across the distributed database, while storage nodes use local GC timestamps to determine which data to purge from its local environment.
Similar to UST, the GC timestamp represents the lowest transaction timestamp in the cluster — all versions within the cluster that fall below the GC timestamp get rolled up or merged, re-writing them as a single value.
The GC timestamp is calculated based on the minimum active read transaction timestamp across the entire cluster. The local GC timestamp is the highest transaction below the minimum read transaction.
After the query is completed, the Read Transaction is purged from memory, allowing the GC timestamp to increase.
The UST and the GC work together to form a sliding window of versions over the database. This window represents the timestamp versions at which a causally consistent query can execute.
The coordinator assigns a predetermined timestamp to the query for each partition. Typically, this timestamp matches the UST at the coordinator's location but can fall between the cluster-wide garbage collection (GC) timestamps and the UST.
When a node coordinates a read transaction, it maintains certain metadata in memory, indicating the UST's value at the transaction's start.
This data helps calculate the local GC timestamp that the node shares during gossip:
- When a node is not actively executing a read transaction, to ensure continuous progress, it continues to gossip its view of the UST as the GC timestamp.
- In a quiet cluster with no ongoing reads, the GC timestamp matches the UST, resulting in only one version of each data item.