Data Sync

Sync Concepts

The realtime end-user transactional data generated by your app can be disseminated across devices through three different methods:

  • Centralized — A Small Peer device establishes a direct internet-enabled connection to the Big Peer cloud deployment.
  • Decentralized — A Small Peer device establishes a mesh network connection with nearby Small Peer devices using any and all communication transport types available to them by default.
  • Hybrid — If one or multiple Small Peer devices connected in the mesh network gain access to the internet, not only do those devices with internet access upload their local Ditto store to the Big Peer cloud, but every nearby offline Small Peer device as well.

For instructions on enabling and disabling transport types, see Configuring Transports.

Video Overview



Syncing with Big Peer

The Big Peer supports both bidirectional sync and unidirectional data sync.

Before you can establish a peer-to-peer mesh network and exchange data directly with nearby peers, you must connect to the Big Peer at least once to set up your communication infrastructure.

When you connect to the Big Peer for the first time, you obtain your access credentials and ensure you have the latest data and are up-to-date with the rest of the peers. For more information, see Authentication.

Any Small Peer that has access to the internet automatically establishes a connection with the Big Peer to sync data across the Big Peer and, by way of multihop data sync, all connected Small Peers. For more information, see Multihop Data Sync, as follows.

Syncing Across Peers

The multiple replicas that result from a single partition are spread across virtual and relevant physical server nodes; that is, the Big Peer cloud deployment and local Ditto stores of subscribing Small Peers.

By default, Ditto does not automatically replicate data to peers. Instead, peers subscribe to changes and, using a live query, indicate the data they're interested in watching for changes. Once a change is observed somewhere in the mesh, only the delta matching the live query replicates to subscribing peers.

In simpler terms, sync involves a subscribing peer selectively "pulling" data from other peers, rather than remote peers automatically "pushing" data to it.

Document image


Greedy and Selfish Data Sync



Multihop Data Replication

When all peers connected in the mesh need to share the same view of the data, Ditto automatically initiates the flooding replication process to multihop data from one connected peer to another connected peer by way of intermediate “hops” along a given path.

As illustrated in the following graphic, for multihop to occur, all peers within the chain must observe the same data:

Document image


The benefits of multihop sync technology include the following:

  • Guarantee a consistent offline-first sync of critical data, ensuring data remains consistent even when connectivity is intermittent.
  • Enhance the overall responsiveness of your app, resulting in faster interactions.
  • Reduce costs by optimizing data operations since, when feasible, you can read and write data to a nearby peer instead of traversing a very fragile, low‑bandwidth data source. 

Flooding the Mesh

To ensure that all connected peers subscribing to the same query share the same view of the data, Ditto automatically initiates flood fill, or flooding for short.

Flooding is a replication process common in distributed systems involving having each peer multihop updates to the peers they are connected to until all peers in the mesh have the data change.

Multihopping to the Big Peer: Ditto Links

If at any time a peer connected in the mesh gains access to the internet, all mesh-generated data gets automatically exchanged with the Big Peer by way of multihop.

Ditto’s multihop Big Peer connection allows a Small Peer to have a secure and encrypted connection to the Big Peer through one or more intermediate peers.

When a peer establishes a mesh network connection with another peer, it first establishes encrypted internal TLS tunnels, or Ditto Links, that nearby peers use for transmission. Similar in function to a virtual private network (VPN), Ditto links are virtual private connections that utilize the Noise protocol to establish secure and protected connections.

The Noise Protocol is a cryptographic framework for advanced security protections. For more information, see the official documentation at noiseprotocol.org.    

Concurrency Conflicts

A challenge arises in offline scenarios when two or more peers make edits independently and the data values stored by each peer diverge over time.

Referred to as conflict resolution, Ditto's process of addressing concurrency conflicts involves a combination of a metadata database and guiding principles.

Metadata Database

Each peer individually maintains its own metadata database. The metadata database serves as a storage repository of information essential in resolving conflicts that arise during the process of merging concurrent changes.

Within this metadata database is the version vector. The version vector manages the following essential details for a particular peer:

  • The current state of sync.
  • A record of the data previously sent and received.
  • The sequence of updates.
  • The timestamp of last write operation.

Guiding Principles

Ditto adheres to the following strategies to ensure that all peers ultimately reach the same value for a particular data item:

  • Deterministic — As part of the strategy of eventual consistency, regardless of the order in which updates from different peers are received and merged, all peers ultimately converge to the same single value.
  • Predictable and Meaningful — Instead of arbitrarily resolving conflicting registers to a predefined value, the resulting merge accurately represents the original input and some rational interpretation of that input.

The following scenario provides a walkthrough of the mechanics of version vectors, their role in determining merging behavior, and how different peers contribute to data sync:

Eventual Consistency: Example Scenario

The HLC uses the UInt128 data type to represent the Site_ID and 64bit timestamp in Ditto; however, for simplicity, the following scenario uses basic string and number types instead.

1

Local Peer A document: 123abc links to a version vector that indicates:

  • Its locally-stored document is currently at vector version 5.
  • The most recent incoming Remote Peer B changes were incorporated and merged at version 1.
  • The Remote Peer C changes were incorporated and merged at vector version 4.
Text

2

Local Peer A receives document changes from remote Peer B. The incoming document's version vector indicates:

  • The remote Peer B incoming version vector is value 4.
  • The most recent incoming Remote Peer B changes were of version 1; a value less than the incoming document version vector value of 4.

Since 4 is greater than 1, the local Peer A determines that the changes are new and should be incorporated and merged in order to remain consistent.

Text


Hybrid Logical Clock

Each document in each peer contains a hidden metadata map of a Site_ID and a HLC. The HLC stands for a hybrid logical clock. This HLC is used to determine whether a change has "happened before".

It might be tempting to use physical clocks to resolve conflicts when attempting to merge concurrent data changes. However, it's essential to know that even quartz-crystal-based physical clocks can skew forward or backward in time. Almost every device regularly attempts to synchronize with an NTP-synchronized clock server. But even then, the round trip time from the request to the server's response adds additional variability. In addition, there are limitations to nature and physics that will never allow two measurements of physical time to align precisely. Thus, these conditions led us to determine that physical clocks were not reliable in a distributed mesh network.

Although we decided that we could not build a system that resolved conflicts based purely on physical time, we needed to preserve the notion of physical time so as not to confuse users of collaborative applications. However, each peer still needs a deterministic way to resolve conflicts. In other words, each peer when sharing CRDT deltas needs to always resolve conflicts the same way. This requirement still needs logical ordering. This requirement led us to implement the version vector with a Hybrid Logical Clock (often referred to as HLC).

Clock Accuracy

Ditto requires that all devices participating in the mesh have the time and date configured accurately. It is recommended to synchronize the clock automatically with an Internet time server. This feature is built into most platforms including iOS and Android. This will be sufficiently accurate for Ditto’s needs.

If a device has a significantly incorrect clock, outcomes may include:

  • Failing to connect to a peer because it thinks their certificate isn’t valid yet or has already expired.
  • Unintuitive conflict resolution in certain situations when concurrently editing documents.

Version Vectors

Each peer has its own version of the documents that multiple peers can work on concurrently. They keep track of their respective document changes in a data structure known as a version vector.

The version vectors that each document links to contain a snapshot of essential metadata, such as the given document's current state, as well as a record of the data previously sent and received.

Incrementing Version Vectors

When a peer directly updates its own locally stored document, the document version vector increments by a value of one.

Determining Newness

Local peers use document version vectors to assess local and observed edits and ensure data consistency accurately:

  • Before a peer incorporates and merges incoming changes into its document state, the peer compares its existing version vector value with the receiving version vector value to determine whether the incoming changes are new (or old) relative to its current document state.
  • If the incoming version vector is a value greater than the existing version vector value, the incoming changes are new and must be incorporated and merged for its document state to remain consistent.

Managing Time-Based Operations

The version vector pinpoints the chronological order of events and determines the exact timing of the most recent write operation for a specific peer utilizing a combination of temporal timestamp and the Hybrid Logical Clock (HLC):

  • The temporal timestamp is a component associated with an individual peer that tracks the chronological order of its local events.
  • The HLC is a higher-level component that combines both physical time and logical clocks to create a unified timestamp that tracks the chronological order of events for the entire distributed system.