Bad Nodes
In this first iteration of Ditto Server we have the blunt, expedient tool available to us of killing and removing a node that is bad. The process is simple. Update the Current Config to remove the offending node(s) and signal the remaining servers. They will immediately no longer route to that node, use that node in their calculations, or listen to gossip from that node. This is fine as long as at least one node in each partition of the map is left standing. As soon as the offending nodes are removed the data is under-replicated. At once add replacement nodes by performing a transition as above. For example, imaginep1r2
has become unresponsive. Remove it from the Current Config, create a Next Config with a new server to take the place of** **p1r2
, store the configs in the Strongly Consistent metadata store, and signal the nodes. The new node will begin to consume transactions and backfill, and the UST will rise etc.
Missing and Lost Data
As described in the first Backfill section, it is possible, with a long network incident and a short log retention policy, that some transactions are missed. If all the replicas for a partition miss some intersecting subset of transactions, that data has been missed, and it is lost. This should never happen. If it does, we don’t want to throw away the Ditto Server cluster, and all the good data. Progress must still be made. In this case each replica of the partition understands from the**IntervalMaps
that some transaction T** has been missed. After doing a strongly consistent read of the metadata store, to check that no server in the next config exists that may have the data, the replicas agree unilaterally to pretend that really they did store this data, and they splice it into their IntervalMaps
. The UST rises, and progress is made.
It is essential to understand this is a disaster scenario, and not business as usual, but disasters happen, and they should be planned for. We do everything we can to never lose data, including a replicated durable transaction log with a long retention policy.