Fault Tolerance

During my PhD thesis, I have studied possible solutions to fault tolerance within fully asynchronous distributed systems. In particular, I have focused on designing software solutions to active replication. In this model of replication, a critical tasked is replicated on many symmetrical entities. They all execute the same piece of deterministic code, and the environment is managed, so that all the tasks (transactions, RPC call, and so on ...) the entities are submitted, are committed in the same order (by all the entities). Since we assume that the initial state of entities is unique, it can be derived that we are building a replicated state machine, whose global state, is equal to the local state of each entity. This can be opposed to the passive replication scheme, where entities are divided segregated into one unique master entity, and slaves. All the tasks submitted to the group of entities are only committed by the master. Slaves are only useful in the case of a master crash, when they have to elect a new master among them. It is easy to see that this scheme cannot outperform the active one, in terms of response time, when a crash occurs.

When it comes to guarantee that all the tasks that are to be performed by the entities, will all be committed in the same order, it is necessary to implement a well-known communications primitive called total order within the group of replicas. One way of implementing this primitive, is to provide the group with an agreement protocol. Such a protocol allows a set of processes to reach a common agreement on a value (taken from a very general set of values), starting from potentially different initial values at each process. And this must be ensured despite the crash of an unbounded number of processes. In the case we are concerned with (implementing a total order communication primitive), the initial values will be the identities of task to be ordered, and the final agreement value, may well be the union of all these values.

Moreover, we are interesting in allowing the group composition to be dynamically changed. This is necessary to adjust the group size with respect to level of safety we want to reach. The optimal size of the group can de deduced from the probability of failure of each of its single component. To perform a dynamic change of group composition, we need a service usually called membersship . This service is in charge of computing the most accurate view possible of the members still alive in the group. To do this, this service needs to gather the local view at each member, and then delivers an unique view on which each member will agree.

These changes of group composition need to be synchronized with the deliveries of application messages. This is necessary to obtain a coherent global state. Indeed, without this synchronization, some replicas may apply some transactions while others do not (because of a view change). This would end up with different local state, thus breaking the overall coherence of the group. To avoid this situation we have to solve a problem called view synchrony . A possible way of implementing view synchrony is to apply a view change in a virtually synchronous way. That is, all the replicas apply the view change at the same logical time, namely right after the commitment of a predetermined transaction (this transaction is taken from the ordered stream of transaction that still need to be applied). We can see that the processes still need to agree on the logical time at which the view change is performed.

One can see that the problem of distributed agreement recurrently appears in the solutions to the previous problems. Also called, consensus, it is one of the fundamental problem tackled by distributed algorithmics. It has been devoted a lot of work. The partial conclusions of these works, are that (1) there is no deterministic solution to the problem of consensus, in a purely asynchronous system, (2) most of the problems that have been previously quoted, are in fact equivalent to the consensus. That is, solving one gives a solution to the other, and conversely. So we are facing a situation where most of the problems we want to solve, cannot be solved by deterministic solutions. It is important to note that randomized solutions have been found for the problem of binary consensus ( where processes have to choose among two predetermined values). However, these solutions do not scale when it comes to large sets of values, or even worse when the set of initial values is not well defined. However, I have proposed in my PhD thesis a way to build, from any randomized protocol for binary consensus, a protocol for general consensus (even when initial values are not defined at compile time).

Frédéric Tronel
Last modified: Fri Oct 12 10:56:39 MEST 2001