Accrual Failure Detectors

Detecting failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far. One of the reasons is the difficulty to satisfy several application requirements simultaneously when using classical failure detectors. We proposed a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a boolean nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale.

The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment. We made an implementation based on the accrual failure detector model, that we call the phi failure detector (1). The particularity of the phi failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our phi failure detector over an intercontinental communication link during several days. Our experimental results show that our phi failure detector performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility. The phi accrual failure detector is currently implemented in several services, such as Cassandra of Facebook(see here), fluentdAkka (see here), Node.js(see here), APPIA developed at Universidade de Lisboa.

Chief investigator: Naohiro HAYASHIBARA