AbstractWe introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties -- completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
The abstract is also available as a LaTeX file, a DVI file, or a PostScript file.
Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems -- distributed applications, distributed databases, network operating systems; C.4 [Performance of Systems]; D.1.3 [Programming Techniques]: Concurrent Programming -- distributed programming; D.4.5 [Operating Systems]: Reliability -- fault-tolerance; F.1.1 [Computation by Abstract Devices]: Models of Computation -- automata, relations among models; F.1.2 [Computation by Abstract Devices]: Modes of Computation -- parallelism and concurrency; H.2.4 [Database Management]: Systems -- concurrency, distributed systems, transaction processing
General Terms: Algorithms, Reliability, Theory
Additional Key Words and Phrases: agreement problem, asynchronous systems, atomic broadcast, \emph{Byzantine Generals'} problem, commit problem, consensus problem, crash failures, failure detection, fault-tolerance, message passing, partial synchrony, processor failures
Selected papers that cite this one
- Francis Chu. Reducing Omega to \Diamond W. Information Processing Letters, 67(6):289-293, 30 September 1998.
- Cynthia Dwork, Joseph Y. Halpern, and Orli Waarts. Performing work efficiently in the presence of faults. SIAM Journal on Computing, 27(5):1457-1491, October 1998.
Selected references
- Y. Amir, D. Dolev, S. Kramer, and D. Malki. Transis: A communication sub-system for high availability. Technical Report CS91-13 (Nov.), Computer Science Department, The Hebrew University of Jerusalem.
- Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Daphne Koller, David Peleg, and Rüdiger Reischuk. Achievable cases in an asynchronous environment (extended abstract). In 28th Annual Symposium on Foundations of Computer Science, pages 337-346, Los Angeles, California, 12-14 October 1987. IEEE.
- Hagit Attiya, Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Bounds on the time to reach agreement in the presence of timing uncertainty. In Proceedings of the Twenty Third Annual ACM Symposium on Theory of Computing, pages 359-369, New Orleans, Louisiana, 6-8 May 1991.
- Michael Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols (extended abstract). In Proceedings of the Second Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, pages 27-30, Montreal, Quebec, Canada, 17-19 August 1983.
- Piotr Berman, Juan A. Garay, and Kenneth J. Perry. Towards optimal distributed consensus (extended abstract). In 30th Annual Symposium on Foundations of Computer Science, pages 410-415, Research Triangle Park, North Carolina, 30 October-1 November 1989. IEEE.
- Ofer Biran, Shlomo Moran, and Shmuel Zaks. A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, pages 263-275, Toronto, Ontario, Canada, 15-17 August 1988.
- Kenneth P. Birman, Robert Cooper, Thomas A. Joseph, Kenneth P. Kane, and Frank Bernhard Schmuck. ISIS -- A Distributed Programming Environment. June 1990.
- Kenneth P. Birman and Thomas A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems, 5(1):47-76, February 1987.
- Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. Journal of the ACM, 32(4):824-840, October 1985.
- Michael F. Bridgland and Ronald J. Watro. Fault-tolerant decision making in totally asynchronous distributed systems (preliminary version). In Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, pages 52-63, Vancouver, British Columbia, Canada, 10-12 August 1987.
- N. Budhiraja, A. Gopal, and S. Toueg. Early-stopping distributed bidding and applications. In Proceedings of the Fourth International Workshop on Distributed Algorithms (Sept. 1990), pp. 301-320. Springer-Verlag.
- T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Technical Report 92-1293 (July), Department of Computer Science, Cornell University. Available from ftp://ftp.cs.cornell.edu/pub/chandra/failure.detectors.weakest.dvi.Z. A preliminary version appeared in the Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, pages 147-158. ACM Press, August 1992.
- T. D. Chandra, V. Hadzilacos, and S. Toueg. Impossibility of group membership in asynchronous systems. Technical Report 95-1533 (August), Computer Science Department, Cornell University, Ithaca, New York 14853.
- T. D. Chandra and S. Toueg. Time and message efficient reliable broadcasts. In Proceedings of the Fourth International Workshop on Distributed Algorithms (Sept. 1990), pp. 289-300. Springer-Verlag.
- Jo-Mei Chang and N. F. Maxemchuk. Reliable broadcast protocols. ACM Transactions on Computer Systems, 2(3):251-273, August 1984.
- B. Chor and C. Dwork. Randomization in byzantine agreement. Advances in Computer Research 5, 443-497.
- F. Cristian. Issues in the design of highly available computing services. In Annual Symposium of the Canadian Information Processing Society (July 1987), pp. 9-16. Also IBM Research Report RJ5856, July 1987.
- F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the Fifteenth International Symposium on Fault-Tolerant Computing (June 1985), pp. 200-206. A revised version appears as IBM Research Laboratory Technical Report RJ5244 (April 1989).
- F. Cristian, R. D. Dancey, and J. Dehn. Fault-tolerance in the advanced automation system. Technical Report RJ 7424 (April), IBM Research Laboratory.
- Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. On the minimal synchronism needed for distributed consensus. Journal of the ACM, 34(1):77-97, January 1987.
- Danny Dolev, Nancy A. Lynch, Shlomit S. Pinter, Eugene W. Stark, and William E. Weihl. Reaching approximate agreement in the presence of faults. Journal of the ACM, 33(3):499-516, July 1986.
- Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. Journal of the ACM, 35(2):288-323, April 1988.
- M. J. Fischer. The consensus problem in unreliable distributed systems (a brief survey). Technical Report 273 (June), Department of Computer Science, Yale University.
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374-382, April 1985.
- Ajei S. Gopal, H. Raymond Strong, Sam Toueg, and Flaviu Cristian. Early-delivery atomic broadcast. In Proceedings of the Ninth Annual ACM Symposium on Principles of Distributed Computing, pages 297-309, Quebec City, Quebec, Canada, 22-24 August 1990.
- R. Guerraoui. Revisiting the relationship between non blocking atomic commitment and consensus. In Proceedings of the Ninth International Workshop on Distributed Algorithms (September 1995). Springer-Verlag.
- V. Hadzilacos and S. Toueg. Fault-tolerant broadcasts and related problems. In S. J. Mullender Ed., Distributed Systems, Chapter 5, pp. 97-145. Addison-Wesley.
- V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425 (May), Computer Science Department, Cornell University, Ithaca, New York 14853. Available by anonymous ftp from ftp://ftp.db.toronto.edu/pub/vassos/fault.tolerant.broadcasts.dvi.Z.
- Joseph Y. Halpern and Yoram Moses. Knowledge and common knowledge in a distributed environment. Journal of the ACM, 37(3):549-587, July 1990.
- L. Lamport. The implementation of reliable distributed multiprocess systems. Computer Networks 2, 95-114.
- Leslie Lamport, Robert E. Shostak, and Marshall C. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382-401, July 1982.
- W. K. Lo and V. Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the Eighth International Workshop on Distributed Algorithms (Sept. 1994), pp. 280-295. Springer-Verlag. Available from ftp://ftp.db.toronto.edu/pub/vassos/failure.detectors.shared.memory.ps.Z.
- M. Loui and Abu-Amara. Memory requirements for agreement among unreliable asynchronous processes. Advances in computing research 4, 163-183.
- P. Melliar-Smith, R. E. Shostak, and C. B. Weinstock. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proceedings of the IEEE 66, 10 (Oct.), 1240-1255.
- Yoram Moses, Danny Dolev, and Joseph Y. Halpern. Cheating husbands and other stories: A case study of knowledge, action, and communication. Distributed Computing, 1(3):167-176, 1986.
- S. J. Ed. Mullender. The Amoeba distributed operating system: Selected papers 1984 - 1987. Centre for Mathematics and Computer Science.
- Gil Neiger. Failure detectors and the wait-free hierarchy. In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing, pages 100-109, Ottawa, Ontario, Canada, 2-23 August 1995.
- Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374-419, September 1990.
- M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228-234, April 1980.
- Larry L. Peterson, Nick C. Buchholz, and Richard D. Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217-246, August 1989.
- Frank M. Pittelli and Hector Garcia-Molina. Reliable scheduling in a TMR database system. ACM Transactions on Computer Systems, 7(1):25-60, February 1989.
- D. Ed. Powell. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag.
- R. Reischuk. A new solution for the Byzantine general's problem. Technical Report RJ 3673 (Nov.), IBM Research Laboratory.
- Aleta Ricciardi and Kenneth P. Birman. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the Tenth Annual ACM Symposium on Principles of Distributed Computing, pages 341-353, Montreal, Quebec, Canada, 19-21 August 1991.
- L. Sabel and K. Marzullo. Election vs. consensus in asynchronous systems. Technical Report TR95-411 (Feb.), University of California at San Diego. Available at ftp://ftp.cs.cornell.edu/pub/sabel/tr94-1413.ps.
- Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4):299-319, December 1990.