An empirical study on crash recovery bugs in large-scale distributed systems

Y Gao, W Dou, F Qin, C Gao, D Wang, J Wei… - Proceedings of the …, 2018 - dl.acm.org
Y Gao, W Dou, F Qin, C Gao, D Wang, J Wei, R Huang, L Zhou, Y Wu
Proceedings of the 2018 26th ACM joint meeting on european software …, 2018dl.acm.org
In large-scale distributed systems, node crashes are inevitable, and can happen at any time.
As such, distributed systems are usually designed to be resilient to these node crashes via
various crash recovery mechanisms, such as write-ahead logging in HBase and hinted
handoffs in Cassandra. However, faults in crash recovery mechanisms and their
implementations can introduce intricate crash recovery bugs, and lead to severe
consequences. In this paper, we present CREB, the most comprehensive study on 103 …
In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.
In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.
ACM Digital Library