skip to main content
10.1145/2939672.2939699acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Predicting Disk Replacement towards Reliable Data Centers

Published: 13 August 2016 Publication History

Abstract

Disks are among the most frequently failing components in today's IT environments. Despite a set of defense mechanisms such as RAID, the availability and reliability of the system are still often impacted severely. In this paper, we present a highly accurate SMART-based analysis pipeline that can correctly predict the necessity of a disk replacement even 10-15 days in advance. Our method has been built and evaluated on more than 30000 disks from two major manufacturers, monitored over 17 months. Our approach employs statistical techniques to automatically detect which SMART parameters correlate with disk replacement and uses them to predict the replacement of a disk with even 98% accuracy.

Supplementary Material

MP4 File (kdd2016_botezatu_disk_replacement_01-acm.mp4)

References

[1]
Data center downtime costs. http://www.emerson.com/en-us/News/Pages/Net-Power-Study-Data-Center.aspx.
[2]
Hard drive smart stats. https://www.backblaze.com/blog/hard-drive-smart-stats/.
[3]
IBM system storagetextDS8000 architecture and implementation. http://www.redbooks.ibm.com/redbooks/pdfs/sg248886.pdf.
[4]
S.M.A.R.T. https://en.wikipedia.org/wiki/S.M.A.R.T.
[5]
V. Agarwal, C. Bhattacharyya, T. Niranjan, and S. Susarla. Discovering rules from disk events for predicting hard drive failures. ICMLA '09, pages 782--786, Dec 2009.
[6]
L. Breiman. Random forests. Machine Learning, 45(1):5--32.
[7]
K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247--274, 2015.
[8]
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov., 2(2):121--167, June 1998.
[9]
D. R. Cox. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B, 20:215--242, 1958.
[10]
J. Elerath and S. Shah. Server class disk drives: how reliable are they? In Reliability and Maintainability, 2004 Annual Symposium - RAMS, pages 151--156.
[11]
G. Hamerly and C. Elkan. Bayesian approaches to failure prediction for disk drives. ICML '01, pages 202--209, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[12]
How does S.M.A.R.T. function of hard disks work? www.hdsentinel.com/smart/index.php.
[13]
G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3):350--357, 2002.
[14]
R. Johnson and T. Zhang. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):942--954, May 2014.
[15]
J. Macqueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
[16]
J. F. Murray, G. F. Hughes, and D. Schuurmans. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning research, 6:816, 2005.
[17]
E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. FAST 2007, Berkeley, CA, USA, 2007. USENIX Association.
[18]
B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? FAST 2007, Berkeley, CA, USA, 2007. USENIX Association.
[19]
Y. Tan and X. Gu. On predictability of system anomalies in real world. In IEEE MASCOTS, 2010, pages 133--140, Aug 2010.
[20]
G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In KDD 1998, pages 359--363. AAAI Press, 1998.
[21]
J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. CIKM 2009, pages 2061--2064. ACM, 2009.

Cited By

View all
  • (2024)Prediction of Disk Failure Based on Classification Intensity ResamplingInformation10.3390/info1506032215:6(322)Online publication date: 31-May-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real WorldProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644961(222-233)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. changepoint
  2. classification
  3. disk replacement
  4. time series

Qualifiers

  • Research-article

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Prediction of Disk Failure Based on Classification Intensity ResamplingInformation10.3390/info1506032215:6(322)Online publication date: 31-May-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real WorldProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644961(222-233)Online publication date: 14-Apr-2024
  • (2024)MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing SystemsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671568(5509-5520)Online publication date: 25-Aug-2024
  • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
  • (2024)SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data CenterIEEE Transactions on Services Computing10.1109/TSC.2024.339469217:5(2890-2903)Online publication date: Sep-2024
  • (2024)Proactive Drive Failure Prediction for Cloud Storage System Through Semi-Supervised LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.328609321:4(1528-1543)Online publication date: Jul-2024
  • (2024)Machine Learning Based Collaborative Prediction of SSD Failures in the Cloud2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)10.1109/ACDSA59508.2024.10467231(1-6)Online publication date: 1-Feb-2024
  • (2024)Using machine learning to forecast hard drive failuresE3S Web of Conferences10.1051/e3sconf/202454908024549(08024)Online publication date: 15-Jul-2024
  • (2024)Storage ReliabilityData Storage Architectures and Technologies10.1007/978-981-97-3534-1_9(225-270)Online publication date: 28-Aug-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media