skip to main content
research-article
Open access

A Large-Scale Study of Flash Memory Failures in the Field

Published: 15 June 2015 Publication History

Abstract

Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center environment running modern applications and system software.
This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of flash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how data is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power.
Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD's physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software. We hope that the findings of this first large-scale flash memory reliability study can inspire others to develop other publicly-available analyses and novel flash reliability solutions.

References

[1]
NVM Express Specification. http://www.nvmexpress.org/specifications/.
[2]
The R Project for Statistical Computing. http://www.r-project.org/.
[3]
American National Standards Institute. AT Attachment 8 -- ATA/ATAPI Command Set. http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf, 2008.
[4]
H. Belgal, N. Righos, I. Kalastirsky, et al. A New Reliability Model for Post-Cycling Charge Retention of Flash Memories. IRPS, 2002.
[5]
A. Brand, K. Wu, S. Pan, et al. Novel Read Disturb Failure Mechanism Induced By Flash Cycling. IRPS, 1993.
[6]
Y. Cai, E. F. Haratsch, O. Mutlu, et al. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In DATE, 2012.
[7]
Y. Cai, E. F. Haratsch, O. Mutlu, et al. Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling. In DATE, 2013.
[8]
Y. Cai, Y. Luo, S. Ghose, et al. Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation. In DSN, 2015.
[9]
Y. Cai, Y. Luo, E. F. Haratsch, et al. Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery. In HPCA, 2015.
[10]
Y. Cai, O. Mutlu, E. F. Haratsch, et al. Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation. In ICCD, 2013.
[11]
Y. Cai, G. Yalcin, O. Mutlu, et al. Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime. In ICCD, 2012.
[12]
Y. Cai, G. Yalcin, O. Mutlu, et al. Error Analysis and Retention-Aware Error Management for NAND Flash Memory. ITJ, 2013.
[13]
Y. Cai, G. Yalcin, O. Mutlu, et al. Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories. In SIGMETRICS, 2014.
[14]
A. Chimenton and P. Olivo. Erratic Erase in Flash Memories -- Part I: Basic Experimental and Statistical Characterization. IEEE Trans. Elect. Dev., 50(4), 2003.
[15]
T.-S. Chung, D.-J. Park, S. Park, et al. A survey of flash translation layer. J. Sys. Arch., 55, 2009.
[16]
C. Compagnoni, A. Spinelli, R. Gusmeroli, et al. First Evidence for Injection Statistics Accuracy Limitations in NAND Flash Constant-Current Fowler-Nordheim Programming. IEDM Tech Dig., 2007.
[17]
J. Cooke. The Inconvenient Truths of NAND Flash Memory. In Flash Memory Summit, 2007.
[18]
R. Degraeve, F. Schuler, B. Kaczer, et al. Analytical Percolation Model for Predicting Anomalous Charge Loss in Flash Memories. IEEE Trans. Elect. Dev., 51(9), 2004.
[19]
A. Gartrell, M. Srinivasan, B. Alger, et al. McDipper: A Key-Value Cache for Flash Storage. https://www.facebook.com/notes/10151347090423920, 2013.
[20]
L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak Future of NAND Flash Memory. In FAST, 2012.
[21]
S. Hur, J. Lee, M. Park, et al. Effective Program Inhibition Beyond 90nm NAND Flash Memories. NVSM, 2004.
[22]
S. Joo, H. Yang, K. Noh, et al. Abnormal Disturbance Mechanism of Sub-100 nm NAND Flash Memory. Japanese J. Applied Physics, 45(8A), 2006.
[23]
T. Jung, Y. Choi, K. Suh, et al. A 3.3V 128Mb Multi-Level NAND Flash Memory for Mass Storage Applications. ISSCC, 1996.
[24]
M. Kato, N. Miyamoto, H. Kume, et al. Read-Disturb Degradation Mechanism Due to Electron Trapping in the Tunnel Oxide for Low-Voltage Flash Memories. IEDM, 1994.
[25]
H. Kurata, K. Otsuga, A. Kotabe, et al. The Impact of Random Telegraph Signals on the Scaling of Multilevel Flash Memories. VLSI, 2006.
[26]
J. Lee, J. Choi, D. Park, et al. Degradation of Tunnel Oxide by FN Current Stress and Its Effects on Data Retention Characteristics of 90-nm NAND Flash Memory. IRPS, 2003.
[27]
J. Lee, C. Lee, M. Lee, et al. A New Program Disturbance Phenomenon in NAND Flash Memory by Source/Drain Hot-Electrons Generated by GIDL Current. NVSM, 2006.
[28]
A. Maislos. A New Era in Embedded Flash Memory, 2011. Presentation at Flash Memory Summit.
[29]
N. Mielke, H. Belgal, A. Fazio, et al. Recovery Effects in the Distributed Cycling of Flash Memories. RPS, 2006.
[30]
N. Mielke, H. Belgal, I. Kalastirsky, et al. Flash EEPROM Threshold Instabilities due to Charge Trapping During Program/Erase Cycling. IEEE Trans. Dev. and Mat. Reliability, 2(3), 2004.
[31]
N. Mielke, T. Marquart, N. Wu, et al. Bit Error Rate in NAND Flash Memories. In IRPS, 2008.
[32]
T. Ong, A. Fazio, N. Mielke, et al. Erratic Erase in ETOX#8482; Flash Memory Array. VLSI, 1993.
[33]
J. Ouyang, S. Lin, S. Jiang, et al. SDF: Software-Defined Flash for Web-Scale Internet Storage Systems. ASPLOS, 2014.
[34]
B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In FAST, 2007.
[35]
K. Suh, B. Suh, Y. Lim, et al. A 3.3 V 32 Mb NAND Flash Memory with Incremental Step Pulse Programming Scheme. IEEE J. Sol. St. Circuits, 30(11), 1995.
[36]
K. Takeuchi, S. Satoh, T. Tanaka, et al. A Negative Vth Cell Architecture for Highly Scalable, Excellently Noise-Immune, and Highly Reliable NAND Flash Memories. IEEE J. Sol. St. Circuits, 34(5), 1995.
[37]
A. Thusoo, J. Sen Sarma, N. Jain, et al. Hivetextendash A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.
[38]
M. Xu, C. Tan, and L. MingFu. Extended Arrhenius Law of Time-to-Breakdown of Ultrathin Gate Oxides. Applied Physics Letters, 82(15), 2003.
[39]
R. Yamada, Y. Mori, Y. Okuyama, et al. Analysis of Detrap Current Due to Oxide Traps to Improve Flash Memory Retention. IRPS, 2000.
[40]
R. Yamada, T. Sekiguchi, Y. Okuyama, et al. A Novel Analysis Method of Threshold Voltage Shift Due to Detrap in a Multi-Level Flash Memory. VLSI, 2001.

Cited By

View all
  • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
  • (2024)Quantification and analysis of performance fluctuation in distributed file systemCluster Computing10.1007/s10586-023-04141-427:3(3149-3162)Online publication date: 1-Jun-2024
  • (2023)A Review on Solid state Drives transformer concept A new era in power supplyElectrical and Automation Engineering10.46632/eae/2/1/152:1(104-110)Online publication date: 1-Mar-2023
  • Show More Cited By

Index Terms

  1. A Large-Scale Study of Flash Memory Failures in the Field

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 43, Issue 1
    Performance evaluation review
    June 2015
    468 pages
    ISSN:0163-5999
    DOI:10.1145/2796314
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
      June 2015
      488 pages
      ISBN:9781450334860
      DOI:10.1145/2745844
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2015
    Published in SIGMETRICS Volume 43, Issue 1

    Check for updates

    Author Tags

    1. flash memory
    2. reliability
    3. warehouse-scale data centers

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)366
    • Downloads (Last 6 weeks)68
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
    • (2024)Quantification and analysis of performance fluctuation in distributed file systemCluster Computing10.1007/s10586-023-04141-427:3(3149-3162)Online publication date: 1-Jun-2024
    • (2023)A Review on Solid state Drives transformer concept A new era in power supplyElectrical and Automation Engineering10.46632/eae/2/1/152:1(104-110)Online publication date: 1-Mar-2023
    • (2023)Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613866(2050-2055)Online publication date: 30-Nov-2023
    • (2023)Reliability Evaluation of Erasure-coded Storage Systems with Latent ErrorsACM Transactions on Storage10.1145/356831319:1(1-47)Online publication date: 11-Jan-2023
    • (2023)Retention Leveling: Leverage Retention Refreshing and Wear Leveling Techniques to Enhance Flash Reliability with the Awareness of Temperature2023 IEEE 12th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA58981.2023.00017(1-6)Online publication date: Aug-2023
    • (2023)Improving Productivity and Efficiency of SSD Manufacturing Self-Test Process by Learning-Based Proactive Defect Prediction2023 IEEE International Test Conference (ITC)10.1109/ITC51656.2023.00039(226-235)Online publication date: 7-Oct-2023
    • (2023)Leveraging Temporality of Data to Improve Failure Predictions for Solid State Drives in Data Centers2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00193(1272-1278)Online publication date: Jun-2023
    • (2023)SWEP-RF: Accuracy sliding window-based ensemble pruning method for latent sector error prediction in cloud storage computingJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10167235:8(101672)Online publication date: Sep-2023
    • (2023)A cost-efficient hybrid redundancy coding scheme for wireless storage systemsComputer Communications10.1016/j.comcom.2023.03.012203(226-237)Online publication date: Apr-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media