What is the Issue?
Silent Data Corruption (SDC), sometimes referred to as Silent Data Error (SDE), is an industry-wide issue impacting not only long-protected memory, storage, and networking, but also computer CPUs. As with software issues, hardware-induced SDC can contribute to data loss and corruption. An SDC occurs when an impacted CPU inadvertently causes errors in the data it processes. For example, an impacted CPU might miscalculate data (i.e., 1+1=3). There may be no indication of these computational errors unless the software systematically checks for errors.
SDC issues have been around for many years, but as chips have become more advanced and compact in size, the transistors and lines have become so tiny that small electrical fluctuations can cause errors. Most of these errors are caused by defects during manufacturing and are screened out by the vendors; others are caught by hardware error detection or correction. However, some errors go undetected by hardware; therefore only detection software can protect against such errors.
How is Google Addressing this Issue?
We have employed software-level defenses against hardware data corruption for two decades. For example, the Google File System (GFS) paper (published in 2003) describes how the early file system defended against memory and disk errors by adding software checksums. Today, many Google Cloud services contain such checks, thus preventing SDC errors from permanently corrupting customer or system data.
As part of our ongoing fleet maintenance, we regularly screen our fleet on a systematic basis and take the following actions.
- When a substantially better test becomes available, we re-screen the fleet.
- When a defect is detected, that CPU is replaced.
Additionally, Google collaborates closely with CPU vendors on detection, analysis, and mitigation to identify defective CPUs across the industry. As part of these efforts, more than two years ago, we created CPU Check, an open source test that any organization can use to scan its fleet for hardware faults. Google maintains and periodically updates this test. These detection and mitigation measures are being continuously strengthened and are effectively identifying most defective CPUs.
However, SDC is difficult to diagnose with 100% accuracy, and there are a variety of possible CPU defects. We are working with our partners to continue developing more effective screening tests and on helping our customers to defend their own critical systems.
Potential Mitigation Approaches
Mitigation in software and added protections in hardware are common approaches to ensure reliable operation in the presence of faulty hardware. For example, DRAM memory is protected by Error Correction Codes (ECC) to ensure correctness and detection. Disks have parity bits and RAID (Redundant Array of Independent Disks) configuration to enable reliable operation. For network packet corruption, there are methods such as cyclic redundancy check (CRC) that enable detection and discarding of corrupted packets. Bugs in software may also cause corruption.
CPUs also contain ECC protection for many of their parts (e.g., cache memory), but as discussed above, some errors can remain undetected and lead to corrupted computations, corrupted data or partial, or even complete loss of data (e.g., metadata and cryptographic key corruption can have widespread impact).
We therefore need to extend known protection methods in software against more potential corruption sources. We discuss some of these methods in more detail below to serve as a point of departure. Such protection measures may help mitigate the far more common software bugs, while also addressing the challenge of detecting hardware-induced corruption.
Identifying and removing corrupting CPUs from serving systems is a constant endeavor. Given that this is a challenging problem, such methods need to be complemented with building further resiliency in software.
Many of Google’s software systems already replicate some computations for fail-stop resilience and cross-check some computations to detect software bugs and hardware defects (e.g., via end-to-end checksums).
To allow a broader group of application developers to leverage our shared expertise in addressing SDCs, users may consider developing a few libraries with self-checking implementations of critical functions, such as encryption and compression, where one corruption can have large impact. We discuss some guidelines that users can choose to follow to ensure more reliable operation of their applications, services, and systems.
- Planning Based on your Objective Metrics: Recovery Point Objective (RPO) and Recovery Time Objective (RTO) goals for your services should help inform how often backup checkpoints are made as part of your operations lifecycle. Logging to ensure there is a system in place to recover from the last checkpoint or backup based upon the detection of any corruption incident should be considered. These are objectives that can guide enterprises to choose an optimal cloud backup and disaster recovery plan. See example guidelines on Google Disaster Recovery Planning Guide.
- End-to-End Checksumming: It is critical to make sure that the end-to-end principle of system design is followed to be able to detect and recover from any corruption. All data traversing the network should be protected via CRC computation at source and verification at the destination to ensure all corruption in transit is detected.1 For example, the gRPC library adds software checksums to all messages to detect SDC. Likewise, for other services that operate and move data, enable end-to-end checksumming.
- Building Replicas for Important Data: For all important data, consider developing and maintaining replicas, preferably on different machines, to serve as a) recovery options when one of the machines holding a replica fails, and b) being able to check periodically if the replicas are consistent.
- Write Paths: Write paths are especially critical to avoid any permanent loss of data. Best practices suggest making sure there are replicas for all data, and when writing, to make sure the checksums of the replicas match. When reading, consider reading from both replicas and checking to make sure they match.
- Choosing a Checksum Algorithm: To ensure that the checksum output is different for different input streams, the checksum algorithm must be carefully chosen. Salting the input with object/data specific attributes (e.g., sequence id, or last timestamp, or object id) can help minimize collisions and detect metadata corruption that swaps objects. Such salting can help as an extra layer of protection.
- Periodic Verification: Make sure data is in a consistent state to ensure continued reliable operation with periodic verification of the checksums for data at rest. These verifications can be part of background activity when utilization is low.
- Critical Control Objects: Encryption operation and security keys generation, are especially important to protect via software. High-value metadata (e.g., file system inodes) are critical to protect using additional software checks. Corruption of these objects can significantly amplify data integrity and durability risks. If not too expensive, it is recommended that all such operations be checked via software on each access. If cost-prohibitive, some sampling-based approaches can be considered, depending on the cost and the needed confidence in reliable operation. One approach to consider is to encrypt on one CPU and decrypt on another CPU and see if the data is consistent. This protects against encryption landing on a possible corrupting core. Another approach is to make sure there are replicas maintained for these critical objects and the checksums for the replicas are computed on different machines and matched.
- Invariant Checks (i.e Asserts): Consider adding and strengthening invariant checks throughout the system to improve coverage and timeliness for detection of software and hardware corruptions.
- Soft Deletes: Consider adding “soft deletes” to mitigate against potentially erroneous garbage collection of files and data, perhaps moving them to a temporary directory to be cleaned up only periodically. This will enable restoring in the event of corruption that results in loss of data.
- Operational Improvements: Consider constant improvement of test coverage and use all available code sanitizers: a) address sanitizers including HWASAN (hardware-assisted) to check for memory addressability, b) leak sanitizers to check for memory leaks, c) thread sanitizers to detect race conditions and deadlocks, d) memory sanitizers to check for reference to uninitialized memory, and e) UBSAN to check for undefined behavior sanitizers.
- Triage: Triage all crashes and errors promptly to guard against rolling out buggy software releases and ensure bad hardware is identified and removed from operation to reduce recidivism. Close the loop on the results of the triage. This could be new test content and screening to identify other machines used by your services that may be susceptible to similar failures proactively.
- Exercising Backup and Restore Mechanisms: It is critical to ensure the backup and restore systems are tested for consistent operation from time to time, given that these are the last line of defense against potential corruption.
References
- The Google File System: A very early Google paper, discusses checksumming.
- Bigtable: A Distributed Storage System for Structured Data: Discusses value of various kinds of protection, e.g., see section 9 “Lessons”
- Colossus: Google’s exabyte scale file system.
1. Google Cloud’s network hardware already performs checksumming to guard against data corruption. However, software checks can guard against that data being corrupted by the sending or receiving network processor before it is protected by the hardware checksum.↩