Correcting An Error
Why error-correcting memory, long an obscure computing concept, suddenly has major relevance outside of the server room. At least according to Linus Torvalds.
Today’s GIF comes from the classicly bad Raul Julia film Overdrawn at the Memory Bank, which Mystery Science Theater 3000 riffed on in 1997.
Sponsored By … You?
If you find weird or unusual topics like this super-fascinating, the best way to tell us is to give us a nod on Ko-Fi. It helps ensure that we can keep this machine moving, support outside writers, and bring on the tools to support our writing. (Also it’s heartening when someone chips in.)
“Dammit, if a machine can find out that there is an error, why can’t it locate where it is and change the setting of the relay from one to zero or zero to one?”
— Richard Hamming, a Bell Labs employee, discussing the decision-making process that led to the Hamming Code, the first prominent error-correction algorithm, in 1950. The test relies on parity checking to help determine whether errors occurred in a data transmission, and to fix them. Hamming’s work, per a Computer History Museum biography, was inspired by a test that broke on him on the computer he was using at the time, a Bell Model V that relied on punch cards. An error with the cards sent the results he needed to give his coworkers off the rails—but it soon led to something fundamental to the history of computing. His formative work soon was improved greatly by numerous others that followed in his footsteps.
What the heck is error correction, and why would a computer user want it?
Error correction involves a series of formulas that aim to ensure the flow of information being distributed isn’t broken even if something goes wrong or is corrupted.
And its context goes far beyond what the RAM in your computer does.
A good way of thinking about this in a real-world way is to consider what happens if you’re streaming a video on a bad connection. Bits and pieces break off the stream, and the video client (for example, Zoom) has to account for them as best as possible. It may lead to a choppy experience with dropped frames and maybe some blurriness or broken up imagery, but the video does the best it can to continue unabated. Perhaps redundancy is built into the video codec so that the random missing byte doesn’t break the end-user’s connection; perhaps parity checks that are used to help determine the quality of the data being sent can help clean up some of the bits being sent over the wire so an error doesn’t end up looking wrong.
This is actually something that connections have been doing all along. When we were all trying to download data on pokey, noisy telephone lines, a little static was enough to ruin a connection.
This led to many efforts at error correction targeting the phone system. For example, many modems sold during the late ’80s and early ’90s supported an error correction protocol called V.42, one of a number of “V-series” protocols decided on by International Telecommunication Union for managing data-based communications through the telephone line.
Rather than correcting the error on the fly like the Hamming Code allows for, V.42 used an error correction method called Automatic Repeat reQuest (ARQ), which basically means that it asks for a lost packet again after a piece of data goes missing. The error correction worked through the use of repetition, essentially resending any lost data packets as soon as they’re detected. (The goal is not maximum speed, but consistency. After all, a fast connection that completely breaks down is likely not worth it on dial-up.)
Error correction methods were used to help ensure that dropped bytes were able to be repeated so that it didn’t, for example, negatively impact a file transfer.
The error correction system that Hamming landed upon, meanwhile, involves a concept called parity, in which more information is sent than needed to confirm that what was sent correctly got through. Generally, a parity bit can help determine whether a resulting byte of binary code should be even or odd, and correct the data as needed. As shown in the Khan Academy video above, the solution—which effectively describes the Hamming codes—essentially does a test on itself to ensure that nothing broke during the data transmission process that could negatively affect the information.
This kind of error correction, called forward error correction, has a lot of practical uses. Hamming’s work was eventually followed up by other error-correction methods, most notably a system devised by Irving S. Reed and Gustave Solomon in the early 1960s that combined on-the-fly encoding and decoding of data to help protect the integrity of the data in noisy environments. Reed-Solomon codes have come into use most famously with CDs and DVDs (it’s the technology that helps prevent skips in those devices when, say, a disc is scratched), but a wide array of other technologies as well, such as wireless data.
There are lots of other codes for error correction that have found use over the years, but for the layperson, the key thing to know is that it’s a fundamental building block of computing … and it’s everywhere, helping to ensure things as diverse as your Netflix stream and your LTE signal land with a minimum of disruption.
This concept applies more generally to computer memory in general, which requires error correction in particular contexts. Ever have it happen where a piece of software just crashes on you, no explanation, and you have to restart your app—or possibly even the computer? Often there may be no rhyme or reason to it, but it happens anyway.
In certain environments, such as server rooms, crashes such as these can prove hugely problematic, stopping mission critical applications directly in their tracks.
And in ECC memory, the kind Linus Torvalds was complaining about, the Hamming Code is everywhere, helping to make sure those small computational mistakes don’t break the machine.
0.09%
The estimated failure rate for ECC memory, according to a 2014 analysis by Puget Systems, a developer of high-end workstations and servers. The company analyzed the failure rates of its computer memory over a yearlong period. By comparison, its non-ECC memory failed 0.6 percent of the time, or 6.67 times more than the error-correcting option. (Puget’s analysis, which is admittedly a bit on the older side, also dives into a common misconception about ECC memory, that the extra error-checking comes at a significant performance cost; in some of its tests, it found that the ECC memory was often faster than the standard equivalent.)
Why you likely have never used error-correcting memory in a computer you own … unless you used an IBM PC in the ’80s
So to get back to Linus Torvalds’ complaint, he’s effectively upset that Intel’s efforts over the years to differentiate its high-end server and workstation equipment from its consumer-level gear has left most modern computer users without a feature that could benefit many regular users.
Of course, the way he says it is way more colorful than anything I could come up with, so I’ll let him take it from here: “The ‘modern DRAM is so reliable that it doesn’t need ECC’ was always a bedtime story for children that had been dropped on their heads a bit too many times.”
(OK, maybe that was a bad idea. Yikes, that metaphor.)
But to take a step back, the general concept of error correction in the IBM PC actually dates to the earliest days of the platform, when many early PCs used nine-bit memory words, with the additional bit going to parity. But that faded away over time, with many major RAM manufacturers deciding by the mid-1990s to stop selling it in consumer use cases.
There was a good reason for this, too: While manufacturers didn’t think it was necessary for regular users anymore, so they dropped the feature, seen as adding cost and lowering speed, in many non-critical use cases.
ECC memory has been around a long time, but has largely been in niche use cases like workstations and servers for the past 30 years or so—in part because Intel has largely limited its support to its high-end Xeon chip line, which often is used in mission-critical ways. (My dumpster-dive Xeon uses ECC memory, in case you were wondering. Side note: While ECC memory is generally more expensive new, it’s often cheaper used, which is why said machine has 64 gigs of RAM.)
But recently, the case for ECC memory for regular users has started to grow as individual memory chips have started to grow faster and more tightly condensed. This has created new types of technical problems that seem to point toward ECC’s comeback at a consumer level.
In recent years, a new type of security exploit called a “rowhammer” has gained attention in technical circles. This exploit, also known as a bit-flip attack, effectively attacks memory cells repeatedly with the goal of acquiring or changing data already in memory. It’s essentially the computer memory equivalent of a concussion.
As ZDNet notes, the attack model is largely theoretical at this time, but vendors have tried … and repeatedly failed to prevent academics from proving that rowhammer attacks remain a fundamental threat to computer security. (Better academics than zero-day exploiters, right?)
While ECC memory can help mitigate such attacks, it is not foolproof, with Dutch researchers coming up with a rowhammer attack that even affects ECC RAM.
Torvalds—who, it should be reminded, specializes in building a low-level operating system kernel used by hundreds of millions of people, so he likely sees these issues up close—argues that Intel’s move to segregate ECC from mainstream computer users has likely caused problems with modern computers for years, even without academics going out of their way to attack it.
“We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory,” he writes. “And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.”
Intel’s mainstream Core chips generally do not support ECC, but as Torvalds notes, more recent AMD Ryzen chips—which have gained major popularity in the consumer technology space in recent years, largely because they’re often better than Intel—generally do (though it is dependent on the motherboard).
“The effect of cosmic rays on large computers is so bad that today’s large supercomputers would not even boot up if they did not have ECC memory in them.”
— Al Geist, a research scientist at the Oak Ridge National Laboratory, discussing the importance of ECC memory in a 2012 Wired piece that largely focuses on the challenges that cosmic rays create for computing solutions on the planet Mars—something that the Curiosity Rover, which is based on a PowerPC chip design, was built to work around. (Cosmic rays are just one of the factors that can cause bit-flipping, or the introduction of errors into computer memory.) However, Geist notes that the concerns that space rovers face also can cause issues on the ground—issues mitigated by the use of error-correcting memory.
In some ways, the fact that such a prominent figure—a guy famous for speaking his mind on random internet forums—is sticking his neck out there in favor of a technology that few consumers even know about highlights its importance in the modern day.
But the truth is, error correction has always been with us in one way or another as computer users, whether in a hardware or software context.
One might argue he’s just venting, letting off steam, but nonetheless, a big-name figure arguing for something that would benefit regular consumers is a good thing, even if it might not happen tomorrow.
(Something in favor of this becoming more mainstream: Intel is hurting, and is the target of activist investors that are trying to make the case that Intel needs to make some huge strategic changes to keep up, even going so far as to drop its longstanding vertical-integration model, which involves manufacturing its own chips. In other words, cards are on the table that haven’t been in a long time.)
Will it come to mainstream computers in this context? Time will only tell, but perhaps this is a conversation worth having right now. After all, as computers become more complex, old standards for technical needs are going to matter less and less, and things like reliability are going to matter a whole lot more.
Whether or not he was trying to, there’s potential that Linus might have started a useful conversation about what the future of computing needs to include—and what should or shouldn’t be a premium feature in our hardware.
But he may want to leave the metaphors to others.
--
Find this one an interesting read? Share it with a pal!