Andy Glew's comp-arch.net wiki, http://semipublic.comp-arch.net
If you are reading this elsewhere, e.g. at site waboba.info, it is an unauthorized copy, and probably a malware site.
comp-arch.net wiki on hold from October 17, 2011
Poor Man's ECC
Conventional Width Oriented ECC
Conventional ECC is width-oriented ECC. E.g. to 64 bits of data, you attach 8 bits of ECC code. Typically these 64+8 bit wide chunks are organized into cache-line sized bursts of 64 bytes, i.e. 8 chunks of 64+8 bits apiece.
These 8 bits are sufficient for normal SECDED. Some systems calculate the ECC across more than 64-bits at a time, allowing more advanced codes such as chipkill.
"Chipkill"? The term tends to imply DRAM memory systems created out of 8-bit wide DRAMs: 9 8 bit wide DRAMs to create a 72 bit wide multichip DRAM interface. Such 9x8b = 72 bit wide DRAM DIMM interfaces are the norm in 2010. 128-bit wide (144 total) can also be found.
In years past, x1 (1-bit wide) and x4 (4-bit wide) DRAM chip interfaces were originally more common than x8. These narrower interfaces allowed system designers to create different flavors of width oriented ECC subsystems. However, trends in the DRAM market suppressed these design alternatives.
Almost never does one see x9, x18, or x36 DRAMs, where the ECC is "built in".
How about wider DRAM interfaces? 16-bit wide and 32-bit wide DRAMs can be found, at a premium over commodity 8-bit wide DRAMs..
Consider what this would do to width oriented ECC: if you wanted to build a 64-bit wide data system, you would have to have 5 x16 chips => actually 80 bits wide. A net increase of 25%, just because of granularity of the x16 DRAM chip. 128 bits would work out okay, 144 bit total. But the same scenario would repeat itself as you go for x32.
David Wang first explained this to me in the late 1990s:
* servers that want arbitrarily large memory subsystems can build them using narrow, x4 or x8 DRAM chips * but PCs and embedded systems that want less memory seek to use wider DRAM chips over time - and hence the overhead of ECC grows over time.
Poor Man's ECC
Poor Man's ECC seeks to break this stalemate, by providing alternatives to width-oriented ECC:
Length Oriented ECC
Length oriented ECC places the ECC immediately after the data.
If you can affect the design of DRAM chips, you might be able to persuade JEDEC and the DRAM manufacturers to make available DRAMs with 5 or 9 cycle burst transfers.
- Unfortunately, influencing DRAM vendors has 'not proved successful. Such a DRAM would be a non-commodity part, and hence expensive.
A second approach to length oriented DRAM is to have the memory controller scale all addresses by 9/8 or the like. Worst case, two transfers might be required for any cache line; however, this would be amortized over sequential access patterns.
This would allow conventional DRAM DIMMs to be used. However, software might find it strange to learn that 1G of DRAM is effectively less, not a power of two, after the scaling is done. Software often assumes powers of two.
A second approach is to store the ECC in a separate area of memory.
There might be a single area for all of memory - assuming a contiguous DRAM space.
e.g. ECC_address(A) = A >> 6 + ECC_base
Or such an ECC area might be allocated once every M bytes, e.g. in every 1GiB region, one might allow 7/8 of a GiB of data, and 1/8 of a GiB of ECC. Or, for a smaller size:
e.g. ECC_address(A) = (A & ~0x0FFFFF) + ((A&0x0FFFFF)>>6)
- downside: cannot have large contiguous amounts of physical memory in this scheme; large arrays must be allocated in virtual memory
- historically OSes objected to discontiguous physical memory, but there is nothing fundamentally hard about it.
An ECC memory controller would have to arrange to read both the data, and the outlying ECC. Worst case this would be 2X the memory traffic for random accesses; however, sequential access patterns would greatly reduce this.
Optimizing Poor Man's ECC
As mentioned above, worst case any flavour of Poor Man's ECC would be 2X the memory traffic for random accesses; however, sequential access patterns would greatly reduce this.
One can imagine an ECC cache.
Finally, one can imagine compressing each cache line of data, as described in my (Andy Glew's) ASPLOS 98 talk. The compressed cache line might include the data, plus a bit to indicate compression, plus ECC. Therefore, in many circumstances the data plus the ECC would fit in the space of the original cache line, and it would not be necessary to access the outlying ECC area. (Using compressed memory for metadata is a general technique that can be applied to any dense metadata.)
Poor Man's ECC Patents
US patent application 20080235485, ECC implementation in non-ECC components, Haertel, Polzin, Kocev, Steinman, assigned to AMD, filed 2007.
- Since this AMD patent application is now public, I think that it is okay to mention that I invented the above concepts of Poor Man's ECC at AMD in 2002-2004, and discussed them with Haertel.
US patent 7,117,421, Transparent error correction code memory system and method, Danilak, assigned to Nvidia, 2002.
- Probably predates any work I did on Poor Man's ECC.
Poor Man's ECC in Products
None as of Jan 2010, but...
seems to indicate that Nvidia is using outlying ECC. Perhaps as described above. Perhaps using some of the performance optimization techniques described above.
Let us look forward to the day when anyone, not just big servers, can have ECC.