Andy Glew's comp-arch.net wiki, http://semipublic.comp-arch.net
If you are reading this elsewhere, e.g. at site waboba.info, it is an unauthorized copy, and probably a malware site.
comp-arch.net wiki on hold from October 17, 2011
Content Addressable Memory (CAM)
CAMs are a powerful concept often desired by computer architects, and often deprecated by circuit designers.
Whereas a RAM is indexed by the number of an entry, a CAM is indexed by something within an entry. Usually not the whole entry, usually a field within an entry.
Examples of structures that are, at least conceptually, CAMs:
- Translation Lookaside Buffer (TLB): indexed by virtual address, returns physical address and permissions
- Cache: indexed by virtual or physical address, contains instruction or data
CAM array structure
Pitch problems TBD
Split into separate arrays: CAM tag array and SRAM data array.
A CAM can usually be described with pseudocode as
for every entry E in CAM array if match(E,Input.address) then select E
The CAM match function may take various forms:
match(E,Input.address) ::== E.address = Input.address
match(E,Input.address) ::== E.address & E.mask = Input.address & E.mask
match(E,Input.address) ::== E.address = Input.address & E.mask
match(E,Input.address) ::== 0 != (E.address & Input.address)
- Range CAM, encoded form
match(E,Input.address) ::== E.address_lo <= Input.address < E.address_hi
- Interval CAM, encoded form
match(E,Input.address) ::== [E.lo,E.hi) INTERSECT [Input.lo, Input.hi)
- Bitmask CAMs lend themselves to a different implementation of range or interval CAMs.
for every entry E in CAM array if match(E,Input.address) then assert E.matched for all entries E that have E.matched asserted select one and only one according to some prioritization rule
Typical prioritization rules may be
- lowest physical entry number E
- entry E that has the lowest value of some other field, such as a timestamp
Are CAMs prohibitively expensive?
For many years it has been a truism that CAMs are prohibitively expensive. This is not always true.
Furthermore, RAM decoders can share much logic, essentially via common subexpression elimination at the gate level. CAM decoders cannot share such logic in general, if the addresses that are to be CAMmed are completely independent of each other.
Similarly CAMs may be more power hungry, because every bit is potentially active at all times.
However, after these basic differences, CAMs have often earned a bad rap because of the prioritization logic in Priority CAMs - sometimes called a picker circuit. Non-priority CAMs can be significantly less expensive that priority CAMs.
Duality between CAMs and Software Data structures
CAMs are a hardware data structure.
In hardware, CAMs may be quite simple and efficient - if you neglect things like power and circuit area. At least we know how to build hardware CAMs, no matter what their cost.
In software, there is no equivalent software data structure. At least not on a single threaded Von Neumann machine. Arguably, one can build a CAM out of a MIMD parallel machine, with a separate processor dedicated to each CAM comparison, but we are not yet at a point where that is a reasonable approach.
Moreover, even hardware CAMs are, more and more, deprecated. So the duality described here also applies to hardware - alternative ways of implementing a hardware CAM.
However, there are certain well-known dualities between hardware CAMs and software data structures:
- Equality CAMs often map to software hash tables, with whatever convenient collision structure
- Ternary CAMs or Masked CAMs map to the sort of data structures that I used for mkIrecog - usually a multilevel tree dispatched on the largest high information content fields that can be found.
- Range CAMs map to ... TBD
- Priority CAMs map to ... TBD
for every entry E of the CAM array: if E.address = Input.address then select entry
for every entry E of the CAM array: if E.decoded_address_bitmask & Input.decoded_address_bitmask then select entry
Where the Input.decoded_address_bitmask might be prepared by
Input.decoded_address_bitmask = 1 << Input.encoded_address, i.e. all bits 0 except for the single bit corresponding to the encoded address
i.e. it might be a one-hot.
However, a decoded CAM might also be indexed by multiple bits in the mask. I.e. it is a little bit easier to select multiple entries for output with an decoded CAM, whereas with an encoded CAM you would have to have several encoded addresses provided (although a ternary CAM provides a limited degree of flexibility in matching by allowing some bits to be "don't care").
(In either case, the CAM may need a policy as to what happens when multiple entries in a CAM are selected - does it act like a Priority CAM or are the multiple selected entries combined by a logic function such as AND or OR?)
- Decoded CAM has slightly simpler logic per decoded address bit
- Encoded CAM requires what amounts to an XOR at every address bit for every entry
As mentioned above
- Decoded CAM or bitmask CAM permits multiple entries to be addressed
- whereas an encoded CAM might require multiple encoded addresses
- number of address bits scales more slowly for encoded CAMs than for decoded CAMs.
Overall, decoded CAMs are fine for small addresses. Encoded CAMs are more efficient for large addresses.
May already have been discussed above.
- usually N-way associative or direct mapped
- of note: original StrongARM, with multiple fully associative banks. Rarely seen.
- often a variant of Priority CAM, since a load must receive the data from the youngest store older than itself matching its address.
- sometimes a Range CAM or Interval CAM
- BTPs are good examples for CAM versus non-CAM data structures - good examples of many different possibilities
- See also CAM-indexed versus tagged arrays.
- detecting that writeback of a result enables dependent instructions is often done by an Encoded CAM or Decoded CAM, matching on renamed physical register numbers or result number (See Decoded versus Encoded CAMs for Instruction Scheduling.)
- in older machines, e.g. HPSm, without register renaming, these were Priority CAMs.
- picking instructions to schedule out of ready instructions involves prioritization, but in schedulers the CAMming to mark operands as ready is usually quite separate from the picking
- Bypass detection
- nearly any predictor can be originally imagined as a CAM indexed by history and/or location, e.g. branch from address. See Andy Glew's method for understanding branch predictors
- however, TnT predictors are often so simple - basically 2-bit counters that it is more efficient to build many of them, in a simple hashed, direct mapped, structure, than it is to build a CAM for them. In fact, often these are untagged hashed direct mapped structures. However, a few bits of tag occasionally help such direction predictors.
- I have worked on branch predictors with variable length history using mask CAMs - however, it is quite difficult to make these area efficient. The algorithms work, but the hardware is expensive. Nevertheless, a good thinking tool.
- Branch identifiers and branch target buffers more often benefit from tags. The usual tradeoffs for associativity apply.
- Glew opinion: untagged hashed direct mapped may be fastest, if the speed of an L0 BTB is needed. (But usually an L0 BTB is not needed techniques for tolerating predicted taken branch latency.) Tagged direct mapped or highly associative structures may be next fastest - an L1 BTB, if you will. L2 BTBs may use more general techniques, such as N-way associativity.
- often fully associative
- Content Addressable Memory (CAM)
- CAM (Content Addressable Memory)