Andy Glew's comp-arch.net wiki, http://semipublic.comp-arch.net
If you are reading this elsewhere, e.g. at site waboba.info, it is an unauthorized copy, and probably a malware site.
comp-arch.net wiki on hold from October 17, 2011
Wide operand cache
The wide operand cache is a concept originally by Microunity
- you want to design a RISC machine, with "naturally" 32 or 64 bit registers.
- or even 1024 bit registers - no matter how wide you go, there will nearly always be a reason to have significantly wider operands.
- but you also want to support "simple" instructions that just happen to have very wide operands.
See #Examples of instructions with wide operands. For the purposes of this discussion, we will consider BMM (bit matrix multiply). BMM wants to have one, say, 64 bit operand, and a second operand, the matrix, that is 64x64 bits in size - 4Kib.
Defining an instruction such as
BMM destreg64b := srcreg64b * srcreg4x64b
could be possible - but it has many issues:
- how many 64x64=4Kib operand registers do you want to define? 1? 2? 4?
- you probably do not want to copy such a wide operand around - instead, you might want it to live inside its execution unit
The wide operand cache approach is as follows: define an instruction with a memory operand
BMM destreg64b := srcreg64b * M[...]
- You may specify the memory operand simply as register direct, M[reg], or you may define it using a normal addressing mode such as M[basereg+offset], or even using scaled indexing.
Conceptually, the wide operand is loaded before every use:
BMM destreg64b := srcreg64b * M[addr] tmp64x64b := load M[addr] destreg64b := srcreg64b * tmpreg64x64b
However, we may avoid unnecessary memory traffic by caching the wide operand.
This leads to the basic issue: #TLB semantics versus coherent.
TLB semantics versus coherent
The wide operand cache concept admits a basic question:
- do you snoop the wide operand cache, keeping it coherent with main memory, so that if somebody writes to a cached wide operand it is reflected in the next execution of the instruction?
- or do you use "TLB semantics", i.e. making it a noncoherent cache, requiring the user to explicitly invalidate it (requiring an invalidate wide operand cache(s) instruction.
If coherent then an implementation can vary the number of wide operand cache entries transparently.
If noncoherent then not only can the number of entries be detected, but also you run the risk of getting different answers depending on the context switching rate. If the backing store for the wide value is changing.
Glew opinion: I prefer coherent, but would live with noncoherent if necessary.
Frustrating anecdote: at AMD I was working out, with Alex Klaiber, how to support instructions like with very wide operands, such as BMM. My whiteboard was full of scribblings, with the basic question of #TLB semantics versus coherent.
I then went to a meeting where Microunity presented their patents.
So close... Actually, probably off by 5-10 years. But I was following a path that I did not know had been trailblazed...
Examples of instructions with wide operands
Any instruction that has an operand that is significantly wider than some of its inputs, and which either tends to be constant, or which tends to be modified in place, is a candidate for a wide operand in memory implemented via a wide operand cache. For that matter, one could have wide operand instructions whose operands are all pseudo-registers in a wide operand cache.
- vector-matrix instructions - N X NxN
- BMM (bit matrix multiply) is a special case for single-bit values, where the vector might fit in an ordinary, e.g., 64-bit, register
- permutation index vector instruction - Nbits*log2(Nbits)
- permutation bit matrix instruction - really a form of BMM
- superaccumulator - 32 bit - thousands ...
- regex instructions - with large regex operands that can be compiled
- LUT (lookup table) instructions
- texture sampling
- interpolation instructions
- CAM instructions