Andy Glew's comp-arch.net wiki, http://semipublic.comp-arch.net

If you are reading this elsewhere, e.g. at site waboba.info, it is an unauthorized copy, and probably a malware site.
comp-arch.net wiki on hold from October 17, 2011

Dynamic instruction rewriting

From CompArch

Jump to: navigation, search

Contents

Dynamic Instruction Rewriting (Glew 1998-2000, unpublished) =

Dynamic instruction rewriting is a microarchitecture optimization that I, Andy Glew, came up with while I was a student at the University of Wisconsin, 1996-2000. Undoubtedly others have had similar ideas: I know for sure after (TBD, publication) and suspect before. I will describe my version here (TBD add and compare other versions).

    I often refer to this optimization as simply flattening, since that it what it does to the dataflow graph. Dynamic instruction rewriting was one of several optimizations I grouped as instruction refinement for the PhD I never completed.


Dynamic instruction rewriting does what it says - it rewrites instructions, in particular, flattening the dataflow graph of a computation, in hardware. It does this much as a compiler would do, but goes beyond a compiler's capabilities in that it performs these optimizations across basic blocks, procedures, infrequently taken branches, and other boundaries that might impede a compiler based optimization. (In this, I also include "compiler-like optimizations" performed inside the instruction cache.)

My flavor of dynamic instruction rewriting is based on letting instructions flow around a microarchitecture much as values might.

It can be performed in several places in the pipeline, in particular in the instruction renaming pipestage, and/or in actual execution. Also possibly as a post-retirement optimization.

The quintessential example dynamic instruction rewriting is this: imagine an instruction sequence that increments a register. (Actually, we will change the registers for didactic purposes, as might happen with simple loop unrolling.)

  ...
  r2 := r1 + 1
  ...
  r1 := r2 + 1
  ...
  r2 := r1 + 1
  ...

Let us discuss the renamer pipestage optimization.

The renamer or map pipestage has a table indexed by logical register number. In a simple in-orrder scoreboarded machine, the entries of this table may be single bit indicating whether a register value is ready or not; in a P6 or HaRRM style machine, the entries of this table may also contain physical register numbers that the logical registers are mapped to, which are scheduled later; in a future file or active register file, the actual values may also be in this table.

In renamer pipestage dynamic register rewriting, the table also includes the instruction, the actual instruction, that writes the result. Or at least a subset thereof (the subset that can be optimized in this manner).

Considering the above code sequence, let us indicate the value of a P6 or HaRRM style register renamer, mapping logical to physical register numbers The actual physical register numbers do not matter, and could be random - whatever preg is free next.

  ...
    r1 -> pr11
    r2 -> pr22
  r2 := r1 + 1
    r1 -> pr11
    r2 -> pr33
  ...
  r1 := r2 + 1
    r1 -> pr44
    r2 -> pr33
  ...
  r2 := r1 + 1
    r1 -> pr44
    r2 -> pr55
  ...

Now let us augment with the actual instruction producing the result - but without performing the optimization


  ...
    r1 -> pr11
    r2 -> pr22
  r2 := r1 + 1
    r1 -> pr11   
    r2 -> pr33   pr33 := pr11 + 1
  ...
  r1 := r2 + 1
    r1 -> pr44   pr44 := pr33 + 1
    r2 -> pr33   pr33 := pr11 + 1
  ...
  r2 := r1 + 1
    r1 -> pr44   pr44 := pr33 + 1
    r2 -> pr55   pr55 := pr44 + 1

Now let us imagine that renamer logic is enabled to substitute the input "value" of a register - the instruction - in its output expression

  ...
    r1 -> pr11
    r2 -> pr22
  r2 := r1 + 1
    r1 -> pr11   
    r2 -> pr33   pr33 := pr11 + 1
  ...
  r1 := r2 + 1
    r1 -> pr44   pr44 := pr33 + 1 == (pr11 + 1) + 1
    r2 -> pr33   pr33 := pr11 + 1
  ...
  r2 := r1 + 1
    r1 -> pr44   pr44 := pr33 + 1
    r2 -> pr55   pr55 := pr44 + 1

When the dynamic instruction rewriting logic sees a value of the form

  (pr11 + 1) + 1

it performs the associative and commutative transformations, to produce optimizations

  => pr11 + 1 + 1
  => pr11 + 2

and overall

  ...
    r1 -> pr11
    r2 -> pr22
  r2 := r1 + 1
    r1 -> pr11   
    r2 -> pr33   pr33 := pr11 + 1
  ...
  r1 := r2 + 1
    r1 -> pr44   pr44 := pr33 + 1 == (pr11 + 1) + 1 == pr11 + 2
    r2 -> pr33   pr33 := pr11 + 1
  ...
  r2 := r1 + 1
    r1 -> pr44   pr44 := pr33 + 1 == pr11+2
    r2 -> pr55   pr55 := pr44 + 1 == pr11+3

and so on.


Actually, nothing in the above description depends on the dynamic instruction rewriting logic being in the renamer pipestage.

If it is in the renamer pipestage, and there is no additional delay for the optimization, we would never see a multi-stage transformation

 pr55:=pr44+1 == (pr33+1)+1 == ((pr11+2)+1) == pr11+3

i.e. if the dynamic instruction rewriting logic at the renamer pipestage were fast enough, only the optimized version would propagate.


However, if the dynamic instruction rewriting logic were in the out-of-order execution system, we might see such multistage transformations. (Also, if renamer pipestage logic had a delay.)

Indeed, we might place unrewritten instructions into the out-of-order scheduler, and trigger a rewrite when one of the operands arrives - but not the other, without waiting for the second or other operands to arrive. I.e. this dynamic instruction rewriting optimization can take advantage of operand inter-arrival time skew, indeed, of dynamic operand inter-arrival time skew, when the operand arrival times cannot be estimated in advance.

(Another class of optimizations can be performed when you can estimate operand inter-arrival time skew.)

Consider

   ...
   r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r2 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r3 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r4 := r1 + r2
   r5 := r4 + r3
   ...

Let all of these operations be placed into an out-of-order scheduler. (The renamer for this is trivial.)

Now let us assume that the r3 load completes early, r3:=C

   ...
   r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r2 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   COMPLETE C <= r3 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r4 := r1 + r2
   r5 := r4 + (r3=C)
   ...

Now let us assume that the r2 load completes early, r2:=B

   ...
   r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   COMPLETE B <= r2 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   COMPLETE C <= r3 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r4 := r1 + (r2=B)
   r5 := r4 + (r3=C)
   ...

Then let us "fire" the partially complete instruction

   r4 := r1 + (r2=B)

producing


   ...
   r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   COMPLETE B <= r2 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   COMPLETE C <= r3 := load ( ... ) // cache  miss, unpredictably randomized arrival time
   r4 := r1 + (r2=B)
   r5 := r4 + (r3=C) == (r4=(r1+(r2=B))+(r3=C))== r1 + (B+C)
   ...

where B+C is a constant that can be evaluated, and written into the scheduler entry for r5:=.

Therefore, when r1 completes, both

  r4 := r1 + B

and

  r5 := r1 + (B+C) 

can be evaluated in dataflow graph depth 1 - rather than the two if not rewritten.

Fibonacci anecdote

The first time I implemented execution pipestage dynamic instruction rewriting, I was tickled to see that the latencies for a particular setup produced an inverse Fibonacci sequence.

(TBD: record the configuration that gave this rather pleasant result.)

Types of Dynamic Instruction Rewriting Optimizations

Commutative-associative add operations

The quintessential dynamic instruction rewriting optimization involves commutative associative operators such as add.

   rN := (rL + C1) + C2 => rM + (C1+C2)
   rN := (rL - C1) - C2 => rM - (C1-C2)
   rN := C2 - (C1 - rL) => (C2-C1)+rL

etc.

Unfortunately, only wrap around integer arithmetic is commutative associative in this manner. Neither floating point, nor the sort of saturated arithmetic common in multimedia is.

We can treat saturation as an even later propagating condition: just as an instruction may fire early, for dynamic instruction rewriting with partial information, and later, with all register values, in can fire even later, when saturation is detected. I.e. saturation can become a replay condition, much like cache misses have been. (And care must be taken to avoid replay tornadoes, since replays would be even more common.)

I think that floating point intermediate rounding could be handled in the same way - but I must admit, I have not worked out the details to my satisfaction. My current best idea is to propagate floating point values across the optimized dataflow graph, and to verify the result after rounding is done in the proper order. I.e. intermediate rounding differences also become a replay condition. However, it is probably wise not to nuke a pipeline on such a replay - it is entirely likely that the ULP may diverge for one stage of a dataflow graph, and reconverge by the next. I.e. incremental replay is desirable. Instantaneous replay may waste work for FP intermediate rounding.

Logic operations

Well known properties

   rN := (rL OR C1) OR C2 => rM + (C1|C2)

etc

Shift count

Well known properties

   rN := (rL >>C1) >> C2 => rM >> (C1+C2)

but

   rN := (rL >>C1) <<C2 != rL >> (C1-C2)

rather, it corresponds to a particular bitfield extraction operation

Ironically, rotates are more easily handled in such optimizations - implying we can speed up cryptography.

Dependent shifts
   rN := C2 >> (C1 >> r2)

I do not know of an expression - but can imagine deep shift networks. Not worth doing unless common - and most shits have fixed shift count.

Load Load
   rN := load( load(rL+C1) + C2 )

This is the sort of operation that is cached in a pointer cache , particularly if C1=C2

Store to load forwarding
   rN := load( store(rL+C1,val) )

Two comments:

(1) the biggest benefit from dynamic instruction rewriting comes when it is used in conjunction with a store-to-load forwarding predictor, allowing the dataflow graph to be flattened even when it is not contained in the register file.

(2) dynamic instruction rewriting actually provides a pretty damned good form of store to load disambiguation. This term is used because it is stringer than store to load forwarding predictor or memory dependency predictor.

If the store to load forwarding comparison is dome 'symbolically", using the instruction bits that are propagated in dynamic instruction rewriting, then when it detects a match there is guaranteed to be such a match. However, if the "symbols" do not match, and involve different registers, then a match may still occur. It may be possible to detect independence for certain forms of "flattened" symbolic addresses, for example

    r1+C1  ?=? r1 + C2  => mismatch if abs(C1-C2) > data_size  

- forgetting the possibility of virtual address aliasing - which can be added to the match logic, but which is rare enough that it may better be treated as a replay condition.


Instruction Repertoire

I usually considered dynamic instruction rewriting optimizing from and to the same instruction of microinstruction format: e.g. 2-input, 1-output uops received and optimized to.

There are some interesting possibilities when the output set is different than the input set. E.g. the output set might have more complicated operations than you would consider for a long lived instruction set design.


See Also

Personal tools
No more shadowing