Conversion from Tomasulo to Scoreboards

See discussion (1) and discussion (2)

This page aids and assists in understanding the full functional equivalence of a Scoreboard-based design when compared to a Tomasulo algorithm. However it is extremely important to note that the Academic literature, by focussing exclusively on the published patent covering Q-Tables, is hopelessly inaccurate, factually incorrect and completely misleading.

By only comparing Q-Tables against the entirety of the Tomasulo algorithm, this is equivalent to narrowly focussing on the Reorder Buffer of Tomasulo, excluding all else, and concluding that a design that uses a ROB and therefore the entire Tomasulo algorithm itself is incapable of useful high-performance out-of-order execution.

This article helps readers to understand that Q-Tables != Scoreboards, by describing a series of functionally-equivalent transformations that, when followed, turn the Tomasulo algorithm into a Scoreboard-based design. It also highlights that, following that transformation, multi-issue execution is near-trivial to add by comparison. Precise exception handling is also trivial to add (holding of write commits) and is described in the 6600scoreboard page under "Shadowing"

On Saturday, May 16, 2020, Yehowshua yimmanuel3@gatech.edu wrote:

This is a very intricate and complicated subject matter for sure.

yes, except it doesn't have to be. the actual https://en.wikipedia.org/wiki/Levenshtein_distance between Tomasulo and 6600 really is not that great.

i thought it would be fun to use a new unpronounceable word i learned yesterday :)

At some point, it be great to really break things down and make them more accessible.

yes. it comes down to time.

start with this.

  1. Begin from Tomasulo.

  2. Make sure to add an Operand Forwarding Bus. this is critical to providing the functionality provided by the Tomasulo Common Data Bus.

    note (later) that multiple Op Fwd Buses may be conveniently added as parallel data paths without severe design penalties.

  3. Start by only allowing one row per Reservation Station.

  4. Expand the number of RSes so that if you were to count the total number of places operands are stored, they are the same.

    (another way to put this is, "flatten all 2D RSes into 1D")

  5. where pipelines were formerly connected exclusively to one RS, preserve those connections even though the rows are now 1D flattened.

    (another way to put this is: we have a global 1D naming scheme to reference the operand latches rather than a 2D scheme involving RS number in 1 dimension and the row number in the 2nd)

  6. give this 1D flattening an UNARY numbering scheme.

  7. make the size of the Reorder Buffer EXACTLY equal to the number of 1D flattened RSes.

  8. rename RSes to "Function Units" (actually in Thornton's book the phrase "Computation Units" is used)

    thus, at this point in the transformation, the ROB row number IS the Function Unit Number, the need to actually store the ROB # in the Reservation Station Row is REMOVED, and consequently the Reservation Stations are NO LONGER A CAM.

  9. give all register file numbers (INT FP) an UNARY numbering.

    this means that in the ROB, updating of register numbers in a multi-issue scenario is a matter of raising one of any number of single bits. contrast this in the Tomasulo to having to multi-port the SRAM in the ROB, setting multiple bits even for single-issue (5-bits for 32-bit reg numbering).

  10. add "Shadowing" capability to each Function Unit and create a Shadow Matrix (appx 20 gates per Function Unit)

    the "Shadow" capability hooks into the WRITE-COMMIT phase of every Function Unit, permitting it to EXECUTE but prohibiting it from WRITING the result of that execution until explicitly permitted to do so.

  11. Upgrade the CDB from a multi-fan-in, multi-fan-out, single resource global choke-point to separate (multiple, if desired) read-fanout broadcast and write-fan-in register data broadcast buses.

Post-transformation Analysis

with the ROB now having rows of bitvectors, it is now termed a "Matrix".

the left side of the ROB, which used to contain the RS Number in unary, now contains a bitvector Directed Acyclic Graph of the FU to FU dependencies, and is split out into its own Matrix.

this we call the FU-FU Dependency Matrix.

therefore, where previously, it was the ROB Row binary number that preserved instruction order (as an inherent DAG through sequential cyclically-incremented numbering), the 2-D bit-level FU-FU matrix preserves the same DAG by way of single-bit cells that express FU-to-FU dependencies, creating a hardware-form of a software "linked list".

the remainder of the "ROB" contains the register numbers in unary Matrix form, and with each row being directly associated with a Function Unit, we now have an association between FU and Regs which preserves the knowledge of what instruction required which registers, and who will produce the result.

this we call the FU-Regs Dependency Matrix.

that really is it.

take some time to absorb the transformation which not only preserves absolutely every functional aspect of the Tomasulo Algorithm, it drastically simplifies the implementation, reduces gate count, reduces power consumption and provides a strong foundation for doing arbitrary multi-issue execution with only an O(N) linear increase in gate count to do so.

further hilariously simple additional transformations occur to replace former massive resource constrained bottlenecks, due to the binary numbering on both ROB numbers and Reg numbers, with simple large unary NOR gates:

  • the determination of when hazards are clear, on a per register basis, is a laughably trivial NOR gate across all columns of the FU-REGs matrix, producing a row bitvector for each read register and each write register.

  • the determination of when a Function Unit may proceed is a laughably trivial NOR gate across all rows of the FU-FU Matrix, producing a row-based vector, determining that it is "readable" if there exists no write hazard and "writable" if there exists no read hazard.

  • the Tomasulo Common Data Bus, formerly being a single chokepoint binary-addressing global Bus, may now be upgraded to MULTIPLE Common Data Buses that, because the addressing information about registers is now in unary, is likewise laughably trivial to use cascading Priority Pickers (a nmigen PriorityEncoder and Decoder, back-to-back) to determine which Function Unit shall be granted access to which CDB in order to receive (or send) its operand (or result).

  • multi-issue as i mentioned a few times is an equally laughably trivial matter of transitively cascading the Register Dependency Hazards (both read and write) across future instructions in the same multi issue execution window. instr2 has instr1 AND instr2's hazards. instr3 has instr1 AND instr2 AND instr3's hazards and so on. this just leaves the necessity of increasing register port numbers, number of CDBs, and LD/ST memory bandwidth to compensate and cope with the additional resource demands that will now occur.

the latter is particularly why we have a design that, ultimately, we could take on ARM, Intel, and AMD.

there is no reason technically why we could not do a 4, 6 or 8 multi issue system, and with enough Function Units and the cyclic buffer system (so as not to require a full crossbar at the Common Data Buses), and proper stratification and design of the register files, massive Vector parallelism at the pipelines would be kept fully occupied without an overwhelming increase in gates or power consumption that would normally be expected, and scalar performance would be similarly high as well.

Terminology notes

These terms help understand that conceptually there is no difference in the capabilities of Tomasulo and Scoreboards.

Tomasulo name Scoreboard name
Precise Exceptions Precise-capable ("Shadowed") Scoreboard
ROB index cycling order FU-FU DAG that preserves instruction order
Reorder Buffer hybrid of Shadow, FU-FU and FU-Regs Matrices
Reservation Station CAMs RS Row = "Computation Unit latches" (no CAM)
"register renaming" "nameless" registers (Comp Unit latches)
part-ROB, part-RS Q-Tables
blocking Common Data Bus fan-out Read Reg, fan-in Write, OpFwd Bus(es)
Centralised regfile(s) Centralised regfile(s)