Comparing the 6600-derived architecture to the traditional register-renaming/OoO architecture

One critical difference between the 6600-derived architecture and traditional register-renaming OoO speculative processors is that writes to any one particular ISA-level register max out at 1 per clock cycle (without special measures to improve that) in the 6600-derived architecture, whereas the register-renamed version can easily handle multiple such register writes per clock cycle since the register writes are spread out across multiple physical registers.

(Note from lkcl: 6600 Reservation Stations are "register-renaming" stations. unlike in the Tomasulo Algorithm, they're just not given "names" because Cray and Thornton solved a problem they didn't realise everyone else would have. See tomasulo transformation and However further investigation shows that this may be WaW hazard relate)

The following diagrams are assuming that the fetch, decode, branch prediction, and register renaming can handle 4 instructions per clock cycle (usual on Intel's processors for many generations). They assume that ldu can write the address register after 1 clock cycle of execution and the destination register after 4 clock cycles of execution (can be achieved by splitting into 2 separate micro-ops).

The following C program is used:

#include <stdint.h>

void f(uint64_t *r3, uint64_t r4) {
    uint64_t ctr, r9;
    ctr = r4;
    do {
        r9 = *++r3;
        r9 += 100;
        *r3 = r9;
    } while(--ctr != 0);

See on Compiler Explorer

It produces the following Power instructions (edited for style):

    mtctr r4
    ldu r9, 8(r3)
    addi r9, r9, 100
    std r9, 0(r3)
    bdnz .L2

Register Renaming

Renamed hardware registers are named h0, h1, h2, ...

The syntax ldu h7, 8(h5 -> h8) will be used to mean that the address read comes from h5 and the address write goes to h8

The register rename table starts out as following:

r3 r4
h0 h1
ISA-level instruction Num Renamed Instruction 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
mtctr r4 #0 mtctr h1 Fetch Decode Ex: Rd h1 Ex: Wr ctr Retire
ldu r9, 8(r3) #1 ldu h2, 8(h0 -> h3) Fetch Decode Ex: Rd h0 Ex: Wr h3 Ex Ex: Wr h2 Retire
addi r9, r9, 100 #2 addi h4, h2, 1 Fetch Decode Wait: h2 Wait: h2 Wait: h2 Ex: Rd h2 Ex: Wr h4 Retire
std r9, 0(r3) #3 std h4, 0(h3) Fetch Decode Wait: h3 and h4 Wait: h4 Wait: h4 Wait: h4 Ex: Rd h3 and h4 Ex Ex Retire
bdnz .L2 #4 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #5 ldu h5, 8(h3 -> h6) Fetch Decode Ex: Rd h3 Ex: Wr h6 Ex Ex: Wr h5 Wait: Retire Retire
addi r9, r9, 100 #6 addi h7, h5, 100 Fetch Decode Wait: h5 Wait: h5 Wait: h5 Ex: Rd h5 Ex: Wr h7 Retire
std r9, 0(r3) #7 std h7, 0(h6) Fetch Decode Wait: h6 and h7 Wait: h7 Wait: h7 Wait: h7 Ex: Rd h6 and h7 Ex Ex Retire
bdnz .L2 #8 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #9 ldu h8, 8(h6 -> h9) Fetch Decode Ex: Rd h6 Ex: Wr h9 Ex Ex: Wr h8 Wait: Retire Wait: Retire Retire
addi r9, r9, 100 #10 addi h10, h8, 100 Fetch Decode Wait: h8 Wait: h8 Wait: h8 Ex: Rd h8 Ex: Wr h10 Wait: Retire Retire
std r9, 0(r3) #11 std h10, 0(h9) Fetch Decode Wait: h9 and h10 Wait: h10 Wait: h10 Wait: h10 Ex: Rd h9 and h10 Ex Ex Retire
bdnz .L2 #12 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #13 ldu h11, 8(h9 -> h12) Fetch Decode Ex: Rd h9 Ex: Wr h12 Ex Ex: Wr h11 Wait: Retire Wait: Retire Retire
addi r9, r9, 100 #14 addi h13, h11, 100 Fetch Decode Wait: h11 Wait: h11 Wait: h11 Ex: Rd h11 Ex: Wr h13 Wait: Retire Retire
std r9, 0(r3) #15 std h13, 0(h12) Fetch Decode Wait: h12 and h13 Wait: h13 Wait: h13 Wait: h13 Ex: Rd h12 and h13 Ex Ex Retire
bdnz .L2 #16 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #17 ldu h14, 8(h12 -> h15) Fetch Decode Ex: Rd h12 Ex: Wr h15 Ex Ex: Wr h14 Wait: Retire Wait: Retire Retire
addi r9, r9, 100 #18 addi h16, h14, 100 Fetch Decode Wait: h14 Wait: h14 Wait: h14 Ex: Rd h14 Ex: Wr h16 Wait: Retire Retire
std r9, 0(r3) #19 std h16, 0(h15) Fetch Decode Wait: h15 and h16 Wait: h16 Wait: h16 Wait: h16 Ex: Rd h15 and h16 Ex Ex Retire
bdnz .L2 #20 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...


Notice how the WaR Waits on r9 cause 2 instructions to finish per cycle (5 micro-ops per 2 cycles) instead of the 4 per cycle for the Register Renaming version, this means the processor's resources will eventually be full, limiting total throughput to 2 instructions/clock.

For the following table: - Assumes that ldu instructions are split into two micro-ops in the decode stage. The address computation is denoted "#5.a" and the memory read is denoted "#5.m". - Assumes that a mechanism for forwarding from a FU's result latch to a waiting operation is in place, without having to wait until the result can be written to the register file. - "Av r3" denotes that the value to be written to r3 is computed and is available for forwarding but can't yet be written to the register file. - "SW: #4" denotes that the instruction is waiting on the shadow produced by instruction #4. - "Rf #5:r5" denotes that the instruction reads the result latch for instruction #5's new value for r5 through the forwarding mechanism.

ISA-level instruction Num 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
mtctr r4 #0 Fetch Decode Ex: Rd r4 Ex: Wr ctr Finish
ldu r9, 8(r3) #1.a Fetch Decode Ex: Rd r3 Ex: Av r3 SW: #1.m Ex: Wr r3 Finish
ldu r9, 8(r3) #1.m Decode Wait: #1.a Ex Ex Ex: Wr r9 Finish
addi r9, r9, 100 #2 Fetch Decode Wait: #1.m Wait: #1.m Wait: #1.m Ex: Rd r9 Ex: Wr r9 Finish
std r9, 0(r3) #3 Fetch Decode Wait: #1.a #2 Wait: #2 Wait: #2 Wait: #2 Ex: Rd r3 r9 Ex Ex Finish
bdnz .L2 #4 Fetch Decode Ex: Rd ctr Ex: Av ctr SW: #3 SW: #3 SW: #3 SW: #3 Ex: Wr ctr Finish
ldu r9, 8(r3) #5.a Fetch Decode Ex: Rf #1.a:r3 Ex: Av r3 SW: #5.m SW: #3 SW: #3 Ex: Wr r3 Finish
ldu r9, 8(r3) #5.m Decode Wait: #5.a Ex Ex Ex: Av r9 SW: #3 Ex: Wr r9 Finish
addi r9, r9, 100 #6 Fetch Decode Wait: #5.m Wait: #5.m Wait: #5.m Ex: Rf #5.m:r9 Ex: Av r9 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #7 Fetch Decode Wait: #5.a #6 Wait: #6 Wait: #6 Wait: #6 Ex: Rf #6:r9 Ex Ex Finish
bdnz .L2 #8 Fetch Decode Ex: Rf #4:ctr Ex: Av ctr SW: #7 SW: #7 SW: #7 SW: #7 SW: #7 Ex: Wr ctr Finish
ldu r9, 8(r3) #9.a Fetch Decode Ex: Rf #5.m:r3 Ex: Av r3 SW: #9.m SW: #7 SW: #7 SW: #7 Ex: Wr r3 Finish
ldu r9, 8(r3) #9.m Decode Wait: #9.a Ex Ex Ex: Av r9 SW: #7 SW: #7 Ex: Wr r9 Finish
addi r9, r9, 100 #10 Fetch Decode Wait: #9.m Wait: #9.m Wait: #9.m Ex: Rf #9.m:r9 Ex: Av r9 SW: #7 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #11 Fetch Decode Wait: #9.a #10 Wait: #10 Wait: #10 Wait: #10 Ex: Rf #9.a:r3 #10:r9 Ex Ex Finish
bdnz .L2 #12 Fetch Decode Ex: Rf ctr Ex: Av ctr SW: #11 SW: #11 SW: #11 SW: #11 SW: #11 Ex: Wr ctr Finish
ldu r9, 8(r3) #13.a Fetch Decode Ex: Rf #9.a:r3 Ex: Av r3 SW: #13.m SW: #11 SW: #11 SW: #11 Ex: Wr r3 Finish
ldu r9, 8(r3) #13.m Decode Wait: #13.a Ex Ex Ex: Av r9 SW: #11 SW: #11 WaR Wait: r9 Ex: Wr r9 Finish
addi r9, r9, 100 #14 Fetch Decode Wait: #13.m Wait: #13.m Wait: #13.m Ex: Rf #13.m:r9 Ex: Av r9 SW: #11 WaR Wait: r9 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #15 Fetch Decode Wait: #13.a #14 Wait: #14 Wait: #14 Wait: #14 Ex: Rf #13.a:r3 #14:r9 Ex Ex Finish
bdnz .L2 #16 Fetch Decode Ex: Rf #12:ctr Ex: Av ctr SW: #15 SW: #15 SW: #15 SW: #15 SW: #15 Ex: Wr ctr Finish
ldu r9, 8(r3) #17.a Fetch Decode Ex: Rf #13.a:r3 Ex: Av r3 SW: #17.m SW: #15 SW: #15 SW: #15 Ex: Wr r3 Finish
ldu r9, 8(r3) #17.m Decode Wait: #17.a Ex Ex Ex: Av r9 SW: #15 SW: #15 WaR Wait: r9 WaR Wait: r9 Ex: Wr r9 Finish
addi r9, r9, 100 #18 Fetch Decode Wait: #17.m Wait: #17.m Wait: #17.m Ex: Rf #17.m:r9 Ex: Av r9 SW: #15 WaR Wait: r9 WaR Wait: r9 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #19 Fetch Decode Wait: #17.a #18 Wait: #18 Wait: #18 Wait: #18 Ex: Rf #17.a:r3 #18:r9 Ex Ex Finish
bdnz .L2 #20 Fetch Decode Ex: Rf #16:ctr Ex: Av ctr SW: #19 SW: #19 SW: #19 SW: #19 SW: #19 Finish
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...