Comparing the 6600-derived architecture to the traditional register-renaming/OoO architecture

One critical difference between the 6600-derived architecture and traditional register-renaming OoO speculative processors is that writes to any one particular ISA-level register max out at 1 per clock cycle (without special measures to improve that) in the 6600-derived architecture, whereas the register-renamed version can easily handle multiple such register writes per clock cycle since the register writes are spread out across multiple physical registers.

(Note from lkcl: 6600 Reservation Stations are "register-renaming" stations. unlike in the Tomasulo Algorithm, they're just not given "names" because Cray and Thornton solved a problem they didn't realise everyone else would have. See tomasulo transformation and http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-October/001050.html However further investigation shows that this may be WaW hazard relate)

The following diagrams are assuming that the fetch, decode, branch prediction, and register renaming can handle 4 instructions per clock cycle (usual on Intel's processors for many generations). They assume that ldu can write the address register after 1 clock cycle of execution and the destination register after 4 clock cycles of execution (can be achieved by splitting into 2 separate micro-ops).

The following C program is used:

#include <stdint.h>

void f(uint64_t *r3, uint64_t r4) {
    uint64_t ctr, r9;
    ctr = r4;
    do {
        r9 = *++r3;
        r9 += 100;
        *r3 = r9;
    } while(--ctr != 0);
}

See on Compiler Explorer

It produces the following Power instructions (edited for style):

f:
    mtctr r4
.L2:
    ldu r9, 8(r3)
    addi r9, r9, 100
    std r9, 0(r3)
    bdnz .L2
    blr

Register Renaming

Renamed hardware registers are named h0, h1, h2, ...

The syntax ldu h7, 8(h5 -> h8) will be used to mean that the address read comes from h5 and the address write goes to h8

The register rename table starts out as following:

r3 r4
h0 h1
ISA-level instruction Num Renamed Instruction 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
mtctr r4 #0 mtctr h1 Fetch Decode Ex: Rd h1 Ex: Wr ctr Retire
ldu r9, 8(r3) #1 ldu h2, 8(h0 -> h3) Fetch Decode Ex: Rd h0 Ex: Wr h3 Ex Ex: Wr h2 Retire
addi r9, r9, 100 #2 addi h4, h2, 1 Fetch Decode Wait: h2 Wait: h2 Wait: h2 Ex: Rd h2 Ex: Wr h4 Retire
std r9, 0(r3) #3 std h4, 0(h3) Fetch Decode Wait: h3 and h4 Wait: h4 Wait: h4 Wait: h4 Ex: Rd h3 and h4 Ex Ex Retire
bdnz .L2 #4 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #5 ldu h5, 8(h3 -> h6) Fetch Decode Ex: Rd h3 Ex: Wr h6 Ex Ex: Wr h5 Wait: Retire Retire
addi r9, r9, 100 #6 addi h7, h5, 100 Fetch Decode Wait: h5 Wait: h5 Wait: h5 Ex: Rd h5 Ex: Wr h7 Retire
std r9, 0(r3) #7 std h7, 0(h6) Fetch Decode Wait: h6 and h7 Wait: h7 Wait: h7 Wait: h7 Ex: Rd h6 and h7 Ex Ex Retire
bdnz .L2 #8 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #9 ldu h8, 8(h6 -> h9) Fetch Decode Ex: Rd h6 Ex: Wr h9 Ex Ex: Wr h8 Wait: Retire Wait: Retire Retire
addi r9, r9, 100 #10 addi h10, h8, 100 Fetch Decode Wait: h8 Wait: h8 Wait: h8 Ex: Rd h8 Ex: Wr h10 Wait: Retire Retire
std r9, 0(r3) #11 std h10, 0(h9) Fetch Decode Wait: h9 and h10 Wait: h10 Wait: h10 Wait: h10 Ex: Rd h9 and h10 Ex Ex Retire
bdnz .L2 #12 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #13 ldu h11, 8(h9 -> h12) Fetch Decode Ex: Rd h9 Ex: Wr h12 Ex Ex: Wr h11 Wait: Retire Wait: Retire Retire
addi r9, r9, 100 #14 addi h13, h11, 100 Fetch Decode Wait: h11 Wait: h11 Wait: h11 Ex: Rd h11 Ex: Wr h13 Wait: Retire Retire
std r9, 0(r3) #15 std h13, 0(h12) Fetch Decode Wait: h12 and h13 Wait: h13 Wait: h13 Wait: h13 Ex: Rd h12 and h13 Ex Ex Retire
bdnz .L2 #16 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
ldu r9, 8(r3) #17 ldu h14, 8(h12 -> h15) Fetch Decode Ex: Rd h12 Ex: Wr h15 Ex Ex: Wr h14 Wait: Retire Wait: Retire Retire
addi r9, r9, 100 #18 addi h16, h14, 100 Fetch Decode Wait: h14 Wait: h14 Wait: h14 Ex: Rd h14 Ex: Wr h16 Wait: Retire Retire
std r9, 0(r3) #19 std h16, 0(h15) Fetch Decode Wait: h15 and h16 Wait: h16 Wait: h16 Wait: h16 Ex: Rd h15 and h16 Ex Ex Retire
bdnz .L2 #20 bdnz .L2 Fetch Decode Ex: Rd ctr Ex: Wr ctr Wait: Retire Wait: Retire Wait: Retire Wait: Retire Wait: Retire Retire
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

6600-derived

Notice how the WaR Waits on r9 cause 2 instructions to finish per cycle (5 micro-ops per 2 cycles) instead of the 4 per cycle for the Register Renaming version, this means the processor's resources will eventually be full, limiting total throughput to 2 instructions/clock.

For the following table: - Assumes that ldu instructions are split into two micro-ops in the decode stage. The address computation is denoted "#5.a" and the memory read is denoted "#5.m". - Assumes that a mechanism for forwarding from a FU's result latch to a waiting operation is in place, without having to wait until the result can be written to the register file. - "Av r3" denotes that the value to be written to r3 is computed and is available for forwarding but can't yet be written to the register file. - "SW: #4" denotes that the instruction is waiting on the shadow produced by instruction #4. - "Rf #5:r5" denotes that the instruction reads the result latch for instruction #5's new value for r5 through the forwarding mechanism.

ISA-level instruction Num 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
mtctr r4 #0 Fetch Decode Ex: Rd r4 Ex: Wr ctr Finish
ldu r9, 8(r3) #1.a Fetch Decode Ex: Rd r3 Ex: Av r3 SW: #1.m Ex: Wr r3 Finish
ldu r9, 8(r3) #1.m Decode Wait: #1.a Ex Ex Ex: Wr r9 Finish
addi r9, r9, 100 #2 Fetch Decode Wait: #1.m Wait: #1.m Wait: #1.m Ex: Rd r9 Ex: Wr r9 Finish
std r9, 0(r3) #3 Fetch Decode Wait: #1.a #2 Wait: #2 Wait: #2 Wait: #2 Ex: Rd r3 r9 Ex Ex Finish
bdnz .L2 #4 Fetch Decode Ex: Rd ctr Ex: Av ctr SW: #3 SW: #3 SW: #3 SW: #3 Ex: Wr ctr Finish
ldu r9, 8(r3) #5.a Fetch Decode Ex: Rf #1.a:r3 Ex: Av r3 SW: #5.m SW: #3 SW: #3 Ex: Wr r3 Finish
ldu r9, 8(r3) #5.m Decode Wait: #5.a Ex Ex Ex: Av r9 SW: #3 Ex: Wr r9 Finish
addi r9, r9, 100 #6 Fetch Decode Wait: #5.m Wait: #5.m Wait: #5.m Ex: Rf #5.m:r9 Ex: Av r9 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #7 Fetch Decode Wait: #5.a #6 Wait: #6 Wait: #6 Wait: #6 Ex: Rf #6:r9 Ex Ex Finish
bdnz .L2 #8 Fetch Decode Ex: Rf #4:ctr Ex: Av ctr SW: #7 SW: #7 SW: #7 SW: #7 SW: #7 Ex: Wr ctr Finish
ldu r9, 8(r3) #9.a Fetch Decode Ex: Rf #5.m:r3 Ex: Av r3 SW: #9.m SW: #7 SW: #7 SW: #7 Ex: Wr r3 Finish
ldu r9, 8(r3) #9.m Decode Wait: #9.a Ex Ex Ex: Av r9 SW: #7 SW: #7 Ex: Wr r9 Finish
addi r9, r9, 100 #10 Fetch Decode Wait: #9.m Wait: #9.m Wait: #9.m Ex: Rf #9.m:r9 Ex: Av r9 SW: #7 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #11 Fetch Decode Wait: #9.a #10 Wait: #10 Wait: #10 Wait: #10 Ex: Rf #9.a:r3 #10:r9 Ex Ex Finish
bdnz .L2 #12 Fetch Decode Ex: Rf ctr Ex: Av ctr SW: #11 SW: #11 SW: #11 SW: #11 SW: #11 Ex: Wr ctr Finish
ldu r9, 8(r3) #13.a Fetch Decode Ex: Rf #9.a:r3 Ex: Av r3 SW: #13.m SW: #11 SW: #11 SW: #11 Ex: Wr r3 Finish
ldu r9, 8(r3) #13.m Decode Wait: #13.a Ex Ex Ex: Av r9 SW: #11 SW: #11 WaR Wait: r9 Ex: Wr r9 Finish
addi r9, r9, 100 #14 Fetch Decode Wait: #13.m Wait: #13.m Wait: #13.m Ex: Rf #13.m:r9 Ex: Av r9 SW: #11 WaR Wait: r9 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #15 Fetch Decode Wait: #13.a #14 Wait: #14 Wait: #14 Wait: #14 Ex: Rf #13.a:r3 #14:r9 Ex Ex Finish
bdnz .L2 #16 Fetch Decode Ex: Rf #12:ctr Ex: Av ctr SW: #15 SW: #15 SW: #15 SW: #15 SW: #15 Ex: Wr ctr Finish
ldu r9, 8(r3) #17.a Fetch Decode Ex: Rf #13.a:r3 Ex: Av r3 SW: #17.m SW: #15 SW: #15 SW: #15 Ex: Wr r3 Finish
ldu r9, 8(r3) #17.m Decode Wait: #17.a Ex Ex Ex: Av r9 SW: #15 SW: #15 WaR Wait: r9 WaR Wait: r9 Ex: Wr r9 Finish
addi r9, r9, 100 #18 Fetch Decode Wait: #17.m Wait: #17.m Wait: #17.m Ex: Rf #17.m:r9 Ex: Av r9 SW: #15 WaR Wait: r9 WaR Wait: r9 WaR Wait: r9 Ex: Wr r9 Finish
std r9, 0(r3) #19 Fetch Decode Wait: #17.a #18 Wait: #18 Wait: #18 Wait: #18 Ex: Rf #17.a:r3 #18:r9 Ex Ex Finish
bdnz .L2 #20 Fetch Decode Ex: Rf #16:ctr Ex: Av ctr SW: #19 SW: #19 SW: #19 SW: #19 SW: #19 Finish
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...