Comparing the 6600-derived architecture to the traditional register-renaming/OoO architecture

One critical difference between the 6600-derived architecture and traditional register-renaming OoO speculative processors is that writes to any one particular ISA-level register max out at 1 per clock cycle (without special measures to improve that) in the 6600-derived architecture, whereas the register-renamed version can easily handle multiple such register writes per clock cycle since the register writes are spread out across multiple physical registers.

(Note from lkcl: 6600 Reservation Stations are "register-renaming" stations. unlike in the Tomasulo Algorithm, they're just not given "names" because Cray and Thornton solved a problem they didn't realise everyone else would have. See tomasulo transformation and http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-October/001050.html However further investigation shows that this may be WaW hazard relate)

The following diagrams are assuming that the fetch, decode, branch prediction, and register renaming can handle 4 instructions per clock cycle (usual on Intel's processors for many generations). They assume that ldu can write the address register after 1 clock cycle of execution and the destination register after 4 clock cycles of execution (can be achieved by splitting into 2 separate micro-ops).

The following C program is used:

#include <stdint.h>

void f(uint64_t *r3, uint64_t r4) {
    uint64_t ctr, r9;
    ctr = r4;
    do {
        r9 = *++r3;
        r9 += 100;
        *r3 = r9;
    } while(--ctr != 0);
}

See on Compiler Explorer

It produces the following Power instructions (edited for style):

f:
    mtctr r4
.L2:
    ldu r9, 8(r3)
    addi r9, r9, 100
    std r9, 0(r3)
    bdnz .L2
    blr

Register Renaming

Renamed hardware registers are named h0, h1, h2, ...

The syntax ldu h7, 8(h5 -> h8) will be used to mean that the address read comes from h5 and the address write goes to h8

The register rename table starts out as following:

`r3`	`r4`
`h0`	`h1`

ISA-level instruction	Num	Renamed Instruction	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
`mtctr r4`	#0	`mtctr h1`	Fetch	Decode	Ex: Rd `h1`	Ex: Wr `ctr`	Retire
`ldu r9, 8(r3)`	#1	`ldu h2, 8(h0 -> h3)`	Fetch	Decode	Ex: Rd `h0`	Ex: Wr `h3`	Ex	Ex: Wr `h2`	Retire
`addi r9, r9, 100`	#2	`addi h4, h2, 1`	Fetch	Decode	Wait: `h2`	Wait: `h2`	Wait: `h2`	Ex: Rd `h2`	Ex: Wr `h4`	Retire
`std r9, 0(r3)`	#3	`std h4, 0(h3)`	Fetch	Decode	Wait: `h3` and `h4`	Wait: `h4`	Wait: `h4`	Wait: `h4`	Ex: Rd `h3` and `h4`	Ex	Ex	Retire
`bdnz .L2`	#4	`bdnz .L2`		Fetch	Decode	Ex: Rd `ctr`	Ex: Wr `ctr`	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Retire
`ldu r9, 8(r3)`	#5	`ldu h5, 8(h3 -> h6)`			Fetch	Decode	Ex: Rd `h3`	Ex: Wr `h6`	Ex	Ex: Wr `h5`	Wait: Retire	Retire
`addi r9, r9, 100`	#6	`addi h7, h5, 100`			Fetch	Decode	Wait: `h5`	Wait: `h5`	Wait: `h5`	Ex: Rd `h5`	Ex: Wr `h7`	Retire
`std r9, 0(r3)`	#7	`std h7, 0(h6)`			Fetch	Decode	Wait: `h6` and `h7`	Wait: `h7`	Wait: `h7`	Wait: `h7`	Ex: Rd `h6` and `h7`	Ex	Ex	Retire
`bdnz .L2`	#8	`bdnz .L2`			Fetch	Decode	Ex: Rd `ctr`	Ex: Wr `ctr`	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Retire
`ldu r9, 8(r3)`	#9	`ldu h8, 8(h6 -> h9)`				Fetch	Decode	Ex: Rd `h6`	Ex: Wr `h9`	Ex	Ex: Wr `h8`	Wait: Retire	Wait: Retire	Retire
`addi r9, r9, 100`	#10	`addi h10, h8, 100`				Fetch	Decode	Wait: `h8`	Wait: `h8`	Wait: `h8`	Ex: Rd `h8`	Ex: Wr `h10`	Wait: Retire	Retire
`std r9, 0(r3)`	#11	`std h10, 0(h9)`				Fetch	Decode	Wait: `h9` and `h10`	Wait: `h10`	Wait: `h10`	Wait: `h10`	Ex: Rd `h9` and `h10`	Ex	Ex	Retire
`bdnz .L2`	#12	`bdnz .L2`				Fetch	Decode	Ex: Rd `ctr`	Ex: Wr `ctr`	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Retire
`ldu r9, 8(r3)`	#13	`ldu h11, 8(h9 -> h12)`					Fetch	Decode	Ex: Rd `h9`	Ex: Wr `h12`	Ex	Ex: Wr `h11`	Wait: Retire	Wait: Retire	Retire
`addi r9, r9, 100`	#14	`addi h13, h11, 100`					Fetch	Decode	Wait: `h11`	Wait: `h11`	Wait: `h11`	Ex: Rd `h11`	Ex: Wr `h13`	Wait: Retire	Retire
`std r9, 0(r3)`	#15	`std h13, 0(h12)`					Fetch	Decode	Wait: `h12` and `h13`	Wait: `h13`	Wait: `h13`	Wait: `h13`	Ex: Rd `h12` and `h13`	Ex	Ex	Retire
`bdnz .L2`	#16	`bdnz .L2`					Fetch	Decode	Ex: Rd `ctr`	Ex: Wr `ctr`	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Retire
`ldu r9, 8(r3)`	#17	`ldu h14, 8(h12 -> h15)`						Fetch	Decode	Ex: Rd `h12`	Ex: Wr `h15`	Ex	Ex: Wr `h14`	Wait: Retire	Wait: Retire	Retire
`addi r9, r9, 100`	#18	`addi h16, h14, 100`						Fetch	Decode	Wait: `h14`	Wait: `h14`	Wait: `h14`	Ex: Rd `h14`	Ex: Wr `h16`	Wait: Retire	Retire
`std r9, 0(r3)`	#19	`std h16, 0(h15)`						Fetch	Decode	Wait: `h15` and `h16`	Wait: `h16`	Wait: `h16`	Wait: `h16`	Ex: Rd `h15` and `h16`	Ex	Ex	Retire
`bdnz .L2`	#20	`bdnz .L2`						Fetch	Decode	Ex: Rd `ctr`	Ex: Wr `ctr`	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Wait: Retire	Retire
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

6600-derived

Notice how the WaR Waits on r9 cause 2 instructions to finish per cycle (5 micro-ops per 2 cycles) instead of the 4 per cycle for the Register Renaming version, this means the processor's resources will eventually be full, limiting total throughput to 2 instructions/clock.

For the following table: - Assumes that ldu instructions are split into two micro-ops in the decode stage. The address computation is denoted "#5.a" and the memory read is denoted "#5.m". - Assumes that a mechanism for forwarding from a FU's result latch to a waiting operation is in place, without having to wait until the result can be written to the register file. - "Av r3" denotes that the value to be written to r3 is computed and is available for forwarding but can't yet be written to the register file. - "SW: #4" denotes that the instruction is waiting on the shadow produced by instruction #4. - "Rf #5:r5" denotes that the instruction reads the result latch for instruction #5's new value for r5 through the forwarding mechanism.

ISA-level instruction	Num	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
`mtctr r4`	#0	Fetch	Decode	Ex: Rd `r4`	Ex: Wr `ctr`	Finish
`ldu r9, 8(r3)`	#1.a	Fetch	Decode	Ex: Rd `r3`	Ex: Av `r3`	SW: #1.m	Ex: Wr `r3`	Finish
`ldu r9, 8(r3)`	#1.m		Decode	Wait: #1.a	Ex	Ex	Ex: Wr `r9`	Finish
`addi r9, r9, 100`	#2	Fetch	Decode	Wait: #1.m	Wait: #1.m	Wait: #1.m	Ex: Rd `r9`	Ex: Wr `r9`	Finish
`std r9, 0(r3)`	#3	Fetch	Decode	Wait: #1.a #2	Wait: #2	Wait: #2	Wait: #2	Ex: Rd `r3` `r9`	Ex	Ex	Finish
`bdnz .L2`	#4		Fetch	Decode	Ex: Rd `ctr`	Ex: Av `ctr`	SW: #3	SW: #3	SW: #3	SW: #3	Ex: Wr `ctr`	Finish
`ldu r9, 8(r3)`	#5.a			Fetch	Decode	Ex: Rf #1.a:`r3`	Ex: Av `r3`	SW: #5.m	SW: #3	SW: #3	Ex: Wr `r3`	Finish
`ldu r9, 8(r3)`	#5.m				Decode	Wait: #5.a	Ex	Ex	Ex: Av `r9`	SW: #3	Ex: Wr `r9`	Finish
`addi r9, r9, 100`	#6			Fetch	Decode	Wait: #5.m	Wait: #5.m	Wait: #5.m	Ex: Rf #5.m:`r9`	Ex: Av `r9`	WaR Wait: `r9`	Ex: Wr `r9`	Finish
`std r9, 0(r3)`	#7			Fetch	Decode	Wait: #5.a #6	Wait: #6	Wait: #6	Wait: #6	Ex: Rf #6:`r9`	Ex	Ex	Finish
`bdnz .L2`	#8			Fetch	Decode	Ex: Rf #4:`ctr`	Ex: Av `ctr`	SW: #7	SW: #7	SW: #7	SW: #7	SW: #7	Ex: Wr `ctr`	Finish
`ldu r9, 8(r3)`	#9.a				Fetch	Decode	Ex: Rf #5.m:`r3`	Ex: Av `r3`	SW: #9.m	SW: #7	SW: #7	SW: #7	Ex: Wr `r3`	Finish
`ldu r9, 8(r3)`	#9.m					Decode	Wait: #9.a	Ex	Ex	Ex: Av `r9`	SW: #7	SW: #7	Ex: Wr `r9`	Finish
`addi r9, r9, 100`	#10				Fetch	Decode	Wait: #9.m	Wait: #9.m	Wait: #9.m	Ex: Rf #9.m:`r9`	Ex: Av `r9`	SW: #7	WaR Wait: `r9`	Ex: Wr `r9`	Finish
`std r9, 0(r3)`	#11				Fetch	Decode	Wait: #9.a #10	Wait: #10	Wait: #10	Wait: #10	Ex: Rf #9.a:`r3` #10:`r9`	Ex	Ex	Finish
`bdnz .L2`	#12				Fetch	Decode	Ex: Rf `ctr`	Ex: Av `ctr`	SW: #11	SW: #11	SW: #11	SW: #11	SW: #11	Ex: Wr `ctr`	Finish
`ldu r9, 8(r3)`	#13.a					Fetch	Decode	Ex: Rf #9.a:`r3`	Ex: Av `r3`	SW: #13.m	SW: #11	SW: #11	SW: #11	Ex: Wr `r3`	Finish
`ldu r9, 8(r3)`	#13.m						Decode	Wait: #13.a	Ex	Ex	Ex: Av `r9`	SW: #11	SW: #11	WaR Wait: `r9`	Ex: Wr `r9`	Finish
`addi r9, r9, 100`	#14					Fetch	Decode	Wait: #13.m	Wait: #13.m	Wait: #13.m	Ex: Rf #13.m:`r9`	Ex: Av `r9`	SW: #11	WaR Wait: `r9`	WaR Wait: `r9`	Ex: Wr `r9`	Finish
`std r9, 0(r3)`	#15					Fetch	Decode	Wait: #13.a #14	Wait: #14	Wait: #14	Wait: #14	Ex: Rf #13.a:`r3` #14:`r9`	Ex	Ex	Finish
`bdnz .L2`	#16					Fetch	Decode	Ex: Rf #12:`ctr`	Ex: Av `ctr`	SW: #15	SW: #15	SW: #15	SW: #15	SW: #15	Ex: Wr `ctr`	Finish
`ldu r9, 8(r3)`	#17.a						Fetch	Decode	Ex: Rf #13.a:`r3`	Ex: Av `r3`	SW: #17.m	SW: #15	SW: #15	SW: #15	Ex: Wr `r3`	Finish
`ldu r9, 8(r3)`	#17.m							Decode	Wait: #17.a	Ex	Ex	Ex: Av `r9`	SW: #15	SW: #15	WaR Wait: `r9`	WaR Wait: `r9`	Ex: Wr `r9`	Finish
`addi r9, r9, 100`	#18						Fetch	Decode	Wait: #17.m	Wait: #17.m	Wait: #17.m	Ex: Rf #17.m:`r9`	Ex: Av `r9`	SW: #15	WaR Wait: `r9`	WaR Wait: `r9`	WaR Wait: `r9`	Ex: Wr `r9`	Finish
`std r9, 0(r3)`	#19						Fetch	Decode	Wait: #17.a #18	Wait: #18	Wait: #18	Wait: #18	Ex: Rf #17.a:`r3` #18:`r9`	Ex	Ex	Finish
`bdnz .L2`	#20						Fetch	Decode	Ex: Rf #16:`ctr`	Ex: Av `ctr`	SW: #19	SW: #19	SW: #19	SW: #19	SW: #19	Finish
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...