Comparing the 6600-derived architecture to the traditional register-renaming/OoO architecture
One critical difference between the 6600-derived architecture and traditional register-renaming OoO speculative processors is that writes to any one particular ISA-level register max out at 1 per clock cycle (without special measures to improve that) in the 6600-derived architecture, whereas the register-renamed version can easily handle multiple such register writes per clock cycle since the register writes are spread out across multiple physical registers.
(Note from lkcl: 6600 Reservation Stations are "register-renaming" stations. unlike in the Tomasulo Algorithm, they're just not given "names" because Cray and Thornton solved a problem they didn't realise everyone else would have. See tomasulo transformation and http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-October/001050.html However further investigation shows that this may be WaW hazard relate)
The following diagrams are assuming that the fetch, decode, branch
prediction, and register renaming can handle 4 instructions per clock
cycle (usual on Intel's processors for many generations). They assume that
ldu
can write the address register after 1 clock cycle of execution
and the destination register after 4 clock cycles of execution (can be
achieved by splitting into 2 separate micro-ops).
The following C program is used:
#include <stdint.h>
void f(uint64_t *r3, uint64_t r4) {
uint64_t ctr, r9;
ctr = r4;
do {
r9 = *++r3;
r9 += 100;
*r3 = r9;
} while(--ctr != 0);
}
It produces the following Power instructions (edited for style):
f:
mtctr r4
.L2:
ldu r9, 8(r3)
addi r9, r9, 100
std r9, 0(r3)
bdnz .L2
blr
Register Renaming
Renamed hardware registers are named h0
, h1
, h2
, ...
The syntax ldu h7, 8(h5 -> h8)
will be used to mean that the address
read comes from h5
and the address write goes to h8
The register rename table starts out as following:
r3 |
r4 |
---|---|
h0 |
h1 |
ISA-level instruction | Num | Renamed Instruction | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mtctr r4 |
#0 | mtctr h1 |
Fetch | Decode | Ex: Rd h1 |
Ex: Wr ctr |
Retire | ||||||||||
ldu r9, 8(r3) |
#1 | ldu h2, 8(h0 -> h3) |
Fetch | Decode | Ex: Rd h0 |
Ex: Wr h3 |
Ex | Ex: Wr h2 |
Retire | ||||||||
addi r9, r9, 100 |
#2 | addi h4, h2, 1 |
Fetch | Decode | Wait: h2 |
Wait: h2 |
Wait: h2 |
Ex: Rd h2 |
Ex: Wr h4 |
Retire | |||||||
std r9, 0(r3) |
#3 | std h4, 0(h3) |
Fetch | Decode | Wait: h3 and h4 |
Wait: h4 |
Wait: h4 |
Wait: h4 |
Ex: Rd h3 and h4 |
Ex | Ex | Retire | |||||
bdnz .L2 |
#4 | bdnz .L2 |
Fetch | Decode | Ex: Rd ctr |
Ex: Wr ctr |
Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | ||||||
ldu r9, 8(r3) |
#5 | ldu h5, 8(h3 -> h6) |
Fetch | Decode | Ex: Rd h3 |
Ex: Wr h6 |
Ex | Ex: Wr h5 |
Wait: Retire | Retire | |||||||
addi r9, r9, 100 |
#6 | addi h7, h5, 100 |
Fetch | Decode | Wait: h5 |
Wait: h5 |
Wait: h5 |
Ex: Rd h5 |
Ex: Wr h7 |
Retire | |||||||
std r9, 0(r3) |
#7 | std h7, 0(h6) |
Fetch | Decode | Wait: h6 and h7 |
Wait: h7 |
Wait: h7 |
Wait: h7 |
Ex: Rd h6 and h7 |
Ex | Ex | Retire | |||||
bdnz .L2 |
#8 | bdnz .L2 |
Fetch | Decode | Ex: Rd ctr |
Ex: Wr ctr |
Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | |||||
ldu r9, 8(r3) |
#9 | ldu h8, 8(h6 -> h9) |
Fetch | Decode | Ex: Rd h6 |
Ex: Wr h9 |
Ex | Ex: Wr h8 |
Wait: Retire | Wait: Retire | Retire | ||||||
addi r9, r9, 100 |
#10 | addi h10, h8, 100 |
Fetch | Decode | Wait: h8 |
Wait: h8 |
Wait: h8 |
Ex: Rd h8 |
Ex: Wr h10 |
Wait: Retire | Retire | ||||||
std r9, 0(r3) |
#11 | std h10, 0(h9) |
Fetch | Decode | Wait: h9 and h10 |
Wait: h10 |
Wait: h10 |
Wait: h10 |
Ex: Rd h9 and h10 |
Ex | Ex | Retire | |||||
bdnz .L2 |
#12 | bdnz .L2 |
Fetch | Decode | Ex: Rd ctr |
Ex: Wr ctr |
Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | |||||
ldu r9, 8(r3) |
#13 | ldu h11, 8(h9 -> h12) |
Fetch | Decode | Ex: Rd h9 |
Ex: Wr h12 |
Ex | Ex: Wr h11 |
Wait: Retire | Wait: Retire | Retire | ||||||
addi r9, r9, 100 |
#14 | addi h13, h11, 100 |
Fetch | Decode | Wait: h11 |
Wait: h11 |
Wait: h11 |
Ex: Rd h11 |
Ex: Wr h13 |
Wait: Retire | Retire | ||||||
std r9, 0(r3) |
#15 | std h13, 0(h12) |
Fetch | Decode | Wait: h12 and h13 |
Wait: h13 |
Wait: h13 |
Wait: h13 |
Ex: Rd h12 and h13 |
Ex | Ex | Retire | |||||
bdnz .L2 |
#16 | bdnz .L2 |
Fetch | Decode | Ex: Rd ctr |
Ex: Wr ctr |
Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | |||||
ldu r9, 8(r3) |
#17 | ldu h14, 8(h12 -> h15) |
Fetch | Decode | Ex: Rd h12 |
Ex: Wr h15 |
Ex | Ex: Wr h14 |
Wait: Retire | Wait: Retire | Retire | ||||||
addi r9, r9, 100 |
#18 | addi h16, h14, 100 |
Fetch | Decode | Wait: h14 |
Wait: h14 |
Wait: h14 |
Ex: Rd h14 |
Ex: Wr h16 |
Wait: Retire | Retire | ||||||
std r9, 0(r3) |
#19 | std h16, 0(h15) |
Fetch | Decode | Wait: h15 and h16 |
Wait: h16 |
Wait: h16 |
Wait: h16 |
Ex: Rd h15 and h16 |
Ex | Ex | Retire | |||||
bdnz .L2 |
#20 | bdnz .L2 |
Fetch | Decode | Ex: Rd ctr |
Ex: Wr ctr |
Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | |||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6600-derived
Notice how the WaR Waits on r9
cause 2 instructions to finish per cycle
(5 micro-ops per 2 cycles) instead of the 4 per cycle for the Register
Renaming version, this means the processor's resources will eventually
be full, limiting total throughput to 2 instructions/clock.
For the following table:
- Assumes that ldu
instructions are split into two micro-ops in the
decode stage. The address computation is denoted "#5.a" and the memory
read is denoted "#5.m".
- Assumes that a mechanism for forwarding from a FU's result latch to a
waiting operation is in place, without having to wait until the result
can be written to the register file.
- "Av r3
" denotes that the value to be written to r3
is computed and
is available for forwarding but can't yet be written to the register file.
- "SW: #4" denotes that the instruction is waiting on the shadow produced
by instruction #4.
- "Rf #5:r5
" denotes that the instruction reads the result latch for
instruction #5's new value for r5
through the forwarding mechanism.
ISA-level instruction | Num | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mtctr r4 |
#0 | Fetch | Decode | Ex: Rd r4 |
Ex: Wr ctr |
Finish | |||||||||||||
ldu r9, 8(r3) |
#1.a | Fetch | Decode | Ex: Rd r3 |
Ex: Av r3 |
SW: #1.m | Ex: Wr r3 |
Finish | |||||||||||
ldu r9, 8(r3) |
#1.m | Decode | Wait: #1.a | Ex | Ex | Ex: Wr r9 |
Finish | ||||||||||||
addi r9, r9, 100 |
#2 | Fetch | Decode | Wait: #1.m | Wait: #1.m | Wait: #1.m | Ex: Rd r9 |
Ex: Wr r9 |
Finish | ||||||||||
std r9, 0(r3) |
#3 | Fetch | Decode | Wait: #1.a #2 | Wait: #2 | Wait: #2 | Wait: #2 | Ex: Rd r3 r9 |
Ex | Ex | Finish | ||||||||
bdnz .L2 |
#4 | Fetch | Decode | Ex: Rd ctr |
Ex: Av ctr |
SW: #3 | SW: #3 | SW: #3 | SW: #3 | Ex: Wr ctr |
Finish | ||||||||
ldu r9, 8(r3) |
#5.a | Fetch | Decode | Ex: Rf #1.a:r3 |
Ex: Av r3 |
SW: #5.m | SW: #3 | SW: #3 | Ex: Wr r3 |
Finish | |||||||||
ldu r9, 8(r3) |
#5.m | Decode | Wait: #5.a | Ex | Ex | Ex: Av r9 |
SW: #3 | Ex: Wr r9 |
Finish | ||||||||||
addi r9, r9, 100 |
#6 | Fetch | Decode | Wait: #5.m | Wait: #5.m | Wait: #5.m | Ex: Rf #5.m:r9 |
Ex: Av r9 |
WaR Wait: r9 |
Ex: Wr r9 |
Finish | ||||||||
std r9, 0(r3) |
#7 | Fetch | Decode | Wait: #5.a #6 | Wait: #6 | Wait: #6 | Wait: #6 | Ex: Rf #6:r9 |
Ex | Ex | Finish | ||||||||
bdnz .L2 |
#8 | Fetch | Decode | Ex: Rf #4:ctr |
Ex: Av ctr |
SW: #7 | SW: #7 | SW: #7 | SW: #7 | SW: #7 | Ex: Wr ctr |
Finish | |||||||
ldu r9, 8(r3) |
#9.a | Fetch | Decode | Ex: Rf #5.m:r3 |
Ex: Av r3 |
SW: #9.m | SW: #7 | SW: #7 | SW: #7 | Ex: Wr r3 |
Finish | ||||||||
ldu r9, 8(r3) |
#9.m | Decode | Wait: #9.a | Ex | Ex | Ex: Av r9 |
SW: #7 | SW: #7 | Ex: Wr r9 |
Finish | |||||||||
addi r9, r9, 100 |
#10 | Fetch | Decode | Wait: #9.m | Wait: #9.m | Wait: #9.m | Ex: Rf #9.m:r9 |
Ex: Av r9 |
SW: #7 | WaR Wait: r9 |
Ex: Wr r9 |
Finish | |||||||
std r9, 0(r3) |
#11 | Fetch | Decode | Wait: #9.a #10 | Wait: #10 | Wait: #10 | Wait: #10 | Ex: Rf #9.a:r3 #10:r9 |
Ex | Ex | Finish | ||||||||
bdnz .L2 |
#12 | Fetch | Decode | Ex: Rf ctr |
Ex: Av ctr |
SW: #11 | SW: #11 | SW: #11 | SW: #11 | SW: #11 | Ex: Wr ctr |
Finish | |||||||
ldu r9, 8(r3) |
#13.a | Fetch | Decode | Ex: Rf #9.a:r3 |
Ex: Av r3 |
SW: #13.m | SW: #11 | SW: #11 | SW: #11 | Ex: Wr r3 |
Finish | ||||||||
ldu r9, 8(r3) |
#13.m | Decode | Wait: #13.a | Ex | Ex | Ex: Av r9 |
SW: #11 | SW: #11 | WaR Wait: r9 |
Ex: Wr r9 |
Finish | ||||||||
addi r9, r9, 100 |
#14 | Fetch | Decode | Wait: #13.m | Wait: #13.m | Wait: #13.m | Ex: Rf #13.m:r9 |
Ex: Av r9 |
SW: #11 | WaR Wait: r9 |
WaR Wait: r9 |
Ex: Wr r9 |
Finish | ||||||
std r9, 0(r3) |
#15 | Fetch | Decode | Wait: #13.a #14 | Wait: #14 | Wait: #14 | Wait: #14 | Ex: Rf #13.a:r3 #14:r9 |
Ex | Ex | Finish | ||||||||
bdnz .L2 |
#16 | Fetch | Decode | Ex: Rf #12:ctr |
Ex: Av ctr |
SW: #15 | SW: #15 | SW: #15 | SW: #15 | SW: #15 | Ex: Wr ctr |
Finish | |||||||
ldu r9, 8(r3) |
#17.a | Fetch | Decode | Ex: Rf #13.a:r3 |
Ex: Av r3 |
SW: #17.m | SW: #15 | SW: #15 | SW: #15 | Ex: Wr r3 |
Finish | ||||||||
ldu r9, 8(r3) |
#17.m | Decode | Wait: #17.a | Ex | Ex | Ex: Av r9 |
SW: #15 | SW: #15 | WaR Wait: r9 |
WaR Wait: r9 |
Ex: Wr r9 |
Finish | |||||||
addi r9, r9, 100 |
#18 | Fetch | Decode | Wait: #17.m | Wait: #17.m | Wait: #17.m | Ex: Rf #17.m:r9 |
Ex: Av r9 |
SW: #15 | WaR Wait: r9 |
WaR Wait: r9 |
WaR Wait: r9 |
Ex: Wr r9 |
Finish | |||||
std r9, 0(r3) |
#19 | Fetch | Decode | Wait: #17.a #18 | Wait: #18 | Wait: #18 | Wait: #18 | Ex: Rf #17.a:r3 #18:r9 |
Ex | Ex | Finish | ||||||||
bdnz .L2 |
#20 | Fetch | Decode | Ex: Rf #16:ctr |
Ex: Av ctr |
SW: #19 | SW: #19 | SW: #19 | SW: #19 | SW: #19 | Finish | ||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |