Example execution of vector chaining through masks stored in integer registers
As described in bug 213 comment 56 and bug 213 comment 53.
Using the following assembly language:
# starts with VL = 20
vec.cmpw.ge r10, r30, r50
andc r12, r10, r11
vec.fadds f30, f30, f50, mask=r12
The examples assume a computer with a 4x32-bit SIMD integer pipe and a 4x32-bit SIMD FP pipe and a scalar integer pipe.
For ease of viewing, the following examples combines execution pipelines and the scheduling FUs, which are separate things.
Treating integer registers as whole registers at the scheduler level (chaining* doesn't work)
Slow because the fadds
instructions have to wait for all the cmpw.ge
and andc
instructions to complete before any can start executing.
Cycle | SIMD integer pipe | SIMD FP pipe | scalar integer pipe |
---|---|---|---|
0 | r10.0-3 <- cmpw.ge r30-31, r50-51 |
waiting on r12.0-63 |
waiting on r10.0-63 |
1 | r10.4-7 <- cmpw.ge r32-33, r52-53 |
waiting on r12.0-63 |
waiting on r10.0-63 |
2 | r10.8-11 <- cmpw.ge r34-35, r54-55 |
waiting on r12.0-63 |
waiting on r10.0-63 |
3 | r10.12-15 <- cmpw.ge r36-37, r56-57 |
waiting on r12.0-63 |
waiting on r10.0-63 |
4 | r10.16-19 <- cmpw.ge r38-39, r58-59 |
waiting on r12.0-63 |
waiting on r10.0-63 |
5 | waiting on r12.0-63 |
r12.0-63 <- andc r10.0-63, r11 |
|
6 | f30-31 <- fadds f30-31, f50-51, mask=r12.0-3 |
||
7 | f32-33 <- fadds f32-33, f52-53, mask=r12.4-7 |
||
8 | f34-35 <- fadds f34-35, f54-55, mask=r12.8-11 |
||
9 | f36-37 <- fadds f36-37, f56-57, mask=r12.12-15 |
||
10 | f38-39 <- fadds f38-39, f58-59, mask=r12.16-19 |
Treating integer registers* as many single-bit registers at the scheduler level (vector chaining works)
* or at least the register(s) optimized for usage as masks
Faster because fadds
instructions only have to wait for their vector lanes' mask bits to complete, rather than all vector lanes.
Cycle | SIMD integer pipe | SIMD FP pipe | scalar integer pipe |
---|---|---|---|
0 | r10.0-3 <- cmpw.ge r30-31, r50-51 |
waiting on r12.0-3 |
waiting on r10.0-3 |
1 | r10.4-7 <- cmpw.ge r32-33, r52-53 |
waiting on r12.0-3 |
r12.0-3 <- andc r10.0-3, r11.0-3 |
2 | r10.8-11 <- cmpw.ge r34-35, r54-55 |
f30-31 <- fadds f30-31, f50-51, mask=r12.0-3 |
r12.4-7 <- andc r10.4-7, r11.4-7 |
3 | r10.12-15 <- cmpw.ge r36-37, r56-57 |
f32-33 <- fadds f32-33, f52-53, mask=r12.4-7 |
r12.8-11 <- andc r10.8-11, r11.8-11 |
4 | r10.16-19 <- cmpw.ge r38-39, r58-59 |
f34-35 <- fadds f34-35, f54-55, mask=r12.8-11 |
r12.12-15 <- andc r10.12-15, r11.12-15 |
5 | f36-37 <- fadds f36-37, f56-57, mask=r12.12-15 |
r12.16-19 <- andc r10.16-19, r11.16-19 |
|
6 | f38-39 <- fadds f38-39, f58-59, mask=r12.16-19 |