daxpy
This is a standard textbook algorithm demonstration for Vector and SIMD ISAs.
Summary
ISA | total | loop | words | notes |
---|---|---|---|---|
SVP64 | 8 | 6 | 13 | 5 64-bit, 4 32-bit |
RVV | 13 | 11 | 9.5 | 7 32-bit, 5 16-bit |
SVE | 12 | 7 | 12 | all 32-bit |
c code
void daxpy(size_t n, double a, const double x[], double y[]) {
for (size_t i = 0; i < n; i++)
y[i] = a*x[i] + y[i];
}
SVP64 Power ISA version
In SVP64 Power ISA assembler, the algorithm, despite easy parallelism in hardware, is almost deceptively simple and straightforward. There are however some key additions over Standard Scalar (SFFS Subset) Power ISA 3.0 that need explaining.
# r5: n count; r6: x ptr; r7: y ptr; fp1: a
1 mtctr 5 # move n to CTR
2 .L2 setvl MAXVL=32,VL=CTR # actually VL=MIN(MAXVL,CTR)
3 sv.lfdup *32,8(6) # load x into fp32-63, incr x
4 sv.lfd/els *64,8(7) # load y into fp64-95, NO INC
5 sv.fmadd *64,*64,1,*32 # (*y) = (*y) * (*x) + a
6 sv.stfdup *64,8(7) # store at y, post-incr y
7 sv.bc/ctr .L2 # decr CTR by VL, jump !zero
8 blr # return
The first instruction is simple: the plan is to use CTR for looping. Therefore, copy n (r5) into CTR. Next however, at the start of the loop (L2) is not so obvious: MAXVL is being set to 32 elements, but at the same time VL is being set to MIN(MAXVL,CTR).
This algorithm relies on post-increment, relies on no overlap between x and y in memory, and critically relies on y overwrite. x is post-incremented when read, but y is post-incremented on write. Load/Store Post-Increment is a new Draft set of instructions for the Scalar Subsets, which save having to pre-subtract an offset before running the loop.
For sv.lfdup
, RA is Scalar so continuously accumulates
additions of the immediate (8) but only after RA has been used
as the Effective Address each time.
The last write to RA is the address for
the next block (the next time round the CTR loop).
To understand this it is necessary to appreciate that SVP64 is as if a sequence of loop-unrolled scalar instructions were issued. With that sequence all writing the new version of RA before the next element-instruction, the end result is identical in effect to Element-Strided, except that RA points to the start of the next batch.
If sv.lfdup
was not available, sv.lfdu
could be used to the same
effect, but RA would have to be pre-subtracted by one element, outside
of the loop. Due to the compactness of this highly hardware-parallelizable
algorithm, that one additional instruction would increase the implementation
code size by 5 percent! This helps explain why Post-Increment Update
Load/Store instructions are so important.
Use of Element-Strided on sv.lfd/els
ensures the Immediate (8) results in a contiguous LD
without modifying RA.
The first element is loaded from RA, the second from RA+8, the third
from RA+16 and so on. However unlike the sv.lfdup
, RA remains
pointing at the current block being processed of the y array.
With both a part of x and y loaded into a batch of GPR
registers starting at 32 and 64 respectively, a multiply-and-accumulate
can be carried out. The scalar constant a
is in fp1.
Where the curret pointer to y had not been updated by the sv.lfd/els
instruction, this means that y (r7) is already pointing to the
right place to store the results. However given that we want r7
to point to the start of the next batch, now we can use
sv.stfdup
which will post-increment RA repeatedly by 8
Now that the batch of length VL
has been done, the next step
is to decrement CTR by VL, which we know due to the setvl instruction
that VL and CTR will be equal or that if CTR is greater than MAXVL,
that VL will be equal to MAXVL. Therefore, when sv bc/ctr
performs a decrement of CTR by VL, we an be confident that CTR
will only reach zero if there is no more of the array to process.
Thus in an elegant way each RISC instruction is actually quite sophisticated, but not a huge CISC-like difference from the original Power ISA. Scalar Power ISA already has LD/ST-Update (pre-increment on RA): we propose adding Post-Increment (Motorola 68000 and 8086 have had both for decades). Power ISA branch-conditional has had Decrement-CTR since its inception: we propose in SVP64 to add "Decrement CTR by VL". The end result is an exceptionally compact daxpy that is easy to read and understand.
RVV version
# a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
li t0, 2<<25
vsetdcfg t0 # enable 2 64b Fl.Pt. registers
loop:
setvl t0, a0 # vl = t0 = min(mvl, n)
vld v0, a1 # load vector x
c.slli t1, t0, 3 # t1 = vl * 8 (in bytes)
vld v1, a2 # load vector y
c.add a1, a1, t1 # increment pointer to x by vl*8
vfmadd v1, v0, fa0, v1 # v1 += v0 * fa0 (y = a * x + y)
c.sub a0, a0, t0 # n -= vl (t0)
vst v1, a2 # store Y
c.add a2, a2, t1 # increment pointer to y by vl*8
c.bnez a0, loop # repeat if n != 0
c.ret # return
SVE Version
1 // x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
2 daxpy_:
3 ldrswx3, [x3] // x3=*n
4 movx4, #0 // x4=i=0
5 whilelt p0.d, x4, x3 // p0=while(i++<n)
6 ld1rdz0.d, p0/z, [x2] // p0:z0=bcast(*a)
7 .loop:
8 ld1d z1.d, p0/z, [x0, x4, lsl #3] // p0:z1=x[i]
9 ld1d z2.d, p0/z, [x1, x4, lsl #3] // p0:z2=y[i]
10 fmla z2.d, p0/m, z1.d, z0.d // p0?z2+=x[i]*a
11 st1d z2.d, p0, [x1, x4, lsl #3] // p0?y[i]=z2
12 incd x4 // i+=(VL/64)
13 .latch:
14 whilelt p0.d, x4, x3 // p0=while(i++<n)
15 b.first .loop // more to do?
16 ret