Positional popcount SVP64
Links
- https://bugs.libre-soc.org/show_bug.cgi?id=672
- https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s
- RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf
Funding
This project is funded through the NGI Assure Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.
Introduction
Positional popcount in optimised assembler is typically done on SIMD
ISAs in around 500 lines. The implementations are extraordinarily
complex and very hard to understand. Power ISA thanks to bpermd
can be much
more efficient: with SVP64 even more so. The reference implementation
showing the concept is below.
// Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
// count8 reference implementation for tests. Do not alter.
func count8safe(counts *[8]int, buf []uint8) {
for i := range buf {
for j := 0; j < 8; j++ {
counts[j] += int(buf[i] >> j & 1)
}
}
}
A simple but still hardware-paralleliseable SVP64 assembler for
8-bit input values (count8safe
) is as follows:
mtspr 9, 3 # move r3 to CTR
setvl 3,0,8,0,1,1 # set MVL=8, VL=r3=MIN(MVL,CTR)
# load VL bytes (update r4 addr) but compressed (dw=8)
addi 6, 0, 0 # initialise all 64-bits of r6 to zero
sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
# gather performs the transpose (which gets us to positional..)
gbbd 8,6
# now those bits have been turned around, popcount and sum them
setvl 0,0,8,0,1,1 # set MVL=VL=8
sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
sv.add *16,*16,*24 # and accumulate in results
# branch back if CTR still non-zero. works even though VL=8
sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
Array popcount is just standard popcount function (Hamming weight) on an array of values, horizontally, however positional popcount is different (vertical). Refer to Fig.1
Positional popcount adds up the totals of each bit set to 1 in each bit-position, of an array of input values. Refer to Fig.2
Visual representation of the pospopcount algorithm
In order to perform positional popcount we need to go through series of steps shown below in figures 3, 4, 5 & 6. The limitation to overcome is that the CPU can only work on registers (horizontally) but the requirement of pospopcount is to add vertically. Part of the challenge is therefore to perform an appropriate transpose of the data, in blocks that suit the processor and the ISA capacity.
Fig.3 depicts how the data is divided into blocks of 8*8-bit.
Fig.4 shows that a transpose is needed on each block. The gbbd instruction is used for implementing the transpose function, preparing the data for using the standard popcount instruction.
Now on each 8*8 block, 8 popcount instructions can be run each of which is independent and therefore can be parallelised even by In-Order multi-issue hardware(Fig.5).
Fig.6 depicts how each of the intermediate results are accumulated. It is worth noting that each intermediate result is independent of the other intermediate results and also parallel reduction can be applied to all of them individually. This gives two opportunities for hardware parallelism rather than one.
In short this algorithm is very straightforward to implement thanks to the
two crucial instructions, gbbd
and popcntd
. Below is a walkthrough
of the assembler, keeping it very simple, and exploiting only one of the
opportunities for parallelism (by not including the Parallel Reduction
opportunity mentioned above).
Walkthrough of the assembler
Firstly the CTR (Counter) SPR is set up, and is key to looping as outlined further, below
mtspr 9, 3 # move r3 to CTR
The Vector Length, which is limited to 8 (MVL - Maximum Vector Length) is set up. A special "CTR" Mode is used which automatically uses the CTR SPR rather than register RA. (Note that RA is set to zero to indicate this, because there is limited encoding space. See setvl instruction specification for details).
The result of this instruction is that if CTR is greater than 8, VL is set
to 8. If however CTR is less than or equal to 8, then VL is set to CTR.
Additionally, a copy of VL is placed into RT (r3 in this case), which
again is necessary as part of the limited encoding space but in some cases
(not here) this is desirable, and avoids a mfspr
instruction to take
a copy of VL into a GPR.
# VL = r3 = MIN(CTR,MAXVL=8)
setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
Next the data must be loaded in batches (of up to QTY 8of 8-bit values). We must also not over-run the end of the array (which a normal SIMD ISA would do, potentially causing a pagefault) and this is achieved by when CTR <= 8: VL (Vector Length) in such circumstances being set to the remaining elements.
The next instruction is sv.lbzu/pi
- a Post-Increment variant of lbzu
which updates the Effective Address (in RA) after it has been used,
not before. Also of note is the element-width override of dw=8
which
writes to individual bytes of r6, not to r6/7/8...13.
Of note here however is the precursor addi
instruction, which clears
out the contents of r6 to zero before beginning the Vector/Looped Load
sequence. This is important because SV does not zero-out parts
of a register when element-width overrides are in use.
# load VL bytes (update r4 addr) but compressed (dw=8)
addi 6, 0, 0 # initialise all 64-bits of r6 to zero
sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
The next instruction is quite straightforward, and has the effect of turning (transposing) 8 values. Each bit zero of each input byte is placed into the first byte; each bit one of each input byte is placed into the second byte and so on.
(This instruction in VSX is called vgbbd
, and it is present in many
other ISAs. RISC-V bitmanip-draft-0.94 calls it bmatflip
, x86 calls it
GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely
powerful and useful instruction)
# gather performs the transpose (which gets us to positional..)
gbbd 8,6
Now the bits have been transposed in bytes, it is plain sailing to perform QTY8 parallel popcounts. However there is a trick going on: we have set VL=MVL=8. To explain this: it covers the last case where CTR may be between 1 and 8.
Remember what happened back at the Vector-Load, where r6 was cleared to
zero before-hand? This filled out the 8x8 transposed grid (gbbd
) fully
with zeros prior to the actual transpose. Now when we do the popcount,
there will be upper-numbered columns that are not part of the result
that contain zero and consequently have no impact on the result.
This elegant trick extends even to the accumulation of the
results. However before we get there it is necessary to describe why
sw=8
has been added to sv.popcntd
. What this is doing is treating
each byte of its input (starting at the first byte of r8) as an
independent quantity to be popcounted, but the result is zero-extended
to 64-bit and stored in r24, r25, r26 ... r31.
Therefore:
- r24 contains the popcount of the first byte of r8
- r25 contains the popcount of the second byte of r8
- ...
- r31 contains the popcount of the last (7th) byte of r8
# now those bits have been turned around, popcount and sum them
setvl 0,0,8,0,1,1 # set MVL=VL=8
sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
With VL being hard-set to 8, we can now Vector-Parallel sum the partial
results r24-31 into the array of accumulators r16-r23. There was no need
to set VL=8 again because it was set already by the setvl
above.
sv.add *16,*16,*24 # and accumulate in results
Finally, sv.bc
loops back, subtracting VL from CTR. Being based on
Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will
because VL was set hard-coded to 8, above) it does not matter, the loop
will still be correctly exited. In other words we complete the correct
number of blocks (of length 8).
# branch back if CTR still non-zero. works even though VL=8
sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
Improvements
There exist many opportunities for parallelism that simpler hardware would need to have in order to maximise performance. On Out-of-Order hardware the above extremely simple and very clear algorithm will achieve extreme high levels of performance simply by exploiting standard Multi-Issue Register Hazard Management.
However simpler hardware - in-order - will need a little bit of
assistance, and that can come in the form of expanding to QTY4 or QTY8
64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), gbbd
becomes an sv.gbbd
but VL being set to the block count (QTY4 or QTY8),
and the SV REMAP Parallel Reduction Schedule being applied to each
intermediary result rather than using an array of straight accumulators
r16-r23
. However this starts to push the boundaries of the number of
registers needed, so as an exercise is left for another time.
Conclusion
Where a normal SIMD ISA requires explicit hand-crafted optimisation in order to achieve full utilisation of the underlying hardware, Simple-V instead can rely to a large extent on standard Multi-Issue hardware to achieve similar performance, whilst crucially keeping the algorithm implementation down to a shockingly-simple degree that makes it easy to understand an easy to review. Again also as with many other algorithms when implemented in Simple-V SVP64, by keeping to a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised, and in this case just as with chacha20 the entire algorithm, being only 9 lines of assembler fitting into 13 4-byte words it can fit into a single L1 I-Cache Line without triggering Virtual Memory TLB misses.
Further performance improvements are achievable by using REMAP Parallel Reduction, still fitting into a single L1 Cache line, but beginning to approach the limit of the 128-long register file.