Positional popcount SVP64

Links

https://bugs.libre-soc.org/show_bug.cgi?id=672
https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s
RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf

Funding

This project is funded through the NGI Assure Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

Introduction

Positional popcount in optimised assembler is typically done on SIMD ISAs in around 500 lines. The implementations are extraordinarily complex and very hard to understand. Power ISA thanks to bpermd can be much more efficient: with SVP64 even more so. The reference implementation showing the concept is below.

// Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
// count8 reference implementation for tests.  Do not alter.
func count8safe(counts *[8]int, buf []uint8) {
    for i := range buf {
        for j := 0; j < 8; j++ {
            counts[j] += int(buf[i] >> j & 1)
        }
    }
}

A simple but still hardware-paralleliseable SVP64 assembler for 8-bit input values (count8safe) is as follows:

mtspr 9, 3                 # move r3 to CTR
setvl 3,0,8,0,1,1          # set MVL=8, VL=r3=MIN(MVL,CTR)
# load VL bytes (update r4 addr) but compressed (dw=8)
addi 6, 0, 0               # initialise all 64-bits of r6 to zero
sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
# gather performs the transpose (which gets us to positional..)
gbbd 8,6
# now those bits have been turned around, popcount and sum them
setvl 0,0,8,0,1,1          # set MVL=VL=8
sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
sv.add *16,*16,*24         # and accumulate in results
# branch back if CTR still non-zero. works even though VL=8
sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve

Array popcount is just standard popcount function (Hamming weight) on an array of values, horizontally, however positional popcount is different (vertical). Refer to Fig.1

pospopcnt

Positional popcount adds up the totals of each bit set to 1 in each bit-position, of an array of input values. Refer to Fig.2

pospopcnt

Visual representation of the pospopcount algorithm

In order to perform positional popcount we need to go through series of steps shown below in figures 3, 4, 5 & 6. The limitation to overcome is that the CPU can only work on registers (horizontally) but the requirement of pospopcount is to add vertically. Part of the challenge is therefore to perform an appropriate transpose of the data, in blocks that suit the processor and the ISA capacity.

Fig.3 depicts how the data is divided into blocks of 8*8-bit.

pospopcnt

Fig.4 shows that a transpose is needed on each block. The gbbd instruction is used for implementing the transpose function, preparing the data for using the standard popcount instruction.

pospopcnt

Now on each 8*8 block, 8 popcount instructions can be run each of which is independent and therefore can be parallelised even by In-Order multi-issue hardware(Fig.5).

pospopcnt

Fig.6 depicts how each of the intermediate results are accumulated. It is worth noting that each intermediate result is independent of the other intermediate results and also parallel reduction can be applied to all of them individually. This gives two opportunities for hardware parallelism rather than one.

pospopcnt

In short this algorithm is very straightforward to implement thanks to the two crucial instructions, gbbd and popcntd. Below is a walkthrough of the assembler, keeping it very simple, and exploiting only one of the opportunities for parallelism (by not including the Parallel Reduction opportunity mentioned above).

Walkthrough of the assembler

Firstly the CTR (Counter) SPR is set up, and is key to looping as outlined further, below

mtspr 9, 3                # move r3 to CTR

The Vector Length, which is limited to 8 (MVL - Maximum Vector Length) is set up. A special "CTR" Mode is used which automatically uses the CTR SPR rather than register RA. (Note that RA is set to zero to indicate this, because there is limited encoding space. See setvl instruction specification for details).

The result of this instruction is that if CTR is greater than 8, VL is set to 8. If however CTR is less than or equal to 8, then VL is set to CTR. Additionally, a copy of VL is placed into RT (r3 in this case), which again is necessary as part of the limited encoding space but in some cases (not here) this is desirable, and avoids a mfspr instruction to take a copy of VL into a GPR.

# VL = r3 = MIN(CTR,MAXVL=8)
setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)

Next the data must be loaded in batches (of up to QTY 8of 8-bit values). We must also not over-run the end of the array (which a normal SIMD ISA would do, potentially causing a pagefault) and this is achieved by when CTR <= 8: VL (Vector Length) in such circumstances being set to the remaining elements.

The next instruction is sv.lbzu/pi - a Post-Increment variant of lbzu which updates the Effective Address (in RA) after it has been used, not before. Also of note is the element-width override of dw=8 which writes to individual bytes of r6, not to r6/7/8...13.

Of note here however is the precursor addi instruction, which clears out the contents of r6 to zero before beginning the Vector/Looped Load sequence. This is important because SV does not zero-out parts of a register when element-width overrides are in use.

# load VL bytes (update r4 addr) but compressed (dw=8)
addi 6, 0, 0               # initialise all 64-bits of r6 to zero
sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well

The next instruction is quite straightforward, and has the effect of turning (transposing) 8 values. Each bit zero of each input byte is placed into the first byte; each bit one of each input byte is placed into the second byte and so on.

(This instruction in VSX is called vgbbd, and it is present in many other ISAs. RISC-V bitmanip-draft-0.94 calls it bmatflip, x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely powerful and useful instruction)

# gather performs the transpose (which gets us to positional..)
gbbd 8,6

Now the bits have been transposed in bytes, it is plain sailing to perform QTY8 parallel popcounts. However there is a trick going on: we have set VL=MVL=8. To explain this: it covers the last case where CTR may be between 1 and 8.

Remember what happened back at the Vector-Load, where r6 was cleared to zero before-hand? This filled out the 8x8 transposed grid (gbbd) fully with zeros prior to the actual transpose. Now when we do the popcount, there will be upper-numbered columns that are not part of the result that contain zero and consequently have no impact on the result.

This elegant trick extends even to the accumulation of the results. However before we get there it is necessary to describe why sw=8 has been added to sv.popcntd. What this is doing is treating each byte of its input (starting at the first byte of r8) as an independent quantity to be popcounted, but the result is zero-extended to 64-bit and stored in r24, r25, r26 ... r31.

Therefore:

r24 contains the popcount of the first byte of r8
r25 contains the popcount of the second byte of r8
...
r31 contains the popcount of the last (7th) byte of r8

# now those bits have been turned around, popcount and sum them
setvl 0,0,8,0,1,1          # set MVL=VL=8
sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount

With VL being hard-set to 8, we can now Vector-Parallel sum the partial results r24-31 into the array of accumulators r16-r23. There was no need to set VL=8 again because it was set already by the setvl above.

sv.add *16,*16,*24         # and accumulate in results

Finally, sv.bc loops back, subtracting VL from CTR. Being based on Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will because VL was set hard-coded to 8, above) it does not matter, the loop will still be correctly exited. In other words we complete the correct number of blocks (of length 8).

# branch back if CTR still non-zero. works even though VL=8
sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve

Improvements

There exist many opportunities for parallelism that simpler hardware would need to have in order to maximise performance. On Out-of-Order hardware the above extremely simple and very clear algorithm will achieve extreme high levels of performance simply by exploiting standard Multi-Issue Register Hazard Management.

However simpler hardware - in-order - will need a little bit of assistance, and that can come in the form of expanding to QTY4 or QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), gbbd becomes an sv.gbbd but VL being set to the block count (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being applied to each intermediary result rather than using an array of straight accumulators r16-r23. However this starts to push the boundaries of the number of registers needed, so as an exercise is left for another time.

Conclusion

Where a normal SIMD ISA requires explicit hand-crafted optimisation in order to achieve full utilisation of the underlying hardware, Simple-V instead can rely to a large extent on standard Multi-Issue hardware to achieve similar performance, whilst crucially keeping the algorithm implementation down to a shockingly-simple degree that makes it easy to understand an easy to review. Again also as with many other algorithms when implemented in Simple-V SVP64, by keeping to a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised, and in this case just as with chacha20 the entire algorithm, being only 9 lines of assembler fitting into 13 4-byte words it can fit into a single L1 I-Cache Line without triggering Virtual Memory TLB misses.

Further performance improvements are achievable by using REMAP Parallel Reduction, still fitting into a single L1 Cache line, but beginning to approach the limit of the 128-long register file.