RFC ls014 Advanced Scalar Bitmanipulation

Funded by NLnet under the Privacy and Enhanced Trust Programme, EU Horizon2020 Grant 825310, and NGI0 Entrust No 101069594
https://libre-soc.org/openpower/sv/rfc/ls014/
https://git.openpower.foundation/isa/PowerISA/issues/TODO
https://bugs.libre-soc.org/show_bug.cgi?id=1065

Severity: Major

Status: New

Date: 14 Apr 2023

Target: v3.2B

Source: v3.1B

Books and Section affected:

    Book I Fixed-Point Instructions
    Appendix E Power ISA sorted by opcode
    Appendix F Power ISA sorted by version
    Appendix G Power ISA sorted by Compliancy Subset
    Appendix H Power ISA sorted by mnemonic

Summary

    Instructions added: bmask, grevlut, grevluti

Submitter: Luke Leighton (Libre-SOC)

Requester: Libre-SOC

Impact on processor:

    Addition of new GPR-based instructions

Impact on software:

    Requires support for new instructions in assembler, debuggers,
    and related tools.

Keywords:

    LUTs, Bitmanipulation, GPR

Motivation

Scalar Bitmanipulation in other high-end ISAs have had BMI subsets for over a decade. Their use and benefit is well-understood and compiler integration well-established. bmask brings twenty four BMI instructions to the Power ISA.

grevlut on the other hand is highly experimental and extremely powerful. Normally only grev (Generalised Reverse) and occasionally gor are added to a Bitmanip-strong ISA: grevlut utilises LUTs and inversion to add 512 Generalised Reverse instructions. Desirable savings in general binary size are achieved.

Notes and Observations:

bmask is a synthesis and generalisation of every "TBM" instruction with additional options not found in any other ISA BMI group.
grevluti as a 32-bit Defined Word-instruction is capable of generating over a thousand useful regular-patterned 64-bit "magic constants" that otherwise require either a Load or require several instructions to synthesise
word halfword byte nibble 2-bit 1-bit reversal at multiple levels are all achieved with grevlut. Some of these instructions were explicitly added in Power ISA v3.1 but grevlut is akin to xxeval.
grevlut can be expensive in hardware (estimated 20,000 gates) but like xxeval provides 512 equivalent instructions.

Changes

Add the following entries to:

the Appendices of Book I
Book I 3.3.13 Fixed-Point Logical Instructions
Book I 1.6.1 and 1.6.2

\newpage{}

Rationale

bmask

Based on RVV masked set-before-first, set-after-first etc. and Intel and AMD Bitmanip instructions made generalised then advanced further to include masks, this is a single instruction covering 24 individual instructions in other ISAs.

The patterns within the pseudocode for AMD TBM and x86 BMI1 are as follows:

first pattern A: two options x or ~x
second pattern B: three options | & or ^
third pattern C: four options x+1, x-1, ~(x+1) or (~x)+1

Thus it makes sense to create a single instruction that covers all of these. A crucial addition that is essential for Scalable Vector usage as Predicate Masks, is the second mask parameter (RB). The additional paramater, L, if set, will leave bits of RA masked by RB unaltered, otherwise those bits are set to zero. Note that when RB=0 then instead of reading from the register file the mask is set to all ones.

Executable pseudocode demo:


def bmask(bm, RA, RB=None, zero=False, XLEN=64):
    mask = RB if RB is not None else ((1<<XLEN)-1)
    ra = RA & mask
    mode1 = bm&1
    a1 = ra if mode1 else ~ra
    mode2 = (bm >> 1) & 0b11
    if mode2 == 0: a2 = -ra
    if mode2 == 1: a2 = ra-1
    if mode2 == 2: a2 = ra+1
    if mode2 == 3: a2 = ~(ra+1)
    a1 = a1 & mask
    a2 = a2 & mask
    mode3 = (bm >> 3) & 0b11
    if mode3 == 0: RS = a1 | a2
    if mode3 == 1: RS = a1 & a2
    if mode3 == 2: RS = a1 ^ a2
    if mode3 == 3: RS = 0 # RESERVED
    RS &= mask
    if not zero:
        # put back masked-out bits of RA
        RS |= RA & ~mask
    return RS

SBF = 0b01010 # set before first
SOF = 0b01001 # set only first
SIF = 0b10000 # set including first 10011 also works no idea why yet

\newpage{}

SV Vector-assist Operations.

Links:

discussion
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions
https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html
https://bugs.libre-soc.org/show_bug.cgi?id=865 implementation in simulator
https://bugs.libre-soc.org/show_bug.cgi?id=213
https://bugs.libre-soc.org/show_bug.cgi?id=142 specialist vector ops out of scope for this document 3d vector ops
bitmanip previous version, contains pseudocode for sof, sif, sbf
https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)

The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. Therefore there are not that many cases where actual Vector instructions are needed. If they are, they are more "assistance" functions. Two traditional Vector instructions were initially considered (conflictd and vmiota) however they may be synthesised from existing SVP64 instructions: vmiota may use svstep. Details in discussion

Notes:

Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See cr int predication.

Mask-suited Bitmanipulation

BM2-Form

0..5	6..10	11..15	16..20	21-25	26	27..31	Form
PO	RS	RA	RB	bm	L	XO	BM2-Form

bmask RS,RA,RB,bm,L

Pseudo-code:

    if _RB = 0 then mask <- [1] * XLEN
    else            mask <- (RB)
    ra <- (RA) & mask
    a1 <- ra
    if bm[4] = 0 then a1 <- ¬ra
    mode2 <- bm[2:3]
    if mode2 = 0 then a2 <- (¬ra)+1
    if mode2 = 1 then a2 <- ra-1
    if mode2 = 2 then a2 <- ra+1
    if mode2 = 3 then a2 <- ¬(ra+1)
    a1 <- a1 & mask
    a2 <- a2 & mask
    # select operator
    mode3 <- bm[0:1]
    if mode3 = 0 then result <- a1 | a2
    if mode3 = 1 then result <- a1 & a2
    if mode3 = 2 then result <- a1 ^ a2
    if mode3 = 3 then result <- undefined([0]*XLEN)
    # mask output
    result <- result & mask
    # optionally restore masked-out bits
    if L = 1 then
        result <- result | (RA & ¬mask)
    RT <- result

first pattern A: two options x or ~x
second pattern B: three options | & or ^
third pattern C: four options x+1, x-1, ~(x+1) or (~x)+1

The lower two bits of bm set to 0b11 are RESERVED. An illegal instruction trap must be raised.

Special Registers Altered:

    None

Carry-lookahead

As a single scalar 32-bit instruction, up to 64 carry-propagation bits may be computed. When the output is then used as a Predicate mask it can be used to selectively perform the "add carry" of biginteger math, with sv.addi/sm=rN RT.v, RA.v, 1.

cprop RT,RA,RB (Rc=0)
cprop. RT,RA,RB (Rc=1)

pseudocode:

    P = (RA)
    G = (RB)
    RT = ((P|G)+G)^P

X-Form

0:5	6:10	11:15	16:20	21:30	31	name	Form
PO	RT	RA	RB	XO	Rc	cprop	X-Form

used not just for carry lookahead, also a special type of predication mask operation.

\newpage{}

Instruction Formats

Add the following entries to Book I 1.6.1 Word Instruction Formats:

MM-FORM

    |0    |6    |11   |16   |21   |24 |25  |31  |
    | PO  | FRT | FRA | FRB | FMM     | XO | Rc |
    | PO  | RT  | RA  | RB  | MMM | / | XO | Rc |

Add the following new fields to Book I 1.6.2 Word Instruction Fields:

    FMM (21:24)
        Field used to specify minimum/maximum mode for fminmax[s].

        Formats: MM

    MMM (21:23)
        Field used to specify minimum/maximum mode for integer minmax.

        Formats: MM

Add MM to the Formats: list for all of FRT, FRA, FRB, XO (25:30), Rc, RT, RA and RB.

\newpage{}

Appendices

Appendix E Power ISA sorted by opcode
Appendix F Power ISA sorted by version
Appendix G Power ISA sorted by Compliancy Subset
Appendix H Power ISA sorted by mnemonic

Form	Book	Page	Version	Mnemonic	Description
MM	I	#	3.2B	fminmax	Floating Minimum/Maximum
MM	I	#	3.2B	fminmaxs	Floating Minimum/Maximum Single
MM	I	#	3.2B	minmax	Minimum/Maximum