SV Vector-assist Operations.

Links:

The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. Therefore there are not that many cases where actual Vector instructions are needed. If they are, they are more "assistance" functions. Two traditional Vector instructions were initially considered (conflictd and vmiota) however they may be synthesised from existing SVP64 instructions: vmiota may use svstep. Details in discussion

Notes:

  • Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  • Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See cr int predication.

Mask-suited Bitmanipulation

Based on RVV masked set-before-first, set-after-first etc. and Intel and AMD Bitmanip instructions made generalised then advanced further to include masks, this is a single instruction covering 24 individual instructions in other ISAs. (sbf/sof/sif moved to discussion)

BM2-Form

0..5 6..10 11..15 16..20 21-25 26 27..31 Form
PO RS RA RB bm L XO BM2-Form
  • bmask RS,RA,RB,bm,L

The patterns within the pseudocode for AMD TBM and x86 BMI1 are as follows:

  • first pattern A: two options x or ~x
  • second pattern B: three options | & or ^
  • third pattern C: four options x+1, x-1, ~(x+1) or (~x)+1

Thus it makes sense to create a single instruction that covers all of these. A crucial addition that is essential for Scalable Vector usage as Predicate Masks, is the second mask parameter (RB). The additional paramater, L, if set, will leave bits of RA masked by RB unaltered, otherwise those bits are set to zero. Note that when RB=0 then instead of reading from the register file the mask is set to all ones.

The lower two bits of bm set to 0b11 are RESERVED. An illegal instruction trap must be raised.

Executable pseudocode demo:


def bmask(bm, RA, RB=None, zero=False, XLEN=64):
    mask = RB if RB is not None else ((1<<XLEN)-1)
    ra = RA & mask
    mode1 = bm&1
    a1 = ra if mode1 else ~ra
    mode2 = (bm >> 1) & 0b11
    if mode2 == 0: a2 = -ra
    if mode2 == 1: a2 = ra-1
    if mode2 == 2: a2 = ra+1
    if mode2 == 3: a2 = ~(ra+1)
    a1 = a1 & mask
    a2 = a2 & mask
    mode3 = (bm >> 3) & 0b11
    if mode3 == 0: RS = a1 | a2
    if mode3 == 1: RS = a1 & a2
    if mode3 == 2: RS = a1 ^ a2
    if mode3 == 3: RS = 0 # RESERVED
    RS &= mask
    if not zero:
        # put back masked-out bits of RA
        RS |= RA & ~mask
    return RS

SBF = 0b01010 # set before first
SOF = 0b01001 # set only first
SIF = 0b10000 # set including first 10011 also works no idea why yet

Carry-lookahead

As a single scalar 32-bit instruction, up to 64 carry-propagation bits may be computed. When the output is then used as a Predicate mask it can be used to selectively perform the "add carry" of biginteger math, with sv.addi/sm=rN RT.v, RA.v, 1.

  • cprop RT,RA,RB
  • cprop. RT,RA,RB

pseudocode:

P = (RA)
G = (RB)
RT = ((P|G)+G)^P 

X-Form

0.5 6.10 11.15 16.20 21..30 31 name Form
NN RT RA RB 0110001110 Rc cprop X-Form

used not just for carry lookahead, also a special type of predication mask operation.

From QLSKY.png:

    x0 = nand(CIn, P0)
    C0 = nand(x0, ~G0)

    x1 = nand(CIn, P0, P1)
    y1 = nand(G0, P1)
    C1 = nand(x1, y1, ~G1)

    x2 = nand(CIn, P0, P1, P2)
    y2 = nand(G0, P1, P2)
    z2 = nand(G1, P2)
    C1 = nand(x2, y2, z2, ~G2)

    # Gen*
    x3 = nand(G0, P1, P2, P3)
    y3 = nand(G1, P2, P3)
    z3 = nand(G2, P3)
    G* = nand(x3, y3, z3, ~G3)
     P = (A | B) & Ci
     G = (A & B)

Stackoverflow algorithm ((P|G)+G)^P works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.

    At each id, compute C[id] = A[id]+B[id]+0
    Get G[id] = C[id] > radix -1
    Get P[id] = C[id] == radix-1
    Join all P[id] together, likewise G[id]
    Compute newC = ((P|G)+G)^P
    result[id] = (C[id] + newC[id]) % radix

two versions: scalar int version and CR based version.

scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge

vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.

if zero (no propagation) then CR0.eq is zero

CR based version, TODO.