OpenPOWER SV setvl/setvli

See links:

Use of setvl results in changes to the MVL, VL and STATE SPRs. see sprs

Behaviour and Rationale

SV's Vector Engine is based on Cray-style Variable-length Vectorisation, just like RVV. However unlike RVV, SV sits on top of the standard Scalar regfiles: there is no separate Vector register numbering. Therefore, also unlike RVV, SV does not have hard-coded "Lanes": microarchitects may use ordinary in-order, out-of-order, or superscalar designs as the basis for SV. By contrast, the relevant parameter in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems, anywhere from 1 to tens of thousands of Lanes in supercomputers.

SV is more like how MMX used to sit on top of the x86 FP regfile. Therefore when Vector operations are performed, the question has to be asked, "well, how much of the regfile do you want to allocate to this operation?" because if it is too small an amount performance may be affected, and if too large then other registers would overlap and cause data corruption, or even if allocated correctly would require spill to memory.

The answer effectively needs to be parameterised. Hence: MAXVL (MVL) is set from an immediate, so that the compiler may decide, statically, a guaranteed resource allocation according to the needs of the application.

While RVV's MAXVL was a hw limit, SV's MVL is simply a loop optimization. It does not carry side-effects for the arch, though for a specific cpu it may affect hw unit usage.

Other than being able to set MVL, SV's VL (Vector Length) works just like RVV's VL, with one minor twist. RVV permits the setvl instruction to set VL to an arbitrary explicit value. Within the limit of MVL, VL MUST be set to the requested value. Given that RVV only works on Vector Loops, this is fine and part of its value and design. However, SV sits on top of the standard register files. When MVL=VL=2, a Vector Add on r3 will perform two Scalar Adds: one on r3 and one on r4.

Thus there is the opportunity to set VL to an explicit value (within the limits of MVL) with the reasonable expectation that if two operations are requested (by setting VL=2) then two operations are guaranteed. This avoids the need for a loop (with not-insignificant use of the regfiles for counters), simply two instructions:

setvli r0, MVL=64, VL=64
ld r0.v, 0(r30) # load exactly 64 registers from memory

Page Faults etc. aside this is guaranteed 100% without fail to perform 64 unit-strided LDs starting from the address pointed to by r30 and put the contents into r0 through r63. Thus it becomes a "LOAD-MULTI". Twin Predication could even be used to only load relevant registers from the stack. This only works if VL is set to the requested value rather than, as in RVV, allowing the hardware to set VL to an arbitrary value (caveat being, limited to not exceed MVL)

Also available is the option to set VL from CTR (VL = MIN(CTR, MVL). In combination with SVP64 branches this can save one instruction inside critical inner loops.

Format

(Allocation of opcode TBD pending OPF ISA WG approval), using EXT22 temporarily and fitting into the bitmanip space

Form: SVL-Form (see fields.text)

0.5 6.10 11.15 16..21 22...25 26.30 31 name
OPCD RT RA SVi cv ms vs vf 11110 Rc setvl

Note that the immediate (SVi) spans 7 bits (16 to 22)

  • cv - bit 22 - reads CTR instead of RA
  • ms - bit 23 - allows for setting of MVL.
  • vs - bit 24 - allows for setting of VL.
  • vf - bit 25 - sets "Vertical First Mode".

Note that in immediate setting mode VL and MVL start from one i.e. that an immediate value of zero will result in VL/MVL being set to 1. 0b111111 results in VL/MVL being set to 64. This is because setting VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL to 0 would result in all Vector operations becoming nop. If this is truly desired (nop behaviour) then setting VL and MVL to zero is to be done via the SVSTATE SPR

Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise

setvli VL=8    : setvl r5, r0, VL=8
setmvli MVL=8  : setvl r0, r0, MVL=8

Additional pseudo-op for obtaining VL without modifying it:

getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0

For Vertical-First mode, a pseudo-op for explicit incrementing of srcstep and dststep:

svstep.        : setvl. 0, 0, vf=1, vs=0, ms=0

Note that whilst it is possible to set both MVL and VL from the same immediate, it is not possible to set them to different immediates in the same instruction. That would require two instructions.

setmvlhi

In Vertical-First Mode the minimum expectation is that one scalar element may be executed by each instruction. There are however circumstances where it may be possible to execute more than one element per instruction (srcstep elements 0-3 for example) but leaving it up to hardware to determine a "safe minimum amount" where memory corruption does not occur may not be practical (or is simply very costly).

Therefore, setmvlhi may specify, as determined by the compiler, exactly what that quantity is. Unlike VL, which is an amount that, when requested, must be executed, VFhint may be set by the hardware to an amount that the hardware is capable of. In other words: setmvlhi requests a hint size, bur hardware chooses the actual hint.

The reason for this cooperative negotiation between hardware and software is that whilst the compiler may have information about memory hazards that must be avoided which hardware cannot know about, the hardware knows the maximum batch size it can execute in parallel but the compiler is unaware of the variance in that batch size on different implementations. Thus, hardware sets VLHint to the minimum of the requested amount and the hardware limit. Simple implementations always set VLHint to 1.

Critical to note are two things:

  1. VFhint must not be set by hardware to an amount that exceeds either MVL or the requested amount, and must set VFhint to at least 1 element.
  2. svstep will increment srcstep and dststep by VFhint, therefore when hardware says it can perform N element operations, hardware MUST perform N operations for every single instruction.

Form: SVL-Form (see fields.text)

0.5 6.10 11.15 16..21 22 23...25 26.30 31 name
OPCD RT MVL SVi MVL ms vs vf 10110 Rc setmvlhi

Vertical First Mode

Vertical First is effectively like an implicit single bit predicate applied to every SVP64 instruction. ONLY one element in each SVP64 Vector instruction is executed; srcstep and dststep do not increment, and the Program Counter progresses *immediately to the next instruction just as it would for any standard scalar v3.0B instruction.

An explicit mode of setvl is called which can move srcstep and dststep on to the next element, still respecting predicate masks.

In other words, where normal SVP64 Vectorisation acts "horizontally" by looping first through 0 to VL-1 and only then moving the PC to the next instruction, Vertical-First moves the PC onwards (vertically) through multiple instructions with the same srcstep and dststep, then an explict instruction used to advance srcstep/dststep, and an outer loop is expected to be used (branch instruction) which completes a series of Vector operations.

svstep mode is enabled when vf=1, vs=0 and ms=0. When Rc=1 it is possible to determine when any level of loops reach an end condition, or if VL has been reached. The immediate can be reinterpreted as indicating which SVSTATE (0-3) should be tested and placed into CR0.

  • setvl immediate = 1: only VL testing is enabled. CR0.SO is set to 1 when either srcstep or dststep reach VL
  • setvl immediate = 2: also include inner middle and outer loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
  • setvl immediate = 3: test SVSTATE1
  • setvl immediate = 4: test SVSTATE2
  • setvl immediate = 5: test SVSTATE3

Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.

Programmers should be aware that VL, srcstep and dststep are global in nature. Nested looping with different schedules is perfectly possible, as is calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.

Pseudocode

// instruction fields:
rd = get_rt_field();         // bits 6..10
ra = get_ra_field();         // bits 11..15
vc = get_vc_field();         // bit 22
vf = get_vf_field();         // bit 23
vs = get_vs_field();         // bit 24
ms = get_ms_field();         // bit 25
Rc = get_Rc_field();         // bit 31

if vf and not vs and not ms {
    // increment src/dest step mode
    // NOTE! this is in no way complete! predication is not included
    // and neither is SUB-VL mode
    srcstep = SPR[SV].srcstep
    dststep = SPR[SV].dststep
    VL = SPR[SV].VL
    srcstep++
    dststep++
    rollover = (srcstep == VL or dststep == VL)
    if rollover:
        // Reset srcstep, dststep, and also exit "Vertical First" mode
        srcstep = 0
        dststep = 0
        MSR[6] = 0
    SPR[SV].srcstep = srcstep
    SPR[SV].dststep = dststep

    // write CR? helps for doing Vertical loops, detects end
    // of Vector Elements
    if Rc {
        // update CR to indicate that srcstep/dststep "rolled over"
        CR0.eq = rollover
    }
} else {
    // add one. MVL/VL=1..64 not 0..63
    vlimmed = get_immed_field()+1; //  16..22

    // set VL (or not).
    // 4 options: from SPR, from immed, from ra, from CTR
    if vs {
       // VL to be sourced from fields/regs
       if vc {
           VL = CTR
       } else if ra != 0 {
           VL = GPR[ra]
       } else {
           VL = vlimmed
       }
    } else {
       // VL not to change (except if MVL is reduced)
       // read from SPRs
       VL = SPR[SV_VL]
    }

    // set MVL (or not).
    // 2 options: from SPR, from immed
    if ms {
       MVL = vlimmed
    } else {
       // MVL not to change, read from SPRs
       MVL = SPR[SV_MVL]
    }

    // calculate (limit) VL
    VL = min(VL, MVL)

    // store VL, MVL
    SVSTATE.VL = VL
    SVSTATE.MVL = MVL

    // write rd
    if rt != 0 {
        // rt is not zero
        regs[rt] = VL;
    }
    // write CR?
    if Rc {
        // update CR from VL (not rt)
        CR0.eq = (VL == 0)
        ...
        ...
    }
    // write Vertical-First mode
    SVSTATE.vf = vf
}

Examples

Core concept loop

loop:
    setvl a3, a0, MVL=8    #  update a3 with vl
                           # (# of elements this iteration)
                           # set MVL to 8
    # do vector operations at up to 8 length (MVL=8)
    # ...
    sub a0, a0, a3   # Decrement count by vl
    bnez a0, loop    # Any more?

Loop using Rc=1

my_fn:
  li r3, 1000
  b test
loop:
  sub r3, r3, r4
  ...
test:
  setvli. r4, r3, MVL=64
  bne cr0, loop
end:
  blr

setmvlhi double loop

Two elements per inner loop are executed per instruction. This assumes that underlying hardware, when setmvlhi requests a parallelism hint of 2 actually sets a parallelism hint of 2.

This example, in c, would be:

long *r4;
for (i=0; i < CTR; i++)
{
    r4[i+2] += r4[i]
}

where, clearly, it is not possible to do more than 2 elements in parallel at a time: attempting to do so would result in data corruption. The compiler may be able to determine memory aliases and inform hardware at runtime of the maximum safe parallelism limit.

Whilst this example could be simplified to simply set VL=2, or exploit the fact that overlapping adds have well-defined behaviour, this has not been done, here, for illustrative purposes in order to demonstrate setmvhli and Vertical-First Mode.

Note, crucially, how r4, r32 and r20 are NOT incremented inside the inner loop. The MAXVL reservation is still 8, i.e. as srcstep and dststep advance (by 2 elements at a time) registers r20-r27 will be used for the first LD, and registers r32-39 for the second LD. r4+srcstep*8 will be used as the elstrided offset for LDs.

   setmvlhi  8, 2 # MVL=8, VFHint=2
loop:
    setvl  r1, CTR, vf=1 # VL=r1=MAX(MVL, CTR), VF=1
    mulli  r1, r1, 8     # multiply by int width
loopinner:
    sv.ld r20.v, r4(0) # load VLhint elements (max 2)
    addi r2, r4, 16    # 2 elements ahead
    sv.ld r32.v, r2(0) # load VLhint elements (max 2)
    sv.add r32.v, r20.v, r32.v # x[i+2] += x[i]
    sv.st r32.v, r2(0) # store VLhint elements
    svstep.            # srcstep += VLhint
    bnz loopinner      # repeat until srcstep=VL
    # now done VL elements, move to next batch
    add r4, r4, r1     # move r4 pointer forward
    sv.bnz/ctr loop    # decrement CTR by VL