DRAFT setvl/setvli

See links:

Use of setvl results in changes to the SVSTATE SPR. see sprs

Behaviour and Rationale

SV's Vector Engine is based on Cray-style Variable-length Vectorisation, just like RVV. However unlike RVV, SV sits on top of the standard Scalar regfiles: there is no separate Vector register numbering. Therefore, also unlike RVV, SV does not have hard-coded "Lanes": microarchitects may use ordinary in-order, out-of-order, or superscalar designs as the basis for SV. By contrast, the relevant parameter in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems, anywhere from 1 to tens of thousands of Lanes in supercomputers.

SV is more like how MMX used to sit on top of the x86 FP regfile. Therefore when Vector operations are performed, the question has to be asked, "well, how much of the regfile do you want to allocate to this operation?" because if it is too small an amount performance may be affected, and if too large then other registers would overlap and cause data corruption, or even if allocated correctly would require spill to memory.

The answer effectively needs to be parameterised. Hence: MAXVL (MVL) is set from an immediate, so that the compiler may decide, statically, a guaranteed resource allocation according to the needs of the application.

While RVV's MAXVL was a hw limit, SV's MVL is simply a loop optimization. It does not carry side-effects for the arch, though for a specific cpu it may affect hw unit usage.

Other than being able to set MVL, SV's VL (Vector Length) works just like RVV's VL, with one minor twist. RVV permits the setvl instruction to set VL to an arbitrary explicit value. Within the limit of MVL, VL MUST be set to the requested value. Given that RVV only works on Vector Loops, this is fine and part of its value and design. However, SV sits on top of the standard register files. When MVL=VL=2, a Vector Add on r3 will perform two Scalar Adds: one on r3 and one on r4.

Thus there is the opportunity to set VL to an explicit value (within the limits of MVL) with the reasonable expectation that if two operations are requested (by setting VL=2) then two operations are guaranteed. This avoids the need for a loop (with not-insignificant use of the regfiles for counters), simply two instructions:

setvli r0, MVL=64, VL=64
sv.ld *r0, 0(r30) # load exactly 64 registers from memory

Page Faults etc. aside this is guaranteed 100% without fail to perform 64 unit-strided LDs starting from the address pointed to by r30 and put the contents into r0 through r63. Thus it becomes a "LOAD-MULTI". Twin Predication could even be used to only load relevant registers from the stack. This only works if VL is set to the requested value rather than, as in RVV, allowing the hardware to set VL to an arbitrary value (due to variances in implementation choices).

Also available is the option to set VL from CTR (VL = MIN(CTR, MVL). In combination with SVP64 branches this can save one instruction inside critical inner loops. A caveat: to avoid having an extra opcode bit in setvl, selection of CTR mode is slightly convoluted.

Format

(Allocation of opcode TBD pending OPF ISA WG approval), using EXT22 temporarily and fitting into the bitmanip space

Form: SVL-Form (see fields.text)

0.5 6.10 11.15 16..22 23...25 26.30 31 name
OPCD RT RA SVi ms vs vf 11011 Rc setvl

Instruction format:

setvl RT,RA,SVi,vf,vs,ms
setvl. RT,RA,SVi,vf,vs,ms

Note that the immediate (SVi) spans 7 bits (16 to 22)

  • ms - bit 23 - allows for setting of MVL
  • vs - bit 24 - allows for setting of VL
  • vf - bit 25 - sets "Vertical First Mode".

Note that in immediate setting mode VL and MVL start from one but that this is compensated for in the assembly notation. i.e. that an immediate value of 1 in assembler notation actually places the value 0b0000000 in the SVi field bits: on execution the setvl instruction adds one to the decoded SVi field bits, resulting in VL/MVL being set to 1. This allows VL to be set to values ranging from 1 to 128 with only 7 bits instead of 8. Setting VL/MVL to 0 would result in all Vector operations becoming nop. If this is truly desired (nop behaviour) then setting VL and MVL to zero is to be done via the SVSTATE SPR.

Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise

setvli   VL=8   : setvl  r0, r0, VL=8, vf=0, vs=1, ms=0
setvli.  VL=8   : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
setmvli  MVL=8  : setvl  r0, r0, MVL=8, vf=0, vs=0, ms=1
setmvli. MVL=8  : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1

Additional pseudo-op for obtaining VL without modifying it (or any state):

getvl  r5      : setvl  r5, r0, vf=0, vs=0, ms=0
getvl. r5      : setvl. r5, r0, vf=0, vs=0, ms=0

For Vertical-First mode, a pseudo-op for explicit incrementing of srcstep and dststep:

svfstep         : setvl  0, 0, vf=1, vs=0, ms=0
svfstep.        : setvl. 0, 0, vf=1, vs=0, ms=0

This pseudocode op is different from svstep which is used to perform detailed enquiries about internal state.

Note that whilst it is possible to set both MVL and VL from the same immediate, it is not possible to set them to different immediates in the same instruction. Doing so would require two instructions.

Selecting sources for VL

There is considerable opcode pressure, consequently to set MVL and VL from different sources is as follows:

condition effect
vs=1, RA=0, RT!=0 VL,RT set to MIN(MVL, CTR)
vs=1, RA=0, RT=0 VL set to MIN(MVL, SVi+1)
vs=1, RA!=0, RT=0 VL set to MIN(MVL, RA)
vs=1, RA!=0, RT!=0 VL,RT set to MIN(MVL, RA)

The reasoning here is that the opportunity to set RT equal to the immediate SVi+1 is sacrificed in favour of setting from CTR.

Unusual Rc=1 behaviour

Normally, the return result from an instruction is in RT. With it being possible for RT=0 to mean that CTR mode is to be read, some different semantics are needed.

CR Field 0, when Rc=1, may be set even if RT=0. The reason is that overflow may occur: VL, if set either from an immediate or from CTR, may not exceed MAXVL, and if it is, CR0.SO must be set.

Additionally, in reality it is VL being set. Therefore, rather than CR0 testing RT when Rc=1, CR0.EQ is set if VL=0, CR0.GE is set if VL is non-zero.

Vertical First Mode

Vertical First is effectively like an implicit single bit predicate applied to every SVP64 instruction. ONLY one element in each SVP64 Vector instruction is executed; srcstep and dststep do not increment, and the Program Counter progresses immediately to the next instruction just as it would for any standard scalar v3.0B instruction.

An explicit mode of setvl is called which can move srcstep and dststep on to the next element, still respecting predicate masks.

In other words, where normal SVP64 Vectorisation acts "horizontally" by looping first through 0 to VL-1 and only then moving the PC to the next instruction, Vertical-First moves the PC onwards (vertically) through multiple instructions with the same srcstep and dststep, then an explict instruction used to advance srcstep/dststep. An outer loop is expected to be used (branch instruction) which completes a series of Vector operations.

svfstep mode is enabled when vf=1, vs=0 and ms=0. When Rc=1 it is possible to determine when any level of loops reach an end condition, or if VL has been reached. The immediate can be reinterpreted as indicating which SVSTATE (0-3) should be tested and placed into CR0 (when Rc=1)

When RT is not zero, an internal stepping index may also be returned, either the REMAP index or srcstep or dststep. This table is identical to that of svstep:

  • SVi=1: also include inner middle and outer loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
  • SVi=2: test SVSTATE1 (and return conditions)
  • SVi=3: test SVSTATE2 (and return conditions)
  • SVi=4: test SVSTATE3 (and return conditions)
  • SVi=5: SVSTATE.srcstep is returned.
  • SVi=6: SVSTATE.dststep is returned.

Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.

Programmers should be aware that VL, srcstep and dststep are global in nature. Nested looping with different schedules is perfectly possible, as is calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.

SUBVL

Sub-vector elements are not be considered "Vertical". The vec2/3/4 is to be considered as if the "single element". Caveats exist for mv.swizzle and mv.vec when Pack/Unpack is enabled, due to the order in which VL and SUBVL loops are applied being swapped (outer-inner becomes inner-outer)

Examples

Core concept loop

loop:
    setvl a3, a0, MVL=8    #  update a3 with vl
                           # (# of elements this iteration)
                           # set MVL to 8
    # do vector operations at up to 8 length (MVL=8)
    # ...
    sub a0, a0, a3   # Decrement count by vl
    bnez a0, loop    # Any more?

Loop using Rc=1

my_fn:
  li r3, 1000
  b test
loop:
  sub r3, r3, r4
  ...
test:
  setvli. r4, r3, MVL=64
  bne cr0, loop
end:
  blr

Load/Store-Multi (selective)

Up to 64 FPRs will be loaded, here. r3 is set one per bit for each FP register required to be loaded. The block of memory from which the registers are loaded is contiguous (no gaps): any FP register which has a corresponding zero bit in r3 is unaltered. In essence this is a selective LD-multi with "Scatter" capability.

setvli r0, MVL=64, VL=64
sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers

Up to 64 FPRs will be saved, here. Again, r3

setvli r0, MVL=64, VL=64
sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers