Links

svstep: Vertical-First Stepping and status reporting

SVL-Form

  • svstep RT,RA,SVi,vf (Rc=0)
  • svstep. RT,RA,SVi,vf (Rc=1)
0-5 6-10 11.15 16..22 23-25 26-30 31 Form
PO RT RA SVi / / vf XO Rc SVL-Form

Pseudo-code:

    if SVi[3:4] = 0b11 then
        # store pack and unpack in SVSTATE
        SVSTATE[53] <- SVi[5]
        SVSTATE[54] <- SVi[6]
        RT <- [0]*62 || SVSTATE[53:54]
    else
        # Vertical-First explicit stepping.
        step <- SVSTATE_NEXT(SVi, vf)
        RT <- [0]*57 || step

Special Registers Altered:

CR0                     (if Rc=1)

Description

svstep may be used to enquire about the REMAP Schedule and it may be used to alter Vectorization State. When vf=1 then stepping occurs. When vf=0 the enquiry is performed without altering internal state. If SVi=0, Rc=0, vf=0 the instruction is a nop.

The following Modes exist:

  • SVi=0: appropriately step srcstep, dststep, subsrcstep and subdststep to the next element, taking pack and unpack into consideration.
  • When SVi is 1-4 the REMAP Schedule for a given SVSHAPE may be returned in RT. SVi=1 selects SVSHAPE0 current state, through to SVi=4 selects SVSHAPE3.
  • When SVi is 5, SVSTATE.srcstep is returned.
  • When SVi is 6, SVSTATE.dststep is returned.
  • When SVi is 7, SVSTATE.ssubstep is returned.
  • When SVi is 8, SVSTATE.dsubstep is returned.
  • When SVi is 0b1100 pack/unpack in SVSTATE is cleared
  • When SVi is 0b1101 pack in SVSTATE is set, unpack is cleared
  • When SVi is 0b1110 unpack in SVSTATE is set, pack is cleared
  • When SVi is 0b1111 pack/unpack in SVSTATE are set

As this is a Single-Predicated (1P) instruction, predication may be applied to skip (or zero) elements.

  • Vertical-First Mode will return the requested index (and move to the next state if vf=1)
  • Horizontal-First Mode can be used to return all indices, i.e. walks through all possible states.

Vectorization of svstep itself

As a 32-bit instruction, svstep may be itself be Vector-Prefixed, as sv.svstep. This will work perfectly well in Horizontal-First as it will in Vertical-First Mode although there are caveats for the Deterministic use of looping with Sub-Vectors in Vertical-First mode.

Example: to obtain the full set of possible computed element indices use sv.svstep *RT,SVi,1 which will store all computed element indices, starting from RT. If Rc=1 then a co-result Vector of CR Fields will also be returned, comprising the "loop end-points" of each of the inner loops when either Matrix Mode or DCT/FFT is set. In other words, for example, when the xdim inner loop reaches the end and on the next iteration it will begin again at zero, the CR Field EQ will be set. With a maximum of three loops within both Matrix and DCT/FFT Modes, the CR Field's EQ bit will be set at the end of the first inner loop, the LE bit for the second, the GT bit for the outermost loop and the SO bit set on the very last element, when all loops reach their maximum extent.

Programmer's note: VL in some situations, particularly larger Matrices (5x7x3 will set MAXVL=105), will cause sv.svstep to return a considerable number of values. Under such circumstances sv.svstep/ew=8 is recommended.

Programmer's note: having conveniently obtained a pre-computed Schedule with sv.svstep, it may then be used as the input to Indexed REMAP Mode to achieve the exact same Schedule. It is evident however that before use some of the Indices may be arbitrarily altered as desired. sv.svstep helps the programmer avoid having to manually recreate Indices for certain types of common Loop patterns. In its simplest form, without REMAP (SVi=5 or SVi=6), is equivalent to the iota instruction found in other Vector ISAs

Vertical First Mode

Vertical First is effectively like an implicit single bit predicate applied to every SVP64 instruction. ONLY one element in each SVP64 Vector instruction is executed; srcstep and dststep do not increment automatically on completion of one instruction, and the Program Counter progresses immediately to the next instruction just as it would for any standard scalar v3.0B instruction.

A mode of srcstep (SVi=0) is called which can move srcstep and dststep on to the next element, still respecting predicate masks.

In other words, where normal SVP64 Vectorization acts "horizontally" by looping first through 0 to VL-1 and only then moving the PC to the next instruction, Vertical-First moves the PC onwards (vertically) through multiple instructions with the same srcstep and dststep, then an explict instruction used to advance srcstep/dststep. An outer loop is expected to be used (branch instruction) which completes a series of Vector operations.

Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.

Programmer's note: when Predicate Non-Zeroing is used this indicates to the underlying hardware that any masked-out element must be skipped. This includes in Vertical-First Mode, and programmers should be keenly aware that srcstep or dststep or both may jump by more than one as a result, because the actual request under these circumstances was to execute on the first available next non-masked-out element. It should be evident that it is the sv.svstep instruction that must be Predicated in order for the entire loop to use the Predicate correctly, and it is strongly recommended for all instructions within the same Vertical-First Loop to utilise the exact same Predicate Mask(s).

Programmers should be aware that VL, srcstep and dststep and the SUBVL substeps are global in nature. Nested looping with different schedules is perfectly possible, as is calling of functions, however SVSTATE (and any associated SVSHAPEs if REMAP is being used) should obviously be stored on the stack in order to achieve this benefit not normally found in Vector ISAs.

Use of svstep with Vertical-First sub-vectors

Incrementing and iteration through subvector state ssubstep and dsubstep is possible with sv.svstep/vecN where as expected N may be 2/3/4. However it is necessary to use the exact same Sub-Vector qualifier on any Prefixed instructions, within any given Vertical-First loop: vec2/3/4 is not automatically applied to all instructions, it must be explicitly applied on a per-instruction basis. Also valid is not specifying a Sub-vector qualifier at all, but it is critically important to note that operations will be repeated. For example if sv.svstep/vec2 is not used on sv.addi then each Vector element operation is repeated twice. The reason is that whilst svstep will be iterating through both the SUBVL and VL loops, the addi instruction only uses srcstep and dststep (not ssubstep or dsubstep) Illustrated below:

    def offset():
      for step in range(VL):
        for substep in range(SUBVL=2):
          yield step, substep
    for i, j in offset():
        vec2_offs = i * SUBVL + j  # calculate vec2 offset
        addi RT+i, RA+i, 1      # but sv.addi is not vec2!
        muli/vec2 RT+vec2_offs, RA+vec2_offs, 2 # this is

Actual assembler would be:

    loop:
        setvl VF=1, CTRmode
        sv.addi *RT, *RA, 1      # no vec2
        sv.muli/vec2 *RT, *RA, 2 # vec2
        sv.svstep/vec2           # must match the muli
        sv.bc CTRmode, loop      # subtracts VL from CTR

This illustrates the correct but seemingly-anomalous behaviour: sv.svstep/vec2 is being requested to update SVSTATE to follow a vec2 loop construct. The anomalous sv.addi is not prohibited as it may in fact be desirable to execute operations twice, or to re-load data that was overwritten, and many other possibilities.


\newpage{}

Appendix

src_iterate

Note that srcstep and ssubstep are not the absolute final Element (and Sub-Element) offsets. srcstep still has to go through individual REMAP translation before becoming a per-operand (RA, RB, RC, RT, RS) Element-level Source offset.

Note also critically that PACK mode simply inverts the outer/order loops making SUBVL the outer loop and VL the inner.

    # source-stepping iterator
    subvl = SVSTATE.subvl
    vl = SVSTATE.vl
    pack = SVSTATE.pack
    unpack = SVSTATE.unpack
    ssubstep = SVSTATE.ssubstep
    end_ssub = ssubstep == subvl
    end_src = SVSTATE.srcstep == vl-1
    # first source step.
    srcstep = SVSTATE.srcstep
    # used below:
    #       sz      - from RM.MODE, source-zeroing
    #       srcmask - from RM.MODE, the source predicate
    if pack:
        # pack advances subvl in *outer* loop
        while True:
            assert srcstep <= vl-1
            end_src = srcstep == vl-1
            if end_src:
                if end_ssub:
                    loopend = True
                else:
                    SVSTATE.ssubstep += 1
                srcstep = 0  # reset
                break
            else:
                srcstep += 1  # advance srcstep
                if not sz:
                    break
                if ((1 << srcstep) & srcmask) != 0:
                    break
    else:
        # advance subvl in *inner* loop
        if end_ssub:
            while True:
                assert srcstep <= vl-1
                end_src = srcstep == vl-1
                if end_src:  # end-point
                    loopend = True
                    srcstep = 0
                    break
                else:
                    srcstep += 1
                if not sz:
                    break
                if ((1 << srcstep) & srcmask) != 0:
                    break
                else:
                    log("      sskip", bin(srcmask), bin(1 << srcstep))
            SVSTATE.ssubstep = 0b00  # reset
        else:
            # advance ssubstep
            SVSTATE.ssubstep += 1

    SVSTATE.srcstep = srcstep

\newpage{}

dest_iterate

Note that dststep and dsubstep are not the absolute final Element (and Sub-Element) offsets. dststep still has to go through individual REMAP translation before becoming a per-operand (RT, RS/EA) destination Element-level offset, and dsubstep may also go through (f)mv.swizzle reordering.

Note also critically that UNPACK mode simply inverts the outer/order loops making SUBVL the outer loop and VL the inner.

    # dest step iterator
    vl = SVSTATE.vl
    subvl = SVSTATE.subvl
    unpack = SVSTATE.unpack
    dsubstep = SVSTATE.dsubstep
    end_dsub = dsubstep == subvl
    dststep = SVSTATE.dststep
    end_dst = dststep == vl-1
    # used below:
    #       dz      - from RM.MODE, destination-zeroing
    #       dstmask - from RM.MODE, the destination predicate
    if unpack:
        # unpack advances subvl in *outer* loop
        while True:
            assert dststep <= vl-1
            end_dst = dststep == vl-1
            if end_dst:
                if end_dsub:
                    loopend = True
                else:
                    SVSTATE.dsubstep += 1
                dststep = 0  # reset
                break
            else:
                dststep += 1  # advance dststep
                if not dz:
                    break
                if ((1 << dststep) & dstmask) != 0:
                    break
    else:
        # advance subvl in *inner* loop
        if end_dsub:
            while True:
                assert dststep <= vl-1
                end_dst = dststep == vl-1
                if end_dst:  # end-point
                    loopend = True
                    dststep = 0
                    break
                else:
                    dststep += 1
                if not dz:
                    break
                if ((1 << dststep) & dstmask) != 0:
                    break
            SVSTATE.dsubstep = 0b00  # reset
        else:
            # advance ssubstep
            SVSTATE.dsubstep += 1

    SVSTATE.dststep = dststep

\newpage{}

SVSTATE_NEXT

    if SVi = 1 then return REMAP SVSHAPE0 current offset
    if SVi = 2 then return REMAP SVSHAPE1 current offset
    if SVi = 3 then return REMAP SVSHAPE2 current offset
    if SVi = 4 then return REMAP SVSHAPE3 current offset
    if SVi = 5 then return SVSTATE.srcstep  # VL source step
    if SVi = 6 then return SVSTATE.dststep  # VL dest step
    if SVi = 7 then return SVSTATE.ssubstep # SUBVL source step
    if SVi = 8 then return SVSTATE.dsubstep # SUBVL dest step

    # SVi=0, explicit iteration requezted
    src_iterate();
    dst_iterate();
    return 0

at_loopend

Both Vertical-First and Horizontal-First may use this algorithm to determine if the "end-of-looping" (end of Sub-Program-Counter) has been reached. Horizontal-First Mode will immediately move to the next instruction, where svstep. will set CR0.EQ to 1.

    # tells if this is the last possible element.
    subvl = SVSTATE.subvl
    vl = SVSTATE.vl
    end_ssub = SVSTATE.ssubstep == subvl
    end_dsub = SVSTATE.dsubstep == subvl
    if SVSTATE.srcstep == vl-1 and end_ssub:
        return True
    if SVSTATE.dststep == vl-1 and end_dsub:
        return True
    return False

\newpage{}