appendix

REMAP Matrix pseudocode

The algorithm below shows how REMAP works more clearly, and may be executed as a python program:


# Finite State Machine version of the REMAP system.  much more likely
# to end up being actually used in actual hardware

# up to three dimensions permitted
xdim = 3
ydim = 2
zdim = 1

VL = xdim * ydim * zdim # set total (can repeat, e.g. VL=x*y*z*4)

lims = [xdim, ydim, zdim]
idxs = [0,0,0]   # starting indices
applydim = [1, 1]   # apply lower dims
order = [1,0,2]  # experiment with different permutations, here
offset = 0       # experiment with different offsetet, here
invxyz = [0,1,0] # inversion allowed

# pre-prepare the index state: run for "offset" times before
# actually starting.  this algorithm can also be used for re-entrancy
# if exceptions occur and a REMAP has to be started from where the
# interrupt left off.
for idx in range(offset):
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        idxs[order[i]] = 0

break_count = 0 # for pretty-printing

for idx in range(VL):
    ix = [0] * 3
    for i in range(3):
        ix[i] = idxs[i]
        if invxyz[i]:
            ix[i] = lims[i] - 1 - ix[i]
    new_idx = ix[2]
    if applydim[1]:
        new_idx = new_idx * ydim + ix[1]
    if applydim[0]:
        new_idx = new_idx * xdim + ix[0]
    print ("%d->%d" % (idx, new_idx)),
    break_count += 1
    if break_count == lims[order[0]]:
        print ("")
        break_count = 0
    # this is the exact same thing as the pre-preparation stage
    # above.  step 1: count up to the limit of the current dimension
    # step 2: if limit reached, zero it, and allow the *next* dimension
    # to increment.  repeat for 3 dimensions.
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        idxs[order[i]] = 0

An easier-to-read version (using python iterators) is given in a later section of this Appendix.

Each element index from the for-loop 0..VL-1 is run through the above algorithm to work out the actual element index, instead. Given that there are four possible SHAPE entries, up to four separate registers in any given operation may be simultaneously remapped:

    function op_add(RT, RA, RB) # add not VADD!
      for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
        SVSTATE.srcstep = i # save context
        if (predval & 1<<i) # predication mask
           GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
                                 GPR[RB+remap3(irs2)];
           if (!RT.isvector) break;
        if (RT.isvector)  { id += 1; }
        if (RA.isvector)  { irs1 += 1; }
        if (RB.isvector)  { irs2 += 1; }

By changing remappings, 2D matrices may be transposed "in-place" for one operation, followed by setting a different permutation order without having to move the values in the registers to or from memory.

Note that:

Over-running the register file clearly has to be detected and an illegal instruction exception thrown
When non-default elwidths are set, the exact same algorithm still applies (i.e. it offsets polymorphic elements within registers rather than entire registers).
If permute option 000 is utilised, the actual order of the reindexing does not change. However, modulo MVL still occurs which will result in repeated operations (use with caution).
If two or more dimensions are set to zero, the actual order does not change!
The above algorithm is pseudo-code only. Actual implementations will need to take into account the fact that the element for-looping must be re-entrant, due to the possibility of exceptions occurring. See SVSTATE SPR, which records the current element index. Continuing after return from an interrupt may introduce latency due to re-computation of the remapped offsets.
Twin-predicated operations require two separate and distinct element offsets. The above pseudo-code algorithm will be applied separately and independently to each, should each of the two operands be remapped. This even includes unit-strided LD/ST and other operations in that category, where in that case it will be the address offset that is remapped: EA <- (RA) + immediate * REMAP(elementoffset).
Offset is especially useful, on its own, for accessing elements within the middle of a register. Without offsets, it is necessary to either use a predicated MV, skipping the first elements, or performing a LOAD/STORE cycle to memory. With offsets, the data does not have to be moved.
Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to less than MVL is perfectly legal, albeit very obscure. It permits entries to be regularly presented to operands more than once, thus allowing the same underlying registers to act as an accumulator of multiple vector or matrix operations, for example.
Note especially that Program Order must still be respected even when overlaps occur that read or write the same register elements including polymorphic ones

Clearly here some considerable care needs to be taken as the remapping could hypothetically create arithmetic operations that target the exact same underlying registers, resulting in data corruption due to pipeline overlaps. Out-of-order / Superscalar micro-architectures with register-renaming will have an easier time dealing with this than DSP-style SIMD micro-architectures.

4x4 Matrix to vec4 Multiply (4x4 by 1x4)

The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.

SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
VL=16, f4=vec, f0=vec, f8=vec
FMAC f4, f0, f8, f4

The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.

The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.

At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:

    fmac f4, f0, f8, f4
    fmac f5, f0, f9, f5
    fmac f6, f0, f10, f6
    fmac f7, f0, f11, f7
    fmac f4, f1, f12, f4
    fmac f5, f1, f13, f5
    fmac f6, f1, f14, f6
    fmac f7, f1, f15, f7
    fmac f4, f2, f16, f4
    fmac f5, f2, f17, f5
    fmac f6, f2, f18, f6
    fmac f7, f2, f19, f7
    fmac f4, f3, f20, f4
    fmac f5, f3, f21, f5
    fmac f6, f3, f22, f6
    fmac f7, f3, f23, f7

Hardware should easily pipeline the above FMACs and as long as each FMAC completes in 4 cycles or less there should be 100% sustained throughput, from the one single Vector FMAC.

The only other instruction required is to ensure that f4-f7 are initialised (usually to zero) however obviously if used as part of some other computation, which is frequently the case, then clearly the zeroing is not needed.

REMAP FFT, DFT, NTT

The algorithm from a later section of this Appendix shows how FFT REMAP works, and it may be executed as a standalone python3 program. The executable code is designed to illustrate how a hardware implementation may generate Indices which are completely independent of the Execution of element-level operations, even for something as complex as a Triple-loop Tukey-Cooley Schedule. A comprehensive demo and test suite may be found here including Complex Number FFT which deploys Vertical-First Mode on top of the REMAP Schedules.

Other uses include more than DFT and NTT: as abstracted RISC-paradigm the Schedules are not restricted in any way or tied to any particular instruction. If the programmer can find any algorithm which has identical triple nesting then the FFT Schedule may be used even there.

svshape pseudocode

    # for convenience, VL to be calculated and stored in SVSTATE
    vlen <- [0] * 7
    mscale[0:5] <- 0b000001 # for scaling MAXVL
    itercount[0:6] <- [0] * 7
    SVSTATE[0:31] <- [0] * 32
    # only overwrite REMAP if "persistence" is zero
    if (SVSTATE[62] = 0b0) then
        SVSTATE[32:33] <- 0b00
        SVSTATE[34:35] <- 0b00
        SVSTATE[36:37] <- 0b00
        SVSTATE[38:39] <- 0b00
        SVSTATE[40:41] <- 0b00
        SVSTATE[42:46] <- 0b00000
        SVSTATE[62] <- 0b0
        SVSTATE[63] <- 0b0
    # clear out all SVSHAPEs
    SVSHAPE0[0:31] <- [0] * 32
    SVSHAPE1[0:31] <- [0] * 32
    SVSHAPE2[0:31] <- [0] * 32
    SVSHAPE3[0:31] <- [0] * 32

    # set schedule up for multiply
    if (SVrm = 0b0000) then
        # VL in Matrix Multiply is xd*yd*zd
        xd <- (0b00 || SVxd) + 1
        yd <- (0b00 || SVyd) + 1
        zd <- (0b00 || SVzd) + 1
        n <- xd * yd * zd
        vlen[0:6] <- n[14:20]
        # set up template in SVSHAPE0, then copy to 1-3
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[6:11] <- (0b0 || SVyd)   # ydim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim
        SVSHAPE0[28:29] <- 0b11           # skip z
        # copy
        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
        SVSHAPE3[0:31] <- SVSHAPE0[0:31]
        # set up FRA
        SVSHAPE1[18:20] <- 0b001          # permute x,z,y
        SVSHAPE1[28:29] <- 0b01           # skip z
        # FRC
        SVSHAPE2[18:20] <- 0b001          # permute x,z,y
        SVSHAPE2[28:29] <- 0b11           # skip y

    # set schedule up for FFT butterfly
    if (SVrm = 0b0001) then
        # calculate O(N log2 N)
        n <- [0] * 3
        do while n < 5
           if SVxd[4-n] = 0 then
               leave
           n <- n + 1
        n <- ((0b0 || SVxd) + 1) * n
        vlen[0:6] <- n[1:7]
        # set up template in SVSHAPE0, then copy to 1-3
        # for FRA and FRT
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D FFT)
        mscale <- (0b0 || SVzd) + 1
        SVSHAPE0[30:31] <- 0b01          # Butterfly mode
        # copy
        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
        # set up FRB and FRS
        SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
        # FRC (coefficients)
        SVSHAPE2[28:29] <- 0b10           # k schedule

    # set schedule up for (i)DCT Inner butterfly
    # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
    if ((SVrm = 0b0100) |
        (SVrm = 0b1100)) then
        # calculate O(N log2 N)
        n <- [0] * 3
        do while n < 5
           if SVxd[4-n] = 0 then
               leave
           n <- n + 1
        n <- ((0b0 || SVxd) + 1) * n
        vlen[0:6] <- n[1:7]
        # set up template in SVSHAPE0, then copy to 1-3
        # set up FRB and FRS
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
        mscale <- (0b0 || SVzd) + 1
        if (SVrm = 0b1100) then
            SVSHAPE0[30:31] <- 0b11          # iDCT mode
            SVSHAPE0[18:20] <- 0b011         # iDCT Inner Butterfly sub-mode
        else
            SVSHAPE0[30:31] <- 0b01          # DCT mode
            SVSHAPE0[18:20] <- 0b001         # DCT Inner Butterfly sub-mode
            SVSHAPE0[21:23] <- 0b001         # "inverse" on outer loop
        SVSHAPE0[6:11] <- 0b000011       # (i)DCT Inner Butterfly mode 4
        # copy
        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
        if (SVrm != 0b0100) & (SVrm != 0b1100) then
            SVSHAPE3[0:31] <- SVSHAPE0[0:31]
        # for FRA and FRT
        SVSHAPE0[28:29] <- 0b01           # j+halfstep schedule
        # for cos coefficient
        SVSHAPE2[28:29] <- 0b10           # ci (k for mode 4) schedule
        SVSHAPE2[12:17] <- 0b000000       # reset costable "striding" to 1
        if (SVrm != 0b0100) & (SVrm != 0b1100) then
            SVSHAPE3[28:29] <- 0b11           # size schedule

    # set schedule up for (i)DCT Outer butterfly
    if (SVrm = 0b0011) | (SVrm = 0b1011) then
        # calculate O(N log2 N) number of outer butterfly overlapping adds
        vlen[0:6] <- [0] * 7
        n <- 0b000
        size <- 0b0000001
        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
        itercount[0:6] <- (0b0 || itercount[0:5])
        do while n < 5
           if SVxd[4-n] = 0 then
               leave
           n <- n + 1
           count <- (itercount - 0b0000001) * size
           vlen[0:6] <- vlen + count[7:13]
           size[0:6] <- (size[1:6] || 0b0)
           itercount[0:6] <- (0b0 || itercount[0:5])
        # set up template in SVSHAPE0, then copy to 1-3
        # set up FRB and FRS
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
        mscale <- (0b0 || SVzd) + 1
        if (SVrm = 0b1011) then
            SVSHAPE0[30:31] <- 0b11      # iDCT mode
            SVSHAPE0[18:20] <- 0b011     # iDCT Outer Butterfly sub-mode
            SVSHAPE0[21:23] <- 0b101     # "inverse" on outer and inner loop
        else
            SVSHAPE0[30:31] <- 0b01      # DCT mode
            SVSHAPE0[18:20] <- 0b100     # DCT Outer Butterfly sub-mode
        SVSHAPE0[6:11] <- 0b000010       # DCT Butterfly mode
        # copy
        SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
        SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
        # for FRA and FRT
        SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
        # reset costable "striding" to 1
        SVSHAPE2[12:17] <- 0b000000

    # set schedule up for DCT COS table generation
    if (SVrm = 0b0101) | (SVrm = 0b1101) then
        # calculate O(N log2 N)
        vlen[0:6] <- [0] * 7
        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
        itercount[0:6] <- (0b0 || itercount[0:5])
        n <- [0] * 3
        do while n < 5
           if SVxd[4-n] = 0 then
               leave
           n <- n + 1
           vlen[0:6] <- vlen + itercount
           itercount[0:6] <- (0b0 || itercount[0:5])
        # set up template in SVSHAPE0, then copy to 1-3
        # set up FRB and FRS
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
        mscale <- (0b0 || SVzd) + 1
        SVSHAPE0[30:31] <- 0b01          # DCT/FFT mode
        SVSHAPE0[6:11] <- 0b000100       # DCT Inner Butterfly COS-gen mode
        if (SVrm = 0b0101) then
            SVSHAPE0[21:23] <- 0b001     # "inverse" on outer loop for DCT
        # copy
        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
        # for cos coefficient
        SVSHAPE1[28:29] <- 0b10           # ci schedule
        SVSHAPE2[28:29] <- 0b11           # size schedule

    # set schedule up for iDCT / DCT inverse of half-swapped ordering
    if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
        vlen[0:6] <- (0b00 || SVxd) + 0b0000001
        # set up template in SVSHAPE0
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
        mscale <- (0b0 || SVzd) + 1
        if (SVrm = 0b1110) then
            SVSHAPE0[18:20] <- 0b001     # DCT opposite half-swap
        if (SVrm = 0b1111) then
            SVSHAPE0[30:31] <- 0b01          # FFT mode
        else
            SVSHAPE0[30:31] <- 0b11          # DCT mode
        SVSHAPE0[6:11] <- 0b000101       # DCT "half-swap" mode

    # set schedule up for parallel reduction
    if (SVrm = 0b0111) then
        # calculate the total number of operations (brute-force)
        vlen[0:6] <- [0] * 7
        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
        step[0:6] <- 0b0000001
        i[0:6] <- 0b0000000
        do while step <u itercount
            newstep <- step[1:6] || 0b0
            j[0:6] <- 0b0000000
            do while (j+step <u itercount)
                j <- j + newstep
                i <- i + 1
            step <- newstep
        # VL in Parallel-Reduce is the number of operations
        vlen[0:6] <- i
        # set up template in SVSHAPE0, then copy to 1. only 2 needed
        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
        mscale <- (0b0 || SVzd) + 1
        SVSHAPE0[30:31] <- 0b10          # parallel reduce submode
        # copy
        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
        # set up right operand (left operand 28:29 is zero)
        SVSHAPE1[28:29] <- 0b01           # right operand

    # set VL, MVL and Vertical-First
    m[0:12] <- vlen * mscale
    maxvl[0:6] <- m[6:12]
    SVSTATE[0:6] <- maxvl  # MAVXL
    SVSTATE[7:13] <- vlen  # VL
    SVSTATE[63] <- vf

svindex pseudocode

    # based on nearest MAXVL compute other dimension
    MVL <- SVSTATE[0:6]
    d <- [0] * 6
    dim <- SVd+1
    do while d*dim <u ([0]*4 || MVL)
       d <- d + 1

    # set up template, then copy once location identified
    shape <- [0]*32
    shape[30:31] <- 0b00            # mode
    if SVyx = 0 then
        shape[18:20] <- 0b110       # indexed xd/yd
        shape[0:5] <- (0b0 || SVd)  # xdim
        if sk = 0 then shape[6:11] <- 0 # ydim
        else           shape[6:11] <- 0b111111 # ydim max
    else
        shape[18:20] <- 0b111       # indexed yd/xd
        if sk = 1 then shape[6:11] <- 0 # ydim
        else           shape[6:11] <- d-1 # ydim max
        shape[0:5] <- (0b0 || SVd) # ydim
    shape[12:17] <- (0b0 || SVG)        # SVGPR
    shape[28:29] <- ew                  # element-width override
    shape[21] <- sk                     # skip 1st dimension

    # select the mode for updating SVSHAPEs
    SVSTATE[62] <- mm # set or clear persistence
    if mm = 0 then
        # clear out all SVSHAPEs first
        SVSHAPE0[0:31] <- [0] * 32
        SVSHAPE1[0:31] <- [0] * 32
        SVSHAPE2[0:31] <- [0] * 32
        SVSHAPE3[0:31] <- [0] * 32
        SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
        SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
        idx <- 0
        for bit = 0 to 4
            if rmm[4-bit] then
                # activate requested shape
                if idx = 0 then SVSHAPE0 <- shape
                if idx = 1 then SVSHAPE1 <- shape
                if idx = 2 then SVSHAPE2 <- shape
                if idx = 3 then SVSHAPE3 <- shape
                SVSTATE[bit*2+32:bit*2+33] <- idx
                # increment shape index, modulo 4
                if idx = 3 then idx <- 0
                else            idx <- idx + 1
    else
        # refined SVSHAPE/REMAP update mode
        bit <- rmm[0:2]
        idx <- rmm[3:4]
        if idx = 0 then SVSHAPE0 <- shape
        if idx = 1 then SVSHAPE1 <- shape
        if idx = 2 then SVSHAPE2 <- shape
        if idx = 3 then SVSHAPE3 <- shape
        SVSTATE[bit*2+32:bit*2+33] <- idx
        SVSTATE[46-bit] <- 1

svshape2 pseudocode

    # based on nearest MAXVL compute other dimension
    MVL <- SVSTATE[0:6]
    d <- [0] * 6
    dim <- SVd+1
    do while d*dim <u ([0]*4 || MVL)
       d <- d + 1
    # set up template, then copy once location identified
    shape <- [0]*32
    shape[30:31] <- 0b00            # mode
    shape[0:5] <- (0b0 || SVd)      # x/ydim
    if SVyx = 0 then
        shape[18:20] <- 0b000       # ordering xd/yd(/zd)
        if sk = 0 then shape[6:11] <- 0 # ydim
        else           shape[6:11] <- 0b111111 # ydim max
    else
        shape[18:20] <- 0b010       # ordering yd/xd(/zd)
        if sk = 1 then shape[6:11] <- 0 # ydim
        else           shape[6:11] <- d-1 # ydim max
    # offset (the prime purpose of this instruction)
    shape[24:27] <- SVo         # offset
    if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
    else           shape[28:29] <- 0b00 # no skipping
    # select the mode for updating SVSHAPEs
    SVSTATE[62] <- mm # set or clear persistence
    if mm = 0 then
        # clear out all SVSHAPEs first
        SVSHAPE0[0:31] <- [0] * 32
        SVSHAPE1[0:31] <- [0] * 32
        SVSHAPE2[0:31] <- [0] * 32
        SVSHAPE3[0:31] <- [0] * 32
        SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
        SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
        idx <- 0
        for bit = 0 to 4
            if rmm[4-bit] then
                # activate requested shape
                if idx = 0 then SVSHAPE0 <- shape
                if idx = 1 then SVSHAPE1 <- shape
                if idx = 2 then SVSHAPE2 <- shape
                if idx = 3 then SVSHAPE3 <- shape
                SVSTATE[bit*2+32:bit*2+33] <- idx
                # increment shape index, modulo 4
                if idx = 3 then idx <- 0
                else            idx <- idx + 1
    else
        # refined SVSHAPE/REMAP update mode
        bit <- rmm[0:2]
        idx <- rmm[3:4]
        if idx = 0 then SVSHAPE0 <- shape
        if idx = 1 then SVSHAPE1 <- shape
        if idx = 2 then SVSHAPE2 <- shape
        if idx = 3 then SVSHAPE3 <- shape
        SVSTATE[bit*2+32:bit*2+33] <- idx
        SVSTATE[46-bit] <- 1

Example Matrix Usage

svshape to set the type of reordering to be applied to an otherwise usual 0..VL-1 hardware for-loop
svremap to set which registers a given reordering is to apply to (RA, RT etc)
sv.{instruction} where any Vectorized register marked by svremap will have its ordering REMAPPED according to the schedule set by svshape.

The following illustrative example multiplies a 3x4 and a 5x3 matrix to create a 5x4 result:

    svshape 5,4,3,0,0         # Outer Product 5x4 by 4x3
    svremap 15,1,2,3,0,0,0,0  # link Schedule to registers
    sv.fmadds *0,*32,*64,*0   # 60 FMACs get executed here

svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
svremap activates four out of five registers RA RB RC RT RS (15)
svremap requests:
- RA to use SVSHAPE1
- RB to use SVSHAPE2
- RC to use SVSHAPE3
- RT to use SVSHAPE0
- RS Remapping to not be activated
sv.fmadds has vectors RT=0, RA=32, RB=64, RC=0
With REMAP being active each register's element index is independently transformed using the specified SHAPEs.

Thus the Vector Loop is arranged such that the use of the multiply-and-accumulate instruction executes precisely the required Schedule to perform an in-place in-registers Outer Product Matrix Multiply with no need to perform additional Transpose or register copy instructions. The example above may be executed as a unit test and demo, here

\newpage{}