REMAP

see propagation because it is currently the only way to apply REMAP.

REMAP allows the usual vector loop 0..VL-1 to be "reshaped" (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset" to permit arbitrary access to elements, independently on each Vector src or dest register.

Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Four SPRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs. Additional uses include regular "Structure Packing" such as RGB pixel data extraction and reforming.

REMAP, like all of SV, is abstracted out, meaning that unlike traditional Vector ISAs which would typically only have a limited set of instructions that can be structure-packed (LD/ST typically), REMAP may be applied to literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.

Note that REMAP does not apply to sub-vector elements: that is what swizzle is for. Swizzle can however be applied to the same instruction as REMAP.

REMAP is quite expensive to set up, and on some implementations introduce latency, so should realistically be used only where it is worthwhile

Principle

  • normal vector element read/write as operands would be sequential (0 1 2 3 ....)
  • this is not appropriate for (e.g.) Matrix multiply which requires accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  • normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope" with this. both are expensive (copy large vectors, spill through memory)
  • REMAP redefines the order of access according to set "Schedules"

Only the most commonly-used algorithms in computer science have REMAP support, due to the high cost in both the ISA and in hardware.

REMAP SPR

0 2 4 6 8 10.14 15..23
mi0 mi1 mi2 mo0 mo1 SVme rsv

mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register. mi0-2 apply to RA, RB, RC respectively, as input registers, and likewise mo0-1 apply to output registers (FRT, FRS respectively). SVme is 5 bits, and indicates indicate whether the SVSHAPE is actively applied or not.

  • bit 0 of SVme indicates if mi0 is applied to RA / FRA
  • bit 1 of SVme indicates if mi1 is applied to RB / FRB
  • bit 2 of SVme indicates if mi2 is applied to RC / FRC
  • bit 3 of SVme indicates if mo0 is applied to RT / FRT
  • bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS (LD/ST-with-update has an implicit 2nd write register, RA)

There is also a corresponding SVRM-Form for the svremap instruction which matches the above SPR:

|0     |6     |11  |13   |15   |17   |19   |21    | 22    |26     |31 |
| PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd | XO    | / |

SHAPE 1D/2D/3D vector-matrix remapping SPRs

There are four "shape" SPRs, SHAPE0-3, 32-bits in each, which have the same format.

Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is disabled: the register's elements are a linear (1D) vector.

31..30 29..28 27..24 23..21 20..18 17..12 11..6 5..0
0b00 skip offset invxyz permute zdimsz ydimsz xdimsz
0b01 submode offset invxyz submode2 rsvd rsvd xdimsz

mode sets different behaviours (straight matrix multiply, FFT, DCT).

  • mode=0b00 sets straight Matrix Mode
  • mode=0b01 sets "FFT/DCT" mode and activates submodes

When submode2 is 0, for FFT submode the following schedules may be selected:

  • submode=0b00 selects the j offset of the innermost for-loop of Tukey-Cooley
  • submode=0b10 selects the j+halfsize offset of the innermost for-loop of Tukey-Cooley
  • submode=0b11 selects the k of exptable (which coefficient)

When submode2 is 1 or 2, for DCT inner butterfly submode the following schedules may be selected. When submode2 is 1, additional bit-reversing is also performed.

  • submode=0b00 selects the j offset of the innermost for-loop, in-place
  • submode=0b010 selects the j+halfsize offset of the innermost for-loop, in reverse-order, in-place
  • submode=0b10 selects the ci count of the innermost for-loop, useful for calculating the cosine coefficient
  • submode=0b11 selects the size offset of the outermost for-loop, useful for the cosine coefficient cos(ci + 0.5) * pi / size

When submode2 is 3 or 4, for DCT outer butterfly submode the following schedules may be selected. When submode is 3, additional bit-reversing is also performed.

  • submode=0b00 selects the j offset of the innermost for-loop,
  • submode=0b01 selects the j+1 offset of the innermost for-loop,

in Matrix Mode, skip allows dimensions to be skipped from being included in the resultant output index. this allows sequences to be repeated: 0 0 0 1 1 1 2 2 2 ... or in the case of skip=0b11 this results in modulo 0 1 2 0 1 2 ...

  • skip=0b00 indicates no dimensions to be skipped
  • skip=0b01 sets "skip 1st dimension"
  • skip=0b10 sets "skip 2nd dimension"
  • skip=0b11 sets "skip 3rd dimension"

invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.

offset will have the effect of offsetting the result by offset elements:

for i in 0..VL-1:
    GPR(RT + remap(i) + SVSHAPE.offset) = ....

this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. also bear in mind that unlike a static compiler SVSHAPE.offset may be set dynamically at runtime.

xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates that the array dimensionality for that dimension is 1. any dimension not intended to be used must have its value set to 0 (dimensionality of 1). A value of xdimsz=2 would indicate that in the first dimension there are 3 elements in the array. For example, to create a 2D array X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0

The format of the array is therefore as follows:

array[xdimsz+1][ydimsz+1][zdimsz+1]

However whilst illustrative of the dimensionality, that does not take the "permute" setting into account. "permute" may be any one of six values (0-5, with values of 6 and 7 being reserved, and not legal). The table below shows how the permutation dimensionality order works:

permute order array format
000 0,1,2 (xdim+1)(ydim+1)(zdim+1)
001 0,2,1 (xdim+1)(zdim+1)(ydim+1)
010 1,0,2 (ydim+1)(xdim+1)(zdim+1)
011 1,2,0 (ydim+1)(zdim+1)(xdim+1)
100 2,0,1 (zdim+1)(xdim+1)(ydim+1)
101 2,1,0 (zdim+1)(ydim+1)(xdim+1)

In other words, the "permute" option changes the order in which nested for-loops over the array would be done. See executable python reference code for further details.

The algorithm below shows how REMAP works more clearly, and may be executed as a python program:


# Finite State Machine version of the REMAP system.  much more likely
# to end up being actually used in actual hardware

# up to three dimensions permitted
xdim = 3
ydim = 2
zdim = 1

VL = xdim * ydim * zdim # set total (can repeat, e.g. VL=x*y*z*4)

lims = [xdim, ydim, zdim]
idxs = [0,0,0]   # starting indices
applydim = [1, 1]   # apply lower dims
order = [1,0,2]  # experiment with different permutations, here
offset = 0       # experiment with different offsetet, here
invxyz = [0,1,0] # inversion allowed

# pre-prepare the index state: run for "offset" times before
# actually starting.  this algorithm can also be used for re-entrancy
# if exceptions occur and a REMAP has to be started from where the
# interrupt left off.
for idx in range(offset):
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        idxs[order[i]] = 0

break_count = 0 # for pretty-printing

for idx in range(VL):
    ix = [0] * 3
    for i in range(3):
        ix[i] = idxs[i]
        if invxyz[i]:
            ix[i] = lims[i] - 1 - ix[i]
    new_idx = ix[2]
    if applydim[1]:
        new_idx = new_idx * ydim + ix[1]
    if applydim[0]:
        new_idx = new_idx * xdim + ix[0]
    print ("%d->%d" % (idx, new_idx)),
    break_count += 1
    if break_count == lims[order[0]]:
        print ("")
        break_count = 0
    # this is the exact same thing as the pre-preparation stage
    # above.  step 1: count up to the limit of the current dimension
    # step 2: if limit reached, zero it, and allow the *next* dimension
    # to increment.  repeat for 3 dimensions.
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        idxs[order[i]] = 0

An easier-to-read version (using python iterators) shows the loop nesting:


# a "yield" version of the REMAP algorithm. a little easier to read
# than the Finite State Machine version

# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    xd = SVSHAPE.lims[0]
    yd = SVSHAPE.lims[1]
    zd = SVSHAPE.lims[2]
    # create lists of indices to iterate over in each dimension
    x_r = list(range(xd))
    y_r = list(range(yd))
    z_r = list(range(zd))
    # invert the indices if needed
    if SVSHAPE.invxyz[0]: x_r.reverse()
    if SVSHAPE.invxyz[1]: y_r.reverse()
    if SVSHAPE.invxyz[2]: z_r.reverse()
    # start an infinite (wrapping) loop
    while True:
        for z in z_r:   # loop over 1st order dimension
            for y in y_r:       # loop over 2nd order dimension
                for x in x_r:           # loop over 3rd order dimension
                    # ok work out which order to construct things in.
                    # start by creating a list of tuples of the dimension
                    # and its limit
                    vals = [(SVSHAPE.lims[0], x, "x"),
                            (SVSHAPE.lims[1], y, "y"),
                            (SVSHAPE.lims[2], z, "z")
                           ]
                    # now select those by order.  this allows us to
                    # create schedules for [z][x], [x][y], or [y][z]
                    # for matrix multiply.
                    vals = [vals[SVSHAPE.order[0]],
                            vals[SVSHAPE.order[1]],
                            vals[SVSHAPE.order[2]]
                           ]
                    # some of the dimensions can be "skipped".  the order
                    # was actually selected above on all 3 dimensions,
                    # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
                    # those to be knocked out
                    if SVSHAPE.skip == 0b00:
                        select = 0b111
                    elif SVSHAPE.skip == 0b11:
                        select = 0b011
                    elif SVSHAPE.skip == 0b01:
                        select = 0b110
                    elif SVSHAPE.skip == 0b10:
                        select = 0b101
                    else:
                        select = 0b111
                    result = 0
                    mult = 1
                    # ok now we can construct the result, using bits of
                    # "order" to say which ones get stacked on
                    for i in range(3):
                        lim, idx, dbg = vals[i]
                        if select & (1<<i):
                            #print ("select %d %s" % (i, dbg))
                            idx *= mult   # shifts up by previous dimension(s)
                            result += idx # adds on this dimension
                            mult *= lim   # for the next dimension

                    yield result + SVSHAPE.offset

def demo():
    # set the dimension sizes here
    xdim = 3
    ydim = 2
    zdim = 1

    # set total (can repeat, e.g. VL=x*y*z*4)
    VL = xdim * ydim * zdim

    # set up an SVSHAPE
    class SVSHAPE:
        pass
    SVSHAPE0 = SVSHAPE()
    SVSHAPE0.lims = [xdim, ydim, zdim]
    SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b00
    SVSHAPE0.offset = 0       # experiment with different offset, here
    SVSHAPE0.invxyz = [0,0,0] # inversion if desired

    # enumerate over the iterator function, getting new indices
    for idx, new_idx in enumerate(iterate_indices(SVSHAPE0)):
        if idx >= VL:
            break
        print ("%d->%d" % (idx, new_idx))

# run the demo
if __name__ == '__main__':
    demo()

Each element index from the for-loop 0..VL-1 is run through the above algorithm to work out the actual element index, instead. Given that there are four possible SHAPE entries, up to four separate registers in any given operation may be simultaneously remapped:

function op_add(rd, rs1, rs2) # add not VADD!
  ...
  ...
  for (i = 0; i < VL; i++)
    xSTATE.srcoffs = i # save context
    if (predval & 1<<i) # predication uses intregs
       ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
                              ireg[rs2+remap3(irs2)];
       if (!int_vec[rd ].isvector) break;
    if (int_vec[rd ].isvector)  { id += 1; }
    if (int_vec[rs1].isvector)  { irs1 += 1; }
    if (int_vec[rs2].isvector)  { irs2 += 1; }

By changing remappings, 2D matrices may be transposed "in-place" for one operation, followed by setting a different permutation order without having to move the values in the registers to or from memory.

Note that:

  • Over-running the register file clearly has to be detected and an illegal instruction exception thrown
  • When non-default elwidths are set, the exact same algorithm still applies (i.e. it offsets polymorphic elements within registers rather than entire registers).
  • If permute option 000 is utilised, the actual order of the reindexing does not change. However, modulo MVL still occurs which will result in repeated operations (use with caution).
  • If two or more dimensions are set to zero, the actual order does not change!
  • The above algorithm is pseudo-code only. Actual implementations will need to take into account the fact that the element for-looping must be re-entrant, due to the possibility of exceptions occurring. See SVSTATE SPR, which records the current element index. Continuing after return from an interrupt may introduce latency due to re-computation of the remapped offsets.
  • Twin-predicated operations require two separate and distinct element offsets. The above pseudo-code algorithm will be applied separately and independently to each, should each of the two operands be remapped. This even includes unit-strided LD/ST and other operations in that category, where in that case it will be the offset that is remapped.
  • Offset is especially useful, on its own, for accessing elements within the middle of a register. Without offsets, it is necessary to either use a predicated MV, skipping the first elements, or performing a LOAD/STORE cycle to memory. With offsets, the data does not have to be moved.
  • Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to less than MVL is perfectly legal, albeit very obscure. It permits entries to be regularly presented to operands more than once, thus allowing the same underlying registers to act as an accumulator of multiple vector or matrix operations, for example.
  • Note especially that Program Order must still be respected even when overlaps occur that read or write the same register elements including polymorphic ones

Clearly here some considerable care needs to be taken as the remapping could hypothetically create arithmetic operations that target the exact same underlying registers, resulting in data corruption due to pipeline overlaps. Out-of-order / Superscalar micro-architectures with register-renaming will have an easier time dealing with this than DSP-style SIMD micro-architectures.

svstate instruction

Please note: this is not intended for production. It sets up (overwrites) all required SVSHAPE SPRs and indicates that the next instruction shall have those REMAP shapes applied to it, assuming that instruction is of the form FRT,FRA,FRC,FRB.

Form: SVM-Form SV "Matrix" Form (see fields.text)

0.5 6.10 11.15 16..20 21..25 25 26..30 31 name
OPCD SVxd SVyd SVzd SVRM vf XO / svstate

Fields:

  • SVxd - SV REMAP "xdim"
  • SVyd - SV REMAP "ydim"
  • SVMM - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT)
  • vf - sets "Vertical-First" mode
  • XO - standard 5-bit XO field

4x4 Matrix to vec4 Multiply Example

The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.

  • SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
  • SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
  • VL=16, f4=vec, f0=vec, f8=vec
  • FMAC f4, f0, f8, f4

The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.

The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.

At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:

fmac f4, f0, f8, f4
fmac f5, f0, f9, f5
fmac f6, f0, f10, f6
fmac f7, f0, f11, f7
fmac f4, f1, f12, f4
fmac f5, f1, f13, f5
fmac f6, f1, f14, f6
fmac f7, f1, f15, f7
fmac f4, f2, f16, f4
fmac f5, f2, f17, f5
fmac f6, f2, f18, f6
fmac f7, f2, f19, f7
fmac f4, f3, f20, f4
fmac f5, f3, f21, f5
fmac f6, f3, f22, f6
fmac f7, f3, f23, f7

The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).

It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs, and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.

Warshall transitive closure algorithm

TODO move to ?discussion page, copied from here http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html

with thanks to Hendrik.

Just a note: interpreting + as 'or', and * as 'and', operating on Boolean matrices, and having result, X, and Y be the exact same matrix, updated while being used, gives the traditional Warshall transitive-closure algorithm, if the loops are nested exactly in thie order.

this can be done with the ternary instruction which has an in-place triple boolean input:

RT = RT | (RA & RB)

and also has a CR Field variant of the same

notes from conversations:

for y in y_r: for x in x_r: for z in z_r: result[y][x] += a[y][z] * b[z][x]

This nesting of loops works for matrix multiply, but not for transitive closure.

it can be done:

  for z in z_r:    for y in y_r:     for x in x_r:       result[y][x] +=          a[y][z] *          b[z][x]

And this ordering of loops does work for transitive closure, when a, b, and result are the very same matrix, updated while being used.

By the way, I believe there is a graph algorithm that does the transitive closure thing, but instead of using boolean, "and", and "or", they use real numbers, addition, and minimum.  I think that one computes shortest paths between vertices.

By the time the z'th iteration of the z loop begins, the algorithm has already peocessed paths that go through vertices numbered < z, and it adds paths that go through vertices numbered z.

For this to work, the outer loop has to be the one on teh subscript that bridges a and b (which in this case are teh same matrix, of course).

SUBVL Remap

Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the sub-vectir elements were simply part of the main VL loop. This is the complete opposite of predication which only applies to the whole vec2/3/4. In pseudocode this would be:

  for (i = 0; i < VL; i++)
    if (predval & 1<<i) # apply to VL not SUBVL
      for (j = 0; j < SUBVL; j++)
         id = i*SUBVL + j # not, "id=i".
         ireg[RT+remap1(id)] ...

The reason for allowing SUBVL Remaps is that some regular patterns using Swizzle which would otherwise require multiple explicit instructions with 12 bit swizzles encoded in them may be efficently encoded with Remap instead. Not however that Swizzle is still permitted to be applied.

An example where SUBVL Remap is appropriate is the Rijndael MixColumns stage:

Assuming that the bytes are stored a00 a01 a02 a03 a10 .. a33 a 2D REMAP allows:

  • the column bytes (as a vec4) to be iterated over as an inner loop, progressing vertically (a00 a10 a20 a30)
  • the columns themselves to be iterated as an outer loop
  • a 32 bit GF(256) Matrix Multiply on the vec4 to be performed.

This entirely in-place without special 128-bit opcodes. Below is the pseudocode for Rijndael MixColumns

void gmix_column(unsigned char *r) {
    unsigned char a[4];
    unsigned char b[4];
    unsigned char c;
    unsigned char h;
    // no swizzle here but still SUBVL.Remap
    // can be done as vec4 byte-level
    // elwidth overrides though.
    for (c = 0; c < 4; c++) {
        a[c] = r[c];
        h = (unsigned char)((signed char)r[c] >> 7);
        b[c] = r[c] << 1;
        b[c] ^= 0x1B & h; /* Rijndael's Galois field */
    }
    // SUBVL.Remap still needed here
    // bytelevel elwidth overrides and vec4
    // These may then each be 4x 8bit bit Swizzled
    // r0.vec4 = b.vec4
    // r0.vec4 ^= a.vec4.WXYZ
    // r0.vec4 ^= a.vec4.ZWXY
    // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
    r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
    r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
    r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; 
    r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
}

With the assumption made by the above code that the column bytes have already been turned around (vertical rather than horizontal) SUBVL.REMAP may transparently fill that role, in-place, without a complex byte-level mv operation.

The application of the swizzles allows the remapped vec4 a, b and r variables to perform four straight linear 32 bit XOR operations where a scalar processor would be required to perform 16 byte-level individual operations. Given wide enough SIMD backends in hardware these 3 bit XORs may be done as single-cycle operations across the entire 128 bit Rijndael Matrix.

The other alternative is to simply perform the actual 4x4 GF(256) Matrix Multiply using the MDS Matrix.

TODO

investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429 in https://bugs.libre-soc.org/show_bug.cgi?id=653