REMAP Matrix pseudocode

The algorithm below shows how REMAP works more clearly, and may be executed as a python program:


# Finite State Machine version of the REMAP system.  much more likely
# to end up being actually used in actual hardware

# up to three dimensions permitted
xdim = 3
ydim = 2
zdim = 1

VL = xdim * ydim * zdim # set total (can repeat, e.g. VL=x*y*z*4)

lims = [xdim, ydim, zdim]
idxs = [0,0,0]   # starting indices
applydim = [1, 1]   # apply lower dims
order = [1,0,2]  # experiment with different permutations, here
offset = 0       # experiment with different offsetet, here
invxyz = [0,1,0] # inversion allowed

# pre-prepare the index state: run for "offset" times before
# actually starting.  this algorithm can also be used for re-entrancy
# if exceptions occur and a REMAP has to be started from where the
# interrupt left off.
for idx in range(offset):
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        idxs[order[i]] = 0

break_count = 0 # for pretty-printing

for idx in range(VL):
    ix = [0] * 3
    for i in range(3):
        ix[i] = idxs[i]
        if invxyz[i]:
            ix[i] = lims[i] - 1 - ix[i]
    new_idx = ix[2]
    if applydim[1]:
        new_idx = new_idx * ydim + ix[1]
    if applydim[0]:
        new_idx = new_idx * xdim + ix[0]
    print ("%d->%d" % (idx, new_idx)),
    break_count += 1
    if break_count == lims[order[0]]:
        print ("")
        break_count = 0
    # this is the exact same thing as the pre-preparation stage
    # above.  step 1: count up to the limit of the current dimension
    # step 2: if limit reached, zero it, and allow the *next* dimension
    # to increment.  repeat for 3 dimensions.
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        idxs[order[i]] = 0

An easier-to-read version (using python iterators) shows the loop nesting:


# a "yield" version of the REMAP algorithm. a little easier to read
# than the Finite State Machine version

# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    xd = SVSHAPE.lims[0]
    yd = SVSHAPE.lims[1]
    zd = SVSHAPE.lims[2]
    # create lists of indices to iterate over in each dimension
    x_r = list(range(xd))
    y_r = list(range(yd))
    z_r = list(range(zd))
    # invert the indices if needed
    if SVSHAPE.invxyz[0]: x_r.reverse()
    if SVSHAPE.invxyz[1]: y_r.reverse()
    if SVSHAPE.invxyz[2]: z_r.reverse()
    # start an infinite (wrapping) loop
    while True:
        for z in z_r:   # loop over 1st order dimension
            for y in y_r:       # loop over 2nd order dimension
                for x in x_r:           # loop over 3rd order dimension
                    # ok work out which order to construct things in.
                    # start by creating a list of tuples of the dimension
                    # and its limit
                    vals = [(SVSHAPE.lims[0], x, "x"),
                            (SVSHAPE.lims[1], y, "y"),
                            (SVSHAPE.lims[2], z, "z")
                           ]
                    # now select those by order.  this allows us to
                    # create schedules for [z][x], [x][y], or [y][z]
                    # for matrix multiply.
                    vals = [vals[SVSHAPE.order[0]],
                            vals[SVSHAPE.order[1]],
                            vals[SVSHAPE.order[2]]
                           ]
                    # some of the dimensions can be "skipped".  the order
                    # was actually selected above on all 3 dimensions,
                    # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
                    # those to be knocked out
                    if SVSHAPE.skip == 0b00:
                        select = 0b111
                    elif SVSHAPE.skip == 0b11:
                        select = 0b011
                    elif SVSHAPE.skip == 0b01:
                        select = 0b110
                    elif SVSHAPE.skip == 0b10:
                        select = 0b101
                    else:
                        select = 0b111
                    result = 0
                    mult = 1
                    # ok now we can construct the result, using bits of
                    # "order" to say which ones get stacked on
                    for i in range(3):
                        lim, idx, dbg = vals[i]
                        if select & (1<<i):
                            #print ("select %d %s" % (i, dbg))
                            idx *= mult   # shifts up by previous dimension(s)
                            result += idx # adds on this dimension
                            mult *= lim   # for the next dimension

                    yield result + SVSHAPE.offset

def demo():
    # set the dimension sizes here
    xdim = 3
    ydim = 2
    zdim = 1

    # set total (can repeat, e.g. VL=x*y*z*4)
    VL = xdim * ydim * zdim

    # set up an SVSHAPE
    class SVSHAPE:
        pass
    SVSHAPE0 = SVSHAPE()
    SVSHAPE0.lims = [xdim, ydim, zdim]
    SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b00
    SVSHAPE0.offset = 0       # experiment with different offset, here
    SVSHAPE0.invxyz = [0,0,0] # inversion if desired

    # enumerate over the iterator function, getting new indices
    for idx, new_idx in enumerate(iterate_indices(SVSHAPE0)):
        if idx >= VL:
            break
        print ("%d->%d" % (idx, new_idx))

# run the demo
if __name__ == '__main__':
    demo()

Each element index from the for-loop 0..VL-1 is run through the above algorithm to work out the actual element index, instead. Given that there are four possible SHAPE entries, up to four separate registers in any given operation may be simultaneously remapped:

function op_add(rd, rs1, rs2) # add not VADD!
  ...
  ...
  for (i = 0; i < VL; i++)
    xSTATE.srcoffs = i # save context
    if (predval & 1<<i) # predication uses intregs
       ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
                              ireg[rs2+remap3(irs2)];
       if (!int_vec[rd ].isvector) break;
    if (int_vec[rd ].isvector)  { id += 1; }
    if (int_vec[rs1].isvector)  { irs1 += 1; }
    if (int_vec[rs2].isvector)  { irs2 += 1; }

By changing remappings, 2D matrices may be transposed "in-place" for one operation, followed by setting a different permutation order without having to move the values in the registers to or from memory.

Note that:

  • Over-running the register file clearly has to be detected and an illegal instruction exception thrown
  • When non-default elwidths are set, the exact same algorithm still applies (i.e. it offsets polymorphic elements within registers rather than entire registers).
  • If permute option 000 is utilised, the actual order of the reindexing does not change. However, modulo MVL still occurs which will result in repeated operations (use with caution).
  • If two or more dimensions are set to zero, the actual order does not change!
  • The above algorithm is pseudo-code only. Actual implementations will need to take into account the fact that the element for-looping must be re-entrant, due to the possibility of exceptions occurring. See SVSTATE SPR, which records the current element index. Continuing after return from an interrupt may introduce latency due to re-computation of the remapped offsets.
  • Twin-predicated operations require two separate and distinct element offsets. The above pseudo-code algorithm will be applied separately and independently to each, should each of the two operands be remapped. This even includes unit-strided LD/ST and other operations in that category, where in that case it will be the offset that is remapped.
  • Offset is especially useful, on its own, for accessing elements within the middle of a register. Without offsets, it is necessary to either use a predicated MV, skipping the first elements, or performing a LOAD/STORE cycle to memory. With offsets, the data does not have to be moved.
  • Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to less than MVL is perfectly legal, albeit very obscure. It permits entries to be regularly presented to operands more than once, thus allowing the same underlying registers to act as an accumulator of multiple vector or matrix operations, for example.
  • Note especially that Program Order must still be respected even when overlaps occur that read or write the same register elements including polymorphic ones

Clearly here some considerable care needs to be taken as the remapping could hypothetically create arithmetic operations that target the exact same underlying registers, resulting in data corruption due to pipeline overlaps. Out-of-order / Superscalar micro-architectures with register-renaming will have an easier time dealing with this than DSP-style SIMD micro-architectures.

REMAP FFT pseudocode

The algorithm below shows how FFT REMAP works, and may be executed as a python program:


# a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
# original code for the FFT Tukey-Cooley schedul:
# https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
"""
    # Radix-2 decimation-in-time FFT
    size = 2
    while size <= n:
        halfsize = size // 2
        tablestep = n // size
        for i in range(0, n, size):
            k = 0
            for j in range(i, i + halfsize):
                jh = j+halfsize
                jl = j
                temp1 = vec[jh] * exptable[k]
                temp2 = vec[jl]
                vec[jh] = temp2 - temp1
                vec[jl] = temp2 + temp1
                k += tablestep
        size *= 2
"""

# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    n = SVSHAPE.lims[0]
    # createing lists of indices to iterate over in each dimension
    # has to be done dynamically, because it depends on the size
    # first, the size-based loop (which can be done statically)
    x_r = []
    size = 2
    while size <= n:
        x_r.append(size)
        size *= 2
    # invert order if requested
    if SVSHAPE.invxyz[0]: x_r.reverse()

    if len(x_r) == 0:
        return

    # start an infinite (wrapping) loop
    skip = 0
    while True:
        for size in x_r:           # loop over 3rd order dimension (size)
            # y_r schedule depends on size
            halfsize = size // 2
            tablestep = n // size
            y_r = []
            for i in range(0, n, size):
                y_r.append(i)
            # invert if requested
            if SVSHAPE.invxyz[1]: y_r.reverse()
            for i in y_r:       # loop over 2nd order dimension
                k_r = []
                j_r = []
                k = 0
                for j in range(i, i+halfsize):
                    k_r.append(k)
                    j_r.append(j)
                    k += tablestep
                # invert if requested
                if SVSHAPE.invxyz[2]: k_r.reverse()
                if SVSHAPE.invxyz[2]: j_r.reverse()
                for j, k in zip(j_r, k_r):   # loop over 1st order dimension
                    # skip the first entries up to offset
                    if skip < SVSHAPE.offset:
                        skip += 1
                        continue
                    # now depending on MODE return the index
                    if SVSHAPE.skip == 0b00:
                        result = j              # for vec[j]
                    elif SVSHAPE.skip == 0b01:
                        result = j + halfsize   # for vec[j+halfsize]
                    elif SVSHAPE.skip == 0b10:
                        result = k              # for exptable[k]

                    yield result

def demo():
    # set the dimension sizes here
    xdim = 8
    ydim = 0 # not needed
    zdim = 0 # again, not needed

    # set total. err don't know how to calculate how many there are...
    # do it manually for now
    VL = 0
    size = 2
    n = xdim
    while size <= n:
        halfsize = size // 2
        tablestep = n // size
        for i in range(0, n, size):
            for j in range(i, i + halfsize):
                VL += 1
        size *= 2

    # set up an SVSHAPE
    class SVSHAPE:
        pass
    # j schedule
    SVSHAPE0 = SVSHAPE()
    SVSHAPE0.lims = [xdim, ydim, zdim]
    SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b01
    SVSHAPE0.skip = 0b00
    SVSHAPE0.offset = 0       # experiment with different offset, here
    SVSHAPE0.invxyz = [0,0,0] # inversion if desired
    # j+halfstep schedule
    SVSHAPE1 = SVSHAPE()
    SVSHAPE1.lims = [xdim, ydim, zdim]
    SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
    SVSHAPE1.mode = 0b01
    SVSHAPE1.skip = 0b01
    SVSHAPE1.offset = 0       # experiment with different offset, here
    SVSHAPE1.invxyz = [0,0,0] # inversion if desired
    # k schedule
    SVSHAPE2 = SVSHAPE()
    SVSHAPE2.lims = [xdim, ydim, zdim]
    SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
    SVSHAPE2.mode = 0b01
    SVSHAPE2.skip = 0b10
    SVSHAPE2.offset = 0       # experiment with different offset, here
    SVSHAPE2.invxyz = [0,0,0] # inversion if desired

    # enumerate over the iterator function, getting new indices
    schedule = []
    for idx, (jl, jh, k) in enumerate(zip(iterate_indices(SVSHAPE0),
                                          iterate_indices(SVSHAPE1),
                                          iterate_indices(SVSHAPE2))):
        if idx >= VL:
            break
        schedule.append((jl, jh, k))

    # ok now pretty-print the results, with some debug output
    size = 2
    idx = 0
    while size <= n:
        halfsize = size // 2
        tablestep = n // size
        print ("size %d halfsize %d tablestep %d" % \
                (size, halfsize, tablestep))
        for i in range(0, n, size):
            prefix = "i %d\t" % i
            k = 0
            for j in range(i, i + halfsize):
                jl, jh, ks = schedule[idx]
                print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
                        "j[jl=%-2d] j[jh=%-2d] exptable[k=%d]" % \
                                (idx, prefix, j, j+halfsize, k,
                                      jl, jh, ks))
                k += tablestep
                idx += 1
        size *= 2

# run the demo
if __name__ == '__main__':
    demo()

The executable code above is designed to show how a hardware implementation may generate Indices which are completely independent of the Execution of element-level operations, even for something as complex as a Triple-loop Tukey-Cooley Schedule. A comprehensive demo and test suite may be found here

Other uses include more than DFT and NTT: the Schedules are not restricted in any way and if the programmer can find any algorithm which has identical triple nesting then the FFT Schedule may be used.

4x4 Matrix to vec4 Multiply Example

The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.

  • SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
  • SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
  • VL=16, f4=vec, f0=vec, f8=vec
  • FMAC f4, f0, f8, f4

The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.

The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.

At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:

fmac f4, f0, f8, f4
fmac f5, f0, f9, f5
fmac f6, f0, f10, f6
fmac f7, f0, f11, f7
fmac f4, f1, f12, f4
fmac f5, f1, f13, f5
fmac f6, f1, f14, f6
fmac f7, f1, f15, f7
fmac f4, f2, f16, f4
fmac f5, f2, f17, f5
fmac f6, f2, f18, f6
fmac f7, f2, f19, f7
fmac f4, f3, f20, f4
fmac f5, f3, f21, f5
fmac f6, f3, f22, f6
fmac f7, f3, f23, f7

The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).

It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs, and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.

Warshall transitive closure algorithm

TODO move to discussion page, copied from here http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html

with thanks to Hendrik.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm

Just a note: interpreting + as 'or', and * as 'and', operating on Boolean matrices, and having result, X, and Y be the exact same matrix, updated while being used, gives the traditional Warshall transitive-closure algorithm, if the loops are nested exactly in thie order.

this can be done with the ternary instruction which has an in-place triple boolean input:

RT = RT | (RA & RB)

and also has a CR Field variant of the same

notes from conversations:

for y in y_r: for x in x_r: for z in z_r: result[y][x] += a[y][z] * b[z][x]

This nesting of loops works for matrix multiply, but not for transitive closure.

it can be done:

  for z in z_r:    for y in y_r:     for x in x_r:       result[y][x] +=          a[y][z] *          b[z][x]

And this ordering of loops does work for transitive closure, when a, b, and result are the very same matrix, updated while being used.

By the way, I believe there is a graph algorithm that does the transitive closure thing, but instead of using boolean, "and", and "or", they use real numbers, addition, and minimum.  I think that one computes shortest paths between vertices.

By the time the z'th iteration of the z loop begins, the algorithm has already peocessed paths that go through vertices numbered < z, and it adds paths that go through vertices numbered z.

For this to work, the outer loop has to be the one on the subscript that bridges a and b (which in this case are teh same matrix, of course).

SUBVL Remap

Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4 itself must be considered to be the "element". To perform REMAP on the elements of a vec2/3/4, either use mv.swizzle, or, due to the sub-elements themselves being contiguous, treat them as such and use Indexing, or add one extra dimension to Matrix REMAP, the inner dimension being the size of the Subvector (2, 3, or 4).

Note that Swizzle on Sub-vectors may be applied on top of REMAP. Where this is appropriate is the Rijndael MixColumns stage:

Assuming that the bytes are stored a00 a01 a02 a03 a10 .. a33 a 2D REMAP allows:

  • the column bytes (as a vec4) to be iterated over as an inner loop, progressing vertically (a00 a10 a20 a30)
  • the columns themselves to be iterated as an outer loop
  • a 32 bit GF(256) Matrix Multiply on the vec4 to be performed.

This entirely in-place without special 128-bit opcodes. Below is the pseudocode for Rijndael MixColumns

void gmix_column(unsigned char *r) {
    unsigned char a[4];
    unsigned char b[4];
    unsigned char c;
    unsigned char h;
    // no swizzle here but vec4 byte-level
    // elwidth overrides can be done though.
    for (c = 0; c < 4; c++) {
        a[c] = r[c];
        h = (unsigned char)((signed char)r[c] >> 7);
        b[c] = r[c] << 1;
        b[c] ^= 0x1B & h; /* Rijndael's Galois field */
    }
    // These may then each be 4x 8bit Swizzled
    // r0.vec4 = b.vec4
    // r0.vec4 ^= a.vec4.WXYZ
    // r0.vec4 ^= a.vec4.ZWXY
    // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
    r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
    r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
    r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; 
    r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
}

The application of the swizzles allows the remapped vec4 a, b and r variables to perform four straight linear 32 bit XOR operations where a scalar processor would be required to perform 16 byte-level individual operations. Given wide enough SIMD backends in hardware these 3 bit XORs may be done as single-cycle operations across the entire 128 bit Rijndael Matrix.

The other alternative is to simply perform the actual 4x4 GF(256) Matrix Multiply using the MDS Matrix.