TODO svshape offset concept

svshape2 offs,inv,yx,rmm,SVd,sk,mm

0.5	6..8	9	10	11.15	16..20	21..25	25	26..31	name
OPCD	offs	inv	yx	rmm	SVd	100/mm	sk	XO	svshape

offs (3 bits) - unsigned offset
yx (1 bit) - swap XY to YX
inv (1 bit) inverts X if yx=0, Y if yx=1
SVd dimension size
rmm REMAP mask
mm mask mode
sk (1 bit) skips 1st dimension if set

Dimensions are calculated exactly as svindex. rmm and mm are as per svindex.

TODO

investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429 in https://bugs.libre-soc.org/show_bug.cgi?id=653
UTF-8 https://bugs.libre-soc.org/show_bug.cgi?id=794
Triangular REMAP
Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
Convolution REMAP

Warshall transitive closure algorithm

TODO move to discussion page, copied from here http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html

with thanks to Hendrik.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm

Just a note: interpreting + as 'or', and * as 'and', operating on Boolean matrices, and having result, X, and Y be the exact same matrix, updated while being used, gives the traditional Warshall transitive-closure algorithm, if the loops are nested exactly in thie order.

this can be done with the ternary instruction which has an in-place triple boolean input:

    RT = RT | (RA & RB)

and also has a CR Field variant of the same

notes from conversations:

for y in y_r: for x in x_r: for z in z_r: result[y][x] += a[y][z] * b[z][x]

This nesting of loops works for matrix multiply, but not for transitive closure.

it can be done:

for z in z_r: for y in y_r: for x in x_r: result[y][x] += a[y][z] * b[z][x]

And this ordering of loops does work for transitive closure, when a, b, and result are the very same matrix, updated while being used.

By the way, I believe there is a graph algorithm that does the transitive closure thing, but instead of using boolean, "and", and "or", they use real numbers, addition, and minimum. I think that one computes shortest paths between vertices.

By the time the z'th iteration of the z loop begins, the algorithm has already peocessed paths that go through vertices numbered < z, and it adds paths that go through vertices numbered z.

For this to work, the outer loop has to be the one on the subscript that bridges a and b (which in this case are teh same matrix, of course).

SUBVL Remap

Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4 itself must be considered to be the "element". To perform REMAP on the elements of a vec2/3/4, either use mv.swizzle, or, due to the sub-elements themselves being contiguous, treat them as such and use Indexing, or add one extra dimension to Matrix REMAP, the inner dimension being the size of the Subvector (2, 3, or 4).

Note that Swizzle on Sub-vectors may be applied on top of REMAP. Where this is appropriate is the Rijndael MixColumns stage:

Assuming that the bytes are stored a00 a01 a02 a03 a10 .. a33 a 2D REMAP allows:

the column bytes (as a vec4) to be iterated over as an inner loop, progressing vertically (a00 a10 a20 a30)
the columns themselves to be iterated as an outer loop
a 32 bit GF(256) Matrix Multiply on the vec4 to be performed.

This entirely in-place without special 128-bit opcodes. Below is the pseudocode for Rijndael MixColumns

void gmix_column(unsigned char *r) {
    unsigned char a[4];
    unsigned char b[4];
    unsigned char c;
    unsigned char h;
    // no swizzle here but vec4 byte-level
    // elwidth overrides can be done though.
    for (c = 0; c < 4; c++) {
        a[c] = r[c];
        h = (unsigned char)((signed char)r[c] >> 7);
        b[c] = r[c] << 1;
        b[c] ^= 0x1B & h; /* Rijndael's Galois field */
    }
    // These may then each be 4x 8bit Swizzled
    // r0.vec4 = b.vec4
    // r0.vec4 ^= a.vec4.WXYZ
    // r0.vec4 ^= a.vec4.ZWXY
    // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
    r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
    r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
    r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; 
    r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
}

The application of the swizzles allows the remapped vec4 a, b and r variables to perform four straight linear 32 bit XOR operations where a scalar processor would be required to perform 16 byte-level individual operations. Given wide enough SIMD backends in hardware these 3 bit XORs may be done as single-cycle operations across the entire 128 bit Rijndael Matrix.

The other alternative is to simply perform the actual 4x4 GF(256) Matrix Multiply using the MDS Matrix.