# MV.X and MV.swizzle

swizzle needs a MV (there are 2 of them: swizzle and swizzle2). see below for a potential way to use the funct7 to do a swizzle in rs2.

 Encoding 31:27 26:25 24:20 19:15 14:12 11:7 6:2 1:0 RV32-I-type imm[11:0] rs1[4:0] funct3 rd[4:0] opcode 0b11 RV32-I-type fn4[3:0] swizzle[7:0] rs1[4:0] 0b000 rd[4:0] OP-V 0b11
• funct3 = MV: 0b000 for FP, 0b001 for INT
• OP-V = 0b1010111
• fn4 = 4 bit function.
• fn4 = 0b0000 - MV-SWIZZLE
• fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
• fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)

swizzle (only active on SV or P48/P64 when SUBVL!=0):

 7:6 5:4 3:2 1:0 w z y x

MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.

```for i in range(VL):
for j in range(SUBVL):
regs[rd] = regs[rd+regs[rs+j]]
```

Normal mode will apply the element offsets incrementally:

```for i in range(VL):
for j in range(SUBVL):
regs[rd] = regs[rd+regs[rs+k]]
k++
```

Pseudocode for element width part of MV.X:

```def mv_x(rd, rs1, funct4):
elwidth = (funct4>>2) & 0x3
bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
bytewidth = bitwidth / 8 # get bytes per el
for i in range(VL):
offset = addr + bytewidth # get offset within regfile as SRAM
# TODO, actually, needs to respect rd and rs1 element width,
# here, as well.  this pseudocode just illustrates that the
# MV.X operation contains a way to compact the indices into
# less space.
regs[rd] = (unsigned char*)(regs)[offset]
```

The idea here is to allow 8-bit indices to be stored inside XLEN-sized registers, such that rather than doing this:

```ldimm x8, 1
ldimm x9, 3
ldimm x10, 2
ldimm x11, 0
{SVP.VL=4} MV.X x3, x8, elwidth=default
```

The alternative is this:

```ldimm x8, 0x00020301
{SVP.VL=4} MV.X x3, x8, elwidth=8
```

Thus compacting four indices into the one register. x3 and x8's element width are independent of the MV.X elwidth, thus allowing both source and element element widths of the elements to be moved to be over-ridden, whilst at the same time allowing the indices to be compacted, as well.

potential MV.X? register-version of MV-swizzle?

 Encoding 31:27 26:25 24:20 19:15 14:12 11:7 6:2 1:0 RV32-R-type funct7 rs2[4:0] rs1[4:0] funct3 rd[4:0] opcode 0b11 RV32-R-type 0b0000000 rs2[4:0] rs1[4:0] 0b001 rd[4:0] OP-V 0b11
• funct3 = MV.X
• OP-V = 0b1010111
• funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
• funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
• funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
• funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?

question: do we need a swizzle MV.X as well?

# MV.X with 3 operands

regs[rd] = regs[rs1 + regs[rs2]]

Similar to LD/ST with the same twin predication rules

# macro-op fusion

there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction. <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>

# VBLOCK context?

additional idea: a VBLOCK context that says that if a given register is used, it indicates that the register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.

# mm_shuffle_ps?

__m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
_MM_SHUFFLE(hi3,hi2,lo1,lo0))
Interleave inputs into low 2 floats and high 2 floats of output. Basically
out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3];

For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.

# Transpose

assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using): using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:

input: | m00 m10 m20 m30 | | m01 m11 m21 m31 | | m02 m12 m22 m32 | | m03 m13 m23 m33 |

transpose 4 corner 2x2 matrices

intermediate: | m00 m01 m20 m21 | | m10 m11 m30 m31 | | m02 m03 m22 m23 | | m12 m13 m32 m33 |

finish transpose

output: | m00 m01 m02 m03 | | m10 m11 m12 m13 | | m20 m21 m22 m23 | | m30 m31 m32 m33 |

```__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);

/* Assigning transposed values back into I[0-3] */
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
```

# Table to evaluate

swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)

31:27 26:25 24:20 19:15 14:12 11:7
swizzle2 rs3 00 rs2 rs1 000 rd
fswizzle2 rs3 01 rs2 rs1 000 rd
swizzle 0 10 rs2 rs1 000 rd
fswizzle 0 11 rs2 rs1 000 rd
swizzlei imm rs1 001 rd
fswizzlei rs1 010 rd

More:

swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:

int/fp DESTSUBVL 31 30:29 28:20 19:15 14:12 11:7
int 1 to 3 0 DESTSUBVL selector rs 000 rd
fp 1 to 3 1 DESTSUBVL selector rs 000 rd
int 4 selector[11:0] rs 001 rd
fp 4 selector[11:0] rs 010 rd

the rest could be encoded as follows:

31:27 26:25 24:20 19:15 14:12 11:7
swizzle2 rs3 DESTSUBVL rs2 rs1 100 rd
swizzle rs1 DESTSUBVL rs2 rs1 100 rd
fswizzle2 rs3 DESTSUBVL rs2 rs1 101 rd
fswizzle rs1 DESTSUBVL rs2 rs1 101 rd

note how for [f]swizzle, rs3 == rs1

so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.

Alternative immed encoding

int/fp 31:28 27:20 19:15 14:12 11:7
int DESTMASK selector rs 000 rd
fp DESTMASK selector rs 001 rd
int DESTMASK constsel rs 010 rd
fp DESTMASK constsel rs 011 rd

Allows setting of arbitrary dest (xz, yw) without needing register-versions. Saves on instruction count. Needs 4 funct3 to express.

# Matrix 4x4 Vector mul

```pfscale,3 F2, F1, F10
```

pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.

In effect what this is doing is:

```fmul f2, f1.xxxx, f10
fmac f2, f1.yyyy, f11, f2
fmac f2, f1.zzzz, f12, f2
fmac f2, f1.wwww, f13, f2
```

Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.

# Pseudocode

Swizzle:

```pub trait SwizzleConstants: Copy + 'static {
const CONSTANTS: &'static [Self; 4];
}

impl SwizzleConstants for u8 {
const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
}

impl SwizzleConstants for u16 {
const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
}

impl SwizzleConstants for f32 {
const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
}

// impl for other types too...

pub fn swizzle<Elm, Selector>(
rd: &mut [Elm],
rs1: &[Elm],
rs2: &[Selector],
vl: usize,
destsubvl: usize,
srcsubvl: usize)
where
Elm: SwizzleConstants,
// Selector is a copyable type that can be converted into u64
Selector: Copy + Into<u64>,
{
const FIELD_SIZE: usize = 3;
for vindex in 0..vl {
let selector = rs2[vindex].into();
// selector's type is u64
if selector >> (FIELD_SIZE * destsubvl) != 0 {
// handle illegal instruction trap
}
for i in 0..destsubvl {
let mut sel_field = selector >> (FIELD_SIZE * i);
let src = if (sel_field & 0b100) == 0 {
&rs1[(vindex * srcsubvl)..]
} else {
SwizzleConstants::CONSTANTS
};
sel_field &= 0b11;
if sel_field as usize >= srcsubvl {
// handle illegal instruction trap
}
let value = src[sel_field as usize];
rd[vindex * destsubvl + i] = value;
}
}
}
```

Swizzle2:

```fn swizzle2<Elm, Selector>(
rd: &mut [Elm],
rs1: &[Elm],
rs2: &[Selector],
rs3: &[Elm],
vl: usize,
destsubvl: usize,
srcsubvl: usize)
where
// Elm is a copyable type
Elm: Copy,
// Selector is a copyable type that can be converted into u64
Selector: Copy + Into<u64>,
{
const FIELD_SIZE: usize = 3;
for vindex in 0..vl {
let selector = rs2[vindex].into();
// selector's type is u64
if selector >> (FIELD_SIZE * destsubvl) != 0 {
// handle illegal instruction trap
}
for i in 0..destsubvl {
let mut sel_field = selector >> (FIELD_SIZE * i);
let src = if (sel_field & 0b100) != 0 {
rs1
} else {
rs3
};
sel_field &= 0b11;
if sel_field as usize >= srcsubvl {
// handle illegal instruction trap
}
let value = src[vindex * srcsubvl + (sel_field as usize)];
rd[vindex * destsubvl + i] = value;
}
}
}
```