MV.X and MV.swizzle
swizzle needs a MV (there are 2 of them: swizzle and swizzle2). see below for a potential way to use the funct7 to do a swizzle in rs2.
Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
RV32-I-type | imm[11:0] | rs1[4:0] | funct3 | rd[4:0] | opcode | 0b11 | ||
RV32-I-type | fn4[3:0] | swizzle[7:0] | rs1[4:0] | 0b000 | rd[4:0] | OP-V | 0b11 | |
- funct3 = MV: 0b000 for FP, 0b001 for INT
- OP-V = 0b1010111
- fn4 = 4 bit function.
- fn4 = 0b0000 - MV-SWIZZLE
- fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
- fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
swizzle (only active on SV or P48/P64 when SUBVL!=0):
7:6 | 5:4 | 3:2 | 1:0 |
w | z | y | x |
MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
for i in range(VL): for j in range(SUBVL): regs[rd] = regs[rd+regs[rs+j]]
Normal mode will apply the element offsets incrementally:
for i in range(VL): for j in range(SUBVL): regs[rd] = regs[rd+regs[rs+k]] k++
Pseudocode for element width part of MV.X:
def mv_x(rd, rs1, funct4): elwidth = (funct4>>2) & 0x3 bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el bytewidth = bitwidth / 8 # get bytes per el for i in range(VL): addr = (unsigned char *)®s[rs1] offset = addr + bytewidth # get offset within regfile as SRAM # TODO, actually, needs to respect rd and rs1 element width, # here, as well. this pseudocode just illustrates that the # MV.X operation contains a way to compact the indices into # less space. regs[rd] = (unsigned char*)(regs)[offset]
The idea here is to allow 8-bit indices to be stored inside XLEN-sized registers, such that rather than doing this:
ldimm x8, 1 ldimm x9, 3 ldimm x10, 2 ldimm x11, 0 {SVP.VL=4} MV.X x3, x8, elwidth=default
The alternative is this:
ldimm x8, 0x00020301 {SVP.VL=4} MV.X x3, x8, elwidth=8
Thus compacting four indices into the one register. x3 and x8's element width are independent of the MV.X elwidth, thus allowing both source and element element widths of the elements to be moved to be over-ridden, whilst at the same time allowing the indices to be compacted, as well.
potential MV.X? register-version of MV-swizzle?
Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
RV32-R-type | funct7 | rs2[4:0] | rs1[4:0] | funct3 | rd[4:0] | opcode | 0b11 | |
RV32-R-type | 0b0000000 | rs2[4:0] | rs1[4:0] | 0b001 | rd[4:0] | OP-V | 0b11 | |
- funct3 = MV.X
- OP-V = 0b1010111
- funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
- funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
- funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
- funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
question: do we need a swizzle MV.X as well?
MV.X with 3 operands
regs[rd] = regs[rs1 + regs[rs2]]
Similar to LD/ST with the same twin predication rules
macro-op fusion
there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction. <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
VBLOCK context?
additional idea: a VBLOCK context that says that if a given register is used, it indicates that the register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
mm_shuffle_ps?
- __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
- _MM_SHUFFLE(hi3,hi2,lo1,lo0))
- Interleave inputs into low 2 floats and high 2 floats of output. Basically
- out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3];
For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.
Transpose
assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using): using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
input: | m00 m10 m20 m30 | | m01 m11 m21 m31 | | m02 m12 m22 m32 | | m03 m13 m23 m33 |
transpose 4 corner 2x2 matrices
intermediate: | m00 m01 m20 m21 | | m10 m11 m30 m31 | | m02 m03 m22 m23 | | m12 m13 m32 m33 |
finish transpose
output: | m00 m01 m02 m03 | | m10 m11 m12 m13 | | m20 m21 m22 m23 | | m30 m31 m32 m33 |
__m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); /* Assigning transposed values back into I[0-3] */ I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3);
Transforms for DCT
Table to evaluate
swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | |
---|---|---|---|---|---|---|
swizzle2 | rs3 | 00 | rs2 | rs1 | 000 | rd |
fswizzle2 | rs3 | 01 | rs2 | rs1 | 000 | rd |
swizzle | 0 | 10 | rs2 | rs1 | 000 | rd |
fswizzle | 0 | 11 | rs2 | rs1 | 000 | rd |
swizzlei | imm | rs1 | 001 | rd | ||
fswizzlei | rs1 | 010 | rd |
More:
swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:
int/fp | DESTSUBVL | 31 | 30:29 | 28:20 | 19:15 | 14:12 | 11:7 |
---|---|---|---|---|---|---|---|
int | 1 to 3 | 0 | DESTSUBVL | selector | rs | 000 | rd |
fp | 1 to 3 | 1 | DESTSUBVL | selector | rs | 000 | rd |
int | 4 | selector[11:0] | rs | 001 | rd | ||
fp | 4 | selector[11:0] | rs | 010 | rd |
the rest could be encoded as follows:
31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | |
---|---|---|---|---|---|---|
swizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 100 | rd |
swizzle | rs1 | DESTSUBVL | rs2 | rs1 | 100 | rd |
fswizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 101 | rd |
fswizzle | rs1 | DESTSUBVL | rs2 | rs1 | 101 | rd |
note how for [f]swizzle, rs3 == rs1
so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.
Alternative immed encoding
int/fp | 31:28 | 27:20 | 19:15 | 14:12 | 11:7 |
---|---|---|---|---|---|
int | DESTMASK | selector | rs | 000 | rd |
fp | DESTMASK | selector | rs | 001 | rd |
int | DESTMASK | constsel | rs | 010 | rd |
fp | DESTMASK | constsel | rs | 011 | rd |
Allows setting of arbitrary dest (xz, yw) without needing register-versions. Saves on instruction count. Needs 4 funct3 to express.
Matrix 4x4 Vector mul
pfscale,3 F2, F1, F10 pfscaleadd,2 F2, F1, F11, F2 pfscaleadd,1 F2, F1, F12, F2 pfscaleadd,0 F2, F1, F13, F2
pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
In effect what this is doing is:
fmul f2, f1.xxxx, f10 fmac f2, f1.yyyy, f11, f2 fmac f2, f1.zzzz, f12, f2 fmac f2, f1.wwww, f13, f2
Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.
Pseudocode
Swizzle:
pub trait SwizzleConstants: Copy + 'static { const CONSTANTS: &'static [Self; 4]; } impl SwizzleConstants for u8 { const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F]; } impl SwizzleConstants for u16 { const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF]; } impl SwizzleConstants for f32 { const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5]; } // impl for other types too... pub fn swizzle<Elm, Selector>( rd: &mut [Elm], rs1: &[Elm], rs2: &[Selector], vl: usize, destsubvl: usize, srcsubvl: usize) where Elm: SwizzleConstants, // Selector is a copyable type that can be converted into u64 Selector: Copy + Into<u64>, { const FIELD_SIZE: usize = 3; const FIELD_MASK: u64 = 0b111; for vindex in 0..vl { let selector = rs2[vindex].into(); // selector's type is u64 if selector >> (FIELD_SIZE * destsubvl) != 0 { // handle illegal instruction trap } for i in 0..destsubvl { let mut sel_field = selector >> (FIELD_SIZE * i); sel_field &= FIELD_MASK; let src = if (sel_field & 0b100) == 0 { &rs1[(vindex * srcsubvl)..] } else { SwizzleConstants::CONSTANTS }; sel_field &= 0b11; if sel_field as usize >= srcsubvl { // handle illegal instruction trap } let value = src[sel_field as usize]; rd[vindex * destsubvl + i] = value; } } }
Swizzle2:
fn swizzle2<Elm, Selector>( rd: &mut [Elm], rs1: &[Elm], rs2: &[Selector], rs3: &[Elm], vl: usize, destsubvl: usize, srcsubvl: usize) where // Elm is a copyable type Elm: Copy, // Selector is a copyable type that can be converted into u64 Selector: Copy + Into<u64>, { const FIELD_SIZE: usize = 3; const FIELD_MASK: u64 = 0b111; for vindex in 0..vl { let selector = rs2[vindex].into(); // selector's type is u64 if selector >> (FIELD_SIZE * destsubvl) != 0 { // handle illegal instruction trap } for i in 0..destsubvl { let mut sel_field = selector >> (FIELD_SIZE * i); sel_field &= FIELD_MASK; let src = if (sel_field & 0b100) != 0 { rs1 } else { rs3 }; sel_field &= 0b11; if sel_field as usize >= srcsubvl { // handle illegal instruction trap } let value = src[vindex * srcsubvl + (sel_field as usize)]; rd[vindex * destsubvl + i] = value; } } }