Note about naming

the original assessment for SVP from 18 months ago concluded that it should be easy for scalar (non SV) instructions to get at the exact same scalar registers when in SVP mode. otherwise scalar v3.0B code needs to restrict itself to a massively truncated subset of the scalar registers numbered 0-31 (only r0, r4, r8...) which hugely interferes with ABIs to such an extent that it would compromise SV.

question: has anything changed about the assessment that was done, which concluded that for scalar SVP regs they should overlap completely with scalar ISA regs?

Notes on requirements for bit allocations

do not try to jam VL or MAXVL in. go with the flow of 24 bits spare.

  • 2: SUBVL
  • 2: elwidth
  • 2: twin-predication (src, dest) elwidth
  • 1: select INT or CR predication
  • 3: predicate selection and inversion (QTY 2 for tpred)
  • 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
  • 5: mode

totals: 24 bits (dest elwidth shared)

http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html

All zeros indicates "disable SVP"

The defaults for all capabilities of SVP should be zero to indicate "no action". SUBVL=1 encoded as 0b00. register name prefixes, scalar=0b0, elwidth overrides DEFAULT=0b00, predication off=0b000 etc.

this way SV may be entirely disabled, leaving an "all zeros" to indicate to v3.1B 64bit prefixing that the standard OpenPOWER v3.1B encodings are in full effect (and that SV is not). As all zeros meshes with current "reserved" encodings this should work well.

(but not disable use of VL or anything related to SVSTATE, that is completely different. above refers to instruction augmentation)

twin predication

twin predication and twin elwidth overrides is extremely important to have to be able to override both the src and dest elwidth yet keep the underlying scalar operation intact. examples include mr with an elwidth=8, VL=8 on the src will take a byte at a time from one 64 bit reg and place it into 8x 64-bit regs, zero-extended. more complex operations involve SUBVL and Audio/Video DSP operations, see av opcodes

something like:

0 1 2 3 4 5 6 7 9 10 12 13 18 19 23
subvl sew dew ptyp psrc pdst vspec mode

table:

  • subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  • sew / dew - DEFAULT / 8 / 16 /32 element width
  • ptyp - predication INT / CR
  • psrc / pdst - predicate mask selector and inversion
  • vspec - 3 bit src / dest scalar-vector extension
  • mode: 5 bits

twin predication, CR based.

separate src and dest predicates are a critical part of SV for provision of VEXPAND, VCOMPRESS, VSPLAT, VINSERT and many more operations.

Twin CR predication could be done in two ways:

  • start from different CRs for the src and dest
  • start from the same CR.

With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.

standard arith ops (single predication)

these are of the form res = op(src1, src2, ...)

0 1 2 3 4 5 6 7 9 10 18 19 23
subvl sew dew ptyp pred vspec mode

table:

  • subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  • sew / dew - DEFAULT / 8 / 16 /32 element width
  • ptyp - predication INT / CR
  • pred - predicate mask selector and inversion
  • vspec - 2/3 bit src / dest scalar-vector extension
  • mode - 5 bit

For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits. for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.

Note:

  • the operation should always be done at max(srcwidth, dstwidth), unless it can be proven using the lower will lead to the same result
  • saturation is done on the result at the dest elwidth

Some examples on different operation widths:

u16 / u16 = u8
256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong

u8 * u8 = u16
255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong

u16 + u16 = u8
256 + 2 = 2 # this is correct whether we use the larger or smaller width
            # aka hw can optimize narrowing addition

Notes about Swizzle

Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.

therefore the strategy proposed is:

  • design 16bit scalar ops
  • use the 11 bit old SV prefix to create 32bit insns
  • when those are embedded into v3.1B 64 prefix, the 24 bits are entirely allocated to swizzle.

with 2x12 this would mean no need to have complex encoding of swizzle.

if we really do need 2 bits spare then the complex encoder of swizzle could be deployed. (an analysis shows this to be very unlikely. 74 is around 2400 which still requires 12 bits to encode (that's miscalculated, see Single Swizzle section below.) it isn't because you missed out predicate mask skip as thr 7th option.)

Single Swizzle

I expect swizzle to not be common enough to warrant 2 swizzles in a single instruction, therefor the above swizzle strategy is probably unnecessary.

Also, if a swizzle supports up to subvl=4, then 11 bits is sufficient since each swizzle element needs to be able to select 1 of 6 different values: 0, 1, x, y, z, w. 64 = 1296 which easily fits in 11 bits (only by dropping "predicate mask" from the list of options, which makes 7 options not 6. see mv.swizzle)

What about subvl=4 that skips one element? src vec is 4 but one of the elements is to be left alone? This is not 6 options, it is 7 options (including "skip" i.e combining with a predicate mask in effect). note that this is not the same as a vec3-with-a-skip

What could hypothetically be done is: when SUBVL=3 a different encoding is used, one that allows the "skip" to be specified. X Y skip W for example. this would then be interpreted, "actually the vector is vec4 but one rlement is skipped"

the problem with that is that now SUBVL has become critically dependent on the swizzle, worse than that the swizzle is embedded in the instruction, even worse than that it's encoded in a complex multi-gate fashion.

all of which screams, "this is going in completely the wrong direction". keep it simple. 7 options, 3 bits, 4x3, 12 bits for swizzle, ignore some if SUBVL is 1 2 or 3.

note about INT predicate

001 ALWAYS (implicit)   Operation is not masked

this means by default that 001 will always be in nonpredicated ops, which seems anomalous. would 000 be better to indicate "no predication"?

000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"

programmerjake: I picked 0001 to indicate ALWAYS since that matches with the other semantics: the LSB bit is invert-the-mask, and you can think about the table as-if it is really:

this is the opposite of what feels natural. inversion should switch off something. also 000 is the canonical "this feature is off by default" number.

the constant should be an immediate of all 1s (not r0), which is the natural way to think of "predication is off".

i get the idea "r0 to be used therefore it is all zeros" but that makes 001 the "default", not 000.

Value Mnemonic
000 R0 (zero) set to all 1s, naturally means "no predication"
001 ~R0 (~zero)
010 R3
011 ~R3
100 R10
101 ~R10
110 R30
111 ~R30

CR Vectorization

Some thoughts on this: the sensible (sane) number of CRs to have is 64. A case could be made for having 128 but it is an awful lot. 64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much).

A practical issue stems from the fact that accessing the CR regfile on a non-aligned 8-CR boundary during Vector operations would significantly increase internal routing. By aligning Vector Reads/Writes to 8 CRs this requires only 32 bit aligned read/writes. (programmerjake: simple solution -- rename them internally such that CR6 is the first one)

How to number them as vectors gets particularly interesting. A case could be made for treating the 64 CRs as a square, and using CR numbering (CR0-7) to begin VL for-loop incrementing first by row and when rolling over to increment the column. CR6 CR14 ... CR62 then CR7 CR15 ...

When the SV prefix marks them with 2 bits, one of those could be used to indicate scalar, and the other to indicate whether the 3 bit CR number is to be treated as a horizontal vector (CR incrementing straight by 1) or a vertical vector (incrementing by 8)

When there are 3 bits it would be possible to indicate whether to begin from a position offset by 4 (middle of matrix, edge of matrix).

Note: considerable care needs to be taken when putting these horiz/vertical CRs through the Dependency Matrices

Indexing algorithm illustrating how the H/V modes would work. Note that BA is the 3 bit CR register field that normsll, in scalar ISA, would reference only CR0-7 as CR[BA].

for i in range(VL)
    y = i % 8
    x = i // 8
    if verticalmode:
        CRINDEX = BA + y*8 + x
    else:
        CRINDEX = BA*8 + i
    CR[CRINDEX] = ...

Should twin-predication (src=1, dest=1) have DEST SUBVL?

this is tricky: there isn't really enough space unless the reg scalar-vector extension (currently 3 bits per reg) is compacted to only 2 bits each, which would provide 2 extra bits.

so before adding this, an evaluation is needed: is it necessary?

what actual operations out of this list need - and work - with a separate SRC and DEST SUBVL?

  • mv (the usual way that V* operations are created)
  • exts* sign-extension
  • rwlinm and other RS-RA shift operations
  • LD and ST (treating AGEN as one source)
  • FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
  • Condition Register ops mfcr, mtcr and other similar

Evaluation:

  • mv: yes. these may need merge/split
  • exts: no. no transformation.
  • rwlinm shift operations: no
  • LD and ST: no
  • FP ops: no
  • CR ops: maybe on mvs, not on arithmetic.

therefore it makes no sense to have DEST SUBVL, and instead to have special mv operations. see mv.vec

"Auto-Scalar" (Auto-VL=1) review

full review needed, answering question:

if sv.op RT.scalar RA.scalar RB.scalar is changed to temporarily
override VL to be 1, is anything lost?

four aspects:

  1. instructions with only one source (or dest)
  2. RM Modes
  3. is the loss of the dynamic meaning "VL=0" nop effect important?
  4. why would "sv.op all-scalar" be inside a loop in the first place?

Summary so far:

  • failfirst needs to be an illegal exception if all-scalar
  • non-zeroing predication on all-scalar with VL>1 requires all relevant bits to be set, this changes to the first bit for auto-VL=1. requires an extra reduction instruction.
  • sv.branches should not be touched. at all.

only 1 src/dest

Instructions in this category are usually Unvectorizeable or they are Load-Immediates. fmvis for example, is 1-Write, whilst SV.Branch-Conditional is BI (CR field bit).

TBD

answers to 2, RM Modes

Normal Mode:

  • simple mode is straight vectorization.
  • reduce mode
  • ffirst or data-dependent fail-on-first:
  • sat mode or saturation:
  • pred-result mode

simple mode is fine including on predication but has a CHANGE OF BEHAVIOUR. first bit of src/dest is used when zeroing is on, but first ENABLED bit of predicate is used when VL>1.

reduce mode is unaffected (meaningless) on a scalar operation

fail-first with or without VLI, should be unaffected, but what should VL be truncated to? the override VL=1? (or VL=0 when VLI is set?)

saturation mode is fine

predicate-result should be fine as well (aside from change in predicate behaviour)

LD/ST Immediate

ldst

First critical observation, RA.isvec is utilized to detect element-stride.

if RA.isvec:
    svctx.ldstmode = indexed
elif els == 0:
    svctx.ldstmode = unitstride
elif immediate != 0:
    svctx.ldstmode = elementstride

thus it is actually legitimate to have scalar src and dest especially with predicate masks. the trick noted in (4) below of setting RA.isvec would therefore activate the Vector Indexed mode, with associated predication-based offsets (and REMAP) (not to be confused with VSPLAT mode).

elif RA.isvec:
    # quirky Vector indexed mode but with an immediate
    srcbase = ireg[RA+i]
    offs = immed;
else
    # standard scalar mode (but predicated)
    # no stride multiplier means VSPLAT mode (FIXME(lkcl): unclear --
    # VSPLAT requires RT.isvec=1 and all sources to be scalar)
    srcbase = ireg[RA]
    offs = immed
  • Question: is RA.isvec=0 RT.isvec=1 a substitute?
  • Answer: no because RA.isvec=0 RT.isvec=1 is VSPLAT mode
  • Question: are there any other equivalent modes?
  • Answer: no, because immediate=0 selects Indexed mode
  • Question: can the VSPLAT mode be actually used in proposed scalar-mode?
  • Answer: No, scalar-mode requires RA.isvec=0 RT.isvec=0, but VSPLAT is RA.isvec=0 RT.isvec=1.

VL>1 at the moment, with a scalar source and scalar dest, will not undergo any changes to the EA compared to if VL=1. Destination SPLAT if scalar is only influenced by predication which has a workaround below (merge to single bit mask)

LD/ST Indexed

Element-Striding is specifically enabled on RA and RB being scalar. If VL=1 behaviour is also activated then this is potentially interfered with, except that, again, RT may be set as a vector destination.

    if svctx.ldstmode == elementstride:
        EA = ireg[RA] + ireg[RB]*j   # register-strided

Vector destination is again "VSPLAT" mode, but if a Scalar destination was set with VL>1, then just as with LD-immediate it is the entire predicate mask which must be zero to stop the scalar element from being loaded, and the same effect may be achieved with VL=1 by ORing all predicate mask bits down to a single bit as a new predicate.

CR ops

cr ops

  • simple mode is straight vectorization.
  • reduce mode
  • ffirst or data-dependent fail-on-first:

all of these modes are identical analysis to RM.normal. simple requires preficate reduction, mapreduce is irrelevant and failfirst needs to be an Illegal Instruction.

Branch-Conditional

branches is so heavily interdependent in CTR-test and VLSet Modes, and only having a single source (BI) that it is simply strongly recommended not to interfere with its behaviour, at all. additionally the unaltered behaviour is needed to substitute for the loss of all-ones predicate mask behaviour on SV-scalar-regs

answers to 4, loops/uses

REMAP

A REMAP would redirect operations from the first nonmasked predicated element to the first REMAPped element, and combined with predication (single bit or otherwise) selection of a single element is achieved.

(but hang on, does it actually? answer: no. scalar sources are not REMAPped regardless of VL)

  • biggest concern: how to achieve the same effect?
  • answer: use at least one vector source. this solves the predication issue.
  • question: does this impact LD/ST which has special overrides and mode-selection based on RA.isvec?
  • answer: yes but only for predication (below)

summary, REMAP is a red herring

predication

with nonzeroing the application of a predicate mask to an all-scalar operation effectively tests ALL relevant bits 0..VL-1 as nonzero in the decision-making, whereas VL=1 will only test the first.

a need for merging (ORing) all bits into a single alternative predicate mask (single-bit) is the sort of thing we can probably live with.

reducing unnecessary setvl interchanges

A major motivation for changing SVP64 with all isvec=0 to temporarily override VL to 1 is to allow easy interleaving of Scalar instructions accessing the otherwise-inaccessible registers and CR Fields numbered 32-127, without the need to change VL. Given that VL may have been set to any value between 0 and MAXVL on a Fail-First, and given that VL=0 turns Vector instructions into nop this is quite important.

An example, emulating PackedSIMD ISA behaviour:

# set VL=4 to be able to do the next add as `u8x16`/`i8x16`
setvl MAXVL=VL=4
sv.add/subvl=4/elwid=8 *RT, *RA, *RB
# now set VL=1 to get scalar register access
setvl VL=1
sv.add/subvl=4/elwid=8 RT, RA, RB
# set VL=4 again.  repeat, repeat, repeat
setvl MAXVL=VL=4

This applies equally to loops when VL is dynamically set (from RA or CTR) reducing instruction count in hot-loops as it does when setting VL to static quantities as a method of emulating PackedSIMD ISA behaviour.