# SimdShape

Links:

# Requirements Analysis

The dynamic partitioned SimdSignal class is based on the logical extension of the full capabilities to the nmigen language behavioural constructs to a parallel dimension, with zero changes in that behaviour as a result of that parallelism.

Logically therefore even the concept of ast.Shape should be extended solely to express and define the extent of the parallelism and SimdShape should in no way attempt to change the expected behaviour of the Shape class behaviour from which it derives.

A logical extension of the nmigen `ast.Shape`

concept, `SimdShape`

provides sufficient context to both define overrides for individual lengths
on a per-mask basis as well as sufficient information to "upcast"
back to a SimdSignal, in exactly the same way that c++ virtual base
class upcasting works when RTTI (Run Time Type Information) works.

By deriving from `ast.Shape`

both `width`

and `signed`

are provided
already, leaving the `SimdShape`

class with the responsibility to
additionally define lengths for each mask basis. This is best illustrated
with an example.

Also by fitting on top of existing nmigen concepts, and defining the
`SimdShape.width`

equal to and synonymous with `Shape.width`

then
downcasting becomes possible and practical. *(An alternative proposal
to redefine "width" to be in terms of the multiple options, i.e.
context-dependent on the partition setting, is unworkable as it
prevents downcasting to e.g. Signal)*

The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
but without massive disruptive code-duplication or intrusive explicit
coding as outlined in the worst of the techniques documented in
dynamic simd. This in turn implies that Signals need to be declared
for both mantissa and exponent that **change width to non-power-of-two
sizes** depending on Partition Mask Context.

Mantissa:

- when the context is 1xFP64 the mantissa is 54 bits (excluding guard rounding and sticky)
- when the context is 2xFP32 there are
**two**mantissas of 23 bits - when the context is 4xFP16 there are
**four**mantissas of 10 bits - when the context is 4xBF16 there are four mantissas of 5 bits.

Exponent:

- 1xFP64: 11 bits, one exponent
- 2xFP32: 8 bits, two exponents
- 4xFP16: 5 bits, four exponents
- 4xBF16: 8 bits, four exponents

`SimdShape`

needs this information in addition to the normal
information (width, sign) in order to create the partitions
that allow standard nmigen operations to **transparently**
and naturally take place at **all** of these non-uniform
widths, as if they were in fact scalar Signals *at* those
widths.

A minor wrinkle which emerges from deep analysis is that the overall
available width (`Shape.width`

) does in fact need to be explicitly
declared under *some* circumstances, and
the sub-partitions to fit onto power-of-two boundaries, in order to allow
straight wire-connections rather than allow the SimdSignal to be
arbitrary-sized (compact). Although on shallow inspection this
initially would seem to imply that it would result in large unused
sub-partitions (padding partitions) these gates can in fact be eliminated
with a "blanking" mask, created from static analysis of the SimdShape
context.

Example:

- all 32 and 16-bit values are actually to be truncated to 11 bit
- all 8-bit values to 5-bit

from these we can write out the allocations, bearing in mind that in each partition the sub-signal must start on a power-2 boundary,

```
|31| | |24| 16|15| | 8|7 0 |
32bit | | | | 1.11 |
16bit | | 2.11 | | | 1.11 |
8bit | | 4.5 | 3.5 | | 2.5 | | 1.5 |
```

Next we identify the start and end points, and note that "x" marks unused (padding) portions. We begin by marking the power-of-two boundaries (0-7 .. 24-31) and also including column guidelines to delineate the start and endpoints:

```
|31| | |24| 16|15| | 8|7 0 |
|31|28|26|24| |20|16|15|12|10|8| |4 0 |
32bit | x| x| x| | x| x| x|10 .... 0 |
16bit | x| x|26 ... 16 | x| x|10 .... 0 |
8bit | x|28 .. 24| 20.16| x|12 .. 8|x|4.. 0 |
unused x x
```

thus, we deduce, we *actually* need breakpoints at *nine* positions,
and that unused portions common to **all** cases can be deduced
and marked "x" by looking at the columns above them.
These 100% unused "x"s therefore define the "blanking" mask, and in
these sub-portions it is unnecessary to allocate computational gates.

Also in order to save gates, in the example above there are only three
cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
are required to construct the larger overall computational result
from the "smaller chunks". At first glance, with there
being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
would appear that 2^{9} (512!) cases were required, where in fact
there are only three.

These facts also need to be communicated to both the SimdSignal as well as the submodules implementing its core functionality: add operation and other arithmetic behaviour, as well as cat and others.

In addition to that, there is a "convenience" that emerged
from technical discussions as desirable
to have, which is that it should be possible to perform
rudimentary arithmetic operations *on a SimdShape* which preserves
or adapts the Partition context, where the arithmetic operations
occur on `Shape.width`

.

```
>>> XLEN = SimdShape(fixed_width=64, signed=True, ...)
>>> x2 = XLEN // 2
>>> print(x2.width)
32
>>> print(x2.signed)
True
```

With this capability it becomes possible to use the Liskov Substitution Principle in dynamically compiling code that switches between scalar and SIMD transparently:

```
# scalar context
scalarctx = scl = object()
scl.XLEN = 64
scl.SigKls = Signal # standard nmigen Signal
# SIMD context
simdctx = sdc = object()
sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
sdc.SigKls = SimdSignal # advanced SIMD Signal
sdc.elwidth = Signal(2)
# select one
if compiletime_switch == 'SIMD':
ctx = simdctx
else:
ctx = scalarctx
# exact same code switching context at compile time
m = Module():
with ctx:
x = ctx.SigKls(ctx.XLEN)
y = ctx.SigKls(ctx.XLEN // 2)
...
m.d.comb += x.eq(Const(3))
```

An interesting practical requirement transpires from attempting to use
SimdSignal, that affects the way that SimdShape works. The register files
are 64 bit, and are subdivided according to what wikipedia terms
"SIMD Within A Register" (SWAR). Therefore, the SIMD ALUs *have* to
both accept and output 64-bit signals at that explicit width, with
subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.

However when it comes to intermediary processing (partial computations)
those intermediary Signals can and will be required to be a certain
fixed width *regardless* and having nothing to do with the register
file source or destination 64 bit fixed width.

The simplest example here would be a boolean (1 bit) Signal for Scalar (but an 8-bit quantity for SIMD):

```
m = Module():
with ctx:
x = ctx.SigKls(ctx.XLEN)
y = ctx.SigKls(ctx.XLEN)
b = ctx.SigKls(1)
m.d.comb += b.eq(x == y)
with m.If(b):
....
```

This code is obvious for Scalar behaviour but for SIMD, because
the elwidths are declared as `1x64, 2x32, 4x16, 8x8`

then whilst
the *elements* are 1 bit (in order to make a total of QTY 8
comparisons of 8 parallel SIMD 8-bit values), there correspondingly
needs to be **eight** such element bits in order to store up to
eight 8-bit comparisons. Exactly how that comparison works
is described in eq

Another example would be a simple test of the first *nibble* of
the data.

```
m = Module():
with ctx:
x = ctx.SigKls(ctx.XLEN)
y = ctx.SigKls(4)
m.d.comb += y.eq(x[0:3])
....
```

Here, we do not necessarily want to declare y to be 64-bit: we want only the first 4 bits of each element, after all, and when y is set to be QTY 8of 8-bit elements, then y will only need to store QTY 8of 4-bit quantities, i.e. only a maximum of 32 bits total.

If y was declared as 64 bit this would indicate that the actual elements were at least 8 bit long, and if that was then used as a shift input it might produce the wrong calculation because the actual shift amount was only supposed to be 4 bits.

Thus not one method of setting widths is required but *two*:

- at the element level
- at the width of the entire SIMD signal

With this background and context in mind the requirements can be determined

# Requirements

SimdShape needs:

- to derive from nmigen ast.Shape in order to provide the overall width and whether it is signed or unsigned. However the overall width is not necessarily hard-set but may be calculated
- provides a means to specify the number of partitions in each of an arbitrarily-named set. for convenience and by convention from SVP64 this set is called "elwidths".
- to support a range of sub-signal divisions (element widths) and for there to be an option to either set each element width explicitly or to allow each width to be computed from the overall width and the number of partitions.
- to provide rudimentary arithmetic operator capability that automatically computes a new SimdShape, adjusting width and element widths accordingly.

Interfacing to SimdSignal requires an adapter that:

- allows a switch-case set to be created
- the switch statement is the elwidth parameter
- the case statements are the PartitionPoints
- identifies which partitions are "blank" (padding)

# SimdShape API

SimdShape needs:

- a constructor taking the following arguments:
- (mandatory, from scope) an elwidth Signal
- (optional) an integer vector width or a dictionary of vector widths (the keys to be the "elwidth")
- (mandatory, from scope) a dictionary of "partition counts": the keys to again be the "elwidth" and the values to be the number of Vector Elements at that elwidth
- (optional) a "fixed width" which if given shall auto-compute the dictionary of Vector Widths
- (mandatory) a "signed" boolean argument which defaults to False

- To derive from Shape, where the (above) constructor passes it
the following arguments:
- the signed argument. this is simply passed in, unmodified.
- a width argument. this will be
**either**the fixed_width parameter from the constructor (if given)**or**it will be the**calculated**value sufficient to store all partitions.

- a suite of operators (
`__add__`

, etc) that shall take simple integer arguments and perform the computations on*every*one of the dictionary of Vector widths (examples below) - a "recalculate" function (currently known as layout() in layout_experiment.py) which creates information required by PartitionedSignal.
- a function which computes and returns a suite of PartitionPoints as well as an "Adapter" instance, for use by PartitionedSignal

Examples of the operator usage:

```
x = SimdShape(vec_op_widths={0b00: 64, 0b01:32, 0b10: 16})
y = x + 5
print(y.vec_op_widths)
{0b00: 69, 0b01: 37, 0b10: 21}
```

In other words, when requesting 5 to be added to x, every single one of the Vector Element widths had 5 added to it. If the partition counts were 2x for 0b00 and 4x for 0b01 then this would create 2x 69-bit and 4x 37-bit Vector Elements.

# Adapter API

The Adapter API performs a specific job of letting SimdSignal
know the relationship between the supported "configuration"
options that a SimdSignal must have, and the actual PartitionPoints
bits that must be set or cleared *in order* to have the SimdSignal
cut itself into the required sub-sections. This information
comes *from* SimdShape but the adapter is not part *of* SimdShape
because there can be more than one type of Adapter Mode, depending
on SimdShape input parameters.

```
class PartType: # TODO decide name
def __init__(self, psig):
self.psig = psig
def get_mask(self):
def get_switch(self):
def get_cases(self):
@property
def blanklanes(self):
```

# SimdShape arithmetic operators

Rudimentary arithmetic operations are required in order to perform tricks such as:

```
m = Module()
with SimdScope(m, elwid, vec_el_counts) as s:
shape = SimdShape(s, fixed_width=width)
a = s.Signal(shape)
b = s.Signal(shape*2)
o = s.Signal(shape*3)
m.d.comb + o.eq(Cat(a, b))
```

as well as:

```
with SimdScope(m, elwid, vec_el_counts) as s:
shape = SimdShape(s, fixed_width=width)
a = s.Signal(shape)
b = s.Signal(shape*2)
o2 = s.Signal(a.shape + b.shape)
```

and:

```
with SimdScope(m, elwid, vec_el_counts) as s:
shape = SimdShape(s, fixed_width=width)
a = s.Signal(16) # element width set to 16
b = s.Signal(shape*2)
o2 = s.Signal(a.shape + b.shape)
```

From these examples we deduce what the arithmetic operators have to cope with:

- RHS of simple integers
- RHS of another SimdShape

In both cases, there are further subdivisions because SimdShapes can be created as either fixed_width priority or elwidths priority

- fixed_width priority (vec_op_widths=None)
- elwidths priority (fixed_width=None)
- equal (no) priority (both are given)

## RHS simple integers

Although this is a simpler case, there are still three options:

- add/mul/sub etc. of integer at fixed_width priority (vec_op_widths=None)
- add/mul/sub etc. of integer at elwidths priority (fixed_width=None)
- add/mul/sub etc. of integer at equal (no) priority (both are given)

The expected behaviour on fixed_width priority is that the arithmetic operation should take place on the fixed width. adding 8 to a 64-bit fixed-width SimdSignal should result in the overall fixed width being extended to 72-bit, and for all partitions the new element-width within each partition is newly-computed to be the maximum possible permitted amount. Example:

- assume fixed_width=64 and partition_counts {1,2,4,8}.
- therefore the computed element sizes are: 1x64, 2x32, 4x16, 8x8
- assume that an add operation has requested to add 8 to the fixed width
- the new fixed_width is 72
- the newly-computed element sizes shall be: 1x72, 2x36, 4x18, 8x9

The expected behaviour on element-width-priority on the other hand is that
the arithmetic operation takes place on *each of the element-widths*

Example:

- assume element_widths={16,16,10,12} and partition_counds {1,2,4,8}
- therefore the computed element sizes are: 1x16, 2x16, 4x10, 8x12
- therefore the computed overall width is the maximum of these amounts (8x12) which is a 96-bit wide Signal.
- assume that a subtract operation has requested to subtract 5 from the element_widths.
- the newly-computed element_widths becomes {16-5,16-5,10-5,12-5} which is {11,11,5,7}
- therefore the computed element sizes are: 1x11, 2x11, 4x5, 8x7
- therefore the newly-computed overall width is the maximum of these amounts (8x7) which is a 56-bit quantity.

It is important to note that some arithmetic operations will not work correctly if both fixed_width and element-widths are specified. Addition:

- assume fixed_width=64, element_widths {8,8,8,8} and partition_counts {1,2,4,8}.
- adding 8 to each element width creates 16 element width for all partitions
- this creates a maximum expected width of 8x16 or 128 bits
- but the fixed_width of 64 plus 8 is only 72.
- therefore we must prohibit this operation when
**both**fixed and elwidth are specified

However for multiply, it is fine, due to a property of multiplication:

- assume fixed_width=64, element_widths {8,8,8,8} and partition_counts {1,2,4,8}.
- multiply by an integer value of 2
- the new fixed_width is 2x64 = 128
- each element width is
**also**multiplied by 2 to create {16,16,16,16} - this creates partitions 1x16 2x16 4x16 8x16
- the maximum is (not by a coincidence) exactly 128 bit wide
- this matches perfectly the newly-calculated fixed_width

Given that left-shift is a type of multiply, a left-shift arithmetic operator with an integer amount (as applied equally to all element widths and to the fixed_width) is also expected to also work.

Divide on the other hand (and right-shift) will only work (when both elwidth and fixed-width are set) if there is no rounding (no bits lost due to the division / right-shift).

Summary:

**Fixed_width Priority**- all operations (add, mul, sub, shift, div) expected to work (caveat: layout() should check partition alignment)

**Element-width Priority**- all operations expected to work (no caveats)

**Fixed-and-Elwidth (no priority)**- multiply and left-shift always expected to work
- divide and right-shift expected to work (caveat: no bits lost due to rounding)
- add and subtract
**not**expected to work (ambiguous): exception must be raised

## Arithmetic of two SimdShapes

With some thought it becomes clear that when performing operations not involving elwidth priority should simply calculate a new fixed width based on straight arithmetic on the LHS and RHS fixed width. The partition counts remains the same (coming from the scope context) therefore the result may also be a fixed_width priority result using the exact same partition counts.

However - and bearing in mind that for fixed_width priority the
elwidths are *computed* from the fixed width and the partition counts -
the moment that elwidths (vec_op_widths) are involved then
the priority switches to the elwidths, even if one of those elwidths were
calculated initially from a fixed_width and partition counts.
In this case, the result will be an elwidths priority SimdShape,
where the layout() function is already capable of computing the
required overall width based on the (newly-computed) vec_el_widths.

Note that, interestingly, when fixed_width priority SimdShapes are used on both LHS and RHS, it is effectively an identical case to when the RHS is an integer, because the fixed_width of the RHS is itself an integer.

For certain operators (mul, shift) a special case of this also occurs in instances where all elwidths in all partitions on either LHS or RHS are identical: element widths {3,3,3,3} for example. Under such special circumstances, multiply would function correctly even for dual-priority, due to uniform scaling.

Summary:

**Both LHS and RHS Fixed-width Priority**- Exactly the same as for when RHS is an Integer, given that the integer fixed width is, in fact, an integer.
- the return result is always a fixed-width priority SimdShape

**Either LHS or RHS Elwidth Priority**(but not both)- all operators always expected to work because, again, one of RHS or LHS is an integer, and that integer is used as the input to the arithmetic.
- reverse-operators (rmul etc) take care of RHS.
- the return result is however always an elwidth-priority SimdShape

**Both LHS and RHS Elwidth priority**- for mul and shift-left, as long as one of LHS or RHS has identical elwidths, by a mathematical coincidence these are fine. return result may be a dual-priority result.
- for add and sub, an exception must be raised.
- divide and shift-right are expected to work on the condition that, again, all elwidths have the exact same value, and, again, that no LSBs are lost.