Analysis
DRAFT SVP64
 Revision 0.0: 21apr2022 https://www.youtube.com/watch?v=8hrIG7E77o
 Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
 Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
 Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
 Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
 Revision 0.05: 27apr2022 add vector shift section (no new instructions)
Introduction
This page covers an analysis of big integer operations, to work out optimal Scalar Instructions to propose be submitted to the OpenPOWER ISA WG, that when combined with Draft SVP64 give high performance compact Big Integer Vector Arithmetic. Leverage of existing Scalar Power ISA instructions is also explained.
Use of smaller suboperations is a given: worstcase in a Scalar context, addition is O(N) whilst multiply and divide are O(N^{2}), and their Vectorisation would reduce those (for small N) to O(1) and O(N). Knuth's biginteger scalar algorithms provide useful realworld grounding into the types of operations needed, making it easy to demonstrate how they would be Vectorised.
The basic principle behind Knuth's algorithms is to break the problem down into a single scalar op against a Vector operand. This fits naturally with a Scalable Vector ISA such as SVP64. It only remains to exploit Carry (1bit and 64bit) in a Scalable Vector context and the picture is complete.
Links
 http://www.intel.com/content/dam/www/public/us/en/documents/whitepapers/ialargeintegerarithmeticpaper.pdf
 https://lists.libresoc.org/pipermail/libresocdev/2022April/004700.html
 https://news.ycombinator.com/item?id=21151646
 https://twitter.com/lkcl/status/1517169267912986624
 https://www.youtube.com/watch?v=8hrIG7E77o
 https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/
 https://bugs.libresoc.org/show_bug.cgi?id=817
Vector Add and Subtract
Surprisingly, no new additional instructions are required to perform
a straightforward biginteger add or subtract. Vectorised adde
or addex
is perfectly sufficient to produce arbitrarylength
biginteger add due to the rules set in SVP64 that all Vector Operations
are directly equivalent to the strict Program Order Execution of
their elementlevel operations. Assuming that the two bigints (or
a part thereof) have been loaded into sequentiallycontiguous
registers, with the leastsignificant bits being in the lowestnumbered
register in each case:
R0,CA = A0+B0+CA adde r0,a0,b0

++

R1,CA = A1+B1+CA adde r1,a1,b1

++

R2,CA = A2+B2+CA adde r2,a2,b2
This pattern  sequential execution of individual instructions
with incrementing register numbers  is precisely the very definition
of how SVP64 works!
Thus, due to sequential execution of adde
both consuming and producing
a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
sv.adde
is in effect an alias for BigInteger Vectorised add. As such,
implementors are entirely at liberty to recognise HorizontalFirst Vector
adds and send the vector of registers to a much larger and wider backend
ALU, and shortcut the intermediate storage of XER.CA on an element
level in backend hardware that need only:
 read the first incoming XER.CA
 implement a large Vectoraware carry propagation algorithm
 store the very last XER.CA in the batch
The size and implementation of the underlying backend SIMD ALU is entirely at the discretion of the implementer, as is whether to deploy the above strategy. The only hard requirement for implementors of SVP64 is to comply with strict and precise Program Order even at the Element level.
If there is pressure on the register file (or multimilliondigit big integers) then a partialsum may be carried out with LD and ST in a standard Craystyle Vector Loop:
aptr = A address
bptr = B address
rptr = Result address
li r0, 0 # used to help clear CA
addic r0, r0, 0 # CA to zero as well
setmvli 8 # set MAXVL to 8
loop:
setvl t0, n # n is the number of digits
mulli t1, t0, 8 # 8 bytes per digit/element
sv.ldu a0, aptr, t1 # update advances pointer
sv.ldu b0, bptr, t1 # likewise
sv.adde r0, a0, b0 # takes in CA, updates CA
sv.stu rptr, r0, t1 # pointer advances too
sub. n, n, t0 # should not alter CA
bnz loop # do more digits
This is not that different from a Scalar BigInt add, it is just that like all Craystyle Vectorisation, a variable number of elements are covered by one instruction. Of interest to people unfamiliar with Craystyle Vectors: if VL is not permitted to exceed 1 (because MAXVL is set to 1) then the above actually becomes a Scalar BigInt add algorithm.
Vector Shift
Like add and subtract, strictly speaking these need no new instructions. Keeping the shift amount within the range of the element (64 bit) a Vector bitshift may be synthesised from a pair of shift operations and an OR, all of which are standard Scalar Power ISA instructions that when Vectorised are exactly what is needed.
void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
for (int i = 0; i < n  1; i++)
r[i] = (un[i] >> s)  (un[i + 1] << (64  s));
r[n  1] = un[n  1] >> s;
}
With SVP64 being on top of the standard scalar regfile the offset by
one of the elements may be achieved simply by referencing the same
vector data offset by one. Given that all three instructions
(srd
, sld
, or
) are an SVP64 type RM1P2S1D
and are EXTRA3
,
it is possible to reference the full 128 64bit registers (r0r127):
subfic t1, t0, 64 # compute 64s (s in t0)
sv.srd r8.v, r24.v, t0 # shift each element of r24.v up by s
sv.sld r16.v, r25.v, t1 # offset start of vector by one (r25)
sv.or r8.v, r8.v, r16.v # OR two parts together
Predication with zeroing may be utilised on sld to ensure that the last element is zero, avoiding overrun.
The reason why three instructions are needed instead of one in the
case of bigadd is because multiple bits chain through to the
next element, where for add it is a single bit (carryin, carryout),
and this is precisely what adde
already does.
For multiply and divide as shown later it is worthwhile to use
one scalar register effectively as a full 64bit carry/chain
but in the case of shift, an OR may glue things together, easily,
and in parallel, because unlike sv.adde
, downchain
carrypropagation through multiple elements does not occur.
With Scalar shift and rotate operations in the Power ISA already being complex and very comprehensive, it is hard to justify creating complex 3in 2out variants when a sequence of 3 simple instructions will suffice. However it is reasonably justifiable to have a 3in 1out instruction with an implicit source, based around the inner operation:
# r[i] = (un[i] >> s)  (un[i + 1] << (64  s));
t < ROT128(RA  RA1, RB[58:63])
RT < t[64:127]
``
RA1 is implicitly (or explicitly, RC) greater than RA by one
scalar register number, and like the other operations below,
a 128/64 shift is performed, truncating to take the lower
64 bits. By taking a Vector source RA and assuming lowernumbered
registers are lowersignificant digits in the biginteger operation
the entire biginteger source may be shifted by a scalar.
For larger shift amounts beyond an element bitwidth standard register move
operations may be used, or, if the shift amount is static,
to reference an alternate starting point in
the registers containing the Vector elements
because SVP64 sits on top of a standard Scalar register file.
`sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
# Vector Multiply
Longmultiply, assuming an O(N^2) algorithm, is performed by summing
NxN separate smaller multiplications together. Karatsuba's algorithm
reduces the number of small multiplies at the expense of increasing
the number of additions. Some algorithms follow the Vedic Multiply
pattern by grouping together all multiplies of the same magnitude/power
(same column) whilst others perform rowbased multiplication: a single
digit of B multiplies the entirety of A, summed a row at a time.
A Rowbased
algorithm is the basis of the analysis below (Knuth's Algorithm M).
Multiply is tricky: 64 bit operands actually produce a 128bit result,
which clearly cannot fit into an orthogonal register file.
Most Scalar RISC ISAs have separate `mullowhalf` and `mulhihalf`
instructions, whilst some (OpenRISC) have "Accumulators" from which
the results of the multiply must be explicitly extracted. High
performance RISC advocates
recommend "macroop fusion" which is in effect where the second instruction
gains access to the cached copy of the HI half of the
multiply result, which had already been
computed by the first. This approach quickly complicates the internal
microarchitecture, especially at the decode phase.
Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
both HI and LO halves of the multiply to reach registers with a single
instruction. If however done as a
multiplyandaccumulate this becomes quite an expensive operation:
(3 64Bit in, 2 64bit registers out).
Longmultiplication may be performed a row at a time, starting
with B0:
C4 C3 C2 C1 C0
A0xB0
A1xB0
A2xB0
A3xB0
R4 R3 R2 R1 R0
* R0 contains C0 plus the LO half of A0 times B0
* R1 contains C1 plus the LO half of A1 times B0
plus the HI half of A0 times B0.
* R2 contains C2 plus the LO half of A2 times B0
plus the HI half of A1 times B0.
This would on the face of it be a 4in operation:
the upper half of a previous multiply, two new operands
to multiply, and an additional accumulator (C). However if
C is left out (and added afterwards with a VectorAdd)
things become more manageable.
Demonstrating in c, a Rowbased multiply using a temporary vector.
Adapted from a simple implementation
of Knuth M: <https://git.libresoc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
// this becomes the basis for sv.maddedu in RS=RC Mode,
// where k is RC. k takes the upper half of product
// and adds it in on the next iteration
k = 0;
for (i = 0; i < m; i++) {
unsigned product = u[i]*v[j] + k;
k = product>>16;
plo[i] = product; // & 0xffff
}
// this is simply sv.adde where k is XER.CA
k = 0;
for (i = 0; i < m; i++) {
t = plo[i] + w[i + j] + k;
w[i + j] = t; // (I.e., t & 0xFFFF).
k = t >> 16; // carry: should only be 1 bit
}
We therefore propose an operation that is 3in, 2out,
that, noting that the connection between successive
muladds has the UPPER half of the previous operation
as its input, writes the UPPER half of the current
product into a second output register for exactly the
purpose of letting it be added onto the next BigInt digit.
product = RA*RB+RC
RT = lowerhalf(product)
RC = upperhalf(product)
HorizontalFirst Mode therefore may be applied to just this one
instruction.
Successive sequential iterations effectively use RC as a kind of
64bit carry, and
as noted by Intel in their notes on mulx,
`RA*RB+RC+RD` cannot overflow, so does not require
setting an additional CA flag. We first cover the chain of
`RA*RB+RC` as follows:
RT0, RC0 = RA0 * RB0 + 0

++

RT1, RC1 = RA1 * RB1 + RC0

++

RT2, RC2 = RA2 * RB2 + RC1
Following up to add each partiallycomputed row to what will become
the final result is achieved with a Vectorised bigint
`sv.adde`. Thus, the key inner loop of
Knuth's Algorithm M may be achieved in four instructions, two of
which are scalar initialisation:
li r16, 0 # zero accumulator
addic r16, r16, 0 # CA to zero as well
sv.madde r0.v, r8.v, r17, r16 # mul vector
sv.adde r24.v, r24.v, r0.v # bigadd row to result
Normally, in a Scalar ISA, the use of a register as both a source
and destination like this would create costly Dependency Hazards, so
such an instruction would never be proposed. However: it turns out
that, just as with repeated chained application of `adde`, macroop
fusion may be internally applied to a sequence of these strange multiply
operations. (*Such a trick works equally as well in a Scalaronly
OutofOrder microarchitecture, although the conditions are harder to
detect*).
**Application of SVP64**
SVP64 has the means to mark registers as scalar or vector. However
the available space in the prefix is extremely limited (9 bits).
With effectively 5 operands (3 in, 2 out) some compromises are needed.
A little thought gives a useful workaround: two modes,
controlled by a single bit in `RM.EXTRA`, determine whether the 5th
register is set to RC or whether to RT+MAXVL. This then leaves only
4 registers to qualify as scalar/vector, which can use four
EXTRA2 designators and fits into the available 9bit space.
RS=RT+MAXVL Mode:
product = RA*RB+RC
RT = lowerhalf(product)
RS=RT+MAXVL = upperhalf(product)
and RS=RC Mode:
product = RA*RB+RC
RT = lowerhalf(product)
RS=RC = upperhalf(product)
Now there is much more potential, including setting RC to a Scalar,
which would be useful as a 64 bit Carry. RC as a Vector would produce
a Vector of the HI halves of a Vector of multiplies. RS=RT+MAXVL Mode
would allow that same Vector of HI halves to not be an overwrite of RC.
Also it is possible to specify that any of RA, RB or RC are scalar or
vector. Overall it is extremely powerful.
# Vector Divide
The simplest implementation of bigint divide is the standard schoolbook
"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
Algorithm D performs estimates which, if wrong, are compensated for
afterwards. Essentially there are three phases:
* Calculation of the quotient estimate. This uses a single
Scalar divide, which is covered separately in a later section
* Big Integer multiply and subtract.
* CarryCorrection with a big integer add, if the estimate from
phase 1 was wrong by one digit.
From Knuth's Algorithm D, implemented in
[divmnu64.c](https://git.libresoc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD),
Phase 2 is expressed in c, as:
// Multiply and subtract.
k = 0;
for (i = 0; i < n; i++) {
p = qhat*vn[i]; // 64bit product
t = un[i+j]  k  (p & 0xFFFFFFFFLL);
un[i+j] = t;
k = (p >> 32)  (t >> 32);
}
Where analysis of this algorithm, if a temporary vector is acceptable,
shows that it can be split into two in exactly the same way as Algorithm M,
this time using subtract instead of add.
uint32_t carry = 0;
// this is just sv.maddedu again
for (int i = 0; i <= n; i++) {
uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
carry = (uint32_t)(value >> 32); // upper half for next loop
product[i] = (uint32_t)value; // lower into vector
}
bool ca = true;
// this is simply sv.subfe where ca is XER.CA
for (int i = 0; i <= n; i++) {
uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
ca = value >> 32 != 0;
un_j[i] = value;
}
bool need_fixup = !ca; // for phase 3 correction
In essence then the primary focus of Vectorised BigInt divide is in
fact biginteger multiply
Detection of the fixup (phase 3) is determined by the Carry (borrow)
bit at the end. Logically: if borrow was required then the qhat estimate
was too large and the correction is required, which is, again,
nothing more than a Vectorised biginteger add (one instruction).
However this is not the full story
**128/64bit divisor**
As mentioned above, the first part of the Knuth Algorithm D involves
computing an estimate for the divisor. This involves using the three
most significant digits, performing a scalar divide, and consequently
requires a scalar division with *twice* the number of bits of the
size of individual digits (for example, a 64bit array). In this
example taken from
[divmnu64.c](https://git.libresoc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD)
the digits are 32 bit and, specialcasing the overflow, a 64/32 divide is sufficient (64bit dividend, 32bit divisor):
// Compute estimate qhat of q[j] from top 2 digits
uint64_t dig2 = ((uint64_t)un[j + n] << 32)  un[j + n  1];
if (un[j+n] >= vn[n1]) {
// rhat can be bigger than 32bit when the division overflows
qhat = UINT32_MAX;
rhat = dig2  (uint64_t)UINT32_MAX * vn[n  1];
} else {
qhat = dig2 / vn[n  1]; // 64/32 divide
rhat = dig2 % vn[n  1]; // 64/32 modulo
}
// use 3rdfromtop digit to obtain better accuracy
b = 1UL<<32;
while (rhat < b  qhat * vn[n  2] > b * rhat + un[j + n  2]) {
qhat = qhat  1;
rhat = rhat + vn[n  1];
}
However when moving to 64bit digits (desirable because the algorithm
is `O(N^2)`) this in turn means that the estimate has to be computed
from a *128* bit dividend and a 64bit divisor. Such an operation
simply does not exist in most Scalar 64bit ISAs. Although Power ISA
comes close with `divdeu`, by placing one operand in the upper half
of a 128bit dividend, the lower half is zero. Again Power ISA
has a Packed SIMD instruction `vdivuq` which is a 128/128
(quad) divide, not a 128/64, and its use would require considerable
effort to move registers to and from GPRs. Some investigation into
softimplementations of 128/128 or 128/64 divide show it to be typically
implemented bitwise, with all that implies.
The irony is, therefore, that attempting to
improve biginteger divide by moving to 64bit digits in order to take
advantage of the efficiency of 64bit scalar multiply when Vectorised
would instead
lock up CPU time performing a 128/64 scalar division. With the Vector
Multiply operations being critically dependent on that `qhat` estimate, and
because that scalar is as an input into each of the vector digit
multiples, as a Dependency Hazard it would cause *all* Parallel
SIMD Multiply backends to sit 100% idle, waiting for that one scalar value.
Whilst one solution is to reduce the digit width to 32bit in order to
go back to 64/32 divide, this increases the completion time by a factor
of 4 due to the algorithm being `O(N^2)`.
**Reducing completion time of 128/64bit Scalar division**
Scalar division is a known computer science problem because, as even the
BigInt Divide shows, it requires looping around a multiply (or, if reduced
to 1bit per loop, a simple compare, shift, and subtract). If the simplest
approach were deployed then the completion time for the 128/64 scalar
divide would be a whopping 128 cycles. To be workable an alternative
algorithm is required, and one of the fastest appears to be
Goldschmidt Division. Whilst typically deployed for Floating Point,
there is no reason why it should not be adapted to Fixed Point.
In this way a Scalar Integer divide can be performed in the same
timeorder as NewtonRaphson, using two hardware multipliers
and a subtract.
**Back to Vector carrylooping**
There is however another reason for having a 128/64 division
instruction, and it's effectively the reverse of `maddedu`.
Look closely at Algorithm D when the divisor is only a scalar
(`v[0]`):
k = 0; // the case of a
for (j = m  1; j >= 0; j)
{ // singledigit
uint64_t dig2 = ((k << 32)  u[j]);
q[j] = dig2 / v[0]; // divisor here.
k = dig2 % v[0]; // modulo back into next loop
}
```
Here, just as with maddedu
which can put the hihalf of the 128 bit product
back in as a form of 64bit carry, a scalar divisor of a vector dividend
puts the modulo back in as the hihalf of a 128/64bit divide.
RT0 = (( 0<<64)  RA0) / RB0
RC0 = (( 0<<64)  RA0) % RB0

++

RT1 = ((RC0<<64)  RA1) / RB1
RC1 = ((RC0<<64)  RA1) % RB1

++

RT2 = ((RC1<<64)  RA2) / RB2
RC2 = ((RC1<<64)  RA2) % RB2
By a nice coincidence this is exactly the same 128/64bit operation
needed (once, rather than chained) for the qhat
estimate if it may
produce both the quotient and the remainder.
The pseudocode cleanly covering both scenarios (leaving out
overflow for clarity) can be written as:
divmod2du RT,RA,RB,RC
dividend = (RC)  (RA)
divisor = EXTZ128(RB)
RT = UDIV(dividend, divisor)
RS = UREM(dividend, divisor)
Again, in an SVP64 context, using EXTRA mode bit 8 allows for
selecting whether RS=RC
or
RS=RT+MAXVL
. Similar flexibility in the scalarvector settings
allows the instruction to perform full parallel vector div/mod,
or act in loopback mode for bigint division by a scalar,
or for a single scalar 128/64 div/mod.
Again, just as with sv.maddedu
and sv.adde
, adventurous implementors
may perform massivelywide DIV/MOD by transparently merging (fusing)
the Vector element operations together, only inputting a single RC and
outputting the last RC. Where efficient algorithms such as Goldschmidt
are deployed internally this could dramatically reduce the cycle completion
time for massive Vector DIV/MOD. Thus, just as with the other operations
the apparent limitation of creating chains is overcome: SVP64 is,
by design, an "expression of intent" where the implementor is free to
achieve that intent in any way they see fit
as long as strict preciseaware Program Order is
preserved (even on the VL forloops).
Just as with divdeu
on which this instruction is based an overflow
detection is required. When the divisor is too small compared to
the dividend then the result may not fit into 64 bit. Knuth's
original algorithm detects overflow and manually places 0xffffffff
(all ones) into qhat
. With there being so many operands already
in divmod2du
a cmpl
instruction can be used instead to detect
the overflow. This saves having to add an Rc=1 or OE=1 mode when
the available space in VAForm EXT04 is extremely limited.
Looking closely at the loop however we can see that overflow
will not occur. The initial value k is zero: as long as a dividebyzero
is not requested this always fulfils the condition RC < RA
, and on
subsequent iterations the new k, being the modulo, is always less than the
divisor as well. Thus the condition (the loop invariant) RC < RA
is preserved, as long as RC starts at zero.
Limitations
One of the worst things for any ISA is that an algorithm's completion time is directly affected by different implementations having instructions take longer or shorter times. Knuth's BigInteger division is unfortunately one such algorithm.
Assuming that the computation of qhat takes 128 cycles to complete on a small powerefficient embedded design, this time would dominate compared to the 64 bit multiplications. However if the element width was reduced to 8, such that the computation of qhat only took 16 cycles, the calculation of qhat would not dominate, but the number of multiplications would rise: somewhere in between there would be an elwidth and a Vector Length that would suit that particular embedded processor.
By contrast a high performance microarchitecture may deploy Goldschmidt or other efficient Scalar Division, which could complete 128/64 qhat computation in say only 5 to 8 cycles, which would be tolerable. Thus, for generalpurpose software, it would be necessary to ship multiple implementations of the same algorithm and dynamically select the best one.
The very fact that programmers even have to consider multiple implementations and compare their performance is an unavoidable nuisance. SVP64 is supposed to be designed such that only one implementation of any given algorithm is needed. In some ways it is reassuring that some algorithms just don't fit. Slightly more reassuring is that Goldschmidt Divide, which uses two multiplications that can be performed in parallel, would be a much better fit with SVP64 (and Vector Processing in general), the only downside being that it is regarded as worthwhile for much larger integers.
Conclusion
TODO