Analysis

DRAFT SVP64

  • Revision 0.0: 21apr2022 https://www.youtube.com/watch?v=8hrIG7-E77o
  • Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
  • Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
  • Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
  • Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
  • Revision 0.05: 27apr2022 add vector shift section (no new instructions)

Introduction

This page covers an analysis of big integer operations, to work out optimal Scalar Instructions to propose be submitted to the OpenPOWER ISA WG, that when combined with Draft SVP64 give high performance compact Big Integer Vector Arithmetic. Leverage of existing Scalar Power ISA instructions is also explained.

Use of smaller sub-operations is a given: worst-case in a Scalar context, addition is O(N) whilst multiply and divide are O(N2), and their Vectorization would reduce those (for small N) to O(1) and O(N). Knuth's big-integer scalar algorithms provide useful real-world grounding into the types of operations needed, making it easy to demonstrate how they would be Vectorized.

The basic principle behind Knuth's algorithms is to break the problem down into a single scalar op against a Vector operand. This fits naturally with a Scalable Vector ISA such as SVP64. It only remains to exploit Carry (1-bit and 64-bit) in a Scalable Vector context and the picture is complete.

Links

Vector Add and Subtract

Surprisingly, no new additional instructions are required to perform a straightforward big-integer add or subtract. Vectorized adde or addex is perfectly sufficient to produce arbitrary-length big-integer add due to the rules set in SVP64 that all Vector Operations are directly equivalent to the strict Program Order Execution of their element-level operations. Assuming that the two bigints (or a part thereof) have been loaded into sequentially-contiguous registers, with the least-significant bits being in the lowest-numbered register in each case:

R0,CA = A0+B0+CA  adde r0,a0,b0
    |
    +----------+
               |
R1,CA = A1+B1+CA  adde r1,a1,b1
    |
    +----------+
               |
R2,CA = A2+B2+CA  adde r2,a2,b2

This pattern - sequential execution of individual instructions with incrementing register numbers - is precisely the very definition of how SVP64 works! Thus, due to sequential execution of adde both consuming and producing a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA, sv.adde is in effect an alias for Big-Integer Vectorized add. As such, implementors are entirely at liberty to recognise Horizontal-First Vector adds and send the vector of registers to a much larger and wider back-end ALU, and short-cut the intermediate storage of XER.CA on an element level in back-end hardware that need only:

  • read the first incoming XER.CA
  • implement a large Vector-aware carry propagation algorithm
  • store the very last XER.CA in the batch

The size and implementation of the underlying back-end SIMD ALU is entirely at the discretion of the implementer, as is whether to deploy the above strategy. The only hard requirement for implementors of SVP64 is to comply with strict and precise Program Order even at the Element level.

If there is pressure on the register file (or multi-million-digit big integers) then a partial-sum may be carried out with LD and ST in a standard Cray-style Vector Loop:

  aptr = A address
  bptr = B address
  rptr = Result address
  li r0, 0        # used to help clear CA
  addic r0, r0, 0 # CA to zero as well
  setmvli 8       # set MAXVL to 8
loop:
  setvl t0, n         # n is the number of digits
  mulli t1, t0, 8     # 8 bytes per digit/element
  sv.ldu a0, aptr, t1 # update advances pointer
  sv.ldu b0, bptr, t1 # likewise
  sv.adde r0, a0, b0  # takes in CA, updates CA
  sv.stu rptr, r0, t1 # pointer advances too
  sub. n, n, t0       # should not alter CA
  bnz loop            # do more digits

This is not that different from a Scalar Big-Int add, it is just that like all Cray-style Vectorization, a variable number of elements are covered by one instruction. Of interest to people unfamiliar with Cray-style Vectors: if VL is not permitted to exceed 1 (because MAXVL is set to 1) then the above actually becomes a Scalar Big-Int add algorithm.

Vector Multiply

Long-multiply, assuming an O(N2) algorithm, is performed by summing NxN separate smaller multiplications together. Karatsuba's algorithm reduces the number of small multiplies at the expense of increasing the number of additions. Some algorithms follow the Vedic Multiply pattern by grouping together all multiplies of the same magnitude/power (same column) whilst others perform row-based multiplication: a single digit of B multiplies the entirety of A, summed a row at a time. A Row-based algorithm is the basis of the analysis below (Knuth's Algorithm M).

Multiply is tricky: 64 bit operands actually produce a 128-bit result, which clearly cannot fit into an orthogonal register file. Most Scalar RISC ISAs have separate mul-low-half and mul-hi-half instructions, whilst some (OpenRISC) have "Accumulators" from which the results of the multiply must be explicitly extracted. High performance RISC advocates recommend "macro-op fusion" which is in effect where the second instruction gains access to the cached copy of the HI half of the multiply result, which had already been computed by the first. This approach quickly complicates the internal microarchitecture, especially at the decode phase.

Instead, Intel, in 2012, specifically added a mulx instruction, allowing both HI and LO halves of the multiply to reach registers with a single instruction. If however done as a multiply-and-accumulate this becomes quite an expensive operation: (3 64-Bit in, 2 64-bit registers out).

Long-multiplication may be performed a row at a time, starting with B0:

C4 C3 C2 C1 C0
         A0xB0
      A1xB0
   A2xB0
A3xB0
R4 R3 R2 R1 R0
  • R0 contains C0 plus the LO half of A0 times B0
  • R1 contains C1 plus the LO half of A1 times B0 plus the HI half of A0 times B0.
  • R2 contains C2 plus the LO half of A2 times B0 plus the HI half of A1 times B0.

This would on the face of it be a 4-in operation: the upper half of a previous multiply, two new operands to multiply, and an additional accumulator (C). However if C is left out (and added afterwards with a Vector-Add) things become more manageable.

Demonstrating in c, a Row-based multiply using a temporary vector. Adapted from a simple implementation of Knuth M: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD

      // this becomes the basis for sv.maddedu in RS=RC Mode,
      // where k is RC. k takes the upper half of product
      // and adds it in on the next iteration
      k = 0;
      for (i = 0; i < m; i++) {
         unsigned product = u[i]*v[j] + k;
         k = product>>16;
         plo[i] = product; // & 0xffff
      }
      // this is simply sv.adde where k is XER.CA
      k = 0;
      for (i = 0; i < m; i++) {
         t = plo[i] + w[i + j] + k;
         w[i + j] = t;          // (I.e., t & 0xFFFF).
         k = t >> 16; // carry: should only be 1 bit
      }

We therefore propose an operation that is 3-in, 2-out, that, noting that the connection between successive mul-adds has the UPPER half of the previous operation as its input, writes the UPPER half of the current product into a second output register for exactly the purpose of letting it be added onto the next BigInt digit.

product = RA*RB+RC
RT = lowerhalf(product)
RC = upperhalf(product)

Horizontal-First Mode therefore may be applied to just this one instruction. Successive sequential iterations effectively use RC as a kind of 64-bit carry, and as noted by Intel in their notes on mulx, RA*RB+RC+RD cannot overflow, so does not require setting an additional CA flag. We first cover the chain of RA*RB+RC as follows:

RT0, RC0 = RA0 * RB0 + 0
      |
      +----------------+
                       |
RT1, RC1 = RA1 * RB1 + RC0
      |
      +----------------+
                       |
RT2, RC2 = RA2 * RB2 + RC1

Following up to add each partially-computed row to what will become the final result is achieved with a Vectorized big-int sv.adde. Thus, the key inner loop of Knuth's Algorithm M may be achieved in four instructions, two of which are scalar initialisation:

li r16, 0                     # zero accumulator
addic r16, r16, 0             # CA to zero as well
sv.maddedu *r0, *r8, r17, r16 # mul vector
sv.adde *r24, *r24, *r0   # big-add row to result

Normally, in a Scalar ISA, the use of a register as both a source and destination like this would create costly Dependency Hazards, so such an instruction would never be proposed. However: it turns out that, just as with repeated chained application of adde, macro-op fusion may be internally applied to a sequence of these strange multiply operations. (Such a trick works equally as well in a Scalar-only Out-of-Order microarchitecture, although the conditions are harder to detect).

Application of SVP64

SVP64 has the means to mark registers as scalar or vector. However the available space in the prefix is extremely limited (9 bits). With effectively 5 operands (3 in, 2 out) some compromises are needed. A little thought gives a useful workaround: two modes, controlled by a single bit in RM.EXTRA, determine whether the 5th register is set to RC or whether to RT+MAXVL. This then leaves only 4 registers to qualify as scalar/vector, which can use four EXTRA2 designators and fits into the available 9-bit space.

RS=RT+MAXVL Mode:

product = RA*RB+RC
RT = lowerhalf(product)
RS=RT+MAXVL = upperhalf(product)

and RS=RC Mode:

product = RA*RB+RC
RT = lowerhalf(product)
RS=RC = upperhalf(product)

Now there is much more potential, including setting RC to a Scalar, which would be useful as a 64 bit Carry. RC as a Vector would produce a Vector of the HI halves of a Vector of multiplies. RS=RT+MAXVL Mode would allow that same Vector of HI halves to not be an overwrite of RC. Also it is possible to specify that any of RA, RB or RC are scalar or vector. Overall it is extremely powerful.

Vector Shift

The decision here was made to follow the same principle as for multiply-and-accumulate. Whilst Intel has srd which is very similar (but only 2-in 2-out), the decision was made to create a 3-in 2-out instruction that effectively uses one of the inputs and one of the outputs as a "sort of 64-bit carry".

Without the dsrd instruction, it is necessary to have three instructions in an inner loop. Keeping the shift amount within the range of the element (64 bit) a Vector bit-shift may be synthesised from a pair of shift operations and an OR, all of which are standard Scalar Power ISA instructions. that when Vectorized are exactly what is needed, but that critically requires Vertical-First Mode.

void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
    for (int i = 0; i < n - 1; i++)
        r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
    r[n - 1] = un[n - 1] >> s;
}

With SVP64 being on top of the standard scalar regfile the offset by one of the elements may be achieved simply by referencing the same vector data offset by one. Given that all three instructions (srd, sld, or) are an SVP64 type RM-1P-2S1D and are EXTRA3, it is possible to reference the full 128 64-bit registers (r0-r127):

subfic t1, t0, 64     # compute 64-s (s in t0)
sv.srd *r8, *r24, t0  # shift each element of r24 vector up by s
sv.sld *r16, *r25, t1 # offset start of vector by one (r25)
sv.or  *r8, *r8, *r16 # OR two parts together

Predication with zeroing may be utilised on sld to ensure that the last element is zero, avoiding over-run.

The reason why three instructions are needed instead of one in the case of big-add is because multiple bits chain through to the next element, where for add it is a single bit (carry-in, carry-out), and this is precisely what adde already does. For multiply, divide and shift it is worthwhile to use one scalar register effectively as a full 64-bit carry/chain.

The limitations of this approach therefore become pretty clear: not only must Vertical-First Mode be used but also the predication with zeroing trick. Worse than that, an entire temporary vector is required which wastes register space. A better way would be to create a single scalar instruction that can do the long-shift in-place.

The basic principle of the 3-in 2-out dsrd is:

    # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
    temp <- ROT128(RA || RC, RB[58:63])
    RT <- temp[64:127]
    RS <- temp[0:63]

A 128-bit shift is performed, taking the lower half into RS and the upper half into RT. However there is a trick that may be applied, which only requires ROT64:

    n <- (RB)[58:63]
    v <- ROTL64((RA), 64-n)
    mask <- MASK(n, 63)
    RT <- (v[0:63] & mask) | ((RC) & ¬mask)
    RS <- v[0:63] & ¬mask

The trick here is that the entirety of RA is rotated, then parts of it are masked into the destinations. RC, if also properly masked, can be ORed into RT, as long as the bits of RC are in the right place. The really interesting bit is that when Vectorised, the upper bits (now in RS) are in the right bit-positions to be ORed into the second dsrd operation. This allows us to create a chain sv.dsrd, and a single instruction replaces all four above:

sv.dsrd *r8, *r24, t1, t0

For larger shift amounts beyond an element bitwidth standard register move operations may be used, or, if the shift amount is static, to reference an alternate starting point in the registers containing the Vector elements because SVP64 sits on top of a standard Scalar register file. sv.sld r16.v, r26.v, t1 for example is equivalent to shifting by an extra 64 bits, compared to sv.sld r16.v, r25.v, t1.

Vector Divide

The simplest implementation of big-int divide is the standard schoolbook "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's Algorithm D performs estimates which, if wrong, are compensated for afterwards. Essentially there are three phases:

  • Calculation of the quotient estimate. This uses a single Scalar divide, which is covered separately in a later section
  • Big Integer multiply and subtract.
  • Carry-Correction with a big integer add, if the estimate from phase 1 was wrong by one digit.

From Knuth's Algorithm D, implemented in divmnu64.c, Phase 2 is expressed in c, as:

      // Multiply and subtract.
      k = 0;
      for (i = 0; i < n; i++) {
         p = qhat*vn[i]; // 64-bit product
         t = un[i+j] - k - (p & 0xFFFFFFFFLL);
         un[i+j] = t;
         k = (p >> 32) - (t >> 32);
      }

Where analysis of this algorithm, if a temporary vector is acceptable, shows that it can be split into two in exactly the same way as Algorithm M, this time using subtract instead of add.

        uint32_t carry = 0;
        // this is just sv.maddedu again
        for (int i = 0; i <= n; i++) {
            uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
            carry = (uint32_t)(value >> 32); // upper half for next loop
            product[i] = (uint32_t)value;    // lower into vector
        }
        bool ca = true;
        // this is simply sv.subfe where ca is XER.CA
        for (int i = 0; i <= n; i++) {
            uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
            ca = value >> 32 != 0;
            un_j[i] = value;
        }
        bool need_fixup = !ca; // for phase 3 correction

In essence then the primary focus of Vectorized Big-Int divide is in fact big-integer multiply

Detection of the fixup (phase 3) is determined by the Carry (borrow) bit at the end. Logically: if borrow was required then the qhat estimate was too large and the correction is required, which is, again, nothing more than a Vectorized big-integer add (one instruction). However this is not the full story

128/64-bit divisor

As mentioned above, the first part of the Knuth Algorithm D involves computing an estimate for the divisor. This involves using the three most significant digits, performing a scalar divide, and consequently requires a scalar division with twice the number of bits of the size of individual digits (for example, a 64-bit array). In this example taken from divmnu64.c the digits are 32 bit and, special-casing the overflow, a 64/32 divide is sufficient (64-bit dividend, 32-bit divisor):

        // Compute estimate qhat of q[j] from top 2 digits
        uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
        if (un[j+n] >= vn[n-1]) {
            // rhat can be bigger than 32-bit when the division overflows
            qhat = UINT32_MAX;
            rhat = dig2 - (uint64_t)UINT32_MAX * vn[n - 1];
        } else {
            qhat = dig2 / vn[n - 1]; // 64/32 divide
            rhat = dig2 % vn[n - 1]; // 64/32 modulo
        }
        // use 3rd-from-top digit to obtain better accuracy
        b = 1UL<<32;
        while (rhat < b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
            qhat = qhat - 1;
            rhat = rhat + vn[n - 1];
        }

However when moving to 64-bit digits (desirable because the algorithm is O(N^2)) this in turn means that the estimate has to be computed from a 128 bit dividend and a 64-bit divisor. Such an operation simply does not exist in most Scalar 64-bit ISAs. Although Power ISA comes close with divdeu, by placing one operand in the upper half of a 128-bit dividend, the lower half is zero. Again Power ISA has a Packed SIMD instruction vdivuq which is a 128/128 (quad) divide, not a 128/64, and its use would require considerable effort to move registers to and from GPRs. Some investigation into soft-implementations of 128/128 or 128/64 divide show it to be typically implemented bit-wise, with all that implies.

The irony is, therefore, that attempting to improve big-integer divide by moving to 64-bit digits in order to take advantage of the efficiency of 64-bit scalar multiply when Vectorized would instead lock up CPU time performing a 128/64 scalar division. With the Vector Multiply operations being critically dependent on that qhat estimate, and because that scalar is as an input into each of the vector digit multiples, as a Dependency Hazard it would cause all Parallel SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.

Whilst one solution is to reduce the digit width to 32-bit in order to go back to 64/32 divide, this increases the completion time by a factor of 4 due to the algorithm being O(N^2).

Reducing completion time of 128/64-bit Scalar division

Scalar division is a known computer science problem because, as even the Big-Int Divide shows, it requires looping around a multiply (or, if reduced to 1-bit per loop, a simple compare, shift, and subtract). If the simplest approach were deployed then the completion time for the 128/64 scalar divide would be a whopping 128 cycles. To be workable an alternative algorithm is required, and one of the fastest appears to be Goldschmidt Division. Whilst typically deployed for Floating Point, there is no reason why it should not be adapted to Fixed Point. In this way a Scalar Integer divide can be performed in the same time-order as Newton-Raphson, using two hardware multipliers and a subtract.

Back to Vector carry-looping

There is however another reason for having a 128/64 division instruction, and it's effectively the reverse of maddedu. Look closely at Algorithm D when the divisor is only a scalar (v[0]):

        k = 0; // the case of a
        for (j = m - 1; j >= 0; j--)
        {                                 // single-digit
            uint64_t dig2 = ((k << 32) | u[j]);
            q[j] = dig2 / v[0]; // divisor here.
            k = dig2 % v[0]; // modulo back into next loop
        }

Here, just as with maddedu which can put the hi-half of the 128 bit product back in as a form of 64-bit carry, a scalar divisor of a vector dividend puts the modulo back in as the hi-half of a 128/64-bit divide.

RT0      = ((  0<<64) | RA0) / RB0
     RC0 = ((  0<<64) | RA0) % RB0
      |
      +-------+
              |
RT1      = ((RC0<<64) | RA1) / RB1
     RC1 = ((RC0<<64) | RA1) % RB1
      |
      +-------+
              |
RT2      = ((RC1<<64) | RA2) / RB2
     RC2 = ((RC1<<64) | RA2) % RB2

By a nice coincidence this is exactly the same 128/64-bit operation needed (once, rather than chained) for the qhat estimate if it may produce both the quotient and the remainder. The pseudocode cleanly covering both scenarios (leaving out overflow for clarity) can be written as:

divmod2du RT,RA,RB,RC

 dividend = (RC) || (RA)
 divisor = EXTZ128(RB)
 RT = UDIV(dividend, divisor)
 RS = UREM(dividend, divisor)

Again, in an SVP64 context, using EXTRA mode bit 8 allows for selecting whether RS=RC or RS=RT+MAXVL. Similar flexibility in the scalar-vector settings allows the instruction to perform full parallel vector div/mod, or act in loop-back mode for big-int division by a scalar, or for a single scalar 128/64 div/mod.

Again, just as with sv.maddedu and sv.adde, adventurous implementors may perform massively-wide DIV/MOD by transparently merging (fusing) the Vector element operations together, only inputting a single RC and outputting the last RC. Where efficient algorithms such as Goldschmidt are deployed internally this could dramatically reduce the cycle completion time for massive Vector DIV/MOD. Thus, just as with the other operations the apparent limitation of creating chains is overcome: SVP64 is, by design, an "expression of intent" where the implementor is free to achieve that intent in any way they see fit as long as strict precise-aware Program Order is preserved (even on the VL for-loops).

Just as with divdeu on which this instruction is based an overflow detection is required. When the divisor is too small compared to the dividend then the result may not fit into 64 bit. Knuth's original algorithm detects overflow and manually places 0xffffffff (all ones) into qhat. With there being so many operands already in divmod2du a cmpl instruction can be used instead to detect the overflow. This saves having to add an Rc=1 or OE=1 mode when the available space in VA-Form EXT04 is extremely limited.

Looking closely at the loop however we can see that overflow will not occur. The initial value k is zero: as long as a divide-by-zero is not requested this always fulfils the condition RC < RA, and on subsequent iterations the new k, being the modulo, is always less than the divisor as well. Thus the condition (the loop invariant) RC < RA is preserved, as long as RC starts at zero.

Limitations

One of the worst things for any ISA is that an algorithm's completion time is directly affected by different implementations having instructions take longer or shorter times. Knuth's Big-Integer division is unfortunately one such algorithm.

Assuming that the computation of qhat takes 128 cycles to complete on a small power-efficient embedded design, this time would dominate compared to the 64 bit multiplications. However if the element width was reduced to 8, such that the computation of qhat only took 16 cycles, the calculation of qhat would not dominate, but the number of multiplications would rise: somewhere in between there would be an elwidth and a Vector Length that would suit that particular embedded processor.

By contrast a high performance microarchitecture may deploy Goldschmidt or other efficient Scalar Division, which could complete 128/64 qhat computation in say only 5 to 8 cycles, which would be tolerable. Thus, for general-purpose software, it would be necessary to ship multiple implementations of the same algorithm and dynamically select the best one.

The very fact that programmers even have to consider multiple implementations and compare their performance is an unavoidable nuisance. SVP64 is supposed to be designed such that only one implementation of any given algorithm is needed. In some ways it is reassuring that some algorithms just don't fit. Slightly more reassuring is that Goldschmidt Divide, which uses two multiplications that can be performed in parallel, would be a much better fit with SVP64 (and Vector Processing in general), the only downside being that it is regarded as worthwhile for much larger integers.

Conclusion

The majority of ISAs do not have 3-in 2-out instructions for very good reasons: they are horrendously expensive if implemented as operations that can write to two registers every clock cycle. Implementations that take two clock cycles to write to a single write port are frowned-upon, and the Hazard Management complexity also goes up.

However when looked at from a Vectorised perspective and using "chaining" (64-bit carry-out becomes the 64-bit output for the next operation) then "Operand Forwarding" makes one read and one write effectively disappear, reducing most of the operations to a manageable efficent 2-in 1-out. In normal Scalar hardware being used to attempt Vectored bigint operations the opportunity for "Forwarding" is quite hard to spot, but because of the Vectorisation it is known in advance that a Vector of values is being used, i.e. only one instruction issued not several. Therefore it is easy to micro-code the Operand Forwarding.

In essence by taking things that extra step further the complexity of Scalar ISAs disappears through the introduction of some uniform Vector-Looping that, although complex in itself in hardware, at least only has to be done once and becomes uniform-RISC applicable to all instructions. These comprehensive powerful Scalar arithmetic instructions will significantly reduce the complexity of big-integer operations.