Big Integer Arithmetic

DRAFT STATUS 19apr2022, last edited 23may2022

BigNum arithmetic is extremely common especially in cryptography, where for example RSA relies on arithmetic of 2048 or 4096 bits in length. The primary operations are add, multiply and divide (and modulo) with specialisations of subtract and signed multiply.

A reminder that a particular focus of SVP64 is that it is built on top of Scalar operations, where those scalar operations are useful in their own right without SVP64. Thus the operations here are proposed first as Scalar Extensions to the Power ISA.

A secondary focus is that if Vectorised, implementors may choose to deploy macro-op fusion targetting back-end 256-bit or greater Dynamic SIMD ALUs for maximum performance and effectiveness.

Analysis

Covered in analysis the summary is that standard adde is sufficient for SVP64 Vectorisation of big-integer addition (and subfe for subtraction) but that big-integer multiply and divide require an extra 3-in 2-out instruction, similar to Intel's mulx, to be efficient. The same instruction (madded) is used for both because 'madded''s primary purpose is to perform a fused 64-bit scalar multiply with a large vector, where that result is Big-Added for Big-Multiply, but Big-Subtracted for Big-Divide.

Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a fashion that is hidden from the user, behind a consistent, stable ISA API. The same macro-op fusion may theoretically be deployed even on Scalar operations.

madded

DRAFT

madded is similar to v3.0 madd, and is VA-Form despite having 2 outputs: the second destination register is implicit.

0.....5 6..10 11..15 16..20 21..25 26..31
EXT04 RT RA RB RC XO

The pseudocode for madded RT, RA, RB, RC is:

prod[0:127] = (RA) * (RB)
sum[0:127] = EXTZ(RC) + prod
RT <- sum[64:127]
RS <- sum[0:63] # RS implicit register, see below

RC is zero-extended (not shifted, not sign-extended), the 128-bit product added to it; the lower half of that result stored in RT and the upper half in RS.

The differences here to maddhdu are that maddhdu stores the upper half in RT, where madded stores the upper half in RS. There is no equivalent to maddld because maddld performs sign-extension on RC.

Programmer's Note: As a Scalar Power ISA operation, like lq and stq, RS=RT+1. To achieve the same big-integer rolling-accumulation effect as SVP64: assuming the scalar to multiply is in r0, the vector to multiply by starts at r4 and the result vector in r20, instructions may be issued madded r20,r4,r0,r20 madded r21,r5,r0,r21 etc. where the first madded will have stored the upper half of the 128-bit multiply into r21, such that it may be picked up by the second madded. Repeat inline to construct a larger bigint scalar-vector multiply, as Scalar GPR register file space permits.

SVP64 overrides the Scalar behaviour of what defines RS. For SVP64 EXTRA register extension, the RM-1P-3S-1D format is used with the additional bit set for determining RS.

Field Name Field bits Description
Rdest_EXTRA2 10:11 extends RT (R*_EXTRA2 Encoding)
Rsrc1_EXTRA2 12:13 extends RA (R*_EXTRA2 Encoding)
Rsrc2_EXTRA2 14:15 extends RB (R*_EXTRA2 Encoding)
Rsrc3_EXTRA2 16:17 extends RC (R*_EXTRA2 Encoding)
EXTRA2_MODE 18 used by madded for determining RS

When EXTRA2_MODE is set to zero, the implicit RS register takes its Vector/Scalar setting from Rdest_EXTRA2, and takes the register number from RT, but all numbering is offset by MAXVL. Note that element-width overrides influence this offset (see SVP64 appendix for full details).

When EXTRA2_MODE is set to one, the implicit RS register is identical to RC extended with SVP64 using Rsrc3_EXTRA2 in every respect, including whether RC is set Scalar or Vector.

divrem2du RT,RA,RB,RC

DRAFT

Divide/Modulu Quad-Double Unsigned is another VA-Form instruction that is near-identical to divdeu except that:

  • the lower 64 bits of the dividend, instead of being zero, contain a register, RC.
  • it performs a fused divide and modulo in a single instruction, storing the modulo in an implicit RS (similar to madded)

RB, the divisor, remains 64 bit. The instruction is therefore a 128/64 division, producing a (pair) of 64 bit result(s). Overflow conditions are detected in exactly the same fashion as divdeu, except that rather than have UNDEFINED behaviour, RT is set to all ones and RS set to all zeros on overflow.

Programmer's note: there are no Rc variants of any of these VA-Form instructions. cmpi will need to be used to detect overflow conditions: the saving in instruction count is that both RT and RS will have already been set to useful values needed as part of implementing Knuth's Algorithm D

For SVP64, given that this instruction is also 3-in 2-out 64-bit registers, the exact same EXTRA format and setting of RS is used as for sv.madded. For Scalar usage, just as for madded, RS=RT+1 (similar to lq and stq).

Pseudo-code:

if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then
    dividend[0:(XLEN*2)-1] <- (RA) || (RC)
    divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
    result <- dividend / divisor
    modulo <- dividend % divisor
    RT <- result[XLEN:(XLEN*2)-1]
    RS <- modulo[XLEN:(XLEN*2)-1]
else
    RT <- [1]*XLEN
    RS <- [0]*XLEN

[DRAFT] EXT04 Proposed Map

For the Opcode map (XO Field) see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357. Proposed is the addition of madded (DRAFT, NOT APPROVED) in 110010 and divrem2du in 110100

110000 110001 110010 110011 110100 110101 110110 110111
maddhd maddhdu madded maddld divrem2du rsvd rsvd rsvd