Big Integer Arithmetic
DRAFT STATUS 19apr2022, last edited 23may2022
- discussion page for notes
- https://bugs.libre-soc.org/show_bug.cgi?id=817 bugreport
- https://bugs.libre-soc.org/show_bug.cgi?id=937 128/64 shifts
- analysis
- svfixedarith pseudocode
BigNum arithmetic is extremely common especially in cryptography, where for example RSA relies on arithmetic of 2048 or 4096 bits in length. The primary operations are add, multiply and divide (and modulo) with specialisations of subtract and signed multiply.
A reminder that a particular focus of SVP64 is that it is built on top of Scalar operations, where those scalar operations are useful in their own right without SVP64. Thus the operations here are proposed first as Scalar Extensions to the Power ISA.
A secondary focus is that if Vectorized, implementors may choose to deploy macro-op fusion targetting back-end 256-bit or greater Dynamic SIMD ALUs for maximum performance and effectiveness.
Analysis
Covered in analysis the summary is that standard adde
is sufficient for SVP64 Vectorization of big-integer addition (and subfe
for subtraction) but that big-integer shift, multiply and divide require an
extra 3-in 2-out instructions, similar to Intel's
shld
and shrd,
mulx and
divq,
to be efficient.
The same instruction (maddedu
) is used in both
big-divide and big-multiply because 'maddedu''s primary
purpose is to perform a fused 64-bit scalar multiply with a large vector,
where that result is Big-Added for Big-Multiply, but Big-Subtracted for
Big-Divide.
Chaining the operations together gives Scalar-by-Vector
operations, except for sv.adde
and sv.subfe
which are
Vector-by-Vector Chainable (through the CA
flag).
Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a
fashion that is hidden from the user, behind a consistent, stable ISA API.
The same macro-op fusion may theoretically be deployed even on Scalar
operations.
DRAFT dsld
0.....5 | 6..10 | 11..15 | 16..20 | 21.25 | 26..30 | 31 |
---|---|---|---|---|---|---|
EXT04 | RT | RA | RB | RC | XO | Rc |
VA2-Form
- dsld RT,RA,RB,RC (Rc=0)
- dsld. RT,RA,RB,RC (Rc=1)
Pseudo-code:
n <- (RB)[58:63]
v <- ROTL64((RA), n)
mask <- MASK(0, 63-n)
RT <- (v[0:63] & mask) | ((RC) & ¬mask)
RS <- v[0:63] & ¬mask
overflow = 0
if RS != [0]*64:
overflow = 1
Special Registers Altered:
CR0 (if Rc=1)
DRAFT dsrd
0.....5 | 6..10 | 11..15 | 16..20 | 21.25 | 26..30 | 31 |
---|---|---|---|---|---|---|
EXT04 | RT | RA | RB | RC | XO | Rc |
VA2-Form
- dsrd RT,RA,RB,RC (Rc=0)
- dsrd. RT,RA,RB,RC (Rc=1)
Pseudo-code:
n <- (RB)[58:63]
v <- ROTL64((RA), 64-n)
mask <- MASK(n, 63)
RT <- (v[0:63] & mask) | ((RC) & ¬mask)
RS <- v[0:63] & ¬mask
overflow = 0
if RS != [0]*64:
overflow = 1
Special Registers Altered:
CR0 (if Rc=1)
maddedu
DRAFT
maddedu
is similar to v3.0 madd
, and
is VA-Form despite having 2 outputs: the second
destination register is implicit.
0.....5 | 6..10 | 11..15 | 16..20 | 21..25 | 26..31 |
---|---|---|---|---|---|
EXT04 | RT | RA | RB | RC | XO |
The pseudocode for maddedu RT, RA, RB, RC
is:
prod[0:127] = (RA) * (RB)
sum[0:127] = EXTZ(RC) + prod
RT <- sum[64:127]
RS <- sum[0:63] # RS implicit register, see below
RC is zero-extended (not shifted, not sign-extended), the 128-bit product added to it; the lower half of that result stored in RT and the upper half in RS.
The differences here to maddhdu
are that maddhdu
stores the upper
half in RT, where maddedu
stores the upper half in RS.
The value stored in RT is exactly equivalent to maddld
despite maddld
performing sign-extension on RC, because RT is the full mathematical result
modulo 264 and sign/zero extension from 64 to 128 bits produces identical
results modulo 264. This is why there is no maddldu instruction.
Programmer's Note:
As a Scalar Power ISA operation, like lq
and stq
, RS=RT+1.
To achieve the same big-integer rolling-accumulation effect
as SVP64: assuming the scalar to multiply is in r0,
the vector to multiply by starts at r4 and the result vector
in r20, instructions may be issued maddedu r20,r4,r0,r20
maddedu r21,r5,r0,r21
etc. where the first maddedu
will have
stored the upper half of the 128-bit multiply into r21, such
that it may be picked up by the second maddedu
. Repeat inline
to construct a larger bigint scalar-vector multiply,
as Scalar GPR register file space permits.
SVP64 overrides the Scalar behaviour of what defines RS.
For SVP64 EXTRA register extension, the RM-1P-3S-1D
format is
used with the additional bit set for determining RS.
Field Name | Field bits | Description |
---|---|---|
Rdest_EXTRA2 | 10:11 |
extends RT (R*_EXTRA2 Encoding) |
Rsrc1_EXTRA2 | 12:13 |
extends RA (R*_EXTRA2 Encoding) |
Rsrc2_EXTRA2 | 14:15 |
extends RB (R*_EXTRA2 Encoding) |
Rsrc3_EXTRA2 | 16:17 |
extends RC (R*_EXTRA2 Encoding) |
EXTRA2_MODE | 18 |
used by maddedu for determining RS |
When EXTRA2_MODE
is set to zero, the implicit RS register takes
its Vector/Scalar setting from Rdest_EXTRA2, and takes
the register number from RT, but all numbering
is offset by MAXVL. Note that element-width overrides influence this
offset (see SVP64 appendix for full details).
When EXTRA2_MODE
is set to one, the implicit RS register is identical
to RC extended with SVP64 using Rsrc3_EXTRA2
in every respect, including whether RC is set Scalar or
Vector.
divmod2du RT,RA,RB,RC
DRAFT
Divide/Modulu Quad-Double Unsigned is another VA-Form instruction
that is near-identical to divdeu
except that:
- the lower 64 bits of the dividend, instead of being zero, contain a register, RC.
- it performs a fused divide and modulo in a single instruction, storing
the modulo in an implicit RS (similar to
maddedu
)
RB, the divisor, remains 64 bit. The instruction is therefore a 128/64
division, producing a (pair) of 64 bit result(s), in the same way that
Intel divq works.
Overflow conditions
are detected in exactly the same fashion as divdeu
, except that rather
than have UNDEFINED
behaviour, RT is set to all ones and RS set to all
zeros on overflow.
Programmer's note: there are no Rc variants of any of these VA-Form
instructions. cmpi
will need to be used to detect overflow conditions:
the saving in instruction count is that both RT and RS will have already
been set to useful values (all 1s and all zeros respectively)
needed as part of implementing Knuth's
Algorithm D
For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,
the exact same EXTRA format and setting of RS is used as for sv.maddedu
.
For Scalar usage, just as for maddedu
, RS=RT+1
(similar to lq
and stq
).
Pseudo-code:
if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then
dividend[0:(XLEN*2)-1] <- (RA) || (RC)
divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
result <- dividend / divisor
modulo <- dividend % divisor
RT <- result[XLEN:(XLEN*2)-1]
RS <- modulo[XLEN:(XLEN*2)-1]
else
RT <- [1]*XLEN
RS <- [0]*XLEN
[DRAFT] EXT04 Proposed Map
For the Opcode map (XO Field) see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357. Proposed is the addition of:
maddedu
in110010
divmod2du
in111010
pcdec
in111000
v > | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|
110 | maddhd | maddhdu | maddedu | maddld | rsvd | rsvd | rsvd | rsvd |
111 | pcdec. | rsvd | divmod2du | vpermr | vaddequm | vaddecuq | vsubeuqm | vsubecuq |