RFC ls003 Big Integer

URLs:

Severity: Major

Status: New

Date: 20 Oct 2022

Target: v3.2B

Source: v3.0B

Books and Section affected: UPDATE

    Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1
    Appendix E Power ISA sorted by opcode
    Appendix F Power ISA sorted by version
    Appendix G Power ISA sorted by Compliancy Subset
    Appendix H Power ISA sorted by mnemonic

Summary

Instructions added

    maddedu - Multiply-Add Extended Double Unsigned
    divmod2du - Divide/Modulo Quad-Double Unsigned
    dsld - Double Shift Left Doubleword
    dsrd - Double Shift Right Doubleword

Submitter: Luke Leighton (Libre-SOC)

Requester: Libre-SOC

Impact on processor:

    Addition of two new GPR-based instructions

Impact on software:

    Requires support for new instructions in assembler, debuggers,
    and related tools.

Keywords:

    GPR, Big-integer, Double-word

Motivation

  • Similar to maddhdu and maddld, but allow for a big-integer rolling accumulation affect: RC effectively becomes a 64-bit carry in chains of highly-efficient loop-unrolled arbitrary-length big-integer operations.
  • Similar to divdeu, and has similar advantages to maddedu, Modulo result is available with the quotient in a single instruction allowing highly-efficient arbitrary-length big-integer division.

Notes and Observations:

TODO: address Jacob's comments: https://libre-soc.org/irclog/%23libre-soc.2022-10-28.log.html#t2022-10-28T18:00:27

  1. It is not practical to add Rc=1 variants as VA-Form is used and there is a pair of results produced.
  2. An overflow variant (XER.OV set) of divmod2du would be valuable but VA-Form EXT004 is under severe pressure.
  3. Both maddhdu and divmod2du instructions have been present in Intel x86 for several decades. Likewise, variants of dsld and dsrd.
  4. None of these instruction is present in VSX: these are 128/64 whereas VSX is 128/128.
  5. maddedu and divmod2du are full inverses of each other, including when used for arbitrary-length big-integer arithmetic.
  6. These are all 3-in 2-out instructions. If Power ISA did not already have LD/ST-with-update instructions and instructions with RAp and RTp then these instructions would not be proposed.

Changes

Add the following entries to:

  • the Appendices of Book I
  • Instructions of Book I added to Section 3.3.9.1
  • VA2-Form of Book I Section 1.6.21.1 and 1.6.2

\newpage{}

Multiply-Add Extended Double Unsigned

maddedu RT, RA, RB, RC

0-5 6-10 11-15 16-20 21-25 26-31 Form
EXT04 RT RA RB RC XO VA-Form

Pseudocode:

prod[0:127] <- (RA) * (RB)    # Multiply RA and RB, result 128-bit
sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product
RT <- sum[64:127]             # Store low half in RT
RS <- sum[0:63]               # RS implicit register, equal to RC

Special registers altered:

None

The 64-bit operands are (RA), (RB), and (RC). RC is zero-extended (not shifted, not sign-extended). The 128-bit product of the operands (RA) and (RB) is added to (RC). The low-order 64 bits of the 128-bit sum are placed into register RT. The high-order 64 bits of the 128-bit sum are placed into register RS. RS is implictly defined as the same register as RC.

All three operands and the result are interpreted as unsigned integers.

The differences here to maddhdu are that maddhdu stores the upper half in RT, where maddedu stores the upper half in RS.

The value stored in RT is exactly equivalent to maddld despite maddld performing sign-extension on RC, because RT is the full mathematical result modulo 264 and sign/zero extension from 64 to 128 bits produces identical results modulo 264. This is why there is no maddldu instruction.

Programmer's Note: To achieve a big-integer rolling-accumulation effect: assuming the scalar to multiply is in r0, and r3 is used (effectively) as a 64-bit carry, the vector to multiply by starts at r4 and the result vector in r20, instructions may be issued maddedu r20,r4,r0,r3 maddedu r21,r5,r0,r3 etc. where the first maddedu will have stored the upper half of the 128-bit multiply into r3, such that it may be picked up by the second maddedu. Repeat inline to construct a larger bigint scalar-vector multiply, as Scalar GPR register file space permits. If register spill is required then r3, as the effective 64-bit carry, continues the chain.

Examples:

# (r0 * r1) + r2, store lower in r4, upper in r2
maddedu r4, r0, r1, r2

# Chaining together for larger bigint (see Programmer's Note above)
# r3 starts with zero (no carry-in)
maddedu r20,r4,r0,r3
maddedu r21,r5,r0,r3
maddedu r22,r6,r0,r3

\newpage{}

Divide/Modulo Quad-Double Unsigned

Should name be Divide/Module Double Extended Unsigned? Check the pseudo-code comments

divmod2du RT,RA,RB,RC

0-5 6-10 11-15 16-20 21-25 26-31 Form
EXT04 RT RA RB RC XO VA-Form

Pseudo-code:

if ((RA) <u (RB)) & ((RB) != [0]*64) then  # Check RA<RB, for divide-by-0
    dividend[0:127] <- (RA) || (RC)        # Combine RA/RC as 128-bit
    divisor[0:127] <- [0]*64 || (RB)       # Extend RB to 128-bit
    result <- dividend / divisor           # Unsigned Division
    modulo <- dividend % divisor           # Unsigned Modulo
    RT <- result[64:127]                   # Store result in RT
    RS <- modulo[64:127]                   # Modulo in RC, implicit
else                                       # In case of error
    RT <- [1]*64                           # RT all 1's
    RS <- [0]*64                           # RS all 0's

Special registers altered:

None

The 128-bit dividend is (RA) || (RC). The 64-bit divisor is (RB). If the quotient can be represented in 64 bits, it is placed into register RT. The modulo is placed into register RS. RS is implictly defined as the same register as RC, similarly to maddedu.

The quotient can be represented in 64-bits when both these conditions are true:

  • (RA) < (RB) (unsigned comparison)
  • (RB) is NOT 0 (not divide-by-0)

If these conditions are not met, RT is set to all 1's, RS to all 0's.

All operands, quotient, and modulo are interpreted as unsigned integers.

Divide/Modulo Quad-Double Unsigned is a VA-Form instruction that is near-identical to divdeu except that:

  • the lower 64 bits of the dividend, instead of being zero, contain a register, RC.
  • it performs a fused divide and modulo in a single instruction, storing the modulo in an implicit RS (similar to maddedu)
  • There is no UNDEFINED behaviour.

RB, the divisor, remains 64 bit. The instruction is therefore a 128/64 division, producing a (pair) of 64 bit result(s), in the same way that Intel divq works. Overflow conditions are detected in exactly the same fashion as divdeu, except that rather than have UNDEFINED behaviour, RT is set to all ones and RS set to all zeros on overflow.

Programmer's note: there are no Rc variants of any of these VA-Form instructions. cmpi will need to be used to detect overflow conditions: the saving in instruction count is that both RT and RS will have already been set to useful values (all 1s and all zeros respectively) needed as part of implementing Knuth's Algorithm D

For Scalar usage, just as for maddedu, RS=RC Examples:

# ((r0 << 64) + r2) / r1, store in r4
# ((r0 << 64) + r2) % r1, store in r2
divmod2du r4, r0, r1, r2

\newpage{}

Double-Shift Left Doubleword

dsld RT,RA,RB,RC

0-5 6-10 11-15 16-20 21-25 26-30 31 Form
EXT04 RT RA RB RC XO Rc VA2-Form

Pseudo-code:

n <- (RB)[58:63]                        # Use lower 6-bits for shift
v <- ROTL64((RA), n)                    # Rotate RA 64-bit left by n bits
mask <- MASK(64, 63-n)                  # 1's mask in MSBs
RT <- (v[0:63] & mask) | ((RC) & ¬mask) # mask-in RC into RT
RS <- v[0:63] & ¬mask                   # part normally lost into RC
overflow = 0                            # Clear overflow flag
if RS != [0]*64:                        # Check if RS is NOT zero
    overflow = 1                        # Set the overflow flag

Special Registers Altered:

CR0                    (if Rc=1)

The contents of register RA are shifted left the number of bits specified by (RB) 58:63. The same number of shifted bits are taken from register RC and placed into the LSBs of the result, RT. Additionally, the MSB bits of register RA that would normally be discarded by a 64-bit left shift are placed into the LSBs of RS.

Note: When Rc=1, and the value in RS is nonzero, the overflow flag is raised in CR0.

Programmer's note: similar to maddedu and divmod2du, dsld can be chained (using RC).

\newpage{}

Double-Shift Right Doubleword

dsrd RT,RA,RB,RC

0-5 6-10 11-15 16-20 21-25 26-30 31 Form
EXT04 RT RA RB RC XO Rc VA2-Form

Pseudo-code:

n <- (RB)[58:63]                        # Take lower 6-bits for shift
v <- ROTL64((RA), 64-n)                 # Rotate RA 64-bit left by 64-n bits
mask <- MASK(n, 63)                     # 0's mask, set mask[n:63] to 1' 
RT <- (v[0:63] & mask) | ((RC) & ¬mask) # 
RS <- v[0:63] & ¬mask
overflow = 0
if RS != [0]*64:
    overflow = 1

Special Registers Altered:

CR0                    (if Rc=1)

\newpage{}

VA2-Form

Add the following to Book I, 1.6.21.1, VA2-Form

|0      |6     |11     |16     |21  |24|26  |31  |
| PO    |  RT  |   RA  |   RB  | RC    | XO | Rc |

Add RA to XO Field in Book I, 1.6.2

RA (11:15)
    Field used to specify a GPR to be used as a
    source or as a target.
    Formats: A, BM2, D, DQ, DQE, DS, M, MD, MDS, TX, VA, VA2,
             VX, X, XO, XS, SVL, XB, TLI, Z23

TODO other fields RT, RB, RC, XO, and Rc, see https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;hb=HEAD

Appendices

Appendix E Power ISA sorted by opcode
Appendix F Power ISA sorted by version
Appendix G Power ISA sorted by Compliancy Subset
Appendix H Power ISA sorted by mnemonic
Form Book Page Version mnemonic Description
VA I # 3.0B maddedu Multiply-Add Extend Double Unsigned
VA I # 3.0B divmod2du Divide/Modulo Quad-Double Unsigned