Implementation strategy for poly1305 on Simple-V

Links:

Poly1305 is a fast and surprisingly simple provably-secure one-time authenticator. A huge amount of detailed mathematical analysis has been done on it, demonstrating and implementing tricks of modulo arithmetic that can be exploited to good effect and need not be repeated here: the primary purpose of this document is to show that it is possible to use Simple-V REMAP when patterns are identified, instead of standard implementations on Scalar / SIMD ISAs that are loop-unrolled in order to ensure no branching (and definitely no conditional-branching) is utilised.

For example: this entire function, from Loup Vaillant's tutorial may be carried out with some REMAPs, a mul and a madd instruction.

void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
{
    uint64_t a0 = a[0];  uint64_t b0 = b[0];
    uint64_t a1 = a[1];  uint64_t b1 = b[1];
    uint64_t a2 = a[2];  uint64_t b2 = b[2];
    uint64_t a3 = a[3];  uint64_t b3 = b[3];
    uint64_t a4 = a[4];  uint64_t b4 = b[4];
    p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
    p[1] = a0*b1 + a1*b0   + a2*b4*5 + a3*b3*5 + a4*b2*5;
    p[2] = a0*b2 + a1*b1   + a2*b0   + a3*b4*5 + a4*b3*5;
    p[3] = a0*b3 + a1*b2   + a2*b1   + a3*b0   + a4*b4*5;
    p[4] = a0*b4 + a1*b3   + a2*b2   + a3*b1   + a4*b0  ;
}

Above, the patterns are quite clear: it can be expressed as a pair of loops (which is normally avoided like the plague),

for ai = 0 to 4
    for bi = 0 to 4
        if ai < bi then const = 5 else 1
        p[ai] += a[ai] * b[(bi+ai)%5] * const

This can be covered in Simple-V by having

one REMAP for the constant (5 or 1), which is simplest done by a Triangle REMAP that selects from a Vector. (otherwise, Indexed REMAP is necessary)
a second REMAP for the loop ai above
a third REMAP for the loop value bi+ai modulo 5

Then - using Vertical-First - the loop is four instructions: mul, madd, svstep and bc.

Fast-forwarding to the python implementation, let us look at this:

 186    h0 = (h0 & c) | g0;
 187    h1 = (h1 & c) | g1;
 188    h2 = (h2 & c) | g2;

This again is a simple parallel operation, so can be done as two Vertical-First instructions:

sv.and *h, *h, c
sv.or  *h, *h, *g

This section however gets more complex, and also requires a new instruction (Double-Shift-and-add, where previous instructions designed are Double-Shift-and-or):

 110    h0 += (( t0                    ) & 0xfffffffffff);
 111    h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
 112    h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;

Here, a "cheat" is required - slightly - to bring in some extra variables that are set to zero: t2 and t-minus-one (which we name tm):

 110    h0 += (((t0 >> 0 ) | (tm << 64)) & 0xfffffffffff);
 111    h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
 112    h2 += (((t1 >> 24) | (t2 << 40 ) & 0x3ffffffffff);
 112    h2 |= hibit;

Both tm and t2 are set to zero: strictly speaking tm shifted by 64 will always be zero. Now it becomes possible to identify the pattern, and also split out the AND part:

        h[N] += (((t[N] >> s[N] ) | (t[N-1] << (64-s[N]))
        h[N] &= const[N]

Where it is clear that the shift amount s[N] and const[N] can be set up as Vectors. Now it is possible to utilise a proposed new instruction, "double-shift-and-add" for the first part, and a straight sv.add *h, *h, *const for the second.

In this way, again we have identified very simple regular patterns, and applied Vectorisation to them to reduce instruction count. Interestingly in this case, unlike many algorithms converted there is not anticipated to be a huge reduction in instruction count, but the key is that the core of the algorithm is preserved and thus is easy to validate.