Implementation strategy for poly1305 on Simple-V
Links:
Poly1305 is a fast and surprisingly simple provably-secure one-time authenticator. A huge amount of detailed mathematical analysis has been done on it, demonstrating and implementing tricks of modulo arithmetic that can be exploited to good effect and need not be repeated here: the primary purpose of this document is to show that it is possible to use Simple-V REMAP when patterns are identified, instead of standard implementations on Scalar / SIMD ISAs that are loop-unrolled in order to ensure no branching (and definitely no conditional-branching) is utilised.
For example: this entire function, from Loup Vaillant's tutorial may be carried out with some REMAPs, a mul and a madd instruction.
void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
{
uint64_t a0 = a[0]; uint64_t b0 = b[0];
uint64_t a1 = a[1]; uint64_t b1 = b[1];
uint64_t a2 = a[2]; uint64_t b2 = b[2];
uint64_t a3 = a[3]; uint64_t b3 = b[3];
uint64_t a4 = a[4]; uint64_t b4 = b[4];
p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
p[1] = a0*b1 + a1*b0 + a2*b4*5 + a3*b3*5 + a4*b2*5;
p[2] = a0*b2 + a1*b1 + a2*b0 + a3*b4*5 + a4*b3*5;
p[3] = a0*b3 + a1*b2 + a2*b1 + a3*b0 + a4*b4*5;
p[4] = a0*b4 + a1*b3 + a2*b2 + a3*b1 + a4*b0 ;
}
Above, the patterns are quite clear: it can be expressed as a pair of loops (which is normally avoided like the plague),
for ai = 0 to 4
for bi = 0 to 4
if ai < bi then const = 5 else 1
p[ai] += a[ai] * b[(bi+ai)%5] * const
This can be covered in Simple-V by having
- one REMAP for the constant (5 or 1), which is simplest done by a Triangle REMAP that selects from a Vector. (otherwise, Indexed REMAP is necessary)
- a second REMAP for the loop
ai
above - a third REMAP for the loop value
bi+ai
modulo 5
Then - using Vertical-First - the loop is four instructions: mul, madd, svstep and bc.
Fast-forwarding to the python implementation, let us look at this:
186 h0 = (h0 & c) | g0;
187 h1 = (h1 & c) | g1;
188 h2 = (h2 & c) | g2;
This again is a simple parallel operation, so can be done as two Vertical-First instructions:
sv.and *h, *h, c
sv.or *h, *h, *g
This section however gets more complex, and also requires a new instruction (Double-Shift-and-add, where previous instructions designed are Double-Shift-and-or):
110 h0 += (( t0 ) & 0xfffffffffff);
111 h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
112 h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit;
Here, a "cheat" is required - slightly - to bring in some extra variables that are set to zero: t2 and t-minus-one (which we name tm):
110 h0 += (((t0 >> 0 ) | (tm << 64)) & 0xfffffffffff);
111 h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
112 h2 += (((t1 >> 24) | (t2 << 40 ) & 0x3ffffffffff);
112 h2 |= hibit;
Both tm and t2 are set to zero: strictly speaking tm shifted by 64 will always be zero. Now it becomes possible to identify the pattern, and also split out the AND part:
h[N] += (((t[N] >> s[N] ) | (t[N-1] << (64-s[N]))
h[N] &= const[N]
Where it is clear that the shift amount s[N]
and const[N]
can be set up as Vectors. Now it is possible to utilise a
proposed new instruction, "double-shift-and-add" for the first
part, and a straight sv.add *h, *h, *const
for the second.
In this way, again we have identified very simple regular patterns, and applied Vectorisation to them to reduce instruction count. Interestingly in this case, unlike many algorithms converted there is not anticipated to be a huge reduction in instruction count, but the key is that the core of the algorithm is preserved and thus is easy to validate.