Dynamic Partitioned Multiply
This is complicated! It is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds (long multipication, in binary), using PartitionedAdd, which will "automatically" break the results down into segments, at all times, keeping each partitioned result separate.
Therefore, for a full 64 bit multiply, with 7 partitions, a matrix of 8x8 multiplications are performed, then added up in each column of the same magnitude, in exactly the same way as described by Vedic Mathematics. Ultimately it is the partitions on the adds that allows the entire multiply to be broken into SIMD pieces: the Partitioned Adders are what stops carry rolling over to "affect" the result of another partition.
The Wallace Tree algorithm is presently deployed, here: we need to use the (more efficient) Dadda algorithm
This image illustrates it well, except for the last step, "Step(v)", whose result should be "1 x 2 = 2". The partitioned sum at the bottom is correct.