# Friday, 2023-05-19

lkcl lkcl lkcl *** jn has quit IRC 00:07 *** jn has joined #libre-soc 00:08 *** jn has joined #libre-soc 00:08 *** ghostmansd[m] has quit IRC 07:05 *** ghostmansd[m] has joined #libre-soc 07:06 *** ghostmansd[m] has quit IRC 07:23 *** ghostmansd[m] has joined #libre-soc 07:23 *** ghostmansd[m] has quit IRC 07:30 *** ghostmansd[m] has joined #libre-soc 07:30 *** ghostmansd[m] has quit IRC 08:42 *** ghostmansd[m] has joined #libre-soc 08:43 *** ghostmansd[m] has quit IRC 08:48 *** libredev has quit IRC 08:48 *** ghostmansd[m] has joined #libre-soc 08:48 *** libredev has joined #libre-soc 09:15 *** libredev has quit IRC 09:36 ghostmansd[m], it is indeed. 1039 09:42 does anyone know how to use parallel prefix sum to do polynomial interpolation? 09:43 https://en.wikipedia.org/wiki/Polynomial_interpolation 09:43 https://bugs.libre-soc.org/show_bug.cgi?id=1039 09:48 *** libredev has joined #libre-soc 10:01 I've done a similar method in the past 10:34 idk, but parallel prefix sum and parallel tree reduction can be used for polynomial evaluation: 10:36 reduce(element_wise_mul(c, prefix_sum([1] + splat(x), op=mul)), op=add) == 10:36 c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4... 10:36 though it tends to have lower precision than using fma because of rounding before adding 10:39 so it's d0 + x * (d1 + x * (d2 + x * (....) 12:20 where there's a clear relationship between c and d? 12:20 *** tplaten has joined #libre-soc 12:20 *** tplaten has quit IRC 12:24 if you do it like that it has to be calculated sequentially though 13:05 and given enough terms it should always get to the correct result 13:07 unless I'm misunderstanding something, but a reduce operation like that has to start from the inner expressions and going outward, which I don't see how it can be parallelized 13:08 *** tplaten has joined #libre-soc 14:46 ok so i have it, now, real simple. programmerjake thank you i was able to break it down into two REMAPs, a straight vec-mul and a mapreduce 16:47 *** pangelo[m] has quit IRC 17:06 > so it's d0 + x * (d1 + x * (d2 + x * (....) 17:32 yes, except i call the coefficients c[0], c[1], c[2], ... instead of d0, d1, d2, ... 17:32 though evaluation order of the parallel prefix-sum/reduction version is different to achieve O(log N) runtime assuming the cpu has wide enough simd units to evaluate N-element vector ops in O(1) 17:35 markos: parallel prefix sum is evaluated like (where the `@` stands for the single-element operation, e.g. addition): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=4bf2f20bddc057df1597d14e0b990c0b9bdeb10e#l167 17:39 parallel tree reduction is kinda like that, except only returning the last element and eliminating unnecessary operations for computing the non-last element. for reduction in SVP64 the tree is tilted top to bottom left rather than top to bottom right to end up with the reduction result in element 0 (assuming the predicate mask is all ones) 17:44 so for polynomial evaluation on a wide-enough simd machine it takes approximately 3 * log2(N) steps (first 2 log2 for prefix sum, last one log2 for reduction) 17:46 which is waay faster than the N steps needed for serial polynomial evaluation 17:47 for big enough N 17:48 though if you need N independent polynomial evaluations rather than 1 it's still faster and more accurate to use the serial fma algorithm vectorized to be N-wide 17:49 since that takes just N steps rather than N * 3 * log2(N) 17:50 assuming no overlap between the independent parallel prefix-sums/reductions 17:51 *** ghostmansd[m] has quit IRC 18:12 *** ghostmansd[m] has joined #libre-soc 18:13 *** octavius has joined #libre-soc 18:45 *** tplaten has quit IRC 20:42 *** octavius has quit IRC 21:32

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!