Friday, 2023-05-19

*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC00:07
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has joined #libre-soc00:08
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc00:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC07:05
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.170.0> has joined #libre-soc07:06
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.170.0> has quit IRC07:23
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc07:23
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC07:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has joined #libre-soc07:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has quit IRC08:42
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc08:43
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC08:48
*** libredev <libredev!libredev@ircforever.org> has quit IRC08:48
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has joined #libre-soc08:48
*** libredev <libredev!libredev@libredev.ircforever.org> has joined #libre-soc09:15
*** libredev <libredev!libredev@libredev.ircforever.org> has quit IRC09:36
lkclghostmansd[m], it is indeed. 103909:42
lkcldoes anyone know how to use parallel prefix sum to do polynomial interpolation?09:43
lkclhttps://en.wikipedia.org/wiki/Polynomial_interpolation09:43
lkclhttps://bugs.libre-soc.org/show_bug.cgi?id=103909:48
*** libredev <libredev!libredev@ircforever.org> has joined #libre-soc10:01
markos_I've done a similar method in the past10:34
programmerjakeidk, but parallel prefix sum and parallel tree reduction can be used for polynomial evaluation:10:36
programmerjakereduce(element_wise_mul(c, prefix_sum([1] + splat(x), op=mul)), op=add) ==10:36
programmerjakec[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4...10:36
programmerjakethough it tends to have lower precision than using fma because of rounding before adding10:39
lkclso it's d0 + x * (d1 + x * (d2 + x * (....)12:20
lkclwhere there's a clear relationship between c and d?12:20
*** tplaten <tplaten!~tplaten@62.144.46.89> has joined #libre-soc12:20
*** tplaten <tplaten!~tplaten@62.144.46.89> has quit IRC12:24
markos_if you do it like that it has to be calculated sequentially though13:05
markos_and given enough terms it should always get to the correct result13:07
markos_unless I'm misunderstanding something, but a reduce operation like that has to start from the inner expressions and going outward, which I don't see how it can be parallelized13:08
*** tplaten <tplaten!~tplaten@62.144.46.89> has joined #libre-soc14:46
lkclok so i have it, now, real simple. programmerjake thank you i was able to break it down into two REMAPs, a straight vec-mul and a mapreduce16:47
*** pangelo[m] <pangelo[m]!~pangeloma@2001:470:69fc:105::3ec5> has quit IRC17:06
programmerjake> so it's d0 + x * (d1 + x * (d2 + x * (....)17:32
programmerjakeyes, except i call the coefficients c[0], c[1], c[2], ... instead of d0, d1, d2, ...17:32
programmerjakethough evaluation order of the parallel prefix-sum/reduction version is different to achieve O(log N) runtime assuming the cpu has wide enough simd units to evaluate N-element vector ops in O(1)17:35
programmerjakemarkos: parallel prefix sum is evaluated like (where the `@` stands for the single-element operation, e.g. addition): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=4bf2f20bddc057df1597d14e0b990c0b9bdeb10e#l16717:39
programmerjakeparallel tree reduction is kinda like that, except only returning the last element and eliminating unnecessary operations for computing the non-last element. for reduction in SVP64 the tree is tilted top to bottom left rather than top to bottom right to end up with the reduction result in element 0 (assuming the predicate mask is all ones)17:44
programmerjakeso for polynomial evaluation on a wide-enough simd machine it takes approximately 3 * log2(N) steps (first 2 log2 for prefix sum, last one log2 for reduction)17:46
programmerjakewhich is waay faster than the N steps needed for serial polynomial evaluation17:47
programmerjakefor big enough N17:48
programmerjakethough if you need N independent polynomial evaluations rather than 1 it's still faster and more accurate to use the serial fma algorithm vectorized to be N-wide17:49
programmerjakesince that takes just N steps rather than N * 3 * log2(N)17:50
programmerjakeassuming no overlap between the independent parallel prefix-sums/reductions17:51
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has quit IRC18:12
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc18:13
*** octavius <octavius!~octavius@92.40.168.179.threembb.co.uk> has joined #libre-soc18:45
*** tplaten <tplaten!~tplaten@62.144.46.89> has quit IRC20:42
*** octavius <octavius!~octavius@92.40.168.179.threembb.co.uk> has quit IRC21:32

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!