*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC | 00:07 | |
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has joined #libre-soc | 00:08 | |
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc | 00:08 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC | 07:05 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.170.0> has joined #libre-soc | 07:06 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.170.0> has quit IRC | 07:23 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc | 07:23 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC | 07:30 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has joined #libre-soc | 07:30 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has quit IRC | 08:42 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc | 08:43 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC | 08:48 | |
*** libredev <libredev!libredev@ircforever.org> has quit IRC | 08:48 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has joined #libre-soc | 08:48 | |
*** libredev <libredev!libredev@libredev.ircforever.org> has joined #libre-soc | 09:15 | |
*** libredev <libredev!libredev@libredev.ircforever.org> has quit IRC | 09:36 | |
lkcl | ghostmansd[m], it is indeed. 1039 | 09:42 |
---|---|---|
lkcl | does anyone know how to use parallel prefix sum to do polynomial interpolation? | 09:43 |
lkcl | https://en.wikipedia.org/wiki/Polynomial_interpolation | 09:43 |
lkcl | https://bugs.libre-soc.org/show_bug.cgi?id=1039 | 09:48 |
*** libredev <libredev!libredev@ircforever.org> has joined #libre-soc | 10:01 | |
markos_ | I've done a similar method in the past | 10:34 |
programmerjake | idk, but parallel prefix sum and parallel tree reduction can be used for polynomial evaluation: | 10:36 |
programmerjake | reduce(element_wise_mul(c, prefix_sum([1] + splat(x), op=mul)), op=add) == | 10:36 |
programmerjake | c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4... | 10:36 |
programmerjake | though it tends to have lower precision than using fma because of rounding before adding | 10:39 |
lkcl | so it's d0 + x * (d1 + x * (d2 + x * (....) | 12:20 |
lkcl | where there's a clear relationship between c and d? | 12:20 |
*** tplaten <tplaten!~tplaten@62.144.46.89> has joined #libre-soc | 12:20 | |
*** tplaten <tplaten!~tplaten@62.144.46.89> has quit IRC | 12:24 | |
markos_ | if you do it like that it has to be calculated sequentially though | 13:05 |
markos_ | and given enough terms it should always get to the correct result | 13:07 |
markos_ | unless I'm misunderstanding something, but a reduce operation like that has to start from the inner expressions and going outward, which I don't see how it can be parallelized | 13:08 |
*** tplaten <tplaten!~tplaten@62.144.46.89> has joined #libre-soc | 14:46 | |
lkcl | ok so i have it, now, real simple. programmerjake thank you i was able to break it down into two REMAPs, a straight vec-mul and a mapreduce | 16:47 |
*** pangelo[m] <pangelo[m]!~pangeloma@2001:470:69fc:105::3ec5> has quit IRC | 17:06 | |
programmerjake | > so it's d0 + x * (d1 + x * (d2 + x * (....) | 17:32 |
programmerjake | yes, except i call the coefficients c[0], c[1], c[2], ... instead of d0, d1, d2, ... | 17:32 |
programmerjake | though evaluation order of the parallel prefix-sum/reduction version is different to achieve O(log N) runtime assuming the cpu has wide enough simd units to evaluate N-element vector ops in O(1) | 17:35 |
programmerjake | markos: parallel prefix sum is evaluated like (where the `@` stands for the single-element operation, e.g. addition): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=4bf2f20bddc057df1597d14e0b990c0b9bdeb10e#l167 | 17:39 |
programmerjake | parallel tree reduction is kinda like that, except only returning the last element and eliminating unnecessary operations for computing the non-last element. for reduction in SVP64 the tree is tilted top to bottom left rather than top to bottom right to end up with the reduction result in element 0 (assuming the predicate mask is all ones) | 17:44 |
programmerjake | so for polynomial evaluation on a wide-enough simd machine it takes approximately 3 * log2(N) steps (first 2 log2 for prefix sum, last one log2 for reduction) | 17:46 |
programmerjake | which is waay faster than the N steps needed for serial polynomial evaluation | 17:47 |
programmerjake | for big enough N | 17:48 |
programmerjake | though if you need N independent polynomial evaluations rather than 1 it's still faster and more accurate to use the serial fma algorithm vectorized to be N-wide | 17:49 |
programmerjake | since that takes just N steps rather than N * 3 * log2(N) | 17:50 |
programmerjake | assuming no overlap between the independent parallel prefix-sums/reductions | 17:51 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has quit IRC | 18:12 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc | 18:13 | |
*** octavius <octavius!~octavius@92.40.168.179.threembb.co.uk> has joined #libre-soc | 18:45 | |
*** tplaten <tplaten!~tplaten@62.144.46.89> has quit IRC | 20:42 | |
*** octavius <octavius!~octavius@92.40.168.179.threembb.co.uk> has quit IRC | 21:32 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!