Friday, 2023-05-19

*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC		00:07
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has joined #libre-soc		00:08
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc		00:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC		07:05
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.170.0> has joined #libre-soc		07:06
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.170.0> has quit IRC		07:23
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc		07:23
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC		07:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has joined #libre-soc		07:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has quit IRC		08:42
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc		08:43
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has quit IRC		08:48
*** libredev <libredev!libredev@ircforever.org> has quit IRC		08:48
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has joined #libre-soc		08:48
*** libredev <libredev!libredev@libredev.ircforever.org> has joined #libre-soc		09:15
*** libredev <libredev!libredev@libredev.ircforever.org> has quit IRC		09:36
lkcl	ghostmansd[m], it is indeed. 1039	09:42
lkcl	does anyone know how to use parallel prefix sum to do polynomial interpolation?	09:43
lkcl	https://en.wikipedia.org/wiki/Polynomial_interpolation	09:43
lkcl	https://bugs.libre-soc.org/show_bug.cgi?id=1039	09:48
*** libredev <libredev!libredev@ircforever.org> has joined #libre-soc		10:01
markos_	I've done a similar method in the past	10:34
programmerjake	idk, but parallel prefix sum and parallel tree reduction can be used for polynomial evaluation:	10:36
programmerjake	reduce(element_wise_mul(c, prefix_sum([1] + splat(x), op=mul)), op=add) ==	10:36
programmerjake	c[0]+c[1]x+c[2]x^2+c[3]x^3+c[4]x^4...	10:36
programmerjake	though it tends to have lower precision than using fma because of rounding before adding	10:39
lkcl	so it's d0 + x * (d1 + x * (d2 + x * (....)	12:20
lkcl	where there's a clear relationship between c and d?	12:20
*** tplaten <tplaten!~tplaten@62.144.46.89> has joined #libre-soc		12:20
*** tplaten <tplaten!~tplaten@62.144.46.89> has quit IRC		12:24
markos_	if you do it like that it has to be calculated sequentially though	13:05
markos_	and given enough terms it should always get to the correct result	13:07
markos_	unless I'm misunderstanding something, but a reduce operation like that has to start from the inner expressions and going outward, which I don't see how it can be parallelized	13:08
*** tplaten <tplaten!~tplaten@62.144.46.89> has joined #libre-soc		14:46
lkcl	ok so i have it, now, real simple. programmerjake thank you i was able to break it down into two REMAPs, a straight vec-mul and a mapreduce	16:47
*** pangelo[m] <pangelo[m]!~pangeloma@2001:470:69fc:105::3ec5> has quit IRC		17:06
programmerjake	> so it's d0 + x * (d1 + x * (d2 + x * (....)	17:32
programmerjake	yes, except i call the coefficients c[0], c[1], c[2], ... instead of d0, d1, d2, ...	17:32
programmerjake	though evaluation order of the parallel prefix-sum/reduction version is different to achieve O(log N) runtime assuming the cpu has wide enough simd units to evaluate N-element vector ops in O(1)	17:35
programmerjake	markos: parallel prefix sum is evaluated like (where the `@` stands for the single-element operation, e.g. addition): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=4bf2f20bddc057df1597d14e0b990c0b9bdeb10e#l167	17:39
programmerjake	parallel tree reduction is kinda like that, except only returning the last element and eliminating unnecessary operations for computing the non-last element. for reduction in SVP64 the tree is tilted top to bottom left rather than top to bottom right to end up with the reduction result in element 0 (assuming the predicate mask is all ones)	17:44
programmerjake	so for polynomial evaluation on a wide-enough simd machine it takes approximately 3 * log2(N) steps (first 2 log2 for prefix sum, last one log2 for reduction)	17:46
programmerjake	which is waay faster than the N steps needed for serial polynomial evaluation	17:47
programmerjake	for big enough N	17:48
programmerjake	though if you need N independent polynomial evaluations rather than 1 it's still faster and more accurate to use the serial fma algorithm vectorized to be N-wide	17:49
programmerjake	since that takes just N steps rather than N * 3 * log2(N)	17:50
programmerjake	assuming no overlap between the independent parallel prefix-sums/reductions	17:51
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.205> has quit IRC		18:12
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.205.168.99> has joined #libre-soc		18:13
*** octavius <octavius!~octavius@92.40.168.179.threembb.co.uk> has joined #libre-soc		18:45
*** tplaten <tplaten!~tplaten@62.144.46.89> has quit IRC		20:42
*** octavius <octavius!~octavius@92.40.168.179.threembb.co.uk> has quit IRC		21:32

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!