Thursday, 2022-04-21

sadoon[m]	Awesome, I'll shoot you an email with the public key soon then, thanks!	07:56
sadoon[m]	@lkcl	07:56
lkcl	cesar, check email, are you happy with EUR 1,000 for the bitcell Formal proof work?	15:50
lkcl	for the 8T one, it is after the NGI POINTER Milestone 2, so would be later, but also payable	15:51
lkcl	just did a quick video about the big-int math SVP64 analysis and am doing the usual rounds https://twitter.com/lkcl/status/1517169267912986624	18:06
lkcl	https://www.youtube.com/watch?v=8hrIG7-E77o	18:06
cesar	lkcl: Sure!	18:37
programmerjake	watching the video, i'll note that maddld it doesn't matter that sign-extension of RC is used, since the lower 64-bits of the result is exactly the same, which is all maddld outputs.	19:38
programmerjake	hence why there isn't a maddldu	19:38
programmerjake	afaict the division loop is incorrect, it should do sv.madded and sv.subfe, so sv.msubed isn't needed	19:43
programmerjake	sorry i didn't yet get around to fixing the c code to do that	19:43
programmerjake	since the division loop computes `temp[] = divisor[] * qhat; numerator[] -= temp[];` where [] indicates vector	19:45
programmerjake	more specifically the corrected division loop should compute ^	19:46
programmerjake	created https://bugs.libre-soc.org/show_bug.cgi?id=817 to track bigint stuff	20:05
lkcl	programmerjake, totally got it now (see comment #2) if you can do the c i'll do the wiki/words/writeup.	21:18
lkcl	128-bit div in 64-bit regs is going to be an absolute pig	21:56
lkcl	(scalar 128-bit div in 64-bit regs, that is)	21:56
lkcl	like, 2x for divisor, 2x for dividend, 2x for result, 2x for remainder	21:57
lkcl	4-in, 4-out. it's like totally out of the question	21:57
lkcl	even in hardware it's going to be dicey: a typical FSM would be a whopping 256 clock cycles if done as shift-and-cmp-and-subtract	22:01
lkcl	total irony: it might even be better to go down to an elwidth of 16 or even 8 and crank VL up to 32 or 64, just to get that qhat estimate to complete in a shorter amount of time!	22:03
lkcl	urr this is giving me flashback nightmares to what we had to do for the Aspex Array-String-Processor :)	22:04
lkcl	two competing parallel instructions that completed in different times, where you had to design and write multiple algorithms with different bit-widths, and use the "best"	22:05
* lkcl shudders		22:05
lkcl	in this case, the fact that that scalar-estimate would stall for say 64 or greater cycles trying to compute a 128-bit scalar division in hardware, you can't run the SIMD-Muls because they'd be stalled waiting for qhat!	22:07
lkcl	even if you manually did the 128-bit divide in software, using 64-bit scalar arith, you'd still be twiddling thumbs	22:08
lkcl	ironically, doing a 16-bit scalar div to produce qhat, then doing sv.madded/ew=8 @ VL=64, i suspect it would be faster even though it'd be doing more work (ON^2)	22:11
lkcl	nggggh :)	22:11
programmerjake	it's 128x64->64 division, so 2x 64 for numerator, 1x 64 for denominator, and 1x 64 for result...so 3-in 1-out, not 4-in 2-out	22:34
programmerjake	once you have enough bits in your division, it's more efficient to use a different faster algorithm, like goldschmidt division:	22:40
programmerjake	https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division	22:40
lkcl	am still scared of it! :)	22:45
lkcl	ooo i like that goldschmidt thing	22:47
lkcl	3-in 1-out is manageable	22:47
lkcl	still need the modulo	22:47
lkcl	(rhat)	22:47
lkcl	ya know... the goldschmit reminds me of Algorithm D, but at the bit-level not the RADIX-64 level	22:48
programmerjake	simply have another 3-in 1-out instruction for the remainder	22:49
lkcl	yeh grok it	22:49
lkcl	or, have RC=RA+1	22:50
lkcl	(similar to lq/stq)	22:50
lkcl	same trick	22:50
lkcl	so basically in goldschmidt, you're trying to get D to be 0xffff_ffff_ffff_fffff	22:52
programmerjake	goldschmidt division in hw takes 2*log2(N) or so N-bit multiplies for a N-bit division...algorithm D is much slower at around N N-bit multiplies	22:53
lkcl	nnnggggh brain-melt	22:53
lkcl	i need to look up a ref implementation	22:54
programmerjake	basically goldschmidt division is what i'm saying 128x64-bit division should probably be implemented with	22:54
lkcl	using fixed-point math not FP	22:55
programmerjake	yup	22:55
lkcl	that's what confused me, then, with 0 < D < 1	22:55
programmerjake	it can also be used for implementing fp div if we like...	22:55
programmerjake	just pretend your 64-bit divisor is a fraction 0x0.XXXX....	22:56
* lkcl yeyeh		22:57
lkcl	https://stackoverflow.com/questions/2661541/picking-good-first-estimates-for-goldschmidt-division	22:58
programmerjake	see also https://en.m.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm	23:00
lkcl	fascinating	23:02

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!