sadoon[m] | Awesome, I'll shoot you an email with the public key soon then, thanks! | 07:56 |
---|---|---|
sadoon[m] | @lkcl | 07:56 |
lkcl | cesar, check email, are you happy with EUR 1,000 for the bitcell Formal proof work? | 15:50 |
lkcl | for the 8T one, it is after the NGI POINTER Milestone 2, so would be later, but also payable | 15:51 |
lkcl | just did a quick video about the big-int math SVP64 analysis and am doing the usual rounds https://twitter.com/lkcl/status/1517169267912986624 | 18:06 |
lkcl | https://www.youtube.com/watch?v=8hrIG7-E77o | 18:06 |
cesar | lkcl: Sure! | 18:37 |
programmerjake | watching the video, i'll note that maddld it doesn't matter that sign-extension of RC is used, since the lower 64-bits of the result is exactly the same, which is all maddld outputs. | 19:38 |
programmerjake | hence why there isn't a maddldu | 19:38 |
programmerjake | afaict the division loop is incorrect, it should do sv.madded and sv.subfe, so sv.msubed isn't needed | 19:43 |
programmerjake | sorry i didn't yet get around to fixing the c code to do that | 19:43 |
programmerjake | since the division loop computes `temp[] = divisor[] * qhat; numerator[] -= temp[];` where [] indicates vector | 19:45 |
programmerjake | more specifically the corrected division loop should compute ^ | 19:46 |
programmerjake | created https://bugs.libre-soc.org/show_bug.cgi?id=817 to track bigint stuff | 20:05 |
lkcl | programmerjake, totally got it now (see comment #2) if you can do the c i'll do the wiki/words/writeup. | 21:18 |
lkcl | 128-bit div in 64-bit regs is going to be an absolute pig | 21:56 |
lkcl | (scalar 128-bit div in 64-bit regs, that is) | 21:56 |
lkcl | like, 2x for divisor, 2x for dividend, 2x for result, 2x for remainder | 21:57 |
lkcl | 4-in, 4-out. it's like totally out of the question | 21:57 |
lkcl | even in hardware it's going to be dicey: a typical FSM would be a whopping 256 clock cycles if done as shift-and-cmp-and-subtract | 22:01 |
lkcl | total irony: it might even be better to go down to an elwidth of 16 or even 8 and crank VL up to 32 or 64, just to get that qhat estimate to complete in a shorter amount of time! | 22:03 |
lkcl | urr this is giving me flashback nightmares to what we had to do for the Aspex Array-String-Processor :) | 22:04 |
lkcl | two competing parallel instructions that completed in different times, where you had to design and write *multiple* algorithms with different bit-widths, and use the "best" | 22:05 |
* lkcl shudders | 22:05 | |
lkcl | in this case, the fact that that scalar-estimate would stall for say 64 or greater cycles trying to compute a 128-bit scalar division in hardware, you *can't* run the SIMD-Muls because they'd be stalled waiting for qhat! | 22:07 |
lkcl | even if you manually did the 128-bit divide in software, using 64-bit scalar arith, you'd still be twiddling thumbs | 22:08 |
lkcl | ironically, doing a 16-bit scalar div to produce qhat, then doing sv.madded/ew=8 @ VL=64, i suspect it would be faster *even though* it'd be doing more work (ON^2) | 22:11 |
lkcl | nggggh :) | 22:11 |
programmerjake | it's 128x64->64 division, so 2x 64 for numerator, 1x 64 for denominator, and 1x 64 for result...so 3-in 1-out, not 4-in 2-out | 22:34 |
programmerjake | once you have enough bits in your division, it's more efficient to use a different faster algorithm, like goldschmidt division: | 22:40 |
programmerjake | https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division | 22:40 |
lkcl | am still scared of it! :) | 22:45 |
lkcl | ooo i like that goldschmidt thing | 22:47 |
lkcl | 3-in 1-out is manageable | 22:47 |
lkcl | still need the modulo | 22:47 |
lkcl | (rhat) | 22:47 |
lkcl | ya know... the goldschmit reminds me of Algorithm D, but at the bit-level not the RADIX-64 level | 22:48 |
programmerjake | simply have another 3-in 1-out instruction for the remainder | 22:49 |
lkcl | yeh grok it | 22:49 |
lkcl | or, have RC=RA+1 | 22:50 |
lkcl | (similar to lq/stq) | 22:50 |
lkcl | same trick | 22:50 |
lkcl | so basically in goldschmidt, you're trying to get D to be 0xffff_ffff_ffff_fffff | 22:52 |
programmerjake | goldschmidt division in hw takes 2*log2(N) or so N-bit multiplies for a N-bit division...algorithm D is much slower at around N N-bit multiplies | 22:53 |
lkcl | nnnggggh brain-melt | 22:53 |
lkcl | i need to look up a ref implementation | 22:54 |
programmerjake | basically goldschmidt division is what i'm saying 128x64-bit division should probably be implemented with | 22:54 |
lkcl | using fixed-point math not FP | 22:55 |
programmerjake | yup | 22:55 |
lkcl | that's what confused me, then, with 0 < D < 1 | 22:55 |
programmerjake | it can also be used for implementing fp div if we like... | 22:55 |
programmerjake | just pretend your 64-bit divisor is a fraction 0x0.XXXX.... | 22:56 |
* lkcl yeyeh | 22:57 | |
lkcl | https://stackoverflow.com/questions/2661541/picking-good-first-estimates-for-goldschmidt-division | 22:58 |
programmerjake | see also https://en.m.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm | 23:00 |
lkcl | fascinating | 23:02 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!