Thursday, 2022-04-21

sadoon[m]Awesome, I'll shoot you an email with the public key soon then, thanks!07:56
sadoon[m]@lkcl07:56
lkclcesar, check email, are you happy with EUR 1,000 for the bitcell Formal proof work?15:50
lkclfor the 8T one, it is after the NGI POINTER Milestone 2, so would be later, but also payable15:51
lkcljust did a quick video about the big-int math SVP64 analysis and am doing the usual rounds https://twitter.com/lkcl/status/151716926791298662418:06
lkclhttps://www.youtube.com/watch?v=8hrIG7-E77o18:06
cesarlkcl: Sure!18:37
programmerjakewatching the video, i'll note that maddld it doesn't matter that sign-extension of RC is used, since the lower 64-bits of the result is exactly the same, which is all maddld outputs.19:38
programmerjakehence why there isn't a maddldu19:38
programmerjakeafaict the division loop is incorrect, it should do sv.madded and sv.subfe, so sv.msubed isn't needed19:43
programmerjakesorry i didn't yet get around to fixing the c code to do that19:43
programmerjakesince the division loop computes `temp[] = divisor[] * qhat; numerator[] -= temp[];` where [] indicates vector19:45
programmerjakemore specifically the corrected division loop should compute ^19:46
programmerjakecreated https://bugs.libre-soc.org/show_bug.cgi?id=817 to track bigint stuff20:05
lkclprogrammerjake, totally got it now (see comment #2) if you can do the c i'll do the wiki/words/writeup.21:18
lkcl128-bit div in 64-bit regs is going to be an absolute pig21:56
lkcl(scalar 128-bit div in 64-bit regs, that is)21:56
lkcllike, 2x for divisor, 2x for dividend, 2x for result, 2x for remainder21:57
lkcl4-in, 4-out. it's like totally out of the question21:57
lkcleven in hardware it's going to be dicey: a typical FSM would be a whopping 256 clock cycles if done as shift-and-cmp-and-subtract22:01
lkcltotal irony: it might even be better to go down to an elwidth of 16 or even 8 and crank VL up to 32 or 64, just to get that qhat estimate to complete in a shorter amount of time!22:03
lkclurr this is giving me flashback nightmares to what we had to do for the Aspex Array-String-Processor :)22:04
lkcltwo competing parallel instructions that completed in different times, where you had to design and write *multiple* algorithms with different bit-widths, and use the "best"22:05
* lkcl shudders22:05
lkclin this case, the fact that that scalar-estimate would stall for say 64 or greater cycles trying to compute a 128-bit scalar division in hardware, you *can't* run the SIMD-Muls because they'd be stalled waiting for qhat!22:07
lkcleven if you manually did the 128-bit divide in software, using 64-bit scalar arith, you'd still be twiddling thumbs22:08
lkclironically, doing a 16-bit scalar div to produce qhat, then doing sv.madded/ew=8 @ VL=64, i suspect it would be faster *even though* it'd be doing more work (ON^2)22:11
lkclnggggh :)22:11
programmerjakeit's 128x64->64 division, so 2x 64 for numerator, 1x 64 for denominator, and 1x 64 for result...so 3-in 1-out, not 4-in 2-out22:34
programmerjakeonce you have enough bits in your division, it's more efficient to use a different faster algorithm, like goldschmidt division:22:40
programmerjakehttps://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division22:40
lkclam still scared of it! :)22:45
lkclooo i like that goldschmidt thing22:47
lkcl3-in 1-out is manageable22:47
lkclstill need the modulo22:47
lkcl(rhat)22:47
lkclya know... the goldschmit reminds me of Algorithm D, but at the bit-level not the RADIX-64 level22:48
programmerjakesimply have another 3-in 1-out instruction for the remainder22:49
lkclyeh grok it22:49
lkclor, have RC=RA+122:50
lkcl(similar to lq/stq)22:50
lkclsame trick22:50
lkclso basically in goldschmidt, you're trying to get D to be 0xffff_ffff_ffff_fffff22:52
programmerjakegoldschmidt division in hw takes 2*log2(N) or so N-bit multiplies for a N-bit division...algorithm D is much slower at around N N-bit multiplies22:53
lkclnnnggggh brain-melt22:53
lkcli need to look up a ref implementation22:54
programmerjakebasically goldschmidt division is what i'm saying 128x64-bit division should probably be implemented with22:54
lkclusing fixed-point math not FP22:55
programmerjakeyup22:55
lkclthat's what confused me, then, with 0 < D < 122:55
programmerjakeit can also be used for implementing fp div if we like...22:55
programmerjakejust pretend your 64-bit divisor is a fraction 0x0.XXXX....22:56
* lkcl yeyeh22:57
lkclhttps://stackoverflow.com/questions/2661541/picking-good-first-estimates-for-goldschmidt-division22:58
programmerjakesee also https://en.m.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm23:00
lkclfascinating23:02

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!