Monday, 2022-04-18

ghostmansdlkcl, I've decided to make it something in the middle between, https://git.libre-soc.org/?p=binutils-gdb.git;a=blob;f=gas/config/tc-ppc-svp64.c;h=66848c2c321751c924bbbff707aec14e3d45d283;hb=dc52061c2a9493a86ade6c50220b6986f477bb4c#l8114:12
ghostmansd(svp64_rm and svp64_prefix types are somewhat artificial, I know; I'd like to stress these serve different purpose, plus ensure we have 24 bits in svp64_rm)14:14
ghostmansdthis is perhaps more complex than bit masks and shifts14:14
ghostmansdon the other hand, hey, it ends up being exactly bit masks and shifts, it's only the table which makes things different14:14
ghostmansdI haven't checked yet, but perhaps this can also be optimized out with aggressive optimization (it should be)14:15
lkclhey that's a good-enough idea15:13
lkcl 142   [SVP64_PREFIX_RM] = {24, {15:14
lkcl 143     6, 8, 10, 11, 12, 13, 14, 15,15:14
lkcl 144     16, 17, 18, 19, 20, 21, 22, 23,15:14
lkcl 145     24, 25, 26, 27, 28, 29, 30, 31,15:14
lkcl 146   }},15:14
lkclthat's the basic concept, bear in mind those are MSB0 numbers, you need to turn them round to LSB0 numbers by subtracting from "sizeof(object)-1"15:14
lkcli deliberately kept the MSB0 numbering and created SelectableInt and FieldSelectableInt to "hide" the MSB0-to-LSB0 conversion as a way to preserve sanity15:16
lkclyou don't _have_ to follow that ;)15:16
lkclfeel free to do what everyone else does which is manually subtract IBM Specification MSB0 numbering from (usually) 31 or 63 or "whatever sizeof(object)-1 comes out to"15:17
lkclmy only concern with the static inline containing a for-loop is that it'll not be properly optimised-away15:18
lkclto get it to be properly optimised away down to a pure sequence of (x&M<<N)|(y&M1<<N1) you really need #define macros15:19
lkclunfortunately15:19
lkclalthough it would look terrible you could probably auto-generate the macros from python (sv_analysis.py)16:35
lkclblerg16:35
programmerjakelkcl, you'd probably be happy to know that rust's zulip instance is changing to make zulip links work without logging in; for details see https://zulip-archive.rust-lang.org/stream/122649-announce/topic/moving.20to.20web-public.20streams.html#27928517017:31
programmerjakelkcl, i'll try to fix that .c file17:49
lkclprogrammerjake, appreciated.17:53
lkcli'm not coping17:53
programmerjakenp17:55
lkclalso can you make it a standard mul-with-sub18:09
lkclit will be very hard to justify a non-standard subtract with such precious space in EXT0418:09
programmerjakek18:10
lkclappreciated.  if there was plenty of space i'd say go for it, but there's only 5 spare slots18:11
programmerjakewell...in that case the original code is already submulborrow18:12
lkclyou can see i tried doing that18:12
lkclhttps://libre-soc.org/openpower/sv/800xNxweirdmuladd.jpg.pagespeed.ic.ldA94nYdxY.webp18:12
lkclby adding "+1-1" into the equation you created18:12
lkclwhich then made:18:13
lkclproduct = RC - (RA*RB)18:14
lkcland18:14
lkclresult = product + CARRY-118:14
lkclwhich is trivial and means mul-with-sub looks "normal"18:15
programmerjakeso, do you want me to leave the .c code in sub-mul-borrow form, or add a loop with renamed variables so you can easily see it's in sub-mul-borrow form, or add the algorithm i gave?18:15
lkclyou decide, i'm really not coping at the moment, i can't explain why18:15
programmerjakejust know that sub-mul-borrow means extra hardware, hence why i didn't choose that18:16
programmerjakeand inconsistency with subfe18:16
programmerjakek, i'll add all the loops for easy comparison.18:18
ghostmansdlkcl, the result is pretty much compiler-dependent; I bet this can be optimized even more (don't see a reason to sacrifice readability for performance here, though)19:03
ghostmansdhttps://godbolt.org/z/G5s1aM4Pj19:03
ghostmansdcheck e.g. svp64_prefix_insn_get with clang trunk19:04
ghostmansdgcc isn't particularly bad as well19:04
ghostmansdbut has a loop which, I think, is not that difficult to unroll19:05
ghostmansdbut hey, who gives a shit, all these are called once per field19:05
ghostmansdI think if someone ever proves this is the culprit the stuff is slower than a turtle we might opt optimizing that, but no sooner19:06
ghostmansdas for bit order... I thought we'd take care of it at the point when we actually emit something, not sooner19:08
ghostmansdbut I'll think about this `sizeof(uint32_t) - 1' inversion, thanks for reminder!19:08
lkclprogrammerjake, code looks fantastic19:30
lkclghostmansd, remember, for RM, it's 24-bit, so that'll be (23-index)19:31
programmerjake:)19:31
lkclMSB0 is... well, 18 months i suddenly had an epiphany19:32
lkcli realised that i'd just read something in MSB0 order and it made sense automatically19:32
lkclbecause the numbering is left-to-right, so left is at the "top".19:32
lkcli'm not sure if that's a bad sign :)19:32
programmerjakeit's a sign of your brain becoming more flexible...either that or we're going crazy from an overdose of msb019:34
lkclit was the automatic bit that had me worried :)19:42
lkcljust code-morphed the SUB_MUL_BORROW into two "instructions"19:45
lkclnow to do MUL_RSUB_CARRY19:45
lkclvn when i == n funnily enough can be done with predication19:46
lkclor can it... hmmm...19:46
programmerjakenope19:46
lkclit can, but only to a mv instruction19:47
lkclwhich would take a copy of the input vector, drat19:47
programmerjakeyou still need to subtract when i==n19:47
lkclyes. so the "vn_i = i < n ? vn[i] : 0" could be predicated19:47
lkclbut not the mul-rsub19:48
programmerjakei'd just make vn 1 bigger19:48
lkcloh yeah, and it's malloc'd anyway19:50
programmerjakedo note that mrsubcarry is intentionally 1 instruction to avoid needing fusing 2 separate svp64 instructions...19:51
lkclyes - i explained in an earlier post that it's a 6x 64-bit instruction, there's no way that's going to be accepted19:51
lkclsplitting it into two turns out to be "3-in 2-out" which we can barely get away with19:52
lkcl(given that LD/ST units are already 3-in 2-out)19:52
programmerjakemicroarchitecturally 4-in 2-out is waay better19:53
lkclit's well over 400 wires into a pipeline: that's going to meet with resistance19:54
lkclJean-Paul and i experimented with a layout where the pipelines were placed in their own blocks: we simply couldn't get that many wires in.19:55
programmerjakewell...realistically we want the version with 256+256+64-in and 256+64-out19:55
programmerjakemul is big enough that there should be enough space for wires...19:55
programmerjakeespecially when we add f32/f64/i8/i16/i32/i64 support to the multiplier as well19:56
programmerjakealso, i don't think you'd have much resistance from the isa wg because of the number of in/outs...iirc vsx has instructions like that19:59
lkclPOWER10 had to compensate for the insanity by only having 2 128-bit units20:01
lkclthis is the scalar unit: i don't want the hassle of having to justify the increase20:01
lkcl"but LD/ST pipelines have 3-in 2-out already" is a good reason20:02
programmerjakewell....what if you stored the carry reg in the mrsubcarry pipeline? then it's effectively 3-in 1-out20:03
lkclthat means it's no longer re-entrant (a critical inviolate design characteristic of SVP64)20:07
lkcland wouldn't work on vertical-first20:07
programmerjakehmm...20:08
lkclso many constraints, it's mental20:10
programmerjakescratch that carry reg idea; instead have it so the first lane of the simd multiplier is the only one that can execute mrsubcarry...either 64-bit or 256-bit variants (or smaller)...20:11
programmerjakeit just won't be fast in vertical-first mode20:12
programmerjake(which is fine...it's fast in horizontal mode)20:12
programmerjakehonestly i'd expect gfbinv or cldiv to be harder for the isa wg than mrsubcarry20:14
lkclthe trick that mitch alsup taught me is that vertical-first, as long as the loops are small enough, can be analysed once instructions are in-flight20:15
lkcland macro-op fused into parallel ones20:15
lkclbut at some point, obviously, if you have more instructions than you have OoO ReservationStations, you have to fall back to scalar operation20:16
programmerjakeif encoding space is a concern, we can switch to mrsubcarry overwriting RC (meaning it ends up being RT) instead of being 4-arg20:16
lkclwell that's the other advantage of splitting into two20:16
lkclmsubx (RT,RA,RB,RC with an implicit RS=RT+VL) is standard enough to fly20:17
programmerjakemrsubcarry rt, ra, rb # rt, carry = mrsubcarry(ra, rb, rt, carry)20:17
lkcland then the remaining parts can be done as RT,RA,RB (with implicit RS=RB+VL)20:17
lkcloh wait20:17
lkcl1x 3-in 2-out and20:17
lkcl1x 2-in 2-out20:17
lkcli think20:17
lkclweirdaddx RT, RA, RB (RS=RB+VL for SVP64, RS=RB+1 for scalar)20:18
lkclsorry20:18
lkclencoded in standard X-Form (RT,RA,RB)20:18
lkclbut yes it's still 3-in 2-out20:18
lkclin: RA, RB, (implicit RS=RB+VL)20:18
lkclout: RT, RA20:19
lkclsimilar to LD-with-update20:19
lkclRA is read and overwritten (carry)20:19
lkclhttps://libre-soc.org/openpower/sv/bitmanip/appendix/20:19
programmerjakeone other major downside of splitting into two instructions is now we have to have the microarchitecture have those intermediates as an output of the fused instruction, making the 256-bit mul really 256-bit + 4x 64x64->128-bit muls ... doubling the area required20:19
lkclhttps://libre-soc.org/openpower/isa/svfixedarith/20:20
lkclwell the nice thing about macro-op fusion is, you don't have to do that if the intermediate registers are overwritten20:21
lkclyou can literally replace it internally with whatever-you-like20:21
lkclthat's the strict definition20:22
* lkcl checks20:22
programmerjakewell, there aren't enough results to overwrite all the intermediates...so that won't work unless you fuse a 3-instruction sequence: mul, subcarry, clear-intermediates20:22
programmerjakeimho 1 instruction is by-far the simplest20:23
lkclit's too much. we're developing RISC, not CISC20:24
lkcleven 3-in 2-out is pushing the boundaries20:24
lkcland mul-with-sub [into two halves] is an easy sell for obvious reasons20:25
programmerjakeyup, which is a good reason to have 1 instruction that does the op...it's simple...unlike the instruction fusion mostrosity20:25
programmerjakein reply to cisc ^20:26
lkclit's the old story about "you can do stuff fast or you can do stuff general, but you can't do both"20:26
lkclthe Rijndael / AES instructions on the other hand, doing an entire round as a single instruction, no problem with that at all20:26
lkcli mean, i don't like it, but it's hard to not-justify because it's so insanely common20:27
lkclat least this can be done as 64-bit20:28
lkclthere's no pressure to do the entire loop @ 32-bit20:28
programmerjakewell...nice part about mrsubcarry...the 32-bit variant merged to 256-bits is basically just truncating the carry in/out to 32-bits..,the rest is unchanged20:29
lkclyeah makes sense20:33
lkcli wonder what carry-propagation would look like on the 3-in 2-out weird-add20:34
* lkcl need rest, need to get up.20:36
lkclthx jacob, really useful discussion. really appreciated20:36

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!