ghostmansd | lkcl, I've decided to make it something in the middle between, https://git.libre-soc.org/?p=binutils-gdb.git;a=blob;f=gas/config/tc-ppc-svp64.c;h=66848c2c321751c924bbbff707aec14e3d45d283;hb=dc52061c2a9493a86ade6c50220b6986f477bb4c#l81 | 14:12 |
---|---|---|
ghostmansd | (svp64_rm and svp64_prefix types are somewhat artificial, I know; I'd like to stress these serve different purpose, plus ensure we have 24 bits in svp64_rm) | 14:14 |
ghostmansd | this is perhaps more complex than bit masks and shifts | 14:14 |
ghostmansd | on the other hand, hey, it ends up being exactly bit masks and shifts, it's only the table which makes things different | 14:14 |
ghostmansd | I haven't checked yet, but perhaps this can also be optimized out with aggressive optimization (it should be) | 14:15 |
lkcl | hey that's a good-enough idea | 15:13 |
lkcl | 142 [SVP64_PREFIX_RM] = {24, { | 15:14 |
lkcl | 143 6, 8, 10, 11, 12, 13, 14, 15, | 15:14 |
lkcl | 144 16, 17, 18, 19, 20, 21, 22, 23, | 15:14 |
lkcl | 145 24, 25, 26, 27, 28, 29, 30, 31, | 15:14 |
lkcl | 146 }}, | 15:14 |
lkcl | that's the basic concept, bear in mind those are MSB0 numbers, you need to turn them round to LSB0 numbers by subtracting from "sizeof(object)-1" | 15:14 |
lkcl | i deliberately kept the MSB0 numbering and created SelectableInt and FieldSelectableInt to "hide" the MSB0-to-LSB0 conversion as a way to preserve sanity | 15:16 |
lkcl | you don't _have_ to follow that ;) | 15:16 |
lkcl | feel free to do what everyone else does which is manually subtract IBM Specification MSB0 numbering from (usually) 31 or 63 or "whatever sizeof(object)-1 comes out to" | 15:17 |
lkcl | my only concern with the static inline containing a for-loop is that it'll not be properly optimised-away | 15:18 |
lkcl | to get it to be properly optimised away down to a pure sequence of (x&M<<N)|(y&M1<<N1) you really need #define macros | 15:19 |
lkcl | unfortunately | 15:19 |
lkcl | although it would look terrible you could probably auto-generate the macros from python (sv_analysis.py) | 16:35 |
lkcl | blerg | 16:35 |
programmerjake | lkcl, you'd probably be happy to know that rust's zulip instance is changing to make zulip links work without logging in; for details see https://zulip-archive.rust-lang.org/stream/122649-announce/topic/moving.20to.20web-public.20streams.html#279285170 | 17:31 |
programmerjake | lkcl, i'll try to fix that .c file | 17:49 |
lkcl | programmerjake, appreciated. | 17:53 |
lkcl | i'm not coping | 17:53 |
programmerjake | np | 17:55 |
lkcl | also can you make it a standard mul-with-sub | 18:09 |
lkcl | it will be very hard to justify a non-standard subtract with such precious space in EXT04 | 18:09 |
programmerjake | k | 18:10 |
lkcl | appreciated. if there was plenty of space i'd say go for it, but there's only 5 spare slots | 18:11 |
programmerjake | well...in that case the original code is already submulborrow | 18:12 |
lkcl | you can see i tried doing that | 18:12 |
lkcl | https://libre-soc.org/openpower/sv/800xNxweirdmuladd.jpg.pagespeed.ic.ldA94nYdxY.webp | 18:12 |
lkcl | by adding "+1-1" into the equation you created | 18:12 |
lkcl | which then made: | 18:13 |
lkcl | product = RC - (RA*RB) | 18:14 |
lkcl | and | 18:14 |
lkcl | result = product + CARRY-1 | 18:14 |
lkcl | which is trivial and means mul-with-sub looks "normal" | 18:15 |
programmerjake | so, do you want me to leave the .c code in sub-mul-borrow form, or add a loop with renamed variables so you can easily see it's in sub-mul-borrow form, or add the algorithm i gave? | 18:15 |
lkcl | you decide, i'm really not coping at the moment, i can't explain why | 18:15 |
programmerjake | just know that sub-mul-borrow means extra hardware, hence why i didn't choose that | 18:16 |
programmerjake | and inconsistency with subfe | 18:16 |
programmerjake | k, i'll add all the loops for easy comparison. | 18:18 |
ghostmansd | lkcl, the result is pretty much compiler-dependent; I bet this can be optimized even more (don't see a reason to sacrifice readability for performance here, though) | 19:03 |
ghostmansd | https://godbolt.org/z/G5s1aM4Pj | 19:03 |
ghostmansd | check e.g. svp64_prefix_insn_get with clang trunk | 19:04 |
ghostmansd | gcc isn't particularly bad as well | 19:04 |
ghostmansd | but has a loop which, I think, is not that difficult to unroll | 19:05 |
ghostmansd | but hey, who gives a shit, all these are called once per field | 19:05 |
ghostmansd | I think if someone ever proves this is the culprit the stuff is slower than a turtle we might opt optimizing that, but no sooner | 19:06 |
ghostmansd | as for bit order... I thought we'd take care of it at the point when we actually emit something, not sooner | 19:08 |
ghostmansd | but I'll think about this `sizeof(uint32_t) - 1' inversion, thanks for reminder! | 19:08 |
lkcl | programmerjake, code looks fantastic | 19:30 |
lkcl | ghostmansd, remember, for RM, it's 24-bit, so that'll be (23-index) | 19:31 |
programmerjake | :) | 19:31 |
lkcl | MSB0 is... well, 18 months i suddenly had an epiphany | 19:32 |
lkcl | i realised that i'd just read something in MSB0 order and it made sense automatically | 19:32 |
lkcl | because the numbering is left-to-right, so left is at the "top". | 19:32 |
lkcl | i'm not sure if that's a bad sign :) | 19:32 |
programmerjake | it's a sign of your brain becoming more flexible...either that or we're going crazy from an overdose of msb0 | 19:34 |
lkcl | it was the automatic bit that had me worried :) | 19:42 |
lkcl | just code-morphed the SUB_MUL_BORROW into two "instructions" | 19:45 |
lkcl | now to do MUL_RSUB_CARRY | 19:45 |
lkcl | vn when i == n funnily enough can be done with predication | 19:46 |
lkcl | or can it... hmmm... | 19:46 |
programmerjake | nope | 19:46 |
lkcl | it can, but only to a mv instruction | 19:47 |
lkcl | which would take a copy of the input vector, drat | 19:47 |
programmerjake | you still need to subtract when i==n | 19:47 |
lkcl | yes. so the "vn_i = i < n ? vn[i] : 0" could be predicated | 19:47 |
lkcl | but not the mul-rsub | 19:48 |
programmerjake | i'd just make vn 1 bigger | 19:48 |
lkcl | oh yeah, and it's malloc'd anyway | 19:50 |
programmerjake | do note that mrsubcarry is intentionally 1 instruction to avoid needing fusing 2 separate svp64 instructions... | 19:51 |
lkcl | yes - i explained in an earlier post that it's a 6x 64-bit instruction, there's no way that's going to be accepted | 19:51 |
lkcl | splitting it into two turns out to be "3-in 2-out" which we can barely get away with | 19:52 |
lkcl | (given that LD/ST units are already 3-in 2-out) | 19:52 |
programmerjake | microarchitecturally 4-in 2-out is waay better | 19:53 |
lkcl | it's well over 400 wires into a pipeline: that's going to meet with resistance | 19:54 |
lkcl | Jean-Paul and i experimented with a layout where the pipelines were placed in their own blocks: we simply couldn't get that many wires in. | 19:55 |
programmerjake | well...realistically we want the version with 256+256+64-in and 256+64-out | 19:55 |
programmerjake | mul is big enough that there should be enough space for wires... | 19:55 |
programmerjake | especially when we add f32/f64/i8/i16/i32/i64 support to the multiplier as well | 19:56 |
programmerjake | also, i don't think you'd have much resistance from the isa wg because of the number of in/outs...iirc vsx has instructions like that | 19:59 |
lkcl | POWER10 had to compensate for the insanity by only having 2 128-bit units | 20:01 |
lkcl | this is the scalar unit: i don't want the hassle of having to justify the increase | 20:01 |
lkcl | "but LD/ST pipelines have 3-in 2-out already" is a good reason | 20:02 |
programmerjake | well....what if you stored the carry reg in the mrsubcarry pipeline? then it's effectively 3-in 1-out | 20:03 |
lkcl | that means it's no longer re-entrant (a critical inviolate design characteristic of SVP64) | 20:07 |
lkcl | and wouldn't work on vertical-first | 20:07 |
programmerjake | hmm... | 20:08 |
lkcl | so many constraints, it's mental | 20:10 |
programmerjake | scratch that carry reg idea; instead have it so the first lane of the simd multiplier is the only one that can execute mrsubcarry...either 64-bit or 256-bit variants (or smaller)... | 20:11 |
programmerjake | it just won't be fast in vertical-first mode | 20:12 |
programmerjake | (which is fine...it's fast in horizontal mode) | 20:12 |
programmerjake | honestly i'd expect gfbinv or cldiv to be harder for the isa wg than mrsubcarry | 20:14 |
lkcl | the trick that mitch alsup taught me is that vertical-first, as long as the loops are small enough, can be analysed once instructions are in-flight | 20:15 |
lkcl | and macro-op fused into parallel ones | 20:15 |
lkcl | but at some point, obviously, if you have more instructions than you have OoO ReservationStations, you have to fall back to scalar operation | 20:16 |
programmerjake | if encoding space is a concern, we can switch to mrsubcarry overwriting RC (meaning it ends up being RT) instead of being 4-arg | 20:16 |
lkcl | well that's the other advantage of splitting into two | 20:16 |
lkcl | msubx (RT,RA,RB,RC with an implicit RS=RT+VL) is standard enough to fly | 20:17 |
programmerjake | mrsubcarry rt, ra, rb # rt, carry = mrsubcarry(ra, rb, rt, carry) | 20:17 |
lkcl | and then the remaining parts can be done as RT,RA,RB (with implicit RS=RB+VL) | 20:17 |
lkcl | oh wait | 20:17 |
lkcl | 1x 3-in 2-out and | 20:17 |
lkcl | 1x 2-in 2-out | 20:17 |
lkcl | i think | 20:17 |
lkcl | weirdaddx RT, RA, RB (RS=RB+VL for SVP64, RS=RB+1 for scalar) | 20:18 |
lkcl | sorry | 20:18 |
lkcl | encoded in standard X-Form (RT,RA,RB) | 20:18 |
lkcl | but yes it's still 3-in 2-out | 20:18 |
lkcl | in: RA, RB, (implicit RS=RB+VL) | 20:18 |
lkcl | out: RT, RA | 20:19 |
lkcl | similar to LD-with-update | 20:19 |
lkcl | RA is read and overwritten (carry) | 20:19 |
lkcl | https://libre-soc.org/openpower/sv/bitmanip/appendix/ | 20:19 |
programmerjake | one other major downside of splitting into two instructions is now we have to have the microarchitecture have those intermediates as an output of the fused instruction, making the 256-bit mul really 256-bit + 4x 64x64->128-bit muls ... doubling the area required | 20:19 |
lkcl | https://libre-soc.org/openpower/isa/svfixedarith/ | 20:20 |
lkcl | well the nice thing about macro-op fusion is, you don't have to do that if the intermediate registers are overwritten | 20:21 |
lkcl | you can literally replace it internally with whatever-you-like | 20:21 |
lkcl | that's the strict definition | 20:22 |
* lkcl checks | 20:22 | |
programmerjake | well, there aren't enough results to overwrite all the intermediates...so that won't work unless you fuse a 3-instruction sequence: mul, subcarry, clear-intermediates | 20:22 |
programmerjake | imho 1 instruction is by-far the simplest | 20:23 |
lkcl | it's too much. we're developing RISC, not CISC | 20:24 |
lkcl | even 3-in 2-out is pushing the boundaries | 20:24 |
lkcl | and mul-with-sub [into two halves] is an easy sell for obvious reasons | 20:25 |
programmerjake | yup, which is a good reason to have 1 instruction that does the op...it's simple...unlike the instruction fusion mostrosity | 20:25 |
programmerjake | in reply to cisc ^ | 20:26 |
lkcl | it's the old story about "you can do stuff fast or you can do stuff general, but you can't do both" | 20:26 |
lkcl | the Rijndael / AES instructions on the other hand, doing an entire round as a single instruction, no problem with that at all | 20:26 |
lkcl | i mean, i don't like it, but it's hard to not-justify because it's so insanely common | 20:27 |
lkcl | at least this can be done as 64-bit | 20:28 |
lkcl | there's no pressure to do the entire loop @ 32-bit | 20:28 |
programmerjake | well...nice part about mrsubcarry...the 32-bit variant merged to 256-bits is basically just truncating the carry in/out to 32-bits..,the rest is unchanged | 20:29 |
lkcl | yeah makes sense | 20:33 |
lkcl | i wonder what carry-propagation would look like on the 3-in 2-out weird-add | 20:34 |
* lkcl need rest, need to get up. | 20:36 | |
lkcl | thx jacob, really useful discussion. really appreciated | 20:36 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!