Monday, 2022-04-18

ghostmansd	lkcl, I've decided to make it something in the middle between, https://git.libre-soc.org/?p=binutils-gdb.git;a=blob;f=gas/config/tc-ppc-svp64.c;h=66848c2c321751c924bbbff707aec14e3d45d283;hb=dc52061c2a9493a86ade6c50220b6986f477bb4c#l81	14:12
ghostmansd	(svp64_rm and svp64_prefix types are somewhat artificial, I know; I'd like to stress these serve different purpose, plus ensure we have 24 bits in svp64_rm)	14:14
ghostmansd	this is perhaps more complex than bit masks and shifts	14:14
ghostmansd	on the other hand, hey, it ends up being exactly bit masks and shifts, it's only the table which makes things different	14:14
ghostmansd	I haven't checked yet, but perhaps this can also be optimized out with aggressive optimization (it should be)	14:15
lkcl	hey that's a good-enough idea	15:13
lkcl	142 [SVP64_PREFIX_RM] = {24, {	15:14
lkcl	143 6, 8, 10, 11, 12, 13, 14, 15,	15:14
lkcl	144 16, 17, 18, 19, 20, 21, 22, 23,	15:14
lkcl	145 24, 25, 26, 27, 28, 29, 30, 31,	15:14
lkcl	146 }},	15:14
lkcl	that's the basic concept, bear in mind those are MSB0 numbers, you need to turn them round to LSB0 numbers by subtracting from "sizeof(object)-1"	15:14
lkcl	i deliberately kept the MSB0 numbering and created SelectableInt and FieldSelectableInt to "hide" the MSB0-to-LSB0 conversion as a way to preserve sanity	15:16
lkcl	you don't _have_ to follow that ;)	15:16
lkcl	feel free to do what everyone else does which is manually subtract IBM Specification MSB0 numbering from (usually) 31 or 63 or "whatever sizeof(object)-1 comes out to"	15:17
lkcl	my only concern with the static inline containing a for-loop is that it'll not be properly optimised-away	15:18
lkcl	to get it to be properly optimised away down to a pure sequence of (x&M<<N)\|(y&M1<<N1) you really need #define macros	15:19
lkcl	unfortunately	15:19
lkcl	although it would look terrible you could probably auto-generate the macros from python (sv_analysis.py)	16:35
lkcl	blerg	16:35
programmerjake	lkcl, you'd probably be happy to know that rust's zulip instance is changing to make zulip links work without logging in; for details see https://zulip-archive.rust-lang.org/stream/122649-announce/topic/moving.20to.20web-public.20streams.html#279285170	17:31
programmerjake	lkcl, i'll try to fix that .c file	17:49
lkcl	programmerjake, appreciated.	17:53
lkcl	i'm not coping	17:53
programmerjake	np	17:55
lkcl	also can you make it a standard mul-with-sub	18:09
lkcl	it will be very hard to justify a non-standard subtract with such precious space in EXT04	18:09
programmerjake	k	18:10
lkcl	appreciated. if there was plenty of space i'd say go for it, but there's only 5 spare slots	18:11
programmerjake	well...in that case the original code is already submulborrow	18:12
lkcl	you can see i tried doing that	18:12
lkcl	https://libre-soc.org/openpower/sv/800xNxweirdmuladd.jpg.pagespeed.ic.ldA94nYdxY.webp	18:12
lkcl	by adding "+1-1" into the equation you created	18:12
lkcl	which then made:	18:13
lkcl	product = RC - (RA*RB)	18:14
lkcl	and	18:14
lkcl	result = product + CARRY-1	18:14
lkcl	which is trivial and means mul-with-sub looks "normal"	18:15
programmerjake	so, do you want me to leave the .c code in sub-mul-borrow form, or add a loop with renamed variables so you can easily see it's in sub-mul-borrow form, or add the algorithm i gave?	18:15
lkcl	you decide, i'm really not coping at the moment, i can't explain why	18:15
programmerjake	just know that sub-mul-borrow means extra hardware, hence why i didn't choose that	18:16
programmerjake	and inconsistency with subfe	18:16
programmerjake	k, i'll add all the loops for easy comparison.	18:18
ghostmansd	lkcl, the result is pretty much compiler-dependent; I bet this can be optimized even more (don't see a reason to sacrifice readability for performance here, though)	19:03
ghostmansd	https://godbolt.org/z/G5s1aM4Pj	19:03
ghostmansd	check e.g. svp64_prefix_insn_get with clang trunk	19:04
ghostmansd	gcc isn't particularly bad as well	19:04
ghostmansd	but has a loop which, I think, is not that difficult to unroll	19:05
ghostmansd	but hey, who gives a shit, all these are called once per field	19:05
ghostmansd	I think if someone ever proves this is the culprit the stuff is slower than a turtle we might opt optimizing that, but no sooner	19:06
ghostmansd	as for bit order... I thought we'd take care of it at the point when we actually emit something, not sooner	19:08
ghostmansd	but I'll think about this `sizeof(uint32_t) - 1' inversion, thanks for reminder!	19:08
lkcl	programmerjake, code looks fantastic	19:30
lkcl	ghostmansd, remember, for RM, it's 24-bit, so that'll be (23-index)	19:31
programmerjake	:)	19:31
lkcl	MSB0 is... well, 18 months i suddenly had an epiphany	19:32
lkcl	i realised that i'd just read something in MSB0 order and it made sense automatically	19:32
lkcl	because the numbering is left-to-right, so left is at the "top".	19:32
lkcl	i'm not sure if that's a bad sign :)	19:32
programmerjake	it's a sign of your brain becoming more flexible...either that or we're going crazy from an overdose of msb0	19:34
lkcl	it was the automatic bit that had me worried :)	19:42
lkcl	just code-morphed the SUB_MUL_BORROW into two "instructions"	19:45
lkcl	now to do MUL_RSUB_CARRY	19:45
lkcl	vn when i == n funnily enough can be done with predication	19:46
lkcl	or can it... hmmm...	19:46
programmerjake	nope	19:46
lkcl	it can, but only to a mv instruction	19:47
lkcl	which would take a copy of the input vector, drat	19:47
programmerjake	you still need to subtract when i==n	19:47
lkcl	yes. so the "vn_i = i < n ? vn[i] : 0" could be predicated	19:47
lkcl	but not the mul-rsub	19:48
programmerjake	i'd just make vn 1 bigger	19:48
lkcl	oh yeah, and it's malloc'd anyway	19:50
programmerjake	do note that mrsubcarry is intentionally 1 instruction to avoid needing fusing 2 separate svp64 instructions...	19:51
lkcl	yes - i explained in an earlier post that it's a 6x 64-bit instruction, there's no way that's going to be accepted	19:51
lkcl	splitting it into two turns out to be "3-in 2-out" which we can barely get away with	19:52
lkcl	(given that LD/ST units are already 3-in 2-out)	19:52
programmerjake	microarchitecturally 4-in 2-out is waay better	19:53
lkcl	it's well over 400 wires into a pipeline: that's going to meet with resistance	19:54
lkcl	Jean-Paul and i experimented with a layout where the pipelines were placed in their own blocks: we simply couldn't get that many wires in.	19:55
programmerjake	well...realistically we want the version with 256+256+64-in and 256+64-out	19:55
programmerjake	mul is big enough that there should be enough space for wires...	19:55
programmerjake	especially when we add f32/f64/i8/i16/i32/i64 support to the multiplier as well	19:56
programmerjake	also, i don't think you'd have much resistance from the isa wg because of the number of in/outs...iirc vsx has instructions like that	19:59
lkcl	POWER10 had to compensate for the insanity by only having 2 128-bit units	20:01
lkcl	this is the scalar unit: i don't want the hassle of having to justify the increase	20:01
lkcl	"but LD/ST pipelines have 3-in 2-out already" is a good reason	20:02
programmerjake	well....what if you stored the carry reg in the mrsubcarry pipeline? then it's effectively 3-in 1-out	20:03
lkcl	that means it's no longer re-entrant (a critical inviolate design characteristic of SVP64)	20:07
lkcl	and wouldn't work on vertical-first	20:07
programmerjake	hmm...	20:08
lkcl	so many constraints, it's mental	20:10
programmerjake	scratch that carry reg idea; instead have it so the first lane of the simd multiplier is the only one that can execute mrsubcarry...either 64-bit or 256-bit variants (or smaller)...	20:11
programmerjake	it just won't be fast in vertical-first mode	20:12
programmerjake	(which is fine...it's fast in horizontal mode)	20:12
programmerjake	honestly i'd expect gfbinv or cldiv to be harder for the isa wg than mrsubcarry	20:14
lkcl	the trick that mitch alsup taught me is that vertical-first, as long as the loops are small enough, can be analysed once instructions are in-flight	20:15
lkcl	and macro-op fused into parallel ones	20:15
lkcl	but at some point, obviously, if you have more instructions than you have OoO ReservationStations, you have to fall back to scalar operation	20:16
programmerjake	if encoding space is a concern, we can switch to mrsubcarry overwriting RC (meaning it ends up being RT) instead of being 4-arg	20:16
lkcl	well that's the other advantage of splitting into two	20:16
lkcl	msubx (RT,RA,RB,RC with an implicit RS=RT+VL) is standard enough to fly	20:17
programmerjake	mrsubcarry rt, ra, rb # rt, carry = mrsubcarry(ra, rb, rt, carry)	20:17
lkcl	and then the remaining parts can be done as RT,RA,RB (with implicit RS=RB+VL)	20:17
lkcl	oh wait	20:17
lkcl	1x 3-in 2-out and	20:17
lkcl	1x 2-in 2-out	20:17
lkcl	i think	20:17
lkcl	weirdaddx RT, RA, RB (RS=RB+VL for SVP64, RS=RB+1 for scalar)	20:18
lkcl	sorry	20:18
lkcl	encoded in standard X-Form (RT,RA,RB)	20:18
lkcl	but yes it's still 3-in 2-out	20:18
lkcl	in: RA, RB, (implicit RS=RB+VL)	20:18
lkcl	out: RT, RA	20:19
lkcl	similar to LD-with-update	20:19
lkcl	RA is read and overwritten (carry)	20:19
lkcl	https://libre-soc.org/openpower/sv/bitmanip/appendix/	20:19
programmerjake	one other major downside of splitting into two instructions is now we have to have the microarchitecture have those intermediates as an output of the fused instruction, making the 256-bit mul really 256-bit + 4x 64x64->128-bit muls ... doubling the area required	20:19
lkcl	https://libre-soc.org/openpower/isa/svfixedarith/	20:20
lkcl	well the nice thing about macro-op fusion is, you don't have to do that if the intermediate registers are overwritten	20:21
lkcl	you can literally replace it internally with whatever-you-like	20:21
lkcl	that's the strict definition	20:22
* lkcl checks		20:22
programmerjake	well, there aren't enough results to overwrite all the intermediates...so that won't work unless you fuse a 3-instruction sequence: mul, subcarry, clear-intermediates	20:22
programmerjake	imho 1 instruction is by-far the simplest	20:23
lkcl	it's too much. we're developing RISC, not CISC	20:24
lkcl	even 3-in 2-out is pushing the boundaries	20:24
lkcl	and mul-with-sub [into two halves] is an easy sell for obvious reasons	20:25
programmerjake	yup, which is a good reason to have 1 instruction that does the op...it's simple...unlike the instruction fusion mostrosity	20:25
programmerjake	in reply to cisc ^	20:26
lkcl	it's the old story about "you can do stuff fast or you can do stuff general, but you can't do both"	20:26
lkcl	the Rijndael / AES instructions on the other hand, doing an entire round as a single instruction, no problem with that at all	20:26
lkcl	i mean, i don't like it, but it's hard to not-justify because it's so insanely common	20:27
lkcl	at least this can be done as 64-bit	20:28
lkcl	there's no pressure to do the entire loop @ 32-bit	20:28
programmerjake	well...nice part about mrsubcarry...the 32-bit variant merged to 256-bits is basically just truncating the carry in/out to 32-bits..,the rest is unchanged	20:29
lkcl	yeah makes sense	20:33
lkcl	i wonder what carry-propagation would look like on the 3-in 2-out weird-add	20:34
* lkcl need rest, need to get up.		20:36
lkcl	thx jacob, really useful discussion. really appreciated	20:36

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!