Saturday, 2022-10-22

*** kanzure <kanzure!~kanzure@user/kanzure> has quit IRC		06:30
*** kanzure <kanzure!~kanzure@user/kanzure> has joined #libre-soc		06:30
*** awygle <awygle!~quassel@2604:a880:2:d0::5380:3001> has quit IRC		06:53
*** sauce <sauce!~sauce@omae.wa.mou.shindei.ru> has quit IRC		06:53
*** awygle <awygle!~quassel@2604:a880:2:d0::5380:3001> has joined #libre-soc		06:54
*** sauce <sauce!~sauce@omae.wa.mou.shindei.ru> has joined #libre-soc		06:56
lkcl	ghostmansd[m], so what's the primary focus of each of the two current bugreports?	09:47
lkcl	and, just as importantly, "what fits into EUR 8,000"?	09:48
lkcl	for example: if i split into EUR 4000 for 947 and EUR 4000 for the other-top-level-one-for-binutils, is that ok?	09:49
lkcl	i take it https://bugs.libre-soc.org/show_bug.cgi?id=958 is the top-level?	09:51
ghostmansd[m]	This is basically doing the same stuff in two projects, TBH. Where they can re-use each other (e.g. where we can put some code into sv_binutils.py), this is great. Shortly, it's "building asm atop of insndb like we did with disasm, dropping custom tweaks". More details below.	09:52
ghostmansd[m]	binutils: infrastructure is more or less ready. We mostly need to refactor the and/or/shifts/whatever with fields, also establish per-mode handlers (you remember patterns matching in disassembly, I want to do something similar for asm, based on strings representing modes, e.g. /mr should toggle mapreduce and its submodes), etc.	09:53
ghostmansd[m]	After this is done, we need to introduce tests for SVP64. We already have tests for custom word instructions, but nothing for SVP64.	09:53
lkcl	remember the EUR 8000 is "interim", based on the fact that the cavatools budget has been approved	09:56
lkcl	so if there is more to be done, that would take longer, then it can be added to another project	09:57
ghostmansd[m]	openpower-isa: refactor pysvp64asm in the same way (actually this is the one I do now, since it's a reference and we have tests already). Also, as part of this, I'm supporting operand types, and support more advanced assembly for these (r, f, cr prefixes, non-zero operands without hardcoding it for each and every instruction with _custom_insn, CR complex descriptions, etc.).	09:57
lkcl	but please remember that if it has not yet been approved and the work has already been done you can't get paid	09:57
lkcl	so for example	09:57
lkcl	you've stated here:	09:57
lkcl	https://bugs.libre-soc.org/show_bug.cgi?id=958#c1	09:57
lkcl	"all tasks have been completed"	09:57
lkcl	but	09:57
lkcl	it was attached to an UNAPPROVED grant (OPF ISA WG)	09:58
ghostmansd[m]	Once both are done, I'd like to have some test infrastructure common for binutils and openpower-isa.	09:58
lkcl	therefore you cannot get paid for it under that grant	09:58
lkcl	therefore i have instead moved it to cavatools	09:58
ghostmansd[m]	> it was attached to an UNAPPROVED grant (OPF ISA WG)	09:58
ghostmansd[m]	Yeah that's why I asked about cavatools.	09:58
ghostmansd[m]	Because logically these changes are inevitable, if we want to support binutils, we must have all instructions there we have in reference implementation.	09:59
ghostmansd[m]	And also markos needed those, that's how it started. :-)	09:59
lkcl	yehyeh	09:59
ghostmansd[m]	Not all of them but I also did what was nearby.	09:59
ghostmansd[m]	This task is somewhat outside	10:00
lkcl	don't add any grev* bitmanip ones	10:00
ghostmansd[m]	Didn't intend	10:00
ghostmansd[m]	I remember their fate is not yet stated.	10:00
lkcl	programmerjake needs to remove them and complete grevlut*	10:00
ghostmansd[m]	We also even have some task called like "revisit grev stuff once we decide on it"	10:01
ghostmansd[m]	Don't remember the number, though	10:01
ghostmansd[m]	So, yeah:	10:03
ghostmansd[m]	1.Refactor binutils assembly.	10:03
ghostmansd[m]	2. Prepare binutils tests.	10:03
ghostmansd[m]	3. Refactor openpower-isa assembly.	10:03
ghostmansd[m]	4. Handle what's missing/controversial between these.	10:03
ghostmansd[m]	5. If time permits, come up with some tests infra which cross-checks these.	10:03
ghostmansd[m]	We might argue these tasks overlap a lot, and this is actually true, a long meta is "support SVP64 asm and extend it and test it". I'm splitting it for convenience so that you can decide what goes where.	10:05
lkcl	sounds great. what fits into EUR 8,000 and what doesn't?	10:05
lkcl	at present i'm simply putting both #958 and #949 under cavatools @ EUR 8,000.	10:05
ghostmansd[m]	I'm not sure, that's why I asked. Refactoring assembly is a big task per se.	10:06
ghostmansd[m]	I think the code fits well.	10:06
lkcl	if however you can do "some" of it	10:06
lkcl	then continue the rest under another Grant	10:06
lkcl	that would work	10:06
ghostmansd[m]	Tests, both for binutils and cross-check, likely will have to be done on other grant.	10:06
lkcl	that works	10:06
lkcl	boring, but it works	10:07
markos	lkcl, speaking of binutils, I think that since it's more or less working fine, we should use it more for more complicated -ie non-unit test work-. After all, people are going to use binutils and not some reference assembler for real code. Saying that because for me at least it's much harder reading complicated asm code embedded into python as a unit test (eg chacha), I'd much rather have the assembler separate from the python part on	10:07
markos	its own file	10:07
lkcl	most likely under OPF ISA WG one https://bugs.libre-soc.org/show_bug.cgi?id=952	10:07
ghostmansd[m]	Cross-check perhaps will be dropped for now, I suspect. It's big, and we could check each implementation on its own.	10:07
lkcl	markos, what you've created is great, but it doesn't "run from python setup.py test"	10:08
ghostmansd[m]	binutils tests are boring and difficult, but we have to do them if we want the code to be committed.	10:08
markos	that's not what I'm suggesting	10:08
lkcl	and, "run from python setup.py test" is the canonical - easy - method of ensuring that everything is working	10:08
lkcl	so	10:08
markos	I suggest not embedding complicated -ie more than a handful of instructions- within python	10:09
lkcl	that chacha20 inner-core constitutes a unit test, not a demo	10:09
lkcl	yes, i'm not going beyond that. 17 is quite enough	10:09
markos	what I'm suggesting is splitting the asm externally as a .s file and loading it as from python for the unit test, but also making it available for testing in binutils as well	10:10
lkcl	utf8-validation which programmerjake wrote is borderline. it's extremely comprehensive (the testing) and should really have been done in the style you developed	10:10
lkcl	but then it would not be possible to run as a python-based unit test	10:10
markos	of course it would, just load it as a string from file	10:11
lkcl	then it is no longer possible under the unit tests to perform python-based parameterised substitution for the purposes of performing comprehensive unit testing	10:11
lkcl	or it is, but is extremely inconvenient	10:12
lkcl	for example:	10:12
lkcl	i perform substitution in many unit tests of VL as a python-based parameter	10:12
lkcl	re-running the entire unit test multiple times	10:12
markos	yes, but these are not the tests I'm talking about	10:12
lkcl	with completely different assembler, in effect	10:12
lkcl	i know	10:12
markos	anyway	10:13
markos	not saying anything should be changed now, just that we should consider putting big asm chunks externally	10:13
lkcl	when there's a more "stand-alone" style appropriate (like the one you are developing) then yes absolutely i agree 100% the separate .s is good	10:13
lkcl	the 17 instructions in chacha20 turned out to be necessary, they're a bit much	10:14
markos	ack	10:14
lkcl	but i needed to test svindex - it's just that it needed to be called 4 times	10:15
lkcl	which is the limit (there are only 4 SVSHAPEs)	10:15
lkcl	this is not possible to do in a ".s" file:	10:16
lkcl	'addi 16, 0, %d' % nrounds, # set number of rounds	10:16
lkcl	strictly speaking the "32" on setvl should also be parameterised	10:16
lkcl	so that a much shorter test run can be made	10:17
lkcl	(and also experiment and see how number of rounds affects output)	10:17
lkcl	i keep getting tempted to try to do 2D svindex REMAPs	10:18
lkcl	the "normal" design of chacha20 is to SIMD-i-fy every 4th quarter-round in groups	10:19
lkcl	performing lane-based xor/add/rotl32	10:19
lkcl	https://en.wikipedia.org/wiki/File:ChaCha_Cipher_Quarter_Round_Function.svg	10:20
markos	if you do that and add butterfly-and-family int instructions 16/32-bit, we can do pretty much all comples video codec loops in 1/10th of the original code :)	10:20
markos	^complex	10:21
lkcl	:)	10:21
lkcl	yyeahh that one needs to go on the OPF ISA WG grant	10:21
ghostmansd[m]	lkcl, are you fine if I re-split the budget between 947 and 958, say, 5000 and 3000? Because 947 needs way more work, and 958 is simpler.	10:21
ghostmansd[m]	Maybe even 5500 and 2500, perhaps.	10:22
markos	ghostmansd[m], are you btw familiar with LLVM codebase as well?	10:22
lkcl	ghostmansd[m], yes sure, feel free to edit it yourself. it's not "approved" yet so is fine to edit	10:22
lkcl	markos, ohh, 2D DCT schedule as well	10:23
markos	lkcl, yup	10:23
ghostmansd[m]	Ok, thanks!	10:23
lkcl	markos, what was the name of that fantastically-odd instruction?	10:24
lkcl	the one with mul-and-add-and-a-shift-thrown-in-for-good-measure	10:24
markos	vqrdmulhq_s16/vqrdmulhq_s32	10:25
lkcl	ta	10:25
markos	https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vqrdmulhq_s16	10:25
markos	https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SQRDMULH--vector---Signed-saturating-Rounding-Doubling-Multiply-returning-High-half-?lang=en for the asm instruction	10:25
ghostmansd[m]	> it's not "approved" yet so is fine to edit	10:26
ghostmansd[m]	Speaking of that, do you have any estimations on it? Some bits of 947 (948) are almost done, and 958 is reaching the level it's fully done (I need to check instructions outside of av.mdwn).	10:26
markos	however that works for the one coeff only	10:26
markos	tried using it twice for the 2-coeffs butterfly but unfortunately it doesn't work that well	10:27
lkcl	ghostmansd[m], no - see the process i put on one of the bugreports.	10:27
markos	ah but there is SQDMLAL	10:27
lkcl	a) decide the tasks	10:28
markos	yup, that's the 2-coeff variant	10:28
lkcl	b) review the tasks	10:28
lkcl	c) get them approved by NLnet	10:28
lkcl	d) get a signed MoU back	10:28
ghostmansd[m]	Ah OK, good	10:28
ghostmansd[m]	Do we have tasks which are approved?	10:29
ghostmansd[m]	I'll complete these to some degree then until MoU is signed.	10:29
ghostmansd[m]	(Any risk it's not signed?)	10:29
lkcl	no risk whatsoever. it's already approved by the EU	10:30
lkcl	here's the list of all milestones: https://bugs.libre-soc.org/show_bug.cgi?id=938	10:30
lkcl	so, anything that can go under this	10:31
lkcl	https://bugs.libre-soc.org/show_bug.cgi?id=589	10:31
lkcl	can be justified	10:31
lkcl	e.g.... if you for example.... ooo i dunno...	10:32
lkcl	added the two new shift-instructions	10:32
lkcl	and sv.maddedu and sv.divmod2du	10:32
lkcl	those are for bigint which is for cryptographic purposes which can therefore be justified under that grant	10:33
ghostmansd[m]	Ok, no risk, but still no option to send any RFP under cavatools. :-) So I'll have to switch to something else for a while until that one is approved.	10:33
ghostmansd[m]	Hm, maddedu... IIRC this one already exists as word instruction.	10:34
ghostmansd[m]	Not in binutils though.	10:34
lkcl	correct, no risk. correct, no RFP yet. estimated... mmm... 2-3 weeks as i need to write the rest of the cavatools tasks, sigh.	10:34
ghostmansd[m]	Aha, OK, sufficient to do something else :-)	10:35
ghostmansd[m]	OK I'll add maddedu and divmod2du. Do we have task for these?	10:35
lkcl	just looking...	10:35
lkcl	prrooobbabblllyy.... under this one https://bugs.libre-soc.org/show_bug.cgi?id=771	10:35
lkcl	or https://bugs.libre-soc.org/show_bug.cgi?id=772 would be better, it's larger.	10:36
ghostmansd[m]	Ok, cool, thanks!	10:37
ghostmansd[m]	I will clean up what I have so far for assembly, and then proceed with these.	10:37
lkcl	ack	10:37
ghostmansd[m]	lkcl, I noticed we don't have pcdec too; should I added? Is it sufficiently stable?	13:22
ghostmansd[m]	Sorry, I meant pcdec., apparently it lacks non-Rc version?	13:23
lkcl	like "stcix." it's a Rc=1 only, yes	13:29
lkcl	ghostmansd[m], https://bugs.libre-soc.org/show_bug.cgi?id=937#c12	13:38
ghostmansd[m]	Sorry, kinda out of context. Do you mean these should be added too?	13:39
lkcl	the shift instructions will likely become 3-in 1-out so don't add them to the list yet	13:39
lkcl	but yes	13:39
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has quit IRC		13:42
*** openpowerbot <openpowerbot!~openpower@94-226-188-34.access.telenet.be> has quit IRC		13:42
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		17:01
ghostmansd	lkcl, I've raised the task about maddedu and divmod2du: https://bugs.libre-soc.org/show_bug.cgi?id=964	17:01
lkcl	ghostmansd, ack	17:02
ghostmansd	And, actually, already done it :-)	17:03
lkcl	doh :)	17:03
ghostmansd	Actually I did so many of these that I already do it semi-automatically	17:04
lkcl	yehyeh	17:04
lkcl	ah - divmod2du has to move btw	17:04
ghostmansd	I got lucky, since there were instructions with the same form	17:04
ghostmansd	What do you mean?	17:04
lkcl	to XO=58	17:04
ghostmansd	Ah OK, ping me when you move it	17:05
ghostmansd	Or should I?	17:05
lkcl	am just in the middle of sorting that out (to accommodate dsld and dsrd)	17:05
ghostmansd	I'll do it, OK	17:05
ghostmansd	Sure, NP	17:05
ghostmansd	1 sec, this will need updates both in openpower-isa and binutils	17:05
lkcl	i'm sorting out openpower-isa	17:06
lkcl	commits outstanding, please don't make modifications to svp64.py minor_4.csv or svfixedarith.mdwn	17:06
ghostmansd	Ah OK	17:07
ghostmansd	I'll handle binutils then	17:07
ghostmansd	ping me when I need to rebase	17:07
ghostmansd	Because I might need to regenerate parts of binutils	17:07
ghostmansd	Some stuff is fixed (e.g. XO), but the tables...	17:08
lkcl	ghostmansd, done	17:15
lkcl	well that's frickin fricked	17:22
lkcl	dsld as a 4-op can only be EXTRA2	17:22
lkcl	frick	17:22
lkcl	and for dsld to be able to do overlapping pickup of a pair of consecutive 64-bit registers has to be EXTRA3	17:22
lkcl	frick	17:22
ghostmansd	1 sec, need to check it and update the tests based on what pysvp64asm produces	17:22
ghostmansd	done!	17:26
lkcl	brilliant	17:32
ghostmansd[m]	Feel free to assign the budget and split as you like	17:34
ghostmansd[m]	Since you did the XO moving for this	17:34
ghostmansd[m]	https://bugs.libre-soc.org/show_bug.cgi?id=964	17:34
lkcl	ack	17:35
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has joined #libre-soc		17:52
*** markos <markos!~Konstanti@static062038151250.dsl.hol.gr> has quit IRC		17:57
*** markos <markos!~Konstanti@static062038151250.dsl.hol.gr> has joined #libre-soc		18:07
ghostmansd[m]	lkcl, do you have any other tasks in mind that might need my help?	19:18
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		20:05
lkcl	ghostmansd[m], kinda, yes - mostly we've got to focus on cavatools and the cryptoprimitives	20:54
FUZxxl	Where are most of you from?	20:55
lkcl	all over the place!	20:55
ghostmansd[m]	I ain't no cryptographer, but I can help unless there is math involved:-)	20:55
ghostmansd[m]	Here's Mother Russia, for example	20:56
lkcl	kuwait russia greece canada US germany UK belgium india	20:56
FUZxxl	cool	20:56
lkcl	maaaath	20:56
ghostmansd[m]	Hi FUZxxl	20:56
FUZxxl	hi ghostmansd[m]	20:56
lkcl	btw i'm supposed to be shopping, back in 1/2 hr :)	20:57
ghostmansd[m]	Ok then :-)	20:57
lkcl	aspirin and jelly-babys	20:57
lkcl	no connection	20:57
ghostmansd[m]	lkcl, do you mean Haribos?	20:57
FUZxxl	I want to do a PhD project about something related to SIMD stuff	20:58
FUZxxl	someone on reddit told me you might be interested	20:58
ghostmansd[m]	Well, SVP64 is not a out SIMD, to be honest, quite the opposite	20:59
FUZxxl	I think that was you lkcl ?	20:59
FUZxxl	Vector stuff counts in this regard :-)	20:59
ghostmansd[m]	*about	20:59
FUZxxl	though for some application it might be more tricky to use	20:59
ghostmansd[m]	In this sense, it totally applies!	20:59
ghostmansd[m]	Not more tricky than SIMD for sure	20:59
FUZxxl	We just wrote a fun paper about a SIMD-accelerated algorithm to transcode between UTF-8 and UTF-16 (both directions)	20:59
FUZxxl	it's full of horizontal operations	21:00
FUZxxl	I'm not sure if you can do enough arithmetic on masks in SVP64 to implement it there	21:00
ghostmansd[m]	You might want to talk to programmerjake, he'd recently implemented UTF-8 in SVP64	21:01
ghostmansd[m]	That'd be a great place to start I think	21:01
FUZxxl	programmerjake: hi	21:01
ghostmansd[m]	Just to feel the difference, so to speak	21:01
FUZxxl	sure	21:01
FUZxxl	I can send you a preprint if you like	21:01
FUZxxl	we achieve ~8 GB/s for UTF-16->UTF-8 and ~5 GB/s for the other direction on Icelake	21:02
ghostmansd[m]	We have some nice examples added recently in SVP64 assembly; lkcl, markos and programmerjake can share some tips on directions	21:02
ghostmansd[m]	I'm not sure whether programmerjake measured the throughput, but you can see the simplicity, at least	21:02
ghostmansd[m]	I mean, code-wise	21:03
FUZxxl	I think I kind of have to get used to it	21:03
FUZxxl	I think lkcl has send me code about that before	21:03
FUZxxl	but honestly I didn't get around understanding it	21:03
ghostmansd[m]	Ah, great!	21:03
ghostmansd[m]	Well, the good starting point about PPC assembly would be PowerPC ISA book	21:04
FUZxxl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/algorithms/svp64_utf_8_validation.py;hb=HEAD	21:04
FUZxxl	^ this one was a UTF-8 validator	21:04
FUZxxl	I've actually not implemented a dedicated UTF-8 validator, but my code does validation throughout the transcoding process.	21:05
FUZxxl	It really doesn't cost much at all (not on the critical path and there are enough ports with spare cycles to make it essentially free)	21:05
FUZxxl	anyway; that was mostly a warm up	21:06
FUZxxl	I heard you are interested in people writing assemblers for you?	21:06
FUZxxl	I was thinking about designing sort of a "smart assembler" that helps you write assembly code more effectively	21:07
ghostmansd[m]	Well, SVP64 pays a significant attention to write the assembly code smart	21:08
ghostmansd[m]	But I guess lkcl can enlighten more on this part :-)	21:08
FUZxxl	I'm unfortunately not super familiar with SVP64 yet	21:09
FUZxxl	my key idea was to have an assembler that does chores like laying out a stack frame and dealing with obtaining global variables through whatever PIC scheme your platform uses and this sort of thing	21:09
FUZxxl	So you can write intuitive code instead of having to deal with weird relocations and such	21:10
FUZxxl	additional possible objectives include strength reductions and register allocation	21:10
FUZxxl	i.e. instead of taking a compiled language and turning it into a macro assembler, I want to take an assembler and give it useful high-level features so you can make your life easier while still fundamentally programming assembly	21:11
FUZxxl	ghostmansd[m]: what is your focus in the libre-soc project?	21:14
ghostmansd[m]	My recent works are towards supporting SVP64 in binutils, also helping with the other stuff related to instructions.	21:15
FUZxxl	sounds great	21:15
ghostmansd[m]	Surprisingly this hasn't ever needed assembly knowledge. :-)	21:15
FUZxxl	hah	21:15
ghostmansd[m]	> So you can write intuitive code instead of having to deal with weird relocations and such	21:16
ghostmansd[m]	Well, these parts we don't touch, they're kept as they were in vanilla PPC world.	21:17
FUZxxl	yes correct	21:17
FUZxxl	still they suck if you program in assembly	21:17
FUZxxl	so I think a smart assembler could help you do better	21:17
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		21:17
ghostmansd[m]	Many things suck if you program in assembly :-)	21:17
ghostmansd[m]	That depends on the assembly used	21:17
FUZxxl	they don't have to!	21:17
ghostmansd[m]	But, even then, it's low-level kind of things	21:18
FUZxxl	So?	21:18
ghostmansd[m]	Well, they don't, but then it's not the lowest-level (well not the lowest but you understand what I mean)	21:18
FUZxxl	so?	21:18
FUZxxl	Assembly programming doesn't mean you are forced to ignore any QOL improvements.	21:19
FUZxxl	It's just that you are fundamentally programming in terms of machine instructions.	21:19
ghostmansd[m]	No that's not what I mean	21:19
ghostmansd[m]	I mean, not many people think about it	21:19
ghostmansd[m]	Because they generally code in high-level	21:19
ghostmansd[m]	And the ugly part (e.g. relocations) is hidden	21:20
FUZxxl	True.	21:20
ghostmansd[m]	If everybody coded in asm, people would have thought about it more	21:20
ghostmansd[m]	But the thing is, they simply don't	21:20
FUZxxl	I think you have to write in assembly for fast SIMD/vector code as high level languages do not have the right paradigm to oexpress such algortihms.	21:20
ghostmansd[m]	They just take, say, shiny Python and don't care	21:20
FUZxxl	Improving on this is another possible PhD project.	21:21
FUZxxl	So being able to write assembly code more comfortably is something I consider very useful. I don't particularly care that it's a small niche.	21:21
ghostmansd[m]	True, but, first, this does not really involve diving into low-level weird chunks like relocations	21:22
ghostmansd[m]	And, second, it's mostly about optimizing some sections	21:22
FUZxxl	It does.	21:22
FUZxxl	For example, what do you have to do to obtain the address of a global variable?	21:22
FUZxxl	The answer depends on whether you write position-dependent, PIC, or PIE code.	21:22
ghostmansd[m]	This is true	21:23
ghostmansd[m]	But why would you do it in a small piece of code?	21:23
FUZxxl	So if an assembler could do it for you, it would be much easier for the programmer to focus on the relevant bits	21:23
FUZxxl	the assembly code can be substantial	21:23
ghostmansd[m]	I'd just pass that address in a register	21:23
ghostmansd[m]	To some routine written in assembly	21:23
ghostmansd[m]	What I mean, I would not write everything around in assembly	21:23
ghostmansd[m]	I would write some exclusive performance-critical parts	21:24
ghostmansd[m]	And make these callable from the high-level code	21:24
FUZxxl	but why should there be this restriction?	21:24
ghostmansd[m]	I'm not saying it's right :-)	21:24
ghostmansd[m]	I simply state how it mostly works	21:24
FUZxxl	another example: calling functions is hard because the correct sequence depends on the code model	21:24
ghostmansd[m]	There are exceptions, like KolibriOS	21:24
ghostmansd[m]	Indeed	21:25
FUZxxl	TLS is hard, too	21:25
ghostmansd[m]	I mean, I'm not saying the idea is wrong, I just express my thoughts on how it works now	21:25
FUZxxl	and it's usually OS dependent too	21:25
FUZxxl	sure	21:25
FUZxxl	I know how it works now	21:25
ghostmansd[m]	TLS is fucked as a concept TBH	21:25
FUZxxl	and I am not satisfied.	21:25
ghostmansd[m]	Not unreasonable but still fucked	21:26
FUZxxl	oh, please explain! An interesting view!	21:26
ghostmansd[m]	I mostly mean concepts like errno	21:26
FUZxxl	how is it fucked?	21:26
ghostmansd[m]	Compared to errno, using a negative error codes like in Linux kernel is rather a sane idea	21:27
ghostmansd[m]	Because checking/setting an integer in a register is much cheaper than accessing TLS	21:27
FUZxxl	you only need to check TLS after having established that the call failed	21:27
ghostmansd[m]	BSD use another approach, setting a carry flag	21:27
ghostmansd[m]	But in essense it's the same	21:27
ghostmansd[m]	First, it depends	21:28
ghostmansd[m]	Some routines require setting errno to 0 otherwise you won't be able to differ some failure cases	21:28
FUZxxl	these are not system calls and hence could not benefit from the syscall convention	21:28
ghostmansd[m]	Second, why address TLS if you can simply use register which is already thread-specific enough?	21:29
FUZxxl	also the Linux convention is kind of fucked as it arbitrarily assumes that a certain range of values could not occur in valid results.	21:29
FUZxxl	Not really future proof.	21:29
ghostmansd[m]	Or, well, rather processor specific	21:29
FUZxxl	Sure, use one register for each global variable	21:29
FUZxxl	errno is not performance critical enough to waste such a resource on it	21:29
ghostmansd[m]	Rather, don't use global variables :-)	21:30
FUZxxl	we already keep the TLS pointer in a register, so errno is just one dereference away	21:30
ghostmansd[m]	But I'm not trying to persuade, this is not a holy war on global state :-D	21:30
FUZxxl	the reason for errno is chiefly that you can't return more than one value in C.	21:31
ghostmansd[m]	> we already keep the TLS pointer in a register, so errno is just one dereference away	21:31
ghostmansd[m]	This actually depends on the architecture I think	21:31
FUZxxl	If you can, there's no need to have it, as e.g. Go shows.	21:31
ghostmansd[m]	On most architectures yes	21:31
ghostmansd[m]	Also, enabling TLS per se is a difficult stuff	21:31
FUZxxl	you can't really avoid it	21:31
FUZxxl	You are going to need thread-local variables one way or another	21:31
FUZxxl	e.g. Go, which doesn't have TLS exposed in the language still needs it to implement details of its green threads	21:32
ghostmansd[m]	I wouldn't say; per-CPU, yes, these are unavoidable for many contexts	21:32
ghostmansd[m]	Per-thread? Not that much.	21:32
FUZxxl	what is the distinction in your opinion?	21:32
FUZxxl	there is a 1:1 mapping between threads and CPUs	21:33
FUZxxl	i.e. whenever a thread runs, it runs on one CPU	21:33
ghostmansd[m]	Yes and no. Per-CPU can be just as simple as "take CPU index and refer to entry in say array".	21:34
ghostmansd[m]	And, per-thread would have been the same...	21:34
ghostmansd[m]	...but instead it got buried with a lot of details, like allocate stuff in kernel, provide interface for it in userspace, support it in compilers, support it upon linkage, support it everywhere...	21:35
FUZxxl	So per CPU means that if you have a new time slice, whatever other process runs gets the per CPU variable? :-D	21:35
ghostmansd[m]	I mean, this is kinda the difference between kernel and userspace, I know it's not entirely a fair comparison.	21:36
FUZxxl	also all the support would still be needed for "per CPU" stuff	21:36
FUZxxl	another point is that register space is limited	21:37
ghostmansd[m]	No it doesn't mean. No it won't be, at least not with the amount of details for pthread.	21:37
FUZxxl	so you cannot have programs freely allocate registers for global or per-thread stuff	21:38
ghostmansd[m]	And yes it could've been as simple as "just sacrifice one register"	21:38
FUZxxl	you run out too quickly (as opposed to memory)	21:38
FUZxxl	we do, it's called the TLS register	21:38
FUZxxl	sacrificed to enable use of arbitrary per-thread state.	21:38
ghostmansd[m]	Yeah I know	21:39
ghostmansd[m]	Anyway, the level of magic for supporting per-CPU in Linux kernel is not compared to level of magic needed to support pthreads in userspace	21:40
FUZxxl	another issue with sacrificing registers is that every part of the program needs to know which registers are sacrificed	21:40
ghostmansd[m]	That's my point	21:40
FUZxxl	which is hard to do in the presence of shared libraries and such	21:40
ghostmansd[m]	Well, this is kinda part of the deal, huh?	21:40
FUZxxl	what you call "per-CPU" is just "per-thread" but not touching memory	21:41
ghostmansd[m]	Like, say, with the calling conventions	21:41
ghostmansd[m]	It's not what I call "per-CPU", it's in fact per-CPU	21:41
ghostmansd[m]	It's rather the thing you call per-thread is rather per-CPU :-)	21:41
FUZxxl	semantics	21:42
ghostmansd[m]	But this is buried deep down	21:42
ghostmansd[m]	Semantics matter	21:42
FUZxxl	sorry, I still don't really get the point you are trying to make.	21:42
FUZxxl	Especially with the distinction between per-thread and per-CPU	21:43
ghostmansd[m]	My point is, unless you don't really need to, don't use TLS. Allocate per-thread context, for example.	21:43
FUZxxl	TLS is there anyway, might as well use it.	21:44
ghostmansd[m]	*unless you really need	21:44
FUZxxl	It won't go back into its box if you don't use it.	21:44
ghostmansd[m]	There are many things around which won't; not the reason to use them.	21:44
FUZxxl	so ... you oppose on philosophical grounds?	21:44
FUZxxl	it would be easier to implement if we didn't have them, so even though we do, don't use them?	21:45
ghostmansd[m]	I oppose on the grounds of simplicity. You might as well call these philosophical, if you like. :-)	21:45
FUZxxl	I think TLS makes for some tremendously simple solutions.	21:46
FUZxxl	Just as static variables do.	21:46
FUZxxl	I mean there is a reason objects are so popular: object members are just per-instance static variables.	21:46
FUZxxl	Let's you use them without feeling guilty for having sinned.	21:46
ghostmansd[m]	And simplicity differs. I know declaring something thread_local might appear simpler than calling malloc inside the thread. And having a global variable is simpler than passing function arguments. In this sense, this is not the simplicity I'm talking about.	21:47
FUZxxl	so philosophical grounds :-)	21:48
ghostmansd[m]	I'm talking about simplicity to debug and maintain it, and also about simplicity of supporting the internals needed, etc.	21:48
ghostmansd[m]	Again: if you think these are philosophical grounds, that's just your opinion, I'm fine with it. :-)	21:48
ghostmansd[m]	I don't really intend to convert you into the no-TLS-whatsoever church. :-)	21:49
ghostmansd[m]	Each and every programmer is full of opinionated statements, so am I.	21:49
FUZxxl	debugging is a lot easier with TLS and static variables because you can just look at the variable	21:50
FUZxxl	it's a lot harder if the variable is to be found somewhere deep in an object hierarchy or the call stack	21:50
FUZxxl	you also don't have to lug the variable down the call stack which e.g. helps tremendously when writing recursive-descent parsers.	21:50
ghostmansd[m]	My experience differs, but, again, this is about opinions.	21:50
FUZxxl	sure	21:50
ghostmansd[m]	This diverged a lot from UTF-8, lol :-D	21:51
FUZxxl	heh	21:51
FUZxxl	My current other project is writing a SIMD-accelerated SAT solver	21:52
FUZxxl	let's see if it works	21:52
ghostmansd[m]	Oh, nice!	21:57
FUZxxl	SAT solvers are very branch heavy	21:57
FUZxxl	they constantly (1) load a datum from RAM (2) compare it with something (3) branch on the result	21:58
FUZxxl	almost all the loads are interdependent	21:58
FUZxxl	I believe I can do better with slighly improved data structures, permitting me to then use gather instructions and so on	21:58
ghostmansd[m]	Well, that's one of the benefits with vectors, being able to write branchless code, so I hope you'll find what you're searching for here. :-)	22:00
ghostmansd[m]	So, welcome :-)	22:00
FUZxxl	Oh it's still going to be plenty branchy	22:00
FUZxxl	and I think vectors will not actually give a lot of benefit	22:00
FUZxxl	even being able to do 4 loads independently would be a big plus	22:00
FUZxxl	even with vectors you'll be limited to how many fetch units you have	22:02
FUZxxl	as most loads are random	22:02
FUZxxl	so once your vector holds as many integers as you have fetch units, you have reached the limit of possible perf benefits.	22:02
markos	FUZxxl, hey, welcome, nice to see another fellow SIMD junkie :)	22:06
markos	well, in my case, ex-SIMD junkie, I'll never see SIMD again in the same light after SVP64	22:06
markos	in fact, working with SVP64 in most cases requires to see the actual reference scalar code rather than a SIMD implementation	22:07
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has quit IRC		22:07
FUZxxl	markos: I like to do something I call "combinatorial SIMD" i.e. SIMD use for combinatorial algorithms	22:08
FUZxxl	The results are usually completely different from the scalar code as they require entirely different algorithm design approaches	22:08
markos	I implemented some example code for SVP64, you'll see that it's much much easier to just write SVP64 assembly directly following the scalar code	22:09
FUZxxl	which also makes me feel uncertain about how advantageous vector architectures are going to be. I don't know how good their horizontal operations are.	22:09
markos	and forget SIMD entirely	22:09
FUZxxl	If you have all vertical code I'm sure it is.	22:09
markos	how about reducing code size to 1/5th or even 1/10th?	22:10
FUZxxl	but alas I do not have enough experience.	22:10
markos	in assembly!	22:10
FUZxxl	sure, but does it do the trick?	22:10
markos	yes	22:10
markos	how fast we don't know yet	22:10
FUZxxl	e.g. could you do it with, say, an integer sorting algorithm?	22:10
FUZxxl	if you have, say, quicksort, it's a very simple scalar procedure.	22:10
markos	sure	22:10
FUZxxl	But what would the SVP64 variant look like?	22:10
FUZxxl	Each iteration has a loop carried dependency.	22:11
FUZxxl	You need to redesign the algorithm to vectorise it.	22:11
markos	we haven't done sorting yet, but it's trivial to find the horizontal min/max across a vector	22:11
FUZxxl	sure, this is because you have reduction instructions I suppose.	22:11
markos	yes	22:11
markos	one instruction to get the min/max across 64 bit registers	22:11
FUZxxl	cheating :-P	22:11
markos	not more than what other engines are doing already	22:12
FUZxxl	an instruction to implement a special case says nothing about the general case	22:12
markos	but instead of adding a gazillion instructions to do one simple job	22:12
markos	SVP64 adds just a few powerful instructions that can be modified to be used in pretty much all scenarios	22:12
FUZxxl	for example, what if you want to compute the gcd of a vector?	22:12
FUZxxl	Is it still that easy?	22:12
FUZxxl	the scalar code is just as easy as "min of a vector"	22:13
markos	yes	22:13
FUZxxl	what would it look like in SVP64? Is there a "reduce with gcd" instruction?	22:13
markos	I did ports of a few algorithms used in VP8/VP9 and AV1	22:13
markos	do you have the git checked out?	22:13
FUZxxl	which git in particular?	22:14
markos	libresoc, openpower-isa	22:14
FUZxxl	can you give me the correct link?	22:14
FUZxxl	(no I do not have it checked out)	22:14
markos	https://git.libre-soc.org/?p=openpower-isa.git;a=summary	22:15
FUZxxl	what is the URL to clone this repository? The site does not say.	22:16
markos	well, if you want to work on it you have to get an actual dev environment, unfortunately it's not that simple as checking out to start playing with it	22:17
markos	but if you want to look at SVP64 asm examples	22:17
markos	https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=media/video;hb=HEAD	22:18
markos	has a port of a few functions for libvpx and dav1d libraries	22:18
FUZxxl	you are evading my question :-)	22:19
markos	built using ghostmansd's binutils port to SVP64 and run on the python simulator	22:19
markos	which is? how would a gcd be implemented in SVP64?	22:19
FUZxxl	no, a "reduction by gcd"	22:19
FUZxxl	I chose gcd because I am pretty sure you don't have an instruction for it.	22:19
FUZxxl	i.e. take an array of numbers and compute the gcd of all numbers.	22:19
FUZxxl	the scalar loop is simple, but how can it be vectorised? I cannot imagine it would be as simple as the scalar loop.	22:20
markos	you have to think outside the box of SIMD	22:20
markos	no there no specific instruction for it	22:20
markos	but it can be turned into SVP64 quite easily	22:20
FUZxxl	But would it look just like the scalar code?	22:21
markos	not exactly, but close	22:21
markos	it's just a bunch of modulos and compare against zero	22:21
FUZxxl	the thing is: I'm sure it's very easy to make the vector code look like the scalar code for simple numerical algorithms	22:21
FUZxxl	but I really don't care about these	22:22
markos	how's DCT?	22:22
markos	I wouldn't call it exactly simple	22:22
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has joined #libre-soc		22:22
FUZxxl	markos: and you have to somehow deal with it being a reduction	22:22
markos	and yet it's much smaller than SIMD DCT code I'm writing for Arm	22:22
FUZxxl	discrete cosine transform?	22:22
markos	yes	22:22
FUZxxl	sure it is	22:22
FUZxxl	you don't have to deal with edge conditions or chunking	22:22
markos	have you seen SIMD unrolled code?	22:23
markos	for DCT I mean	22:23
FUZxxl	otoh you will not be benefit from more execution units with such simple code	22:23
FUZxxl	*not benefit	22:23
FUZxxl	sure	22:23
FUZxxl	You can write SIMD code in very similar style, especially with AVX-512. The extra complexity is mainly computing masks for the final iteration.	22:23
FUZxxl	But such code is not very efficient as it doesn't have high ILP	22:24
markos	let's leave AVX512 out of the discussion please :)	22:24
FUZxxl	the complexity is needed to enable better ILP and I believe that'll be the same with SVP64 in future implementations	22:24
markos	I'm writing SIMD code for pretty much all architectures, as a profession	22:24
markos	AVX512 apart from a few interesting instructions is a mess	22:24
FUZxxl	so am I :-)	22:24
FUZxxl	AVX-512 is the best ISA for my particular work because it has the best horizontal operations	22:25
markos	if it works for you, fine	22:25
FUZxxl	e.g. VPCOMPRESSB, PDEP, PEXT, VPMULTISHIFTQB, VPERMB	22:25
FUZxxl	other architectures lack them	22:25
FUZxxl	SVE2 has something like VPCOMPRESSB, but no PDEP or PEXT, so I will not be able to port my UTF-8 algorithm to it easily.	22:26
markos	pdep/pext are indeed lacking other architectures	22:26
markos	^in	22:26
FUZxxl	SVP64 would need pdep/pext on mask (?) registers to be effective	22:27
markos	but here's the thing, in the end all that matters is the resulting performance, and so far I managed to get the best performance of my software on a M1 using NEON	22:27
FUZxxl	and compress/expand (which I think you might already have)	22:27
FUZxxl	yeah that chip rules	22:27
FUZxxl	however, NEON sucks for combinatoric programming: no scatter/gather, no compress, no pdep, no pext	22:27
FUZxxl	good for number crunching though	22:28
markos	scatter/gather are a joke on Intel	22:28
FUZxxl	they got a lot better with Icelake	22:28
FUZxxl	could be better though	22:28
FUZxxl	have you worked with NEC Aurora Tsubasa before?	22:28
FUZxxl	Fun system, too.	22:28
markos	no	22:28
FUZxxl	vector ISA w/ 2048 byte vectors iirc. HBM2 memory, no L2 cache	22:29
FUZxxl	mem is so fast that vector loads always bypass the L1 cache. Crazy perf	22:29
markos	that's highly specialized	22:29
FUZxxl	made for number crunching	22:29
markos	yes, and only that	22:29
FUZxxl	correct.	22:29
markos	SVE was based on such a CPU Fujitsu iirc	22:30
FUZxxl	yeah	22:30
FUZxxl	still waiting for SVE2 hardware I can put on my desk	22:30
markos	same here, probably I'll get access next month	22:30
FUZxxl	once that happens ... I gues I'll write a whole lot more ARM asm	22:30
FUZxxl	https://github.com/clausecker/pospop/blob/master/countneon_arm64.s	22:31
markos	anyway	22:31
FUZxxl	^ here's some ARM asm I wrote a while ago.	22:31
markos	the point is that SVP64 is a totally new game	22:31
markos	check the asm in the media/video/ directory	22:31
FUZxxl	that was fun. I wonder if it would work the same or better with SVP64 or other vector isas.	22:31
markos	there is no need to unroll your loop 100 times to get the performance	22:32
markos	eg. I'm doing a DCT 32x32 algorithm for NEON right now	22:32
markos	it's unrolled 1500 lines with a ton of inline functions	22:32
FUZxxl	sure, neither is it for the M1	22:32
markos	so even more	22:32
FUZxxl	the out of order mechanism makes it so multiple iterations happen at once	22:33
markos	with SVP64 there are specialized DCT instructions	22:33
FUZxxl	still brings a little bit of an advantage some times though	22:33
markos	it's not about a specific implementation, eg M1 or another	22:33
markos	it's about the architecture, not the micro-architecture	22:33
FUZxxl	why does the architecture require unrolling?	22:33
FUZxxl	what is fundamentally different with SVP64 that it doesn't?	22:34
markos	the loops are in the hardware instructions themselves	22:34
FUZxxl	so ... each instruction is a loop?	22:35
markos	whereas in SIMD you might have to unroll 4 times, for a simple 128-bit vector addition, with SVP64 you just do a sv.adds having previously set the VL (Vector Length) to whatever you want	22:35
markos	(up to 64 iirc)	22:35
FUZxxl	why do you have to unroll 4 times with SIMD? You can just do the same loop you do with SVP64	22:36
markos	the added bonus of having a full 128 registers to play with, makes for some pretty interesting code	22:36
FUZxxl	and for slow context switches	22:36
markos	in general, branches are bad	22:36
FUZxxl	not if they are easy to predict	22:36
markos	context switching can be solved easily	22:36
markos	no, branches are bad, period	22:36
FUZxxl	I disagree	22:36
FUZxxl	e.g. my UTF-8 conversion routine got twice as fast by making it more branchy	22:37
FUZxxl	I added fast paths for e.g. when there are no surrogates skipping large chunks of the algorithm	22:37
markos	in order to have good branch prediction units, you have to add more silicon to the chip, which in turn increases power usage, which makes it more difficult to scale your cpu, etc etc	22:37
markos	I'd prefer using my silicon for other stuff	22:37
FUZxxl	Most code is branchy business logic, you will not be able to vectorise that. Or make it less branchy.	22:38
markos	critical loops aren't for this reason	22:38
FUZxxl	Mine are :-)	22:38
FUZxxl	You are invited to make them less branchy and faster.	22:38
markos	my plate is already full :-)	22:38
FUZxxl	Sure.	22:39
FUZxxl	e.g. here is my UTF-8 conversion routine: https://github.com/simdutf/simdutf/blob/clausecker/utf8_to_utf16le_avx512.S	22:39
FUZxxl	How would it benefit from SVP64?	22:39
markos	if it was as you say, there would be no need for so much inlining of functions in the hot path	22:39
markos	programmerjake is the utf8 expert here	22:39
FUZxxl	what the inlining helps with is getting rid of the shuffling of registers	22:40
FUZxxl	also, functions are often called with constant arguments so constant propagation helps.	22:40
FUZxxl	The pure function call is usually not what makes a call expensive.	22:40
FUZxxl	markos: one thing this code does e.g. is obtaining masks over vector elements, then moving them to scalar registers and doing arithmetic on them.	22:41
FUZxxl	Not sure how this would be possible with SVP64 given that masks are longer than scalar registers.	22:41
markos	without knowing the algorithm I can't say	22:42
FUZxxl	We also have a thing where the vector length must necessarily be dynamic as UTF8 and UTF16 are variable length encodings, so a variable lenght chunk of input is translated into a variable length chunk of output.	22:42
markos	but it's the same as usual, check the reference algorithm and not the SIMD code first	22:42
FUZxxl	But I guess there is a "set vector length to the number in this register" instruction.	22:42
FUZxxl	the reference algorithm is completely different and cannot be vectorised as it is full of loop carried dependencies.	22:42
FUZxxl	and the SIMD algorithm I designed cannot be expressed in a scalar paradigm.	22:43
markos	to put in other words, if I'm going to port code to SVP64 I found it it much much simpler to just take the C code and work with that	22:43
FUZxxl	Please, I'm interestedin seeing the result.	22:43
markos	SIMD-isms are not possible with SVP64 or are much much harder to implement and with a big doubt of the actual benefit	22:43
FUZxxl	Here is the reference C code I wrote: https://github.com/simdutf/simdutf/blob/clausecker/utf8_to_utf16le_ref.c	22:43
markos	I already told you, check the media/video code	22:43
FUZxxl	It's a completely different algorithm though.	22:43
markos	reference C algorithm and SVP64 in the same dir (*_real.s is the actual code)	22:44
FUZxxl	that code is just straightforward math code. No loop carried dependencies, no permutations, no data-dependent addressing.	22:44
FUZxxl	I 100% agree with you that you can directly translate such code.	22:44
markos	lots of branches but rather straightforward	22:45
FUZxxl	track inpos and outpos	22:46
FUZxxl	these two vary depending on which branch is taken	22:46
FUZxxl	so you cannot just do the usual "do both branches and merge the results" approach as the addresses from which the next elements are fetched vary	22:46
markos	never said you would	22:47
FUZxxl	depening on which branch is taken in a loop-carried-dependency	22:47
FUZxxl	I would really like to see what it would look like	22:47
FUZxxl	it feels hard to imagine.	22:47
markos	but from what I see, the SVP64 version would probably -this is a guess- look closer to the scalar, translated, rather than a SIMD implementation	22:47
FUZxxl	how do you deal with the loop-carried dependency?	22:48
markos	I think that would be a good exercise for you to learn SVP64 assembly :)	22:48
FUZxxl	why do you keep evading my questions?	22:48
lkcl	FUZxxl, i made a start on in-regs insertion-sort, it's about... 8-10-ish instructions, i haven't finished it.	22:48
FUZxxl	I try to understand what the trick is and you keep waving your hands	22:48
markos	I don't really know what you expect me to tell you, if you're expecting me to implement a SVP64 version of that, that's not going to happen	22:48
markos	because I really have a ton of stuff to work on, and this is not in the list	22:49
FUZxxl	I think you may have not understand the problem with vectorising this algorithm	22:49
markos	loop carried dependencies are not as important with SVP64 as they are with SIMD	22:49
FUZxxl	lkcl: it would be interesting to see that.	22:49
FUZxxl	but how are they resolved?	22:49
FUZxxl	e.g. in this algorithm, to fetch c0 for the next iteration, you need to have completed the previous iteration	22:50
markos	what does a loop-carried dependency in the first place?	22:50
FUZxxl	so how would this be vectorised?	22:50
markos	s/does/is	22:50
markos	it's data you have to carry horizontally and you can't do that with SIMD	22:50
markos	you can with scalar	22:50
lkcl	FUZxxl, we're very much learning - all of us - as this is such a new concept that we're literally at the forefront of experimenting and finding out how it works	22:50
FUZxxl	"data you have to carry horizontally" -> it is	22:50
FUZxxl	and the whole point of my SIMD code is that it can do that.	22:51
markos	and you can with SVP64, because the result of each instruction CAN affect the next	22:51
FUZxxl	I think I still do not understand how SVP64 works.	22:51
markos	with SIMD, unless the engine has specific instructions to do that -eg. horizontal adds, min/max/etc- you have to change the order from horizontal to vertical, eg. by transposing the vector matrix	22:52
lkcl	FUZxxl, it's actually not-strictly-speaking a Scalable Vector system at all, it's more classified as a type of "Zero-Overhead Loop-Control"	22:52
lkcl	https://bugs.libre-soc.org/show_bug.cgi?id=659#c5	22:52
markos	even if it does offer specific instructions, you cannot have the same instruction change the results	22:52
lkcl	that's where i began insertion-sort	22:52
lkcl	the idea came from looking at both SIMD and Cray-Style Vectors, and noticing that:	22:53
lkcl	SIMD is "for i in fixed_simd_length: operation(register[j].element[i],....)	22:53
lkcl	and Cray-style Vectors is:	22:54
lkcl	"for i in 0..GLOBAL_DYNAMIC_VECTOR_LENGTH_REGISTER: operation(register[j].element[i], ....)	22:54
lkcl	i decided to separate out the "for-looping" part from "the operation" part, as a prefix-suffix 32-32 encoding.	22:55
FUZxxl	how is this translated into micro operations?	22:55
lkcl	using the Scalar Power ISA as the suffix	22:55
FUZxxl	that's how it's encoded	22:55
FUZxxl	so say the scalar UTF-8 algorithm	22:55
markos	ok, I think I found an example that can explain your question	22:55
markos	from mp3, showing loop-carried dependency	22:56
FUZxxl	if you have a hardware loop mechanism to basically execute the scalar algorithm in a hardware loop, the total latency would still be the same as it's all loop carried	22:56
lkcl	in high-performance implementations it would be done by inserting an additional hardware block in between Decode and Multi-Issue phases	22:56
markos	setvl 0,0,18,0,1,1 # Set VL to 18 elements	22:56
markos	# Load 18 floats from (in)	22:56
markos	sv.lfs *vin, 0(in)	22:56
markos	# equivalent to: for (i = 17; i >= 1; i--) in[i] += in[i-1];	22:56
markos	sv.fadds/mrr vin+1, vin+1, *vin	22:56
markos	check the for loop	22:56
lkcl	nope, not if you use a Multi-Issue Out-of-Order Execution Engine as the underlying hardware	22:56
FUZxxl	if each operation is linearly dependent on the previous one, then I don't see how that would work	22:57
FUZxxl	which is is in the scalar UTF-8 translation algorithm	22:57
FUZxxl	*each iteration	22:57
markos	the actual implementation is a hardware micro-architecture issue	22:57
lkcl	then yes, what you get is, you get the situation where you benefit instead from having a multi-issue OoO Engine as the back-end	22:58
lkcl	where there would be multiple in-flight operations	22:58
FUZxxl	you will not be able to get it faster than the length of the critical path	22:58
lkcl	correct.	22:58
FUZxxl	which is ... 10 times slower than the SIMD algorithm I designed	22:58
lkcl	under which circumstances, a little bit more intelligence in how the algorithm is written has to be done	22:58
lkcl	i.e. you have to think in terms not so much of "SIMD" but of "SIMD elements"	22:59
FUZxxl	ok, done.	22:59
FUZxxl	I can envision how each individual element flows through the algorithm.	22:59
lkcl	you'd have to make an implementation of the algorithm that ensures that the elements have less Hazard-Dependencies	22:59
markos	parallelizing the algorithm does NOT require SIMD	23:00
FUZxxl	which I did, but that only works by viewing a whole vector of elements at once.	23:00
lkcl	this is common practice in Cray-style Vectors (since the 70s)	23:00
lkcl	yes.	23:00
lkcl	that's it. that's what you do.	23:00
lkcl	but	23:00
FUZxxl	i.e. I break the dependencies by computing where the elements go ahead of time and then just permuting.	23:00
FUZxxl	But that does require an entirely different approach, in contrary to what markos said.	23:01
FUZxxl	Which is my whole point.	23:01
lkcl	what you are not restricted to is only thinking in terms of vector-of-elements	23:01
FUZxxl	you aren't restricted to that with SIMD either. It's plausible someone will implement SIMD extensions with individually renamed elements.	23:01
lkcl	you can issue a vector-instruction that very very deliberately has overlaps, e.g. a Prefix-Sum (fibonacci)	23:01
FUZxxl	e.g. AMD already does something like this by splitting vectors into chunks and renaming each chunk separately.	23:01
lkcl	yes but they have to actually explicitly add that as instructions	23:02
FUZxxl	no	23:02
lkcl	oh?	23:02
markos	you asked for an example of SVP64 handling loop-carried dependency, I showed you one, you complained about performance and really there is no way to tell that at the moment	23:02
FUZxxl	it's an implementation detail of how they implement AVX-512 in Zen4	23:02
FUZxxl	My goal is to understand what this is all about	23:02
FUZxxl	so I am proposing problems that would not work with my current understanding of the technology.	23:02
markos	it's not a solution for all problems	23:03
lkcl	no - we're focussing on ISA-level simulations at the moment rather than HDL (architectural) simulations	23:03
FUZxxl	To gain insight into how they could be implemented to then amend my understanding.	23:03
markos	but it does solve many existing problems	23:03
FUZxxl	How do you model the performance without an uarch level simulation?	23:03
lkcl	although the cavatools port will have timing / cycle-accurate counting	23:03
lkcl	at this early phase: we don't.	23:03
FUZxxl	bummer	23:03
lkcl	that's what the next NLnet grant is for	23:03
markos	there is no microarch at the moment, that's what I've been saying	23:03
lkcl	which will take about... 5-6 months to complete	23:04
FUZxxl	the way you explain this algorithm it seems like the details of the uarch design are critical to the perf characteristics of the architecture	23:04
FUZxxl	and for how you need to write code to be fast.	23:04
markos	uarch is always specific to the implementor	23:04
markos	it can be entirely different from each vendor	23:04
lkcl	then we have a base on which to begin maaking performance estimates	23:04
lkcl	ah	23:04
FUZxxl	or more polemically: just because it looks elegant doesn't mean it'll be fast.	23:05
lkcl	one of the things about designing a general-purpose Scalable Vector ISA is: you don't design the ISA based on "performance"	23:05
markos	eg. M1 is Arm Architecture, but NOT Arm Microarchitecture, it's Apple's own, the rest of the vendors (Samsung, Qualcomm, etc) take some parts of the architecture and add their own	23:05
lkcl	that's a very unfortunate conflation that's come from SIMD ISAs	23:05
FUZxxl	I partially agree, but partially disagree	23:05
lkcl	which are near-direct exposure of the underlying SIMD hardware internal micro-architecture directly to/at the programmer	23:05
FUZxxl	you want the programmer to be able to make assumptions about what the performance characteristics are at least	23:06
FUZxxl	it can always be faster in practice	23:06
lkcl	ARM is presently trying to dissuade people from doing exactly that	23:06
lkcl	by designing algorithms to be "vector-length-independent"	23:06
lkcl	as is RISC-V with RVV	23:06
FUZxxl	e.g. something like "if I have a permutation of a vector, do the data elements have to be ready when the permutation happens or only the index elements"	23:07
FUZxxl	which majorly affects how an algorithm requiring a permutation is to be designed	23:07
lkcl	if successful it allows code to be fully portable from embedded systems with zero lanes (a scalar-element) right the way up to massive 1024-way SIMD hardware	23:07
markos	permutation of the vector in SVP64 is just moves based on indices	23:07
markos	actually not even moves	23:07
FUZxxl	renames?	23:07
lkcl	yyeah, permutations are the "massive spanner in the works" for both ARM SVE/2 and RISC-V RVV's "plans", there	23:08
markos	you change the index numbers of the registers that are going to be used in the next iteration	23:08
markos	you want reverse? sure, take a reverse sequence	23:08
FUZxxl	so there is no dependency on the data, only on the indices	23:08
lkcl	you can't possibly have a reliable permutation concept when you've absolutely no idea what the target microarchitectural vector length is!!	23:08
FUZxxl	is that guaranteed or just an implementation detail? That's the sort of stuff I need to know	23:08
FUZxxl	I don't really care about specific instruction timings	23:08
FUZxxl	But about this sort of general stuff.	23:08
markos	too early	23:08
lkcl	there's a hard-guaranteed limit on the Vector Length of 127.	23:08
lkcl	it is required that that Vector Length be supported, across all implementations	23:09
FUZxxl	Same with programming for ARM. You are completely correct about not programming for a specific uarch, but you can make the assumptions that all uarches will be very similar in certain aspects.	23:09
markos	you can never assume anything about specific performance	23:09
FUZxxl	better than on rv64 :-)	23:09
lkcl	even if that implementation is a ridiculously-slow Finite State Machine that's only capable of executing one element per every 10 clock cycles	23:09
FUZxxl	then I will not be able to design fast algorithms for the machine	23:10
FUZxxl	and it will immediately be uninteresting	23:10
lkcl	or it's an insanely-fast one that can do QTY 16of elements per SIMD-back-end-ALU per multi-issue instruction	23:10
markos	one instruction can take 3 cycles on a Qualcomm, and 1 cycle on a M1, and you can never know/assume that	23:10
FUZxxl	sure, not really a problem (though annoying to deal with some times)	23:11
lkcl	you can design portable and highly-efficient (compact) algorithms for any future machine that anyone decides to implement in future	23:11
FUZxxl	what is worse if you e.g. don't know if a shift is a barrel shift or one shift per cycle	23:11
FUZxxl	once again, heavily affects the algorithm design	23:11
markos	or one CPU can have 1 vector ALU and be slow as hell, or it can have 4x128 NEON units (graviton3) and your NEON code can really kick ass	23:11
FUZxxl	or for ARM, you know that perms are happening for the whole register. Elements are not individually renamed.	23:12
FUZxxl	This too affects algorithm design.	23:12
markos	that's a microarchitecture detail	23:12
lkcl	yyeah these kinds of things we're just going to have to live with until there's even one "decent" implementation (let alone 2 or 3)	23:12
markos	you cannot impose that kind of thing to a vendor	23:12
FUZxxl	then I think I'll have to come back once there is more clarity about that	23:12
lkcl	like i said, we're at the software-simulator stage	23:12
markos	unless you sell them the microarchitecture as well as the ISA	23:12
FUZxxl	The thing is, you are proposing a revolutionary ISA design approach but don't give any details about what the implementation is going to look like	23:13
FUZxxl	so it's impossible to optimise complex data flow for it	23:13
lkcl	i've been designing the implementation for 3 years, now	23:13
FUZxxl	e.g. the reduction code you cited above will be really slow (17 cycles minimum) if the CPU just executes all the additions in the order of the dependency chain you define there	23:14
FUZxxl	but perhaps it is faster if it is special cased.	23:14
lkcl	the base i have planned is a multi-issue OoO one...	23:14
lkcl	... yes	23:14
markos	I think you're thinking too much around SIMD to see the benefit	23:14
FUZxxl	In practice it may be 1000 elements and you may really want to know if it is	23:14
FUZxxl	markos: I think in terms of an OOO CPU	23:14
lkcl	right, that's actually coded into the specification as an "Architectural Note/Hint"	23:14
FUZxxl	i.e. in terms of data dependencies	23:14
FUZxxl	ah so you do have assumptions that can be made (?)	23:15
lkcl	that implementors are perfectly permitted to spot those kinds of cases	23:15
FUZxxl	but will they in practice?	23:15
markos	it can also be an in-order CPU, that's again a microarchitecture detail	23:15
lkcl	and replace them with micro-coded optimisations	23:15
lkcl	when we've the VC funding i'm planning to instruct the team on where micro-coded optimisations will be possible, yes	23:15
FUZxxl	I can write possibly beautiful code against your ISA, but until it is known what typical uarches look like, it will not be possible to say if that code is really good code	23:16
lkcl	for example a mapreduce on a Vector-OR-operation is clearly easily paralleliseable	23:16
lkcl	we know	23:16
FUZxxl	the fadd example in particular is tricky because addition is not associative	23:16
lkcl	i have been trying to drum this into one of the team, programmerjake, for some time	23:16
lkcl	yes, and that's also in the specification :)	23:16
FUZxxl	so not sure if the CPU will be able to get rid of the dependency chain	23:17
lkcl	that fadd can never be associative on strict-ordered-linear-reduction	23:17
lkcl	and the implementor is required to respect that.	23:17
lkcl	they are not permitted, on non-associative operations, to choose an arbitrary order. the results are required by the specification to be 100% Deterministic	23:18
lkcl	but	23:18
lkcl	obviously that's not good in terms of performance :)	23:18
lkcl	so there is a parallel reduction mode, as well	23:18
lkcl	which is, again, 100% Deterministic	23:18
lkcl	such that even that can be applied to non-associative base operations	23:18
markos	keep in mind that there are multiple ways to solve a problem, and this code is for a platform that doesn't exist yet, if only to prove a point that it works, it's definitely not "optimized"	23:19
FUZxxl	I believe that it works	23:20
lkcl	the "optimisation" - whether and how to optimise - is down to the hardware implementors, of which there is only one, at present, in HDL, and it was an experimental CPU, done last year.	23:20
markos	in fact, many instructions are added/modified as we try to port existing algorithms to SVP64	23:20
lkcl	(on the basis of which we were able to get more funding. it's a bootstrap process in other words)	23:20
lkcl	the bit that is guaranteed - regardless of implementation - is the reduction in program size	23:21
lkcl	(this is what i have been trying to drum into programmerjake: he has recently also been side-tracked into focussing on "performance")	23:21
markos	imho, that's the best part, we see what's needed for specific algorithms and lkcl willl try to find a way to make such an instruction work, if it's possible :)	23:21
markos	personally I'm not at all worried about performance at this moment	23:22
lkcl	the reduced program size (between 50% and 90%) results clearly in reduced I-Cache hits, reduced TLB lookups, and so on.	23:22
FUZxxl	not really the key factor here	23:23
lkcl	and that's clearly right across the board regardless of the data-path side, number of ALUs, and so on	23:23
FUZxxl	SVP64 doesn't really improve standard business logic which is the big icache hog	23:23
FUZxxl	a hot loop usually fits into the µop cache, even with unrolled SIMD	23:23
lkcl	if the I-side is sitting idle literally doing nothing it's saving power	23:23
markos	I'm more interested in getting eg. the butterfly instructions used in DCT in the ISA, so that we can have a full DCT implementation in just a dozen instruction -compared to the thousands right now	23:24
lkcl	we've found that the hot-loops fit into a single line of an I-Cache, without loop-unrolling!	23:24
FUZxxl	My key argument: I tend not to believe claims that something will eventually perform well unless the way it's going to be implemented is clearly laid out and will obviously work	23:24
markos	stuff like that is more important at this stage	23:24
* lkcl looking for an example		23:24
markos	defining the ISA implies nothing about the underlying architecture	23:24
markos	sorry, micro-architecture	23:25
FUZxxl	partially	23:25
FUZxxl	ISA design choices heavily affect what the uarch has to take into account	23:25
markos	no, it says nothing about X instruction requiring 2 cycles or 10	23:25
markos	that's uarch-specific	23:25
lkcl	in a SIMD ISA, yes.	23:25
lkcl	in a Scalable Vector ISA, very deliberately no.	23:26
FUZxxl	again: I don't really care about these details, but more about "what is renamed in dependence to what? What are the hidden data dependencies? What operations are renames? What are ALU ops?"	23:26
FUZxxl	e.g. if you have some sort of selection instruction, is it an ALU op or a data-dependent rename?	23:26
FUZxxl	This severely affects algorithm design	23:26
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py;h=647e7935d1aa9a77713c3a7af4838e16b5e4c028;hb=36ab18ddb338b2b4be00bdbc03d9efa905a2c3f9#l32	23:26
FUZxxl	regardless of what the cycles counts are going to be	23:26
markos	do you need to know such info for example on Intel or Arm?	23:27
markos	when writing SIMD code?	23:27
FUZxxl	yes	23:27
FUZxxl	I mean if I don't care about perf I can ignore it	23:27
markos	but such info always or almost always changes between revisions of the cpu	23:27
FUZxxl	not really	23:28
markos	well, you mentioned scatter/gather before	23:28
FUZxxl	the important bits always stay similar	23:28
markos	before icelake it was crap	23:28
FUZxxl	correct	23:28
markos	even now it's not ideal	23:28
FUZxxl	it's pretty much about as good as it gets with the number of load/store ports they have	23:28
markos	in fact I have yet to see more than a few % increase vs not using gather scatter	23:28
markos	and ALWAYS it depends on the actual cpu model	23:28
lkcl	ok so scatter-gather is possible in SVP64 by using Predicate Mask bits, one on source, one on destination	23:29
markos	ie, is it a i9 or a Xeon? and what model?	23:29
FUZxxl	they all have the same number of load/store ports in one uarch generation	23:29
markos	because Xeon obviously can scale better in memory loads	23:29
lkcl	if you use "twin" predication - you get the gather-scatter back-to-back... in the same instruction	23:29
FUZxxl	no it can't	23:29
lkcl	but	23:29
lkcl	this is "high-level"	23:29
lkcl	how the micro-ops are issued, how many are issued, how they are satisified: this is down to the hardware implementor to implement	23:30
markos	the thing is that with SVP64 you don't have to do anything special with scatter/gather, there is no special instruction needed	23:30
lkcl	and then optimise	23:30
FUZxxl	and this is critical to know	23:30
FUZxxl	so?	23:30
FUZxxl	does it matter if a special instruction is needed?	23:30
FUZxxl	Using scatter/gather or not is a choice of data structure and algorithm design	23:30
lkcl	and it will be down to the hardware implementor to either declare what they have done, or for programmers to find out, once hardware is available	23:30
FUZxxl	not of whether there are special instructions for it or not	23:30
lkcl	(likely in about 3-4 years time)	23:30
FUZxxl	looking forwards to it!	23:31
FUZxxl	I think it'll be a whole lot more interesting for me once there is silicon available	23:31
lkcl	we'll be able to do FPGA simulations a lot sooner before before then.	23:31
FUZxxl	that will be useful	23:31
FUZxxl	the tyranny of CPU design is,	23:31
FUZxxl	if you have the most beautiful programming model	23:32
FUZxxl	but it's faster to just hack it and write uggly-ass code, compilers will do exactly that	23:32
FUZxxl	history has repeatedly proven that	23:32
lkcl	yes, i remember the first SPARC compilers, simply padding nops in between all operations	23:32
lkcl	because the early hardware had absolutely zero hazard-dependency-checking	23:33
FUZxxl	so before it is proven that the SVP64 coding model you envision will actually be faster than, say, just treating it as a SIMD arch	23:33
* lkcl face-palm		23:33
FUZxxl	I am going to be wary	23:33
lkcl	well, in this case it's reasonable to ride off the back of the assumption of a half-decent (advanced) multi-issue out-of-order execution back-end	23:33
markos	it doesn't matter if it's faster in absolute numbers, it probably won't in the beginning anyway	23:33
lkcl	those exist.	23:33
markos	but if you can run a DCT algorithm in 20 instructions vs 1000 on a 1Ghz 5W SoC, it definitely matters	23:34
markos	for example	23:34
markos	it won't beat the intel/amd/Arm cpus in 5y time	23:35
lkcl	if they exist now, and the "looping" system can be "inserted" - as i said - between Decode and Issue - then it's also reasonable to expect parity between "the Multi-Issue System's Performance Before SVP64" and "the Multi-Issue System's Performance After Adding SVP64"	23:35
lkcl	(it won't be precisely the same because it's one extra pipeline stage, at least)	23:35
FUZxxl	markos: I'm not sure if it'll be as effective as you think	23:36
lkcl	(which in turn has a knock-on effect on the ratio of in-flight operations vs the completion latency... blah blah)	23:36
FUZxxl	But I think time will tell	23:37
FUZxxl	I think there's a good chance it'll be amazing and I'm going to love it.	23:37
markos	we'll cross that bridge when we get there, nothing more can be said right now	23:37
FUZxxl	I agree	23:37
lkcl	there's some obvious things on DCT (i'm going to do 2D DCT next, it'll take about 2 months, same number of instructions)	23:38
lkcl	a "normal" (SIMD) way of doing DCT (let alone 2D DCT) requires either pushing the data back to memory and loading it back again	23:38
*** jab <jab!~jab@user/jab> has joined #libre-soc		23:38
lkcl	or it requires copying of the vector of data from register-batch to register-batch	23:38
lkcl	the SVP64 REMAP system leaves the entire DCT in-registers in-place	23:39
markos	yes, 2 passes, transposing the whole matrix between them and after	23:39
lkcl	you have the vector of registers for the coefficients	23:39
lkcl	you have the vector of registers containing the data	23:39
FUZxxl	Imagine a SIMD arch with as large as a register file and as long as a vector length as SVP64	23:40
lkcl	apart from one additional scalar register containing the DCT size, there are no other registers needed!	23:40
FUZxxl	would the benefits persist?	23:40
lkcl	yes!	23:40
lkcl	yes they would	23:40
FUZxxl	ok	23:40
lkcl	because the SIMD arch would firstly have the entire triple-loop as explicit instructions	23:40
lkcl	secondly, they'd go "argh we don't have a bit-reversed Vector-LD instruction, we have to compute the indices for the permutation"	23:41
FUZxxl	the SIMD arch could have that instruction	23:41
FUZxxl	or it could have a general mechanism to do simple permutations on its operands	23:42
FUZxxl	e.g. linear shifts or reversals	23:42
lkcl	thirdly (unless it's ARM, with that funny twin-butterfly instruction) they'd go "argh, this data is in reverse-order, it's going to overwrite and destroy the source which we need later, we have to use double the number of registers"	23:42
lkcl	as a hardware-computed schedule, covering the entire triple-loop?	23:42
FUZxxl	I don't know	23:43
FUZxxl	I don't understand SVP64 well enough to tell	23:43
lkcl	TI's MSP32* series and Qualcomm's Hexagon DSP only have ZOLC for the inner loop of FFT/DCT	23:43
markos	libvpx: Arm NEON: 4k lines of fdct code for 4x4, 8x8, 16x16, 32x32 and that's excluding the transposing functions, x86: 4-8k linesspread across multiple files, multiple implementations, also excluding transposing functions	23:44
lkcl	i haven't been able to find anything in any other ISA like this	23:44
FUZxxl	it could exist	23:44
FUZxxl	just like SVP64 could exist	23:44
lkcl	not in the major ISAs (which, let's be realistic, they're the ones that "matter")	23:44
FUZxxl	make a steelman: imagine how a SIMD architecture could reasonably be extended to solve these issues	23:44
FUZxxl	don't make the mistake of saying "we will add lots of fun stuff but SIMD will stay exactly the same way it is"	23:45
lkcl	i was surprised ARM had that double-butterfly mul-add-shift as a single instruction	23:45
lkcl	SIMD is SIMD (at the back-end)	23:45
FUZxxl	sure. And the double-butterfly mul-add-shift stuff is possible.	23:45
FUZxxl	As is what AVX-512 does.	23:45
FUZxxl	As is probably a whole lot more.	23:45
markos	FUZxxl, the solution followed by intel/arm/etc is to just add more instructions	23:45
lkcl	you can't escape the fact that there's really only one way to build a SIMD ALU: put down a batch of parallel gates.	23:45
FUZxxl	no, you can also shove the register through in batches, as e.g. Zen4 does	23:46
FUZxxl	you could even do each element individually	23:46
FUZxxl	and while e.g. SIMD shuffles are ALU ops right now, they might well be renames in future iterations of the design	23:46
lkcl	yes - that's taking the "Cray Vector Chaining" concept from the 1970s (first implemented in Cray-I) and applying it to modern architectures	23:47
FUZxxl	correct -- it's all uarch	23:47
FUZxxl	so why should SIMD not benefit from better uarch design while you are certain the SVP64 will?	23:47
FUZxxl	*that	23:47
lkcl	because once - in your mind - you have separated the "SIMD-as-a-direct-front-end-ISA" from "SIMD-at-the-back-end-in-hardware"	23:48
lkcl	then realised that Scalable Vector ISAs also require (pretty much) identical back-end Engines as those processors which have a SIMD-front-end-ISA	23:48
lkcl	and accepted that, comparing apples-with-apples you make the back-ends the same	23:49
FUZxxl	ok	23:49
lkcl	then you can focus on comparing the front-ends	23:49
FUZxxl	but then (to loop back to the previous point), the backend will not solve the loop-carried-dependency issues	23:49
markos	imho, SIMD is not future proof, right now Intel+Arm+Power: 17k total C intrinsics, slightly fewer asm instructions, Intel + Arm take teh majority, VSX by comparison is just ~200 intrinsics	23:49
markos	you can't just keep adding more and more instructions	23:50
lkcl	within about.. how long markos? about eight weeks? we've found that people go "holy s*** this is so much easier and drastically reduces instruction count"	23:50
FUZxxl	If you look at AVX-512 and NEON, it doesn't actually have all that many truly distinct instructions	23:50
markos	both AVX512 and SVE -to a lesser extent- have to lower clock frequency when AVX512/SVE* kick in	23:50
FUZxxl	it's mostly the same instruction counted a bunch of times for different data types and vector sizes	23:50
markos	which lowers performance	23:50
lkcl	it was 1/2 hour ago (or so) and i wasn't quite following: what's "loop-carried-dependencies"? you mean like prefix-sum and reduction?	23:51
FUZxxl	Ever since Icelake they no longer do it	23:51
markos	most of the times the instructions are similar -not 1-1 mapping ofc	23:51
FUZxxl	lkcl: e.g. I was talking about the UTF-8 code where to find the start of the next character you have to finish analysing the previous one (in the scalar algorithm)	23:51
FUZxxl	So even an out of order uarch will not be able to process more than one character at a time	23:51
markos	and intel has some pretty specific instructions, eg. movemask set of instructions	23:51
lkcl	ah yes. then you have two choices.	23:51
FUZxxl	markos: these are obsolete with AVX-512	23:52
lkcl	but you can at least chuck the instructions into in-flight buffers	23:52
markos	which do not exist on Arm, so you have to emulate them	23:52
FUZxxl	as AVX-512 adds masks to all loads and stores	23:52
markos	not obsolete, just done differently	23:52
markos	yes I know	23:52
lkcl	and you can at least benefit from that (whilst still kicking yourself in frustration that the scalar-chained-hazards exist)	23:52
lkcl	or	23:52
FUZxxl	lkcl: with e.g. an 8 wide fronted on the Apple M1, you can chuck them quickly enough even in the conventional paradigm	23:52
lkcl	like you mentioned	23:52
markos	surprisingly VSX does offer a very similar actually more powerful instruction that you can use to emulate movemask in a few instructions	23:52
lkcl	yehyeh, i'm counting on that :)	23:53
FUZxxl	So I don't quite see the benefit still	23:53
FUZxxl	(I usually do uarch simulations on my code and the frontend is rarely the bottleneck)	23:53
lkcl	you can parallelise the algorithm and treat the Vector Elements partially as-if they were SIMD	23:53
lkcl	i.e. you can set the Vector Length to an explicit width	23:53
FUZxxl	exactly -- which requires a much more complex algorithm like the one I designed	23:53
lkcl	and effectively "emulate" SIMD.	23:53
lkcl	but here with SVP64 you have the choice	23:54
FUZxxl	And it's unclear if it can be expressed in SVP64	23:54
lkcl	(and a third option)	23:54
lkcl	the fact that we can "emulate" (not quite the right word) SIMD means "yes you can"	23:54
markos	some inspiration, tricks can be emulated, but I think the whole point is that you don't really need a specific algorithm to write SVP64 code	23:54
FUZxxl	So to summarise: for my combinatorial problems, SVP64 will not make the code shorter or easier or better most likely.	23:54
lkcl	i spent 4 years taking every concept from every SIMD and Scalable Vector ISA i could get my hands on and including them	23:54
markos	you don't know that until you try it	23:55
lkcl	ah, i haven't gotten onto "option 3" - Vertical_First, yet :)	23:55
FUZxxl	I am all ears.	23:55
markos	quite possibly it's going to be shorter, no way to tell if it's going to be faster though	23:55
lkcl	i had to do it as a video	23:55
lkcl	https://youtu.be/fn2KJvWyBKg	23:55
lkcl	but there's a diagram that explains it quicker	23:55
lkcl	1 se	23:55
lkcl	c	23:55
FUZxxl	markos: how is it ging to be shorter? The SIMD code is not unrolled and each instruction does some elementary step that needs to be performed.	23:56
markos	"possibly"	23:56
markos	as I said, no way to know that until someone tries	23:56
lkcl	https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/sv_horizontal_vs_vertical.svg;hb=HEAD	23:57
markos	because so far it has been shorter than all SIMD implementations of functions we ported	23:57
markos	(the SVP64 version that is)	23:57
FUZxxl	There's nothing in the code that really needs a fixed vector length, it would most likely be directly translatable to the vector paradigm if it has all the relevant instructions.	23:57
lkcl	Horizontal-First is how both Cray-style Vectors and SIMD ISAs do "elements"	23:57
FUZxxl	However, I doubt it would be faster or shorter.	23:57
lkcl	they progress through elements first before moving onto instruction second	23:57
lkcl	Vertical-First is the other way round	23:57
lkcl	the hardware maintains an element-offset-pointer	23:58
lkcl	which remains at its current index until explicitly incremented	23:58
FUZxxl	markos: the functions you've showed me where either trivially parallelisable or have very simple data-dependencies	23:58
FUZxxl	And I believe you in that they can be expressed much easier	23:58
FUZxxl	lkcl: ok	23:58
lkcl	FUZxxl, the reason for that is exactly the benefit	23:58
lkcl	we had to go back to the scalar implementations because the SIMD implementations were so obtuse, so heavily optimised, they're unreadable	23:59
FUZxxl	I'm 100% certain future SVP64 kernels will be too	23:59
markos	mp3 is anything but trivially parallelizable but ok	23:59
markos	I showed you one line	23:59

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!