Sunday, 2022-10-23

FUZxxl	markos: you showed me all the code in that directory	00:00
lkcl	i then very deliberately took the optimisation path at the ISA level to make sure that those "simple" looking vectorised algorithms could be thrown at multi-issue (parallel) hardware and get high performance	00:00
FUZxxl	lkcl: okay, vertical first makes sense and permits data dependencies	00:00
lkcl	yes.	00:00
FUZxxl	but they'll still be there creating dependency chains	00:00
lkcl	it's a kind-of cheat	00:00
FUZxxl	do you plan to rename each vector element individually?	00:01
lkcl	if the loop is small enough, hardware may go, "oh, hm, i'm getting a batch of non-conflicting non-overlapping elements. i could SIMD-batch those. let me just do that"	00:01
FUZxxl	I am not sure if this will be possible in practice	00:01
lkcl	FUZxxl, at the Scalar-register-element-size level, yes	00:01
markos	lkcl, and that's a micro-architecture specific detail	00:02
lkcl	markos, yes	00:02
FUZxxl	existing OOO architectures cannot re-schedule after more data dependencies are known	00:02
markos	every vendor might choose to implement this one way or another	00:02
markos	or not at all	00:02
FUZxxl	(which really sucks on current Intel uarchs, too)	00:02
lkcl	when the elements are linear and (like MMX) below the 64-bit level, they'll be easily batched	00:02
lkcl	but beyond that, it gets... tricky	00:02
FUZxxl	okay, so you will have to write convoluted code to get the batching right in the non-trivial case.	00:03
lkcl	well, the value of doing that is going to depend on how many implementations there are (in.... 4-10 years time)	00:03
FUZxxl	Looking forwards to it!	00:04
lkcl	ultimately (annoyingly) we will need switches in gcc, per architecture	00:04
lkcl	to say "please generate assembler targetted at v1.2.3.4 vendor's hardware"	00:04
lkcl	it's inevitable, sigh	00:04
FUZxxl	Please don't understand my words as a disapproval of your project. In fact, the ideas are extremely fascinating and like to lead to interesting results.	00:04
markos	SVP64 is not trivially simple neither does it lack complexity, but the difference is instead of having thousands upon thousands of different instructions, it offers very few extra instructions that sit on top of the existing scalar instructions and "vectorize" them	00:04
lkcl	no, not at all	00:05
FUZxxl	Lack of performance portability is going to be tricky if it happens.	00:05
FUZxxl	markos: I don't think a high instruction count is really a problem.	00:05
markos	it is	00:05
lkcl	realistically, RED Semiconductor Ltd (the company i established) will have the only hardware, for at least 6-8 of those years	00:05
FUZxxl	If you e.g. look at ARM, most instructions just combine the existing HW in different ways to reduce the latenc yo f common operations.	00:05
FUZxxl	e.g. ARM has instructions to zero-extend + add at once	00:06
markos	Arm has an orthogonal ISA	00:06
lkcl	FUZxxl, you may not be aware: in the IBM POWER9, there's a bottleneck at the L2 Cache	00:06
markos	so you can predict the exact instruction you need	00:06
lkcl	if you have an algorithm that cannot fit into L1 I-Cache, that is also L1 D-Cache heavy	00:06
lkcl	you get contention!	00:06
FUZxxl	you could do it in two separate instructions but it would be slower. The hardware can already do both at once, so it makes sense to expose that.	00:06
markos	Intel definitely does not have that	00:06
lkcl	not many people are even aware of that limitation of IBM's POWER9 microarchitecture	00:06
FUZxxl	markos: AVX-512 is pretty orthogonal	00:06
markos	there are so many variants you have to constantly check the ISA manual to see which instruction exactly you need	00:07
FUZxxl	so if you look at 750 something ASIMD instructions, it really boils down to not that many truly distinct operations	00:07
FUZxxl	markos: sure, but you could solve that with a better asm syntax (wink wink)	00:07
markos	it's better, but not much because you always have to carry the old baggage of AVX2/SSE	00:07
FUZxxl	e.g. deriving zero extension from the operand type or something	00:07
FUZxxl	same with Intel	00:08
FUZxxl	a better assembler could get rid of vfmadd132pd and friends and just derive the right opcode from the combination of operands	00:08
FUZxxl	lkcl: ah that's an ouch for sure	00:08
lkcl	that same logic ("better asm syntax") is what drove me to create SV.	00:08
markos	FUZxxl, you depend on the compiler in that case	00:09
lkcl	over time i expect it to propagate cleanly up to intrinsics and ultimately to the compilers, without needing new front-end high-level languages	00:09
markos	I prefer not to write asm unless I have to	00:09
FUZxxl	markos: if you have a compiler, why are you spending your time reading ISA manuals?	00:09
markos	and with SVP64 I was able to write a working implementation in a few hours	00:09
lkcl	because markos's company specialises in optimisation for companies	00:10
FUZxxl	I see	00:10
lkcl	such as ARM and Intel	00:10
markos	Arm is our client	00:10
markos	SVP64 is a personal involvment	00:10
lkcl	you did AV1 for them, recently, and that... what was it...	00:10
markos	no, libvpx	00:10
markos	av1 is next :)	00:10
markos	and vectorscan	00:10
FUZxxl	I do not like writing SIMD code in high level languages because compilers suck at generating SIMD code	00:10
lkcl	the "40,000-regex-which-intel-optimised"?	00:11
lkcl	FUZxxl, we know! :)	00:11
markos	ported Intel hyperscan to Arm, Intel didn't accept any non-intel ports to the original project, hence the fork	00:11
FUZxxl	I see	00:11
markos	and porting it to VSX was done just for fun	00:11
lkcl	hyperscan, that was it	00:11
FUZxxl	cool project, really	00:11
lkcl	oh, did toshywoshy's advice help on VSX?	00:12
FUZxxl	Hah I actually have a bit-parallel string matching paper in my pipeline	00:12
lkcl	were there any other areas it got better?	00:12
markos	FUZxxl, if I did every project in hand written asm I'd still be working on the first function :)	00:12
FUZxxl	should publish it some day	00:12
* lkcl FUZxxl: ooOoo		00:12
markos	lkcl, the vec_gb instruction, yes, it doubled performance on the Power9 :)	00:12
markos	s/instruction/intrinsic	00:12
lkcl	bit-parallel string-matchiiing :)	00:12
lkcl	markos, cool!	00:13
lkcl	dang	00:13
FUZxxl	the algorithm is crazy simple	00:13
markos	basically it reduced movemask emulation from a dozen instructions -or more don't remember- down to 5	00:13
markos	still the project is full of movemask intellisms and I have to abstract them away so that it doesn't hurt performance so much on Arm/Power	00:13
lkcl	FUZxxl, it's funny, it's often the simple things/ways that get missed	00:13
FUZxxl	basically, it's an improvement over Boyer-Moore and all the other algorithms that have the basic "test char, compute shift amount, go to next ieration" loop	00:13
markos	did it for a few modules but it's all over the place	00:14
lkcl	https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm	00:14
FUZxxl	with the key improvement being that (a) it never forgets any information it gained and (b) it can check multiple characters per iteration, hence benefitting from OOO architectures	00:14
markos	problem is SVE, it doesn't want to play with existing SIMD abstractions	00:14
FUZxxl	I'm kind of scared that someone already came up with the idea	00:15
FUZxxl	and my algorithm can deal with character classes which is nice	00:15
lkcl	FUZxxl, write it up, definitely!	00:16
FUZxxl	e.g. you can match something like photo-19[89][0-9]-[0-9][0-9]-[0-9][0-9].jpg	00:16
FUZxxl	main disadvantage: the length of the search pattern is limited to your register length	00:16
FUZxxl	but you can simply look for a 64 char suffix of the search pattern in most cases which is good enough	00:17
FUZxxl	lkcl: will do!	00:17
lkcl	it sounds... significant	00:17
lkcl	i mean that	00:17
FUZxxl	in fact, I already have	00:17
* lkcl late, here. and for you, markos, you're 2 hours ahead of me and it's 00:18 for me!		00:19
lkcl	back to vegging out with a book is called for	00:19
lkcl	until next time	00:19
lkcl	thank you both - awesome conversation	00:19
markos	indeed	00:20
FUZxxl	good night and thank you!	00:22
FUZxxl	As for Tuesday, I may have to shift my attendance to next week	00:29
FUZxxl	It's my Grandmothers birthday and the celebrations may run late	00:30
*** jab <jab!~jab@user/jab> has quit IRC		03:18
programmerjake	welcome FUZxxl!	03:21
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		09:15
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.172.157> has joined #libre-soc		09:16
*** yambo <yambo!~yambo@69.146.1.110> has quit IRC		09:28
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.172.157> has quit IRC		10:00
*** openpowerbot <openpowerbot!~openpower@94.226.188.34> has joined #libre-soc		10:10
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.162.225> has joined #libre-soc		11:12
ghostmansd[m]	lkcl, hi! Any ideas on math-free tasks? :-)	11:23
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.162.225> has quit IRC		11:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		11:34
lkcl	maaath, mmmm loovely	11:52
lkcl	i'm still thinking about it, because everything we're doing right now on the cryptoprimitives thingy is driven by the algorithms	11:53
lkcl	ah, i tell you what	11:54
lkcl	we need a "add with shift by immediate-2/4/8/16" instruction	11:55
lkcl	which you could add to the spec page, and to av.mdwn, and blah blah	11:55
lkcl	under https://bugs.libre-soc.org/show_bug.cgi?id=771	11:56
lkcl	then the unit tests (etc) under https://bugs.libre-soc.org/show_bug.cgi?id=840	11:56
lkcl	with a special note in the spec that the (very same) instruction is needed for LD/ST-address-calculation-with-a-mini-bit-of-a-shift	11:57
ghostmansd[m]	Ok, is there some insn that I should take as reference?	12:08
lkcl	https://libre-soc.org/openpower/sv/bitmanip/#shift-add	12:14
lkcl	there's one in ARM, the syntax uses "#N" on the end of the add-part (we'll not be doing that)	12:17
lkcl	programmerjake, thank you for the unit test on set_masked_reg()	12:20
lkcl	i relied on the unit tests using it "getting things right" (chacha20 for example)	12:21
lkcl	ghostmansd[m], so, bit of a pain (but they have separate budgets), tracking 3 separate bugreports: one for implementation, one for unit tests, one for spec/documentation	12:21
ghostmansd[m]	So we basically need to create everything for these: https://libre-soc.org/openpower/sv/bitmanip/#shift-add?	12:23
ghostmansd[m]	Sigh, IRC thinks ? is a part of URL	12:23
ghostmansd[m]	Ok, will do it	12:23
lkcl	hexchat doesn't :)	12:24
lkcl	yes.	12:24
lkcl	a (new) Z23-Form exists, so the pseudocode can use "sm"	12:24
lkcl	rather than "sh"	12:24
lkcl	do make sure to drop in the git-commit-diff-link under the right bugreport as you do them (just to show some justification for the payment)	12:25
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=HEAD	12:25
lkcl	rather than	12:25
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=HEAD	12:26
ghostmansd[m]	Ok, cool!	12:59
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		13:56
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		14:53
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.52.183> has joined #libre-soc		14:53
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.52.183> has quit IRC		15:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		15:08
*** yambo <yambo!~yambo@69.146.1.110> has joined #libre-soc		16:37
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		17:29
*** octavius <octavius!~octavius@149.147.93.209.dyn.plus.net> has joined #libre-soc		18:01
octavius	lkcl, the pseudo-code for shadd only allows "sh" to be 0-3 (2-bit mask), and then 1 is added, so the max possible shift is 4. Is this standard behaviour? It seems a little roundabout (but I guess there's no way to mask and ensure a max of 4 otherwise)	18:03
octavius	Also, which bitfield does "sh" correspond to in the Z23-form?	18:04
lkcl	octavius, yyep.	18:44
lkcl	if you only need a straight add then you use a straight add	18:45
lkcl	and it's "1<<sm" so that's 2, 4, 8 and 16	18:45
octavius	Ah ok, makes sense	19:27
octavius	wait.	19:31
octavius	sm =0,1,2,3	19:31
octavius	1<<sm=1,2,4,8	19:31
octavius	masking this with 0x3, the last two values will give the same shift value	19:31
lkcl	sm = 0,1,2,3	19:59
lkcl	sm &= 0x3	19:59
lkcl	1<<(sm+1) == 1,2,4,8	20:00
lkcl	not	20:00
lkcl	sm = 0,1,2,3	20:00
lkcl	(1<<(sm+1)) & 0x3	20:00
*** octavius <octavius!~octavius@149.147.93.209.dyn.plus.net> has quit IRC		20:33
programmerjake	lkcl: when linking to stuff in git, please link to an actual commit, not HEAD	21:51
lkcl	programmerjake, i gave it as an example only	22:30
programmerjake	yeah, just this isn't the first time...	22:49
*** openpowerbot <openpowerbot!~openpower@94.226.188.34> has quit IRC		23:18
*** openpowerbot <openpowerbot!~openpower@94-226-188-34.access.telenet.be> has joined #libre-soc		23:25
lkcl	programmerjake, the reason i gave it was not for the purposes of showing the commit itself	23:41
lkcl	the reason i gave it was for comparative purposes of demonstrating to ghostmansd, to ask him to please show the diff link not the commit link	23:42
lkcl	the actual reference was completely irrelevant as to what was actually shown	23:42
lkcl	whether it was HEAD or any other commit was not part of the request to him	23:43
lkcl	consequently it is not in the least bit relevant to ask me to link to an actual commit	23:43
lkcl	as i was not in any way asking him about any actual specific commit, at all.	23:43
lkcl	so just so you know: you're asking me to do something irrelevant on something completely unrelated to the purpose of the conversation.	23:46
programmerjake	i'm pointing it out not because this time it's a problem (though it is a bit misleading for ghostmansd if your demo doesn't contain all the correct pieces of info), but because it has been a problem several times in the past.	23:47
lkcl	therefore i'm going to ignore the request as it is not relevant	23:47
lkcl	i'll eventually successfully communicate with him, through repetition, and expect to catch him at a time that's convenient	23:48
programmerjake	imho he likely figured it out -- he's smart	23:49

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!