Sunday, 2022-10-23

FUZxxlmarkos: you showed me all the code in that directory00:00
lkcli then very deliberately took the optimisation path at the ISA level to make sure that those "simple" looking vectorised algorithms could be thrown at multi-issue (parallel) hardware and get high performance00:00
FUZxxllkcl: okay, vertical first makes sense and permits data dependencies00:00
lkclyes.00:00
FUZxxlbut they'll still be there creating dependency chains00:00
lkclit's a kind-of cheat00:00
FUZxxldo you plan to rename each vector element individually?00:01
lkclif the loop is small enough, hardware may go, "oh, hm, i'm getting a batch of non-conflicting non-overlapping elements.  i could SIMD-batch those. let me just do that"00:01
FUZxxlI am not sure if this will be possible in practice00:01
lkclFUZxxl, at the Scalar-register-element-size level, yes00:01
markoslkcl, and that's a micro-architecture specific detail00:02
lkclmarkos, yes00:02
FUZxxlexisting OOO architectures cannot re-schedule after more data dependencies are known00:02
markosevery vendor might choose to implement this one way or another00:02
markosor not at all00:02
FUZxxl(which really sucks on current Intel uarchs, too)00:02
lkclwhen the elements are linear and (like MMX) below the 64-bit level, they'll be easily batched00:02
lkclbut beyond that, it gets... tricky00:02
FUZxxlokay, so you will have to write convoluted code to get the batching right in the non-trivial case.00:03
lkclwell, the value of doing that is going to depend on how many implementations there are (in.... 4-10 years time)00:03
FUZxxlLooking forwards to it!00:04
lkclultimately (annoyingly) we will need switches in gcc, per architecture00:04
lkclto say "please generate assembler targetted at v1.2.3.4 vendor's hardware"00:04
lkclit's inevitable, sigh00:04
FUZxxlPlease don't understand my words as a disapproval of your project.  In fact, the ideas are extremely fascinating and like to lead to interesting results.00:04
markosSVP64 is not trivially simple neither does it lack complexity, but the difference is instead of having thousands upon thousands of different instructions, it offers very few extra instructions that sit on top of the *existing* scalar instructions and "vectorize" them00:04
lkclno, not at all00:05
FUZxxlLack of performance portability is going to be tricky if it happens.00:05
FUZxxlmarkos: I don't think a high instruction count is really a problem.00:05
markosit is00:05
lkclrealistically, RED Semiconductor Ltd (the company i established) will have the only hardware, for at least 6-8 of those years00:05
FUZxxlIf you e.g. look at ARM, most instructions just combine the existing HW in different ways to reduce the latenc yo f common operations.00:05
FUZxxle.g. ARM has instructions to zero-extend + add at once00:06
markosArm has an orthogonal ISA00:06
lkclFUZxxl, you may not be aware: in the IBM POWER9, there's a bottleneck at the L2 Cache00:06
markosso you can predict the exact instruction you need00:06
lkclif you have an algorithm that cannot fit into L1 I-Cache, that is also L1 D-Cache heavy00:06
lkcl*you get contention*!00:06
FUZxxlyou could do it in two separate instructions but it would be slower.  The hardware can already do both at once, so it makes sense to expose that.00:06
markosIntel definitely does not have that00:06
lkclnot many people are even aware of that limitation of IBM's POWER9 microarchitecture00:06
FUZxxlmarkos: AVX-512 is pretty orthogonal00:06
markosthere are so many variants you have to constantly check the ISA manual to see which instruction exactly you need00:07
FUZxxlso if you look at 750 something ASIMD instructions, it really boils down to not that many truly distinct operations00:07
FUZxxlmarkos: sure, but you could solve that with a better asm syntax (wink wink)00:07
markosit's better, but not much because you always have to carry the old baggage of AVX2/SSE00:07
FUZxxle.g. deriving zero extension from the operand type or something00:07
FUZxxlsame with Intel00:08
FUZxxla better assembler could get rid of vfmadd132pd and friends and just derive the right opcode from the combination of operands00:08
FUZxxllkcl: ah that's an ouch for sure00:08
lkclthat same logic ("better asm syntax") is what drove me to create SV.00:08
markosFUZxxl, you depend on the compiler in that case00:09
lkclover time i expect it to propagate cleanly up to intrinsics and ultimately to the compilers, without needing new front-end high-level languages00:09
markosI prefer not to write asm unless I have to00:09
FUZxxlmarkos: if you have a compiler, why are you spending your time reading ISA manuals?00:09
markosand with SVP64 I was able to write a working implementation in a few hours00:09
lkclbecause markos's company specialises in optimisation for companies00:10
FUZxxlI see00:10
lkclsuch as ARM and Intel00:10
markosArm is our client00:10
markosSVP64 is a personal involvment00:10
lkclyou did AV1 for them, recently, and that... what was it...00:10
markosno, libvpx00:10
markosav1 is next :)00:10
markosand vectorscan00:10
FUZxxlI do not like writing SIMD code in high level languages because compilers suck at generating SIMD code00:10
lkclthe "40,000-regex-which-intel-optimised"?00:11
lkclFUZxxl, we know! :)00:11
markosported Intel hyperscan to Arm, Intel didn't accept any non-intel ports to the original project, hence the fork00:11
FUZxxlI see00:11
markosand porting it to VSX was done just for fun00:11
lkclhyperscan, that was it00:11
FUZxxlcool project, really00:11
lkcloh, did toshywoshy's advice help on VSX?00:12
FUZxxlHah I actually have a bit-parallel string matching paper in my pipeline00:12
lkclwere there any other areas it got better?00:12
markosFUZxxl, if I did every project in hand written asm I'd still be working on the first function :)00:12
FUZxxlshould publish it some day00:12
* lkcl FUZxxl: ooOoo00:12
markoslkcl, the vec_gb instruction, yes, it doubled performance on the Power9 :)00:12
markoss/instruction/intrinsic00:12
lkclbit-parallel string-matchiiing :)00:12
lkclmarkos, cool!00:13
lkcldang00:13
FUZxxlthe algorithm is crazy simple00:13
markosbasically it reduced movemask emulation from a dozen instructions -or more don't remember- down to 500:13
markosstill the project is full of movemask intellisms and I have to abstract them away so that it doesn't hurt performance so much on Arm/Power00:13
lkclFUZxxl, it's funny, it's often the simple things/ways that get missed00:13
FUZxxlbasically, it's an improvement over Boyer-Moore and all the other algorithms that have the basic "test char, compute shift amount, go to next ieration" loop00:13
markosdid it for a few modules but it's all over the place00:14
lkclhttps://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm00:14
FUZxxlwith the key improvement being that (a) it never forgets any information it gained and (b) it can check multiple characters per iteration, hence benefitting from OOO architectures00:14
markosproblem is SVE, it doesn't want to play with existing SIMD abstractions00:14
FUZxxlI'm kind of scared that someone already came up with the idea00:15
FUZxxland my algorithm can deal with character classes which is nice00:15
lkclFUZxxl, write it up, definitely!00:16
FUZxxle.g. you can match something like photo-19[89][0-9]-[0-9][0-9]-[0-9][0-9].jpg00:16
FUZxxlmain disadvantage: the length of the search pattern is limited to your register length00:16
FUZxxlbut you can simply look for a 64 char suffix of the search pattern in most cases which is good enough00:17
FUZxxllkcl: will do!00:17
lkclit sounds... significant00:17
lkcli mean that00:17
FUZxxlin fact, I already have00:17
* lkcl late, here. and for you, markos, you're 2 hours ahead of me and it's 00:18 for me!00:19
lkclback to vegging out with a book is called for00:19
lkcluntil next time00:19
lkclthank you both - awesome conversation00:19
markosindeed00:20
FUZxxlgood night and thank you!00:22
FUZxxlAs for Tuesday, I may have to shift my attendance to next week00:29
FUZxxlIt's my Grandmothers birthday and the celebrations may run late00:30
*** jab <jab!~jab@user/jab> has quit IRC03:18
programmerjakewelcome FUZxxl!03:21
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC09:15
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.172.157> has joined #libre-soc09:16
*** yambo <yambo!~yambo@69.146.1.110> has quit IRC09:28
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.172.157> has quit IRC10:00
*** openpowerbot <openpowerbot!~openpower@94.226.188.34> has joined #libre-soc10:10
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.162.225> has joined #libre-soc11:12
ghostmansd[m]lkcl, hi! Any ideas on math-free tasks? :-)11:23
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.162.225> has quit IRC11:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc11:34
lkclmaaath, mmmm loovely11:52
lkcli'm still thinking about it, because everything we're doing right now on the cryptoprimitives thingy is driven by the algorithms11:53
lkclah, i tell you what11:54
lkclwe need a "add with shift by immediate-2/4/8/16" instruction11:55
lkclwhich you could add to the spec page, and to av.mdwn, and blah blah11:55
lkclunder https://bugs.libre-soc.org/show_bug.cgi?id=77111:56
lkclthen the unit tests (etc) under https://bugs.libre-soc.org/show_bug.cgi?id=84011:56
lkclwith a special note in the spec that the (very same) instruction is needed for LD/ST-address-calculation-with-a-mini-bit-of-a-shift11:57
ghostmansd[m]Ok, is there some insn that I should take as reference?12:08
lkclhttps://libre-soc.org/openpower/sv/bitmanip/#shift-add12:14
lkclthere's one in ARM, the syntax uses "#N" on the end of the add-part (we'll not be doing that)12:17
lkclprogrammerjake, thank you for the unit test on set_masked_reg()12:20
lkcli relied on the unit tests using it "getting things right" (chacha20 for example)12:21
lkclghostmansd[m], so, bit of a pain (but they have separate budgets), tracking 3 separate bugreports: one for implementation, one for unit tests, one for spec/documentation12:21
ghostmansd[m]So we basically need to create everything for these: https://libre-soc.org/openpower/sv/bitmanip/#shift-add?12:23
ghostmansd[m]Sigh, IRC thinks ? is a part of URL12:23
ghostmansd[m]Ok, will do it12:23
lkclhexchat doesn't :)12:24
lkclyes.12:24
lkcla (new) Z23-Form exists, so the pseudocode can use "sm"12:24
lkclrather than "sh"12:24
lkcldo make sure to drop in the git-commit-diff-link under the right bugreport as you do them (just to show some justification for the payment)12:25
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=HEAD12:25
lkclrather than12:25
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=HEAD12:26
ghostmansd[m]Ok, cool!12:59
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC13:56
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC14:53
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.52.183> has joined #libre-soc14:53
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.52.183> has quit IRC15:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc15:08
*** yambo <yambo!~yambo@69.146.1.110> has joined #libre-soc16:37
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc17:29
*** octavius <octavius!~octavius@149.147.93.209.dyn.plus.net> has joined #libre-soc18:01
octaviuslkcl, the pseudo-code for shadd only allows "sh" to be 0-3 (2-bit mask), and then 1 is added, so the max possible shift is 4. Is this standard behaviour? It seems a little roundabout (but I guess there's no way to mask and ensure a max of 4 otherwise)18:03
octaviusAlso, which bitfield does "sh" correspond to in the Z23-form?18:04
lkcloctavius, yyep.18:44
lkclif you only need a straight add then you use a straight add18:45
lkcland it's "1<<sm" so that's *2, *4, *8 and *1618:45
octaviusAh ok, makes sense19:27
octaviuswait.19:31
octaviussm   =0,1,2,319:31
octavius1<<sm=1,2,4,819:31
octaviusmasking this with 0x3, the last two values will give the same shift value19:31
lkclsm = 0,1,2,319:59
lkclsm &= 0x319:59
lkcl1<<(sm+1) == 1,2,4,820:00
lkclnot20:00
lkclsm = 0,1,2,320:00
lkcl(1<<(sm+1)) & 0x320:00
*** octavius <octavius!~octavius@149.147.93.209.dyn.plus.net> has quit IRC20:33
programmerjakelkcl: when linking to stuff in git, *please* link to an actual commit, not HEAD21:51
lkclprogrammerjake, i gave it as an example only22:30
programmerjakeyeah, just this isn't the first time...22:49
*** openpowerbot <openpowerbot!~openpower@94.226.188.34> has quit IRC23:18
*** openpowerbot <openpowerbot!~openpower@94-226-188-34.access.telenet.be> has joined #libre-soc23:25
lkclprogrammerjake, the reason i gave it was not for the purposes of showing the commit itself23:41
lkclthe reason i gave it was for comparative purposes of demonstrating to ghostmansd, to ask him to please show the diff link not the commit link23:42
lkclthe actual reference was completely irrelevant as to what was actually shown23:42
lkclwhether it was HEAD or any other commit was *not* part of the request to him23:43
lkclconsequently it is not in the least bit relevant to ask me to link to an actual commit23:43
lkclas i was not in any way asking him *about* any actual specific commit, at all.23:43
lkclso just so you know: you're asking me to do something irrelevant on something completely unrelated to the purpose of the conversation.23:46
programmerjakei'm pointing it out not because this time it's a problem (though it is a bit misleading for ghostmansd if your demo doesn't contain all the correct pieces of info), but because it has been a problem several times in the past.23:47
lkcltherefore i'm going to ignore the request as it is not relevant23:47
lkcli'll eventually successfully communicate with him, through repetition, and expect to catch him at a time that's convenient23:48
programmerjakeimho he likely figured it out -- he's smart23:49

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!