FUZxxl | markos: you showed me all the code in that directory | 00:00 |
---|---|---|
lkcl | i then very deliberately took the optimisation path at the ISA level to make sure that those "simple" looking vectorised algorithms could be thrown at multi-issue (parallel) hardware and get high performance | 00:00 |
FUZxxl | lkcl: okay, vertical first makes sense and permits data dependencies | 00:00 |
lkcl | yes. | 00:00 |
FUZxxl | but they'll still be there creating dependency chains | 00:00 |
lkcl | it's a kind-of cheat | 00:00 |
FUZxxl | do you plan to rename each vector element individually? | 00:01 |
lkcl | if the loop is small enough, hardware may go, "oh, hm, i'm getting a batch of non-conflicting non-overlapping elements. i could SIMD-batch those. let me just do that" | 00:01 |
FUZxxl | I am not sure if this will be possible in practice | 00:01 |
lkcl | FUZxxl, at the Scalar-register-element-size level, yes | 00:01 |
markos | lkcl, and that's a micro-architecture specific detail | 00:02 |
lkcl | markos, yes | 00:02 |
FUZxxl | existing OOO architectures cannot re-schedule after more data dependencies are known | 00:02 |
markos | every vendor might choose to implement this one way or another | 00:02 |
markos | or not at all | 00:02 |
FUZxxl | (which really sucks on current Intel uarchs, too) | 00:02 |
lkcl | when the elements are linear and (like MMX) below the 64-bit level, they'll be easily batched | 00:02 |
lkcl | but beyond that, it gets... tricky | 00:02 |
FUZxxl | okay, so you will have to write convoluted code to get the batching right in the non-trivial case. | 00:03 |
lkcl | well, the value of doing that is going to depend on how many implementations there are (in.... 4-10 years time) | 00:03 |
FUZxxl | Looking forwards to it! | 00:04 |
lkcl | ultimately (annoyingly) we will need switches in gcc, per architecture | 00:04 |
lkcl | to say "please generate assembler targetted at v1.2.3.4 vendor's hardware" | 00:04 |
lkcl | it's inevitable, sigh | 00:04 |
FUZxxl | Please don't understand my words as a disapproval of your project. In fact, the ideas are extremely fascinating and like to lead to interesting results. | 00:04 |
markos | SVP64 is not trivially simple neither does it lack complexity, but the difference is instead of having thousands upon thousands of different instructions, it offers very few extra instructions that sit on top of the *existing* scalar instructions and "vectorize" them | 00:04 |
lkcl | no, not at all | 00:05 |
FUZxxl | Lack of performance portability is going to be tricky if it happens. | 00:05 |
FUZxxl | markos: I don't think a high instruction count is really a problem. | 00:05 |
markos | it is | 00:05 |
lkcl | realistically, RED Semiconductor Ltd (the company i established) will have the only hardware, for at least 6-8 of those years | 00:05 |
FUZxxl | If you e.g. look at ARM, most instructions just combine the existing HW in different ways to reduce the latenc yo f common operations. | 00:05 |
FUZxxl | e.g. ARM has instructions to zero-extend + add at once | 00:06 |
markos | Arm has an orthogonal ISA | 00:06 |
lkcl | FUZxxl, you may not be aware: in the IBM POWER9, there's a bottleneck at the L2 Cache | 00:06 |
markos | so you can predict the exact instruction you need | 00:06 |
lkcl | if you have an algorithm that cannot fit into L1 I-Cache, that is also L1 D-Cache heavy | 00:06 |
lkcl | *you get contention*! | 00:06 |
FUZxxl | you could do it in two separate instructions but it would be slower. The hardware can already do both at once, so it makes sense to expose that. | 00:06 |
markos | Intel definitely does not have that | 00:06 |
lkcl | not many people are even aware of that limitation of IBM's POWER9 microarchitecture | 00:06 |
FUZxxl | markos: AVX-512 is pretty orthogonal | 00:06 |
markos | there are so many variants you have to constantly check the ISA manual to see which instruction exactly you need | 00:07 |
FUZxxl | so if you look at 750 something ASIMD instructions, it really boils down to not that many truly distinct operations | 00:07 |
FUZxxl | markos: sure, but you could solve that with a better asm syntax (wink wink) | 00:07 |
markos | it's better, but not much because you always have to carry the old baggage of AVX2/SSE | 00:07 |
FUZxxl | e.g. deriving zero extension from the operand type or something | 00:07 |
FUZxxl | same with Intel | 00:08 |
FUZxxl | a better assembler could get rid of vfmadd132pd and friends and just derive the right opcode from the combination of operands | 00:08 |
FUZxxl | lkcl: ah that's an ouch for sure | 00:08 |
lkcl | that same logic ("better asm syntax") is what drove me to create SV. | 00:08 |
markos | FUZxxl, you depend on the compiler in that case | 00:09 |
lkcl | over time i expect it to propagate cleanly up to intrinsics and ultimately to the compilers, without needing new front-end high-level languages | 00:09 |
markos | I prefer not to write asm unless I have to | 00:09 |
FUZxxl | markos: if you have a compiler, why are you spending your time reading ISA manuals? | 00:09 |
markos | and with SVP64 I was able to write a working implementation in a few hours | 00:09 |
lkcl | because markos's company specialises in optimisation for companies | 00:10 |
FUZxxl | I see | 00:10 |
lkcl | such as ARM and Intel | 00:10 |
markos | Arm is our client | 00:10 |
markos | SVP64 is a personal involvment | 00:10 |
lkcl | you did AV1 for them, recently, and that... what was it... | 00:10 |
markos | no, libvpx | 00:10 |
markos | av1 is next :) | 00:10 |
markos | and vectorscan | 00:10 |
FUZxxl | I do not like writing SIMD code in high level languages because compilers suck at generating SIMD code | 00:10 |
lkcl | the "40,000-regex-which-intel-optimised"? | 00:11 |
lkcl | FUZxxl, we know! :) | 00:11 |
markos | ported Intel hyperscan to Arm, Intel didn't accept any non-intel ports to the original project, hence the fork | 00:11 |
FUZxxl | I see | 00:11 |
markos | and porting it to VSX was done just for fun | 00:11 |
lkcl | hyperscan, that was it | 00:11 |
FUZxxl | cool project, really | 00:11 |
lkcl | oh, did toshywoshy's advice help on VSX? | 00:12 |
FUZxxl | Hah I actually have a bit-parallel string matching paper in my pipeline | 00:12 |
lkcl | were there any other areas it got better? | 00:12 |
markos | FUZxxl, if I did every project in hand written asm I'd still be working on the first function :) | 00:12 |
FUZxxl | should publish it some day | 00:12 |
* lkcl FUZxxl: ooOoo | 00:12 | |
markos | lkcl, the vec_gb instruction, yes, it doubled performance on the Power9 :) | 00:12 |
markos | s/instruction/intrinsic | 00:12 |
lkcl | bit-parallel string-matchiiing :) | 00:12 |
lkcl | markos, cool! | 00:13 |
lkcl | dang | 00:13 |
FUZxxl | the algorithm is crazy simple | 00:13 |
markos | basically it reduced movemask emulation from a dozen instructions -or more don't remember- down to 5 | 00:13 |
markos | still the project is full of movemask intellisms and I have to abstract them away so that it doesn't hurt performance so much on Arm/Power | 00:13 |
lkcl | FUZxxl, it's funny, it's often the simple things/ways that get missed | 00:13 |
FUZxxl | basically, it's an improvement over Boyer-Moore and all the other algorithms that have the basic "test char, compute shift amount, go to next ieration" loop | 00:13 |
markos | did it for a few modules but it's all over the place | 00:14 |
lkcl | https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm | 00:14 |
FUZxxl | with the key improvement being that (a) it never forgets any information it gained and (b) it can check multiple characters per iteration, hence benefitting from OOO architectures | 00:14 |
markos | problem is SVE, it doesn't want to play with existing SIMD abstractions | 00:14 |
FUZxxl | I'm kind of scared that someone already came up with the idea | 00:15 |
FUZxxl | and my algorithm can deal with character classes which is nice | 00:15 |
lkcl | FUZxxl, write it up, definitely! | 00:16 |
FUZxxl | e.g. you can match something like photo-19[89][0-9]-[0-9][0-9]-[0-9][0-9].jpg | 00:16 |
FUZxxl | main disadvantage: the length of the search pattern is limited to your register length | 00:16 |
FUZxxl | but you can simply look for a 64 char suffix of the search pattern in most cases which is good enough | 00:17 |
FUZxxl | lkcl: will do! | 00:17 |
lkcl | it sounds... significant | 00:17 |
lkcl | i mean that | 00:17 |
FUZxxl | in fact, I already have | 00:17 |
* lkcl late, here. and for you, markos, you're 2 hours ahead of me and it's 00:18 for me! | 00:19 | |
lkcl | back to vegging out with a book is called for | 00:19 |
lkcl | until next time | 00:19 |
lkcl | thank you both - awesome conversation | 00:19 |
markos | indeed | 00:20 |
FUZxxl | good night and thank you! | 00:22 |
FUZxxl | As for Tuesday, I may have to shift my attendance to next week | 00:29 |
FUZxxl | It's my Grandmothers birthday and the celebrations may run late | 00:30 |
*** jab <jab!~jab@user/jab> has quit IRC | 03:18 | |
programmerjake | welcome FUZxxl! | 03:21 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 09:15 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.172.157> has joined #libre-soc | 09:16 | |
*** yambo <yambo!~yambo@69.146.1.110> has quit IRC | 09:28 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.172.157> has quit IRC | 10:00 | |
*** openpowerbot <openpowerbot!~openpower@94.226.188.34> has joined #libre-soc | 10:10 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.162.225> has joined #libre-soc | 11:12 | |
ghostmansd[m] | lkcl, hi! Any ideas on math-free tasks? :-) | 11:23 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.162.225> has quit IRC | 11:30 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 11:34 | |
lkcl | maaath, mmmm loovely | 11:52 |
lkcl | i'm still thinking about it, because everything we're doing right now on the cryptoprimitives thingy is driven by the algorithms | 11:53 |
lkcl | ah, i tell you what | 11:54 |
lkcl | we need a "add with shift by immediate-2/4/8/16" instruction | 11:55 |
lkcl | which you could add to the spec page, and to av.mdwn, and blah blah | 11:55 |
lkcl | under https://bugs.libre-soc.org/show_bug.cgi?id=771 | 11:56 |
lkcl | then the unit tests (etc) under https://bugs.libre-soc.org/show_bug.cgi?id=840 | 11:56 |
lkcl | with a special note in the spec that the (very same) instruction is needed for LD/ST-address-calculation-with-a-mini-bit-of-a-shift | 11:57 |
ghostmansd[m] | Ok, is there some insn that I should take as reference? | 12:08 |
lkcl | https://libre-soc.org/openpower/sv/bitmanip/#shift-add | 12:14 |
lkcl | there's one in ARM, the syntax uses "#N" on the end of the add-part (we'll not be doing that) | 12:17 |
lkcl | programmerjake, thank you for the unit test on set_masked_reg() | 12:20 |
lkcl | i relied on the unit tests using it "getting things right" (chacha20 for example) | 12:21 |
lkcl | ghostmansd[m], so, bit of a pain (but they have separate budgets), tracking 3 separate bugreports: one for implementation, one for unit tests, one for spec/documentation | 12:21 |
ghostmansd[m] | So we basically need to create everything for these: https://libre-soc.org/openpower/sv/bitmanip/#shift-add? | 12:23 |
ghostmansd[m] | Sigh, IRC thinks ? is a part of URL | 12:23 |
ghostmansd[m] | Ok, will do it | 12:23 |
lkcl | hexchat doesn't :) | 12:24 |
lkcl | yes. | 12:24 |
lkcl | a (new) Z23-Form exists, so the pseudocode can use "sm" | 12:24 |
lkcl | rather than "sh" | 12:24 |
lkcl | do make sure to drop in the git-commit-diff-link under the right bugreport as you do them (just to show some justification for the payment) | 12:25 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=HEAD | 12:25 |
lkcl | rather than | 12:25 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=HEAD | 12:26 |
ghostmansd[m] | Ok, cool! | 12:59 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 13:56 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 14:53 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.52.183> has joined #libre-soc | 14:53 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.52.183> has quit IRC | 15:08 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 15:08 | |
*** yambo <yambo!~yambo@69.146.1.110> has joined #libre-soc | 16:37 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 17:29 | |
*** octavius <octavius!~octavius@149.147.93.209.dyn.plus.net> has joined #libre-soc | 18:01 | |
octavius | lkcl, the pseudo-code for shadd only allows "sh" to be 0-3 (2-bit mask), and then 1 is added, so the max possible shift is 4. Is this standard behaviour? It seems a little roundabout (but I guess there's no way to mask and ensure a max of 4 otherwise) | 18:03 |
octavius | Also, which bitfield does "sh" correspond to in the Z23-form? | 18:04 |
lkcl | octavius, yyep. | 18:44 |
lkcl | if you only need a straight add then you use a straight add | 18:45 |
lkcl | and it's "1<<sm" so that's *2, *4, *8 and *16 | 18:45 |
octavius | Ah ok, makes sense | 19:27 |
octavius | wait. | 19:31 |
octavius | sm =0,1,2,3 | 19:31 |
octavius | 1<<sm=1,2,4,8 | 19:31 |
octavius | masking this with 0x3, the last two values will give the same shift value | 19:31 |
lkcl | sm = 0,1,2,3 | 19:59 |
lkcl | sm &= 0x3 | 19:59 |
lkcl | 1<<(sm+1) == 1,2,4,8 | 20:00 |
lkcl | not | 20:00 |
lkcl | sm = 0,1,2,3 | 20:00 |
lkcl | (1<<(sm+1)) & 0x3 | 20:00 |
*** octavius <octavius!~octavius@149.147.93.209.dyn.plus.net> has quit IRC | 20:33 | |
programmerjake | lkcl: when linking to stuff in git, *please* link to an actual commit, not HEAD | 21:51 |
lkcl | programmerjake, i gave it as an example only | 22:30 |
programmerjake | yeah, just this isn't the first time... | 22:49 |
*** openpowerbot <openpowerbot!~openpower@94.226.188.34> has quit IRC | 23:18 | |
*** openpowerbot <openpowerbot!~openpower@94-226-188-34.access.telenet.be> has joined #libre-soc | 23:25 | |
lkcl | programmerjake, the reason i gave it was not for the purposes of showing the commit itself | 23:41 |
lkcl | the reason i gave it was for comparative purposes of demonstrating to ghostmansd, to ask him to please show the diff link not the commit link | 23:42 |
lkcl | the actual reference was completely irrelevant as to what was actually shown | 23:42 |
lkcl | whether it was HEAD or any other commit was *not* part of the request to him | 23:43 |
lkcl | consequently it is not in the least bit relevant to ask me to link to an actual commit | 23:43 |
lkcl | as i was not in any way asking him *about* any actual specific commit, at all. | 23:43 |
lkcl | so just so you know: you're asking me to do something irrelevant on something completely unrelated to the purpose of the conversation. | 23:46 |
programmerjake | i'm pointing it out not because this time it's a problem (though it is a bit misleading for ghostmansd if your demo doesn't contain all the correct pieces of info), but because it has been a problem several times in the past. | 23:47 |
lkcl | therefore i'm going to ignore the request as it is not relevant | 23:47 |
lkcl | i'll eventually successfully communicate with him, through repetition, and expect to catch him at a time that's convenient | 23:48 |
programmerjake | imho he likely figured it out -- he's smart | 23:49 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!