*** kanzure <kanzure!~kanzure@user/kanzure> has quit IRC | 06:30 | |
*** kanzure <kanzure!~kanzure@user/kanzure> has joined #libre-soc | 06:30 | |
*** awygle <awygle!~quassel@2604:a880:2:d0::5380:3001> has quit IRC | 06:53 | |
*** sauce <sauce!~sauce@omae.wa.mou.shindei.ru> has quit IRC | 06:53 | |
*** awygle <awygle!~quassel@2604:a880:2:d0::5380:3001> has joined #libre-soc | 06:54 | |
*** sauce <sauce!~sauce@omae.wa.mou.shindei.ru> has joined #libre-soc | 06:56 | |
lkcl | ghostmansd[m], so what's the primary focus of each of the two current bugreports? | 09:47 |
---|---|---|
lkcl | and, just as importantly, "what fits into EUR 8,000"? | 09:48 |
lkcl | for example: if i split into EUR 4000 for 947 and EUR 4000 for the other-top-level-one-for-binutils, is that ok? | 09:49 |
lkcl | i take it https://bugs.libre-soc.org/show_bug.cgi?id=958 is the top-level? | 09:51 |
ghostmansd[m] | This is basically doing the same stuff in two projects, TBH. Where they can re-use each other (e.g. where we can put some code into sv_binutils.py), this is great. Shortly, it's "building asm atop of insndb like we did with disasm, dropping custom tweaks". More details below. | 09:52 |
ghostmansd[m] | binutils: infrastructure is more or less ready. We mostly need to refactor the and/or/shifts/whatever with fields, also establish per-mode handlers (you remember patterns matching in disassembly, I want to do something similar for asm, based on strings representing modes, e.g. /mr should toggle mapreduce and its submodes), etc. | 09:53 |
ghostmansd[m] | After this is done, we need to introduce tests for SVP64. We already have tests for custom word instructions, but nothing for SVP64. | 09:53 |
lkcl | remember the EUR 8000 is "interim", based on the fact that the cavatools budget has been approved | 09:56 |
lkcl | so if there is more to be done, that would take longer, then it can be added to *another* project | 09:57 |
ghostmansd[m] | openpower-isa: refactor pysvp64asm in the same way (actually this is the one I do now, since it's a reference and we have tests already). Also, as part of this, I'm supporting operand types, and support more advanced assembly for these (r, f, cr prefixes, non-zero operands without hardcoding it for each and every instruction with _custom_insn, CR complex descriptions, etc.). | 09:57 |
lkcl | but please remember that if it has not yet been approved and the work has already been done *you can't get paid* | 09:57 |
lkcl | so for example | 09:57 |
lkcl | you've stated here: | 09:57 |
lkcl | https://bugs.libre-soc.org/show_bug.cgi?id=958#c1 | 09:57 |
lkcl | "all tasks have been completed" | 09:57 |
lkcl | but | 09:57 |
lkcl | it was attached to an *UNAPPROVED* grant (OPF ISA WG) | 09:58 |
ghostmansd[m] | Once both are done, I'd like to have some test infrastructure common for binutils and openpower-isa. | 09:58 |
lkcl | therefore you *cannot* get paid for it under that grant | 09:58 |
lkcl | therefore i have instead moved it to cavatools | 09:58 |
ghostmansd[m] | > it was attached to an *UNAPPROVED* grant (OPF ISA WG) | 09:58 |
ghostmansd[m] | Yeah that's why I asked about cavatools. | 09:58 |
ghostmansd[m] | Because logically these changes are inevitable, if we want to support binutils, we must have all instructions there we have in reference implementation. | 09:59 |
ghostmansd[m] | And also markos needed those, that's how it started. :-) | 09:59 |
lkcl | yehyeh | 09:59 |
ghostmansd[m] | Not all of them but I also did what was nearby. | 09:59 |
ghostmansd[m] | This task is somewhat outside | 10:00 |
lkcl | don't add any grev* bitmanip ones | 10:00 |
ghostmansd[m] | Didn't intend | 10:00 |
ghostmansd[m] | I remember their fate is not yet stated. | 10:00 |
lkcl | programmerjake needs to remove them and complete grevlut* | 10:00 |
ghostmansd[m] | We also even have some task called like "revisit grev stuff once we decide on it" | 10:01 |
ghostmansd[m] | Don't remember the number, though | 10:01 |
ghostmansd[m] | So, yeah: | 10:03 |
ghostmansd[m] | 1.Refactor binutils assembly. | 10:03 |
ghostmansd[m] | 2. Prepare binutils tests. | 10:03 |
ghostmansd[m] | 3. Refactor openpower-isa assembly. | 10:03 |
ghostmansd[m] | 4. Handle what's missing/controversial between these. | 10:03 |
ghostmansd[m] | 5. If time permits, come up with some tests infra which cross-checks these. | 10:03 |
ghostmansd[m] | We might argue these tasks overlap a lot, and this is actually true, a long meta is "support SVP64 asm and extend it and test it". I'm splitting it for convenience so that you can decide what goes where. | 10:05 |
lkcl | sounds great. what fits into EUR 8,000 and what doesn't? | 10:05 |
lkcl | at present i'm simply putting both #958 and #949 under cavatools @ EUR 8,000. | 10:05 |
ghostmansd[m] | I'm not sure, that's why I asked. Refactoring assembly is a big task per se. | 10:06 |
ghostmansd[m] | I think the code fits well. | 10:06 |
lkcl | if however you can do "some" of it | 10:06 |
lkcl | then continue the rest under another Grant | 10:06 |
lkcl | that would work | 10:06 |
ghostmansd[m] | Tests, both for binutils and cross-check, likely will have to be done on other grant. | 10:06 |
lkcl | that works | 10:06 |
lkcl | boring, but it works | 10:07 |
markos | lkcl, speaking of binutils, I think that since it's more or less working fine, we should use it more for more complicated -ie non-unit test work-. After all, people are going to use binutils and not some reference assembler for real code. Saying that because for me at least it's much harder reading complicated asm code embedded into python as a unit test (eg chacha), I'd much rather have the assembler separate from the python part on | 10:07 |
markos | its own file | 10:07 |
lkcl | most likely under OPF ISA WG one https://bugs.libre-soc.org/show_bug.cgi?id=952 | 10:07 |
ghostmansd[m] | Cross-check perhaps will be dropped for now, I suspect. It's big, and we could check each implementation on its own. | 10:07 |
lkcl | markos, what you've created is great, but it doesn't "run from python setup.py test" | 10:08 |
ghostmansd[m] | binutils tests are boring and difficult, but we have to do them if we want the code to be committed. | 10:08 |
markos | that's not what I'm suggesting | 10:08 |
lkcl | and, "run from python setup.py test" is the canonical - easy - method of ensuring that everything is working | 10:08 |
lkcl | so | 10:08 |
markos | I suggest not embedding complicated -ie more than a handful of instructions- within python | 10:09 |
lkcl | that chacha20 inner-core constitutes a unit test, not a demo | 10:09 |
lkcl | yes, i'm not going beyond that. 17 is quite enough | 10:09 |
markos | what I'm suggesting is splitting the asm externally as a .s file and loading it as from python for the unit test, but *also* making it available for testing in binutils as well | 10:10 |
lkcl | utf8-validation which programmerjake wrote is borderline. it's extremely comprehensive (the testing) and should really have been done in the style you developed | 10:10 |
lkcl | but then it would not be possible to run as a python-based unit test | 10:10 |
markos | of course it would, just load it as a string from file | 10:11 |
lkcl | then it is no longer possible under the unit tests to perform python-based parameterised substitution for the purposes of performing comprehensive unit testing | 10:11 |
lkcl | or it is, but is extremely inconvenient | 10:12 |
lkcl | for example: | 10:12 |
lkcl | i perform substitution in many unit tests of VL as a python-based parameter | 10:12 |
lkcl | *re-running* the entire unit test multiple times | 10:12 |
markos | yes, but these are not the tests I'm talking about | 10:12 |
lkcl | with completely different assembler, in effect | 10:12 |
lkcl | i know | 10:12 |
markos | anyway | 10:13 |
markos | not saying anything should be changed now, just that we should consider putting big asm chunks externally | 10:13 |
lkcl | when there's a more "stand-alone" style appropriate (like the one you are developing) then yes absolutely i agree 100% the separate .s is good | 10:13 |
lkcl | the 17 instructions in chacha20 turned out to be necessary, they're a bit much | 10:14 |
markos | ack | 10:14 |
lkcl | but i needed to test svindex - it's just that it needed to be called 4 times | 10:15 |
lkcl | which is the limit (there are only 4 SVSHAPEs) | 10:15 |
lkcl | this is not possible to do in a ".s" file: | 10:16 |
lkcl | 'addi 16, 0, %d' % nrounds, # set number of rounds | 10:16 |
lkcl | strictly speaking the "32" on setvl should also be parameterised | 10:16 |
lkcl | so that a much shorter test run can be made | 10:17 |
lkcl | (and also experiment and see how number of rounds affects output) | 10:17 |
lkcl | i keep getting tempted to try to do 2D svindex REMAPs | 10:18 |
lkcl | the "normal" design of chacha20 is to SIMD-i-fy every 4th quarter-round in groups | 10:19 |
lkcl | performing lane-based xor/add/rotl32 | 10:19 |
lkcl | https://en.wikipedia.org/wiki/File:ChaCha_Cipher_Quarter_Round_Function.svg | 10:20 |
markos | if you do that and add butterfly-and-family int instructions 16/32-bit, we can do pretty much all comples video codec loops in 1/10th of the original code :) | 10:20 |
markos | ^complex | 10:21 |
lkcl | :) | 10:21 |
lkcl | yyeahh that one needs to go on the OPF ISA WG grant | 10:21 |
ghostmansd[m] | lkcl, are you fine if I re-split the budget between 947 and 958, say, 5000 and 3000? Because 947 needs way more work, and 958 is simpler. | 10:21 |
ghostmansd[m] | Maybe even 5500 and 2500, perhaps. | 10:22 |
markos | ghostmansd[m], are you btw familiar with LLVM codebase as well? | 10:22 |
lkcl | ghostmansd[m], yes sure, feel free to edit it yourself. it's not "approved" yet so is fine to edit | 10:22 |
lkcl | markos, ohh, 2D DCT schedule as well | 10:23 |
markos | lkcl, yup | 10:23 |
ghostmansd[m] | Ok, thanks! | 10:23 |
lkcl | markos, what was the name of that fantastically-odd instruction? | 10:24 |
lkcl | the one with mul-and-add-and-a-shift-thrown-in-for-good-measure | 10:24 |
markos | vqrdmulhq_s16/vqrdmulhq_s32 | 10:25 |
lkcl | ta | 10:25 |
markos | https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vqrdmulhq_s16 | 10:25 |
markos | https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/SQRDMULH--vector---Signed-saturating-Rounding-Doubling-Multiply-returning-High-half-?lang=en for the asm instruction | 10:25 |
ghostmansd[m] | > it's not "approved" yet so is fine to edit | 10:26 |
ghostmansd[m] | Speaking of that, do you have any estimations on it? Some bits of 947 (948) are almost done, and 958 is reaching the level it's fully done (I need to check instructions outside of av.mdwn). | 10:26 |
markos | however that works for the one coeff only | 10:26 |
markos | tried using it twice for the 2-coeffs butterfly but unfortunately it doesn't work that well | 10:27 |
lkcl | ghostmansd[m], no - see the process i put on one of the bugreports. | 10:27 |
markos | ah but there is SQDMLAL | 10:27 |
lkcl | a) decide the tasks | 10:28 |
markos | yup, that's the 2-coeff variant | 10:28 |
lkcl | b) review the tasks | 10:28 |
lkcl | c) get them approved by NLnet | 10:28 |
lkcl | d) get a signed MoU back | 10:28 |
ghostmansd[m] | Ah OK, good | 10:28 |
ghostmansd[m] | Do we have tasks which are approved? | 10:29 |
ghostmansd[m] | I'll complete these to some degree then until MoU is signed. | 10:29 |
ghostmansd[m] | (Any risk it's not signed?) | 10:29 |
lkcl | no risk whatsoever. it's already approved by the EU | 10:30 |
lkcl | here's the list of all milestones: https://bugs.libre-soc.org/show_bug.cgi?id=938 | 10:30 |
lkcl | so, anything that can go under this | 10:31 |
lkcl | https://bugs.libre-soc.org/show_bug.cgi?id=589 | 10:31 |
lkcl | can be justified | 10:31 |
lkcl | e.g.... if you for example.... ooo i dunno... | 10:32 |
lkcl | added the two new shift-instructions | 10:32 |
lkcl | and sv.maddedu and sv.divmod2du | 10:32 |
lkcl | those are for bigint which is for cryptographic purposes which can therefore be justified under that grant | 10:33 |
ghostmansd[m] | Ok, no risk, but still no option to send any RFP under cavatools. :-) So I'll have to switch to something else for a while until that one is approved. | 10:33 |
ghostmansd[m] | Hm, maddedu... IIRC this one already exists as word instruction. | 10:34 |
ghostmansd[m] | Not in binutils though. | 10:34 |
lkcl | correct, no risk. correct, no RFP yet. estimated... mmm... 2-3 weeks as i need to write the *rest* of the cavatools tasks, sigh. | 10:34 |
ghostmansd[m] | Aha, OK, sufficient to do something else :-) | 10:35 |
ghostmansd[m] | OK I'll add maddedu and divmod2du. Do we have task for these? | 10:35 |
lkcl | just looking... | 10:35 |
lkcl | prrooobbabblllyy.... under this one https://bugs.libre-soc.org/show_bug.cgi?id=771 | 10:35 |
lkcl | or https://bugs.libre-soc.org/show_bug.cgi?id=772 would be better, it's larger. | 10:36 |
ghostmansd[m] | Ok, cool, thanks! | 10:37 |
ghostmansd[m] | I will clean up what I have so far for assembly, and then proceed with these. | 10:37 |
lkcl | ack | 10:37 |
ghostmansd[m] | lkcl, I noticed we don't have pcdec too; should I added? Is it sufficiently stable? | 13:22 |
ghostmansd[m] | Sorry, I meant pcdec., apparently it lacks non-Rc version? | 13:23 |
lkcl | like "stcix." it's a Rc=1 only, yes | 13:29 |
lkcl | ghostmansd[m], https://bugs.libre-soc.org/show_bug.cgi?id=937#c12 | 13:38 |
ghostmansd[m] | Sorry, kinda out of context. Do you mean these should be added too? | 13:39 |
lkcl | the shift instructions will likely become 3-in 1-out so don't add them to the list yet | 13:39 |
lkcl | but yes | 13:39 |
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has quit IRC | 13:42 | |
*** openpowerbot <openpowerbot!~openpower@94-226-188-34.access.telenet.be> has quit IRC | 13:42 | |
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 17:01 | |
ghostmansd | lkcl, I've raised the task about maddedu and divmod2du: https://bugs.libre-soc.org/show_bug.cgi?id=964 | 17:01 |
lkcl | ghostmansd, ack | 17:02 |
ghostmansd | And, actually, already done it :-) | 17:03 |
lkcl | doh :) | 17:03 |
ghostmansd | Actually I did so many of these that I already do it semi-automatically | 17:04 |
lkcl | yehyeh | 17:04 |
lkcl | ah - divmod2du has to move btw | 17:04 |
ghostmansd | I got lucky, since there were instructions with the same form | 17:04 |
ghostmansd | What do you mean? | 17:04 |
lkcl | to XO=58 | 17:04 |
ghostmansd | Ah OK, ping me when you move it | 17:05 |
ghostmansd | Or should I? | 17:05 |
lkcl | am just in the middle of sorting that out (to accommodate dsld and dsrd) | 17:05 |
ghostmansd | I'll do it, OK | 17:05 |
ghostmansd | Sure, NP | 17:05 |
ghostmansd | 1 sec, this will need updates both in openpower-isa and binutils | 17:05 |
lkcl | i'm sorting out openpower-isa | 17:06 |
lkcl | commits outstanding, please don't make modifications to svp64.py minor_4.csv or svfixedarith.mdwn | 17:06 |
ghostmansd | Ah OK | 17:07 |
ghostmansd | I'll handle binutils then | 17:07 |
ghostmansd | ping me when I need to rebase | 17:07 |
ghostmansd | Because I might need to regenerate parts of binutils | 17:07 |
ghostmansd | Some stuff is fixed (e.g. XO), but the tables... | 17:08 |
lkcl | ghostmansd, done | 17:15 |
lkcl | well that's frickin fricked | 17:22 |
lkcl | dsld as a 4-op can only be EXTRA2 | 17:22 |
lkcl | frick | 17:22 |
lkcl | and for dsld to be able to do overlapping pickup of a pair of consecutive 64-bit registers has to be EXTRA3 | 17:22 |
lkcl | frick | 17:22 |
ghostmansd | 1 sec, need to check it and update the tests based on what pysvp64asm produces | 17:22 |
ghostmansd | done! | 17:26 |
lkcl | brilliant | 17:32 |
ghostmansd[m] | Feel free to assign the budget and split as you like | 17:34 |
ghostmansd[m] | Since you did the XO moving for this | 17:34 |
ghostmansd[m] | https://bugs.libre-soc.org/show_bug.cgi?id=964 | 17:34 |
lkcl | ack | 17:35 |
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has joined #libre-soc | 17:52 | |
*** markos <markos!~Konstanti@static062038151250.dsl.hol.gr> has quit IRC | 17:57 | |
*** markos <markos!~Konstanti@static062038151250.dsl.hol.gr> has joined #libre-soc | 18:07 | |
ghostmansd[m] | lkcl, do you have any other tasks in mind that might need my help? | 19:18 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 20:05 | |
lkcl | ghostmansd[m], kinda, yes - mostly we've got to focus on cavatools and the cryptoprimitives | 20:54 |
FUZxxl | Where are most of you from? | 20:55 |
lkcl | all over the place! | 20:55 |
ghostmansd[m] | I ain't no cryptographer, but I can help unless there is math involved:-) | 20:55 |
ghostmansd[m] | Here's Mother Russia, for example | 20:56 |
lkcl | kuwait russia greece canada US germany UK belgium india | 20:56 |
FUZxxl | cool | 20:56 |
lkcl | maaaath | 20:56 |
ghostmansd[m] | Hi FUZxxl | 20:56 |
FUZxxl | hi ghostmansd[m] | 20:56 |
lkcl | btw i'm supposed to be shopping, back in 1/2 hr :) | 20:57 |
ghostmansd[m] | Ok then :-) | 20:57 |
lkcl | aspirin and jelly-babys | 20:57 |
lkcl | no connection | 20:57 |
ghostmansd[m] | lkcl, do you mean Haribos? | 20:57 |
FUZxxl | I want to do a PhD project about something related to SIMD stuff | 20:58 |
FUZxxl | someone on reddit told me you might be interested | 20:58 |
ghostmansd[m] | Well, SVP64 is not a out SIMD, to be honest, quite the opposite | 20:59 |
FUZxxl | I think that was you lkcl ? | 20:59 |
FUZxxl | Vector stuff counts in this regard :-) | 20:59 |
ghostmansd[m] | *about | 20:59 |
FUZxxl | though for some application it might be more tricky to use | 20:59 |
ghostmansd[m] | In this sense, it totally applies! | 20:59 |
ghostmansd[m] | Not more tricky than SIMD for sure | 20:59 |
FUZxxl | We just wrote a fun paper about a SIMD-accelerated algorithm to transcode between UTF-8 and UTF-16 (both directions) | 20:59 |
FUZxxl | it's full of horizontal operations | 21:00 |
FUZxxl | I'm not sure if you can do enough arithmetic on masks in SVP64 to implement it there | 21:00 |
ghostmansd[m] | You might want to talk to programmerjake, he'd recently implemented UTF-8 in SVP64 | 21:01 |
ghostmansd[m] | That'd be a great place to start I think | 21:01 |
FUZxxl | programmerjake: hi | 21:01 |
ghostmansd[m] | Just to feel the difference, so to speak | 21:01 |
FUZxxl | sure | 21:01 |
FUZxxl | I can send you a preprint if you like | 21:01 |
FUZxxl | we achieve ~8 GB/s for UTF-16->UTF-8 and ~5 GB/s for the other direction on Icelake | 21:02 |
ghostmansd[m] | We have some nice examples added recently in SVP64 assembly; lkcl, markos and programmerjake can share some tips on directions | 21:02 |
ghostmansd[m] | I'm not sure whether programmerjake measured the throughput, but you can see the simplicity, at least | 21:02 |
ghostmansd[m] | I mean, code-wise | 21:03 |
FUZxxl | I think I kind of have to get used to it | 21:03 |
FUZxxl | I think lkcl has send me code about that before | 21:03 |
FUZxxl | but honestly I didn't get around understanding it | 21:03 |
ghostmansd[m] | Ah, great! | 21:03 |
ghostmansd[m] | Well, the good starting point about PPC assembly would be PowerPC ISA book | 21:04 |
FUZxxl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/algorithms/svp64_utf_8_validation.py;hb=HEAD | 21:04 |
FUZxxl | ^ this one was a UTF-8 validator | 21:04 |
FUZxxl | I've actually not implemented a dedicated UTF-8 validator, but my code does validation throughout the transcoding process. | 21:05 |
FUZxxl | It really doesn't cost much at all (not on the critical path and there are enough ports with spare cycles to make it essentially free) | 21:05 |
FUZxxl | anyway; that was mostly a warm up | 21:06 |
FUZxxl | I heard you are interested in people writing assemblers for you? | 21:06 |
FUZxxl | I was thinking about designing sort of a "smart assembler" that helps you write assembly code more effectively | 21:07 |
ghostmansd[m] | Well, SVP64 pays a significant attention to write the assembly code smart | 21:08 |
ghostmansd[m] | But I guess lkcl can enlighten more on this part :-) | 21:08 |
FUZxxl | I'm unfortunately not super familiar with SVP64 yet | 21:09 |
FUZxxl | my key idea was to have an assembler that does chores like laying out a stack frame and dealing with obtaining global variables through whatever PIC scheme your platform uses and this sort of thing | 21:09 |
FUZxxl | So you can write intuitive code instead of having to deal with weird relocations and such | 21:10 |
FUZxxl | additional possible objectives include strength reductions and register allocation | 21:10 |
FUZxxl | i.e. instead of taking a compiled language and turning it into a macro assembler, I want to take an assembler and give it useful high-level features so you can make your life easier while still fundamentally programming assembly | 21:11 |
FUZxxl | ghostmansd[m]: what is your focus in the libre-soc project? | 21:14 |
ghostmansd[m] | My recent works are towards supporting SVP64 in binutils, also helping with the other stuff related to instructions. | 21:15 |
FUZxxl | sounds great | 21:15 |
ghostmansd[m] | Surprisingly this hasn't ever needed assembly knowledge. :-) | 21:15 |
FUZxxl | hah | 21:15 |
ghostmansd[m] | > So you can write intuitive code instead of having to deal with weird relocations and such | 21:16 |
ghostmansd[m] | Well, these parts we don't touch, they're kept as they were in vanilla PPC world. | 21:17 |
FUZxxl | yes correct | 21:17 |
FUZxxl | still they suck if you program in assembly | 21:17 |
FUZxxl | so I think a smart assembler could help you do better | 21:17 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 21:17 | |
ghostmansd[m] | Many things suck if you program in assembly :-) | 21:17 |
ghostmansd[m] | That depends on the assembly used | 21:17 |
FUZxxl | they don't have to! | 21:17 |
ghostmansd[m] | But, even then, it's low-level kind of things | 21:18 |
FUZxxl | So? | 21:18 |
ghostmansd[m] | Well, they don't, but then it's not the lowest-level (well not the lowest but you understand what I mean) | 21:18 |
FUZxxl | so? | 21:18 |
FUZxxl | Assembly programming doesn't mean you are forced to ignore any QOL improvements. | 21:19 |
FUZxxl | It's just that you are fundamentally programming in terms of machine instructions. | 21:19 |
ghostmansd[m] | No that's not what I mean | 21:19 |
ghostmansd[m] | I mean, not many people think about it | 21:19 |
ghostmansd[m] | Because they generally code in high-level | 21:19 |
ghostmansd[m] | And the ugly part (e.g. relocations) is hidden | 21:20 |
FUZxxl | True. | 21:20 |
ghostmansd[m] | If everybody coded in asm, people would have thought about it more | 21:20 |
ghostmansd[m] | But the thing is, they simply don't | 21:20 |
FUZxxl | I think you have to write in assembly for fast SIMD/vector code as high level languages do not have the right paradigm to oexpress such algortihms. | 21:20 |
ghostmansd[m] | They just take, say, shiny Python and don't care | 21:20 |
FUZxxl | Improving on this is another possible PhD project. | 21:21 |
FUZxxl | So being able to write assembly code more comfortably is something I consider very useful. I don't particularly care that it's a small niche. | 21:21 |
ghostmansd[m] | True, but, first, this does not really involve diving into low-level weird chunks like relocations | 21:22 |
ghostmansd[m] | And, second, it's mostly about optimizing some sections | 21:22 |
FUZxxl | It does. | 21:22 |
FUZxxl | For example, what do you have to do to obtain the address of a global variable? | 21:22 |
FUZxxl | The answer depends on whether you write position-dependent, PIC, or PIE code. | 21:22 |
ghostmansd[m] | This is true | 21:23 |
ghostmansd[m] | But why would you do it in a small piece of code? | 21:23 |
FUZxxl | So if an assembler could do it for you, it would be much easier for the programmer to focus on the relevant bits | 21:23 |
FUZxxl | the assembly code can be substantial | 21:23 |
ghostmansd[m] | I'd just pass that address in a register | 21:23 |
ghostmansd[m] | To some routine written in assembly | 21:23 |
ghostmansd[m] | What I mean, I would not write everything around in assembly | 21:23 |
ghostmansd[m] | I would write some exclusive performance-critical parts | 21:24 |
ghostmansd[m] | And make these callable from the high-level code | 21:24 |
FUZxxl | but why should there be this restriction? | 21:24 |
ghostmansd[m] | I'm not saying it's right :-) | 21:24 |
ghostmansd[m] | I simply state how it mostly works | 21:24 |
FUZxxl | another example: calling functions is hard because the correct sequence depends on the code model | 21:24 |
ghostmansd[m] | There are exceptions, like KolibriOS | 21:24 |
ghostmansd[m] | Indeed | 21:25 |
FUZxxl | TLS is hard, too | 21:25 |
ghostmansd[m] | I mean, I'm not saying the idea is wrong, I just express my thoughts on how it works now | 21:25 |
FUZxxl | and it's usually OS dependent too | 21:25 |
FUZxxl | sure | 21:25 |
FUZxxl | I know how it works now | 21:25 |
ghostmansd[m] | TLS is fucked as a concept TBH | 21:25 |
FUZxxl | and I am not satisfied. | 21:25 |
ghostmansd[m] | Not unreasonable but still fucked | 21:26 |
FUZxxl | oh, please explain! An interesting view! | 21:26 |
ghostmansd[m] | I mostly mean concepts like errno | 21:26 |
FUZxxl | how is it fucked? | 21:26 |
ghostmansd[m] | Compared to errno, using a negative error codes like in Linux kernel is rather a sane idea | 21:27 |
ghostmansd[m] | Because checking/setting an integer in a register is much cheaper than accessing TLS | 21:27 |
FUZxxl | you only need to check TLS after having established that the call failed | 21:27 |
ghostmansd[m] | BSD use another approach, setting a carry flag | 21:27 |
ghostmansd[m] | But in essense it's the same | 21:27 |
ghostmansd[m] | First, it depends | 21:28 |
ghostmansd[m] | Some routines require setting errno to 0 otherwise you won't be able to differ some failure cases | 21:28 |
FUZxxl | these are not system calls and hence could not benefit from the syscall convention | 21:28 |
ghostmansd[m] | Second, why address TLS if you can simply use register which is already thread-specific enough? | 21:29 |
FUZxxl | also the Linux convention is kind of fucked as it arbitrarily assumes that a certain range of values could not occur in valid results. | 21:29 |
FUZxxl | Not really future proof. | 21:29 |
ghostmansd[m] | Or, well, rather processor specific | 21:29 |
FUZxxl | Sure, use one register for each global variable | 21:29 |
FUZxxl | errno is not performance critical enough to waste such a resource on it | 21:29 |
ghostmansd[m] | Rather, don't use global variables :-) | 21:30 |
FUZxxl | we already keep the TLS pointer in a register, so errno is just one dereference away | 21:30 |
ghostmansd[m] | But I'm not trying to persuade, this is not a holy war on global state :-D | 21:30 |
FUZxxl | the reason for errno is chiefly that you can't return more than one value in C. | 21:31 |
ghostmansd[m] | > we already keep the TLS pointer in a register, so errno is just one dereference away | 21:31 |
ghostmansd[m] | This actually depends on the architecture I think | 21:31 |
FUZxxl | If you can, there's no need to have it, as e.g. Go shows. | 21:31 |
ghostmansd[m] | On most architectures yes | 21:31 |
ghostmansd[m] | Also, enabling TLS per se is a difficult stuff | 21:31 |
FUZxxl | you can't really avoid it | 21:31 |
FUZxxl | You are going to need thread-local variables one way or another | 21:31 |
FUZxxl | e.g. Go, which doesn't have TLS exposed in the language still needs it to implement details of its green threads | 21:32 |
ghostmansd[m] | I wouldn't say; per-CPU, yes, these are unavoidable for many contexts | 21:32 |
ghostmansd[m] | Per-thread? Not that much. | 21:32 |
FUZxxl | what is the distinction in your opinion? | 21:32 |
FUZxxl | there is a 1:1 mapping between threads and CPUs | 21:33 |
FUZxxl | i.e. whenever a thread runs, it runs on one CPU | 21:33 |
ghostmansd[m] | Yes and no. Per-CPU can be just as simple as "take CPU index and refer to entry in say array". | 21:34 |
ghostmansd[m] | And, per-thread would have been the same... | 21:34 |
ghostmansd[m] | ...but instead it got buried with a lot of details, like allocate stuff in kernel, provide interface for it in userspace, support it in compilers, support it upon linkage, support it everywhere... | 21:35 |
FUZxxl | So per CPU means that if you have a new time slice, whatever other process runs gets the per CPU variable? :-D | 21:35 |
ghostmansd[m] | I mean, this is kinda the difference between kernel and userspace, I know it's not entirely a fair comparison. | 21:36 |
FUZxxl | also all the support would still be needed for "per CPU" stuff | 21:36 |
FUZxxl | another point is that register space is limited | 21:37 |
ghostmansd[m] | No it doesn't mean. No it won't be, at least not with the amount of details for pthread. | 21:37 |
FUZxxl | so you cannot have programs freely allocate registers for global or per-thread stuff | 21:38 |
ghostmansd[m] | And yes it could've been as simple as "just sacrifice one register" | 21:38 |
FUZxxl | you run out too quickly (as opposed to memory) | 21:38 |
FUZxxl | we do, it's called the TLS register | 21:38 |
FUZxxl | sacrificed to enable use of arbitrary per-thread state. | 21:38 |
ghostmansd[m] | Yeah I know | 21:39 |
ghostmansd[m] | Anyway, the level of magic for supporting per-CPU in Linux kernel is not compared to level of magic needed to support pthreads in userspace | 21:40 |
FUZxxl | another issue with sacrificing registers is that every part of the program needs to know which registers are sacrificed | 21:40 |
ghostmansd[m] | That's my point | 21:40 |
FUZxxl | which is hard to do in the presence of shared libraries and such | 21:40 |
ghostmansd[m] | Well, this is kinda part of the deal, huh? | 21:40 |
FUZxxl | what you call "per-CPU" is just "per-thread" but not touching memory | 21:41 |
ghostmansd[m] | Like, say, with the calling conventions | 21:41 |
ghostmansd[m] | It's not what I call "per-CPU", it's in fact per-CPU | 21:41 |
ghostmansd[m] | It's rather the thing you call per-thread is rather per-CPU :-) | 21:41 |
FUZxxl | semantics | 21:42 |
ghostmansd[m] | But this is buried deep down | 21:42 |
ghostmansd[m] | Semantics matter | 21:42 |
FUZxxl | sorry, I still don't really get the point you are trying to make. | 21:42 |
FUZxxl | Especially with the distinction between per-thread and per-CPU | 21:43 |
ghostmansd[m] | My point is, unless you don't really need to, don't use TLS. Allocate per-thread context, for example. | 21:43 |
FUZxxl | TLS is there anyway, might as well use it. | 21:44 |
ghostmansd[m] | *unless you really need | 21:44 |
FUZxxl | It won't go back into its box if you don't use it. | 21:44 |
ghostmansd[m] | There are many things around which won't; not the reason to use them. | 21:44 |
FUZxxl | so ... you oppose on philosophical grounds? | 21:44 |
FUZxxl | it would be easier to implement if we didn't have them, so even though we do, don't use them? | 21:45 |
ghostmansd[m] | I oppose on the grounds of simplicity. You might as well call these philosophical, if you like. :-) | 21:45 |
FUZxxl | I think TLS makes for some tremendously simple solutions. | 21:46 |
FUZxxl | Just as static variables do. | 21:46 |
FUZxxl | I mean there is a reason objects are so popular: object members are just per-instance static variables. | 21:46 |
FUZxxl | Let's you use them without feeling guilty for having sinned. | 21:46 |
ghostmansd[m] | And simplicity differs. I know declaring something thread_local might appear simpler than calling malloc inside the thread. And having a global variable is simpler than passing function arguments. In this sense, this is not the simplicity I'm talking about. | 21:47 |
FUZxxl | so philosophical grounds :-) | 21:48 |
ghostmansd[m] | I'm talking about simplicity to debug and maintain it, and also about simplicity of supporting the internals needed, etc. | 21:48 |
ghostmansd[m] | Again: if you think these are philosophical grounds, that's just your opinion, I'm fine with it. :-) | 21:48 |
ghostmansd[m] | I don't really intend to convert you into the no-TLS-whatsoever church. :-) | 21:49 |
ghostmansd[m] | Each and every programmer is full of opinionated statements, so am I. | 21:49 |
FUZxxl | debugging is a lot easier with TLS and static variables because you can just look at the variable | 21:50 |
FUZxxl | it's a lot harder if the variable is to be found somewhere deep in an object hierarchy or the call stack | 21:50 |
FUZxxl | you also don't have to lug the variable down the call stack which e.g. helps tremendously when writing recursive-descent parsers. | 21:50 |
ghostmansd[m] | My experience differs, but, again, this is about opinions. | 21:50 |
FUZxxl | sure | 21:50 |
ghostmansd[m] | This diverged a lot from UTF-8, lol :-D | 21:51 |
FUZxxl | heh | 21:51 |
FUZxxl | My current other project is writing a SIMD-accelerated SAT solver | 21:52 |
FUZxxl | let's see if it works | 21:52 |
ghostmansd[m] | Oh, nice! | 21:57 |
FUZxxl | SAT solvers are very branch heavy | 21:57 |
FUZxxl | they constantly (1) load a datum from RAM (2) compare it with something (3) branch on the result | 21:58 |
FUZxxl | almost all the loads are interdependent | 21:58 |
FUZxxl | I believe I can do better with slighly improved data structures, permitting me to then use gather instructions and so on | 21:58 |
ghostmansd[m] | Well, that's one of the benefits with vectors, being able to write branchless code, so I hope you'll find what you're searching for here. :-) | 22:00 |
ghostmansd[m] | So, welcome :-) | 22:00 |
FUZxxl | Oh it's still going to be plenty branchy | 22:00 |
FUZxxl | and I think vectors will not actually give a lot of benefit | 22:00 |
FUZxxl | even being able to do 4 loads independently would be a big plus | 22:00 |
FUZxxl | even with vectors you'll be limited to how many fetch units you have | 22:02 |
FUZxxl | as most loads are random | 22:02 |
FUZxxl | so once your vector holds as many integers as you have fetch units, you have reached the limit of possible perf benefits. | 22:02 |
markos | FUZxxl, hey, welcome, nice to see another fellow SIMD junkie :) | 22:06 |
markos | well, in my case, ex-SIMD junkie, I'll never see SIMD again in the same light after SVP64 | 22:06 |
markos | in fact, working with SVP64 in most cases requires to see the actual reference scalar code rather than a SIMD implementation | 22:07 |
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has quit IRC | 22:07 | |
FUZxxl | markos: I like to do something I call "combinatorial SIMD" i.e. SIMD use for combinatorial algorithms | 22:08 |
FUZxxl | The results are usually completely different from the scalar code as they require entirely different algorithm design approaches | 22:08 |
markos | I implemented some example code for SVP64, you'll see that it's much much easier to just write SVP64 assembly directly following the scalar code | 22:09 |
FUZxxl | which also makes me feel uncertain about how advantageous vector architectures are going to be. I don't know how good their horizontal operations are. | 22:09 |
markos | and forget SIMD entirely | 22:09 |
FUZxxl | If you have all vertical code I'm sure it is. | 22:09 |
markos | how about reducing code size to 1/5th or even 1/10th? | 22:10 |
FUZxxl | but alas I do not have enough experience. | 22:10 |
markos | in assembly! | 22:10 |
FUZxxl | sure, but does it do the trick? | 22:10 |
markos | yes | 22:10 |
markos | how fast we don't know yet | 22:10 |
FUZxxl | e.g. could you do it with, say, an integer sorting algorithm? | 22:10 |
FUZxxl | if you have, say, quicksort, it's a very simple scalar procedure. | 22:10 |
markos | sure | 22:10 |
FUZxxl | But what would the SVP64 variant look like? | 22:10 |
FUZxxl | Each iteration has a loop carried dependency. | 22:11 |
FUZxxl | You need to redesign the algorithm to vectorise it. | 22:11 |
markos | we haven't done sorting yet, but it's trivial to find the horizontal min/max across a vector | 22:11 |
FUZxxl | sure, this is because you have reduction instructions I suppose. | 22:11 |
markos | yes | 22:11 |
markos | one instruction to get the min/max across 64 bit registers | 22:11 |
FUZxxl | cheating :-P | 22:11 |
markos | not more than what other engines are doing already | 22:12 |
FUZxxl | an instruction to implement a special case says nothing about the general case | 22:12 |
markos | but instead of adding a gazillion instructions to do one simple job | 22:12 |
markos | SVP64 adds just a few powerful instructions that can be modified to be used in pretty much all scenarios | 22:12 |
FUZxxl | for example, what if you want to compute the gcd of a vector? | 22:12 |
FUZxxl | Is it still that easy? | 22:12 |
FUZxxl | the scalar code is just as easy as "min of a vector" | 22:13 |
markos | yes | 22:13 |
FUZxxl | what would it look like in SVP64? Is there a "reduce with gcd" instruction? | 22:13 |
markos | I did ports of a few algorithms used in VP8/VP9 and AV1 | 22:13 |
markos | do you have the git checked out? | 22:13 |
FUZxxl | which git in particular? | 22:14 |
markos | libresoc, openpower-isa | 22:14 |
FUZxxl | can you give me the correct link? | 22:14 |
FUZxxl | (no I do not have it checked out) | 22:14 |
markos | https://git.libre-soc.org/?p=openpower-isa.git;a=summary | 22:15 |
FUZxxl | what is the URL to clone this repository? The site does not say. | 22:16 |
markos | well, if you want to work on it you have to get an actual dev environment, unfortunately it's not that simple as checking out to start playing with it | 22:17 |
markos | but if you want to look at SVP64 asm examples | 22:17 |
markos | https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=media/video;hb=HEAD | 22:18 |
markos | has a port of a few functions for libvpx and dav1d libraries | 22:18 |
FUZxxl | you are evading my question :-) | 22:19 |
markos | built using ghostmansd's binutils port to SVP64 and run on the python simulator | 22:19 |
markos | which is? how would a gcd be implemented in SVP64? | 22:19 |
FUZxxl | no, a "reduction by gcd" | 22:19 |
FUZxxl | I chose gcd because I am pretty sure you don't have an instruction for it. | 22:19 |
FUZxxl | i.e. take an array of numbers and compute the gcd of all numbers. | 22:19 |
FUZxxl | the scalar loop is simple, but how can it be vectorised? I cannot imagine it would be as simple as the scalar loop. | 22:20 |
markos | you have to think outside the box of SIMD | 22:20 |
markos | no there no specific instruction for it | 22:20 |
markos | but it can be turned into SVP64 quite easily | 22:20 |
FUZxxl | But would it look just like the scalar code? | 22:21 |
markos | not exactly, but close | 22:21 |
markos | it's just a bunch of modulos and compare against zero | 22:21 |
FUZxxl | the thing is: I'm sure it's very easy to make the vector code look like the scalar code for simple numerical algorithms | 22:21 |
FUZxxl | but I really don't care about these | 22:22 |
markos | how's DCT? | 22:22 |
markos | I wouldn't call it exactly simple | 22:22 |
*** toshywoshy <toshywoshy!~toshywosh@ptr-377wf33o3bnthuddmycb.18120a2.ip6.access.telenet.be> has joined #libre-soc | 22:22 | |
FUZxxl | markos: and you have to somehow deal with it being a reduction | 22:22 |
markos | and yet it's much smaller than SIMD DCT code I'm writing for Arm | 22:22 |
FUZxxl | discrete cosine transform? | 22:22 |
markos | yes | 22:22 |
FUZxxl | sure it is | 22:22 |
FUZxxl | you don't have to deal with edge conditions or chunking | 22:22 |
markos | have you seen SIMD unrolled code? | 22:23 |
markos | for DCT I mean | 22:23 |
FUZxxl | otoh you will not be benefit from more execution units with such simple code | 22:23 |
FUZxxl | *not benefit | 22:23 |
FUZxxl | sure | 22:23 |
FUZxxl | You can write SIMD code in very similar style, especially with AVX-512. The extra complexity is mainly computing masks for the final iteration. | 22:23 |
FUZxxl | But such code is not very efficient as it doesn't have high ILP | 22:24 |
markos | let's leave AVX512 out of the discussion please :) | 22:24 |
FUZxxl | the complexity is needed to enable better ILP and I believe that'll be the same with SVP64 in future implementations | 22:24 |
markos | I'm writing SIMD code for pretty much all architectures, as a profession | 22:24 |
markos | AVX512 apart from a few interesting instructions is a mess | 22:24 |
FUZxxl | so am I :-) | 22:24 |
FUZxxl | AVX-512 is the best ISA for my particular work because it has the best horizontal operations | 22:25 |
markos | if it works for you, fine | 22:25 |
FUZxxl | e.g. VPCOMPRESSB, PDEP, PEXT, VPMULTISHIFTQB, VPERMB | 22:25 |
FUZxxl | other architectures lack them | 22:25 |
FUZxxl | SVE2 has something like VPCOMPRESSB, but no PDEP or PEXT, so I will not be able to port my UTF-8 algorithm to it easily. | 22:26 |
markos | pdep/pext are indeed lacking other architectures | 22:26 |
markos | ^in | 22:26 |
FUZxxl | SVP64 would need pdep/pext on mask (?) registers to be effective | 22:27 |
markos | but here's the thing, in the end all that matters is the resulting performance, and so far I managed to get the best performance of my software on a M1 using NEON | 22:27 |
FUZxxl | and compress/expand (which I think you might already have) | 22:27 |
FUZxxl | yeah that chip rules | 22:27 |
FUZxxl | however, NEON sucks for combinatoric programming: no scatter/gather, no compress, no pdep, no pext | 22:27 |
FUZxxl | good for number crunching though | 22:28 |
markos | scatter/gather are a joke on Intel | 22:28 |
FUZxxl | they got a lot better with Icelake | 22:28 |
FUZxxl | could be better though | 22:28 |
FUZxxl | have you worked with NEC Aurora Tsubasa before? | 22:28 |
FUZxxl | Fun system, too. | 22:28 |
markos | no | 22:28 |
FUZxxl | vector ISA w/ 2048 byte vectors iirc. HBM2 memory, no L2 cache | 22:29 |
FUZxxl | mem is so fast that vector loads always bypass the L1 cache. Crazy perf | 22:29 |
markos | that's highly specialized | 22:29 |
FUZxxl | made for number crunching | 22:29 |
markos | yes, and only that | 22:29 |
FUZxxl | correct. | 22:29 |
markos | SVE was based on such a CPU Fujitsu iirc | 22:30 |
FUZxxl | yeah | 22:30 |
FUZxxl | still waiting for SVE2 hardware I can put on my desk | 22:30 |
markos | same here, probably I'll get access next month | 22:30 |
FUZxxl | once that happens ... I gues I'll write a whole lot more ARM asm | 22:30 |
FUZxxl | https://github.com/clausecker/pospop/blob/master/countneon_arm64.s | 22:31 |
markos | anyway | 22:31 |
FUZxxl | ^ here's some ARM asm I wrote a while ago. | 22:31 |
markos | the point is that SVP64 is a totally new game | 22:31 |
markos | check the asm in the media/video/ directory | 22:31 |
FUZxxl | that was fun. I wonder if it would work the same or better with SVP64 or other vector isas. | 22:31 |
markos | there is no need to unroll your loop 100 times to get the performance | 22:32 |
markos | eg. I'm doing a DCT 32x32 algorithm for NEON right now | 22:32 |
markos | it's unrolled 1500 lines with a ton of inline functions | 22:32 |
FUZxxl | sure, neither is it for the M1 | 22:32 |
markos | so even more | 22:32 |
FUZxxl | the out of order mechanism makes it so multiple iterations happen at once | 22:33 |
markos | with SVP64 there are specialized DCT instructions | 22:33 |
FUZxxl | still brings a little bit of an advantage some times though | 22:33 |
markos | it's not about a specific implementation, eg M1 or another | 22:33 |
markos | it's about the architecture, not the micro-architecture | 22:33 |
FUZxxl | why does the architecture require unrolling? | 22:33 |
FUZxxl | what is fundamentally different with SVP64 that it doesn't? | 22:34 |
markos | the loops are in the hardware instructions themselves | 22:34 |
FUZxxl | so ... each instruction is a loop? | 22:35 |
markos | whereas in SIMD you might have to unroll 4 times, for a simple 128-bit vector addition, with SVP64 you just do a sv.adds having previously set the VL (Vector Length) to whatever you want | 22:35 |
markos | (up to 64 iirc) | 22:35 |
FUZxxl | why do you have to unroll 4 times with SIMD? You can just do the same loop you do with SVP64 | 22:36 |
markos | the added bonus of having a full 128 registers to play with, makes for some pretty interesting code | 22:36 |
FUZxxl | and for slow context switches | 22:36 |
markos | in general, branches are bad | 22:36 |
FUZxxl | not if they are easy to predict | 22:36 |
markos | context switching can be solved easily | 22:36 |
markos | no, branches are bad, period | 22:36 |
FUZxxl | I disagree | 22:36 |
FUZxxl | e.g. my UTF-8 conversion routine got twice as fast by making it more branchy | 22:37 |
FUZxxl | I added fast paths for e.g. when there are no surrogates skipping large chunks of the algorithm | 22:37 |
markos | in order to have good branch prediction units, you have to add more silicon to the chip, which in turn increases power usage, which makes it more difficult to scale your cpu, etc etc | 22:37 |
markos | I'd prefer using my silicon for other stuff | 22:37 |
FUZxxl | Most code is branchy business logic, you will not be able to vectorise that. Or make it less branchy. | 22:38 |
markos | critical loops aren't for this reason | 22:38 |
FUZxxl | Mine are :-) | 22:38 |
FUZxxl | You are invited to make them less branchy and faster. | 22:38 |
markos | my plate is already full :-) | 22:38 |
FUZxxl | Sure. | 22:39 |
FUZxxl | e.g. here is my UTF-8 conversion routine: https://github.com/simdutf/simdutf/blob/clausecker/utf8_to_utf16le_avx512.S | 22:39 |
FUZxxl | How would it benefit from SVP64? | 22:39 |
markos | if it was as you say, there would be no need for so much inlining of functions in the hot path | 22:39 |
markos | programmerjake is the utf8 expert here | 22:39 |
FUZxxl | what the inlining helps with is getting rid of the shuffling of registers | 22:40 |
FUZxxl | also, functions are often called with constant arguments so constant propagation helps. | 22:40 |
FUZxxl | The pure function call is usually not what makes a call expensive. | 22:40 |
FUZxxl | markos: one thing this code does e.g. is obtaining masks over vector elements, then moving them to scalar registers and doing arithmetic on them. | 22:41 |
FUZxxl | Not sure how this would be possible with SVP64 given that masks are longer than scalar registers. | 22:41 |
markos | without knowing the algorithm I can't say | 22:42 |
FUZxxl | We also have a thing where the vector length must necessarily be dynamic as UTF8 and UTF16 are variable length encodings, so a variable lenght chunk of input is translated into a variable length chunk of output. | 22:42 |
markos | but it's the same as usual, check the reference algorithm and not the SIMD code first | 22:42 |
FUZxxl | But I guess there is a "set vector length to the number in this register" instruction. | 22:42 |
FUZxxl | the reference algorithm is completely different and cannot be vectorised as it is full of loop carried dependencies. | 22:42 |
FUZxxl | and the SIMD algorithm I designed cannot be expressed in a scalar paradigm. | 22:43 |
markos | to put in other words, if I'm going to port code to SVP64 I found it it much much simpler to just take the C code and work with that | 22:43 |
FUZxxl | Please, I'm interestedin seeing the result. | 22:43 |
markos | SIMD-isms are not possible with SVP64 or are much much harder to implement and with a big doubt of the actual benefit | 22:43 |
FUZxxl | Here is the reference C code I wrote: https://github.com/simdutf/simdutf/blob/clausecker/utf8_to_utf16le_ref.c | 22:43 |
markos | I already told you, check the media/video code | 22:43 |
FUZxxl | It's a completely different algorithm though. | 22:43 |
markos | reference C algorithm and SVP64 in the same dir (*_real.s is the actual code) | 22:44 |
FUZxxl | that code is just straightforward math code. No loop carried dependencies, no permutations, no data-dependent addressing. | 22:44 |
FUZxxl | I 100% agree with you that you can directly translate such code. | 22:44 |
markos | lots of branches but rather straightforward | 22:45 |
FUZxxl | track inpos and outpos | 22:46 |
FUZxxl | these two vary depending on which branch is taken | 22:46 |
FUZxxl | so you cannot just do the usual "do both branches and merge the results" approach as the addresses from which the next elements are fetched vary | 22:46 |
markos | never said you would | 22:47 |
FUZxxl | depening on which branch is taken in a loop-carried-dependency | 22:47 |
FUZxxl | I would really like to see what it would look like | 22:47 |
FUZxxl | it feels hard to imagine. | 22:47 |
markos | but from what I see, the SVP64 version would probably -this is a guess- look closer to the scalar, translated, rather than a SIMD implementation | 22:47 |
FUZxxl | how do you deal with the loop-carried dependency? | 22:48 |
markos | I think that would be a good exercise for you to learn SVP64 assembly :) | 22:48 |
FUZxxl | why do you keep evading my questions? | 22:48 |
lkcl | FUZxxl, i made a start on in-regs insertion-sort, it's about... 8-10-ish instructions, i haven't finished it. | 22:48 |
FUZxxl | I try to understand what the trick is and you keep waving your hands | 22:48 |
markos | I don't really know what you expect me to tell you, if you're expecting me to implement a SVP64 version of that, that's not going to happen | 22:48 |
markos | because I really have a ton of stuff to work on, and this is not in the list | 22:49 |
FUZxxl | I think you may have not understand the problem with vectorising this algorithm | 22:49 |
markos | loop carried dependencies are not as important with SVP64 as they are with SIMD | 22:49 |
FUZxxl | lkcl: it would be interesting to see that. | 22:49 |
FUZxxl | but how are they resolved? | 22:49 |
FUZxxl | e.g. in this algorithm, to fetch c0 for the next iteration, you need to have completed the previous iteration | 22:50 |
markos | what does a loop-carried dependency in the first place? | 22:50 |
FUZxxl | so how would this be vectorised? | 22:50 |
markos | s/does/is | 22:50 |
markos | it's data you have to carry horizontally and you can't do that with SIMD | 22:50 |
markos | you can with scalar | 22:50 |
lkcl | FUZxxl, we're very much learning - all of us - as this is such a new concept that we're literally at the forefront of experimenting and finding out how it works | 22:50 |
FUZxxl | "data you have to carry horizontally" -> it is | 22:50 |
FUZxxl | and the whole point of my SIMD code is that it can do that. | 22:51 |
markos | and you can with SVP64, because the result of each instruction CAN affect the next | 22:51 |
FUZxxl | I think I still do not understand how SVP64 works. | 22:51 |
markos | with SIMD, unless the engine has specific instructions to do that -eg. horizontal adds, min/max/etc- you have to change the order from horizontal to vertical, eg. by transposing the vector matrix | 22:52 |
lkcl | FUZxxl, it's actually not-strictly-speaking a Scalable Vector system at all, it's more classified as a type of "Zero-Overhead Loop-Control" | 22:52 |
lkcl | https://bugs.libre-soc.org/show_bug.cgi?id=659#c5 | 22:52 |
markos | even if it does offer specific instructions, you cannot have the same instruction change the results | 22:52 |
lkcl | that's where i began insertion-sort | 22:52 |
lkcl | the idea came from looking at both SIMD and Cray-Style Vectors, and noticing that: | 22:53 |
lkcl | SIMD is "for i in fixed_simd_length: operation(register[j].element[i],....) | 22:53 |
lkcl | and Cray-style Vectors is: | 22:54 |
lkcl | "for i in 0..GLOBAL_DYNAMIC_VECTOR_LENGTH_REGISTER: operation(register[j].element[i], ....) | 22:54 |
lkcl | i decided to separate out the "for-looping" part from "the operation" part, as a prefix-suffix 32-32 encoding. | 22:55 |
FUZxxl | how is this translated into micro operations? | 22:55 |
lkcl | using the *Scalar* Power ISA as the suffix | 22:55 |
FUZxxl | that's how it's encoded | 22:55 |
FUZxxl | so say the scalar UTF-8 algorithm | 22:55 |
markos | ok, I think I found an example that can explain your question | 22:55 |
markos | from mp3, showing loop-carried dependency | 22:56 |
FUZxxl | if you have a hardware loop mechanism to basically execute the scalar algorithm in a hardware loop, the total latency would still be the same as it's all loop carried | 22:56 |
lkcl | in high-performance implementations it would be done by inserting an additional hardware block in between Decode and Multi-Issue phases | 22:56 |
markos | setvl 0,0,18,0,1,1 # Set VL to 18 elements | 22:56 |
markos | # Load 18 floats from (in) | 22:56 |
markos | sv.lfs *vin, 0(in) | 22:56 |
markos | # equivalent to: for (i = 17; i >= 1; i--) in[i] += in[i-1]; | 22:56 |
markos | sv.fadds/mrr *vin+1, *vin+1, *vin | 22:56 |
markos | check the for loop | 22:56 |
lkcl | nope, not if you use a Multi-Issue Out-of-Order Execution Engine as the underlying hardware | 22:56 |
FUZxxl | if each operation is linearly dependent on the previous one, then I don't see how that would work | 22:57 |
FUZxxl | which is is in the scalar UTF-8 translation algorithm | 22:57 |
FUZxxl | *each iteration | 22:57 |
markos | the actual implementation is a hardware micro-architecture issue | 22:57 |
lkcl | then yes, what you get is, you get the situation where you benefit instead from having a multi-issue OoO Engine as the back-end | 22:58 |
lkcl | where there would be multiple in-flight operations | 22:58 |
FUZxxl | you will not be able to get it faster than the length of the critical path | 22:58 |
lkcl | correct. | 22:58 |
FUZxxl | which is ... 10 times slower than the SIMD algorithm I designed | 22:58 |
lkcl | under which circumstances, a little bit more intelligence in how the algorithm is written has to be done | 22:58 |
lkcl | i.e. you have to think in terms not so much of "SIMD" but of "SIMD elements" | 22:59 |
FUZxxl | ok, done. | 22:59 |
FUZxxl | I can envision how each individual element flows through the algorithm. | 22:59 |
lkcl | you'd have to make an implementation of the algorithm that ensures that the *elements* have less Hazard-Dependencies | 22:59 |
markos | parallelizing the algorithm does NOT require SIMD | 23:00 |
FUZxxl | which I did, but that only works by viewing a whole vector of elements at once. | 23:00 |
lkcl | this is common practice in Cray-style Vectors (since the 70s) | 23:00 |
lkcl | yes. | 23:00 |
lkcl | that's it. that's what you do. | 23:00 |
lkcl | but | 23:00 |
FUZxxl | i.e. I break the dependencies by computing where the elements go ahead of time and then just permuting. | 23:00 |
FUZxxl | But that does require an entirely different approach, in contrary to what markos said. | 23:01 |
FUZxxl | Which is my whole point. | 23:01 |
lkcl | what you are *not* restricted to is *only* thinking in terms of vector-of-elements | 23:01 |
FUZxxl | you aren't restricted to that with SIMD either. It's plausible someone will implement SIMD extensions with individually renamed elements. | 23:01 |
lkcl | you *can* issue a vector-instruction that very very deliberately has overlaps, e.g. a Prefix-Sum (fibonacci) | 23:01 |
FUZxxl | e.g. AMD already does something like this by splitting vectors into chunks and renaming each chunk separately. | 23:01 |
lkcl | yes but they have to actually explicitly add that as instructions | 23:02 |
FUZxxl | no | 23:02 |
lkcl | oh? | 23:02 |
markos | you asked for an example of SVP64 handling loop-carried dependency, I showed you one, you complained about performance and really there is no way to tell that at the moment | 23:02 |
FUZxxl | it's an implementation detail of how they implement AVX-512 in Zen4 | 23:02 |
FUZxxl | My goal is to understand what this is all about | 23:02 |
FUZxxl | so I am proposing problems that would not work with my current understanding of the technology. | 23:02 |
markos | it's not a solution for all problems | 23:03 |
lkcl | no - we're focussing on ISA-level simulations at the moment rather than HDL (architectural) simulations | 23:03 |
FUZxxl | To gain insight into how they could be implemented to then amend my understanding. | 23:03 |
markos | but it does solve many existing problems | 23:03 |
FUZxxl | How do you model the performance without an uarch level simulation? | 23:03 |
lkcl | although the cavatools port will have timing / cycle-accurate counting | 23:03 |
lkcl | at this early phase: we don't. | 23:03 |
FUZxxl | bummer | 23:03 |
lkcl | that's what the next NLnet grant is for | 23:03 |
markos | there is no microarch at the moment, that's what I've been saying | 23:03 |
lkcl | which will take about... 5-6 months to complete | 23:04 |
FUZxxl | the way you explain this algorithm it seems like the details of the uarch design are critical to the perf characteristics of the architecture | 23:04 |
FUZxxl | and for how you need to write code to be fast. | 23:04 |
markos | uarch is always specific to the implementor | 23:04 |
markos | it can be entirely different from each vendor | 23:04 |
lkcl | *then* we have a base on which to *begin* maaking performance estimates | 23:04 |
lkcl | ah | 23:04 |
FUZxxl | or more polemically: just because it looks elegant doesn't mean it'll be fast. | 23:05 |
lkcl | one of the things about designing a general-purpose Scalable Vector ISA is: you *don't* design the ISA based on "performance" | 23:05 |
markos | eg. M1 is Arm Architecture, but NOT Arm Microarchitecture, it's Apple's own, the rest of the vendors (Samsung, Qualcomm, etc) take *some* parts of the architecture and add their own | 23:05 |
lkcl | that's a very unfortunate conflation that's come from SIMD ISAs | 23:05 |
FUZxxl | I partially agree, but partially disagree | 23:05 |
lkcl | which are near-direct exposure of the underlying SIMD hardware internal micro-architecture directly to/at the programmer | 23:05 |
FUZxxl | you want the programmer to be able to make assumptions about what the performance characteristics are *at least* | 23:06 |
FUZxxl | it can always be faster in practice | 23:06 |
lkcl | ARM is presently trying to dissuade people from doing exactly that | 23:06 |
lkcl | by designing algorithms to be "vector-length-independent" | 23:06 |
lkcl | as is RISC-V with RVV | 23:06 |
FUZxxl | e.g. something like "if I have a permutation of a vector, do the data elements have to be ready when the permutation happens or only the index elements" | 23:07 |
FUZxxl | which majorly affects how an algorithm requiring a permutation is to be designed | 23:07 |
lkcl | if successful it allows code to be fully portable from embedded systems with zero lanes (a scalar-element) right the way up to massive 1024-way SIMD hardware | 23:07 |
markos | permutation of the vector in SVP64 is just moves based on indices | 23:07 |
markos | actually not even moves | 23:07 |
FUZxxl | renames? | 23:07 |
lkcl | yyeah, permutations are the "massive spanner in the works" for both ARM SVE/2 and RISC-V RVV's "plans", there | 23:08 |
markos | you change the index numbers of the registers that are going to be used in the next iteration | 23:08 |
markos | you want reverse? sure, take a reverse sequence | 23:08 |
FUZxxl | so there is no dependency on the data, only on the indices | 23:08 |
lkcl | you can't possibly have a reliable permutation concept when you've absolutely no idea what the target microarchitectural vector length is!! | 23:08 |
FUZxxl | is that *guaranteed* or just an implementation detail? That's the sort of stuff I need to know | 23:08 |
FUZxxl | I don't really care about specific instruction timings | 23:08 |
FUZxxl | But about this sort of general stuff. | 23:08 |
markos | too early | 23:08 |
lkcl | there's a hard-guaranteed limit on the Vector Length of 127. | 23:08 |
lkcl | it is *required* that that Vector Length be supported, across *all* implementations | 23:09 |
FUZxxl | Same with programming for ARM. You are completely correct about not programming for a specific uarch, but you can make the assumptions that all uarches will be very similar in certain aspects. | 23:09 |
markos | you can never assume anything about specific performance | 23:09 |
FUZxxl | better than on rv64 :-) | 23:09 |
lkcl | even if that implementation is a ridiculously-slow Finite State Machine that's only capable of executing one element per every 10 clock cycles | 23:09 |
FUZxxl | then I will not be able to design fast algorithms for the machine | 23:10 |
FUZxxl | and it will immediately be uninteresting | 23:10 |
lkcl | or it's an insanely-fast one that can do QTY 16of elements per SIMD-back-end-ALU per multi-issue instruction | 23:10 |
markos | one instruction can take 3 cycles on a Qualcomm, and 1 cycle on a M1, and you can never know/assume that | 23:10 |
FUZxxl | sure, not really a problem (though annoying to deal with some times) | 23:11 |
lkcl | you can design *portable* and *highly-efficient* (compact) algorithms for any future machine that anyone decides to implement in future | 23:11 |
FUZxxl | what is worse if you e.g. don't know if a shift is a barrel shift or one shift per cycle | 23:11 |
FUZxxl | once again, heavily affects the algorithm design | 23:11 |
markos | or one CPU can have 1 vector ALU and be slow as hell, or it can have 4x128 NEON units (graviton3) and your NEON code can really kick ass | 23:11 |
FUZxxl | or for ARM, you know that perms are happening for the whole register. Elements are not individually renamed. | 23:12 |
FUZxxl | This too affects algorithm design. | 23:12 |
markos | that's a microarchitecture detail | 23:12 |
lkcl | yyeah these kinds of things we're just going to have to live with until there's even one "decent" implementation (let alone 2 or 3) | 23:12 |
markos | you cannot impose that kind of thing to a vendor | 23:12 |
FUZxxl | then I think I'll have to come back once there is more clarity about that | 23:12 |
lkcl | like i said, we're at the software-simulator stage | 23:12 |
markos | unless you sell them the microarchitecture as well as the ISA | 23:12 |
FUZxxl | The thing is, you are proposing a revolutionary ISA design approach but don't give any details about what the implementation is going to look like | 23:13 |
FUZxxl | so it's impossible to optimise complex data flow for it | 23:13 |
lkcl | i've been designing the implementation for 3 years, now | 23:13 |
FUZxxl | e.g. the reduction code you cited above will be really slow (17 cycles minimum) if the CPU just executes all the additions in the order of the dependency chain you define there | 23:14 |
FUZxxl | but perhaps it is faster if it is special cased. | 23:14 |
lkcl | the base i have planned is a multi-issue OoO one... | 23:14 |
lkcl | ... yes | 23:14 |
markos | I think you're thinking too much around SIMD to see the benefit | 23:14 |
FUZxxl | In practice it may be 1000 elements and you may really want to know if it is | 23:14 |
FUZxxl | markos: I think in terms of an OOO CPU | 23:14 |
lkcl | right, that's actually coded into the specification as an "Architectural Note/Hint" | 23:14 |
FUZxxl | i.e. in terms of data dependencies | 23:14 |
FUZxxl | ah so you do have assumptions that can be made (?) | 23:15 |
lkcl | that implementors are perfectly permitted to spot those kinds of cases | 23:15 |
FUZxxl | but will they in practice? | 23:15 |
markos | it can also be an in-order CPU, that's again a microarchitecture detail | 23:15 |
lkcl | and replace them with micro-coded optimisations | 23:15 |
lkcl | when we've the VC funding i'm planning to instruct the team on where micro-coded optimisations will be possible, yes | 23:15 |
FUZxxl | I can write possibly beautiful code against your ISA, but until it is known what typical uarches look like, it will not be possible to say if that code is really good code | 23:16 |
lkcl | for example a mapreduce on a Vector-OR-operation is clearly easily paralleliseable | 23:16 |
lkcl | we know | 23:16 |
FUZxxl | the fadd example in particular is tricky because addition is not associative | 23:16 |
lkcl | i have been trying to drum this into one of the team, programmerjake, for some time | 23:16 |
lkcl | yes, and that's also in the specification :) | 23:16 |
FUZxxl | so not sure if the CPU will be able to get rid of the dependency chain | 23:17 |
lkcl | that fadd can never be associative on strict-ordered-linear-reduction | 23:17 |
lkcl | and the implementor is *required* to respect that. | 23:17 |
lkcl | they are not permitted, on non-associative operations, to choose an arbitrary order. the results are *required* by the specification to be 100% Deterministic | 23:18 |
lkcl | but | 23:18 |
lkcl | obviously that's not good in terms of performance :) | 23:18 |
lkcl | so there is a *parallel* reduction mode, as well | 23:18 |
lkcl | which is, again, 100% Deterministic | 23:18 |
lkcl | such that even that can be applied to non-associative base operations | 23:18 |
markos | keep in mind that there are multiple ways to solve a problem, and this code is for a platform that doesn't exist yet, if only to prove a point that it works, it's definitely not "optimized" | 23:19 |
FUZxxl | I believe that it works | 23:20 |
lkcl | the "optimisation" - whether and how to optimise - is down to the hardware implementors, of which there is only one, at present, in HDL, and it was an experimental CPU, done last year. | 23:20 |
markos | in fact, many instructions are added/modified as we try to port existing algorithms to SVP64 | 23:20 |
lkcl | (on the basis of which we were able to get more funding. it's a bootstrap process in other words) | 23:20 |
lkcl | the bit that *is* guaranteed - regardless of implementation - is the reduction in program size | 23:21 |
lkcl | (this is what i have been trying to drum into programmerjake: he has recently also been side-tracked into focussing on "performance") | 23:21 |
markos | imho, that's the best part, we see what's needed for specific algorithms and lkcl willl try to find a way to make such an instruction work, if it's possible :) | 23:21 |
markos | personally I'm not at all worried about performance at this moment | 23:22 |
lkcl | the reduced program size (between 50% and 90%) results clearly in reduced I-Cache hits, reduced TLB lookups, and so on. | 23:22 |
FUZxxl | not really the key factor here | 23:23 |
lkcl | and that's clearly right across the board *regardless* of the data-path side, number of ALUs, and so on | 23:23 |
FUZxxl | SVP64 doesn't really improve standard business logic which is the big icache hog | 23:23 |
FUZxxl | a hot loop usually fits into the µop cache, even with unrolled SIMD | 23:23 |
lkcl | if the I-side is sitting idle literally doing nothing it's saving power | 23:23 |
markos | I'm more interested in getting eg. the butterfly instructions used in DCT in the ISA, so that we can have a full DCT implementation in just a dozen instruction -compared to the thousands right now | 23:24 |
lkcl | we've found that the hot-loops fit into *a single line* of an I-Cache, *without* loop-unrolling! | 23:24 |
FUZxxl | My key argument: I tend not to believe claims that something will eventually perform well unless the way it's going to be implemented is clearly laid out and will obviously work | 23:24 |
markos | stuff like that is more important at this stage | 23:24 |
* lkcl looking for an example | 23:24 | |
markos | defining the ISA implies nothing about the underlying architecture | 23:24 |
markos | sorry, micro-architecture | 23:25 |
FUZxxl | partially | 23:25 |
FUZxxl | ISA design choices heavily affect what the uarch has to take into account | 23:25 |
markos | no, it says nothing about X instruction requiring 2 cycles or 10 | 23:25 |
markos | that's uarch-specific | 23:25 |
lkcl | in a *SIMD* ISA, yes. | 23:25 |
lkcl | in a *Scalable Vector* ISA, very deliberately no. | 23:26 |
FUZxxl | again: I don't really care about these details, but more about "what is renamed in dependence to what? What are the hidden data dependencies? What operations are renames? What are ALU ops?" | 23:26 |
FUZxxl | e.g. if you have some sort of selection instruction, is it an ALU op or a data-dependent rename? | 23:26 |
FUZxxl | This severely affects algorithm design | 23:26 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py;h=647e7935d1aa9a77713c3a7af4838e16b5e4c028;hb=36ab18ddb338b2b4be00bdbc03d9efa905a2c3f9#l32 | 23:26 |
FUZxxl | regardless of what the cycles counts are going to be | 23:26 |
markos | do you need to know such info for example on Intel or Arm? | 23:27 |
markos | when writing SIMD code? | 23:27 |
FUZxxl | yes | 23:27 |
FUZxxl | I mean if I don't care about perf I can ignore it | 23:27 |
markos | but such info always or almost always changes between revisions of the cpu | 23:27 |
FUZxxl | not really | 23:28 |
markos | well, you mentioned scatter/gather before | 23:28 |
FUZxxl | the important bits always stay similar | 23:28 |
markos | before icelake it was crap | 23:28 |
FUZxxl | correct | 23:28 |
markos | even now it's not ideal | 23:28 |
FUZxxl | it's pretty much about as good as it gets with the number of load/store ports they have | 23:28 |
markos | in fact I have yet to see more than a few % increase vs not using gather scatter | 23:28 |
markos | and ALWAYS it depends on the actual cpu model | 23:28 |
lkcl | ok so scatter-gather is possible in SVP64 by using Predicate Mask bits, one on source, one on destination | 23:29 |
markos | ie, is it a i9 or a Xeon? and what model? | 23:29 |
FUZxxl | they all have the same number of load/store ports in one uarch generation | 23:29 |
markos | because Xeon obviously can scale better in memory loads | 23:29 |
lkcl | if you use "twin" predication - you get the gather-scatter *back-to-back*... in the same instruction | 23:29 |
FUZxxl | no it can't | 23:29 |
lkcl | but | 23:29 |
lkcl | this is "high-level" | 23:29 |
lkcl | *how* the micro-ops are issued, how *many* are issued, how they are satisified: this is down to the *hardware* implementor to implement | 23:30 |
markos | the thing is that with SVP64 you don't have to do anything special with scatter/gather, there is no special instruction needed | 23:30 |
lkcl | and then optimise | 23:30 |
FUZxxl | and this is critical to know | 23:30 |
FUZxxl | so? | 23:30 |
FUZxxl | does it matter if a special instruction is needed? | 23:30 |
FUZxxl | Using scatter/gather or not is a choice of data structure and algorithm design | 23:30 |
lkcl | and it will be down to the hardware implementor to either declare what they have done, or for programmers to find out, once hardware is available | 23:30 |
FUZxxl | not of whether there are special instructions for it or not | 23:30 |
lkcl | (likely in about 3-4 years time) | 23:30 |
FUZxxl | looking forwards to it! | 23:31 |
FUZxxl | I think it'll be a whole lot more interesting for me once there is silicon available | 23:31 |
lkcl | we'll be able to do FPGA simulations a lot sooner before before then. | 23:31 |
FUZxxl | that will be useful | 23:31 |
FUZxxl | the tyranny of CPU design is, | 23:31 |
FUZxxl | if you have the most beautiful programming model | 23:32 |
FUZxxl | but it's faster to just hack it and write uggly-ass code, compilers will do exactly that | 23:32 |
FUZxxl | history has repeatedly proven that | 23:32 |
lkcl | yes, i remember the first SPARC compilers, simply padding nops in between all operations | 23:32 |
lkcl | because the early hardware had absolutely zero hazard-dependency-checking | 23:33 |
FUZxxl | so before it is proven that the SVP64 coding model you envision will actually be faster than, say, just treating it as a SIMD arch | 23:33 |
* lkcl face-palm | 23:33 | |
FUZxxl | I am going to be wary | 23:33 |
lkcl | well, in this case it's reasonable to ride off the back of the assumption of a half-decent (advanced) multi-issue out-of-order execution back-end | 23:33 |
markos | it doesn't matter if it's faster in absolute numbers, it probably won't in the beginning anyway | 23:33 |
lkcl | those exist. | 23:33 |
markos | but if you can run a DCT algorithm in 20 instructions vs 1000 on a 1Ghz 5W SoC, it definitely matters | 23:34 |
markos | for example | 23:34 |
markos | it won't beat the intel/amd/Arm cpus in 5y time | 23:35 |
lkcl | if they exist now, and the "looping" system can be "inserted" - as i said - between Decode and Issue - then it's also reasonable to expect parity between "the Multi-Issue System's Performance Before SVP64" and "the Multi-Issue System's Performance After Adding SVP64" | 23:35 |
lkcl | (it won't be precisely the same because it's one extra pipeline stage, at least) | 23:35 |
FUZxxl | markos: I'm not sure if it'll be as effective as you think | 23:36 |
lkcl | (which in turn has a knock-on effect on the ratio of in-flight operations vs the completion latency... blah blah) | 23:36 |
FUZxxl | But I think time will tell | 23:37 |
FUZxxl | I think there's a good chance it'll be amazing and I'm going to love it. | 23:37 |
markos | we'll cross that bridge when we get there, nothing more can be said right now | 23:37 |
FUZxxl | I agree | 23:37 |
lkcl | there's some obvious things on DCT (i'm going to do 2D DCT next, it'll take about 2 months, same number of instructions) | 23:38 |
lkcl | a "normal" (SIMD) way of doing DCT (let alone 2D DCT) requires either pushing the data back to memory and loading it back again | 23:38 |
*** jab <jab!~jab@user/jab> has joined #libre-soc | 23:38 | |
lkcl | or it requires copying of the vector of data from register-batch to register-batch | 23:38 |
lkcl | the SVP64 REMAP system *leaves the entire DCT in-registers in-place* | 23:39 |
markos | yes, 2 passes, transposing the whole matrix between them and after | 23:39 |
lkcl | you have the vector of registers for the coefficients | 23:39 |
lkcl | you have the vector of registers containing the data | 23:39 |
FUZxxl | Imagine a SIMD arch with as large as a register file and as long as a vector length as SVP64 | 23:40 |
lkcl | apart from one additional scalar register containing the DCT size, there *are* no other registers needed! | 23:40 |
FUZxxl | would the benefits persist? | 23:40 |
lkcl | yes! | 23:40 |
lkcl | yes they would | 23:40 |
FUZxxl | ok | 23:40 |
lkcl | because the SIMD arch would firstly have the entire triple-loop as explicit instructions | 23:40 |
lkcl | secondly, they'd go "argh we don't have a bit-reversed Vector-LD instruction, we have to compute the indices for the permutation" | 23:41 |
FUZxxl | the SIMD arch could have that instruction | 23:41 |
FUZxxl | or it could have a general mechanism to do simple permutations on its operands | 23:42 |
FUZxxl | e.g. linear shifts or reversals | 23:42 |
lkcl | thirdly (unless it's ARM, with that funny twin-butterfly instruction) they'd go "argh, this data is in reverse-order, it's going to overwrite and destroy the source which we need later, we have to use double the number of registers" | 23:42 |
lkcl | as a hardware-computed schedule, covering the entire triple-loop? | 23:42 |
FUZxxl | I don't know | 23:43 |
FUZxxl | I don't understand SVP64 well enough to tell | 23:43 |
lkcl | TI's MSP32* series and Qualcomm's Hexagon DSP only have ZOLC for the inner loop of FFT/DCT | 23:43 |
markos | libvpx: Arm NEON: 4k lines of fdct code for 4x4, 8x8, 16x16, 32x32 and that's excluding the transposing functions, x86: 4-8k linesspread across multiple files, multiple implementations, also excluding transposing functions | 23:44 |
lkcl | i haven't been able to find anything in any other ISA like this | 23:44 |
FUZxxl | it could exist | 23:44 |
FUZxxl | just like SVP64 could exist | 23:44 |
lkcl | not in the major ISAs (which, let's be realistic, they're the ones that "matter") | 23:44 |
FUZxxl | make a steelman: imagine how a SIMD architecture could reasonably be extended to solve these issues | 23:44 |
FUZxxl | don't make the mistake of saying "we will add lots of fun stuff but SIMD will stay exactly the same way it is" | 23:45 |
lkcl | i was surprised ARM had that double-butterfly mul-add-shift as a single instruction | 23:45 |
lkcl | SIMD is SIMD (at the back-end) | 23:45 |
FUZxxl | sure. And the double-butterfly mul-add-shift stuff is possible. | 23:45 |
FUZxxl | As is what AVX-512 does. | 23:45 |
FUZxxl | As is probably a whole lot more. | 23:45 |
markos | FUZxxl, the solution followed by intel/arm/etc is to just add more instructions | 23:45 |
lkcl | you can't escape the fact that there's really only one way to build a SIMD ALU: put down a batch of parallel gates. | 23:45 |
FUZxxl | no, you can also shove the register through in batches, as e.g. Zen4 does | 23:46 |
FUZxxl | you could even do each element individually | 23:46 |
FUZxxl | and while e.g. SIMD shuffles are ALU ops right now, they might well be renames in future iterations of the design | 23:46 |
lkcl | yes - that's taking the "Cray Vector Chaining" concept from the 1970s (first implemented in Cray-I) and applying it to modern architectures | 23:47 |
FUZxxl | correct -- it's all uarch | 23:47 |
FUZxxl | so why should SIMD not benefit from better uarch design while you are certain the SVP64 will? | 23:47 |
FUZxxl | *that | 23:47 |
lkcl | because once - in your mind - you have separated the "SIMD-as-a-direct-front-end-ISA" from "SIMD-at-the-back-end-in-hardware" | 23:48 |
lkcl | then realised that Scalable Vector ISAs *also* require (pretty much) identical back-end Engines *as* those processors which have a SIMD-front-end-ISA | 23:48 |
lkcl | and accepted that, comparing apples-with-apples you make the *back-ends* the same | 23:49 |
FUZxxl | ok | 23:49 |
lkcl | then you can focus on comparing the *front-ends* | 23:49 |
FUZxxl | but then (to loop back to the previous point), the backend will not solve the loop-carried-dependency issues | 23:49 |
markos | imho, SIMD is not future proof, right now Intel+Arm+Power: 17k total C intrinsics, slightly fewer asm instructions, Intel + Arm take teh majority, VSX by comparison is just ~200 intrinsics | 23:49 |
markos | you can't just keep adding more and more instructions | 23:50 |
lkcl | within about.. how long markos? about eight weeks? we've found that people go "holy s*** this is so much easier *and* drastically reduces instruction count" | 23:50 |
FUZxxl | If you look at AVX-512 and NEON, it doesn't actually have all that many truly distinct instructions | 23:50 |
markos | both AVX512 and SVE -to a lesser extent- *have* to lower clock frequency when AVX512/SVE* kick in | 23:50 |
FUZxxl | it's mostly the same instruction counted a bunch of times for different data types and vector sizes | 23:50 |
markos | which lowers performance | 23:50 |
lkcl | it was 1/2 hour ago (or so) and i wasn't quite following: what's "loop-carried-dependencies"? you mean like prefix-sum and reduction? | 23:51 |
FUZxxl | Ever since Icelake they no longer do it | 23:51 |
markos | most of the times the instructions are similar -not 1-1 mapping ofc | 23:51 |
FUZxxl | lkcl: e.g. I was talking about the UTF-8 code where to find the start of the next character you have to finish analysing the previous one (in the scalar algorithm) | 23:51 |
FUZxxl | So even an out of order uarch will not be able to process more than one character at a time | 23:51 |
markos | and intel has some pretty specific instructions, eg. movemask set of instructions | 23:51 |
lkcl | ah yes. then you have two choices. | 23:51 |
FUZxxl | markos: these are obsolete with AVX-512 | 23:52 |
lkcl | but you can at least chuck the instructions into in-flight buffers | 23:52 |
markos | which do not exist on Arm, so you have to emulate them | 23:52 |
FUZxxl | as AVX-512 adds masks to all loads and stores | 23:52 |
markos | not obsolete, just done differently | 23:52 |
markos | yes I know | 23:52 |
lkcl | and you can at least benefit from that (whilst still kicking yourself in frustration that the scalar-chained-hazards exist) | 23:52 |
lkcl | or | 23:52 |
FUZxxl | lkcl: with e.g. an 8 wide fronted on the Apple M1, you can chuck them quickly enough even in the conventional paradigm | 23:52 |
lkcl | like you mentioned | 23:52 |
markos | surprisingly VSX does offer a very similar actually more powerful instruction that you can use to emulate movemask in a few instructions | 23:52 |
lkcl | yehyeh, i'm counting on that :) | 23:53 |
FUZxxl | So I don't quite see the benefit still | 23:53 |
FUZxxl | (I usually do uarch simulations on my code and the frontend is rarely the bottleneck) | 23:53 |
lkcl | you can parallelise the algorithm and treat the Vector Elements partially as-if they were SIMD | 23:53 |
lkcl | i.e. you *can* set the Vector Length to an explicit width | 23:53 |
FUZxxl | exactly -- which requires a much more complex algorithm like the one I designed | 23:53 |
lkcl | and effectively "emulate" SIMD. | 23:53 |
lkcl | but here with SVP64 you have the choice | 23:54 |
FUZxxl | And it's unclear if it can be expressed in SVP64 | 23:54 |
lkcl | (and a third option) | 23:54 |
lkcl | the fact that we can "emulate" (not quite the right word) SIMD means "yes you can" | 23:54 |
markos | some inspiration, tricks can be emulated, but I think the whole point is that you don't really need a specific algorithm to write SVP64 code | 23:54 |
FUZxxl | So to summarise: for my combinatorial problems, SVP64 will not make the code shorter or easier or better most likely. | 23:54 |
lkcl | i spent 4 years taking every concept from every SIMD and Scalable Vector ISA i could get my hands on and including them | 23:54 |
markos | you don't know that until you try it | 23:55 |
lkcl | ah, i haven't gotten onto "option 3" - Vertical_First, yet :) | 23:55 |
FUZxxl | I am all ears. | 23:55 |
markos | quite possibly it's going to be shorter, no way to tell if it's going to be faster though | 23:55 |
lkcl | i had to do it as a video | 23:55 |
lkcl | https://youtu.be/fn2KJvWyBKg | 23:55 |
lkcl | but there's a diagram that explains it quicker | 23:55 |
lkcl | 1 se | 23:55 |
lkcl | c | 23:55 |
FUZxxl | markos: how is it ging to be shorter? The SIMD code is not unrolled and each instruction does some elementary step that needs to be performed. | 23:56 |
markos | "possibly" | 23:56 |
markos | as I said, no way to know that until someone tries | 23:56 |
lkcl | https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/sv_horizontal_vs_vertical.svg;hb=HEAD | 23:57 |
markos | because so far it has been shorter than all SIMD implementations of functions we ported | 23:57 |
markos | (the SVP64 version that is) | 23:57 |
FUZxxl | There's nothing in the code that really needs a fixed vector length, it would most likely be directly translatable to the vector paradigm if it has all the relevant instructions. | 23:57 |
lkcl | Horizontal-First is how both Cray-style Vectors *and* SIMD ISAs do "elements" | 23:57 |
FUZxxl | However, I doubt it would be faster or shorter. | 23:57 |
lkcl | they progress through *elements* first before moving onto instruction *second* | 23:57 |
lkcl | Vertical-First is the *other way round* | 23:57 |
lkcl | the hardware maintains an element-offset-pointer | 23:58 |
lkcl | which *remains* at its current index until *explicitly* incremented | 23:58 |
FUZxxl | markos: the functions you've showed me where either trivially parallelisable or have very simple data-dependencies | 23:58 |
FUZxxl | And I believe you in that they can be expressed much easier | 23:58 |
FUZxxl | lkcl: ok | 23:58 |
lkcl | FUZxxl, the reason for that is exactly the benefit | 23:58 |
lkcl | we had to go back to the scalar implementations because the SIMD implementations were so obtuse, so heavily optimised, they're unreadable | 23:59 |
FUZxxl | I'm 100% certain future SVP64 kernels will be too | 23:59 |
markos | mp3 is anything but trivially parallelizable but ok | 23:59 |
markos | I showed you one line | 23:59 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!