Saturday, 2022-10-22

*** kanzure <kanzure!~kanzure@user/kanzure> has quit IRC06:30
*** kanzure <kanzure!~kanzure@user/kanzure> has joined #libre-soc06:30
*** awygle <awygle!~quassel@2604:a880:2:d0::5380:3001> has quit IRC06:53
*** sauce <sauce!> has quit IRC06:53
*** awygle <awygle!~quassel@2604:a880:2:d0::5380:3001> has joined #libre-soc06:54
*** sauce <sauce!> has joined #libre-soc06:56
lkclghostmansd[m], so what's the primary focus of each of the two current bugreports?09:47
lkcland, just as importantly, "what fits into EUR 8,000"?09:48
lkclfor example: if i split into EUR 4000 for 947 and EUR 4000 for the other-top-level-one-for-binutils, is that ok?09:49
lkcli take it is the top-level?09:51
ghostmansd[m]This is basically doing the same stuff in two projects, TBH. Where they can re-use each other (e.g. where we can put some code into, this is great. Shortly, it's "building asm atop of insndb like we did with disasm, dropping custom tweaks". More details below.09:52
ghostmansd[m]binutils: infrastructure is more or less ready. We mostly need to refactor the and/or/shifts/whatever with fields, also establish per-mode handlers (you remember patterns matching in disassembly, I want to do something similar for asm, based on strings representing modes, e.g. /mr should toggle mapreduce and its submodes), etc.09:53
ghostmansd[m]After this is done, we need to introduce tests for SVP64. We already have tests for custom word instructions, but nothing for SVP64.09:53
lkclremember the EUR 8000 is "interim", based on the fact that the cavatools budget has been approved09:56
lkclso if there is more to be done, that would take longer, then it can be added to *another* project09:57
ghostmansd[m]openpower-isa: refactor pysvp64asm in the same way (actually this is the one I do now, since it's a reference and we have tests already). Also, as part of this, I'm supporting operand types, and support more advanced assembly for these (r, f, cr prefixes, non-zero operands without hardcoding it for each and every instruction with _custom_insn, CR complex descriptions, etc.).09:57
lkclbut please remember that if it has not yet been approved and the work has already been done *you can't get paid*09:57
lkclso for example09:57
lkclyou've stated here:09:57
lkcl"all tasks have been completed"09:57
lkclit was attached to an *UNAPPROVED* grant (OPF ISA WG)09:58
ghostmansd[m]Once both are done, I'd like to have some test infrastructure common for binutils and openpower-isa.09:58
lkcltherefore you *cannot* get paid for it under that grant09:58
lkcltherefore i have instead moved it to cavatools09:58
ghostmansd[m]> it was attached to an *UNAPPROVED* grant (OPF ISA WG)09:58
ghostmansd[m]Yeah that's why I asked about cavatools.09:58
ghostmansd[m]Because logically these changes are inevitable, if we want to support binutils, we must have all instructions there we have in reference implementation.09:59
ghostmansd[m]And also markos needed those, that's how it started. :-)09:59
ghostmansd[m]Not all of them but I also did what was nearby.09:59
ghostmansd[m]This task is somewhat outside10:00
lkcldon't add any grev* bitmanip ones10:00
ghostmansd[m]Didn't intend10:00
ghostmansd[m]I remember their fate is not yet stated.10:00
lkclprogrammerjake needs to remove them and complete grevlut*10:00
ghostmansd[m]We also even have some task called like "revisit grev stuff once we decide on it"10:01
ghostmansd[m]Don't remember the number, though10:01
ghostmansd[m]So, yeah:10:03
ghostmansd[m]1.Refactor binutils assembly.10:03
ghostmansd[m]2. Prepare binutils tests.10:03
ghostmansd[m]3. Refactor openpower-isa assembly.10:03
ghostmansd[m]4. Handle what's missing/controversial between these.10:03
ghostmansd[m]5. If time permits, come up with some tests infra which cross-checks these.10:03
ghostmansd[m]We might argue these tasks overlap a lot, and this is actually true, a long meta is "support SVP64 asm and extend it and test it". I'm splitting it for convenience so that you can decide what goes where.10:05
lkclsounds great. what fits into EUR 8,000 and what doesn't?10:05
lkclat present i'm simply putting both #958 and #949 under cavatools @ EUR 8,000.10:05
ghostmansd[m]I'm not sure, that's why I asked. Refactoring assembly is a big task per se.10:06
ghostmansd[m]I think the code fits well.10:06
lkclif however you can do "some" of it10:06
lkclthen continue the rest under another Grant10:06
lkclthat would work10:06
ghostmansd[m]Tests, both for binutils and cross-check, likely will have to be done on other grant.10:06
lkclthat works10:06
lkclboring, but it works10:07
markoslkcl, speaking of binutils, I think that since it's more or less working fine, we should use it more for more complicated -ie non-unit test work-. After all, people are going to use binutils and not some reference assembler for real code. Saying that because for me at least it's much harder reading complicated asm code embedded into python as a unit test (eg chacha), I'd much rather have the assembler separate from the python part on10:07
markosits own file10:07
lkclmost likely under OPF ISA WG one
ghostmansd[m]Cross-check perhaps will be dropped for now, I suspect. It's big, and we could check each implementation on its own.10:07
lkclmarkos, what you've created is great, but it doesn't "run from python test"10:08
ghostmansd[m]binutils tests are boring and difficult, but we have to do them if we want the code to be committed.10:08
markosthat's not what I'm suggesting10:08
lkcland, "run from python test" is the canonical - easy - method of ensuring that everything is working10:08
markosI suggest not embedding complicated -ie more than a handful of instructions- within python10:09
lkclthat chacha20 inner-core constitutes a unit test, not a demo10:09
lkclyes, i'm not going beyond that.  17 is quite enough10:09
markoswhat I'm suggesting is splitting the asm externally as a .s file and loading it as from python for the unit test, but *also* making it available for testing in binutils as well10:10
lkclutf8-validation which programmerjake wrote is borderline.  it's extremely comprehensive (the testing) and should really have been done in the style you developed10:10
lkclbut then it would not be possible to run as a python-based unit test10:10
markosof course it would, just load it as a string from file10:11
lkclthen it is no longer possible under the unit tests to perform python-based parameterised substitution for the purposes of performing comprehensive unit testing10:11
lkclor it is, but is extremely inconvenient10:12
lkclfor example:10:12
lkcli perform substitution in many unit tests of VL as a python-based parameter10:12
lkcl*re-running* the entire unit test multiple times10:12
markosyes, but these are not the tests I'm talking about10:12
lkclwith completely different assembler, in effect10:12
lkcli know10:12
markosnot saying anything should be changed now, just that we should consider putting big asm chunks externally10:13
lkclwhen there's a more "stand-alone" style appropriate (like the one you are developing) then yes absolutely i agree 100% the separate .s is good10:13
lkclthe 17 instructions in chacha20 turned out to be necessary, they're a bit much10:14
lkclbut i needed to test svindex - it's just that it needed to be called 4 times10:15
lkclwhich is the limit (there are only 4 SVSHAPEs)10:15
lkclthis is not possible to do in a ".s" file:10:16
lkcl            'addi 16, 0, %d' % nrounds,     # set number of rounds10:16
lkclstrictly speaking the "32" on setvl should also be parameterised10:16
lkclso that a much shorter test run can be made10:17
lkcl(and also experiment and see how number of rounds affects output)10:17
lkcli keep getting tempted to try to do 2D svindex REMAPs10:18
lkclthe "normal" design of chacha20 is to SIMD-i-fy every 4th quarter-round in groups10:19
lkclperforming lane-based xor/add/rotl3210:19
markosif you do that and add butterfly-and-family int instructions 16/32-bit, we can do pretty much all comples video codec loops in 1/10th of the original code :)10:20
lkclyyeahh that one needs to go on the OPF ISA WG grant10:21
ghostmansd[m]lkcl, are you fine if I re-split the budget between 947 and 958, say, 5000 and 3000? Because 947 needs way more work, and 958 is simpler.10:21
ghostmansd[m]Maybe even 5500 and 2500, perhaps.10:22
markosghostmansd[m], are you btw familiar with LLVM codebase as well?10:22
lkclghostmansd[m], yes sure, feel free to edit it yourself.  it's not "approved" yet so is fine to edit10:22
lkclmarkos, ohh, 2D DCT schedule as well10:23
markoslkcl, yup10:23
ghostmansd[m]Ok, thanks!10:23
lkclmarkos, what was the name of that fantastically-odd instruction?10:24
lkclthe one with mul-and-add-and-a-shift-thrown-in-for-good-measure10:24
markos for the asm instruction10:25
ghostmansd[m]> it's not "approved" yet so is fine to edit10:26
ghostmansd[m]Speaking of that, do you have any estimations on it? Some bits of 947 (948) are almost done, and 958 is reaching the level it's fully done (I need to check instructions outside of av.mdwn).10:26
markoshowever that works for the one coeff only10:26
markostried using it twice for the 2-coeffs butterfly but unfortunately it doesn't work that well10:27
lkclghostmansd[m], no - see the process i put on one of the bugreports.10:27
markosah but there is SQDMLAL10:27
lkcla) decide the tasks10:28
markosyup, that's the 2-coeff variant10:28
lkclb) review the tasks10:28
lkclc) get them approved by NLnet10:28
lkcld) get a signed MoU back10:28
ghostmansd[m]Ah OK, good10:28
ghostmansd[m]Do we have tasks which are approved?10:29
ghostmansd[m]I'll complete these to some degree then until MoU is signed.10:29
ghostmansd[m](Any risk it's not signed?)10:29
lkclno risk whatsoever. it's already approved by the EU10:30
lkclhere's the list of all milestones:
lkclso, anything that can go under this10:31
lkclcan be justified10:31
lkcle.g.... if you for example.... ooo i dunno...10:32
lkcladded the two new shift-instructions10:32
lkcland sv.maddedu and sv.divmod2du10:32
lkclthose are for bigint which is for cryptographic purposes which can therefore be justified under that grant10:33
ghostmansd[m]Ok, no risk, but still no option to send any RFP under cavatools. :-) So I'll have to switch to something else for a while until that one is approved.10:33
ghostmansd[m]Hm, maddedu... IIRC this one already exists as word instruction.10:34
ghostmansd[m]Not in binutils though.10:34
lkclcorrect, no risk. correct, no RFP yet.  estimated... mmm... 2-3 weeks as i need to write the *rest* of the cavatools tasks, sigh.10:34
ghostmansd[m]Aha, OK, sufficient to do something else :-)10:35
ghostmansd[m]OK I'll add maddedu and divmod2du. Do we have task for these?10:35
lkcljust looking...10:35
lkclprrooobbabblllyy.... under this one
lkclor would be better, it's larger.10:36
ghostmansd[m]Ok, cool, thanks!10:37
ghostmansd[m]I will clean up what I have so far for assembly, and then proceed with these.10:37
ghostmansd[m]lkcl, I noticed we don't have pcdec too; should I added? Is it sufficiently stable?13:22
ghostmansd[m]Sorry, I meant pcdec., apparently it lacks non-Rc version?13:23
lkcllike "stcix." it's a Rc=1 only, yes13:29
ghostmansd[m]Sorry, kinda out of context. Do you mean these should be added too?13:39
lkclthe shift instructions will likely become 3-in 1-out so don't add them to the list yet13:39
lkclbut yes13:39
*** toshywoshy <toshywoshy!> has quit IRC13:42
*** openpowerbot <openpowerbot!> has quit IRC13:42
*** ghostmansd <ghostmansd!> has joined #libre-soc17:01
ghostmansdlkcl, I've raised the task about maddedu and divmod2du:
lkclghostmansd, ack17:02
ghostmansdAnd, actually, already done it :-)17:03
lkcldoh :)17:03
ghostmansdActually I did so many of these that I already do it semi-automatically17:04
lkclah - divmod2du has to move btw17:04
ghostmansdI got lucky, since there were instructions with the same form17:04
ghostmansdWhat do you mean?17:04
lkclto XO=5817:04
ghostmansdAh OK, ping me when you move it17:05
ghostmansdOr should I?17:05
lkclam just in the middle of sorting that out (to accommodate dsld and dsrd)17:05
ghostmansdI'll do it, OK17:05
ghostmansdSure, NP17:05
ghostmansd1 sec, this will need updates both in openpower-isa and binutils17:05
lkcli'm sorting out openpower-isa17:06
lkclcommits outstanding, please don't make modifications to minor_4.csv or svfixedarith.mdwn17:06
ghostmansdAh OK17:07
ghostmansdI'll handle binutils then17:07
ghostmansdping me when I need to rebase17:07
ghostmansdBecause I might need to regenerate parts of binutils17:07
ghostmansdSome stuff is fixed (e.g. XO), but the tables...17:08
lkclghostmansd, done17:15
lkclwell that's frickin fricked17:22
lkcldsld as a 4-op can only be EXTRA217:22
lkcland for dsld to be able to do overlapping pickup of a pair of consecutive 64-bit registers has to be EXTRA317:22
ghostmansd1 sec, need to check it and update the tests based on what pysvp64asm produces17:22
ghostmansd[m]Feel free to assign the budget and split as you like17:34
ghostmansd[m]Since you did the XO moving for this17:34
*** toshywoshy <toshywoshy!> has joined #libre-soc17:52
*** markos <markos!> has quit IRC17:57
*** markos <markos!> has joined #libre-soc18:07
ghostmansd[m]lkcl, do you have any other tasks in mind that might need my help?19:18
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC20:05
lkclghostmansd[m], kinda, yes - mostly we've got to focus on cavatools and the cryptoprimitives20:54
FUZxxlWhere are most of you from?20:55
lkclall over the place!20:55
ghostmansd[m]I ain't no cryptographer, but I can help unless there is math involved:-)20:55
ghostmansd[m]Here's Mother Russia, for example20:56
lkclkuwait russia greece canada US germany UK belgium india20:56
ghostmansd[m]Hi FUZxxl20:56
FUZxxlhi ghostmansd[m]20:56
lkclbtw i'm supposed to be shopping, back in 1/2 hr :)20:57
ghostmansd[m]Ok then :-)20:57
lkclaspirin and jelly-babys20:57
lkclno connection20:57
ghostmansd[m]lkcl, do you mean Haribos?20:57
FUZxxlI want to do a PhD project about something related to SIMD stuff20:58
FUZxxlsomeone on reddit told me you might be interested20:58
ghostmansd[m]Well, SVP64 is not a out SIMD, to be honest, quite the opposite20:59
FUZxxlI think that was you lkcl ?20:59
FUZxxlVector stuff counts in this regard :-)20:59
FUZxxlthough for some application it might be more tricky to use20:59
ghostmansd[m]In this sense, it totally applies!20:59
ghostmansd[m]Not more tricky than SIMD for sure20:59
FUZxxlWe just wrote a fun paper about a SIMD-accelerated algorithm to transcode between UTF-8 and UTF-16 (both directions)20:59
FUZxxlit's full of horizontal operations21:00
FUZxxlI'm not sure if you can do enough arithmetic on masks in SVP64 to implement it there21:00
ghostmansd[m]You might want to talk to programmerjake, he'd recently implemented UTF-8 in SVP6421:01
ghostmansd[m]That'd be a great place to start I think21:01
FUZxxlprogrammerjake: hi21:01
ghostmansd[m]Just to feel the difference, so to speak21:01
FUZxxlI can send you a preprint if you like21:01
FUZxxlwe achieve ~8 GB/s for UTF-16->UTF-8 and ~5 GB/s for the other direction on Icelake21:02
ghostmansd[m]We have some nice examples added recently in SVP64 assembly; lkcl, markos and programmerjake can share some tips on directions21:02
ghostmansd[m]I'm not sure whether programmerjake measured the throughput, but you can see the simplicity, at least21:02
ghostmansd[m]I mean, code-wise21:03
FUZxxlI think I kind of have to get used to it21:03
FUZxxlI think lkcl has send me code about that before21:03
FUZxxlbut honestly I didn't get around understanding it21:03
ghostmansd[m]Ah, great!21:03
ghostmansd[m]Well, the good starting point about PPC assembly would be PowerPC ISA book21:04
FUZxxl^ this one was a UTF-8 validator21:04
FUZxxlI've actually not implemented a dedicated UTF-8 validator, but my code does validation throughout the transcoding process.21:05
FUZxxlIt really doesn't cost much at all (not on the critical path and there are enough ports with spare cycles to make it essentially free)21:05
FUZxxlanyway; that was mostly a warm up21:06
FUZxxlI heard you are interested in people writing assemblers for you?21:06
FUZxxlI was thinking about designing sort of a "smart assembler" that helps you write assembly code more effectively21:07
ghostmansd[m]Well, SVP64 pays a significant attention to write the assembly code smart21:08
ghostmansd[m]But I guess lkcl can enlighten more on this part :-)21:08
FUZxxlI'm unfortunately not super familiar with SVP64 yet21:09
FUZxxlmy key idea was to have an assembler that does chores like laying out a stack frame and dealing with obtaining global variables through whatever PIC scheme your platform uses and this sort of thing21:09
FUZxxlSo you can write intuitive code instead of having to deal with weird relocations and such21:10
FUZxxladditional possible objectives include strength reductions and register allocation21:10
FUZxxli.e. instead of taking a compiled language and turning it into a macro assembler, I want to take an assembler and give it useful high-level features so you can make your life easier while still fundamentally programming assembly21:11
FUZxxlghostmansd[m]: what is your focus in the libre-soc project?21:14
ghostmansd[m]My recent works are towards supporting SVP64 in binutils, also helping with the other stuff related to instructions.21:15
FUZxxlsounds great21:15
ghostmansd[m]Surprisingly this hasn't ever needed assembly knowledge. :-)21:15
ghostmansd[m]> So you can write intuitive code instead of having to deal with weird relocations and such21:16
ghostmansd[m]Well, these parts we don't touch, they're kept as they were in vanilla PPC world.21:17
FUZxxlyes correct21:17
FUZxxlstill they suck if you program in assembly21:17
FUZxxlso I think a smart assembler could help you do better21:17
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc21:17
ghostmansd[m]Many things suck if you program in assembly :-)21:17
ghostmansd[m]That depends on the assembly used21:17
FUZxxlthey don't have to!21:17
ghostmansd[m]But, even then, it's low-level kind of things21:18
ghostmansd[m]Well, they don't, but then it's not the lowest-level (well not the lowest but you understand what I mean)21:18
FUZxxlAssembly programming doesn't mean you are forced to ignore any QOL improvements.21:19
FUZxxlIt's just that you are fundamentally programming in terms of machine instructions.21:19
ghostmansd[m]No that's not what I mean21:19
ghostmansd[m]I mean, not many people think about it21:19
ghostmansd[m]Because they generally code in high-level21:19
ghostmansd[m]And the ugly part (e.g. relocations) is hidden21:20
ghostmansd[m]If everybody coded in asm, people would have thought about it more21:20
ghostmansd[m]But the thing is, they simply don't21:20
FUZxxlI think you have to write in assembly for fast SIMD/vector code as high level languages do not have the right paradigm to oexpress such algortihms.21:20
ghostmansd[m]They just take, say, shiny Python and don't care21:20
FUZxxlImproving on this is another possible PhD project.21:21
FUZxxlSo being able to write assembly code more comfortably is something I consider very useful.  I don't particularly care that it's a small niche.21:21
ghostmansd[m]True, but, first, this does not really involve diving into low-level weird chunks like relocations21:22
ghostmansd[m]And, second, it's mostly about optimizing some sections21:22
FUZxxlIt does.21:22
FUZxxlFor example, what do you have to do to obtain the address of a global variable?21:22
FUZxxlThe answer depends on whether you write position-dependent, PIC, or PIE code.21:22
ghostmansd[m]This is true21:23
ghostmansd[m]But why would you do it in a small piece of code?21:23
FUZxxlSo if an assembler could do it for you, it would be much easier for the programmer to focus on the relevant bits21:23
FUZxxlthe assembly code can be substantial21:23
ghostmansd[m]I'd just pass that address in a register21:23
ghostmansd[m]To some routine written in assembly21:23
ghostmansd[m]What I mean, I would not write everything around in assembly21:23
ghostmansd[m]I would write some exclusive performance-critical parts21:24
ghostmansd[m]And make these callable from the high-level code21:24
FUZxxlbut why should there be this restriction?21:24
ghostmansd[m]I'm not saying it's right :-)21:24
ghostmansd[m]I simply state how it mostly works21:24
FUZxxlanother example: calling functions is hard because the correct sequence depends on the code model21:24
ghostmansd[m]There are exceptions, like KolibriOS21:24
FUZxxlTLS is hard, too21:25
ghostmansd[m]I mean, I'm not saying the idea is wrong, I just express my thoughts on how it works now21:25
FUZxxland it's usually OS dependent too21:25
FUZxxlI know how it works now21:25
ghostmansd[m]TLS is fucked as a concept TBH21:25
FUZxxland I am not satisfied.21:25
ghostmansd[m]Not unreasonable but still fucked21:26
FUZxxloh, please explain!  An interesting view!21:26
ghostmansd[m]I mostly mean concepts like errno21:26
FUZxxlhow is it fucked?21:26
ghostmansd[m]Compared to errno, using a negative error codes like in Linux kernel is rather a sane idea21:27
ghostmansd[m]Because checking/setting an integer in a register is much cheaper than accessing TLS21:27
FUZxxlyou only need to check TLS after having established that the call failed21:27
ghostmansd[m]BSD use another approach, setting a carry flag21:27
ghostmansd[m]But in essense it's the same21:27
ghostmansd[m]First, it depends21:28
ghostmansd[m]Some routines require setting errno to 0 otherwise you won't be able to differ some failure cases21:28
FUZxxlthese are not system calls and hence could not benefit from the syscall convention21:28
ghostmansd[m]Second, why address TLS if you can simply use register which is already thread-specific enough?21:29
FUZxxlalso the Linux convention is kind of fucked as it arbitrarily assumes that a certain range of values could not occur in valid results.21:29
FUZxxlNot really future proof.21:29
ghostmansd[m]Or, well, rather processor specific21:29
FUZxxlSure, use one register for each global variable21:29
FUZxxlerrno is not performance critical enough to waste such a resource on it21:29
ghostmansd[m]Rather, don't use global variables :-)21:30
FUZxxlwe already keep the TLS pointer in a register, so errno is just one dereference away21:30
ghostmansd[m]But I'm not trying to persuade, this is not a holy war on global state :-D21:30
FUZxxlthe reason for errno is chiefly that you can't return more than one value in C.21:31
ghostmansd[m]> we already keep the TLS pointer in a register, so errno is just one dereference away21:31
ghostmansd[m]This actually depends on the architecture I think21:31
FUZxxlIf you can, there's no need to have it, as e.g. Go shows.21:31
ghostmansd[m]On most architectures yes21:31
ghostmansd[m]Also, enabling TLS per se is a difficult stuff21:31
FUZxxlyou can't really avoid it21:31
FUZxxlYou are going to need thread-local variables one way or another21:31
FUZxxle.g. Go, which doesn't have TLS exposed in the language still needs it to implement details of its green threads21:32
ghostmansd[m]I wouldn't say; per-CPU, yes, these are unavoidable for many contexts21:32
ghostmansd[m]Per-thread? Not that much.21:32
FUZxxlwhat is the distinction in your opinion?21:32
FUZxxlthere is a 1:1 mapping between threads and CPUs21:33
FUZxxli.e. whenever a thread runs, it runs on one CPU21:33
ghostmansd[m]Yes and no. Per-CPU can be just as simple as "take CPU index and refer to entry in say array".21:34
ghostmansd[m]And, per-thread would have been the same...21:34
ghostmansd[m]...but instead it got buried with a lot of details, like allocate stuff in kernel, provide interface for it in userspace, support it in compilers, support it upon linkage, support it everywhere...21:35
FUZxxlSo per CPU means that if you have a new time slice, whatever other process runs gets the per CPU variable? :-D21:35
ghostmansd[m]I mean, this is kinda the difference between kernel and userspace, I know it's not entirely a fair comparison.21:36
FUZxxlalso all the support would still be needed for "per CPU" stuff21:36
FUZxxlanother point is that register space is limited21:37
ghostmansd[m]No it doesn't mean. No it won't be, at least not with the amount of details for pthread.21:37
FUZxxlso you cannot have programs freely allocate registers for global or per-thread stuff21:38
ghostmansd[m]And yes it could've been as simple as "just sacrifice one register"21:38
FUZxxlyou run out too quickly (as opposed to memory)21:38
FUZxxlwe do, it's called the TLS register21:38
FUZxxlsacrificed to enable use of arbitrary per-thread state.21:38
ghostmansd[m]Yeah I know21:39
ghostmansd[m]Anyway, the level of magic for supporting per-CPU in Linux kernel is not compared to level of magic needed to support pthreads in userspace21:40
FUZxxlanother issue with sacrificing registers is that every part of the program needs to know which registers are sacrificed21:40
ghostmansd[m]That's my point21:40
FUZxxlwhich is hard to do in the presence of shared libraries and such21:40
ghostmansd[m]Well, this is kinda part of the deal, huh?21:40
FUZxxlwhat you call "per-CPU" is just "per-thread" but not touching memory21:41
ghostmansd[m]Like, say, with the calling conventions21:41
ghostmansd[m]It's not what I call "per-CPU", it's in fact per-CPU21:41
ghostmansd[m]It's rather the thing you call per-thread is rather per-CPU :-)21:41
ghostmansd[m]But this is buried deep down21:42
ghostmansd[m]Semantics matter21:42
FUZxxlsorry, I still don't really get the point you are trying to make.21:42
FUZxxlEspecially with the distinction between per-thread and per-CPU21:43
ghostmansd[m]My point is, unless you don't really need to, don't use TLS. Allocate per-thread context, for example.21:43
FUZxxlTLS is there anyway, might as well use it.21:44
ghostmansd[m]*unless you really need21:44
FUZxxlIt won't go back into its box if you don't use it.21:44
ghostmansd[m]There are many things around which won't; not the reason to use them.21:44
FUZxxlso ... you oppose on philosophical grounds?21:44
FUZxxlit would be easier to implement if we didn't have them, so even though we do, don't use them?21:45
ghostmansd[m]I oppose on the grounds of simplicity. You might as well call these philosophical, if you like. :-)21:45
FUZxxlI think TLS makes for some tremendously simple solutions.21:46
FUZxxlJust as static variables do.21:46
FUZxxlI mean there is a reason objects are so popular: object members are just per-instance static variables.21:46
FUZxxlLet's you use them without feeling guilty for having sinned.21:46
ghostmansd[m]And simplicity differs. I know declaring something thread_local might appear simpler than calling malloc inside the thread. And having a global variable is simpler than passing function arguments. In this sense, this is not the simplicity I'm talking about.21:47
FUZxxlso philosophical grounds :-)21:48
ghostmansd[m]I'm talking about simplicity to debug and maintain it, and also about simplicity of supporting the internals needed, etc.21:48
ghostmansd[m]Again: if you think these are philosophical grounds, that's just your opinion, I'm fine with it. :-)21:48
ghostmansd[m]I don't really intend to convert you into the no-TLS-whatsoever church. :-)21:49
ghostmansd[m]Each and every programmer is full of opinionated statements, so am I.21:49
FUZxxldebugging is a lot easier with TLS and static variables because you can just look at the variable21:50
FUZxxlit's a lot harder if the variable is to be found somewhere deep in an object hierarchy or the call stack21:50
FUZxxlyou also don't have to lug the variable down the call stack which e.g. helps tremendously when writing recursive-descent parsers.21:50
ghostmansd[m]My experience differs, but, again, this is about opinions.21:50
ghostmansd[m]This diverged a lot from UTF-8, lol :-D21:51
FUZxxlMy current other project is writing a SIMD-accelerated SAT solver21:52
FUZxxllet's see if it works21:52
ghostmansd[m]Oh, nice!21:57
FUZxxlSAT solvers are very branch heavy21:57
FUZxxlthey constantly (1) load a datum from RAM (2) compare it with something (3) branch on the result21:58
FUZxxlalmost all the loads are interdependent21:58
FUZxxlI believe I can do better with slighly improved data structures, permitting me to then use gather instructions and so on21:58
ghostmansd[m]Well, that's one of the benefits with vectors, being able to write branchless code, so I hope you'll find what you're searching for here. :-)22:00
ghostmansd[m]So, welcome :-)22:00
FUZxxlOh it's still going to be plenty branchy22:00
FUZxxland I think vectors will not actually give a lot of benefit22:00
FUZxxleven being able to do 4 loads independently would be a big plus22:00
FUZxxleven with vectors you'll be limited to how many fetch units you have22:02
FUZxxlas most loads are random22:02
FUZxxlso once your vector holds as many integers as you have fetch units, you have reached the limit of possible perf benefits.22:02
markosFUZxxl, hey, welcome, nice to see another fellow SIMD junkie :)22:06
markoswell, in my case, ex-SIMD junkie, I'll never see SIMD again in the same light after SVP6422:06
markosin fact, working with SVP64 in most cases requires to see the actual reference scalar code rather than a SIMD implementation22:07
*** toshywoshy <toshywoshy!> has quit IRC22:07
FUZxxlmarkos: I like to do something I call "combinatorial SIMD" i.e. SIMD use for combinatorial algorithms22:08
FUZxxlThe results are usually completely different from the scalar code as they require entirely different algorithm design approaches22:08
markosI implemented some example code for SVP64, you'll see that it's much much easier to just write SVP64 assembly directly following the scalar code22:09
FUZxxlwhich also makes me feel uncertain about how advantageous vector architectures are going to be.  I don't know how good their horizontal operations are.22:09
markosand forget SIMD entirely22:09
FUZxxlIf you have all vertical code I'm sure it is.22:09
markoshow about reducing code size to 1/5th or even 1/10th?22:10
FUZxxlbut alas I do not have enough experience.22:10
markosin assembly!22:10
FUZxxlsure, but does it do the trick?22:10
markoshow fast we don't know yet22:10
FUZxxle.g. could you do it with, say, an integer sorting algorithm?22:10
FUZxxlif you have, say, quicksort, it's a very simple scalar procedure.22:10
FUZxxlBut what would the SVP64 variant look like?22:10
FUZxxlEach iteration has a loop carried dependency.22:11
FUZxxlYou need to redesign the algorithm to vectorise it.22:11
markoswe haven't done sorting yet, but it's trivial to find the horizontal min/max across a vector22:11
FUZxxlsure, this is because you have reduction instructions I suppose.22:11
markosone instruction to get the min/max across 64 bit registers22:11
FUZxxlcheating :-P22:11
markosnot more than what other engines are doing already22:12
FUZxxlan instruction to implement a special case says nothing about the general case22:12
markosbut instead of adding a gazillion instructions to do one simple job22:12
markosSVP64 adds just a few powerful instructions that can be modified to be used in pretty much all scenarios22:12
FUZxxlfor example, what if you want to compute the gcd of a vector?22:12
FUZxxlIs it still that easy?22:12
FUZxxlthe scalar code is just as easy as "min of a vector"22:13
FUZxxlwhat would it look like in SVP64?  Is there a "reduce with gcd" instruction?22:13
markosI did ports of a few algorithms used in VP8/VP9 and AV122:13
markosdo you have the git checked out?22:13
FUZxxlwhich git in particular?22:14
markoslibresoc, openpower-isa22:14
FUZxxlcan you give me the correct link?22:14
FUZxxl(no I do not have it checked out)22:14
FUZxxlwhat is the URL to clone this repository?  The site does not say.22:16
markoswell, if you want to work on it you have to get an actual dev environment, unfortunately it's not that simple as checking out to start playing with it22:17
markosbut if you want to look at SVP64 asm examples22:17
markoshas a port of a few functions for libvpx and dav1d libraries22:18
FUZxxlyou are evading my question :-)22:19
markosbuilt using ghostmansd's binutils port to SVP64 and run on the python simulator22:19
markoswhich is? how would a gcd be implemented in SVP64?22:19
FUZxxlno, a "reduction by gcd"22:19
FUZxxlI chose gcd because I am pretty sure you don't have an instruction for it.22:19
FUZxxli.e. take an array of numbers and compute the gcd of all numbers.22:19
FUZxxlthe scalar loop is simple, but how can it be vectorised?  I cannot imagine it would be as simple as the scalar loop.22:20
markosyou have to think outside the box of SIMD22:20
markosno there no specific instruction for it22:20
markosbut it can be turned into SVP64 quite easily22:20
FUZxxlBut would it look just like the scalar code?22:21
markosnot exactly, but close22:21
markosit's just a bunch of modulos and compare against zero22:21
FUZxxlthe thing is: I'm sure it's very easy to make the vector code look like the scalar code for simple numerical algorithms22:21
FUZxxlbut I really don't care about these22:22
markoshow's DCT?22:22
markosI wouldn't call it exactly simple22:22
*** toshywoshy <toshywoshy!> has joined #libre-soc22:22
FUZxxlmarkos: and you have to somehow deal with it being a reduction22:22
markosand yet it's much smaller than SIMD DCT code I'm writing for Arm22:22
FUZxxldiscrete cosine transform?22:22
FUZxxlsure it is22:22
FUZxxlyou don't have to deal with edge conditions or chunking22:22
markoshave you seen SIMD unrolled code?22:23
markosfor DCT I mean22:23
FUZxxlotoh you will not be benefit from more execution units with such simple code22:23
FUZxxl*not benefit22:23
FUZxxlYou can write SIMD code in very similar style, especially with AVX-512.  The extra complexity is mainly computing masks for the final iteration.22:23
FUZxxlBut such code is not very efficient as it doesn't have high ILP22:24
markoslet's leave AVX512 out of the discussion please :)22:24
FUZxxlthe complexity is needed to enable better ILP and I believe that'll be the same with SVP64 in future implementations22:24
markosI'm writing SIMD code for pretty much all architectures, as a profession22:24
markosAVX512 apart from a few interesting instructions is a mess22:24
FUZxxlso am I :-)22:24
FUZxxlAVX-512 is the best ISA for my particular work because it has the best horizontal operations22:25
markosif it works for you, fine22:25
FUZxxlother architectures lack them22:25
FUZxxlSVE2 has something like VPCOMPRESSB, but no PDEP or PEXT, so I will not be able to port my UTF-8 algorithm to it easily.22:26
markospdep/pext are indeed lacking other architectures22:26
FUZxxlSVP64 would need pdep/pext on mask (?) registers to be effective22:27
markosbut here's the thing, in the end all that matters is the resulting performance, and so far I managed to get the best performance of my software on a M1 using NEON22:27
FUZxxland compress/expand (which I think you might already have)22:27
FUZxxlyeah that chip rules22:27
FUZxxlhowever, NEON sucks for combinatoric programming: no scatter/gather, no compress, no pdep, no pext22:27
FUZxxlgood for number crunching though22:28
markosscatter/gather are a joke on Intel22:28
FUZxxlthey got a lot better with Icelake22:28
FUZxxlcould be better though22:28
FUZxxlhave you worked with NEC Aurora Tsubasa before?22:28
FUZxxlFun system, too.22:28
FUZxxlvector ISA w/ 2048 byte vectors iirc.  HBM2 memory, no L2 cache22:29
FUZxxlmem is so fast that vector loads always bypass the L1 cache.  Crazy perf22:29
markosthat's highly specialized22:29
FUZxxlmade for number crunching22:29
markosyes, and only that22:29
markosSVE was based on such a CPU Fujitsu iirc22:30
FUZxxlstill waiting for SVE2 hardware I can put on my desk22:30
markossame here, probably I'll get access next month22:30
FUZxxlonce that happens ... I gues I'll write a whole lot more ARM asm22:30
FUZxxl^ here's some ARM asm I wrote a while ago.22:31
markosthe point is that SVP64 is a totally new game22:31
markoscheck the asm in the media/video/ directory22:31
FUZxxlthat was fun.  I wonder if it would work the same or better with SVP64 or other vector isas.22:31
markosthere is no need to unroll your loop 100 times to get the performance22:32
markoseg. I'm doing a DCT 32x32 algorithm for NEON right now22:32
markosit's unrolled 1500 lines with a ton of inline functions22:32
FUZxxlsure, neither is it for the M122:32
markosso even more22:32
FUZxxlthe out of order mechanism makes it so multiple iterations happen at once22:33
markoswith SVP64 there are specialized DCT instructions22:33
FUZxxlstill brings a little bit of an advantage some times though22:33
markosit's not about a specific implementation, eg M1 or another22:33
markosit's about the architecture, not the micro-architecture22:33
FUZxxlwhy does the architecture require unrolling?22:33
FUZxxlwhat is fundamentally different with SVP64 that it doesn't?22:34
markosthe loops are in the hardware instructions themselves22:34
FUZxxlso ... each instruction is a loop?22:35
markoswhereas in SIMD you might have to unroll 4 times, for a simple 128-bit vector addition, with SVP64 you just do a sv.adds having previously set the VL (Vector Length) to whatever you want22:35
markos(up to 64 iirc)22:35
FUZxxlwhy do you have to unroll 4 times with SIMD?  You can just do the same loop you do with SVP6422:36
markosthe added bonus of having a full 128 registers to play with, makes for some pretty interesting code22:36
FUZxxland for slow context switches22:36
markosin general, branches are bad22:36
FUZxxlnot if they are easy to predict22:36
markoscontext switching can be solved easily22:36
markosno, branches are bad, period22:36
FUZxxlI disagree22:36
FUZxxle.g. my UTF-8 conversion routine got twice as fast by making it more branchy22:37
FUZxxlI added fast paths for e.g. when there are no surrogates skipping large chunks of the algorithm22:37
markosin order to have good branch prediction units, you have to add more silicon to the chip, which in turn increases power usage, which makes it more difficult to scale your cpu, etc etc22:37
markosI'd prefer using my silicon for other stuff22:37
FUZxxlMost code is branchy business logic, you will not be able to vectorise that.  Or make it less branchy.22:38
markoscritical loops aren't for this reason22:38
FUZxxlMine are :-)22:38
FUZxxlYou are invited to make them less branchy and faster.22:38
markosmy plate is already full :-)22:38
FUZxxle.g. here is my UTF-8 conversion routine:
FUZxxlHow would it benefit from SVP64?22:39
markosif it was as you say, there would be no need for so much inlining of functions in the hot path22:39
markosprogrammerjake is the utf8 expert here22:39
FUZxxlwhat the inlining helps with is getting rid of the shuffling of registers22:40
FUZxxlalso, functions are often called with constant arguments so constant propagation helps.22:40
FUZxxlThe pure function call is usually not what makes a call expensive.22:40
FUZxxlmarkos: one thing this code does e.g. is obtaining masks over vector elements, then moving them to scalar registers and doing arithmetic on them.22:41
FUZxxlNot sure how this would be possible with SVP64 given that masks are longer than scalar registers.22:41
markoswithout knowing the algorithm I can't say22:42
FUZxxlWe also have a thing where the vector length must necessarily be dynamic as UTF8 and UTF16 are variable length encodings, so a variable lenght chunk of input is translated into a variable length chunk of output.22:42
markosbut it's the same as usual, check the reference algorithm and not the SIMD code first22:42
FUZxxlBut I guess there is a "set vector length to the number in this register" instruction.22:42
FUZxxlthe reference algorithm is completely different and cannot be vectorised as it is full of loop carried dependencies.22:42
FUZxxland the SIMD algorithm I designed cannot be expressed in a scalar paradigm.22:43
markosto put in other words, if I'm going to port code to SVP64 I found it it much much simpler to just take the C code and work with that22:43
FUZxxlPlease, I'm interestedin seeing the result.22:43
markosSIMD-isms are not possible with SVP64 or are much much harder to implement and with a big doubt of the actual benefit22:43
FUZxxlHere is the reference C code I wrote:
markosI already told you, check the media/video code22:43
FUZxxlIt's a completely different algorithm though.22:43
markosreference C algorithm and SVP64 in the same dir (*_real.s is the actual code)22:44
FUZxxlthat code is just straightforward math code.  No loop carried dependencies, no permutations, no data-dependent addressing.22:44
FUZxxlI 100% agree with you that you can directly translate such code.22:44
markoslots of branches but rather straightforward22:45
FUZxxltrack inpos and outpos22:46
FUZxxlthese two vary depending on which branch is taken22:46
FUZxxlso you cannot just do the usual "do both branches and merge the results" approach as the addresses from which the next elements are fetched vary22:46
markosnever said you would22:47
FUZxxldepening on which branch is taken in a loop-carried-dependency22:47
FUZxxlI would really like to see what it would look like22:47
FUZxxlit feels hard to imagine.22:47
markosbut from what I see, the SVP64 version would probably -this is a guess- look closer to the scalar, translated, rather than a SIMD implementation22:47
FUZxxlhow do you deal with the loop-carried dependency?22:48
markosI think that would be a good exercise for you to learn SVP64 assembly :)22:48
FUZxxlwhy do you keep evading my questions?22:48
lkclFUZxxl, i made a start on in-regs insertion-sort, it's about... 8-10-ish instructions, i haven't finished it.22:48
FUZxxlI try to understand what the trick is and you keep waving your hands22:48
markosI don't really know what you expect me to tell you, if you're expecting me to implement a SVP64 version of that, that's not going to happen22:48
markosbecause I really have a ton of stuff to work on, and this is not in the list22:49
FUZxxlI think you may have not understand the problem with vectorising this algorithm22:49
markosloop carried dependencies are not as important with SVP64 as they are with SIMD22:49
FUZxxllkcl: it would be interesting to see that.22:49
FUZxxlbut how are they resolved?22:49
FUZxxle.g. in this algorithm, to fetch c0 for the next iteration, you need to have completed the previous iteration22:50
markoswhat does a loop-carried dependency in the first place?22:50
FUZxxlso how would this be vectorised?22:50
markosit's data you have to carry horizontally and you can't do that with SIMD22:50
markosyou can with scalar22:50
lkclFUZxxl, we're very much learning - all of us - as this is such a new concept that we're literally at the forefront of experimenting and finding out how it works22:50
FUZxxl"data you have to carry horizontally" -> it is22:50
FUZxxland the whole point of my SIMD code is that it can do that.22:51
markosand you can with SVP64, because the result of each instruction CAN affect the next22:51
FUZxxlI think I still do not understand how SVP64 works.22:51
markoswith SIMD, unless the engine has specific instructions to do that -eg. horizontal adds, min/max/etc- you have to change the order from horizontal to vertical, eg. by transposing the vector matrix22:52
lkclFUZxxl, it's actually not-strictly-speaking a Scalable Vector system at all, it's more classified as a type of "Zero-Overhead Loop-Control"22:52
markoseven if it does offer specific instructions, you cannot have the same instruction change the results22:52
lkclthat's where i began insertion-sort22:52
lkclthe idea came from looking at both SIMD and Cray-Style Vectors, and noticing that:22:53
lkclSIMD is "for i in fixed_simd_length: operation(register[j].element[i],....)22:53
lkcland Cray-style Vectors is:22:54
lkcl"for i in 0..GLOBAL_DYNAMIC_VECTOR_LENGTH_REGISTER: operation(register[j].element[i], ....)22:54
lkcli decided to separate out the "for-looping" part from "the operation" part, as a prefix-suffix 32-32 encoding.22:55
FUZxxlhow is this translated into micro operations?22:55
lkclusing the *Scalar* Power ISA as the suffix22:55
FUZxxlthat's how it's encoded22:55
FUZxxlso say the scalar UTF-8 algorithm22:55
markosok, I think I found an example that can explain your question22:55
markosfrom mp3, showing loop-carried dependency22:56
FUZxxlif you have a hardware loop mechanism to basically execute the scalar algorithm in a hardware loop, the total latency would still be the same as it's all loop carried22:56
lkclin high-performance implementations it would be done by inserting an additional hardware block in between Decode and Multi-Issue phases22:56
markos        setvl 0,0,18,0,1,1                      # Set VL to 18 elements22:56
markos        # Load 18 floats from (in)22:56
markos        sv.lfs *vin, 0(in)22:56
markos        # equivalent to: for (i = 17; i >= 1; i--) in[i] += in[i-1];22:56
markos        sv.fadds/mrr *vin+1, *vin+1, *vin22:56
markoscheck the for loop22:56
lkclnope, not if you use a Multi-Issue Out-of-Order Execution Engine as the underlying hardware22:56
FUZxxlif each operation is linearly dependent on the previous one, then I don't see how that would work22:57
FUZxxlwhich is is in the scalar UTF-8 translation algorithm22:57
FUZxxl*each iteration22:57
markosthe actual implementation is a hardware micro-architecture issue22:57
lkclthen yes, what you get is, you get the situation where you benefit instead from having a multi-issue OoO Engine as the back-end22:58
lkclwhere there would be multiple in-flight operations22:58
FUZxxlyou will not be able to get it faster than the length of the critical path22:58
FUZxxlwhich is ... 10 times slower than the SIMD algorithm I designed22:58
lkclunder which circumstances, a little bit more intelligence in how the algorithm is written has to be done22:58
lkcli.e. you have to think in terms not so much of "SIMD" but of "SIMD elements"22:59
FUZxxlok, done.22:59
FUZxxlI can envision how each individual element flows through the algorithm.22:59
lkclyou'd have to make an implementation of the algorithm that ensures that the *elements* have less Hazard-Dependencies22:59
markosparallelizing the algorithm does NOT require SIMD23:00
FUZxxlwhich I did, but that only works by viewing a whole vector of elements at once.23:00
lkclthis is common practice in Cray-style Vectors (since the 70s)23:00
lkclthat's it. that's what you do.23:00
FUZxxli.e. I break the dependencies by computing where the elements go ahead of time and then just permuting.23:00
FUZxxlBut that does require an entirely different approach, in contrary to what markos said.23:01
FUZxxlWhich is my whole point.23:01
lkclwhat you are *not* restricted to is *only* thinking in terms of vector-of-elements23:01
FUZxxlyou aren't restricted to that with SIMD either.  It's plausible someone will implement SIMD extensions with individually renamed elements.23:01
lkclyou *can* issue a vector-instruction that very very deliberately has overlaps, e.g. a Prefix-Sum (fibonacci)23:01
FUZxxle.g. AMD already does something like this by splitting vectors into chunks and renaming each chunk separately.23:01
lkclyes but they have to actually explicitly add that as instructions23:02
markosyou asked for an example of SVP64 handling loop-carried dependency, I showed you one, you complained about performance and really there is no way to tell that at the moment23:02
FUZxxlit's an implementation detail of how they implement AVX-512 in Zen423:02
FUZxxlMy goal is to understand what this is all about23:02
FUZxxlso I am proposing problems that would not work with my current understanding of the technology.23:02
markosit's not a solution for all problems23:03
lkclno - we're focussing on ISA-level simulations at the moment rather than HDL (architectural) simulations23:03
FUZxxlTo gain insight into how they could be implemented to then amend my understanding.23:03
markosbut it does solve many existing problems23:03
FUZxxlHow do you model the performance without an uarch level simulation?23:03
lkclalthough the cavatools port will have timing / cycle-accurate counting23:03
lkclat this early phase: we don't.23:03
lkclthat's what the next NLnet grant is for23:03
markosthere is no microarch at the moment, that's what I've been saying23:03
lkclwhich will take about... 5-6 months to complete23:04
FUZxxlthe way you explain this algorithm it seems like the details of the uarch design are critical to the perf characteristics of the architecture23:04
FUZxxland for how you need to write code to be fast.23:04
markosuarch is always specific to the implementor23:04
markosit can be entirely different from each vendor23:04
lkcl*then* we have a base on which to *begin* maaking performance estimates23:04
FUZxxlor more polemically: just because it looks elegant doesn't mean it'll be fast.23:05
lkclone of the things about designing a general-purpose Scalable Vector ISA is: you *don't* design the ISA based on "performance"23:05
markoseg. M1 is Arm Architecture, but NOT Arm Microarchitecture, it's Apple's own, the rest of the vendors (Samsung, Qualcomm, etc) take *some* parts of the architecture and add their own23:05
lkclthat's a very unfortunate conflation that's come from SIMD ISAs23:05
FUZxxlI partially agree, but partially disagree23:05
lkclwhich are near-direct exposure of the underlying SIMD hardware internal micro-architecture directly to/at the programmer23:05
FUZxxlyou want the programmer to be able to make assumptions about what the performance characteristics are *at least*23:06
FUZxxlit can always be faster in practice23:06
lkclARM is presently trying to dissuade people from doing exactly that23:06
lkclby designing algorithms to be "vector-length-independent"23:06
lkclas is RISC-V with RVV23:06
FUZxxle.g. something like "if I have a permutation of a vector, do the data elements have to be ready when the permutation happens or only the index elements"23:07
FUZxxlwhich majorly affects how an algorithm requiring a permutation is to be designed23:07
lkclif successful it allows code to be fully portable from embedded systems with zero lanes (a scalar-element) right the way up to massive 1024-way SIMD hardware23:07
markospermutation of the vector in SVP64 is just moves based on indices23:07
markosactually not even moves23:07
lkclyyeah, permutations are the "massive spanner in the works" for both ARM SVE/2 and RISC-V RVV's "plans", there23:08
markosyou change the index numbers of the registers that are going to be used in the next iteration23:08
markosyou want reverse? sure, take a reverse sequence23:08
FUZxxlso there is no dependency on the data, only on the indices23:08
lkclyou can't possibly have a reliable permutation concept when you've absolutely no idea what the target microarchitectural vector length is!!23:08
FUZxxlis that *guaranteed* or just an implementation detail?  That's the sort of stuff I need to know23:08
FUZxxlI don't really care about specific instruction timings23:08
FUZxxlBut about this sort of general stuff.23:08
markostoo early23:08
lkclthere's a hard-guaranteed limit on the Vector Length of 127.23:08
lkclit is *required* that that Vector Length be supported, across *all* implementations23:09
FUZxxlSame with programming for ARM.  You are completely correct about not programming for a specific uarch, but you can make the assumptions that all uarches will be very similar in certain aspects.23:09
markosyou can never assume anything about specific performance23:09
FUZxxlbetter than on rv64 :-)23:09
lkcleven if that implementation is a ridiculously-slow Finite State Machine that's only capable of executing one element per every 10 clock cycles23:09
FUZxxlthen I will not be able to design fast algorithms for the machine23:10
FUZxxland it will immediately be uninteresting23:10
lkclor it's an insanely-fast one that can do QTY 16of elements per SIMD-back-end-ALU per multi-issue instruction23:10
markosone instruction can take 3 cycles on a Qualcomm, and 1 cycle on a M1, and you can never know/assume that23:10
FUZxxlsure, not really a problem (though annoying to deal with some times)23:11
lkclyou can design *portable* and *highly-efficient* (compact) algorithms for any future machine that anyone decides to implement in future23:11
FUZxxlwhat is worse if you e.g. don't know if a shift is a barrel shift or one shift per cycle23:11
FUZxxlonce again, heavily affects the algorithm design23:11
markosor one CPU can have 1 vector ALU and be slow as hell, or it can have 4x128 NEON units (graviton3) and your NEON code can really kick ass23:11
FUZxxlor for ARM, you know that perms are happening for the whole register.  Elements are not individually renamed.23:12
FUZxxlThis too affects algorithm design.23:12
markosthat's a microarchitecture detail23:12
lkclyyeah these kinds of things we're just going to have to live with until there's even one "decent" implementation (let alone 2 or 3)23:12
markosyou cannot impose that kind of thing to a vendor23:12
FUZxxlthen I think I'll have to come back once there is more clarity about that23:12
lkcllike i said, we're at the software-simulator stage23:12
markosunless you sell them the microarchitecture as well as the ISA23:12
FUZxxlThe thing is, you are proposing a revolutionary ISA design approach but don't give any details about what the implementation is going to look like23:13
FUZxxlso it's impossible to optimise complex data flow for it23:13
lkcli've been designing the implementation for 3 years, now23:13
FUZxxle.g. the reduction code you cited above will be really slow (17 cycles minimum) if the CPU just executes all the additions in the order of the dependency chain you define there23:14
FUZxxlbut perhaps it is faster if it is special cased.23:14
lkclthe base i have planned is a multi-issue OoO one...23:14
lkcl... yes23:14
markosI think you're thinking too much around SIMD to see the benefit23:14
FUZxxlIn practice it may be 1000 elements and you may really want to know if it is23:14
FUZxxlmarkos: I think in terms of an OOO CPU23:14
lkclright, that's actually coded into the specification as an "Architectural Note/Hint"23:14
FUZxxli.e. in terms of data dependencies23:14
FUZxxlah so you do have assumptions that can be made (?)23:15
lkclthat implementors are perfectly permitted to spot those kinds of cases23:15
FUZxxlbut will they in practice?23:15
markosit can also be an in-order CPU, that's again a microarchitecture detail23:15
lkcland replace them with micro-coded optimisations23:15
lkclwhen we've the VC funding i'm planning to instruct the team on where micro-coded optimisations will be possible, yes23:15
FUZxxlI can write possibly beautiful code against your ISA, but until it is known what typical uarches look like, it will not be possible to say if that code is really good code23:16
lkclfor example a mapreduce on a Vector-OR-operation is clearly easily paralleliseable23:16
lkclwe know23:16
FUZxxlthe fadd example in particular is tricky because addition is not associative23:16
lkcli have been trying to drum this into one of the team, programmerjake, for some time23:16
lkclyes, and that's also in the specification :)23:16
FUZxxlso not sure if the CPU will be able to get rid of the dependency chain23:17
lkclthat fadd can never be associative on strict-ordered-linear-reduction23:17
lkcland the implementor is *required* to respect that.23:17
lkclthey are not permitted, on non-associative operations, to choose an arbitrary order.  the results are *required* by the specification to be 100% Deterministic23:18
lkclobviously that's not good in terms of performance :)23:18
lkclso there is a *parallel* reduction mode, as well23:18
lkclwhich is, again, 100% Deterministic23:18
lkclsuch that even that can be applied to non-associative base operations23:18
markoskeep in mind that there are multiple ways to solve a problem, and this code is for a platform that doesn't exist yet, if only to prove a point that it works, it's definitely not "optimized"23:19
FUZxxlI believe that it works23:20
lkclthe "optimisation" - whether and how to optimise - is down to the hardware implementors, of which there is only one, at present, in HDL, and it was an experimental CPU, done last year.23:20
markosin fact, many instructions are added/modified as we try to port existing algorithms to SVP6423:20
lkcl(on the basis of which we were able to get more funding.  it's a bootstrap process in other words)23:20
lkclthe bit that *is* guaranteed - regardless of implementation - is the reduction in program size23:21
lkcl(this is what i have been trying to drum into programmerjake: he has recently also been side-tracked into focussing on "performance")23:21
markosimho, that's the best part, we see what's needed for specific algorithms and lkcl willl try to find a way to make such an instruction work, if it's possible :)23:21
markospersonally I'm not at all worried about performance at this moment23:22
lkclthe reduced program size (between 50% and 90%) results clearly in reduced I-Cache hits, reduced TLB lookups, and so on.23:22
FUZxxlnot really the key factor here23:23
lkcland that's clearly right across the board *regardless* of the data-path side, number of ALUs, and so on23:23
FUZxxlSVP64 doesn't really improve standard business logic which is the big icache hog23:23
FUZxxla hot loop usually fits into the µop cache, even with unrolled SIMD23:23
lkclif the I-side is sitting idle literally doing nothing it's saving power23:23
markosI'm more interested in getting eg. the butterfly instructions used in DCT in the ISA, so that we can have a full DCT implementation in just a dozen instruction -compared to the thousands right now23:24
lkclwe've found that the hot-loops fit into *a single line* of an I-Cache, *without* loop-unrolling!23:24
FUZxxlMy key argument: I tend not to believe claims that something will eventually perform well unless the way it's going to be implemented is clearly laid out and will obviously work23:24
markosstuff like that is more important at this stage23:24
* lkcl looking for an example23:24
markosdefining the ISA implies nothing about the underlying architecture23:24
markossorry, micro-architecture23:25
FUZxxlISA design choices heavily affect what the uarch has to take into account23:25
markosno, it says nothing about X instruction requiring 2 cycles or 1023:25
markosthat's uarch-specific23:25
lkclin a *SIMD* ISA, yes.23:25
lkclin a *Scalable Vector* ISA, very deliberately no.23:26
FUZxxlagain: I don't really care about these details, but more about "what is renamed in dependence to what?  What are the hidden data dependencies?  What operations are renames?  What are ALU ops?"23:26
FUZxxle.g. if you have some sort of selection instruction, is it an ALU op or a data-dependent rename?23:26
FUZxxlThis severely affects algorithm design23:26
FUZxxlregardless of what the cycles counts are going to be23:26
markosdo you need to know such info for example on Intel or Arm?23:27
markoswhen writing SIMD code?23:27
FUZxxlI mean if I don't care about perf I can ignore it23:27
markosbut such info always or almost always changes between revisions of the cpu23:27
FUZxxlnot really23:28
markoswell, you mentioned scatter/gather before23:28
FUZxxlthe important bits always stay similar23:28
markosbefore icelake it was crap23:28
markoseven now it's not ideal23:28
FUZxxlit's pretty much about as good as it gets with the number of load/store ports they have23:28
markosin fact I have yet to see more than a few % increase vs not using gather scatter23:28
markosand ALWAYS it depends on the actual cpu model23:28
lkclok so scatter-gather is possible in SVP64 by using Predicate Mask bits, one on source, one on destination23:29
markosie, is it a i9 or a Xeon? and what model?23:29
FUZxxlthey all have the same number of load/store ports in one uarch generation23:29
markosbecause Xeon obviously can scale better in memory loads23:29
lkclif you use "twin" predication - you get the gather-scatter *back-to-back*... in the same instruction23:29
FUZxxlno it can't23:29
lkclthis is "high-level"23:29
lkcl*how* the micro-ops are issued, how *many* are issued, how they are satisified: this is down to the *hardware* implementor to implement23:30
markosthe thing is that with SVP64 you don't have to do anything special with scatter/gather, there is no special instruction needed23:30
lkcland then optimise23:30
FUZxxland this is critical to know23:30
FUZxxldoes it matter if a special instruction is needed?23:30
FUZxxlUsing scatter/gather or not is a choice of data structure and algorithm design23:30
lkcland it will be down to the hardware implementor to either declare what they have done, or for programmers to find out, once hardware is available23:30
FUZxxlnot of whether there are special instructions for it or not23:30
lkcl(likely in about 3-4 years time)23:30
FUZxxllooking forwards to it!23:31
FUZxxlI think it'll be a whole lot more interesting for me once there is silicon available23:31
lkclwe'll be able to do FPGA simulations a lot sooner before before then.23:31
FUZxxlthat will be useful23:31
FUZxxlthe tyranny of CPU design is,23:31
FUZxxlif you have the most beautiful programming model23:32
FUZxxlbut it's faster to just hack it and write uggly-ass code, compilers will do exactly that23:32
FUZxxlhistory has repeatedly proven that23:32
lkclyes, i remember the first SPARC compilers, simply padding nops in between all operations23:32
lkclbecause the early hardware had absolutely zero hazard-dependency-checking23:33
FUZxxlso before it is proven that the SVP64 coding model you envision will actually be faster than, say, just treating it as a SIMD arch23:33
* lkcl face-palm23:33
FUZxxlI am going to be wary23:33
lkclwell, in this case it's reasonable to ride off the back of the assumption of a half-decent (advanced) multi-issue out-of-order execution back-end23:33
markosit doesn't matter if it's faster in absolute numbers, it probably won't in the beginning anyway23:33
lkclthose exist.23:33
markosbut if you can run a DCT algorithm in 20 instructions vs 1000 on a 1Ghz 5W SoC, it definitely matters23:34
markosfor example23:34
markosit won't beat the intel/amd/Arm cpus in 5y time23:35
lkclif they exist now, and the "looping" system can be "inserted" - as i said - between Decode and Issue - then it's also reasonable to expect parity between "the Multi-Issue System's Performance Before SVP64" and "the Multi-Issue System's Performance After Adding SVP64"23:35
lkcl(it won't be precisely the same because it's one extra pipeline stage, at least)23:35
FUZxxlmarkos: I'm not sure if it'll be as effective as you think23:36
lkcl(which in turn has a knock-on effect on the ratio of in-flight operations vs the completion latency... blah blah)23:36
FUZxxlBut I think time will tell23:37
FUZxxlI think there's a good chance it'll be amazing and I'm going to love it.23:37
markoswe'll cross that bridge when we get there, nothing more can be said right now23:37
FUZxxlI agree23:37
lkclthere's some obvious things on DCT (i'm going to do 2D DCT next, it'll take about 2 months, same number of instructions)23:38
lkcla "normal" (SIMD) way of doing DCT (let alone 2D DCT) requires either pushing the data back to memory and loading it back again23:38
*** jab <jab!~jab@user/jab> has joined #libre-soc23:38
lkclor it requires copying of the vector of data from register-batch to register-batch23:38
lkclthe SVP64 REMAP system *leaves the entire DCT in-registers in-place*23:39
markosyes, 2 passes, transposing the whole matrix between them and after23:39
lkclyou have the vector of registers for the coefficients23:39
lkclyou have the vector of registers containing the data23:39
FUZxxlImagine a SIMD arch with as large as a register file and as long as a vector length as SVP6423:40
lkclapart from one additional scalar register containing the DCT size, there *are* no other registers needed!23:40
FUZxxlwould the benefits persist?23:40
lkclyes they would23:40
lkclbecause the SIMD arch would firstly have the entire triple-loop as explicit instructions23:40
lkclsecondly, they'd go "argh we don't have a bit-reversed Vector-LD instruction, we have to compute the indices for the permutation"23:41
FUZxxlthe SIMD arch could have that instruction23:41
FUZxxlor it could have a general mechanism to do simple permutations on its operands23:42
FUZxxle.g. linear shifts or reversals23:42
lkclthirdly (unless it's ARM, with that funny twin-butterfly instruction) they'd go "argh, this data is in reverse-order, it's going to overwrite and destroy the source which we need later, we have to use double the number of registers"23:42
lkclas a hardware-computed schedule, covering the entire triple-loop?23:42
FUZxxlI don't know23:43
FUZxxlI don't understand SVP64 well enough to tell23:43
lkclTI's MSP32* series and Qualcomm's Hexagon DSP only have ZOLC for the inner loop of FFT/DCT23:43
markoslibvpx: Arm NEON: 4k lines of fdct code for 4x4, 8x8, 16x16, 32x32 and that's excluding the transposing functions, x86: 4-8k linesspread across multiple files, multiple implementations, also excluding transposing functions23:44
lkcli haven't been able to find anything in any other ISA like this23:44
FUZxxlit could exist23:44
FUZxxljust like SVP64 could exist23:44
lkclnot in the major ISAs (which, let's be realistic, they're the ones that "matter")23:44
FUZxxlmake a steelman: imagine how a SIMD architecture could reasonably be extended to solve these issues23:44
FUZxxldon't make the mistake of saying "we will add lots of fun stuff but SIMD will stay exactly the same way it is"23:45
lkcli was surprised ARM had that double-butterfly mul-add-shift as a single instruction23:45
lkclSIMD is SIMD (at the back-end)23:45
FUZxxlsure.  And the double-butterfly mul-add-shift stuff is possible.23:45
FUZxxlAs is what AVX-512 does.23:45
FUZxxlAs is probably a whole lot more.23:45
markosFUZxxl, the solution followed by intel/arm/etc is to just add more instructions23:45
lkclyou can't escape the fact that there's really only one way to build a SIMD ALU: put down a batch of parallel gates.23:45
FUZxxlno, you can also shove the register through in batches, as e.g. Zen4 does23:46
FUZxxlyou could even do each element individually23:46
FUZxxland while e.g. SIMD shuffles are ALU ops right now, they might well be renames in future iterations of the design23:46
lkclyes - that's taking the "Cray Vector Chaining" concept from the 1970s (first implemented in Cray-I) and applying it to modern architectures23:47
FUZxxlcorrect -- it's all uarch23:47
FUZxxlso why should SIMD not benefit from better uarch design while you are certain the SVP64 will?23:47
lkclbecause once - in your mind - you have separated the "SIMD-as-a-direct-front-end-ISA" from "SIMD-at-the-back-end-in-hardware"23:48
lkclthen realised that Scalable Vector ISAs *also* require (pretty much) identical back-end Engines *as* those processors which have a SIMD-front-end-ISA23:48
lkcland accepted that, comparing apples-with-apples you make the *back-ends* the same23:49
lkclthen you can focus on comparing the *front-ends*23:49
FUZxxlbut then (to loop back to the previous point), the backend will not solve the loop-carried-dependency issues23:49
markosimho, SIMD is not future proof, right now Intel+Arm+Power: 17k total C intrinsics, slightly fewer asm instructions, Intel + Arm take teh majority, VSX by comparison is just ~200 intrinsics23:49
markosyou can't just keep adding more and more instructions23:50
lkclwithin about.. how long markos? about eight weeks? we've found that people go "holy s*** this is so much easier *and* drastically reduces instruction count"23:50
FUZxxlIf you look at AVX-512 and NEON, it doesn't actually have all that many truly distinct instructions23:50
markosboth AVX512 and SVE -to a lesser extent- *have* to lower clock frequency when AVX512/SVE* kick in23:50
FUZxxlit's mostly the same instruction counted a bunch of times for different data types and vector sizes23:50
markoswhich lowers performance23:50
lkclit was 1/2 hour ago (or so) and i wasn't quite following: what's "loop-carried-dependencies"?  you mean like prefix-sum and reduction?23:51
FUZxxlEver since Icelake they no longer do it23:51
markosmost of the times the instructions are similar -not 1-1 mapping ofc23:51
FUZxxllkcl: e.g. I was talking about the UTF-8 code where to find the start of the next character you have to finish analysing the previous one (in the scalar algorithm)23:51
FUZxxlSo even an out of order uarch will not be able to process more than one character at a time23:51
markosand intel has some pretty specific instructions, eg. movemask set of instructions23:51
lkclah yes.  then you have two choices.23:51
FUZxxlmarkos: these are obsolete with AVX-51223:52
lkclbut you can at least chuck the instructions into in-flight buffers23:52
markoswhich do not exist on Arm, so you have to emulate them23:52
FUZxxlas AVX-512 adds masks to all loads and stores23:52
markosnot obsolete, just done differently23:52
markosyes I know23:52
lkcland you can at least benefit from that (whilst still kicking yourself in frustration that the scalar-chained-hazards exist)23:52
FUZxxllkcl: with e.g. an 8 wide fronted on the Apple M1, you can chuck them quickly enough even in the conventional paradigm23:52
lkcllike you mentioned23:52
markossurprisingly VSX does offer a very similar actually more powerful instruction that you can use to emulate movemask in a few instructions23:52
lkclyehyeh, i'm counting on that :)23:53
FUZxxlSo I don't quite see the benefit still23:53
FUZxxl(I usually do uarch simulations on my code and the frontend is rarely the bottleneck)23:53
lkclyou can parallelise the algorithm and treat the Vector Elements partially as-if they were SIMD23:53
lkcli.e. you *can* set the Vector Length to an explicit width23:53
FUZxxlexactly -- which requires a much more complex algorithm like the one I designed23:53
lkcland effectively "emulate" SIMD.23:53
lkclbut here with SVP64 you have the choice23:54
FUZxxlAnd it's unclear if it can be expressed in SVP6423:54
lkcl(and a third option)23:54
lkclthe fact that we can "emulate" (not quite the right word) SIMD means "yes you can"23:54
markossome inspiration, tricks can be emulated, but I think the whole point is that you don't really need a specific algorithm to write SVP64 code23:54
FUZxxlSo to summarise: for my combinatorial problems, SVP64 will not make the code shorter or easier or better most likely.23:54
lkcli spent 4 years taking every concept from every SIMD and Scalable Vector ISA i could get my hands on and including them23:54
markosyou don't know that until you try it23:55
lkclah, i haven't gotten onto "option 3" - Vertical_First, yet :)23:55
FUZxxlI am all ears.23:55
markosquite possibly it's going to be shorter, no way to tell if it's going to be faster though23:55
lkcli had to do it as a video23:55
lkclbut there's a diagram that explains it quicker23:55
lkcl1 se23:55
FUZxxlmarkos: how is it ging to be shorter?  The SIMD code is not unrolled and each instruction does some elementary step that needs to be performed.23:56
markosas I said, no way to know that until someone tries23:56
markosbecause so far it has been shorter than all SIMD implementations of functions we ported23:57
markos(the SVP64 version that is)23:57
FUZxxlThere's nothing in the code that really needs a fixed vector length, it would most likely be directly translatable to the vector paradigm if it has all the relevant instructions.23:57
lkclHorizontal-First is how both Cray-style Vectors *and* SIMD ISAs do "elements"23:57
FUZxxlHowever, I doubt it would be faster or shorter.23:57
lkclthey progress through *elements* first before moving onto instruction *second*23:57
lkclVertical-First is the *other way round*23:57
lkclthe hardware maintains an element-offset-pointer23:58
lkclwhich *remains* at its current index until *explicitly* incremented23:58
FUZxxlmarkos: the functions you've showed me where either trivially parallelisable or have very simple data-dependencies23:58
FUZxxlAnd I believe you in that they can be expressed much easier23:58
FUZxxllkcl: ok23:58
lkclFUZxxl, the reason for that is exactly the benefit23:58
lkclwe had to go back to the scalar implementations because the SIMD implementations were so obtuse, so heavily optimised, they're unreadable23:59
FUZxxlI'm 100% certain future SVP64 kernels will be too23:59
markosmp3 is anything but trivially parallelizable but ok23:59
markosI showed you one line23:59

Generated by 2.17.1 by Marius Gedminas - find it at!