Friday, 2022-04-22

lkclhooray, i'm getting linux kernel boot messages on the arty a7-100t15:28
lkclthey're a kernel panic, but they're messages15:28
lkclbout damn time15:28
programmerjakeyay!20:30
programmerjakerust-lang finally switched everything to not require a login to view, so you can see project-portable-simd at https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd20:36
lkclhoorah20:50
lkclinteresting discussion with riccardo. he's not wrong about the larger designs. we'd need to do a 1,024-way SMP system to take on where NVIDIA etc. is at the moment21:12
lkclor21:12
lkclimplement that hybrid of EXTRA-V plus ZOLC plus snitch21:13
lkcland the processing-redistribution idea21:14
programmerjakeimho we'd have like 256 cores and f32x16 simd units21:14
lkclyeah could work21:15
programmerjakeand extending svp64 to have more registers -- 128 is too small21:15
lkclan embedded GPU is easily achievable (without going mad), because the targets are lower21:15
lkclthat brings the maximum clock rate down unless doing a L1 cache for the registers. which would kinda have to be done anyway21:16
lkcllike many people working with GPUs, riccardo doesn't have any knowledge of the internals of SIMT and both NVIDIA and AMD like it to stay that way21:16
programmerjakewell...amd gpus are currently around 2.x ghz, so...21:16
lkclyou have to look at MIAOW to find out what the hell's going on21:16
programmerjakenvidia gpus are still <2ghz21:17
programmerjakeiirc21:17
lkcland you find that it's basically "a bunch of standard processors with a common L1 cache and a common instruction fetch/decode plus instruction-broadcast bus"21:17
lkclmeaning:21:17
lkclthere is *only* 1 PC per group21:18
lkclconsequently, that conditional if/else *has* to be done as predicated21:18
lkclon one of the cores on the same broadcast bus, the predicate would go the way of "mmm" and on the next clock "a -=b"21:18
programmerjakenvidia's gpus are 2-way superscalar...21:18
lkcland on another it would go "a += b" followed by "mm"21:19
lkclintriguing21:19
lkclif you're 2-way multi-issue that'd be a neat way to ensure at least one instruction got into the queue21:19
lkclif the pattern of "if x else y" with single-instruction-x and single-instruction-y is very common21:20
lkclyou're guaranteed there to execute one instruction, you just don't know which one21:20
lkclbut by having a buffer-of-two-at-a-time you know *that* one is guaranteed to be executed21:21
lkclin one case it will be (as if) PC is executed21:21
lkcland in the other it will be (as if) PC+1 is executed21:21
lkclneat21:21
programmerjakeso...one way to make the larger number of registers go faster is to have them banked 16 ways -- 1 way per simd lane...and then just make lane crossing take extra cycles21:23
lkclyes. i planned a cyclic buffer, there.21:24
lkcltwo-way-directional (so, actually, a pair of cyclic buffers)21:24
lkclwhere the lower bits - modulo number-of-lanes - would be a "conveyor drop-off counter"21:25
lkclproblem is you'd better damn well have workloads that match the striping21:25
lkclotherwise eeeverything goes to hell21:26
lkclespecially scalar operation21:26
lkcltherefore the additional refinement would be to have r0-r31 be in a standard scalar regfile with tons of ports21:26
lkcland for r32-127 be "striped"21:27
programmerjakethat's also why I want tree-reduction to use moves and not rely only on remap....because the moves move the values to the correct lanes using the special reduce data paths (small subset of full inter-lane data paths -- they can be used for non-reduce ops too, they're just designed for reduce), avoiding the need to delay a bunch for lane crossing because we decided to leave data in the wrong lane cuz remap has nicer "purity" or21:27
programmerjakesomething21:27
lkclno.21:27
lkclthat's an implementation detail, chosen by the implementor (which happens to be us)21:27
lkclthat implementation detail *must not* be back-propagated to the algorithm, poisoning and destroying the SVP64 API in the process21:28
lkcli've said no multiple times.21:28
lkclswitching the operation internally to a micro-coded MV as a microarchitectural detail is perfectly fine21:29
programmerjakewell...afaict that implementation detail needs to be put in the semantics, otherwise it's impossible to implement...maybe21:29
lkclthat'll be up to us to work out, yes, because putting something in the spec that's unworkable in practice is worse than useless21:30
programmerjakesince iirc remap produces different behavior than moves...lanes are left unmodified rather than overwritten by moves21:30
lkclwhich is why the FFT/DCt took 6-8 weeks, i had to do a maaajor tech-heavy nose-dive21:31
lkclyes.  or, more to the point, REMAP expresses the *desire* to access registers21:31
lkclit's then up to the micro-architecture to work out how to implement those efficiently and effectively21:31
lkclthis is one of the major, *major* differences between a Cray-style ISA and a SIMD ISA21:32
lkclthe internal micro-architecture of SIMD "bleeds up to" (is exposed to) the ISA and the programmer is basically told, "we couldn't be bothered: here, you deal with it"21:32
lkcldespite what only look like subtle differences as far as the programmer is concerned, a Cray-style ISA goes "i'm going to do a little more thought inside the micro-architecture so that you, the programmer, *don't* have to deal with this crap"21:33
lkcland the inclusion of the MV in the reduce scheme falls into the former category, unfortunately.21:34
programmerjakewell...imho we should have moves even if it's not "pure" or whatever (i think it fits fine into svp64's semantics and don't think it has a purity problem), because it makes fast reduces possible for an important set of target microarchitectures (those with simd backends and slow fully-general lane crossing)21:34
lkclthe other reason for not having it is because it makes Vertical-First parallel-reduce mode with predication almost impossible to understand21:35
lkclprinciple of MAXIMUM surprise rather than least surprise21:35
programmerjakeif you, the programmer, don't want moves, just follow basically every other arch and don't predicate your reductions21:35
programmerjakeimho reduction shouldn't be in vertical first mode...21:36
lkcla user is trying to read / single-step through code, and can't work out why the bloody hell the registers contain the wrong values21:36
lkclit's there, it's going to be expected to work.21:37
lkcli'm looking to split out the parallel-reduce implementation as a REMAP option21:37
programmerjakevertical first mode is only well suited to vector ops where lane-crossing doesn't occur....it's waaay tooo confusing otherwise21:37
lkclonce separated out it can be set up with SV REMAP instructions, at the top of a loop, just like DCT and FFT.21:38
lkclif the operation changes to a MV half-way through it completely throws off how REMAP works, yes.21:39
lkclREMAP has 4 "re-targets" which can be applied to RA/RB/RC/RT/RS-or-EA21:39
programmerjakeimho reduction is important enough that you should be able to use a single svp64 horizontal instruction to do reduction, no taking several instructions to setup remap21:40
lkclit's a single instruction, jacob. look at the examples and the implementation and the pseudocode.21:40
lkclby a happy coincidence the 5 slots happen to fall on different registers so that the 4 REMAPs can be applied once and only once21:41
lkclwhich was dead-lucky21:42
lkclbottom line, for a ton of reasons: hard-no on MV as an explicitly-exposed operation21:42
programmerjakeimho moves don't cause problems for remap...they're aren't actual move instructions...they're just changing the op to copy the appropriate input to the result without modifying it...so an add reduce with sv.add rt.v, ra.v, rb.v will still use the rb remap option when the add is replaced with a move from rb...even though there isn't a mov instruction in openpower that has rb as an input21:43
lkclwhich means more work for us, sigh, but it's fantastically intriguing21:43
lkcljust... no.21:43
lkcli've said no.21:43
lkclprobably around ten to fifteen times in total, i've said no21:45
programmerjakeyeah, i know..i just think some of your reasons for saying no are based on some misunderstandings of how the reduce-with-mv would work21:45
programmerjakeso if you fully understand it you might not say no21:45
lkcli get it, and i appreciate you going over it21:46
lkclsome things i instinctively get, others i don't.  usually i can tell the difference :)21:46
lkclthe index-redirections effectively represent the MVs.21:47
lkclwhat i'd be interested to see is whether some pre-processing can or has to be carried out21:47
programmerjakethe index-redirections can't replace moves...they give different results...making prefix-sum based on the tree-reduce much harder to implement...21:48
lkcldeep breath, it'll probably need a micro-simulator21:48
lkclthat's what i thought about DCT/FFT21:48
lkcland it turned out that, retrospectively, the lane-swapping utilises gray-coding!21:49
lkcli didn't recognise it at the time21:49
programmerjake:)21:49
lkclthat means it's fully 100% deterministic and can be done in hw as a pre-prep stage21:50
lkclso based on that really nice surprise i wondered - expected - something similar to pop out of the parallel-reduce thing21:50
lkclthere's only *one* thing that's "broken" and it's when the predicate only has a single bit set21:51
programmerjakewell...afaict it's something based on count-trailing-zeros of the lane index21:51
lkclthere you go :)21:52
programmerjakewell..,prefix sum needs all lanes, not just lane 0 so that's not the only thing broken by reduce not moving21:52
lkclremember, the idea is to *work out* the mvs needed, transparently, so that the user doesn't have to know they happened21:53
lkclwhich would be something that a high-performance implementation would do21:53
lkclbut an embedded one most definitely would not21:53
lkclbecause it would be issuing scalar ops one at a time anyway21:53
programmerjakewell...they have to actually move for prefix-sum. even if it's easy to work out where they should move and construct a remap table...they still have to actually move the data otherwise afaict prefix-sum just skips writing the result to some non-masked-off lanes...broken essentially21:55
lkclyep, entirely skipped.21:57
lkclonly the actual operations hit the result vector21:58
programmerjakeonly lanes that are masked-out should be skipped, not non-masked-off lanes21:58
programmerjakehence why i said it's broken21:59
lkclthis is where i'd need to see it22:00
lkclfor no predicate the results are as-expected (obvious)22:00
programmerjakek, i'll write up an algorithm...22:00
lkclappreciated.22:00
programmerjakewell, turns out that that tree-reduction algorithm is the first step of postfix-sum, not prefix-sum...oops22:52
programmerjakewelp: https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=49023473045e166aff508d75993276b5864b6ef823:03
programmerjakethe first half of the work-efficient prefix-sum algorithm only reduces when the input element-count is a power-of-223:04
programmerjakeI could probably munge the tree-reduction algorithm into a postfix-sum, but not easily...I'm giving up for now23:06

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!