lkcl | hooray, i'm getting linux kernel boot messages on the arty a7-100t | 15:28 |
---|---|---|
lkcl | they're a kernel panic, but they're messages | 15:28 |
lkcl | bout damn time | 15:28 |
programmerjake | yay! | 20:30 |
programmerjake | rust-lang finally switched everything to not require a login to view, so you can see project-portable-simd at https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd | 20:36 |
lkcl | hoorah | 20:50 |
lkcl | interesting discussion with riccardo. he's not wrong about the larger designs. we'd need to do a 1,024-way SMP system to take on where NVIDIA etc. is at the moment | 21:12 |
lkcl | or | 21:12 |
lkcl | implement that hybrid of EXTRA-V plus ZOLC plus snitch | 21:13 |
lkcl | and the processing-redistribution idea | 21:14 |
programmerjake | imho we'd have like 256 cores and f32x16 simd units | 21:14 |
lkcl | yeah could work | 21:15 |
programmerjake | and extending svp64 to have more registers -- 128 is too small | 21:15 |
lkcl | an embedded GPU is easily achievable (without going mad), because the targets are lower | 21:15 |
lkcl | that brings the maximum clock rate down unless doing a L1 cache for the registers. which would kinda have to be done anyway | 21:16 |
lkcl | like many people working with GPUs, riccardo doesn't have any knowledge of the internals of SIMT and both NVIDIA and AMD like it to stay that way | 21:16 |
programmerjake | well...amd gpus are currently around 2.x ghz, so... | 21:16 |
lkcl | you have to look at MIAOW to find out what the hell's going on | 21:16 |
programmerjake | nvidia gpus are still <2ghz | 21:17 |
programmerjake | iirc | 21:17 |
lkcl | and you find that it's basically "a bunch of standard processors with a common L1 cache and a common instruction fetch/decode plus instruction-broadcast bus" | 21:17 |
lkcl | meaning: | 21:17 |
lkcl | there is *only* 1 PC per group | 21:18 |
lkcl | consequently, that conditional if/else *has* to be done as predicated | 21:18 |
lkcl | on one of the cores on the same broadcast bus, the predicate would go the way of "mmm" and on the next clock "a -=b" | 21:18 |
programmerjake | nvidia's gpus are 2-way superscalar... | 21:18 |
lkcl | and on another it would go "a += b" followed by "mm" | 21:19 |
lkcl | intriguing | 21:19 |
lkcl | if you're 2-way multi-issue that'd be a neat way to ensure at least one instruction got into the queue | 21:19 |
lkcl | if the pattern of "if x else y" with single-instruction-x and single-instruction-y is very common | 21:20 |
lkcl | you're guaranteed there to execute one instruction, you just don't know which one | 21:20 |
lkcl | but by having a buffer-of-two-at-a-time you know *that* one is guaranteed to be executed | 21:21 |
lkcl | in one case it will be (as if) PC is executed | 21:21 |
lkcl | and in the other it will be (as if) PC+1 is executed | 21:21 |
lkcl | neat | 21:21 |
programmerjake | so...one way to make the larger number of registers go faster is to have them banked 16 ways -- 1 way per simd lane...and then just make lane crossing take extra cycles | 21:23 |
lkcl | yes. i planned a cyclic buffer, there. | 21:24 |
lkcl | two-way-directional (so, actually, a pair of cyclic buffers) | 21:24 |
lkcl | where the lower bits - modulo number-of-lanes - would be a "conveyor drop-off counter" | 21:25 |
lkcl | problem is you'd better damn well have workloads that match the striping | 21:25 |
lkcl | otherwise eeeverything goes to hell | 21:26 |
lkcl | especially scalar operation | 21:26 |
lkcl | therefore the additional refinement would be to have r0-r31 be in a standard scalar regfile with tons of ports | 21:26 |
lkcl | and for r32-127 be "striped" | 21:27 |
programmerjake | that's also why I want tree-reduction to use moves and not rely only on remap....because the moves move the values to the correct lanes using the special reduce data paths (small subset of full inter-lane data paths -- they can be used for non-reduce ops too, they're just designed for reduce), avoiding the need to delay a bunch for lane crossing because we decided to leave data in the wrong lane cuz remap has nicer "purity" or | 21:27 |
programmerjake | something | 21:27 |
lkcl | no. | 21:27 |
lkcl | that's an implementation detail, chosen by the implementor (which happens to be us) | 21:27 |
lkcl | that implementation detail *must not* be back-propagated to the algorithm, poisoning and destroying the SVP64 API in the process | 21:28 |
lkcl | i've said no multiple times. | 21:28 |
lkcl | switching the operation internally to a micro-coded MV as a microarchitectural detail is perfectly fine | 21:29 |
programmerjake | well...afaict that implementation detail needs to be put in the semantics, otherwise it's impossible to implement...maybe | 21:29 |
lkcl | that'll be up to us to work out, yes, because putting something in the spec that's unworkable in practice is worse than useless | 21:30 |
programmerjake | since iirc remap produces different behavior than moves...lanes are left unmodified rather than overwritten by moves | 21:30 |
lkcl | which is why the FFT/DCt took 6-8 weeks, i had to do a maaajor tech-heavy nose-dive | 21:31 |
lkcl | yes. or, more to the point, REMAP expresses the *desire* to access registers | 21:31 |
lkcl | it's then up to the micro-architecture to work out how to implement those efficiently and effectively | 21:31 |
lkcl | this is one of the major, *major* differences between a Cray-style ISA and a SIMD ISA | 21:32 |
lkcl | the internal micro-architecture of SIMD "bleeds up to" (is exposed to) the ISA and the programmer is basically told, "we couldn't be bothered: here, you deal with it" | 21:32 |
lkcl | despite what only look like subtle differences as far as the programmer is concerned, a Cray-style ISA goes "i'm going to do a little more thought inside the micro-architecture so that you, the programmer, *don't* have to deal with this crap" | 21:33 |
lkcl | and the inclusion of the MV in the reduce scheme falls into the former category, unfortunately. | 21:34 |
programmerjake | well...imho we should have moves even if it's not "pure" or whatever (i think it fits fine into svp64's semantics and don't think it has a purity problem), because it makes fast reduces possible for an important set of target microarchitectures (those with simd backends and slow fully-general lane crossing) | 21:34 |
lkcl | the other reason for not having it is because it makes Vertical-First parallel-reduce mode with predication almost impossible to understand | 21:35 |
lkcl | principle of MAXIMUM surprise rather than least surprise | 21:35 |
programmerjake | if you, the programmer, don't want moves, just follow basically every other arch and don't predicate your reductions | 21:35 |
programmerjake | imho reduction shouldn't be in vertical first mode... | 21:36 |
lkcl | a user is trying to read / single-step through code, and can't work out why the bloody hell the registers contain the wrong values | 21:36 |
lkcl | it's there, it's going to be expected to work. | 21:37 |
lkcl | i'm looking to split out the parallel-reduce implementation as a REMAP option | 21:37 |
programmerjake | vertical first mode is only well suited to vector ops where lane-crossing doesn't occur....it's waaay tooo confusing otherwise | 21:37 |
lkcl | once separated out it can be set up with SV REMAP instructions, at the top of a loop, just like DCT and FFT. | 21:38 |
lkcl | if the operation changes to a MV half-way through it completely throws off how REMAP works, yes. | 21:39 |
lkcl | REMAP has 4 "re-targets" which can be applied to RA/RB/RC/RT/RS-or-EA | 21:39 |
programmerjake | imho reduction is important enough that you should be able to use a single svp64 horizontal instruction to do reduction, no taking several instructions to setup remap | 21:40 |
lkcl | it's a single instruction, jacob. look at the examples and the implementation and the pseudocode. | 21:40 |
lkcl | by a happy coincidence the 5 slots happen to fall on different registers so that the 4 REMAPs can be applied once and only once | 21:41 |
lkcl | which was dead-lucky | 21:42 |
lkcl | bottom line, for a ton of reasons: hard-no on MV as an explicitly-exposed operation | 21:42 |
programmerjake | imho moves don't cause problems for remap...they're aren't actual move instructions...they're just changing the op to copy the appropriate input to the result without modifying it...so an add reduce with sv.add rt.v, ra.v, rb.v will still use the rb remap option when the add is replaced with a move from rb...even though there isn't a mov instruction in openpower that has rb as an input | 21:43 |
lkcl | which means more work for us, sigh, but it's fantastically intriguing | 21:43 |
lkcl | just... no. | 21:43 |
lkcl | i've said no. | 21:43 |
lkcl | probably around ten to fifteen times in total, i've said no | 21:45 |
programmerjake | yeah, i know..i just think some of your reasons for saying no are based on some misunderstandings of how the reduce-with-mv would work | 21:45 |
programmerjake | so if you fully understand it you might not say no | 21:45 |
lkcl | i get it, and i appreciate you going over it | 21:46 |
lkcl | some things i instinctively get, others i don't. usually i can tell the difference :) | 21:46 |
lkcl | the index-redirections effectively represent the MVs. | 21:47 |
lkcl | what i'd be interested to see is whether some pre-processing can or has to be carried out | 21:47 |
programmerjake | the index-redirections can't replace moves...they give different results...making prefix-sum based on the tree-reduce much harder to implement... | 21:48 |
lkcl | deep breath, it'll probably need a micro-simulator | 21:48 |
lkcl | that's what i thought about DCT/FFT | 21:48 |
lkcl | and it turned out that, retrospectively, the lane-swapping utilises gray-coding! | 21:49 |
lkcl | i didn't recognise it at the time | 21:49 |
programmerjake | :) | 21:49 |
lkcl | that means it's fully 100% deterministic and can be done in hw as a pre-prep stage | 21:50 |
lkcl | so based on that really nice surprise i wondered - expected - something similar to pop out of the parallel-reduce thing | 21:50 |
lkcl | there's only *one* thing that's "broken" and it's when the predicate only has a single bit set | 21:51 |
programmerjake | well...afaict it's something based on count-trailing-zeros of the lane index | 21:51 |
lkcl | there you go :) | 21:52 |
programmerjake | well..,prefix sum needs all lanes, not just lane 0 so that's not the only thing broken by reduce not moving | 21:52 |
lkcl | remember, the idea is to *work out* the mvs needed, transparently, so that the user doesn't have to know they happened | 21:53 |
lkcl | which would be something that a high-performance implementation would do | 21:53 |
lkcl | but an embedded one most definitely would not | 21:53 |
lkcl | because it would be issuing scalar ops one at a time anyway | 21:53 |
programmerjake | well...they have to actually move for prefix-sum. even if it's easy to work out where they should move and construct a remap table...they still have to actually move the data otherwise afaict prefix-sum just skips writing the result to some non-masked-off lanes...broken essentially | 21:55 |
lkcl | yep, entirely skipped. | 21:57 |
lkcl | only the actual operations hit the result vector | 21:58 |
programmerjake | only lanes that are masked-out should be skipped, not non-masked-off lanes | 21:58 |
programmerjake | hence why i said it's broken | 21:59 |
lkcl | this is where i'd need to see it | 22:00 |
lkcl | for no predicate the results are as-expected (obvious) | 22:00 |
programmerjake | k, i'll write up an algorithm... | 22:00 |
lkcl | appreciated. | 22:00 |
programmerjake | well, turns out that that tree-reduction algorithm is the first step of postfix-sum, not prefix-sum...oops | 22:52 |
programmerjake | welp: https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=49023473045e166aff508d75993276b5864b6ef8 | 23:03 |
programmerjake | the first half of the work-efficient prefix-sum algorithm only reduces when the input element-count is a power-of-2 | 23:04 |
programmerjake | I could probably munge the tree-reduction algorithm into a postfix-sum, but not easily...I'm giving up for now | 23:06 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!