Friday, 2022-04-22

lkcl	hooray, i'm getting linux kernel boot messages on the arty a7-100t	15:28
lkcl	they're a kernel panic, but they're messages	15:28
lkcl	bout damn time	15:28
programmerjake	yay!	20:30
programmerjake	rust-lang finally switched everything to not require a login to view, so you can see project-portable-simd at https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd	20:36
lkcl	hoorah	20:50
lkcl	interesting discussion with riccardo. he's not wrong about the larger designs. we'd need to do a 1,024-way SMP system to take on where NVIDIA etc. is at the moment	21:12
lkcl	or	21:12
lkcl	implement that hybrid of EXTRA-V plus ZOLC plus snitch	21:13
lkcl	and the processing-redistribution idea	21:14
programmerjake	imho we'd have like 256 cores and f32x16 simd units	21:14
lkcl	yeah could work	21:15
programmerjake	and extending svp64 to have more registers -- 128 is too small	21:15
lkcl	an embedded GPU is easily achievable (without going mad), because the targets are lower	21:15
lkcl	that brings the maximum clock rate down unless doing a L1 cache for the registers. which would kinda have to be done anyway	21:16
lkcl	like many people working with GPUs, riccardo doesn't have any knowledge of the internals of SIMT and both NVIDIA and AMD like it to stay that way	21:16
programmerjake	well...amd gpus are currently around 2.x ghz, so...	21:16
lkcl	you have to look at MIAOW to find out what the hell's going on	21:16
programmerjake	nvidia gpus are still <2ghz	21:17
programmerjake	iirc	21:17
lkcl	and you find that it's basically "a bunch of standard processors with a common L1 cache and a common instruction fetch/decode plus instruction-broadcast bus"	21:17
lkcl	meaning:	21:17
lkcl	there is only 1 PC per group	21:18
lkcl	consequently, that conditional if/else has to be done as predicated	21:18
lkcl	on one of the cores on the same broadcast bus, the predicate would go the way of "mmm" and on the next clock "a -=b"	21:18
programmerjake	nvidia's gpus are 2-way superscalar...	21:18
lkcl	and on another it would go "a += b" followed by "mm"	21:19
lkcl	intriguing	21:19
lkcl	if you're 2-way multi-issue that'd be a neat way to ensure at least one instruction got into the queue	21:19
lkcl	if the pattern of "if x else y" with single-instruction-x and single-instruction-y is very common	21:20
lkcl	you're guaranteed there to execute one instruction, you just don't know which one	21:20
lkcl	but by having a buffer-of-two-at-a-time you know that one is guaranteed to be executed	21:21
lkcl	in one case it will be (as if) PC is executed	21:21
lkcl	and in the other it will be (as if) PC+1 is executed	21:21
lkcl	neat	21:21
programmerjake	so...one way to make the larger number of registers go faster is to have them banked 16 ways -- 1 way per simd lane...and then just make lane crossing take extra cycles	21:23
lkcl	yes. i planned a cyclic buffer, there.	21:24
lkcl	two-way-directional (so, actually, a pair of cyclic buffers)	21:24
lkcl	where the lower bits - modulo number-of-lanes - would be a "conveyor drop-off counter"	21:25
lkcl	problem is you'd better damn well have workloads that match the striping	21:25
lkcl	otherwise eeeverything goes to hell	21:26
lkcl	especially scalar operation	21:26
lkcl	therefore the additional refinement would be to have r0-r31 be in a standard scalar regfile with tons of ports	21:26
lkcl	and for r32-127 be "striped"	21:27
programmerjake	that's also why I want tree-reduction to use moves and not rely only on remap....because the moves move the values to the correct lanes using the special reduce data paths (small subset of full inter-lane data paths -- they can be used for non-reduce ops too, they're just designed for reduce), avoiding the need to delay a bunch for lane crossing because we decided to leave data in the wrong lane cuz remap has nicer "purity" or	21:27
programmerjake	something	21:27
lkcl	no.	21:27
lkcl	that's an implementation detail, chosen by the implementor (which happens to be us)	21:27
lkcl	that implementation detail must not be back-propagated to the algorithm, poisoning and destroying the SVP64 API in the process	21:28
lkcl	i've said no multiple times.	21:28
lkcl	switching the operation internally to a micro-coded MV as a microarchitectural detail is perfectly fine	21:29
programmerjake	well...afaict that implementation detail needs to be put in the semantics, otherwise it's impossible to implement...maybe	21:29
lkcl	that'll be up to us to work out, yes, because putting something in the spec that's unworkable in practice is worse than useless	21:30
programmerjake	since iirc remap produces different behavior than moves...lanes are left unmodified rather than overwritten by moves	21:30
lkcl	which is why the FFT/DCt took 6-8 weeks, i had to do a maaajor tech-heavy nose-dive	21:31
lkcl	yes. or, more to the point, REMAP expresses the desire to access registers	21:31
lkcl	it's then up to the micro-architecture to work out how to implement those efficiently and effectively	21:31
lkcl	this is one of the major, major differences between a Cray-style ISA and a SIMD ISA	21:32
lkcl	the internal micro-architecture of SIMD "bleeds up to" (is exposed to) the ISA and the programmer is basically told, "we couldn't be bothered: here, you deal with it"	21:32
lkcl	despite what only look like subtle differences as far as the programmer is concerned, a Cray-style ISA goes "i'm going to do a little more thought inside the micro-architecture so that you, the programmer, don't have to deal with this crap"	21:33
lkcl	and the inclusion of the MV in the reduce scheme falls into the former category, unfortunately.	21:34
programmerjake	well...imho we should have moves even if it's not "pure" or whatever (i think it fits fine into svp64's semantics and don't think it has a purity problem), because it makes fast reduces possible for an important set of target microarchitectures (those with simd backends and slow fully-general lane crossing)	21:34
lkcl	the other reason for not having it is because it makes Vertical-First parallel-reduce mode with predication almost impossible to understand	21:35
lkcl	principle of MAXIMUM surprise rather than least surprise	21:35
programmerjake	if you, the programmer, don't want moves, just follow basically every other arch and don't predicate your reductions	21:35
programmerjake	imho reduction shouldn't be in vertical first mode...	21:36
lkcl	a user is trying to read / single-step through code, and can't work out why the bloody hell the registers contain the wrong values	21:36
lkcl	it's there, it's going to be expected to work.	21:37
lkcl	i'm looking to split out the parallel-reduce implementation as a REMAP option	21:37
programmerjake	vertical first mode is only well suited to vector ops where lane-crossing doesn't occur....it's waaay tooo confusing otherwise	21:37
lkcl	once separated out it can be set up with SV REMAP instructions, at the top of a loop, just like DCT and FFT.	21:38
lkcl	if the operation changes to a MV half-way through it completely throws off how REMAP works, yes.	21:39
lkcl	REMAP has 4 "re-targets" which can be applied to RA/RB/RC/RT/RS-or-EA	21:39
programmerjake	imho reduction is important enough that you should be able to use a single svp64 horizontal instruction to do reduction, no taking several instructions to setup remap	21:40
lkcl	it's a single instruction, jacob. look at the examples and the implementation and the pseudocode.	21:40
lkcl	by a happy coincidence the 5 slots happen to fall on different registers so that the 4 REMAPs can be applied once and only once	21:41
lkcl	which was dead-lucky	21:42
lkcl	bottom line, for a ton of reasons: hard-no on MV as an explicitly-exposed operation	21:42
programmerjake	imho moves don't cause problems for remap...they're aren't actual move instructions...they're just changing the op to copy the appropriate input to the result without modifying it...so an add reduce with sv.add rt.v, ra.v, rb.v will still use the rb remap option when the add is replaced with a move from rb...even though there isn't a mov instruction in openpower that has rb as an input	21:43
lkcl	which means more work for us, sigh, but it's fantastically intriguing	21:43
lkcl	just... no.	21:43
lkcl	i've said no.	21:43
lkcl	probably around ten to fifteen times in total, i've said no	21:45
programmerjake	yeah, i know..i just think some of your reasons for saying no are based on some misunderstandings of how the reduce-with-mv would work	21:45
programmerjake	so if you fully understand it you might not say no	21:45
lkcl	i get it, and i appreciate you going over it	21:46
lkcl	some things i instinctively get, others i don't. usually i can tell the difference :)	21:46
lkcl	the index-redirections effectively represent the MVs.	21:47
lkcl	what i'd be interested to see is whether some pre-processing can or has to be carried out	21:47
programmerjake	the index-redirections can't replace moves...they give different results...making prefix-sum based on the tree-reduce much harder to implement...	21:48
lkcl	deep breath, it'll probably need a micro-simulator	21:48
lkcl	that's what i thought about DCT/FFT	21:48
lkcl	and it turned out that, retrospectively, the lane-swapping utilises gray-coding!	21:49
lkcl	i didn't recognise it at the time	21:49
programmerjake	:)	21:49
lkcl	that means it's fully 100% deterministic and can be done in hw as a pre-prep stage	21:50
lkcl	so based on that really nice surprise i wondered - expected - something similar to pop out of the parallel-reduce thing	21:50
lkcl	there's only one thing that's "broken" and it's when the predicate only has a single bit set	21:51
programmerjake	well...afaict it's something based on count-trailing-zeros of the lane index	21:51
lkcl	there you go :)	21:52
programmerjake	well..,prefix sum needs all lanes, not just lane 0 so that's not the only thing broken by reduce not moving	21:52
lkcl	remember, the idea is to work out the mvs needed, transparently, so that the user doesn't have to know they happened	21:53
lkcl	which would be something that a high-performance implementation would do	21:53
lkcl	but an embedded one most definitely would not	21:53
lkcl	because it would be issuing scalar ops one at a time anyway	21:53
programmerjake	well...they have to actually move for prefix-sum. even if it's easy to work out where they should move and construct a remap table...they still have to actually move the data otherwise afaict prefix-sum just skips writing the result to some non-masked-off lanes...broken essentially	21:55
lkcl	yep, entirely skipped.	21:57
lkcl	only the actual operations hit the result vector	21:58
programmerjake	only lanes that are masked-out should be skipped, not non-masked-off lanes	21:58
programmerjake	hence why i said it's broken	21:59
lkcl	this is where i'd need to see it	22:00
lkcl	for no predicate the results are as-expected (obvious)	22:00
programmerjake	k, i'll write up an algorithm...	22:00
lkcl	appreciated.	22:00
programmerjake	well, turns out that that tree-reduction algorithm is the first step of postfix-sum, not prefix-sum...oops	22:52
programmerjake	welp: https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=49023473045e166aff508d75993276b5864b6ef8	23:03
programmerjake	the first half of the work-efficient prefix-sum algorithm only reduces when the input element-count is a power-of-2	23:04
programmerjake	I could probably munge the tree-reduction algorithm into a postfix-sum, but not easily...I'm giving up for now	23:06

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!