Monday, 2021-02-08

segherso you are not doing VMX either?08:07
segher(i think simplev is a huge mistake, and implementations using it will be both slower and bigger, but not my decision :-) )08:08
segher(not to mention not compatible, etc.)08:08
segheryou also probably want to look at "addex", which is like "adde" but can use more different inputs (only OV is defined right now, but it has room for expanding to 4 extra carry bits, so 5 total; two is fine for now)08:20
segherfor add implementation, i'd always use a mix between carry skip and carry select...  the problem of most abstractions for creating adders is they o not think enough about locality08:23
mepylkcl ^10:25
rscAh. Hi!13:25
mepyHi rsc13:54
*** mepy <mepy!~mepy@> has left #libre-soc15:11
lkclno legacy SIMD.  it's too big, and too troublesome.  if we absolutely have to do it, it will delay the hardware implementation by at least 6 months, even to get a bare minimum, and for very little benefit19:16
lkclyes we have addex (we have the full scalar OpenPOWER v3.0B ISA, except madd, that's still TODO)19:16
lkclsegher: when VL=1 and when the Context is zero, we guarantee 100% compatibility with the v3.0B scalar ISA.19:17
seghermadd is easy on integer19:21
segheryour multiplier already is a big addition tree, you just have to add one more input19:22
segherlkcl: but you redefine some opcodes, which means you can never be said to implement power architecture19:23
segher(or don't you?  i understood you redefine primary 4)19:23
segher(all of this is only critic that i hope is helpful, btw!)19:24
lkclno we're not redefining opcodes... yet.  if we do, it will be behind a PCR (Program Compatibility Register) bit19:27
lkclwhich, sigh, has yet to be reserved at the OPF level19:28
lkclwhat we _are_ doing is "fitting in" with the EXT01 64-bit prefix19:28
lkclwe're requesting at the OPF ISA WG level QTY 16 of the 64 prefix spaces (bits 7-12)19:29
segheroh good!19:29
lkcljacob came up with a fantastic way to fit in there, in a non-disruptive fashion, into the "higher" reserved bits of that space19:29
segherthat is a lot of prefix space, lol19:29
lkclin each column.19:29
lkclyes :)19:29
lkclwe considered 50% (32 / 64) but this would be a bit greedy :)19:30
segher7 is way greedy already19:30
lkclsection "Prefix Opcode Map"19:31
lkclwe need 24 bits for SVP64's prefix system19:31
segherbut it could be hidden behind a PCR, yes19:31
lkclyeahh... that's not so ideal, but doable19:31
lkclwe miss the entire 8LS line, and the MLS line19:32
lkcland sit at the high end of 8RR, MRR and MMIRR's "reserved" space19:32
seghernot sure how useful prefixed loads/stores are for smaller/slower implementation19:33
lkclyou get LD/ST-multi "for free"19:33
lkclyou also get *predicated* LD/ST-multi "for free"19:34
segheryou need a bigger frontend for prefixed19:34
lkclwhich is useful for context-switching (one single instruction) as well as for function call stack save/restore19:34
segherand almost everything else widened, too, for good performance19:34
lkclnot sure what you mean "frontend"?19:34
segherfetch and decode and sequencer19:34
seghereverything before issue :-)19:35
lkclwell that's the beauty of the Cray-style Vectors: the ISA *does not care* if the back-end is 1 wide ALU, 3 4 7 19 21 or 64 wide19:35
segherbut *you* care what the performance becomes19:35
lkclSimple-V sits "in between" fetch and issue, as literally a hardware for-loop (it's called a Sub-PC for a reason)19:36
lkclahh, yes we do :)19:36
lkclhybrid GPU / VPU, go figure19:36
segherand cray vectors work well on in-order single-issue19:36
segherrunning at modest frequencies19:36
segherbut, afk19:37
lkclyes because the instruction decode twiddles its thumbs whilst the back-end ALUs scream 100%.  if we completely run out of time to get proof-of-concepts out there then reluctantly we'll do in-order single-issue19:37
segheri meant to say that i do not see how it can perform well on wider and/or faster cores19:38
segher(and you need OoO for even modestly wider)19:38
lkcl... ok right.  right19:38
segherbut, please prove me wrong :-)19:38
lkcllet's say you want good (high) performance on GPU workloads but you also want good (reasonable) performance on general-purpose workloads, too19:39
lkcli'm assuming here that we've gone through the process of defining a new ABI, the compilers all have auto-vectorisation, yada-yada19:39
segherso you do a 2-wide pipe for the general powerpc thing19:39
lkcl1 sec... yes, thin.FAT rather than big.LITTLE :)19:40
segherthat can be done cheaply and effectively, a little bit OoO but not much19:40
seghersay, like pentiumpro, or 60319:40
segherwell, 604 or 750 really19:40
lkclthe "normal" way to get high performance is to put in back-end SIMD ALUs, 8-wide FP32 or even 16-wide FP32 or potentially even greater19:40
lkcland for a GPU workload this would be absolute fantastic, yes?19:40
lkclnow, what about when you run a standard general-purpose compute workload?19:41
segheryou do simd only if you cannot get better performance from your process19:41
segherand then you only do short vectors not to hurt your cycle time19:41
lkclwith maybe only 2x FP32 or (gosh) there are 4x FP32 only?19:41
segher4x yes19:42
lkclall the SIMD ALUs are 8x or 16x19:42
lkclthe "utilisation" there is going to be stupidly small19:42
seghernormal Power is 4x fp3219:42
segher(in VMX)19:42
lkclthose 2x FP32 operations when sent to an 8x or 16x FP32 SIMD unit, it's going to be only what... 12% or 6.25% utilisation19:42
lkclthe other 8-2 or 16-2 SIMD lanes will do absolutely f***-all19:43
segherbut, afk, sorry19:43
segheryes, but there only are 4 lanes normally19:43
lkclbecause the general-purpose code simply can't...19:43
segher16B short vectors19:43
lkclwe're not doing VSX, and not talking about VSX.19:43
segherthis is VMX19:44
lkcli'm talking of a hypothetical Simple-V system19:44
lkclwhere it has a Cray-style Vector front-end, with predicated SIMD back-ends of width up to 8x or 16x FP3219:44
lkclnot Altivec, not VMX, not VSX, which are hard-coded and fixed to 4xFP32.19:44
lkclso, assume that the ABI has been done, that the compilers have all been done to support Simple-V Cray-style Vectors19:45
lkclnow you have a general-purpose program where the auto-vectorisation can only, at best, detect and issue 2x FP32 at once.19:45
lkcljust purely as an academic exercise, i don't know of an actual real-world example19:46
lkclbut let's pretend that such an algorithm exists19:46
segherand the interesting autovectorisation it cannot do at all19:46
lkclso on this hardware, because the SIMD back-end ALUs are 16x wide, the utilisation of those back-end ALUs is only going to EVER have QTY 2 out of its 16 FP32 SIMD "lanes" occupied at any one time19:47
lkclwasted, yes?  (this is with an in-order, single-issue system, mind)19:47
lkcljust like in POWER10, now let's imagine that instead of in-order single-issue we have 4-way or 8-way multi-issue19:48
lkcllet us imagine that the SIMD back-ends are only 4xFP32 (by coincidence this is the same size as VSX)19:48
sorearyou seem to be exhibiting mid-1960s "the purpose of out of order is to keep our expensive FPUs busy" mentality19:49
lkclnow because that is a loop, and because of in-flight data, and branch prediction, the auto-vectorisation will still only issue 2x FP32 but at the hardware level19:49
lkclthe ALUs will be at least 50% occupied.19:49
lkclnot 6%19:49
lkclsorear: :)19:49
sorearwhere does keeping the cache busy fit in here19:50
lkclbut, and here's the nice bit: when you run a *GPU* workload, it issues those 8x or 16x Vector instructions19:50
lkcland the Simple-V Engine goes, "oh, you wanted 16x FP32, i have QTY 4 4xFP32 SIMD backends, i'll slam your entire 16x FP32 Vector into all four SIMD back-ends in one clock cycle"19:51
sorearenergy efficiency would be happier if you were running 32x at half the clock and 200mV less, so there's a bit of a fundamental conflict running vector and scalar workloads at the same time on the same cores19:52
lkclwhich cache, sorear?  the reason i ask is: we'll need to do 3.  1) I-Cache 2) D-Cache 3) Texture-image cache19:52
lkclyes, this is where the idea from jacob stems from, to do thin.FAT19:53
sorearL2/LLC since that's traditionally most of your area19:53
lkclthe "thin" core will be multi-issue and not so wide SIMD, and also run at a high clock rate19:53
lkclthe "FAT" cores will probably be single-issue, *MASSIVE* wide SIMD back-ends, and run at 1/2 the clock rate19:54
lkcl3D workloads, particularly texture maps, are very regular.  they are also typically LOAD-PROCESS-STORE so we may need to do either L2 cache-line pinning or have L2 cache bypass entirely, for Textures19:55
lkclbecause with the Texture maps being of fixed size at 1 Megabyte in the Vulkan Specification, one entire Texture map would end up flushing 50% or 100% of the entire general-purpose L2 cache (!)19:56
lkclstill all TBD properly19:56
lkclanyway, good question19:59
sorearwhen you think about it texturing is just a JOIN and those can be done with a logarithmic number of passes over memory in the worst case19:59
lkcli don't know the full details (jacob's the one been studying the Vulkan spec) i believe the maps are laid out regularly in memory (deliberately)20:00
lkclit's the "interpolation" opcodes that are the CPU-cycles-killer if you don't have special Texture LD/ST opcodes20:00
lkclyou have to take 4 pixels and interpolate them using *X-Y* values from 0.0 to 1.020:01
lkclin *both* the X *and* Y dimension20:01
lkclthis is for image scaling, obviously20:01
lkclyou know how you get that error if you run a "full" OpenGL application on an OpenGL ES 2.0 hardware, "Non-Power-of-2 scaling is not supported"?20:02
sorearI don't really follow gaming but the key texture compression patents expired a year or two ago, you're probably going to be dealing with _mostly_ compressed textures soon20:02
lkcleuuurgh.  that sounds fun20:02
lkclanyway.  i need to stand up, walk around.20:04
sorearthen again you have a fairly high baseline of instructions per pixel to handle normal interpolation, lighting, depth buffer testing and updates...20:04
sorearI'm not even sure what people consider a good benchmark these days20:04
lkcli went through it with Jeff Bush, his Nyuzi paper is really good20:04
sorearyou have Z-order/swizzling right?20:09
rsclkcl: may I ask what the plan is after the unfortunate VSX response? Or are you currently evaluating?20:09
segheryou can be power isa compliant without vector20:52
segher(just the SFFS subset)20:52
segheryou can extend the elfv2 abi pretty easily for it, too20:52
segherbut perhaps you do not have to at all even20:53
segheran elf object that declares it does not use it could otherwise use the same abi20:54
segheryou'll have to do some linux kernel support, too, but that should be easy as well20:54
segherif you want a distro that does not use Vector or Vector Scalar, you'll have to build one yourself, or pay someone else to do one (or bribe them some other way ;-) )20:56
segherbut, you've got an email from Bill; i'll reply to that tomorrow20:57
segherthe core is that it certainly could be done, but you cannot expect other people to do the legwork20:58
segheri hope that isn't bad new for you :-)20:58
rscI understood that a Power ISA compliant CPU can be without VSX, but introducing a new ABI and a new GNU tiplet etc. is something which I'm in doubt when it comes to Linux distributions, because it means efforts for "one" CPU.21:00
sorearwhat is the VSX "response"?21:00
rsc"64-Bit ELF V2 ABI Specification: Power Architecture" in at least 1.4 (current version) makes VSX non-optional21:01
sorearyes, that's kind of the point of ELF V221:01
programmerjake[miirc elf v2 also has many other features, such as trying to improve tail call optimizations21:03
sorearwhat are you doing to support ieee 754-2008?21:03
programmerjake[mmostly just relying on the OpenPower spec, though I did write a whole sw implementation of ieee 754 2019 in Rust:
programmerjake[munlike berkeley softfloat all features are always available, no recompilation with different flags necessary21:08
sorearI feel like if you're doing 16-wide SIMD and a "GPU" but don't have hard IEEE support something has gone wrong somewhere21:09
programmerjake[mIt is a full implementation of ieee754 2019 for RISC-V, I still need to finish adding all of Power's weird float status flags and handle NaN propagation for Power21:09
programmerjake[mthe cpu *will* support hardware fp, the library I wrote is intended to be a reference implementation for testing against21:10
programmerjake[mwe currently have a incomplete hw fp implementation, we still need to add support for correct NaN propagation, Power status flags and rounding modes, and optimize to try and share hw with the integer alus if possible21:12
programmerjake[min particular, I'd like to share the int div/rem with the fp div/sqrt/rsqrt unit, and I'd like to share int mul/muladd with fp mul/fma and maybe fp add/sub21:15
segherrsc: you can build your own distro easily.  but *supporting* it will be a lot of work21:17
segherprogrammerjake: almost all of ieee float rules leaves no choice to the implementation, so this is easy21:21
programmerjake[mdistro: that's a large part of why we want to get our code upstream, it will reduce our maintenance burden due to other people's refactors and changes being handled upstream rather than our having to port them to our patch set21:22
segheroreder of normalisation and rounding isn't specified, which NaN is taken if there are more than one in the inputs is not specified, and there is a third thing but i forgot right now21:22
segherjake: but why would they spend so much effort for just you?21:23
segherthat's not a realistic thing to expect, imo21:23
programmerjake[mwell, it's specified by the Power spec, also, power splits the invalid status flag into many separate flags21:23
segheryes, and that is perfectly standard compliant21:24
segherboth 754 and 1866121:24
programmerjake[mbecause I'm referring to things like tree-wide changes and non-libre-soc specific changes21:24
segherbut no one else wants it21:25
segherso it is just for libresoc21:25
programmerjake[mthat's not quite true, the a2o and a2i cores don't have altivec iirc, people will probably build stuff based on them21:26
programmerjake[malso, microwatt21:26
segheryes, and there are no distros just for that21:27
programmerjake[myeah, hence why we're (probably) not trying to create a new distro, just get the existing distros to work with libre-soc21:28
segherall current powerpc64le distros support power8 and later only21:28
programmerjake[myeah, because there was nothing else worth supporting when they made that decision... things are potentially different now21:29
segherthose were the only cpus that supported it, even21:30
segherwe did have some power7 before21:30
segherbut that needed so many workarounds, that it was dropped once power8 was mainstream21:30
segherVSX is used a lot, it helps performance quite a bit21:33
programmerjake[mI haven't yet given up on convincing the rest of libre-soc we need to implement altivec and vsx and stuff, but we're going to try to get a working processor before we add really-nice-to-have things21:33
segherand you can!21:33
segherbut you need to recompile everything to not use VMX and VSX21:34
rscsegher: I am a Fedora contributor since ever, so I know what you mean...nevertheless a new architecture for a distribution is usually not going to take place easily.21:36
segherand that is why i said you probably have to pay for it21:38
segheror maybe you can convince people they want to do it.  debian perhaps, or void21:38
programmerjake[mif at all possible, I want to avoid following what Raspberry Pi v1 did with a separate distro, that was really annoying to use21:39
rscHaving to use a nice distribution for Libre-SOC would be sad.21:39
segheri would recommend centos, but :-)21:40
rscsegher: Rocky fixes that hopefully ;-)21:40
programmerjake[mif we were to create our own distro, it would likely be debian-based, since that's what we're currently using for most our development21:41
segherrsc: i mostly use centos 7, and that is EOL in 2024, so i have time21:42
segheryou could also use a distro where all users build stuff from source21:43
segherthen, you only need extra compile flags, the same for most packages21:43
programmerjake[m(not that I've ever used it...)21:44
segherriseros yes, or arch21:44
segher(i know it is called gentoo, but heh)21:45
sorearpresumably you've considered "implement the VSX registers, loads/stores, and IEEE FP and leave the rest of VSX to privileged software emulation"21:45
rscWhile these options indeed exist, I'm not a fan of it. Especially as it reduces the chance for business usecases IMHO.21:45
sorearthen you can use precompiled sw for everything that's not perf critical21:46
seghersorear: there are 64 128-bit vector registers21:47
segherbut that is what you need at a minimum if you just emulate everything, yup21:48
segherthis is the minimum that was required for FP in old powerpc isaas21:48
soreara 1kB 1R1W SRAM isn't _that_ big21:48
segherlike, 602 had 64-bit registers, but only implemented 32-bit insns21:48
programmerjake[mone idea I had was instead of having 128x 64-bit fp regs for SimpleV, instead have 64x 128-bit fp regs mapped 1:1 to vsx regs. same thing for int regs.21:49
sorearyour main register file is big because it has a ton of ports, this doesn't need nearly as many21:49
segheryes, if you emulate everything you are slow anyway, so you do not need a sane register file, a block of ram will do fine21:49
seghersorear: that, and a few more things21:50
segherrenames for example21:50
sorearin principle you can do everything with a block of main memory (see: berkeley softfloat) but it would be nice to not penalize context-switch code21:50
programmerjake[m> same thing for int regs.21:50
programmerjake[mexcept for mapping to vsx regs, of course21:50
segheryou typically just duplicate the whole register file for every write port (or two write ports)21:51
seghersorear: yes, and there are security concerns with that, too (to make sure the kernel will not fault on context switches, etc.)21:52
programmerjake[msince SimpleV has more than enough space to store all vsx regs, we won't need any extra regs (except maybe a few misc sprs)21:52
sorear"we don't need these registers at the same time so make them aliases" is all fun and games until you need to register-rename overlapping registers of different sizes21:53
programmerjake[mhence why I've been planning ahead:
lkclrsc: i mentioned on the fosdem chat room, brian schwartz responded positively, he's contacting people (including you, segher!) to see what the best option is23:07
lkclprogrammerjake[m: yes agreed on sharing INT-FP parts23:08
lkclsegher: so we need to work out how to leverage the fact that A2O, A2I *and* Microwatt *and* Libre-SOC are all in the same boat: no VSX, therefore they're also "ostracised"23:09
lkcli'm counting on the fact that between all four of those, particularly how heavily optimised A2O and A2I are, it could easily be *half a million* in HDL Engineering time to add VSX to all four systems23:09
lkclonce a triplet exists there does exist a solution: it's what's used in RISC-V.  it's not multi-lib, it's not multi-arch, it's not HWCAPs, it's something in between23:11
lkclToshaan informed me that there's one company that's actually provided soft-emulation of VSX.23:12
lkclnote there the implication: ANOTHER company implementing OpenPOWER REFUSED to implement VSX because the cost is so insane.23:12
programmerjake[mone other cpu that doesn't have altivec in powerpc64le is the one used in the power laptop project, it only supports altivec in be mode23:14
lkclyes, the NXP Quorl.  roberto said he's having to go after a Power BE 64-bit port because of this23:16
lkclhe probably means ELF v123:16
lkclwhich is also an option for us: revive the ELF v1 ABI and stick to BE until this is better clarified and resolved23:17
programmerjake[mBE has much bigger problems for SimpleV with the registers currently specced to be always LE23:19
lkclthat's just internal and the discussion we had already solved that23:20
lkclif people absolutely insist on loading data in a smaller word size then accessing the registers in a larger word size they can use REMAP to perform the byte-swapping transparently, both in and out of any operation.23:21
lkclso that's solved.23:21
programmerjake[mexcept that remap isn't fast to setup, and isn't currently supported by svp6423:22
programmerjake[malso, if we have remap hw anyway, why can't we just enable it to do byteswapping by default?23:23
lkclit's added, it's in, it's there.  we implement it, it's done23:24
segherlkcl: A2 is from before elfv223:24
lkclsegher: exactly.23:24
lkclwhich means they're screwed as well23:24
segherand microwatt is experimental23:24
segherdoes A2 support LE at all?23:25
lkclbeing promoted by IBM and OPF, as providing high performance 3ghz option23:25
lkclbut there's not a single GNU/Linux distro that will run on the A2*s23:25
segherthere are quite many other ABIs still supported23:25
segherlike, the powerpc-linux and powerpc64-linux configs23:26
lkclah do you happen to know what those are?23:26
segherthose existed when A2 was born23:26
segherand they are still supported23:26
lkcland those have glibc6 mainline support?23:26
programmerjake[myes...i'd assume so since they have official debian ports23:27
segherlkcl: sure23:27
lkclthis, right?
segherlkcl: not many distros still support powerpc64-linux though23:27
lkclwell as long as there's... something, we're not completely screwed23:28
seghersles, rhel, and ubuntu all have dropped it (i think)23:28
segherbut there are things, certainly23:28
segher*technically* it's not hard or much work to support the older abis23:29
segherbut for a distro it is another arch essentially23:29
segherso it costs non-trivial machine resources, and support23:30
segher(including testing etc.)23:30
seghersome (less commercial) distros juat let the users do the testing :-)23:31
rsc - Debian's PPC64 support is "unofficial" though.23:32
segherand you can usually get machine resources (if you are not in a hurry).  but a lot of human works remains23:32
programmerjake[mI'd assume libre-soc would be donating some of our cpus to debian and fedora so they can test on them once they are ready23:32
segherrsc: but it still is there23:32
programmerjake[mit was previously official iirc23:32
segherthat is quite long ago23:33
segher2017 (stretch)23:35
seghernot as long as i thought, but heh, over 3 years23:35
rscsegher: yes, but projects seem to try to get rid of BE.
programmerjake[myeah, if libre-soc wants wide sw support, we need powerpc64le23:37
segherGo dropped ppc64 (BE) because of human staffing issues23:38
segherand yes, 64LE seems to be the future23:38
segherBE is still marginally faster, but heh23:38
rscEclipse dropped ppc64 (BE), too (both being the reason for Fedora to drop it:
lkcli need to find out what ABI toshywoshy is using in http://powerel.org23:44
rsc says "same ABI as the open source rhel based systems, so you can use existing binaries on PowerEL"23:44
lxoadd guixsd to the list of distros whose packages are built on the user side23:53
lkcllxo: ahh ty23:55
lxolast weekend wasn't very productive for me.  I looked a little into the remaining regressions after the big register renumbering patch in GCC, and found them all to be related with -fstack-check; something's going wrong throwing Ada exceptions out of signal handlers.  still finding my way around that; though I'm reasonably familiar with the stack unwinding code, throwing out of signal handlers is a little special23:56
lxosegher, BTW, I have a patch that prepares for the libre-soc renumbering, using macros instead of literals for FP, CR and VEC registers throughout the codebase.  you think that makes any sense to contribute way ahead of libre-soc extensions?23:58
sorearwhat's the actual long term plan here?  even if you get the patches upstream there's going to be an expectation of maintenance if they're large enough23:59

Generated by 2.17.1 by Marius Gedminas - find it at!