Thursday, 2022-08-11

*** matthewcroughan <matthewcroughan!> has quit IRC01:57
*** ghostmansd <ghostmansd!> has quit IRC07:05
*** ghostmansd <ghostmansd!> has joined #libre-soc07:52
alethkitlkcl: Out of curiosity, have you heard of the Sail emulator?07:55
alethkitIt might make writing unit tests easier.07:55
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC08:44
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc08:53
*** ghostmansd[m] <ghostmansd[m]!> has quit IRC09:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc09:03
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC10:03
markosreg. last discussion yesterday, fwiw, I don't think separate gather/scatter instructions are needed at all in SVP64, they hardly are of a benefit wherever they have been implemented, they overcomplicate things by adding way too many instructions, eg. this is ridiculous:!=undefined&ig_expand=2886,2887,5486,774,3842,3844,3840,4462,4490,4490,3825,3819,4995,3819,2755,2757,7525,4869,710:13
markosfull link:!=undefined&ig_expand=2886,2887,5486,774,3842,3844,3840,4462,4490,4490,3825,3819,4995,3819,2755,2757,7525,4869,7164,4787,6762,6762,6672,1699,2312,5079,5078,3879,5467&text=gather10:14
markosif you really want to use gather/scatter on SVP64, make sure you have fast concurrent access to RAM/cache10:15
markosArm also suffers from slow gather/scatter10:15
programmerjakegather/scatter are essential for gpus, in a gpu shader the majority of loads/stores are gather/scatter10:20
programmerjakethe plan is to make them fast10:21
programmerjakegpus optimize their hardware by fusing the element ops of a gather/scatter into one wide memory access assuming each gather/scatter instruction is unit-strided or all in the same cache line10:23
programmerjakelibre-soc will want to do something similar (not necessarily 1 instruction, but fusing memory accesses so the l2/l3 cache and memory only see 1 wide access)10:25
programmerjakethe way svp64 gets gather/scatter is just by having a normal load/store, but setting the address register to be a vector -- no custom instructions needed.10:26
markosif you can make them fast and avoiding the mess that Intel (and Arm to a lesser degree) have made, by all means10:26
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc10:26
programmerjakethat's the plan :)10:26
markosI've tried gather/scatter on AVX2/AVX512 so many times, and every time performance is actually worse, it's so bad it hurts10:27
markosthese days, I just leave the scalar code in place and don't bother anyumore10:28
markosanyway, glad to hear that no extra instructions are needed :)10:29
programmerjakeat worst, performance should be equivalent to a sequence of scalar load/store ops, a lot of cases it should be better. do note that the fusing i described will most likely also occur with scalar ops (because svp64 is basically just dumping a pile of scalar ops at the wide backend) so scalar != slow10:30
markosooc, how wide is the access to L1/L2 cache going to be, iirc, VMX/VSX has a 128-bit memory bus to L1 cache -don't remember if that also applies to L2 cache, but pretty sure this was the case with VMX in early days10:32
programmerjakel1 should be >= 128-bits, l2 may be narrower in our first design (simplicity & fitting in our silicon budget), they should be a lot wider later10:33
markosI guess number of memory channels depends on the actual implementation also10:33
programmerjakedo count on later designs being more gpu like with wide dram busses10:34
programmerjakee.g. 192-bit as 32-bit to 6 memory chips10:35
markosI'm actually very looking forward to testing it, even on fpga10:35
markoswould make coding low level fun again and not having to go through 5 thousand pages to find out one particular instruction :)10:36
programmerjakeor maybe we could do something like a hypercube with 8 chips each communicating with 3 others and some local ram10:37
markosyeah a 3d topology10:37
programmerjakeor 4d for high-end stuff10:37
markosI have always wondered how come they don't build chips like that yet10:37
markosI know Arm have introduced Neoverse V1 which is using 2D layers interconnected with high speed lanes10:38
markosand upcoming V210:38
programmerjakecuz they're kinda just starting...amd started making mcm cpus mainstream just a few years ago10:38
markosbut I haven't seen 3D yet10:38
programmerjakeit would still likely be on a 2d board, just with 3d topology10:39
programmerjakei guess technically they have 3d -- the raspberry pi puts a ram chip on top of their cpu iirc10:40
markosyeah I was talking about the chip10:41
markoswill be interesting times, as exciting as late 80s, early 90s wrt chip technology :)10:42
programmerjakemaking a 3d si chip is super slow because each layer takes a lot of time and it's really hard to put transistors on top of other transistors10:43
markosthat's what Arm has done with V1/V210:43
markosI was very impressed when I saw the presentation10:43
markosthey managed to reduce paths between units by putting them right on top of each other10:44
markosand in doing so, reduced thermal footprint as well10:45
programmerjakeoh neat, only one i knew about was flash chips -- they're easier because the layers are simpler and repetitive10:45
markosbut tbh, I don't know the process, how long it took them10:45
programmerjaketime per chip = higher price and lower yield and less chips at full fab throughput10:46
markosapologies, this was done for N1 even,
markosI saw that presentation at ArmDevSummit last year10:55
programmerjakeoh, neat!10:56
markosI think the plain N1 chips are not 3D though10:56
markosI remember that this was mentioned as an evolution to be used in the upcoming models10:57
markoshence V1/N210:57
programmerjakewell, it's 3am here, ttyl11:02
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC12:04
*** ghostmansd[m] <ghostmansd[m]!> has joined #libre-soc12:05
lkclalethkit, yes i have. the problem (sigh) is that nobody has created a modern up-to-date Power ISA port / definition.12:23
lkclalthough that *might* have recently changed, because i think Boris Shingarov used our machine-readable markdown versions of the Power ISA Spec, containing actual executable pseudocode, along with our language-translator/parser12:24
lkcland instead of having that language-translator output python i *think* he got it to output a Sail definition12:25
lkclmarkos that's a frickin ridiculous number of intrinsics. illustrates precisely why we cannot go for a 1D intrinsics paradigm.12:27
markosto anyone that's interested, found the Neoverse 3D video presentation:
lkclremember that the gather-scatter capability is not an actual instruction, at all12:27
markoslkcl, yes, programmerjake explained that to me12:27
lkclthere happens to be twin predication as a *general* concept.12:27
markosI've been burned multiple times by trying to implement gather/scatter using avx2/avx512 intrinsics12:28
markoswith *zero* benefit12:28
markosabsolutely no gain, and I spent weeks if not months12:28
markosI was going crazy, trying to figure out what I was doing wrong12:28
markosturned out I did *nothing* wrong, it's just that Intel's gather/scatter is completely useless12:29
lkclif you set src=all1s and dest=mask with zeroing disabled you get the *effect* in the *abstract* of... err... src=1s... scatter(?)12:29
lkcland if you set src=mask and dest=all1s you get gather12:29
lkcl(i get confused which way round it is)12:29
markosyes, that's just great12:30
lkclbut to *stop* (prohibit) that from being possible on LD/ST would actually violate the RISC paradigm!12:30
lkclbut, also it's important to appreciate12:31
lkclwhere other ISAs have SIMD front-ends, they *still* have to decide a back-end micro-architecture12:31
lkcl(ARM's core designers *choose* to make the back-end behind gather/scatter "slow", GPU core designers *have* to make the back-end behind gather/scatter "fast")12:32
lkclthese *hardware designer* choices, based on market forces driving to a particular commercial need, don't actually have anything to do with the actual front-end ISA itself!12:33
lkcliow if we (or more specifically RED) chooses to go after the 3D market then we (or RED) will *have* to do a back-end fast hardware gather-scatter engine12:34
lkclotherwise commercially it will be suicide12:34
markosisn't it "just" the case of enabling multiple concurrent loads/stores between the cpu/cache/memory?12:35
lkclyes programmerjake we will almost certainly have to spot patterns and batch ops together, that's what the L0CacheBuffer is for12:35
lkclthe number of wires going into it is just absolutely mental though12:35
markosI mean, isn't fast gather/scatter an added bonus of multiple concurrent loads/stores?12:36
lkclmarkos, yeees but even for "just" 8 LD/STs @ 64-bit width, with auto-detection of mis-alignment, you have to have QTY 16 LD/ST 64-bit back-end units12:36
markoswhy double?12:37
lkcl64-bit address plus 64-bit data plus some control wires equals about 150 wires times SIXTEEN equals 2,400 wires going into a very small area.12:37
lkclbecause you have 8 bytes but 1 of them could be on the even 64-bit and the other 7 into the odd 64-bit12:37
lkcllast time i calculated it, it was actually around 3,000, iirc12:38
markoswell, you could limit 8 LD/STs for aligned addresses12:38
markosto avoid that overlapping12:38
lkclthat violates the Power ISA specification12:38
markosdamn again12:38
markoswell, not the ISA, but the actual loads12:39
lkclwhat *does not* violate the Power ISA specification is to go above a 4k boundary on Virtual Memory misalignment12:39
lkcland when i did the research i came across something very interesting12:39
markosI mean the gather engine could issue up to 8x LDs if they're aligned, but up to 4x if they are all unaligned12:39
lkclviolates the Power ISA Spec which is dependent on the Scalar operations12:40
lkclso there is a trick that you can do for spotting merging of addresses12:40
lkclusing only 12 bits to recognise that they are to be merged12:40
markosdoes the ISA define how many uops are issued?12:40
lkclif you go to 13 bits it is far too complicated *in hardware*12:40
lkclwe can retrospectively work out that IBM *very specifically* has implemented something similar to what i envisaged it would be best to do12:41
lkclhas already encountered the problem12:41
lkcland *very specifically* allowed the spec to make an exception12:41
lkclmis-aligned LD/STs basically are not allowed to cross a page boundary12:42
lkcland it's down to "attempting to do so is insanely hard in hardware so we allow a trap to be generated"12:42
lkclthe linux kernel went ape-shit until i added microwatt-style misaligned LD/STs.12:43
lkclthat's just on scalar ones12:43
lkclit's fundamentally baked-in to all the software that misaligned LD/STs will be supported in hardware.12:44
*** ghostmansd[m] <ghostmansd[m]!> has quit IRC12:47
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc12:48
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC12:59
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc13:00
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC13:02
*** ghostmansd[m] <ghostmansd[m]!> has joined #libre-soc13:02
*** ghostmansd[m] <ghostmansd[m]!> has quit IRC13:20
*** ghostmansd[m] <ghostmansd[m]!> has joined #libre-soc13:22
*** ghostmansd[m] <ghostmansd[m]!> has quit IRC13:50
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc13:51
lkclalethkit: re unit tests what we have is something unimaginitively called "the Test API"18:15
lkclwhich is python unit tests comprising a list of instructions, a "state" (memory, registers), and an "Expected" state (memory, registers post-execution)18:16
lkclthese have been converted to the Test API18:17
lkclthey can currently be used by the Power ISA Simulator we have written, and by TestIssuer (again, unimagininatively, it issues tests to the HDL implementation)18:17
lkclalso there is a way to issue the same tests to qemu18:18
lkcl(by way of python-pygdbmi)18:18
lkcland the next phase is to add18:18
lkclPower ISA virtual machines (kvm)18:19
lkcldid i mention already this has been a f*** of a lot of work and is going to be a f*** of a lot more? :)18:19
lkclthe only thing that's a little poignant / sad is that there's f***-all help or collaboration from any other team, company, university or OPF Member/Stakeholder18:20
lkclwhich i have to say is *really* bizarre / anomalous.18:21
lkclespecially given that there's *sixty* (!) RISK5 Technical Working Groups18:22
alethkitRISC-V does have nearly all of the "open hardware" mindshare19:17
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC19:19
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc19:20
*** tplaten <tplaten!> has joined #libre-soc19:30
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC19:44
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc19:51
*** choozy <choozy!> has joined #libre-soc20:15
lkclindeed. great for teaching, great for proprietary "never-see-the-light-of-day" scenarios, like Western Digital SSDs/HDDs, Trinamic's TMC2660 stepper ICs, NVIDIA GPU internal architectures, AndesSTAR USB-Audio DSPs and so on21:08
lkclnobody knows that AndesSTAR's market is a billion-units "goes into nearly every USB headset on the planet" kind of market21:09
lkclunnnfortunately, the ISA is so anaemic that it needs 50% more opcodes to reach par with ARM Cortex A73, which is what the Alibaba Group were forced to do... as *rogue* custom instructions (!)21:10
lkcland it's so new that the patents contributed by members are completely inadequate, providing no protection whatsoever to all and any who have patents pre-dating them21:11
lkclit'll eat ARM's lunch in embedded markets (up until the patent lawsuits arrive in droves)21:13
tplatenhever heard about AndesSTAR21:15
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC21:15
lkclonly reason i ever heard about them is because they proposed a PackedSIMD extension to RISK521:16
tplatenI assume they have their own custom isa21:18
lkcllike Western Digital, they used to.21:23
lkclTrinamic licensed ARM, and their use of RISK5 shaved $1 off the cost of their Stepper ICs.21:23
lkclwhich is why everyone wants to drop ARM in embedded markets and use the "free" RISK5 instead21:24
tplatenDid PowerPC have similar license fees in the past, what was the licencing model back in G4 times?21:30
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC21:54
choozyPowerPC would be a derivative of the POWER architecture from AIM, but the licensing fees could be non existent for every company in the then formed AIM (Apple, IBM, Motorola) alliance got their share of the profits of the sales of these chips22:19
choozyThey were used in Apple Machines, some IBM servers and Amiga systems22:20
*** choozy <choozy!> has quit IRC22:34
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc22:49
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC23:15
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc23:15
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has quit IRC23:18
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc23:18

Generated by 2.17.1 by Marius Gedminas - find it at!