*** matthewcroughan <matthewcroughan!~quassel@static.211.38.12.49.clients.your-server.de> has quit IRC | 01:57 | |
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 07:05 | |
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 07:52 | |
alethkit | lkcl: Out of curiosity, have you heard of the Sail emulator? | 07:55 |
---|---|---|
alethkit | It might make writing unit tests easier. | 07:55 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 08:44 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 08:53 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 09:03 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.53> has joined #libre-soc | 09:03 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 10:03 | |
markos | reg. last discussion yesterday, fwiw, I don't think separate gather/scatter instructions are needed at all in SVP64, they hardly are of a benefit wherever they have been implemented, they overcomplicate things by adding way too many instructions, eg. this is ridiculous: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#!=undefined&ig_expand=2886,2887,5486,774,3842,3844,3840,4462,4490,4490,3825,3819,4995,3819,2755,2757,7525,4869,7 | 10:13 |
markos | 164,4787,6762,6762,6672,1699,2312,5079,5078,3879,5467&text=gather | 10:13 |
markos | full link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#!=undefined&ig_expand=2886,2887,5486,774,3842,3844,3840,4462,4490,4490,3825,3819,4995,3819,2755,2757,7525,4869,7164,4787,6762,6762,6672,1699,2312,5079,5078,3879,5467&text=gather | 10:14 |
markos | if you really want to use gather/scatter on SVP64, make sure you have fast concurrent access to RAM/cache | 10:15 |
markos | Arm also suffers from slow gather/scatter | 10:15 |
programmerjake | gather/scatter are essential for gpus, in a gpu shader the majority of loads/stores are gather/scatter | 10:20 |
programmerjake | the plan is to make them fast | 10:21 |
programmerjake | gpus optimize their hardware by fusing the element ops of a gather/scatter into one wide memory access assuming each gather/scatter instruction is unit-strided or all in the same cache line | 10:23 |
programmerjake | libre-soc will want to do something similar (not necessarily 1 instruction, but fusing memory accesses so the l2/l3 cache and memory only see 1 wide access) | 10:25 |
programmerjake | the way svp64 gets gather/scatter is just by having a normal load/store, but setting the address register to be a vector -- no custom instructions needed. | 10:26 |
markos | if you can make them fast and avoiding the mess that Intel (and Arm to a lesser degree) have made, by all means | 10:26 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 10:26 | |
programmerjake | that's the plan :) | 10:26 |
markos | I've tried gather/scatter on AVX2/AVX512 so many times, and every time performance is actually worse, it's so bad it hurts | 10:27 |
markos | these days, I just leave the scalar code in place and don't bother anyumore | 10:28 |
markos | anyway, glad to hear that no extra instructions are needed :) | 10:29 |
programmerjake | at worst, performance should be equivalent to a sequence of scalar load/store ops, a lot of cases it should be better. do note that the fusing i described will most likely also occur with scalar ops (because svp64 is basically just dumping a pile of scalar ops at the wide backend) so scalar != slow | 10:30 |
markos | nice | 10:30 |
markos | ooc, how wide is the access to L1/L2 cache going to be, iirc, VMX/VSX has a 128-bit memory bus to L1 cache -don't remember if that also applies to L2 cache, but pretty sure this was the case with VMX in early days | 10:32 |
programmerjake | l1 should be >= 128-bits, l2 may be narrower in our first design (simplicity & fitting in our silicon budget), they should be a lot wider later | 10:33 |
markos | I guess number of memory channels depends on the actual implementation also | 10:33 |
programmerjake | yup | 10:33 |
markos | ok | 10:34 |
programmerjake | do count on later designs being more gpu like with wide dram busses | 10:34 |
programmerjake | e.g. 192-bit as 32-bit to 6 memory chips | 10:35 |
markos | I'm actually very looking forward to testing it, even on fpga | 10:35 |
programmerjake | :) | 10:35 |
markos | would make coding low level fun again and not having to go through 5 thousand pages to find out one particular instruction :) | 10:36 |
programmerjake | or maybe we could do something like a hypercube with 8 chips each communicating with 3 others and some local ram | 10:37 |
markos | yeah a 3d topology | 10:37 |
programmerjake | or 4d for high-end stuff | 10:37 |
markos | I have always wondered how come they don't build chips like that yet | 10:37 |
markos | I know Arm have introduced Neoverse V1 which is using 2D layers interconnected with high speed lanes | 10:38 |
markos | and upcoming V2 | 10:38 |
programmerjake | cuz they're kinda just starting...amd started making mcm cpus mainstream just a few years ago | 10:38 |
markos | but I haven't seen 3D yet | 10:38 |
programmerjake | it would still likely be on a 2d board, just with 3d topology | 10:39 |
programmerjake | i guess technically they have 3d -- the raspberry pi puts a ram chip on top of their cpu iirc | 10:40 |
markos | yeah I was talking about the chip | 10:41 |
markos | will be interesting times, as exciting as late 80s, early 90s wrt chip technology :) | 10:42 |
programmerjake | making a 3d si chip is super slow because each layer takes a lot of time and it's really hard to put transistors on top of other transistors | 10:43 |
markos | that's what Arm has done with V1/V2 | 10:43 |
markos | I was very impressed when I saw the presentation | 10:43 |
markos | they managed to reduce paths between units by putting them right on top of each other | 10:44 |
markos | and in doing so, reduced thermal footprint as well | 10:45 |
programmerjake | oh neat, only one i knew about was flash chips -- they're easier because the layers are simpler and repetitive | 10:45 |
markos | but tbh, I don't know the process, how long it took them | 10:45 |
programmerjake | time per chip = higher price and lower yield and less chips at full fab throughput | 10:46 |
markos | apologies, this was done for N1 even, https://community.cadence.com/cadence_blogs_8/b/breakfast-bytes/posts/taking-arm-neoverse-into-3d-with-digital-full-flow | 10:55 |
markos | I saw that presentation at ArmDevSummit last year | 10:55 |
programmerjake | oh, neat! | 10:56 |
markos | I think the plain N1 chips are not 3D though | 10:56 |
markos | I remember that this was mentioned as an evolution to be used in the upcoming models | 10:57 |
markos | hence V1/N2 | 10:57 |
markos | anyway | 10:57 |
programmerjake | well, it's 3am here, ttyl | 11:02 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.53> has quit IRC | 12:04 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 12:05 | |
lkcl | alethkit, yes i have. the problem (sigh) is that nobody has created a modern up-to-date Power ISA port / definition. | 12:23 |
lkcl | although that *might* have recently changed, because i think Boris Shingarov used our machine-readable markdown versions of the Power ISA Spec, containing actual executable pseudocode, along with our language-translator/parser | 12:24 |
lkcl | and instead of having that language-translator output python i *think* he got it to output a Sail definition | 12:25 |
lkcl | markos that's a frickin ridiculous number of intrinsics. illustrates precisely why we cannot go for a 1D intrinsics paradigm. | 12:27 |
markos | to anyone that's interested, found the Neoverse 3D video presentation: https://www.youtube.com/watch?v=XJWQx8ZswSI | 12:27 |
lkcl | remember that the gather-scatter capability is not an actual instruction, at all | 12:27 |
markos | lkcl, yes, programmerjake explained that to me | 12:27 |
lkcl | there happens to be twin predication as a *general* concept. | 12:27 |
markos | I've been burned multiple times by trying to implement gather/scatter using avx2/avx512 intrinsics | 12:28 |
markos | with *zero* benefit | 12:28 |
markos | absolutely no gain, and I spent weeks if not months | 12:28 |
markos | I was going crazy, trying to figure out what I was doing wrong | 12:28 |
markos | turned out I did *nothing* wrong, it's just that Intel's gather/scatter is completely useless | 12:29 |
lkcl | if you set src=all1s and dest=mask with zeroing disabled you get the *effect* in the *abstract* of... err... src=1s... scatter(?) | 12:29 |
lkcl | and if you set src=mask and dest=all1s you get gather | 12:29 |
lkcl | (i get confused which way round it is) | 12:29 |
markos | yes, that's just great | 12:30 |
lkcl | but to *stop* (prohibit) that from being possible on LD/ST would actually violate the RISC paradigm! | 12:30 |
lkcl | but, also it's important to appreciate | 12:31 |
lkcl | where other ISAs have SIMD front-ends, they *still* have to decide a back-end micro-architecture | 12:31 |
lkcl | (ARM's core designers *choose* to make the back-end behind gather/scatter "slow", GPU core designers *have* to make the back-end behind gather/scatter "fast") | 12:32 |
lkcl | these *hardware designer* choices, based on market forces driving to a particular commercial need, don't actually have anything to do with the actual front-end ISA itself! | 12:33 |
lkcl | iow if we (or more specifically RED) chooses to go after the 3D market then we (or RED) will *have* to do a back-end fast hardware gather-scatter engine | 12:34 |
lkcl | otherwise commercially it will be suicide | 12:34 |
markos | isn't it "just" the case of enabling multiple concurrent loads/stores between the cpu/cache/memory? | 12:35 |
lkcl | yes programmerjake we will almost certainly have to spot patterns and batch ops together, that's what the L0CacheBuffer is for | 12:35 |
lkcl | the number of wires going into it is just absolutely mental though | 12:35 |
markos | I mean, isn't fast gather/scatter an added bonus of multiple concurrent loads/stores? | 12:36 |
lkcl | markos, yeees but even for "just" 8 LD/STs @ 64-bit width, with auto-detection of mis-alignment, you have to have QTY 16 LD/ST 64-bit back-end units | 12:36 |
markos | why double? | 12:37 |
lkcl | 64-bit address plus 64-bit data plus some control wires equals about 150 wires times SIXTEEN equals 2,400 wires going into a very small area. | 12:37 |
markos | damn | 12:37 |
lkcl | because you have 8 bytes but 1 of them could be on the even 64-bit and the other 7 into the odd 64-bit | 12:37 |
lkcl | last time i calculated it, it was actually around 3,000, iirc | 12:38 |
markos | well, you could limit 8 LD/STs for aligned addresses | 12:38 |
markos | to avoid that overlapping | 12:38 |
lkcl | that violates the Power ISA specification | 12:38 |
markos | damn again | 12:38 |
markos | well, not the ISA, but the actual loads | 12:39 |
lkcl | what *does not* violate the Power ISA specification is to go above a 4k boundary on Virtual Memory misalignment | 12:39 |
lkcl | and when i did the research i came across something very interesting | 12:39 |
markos | I mean the gather engine could issue up to 8x LDs if they're aligned, but up to 4x if they are all unaligned | 12:39 |
lkcl | violates the Power ISA Spec which is dependent on the Scalar operations | 12:40 |
lkcl | so there is a trick that you can do for spotting merging of addresses | 12:40 |
lkcl | using only 12 bits to recognise that they are to be merged | 12:40 |
markos | does the ISA define how many uops are issued? | 12:40 |
lkcl | if you go to 13 bits it is far too complicated *in hardware* | 12:40 |
lkcl | no | 12:40 |
lkcl | thus | 12:40 |
lkcl | we can retrospectively work out that IBM *very specifically* has implemented something similar to what i envisaged it would be best to do | 12:41 |
lkcl | has already encountered the problem | 12:41 |
lkcl | and *very specifically* allowed the spec to make an exception | 12:41 |
lkcl | mis-aligned LD/STs basically are not allowed to cross a page boundary | 12:42 |
lkcl | and it's down to "attempting to do so is insanely hard in hardware so we allow a trap to be generated" | 12:42 |
lkcl | the linux kernel went ape-shit until i added microwatt-style misaligned LD/STs. | 12:43 |
lkcl | that's just on scalar ones | 12:43 |
lkcl | it's fundamentally baked-in to all the software that misaligned LD/STs will be supported in hardware. | 12:44 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 12:47 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has joined #libre-soc | 12:48 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has quit IRC | 12:59 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has joined #libre-soc | 13:00 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has quit IRC | 13:02 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 13:02 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 13:20 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 13:22 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 13:50 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.164.233> has joined #libre-soc | 13:51 | |
lkcl | alethkit: re unit tests what we have is something unimaginitively called "the Test API" | 18:15 |
lkcl | which is python unit tests comprising a list of instructions, a "state" (memory, registers), and an "Expected" state (memory, registers post-execution) | 18:16 |
lkcl | these have been converted to the Test API | 18:17 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/alu/alu_cases.py;hb=HEAD | 18:17 |
lkcl | they can currently be used by the Power ISA Simulator we have written, and by TestIssuer (again, unimagininatively, it issues tests to the HDL implementation) | 18:17 |
lkcl | also there is a way to issue the same tests to qemu | 18:18 |
lkcl | (by way of python-pygdbmi) | 18:18 |
lkcl | and the next phase is to add | 18:18 |
lkcl | gem5 | 18:18 |
lkcl | microwatt | 18:18 |
lkcl | verilator | 18:18 |
lkcl | icarus | 18:18 |
lkcl | FPGAs | 18:18 |
lkcl | compiling-of-Makefiles-so-as-to-be-able-to-compile-and-execute-standalone-binaries-on-Power-Compliant-Hardware | 18:19 |
lkcl | Power ISA virtual machines (kvm) | 18:19 |
lkcl | did i mention already this has been a f*** of a lot of work and is going to be a f*** of a lot more? :) | 18:19 |
lkcl | the only thing that's a little poignant / sad is that there's f***-all help or collaboration from any other team, company, university or OPF Member/Stakeholder | 18:20 |
lkcl | which i have to say is *really* bizarre / anomalous. | 18:21 |
lkcl | especially given that there's *sixty* (!) RISK5 Technical Working Groups | 18:22 |
alethkit | RISC-V does have nearly all of the "open hardware" mindshare | 19:17 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 19:19 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 19:20 | |
*** tplaten <tplaten!~isengaara@55d4bbca.access.ecotel.net> has joined #libre-soc | 19:30 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 19:44 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 19:51 | |
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has joined #libre-soc | 20:15 | |
lkcl | indeed. great for teaching, great for proprietary "never-see-the-light-of-day" scenarios, like Western Digital SSDs/HDDs, Trinamic's TMC2660 stepper ICs, NVIDIA GPU internal architectures, AndesSTAR USB-Audio DSPs and so on | 21:08 |
lkcl | nobody knows that AndesSTAR's market is a billion-units "goes into nearly every USB headset on the planet" kind of market | 21:09 |
lkcl | unnnfortunately, the ISA is so anaemic that it needs 50% more opcodes to reach par with ARM Cortex A73, which is what the Alibaba Group were forced to do... as *rogue* custom instructions (!) | 21:10 |
lkcl | and it's so new that the patents contributed by members are completely inadequate, providing no protection whatsoever to all and any who have patents pre-dating them | 21:11 |
lkcl | it'll eat ARM's lunch in embedded markets (up until the patent lawsuits arrive in droves) | 21:13 |
tplaten | hever heard about AndesSTAR | 21:15 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.164.233> has quit IRC | 21:15 | |
lkcl | exactly! | 21:15 |
lkcl | only reason i ever heard about them is because they proposed a PackedSIMD extension to RISK5 | 21:16 |
tplaten | I assume they have their own custom isa | 21:18 |
lkcl | like Western Digital, they used to. | 21:23 |
lkcl | Trinamic licensed ARM, and their use of RISK5 shaved $1 off the cost of their Stepper ICs. | 21:23 |
lkcl | which is why everyone wants to drop ARM in embedded markets and use the "free" RISK5 instead | 21:24 |
tplaten | Did PowerPC have similar license fees in the past, what was the licencing model back in G4 times? | 21:30 |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 21:54 | |
choozy | PowerPC would be a derivative of the POWER architecture from AIM, but the licensing fees could be non existent for every company in the then formed AIM (Apple, IBM, Motorola) alliance got their share of the profits of the sales of these chips | 22:19 |
choozy | They were used in Apple Machines, some IBM servers and Amiga systems | 22:20 |
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has quit IRC | 22:34 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 22:49 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 23:15 | |
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 23:15 | |
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has quit IRC | 23:18 | |
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 23:18 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!