Thursday, 2022-08-11

*** matthewcroughan <matthewcroughan!~quassel@static.211.38.12.49.clients.your-server.de> has quit IRC		01:57
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		07:05
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		07:52
alethkit	lkcl: Out of curiosity, have you heard of the Sail emulator?	07:55
alethkit	It might make writing unit tests easier.	07:55
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		08:44
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		08:53
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		09:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.53> has joined #libre-soc		09:03
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		10:03
markos	reg. last discussion yesterday, fwiw, I don't think separate gather/scatter instructions are needed at all in SVP64, they hardly are of a benefit wherever they have been implemented, they overcomplicate things by adding way too many instructions, eg. this is ridiculous: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#!=undefined&ig_expand=2886,2887,5486,774,3842,3844,3840,4462,4490,4490,3825,3819,4995,3819,2755,2757,7525,4869,7	10:13
markos	164,4787,6762,6762,6672,1699,2312,5079,5078,3879,5467&text=gather	10:13
markos	full link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#!=undefined&ig_expand=2886,2887,5486,774,3842,3844,3840,4462,4490,4490,3825,3819,4995,3819,2755,2757,7525,4869,7164,4787,6762,6762,6672,1699,2312,5079,5078,3879,5467&text=gather	10:14
markos	if you really want to use gather/scatter on SVP64, make sure you have fast concurrent access to RAM/cache	10:15
markos	Arm also suffers from slow gather/scatter	10:15
programmerjake	gather/scatter are essential for gpus, in a gpu shader the majority of loads/stores are gather/scatter	10:20
programmerjake	the plan is to make them fast	10:21
programmerjake	gpus optimize their hardware by fusing the element ops of a gather/scatter into one wide memory access assuming each gather/scatter instruction is unit-strided or all in the same cache line	10:23
programmerjake	libre-soc will want to do something similar (not necessarily 1 instruction, but fusing memory accesses so the l2/l3 cache and memory only see 1 wide access)	10:25
programmerjake	the way svp64 gets gather/scatter is just by having a normal load/store, but setting the address register to be a vector -- no custom instructions needed.	10:26
markos	if you can make them fast and avoiding the mess that Intel (and Arm to a lesser degree) have made, by all means	10:26
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		10:26
programmerjake	that's the plan :)	10:26
markos	I've tried gather/scatter on AVX2/AVX512 so many times, and every time performance is actually worse, it's so bad it hurts	10:27
markos	these days, I just leave the scalar code in place and don't bother anyumore	10:28
markos	anyway, glad to hear that no extra instructions are needed :)	10:29
programmerjake	at worst, performance should be equivalent to a sequence of scalar load/store ops, a lot of cases it should be better. do note that the fusing i described will most likely also occur with scalar ops (because svp64 is basically just dumping a pile of scalar ops at the wide backend) so scalar != slow	10:30
markos	nice	10:30
markos	ooc, how wide is the access to L1/L2 cache going to be, iirc, VMX/VSX has a 128-bit memory bus to L1 cache -don't remember if that also applies to L2 cache, but pretty sure this was the case with VMX in early days	10:32
programmerjake	l1 should be >= 128-bits, l2 may be narrower in our first design (simplicity & fitting in our silicon budget), they should be a lot wider later	10:33
markos	I guess number of memory channels depends on the actual implementation also	10:33
programmerjake	yup	10:33
markos	ok	10:34
programmerjake	do count on later designs being more gpu like with wide dram busses	10:34
programmerjake	e.g. 192-bit as 32-bit to 6 memory chips	10:35
markos	I'm actually very looking forward to testing it, even on fpga	10:35
programmerjake	:)	10:35
markos	would make coding low level fun again and not having to go through 5 thousand pages to find out one particular instruction :)	10:36
programmerjake	or maybe we could do something like a hypercube with 8 chips each communicating with 3 others and some local ram	10:37
markos	yeah a 3d topology	10:37
programmerjake	or 4d for high-end stuff	10:37
markos	I have always wondered how come they don't build chips like that yet	10:37
markos	I know Arm have introduced Neoverse V1 which is using 2D layers interconnected with high speed lanes	10:38
markos	and upcoming V2	10:38
programmerjake	cuz they're kinda just starting...amd started making mcm cpus mainstream just a few years ago	10:38
markos	but I haven't seen 3D yet	10:38
programmerjake	it would still likely be on a 2d board, just with 3d topology	10:39
programmerjake	i guess technically they have 3d -- the raspberry pi puts a ram chip on top of their cpu iirc	10:40
markos	yeah I was talking about the chip	10:41
markos	will be interesting times, as exciting as late 80s, early 90s wrt chip technology :)	10:42
programmerjake	making a 3d si chip is super slow because each layer takes a lot of time and it's really hard to put transistors on top of other transistors	10:43
markos	that's what Arm has done with V1/V2	10:43
markos	I was very impressed when I saw the presentation	10:43
markos	they managed to reduce paths between units by putting them right on top of each other	10:44
markos	and in doing so, reduced thermal footprint as well	10:45
programmerjake	oh neat, only one i knew about was flash chips -- they're easier because the layers are simpler and repetitive	10:45
markos	but tbh, I don't know the process, how long it took them	10:45
programmerjake	time per chip = higher price and lower yield and less chips at full fab throughput	10:46
markos	apologies, this was done for N1 even, https://community.cadence.com/cadence_blogs_8/b/breakfast-bytes/posts/taking-arm-neoverse-into-3d-with-digital-full-flow	10:55
markos	I saw that presentation at ArmDevSummit last year	10:55
programmerjake	oh, neat!	10:56
markos	I think the plain N1 chips are not 3D though	10:56
markos	I remember that this was mentioned as an evolution to be used in the upcoming models	10:57
markos	hence V1/N2	10:57
markos	anyway	10:57
programmerjake	well, it's 3am here, ttyl	11:02
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.53> has quit IRC		12:04
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		12:05
lkcl	alethkit, yes i have. the problem (sigh) is that nobody has created a modern up-to-date Power ISA port / definition.	12:23
lkcl	although that might have recently changed, because i think Boris Shingarov used our machine-readable markdown versions of the Power ISA Spec, containing actual executable pseudocode, along with our language-translator/parser	12:24
lkcl	and instead of having that language-translator output python i think he got it to output a Sail definition	12:25
lkcl	markos that's a frickin ridiculous number of intrinsics. illustrates precisely why we cannot go for a 1D intrinsics paradigm.	12:27
markos	to anyone that's interested, found the Neoverse 3D video presentation: https://www.youtube.com/watch?v=XJWQx8ZswSI	12:27
lkcl	remember that the gather-scatter capability is not an actual instruction, at all	12:27
markos	lkcl, yes, programmerjake explained that to me	12:27
lkcl	there happens to be twin predication as a general concept.	12:27
markos	I've been burned multiple times by trying to implement gather/scatter using avx2/avx512 intrinsics	12:28
markos	with zero benefit	12:28
markos	absolutely no gain, and I spent weeks if not months	12:28
markos	I was going crazy, trying to figure out what I was doing wrong	12:28
markos	turned out I did nothing wrong, it's just that Intel's gather/scatter is completely useless	12:29
lkcl	if you set src=all1s and dest=mask with zeroing disabled you get the effect in the abstract of... err... src=1s... scatter(?)	12:29
lkcl	and if you set src=mask and dest=all1s you get gather	12:29
lkcl	(i get confused which way round it is)	12:29
markos	yes, that's just great	12:30
lkcl	but to stop (prohibit) that from being possible on LD/ST would actually violate the RISC paradigm!	12:30
lkcl	but, also it's important to appreciate	12:31
lkcl	where other ISAs have SIMD front-ends, they still have to decide a back-end micro-architecture	12:31
lkcl	(ARM's core designers choose to make the back-end behind gather/scatter "slow", GPU core designers have to make the back-end behind gather/scatter "fast")	12:32
lkcl	these hardware designer choices, based on market forces driving to a particular commercial need, don't actually have anything to do with the actual front-end ISA itself!	12:33
lkcl	iow if we (or more specifically RED) chooses to go after the 3D market then we (or RED) will have to do a back-end fast hardware gather-scatter engine	12:34
lkcl	otherwise commercially it will be suicide	12:34
markos	isn't it "just" the case of enabling multiple concurrent loads/stores between the cpu/cache/memory?	12:35
lkcl	yes programmerjake we will almost certainly have to spot patterns and batch ops together, that's what the L0CacheBuffer is for	12:35
lkcl	the number of wires going into it is just absolutely mental though	12:35
markos	I mean, isn't fast gather/scatter an added bonus of multiple concurrent loads/stores?	12:36
lkcl	markos, yeees but even for "just" 8 LD/STs @ 64-bit width, with auto-detection of mis-alignment, you have to have QTY 16 LD/ST 64-bit back-end units	12:36
markos	why double?	12:37
lkcl	64-bit address plus 64-bit data plus some control wires equals about 150 wires times SIXTEEN equals 2,400 wires going into a very small area.	12:37
markos	damn	12:37
lkcl	because you have 8 bytes but 1 of them could be on the even 64-bit and the other 7 into the odd 64-bit	12:37
lkcl	last time i calculated it, it was actually around 3,000, iirc	12:38
markos	well, you could limit 8 LD/STs for aligned addresses	12:38
markos	to avoid that overlapping	12:38
lkcl	that violates the Power ISA specification	12:38
markos	damn again	12:38
markos	well, not the ISA, but the actual loads	12:39
lkcl	what does not violate the Power ISA specification is to go above a 4k boundary on Virtual Memory misalignment	12:39
lkcl	and when i did the research i came across something very interesting	12:39
markos	I mean the gather engine could issue up to 8x LDs if they're aligned, but up to 4x if they are all unaligned	12:39
lkcl	violates the Power ISA Spec which is dependent on the Scalar operations	12:40
lkcl	so there is a trick that you can do for spotting merging of addresses	12:40
lkcl	using only 12 bits to recognise that they are to be merged	12:40
markos	does the ISA define how many uops are issued?	12:40
lkcl	if you go to 13 bits it is far too complicated in hardware	12:40
lkcl	no	12:40
lkcl	thus	12:40
lkcl	we can retrospectively work out that IBM very specifically has implemented something similar to what i envisaged it would be best to do	12:41
lkcl	has already encountered the problem	12:41
lkcl	and very specifically allowed the spec to make an exception	12:41
lkcl	mis-aligned LD/STs basically are not allowed to cross a page boundary	12:42
lkcl	and it's down to "attempting to do so is insanely hard in hardware so we allow a trap to be generated"	12:42
lkcl	the linux kernel went ape-shit until i added microwatt-style misaligned LD/STs.	12:43
lkcl	that's just on scalar ones	12:43
lkcl	it's fundamentally baked-in to all the software that misaligned LD/STs will be supported in hardware.	12:44
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		12:47
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has joined #libre-soc		12:48
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has quit IRC		12:59
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has joined #libre-soc		13:00
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.55.204> has quit IRC		13:02
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		13:02
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		13:20
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc		13:22
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC		13:50
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.164.233> has joined #libre-soc		13:51
lkcl	alethkit: re unit tests what we have is something unimaginitively called "the Test API"	18:15
lkcl	which is python unit tests comprising a list of instructions, a "state" (memory, registers), and an "Expected" state (memory, registers post-execution)	18:16
lkcl	these have been converted to the Test API	18:17
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/alu/alu_cases.py;hb=HEAD	18:17
lkcl	they can currently be used by the Power ISA Simulator we have written, and by TestIssuer (again, unimagininatively, it issues tests to the HDL implementation)	18:17
lkcl	also there is a way to issue the same tests to qemu	18:18
lkcl	(by way of python-pygdbmi)	18:18
lkcl	and the next phase is to add	18:18
lkcl	gem5	18:18
lkcl	microwatt	18:18
lkcl	verilator	18:18
lkcl	icarus	18:18
lkcl	FPGAs	18:18
lkcl	compiling-of-Makefiles-so-as-to-be-able-to-compile-and-execute-standalone-binaries-on-Power-Compliant-Hardware	18:19
lkcl	Power ISA virtual machines (kvm)	18:19
lkcl	did i mention already this has been a f* of a lot of work and is going to be a f* of a lot more? :)	18:19
lkcl	the only thing that's a little poignant / sad is that there's f***-all help or collaboration from any other team, company, university or OPF Member/Stakeholder	18:20
lkcl	which i have to say is really bizarre / anomalous.	18:21
lkcl	especially given that there's sixty (!) RISK5 Technical Working Groups	18:22
alethkit	RISC-V does have nearly all of the "open hardware" mindshare	19:17
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		19:19
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		19:20
*** tplaten <tplaten!~isengaara@55d4bbca.access.ecotel.net> has joined #libre-soc		19:30
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		19:44
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		19:51
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has joined #libre-soc		20:15
lkcl	indeed. great for teaching, great for proprietary "never-see-the-light-of-day" scenarios, like Western Digital SSDs/HDDs, Trinamic's TMC2660 stepper ICs, NVIDIA GPU internal architectures, AndesSTAR USB-Audio DSPs and so on	21:08
lkcl	nobody knows that AndesSTAR's market is a billion-units "goes into nearly every USB headset on the planet" kind of market	21:09
lkcl	unnnfortunately, the ISA is so anaemic that it needs 50% more opcodes to reach par with ARM Cortex A73, which is what the Alibaba Group were forced to do... as rogue custom instructions (!)	21:10
lkcl	and it's so new that the patents contributed by members are completely inadequate, providing no protection whatsoever to all and any who have patents pre-dating them	21:11
lkcl	it'll eat ARM's lunch in embedded markets (up until the patent lawsuits arrive in droves)	21:13
tplaten	hever heard about AndesSTAR	21:15
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.164.233> has quit IRC		21:15
lkcl	exactly!	21:15
lkcl	only reason i ever heard about them is because they proposed a PackedSIMD extension to RISK5	21:16
tplaten	I assume they have their own custom isa	21:18
lkcl	like Western Digital, they used to.	21:23
lkcl	Trinamic licensed ARM, and their use of RISK5 shaved $1 off the cost of their Stepper ICs.	21:23
lkcl	which is why everyone wants to drop ARM in embedded markets and use the "free" RISK5 instead	21:24
tplaten	Did PowerPC have similar license fees in the past, what was the licencing model back in G4 times?	21:30
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		21:54
choozy	PowerPC would be a derivative of the POWER architecture from AIM, but the licensing fees could be non existent for every company in the then formed AIM (Apple, IBM, Motorola) alliance got their share of the profits of the sales of these chips	22:19
choozy	They were used in Apple Machines, some IBM servers and Amiga systems	22:20
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has quit IRC		22:34
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		22:49
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC		23:15
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		23:15
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has quit IRC		23:18
*** lx0 <lx0!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc		23:18

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!