Wednesday, 2022-03-23

programmerjake	lkcl, reminder to send me the link to the prefix-sum tree thing you said i forgot	00:24
lkcl	programmerjake, tomorrow :)	00:25
programmerjake	k	00:25
programmerjake	i'll send you an email...	00:25
lkcl	it was a rust simd discussion	00:26
lkcl	you cross-ref'd. the algorithm was the same	00:26
markos	lkcl, I was searching for clang pronounciation and found this: https://www.kernel.org/doc/html/latest/kbuild/llvm.html "Clang is a front-end to LLVM that supports C and the GNU C extensions required by the kernel, and is pronounced “klang,” not “see-lang.”" :)	10:32
markos	but also this: https://news.ycombinator.com/item?id=11561046	10:33
lkcl	lol sea lang. klang is shorter	10:54
lkcl	but.. but... that means you have to pronounce "c" as just "kuh"	10:54
lkcl	:)	10:54
lkcl	"oh i'm a kuh-plus-plus developer"	10:55
markos	hahaha	10:55
lkcl	https://www.youtube.com/watch?v=qTvhKZHAP8U	10:56
markos	well, can be worse, I've heard people talking about a strange popular language 'hava' and I had no idea what they were talking about until I made the connection... :D	10:56
lkcl	oh dear. were they from Europe by chance? spain, brazil, portugal, netherlands?	10:57
markos	Belgium :)	10:59
markos	Flanders in particular, same bewilderment when I heard talking about loching and loch monitoring	10:59
lkcl	i lived in holland for just over 3 years so i get that one	11:00
markos	used to work for 3 years myself in Flanders region	11:01
markos	loved the beer and the food	11:01
markos	and made some good friends there	11:01
lkcl	ahh tell me about it. only beer i actually like is duvel. or leffe, brun	11:02
markos	my favourite is Westvleteren, named the best Trappist beer in the world for some years, my 2nd Omer (a local beer to the region)	11:04
markos	I'm sorry that I cannot find these beers here easily	11:04
lkcl	the trappist beers are remarkably strong. belgo nord and belgo centraal in London were the only authorised sellers in the UK	11:05
lkcl	yes the monks just won't let them be sold to anybody.	11:05
markos	apparently you can get them now https://www.westvleterenshop.com/apps/webstore	11:06
markos	but they're not cheap	11:06
markos	I did go there once myself and tried at the source :D	11:06
markos	carried a 6pack all the way to Greece :D	11:07
lkcl	:)	11:07
markos	unfortunately 6 bottles of beer are consumed too fast and now I'm out :)	11:09
lkcl	sigh.	11:09
lkcl	it's a remarkably strategic thinking of the monks.	11:09
lkcl	"yes honey i'm on a pilgrimage to a monastery..."	11:10
lkcl	"i may be in seclusion for some time"	11:10
lkcl	hic	11:10
markos	:D	11:12
mepy	hello, I am officially putting in pause/stop mode my commitment to the project. I know, I have done nothing, but I was keeping reading the mailing list and had a commitment, although very little, to the project still. lkcl: if you can, could you please remove me from the list?	11:36
lkcl	mepy, hey absolutely no problem at all, you've been remarkably supportive and that means a lot. you can actually remove yourself with an unsubscribe message or put it on "hold" by setting "nomail"	11:39
lkcl	i'll set "nomail", if you really want to unsubscribe you can log in to mailman yourself ok?	11:40
lkcl	mepy: i can't remember what your email address is (which is normally why you would use the mailman interface yourself). it's umberto something, isn't it?	11:43
lkcl	i don't want to change somebody else's details	11:43
mepy	yeah, it is that. thank you a lot.	11:49
mepy	i am quitting now. bye everyone!	11:51
lkcl	ok! good luck!	11:51
mepy	ty	11:52
lkcl	programmerjake, https://bugs.libre-soc.org/show_bug.cgi?id=697#c6	12:00
lkcl	found it.	12:00
lkcl	programmerjake, could you kindly alert jubilee about a reply i'm about to make, here? https://zulip-archive.rust-lang.org/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice.html#271073695	12:01
lkcl	in SVP64's Matrix REMAP, we do not do "Horizontal Sum". or, you can if you want to, by scheduling the scalar-ADDs in such a way that they (spectacularly) overload the in-flight Reservation Stations	12:02
lkcl	instead what we do is: swap the order of the inner and outer loops so that the MULs have a chance to filter through the pipelines and catch up with the ADDs	12:02
lkcl	the "normal" way - the one we're all taught in school - is to multiply-and-sum everything that goes into result[0][0]	12:03
lkcl	then move on to result[0][1]	12:03
lkcl	then 0-2 0-3	12:03
lkcl	and then go on to 1-0 1-1 1-2 1-3	12:04
lkcl	this requires a horizontal-sum to perform efficiently in hardware, and it's a pig	12:04
lkcl	if however you change the order of the 3 for-loops...	12:04
lkcl	you use result[0][0..3] as partial accumulators	12:05
lkcl	importantly, make sure to keep a row of one of the matrices in-memory [in-registers] and use that repeatedly	12:06
lkcl	there are 3 loops involved so there are 6 possible permutations - 6 possible ways that you could order those loops - to create the result (the matrix multiply)	12:07
lkcl	one of those is the "usual way taught at school" [and requires - requires - a Horizontal Sum instruction]	12:07
lkcl	some of the others do not.	12:07
lkcl	SVP64 Matrix REMAP can do all 6 permutations. actually with inversion of each of the for-loops it's a lot more than that	12:08
lkcl	you can do inner for i 0..VL-1 or inner for i VL-1..0	12:08
lkcl	sorry you can do inner for i 0..x-1 or inner for i x-1..0	12:08
lkcl	middle for i 0..y-1 or for i y-1..0	12:09
lkcl	outer for i 0..z-1 or for i z-1..0	12:09
lkcl	i can't quite do the math in my head on the number of permutations / options there	12:09
lkcl	but they basically cover in-place rotation and transpose.	12:10
lkcl	even for non-power-of-two matrix sizes.	12:10
markos	you know, if you have a rectangular 128x128? flat register file, you can build some really unique instructions, eg. defining kernels of 3x3 or 5x5 elements and doing some operations, vertically and horizontally	13:16
markos	check the description of convolutions in the beginning https://towardsdatascience.com/tensorflow-for-computer-vision-how-to-implement-convolutions-from-scratch-in-python-609158c24f82	13:17
markos	but I wouldn't a specialized instruction, but a generic one that would allow you to perform the same operation on a NxM kernel	13:18
markos	I guess that can be remap?	13:18
*** alMalsamo is now known as lumberjack123		13:18
markos	ooc, I have a spartan-6 lx9 fpga board (xilinx) here that has been in a drawer for a while, can it be used for microwatt?	14:29
markos	it only has eth & usb	14:30
lkcl	markos, yes apparently, but with xilinx tools	14:40
lkcl	https://gitlab.com/nmigen/nmigen/-/blob/master/nmigen/vendor/xilinx.py#L549	14:40
lkcl	apparently symbiflow has had an outstanding issue since 2018 for this https://github.com/f4pga/ideas/issues/10	14:42
lkcl	yeah that explanation is brilliant, and dead easy to understand	14:49
lkcl	i think... to do it justice, though, it's going to need Extra-V (or similar)	14:50
lkcl	coherent memory-load/store	14:50
programmerjake	ooh, is that yet another person (markos) who thinks we should have picked 128-bit int/fp registers as the base for SVP64 rather than sticking with 64-bit registers? :) or was that a 128x128 grid of 64-bit registers?	14:59
markos	not the base, for all I care you could choose 8-bytes as the base :)	15:00
markos	8bits sorry	15:00
markos	yeah, the way I wrote it was wrong	15:01
markos	obviously 128x128 64-bit registers would be fantastic but that's a bit much I guess	15:02
markos	I meant originally 128x128 bytes	15:02
lkcl	2^14 register file entries (16384) is completely out of the question, yes. as is 2048 registers	15:03
lkcl	Extra-V is a means and method of arranging memory to "arrive" (and leave) registers in a coherent deterministic fashion	15:03
markos	but 128x128 = 16kb = 256 64-bit registers in a flat hierarchy	15:04
lkcl	ah bits	15:04
programmerjake	well...amd's gpus have around that many registers (16k or some nearby power of 2)	15:04
programmerjake	so it's not as much out of the question...	15:04
markos	well, the more the better, I'm not going to complain obviously :)	15:05
lkcl	https://arxiv.org/abs/2002.10143	15:05
lkcl	the problem with "more registers" is that each doubling results in routing and delay	15:05
lkcl	meaning that an upper maximum bound is placed on the clock frequency as a result.	15:05
markos	what's the ideal compromise?	15:06
lkcl	32 for a general-purpose processor	15:06
lkcl	32 x 64-bit	15:06
lkcl	most SIMD Processors will actually do batches-of-completely-independent 32x64-bit (or whatever)	15:07
lkcl	and you get "striping" effects	15:07
lkcl	say 4-way-striping	15:07
lkcl	which means in turn that unless you route data via a (slower) path	15:07
lkcl	you can only do r0 = r4+48	15:07
lkcl	r1 = r5+r9	15:07
lkcl	r2 = r6+10	15:07
lkcl	r3 = r7+r11	15:07
lkcl	if you try:	15:08
lkcl	r0 = r4+r5	15:08
markos	here's a crazy idea, ok, 128x128 might be too much, but how about 64x64 = 4k in a flat file where registers would be actually configurable "pointers" (if that can be done in hardware) to the flat register file	15:08
lkcl	then because r5 is in a different "stripe" (like RAID striping) from r0 and r4, the contents of r5 have to be routed via special paths, to get from "the bank for r0 r4 r8 r12 r16" and "the bank for r1 r5 r9 r13 r17"	15:09
lkcl	i've done the architectural design, it took 18 months to think through, and i'm just not going to change it	15:09
lkcl	it's too much	15:09
markos	no no	15:09
markos	I'm not asking you to change it	15:09
lkcl	yes, have a look at https://arxiv.org/abs/2002.10143	15:10
lkcl	and also Extra-V	15:10
markos	I saw already you mention 128 registers	15:10
lkcl	"pointers" is exactly what Snith - and Extra-V - do	15:10
lkcl	when you refer to say add fp3, fp5, fp6	15:10
lkcl	what actually happens in Snitch is, it goes:	15:11
lkcl	"hmm, fp5 has been reconfigured as a memory-coherent-Queue-reloader. let me just get the value for you from the front of the Coherent Memory Queue rather than the actual regfile"	15:11
lkcl	fp6 is targetted (configured) to point at a SECOND Coherent Memory Queue	15:11
lkcl	and fp3 is targetted to store into a (third, pre-configured) Coherent Memory Queue	15:12
markos	so the fp* registers can be configured as aliases to an actual memory address?	15:12
lkcl	the pre-configuration of those 3 regs are then set up to run a deterministic algorithm which in the case of the ones serving "fp5 / 6"	15:13
lkcl	yes	15:13
markos	interesting	15:13
lkcl	with implicit auto-load-and-increment	15:13
markos	but without the latency of memory access?	15:13
lkcl	just like the c (*ptr++) thing	15:13
lkcl	yes and no	15:13
lkcl	if latency happens then the main processor stalls	15:13
lkcl	but the focus of the Snitch core was precisely to arrange the memory and the core so that that did not happen	15:14
markos	so the operation is done async	15:14
lkcl	they achieved this i believe by using the snitch core as a barrel processor	15:14
lkcl	ah no.	15:14
lkcl	it's definitely synchronous.	15:14
lkcl	and it's termed "coherent"	15:14
lkcl	okok	15:14
lkcl	the processor sees the FIFOs in a synchronous fashion	15:15
markos	it's synchronous wrt the Queue but the actual memory?	15:15
lkcl	but the memory-side will obviously be under different constraints	15:15
lkcl	but	15:15
lkcl	and this is the important bit	15:15
markos	sorry, bb in 10'	15:15
lkcl	the algorithm running in the Memory Controller that puts the data into the queue is - has to be - fully deterministic	15:16
lkcl	and likewise on destination (result operations) that go into the "outgoing result" FIFO	15:16
lkcl	those also have to - by definition - follow a deterministic schedule	15:17
lkcl	the simplest of such Deterministic Schedules is: *ptr++	15:17
lkcl	but	15:17
lkcl	in the case of Extra-V	15:17
lkcl	they went a whoooole new level of algorithmic fun	15:18
* lkcl afk		15:18
programmerjake	i posted the link to irc chat log on zulip, it should show up here soon: https://zulip-archive.rust-lang.org/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice.html#271073695	15:19
programmerjake	btw, neat idea with using vector modes for crs to encode 1/2/4/8 bits per int reg...	15:35
lkcl	programmerjake, yeah, it was inspired by the sv.ori./ew=8 thing you came up with	15:44
lkcl	and that you mentioned using crweird after, i wondered how to actually back-to-back those two properly/usefully	15:45
lkcl	markos, Extra-V, instead of simple "*ptr++", can do deterministic Graph-Node-walking	15:46
lkcl	where, again, both the Memory and the Core know, in advance, what the Schedule will be	15:46
lkcl	programmerjake, i think we can probably use the same concept for 3D Shader Pixel data interpolation	15:46
programmerjake	probably, but we'd want that to go through the cache, not straight to main memory	15:50
programmerjake	texture reads ^	15:50
lkcl	Snitch managed to organise it to be direct (somehow)	15:50
programmerjake	texture reads often reuse nearby bits of memory multiple times -- caching is required...i'd guess snitch could organize their loops to read once and not need the data again, negating the need for caching (maybe? imho it'd still need caching since one loop could read/write it and the next loop could need it again)	15:54
programmerjake	as an example of why we need caching for all gpu stuff...just look at how much of a speedup amd got from their humongous cache...	15:55
* lkcl wonders what results come up by searching "AMD humongous cache"		15:59
programmerjake	they call it "infinity cache"	15:59
programmerjake	their largest gpu has 128MB of cache, probably a major portion of their claimed 50% jump in power efficiency	16:03
programmerjake	though...speaking of cache, they released a server cpu with 768MB L3 cache	16:04
lkcl	probably because IBM POWER10 has something mad as well :)	16:06
programmerjake	idk, i think they were adding extra cache to compete with intel on the desktop and thought, why not do that on the server too?	16:09
markos	I still think these are all half-measures until we can achieve 1:1 zero latency ram :)	16:26
lkcl	hey that's perfectly achievable for a max clock rate of ooo 100 mhz? :)	16:27
markos	yeah I was thinking something like modern systems, though tbh, you do need a cache even then, esp if you have multiple cores	16:30
markos	I remember reading many years ago about photonic transistors which would pave the way for photon chips, and then nothing, I wonder what happened to that technology	16:32
programmerjake	lkcl, jubilee responded "huh."	16:58
programmerjake	well....i'm hoping we eventually get 3d sram, kinda like 3d nand, then we could have 128GB cache!	17:01
markos	lkcl, quote from the mp3_0_apply_window_float_basicsv.s.sv file, "at some point 128 registers will be available", was trying to remember where did I see it and it was right in front of me. So, there is indeed a plan for so many registers, 128x64 = 8kB flat register file, my question is why not double it and get a 128x128=16k register file? Would that require too many changes in the ISA proposal?	17:23
markos	sigh, sorry again I miscalculated	17:24
markos	128x64 bits = 8kbit not kB	17:24
markos	ignore that	17:24
programmerjake	iirc lkcl didn't want to move to 128-bit registers since he seemed to think that it would be too much work to change our hdl to have 128-bit data paths...i disagree	17:25
markos	so, the register file will be 1kB and able to hold a 32x32	17:25
markos	I don't want to put extra work on him, I'm just curious	17:26
programmerjake	the current plan is to have 256 64-bit registers (128 int, 128 fp)	17:26
markos	my biggest question is how is permute going to work efficiently	17:27
markos	otoh, SVP64 design makes permute needless in many algorithms	17:28
programmerjake	currently i think the plan is to just push all the data through the register file (even though that's waay slower)	17:28
programmerjake	for permute	17:28
programmerjake	the plan will probably change when we aim for higher performance	17:29
markos	I really wish I could help here, but I have no HDL knowledge	17:30
lkcl	markos: in TestIssuer and in the simulator there are currently only QTY 32 64-bit registers	17:55
lkcl	so you kinda have to squeeze the vector ops into the existing 32 regs at the moment	17:56
lkcl	programmerjake, i have made it clear multiple times that i have had to spend 18 months with a massive complex design currently held in my head and i am NOT going to redo that design	18:04
lkcl	will you please STOP demanding that i waste my time throwing absolutely everything away and satisfy your requirement to do a 128 bit architecture	18:05
lkcl	i am getting very fed up of it	18:05
lkcl	it is making me very angry that you're not listening and i cannot take it any longer	18:06
markos	lkcl, sorry, I started it, I don't want to cause you any extra effort, my question was rather towards the use case of fitting large 2D matrices for convolution	18:07
lkcl	markos, it's ok.	18:07
markos	whether you use 64-bit or 128-bit vectors is irrelevant	18:07
markos	s/vectors/registers	18:07
lkcl	the convolution can be done using coherent memory	18:07
lkcl	because it's most useful when doing large 2D arrays that can never fit into regfiles anyway	18:08
markos	modern video codecs do DCT on largish arrays 32x32 or 64x64 even	18:08
markos	so imagine being able to do a FFT/DCT on a single block in a couple of instructions	18:08
lkcl	this is why i want to include Extra-V in SVP64	18:09
markos	right	18:09
lkcl	SV-REMAP only works - at present - when the data is entirely in registers, limiting the max size.	18:10
lkcl	SV-REMAP-on-top-of-Extra-V entirely removes those limits.	18:11
lkcl	a DCT of 2^32 would then be perfectly possible	18:11
lkcl	take a damn long time but be possible	18:11
programmerjake	lkcl, i'm not demanding we switch to 128-bit registers since i know there are other higher-priority things and 128-bits isn't strictly necessary, i'm just not going to say that i agree with you and think 64-bits is the best choice architecturally or at the isa level since my opinion hasn't changed -- 128-bits would be nicer.	18:15
tplaten	I'm working on the tplaten_3d_game branch, getting an external_core_top.v:258195: ERROR: Re-definition of module `\plru_2'!	18:37
tplaten	I need the modules entity but not its architecture behave	18:39
tplaten	In the makefile we have fpga_files and synth_files, I will have a deeper look	18:39
programmerjake	is that for libre-soc inserted into microwatt's peripherals replacing the core? if so, that would be a problem probably for lkcl...	18:41
tplaten	yes that is the case, I wan't to run the maze game on the libre-soc core	18:45
tplaten	and there are so many changes in microwatt causing merge conflicts if I try to merge microwatt with verilator trace, even in the makefile	18:46
markos	lkcl, I'm trying to test a small change	18:52
markos	setvl 16	18:52
markos	sv.fadds in.v, in.v, in.v	18:52
markos	(I did a .set in. 3 above)	18:52
markos	sorry sv.fadds/mrr	18:52
markos	but I'm getting the following error from pysvp64asm audio/mp3/mp3_1_imdct36_float.s audio/mp3/mp3_1_imdct36_float.s.sv	18:53
programmerjake	hmm, maybe it'd be easier to backport the maze game and usb serial stuff to the branch where libre-soc works?	18:53
programmerjake	rather than trying to merge it	18:54
markos	https://paste.debian.net/1235372/	18:56
markos	(do you use another paste service btw?)	18:56
programmerjake	we mostly just use attachments in bugzilla and ftp.libre-soc.org	18:57
programmerjake	feel free to create a bugzilla bug to track the maze game issues with libre-soc	18:59
programmerjake	https://web.archive.org/web/20220323190212/https://paste.debian.net/1235372/	19:02
programmerjake	the assembly is using an old 1-arg form of setvl, the latest version has 5 args afaict	19:16
programmerjake	6 args...miscounted	19:16
markos	right	19:40
markos	ok, I'm trying to understand how this is going to work, the code I'm trying to convert is rather simple:	19:55
markos	for (i = 17; i >= 1; i--) │ .long 3210589143 # float -0.866025388	19:55
markos	in[i] += in[i-1];	19:55
markos	crap	19:55
markos	tmux doesn't play well with copy+paste with mouse	19:56
markos	anyway	19:56
markos	for (i = 17; i >= 1; i--)	19:56
markos	in[i] += in[i-1];	19:56
markos	so, if I understand this right, and reading the comments in the file, I just need to setvl 16, and do a sv.fadds/mrr	19:57
markos	however, don't I first have to load all the elements into vectors first? using an ld into in.v? (in is 5 here)	19:57
markos	s/vectors/registers/	19:58
markos	so this should be the equivalent of setvl 16 (or the 6-args equivalent)	19:58
markos	or rather wrong	19:59
markos	in is just a pointer	19:59
markos	so I would have to say register 10, the 16 elements, like so	19:59
markos	ld 10.v, (5)	20:00
markos	and then sv.fadds/mrr 11.v, 10.v, 10.v	20:00
markos	as lkcl is writing in the comments	20:00
markos	do I understand it correctly or have I gotten this wrong?	20:00
lkcl	markos, that's pretty much it, yes.	20:38
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_0_apply_window_float_basicsv.s;h=3888852461c794eb9836f8320c28aa5080b72b4a;hb=d1b415e4383366cf445fd4ff2db828a612f88099#l29	20:39
lkcl	ah.	20:39
lkcl	that limit about 128 registers is obviously lifted, there	20:40
lkcl	26 # SV floats	20:40
lkcl	27 .set fv0, 32	20:40
lkcl	28 .set fv1, 40	20:40
lkcl	29 .set fv2, 48	20:40
lkcl	r48 is not even remotely possible with standard Power ISA 3.0	20:40
lkcl	so the comment is clearly out-of-date	20:40
lkcl	programmerjake, sorry, i just find it deeply frustrating because you have no idea of the timescales and implications of what you're advocating, the amount of disruption it would cause to abandon everything done and designed so far to do 128-bit	20:46
lkcl	ouaff	20:51
lkcl	meeting	20:51
markos	one more question, what's the equivalent of 'setvl 16' in the 6-arg format?	20:58

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!