Thursday, 2023-03-09

*** tplaten <tplaten!~tplaten@62.144.45.55> has quit IRC		00:43
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		07:32
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		08:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		08:30
markos	programmerjake, yeah, I took a closer look to the instructions, indeed I do prefer with the old naming scheme, for that matter, it's easier to remember that [s] suffix will do 32-bit floats copy/conversion, vs having to look at the ISA manual to find out which parameter exactly to pass to do the same thing	09:00
markos	same thing for the rest	09:01
markos	it's the same if you write these instructions a thousand times and know it by heart, but if you only occasionally use them, keeping the same (simple) naming scheme is better, imho	09:02
programmerjake	k, thx! lkcl, can you look at that when you have time? thx	09:09
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		09:42
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		10:02
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		10:34
sadoon[m]	https://www.phoronix.com/review/tyan-power8-server	11:22
sadoon[m]	markos:	11:22
markos	just received it, literally minutes ago :)	11:24
markos	yeah I remember the article	11:24
markos	replacing the fans will be one of the first things I will do	11:25
sadoon[m]	My brother received mine a few days ago in the UK, though he's only coming back in the summer so I can't really have fun with it for some time :p	11:34
markos	I'll let you know how it works out :)	11:34
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		12:13
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		13:00
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		13:12
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		14:44
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		15:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		16:30
markos	toshywoshy, do you know what kind of rails the tyan case takes? I know I could just put it there, but I prefer rails on my rack systems	16:41
markos	I could ask the seller, but somehow I doubt they know	16:41
markos	they didn't know the disk tray models...	16:42
markos	lkcl, I'm having trouble understanding how to create the indices for svindex, I need to create the pattern 0, 1, 2, 3, 0, 1, 2, 3 at GPR 16	16:51
markos	I thought I could do that with svshape2 with VL=8, mod 4	16:52
markos	I'm using svshape2 8, 0, 1, 4, 0, 0	16:53
markos	ah f*sck	17:09
markos	all this time, in the unit tests I thought the indices were actually created BY svshape etc	17:10
markos	but I was deceived, the indices were created by setting the initial_regs[] manually	17:10
markos	ffs	17:10
markos	this is so embarrasing	17:11
markos	I was deceived, I thought the indices were actually created by the svindex/svshape instructions in the chacha20 unit test	17:12
markos	but they're created outside the chacha20 code in the unit test prep code using set_masked_reg() functions	17:13
markos	now I get it	17:14
markos	I just have to do this in asm properly	17:14
markos	the "deceived" part was said in humor, obviously :-P	17:46
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		18:01
lkcl	yes, i did explain that :) but it is one thing to hear it explained and another to "realise", if you know what i mean.	18:05
lkcl	markos, if you recall i mentioned in the last irc conversation:	18:06
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;h=7e11fb4b39e596b11b952f171b349c47278467f7;hb=35851d97718547db731809f6942fe97bb31ba7c9#l151	18:06
lkcl	(or maybe one of the conf calls?)	18:06
lkcl	that call to set_masked_reg() sets up (at elwidth=32) the values 16 12 8 7 16 12 8 7 16 12 8 7	18:07
lkcl	and lines 147-148 take the indices from the schedule list and put them into the registers	18:08
markos	yup	18:08
lkcl	which was why i mentioned, "just put a print() statement in that test_caller_svp64_chacha20.py file" then replicate it in assembler	18:08
lkcl	but you have to print the list out first in order to know what assembler to write that will replicate it	18:09
lkcl	pprint would be easier to read	18:09
lkcl	or just after line 146: "print (i, a,b,c,d)"	18:10
lkcl	or	18:10
lkcl	"print ("keyword to search for so you know to look for this in the simulator output", i, a,b,c,d)"	18:11
lkcl	programmerjake, it's wasting your time (and mine even just to tell you it's wasting time) to jump ahead to write any fgrev instructions when absolutely none of us have had the opportunity to evaluate whether the instructions are even beneficial or harmful.	18:13
markos	lkcl, yes, I see what you're doing there now, but I might change it a bit, not because it's wrong, but because I prefer to make it longer and easier to understand, we can optimize it further later	18:14
lkcl	for example: the exact same effect can be achieved by using svindex with a negative direction and a 2D index that is the original width divided by the target width.	18:14
lkcl	making it completely unnecessary to even have any fgrev instructions	18:14
markos	ie, I need to understand it first myself, so I will expand the code to make it easier for me to understand -and therefore transfer the knowledge to the documentation	18:15
lkcl	but you didn't wait for me to take the time to even think that through, you jumped straight in	18:15
lkcl	yes - i mean, feel free to actually write it in c (replicate the indices in c)	18:15
markos	it's all going to be in asm	18:16
lkcl	then hand-pack the results into target registers using 64-bit mvs	18:16
markos	this particular routine I mean	18:16
lkcl	that would at least allow you to do the trick of printfs() to make sure that the list created (in c) was the same as what was print()ed out	18:16
markos	what I don't understand is why/how the indices are in 8-bit elements	18:16
lkcl	then convert over to assembler	18:16
lkcl	because otherwise they take up one hell of a lot of registers	18:17
lkcl	each index if 64-bit (completely wasting over 56 bits btw)	18:17
lkcl	would take up a whopping 64 frickin registers	18:17
markos	there aren't that many indices	18:17
lkcl	however many it is, it's still a lot of regfile read-ports	18:18
lkcl	and if the number of read-port accesses can be reduced by a factor of EIGHT	18:18
lkcl	(because elwidth=8 for the indices not elwidth=64)	18:19
lkcl	that's a massive reduction	18:19
markos	indeed	18:19
markos	but there's the overhead of having svindex work on individual elements, isn't there?	18:19
lkcl	the hardware can cache the indices	18:20
lkcl	but the hardware still has to read them	18:20
markos	but hardware registers are always available with zero latency, right?	18:20
lkcl	and if that overhead can be reduced by packing them as tightly as possible that's clearly a priority	18:20
lkcl	1 clock cycle per read port, yes	18:20
lkcl	but think about it	18:20
lkcl	let's say you have a 3R1W regfile	18:21
lkcl	and you want to apply svindex to an FMAC operation	18:21
lkcl	the FMAC is 3-in 1-out	18:21
lkcl	normally you could do 1 FMAC every clock cycle, yes?	18:21
markos	depending on the available ALU units/cores, but yes	18:22
lkcl	but if you have an svindex needing to read yet another register, and each index per element is taking up an entire 64-bit register	18:22
markos	I think I get your point	18:22
lkcl	you can only issue 1 instruction every two clock cycles	18:22
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		18:22
lkcl	because you now have to read FOUR registers for each one FMAC	18:22
markos	but svindex is not running on the same unit	18:22
lkcl	1) the index	18:22
lkcl	2) operand A	18:22
lkcl	3) operand B	18:23
lkcl	4) operand C	18:23
lkcl	the regfile reads have nothing to do with the units	18:23
markos	right, it's a bit confusing, so the regfile has a limited number of "ports" so to speak?	18:24
lkcl	if you only have 3 regfile ports per clock and your instruction needs 4 operands even though one of them...	18:24
lkcl	yeeees of course!	18:24
programmerjake	lkcl: you already know i disagree on fgrev, we can talk about that later when you're less stressed out. meanwhile imho we should submit the fcvt insns (after resolving #1016) as a rfc without the fmv* insns both to reduce rfc size and because we haven't resolved if we even want fmv* and not fgrev[f/t]gi instead	18:24
lkcl	the cost of doing say a 10R6W regfile is absolutely massive	18:24
markos	isn't that easily solvable by increasing that or is that a Power ISA restriction?	18:24
markos	the number of ports on the regfile that is	18:24
lkcl	programmerjake, no. PLEASE LISTEN. this is the 4th time i have said PLEASE LISTEN in under 16 hours	18:24
lkcl	markos, it becomes an exponential cost to increase the regfile ports	18:25
markos	I see	18:25
lkcl	and power consumption and latency start to push the boundaries of physics	18:25
markos	ok, I understand I will try to follow the same pattern then	18:25
programmerjake	it's a hw cost restriction where a huge number of reg file ports takes up like 30% of the whole cpu's area	18:25
lkcl	indeed.	18:25
lkcl	mitch alsup designed the AMD Opteron's regfile at the gate-level and he said he was barely able to get 10R6W within the required power and speed budget	18:26
lkcl	all execution units then have to "compete" for access to those regfile ports	18:27
markos	what is our own target?	18:27
lkcl	and you need one "Priority Picker" per regfile port so as not to get data corruption (or worse, actually damage the ASIC)	18:27
programmerjake	lkcl: i'm listening, hence why i'm working on ternlogi since you asked me to work on things that need work and why i'm putting fgrevi discussions for later since i'm listening but i disagree	18:27
lkcl	the target will be: "whatever-is-required-for-our-first-real-customeer"	18:27
lkcl	programmerjake, thank you	18:27
lkcl	when i have time i will get round to beginning a discussion of alternatives and costs, on the fgrev bugreport	18:28
programmerjake	i did state that before in my email...	18:28
programmerjake	the i'm listening part	18:28
lkcl	you stressed me out so badly i couldn't bring myself to read it	18:29
programmerjake	k	18:29
lkcl	i'm only just recovering from total overwhelm, after four months	18:29
programmerjake	yeah, things can take time...	18:30
programmerjake	for the compilers rfc, i'd like to do basically all, the work of writing, submitting, etc. so you don't need to bother	18:32
lkcl	my thoughts on that would it may actually be better for it to be a Kazan continuation project	18:33
lkcl	(sotto voice: that happens to require some compiler work)	18:33
lkcl	that took me a few days to think of, apologies i haven't raised the idea before	18:34
lkcl	aside from anything that would give a "real worked example / need" so to speak that would drive the compiler-side	18:35
lkcl	otherwise it's a bit of a fishing expedition if you know what i mean	18:35
programmerjake	imho we need llvm/gcc before kazan, so it should be compilers, unless you want a mainstream-compilers and a kazan rfc?	18:35
lkcl	tied/related... yes. or just the two together but EUR 100k not EUR 50k.	18:36
lkcl	(each)	18:36
programmerjake	though otoh imho cranelift powerisa support gives wasmtime support for power so is easily justifiable as having a european element since europeans with power can then run wasm cli stuff -- cranelift is needed for kazan too	18:37
lkcl	my instincts in nlnet-grant-writing are lighting up more on kazan+cranelift as a first step	18:38
markos	we need gcc/llvm for native code, not wasm :-)	18:38
* lkcl agrees		18:38
programmerjake	i might be able to rope other people (outside of libre-soc) into working on cranelift	18:38
markos	call me old fashioned but wasm is just turning up to be another form of java compile-once-run-everywhere	18:39
markos	though I have to agree it's faster than java	18:39
markos	which isn't saying you cannot do it if you want	18:40
lkcl	FORTH, java, CLR/.NET, JIT, wasm - seen 'em once, seen 'em all...	18:40
programmerjake	except that java never worked very well for non-java languages whereas wasm is intentionally designed for c/c++-style languages too	18:40
markos	but I honestly doubt anyone is really interested for wasm on power at this point in time	18:40
programmerjake	no, forth, java, clr/.net are all programming language specific, wasm is designed to be language independent	18:41
markos	it works fine for jvm languages, closure, scala, kotlin, etc but I'm not interested in those either	18:41
lkcl	no, CLR/.net is definitely non-programming-language-specific. look up Iron-Python and Iron-Ruby.	18:41
markos	I much more prefer to have a working native compiler for power/svp64	18:41
lkcl	markos, yyep.	18:41
programmerjake	cranelift is a native compiler	18:42
markos	ok, let me rephrase	18:42
* lkcl afk		18:42
markos	a working native C/C++ compiler for power/svp64	18:42
programmerjake	that's also part of the compilers rfc	18:42
markos	anything else at this point in time is just a distraction	18:43
markos	no, those SHOULD be the RFC	18:43
markos	anything else is a side project, call it a pet project	18:43
markos	no one is stopping you from doing it	18:43
markos	but it's definitely not a priority	18:43
markos	and you cannot expect others to adopt your logic, when either gcc/llvm is working then sure	18:44
programmerjake	well, we need a vulkan driver and if we don't get to work on it soon it won't be ready when we need it (e.g. texture isa design), cranelift is part of that	18:44
markos	what are we going to do with a vulkan driver on its own	18:44
markos	plus vulkan drivers can ALSO be written in C/C++	18:45
programmerjake	you don't need svp64 support on the compiler that you use to compile the vulkan driver	18:46
programmerjake	just need it in the shader compiler	18:46
programmerjake	so rustc/llvm as is is sufficient if we have a shader compiler with svp64 support	18:47
programmerjake	(or clang/gcc if the vulkan driver is written in c)	18:47
markos	so your suggestion is that before we can actually compile C/C++ code with SVP64, we invest time in getting rust working with SVP64 first so that we get working vulkan, JUST in case we need software that uses vulkan?	18:48
programmerjake	no, i'm suggesting we need a vulkan driver to properly design the gpu features of our cpus	18:49
programmerjake	e.g. texture instructions	18:49
programmerjake	since being a gpu is a major part of what we want it to eventually do	18:49
programmerjake	and vulkan is the logical gpu api to implement (first), since opengl/opencl can translate to vulkan	18:50
markos	perhaps, but still a compiler is more important	18:50
markos	and we still don't have a gpu	18:50
markos	but we do have a cpu (sortof)	18:51
programmerjake	i'm not saying don't implement llvm/gcc, i'm saying work on cranelift too	18:51
markos	again to avoid any misunderstandings, when I'm saying compiler I'm only talking about C/C++	18:51
markos	as first priority	18:51
markos	anything is just isn't	18:51
programmerjake	since realistically it will likely work best for the shader compiler	18:51
markos	yeah I don't know how to reply to that, you keep repeating about the shader compiler, and so far I have to see it mentioned as a high priority task in any of our talks, all I keep hearing/reading is about IoT/edge/computing/crypto/AI/ML/etc	18:53
markos	for all of those we need working compilers	18:53
markos	you are bent on saying we need cranelift too, maybe, add it as a separate task,	18:54
programmerjake	but cranelift lets us try out the very invasive ir changes in llvm/gcc that luke wanted for svp64 support, which i think are going to be very hard to convince gcc/llvm that they should accept the ir changes	18:54
markos	I personally don't want to have it in the same RFC	18:54
markos	it's a distraction	18:54
markos	it will just take time from you and everyone else that's going to work in this particular project	18:55
markos	again, I'm not saying don't do it, it's your call	18:55
programmerjake	ok, then like i suggested: a mainstream-compilers rfc and a vulkan drivers rfc that includes cranelift	18:55
markos	but don't put it in the same task	18:55
markos	yes, no objection from me there	18:55
programmerjake	i wanted them together since lots of stuff we learn while building the cranelift backend (and what i already learned from bigint-presentation-code) will be directly applicable to llvm/gcc except much more complex to implement	18:57
programmerjake	doing it in llvm/gcc first imho is setting us up for failure to some extent	18:58
programmerjake	one other nice feature is the cranelift regalloc is mostly compiler-independent so could be easily slotted into llvm/gcc as a stopgap	19:01
programmerjake	so we only have to implement reg range alloc once at first	19:01
markos	shared knowledge between tasks does not mandate -imho- a common rfc	19:07
markos	you could share code between tasks, it would need modifications anyway so it wouldn't be just a copy paste thing	19:07
programmerjake	maybe, though there are likely tasks that overlap and we want to avoid double-funding quagmire	19:08
markos	I think it's too early to worry about double-funding between 2 almost entirely different projects, I honestly doubt you will have much duplicate code between rust compiler and llvm/gcc	19:09
programmerjake	though for the regalloc stopgap, it would be literally a copy of the cranelift regalloc	19:10
programmerjake	plus some bindings	19:10
programmerjake	or glue code	19:10
markos	well if it has to be exactly the same then so be it, it will still be a needed part of compiler support	19:11
programmerjake	(not literally copied, but probably a crate dependency)	19:11
markos	I still doubt it will be exactly the same, it's like saying 2 entirely different projects use the same hashing function so you can't use it	19:11
markos	crates are a rust thing, for llvm/gcc it has to be integrated	19:12
programmerjake	well the cranelift regalloc is in rust, so we'd need rust -> c ffi adaptor -> glue code to llvm/gcc's c++ internals	19:13
programmerjake	the c ffi adaptor would be written in rust	19:14
markos	er, no, that will never fly with the gcc/llvm people	19:14
markos	the register allocator has to be integrated in C/C++ inside gcc/llvm tree	19:15
programmerjake	hence why i called it a stopgap	19:15
programmerjake	i'd estimate it'd take 1-2 weeks to write the glue code and maybe 1 mo for each of llvm/gcc to rewrite the regalloc into c++	19:16
markos	well, good that you cleared this up now, because I would never agree to this, I would rather we invest the time to develop a proper register allocator in C/C++ and getting it working in llvm/gcc directly	19:16
markos	depending on a rust project to get compiler support is sub optimal to say the least	19:17
markos	for one you add an external dependency for everyone who would want to do compiler development	19:17
programmerjake	well, that's much more complex imho	19:17
markos	not really	19:17
markos	if it's 1mo work to do a rust regalloc, then surely it can't much more to do it in c++	19:17
markos	let's say 2 months?	19:18
programmerjake	external dependencies in rust are waay easier than in c++/c, you add one line to cargo.toml and it works	19:18
markos	I find it a terrible idea, sorry	19:18
markos	it essentially means you have to fight 2 beasts instead of one (gcc or llvm depending on the case)	19:19
markos	not everyone is as well versed with rust like you and I certainly don't want to have to add yet another dependency to the toolchain	19:19
programmerjake	i'd expect the time scaling for c++/rust to be more exponential since it's more complexity i have to keep in my head and the reg alloc is pushing it already	19:19
markos	I thought you wanted to write the code first using rust and then port it to gcc/llvm sharing the solutions you encountered in the first	19:20
programmerjake	so imho writing the regalloc in c++ at first is less wise	19:20
programmerjake	yes, for the final regalloc once i work out the correct alogorithm	19:20
markos	it's anything but less wise	19:20
markos	it's the only solution if we want/expect upstreaming of svp64 compiler support	19:21
markos	and any problem you want to solve with rust, you can easily solve it with C/C++ as well	19:21
markos	I really really dislike this idea	19:21
programmerjake	the rust regalloc is never intended for upstreaming in gcc/llvm	19:21
markos	all the more reason to split the RFCs then	19:22
programmerjake	(though imho llvm might be more open to upstreaming rust code)	19:22
markos	I'd do 2 regallocs then, one in C++ for gcc/llvm and one for rust	19:22
markos	no, LLVM source code is definitely only in C++	19:23
markos	maybe some C/asm for really low level stuff	19:23
programmerjake	right now, yes, but imho they may change their minds if a particularly compelling rust library comes along. in any case i'm definitely not proposing we try upstreaming rust into llvm	19:24
markos	you cannot expect/depend on them changing their mind	19:24
programmerjake	i'm not	19:25
markos	lkcl, I think I may have found a problem in your chacha20 calculation in each set of the quarterrounds the first 4 are with step=4, (0, 4, 8, 12), etc, the second 4 are with step 5 (0, 5, 10, 15), etc	20:53
markos	however only the first 2 quarterrounds can be calculated independently	20:54
markos	so	20:54
markos	fn(x, 0, 4, 8, 12)	20:55
markos	hm, just as I was pasting the code, I found an error in my logic	20:56
markos	hate it when that happens	20:56
markos	on one hand it's good because it helps me find the problem, otoh, it's annoying and embarrasing when it happens	20:56
markos	hm, actually no, that was correct	20:58
markos	#define QUARTERROUND(a,b,c,d) \	20:58
markos	a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \	20:58
markos	c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \	20:58
markos	a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \	20:58
markos	c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);	20:58
markos	we can only do the first two PLUS/XOR/ROTATEs independently/parallel	20:58
markos	we can group them together, but we have to redo the adds/XOR and ROTATEs with the next 2 shift values	20:59
markos	I don't know if you actually manage to get the unit test pass in the past, it fails for me here	21:00
markos	in any case, I'll try to get it working and I'll fix the unit test in the process	21:01
markos	in the end, for the calculations involved, this means VL should be 8 not 16 for each pass of sv.add/sv.xor/sv.rldcl	21:02
markos	then redo with the next shift values, and then move to the next bunch of quarterrounds (with step=5), again using the same logic	21:03
markos	etc	21:03
markos	anyway, I'm writing the documentation in parallel, trying to explain the algorithm	21:03
markos	I hope I will be done with it over the weekend	21:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		22:19
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc		22:36
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC		23:31

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!