Thursday, 2023-03-09

*** tplaten <tplaten!~tplaten@62.144.45.55> has quit IRC00:43
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC07:32
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc08:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC08:30
markosprogrammerjake, yeah, I took a closer look to the instructions, indeed I do prefer with the old naming scheme, for that matter, it's easier to remember that [s] suffix will do 32-bit floats copy/conversion, vs having to look at the ISA manual to find out which parameter exactly to pass to do the same thing09:00
markossame thing for the rest09:01
markosit's the same if you write these instructions a thousand times and know it by heart, but if you only occasionally use them, keeping the same (simple) naming scheme is better, imho09:02
programmerjakek, thx! lkcl, can you look at that when you have time? thx09:09
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc09:42
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC10:02
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc10:34
sadoon[m]https://www.phoronix.com/review/tyan-power8-server11:22
sadoon[m]markos:11:22
markosjust received it, literally minutes ago :)11:24
markosyeah I remember the article11:24
markosreplacing the fans will be one of the first things I will do11:25
sadoon[m]My brother received mine a few days ago in the UK, though he's only coming back in the summer so I can't really have fun with it for some time :p11:34
markosI'll let you know how it works out :)11:34
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC12:13
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc13:00
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC13:12
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc14:44
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC15:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc16:30
markostoshywoshy, do you know what kind of rails the tyan case takes? I know I could just put it there, but I prefer rails on my rack systems16:41
markosI could ask the seller, but somehow I doubt they know16:41
markosthey didn't know the disk tray models...16:42
markoslkcl, I'm having trouble understanding how to create the indices for svindex, I need to create the pattern 0, 1, 2, 3, 0, 1, 2, 3 at GPR 1616:51
markosI thought I could do that with svshape2 with VL=8, mod 416:52
markosI'm using svshape2        8, 0, 1, 4, 0, 016:53
markosah f*sck17:09
markosall this time, in the unit tests I thought the indices were actually created BY svshape etc17:10
markosbut I was deceived, the indices were created by setting the initial_regs[] manually17:10
markosffs17:10
markosthis is so embarrasing17:11
markosI was deceived, I thought the indices were actually created by the svindex/svshape instructions in the chacha20 unit test17:12
markosbut they're created outside the chacha20 code in the unit test prep code using set_masked_reg() functions17:13
markosnow I get it17:14
markosI just have to do this in asm properly17:14
markosthe "deceived" part was said in humor, obviously :-P17:46
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC18:01
lkclyes, i did explain that :) but it is one thing to hear it explained and another to "realise", if you know what i mean.18:05
lkclmarkos, if you recall i mentioned in the last irc conversation:18:06
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_chacha20.py;h=7e11fb4b39e596b11b952f171b349c47278467f7;hb=35851d97718547db731809f6942fe97bb31ba7c9#l15118:06
lkcl(or maybe one of the conf calls?)18:06
lkclthat call to set_masked_reg() sets up (at elwidth=32) the values 16 12 8 7 16 12 8 7 16 12 8 718:07
lkcland lines 147-148 take the *indices* from the schedule list and put them into the registers18:08
markosyup18:08
lkclwhich was why i mentioned, "just put a print() statement in that test_caller_svp64_chacha20.py file" then replicate it in assembler18:08
lkclbut you have to print the list out first in order to know what assembler to write that will replicate it18:09
lkclpprint would be easier to read18:09
lkclor just after line 146: "print (i, a,b,c,d)"18:10
lkclor18:10
lkcl"print ("keyword to search for so you know to look for this in the simulator output", i, a,b,c,d)"18:11
lkclprogrammerjake, it's wasting your time (and mine even just to tell you it's wasting time) to jump ahead to write *any* fgrev instructions when absolutely none of us have had the opportunity to evaluate whether the instructions are even beneficial or harmful.18:13
markoslkcl, yes, I see what you're doing there now, but I might change it a bit, not because it's wrong, but because I prefer to make it longer and easier to understand, we can optimize it further later18:14
lkclfor example: the exact same effect can be achieved by using svindex with a negative direction and a 2D index that is the original width divided by the target width.18:14
lkclmaking it completely unnecessary to even *have* any fgrev instructions18:14
markosie, I need to understand it first myself, so I will expand the code to make it easier for me to understand -and therefore transfer the knowledge to the documentation18:15
lkclbut you didn't wait for me to take the time to even think that through, you jumped straight in18:15
lkclyes - i mean, feel free to actually write it in c (replicate the indices in c)18:15
markosit's all going to be in asm18:16
lkclthen hand-pack the results into target registers using 64-bit mvs18:16
markosthis particular routine I mean18:16
lkclthat would at least allow you to do the trick of printfs() to make sure that the list created (in c) was the same as what was print()ed out18:16
markoswhat I don't understand is why/how the indices are in 8-bit elements18:16
lkclthen convert over to assembler18:16
lkclbecause otherwise they take up one hell of a lot of registers18:17
lkcleach index if 64-bit (completely wasting over 56 bits btw)18:17
lkclwould take up a whopping *64* frickin registers18:17
markosthere aren't that many indices18:17
lkclhowever many it is, it's still a lot of regfile read-ports18:18
lkcland if the number of read-port accesses can be reduced by a factor of EIGHT18:18
lkcl(because elwidth=8 for the indices not elwidth=64)18:19
lkclthat's a massive reduction18:19
markosindeed18:19
markosbut there's the overhead of having svindex work on individual elements, isn't there?18:19
lkclthe hardware can cache the indices18:20
lkclbut the hardware still has to read them18:20
markosbut hardware registers are always available with zero latency, right?18:20
lkcland if that overhead can be reduced by packing them as tightly as possible that's clearly a priority18:20
lkcl1 clock cycle *per read port*, yes18:20
lkclbut think about it18:20
lkcllet's say you have a 3R1W regfile18:21
lkcland you want to apply svindex to an FMAC operation18:21
lkclthe FMAC is 3-in 1-out18:21
lkcl*normally* you could do 1 FMAC every clock cycle, yes?18:21
markosdepending on the available ALU units/cores, but yes18:22
lkclbut if you have an svindex needing to read *yet another register*, and each index per element is taking up an *entire* 64-bit register18:22
markosI think I get your point18:22
lkclyou can only issue 1 instruction *every two clock cycles*18:22
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc18:22
lkclbecause you now have to read *FOUR* registers for each one FMAC18:22
markosbut svindex is not running on the same unit18:22
lkcl1) the index18:22
lkcl2) operand A18:22
lkcl3) operand B18:23
lkcl4) operand C18:23
lkclthe regfile reads have nothing to do with the units18:23
markosright, it's a bit confusing, so the regfile has a limited number of "ports" so to speak?18:24
lkclif you only have 3 regfile ports per clock and your instruction needs 4 operands even though one of them...18:24
lkclyeeees of course!18:24
programmerjakelkcl: you already know i disagree on fgrev, we can talk about that later when you're less stressed out. meanwhile imho we should submit the fcvt insns (after resolving #1016) as a rfc without the fmv* insns both to reduce rfc size and because we haven't resolved if we even want fmv* and not fgrev[f/t]gi instead18:24
lkclthe cost of doing say a 10R6W regfile is absolutely massive18:24
markosisn't that easily solvable by increasing that or is that a Power ISA restriction?18:24
markosthe number of ports on the regfile that is18:24
lkclprogrammerjake, no.  PLEASE LISTEN.  this is the 4th time i have said PLEASE LISTEN in under 16 hours18:24
lkclmarkos, it becomes an exponential cost to increase the regfile ports18:25
markosI see18:25
lkcland power consumption and latency start to push the boundaries of physics18:25
markosok, I understand I will try to follow the same pattern then18:25
programmerjakeit's a hw cost restriction where a huge number of reg file ports takes up like 30% of the whole cpu's area18:25
lkclindeed.18:25
lkclmitch alsup designed the AMD Opteron's regfile at the gate-level and he said he was barely able to get 10R6W within the required power and speed budget18:26
lkcl*all* execution units then have to "compete" for access to those regfile ports18:27
markoswhat is our own target?18:27
lkcland you need one "Priority Picker" per regfile port so as not to get data corruption (or worse, actually damage the ASIC)18:27
programmerjakelkcl: i'm listening, hence why i'm working on ternlogi since you asked me to work on things that need work and why i'm putting fgrevi discussions for later since i'm listening but i disagree18:27
lkclthe target will be: "whatever-is-required-for-our-first-real-customeer"18:27
lkclprogrammerjake, thank you18:27
lkclwhen i have time i will get round to beginning a discussion of alternatives and costs, on the fgrev bugreport18:28
programmerjakei did state that before in my email...18:28
programmerjakethe i'm listening part18:28
lkclyou stressed me out so badly i couldn't bring myself to read it18:29
programmerjakek18:29
lkcli'm only just recovering from total overwhelm, after four *months*18:29
programmerjakeyeah, things can take time...18:30
programmerjakefor the compilers rfc, i'd like to do basically all, the work of writing, submitting, etc. so you don't need to bother18:32
lkclmy thoughts on that would it may actually be better for it to be a Kazan continuation project18:33
lkcl(sotto voice: that happens to require some compiler work)18:33
lkclthat took me a few days to think of, apologies i haven't raised the idea before18:34
lkclaside from anything that would give a "real worked example / need" so to speak that would drive the compiler-side18:35
lkclotherwise it's a bit of a fishing expedition if you know what i mean18:35
programmerjakeimho we need llvm/gcc before kazan, so it should be compilers, unless you want a mainstream-compilers and a kazan rfc?18:35
lkcltied/related... yes.  or just the two together but EUR 100k not EUR 50k.18:36
lkcl(each)18:36
programmerjakethough otoh imho cranelift powerisa support gives wasmtime support for power so is easily justifiable as having a european element since europeans with power can then run wasm cli stuff -- cranelift is needed for kazan too18:37
lkclmy instincts in nlnet-grant-writing are lighting up more on kazan+cranelift as a first step18:38
markoswe need gcc/llvm for native code, not wasm :-)18:38
* lkcl agrees18:38
programmerjakei might be able to rope other people (outside of libre-soc) into working on cranelift18:38
markoscall me old fashioned but wasm is just turning up to be another form of java compile-once-run-everywhere18:39
markosthough I have to agree it's faster than java18:39
markoswhich isn't saying you cannot do it if you want18:40
lkclFORTH, java, CLR/.NET, JIT, wasm - seen 'em once, seen 'em all...18:40
programmerjakeexcept that java never worked very well for non-java languages whereas wasm is intentionally designed for c/c++-style languages too18:40
markosbut I honestly doubt anyone is really interested for wasm on power at this point in time18:40
programmerjakeno, forth, java, clr/.net are all programming language specific, wasm is designed to be language independent18:41
markosit works fine for jvm languages, closure, scala, kotlin, etc but I'm not interested in those either18:41
lkclno, CLR/.net is *definitely* non-programming-language-specific.  look up Iron-Python and Iron-Ruby.18:41
markosI much more prefer to have a working native compiler for power/svp6418:41
lkclmarkos, yyep.18:41
programmerjakecranelift *is* a native compiler18:42
markosok, let me rephrase18:42
* lkcl afk18:42
markosa working native C/C++ compiler for power/svp6418:42
programmerjakethat's also part of the compilers rfc18:42
markosanything else at *this point in time* is just a distraction18:43
markosno, those SHOULD be the RFC18:43
markosanything else is a side project, call it a pet project18:43
markosno one is stopping you from doing it18:43
markosbut it's definitely not a priority18:43
markosand you cannot expect others to adopt your logic, when either gcc/llvm is working then sure18:44
programmerjakewell, we need a vulkan driver and if we don't get to work on it soon it won't be ready when we need it (e.g. texture isa design), cranelift is part of that18:44
markoswhat are we going to do with a vulkan driver on its own18:44
markosplus vulkan drivers can ALSO be written in C/C++18:45
programmerjakeyou don't need svp64 support on the compiler that you use to compile the vulkan driver18:46
programmerjakejust need it in the shader compiler18:46
programmerjakeso rustc/llvm as is is sufficient if we have a shader compiler with svp64 support18:47
programmerjake(or clang/gcc if the vulkan driver is written in c)18:47
markosso your suggestion is that before we can actually compile C/C++ code with SVP64, we invest time in getting rust working with SVP64 first so that we get working vulkan, JUST in case we need software that uses vulkan?18:48
programmerjakeno, i'm suggesting we need a vulkan driver to properly design the gpu features of our cpus18:49
programmerjakee.g. texture instructions18:49
programmerjakesince being a gpu is a major part of what we want it to eventually do18:49
programmerjakeand vulkan is the logical gpu api to implement (first), since opengl/opencl can translate to vulkan18:50
markosperhaps, but still a compiler is more important18:50
markosand we still don't have a gpu18:50
markosbut we do have a cpu (sortof)18:51
programmerjakei'm not saying don't implement llvm/gcc, i'm saying work on cranelift too18:51
markosagain to avoid any misunderstandings, when I'm saying compiler I'm *only* talking about C/C++18:51
markosas first priority18:51
markosanything is just isn't18:51
programmerjakesince realistically it will likely work best for the shader compiler18:51
markosyeah I don't know how to reply to that, you keep repeating about the shader compiler, and so far I have to see it mentioned as a high priority task in any of our talks, all I keep hearing/reading is about IoT/edge/computing/crypto/AI/ML/etc18:53
markosfor all of those we *need* working compilers18:53
markosyou are bent on saying we need cranelift too, maybe, add it as a separate task,18:54
programmerjakebut cranelift lets us try out the very invasive ir changes in llvm/gcc that luke wanted for svp64 support, which i think are going to be very hard to convince gcc/llvm that they should accept the ir changes18:54
markosI personally don't want to have it in the same RFC18:54
markosit's a distraction18:54
markosit will just take time from you and everyone else that's going to work in this particular project18:55
markosagain, I'm not saying don't do it, it's your call18:55
programmerjakeok, then like i suggested: a mainstream-compilers rfc and a vulkan drivers rfc that includes cranelift18:55
markosbut don't put it in the same task18:55
markosyes, no objection from me there18:55
programmerjakei wanted them together since lots of stuff we learn while building the cranelift backend (and what i already learned from bigint-presentation-code) will be directly applicable to llvm/gcc except much more complex to implement18:57
programmerjakedoing it in llvm/gcc first imho is setting us up for failure to some extent18:58
programmerjakeone other nice feature is the cranelift regalloc is mostly compiler-independent so could be easily slotted into llvm/gcc as a stopgap19:01
programmerjakeso we only have to implement reg range alloc once at first19:01
markosshared knowledge between tasks does not mandate -imho- a common rfc19:07
markosyou could share code between tasks, it would need modifications anyway so it wouldn't be just a copy paste thing19:07
programmerjakemaybe, though there are likely tasks that overlap and we want to avoid double-funding quagmire19:08
markosI think it's too early to worry about double-funding between 2 almost entirely different projects, I honestly doubt you will have much duplicate code between rust compiler and llvm/gcc19:09
programmerjakethough for the regalloc stopgap, it would be literally a copy of the cranelift regalloc19:10
programmerjakeplus some bindings19:10
programmerjakeor glue code19:10
markoswell if it has to be exactly the same then so be it, it will still be a needed part of compiler support19:11
programmerjake(not literally copied, but probably a crate dependency)19:11
markosI still doubt it will be exactly the same, it's like saying 2 entirely different projects use the same hashing function so you can't use it19:11
markoscrates are a rust thing, for llvm/gcc it has to be integrated19:12
programmerjakewell the cranelift regalloc is in rust, so we'd need rust -> c ffi adaptor -> glue code to llvm/gcc's c++ internals19:13
programmerjakethe c ffi adaptor would be written in rust19:14
markoser, no, that will never fly with the gcc/llvm people19:14
markosthe register allocator *has* to be integrated in C/C++ inside gcc/llvm tree19:15
programmerjakehence why i called it a stopgap19:15
programmerjakei'd estimate it'd take 1-2 weeks to write the glue code and maybe 1 mo for each of llvm/gcc to rewrite the regalloc into c++19:16
markoswell, good that you cleared this up now, because I would never agree to this, I would rather we invest the time to develop a proper register allocator in C/C++ and getting it working in llvm/gcc directly19:16
markosdepending on a rust project to get compiler support is sub optimal to say the least19:17
markosfor one you add an external dependency for everyone who would want to do compiler development19:17
programmerjakewell, that's much more complex imho19:17
markosnot really19:17
markosif it's 1mo work to do a rust regalloc, then surely it can't much more to do it in c++19:17
markoslet's say 2 months?19:18
programmerjakeexternal dependencies in rust are waay easier than in c++/c, you add one line to cargo.toml and it works19:18
markosI find it a terrible idea, sorry19:18
markosit essentially means you have to fight 2 beasts instead of one (gcc or llvm depending on the case)19:19
markosnot everyone is as well versed with rust like you and I certainly don't want to have to add yet another dependency to the toolchain19:19
programmerjakei'd expect the time scaling for c++/rust to be more exponential since it's more complexity i have to keep in my head and the reg alloc is pushing it already19:19
markosI thought you wanted to write the code first using rust and *then* port it to gcc/llvm sharing the solutions you encountered in the first19:20
programmerjakeso imho writing the regalloc in c++ at first is less wise19:20
programmerjakeyes, for the final regalloc once i work out the correct alogorithm19:20
markosit's anything but less wise19:20
markosit's the only solution if we want/expect upstreaming of svp64 compiler support19:21
markosand any problem you want to solve with rust, you can easily solve it with C/C++ as well19:21
markosI really really dislike this idea19:21
programmerjakethe rust regalloc is never intended for upstreaming in gcc/llvm19:21
markosall the more reason to split the RFCs then19:22
programmerjake(though imho llvm might be more open to upstreaming rust code)19:22
markosI'd do 2 regallocs then, one in C++ for gcc/llvm and one for rust19:22
markosno, LLVM source code is definitely only in C++19:23
markosmaybe some C/asm for really low level stuff19:23
programmerjakeright now, yes, but imho they may change their minds if a particularly compelling rust library comes along. in any case i'm definitely not proposing we try upstreaming rust into llvm19:24
markosyou *cannot* expect/depend on them changing their mind19:24
programmerjakei'm not19:25
markoslkcl, I think I may have found a problem in your chacha20 calculation in each set of the quarterrounds the first 4 are with step=4, (0, 4, 8, 12), etc, the second 4 are with step 5 (0, 5, 10, 15), etc20:53
markoshowever only the first 2 quarterrounds can be calculated independently20:54
markosso20:54
markosfn(x, 0, 4,  8, 12)20:55
markoshm, just as I was pasting the code, I found an error in my logic20:56
markoshate it when that happens20:56
markoson one hand it's good because it helps me find the problem, otoh, it's annoying and embarrasing when it happens20:56
markoshm, actually no, that was correct20:58
markos#define QUARTERROUND(a,b,c,d) \20:58
markos    a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \20:58
markos    c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \20:58
markos    a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \20:58
markos    c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);20:58
markoswe can only do the first two PLUS/XOR/ROTATEs independently/parallel20:58
markoswe can group them together, but we have to redo the adds/XOR and ROTATEs with the next 2 shift values20:59
markosI don't know if you actually manage to get the unit test pass in the past, it fails for me here21:00
markosin any case, I'll try to get it working and I'll fix the unit test in the process21:01
markosin the end, for the calculations involved, this means VL should be 8 not 16 for each pass of sv.add/sv.xor/sv.rldcl21:02
markosthen redo with the next shift values, and then move to the next bunch of quarterrounds (with step=5), again using the same logic21:03
markosetc21:03
markosanyway, I'm writing the documentation in parallel, trying to explain the algorithm21:03
markosI *hope* I will be done with it over the weekend21:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC22:19
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has joined #libre-soc22:36
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-109-173-83-100.ip.moscow.rt.ru> has quit IRC23:31

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!