Wednesday, 2022-03-23

programmerjakelkcl, reminder to send me the link to the prefix-sum tree thing you said i forgot00:24
lkclprogrammerjake, tomorrow :)00:25
programmerjakei'll send you an email...00:25
lkclit was a rust simd discussion00:26
lkclyou cross-ref'd.  the algorithm was the same00:26
markoslkcl, I was searching for clang pronounciation and found this: "Clang is a front-end to LLVM that supports C and the GNU C extensions required by the kernel, and is pronounced “klang,” not “see-lang.”" :)10:32
markosbut also this:
lkcllol sea lang. klang is shorter10:54
lkclbut.. but... that means you have to pronounce "c" as just "kuh"10:54
lkcl"oh i'm a kuh-plus-plus developer"10:55
markoswell, can be worse, I've heard people talking about a strange popular language 'hava' and I had no idea what they were talking about until I made the connection... :D10:56
lkcloh dear. were they from Europe by chance? spain, brazil, portugal, netherlands?10:57
markosBelgium :)10:59
markosFlanders in particular, same bewilderment when I heard talking about loching and loch monitoring10:59
lkcli lived in holland for just over 3 years so i get that one11:00
markosused to work for 3 years myself in Flanders region11:01
markosloved the beer and the food11:01
markosand made some good friends there11:01
lkclahh tell me about it. only beer i actually like is duvel. or leffe, brun11:02
markosmy favourite is Westvleteren, named the best Trappist beer in the world for some years, my 2nd Omer (a local beer to the region)11:04
markosI'm sorry that I cannot find these beers here easily11:04
lkclthe trappist beers are remarkably strong. belgo nord and belgo centraal in London were the only authorised sellers in the UK11:05
lkclyes the monks just won't let them be sold to anybody.11:05
markosapparently you can get them now
markosbut they're not cheap11:06
markosI did go there once myself and tried at the source :D11:06
markoscarried a 6pack all the way to Greece :D11:07
markosunfortunately 6 bottles of beer are consumed too fast and now I'm out :)11:09
lkclit's a remarkably strategic thinking of the monks.11:09
lkcl"yes honey i'm on a pilgrimage to a monastery..."11:10
lkcl"i may be in seclusion for some time"11:10
mepyhello, I am officially putting in pause/stop mode my commitment to the project. I know, I have done nothing, but I was keeping reading the mailing list and had a commitment, although very little, to the project still. lkcl: if you can, could you please remove me from the list?11:36
lkclmepy, hey absolutely no problem at all, you've been remarkably supportive and that means a lot. you can actually remove yourself with an unsubscribe message or put it on "hold" by setting "nomail"11:39
lkcli'll set "nomail", if you really want to unsubscribe you can log in to mailman yourself ok?11:40
lkclmepy: i can't remember what your email address is (which is normally why you would use the mailman interface yourself). it's umberto something, isn't it?11:43
lkcli don't want to change somebody else's details11:43
mepyyeah, it is that. thank you a lot.11:49
mepyi am quitting now. bye everyone!11:51
lkclok! good luck!11:51
lkclfound it.12:00
lkclprogrammerjake, could you kindly alert jubilee about a reply i'm about to make, here?
lkclin SVP64's Matrix REMAP, we do not do "Horizontal Sum".  or, you can if you want to, by scheduling the scalar-ADDs in such a way that they (spectacularly) overload the in-flight Reservation Stations12:02
lkclinstead what we do is: *swap* the order of the inner and outer loops so that the MULs have a chance to filter through the pipelines and catch up with the ADDs12:02
lkclthe "normal" way - the one we're all taught in school - is to multiply-and-sum everything that goes into result[0][0]12:03
lkclthen move on to result[0][1]12:03
lkclthen 0-2 0-312:03
lkcland then go on to 1-0 1-1 1-2 1-312:04
lkclthis *requires* a horizontal-sum to perform efficiently in hardware, and it's a pig12:04
lkclif however you change the order of the 3 for-loops...12:04
lkclyou use result[0][0..3] as *partial accumulators*12:05
lkclimportantly, make sure to keep a row of one of the matrices in-memory [in-registers] and use that repeatedly12:06
lkclthere are 3 loops involved so there are 6 possible permutations - 6 possible ways that you could order those loops - to create the result (the matrix multiply)12:07
lkcl*one* of those is the "usual way taught at school" [and requires - *requires* - a Horizontal Sum instruction]12:07
lkclsome of the others do not.12:07
lkclSVP64 Matrix REMAP can do all 6 permutations.  actually with inversion of each of the for-loops it's a lot more than that12:08
lkclyou can do inner for i 0..VL-1 *or* inner for i VL-1..012:08
lkclsorry you can do inner for i 0..x-1 *or* inner for i x-1..012:08
lkclmiddle for i 0..y-1 *or* for i y-1..012:09
lkclouter for i 0..z-1 *or* for i z-1..012:09
lkcli can't quite do the math in my head on the number of permutations / options there12:09
lkclbut they basically cover in-place rotation and transpose.12:10
lkcleven for non-power-of-two matrix sizes.12:10
markosyou know, if you have a rectangular 128x128? flat register file, you can build some really unique instructions, eg. defining kernels of 3x3 or 5x5 elements and doing some operations, vertically and horizontally13:16
markoscheck the description of convolutions in the beginning
markosbut I wouldn't a specialized instruction, but a generic one that would allow you to perform the same operation on a NxM kernel13:18
markosI guess that can be remap?13:18
*** alMalsamo is now known as lumberjack12313:18
markosooc, I have a spartan-6 lx9 fpga board (xilinx) here that has been in a drawer for a while, can it be used for microwatt?14:29
markosit only has eth & usb14:30
lkclmarkos, yes apparently, but with xilinx tools14:40
lkclapparently symbiflow has had an outstanding issue since 2018 for this
lkclyeah that explanation is brilliant, and dead easy to understand14:49
lkcli think... to do it justice, though, it's going to need Extra-V (or similar)14:50
lkclcoherent memory-load/store14:50
programmerjakeooh, is that yet another person (markos) who thinks we should have picked 128-bit int/fp registers as the base for SVP64 rather than sticking with 64-bit registers? :) or was that a 128x128 grid of 64-bit registers?14:59
markosnot the base, for all I care you could choose 8-bytes as the base :)15:00
markos8bits sorry15:00
markosyeah, the way I wrote it was wrong15:01
markosobviously 128x128 64-bit registers would be fantastic but that's a bit much I guess15:02
markosI meant originally 128x128 bytes15:02
lkcl2^14 register file entries (16384) is completely out of the question, yes.  as is 2048 registers15:03
lkclExtra-V is a means and method of arranging memory to "arrive" (and leave) registers in a coherent deterministic fashion15:03
markosbut 128x128 = 16kb = 256 64-bit registers in a flat hierarchy15:04
lkclah bits15:04
programmerjakewell...amd's gpus have around that many registers (16k or some nearby power of 2)15:04
programmerjakeso it's not as much out of the question...15:04
markoswell, the more the better, I'm not going to complain obviously :)15:05
lkclthe problem with "more registers" is that each doubling results in routing and delay15:05
lkclmeaning that an upper maximum bound is placed on the clock frequency as a result.15:05
markoswhat's the ideal compromise?15:06
lkcl32 for a general-purpose processor15:06
lkcl32 x 64-bit15:06
lkclmost SIMD Processors will actually do batches-of-completely-independent 32x64-bit (or whatever)15:07
lkcland you get "striping" effects15:07
lkclsay 4-way-striping15:07
lkclwhich means in turn that unless you route data via a (slower) path15:07
lkclyou can only do r0 = r4+4815:07
lkclr1 = r5+r915:07
lkclr2 = r6+1015:07
lkclr3 = r7+r1115:07
lkclif you try:15:08
lkclr0 = r4+r515:08
markoshere's a crazy idea, ok, 128x128 might be too much, but how about 64x64 = 4k in a flat file where registers would be actually configurable "pointers" (if that can be done in hardware) to the flat register file15:08
lkclthen because r5 is in a different "stripe" (like RAID striping) from r0 and r4, the contents of r5 have to be routed via special paths, to get from "the bank for r0 r4 r8 r12 r16" and "the bank for r1 r5 r9 r13 r17"15:09
lkcli've done the architectural design, it took 18 months to think through, and i'm just not going to change it15:09
lkclit's too much15:09
markosno no15:09
markosI'm not asking you to change it15:09
lkclyes, have a look at
lkcland also Extra-V15:10
markosI saw already you mention 128 registers15:10
lkcl"pointers" is *exactly* what Snith - and Extra-V - do15:10
lkclwhen you refer to say add fp3, fp5, fp615:10
lkclwhat actually happens in Snitch is, it goes:15:11
lkcl"hmm, fp5 has been reconfigured as a memory-coherent-Queue-reloader.  let me just get the value for you from the front of the Coherent Memory Queue rather than the actual regfile"15:11
lkclfp6 is targetted (configured) to point at a *SECOND* Coherent Memory Queue15:11
lkcland fp3 is targetted to *store* into a (third, pre-configured) Coherent Memory Queue15:12
markosso the fp* registers can be configured as aliases to an actual memory address?15:12
lkclthe pre-configuration of those 3 regs are then set up to run a deterministic algorithm which in the case of the ones serving "fp5 / 6"15:13
lkclwith implicit auto-load-and-increment15:13
markosbut without the latency of memory access?15:13
lkcljust like the c (*ptr++) thing15:13
lkclyes and no15:13
lkclif latency happens then the main processor stalls15:13
lkclbut the focus of the Snitch core was precisely to arrange the memory and the core so that that *did not happen*15:14
markosso the operation is done async15:14
lkclthey achieved this i believe by using the snitch core as a barrel processor15:14
lkclah no.15:14
lkclit's definitely synchronous.15:14
lkcland it's termed "coherent"15:14
lkclthe processor sees the FIFOs in a synchronous fashion15:15
markosit's synchronous wrt the Queue but the actual memory?15:15
lkclbut the memory-side will obviously be under different constraints15:15
lkcland this is the important bit15:15
markossorry, bb in 10'15:15
lkclthe algorithm running in the Memory Controller that puts the data *into* the queue is - has to be - fully deterministic15:16
lkcland likewise on destination (result operations) that go into the "outgoing result" FIFO15:16
lkclthose also have to - by definition - follow a deterministic schedule15:17
lkclthe simplest of such Deterministic Schedules is: *ptr++15:17
lkclin the case of Extra-V15:17
lkclthey went a whoooole new level of algorithmic fun15:18
* lkcl afk15:18
programmerjakei posted the link to irc chat log on zulip, it should show up here soon:
programmerjakebtw, neat idea with using vector modes for crs to encode 1/2/4/8 bits per int reg...15:35
lkclprogrammerjake, yeah, it was inspired by the sv.ori./ew=8 thing you came up with15:44
lkcland that you mentioned using crweird after, i wondered how to actually back-to-back those two properly/usefully15:45
lkclmarkos, Extra-V, instead of simple "*ptr++", can do deterministic Graph-Node-walking15:46
lkclwhere, again, both the Memory and the Core know, *in advance*, what the Schedule will be15:46
lkclprogrammerjake, i think we can probably use the same concept for 3D Shader Pixel data interpolation15:46
programmerjakeprobably, but we'd want that to go through the cache, not straight to main memory15:50
programmerjaketexture reads ^15:50
lkclSnitch managed to organise it to be direct (somehow)15:50
programmerjaketexture reads often reuse nearby bits of memory multiple times -- caching is required...i'd guess snitch could organize their loops to read once and not need the data again, negating the need for caching (maybe? imho it'd still need caching since one loop could read/write it and the next loop could need it again)15:54
programmerjakeas an example of why we need caching for all gpu stuff...just look at how much of a speedup amd got from their humongous cache...15:55
* lkcl wonders what results come up by searching "AMD humongous cache"15:59
programmerjakethey call it "infinity cache"15:59
programmerjaketheir largest gpu has 128MB of cache, probably a major portion of their claimed 50% jump in power efficiency16:03
programmerjakethough...speaking of cache, they released a server cpu with 768MB L3 cache16:04
lkclprobably because IBM POWER10 has something mad as well :)16:06
programmerjakeidk, i think they were adding extra cache to compete with intel on the desktop and thought, why not do that on the server too?16:09
markosI still think these are all half-measures until we can achieve 1:1 zero latency ram :)16:26
lkclhey that's perfectly achievable for a max clock rate of ooo 100 mhz? :)16:27
markosyeah I was thinking something like modern systems, though tbh, you do need a cache even then, esp if you have multiple cores16:30
markosI remember reading many years ago about photonic transistors which would pave the way for photon chips, and then nothing, I wonder what happened to that technology16:32
programmerjakelkcl, jubilee responded "huh."16:58
programmerjakewell....i'm hoping we eventually get 3d sram, kinda like 3d nand, then we could have 128GB cache!17:01
markoslkcl, quote from the file, "at some point 128 registers will be available", was trying to remember where did I see it and it was right in front of me. So, there is indeed a plan for so many registers, 128x64 = 8kB flat register file, my question is why not double it and get a 128x128=16k register file? Would that require too many changes in the ISA proposal?17:23
markossigh, sorry again I miscalculated17:24
markos128x64 bits = 8kbit not kB17:24
markosignore that17:24
programmerjakeiirc lkcl didn't want to move to 128-bit registers since he seemed to think that it would be too much work to change our hdl to have 128-bit data paths...i disagree17:25
markosso, the register file will be 1kB and able to hold a 32x3217:25
markosI don't want to put extra work on him, I'm just curious17:26
programmerjakethe current plan is to have 256 64-bit registers (128 int, 128 fp)17:26
markosmy biggest question is how is permute going to work efficiently17:27
markosotoh, SVP64 design makes permute needless in many algorithms17:28
programmerjakecurrently i think the plan is to just push all the data through the register file (even though that's waay slower)17:28
programmerjakefor permute17:28
programmerjakethe plan will probably change when we aim for higher performance17:29
markosI really wish I could help here, but I have no HDL knowledge17:30
lkclmarkos: in TestIssuer and in the simulator there are currently only QTY 32 64-bit registers17:55
lkclso you kinda have to squeeze the vector ops into the existing 32 regs at the moment17:56
lkclprogrammerjake, i have made it clear multiple times that i have had to spend 18 months with a massive complex design currently held in my head and i am NOT going to redo that design18:04
lkclwill you please STOP demanding that i waste my time throwing absolutely everything away and satisfy your requirement to do a 128 bit architecture18:05
lkcli am getting very fed up of it18:05
lkclit is making me very angry that you're not listening and i cannot take it any longer18:06
markoslkcl, sorry, I started it, I don't want to cause you any extra effort, my question was rather towards the use case of fitting large 2D matrices for convolution18:07
lkclmarkos, it's ok.18:07
markoswhether you use 64-bit or 128-bit vectors is irrelevant18:07
lkclthe convolution can be done using coherent memory18:07
lkclbecause it's most useful when doing large 2D arrays that can never fit into regfiles anyway18:08
markosmodern video codecs do DCT on largish arrays 32x32 or 64x64 even18:08
markosso imagine being able to do a FFT/DCT on a single block in a couple of instructions18:08
lkclthis is why i want to include Extra-V in SVP6418:09
lkclSV-REMAP only works - at present - when the data is entirely in registers, limiting the max size.18:10
lkclSV-REMAP-on-top-of-Extra-V entirely removes those limits.18:11
lkcla DCT of 2^32 would then be perfectly possible18:11
lkcltake a damn long time but be possible18:11
programmerjakelkcl, i'm not demanding we switch to 128-bit registers since i know there are other higher-priority things and 128-bits isn't strictly necessary, i'm just not going to say that i agree with you and think 64-bits is the best choice architecturally or at the isa level since my opinion hasn't changed -- 128-bits would be nicer.18:15
tplatenI'm working on the tplaten_3d_game branch, getting an external_core_top.v:258195: ERROR: Re-definition of module `\plru_2'!18:37
tplatenI need the modules entity but not its architecture behave18:39
tplatenIn the makefile we have fpga_files and synth_files, I will have a deeper look18:39
programmerjakeis that for libre-soc inserted into microwatt's peripherals replacing the core? if so, that would be a problem probably for lkcl...18:41
tplatenyes that is the case, I wan't to run the maze game on the libre-soc core18:45
tplatenand there are so many changes in microwatt causing merge conflicts if I try to merge microwatt with verilator trace, even in the makefile18:46
markoslkcl, I'm trying to test a small change18:52
markos      setvl 1618:52
markos      sv.fadds in.v, in.v, in.v18:52
markos(I did a .set in. 3 above)18:52
markossorry sv.fadds/mrr18:52
markosbut I'm getting the following error from pysvp64asm audio/mp3/mp3_1_imdct36_float.s audio/mp3/mp3_1_imdct36_float.s.sv18:53
programmerjakehmm, maybe it'd be easier to backport the maze game and usb serial stuff to the branch where libre-soc works?18:53
programmerjakerather than trying to merge it18:54
markos(do you use another paste service btw?)18:56
programmerjakewe mostly just use attachments in bugzilla and ftp.libre-soc.org18:57
programmerjakefeel free to create a bugzilla bug to track the maze game issues with libre-soc18:59
programmerjakethe assembly is using an old 1-arg form of setvl, the latest version has 5 args afaict19:16
programmerjake6 args...miscounted19:16
markosok, I'm trying to understand how this is going to work, the code I'm trying to convert is rather simple:19:55
markosfor (i = 17; i >= 1; i--)                                                                                                               │        .long   3210589143              # float -0.86602538819:55
markos        in[i] += in[i-1];19:55
markostmux doesn't play well with copy+paste with mouse19:56
markosfor (i = 17; i >= 1; i--)19:56
markos   in[i] += in[i-1];19:56
markosso, if I understand this right, and reading the comments in the file, I just need to setvl 16, and do a sv.fadds/mrr19:57
markoshowever, don't I first have to load all the elements into vectors first? using an ld into in.v? (in is 5 here)19:57
markosso this should be the equivalent of setvl 16 (or the 6-args equivalent)19:58
markosor rather wrong19:59
markosin is just a pointer19:59
markosso I would have to say register 10, the 16 elements, like so19:59
markosld 10.v, (5)20:00
markosand then sv.fadds/mrr 11.v, 10.v, 10.v20:00
markosas lkcl is writing in the comments20:00
markosdo I understand it correctly or have I gotten this wrong?20:00
lkclmarkos, that's pretty much it, yes.20:38
lkclthat limit about 128 registers is obviously lifted, there20:40
lkcl  26 # SV floats20:40
lkcl  27 .set fv0, 3220:40
lkcl  28 .set fv1, 4020:40
lkcl  29 .set fv2, 4820:40
lkclr48 is not even remotely possible with standard Power ISA 3.020:40
lkclso the comment is clearly out-of-date20:40
lkclprogrammerjake, sorry, i just find it deeply frustrating because you have no idea of the timescales and implications of what you're advocating, the amount of disruption it would cause to abandon everything done and designed so far to do 128-bit20:46
markosone more question, what's the equivalent of 'setvl 16' in the 6-arg format?20:58

Generated by 2.17.1 by Marius Gedminas - find it at!