programmerjake | lkcl, reminder to send me the link to the prefix-sum tree thing you said i forgot | 00:24 |
---|---|---|
lkcl | programmerjake, tomorrow :) | 00:25 |
programmerjake | k | 00:25 |
programmerjake | i'll send you an email... | 00:25 |
lkcl | it was a rust simd discussion | 00:26 |
lkcl | you cross-ref'd. the algorithm was the same | 00:26 |
markos | lkcl, I was searching for clang pronounciation and found this: https://www.kernel.org/doc/html/latest/kbuild/llvm.html "Clang is a front-end to LLVM that supports C and the GNU C extensions required by the kernel, and is pronounced “klang,” not “see-lang.”" :) | 10:32 |
markos | but also this: https://news.ycombinator.com/item?id=11561046 | 10:33 |
lkcl | lol sea lang. klang is shorter | 10:54 |
lkcl | but.. but... that means you have to pronounce "c" as just "kuh" | 10:54 |
lkcl | :) | 10:54 |
lkcl | "oh i'm a kuh-plus-plus developer" | 10:55 |
markos | hahaha | 10:55 |
lkcl | https://www.youtube.com/watch?v=qTvhKZHAP8U | 10:56 |
markos | well, can be worse, I've heard people talking about a strange popular language 'hava' and I had no idea what they were talking about until I made the connection... :D | 10:56 |
lkcl | oh dear. were they from Europe by chance? spain, brazil, portugal, netherlands? | 10:57 |
markos | Belgium :) | 10:59 |
markos | Flanders in particular, same bewilderment when I heard talking about loching and loch monitoring | 10:59 |
lkcl | i lived in holland for just over 3 years so i get that one | 11:00 |
markos | used to work for 3 years myself in Flanders region | 11:01 |
markos | loved the beer and the food | 11:01 |
markos | and made some good friends there | 11:01 |
lkcl | ahh tell me about it. only beer i actually like is duvel. or leffe, brun | 11:02 |
markos | my favourite is Westvleteren, named the best Trappist beer in the world for some years, my 2nd Omer (a local beer to the region) | 11:04 |
markos | I'm sorry that I cannot find these beers here easily | 11:04 |
lkcl | the trappist beers are remarkably strong. belgo nord and belgo centraal in London were the only authorised sellers in the UK | 11:05 |
lkcl | yes the monks just won't let them be sold to anybody. | 11:05 |
markos | apparently you can get them now https://www.westvleterenshop.com/apps/webstore | 11:06 |
markos | but they're not cheap | 11:06 |
markos | I did go there once myself and tried at the source :D | 11:06 |
markos | carried a 6pack all the way to Greece :D | 11:07 |
lkcl | :) | 11:07 |
markos | unfortunately 6 bottles of beer are consumed too fast and now I'm out :) | 11:09 |
lkcl | sigh. | 11:09 |
lkcl | it's a remarkably strategic thinking of the monks. | 11:09 |
lkcl | "yes honey i'm on a pilgrimage to a monastery..." | 11:10 |
lkcl | "i may be in seclusion for some time" | 11:10 |
lkcl | hic | 11:10 |
markos | :D | 11:12 |
mepy | hello, I am officially putting in pause/stop mode my commitment to the project. I know, I have done nothing, but I was keeping reading the mailing list and had a commitment, although very little, to the project still. lkcl: if you can, could you please remove me from the list? | 11:36 |
lkcl | mepy, hey absolutely no problem at all, you've been remarkably supportive and that means a lot. you can actually remove yourself with an unsubscribe message or put it on "hold" by setting "nomail" | 11:39 |
lkcl | i'll set "nomail", if you really want to unsubscribe you can log in to mailman yourself ok? | 11:40 |
lkcl | mepy: i can't remember what your email address is (which is normally why you would use the mailman interface yourself). it's umberto something, isn't it? | 11:43 |
lkcl | i don't want to change somebody else's details | 11:43 |
mepy | yeah, it is that. thank you a lot. | 11:49 |
mepy | i am quitting now. bye everyone! | 11:51 |
lkcl | ok! good luck! | 11:51 |
mepy | ty | 11:52 |
lkcl | programmerjake, https://bugs.libre-soc.org/show_bug.cgi?id=697#c6 | 12:00 |
lkcl | found it. | 12:00 |
lkcl | programmerjake, could you kindly alert jubilee about a reply i'm about to make, here? https://zulip-archive.rust-lang.org/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice.html#271073695 | 12:01 |
lkcl | in SVP64's Matrix REMAP, we do not do "Horizontal Sum". or, you can if you want to, by scheduling the scalar-ADDs in such a way that they (spectacularly) overload the in-flight Reservation Stations | 12:02 |
lkcl | instead what we do is: *swap* the order of the inner and outer loops so that the MULs have a chance to filter through the pipelines and catch up with the ADDs | 12:02 |
lkcl | the "normal" way - the one we're all taught in school - is to multiply-and-sum everything that goes into result[0][0] | 12:03 |
lkcl | then move on to result[0][1] | 12:03 |
lkcl | then 0-2 0-3 | 12:03 |
lkcl | and then go on to 1-0 1-1 1-2 1-3 | 12:04 |
lkcl | this *requires* a horizontal-sum to perform efficiently in hardware, and it's a pig | 12:04 |
lkcl | if however you change the order of the 3 for-loops... | 12:04 |
lkcl | you use result[0][0..3] as *partial accumulators* | 12:05 |
lkcl | importantly, make sure to keep a row of one of the matrices in-memory [in-registers] and use that repeatedly | 12:06 |
lkcl | there are 3 loops involved so there are 6 possible permutations - 6 possible ways that you could order those loops - to create the result (the matrix multiply) | 12:07 |
lkcl | *one* of those is the "usual way taught at school" [and requires - *requires* - a Horizontal Sum instruction] | 12:07 |
lkcl | some of the others do not. | 12:07 |
lkcl | SVP64 Matrix REMAP can do all 6 permutations. actually with inversion of each of the for-loops it's a lot more than that | 12:08 |
lkcl | you can do inner for i 0..VL-1 *or* inner for i VL-1..0 | 12:08 |
lkcl | sorry you can do inner for i 0..x-1 *or* inner for i x-1..0 | 12:08 |
lkcl | middle for i 0..y-1 *or* for i y-1..0 | 12:09 |
lkcl | outer for i 0..z-1 *or* for i z-1..0 | 12:09 |
lkcl | i can't quite do the math in my head on the number of permutations / options there | 12:09 |
lkcl | but they basically cover in-place rotation and transpose. | 12:10 |
lkcl | even for non-power-of-two matrix sizes. | 12:10 |
markos | you know, if you have a rectangular 128x128? flat register file, you can build some really unique instructions, eg. defining kernels of 3x3 or 5x5 elements and doing some operations, vertically and horizontally | 13:16 |
markos | check the description of convolutions in the beginning https://towardsdatascience.com/tensorflow-for-computer-vision-how-to-implement-convolutions-from-scratch-in-python-609158c24f82 | 13:17 |
markos | but I wouldn't a specialized instruction, but a generic one that would allow you to perform the same operation on a NxM kernel | 13:18 |
markos | I guess that can be remap? | 13:18 |
*** alMalsamo is now known as lumberjack123 | 13:18 | |
markos | ooc, I have a spartan-6 lx9 fpga board (xilinx) here that has been in a drawer for a while, can it be used for microwatt? | 14:29 |
markos | it only has eth & usb | 14:30 |
lkcl | markos, yes apparently, but with xilinx tools | 14:40 |
lkcl | https://gitlab.com/nmigen/nmigen/-/blob/master/nmigen/vendor/xilinx.py#L549 | 14:40 |
lkcl | apparently symbiflow has had an outstanding issue since 2018 for this https://github.com/f4pga/ideas/issues/10 | 14:42 |
lkcl | yeah that explanation is brilliant, and dead easy to understand | 14:49 |
lkcl | i think... to do it justice, though, it's going to need Extra-V (or similar) | 14:50 |
lkcl | coherent memory-load/store | 14:50 |
programmerjake | ooh, is that yet another person (markos) who thinks we should have picked 128-bit int/fp registers as the base for SVP64 rather than sticking with 64-bit registers? :) or was that a 128x128 grid of 64-bit registers? | 14:59 |
markos | not the base, for all I care you could choose 8-bytes as the base :) | 15:00 |
markos | 8bits sorry | 15:00 |
markos | yeah, the way I wrote it was wrong | 15:01 |
markos | obviously 128x128 64-bit registers would be fantastic but that's a bit much I guess | 15:02 |
markos | I meant originally 128x128 bytes | 15:02 |
lkcl | 2^14 register file entries (16384) is completely out of the question, yes. as is 2048 registers | 15:03 |
lkcl | Extra-V is a means and method of arranging memory to "arrive" (and leave) registers in a coherent deterministic fashion | 15:03 |
markos | but 128x128 = 16kb = 256 64-bit registers in a flat hierarchy | 15:04 |
lkcl | ah bits | 15:04 |
programmerjake | well...amd's gpus have around that many registers (16k or some nearby power of 2) | 15:04 |
programmerjake | so it's not as much out of the question... | 15:04 |
markos | well, the more the better, I'm not going to complain obviously :) | 15:05 |
lkcl | https://arxiv.org/abs/2002.10143 | 15:05 |
lkcl | the problem with "more registers" is that each doubling results in routing and delay | 15:05 |
lkcl | meaning that an upper maximum bound is placed on the clock frequency as a result. | 15:05 |
markos | what's the ideal compromise? | 15:06 |
lkcl | 32 for a general-purpose processor | 15:06 |
lkcl | 32 x 64-bit | 15:06 |
lkcl | most SIMD Processors will actually do batches-of-completely-independent 32x64-bit (or whatever) | 15:07 |
lkcl | and you get "striping" effects | 15:07 |
lkcl | say 4-way-striping | 15:07 |
lkcl | which means in turn that unless you route data via a (slower) path | 15:07 |
lkcl | you can only do r0 = r4+48 | 15:07 |
lkcl | r1 = r5+r9 | 15:07 |
lkcl | r2 = r6+10 | 15:07 |
lkcl | r3 = r7+r11 | 15:07 |
lkcl | if you try: | 15:08 |
lkcl | r0 = r4+r5 | 15:08 |
markos | here's a crazy idea, ok, 128x128 might be too much, but how about 64x64 = 4k in a flat file where registers would be actually configurable "pointers" (if that can be done in hardware) to the flat register file | 15:08 |
lkcl | then because r5 is in a different "stripe" (like RAID striping) from r0 and r4, the contents of r5 have to be routed via special paths, to get from "the bank for r0 r4 r8 r12 r16" and "the bank for r1 r5 r9 r13 r17" | 15:09 |
lkcl | i've done the architectural design, it took 18 months to think through, and i'm just not going to change it | 15:09 |
lkcl | it's too much | 15:09 |
markos | no no | 15:09 |
markos | I'm not asking you to change it | 15:09 |
lkcl | yes, have a look at https://arxiv.org/abs/2002.10143 | 15:10 |
lkcl | and also Extra-V | 15:10 |
markos | I saw already you mention 128 registers | 15:10 |
lkcl | "pointers" is *exactly* what Snith - and Extra-V - do | 15:10 |
lkcl | when you refer to say add fp3, fp5, fp6 | 15:10 |
lkcl | what actually happens in Snitch is, it goes: | 15:11 |
lkcl | "hmm, fp5 has been reconfigured as a memory-coherent-Queue-reloader. let me just get the value for you from the front of the Coherent Memory Queue rather than the actual regfile" | 15:11 |
lkcl | fp6 is targetted (configured) to point at a *SECOND* Coherent Memory Queue | 15:11 |
lkcl | and fp3 is targetted to *store* into a (third, pre-configured) Coherent Memory Queue | 15:12 |
markos | so the fp* registers can be configured as aliases to an actual memory address? | 15:12 |
lkcl | the pre-configuration of those 3 regs are then set up to run a deterministic algorithm which in the case of the ones serving "fp5 / 6" | 15:13 |
lkcl | yes | 15:13 |
markos | interesting | 15:13 |
lkcl | with implicit auto-load-and-increment | 15:13 |
markos | but without the latency of memory access? | 15:13 |
lkcl | just like the c (*ptr++) thing | 15:13 |
lkcl | yes and no | 15:13 |
lkcl | if latency happens then the main processor stalls | 15:13 |
lkcl | but the focus of the Snitch core was precisely to arrange the memory and the core so that that *did not happen* | 15:14 |
markos | so the operation is done async | 15:14 |
lkcl | they achieved this i believe by using the snitch core as a barrel processor | 15:14 |
lkcl | ah no. | 15:14 |
lkcl | it's definitely synchronous. | 15:14 |
lkcl | and it's termed "coherent" | 15:14 |
lkcl | okok | 15:14 |
lkcl | the processor sees the FIFOs in a synchronous fashion | 15:15 |
markos | it's synchronous wrt the Queue but the actual memory? | 15:15 |
lkcl | but the memory-side will obviously be under different constraints | 15:15 |
lkcl | *but* | 15:15 |
lkcl | and this is the important bit | 15:15 |
markos | sorry, bb in 10' | 15:15 |
lkcl | the algorithm running in the Memory Controller that puts the data *into* the queue is - has to be - fully deterministic | 15:16 |
lkcl | and likewise on destination (result operations) that go into the "outgoing result" FIFO | 15:16 |
lkcl | those also have to - by definition - follow a deterministic schedule | 15:17 |
lkcl | the simplest of such Deterministic Schedules is: *ptr++ | 15:17 |
lkcl | but | 15:17 |
lkcl | in the case of Extra-V | 15:17 |
lkcl | they went a whoooole new level of algorithmic fun | 15:18 |
* lkcl afk | 15:18 | |
programmerjake | i posted the link to irc chat log on zulip, it should show up here soon: https://zulip-archive.rust-lang.org/stream/257879-project-portable-simd/topic/pairwise.20ops.20would.20be.20nice.html#271073695 | 15:19 |
programmerjake | btw, neat idea with using vector modes for crs to encode 1/2/4/8 bits per int reg... | 15:35 |
lkcl | programmerjake, yeah, it was inspired by the sv.ori./ew=8 thing you came up with | 15:44 |
lkcl | and that you mentioned using crweird after, i wondered how to actually back-to-back those two properly/usefully | 15:45 |
lkcl | markos, Extra-V, instead of simple "*ptr++", can do deterministic Graph-Node-walking | 15:46 |
lkcl | where, again, both the Memory and the Core know, *in advance*, what the Schedule will be | 15:46 |
lkcl | programmerjake, i think we can probably use the same concept for 3D Shader Pixel data interpolation | 15:46 |
programmerjake | probably, but we'd want that to go through the cache, not straight to main memory | 15:50 |
programmerjake | texture reads ^ | 15:50 |
lkcl | Snitch managed to organise it to be direct (somehow) | 15:50 |
programmerjake | texture reads often reuse nearby bits of memory multiple times -- caching is required...i'd guess snitch could organize their loops to read once and not need the data again, negating the need for caching (maybe? imho it'd still need caching since one loop could read/write it and the next loop could need it again) | 15:54 |
programmerjake | as an example of why we need caching for all gpu stuff...just look at how much of a speedup amd got from their humongous cache... | 15:55 |
* lkcl wonders what results come up by searching "AMD humongous cache" | 15:59 | |
programmerjake | they call it "infinity cache" | 15:59 |
programmerjake | their largest gpu has 128MB of cache, probably a major portion of their claimed 50% jump in power efficiency | 16:03 |
programmerjake | though...speaking of cache, they released a server cpu with 768MB L3 cache | 16:04 |
lkcl | probably because IBM POWER10 has something mad as well :) | 16:06 |
programmerjake | idk, i think they were adding extra cache to compete with intel on the desktop and thought, why not do that on the server too? | 16:09 |
markos | I still think these are all half-measures until we can achieve 1:1 zero latency ram :) | 16:26 |
lkcl | hey that's perfectly achievable for a max clock rate of ooo 100 mhz? :) | 16:27 |
markos | yeah I was thinking something like modern systems, though tbh, you do need a cache even then, esp if you have multiple cores | 16:30 |
markos | I remember reading many years ago about photonic transistors which would pave the way for photon chips, and then nothing, I wonder what happened to that technology | 16:32 |
programmerjake | lkcl, jubilee responded "huh." | 16:58 |
programmerjake | well....i'm hoping we eventually get 3d sram, kinda like 3d nand, then we could have 128GB cache! | 17:01 |
markos | lkcl, quote from the mp3_0_apply_window_float_basicsv.s.sv file, "at some point 128 registers will be available", was trying to remember where did I see it and it was right in front of me. So, there is indeed a plan for so many registers, 128x64 = 8kB flat register file, my question is why not double it and get a 128x128=16k register file? Would that require too many changes in the ISA proposal? | 17:23 |
markos | sigh, sorry again I miscalculated | 17:24 |
markos | 128x64 bits = 8kbit not kB | 17:24 |
markos | ignore that | 17:24 |
programmerjake | iirc lkcl didn't want to move to 128-bit registers since he seemed to think that it would be too much work to change our hdl to have 128-bit data paths...i disagree | 17:25 |
markos | so, the register file will be 1kB and able to hold a 32x32 | 17:25 |
markos | I don't want to put extra work on him, I'm just curious | 17:26 |
programmerjake | the current plan is to have 256 64-bit registers (128 int, 128 fp) | 17:26 |
markos | my biggest question is how is permute going to work efficiently | 17:27 |
markos | otoh, SVP64 design makes permute needless in many algorithms | 17:28 |
programmerjake | currently i think the plan is to just push all the data through the register file (even though that's waay slower) | 17:28 |
programmerjake | for permute | 17:28 |
programmerjake | the plan will probably change when we aim for higher performance | 17:29 |
markos | I really wish I could help here, but I have no HDL knowledge | 17:30 |
lkcl | markos: in TestIssuer and in the simulator there are currently only QTY 32 64-bit registers | 17:55 |
lkcl | so you kinda have to squeeze the vector ops into the existing 32 regs at the moment | 17:56 |
lkcl | programmerjake, i have made it clear multiple times that i have had to spend 18 months with a massive complex design currently held in my head and i am NOT going to redo that design | 18:04 |
lkcl | will you please STOP demanding that i waste my time throwing absolutely everything away and satisfy your requirement to do a 128 bit architecture | 18:05 |
lkcl | i am getting very fed up of it | 18:05 |
lkcl | it is making me very angry that you're not listening and i cannot take it any longer | 18:06 |
markos | lkcl, sorry, I started it, I don't want to cause you any extra effort, my question was rather towards the use case of fitting large 2D matrices for convolution | 18:07 |
lkcl | markos, it's ok. | 18:07 |
markos | whether you use 64-bit or 128-bit vectors is irrelevant | 18:07 |
markos | s/vectors/registers | 18:07 |
lkcl | the convolution can be done using coherent memory | 18:07 |
lkcl | because it's most useful when doing large 2D arrays that can never fit into regfiles anyway | 18:08 |
markos | modern video codecs do DCT on largish arrays 32x32 or 64x64 even | 18:08 |
markos | so imagine being able to do a FFT/DCT on a single block in a couple of instructions | 18:08 |
lkcl | this is why i want to include Extra-V in SVP64 | 18:09 |
markos | right | 18:09 |
lkcl | SV-REMAP only works - at present - when the data is entirely in registers, limiting the max size. | 18:10 |
lkcl | SV-REMAP-on-top-of-Extra-V entirely removes those limits. | 18:11 |
lkcl | a DCT of 2^32 would then be perfectly possible | 18:11 |
lkcl | take a damn long time but be possible | 18:11 |
programmerjake | lkcl, i'm not demanding we switch to 128-bit registers since i know there are other higher-priority things and 128-bits isn't strictly necessary, i'm just not going to say that i agree with you and think 64-bits is the best choice architecturally or at the isa level since my opinion hasn't changed -- 128-bits would be nicer. | 18:15 |
tplaten | I'm working on the tplaten_3d_game branch, getting an external_core_top.v:258195: ERROR: Re-definition of module `\plru_2'! | 18:37 |
tplaten | I need the modules entity but not its architecture behave | 18:39 |
tplaten | In the makefile we have fpga_files and synth_files, I will have a deeper look | 18:39 |
programmerjake | is that for libre-soc inserted into microwatt's peripherals replacing the core? if so, that would be a problem probably for lkcl... | 18:41 |
tplaten | yes that is the case, I wan't to run the maze game on the libre-soc core | 18:45 |
tplaten | and there are so many changes in microwatt causing merge conflicts if I try to merge microwatt with verilator trace, even in the makefile | 18:46 |
markos | lkcl, I'm trying to test a small change | 18:52 |
markos | setvl 16 | 18:52 |
markos | sv.fadds in.v, in.v, in.v | 18:52 |
markos | (I did a .set in. 3 above) | 18:52 |
markos | sorry sv.fadds/mrr | 18:52 |
markos | but I'm getting the following error from pysvp64asm audio/mp3/mp3_1_imdct36_float.s audio/mp3/mp3_1_imdct36_float.s.sv | 18:53 |
programmerjake | hmm, maybe it'd be easier to backport the maze game and usb serial stuff to the branch where libre-soc works? | 18:53 |
programmerjake | rather than trying to merge it | 18:54 |
markos | https://paste.debian.net/1235372/ | 18:56 |
markos | (do you use another paste service btw?) | 18:56 |
programmerjake | we mostly just use attachments in bugzilla and ftp.libre-soc.org | 18:57 |
programmerjake | feel free to create a bugzilla bug to track the maze game issues with libre-soc | 18:59 |
programmerjake | https://web.archive.org/web/20220323190212/https://paste.debian.net/1235372/ | 19:02 |
programmerjake | the assembly is using an old 1-arg form of setvl, the latest version has 5 args afaict | 19:16 |
programmerjake | 6 args...miscounted | 19:16 |
markos | right | 19:40 |
markos | ok, I'm trying to understand how this is going to work, the code I'm trying to convert is rather simple: | 19:55 |
markos | for (i = 17; i >= 1; i--) │ .long 3210589143 # float -0.866025388 | 19:55 |
markos | in[i] += in[i-1]; | 19:55 |
markos | crap | 19:55 |
markos | tmux doesn't play well with copy+paste with mouse | 19:56 |
markos | anyway | 19:56 |
markos | for (i = 17; i >= 1; i--) | 19:56 |
markos | in[i] += in[i-1]; | 19:56 |
markos | so, if I understand this right, and reading the comments in the file, I just need to setvl 16, and do a sv.fadds/mrr | 19:57 |
markos | however, don't I first have to load all the elements into vectors first? using an ld into in.v? (in is 5 here) | 19:57 |
markos | s/vectors/registers/ | 19:58 |
markos | so this should be the equivalent of setvl 16 (or the 6-args equivalent) | 19:58 |
markos | or rather wrong | 19:59 |
markos | in is just a pointer | 19:59 |
markos | so I would have to say register 10, the 16 elements, like so | 19:59 |
markos | ld 10.v, (5) | 20:00 |
markos | and then sv.fadds/mrr 11.v, 10.v, 10.v | 20:00 |
markos | as lkcl is writing in the comments | 20:00 |
markos | do I understand it correctly or have I gotten this wrong? | 20:00 |
lkcl | markos, that's pretty much it, yes. | 20:38 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_0_apply_window_float_basicsv.s;h=3888852461c794eb9836f8320c28aa5080b72b4a;hb=d1b415e4383366cf445fd4ff2db828a612f88099#l29 | 20:39 |
lkcl | ah. | 20:39 |
lkcl | that limit about 128 registers is obviously lifted, there | 20:40 |
lkcl | 26 # SV floats | 20:40 |
lkcl | 27 .set fv0, 32 | 20:40 |
lkcl | 28 .set fv1, 40 | 20:40 |
lkcl | 29 .set fv2, 48 | 20:40 |
lkcl | r48 is not even remotely possible with standard Power ISA 3.0 | 20:40 |
lkcl | so the comment is clearly out-of-date | 20:40 |
lkcl | programmerjake, sorry, i just find it deeply frustrating because you have no idea of the timescales and implications of what you're advocating, the amount of disruption it would cause to abandon everything done and designed so far to do 128-bit | 20:46 |
lkcl | ouaff | 20:51 |
lkcl | meeting | 20:51 |
markos | one more question, what's the equivalent of 'setvl 16' in the 6-arg format? | 20:58 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!