Sunday, 2022-10-09

jabhaha.  Well I've always wanted an excuse to rob a bank...just kidding.00:18
lkclmarkos, for when you're awake: first two elwidth overrides, w=8 and w=32, on an sv.add, work perfectly fine00:36
lkclbroke just about everything _else_, but hey00:36
jablkcl: are ya'll still doing the weekly virtual meet and greetings?00:37
lkcl2 years now00:38
lkcltuesday 22:00 UTC00:38
lkclyou'd be most welcome to join in.00:40
lkclplease don't publish the jitsi URL publicly because then i have to lock it with a password00:40
jabthat's fine.  I normally work Tuesdays, but thanks for the invite.  Hopefully I'll have it off again at some point.00:42
lkclghostmansd, i don't seriously expect you to be up at 3am either, but when you _are_ awake, elwidth-asm works great, two unit tests created in ISACaller that pass00:42
lkclnot a problem00:43
jablkcl: did ya'll buy a raptor desktop machine yet?00:48
lkclnot yet, i did get a 256 gb RAM space-heater though01:23
lkclarriving tuesday01:23
lkclthe laptop i'm using is now 2 years old and it's concerning me that i've no backup machine01:24
jab256 GB!  Wow!01:43
jabI don't know what I would do with that much RAM.01:44
*** jab <jab!> has quit IRC03:03
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc03:21
*** ghostmansd <ghostmansd!~ghostmans@> has quit IRC06:06
programmerjakewhat cpu does it have? imho if you're getting x86 it'd be a good idea to get the ryzen 7950x since it has the highest single-threaded performance available currently06:09
programmerjakelkcl ^06:11
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC07:48
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc07:49
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC08:35
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc08:36
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC08:55
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc08:56
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC09:03
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc09:04
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC09:08
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc09:11
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC09:55
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc09:55
markoslkcl, how do I set up vertical VL? I have an 8x8 matrix and I need to horizontal as well vertical sums of each row/column, iirc you said it's possible to do a vertical mode10:05
programmerjakevertical mode is not what you want here...vertical mode is where you have a loop with several instructions and it vectorizes the whole loop rather than each instruction individually...10:08
programmerjakeyou probably want matrix remap mode, or pack/unpack mode10:08
programmerjakethough pack/unpack may be limited to 4 rather than 810:09
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC10:11
markosah I see10:12
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc10:12
lkclprogrammerjake, it's what's the highest speed available from Dell, with full support, which is more important than absolute highest speed10:15
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC10:17
programmerjakeok, you're giving up a bunch of performance then...i'd expect that there are (or shortly will be) SIs who will build you a PC with a 7950x and provide a warranty and stuff...10:18
lkclyep. tough.10:18
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc10:18
lkclRED Semiconductor Ltd is not about to go risking money buying assets that are at risk from arbitrary individuals going bust, or wasting time on construction and assembly of machines10:19
lkclit cannot think "like a small team of individuals"10:21
programmerjakea SI is a whole company whose job it is to build and warranty computers for those who don't want to and are willing to pay extra for the privilege...10:22
lkclif it was *my* money - and i had time - i would consider it10:22
programmerjakethey generally don't disappear overnight10:22
lkclit's not an option.10:22
lkclare these SIs a billion-dollar company with a 3-decade reputation?10:22
lkclanswer: no.10:23
lkcltherefore they are a risk10:23
lkcltherefore - plain and simple - they are eliminated from consideration as a supplier10:23
markosDell Poweredge?10:23
programmerjakeno, but several of them have >15yr reputation and are worth 10s of millions...10:23
lkclprogrammerjake, then they're 100x smaller in terms of revenue.10:25
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC10:25
lkclthe decision's already made, based on risk assessment and scale/scope10:26
lkclmarkos, something like that.  a tower. absolute monster.10:26
markosI have a PowerEdge T440 (Tower version) which I recently converted to rack, pretty pleased10:26
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc10:26
markosadded 384GB of RAM and a ~100TB of disks10:26
markosI'm never going back to desktop systems, and the reason is BMC10:26
lkclprecision tower 582010:27
markosall my plain desktops are from server motherboards10:27
markosah the WXeons10:27
markosyes these are pretty powerful10:27
markoshow many cores?10:27
markosI opted for the server class Xeons, they are slower, but can scale to many more cores and the goal was to get a build farm10:28
lkcl14 i think10:28
markosI prefer 40 cores at 2Ghz than 14 at 3Ghz :)10:28
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC10:29
markosand the sockets are compatible (LGA3647)10:29
lkclyyeah i needed top speed, for VLSI/FPGA/Simulation10:29
markosI built two more such systems from Asus/Asrock motherboards10:29
markosteh server class cpus are not exactly slow either, and usually they have tons of L3 cache also10:30
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc10:30
markosbut yeah, it depends on your needs10:30
markosI'm running 20 VMs on each those systems10:30
programmerjakeapparently corsair is a SI now, they've been in business for >25yr and are worth >$1B...10:30
lkcl4.8ghz max was the priority here. other ones were limited - 4.6 or less.10:30
markosjenkins, mail server, file server, even ML/DL models on a Nvidia Titan with gpu passthrough10:30
programmerjakenot that i'm recommending corsair specifically10:31
lkclDell is what resonates with everyone in business10:32
markosCorsair don't build boards, only peripherals10:32
lkclanything else is a risk10:32
markosHP is also good10:32
lkclused to be.  they screwed up about... 8-10 years ago. quality went massively downhill10:32
lkclmarkos, btw you saw i got the first elwidth overrides running?10:33
markosI've been using HP business laptops for the past 6 years and am very pleased with the quality10:33
markosyes, but I will not use it for the av1 task, will gladly convert to elwidth though when done10:33
markosfwiw, I think my next laptop will be an Apple M210:34
lkclit was surprisingly straightfoward10:34
markosor M1 Max/Pro, whatever10:34
markosthe raw speed of that chip is amazing10:34
markosit's even faster with Linux installed10:34
markoswas thinking of a mac studio even, but a laptop is convenient10:35
programmerjakeluke, imho if you spent the 2-3hr needed to build it yourself, you can more than make back that time by the time saved in simulations later, the 7950x is really that much faster...10:41
programmerjakeif you got the intel i9 10980xe (pretty similar to the 14 core i9 10940 you probably got), it's *less than half* as fast as the 7980x in ngspice!!10:51
programmerjake57s vs 134s!!10:51
programmerjakeso imho building a pc yourself or just buying a premade one with the 7950x is more than worth the extra trouble, even ignoring cost10:52
markos7950 is definitely impressive10:53
markosI was never an Intel fan10:54
markosthe only reason I went with Xeons was lack of AVX512 on the AMD CPUs10:54
lkclthat's still thinking in terms of small personal projects10:54
programmerjakeand the 10980xe and those xeon w cpus are *particularly* unimpressive...10:54
programmerjakethink of the *time saved* at work!10:54
lkclthat's still thinking in terms of small personal projects10:55
markosprogrammerjake, W-class Xeons are not unimpressive, I can tell you they are really very powerful CPUs, I'd choose a W-class Xeon over any i7/i9 *any* day10:55
lkcli would have to - personally - as a supplier *to* RED Semiconductor Ltd - take out indemnity insurance10:55
lkclplus provide a support contract to RED Semiconductor Ltd10:56
programmerjakeeven if it breaks every 6mo and you have to spend a day fixing it (that's an absurd level) it would still save a bunch of time10:56
lkclneither of which - personally - i am prepared to do10:56
lkclyou are still not getting it10:56
lkcla business has to think in completely different terms10:56
lkcl"bestest fastest" is completely irrelevant10:56
lkcli cannot place *myself* at risk of being sued for failing to supply reliable service to RED Semiconductor Ltd10:57
markosI agree there, for my company I got a Dell myself, even if the Asus-built server Xeon mobo I did later on my own cost less than half and was even more powerful10:57
programmerjakeand, yes, the 10980xe *is* terrible, it was terrible the day it was released. amd's threadripper of the day has more cores and higher single threaded performance iirc10:57
lkcllikewise we got *Vantosh* Ltd - a Ltd company set up with full indemnity insurance - to do RED Semiconductor's email and web hosting10:58
programmerjakethe latest xeon w are basically the same thing10:58
lkclbecause the risk to an individual is too great10:58
lkclthere's almost no point - at all - in discussing how much better the AMD CPUs are, other than to note, in future, "are they available from Dell"10:59
markosyou can always get another system later with a Ryzen if Dell or HP provide one10:59
lkclindeed. exactly10:59
markosI doubt Dell will ever do that, they have a long contract with Intel10:59
markosIntel will *never* allow Dell to provide AMD systems10:59
markosHP otoh already do iirc10:59
lkclthere's supposed to be laws about that, but hey11:00
markosif AMD won over Dell, it's the beginning of the end for Intel11:00
markosthere's just no comparison between those wrt performance11:01
markoslkcl, so what's the best way to get sums in vertical mode with SVP64 on a 8x8 matrix? I've already done the horizontal sums just fine11:08
markosI thought vertical mode was for that reason11:08
lkclMatrix REMAP11:09
lkclvertical mode is still a linear mapping11:10
lkcl1 sec11:10
lkclthose are *both* still linear mappings.11:11
lkclVertical-First changes the **INSTRUCTION**-to-**REGISTER** ordering/relationship11:12
lkclREMAP changes the **REGISTER-ELEMENT** ordering/relationship11:13
lkclyou can still apply Vertical-First on top of REMAP11:13
lkcli have FFT/DCT examples that do that11:13
markosok, I'll take a look11:14
lkcli did do a unit test for you, showing how to use Matrix REMAP not-for-the-purposes-of-matrix-multiply11:15
lkcli just can't now remember where11:15
markosyes I remember I'll find it11:15
programmerjakewell, luke, considering how slow the cpu is, i'd recommend returning that xeon w computer and finding an amd threadripper pro system (or something with amd ryzen or intel 12th gen desktop cpus) from some vendor that has all the support contracts and stuff, maybe lenovo will do? they were the first with threadripper 5000 iirc.11:15
lkclfor fuck's sake jacob11:15
lkcldrop it11:16
markosunrelated, is there a way to have an "offset" variable in assembler, eg. iteration 1: process registers N+0, iteration 2: process registers N+offset11:16
lkclplease stop wasting time11:16
lkclthe Directors of RED Semiconductor Ltd have, as a group, made a decision that minimises risk for RED Semiconductor Ltd and minimises risk for the individuals associated with RED Semiconductor Ltd11:17
programmerjakeit's not wasted if you get *both* the support/etc. contracts you want so you don't get sued and twice the performance...11:17
markosit's not that important really, as soon as a new fast cpu is released, 3 months from now it will be outdated by something newer11:18
lkclthere is no point in you continuing to waste my time or yours in advising on a decision that was made based on a larger scope than you are used to dealing with or thinking in terms of11:18
markosfor a business longevity is much more important, heck my Power9 is 5 years old and  still running11:18
lkclyou are now wasting everybody's time attempting to discuss something for which a decision has already been made11:19
lkcland to be honest i really didn't want to even tell you that RED Semiconductor Ltd's Directors have voted and made the decsion11:19
lkclprecisely because i knew that you would waste everyone's time here by telling everyone how much better AMD is11:20
lkclyou *have* to get the message that there are more factors involved and that the context is completely different11:20
markosprogrammerjake, speaking from a (bad) experience, as a business I will never ever buy again a random built-to-order PC from some random SI because you never know if they're going to be there a few months/years from now11:20
markosDell/HP/etc you at least know they will be still be there and will be providing support and you will be able to get the parts you need11:21
lkclif we had USD 40 million we could take the risk of buying multiple such machines11:21
programmerjakelenovo too iirc...11:21
markosor even a replacement system if the contract includes such a clause11:21
lkcland if one failed we could even consider writing it off and moving to the next one out of the storeroom11:22
markosyup, nowadays I always buy in pairs11:22
lkclback to answering priority questions11:22
lkclmarkos, what are you looking to do? extract a single scalar from a vector at an arbitrary point?11:22
programmerjakewell, gn. it's nearly 3:30am here...11:23
markoswell, I don't have enough registers to load the whole 8x8 matrix and do the processing11:23
markosso I load the first 4x8 (32) elements11:23
programmerjakeuse strided load?11:23
markosdo the processing on the first half to the resulting matrixes -which take ALL of the remaining registers11:23
markosthen I want to load the next half 32 registers11:24
markoswhich I can do11:24
lkclbut you want to start "half-way" through of a sorts11:24
markosbut I want to do the exact same processing to the previous matrices using an offset to those matrices11:24
markosthere is a partial_sum_hv matrix, [2][8]11:25
markosfirst 32 registers occupy the left half [2][0-4] of this matrix11:25
markosthe result of the partial summation tha tis11:25
markosthe other 32 elements of the 8x8 matrix, would sum to the [2][4-8] half of the partial_sum_hv matrix11:26
markoshere are the instructions atm11:26
lkclyeah no - Matrix loops start from the dimension size11:26
markos        setvl           0,0,8,0,1,1                     # Set VL to 8 elements11:26
markos        sv.add/mr       psum_hv+0, psum_hv+0, *img11:26
markos        sv.add/mr       psum_hv+1, psum_hv+1, *img+811:26
markos        sv.add/mr       psum_hv+2, psum_hv+2, *img+1611:26
markos        sv.add/mr       psum_hv+3, psum_hv+3, *img+2411:26
lkcl(and are individually reversible)11:26
markosso for iteration 211:26
markosI will add 32 to img11:27
markosand would *love* to be able to have an offset added to psum_hv11:27
lkclthat's Matrix REMAP11:27
lkclyou've just described Matrix REMAP11:27
markos        sv.add/mr       psum_hv+offset+0, psum_hv+offset+0, *img11:27
markosI really need to practice this one11:27
lkclit performs the 3 nested loops you expect of Matrix Multiply11:28
lkclwhere as you would expect11:28
lkcli j k11:28
lkcli increments the slowest11:28
lkclj the next slowest11:28
lkclk the fastest11:28
lkclrun the stand-alone program to see what's going on, and play with it11:29
markosok, will do, thanks again11:30
markosah this is standalone11:30
lkclah there's a better demo11:31
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC11:31
lkclwhich sets up all 3 SVSHAPEs11:31
lkcliterates through all 3 SVSHAPEs11:31
lkclzips them up11:31
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC11:32
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc11:32
lkclthen uses the resultant 3 offsets (A-matrix-offset, B-matrix-offset, C-matrix-offset) to perform a mul-add-accumulate "schedule"11:32
lkclthe important thing to remember about REMAP:11:33
lkclthe Schedules are *NOT* explicitly hard-coded onto actual registers11:33
lkclthere are **TWO** critical instructions11:33
lkcl1) svshape - to set up the offsetting11:34
lkcl2) svremap - to set the relation BETWEEN the Shapes and the registers to which those Shapes must be applied11:34
lkclwhy was it done this way?11:34
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc11:35
lkclbecause fmadds has completely and utterly different register ordering/naming from madds from ternlogi from any-other-instruction11:35
lkclyou *can* set up Matrix REMAP Schedules11:35
lkclthen apply those schedules to a sv.add11:35
lkclapply 2 out of 3 of those Schedules to the RT, RA and RB arguments of an sv.add11:36
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC11:41
markosyeah, I'll need to play a bit with this to figure out how it works11:42
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc11:42
markosdamn it, no available registers for the indices :D11:42
markosI might have to rethink this and just do one partial sum matrix at a time and just store it in memory11:43
markosyeah I don't think I can do everything in-register after all11:45
lkclthe indices are hard-coded (deterministically scheduled)12:07
lkclthere is no need - at all - to consider the concept "i must have N registers free for the purposes of use as element-offsets to perform N element operations"12:08
lkclfor that, you are thinking of *Indexed* REMAP12:08
lkclwhich *does* require N registers for the purposes of use as offsets to perform N element operations12:08
lkclMatrix, FFT and DCT REMAP are hardwired Schedules from the SVHAPE0-3 SPRs12:09
markosyeah, it's too blurry in my mind right now, I need to practice this and see the remaping in execution to understand how it works12:09
lkcllook at the instructions12:10
lkclthere's only 3.12:10
lkcldo you see anywhere in there, "the for-loops are stored in registers to be used as offsets"?12:10
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC12:11
lkclthen why would you think that there are GPRs/FPRs used for the purposes of offsets for the indices? :)12:11
markosI was thinking of *Indexed* REMAP :)12:12
lkclthe indices are computed *in hardware* based on the information given in SVSHAPE0-3 and using SVSHAPE12:12
lkclyes, that's a "last resort" one12:13
markosI won't actually understand at depth until I have actually tested it in practice12:13
lkclprecisely because it does, in fact, need to read GPRs as Indices.12:13
lkclwhich is expensive12:13
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc12:14
lkclit's a different concept... but it isn't.12:15
lkclthe exact same "features" are provided in other SIMD ISAs...12:15
lkclthey just explicitly embed the "feature" into a (limited range of) instructions12:15
lkclVSX vperm is "Indexing embedded with MV"12:15
lkclwhereas Indexed-REMAP is "a completely separate Indexing Concept applicable *independently* to *any* register(s) of *any* operation"12:16
lkclas are all REMAPs.12:16
*** ghostmansd <ghostmansd!~ghostmans@> has joined #libre-soc13:24
ghostmansd[m]markos, I've added support for fmvis/fishmv for binutils.13:59
markosthank you14:00
ghostmansd[m]np :-)14:02
ghostmansd[m]Sorry it took that long14:02
*** ghostmansd <ghostmansd!~ghostmans@> has quit IRC14:03
ghostmansd[m]I had some family celebration yesterday, so I could only complete this today14:03
ghostmansd[m]lkcl, it'd be great if we could assign some budget to 945 :-)14:03
markoslkcl, I guess for something as complicated as the partial sums of the diagonals (ie, sum[y+x]) I would have to use an indexed remap right?14:21
lkclghostmansd[m], willdo - just not straight away. it'll likely be under the cavatools budget where i still have to plan the MoU and get it signed by NLnet14:48
lkclmarkos, probably :)14:49
ghostmansd[m]Sure, thanks! I'll raise some tasks on the assembly and disassembly, too. Will these be covered by cavatools too?14:50
lkcli think it can easily be justified as "this is needed to be tested under cavatools the simulator", yes14:50
lkclmarkos, i have been thinking about how to do diagonals, because they're needed for e.g cross-product14:51
lkclif you can write up an example of what you need, everything like that helps in the justification14:52
lkcl(i mean, write up as a bugreport)14:52
lkclan example is sufficient.14:54
markosI will, and I will write the method that I'm currently using to calculate these quantities15:33
markosI'm doing diagonal and reverse diagonal partial sums now15:33
markosanother one, how can I reverse the values in a vector? eg, if I have 0,1,2,3,4,5,6,7, can I reverse the values in the registers in a simple way?15:34
markosand end up with 7,6,5,4,3,2,1,0 in the same registers15:35
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC15:38
*** ghostmansd[m] <ghostmansd[m]!> has joined #libre-soc15:39
*** ghostmansd <ghostmansd!> has joined #libre-soc16:30
*** ghostmansd <ghostmansd!> has quit IRC16:38
*** ghostmansd[m] <ghostmansd[m]!> has quit IRC16:56
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc16:57
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC17:32
*** ghostmansd[m] <ghostmansd[m]!> has joined #libre-soc17:32
lkclmarkos, just use /mrr18:00
lkclit enables the "cheat-that-is-misnamed-mapreduce", but /mrr is "the-cheat-misnamed-mapreduce-but-also-in-reverse"18:00
lkclyou know that mapreduce is a misnaming/misnomer, it just switches off the safety-check on scalar-destination18:01
lkcl"if destination is a scalar then stop looping"18:01
lkclwell, "mapreduce" just switches off that safety-check, allowing you to keep using a scalar as both-source-and-destination18:02
lkclwhich of course gives you reduction and prefix-sum18:02
lkclturns out that reverse-gear is really useful for that18:02
lkclyou can still just as well have a vector destination on /mrr18:02
lkclso you get the reverse-effect and ignore the mapreduce-safety-check-thing entirely18:03
lkclA partial sum of an infinite series is the sum of a finite number of consecutive terms beginning with the first term.18:05
lkclok then you want the mapreduce mode for that anyway18:05
markosproblem is that this is the reverse diagonal, which I'm producing in reverse18:08
markoshere is the calculation of the normal diagonal partial sums:18:08
markos        # First row of diagonal partial sums:18:08
markos        # partial_sum_diag[0][y + x] += px;18:08
markos        sv.add/mr       *psum_diag+0, *psum_diag+0, *img+018:08
markos        sv.add/mr       *psum_diag+1, *psum_diag+1, *img+818:08
markos        sv.add/mr       *psum_diag+2, *psum_diag+2, *img+1618:08
markos        sv.add/mr       *psum_diag+3, *psum_diag+3, *img+2418:08
markos        sv.add/mr       *psum_diag+4, *psum_diag+4, *img+3218:08
markos        sv.add/mr       *psum_diag+5, *psum_diag+5, *img+4018:08
markos        sv.add/mr       *psum_diag+6, *psum_diag+6, *img+4818:08
markos        sv.add/mr       *psum_diag+7, *psum_diag+7, *img+5618:08
markosthis works18:08
markosthis is the first row of a 2x15 array, and it holds the normal diagonals18:09
markoser, wrong term, "normal"18:09
markosanyway the axis is the diagonal from top-left to bottom-right18:09
markosthe other diagonal I produce thus:18:10
markos        # Second row of diagonal partial sums:18:10
markos        # partial_sum_diag[1][7 + y - x] += px;18:10
markos        sv.add/mr       *psum_diag+15, *psum_diag+15, *img+5618:10
markos        sv.add/mr       *psum_diag+16, *psum_diag+16, *img+4818:10
markos        sv.add/mr       *psum_diag+17, *psum_diag+17, *img+4018:10
markos        sv.add/mr       *psum_diag+18, *psum_diag+18, *img+3218:10
markos        sv.add/mr       *psum_diag+19, *psum_diag+19, *img+2418:10
markos        sv.add/mr       *psum_diag+20, *psum_diag+20, *img+1618:10
markos        sv.add/mr       *psum_diag+21, *psum_diag+21, *img+818:10
markos        sv.add/mr       *psum_diag+22, *psum_diag+22, *img+018:10
markosnow this produces the correct results18:10
markosusing /mrr doesn't work, because with every instruction I move to the next element to the right18:11
markosbut the results are in the reverse order18:11
lkclthat's a cumulative-prefix-sum18:19
lkcl(like fibonacci series)18:19
lkclis a *vector* cumulative prefix sum what you actually wanted?18:20
lkclbecause if not you can just use a scalar for RT and RA18:20
lkcli'm assuming you do18:22
lkclusing /mrr should just do "for x in VL-1 downto 0"18:23
lkclbut... you're.. yeah, you're wanting to do a reversal on RB but *not* on RT or RA.18:23
lkclfor that, you'll need to use REMAP18:23
lkcluse svshape2 with a reverse-gear18:24
lkclthen apply it to RB18:24
lkcluse the remap mode "rmm" to get it to apply *only* to RB18:26
lkcldamn. no. there's no option for reverse-gear in svshape2.18:27
lkclthere is however in svindex18:27
lkclfrick, no there isn't.18:28
lkcloo that's annoying18:28
markosok, another issue I found18:29
markosshould normal Power ISA instructions be allowed to use the extra registers?18:29
markosError: operand out of range (78 is not between 0 and 31)18:29
lkclno not at all.18:30
markosso it's expected18:30
lkclPower ISA v3.0 is Power ISA v3.0.18:30
lkclwe are NOT repeat NOT in ANY WAY authorised or permitted to modify Power ISA v3.0.18:30
lkclthat is absolutely out of the question18:30
markossure no problem was just curious, cool, I'll just change the code18:30
lkclthere will be SVP64Single in the future however18:30
lkcland there's a (thoroughly comprehensive) review/audit of whether all registers being scalar should allow VL=1 temporarily just for that one instruction18:31
markoswas a trivial fix, no worries18:32
lkclok :)18:32
markosif all goes well, I will be done with this tonight18:32
markosactually, I may not have to reverse the order of the elements18:38
markosthey're going in a sum anyway18:38
markosvalue does not change :)18:39
markosplease remind me, Error: vector register cannot fit into EXTRA2 means that the offsets used in an instruction go beyond the allowed range?19:07
markossv.maddld/mr    *tmp, *psum+7, *psum+7, *tmp19:07
markosis the instruction that fails19:07
markostmp is 22, psum is 94, and VL=719:07
markosI have finished 70% of the algorithm19:15
markosone last for loop19:15
markosto convert19:15
lkclmaddld is a 4-operand19:49
lkcltherefore to fit extension of 4 operands into 8 bits, there are only 2 bits each per register19:50
lkcl1 is "is this register vector or scalar"19:50
lkclthat just leaves 1 spare bit19:50
lkcl(RT/RA/RB are 5 bit)19:50
lkclbut we have numbers from 0-127 which needs 7 bits19:50
lkclthe LSB has to be zer019:51
lkclyou can only have sv/maddlv 0,2,4,819:51
lkclsv/maddld 10,12,14,1619:51
lkclsv.maddlv 0,1,3,519:51
lkclsv.maddlv *0,*2,*4,*819:52
lkclsv/maddlv *10,*12,*14,*1619:52
lkclsv.maddld *1,*3,*5,*719:52
lkclscalar on the other hand on EXTRA2 you are *still* restricted to 6 bit19:53
lkclbut the choice is to have them access r0-r63 in increments of 119:53
lkclrather than have them access r0-126 in increments of 219:53
programmerjakeif you want odd register numbers, use offset (svoffset iirc), idk if it's implemented yet though19:53
lkclyes it is19:54
lkcla month ago19:54
lkclso that can be used to say "please add an extra 1 onto registers RB and RC in the sv.maddld *tmp, *psum+6, *psum_7, *tmp" instruction19:56
lkcl(or whatever)19:56
programmerjakeit works on elements rather than whole registers, so it can be used to express "add the third byte of r7 with the 4th byte of r4 times the 1st byte of r127 and store in the 8th byte of r3"19:58
lkclyes. really particularly useful for when elwidth overrides are used.20:05
markosso it's the +7 as an offset, because LSB != 0 that's the problem then, if it was an +8 it would work20:22
markosok, I'll rework it a bit20:23
markosso do I understand it correctly like this: svshape2        1, 0, 1, 7, 0, 020:40
markosI'm not sure about the module in this case20:40
markosI think it should be 820:40
markosand how do I unset svshape2 after the instruction is executed?20:41
programmerjakeiirc there's a flag you can set on svshape2 that makes it automatically only apply to the next svp64 instruction21:00
markosright, I think we should add *lots* of examples in the documentation as part of the next grants21:02
markosfrom "only RA is re-mapped via svshape2, not RB or RT, but an offset of 1 is included on RA."21:06
markosso sv.maddld/mr    *tmp, *psum+6, *psum+7, *tmp gives me the same error :-(21:06
programmerjakeusing svshape doesn't change how the assembler works on the following svp64 instruction, that still needs even reg numbers. just that when it's run tge reg numbers are adjusted by the offset you set previously21:08
markosthis is highly confusing21:11
markosso is it possible to add offsets to both RA and RB?21:13
programmerjakeyes afaict, by setting the offset to apply to both RA, RB but not RT. that'd be done by setting rmm to which operands you want to offset and mm=0 in svshape221:17
lkclsvshape2 if you use it in "non-persistent" mode it only applies to the next instruction21:18
lkclwe've got questions on ls002 btw
lkclmaskmode (mm) and remap mode (rmm) are the same as for svindex
lkclrmm is 5 bit21:19
lkclfor "non-persistent" mode you want mm=0 as jacob said21:20
lkclthen if you want RA and RB but not RT then that is21:20
lkclRC=0b00100 (you don't want this)21:20
lkclRT=0b01000 (you don't want this)21:20
lkclRS=0b100000 (you don't want this)21:21
lkclso you want rmm=0b0001121:21
lkclit doesn't change (doesn't set) VL or MAXVL21:23
lkclit does however *use* MAXVL if you use it in 2D mode21:23
lkcl(because it uses MAXVL to *calculate* the size of the 2nd dimension, which saves bits in the 32-bit opcode)21:24
lkclremember that you mustn't subtract 1 from the dimension size.21:26
lkclif you want a dimension size of 8 you must *give* an argument SVd=821:26
lkcl(SVd=0 is an illegal instruction)21:26
programmerjakefor question #2 of ls002 we should compare with stfs, not lfd. you can have non-f32 values in the f64 FRS and stfs will use SINGLE to determine which f32 value to store without setting any exception flags. likewise fishmv uses SINGLE to determine which f32 value is inputted, without setting any exception flags. lfd[s] *can't* load an unrepresentable value -- all possible f32/f64 bit patterns can be represented exactly as a f6421:29
programmerjakeother notes -- there *is no* fld instruction, replace with lfd21:32
programmerjakei'll just edit it myself, there's lots of answers that need changing...21:33
programmerjakeI also fixed fmvis to be Shifted, not Single22:17
markosprogrammerjake, I'd be against numbering in the name of the instructions, I'm constantly looking in the ISA manuals when numbers are used, they are too vague, I much more prefer the scheme used by Arm, high/low and use of w/d/q to denote size22:18
markosbut definitely not fli1/2/3/422:19
programmerjakeyeah, you mentioned that before, want me to add it to the RFC answers?22:19
markosand in any case, suggesting a name in the discussion is wrong (imho), let *them* suggest a name if they don't like fmvis/fishmv22:19
markoss/suggesting a name/suggesting yet another name/22:20
programmerjakethey suggested flis, I'm pointing out I suggested that before, along with fli2-4 (no fli1)22:21
markosnot sure it would do anything other than create even more confusing22:21
programmerjakeimho 2-4 totally makes sense, because they are 2nd 16-bit val in f32, 3rd 16-bit val in f64 and 4th 16-bit val in f6422:22
programmerjakecounting from MSB as PowerISA likes to22:23
markosexactly my point22:23
markosyou have to remember the numbering (from MSB)22:23
markosI still prefer low/high, it doesn't allow any confusion22:24
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC22:24
programmerjakebut basically everything in the PowerISA is numbered from MSB...imho it isn't much of a stretch to name one more thing based on MSB-LSB ordering22:24
programmerjakemy issue is hi/lo doesn't really extend to 4 things, it's better to use numbers at that point...22:26
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc22:27
markoseven they suggested flisl (low)22:28
markosand if fli3/4 are unlikely why create the confusion with the initial fli/fli2?22:29
markosI mean in the beginning there will just those 222:30
markoswhich is what?22:30
markosIF we had 4, perhaps you would be right, but we don't and imho, it's really suboptimal to require 4 instructions to load a 64-bit fp22:31
markosfmvis/flis/fishmv/etc only make sense if you want to create a quick constant to be reused in a loop, with 4 instructions the benefit doesn't seem so obvious anymore22:32
markosif we have to rename, I *still* prefer a clear declaration of high/low for at least one of the instructions22:33
lkcli already went over the cost, in-depth, a few days ago.23:01
programmerjakepaddi loads 34-bits of immediate, optionally PC-relative, so it can't be emulated by addi/oris or addi/addis23:22
lkclthen it's not an appropriate analogy. made a note to that effect23:25
programmerjakeone option for pflis is to use the extra 3 bits of immediate over flis/fishmv (2 bits + the pc-relative flag which is useless for fp) to specify more exponent bits, allowing larger range than f3223:26
programmerjakemaking pflis justifiable23:26
programmerjakealso, the discussion page should probably be linked from the rfc23:27
programmerjakementioned questions on mailing list23:29
programmerjakeproposed pflis pseudocode: v <- DOUBLE(imm[3:34]); v[2:4] <- imm[0:2]; FRT <- v23:33
programmerjakewhere imm is suitably constructed from all the immediate bits23:34
programmerjakelkcl: ^23:34
programmerjakethat covers (except for denormal f32 oddities) the full f64 exponent range23:36
programmerjakesince f64's exponent is 11 bits and f32's exponent is 8 bits23:36
*** josuah <josuah!~irc@> has quit IRC23:37
programmerjakefishmv would still be needed because cpus may not want to implement 64-bit instructions or for svp64-prefixing where 64-bit suffixes aren't allowed23:38
*** josuah <josuah!~irc@> has joined #libre-soc23:38
lkclplease leave all discussion of v3.1 off of this proposal23:44
lkcli absolutely do not want our time wasted discussing things or designing things that are of no immediate benefit23:45
lkclplease consider all v3.1 prefixed instructions and all discussion of any v3.1 prefixed instructions absolutely 100% out of scope23:45
lkclif IBM wants to design v3.1 prefixed instructions they are entirely at liberty to do so.23:46
lkcli will begin eradicating all mention of v3.1 prefixed instructions from the RFC.23:46
lkclwe do not have time to waste, here.23:49
lkclthere are a **HUNDRED** instructions to get through.23:49
lkclplease do not propose any 64-bit instructions23:50
lkclplease do not discuss them further23:50
lkclplease do not add any 64-bit instructions to the discussion23:51
lkclplease do not put any 64-bit instructions in the RFC23:51

Generated by 2.17.1 by Marius Gedminas - find it at!