Wednesday, 2022-10-12

programmerjakehope you think that's better reasoning01:16
programmerjakesome verbiage about being able to get all possible bitpatterns produced by lfs (not lfd) could be added.01:18
*** jab <jab!> has quit IRC02:51
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC06:43
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc06:43
*** josuah <josuah!~irc@> has quit IRC07:42
*** josuah <josuah!~irc@> has joined #libre-soc07:43
lkclprogrammerjake, mmm... that's going to be complicated.  a full and comprehensive justification is needed as to why.09:11
lkclor... hang on, that *is* the justification?09:11
lkclas in, there's no change from use of DOUBLE() and that's enough to express all possible f32 values?09:12
programmerjakedouble is how powerisa expresses all f32 values in f64 registers09:13
programmerjakeall possible f32 values, including all quiet/signaling NaNs and all denormals09:14
lkcland all f32 values are still representable?09:24
lkcl(if so that's great, because there will not be any objection from the OPF ISA WG)09:26
programmerjakeassuming flis/fishmv's pseudocode hasn't changed from when i last checked, yes, it covers all possible f32 bitpatterns09:46
markosin the discussion, could you please pick one name? eg a question refers to  fmvis, and the answer replies on flis, refering to the same command, either pick one or mention both. (question Other.3)09:51
markosfwiw, I'm fine with flis, but just stick to one, keeping both and refering half the times to fmvis and the other half to flis only leads to confusion09:53
markosotoh, fmvis fits better with fishmv (name-wise) :)09:54
lkclthe OPF ISA WG members have picked some more-conformant (precendent-based) names09:57
markoswhich ones?09:58
lkclwe go with those (for obvious reasons)09:58
lkclin the discussion page.09:58
markosflis/flisl yes09:59
markosthat's what I'm saying09:59
lkclok we're onto phase 3 with the 2 grants. can't be "announced" yet, has to go an independent audit10:00
lkcl1 short paragraph is needed to describe each project10:00
lkclcongrats at having a lot more work to do? :)10:00
markosyes :)10:01
lkclnggh :)10:01
markosyou knew this from the start, didn't you? you wouldn't decide to design a vector architecture if you wanted to be lazy, you would do ebanking with Java :D10:04
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC10:17
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc10:18
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC10:32
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc10:37
lkclgiven a choice between that and working at MacDonald's, i'd choose the burgers10:48
markoshaving actually worked in ebanking with Java for a couple of years, I agree 100%, worst environment ever10:50
markosthe definition of boooring10:50
markosfunny thing is that because I did Java a million years ago, recruiters still contact me for a Java job every now and then10:51
markoswouldn't touch it again unless it was for a ridiculous amount of money and even then I think I'd quit instantly10:52
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC11:32
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc11:32
lkcli'd make better friends at macdonalds.11:33
*** midnight <midnight!~midnight@user/midnight> has quit IRC11:33
*** midnight <midnight!~midnight@user/midnight> has joined #libre-soc11:36
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC14:29
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc14:31
*** Veera <Veera!~veera@> has joined #libre-soc14:39
Veeralkcl: I have been paid for Bug #577 and updated bugzilla page for paid status14:40
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC14:46
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc14:56
*** Veera <Veera!~veera@> has quit IRC14:59
cesarWork on the formal verification of MultiCompUnit is now in git. Will update Bug #879 with the issues that it found, and then proceed to submit the RfP.15:03
lkclcesar, fantastic15:28
*** octavius <octavius!> has joined #libre-soc15:49
lkcli found the x86 optimised assembler strlen/strncpy and it's so depressingly large i can't be bothered to post it for comparison15:55
lkcleven the IBM POWER8 strncpy is awful15:56
lkclcode that checks for address-alignment15:57
lkclcode that checks for a 4k page-boundary crossing15:57
lkclstripmining for up to the first 15 bytes15:58
lkclincluding comments it's 479 lines15:59
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has quit IRC17:39
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@> has joined #libre-soc17:39
*** choozy <choozy!> has joined #libre-soc18:12
markoslkcl, aside from predicates, how can I use svstep to do a sv.add instruction, but every 2 elements?18:38
markosactually because it's a lot of elements (VL=64), they don't fit in one predicate mask18:38
lkclmmm.... i wondered about this one as well18:52
lkclin theory what you could do is use matrix or Index REMAP to set up a 2D arrangement, where one of the dimensions is 218:55
lkclthen override VL to *half* the total18:55
lkclwith a 2nd setvl18:55
lkclbut honestly, if you only have 16 elements you can just set r3/r10/r31 equal to the required predicate directly with "li"18:56
*** jab <jab!> has joined #libre-soc19:08
markosI guess the easiest is to do 4 x 1619:19
*** octavius <octavius!> has quit IRC20:21
lkcl64-bit will fit into one integer predicate.20:31
lkclthis was one of the situation that grevluti was designed for: to be able to hit a regular pattern into a GPR in one single 32-bit instruction20:32
*** choozy <choozy!> has quit IRC20:50
programmerjakeif you want to quickly load a repeating 64-bit pattern, you can use sv.addi/subvl=4/elwid=16 rt, 0, 0x5555 -- note no star on rt21:49
markosdo both subvl and elwid have to be specified? doesn't one imply the other?22:15
markosbut that's a cool trick, thanks22:16
lkcli've never tried it - but it should work perfectly22:24
lkclno.  subvl is a "small inner repeating loop", with the option to be 2,3 or 4 (actually, 1 as well as the degenerate case)22:24
lkclelwidths have nothing to do with subvl, they apply independently22:25
lkcli got a sv.divmod2du test working!22:25
cesarlkcl: Just sent the RfP for #87922:26
lkclcesar, saw it - approved. awesome.  2 days remaining (!) so in theory it should be fine22:28
lkclfantastic to find the bugs22:28
lkclthat's exactly the point of doing these proofs22:28
lkcldefinitely worthwhile to do an OPF talk or a FOSDEM talk about that22:29
lkclokaaay that rounds off (completes) 2019-10-032
lkclmarkos, last one!
lkcldo what you can, ok?22:32
markoslkcl, I'm *that* close, last stage22:34
markosok doing now the partial_sum_alts (the slanted diagonals y + (x >> 1) trick), I've done pair-wise addition of the 2 elements so the first carries the sum of the two, and need to copy only those into a new location, so instead of 8x8 I will have 8x4 elements,  -yes I know REMAP :)22:55
markosbecause that way the operation is simplified to pretty much the same as the previous steps22:56
markosthe question is how to do that :)22:56
markosI have the sums, checked they are correct22:57
lkcljust use predicate-masking on src-only23:00
lkclsv.addi/sm=r3 *dest,*src,023:01
lkclwhere r3=0b01010101010101010101010....23:01
lkcland it will do *independent*-running-along of the predicate from the source23:01
markosso dest index will not increase?23:01
lkclit will23:01
lkclunconditionally by 1 for every *1* bit in the source predicate mask23:01
markosthat's what I want23:01
lkcl1 sec23:01
markosI want it to *not* increase when bit is zero23:02
markosthis is what I do now23:02
markossetvl                   0,0,16,0,1,1                    # Set VL to 16 elements23:02
markos        ori                     pred, 0, 0b010101010101010123:02
markos        sv.add/sm=r3            *img, *img, *img+123:02
markosunfortunately subvl are not supported on binutils yet so no VL=64 :(23:03
lkclwould sm=~r3 do the trick?23:03
lkclthat simply inverts the bits of r3 (in-place)23:03
markoscan try23:03
lkclah this is an inverted-one.  sm=~r323:04
lkclactually tht's a twin-pred23:04
lkclbut you get the general idea23:04
lkcldo you want "compress" effect, or "expand" effect"23:05
markosI think I want the first one, compress sm=r323:05
markosso, what I'm doing basically?23:05
lkclr0,r1,r2... -> r4,r6,r8....23:05
lkclwith that r3 value you should get every *even* reg copied to a contiguous block of regs23:06
markosthat's exactly what I want23:06
lkclr0,r2,r4... -> r0,r1,r2...23:06
lkclahhh but you're doing an add at the same time23:06
lkclso you will get:23:07
lkclr0 = r0+r123:07
lkclr1 = r2+r323:07
lkclr2 = r4+r523:07
lkclr3 = r5+r623:07
lkclwith this:23:07
lkclsv.add/sm=r3            *img, *img, *img+123:07
lkclif you *only* want *copy*, you want this:23:08
lkcl    sv.addi sm=r3 *img,*img,023:08
lkclwhich will do:23:08
lkclr0 = r0+023:08
lkclr1 = r2+023:08
markosyes, add and copy is what I want23:08
lkclr2 = r4+023:08
markosamazing that this can happen with just one instruction...23:09
lkclthe horizontal map-reduce is supposed to be for this23:09
lkcl(without needing predicate masks)23:09
lkclwhich... i thiiiink.... might be working?23:09
lkclalthough... can't remember.... does it need REMAP?23:09
markosit probably is, but how can I copy with skipping?23:10
lkclso much frickin going on i can't even remember23:10
markosactually it is working, but I need to copy only the sums23:10
markosnot the next element23:10
jabit sounds like your guys are on the verge of proving P=NP.  :)23:11
lkcljab, lol23:11
lkclhigh-performance strncpy (including the zero-copying) in 10 instructions.23:12
lkclmarkos, what do you need (in c)?23:12
jabseems pretty cool.  I'm not completely following.  but it seems awesome! haha23:14
markosI'll just paste this as it's easier:23:15
markos# horiz axis: x, vert axis: y, quantity of y + (x>>1):23:15
markos        #23:15
markos        # |   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |23:15
markos        # | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 3 | 3 |23:15
markos        # | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 |23:15
lkclthe compelling part is how depressingly long other ISAs are to do the same job23:15
markos        # | 2 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 |23:15
markos        # | 3 | 3 | 3 | 4 | 4 | 5 | 5 | 6 | 6 |23:16
markos        # | 4 | 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 |23:16
markos        # | 5 | 5 | 5 | 6 | 6 | 7 | 7 | 8 | 8 |23:16
markos        # | 6 | 6 | 6 | 7 | 7 | 8 | 8 | 9 | 9 |23:16
markos        # | 7 | 7 | 7 | 8 | 8 | 9 | 9 | a | a |23:16
lkcljab, POWER8 is 470 instructions for example23:16
markoswhat I've done is reduced the 8x8 -> 8x423:16
markoswhen reduced I can just calculate the diagonals sums as before23:17
lkclok so the 0,0 (x,y coords) contains the contents of (0,0) plus (1,0)23:17
markosyes, exactly23:17
lkcl(1,0) contains the contents (2,0) plus (3,0) etc.23:17
markosyup, vertical index increases twice as fast as horizontal one23:18
markosthat's what the x>>1 does23:18
*** jab <jab!> has quit IRC23:18
*** jab <jab!> has joined #libre-soc23:19
lkclif we had the mtcrweird instructions (and 128 CR Fields) you could have blatted the r3 pattern 0b01010101 into the CRs and used that.23:20
lkclinstead, temporarily (sorry!) you'll have to do it as QTY4 of those sv.add/sm=r3 instructions23:21
markosit's ok23:21
markosit's ok I'll figure it out, and then I'll have to do the same for the other 3 rows of the partial_sum_alt matrix :D23:21
markossimilar thing, with y>>1 etc :)23:21
markosunfortunately it means I'll have to reload img matrix from memory, which will break my promise of doing a zero-load implementation :(23:22
markosI don't have enough registers23:22
lkclthat one... you could use 2D REMAP23:22
markosI don't think I have enough time to learn this atm :)23:22
lkclah because you just blatted it?23:23
lkclso sad :)23:23
markosbasically yes23:23
markosunless I reuse some other registers23:23
markoswe'll see, not all is lost :)23:23
lkclif it's towards the end of the algorithm...23:24
markoswhen it's done, I'm going to measure total instructions in the original and SIMD versions and then this, I really want to get this done zero-load23:24
markosit doesn't even store any buffer in the end, just stores a single value to a given pointer :D23:25
lkclridiculous, sigh :)23:29
jablkcl: thanks for explaining23:37
lkcleven the RVV example is 23 instructions
lkcland that's supposed to be a top-of-the-line vector implementation23:40
lkcl      sub a2, a2, t1        # Decrement count.23:40
lkclthat's automatic (implicit, part of the standard Power ISA "Branch-Conditional" CTR decrementing, but improved and linked to the Vector Length)23:41
lkcl      add a3, a3, t1        # Bump dest pointer23:41
lkcl      add a1, a1, t1        # Bump src pointer23:41
lkclboth of those are automatic, by copying what PDP-8, PDP-11, Motorola 68000 do (auto-addressing)23:42
lkclagain, improved and linked to the Vector Length23:42
lkcli'm just... it's hard to explain that it's taken literally... 4 years? to get to this point?23:43
*** jab <jab!> has quit IRC23:55
*** jab <jab!~jab@user/jab> has joined #libre-soc23:55

Generated by 2.17.1 by Marius Gedminas - find it at!