programmerjake | lkcl: https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=07d1eac91c7007954fed88332d495a42cd59afef | 01:16 |
---|---|---|
programmerjake | hope you think that's better reasoning | 01:16 |
programmerjake | some verbiage about being able to get all possible bitpatterns produced by lfs (not lfd) could be added. | 01:18 |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 02:51 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has quit IRC | 06:43 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has joined #libre-soc | 06:43 | |
*** josuah <josuah!~irc@46.23.94.12> has quit IRC | 07:42 | |
*** josuah <josuah!~irc@46.23.94.12> has joined #libre-soc | 07:43 | |
lkcl | programmerjake, mmm... that's going to be complicated. a full and comprehensive justification is needed as to why. | 09:11 |
lkcl | or... hang on, that *is* the justification? | 09:11 |
lkcl | as in, there's no change from use of DOUBLE() and that's enough to express all possible f32 values? | 09:12 |
programmerjake | double is how powerisa expresses all f32 values in f64 registers | 09:13 |
programmerjake | all possible f32 values, including all quiet/signaling NaNs and all denormals | 09:14 |
lkcl | and all f32 values are still representable? | 09:24 |
lkcl | (if so that's great, because there will not be any objection from the OPF ISA WG) | 09:26 |
programmerjake | assuming flis/fishmv's pseudocode hasn't changed from when i last checked, yes, it covers all possible f32 bitpatterns | 09:46 |
markos | in the discussion, could you please pick one name? eg a question refers to fmvis, and the answer replies on flis, refering to the same command, either pick one or mention both. (question Other.3) | 09:51 |
markos | fwiw, I'm fine with flis, but just stick to one, keeping both and refering half the times to fmvis and the other half to flis only leads to confusion | 09:53 |
markos | otoh, fmvis fits better with fishmv (name-wise) :) | 09:54 |
lkcl | brilliant | 09:57 |
lkcl | the OPF ISA WG members have picked some more-conformant (precendent-based) names | 09:57 |
markos | which ones? | 09:58 |
lkcl | we go with those (for obvious reasons) | 09:58 |
lkcl | in the discussion page. | 09:58 |
markos | flis/flisl yes | 09:59 |
markos | that's what I'm saying | 09:59 |
lkcl | ok we're onto phase 3 with the 2 grants. can't be "announced" yet, has to go an independent audit | 10:00 |
lkcl | 1 short paragraph is needed to describe each project | 10:00 |
markos | congrats! | 10:00 |
lkcl | congrats at having a lot more work to do? :) | 10:00 |
markos | yes :) | 10:01 |
lkcl | nggh :) | 10:01 |
markos | you knew this from the start, didn't you? you wouldn't decide to design a vector architecture if you wanted to be lazy, you would do ebanking with Java :D | 10:04 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has quit IRC | 10:17 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has joined #libre-soc | 10:18 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has quit IRC | 10:32 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.187> has joined #libre-soc | 10:37 | |
lkcl | given a choice between that and working at MacDonald's, i'd choose the burgers | 10:48 |
markos | :) | 10:49 |
markos | having actually worked in ebanking with Java for a couple of years, I agree 100%, worst environment ever | 10:50 |
markos | the definition of boooring | 10:50 |
markos | funny thing is that because I did Java a million years ago, recruiters still contact me for a Java job every now and then | 10:51 |
markos | wouldn't touch it again unless it was for a ridiculous amount of money and even then I think I'd quit instantly | 10:52 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.187> has quit IRC | 11:32 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has joined #libre-soc | 11:32 | |
lkcl | i'd make better friends at macdonalds. | 11:33 |
*** midnight <midnight!~midnight@user/midnight> has quit IRC | 11:33 | |
*** midnight <midnight!~midnight@user/midnight> has joined #libre-soc | 11:36 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has quit IRC | 14:29 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.160.14> has joined #libre-soc | 14:31 | |
*** Veera <Veera!~veera@117.243.24.160> has joined #libre-soc | 14:39 | |
Veera | Hi | 14:39 |
Veera | lkcl: I have been paid for Bug #577 and updated bugzilla page for paid status | 14:40 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.160.14> has quit IRC | 14:46 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.164.227> has joined #libre-soc | 14:56 | |
*** Veera <Veera!~veera@117.243.24.160> has quit IRC | 14:59 | |
cesar | Work on the formal verification of MultiCompUnit is now in git. Will update Bug #879 with the issues that it found, and then proceed to submit the RfP. | 15:03 |
lkcl | cesar, fantastic | 15:28 |
*** octavius <octavius!~octavius@243.147.93.209.dyn.plus.net> has joined #libre-soc | 15:49 | |
lkcl | i found the x86 optimised assembler strlen/strncpy and it's so depressingly large i can't be bothered to post it for comparison | 15:55 |
lkcl | even the IBM POWER8 strncpy is awful | 15:56 |
lkcl | https://github.com/lattera/glibc/blob/master/sysdeps/powerpc/powerpc64/power8/strncpy.S | 15:56 |
lkcl | code that checks for address-alignment | 15:57 |
lkcl | code that checks for a 4k page-boundary crossing | 15:57 |
lkcl | stripmining for up to the first 15 bytes | 15:58 |
lkcl | including comments it's 479 lines | 15:59 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.164.227> has quit IRC | 17:39 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has joined #libre-soc | 17:39 | |
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has joined #libre-soc | 18:12 | |
markos | lkcl, aside from predicates, how can I use svstep to do a sv.add instruction, but every 2 elements? | 18:38 |
markos | actually because it's a lot of elements (VL=64), they don't fit in one predicate mask | 18:38 |
lkcl | mmm.... i wondered about this one as well | 18:52 |
lkcl | in theory what you could do is use matrix or Index REMAP to set up a 2D arrangement, where one of the dimensions is 2 | 18:55 |
lkcl | then override VL to *half* the total | 18:55 |
lkcl | with a 2nd setvl | 18:55 |
lkcl | but honestly, if you only have 16 elements you can just set r3/r10/r31 equal to the required predicate directly with "li" | 18:56 |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 19:08 | |
markos | I guess the easiest is to do 4 x 16 | 19:19 |
*** octavius <octavius!~octavius@243.147.93.209.dyn.plus.net> has quit IRC | 20:21 | |
lkcl | 64-bit will fit into one integer predicate. | 20:31 |
lkcl | this was one of the situation that grevluti was designed for: to be able to hit a regular pattern into a GPR in one single 32-bit instruction | 20:32 |
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has quit IRC | 20:50 | |
programmerjake | if you want to quickly load a repeating 64-bit pattern, you can use sv.addi/subvl=4/elwid=16 rt, 0, 0x5555 -- note no star on rt | 21:49 |
markos | do both subvl and elwid have to be specified? doesn't one imply the other? | 22:15 |
markos | but that's a cool trick, thanks | 22:16 |
lkcl | i've never tried it - but it should work perfectly | 22:24 |
lkcl | no. subvl is a "small inner repeating loop", with the option to be 2,3 or 4 (actually, 1 as well as the degenerate case) | 22:24 |
lkcl | elwidths have nothing to do with subvl, they apply independently | 22:25 |
lkcl | i got a sv.divmod2du test working! | 22:25 |
cesar | lkcl: Just sent the RfP for #879 | 22:26 |
lkcl | cesar, saw it - approved. awesome. 2 days remaining (!) so in theory it should be fine | 22:28 |
lkcl | fantastic to find the bugs | 22:28 |
lkcl | that's exactly the point of doing these proofs | 22:28 |
lkcl | definitely worthwhile to do an OPF talk or a FOSDEM talk about that | 22:29 |
lkcl | okaaay that rounds off (completes) 2019-10-032 https://libre-soc.org/task_db/report/ | 22:30 |
lkcl | markos, last one! https://bugs.libre-soc.org/show_bug.cgi?id=229 | 22:31 |
lkcl | do what you can, ok? | 22:32 |
markos | lkcl, I'm *that* close, last stage | 22:34 |
lkcl | :) | 22:34 |
markos | ok doing now the partial_sum_alts (the slanted diagonals y + (x >> 1) trick), I've done pair-wise addition of the 2 elements so the first carries the sum of the two, and need to copy only those into a new location, so instead of 8x8 I will have 8x4 elements, -yes I know REMAP :) | 22:55 |
markos | because that way the operation is simplified to pretty much the same as the previous steps | 22:56 |
markos | the question is how to do that :) | 22:56 |
markos | I have the sums, checked they are correct | 22:57 |
lkcl | :) | 23:00 |
lkcl | just use predicate-masking on src-only | 23:00 |
lkcl | sv.addi/sm=r3 *dest,*src,0 | 23:01 |
markos | aha! | 23:01 |
lkcl | where r3=0b01010101010101010101010.... | 23:01 |
lkcl | and it will do *independent*-running-along of the predicate from the source | 23:01 |
markos | so dest index will not increase? | 23:01 |
lkcl | it will | 23:01 |
lkcl | unconditionally by 1 for every *1* bit in the source predicate mask | 23:01 |
markos | that's what I want | 23:01 |
lkcl | 1 sec | 23:01 |
lkcl | exaaaampllllle..... | 23:02 |
markos | I want it to *not* increase when bit is zero | 23:02 |
lkcl | ermermerm | 23:02 |
markos | this is what I do now | 23:02 |
markos | setvl 0,0,16,0,1,1 # Set VL to 16 elements | 23:02 |
markos | ori pred, 0, 0b0101010101010101 | 23:02 |
markos | sv.add/sm=r3 *img, *img, *img+1 | 23:02 |
markos | unfortunately subvl are not supported on binutils yet so no VL=64 :( | 23:03 |
lkcl | would sm=~r3 do the trick? | 23:03 |
lkcl | that simply inverts the bits of r3 (in-place) | 23:03 |
markos | can try | 23:03 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_predication.py;hb=HEAD#l193 | 23:03 |
lkcl | ah this is an inverted-one. sm=~r3 | 23:04 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_predication.py;hb=HEAD#l261 | 23:04 |
lkcl | actually tht's a twin-pred | 23:04 |
lkcl | sm=r3/dm=~r3 | 23:04 |
lkcl | but you get the general idea | 23:04 |
lkcl | do you want "compress" effect, or "expand" effect" | 23:05 |
markos | I think I want the first one, compress sm=r3 | 23:05 |
markos | so, what I'm doing basically? | 23:05 |
lkcl | r0,r1,r2... -> r4,r6,r8.... | 23:05 |
lkcl | yes | 23:05 |
lkcl | with that r3 value you should get every *even* reg copied to a contiguous block of regs | 23:06 |
markos | that's exactly what I want | 23:06 |
lkcl | r0,r2,r4... -> r0,r1,r2... | 23:06 |
lkcl | ahhh but you're doing an add at the same time | 23:06 |
lkcl | so you will get: | 23:07 |
lkcl | r0 = r0+r1 | 23:07 |
lkcl | r1 = r2+r3 | 23:07 |
lkcl | r2 = r4+r5 | 23:07 |
lkcl | r3 = r5+r6 | 23:07 |
lkcl | ... | 23:07 |
lkcl | with this: | 23:07 |
lkcl | sv.add/sm=r3 *img, *img, *img+1 | 23:07 |
lkcl | if you *only* want *copy*, you want this: | 23:08 |
lkcl | sv.addi sm=r3 *img,*img,0 | 23:08 |
lkcl | which will do: | 23:08 |
lkcl | r0 = r0+0 | 23:08 |
lkcl | r1 = r2+0 | 23:08 |
markos | yes, add and copy is what I want | 23:08 |
lkcl | r2 = r4+0 | 23:08 |
lkcl | .... | 23:08 |
lkcl | ok. | 23:09 |
markos | amazing that this can happen with just one instruction... | 23:09 |
lkcl | the horizontal map-reduce is supposed to be for this | 23:09 |
lkcl | (without needing predicate masks) | 23:09 |
lkcl | which... i thiiiink.... might be working? | 23:09 |
lkcl | although... can't remember.... does it need REMAP? | 23:09 |
markos | it probably is, but how can I copy with skipping? | 23:10 |
lkcl | so much frickin going on i can't even remember | 23:10 |
markos | actually it is working, but I need to copy only the sums | 23:10 |
markos | not the next element | 23:10 |
jab | it sounds like your guys are on the verge of proving P=NP. :) | 23:11 |
lkcl | jab, lol | 23:11 |
lkcl | high-performance strncpy (including the zero-copying) in 10 instructions. | 23:12 |
lkcl | https://twitter.com/lkcl/status/1580315193984241665 | 23:12 |
lkcl | markos, what do you need (in c)? | 23:12 |
jab | seems pretty cool. I'm not completely following. but it seems awesome! haha | 23:14 |
markos | I'll just paste this as it's easier: | 23:15 |
markos | # horiz axis: x, vert axis: y, quantity of y + (x>>1): | 23:15 |
markos | # | 23:15 |
markos | # | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | 23:15 |
markos | # | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 3 | 3 | | 23:15 |
markos | # | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | | 23:15 |
lkcl | the compelling part is how depressingly long other ISAs are to do the same job | 23:15 |
markos | # | 2 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 | | 23:15 |
markos | # | 3 | 3 | 3 | 4 | 4 | 5 | 5 | 6 | 6 | | 23:16 |
markos | # | 4 | 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 | | 23:16 |
markos | # | 5 | 5 | 5 | 6 | 6 | 7 | 7 | 8 | 8 | | 23:16 |
markos | # | 6 | 6 | 6 | 7 | 7 | 8 | 8 | 9 | 9 | | 23:16 |
markos | # | 7 | 7 | 7 | 8 | 8 | 9 | 9 | a | a | | 23:16 |
lkcl | jab, POWER8 is 470 instructions for example | 23:16 |
markos | what I've done is reduced the 8x8 -> 8x4 | 23:16 |
markos | when reduced I can just calculate the diagonals sums as before | 23:17 |
lkcl | ok so the 0,0 (x,y coords) contains the contents of (0,0) plus (1,0) | 23:17 |
markos | yes, exactly | 23:17 |
lkcl | (1,0) contains the contents (2,0) plus (3,0) etc. | 23:17 |
markos | yup, vertical index increases twice as fast as horizontal one | 23:18 |
markos | that's what the x>>1 does | 23:18 |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 23:18 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 23:19 | |
lkcl | if we had the mtcrweird instructions (and 128 CR Fields) you could have blatted the r3 pattern 0b01010101 into the CRs and used that. | 23:20 |
lkcl | instead, temporarily (sorry!) you'll have to do it as QTY4 of those sv.add/sm=r3 instructions | 23:21 |
markos | it's ok | 23:21 |
markos | it's ok I'll figure it out, and then I'll have to do the same for the other 3 rows of the partial_sum_alt matrix :D | 23:21 |
lkcl | niiice | 23:21 |
markos | similar thing, with y>>1 etc :) | 23:21 |
lkcl | jooooy | 23:21 |
markos | unfortunately it means I'll have to reload img matrix from memory, which will break my promise of doing a zero-load implementation :( | 23:22 |
markos | I don't have enough registers | 23:22 |
markos | :D | 23:22 |
lkcl | that one... you could use 2D REMAP | 23:22 |
markos | I don't think I have enough time to learn this atm :) | 23:22 |
lkcl | ah because you just blatted it? | 23:23 |
lkcl | so sad :) | 23:23 |
markos | basically yes | 23:23 |
markos | unless I reuse some other registers | 23:23 |
markos | we'll see, not all is lost :) | 23:23 |
lkcl | if it's towards the end of the algorithm... | 23:24 |
markos | when it's done, I'm going to measure total instructions in the original and SIMD versions and then this, I really want to get this done zero-load | 23:24 |
markos | it doesn't even store any buffer in the end, just stores a single value to a given pointer :D | 23:25 |
lkcl | ridiculous, sigh :) | 23:29 |
jab | lkcl: thanks for explaining | 23:37 |
lkcl | even the RVV example is 23 instructions https://github.com/riscv/riscv-v-spec/blob/master/example/strncpy.s | 23:40 |
lkcl | and that's supposed to be a top-of-the-line vector implementation | 23:40 |
lkcl | sub a2, a2, t1 # Decrement count. | 23:40 |
lkcl | that's automatic (implicit, part of the standard Power ISA "Branch-Conditional" CTR decrementing, but improved and linked to the Vector Length) | 23:41 |
lkcl | add a3, a3, t1 # Bump dest pointer | 23:41 |
lkcl | add a1, a1, t1 # Bump src pointer | 23:41 |
lkcl | both of those are automatic, by copying what PDP-8, PDP-11, Motorola 68000 do (auto-addressing) | 23:42 |
lkcl | again, improved and linked to the Vector Length | 23:42 |
lkcl | i'm just... it's hard to explain that it's taken literally... 4 years? to get to this point? | 23:43 |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 23:55 | |
*** jab <jab!~jab@user/jab> has joined #libre-soc | 23:55 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!