*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 00:01 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 00:01 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has quit IRC | 00:06 | |
*** lxo <lxo!~lxo@gateway/tor-sasl/lxo> has joined #libre-soc | 00:06 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 00:16 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 00:16 | |
*** octavius <octavius!~octavius@230.147.93.209.dyn.plus.net> has quit IRC | 00:33 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 01:37 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 01:37 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 02:56 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 02:57 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 03:10 | |
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 08:10 | |
*** ghostmansd <ghostmansd!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 08:35 | |
markos | lkcl, ok, so I've created the reverse indices with svstep, so if I understand it, I have to use svindex to set those indices as offsets for RB in sv.mulld | 09:24 |
---|---|---|
markos | rmm=0b10 for RB | 09:25 |
markos | and then these would be added to the register index of RB | 09:25 |
markos | so I have: | 09:25 |
markos | setvl 0,0,7,0,1,1 # Set VL to 7 elements | 09:26 |
markos | sv.svstep/mrr *tmp2, 6, 1 | 09:26 |
markos | then svindex | 09:26 |
markos | and then | 09:26 |
markos | #sv.mulld *tmp, *tmp, *divt | 09:26 |
markos | but instead of *divt being multiplied in order divt+0,divt+1,divt+2,divt+3,divt+4,divt+5,divt+6 | 09:27 |
markos | it would be in reverse order | 09:27 |
markos | divt+6, divt+5, ..., divt+1, divt+0 | 09:27 |
markos | now to figure out the svindex syntax :) | 09:28 |
markos | hm, svindex SVG has to be between 0..31, so if I have created the reverse indices in GPRs above that it won't work | 09:40 |
markos | damn, I need more registers | 09:41 |
programmerjake | use strided load with a negative stride... | 09:43 |
markos | they're already loaded in registers | 09:44 |
programmerjake | well, load again but reversed... | 09:44 |
markos | that's not the problem, they values are already calculated | 09:45 |
markos | I need to evaluate to sums | 09:45 |
markos | 2 sums | 09:45 |
markos | sum_0^N{A[i]*B[i]}, and sum_0^N{A[i]*B[N-i]} | 09:45 |
markos | I need to either reverse the order of the second array and for sure I'm not going to store it to memory and reload it reverse | 09:45 |
programmerjake | so it's intermediate values that need to be reversed...if they're 8-bit values, use grevi | 09:45 |
programmerjake | it can do a byte reverse | 09:46 |
programmerjake | 2 grevi ops can reverse a vector of 16-bit elements | 09:46 |
lkcl | i just got elwidth overrides running on svindex. | 09:46 |
markos | unfortunately these are partial sums -of 8-bit values- so no guarrantee they're 8-bit | 09:46 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=3e4d137f3a46b712bdcc966ef930e08fe6ecb621 | 09:47 |
markos | are grevi operations inplace? | 09:47 |
programmerjake | grevi can be in-place, they're like shift/rotate except swapping instead of shifting | 09:48 |
lkcl | please don't use grevi it is not available | 09:48 |
programmerjake | in-place -- just use same reg for src and dest | 09:48 |
lkcl | it is to be replaced with grevluti | 09:48 |
markos | so can I use grevluti then? | 09:49 |
lkcl | no, it has not yet been implemented | 09:49 |
programmerjake | grevi is currently implemented and works, and can be changed to grevluti if/when grevluti is implemented | 09:49 |
programmerjake | so imho just use grevi and whoever implements grevluti will change your code to use it | 09:50 |
markos | programmerjake, hm, but I don't need byte reversal, | 09:51 |
markos | just looked at grevi | 09:51 |
markos | I'd need to basically swap B[i] with B[N-i] | 09:52 |
programmerjake | grevi can do 2-bit, nibble, byte, 16-bit, and 32-bit chunk reversal | 09:52 |
markos | svindex seems to do what I want, but I just need to get the register numbering to get the indices within the first 32 GPRs | 09:53 |
markos | why is that btw? | 09:53 |
lkcl | not enough space in the instruction | 09:53 |
markos | right | 09:53 |
lkcl | only 32-bit | 09:53 |
lkcl | actually it's 5-bits shifted up by 2. | 09:53 |
lkcl | starting at 0, 4, 8, 12.... 124 | 09:53 |
markos | it's going to be tight, but with a little effort I can get the algorithm totally free of any loads apart from the initial loads ofc | 09:54 |
lkcl | def index_remap(i): | 09:54 |
lkcl | return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset | 09:54 |
lkcl | sorry | 09:54 |
lkcl | every 2. | 09:54 |
lkcl | https://libre-soc.org/openpower/sv/remap/#svindex | 09:55 |
programmerjake | so to reverse the 16-bit chunks in a 64-bit register, use grevi RT, RA, 0x30 -- 0x30 is 0x20 | 0x10 -- 0x20 means swap adjacent 32-bit chunks, 0x10 means swap adjacent 16-bit chumks, together they reverse the 4 16-bit chunks in a 64-bit register | 09:55 |
lkcl | nope, it is 0 4 8 12 .... 124 | 09:55 |
lkcl | https://libre-soc.org/openpower/sv/remap/#svindex | 09:55 |
markos | programmerjake, they are in different registers | 09:55 |
lkcl | SVG - GPR SVG<<2 to be used for Indexing | 09:55 |
markos | one value per register | 09:55 |
markos | it's not packed (yet) | 09:56 |
programmerjake | oh, you don't have elwid packing yet? | 09:56 |
programmerjake | well, nm about grevi then | 09:56 |
markos | once elwidth is complete, I will convert the algorithm to packed | 09:56 |
markos | but right now time is pressing | 09:56 |
markos | I want to get it done and then we can convert it at our leisure | 09:56 |
programmerjake | well, if time is that short, just store and load with negative stride, can figure out svindex stuff later | 09:57 |
markos | that's the point, I don't want to store/load :) | 09:58 |
markos | I wouldn't have spent so much time on it, I could just store/load the whole bunch, it's a direction finding function in a 8x8 matrix, and so far I've done 80% of it without a single store/load :) | 10:00 |
markos | but I've had to rearrange the usage of registers 3 times already :-/ | 10:00 |
markos | partly because some instructions have a limitation to use only GPRs <32 | 10:00 |
programmerjake | maybe write a sequence of scalar mv ops? can figure out the sv version later... | 10:00 |
markos | well, speaking of which, it would be cool to have a sv.mv that would just reverse the order, but then again if you can already do it with svindex so why bother... | 10:01 |
markos | but an sv.mv could be used for other stuff as well | 10:02 |
programmerjake | mv r8, r23; mv r9, r22; ... mv r14, r17; mv r15, r16 | 10:02 |
programmerjake | there *is* sv.mv ... it's spelled sv.ori rt,ra, 0...just like mv is ori iirc | 10:03 |
markos | true | 10:03 |
programmerjake | that said it doesn't do anything special that other 2-arg ops don't do | 10:03 |
lkcl | exactly that's the whole point of the various REMAPs - so that you *don't* have to use mvs. | 10:05 |
markos | you gave me a good idea, I could use svindex to sv.mv the original elements (which *are* in GPR <32), move them to somewhere higher, put the svstep reverse indices in the original array place, run sv.mv with svindex, reverse the element order, and then sv.mv it (in reverse) back to the original lower registers | 10:06 |
lkcl | markos, you didn't read what i wrote above. | 10:07 |
lkcl | SVG is shifted up. | 10:07 |
lkcl | SVG reads its array from register locations starting at | 10:07 |
lkcl | 0 | 10:07 |
lkcl | 4 | 10:07 |
lkcl | 8 | 10:07 |
lkcl | 12 | 10:07 |
lkcl | 16 | 10:07 |
lkcl | ... | 10:07 |
lkcl | .... | 10:07 |
lkcl | 124 | 10:07 |
markos | I did, I just tried it with svindex *116,0b10,7,0,0,0,0 and it complained again | 10:08 |
programmerjake | well, it's 2am here, gn. hope your semi-reversed dot product goes well... | 10:08 |
lkcl | use 29, not 4*29 | 10:08 |
lkcl | :) | 10:08 |
markos | ah wait | 10:08 |
markos | nope | 10:08 |
markos | Error: operand out of range (116 is not between 0 and 31) | 10:08 |
lkcl | use 29, not 4*29 | 10:09 |
lkcl | and it is not a "*" (vector) | 10:09 |
markos | yes, I fixed that | 10:09 |
lkcl | you want | 10:09 |
lkcl | svindex 29,0b10,7,0,0,0,0 | 10:09 |
lkcl | not svindex 116 | 10:09 |
lkcl | or svindex *116 | 10:09 |
lkcl | or svindex *29 | 10:09 |
lkcl | just | 10:09 |
lkcl | svindex 29,0b10,7,0,0,0,0 | 10:10 |
markos | could you please explain the last bit? why is the division by 4? | 10:10 |
lkcl | because otherwise ghostmansd[m] would have had to create another special operand in binutils | 10:10 |
lkcl | a "multiply by 4" operand | 10:10 |
markos | ok, so the index has to be always divisible by 4 then | 10:10 |
lkcl | and there are only 5 bits available | 10:11 |
lkcl | which register(s) did you want the indices to apply to? | 10:11 |
lkcl | (RB iirc) | 10:11 |
lkcl | so yes that's 0b00010 | 10:12 |
lkcl | the order's RA (bit 0) RB (bit 1) RC (bit 2) RT (bit 3) RS/EA (bit 4) | 10:12 |
programmerjake | uuh, LSB0? | 10:13 |
markos | ok, testing it now | 10:13 |
markos | yes RB | 10:13 |
lkcl | also, just so you know: you can just set the dimension=1 and it disables the Matrix-style "REMAPping" entirely, leaving just indexing | 10:14 |
lkcl | so you could have used | 10:15 |
lkcl | svindex 29,0b10,1,0,0,0,0 | 10:15 |
lkcl | and it's exactly the same thing | 10:15 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 10:15 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.54.71> has joined #libre-soc | 10:16 | |
lkcl | i almost have enough to do strncpy now i think. | 10:18 |
lkcl | not the LD/ST speculative fail-first | 10:18 |
lkcl | but elwidth overrides and data-dependent fail-first | 10:18 |
lkcl | one thing i really want to do is post-update on EA in LD/ST | 10:19 |
lkcl | as in: you use *just* RA as the input (no offset from RB or immediate) | 10:19 |
lkcl | and write out the new value of RA+imm (or RA+B) *after* the LD/ST | 10:20 |
lkcl | going into a loop that makes it entirely unnecessary to perform a post-vector-LD "add" | 10:21 |
markos | ok, this doesn't yet work, but I'm probably missing something obvious here | 10:38 |
markos | so here is the code so far: | 10:38 |
markos | setvl 0,0,7,0,1,1 | 10:38 |
markos | sv.svstep/mrr *tmp2, 6, 1 | 10:39 |
markos | this produces the sequence 00000006 00000005 00000004 00000003 00000002 00000001 00000000 in registers tmp2 = 116 | 10:39 |
markos | svindex 29,0b10,7,0,0,0,0 | 10:39 |
markos | sv.ori *tmp, 0, *divt | 10:39 |
markos | where tmp = 108, divt = 14 | 10:40 |
markos | just to see if I can move the elements in reverse order and verify that it works | 10:40 |
markos | divt: 00000000 000001a4 00000118 000000d2 000000a8 0000008c 00000078 00000069 | 10:41 |
markos | I expected to see this in reverse | 10:41 |
lkcl | sv.ori.... *tmp,0,*divt - that's... not going to work | 10:42 |
markos | instead I see (in *tmp) 0000000e 0000000e 0000000e 0000000e 0000000e 0000000e 0000000e | 10:42 |
lkcl | you want | 10:42 |
lkcl | sv.ori *tmp,*divt,0 | 10:42 |
markos | aaaargh | 10:42 |
markos | it's an immediate | 10:42 |
lkcl | but that's still RT,RA,0 | 10:42 |
lkcl | so you want | 10:42 |
lkcl | svindex 29, 0b00001, 7,0,0,0 | 10:43 |
markos | yup | 10:43 |
markos | still not right: 0000004f 00000027 00000026 00000076 ffffffffffffffa7 00100070 00000000 | 10:46 |
lkcl | unless i can see the code it's quite inconvenient | 10:46 |
markos | is it ok if I commit everything so far in video/av1? | 10:46 |
lkcl | of course | 10:47 |
markos | ok | 10:47 |
markos | no binaries I know :) | 10:47 |
lkcl | search the log file for the word "indexed_iterator" btw | 10:51 |
markos | ok, pushed | 10:52 |
markos | just run make in video/av1 | 10:52 |
markos | actual SVP64 code is in src/ppc/cdef_tmpl_svp64_real.s | 10:52 |
markos | line in question 147 | 10:53 |
lkcl | found it | 10:53 |
markos | I use SILENCELOG=1 | 10:53 |
markos | up to that line everything works | 10:53 |
lkcl | ah i don't have the modified version of binutils. | 10:54 |
markos | ah yes, that would be needed :) | 10:54 |
lkcl | you'll have to run it. | 10:54 |
lkcl | then look for indexed_iterator in the logs | 10:54 |
lkcl | which prints out from.... | 10:54 |
markos | ah unset SILENCELOG then | 10:54 |
lkcl | here | 10:55 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/svshape.py;h=8b4533755a214d243b509ca20994a7b3ff651cdf;hb=17719e8b26d6b198279f8004d90a256e0890a30b#l163 | 10:55 |
lkcl | yes and run >& /tmp/f | 10:55 |
lkcl | or use nohup | 10:55 |
lkcl | you'll also see, after every instruction, now, a regfile dump in the logs | 10:56 |
lkcl | so you can track what each element does | 10:57 |
lkcl | with your best face-palm do ignore the fact that indexed_iterator walks through multiple times (although it is convenient) | 10:58 |
lkcl | so by the time srcstep=6 you have *seven* printouts of indexed_iterator debug statements | 10:59 |
lkcl | you might actually want svindex 29,0b00001, 1,0,0,0 | 10:59 |
lkcl | you miiiight be reading regs from the wrong location. it *might* be from 58, not 29. | 11:02 |
lkcl | yep i think that's it. | 11:02 |
lkcl | check the example | 11:02 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svindex.py;hb=HEAD#l178 | 11:03 |
lkcl | 183 isa = SVP64Asm(['svindex 8, 1, 1, 0, 0, 0, 0', | 11:03 |
lkcl | but the indices are stored in: | 11:03 |
lkcl | 192 for i in range(6): | 11:03 |
lkcl | 193 initial_regs[16+i] = idxs[i] | 11:03 |
lkcl | apologies | 11:04 |
markos | aha! it's ok, so in the end I do have to use <32 GPRs? | 11:06 |
markos | so it's not 4*29 but 2? | 11:08 |
markos | 2*29 that is | 11:08 |
markos | which means I cannot really use as high as 116 | 11:08 |
markos | and I'll have to rework the code to move the indices to low registers | 11:09 |
markos | btw, logging that is like a movie, I get to see the registers as they fill up :) | 11:10 |
lkcl | yes :) | 11:22 |
markos | hm, the new logging doesn't honour SILENCELOG | 11:24 |
lkcl | should do | 11:25 |
lkcl | 112 sv.add/mr *psum+0, *psum+0, *img+0 | 11:25 |
lkcl | 113 sv.add/mr *psum+1, *psum+1, *img+8 | 11:25 |
lkcl | blegh! | 11:25 |
lkcl | these are what Matrix REMAP is supposed to be for! :) | 11:26 |
lkcl | but it'll probably be necessary to invent a new REMAP mode, [x+y] | 11:26 |
markos | I know, but I haven't really understood how it works :D | 11:26 |
markos | also reverse diagonal | 11:27 |
markos | 7+y-x | 11:27 |
lkcl | it's not modulo, is it? | 11:27 |
lkcl | it's not [0][(y+x)%8] | 11:27 |
markos | no, it fills 15 values | 11:27 |
lkcl | ahh | 11:27 |
markos | the last alt partial sums, does a slanted diagonal, if you see the comments | 11:28 |
lkcl | 41 int partial_sum_diag[2][15] = { { 0 } }; | 11:28 |
markos | yup | 11:28 |
markos | you know, if we do this process for the whole AV1 codec, we might be able to just do AV1 *in software* much faster than other CPUs, possibly close to the speed of specialized hardware | 11:33 |
markos | which will be *huge* | 11:33 |
markos | AV1 is set to replace pretty much everything else in the next years -and they're also already designing AV2 | 11:34 |
markos | in the datacenter that is | 11:34 |
markos | most streamed video content will be converted to AV1 in the next years | 11:34 |
markos | and the best part, this will be done in a generic way | 11:35 |
markos | so all these extra instructions/modes/etc will benefit other code as well, it's not just a black box designed and implemented specifically for AV1 | 11:36 |
markos | this is very exciting! | 11:36 |
lkcl | yes :) | 11:48 |
lkcl | with algorithms upgrading faster than hardware can roll out, it's a big deal | 11:49 |
lkcl | you're never going to get to be "better" in terms of power consumption than dedicated hardware | 11:49 |
lkcl | btw this | 11:53 |
lkcl | 49 partial_sum_alt [0][ y + (x >> 1)] += px; | 11:53 |
lkcl | is a *3* dimensional case | 11:53 |
lkcl | where y=7 | 11:53 |
lkcl | x=errr 4? | 11:53 |
lkcl | no | 11:53 |
lkcl | y=8 | 11:53 |
lkcl | x=4 | 11:53 |
lkcl | and z=2 | 11:53 |
lkcl | but there is a "skip" on z | 11:54 |
lkcl | so y increments twice as fast as x | 11:54 |
lkcl | 53 partial_sum_alt [2][3 - (y >> 1) + x ] += px; | 11:55 |
lkcl | this is not dis-similar to the half-offset-reversing of DCT | 11:55 |
markos | right! | 12:00 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.54.71> has quit IRC | 13:02 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has joined #libre-soc | 13:03 | |
markos | lkcl, in which file is the gpr.dump you added for debugging this? I can't silence it :-/ | 14:17 |
lkcl | urrr... | 14:29 |
lkcl | you're using vi? | 14:29 |
lkcl | run "ctags -R" | 14:29 |
lkcl | then type ":tag dump" | 14:30 |
lkcl | isa/caller.py class GPR | 14:30 |
lkcl | it's using print, not log() | 14:30 |
lkcl | i leave it with you to correct, am in the middle of something on a different branch | 14:30 |
markos | ah right found it | 14:32 |
markos | getting a KeyError in the simulator running the binary | 14:36 |
markos | File "/home/markos/src/openpower-isa/src/openpower/decoder/isa/caller.py", line 2112, in get_input | 14:37 |
markos | reg_val = SelectableInt(self.gpr(base, is_vec, offs, ew_src)) | 14:37 |
markos | File "/home/markos/src/openpower-isa/src/openpower/decoder/isa/caller.py", line 129, in __call__ | 14:37 |
markos | return self[ridx+offs] | 14:37 |
markos | ridx 116 offs 89 | 14:50 |
markos | lkcl, I think this is bug | 14:53 |
markos | ^a | 14:53 |
lkcl | 116+89 is definitely overboard! | 15:00 |
lkcl | is the regfile declared with 128 entries? | 15:00 |
lkcl | that'll take some tracking down | 15:01 |
lkcl | can you add a repro case as a test_caller_svp64*.py unit test? | 15:01 |
lkcl | although if you are using Indexing it's possible you've over-run and are using an Index that's simply far too big | 15:03 |
lkcl | in the last sim.gpr.dump() output, what register contains the value "89"? | 15:03 |
markos | running it now | 15:08 |
markos | ok, now a different value -because it's running with different seed | 15:11 |
markos | ridx 116 offs 60 | 15:11 |
lkcl | ok you're likely over-running the index array | 15:11 |
markos | but I see no 60(0x3C) value in the registers | 15:11 |
markos | in order to avoid the >32 GPR problem I modified the code thus: | 15:12 |
lkcl | can you commit again? | 15:12 |
markos | setvl 0,0,7,0,1,1 # Set VL to 7 elements | 15:12 |
markos | sv.ori *tmp2, *divt, 0 | 15:12 |
markos | sv.svstep/mrr *divt, 6, 1 | 15:12 |
markos | svindex 29,0b1,1,0,0,0,0 | 15:12 |
markos | sv.ori *divt, *tmp2, 0 | 15:12 |
markos | tmp2=116, divt=14 | 15:12 |
markos | 14 has the correct values, 6,5,4,3,2,1,0 | 15:13 |
lkcl | ok so from regs (2*29) | 15:13 |
lkcl | what does the... | 15:14 |
lkcl | where the hell is it... | 15:14 |
lkcl | indexed_iterator() debug message say? | 15:14 |
lkcl | that'll tell you where it's starting from | 15:14 |
*** octavius <octavius!~octavius@43.125.93.209.dyn.plus.net> has joined #libre-soc | 15:15 | |
lkcl | it should be 58 as the base | 15:15 |
lkcl | you should be getting: | 15:16 |
lkcl | indexed_iterator 58, 0, 6, 64 | 15:16 |
lkcl | indexed_iterator 58, 1, 5, 64 | 15:16 |
lkcl | indexed_iterator 58, 2, 4, 64 | 15:16 |
lkcl | ... | 15:16 |
markos | sigh, have to rerun because I have SILENCELOG and these entries are with log, not print | 15:16 |
markos | ok, that will take a while... | 15:16 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@broadband-188-32-220-156.ip.moscow.rt.ru> has quit IRC | 15:22 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.166.174> has joined #libre-soc | 15:23 | |
markos | ok, another case with full logs this time | 15:24 |
markos | ridx 116 offs 31 | 15:24 |
markos | setvl 0,0,7,0,1,1 # Set VL to 7 elements | 15:24 |
markos | sv.ori *tmp2, *divt, 0 | 15:24 |
markos | sv.svstep/mrr *divt, 6, 1 | 15:24 |
markos | svindex 29,0b1,1,0,0,0,0 | 15:24 |
markos | sv.ori *divt, *tmp2, 0 | 15:24 |
markos | argh | 15:24 |
markos | sorry | 15:24 |
markos | ridx 58 offs 0 | 15:24 |
markos | indexed_iterator 58 0 11 64 | 15:24 |
markos | ridx 58 offs 1 | 15:24 |
markos | indexed_iterator 58 1 18446744073709551493 64 | 15:24 |
markos | ridx 58 offs 2 | 15:24 |
markos | indexed_iterator 58 2 18446744073709551519 64 | 15:24 |
markos | SVSHAPE 0 idx, end 2 18446744073709551519 0b111 | 15:24 |
markos | overflow? | 15:24 |
markos | negative overflow? | 15:25 |
lkcl | 1 sec... | 15:25 |
lkcl | >>> hex(18446744073709551493) | 15:25 |
lkcl | '0xffffffffffffff85' | 15:25 |
lkcl | >>> hex(1844674407370955151) | 15:25 |
lkcl | '0x199999999999998f' | 15:25 |
lkcl | what's the contents of the regs at that point? | 15:30 |
lkcl | even this is "weird" | 15:31 |
lkcl | <markos> indexed_iterator 58 0 11 64 | 15:31 |
lkcl | that would say that register 58 contains the value "11" | 15:32 |
lkcl | remap = self.gpr(self.svgpr, True, idx, ew_src).value | 15:32 |
lkcl | log ("indexed_iterator", self.svgpr, idx, remap, ew_src) | 15:32 |
markos | 0xb, yes | 15:32 |
markos | that's correct | 15:32 |
lkcl | ok. | 15:32 |
lkcl | what's the full contents of regs 58-63? | 15:32 |
markos | reg 58 onwards: 0000000b ffffffffffffff85 ffffffffffffff9f ffffffffffffff96 00000024 ffffffffffffffc3 | 15:32 |
markos | reg 64 0000001a ffffffffffffff82 | 15:32 |
lkcl | ok then that's the source of the problem | 15:33 |
lkcl | you can't have negative indices. | 15:33 |
markos | well, the source of the problem is that it's still the wrong place for the indices | 15:33 |
lkcl | no, it's the correct place. | 15:33 |
lkcl | <markos> svindex 29,0b1,1,0,0,0,0 | 15:33 |
lkcl | 2*29 = 58 | 15:34 |
markos | hm | 15:34 |
markos | you're correct -as usual- so if I want to use indices in reg. 14, I'll set svindex 7? | 15:35 |
markos | because svindex is shifted? | 15:35 |
lkcl | yes. by 2. | 15:35 |
lkcl | this saved ghostmansd[m] some effort when doing the svshape2 instruction | 15:35 |
lkcl | otherwise he had to define a special custom operand | 15:36 |
lkcl | s/svshape2/svindex | 15:36 |
markos | ok, so I just caused a hw trap then :D | 15:36 |
lkcl | actually, "undefined" behaviour | 15:37 |
lkcl | the cost in hardware at that extremely early stage is too great to do any error-checking | 15:37 |
markos | as long as it doesn't fry the CPU/FPGA due to an electrical loop, I guess it's ok :) | 15:38 |
lkcl | no sv.HCF instruction. got it. | 15:40 |
lkcl | holy shit, strncpy works | 16:04 |
markos | how many instructions? :) | 16:17 |
lkcl | "mtspr 9, 4", # move r4 to CTR | 16:17 |
lkcl | "setvl 1, 0, %d, 0, 1, 1" % maxvl, # VL (and r1) = MIN(CTR,MAXVL=4) | 16:17 |
lkcl | "sv.lbzu/pi *16, 1(10)", # load VL characters | 16:17 |
lkcl | "sv.cmpi/ff=eq/vli *0,1,*16,0", # compare against zero, truncate | 16:17 |
lkcl | "sv.stbu/pi *16, 1(12)", # scalar r22 += 24 on update | 16:17 |
lkcl | "sv.bc/all 16, *0, -0x1c", # branch, test CTR, reducing by VL | 16:17 |
lkcl | am just fixing a bug where it'll stop if there's a null-char in the middle of the string | 16:19 |
lkcl | string = "hello\x00bye\x00" | 16:19 |
markos | well if it's a null-char in the middle of the string, stopping is correct :) | 16:25 |
lkcl | uhhuhn | 16:26 |
markos | unless you don't use C-strings | 16:26 |
markos | but strncpy IS using C strings | 16:26 |
lkcl | okeeee | 16:27 |
lkcl | oleeee | 16:27 |
lkcl | got it, by a matter of playing "guess the parameter to sv.bc/all" | 16:27 |
lkcl | "mtspr 9, 4", # move r4 to CTR | 16:28 |
lkcl | "setvl 1, 0, %d, 0, 1, 1" % maxvl, # VL (and r1) = MIN(CTR,MAXVL=4) | 16:28 |
lkcl | "sv.lbzu/pi *16, 1(10)", # load VL characters | 16:28 |
lkcl | "sv.cmpi/ff=eq/vli *0,1,*16,0", # compare against zero, truncate | 16:28 |
lkcl | "sv.stbu/pi *16, 1(12)", # scalar r22 += 24 on update | 16:28 |
lkcl | "sv.bc/all 0, *2, -0x1c", # test CTR *and* stop if cmpi failed | 16:28 |
lkcl | compared to 240 VSX instructions. | 16:28 |
markos | lol | 16:28 |
lkcl | and... 20? for RVV? | 16:28 |
lkcl | https://github.com/riscv/riscv-v-spec/blob/master/example/strncpy.s | 16:29 |
lkcl | bear in mind that's 32-bit instructions | 16:30 |
markos | I stopped considering learning Risc-V entirely when I saw how many instructions and intrinsics they added for RVV | 16:30 |
markos | I'd prefer learning 6502/Z80/68k asm | 16:31 |
lkcl | it's still 24 instructions | 16:31 |
markos | and 20k intrinsics | 16:31 |
markos | really they must be insane | 16:31 |
lkcl | total space above.... | 16:31 |
lkcl | mtspr=4 | 16:31 |
lkcl | setvl=4 | 16:31 |
lkcl | 4x sv.xxxx = 4 | 16:32 |
lkcl | sorry | 16:32 |
lkcl | mtspr=1 32-bit | 16:32 |
lkcl | setvl 1 32-bit | 16:32 |
lkcl | 4x sv.xxxx = 4x 64-bit | 16:32 |
markos | even so, it's <10 instructions | 16:32 |
lkcl | = 8x 32-bit | 16:32 |
lkcl | = 10 instructions | 16:32 |
lkcl | 10 32-bit words | 16:32 |
markos | lol, <= 10 | 16:32 |
lkcl | for a vectorised strncpy, based on general-purpose instructions, where MAXVL may be set up to.... 127. | 16:33 |
lkcl | the LD/ST Fault-First variant is... is it any different? | 16:33 |
lkcl | no it isn't (ok, set the ld-fault-first mode - "sv.lbzu/pi/lf *16, 1(10)" | 16:34 |
lkcl | but that's all | 16:34 |
markos | OT, while doing arm fdct, I noticed the instructions vqrdmulhq_s16, I was wondering why such a specialized instruction | 16:37 |
lkcl | meh? | 16:37 |
lkcl | what is it? | 16:37 |
markos | turns out they are *exacly* tailored to the calculating the butterfly coefficients for DCT | 16:37 |
markos | Signed saturating Rounding Doubling Multiply returning High half | 16:37 |
lkcl | multiply rounded double.... | 16:38 |
* lkcl screams | 16:38 | |
markos | :D | 16:38 |
lkcl | but only briefly | 16:38 |
markos | it's basically this: fdct_round_shift((a +/- b) * c) | 16:38 |
markos | can be done in one instruction: vqrdmulhq_s16(vaddq_s16(a, b), 2 * c); | 16:38 |
lkcl | oh look! that's what the 3-in 2-out butterfly instructions are. | 16:39 |
lkcl | ffmadds | 16:39 |
markos | yup | 16:39 |
lkcl | but integer variants will be needed | 16:39 |
markos | was thinking that you've already done that in a more generic way :) | 16:39 |
lkcl | no it's exactly the same principle, funnily enough | 16:39 |
lkcl | except that you need a scalar instruction | 16:40 |
lkcl | (to which the triple-loop DCT Schedule is applied) | 16:40 |
markos | reading the instruction I was constantly thinking "who would need such a specialized instruction" | 16:40 |
lkcl | :) | 16:40 |
markos | and then I saw the fdct code :) | 16:40 |
markos | otoh, arm only provides int versions | 16:43 |
*** tplaten <tplaten!~isengaara@d536c9d8.access.ecotel.net> has joined #libre-soc | 17:09 | |
tplaten | having a look at sdram_init -> sdram_write_leveling_rst_bitslip | 17:43 |
tplaten | I also found this document: https://www.intel.com/content/www/us/en/docs/programmable/683385/17-0/read-and-write-leveling.html | 17:43 |
tplaten | I guess I found the two commands that I need to configure the bitslip: | 17:55 |
tplaten | static void sdram_read_leveling_rst_bitslip(char m) | 17:55 |
tplaten | static void sdram_read_leveling_inc_bitslip(char m) | 17:55 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.166.174> has quit IRC | 18:25 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.150> has joined #libre-soc | 18:28 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 18:55 | |
*** tplaten <tplaten!~isengaara@d536c9d8.access.ecotel.net> has quit IRC | 19:33 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.53.150> has quit IRC | 19:41 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.196.73.48> has joined #libre-soc | 19:42 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has quit IRC | 22:54 | |
*** jab <jab!~jab@courtmarriott2.wintek.com> has joined #libre-soc | 22:55 | |
*** octavius <octavius!~octavius@43.125.93.209.dyn.plus.net> has quit IRC | 23:23 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!