*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC | 03:30 | |
*** jn <jn!~quassel@2a02:908:1066:b7c0:20d:b9ff:fe49:15fc> has joined #libre-soc | 03:32 | |
*** jn <jn!~quassel@2a02:908:1066:b7c0:20d:b9ff:fe49:15fc> has quit IRC | 03:32 | |
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc | 03:32 | |
programmerjake | lkcl: I'm thinking predicated prefix sum is too complex to figure out easily, plus it produces hard-to-use outputs, so what do you think about declaring prefix-sum with predicated off elements as undefined? | 04:11 |
---|---|---|
programmerjake | I'm going to go ahead and do that for now | 04:14 |
programmerjake | another thing I ran into is in iterate_indices, it reverses steps if invxyz[1], however that is actually nonsensical, reversing steps doesn't produce a useful operation (unlike reversing indices, which is equivalent to reversing vector elements before and after the prefix-sum/reduction so is useful) | 04:42 |
programmerjake | it makes it unnecessarily more complex, so I'm going to copy the existing function, remove steps reversing, and add prefix-sum to that. | 04:43 |
programmerjake | note that reversing steps is equivalent to reversing the top half of following diagram vertically (aka. not useful afaict): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=HEAD#l167 | 04:48 |
ghostmansd | lkcl, FYI: https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4167814 | 04:50 |
ghostmansd | FAILED src/openpower/decoder/isa/test_caller_svp64_ldst.py::DecoderTestCase::test_sv_load_dd_ffirst_excl - AssertionError: 2 != 1 | 04:51 |
ghostmansd | Broken in master | 04:51 |
ghostmansd | other than this test, nopr branch seems to produce the same results as master | 04:52 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 04:57 | |
ghostmansd[m] | Correction: this test fails both in master and nopr | 05:00 |
programmerjake | yeah, it was broken from the start afaict...just ignore it for now, luke can fix it later | 05:01 |
ghostmansd[m] | Ok, thank you, later today I'll merge nopr branches both into gdb and openpower-isa | 05:02 |
*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC | 05:04 | |
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has joined #libre-soc | 05:06 | |
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has quit IRC | 05:06 | |
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc | 05:06 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 05:10 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 05:11 | |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 06:21 | |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 06:27 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 06:49 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 06:53 | |
markos | toshywoshy, lkcl, programmerjake what about Thursday afternoon, biweekly for the svp64 meetings? | 09:14 |
*** yambo <yambo!~yambo@069-145-120-113.biz.spectrum.com> has quit IRC | 09:20 | |
*** midnight <midnight!~midnight@user/midnight> has quit IRC | 09:21 | |
programmerjake | oh, i'm busy for some of this thursday afternoon, so idk if i can make it | 09:25 |
programmerjake | oh, wait, it's probably not afternoon for me when you're thinking | 09:26 |
programmerjake | what time? | 09:26 |
*** midnight <midnight!~midnight@user/midnight> has joined #libre-soc | 09:28 | |
*** yambo <yambo!~yambo@069-145-120-113.biz.spectrum.com> has joined #libre-soc | 09:32 | |
markos | right, it's probably going to be morning for you I guess | 09:37 |
markos | I'd say pick a time between 3pm-7pm UK time | 09:38 |
programmerjake | 7pm? if it's earlier than 6pm i likely won't make it | 09:43 |
programmerjake | tbh i prefer later than 7pm if that works for you all | 09:45 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 09:49 | |
programmerjake | though I am wondering if we have to have meetings, since afaict email and irc have been working fine...sorry, i had missed the part where it was explained why we needed SVP64 meetings. for recording presentations, wouldn't it work fine to record them individually and then publish them | 09:50 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.44.124> has joined #libre-soc | 09:50 | |
programmerjake | or are these meetings where we're expecting non-libre-soc people to show up and ask questions? | 09:51 |
markos | it's basically mostly internal, to share svp64 assembly for people who are not yet up to speed | 09:53 |
markos | but I think it could just as well be for other interested people also | 09:54 |
markos | it's not for recording presentations for conferences etc | 09:54 |
programmerjake | ah, so not a major problem if i miss any | 09:54 |
programmerjake | since afaict i'm mostly up to speed on svp64 | 09:55 |
markos | no, though people will probably benefit from your technical knowledge :) | 09:55 |
markos | you are, others not as much :) | 09:55 |
markos | the point is not to train your or Luke :) | 09:56 |
markos | s/your/you | 09:56 |
programmerjake | ah, ok. | 09:56 |
programmerjake | i think we should see who all wants to attend, e.g. if cesar wants to attend we'd have to work around his work schedule | 09:58 |
lkcl | programmerjake, i solved predication in the parallel-reduction case. if you can write a short (10-20 lines) python script in a non-predicated demo, like you did last time (but this time excluding predication entirely) i can work it out | 10:00 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.44.124> has quit IRC | 10:02 | |
programmerjake | lkcl, you'd want https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l35 | 10:03 |
programmerjake | it's short and to-the-point | 10:03 |
lkcl | excellent. | 10:03 |
lkcl | ok so the inverted-bit (going out again) is the bit i need. | 10:03 |
programmerjake | do copy it somewhere else to hack on it... | 10:04 |
lkcl | predication is *solved* jacob. | 10:04 |
cesar12 | No, go ahead, I'm not that interested on SVP64 assembly right now, more focused on low level HDL and Formal Verification. | 10:04 |
programmerjake | ok, cesar | 10:04 |
lkcl | it's done by maintaining a suite of indices where instead of a MV operation the indices are MVed. | 10:04 |
lkcl | such that on the next operation that would *otherwise* have needed a MV, the source operand is taken from the *MVed index* position | 10:05 |
programmerjake | except that prefix sum has no moves | 10:05 |
lkcl | so, do predicated elements remain where they are? | 10:07 |
programmerjake | and if you tried to renumber indices based on skipping lanes predicated out, you'd end up with a highly variable pattern difficult to optimize hw for | 10:07 |
lkcl | tough. | 10:07 |
programmerjake | prefix sum is unpredicated | 10:07 |
lkcl | then the developer must perform a predicated VCOMPRESS/VEXPAND before/after | 10:07 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.73.253.155> has joined #libre-soc | 10:07 | |
programmerjake | ok, fine with me | 10:08 |
lkcl | i'd like to keep the predication-index-moving-thing in because it works, and we may find that someone gets it to work | 10:08 |
lkcl | if they find it's low performance and use VCOMPRESS/VEXPAND, they learned something :) | 10:09 |
lkcl | here: | 10:09 |
lkcl | + # start a loop from the lowest step | 10:09 |
lkcl | + step = 1 | 10:09 |
lkcl | + while step < xd: | 10:09 |
lkcl | + step *= 2 | 10:09 |
lkcl | + stepend = step >= xd # note end of steps | 10:09 |
lkcl | is that basically the same as the nmigen prefix_sum_ops algorithm? | 10:10 |
programmerjake | no but it's similar | 10:10 |
programmerjake | step = 2 * dist | 10:11 |
lkcl | but achieves a work-efficient schedule? | 10:11 |
programmerjake | but reduction operates differently than prefix-sum because it does operations toward the other end... | 10:11 |
programmerjake | reduction achieves a work-efficient schedule, but it's somewhat different than the prefix-sum work-efficient schedule | 10:13 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.73.253.155> has quit IRC | 10:14 | |
lkcl | okaaay. sounding like a separate iterator function is needed: i thought it was identical-first-half | 10:14 |
programmerjake | it's similar, i thought it was identical | 10:14 |
programmerjake | i didn't think through all the details at the time | 10:15 |
lkcl | well, there's room. submode=0b10 and 0b11 | 10:15 |
lkcl | it's all good | 10:15 |
lkcl | ok let me just tie this in... | 10:16 |
programmerjake | they can probably share a lot of hw at least.., | 10:16 |
lkcl | yehyeh | 10:16 |
programmerjake | note the code in the `if` that i comitted is ported from nmutils.prefix_sum | 10:16 |
programmerjake | so you don't need to re-convert it | 10:17 |
lkcl | i'm just going to link iterate_indices2() into SVSHAPE.get_iterator | 10:18 |
lkcl | that's all | 10:18 |
programmerjake | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;h=9e9fa2a69a0efd0a7794149353fff14d6fbcd73a;hb=0b6592c574f814d81cfede4c74c50b583590db13#l49 | 10:18 |
programmerjake | if you don't mind my having removed steps.reverse(), just delete the existing iterate_indices and rename iterate_indices2 -> iterate_indices | 10:19 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 10:20 | |
lkcl | yes i mind. although it should be mirroring rather than total-inversion | 10:20 |
programmerjake | it should function identically for all useful cases | 10:20 |
lkcl | i.e. the end-result of inversion is that the result ends up in element VL-1 rather than element 0 | 10:21 |
programmerjake | well, in that case just copy the `if` block to iterate_indices and delete iterate_indices2 | 10:22 |
lkcl | gimme a sec... ok done | 10:24 |
lkcl | python3 decoder/isa/test_caller_svp64_parallel_reduce.py >& /tmp/f | 10:24 |
lkcl | nothing "damaged" | 10:24 |
lkcl | next step: simplev.mdwn | 10:25 |
programmerjake | k, i'm going to sleep, so ttyl | 10:27 |
lkcl | night jacob, thanks for your help | 10:30 |
markos | argh, how the heck do tables work in markdown? | 10:36 |
lkcl | |heading1|heading2| | 10:36 |
markos | https://libre-soc.org/openpower/sv/cookbook/chacha20/ | 10:36 |
lkcl | |-----|-----| | 10:36 |
lkcl | |rowdata1|rowdata2| | 10:36 |
markos | yeah, I've done that but I'm getting crap formatting | 10:36 |
lkcl | 1 sec let me take a look | 10:36 |
lkcl | you forgot the headings | 10:37 |
programmerjake | now that i look at the time, i'm unlikely to make it in time for a 6pm BST meeting, maybe 7pm? sorry | 10:37 |
programmerjake | don't count on me attending today | 10:38 |
markos | programmerjake, well no one agreed for today anyways, don't worry | 10:38 |
lkcl | markos, fixed the 1st table, you can see what it looks like now. | 10:39 |
markos | aha! | 10:39 |
lkcl | if you add extra "|----|----|"s it just adds "-----" into cells | 10:40 |
lkcl | you nearly had it - just the missing headings | 10:41 |
lkcl | the format you were thinking of is more restructured-text | 10:41 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 10:43 | |
markos | finally | 10:44 |
markos | ok, so this should be ok now | 10:44 |
markos | could you please check if the VF description is adequate? | 10:44 |
markos | so that I can finally get this 'done' :) | 10:45 |
markos | argh, accidentally removed the TOML values | 10:49 |
markos | fixed | 10:51 |
lkcl | yes am good with it. i just added a REMAP Indexing quick intro as well | 10:55 |
markos | great, I'm closing this as fixed then | 10:57 |
markos | ok, now to moving to the butterfly instructions :) | 10:58 |
lkcl | :) | 11:04 |
lkcl | just added an intro section, no conclusion - i think the assembler itself is enough. | 11:09 |
lkcl | aand we're good | 11:09 |
markos | great, (belated) RFPs sent for those :) | 11:12 |
lkcl | got it. do update the toml field(s) | 11:13 |
lkcl | markos = {amount=NNN, submitted=date} i forget the format YYYY-MM-DD? | 11:13 |
lkcl | ok you can see in https://bugs.libre-soc.org/show_bug.cgi?id=1007 | 11:14 |
markos | didn't I fix it? did I do it wrongly? | 11:15 |
lkcl | you need to keep the *bugzilla* records consistent with the RFP | 11:15 |
lkcl | (and you put in EUR 1800 not EUR 1700 which i don't mind) | 11:16 |
markos | argh | 11:16 |
markos | crap, can I edit it? | 11:16 |
lkcl | The table of payments (in EUR) for this task; TOML format: | 11:16 |
lkcl | (edit) | 11:16 |
lkcl | markos=1100 | 11:16 |
lkcl | lkcl={amount=400, submitted=2023-03-25} | 11:16 |
lkcl | nope. it's in, and approved. | 11:16 |
lkcl | so i retrospectively changed the amount to 1800 | 11:16 |
lkcl | https://bugs.libre-soc.org/show_bug.cgi?id=1007 | 11:16 |
lkcl | you need to edit the TOML field and put | 11:17 |
markos | we'll balance it out in the next one :) | 11:17 |
markos | sorry about that | 11:17 |
lkcl | markos={amount=1100, submitted=2023-04-27} | 11:17 |
markos | at worst I'll buy you an expensive bottle of wine :-) | 11:17 |
lkcl | likewise in 1006, put the record of the same date | 11:17 |
lkcl | :) | 11:17 |
lkcl | don't worry about it | 11:17 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 11:21 | |
lkcl | markos, ok great - let me run the budget-sync thing and you can review your page https://libre-soc.org/task_db/mdwn/markos/ | 11:25 |
lkcl | give it 2 mins... | 11:25 |
lkcl | ok it's updated. normally what i do is actually copy that markdown auto-generated table *into* the "RFP comments/results" | 11:26 |
lkcl | but it requires that you run the budget-sync program and do the TOML-field editing before _that_... | 11:27 |
lkcl | but at least you can see: "Submitted but not yet paid" now contains the two tasks you are waiting for an RFP for, to be paid, yay | 11:29 |
markos | indeed, thanks | 11:36 |
markos | lkcl, btw, reg butterfly insns, should those go to fixedarith.mdwn or own file? | 11:40 |
markos | I'm going to do something better than what Arm is doing, their versions are not as precise so we cannot use them everywhere as expected | 11:51 |
markos | can we do 3-in, 2-out? | 11:51 |
markos | which form is that? | 11:52 |
markos | or 4-in, 1-out | 12:00 |
markos | 4-in might be useful to add in a right-shift immediate | 12:01 |
markos | basically the instructions are trying to emulate fdct_round_shift((a +/- b) * c) | 12:02 |
markos | if we can do 2-out then we can both fdct_round_shift((a + b) * c) and fdct_round_shift((a - b) * c) in the same instruction | 12:02 |
markos | if not then we have to provide 2 instructions for that, but in that case, we can use an extra instruction for the shifting | 12:03 |
markos | er, extra operand | 12:03 |
markos | fdct_round_shift(x) is essentially ROUND_POWER_OF_TWO(x, DCT_CONST_BITS) | 12:05 |
markos | where #define ROUND_POWER_OF_TWO(value, n) (((value) + (1 << ((n)-1))) >> (n)) | 12:05 |
markos | and DCT_CONST_BITS = 14 | 12:06 |
markos | I'd love to be able to do both a+b/a-b in a single instruction though, that would essentially double throughput | 12:08 |
markos | where can I find the possible Forms? | 12:11 |
markos | nevermind, 1.6.1 ISA manual | 12:59 |
markos | lkcl, stupid question, could we assume that an instruction has 2 outputs but only needs one output register? ie, it outputs to RT and RT+1 | 13:11 |
markos | I guess not, but thought I'd ask | 13:14 |
markos | because that way we can have 3-in, RA, RB, RC and 2-outs, RT = (RA+RB)*c and RT+1 = (RA-RB)*c | 13:14 |
markos | if we can squeeze in a 4-bit immediate to right shift, this will be a killer instruction | 13:15 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 13:27 | |
markos | lkcl, actually I think this is already done in fdmadds FRT,FRA,FRC,FRB | 13:37 |
markos | pseudo-code has: FRS <- FPADD32(FRA, FRB) | 13:37 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 13:37 | |
markos | but there is no FRS declared in the assembly syntax | 13:37 |
markos | this looks a bit wrong | 13:42 |
markos | anyway, could we use the same trick as with svshape and save a bit in output register, and assume a pair of registers written? | 13:43 |
markos | ie instead of RT, provide RT/2, and always assume that this instruction will accumulate both RT and RT+1 | 13:45 |
markos | with accumulate that means you can have the 2-coeffs butterfly operation fdct_round_shift(a * c1 +/- b * c2) with just 2 instructions :) | 13:47 |
markos | you'd just have to swap RA, RB in the second instruction | 13:47 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 14:35 | |
markos | oh well, apparently RT + 1 <- does not work :-/ | 15:06 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 15:07 | |
lkcl | markos, svfixedarith.mdwn or yes their own file is perfectly fine | 16:14 |
lkcl | yes that's what's been done. the extra operand is declared to exist as RT+1 for scalar-only instructions | 16:15 |
lkcl | and is declared to exist as RT+MAXVL for vectorised instructions | 16:15 |
lkcl | notes are in the spec and they _should_ be at the top of the mdwn file as comments? | 16:15 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 16:16 | |
markos | well, I made some progress, mdwn file is written, test case also, added enums, etc. I'm getting the op_maddsubrs (just picked something for now) generated and trying to run it now but getting some errors | 16:16 |
markos | I called the instruction as maddsubrs (multiply-add-sub-right-shift) | 16:20 |
lkcl | ok cool! | 16:25 |
lkcl | if you drop it in a branch give me a shout i can take a look. | 16:26 |
lkcl | unless you feel confident it doesn't cause "damage" in which case just shove it in master | 16:26 |
markos | yeah, will do that in a bit, getting some stupid errors right now, don't want to mess up master yet, if I fail to fix it I'll just commit in a branch asap | 16:28 |
markos | it's about GPROperand(RC), getting lists of indices for GPROperand(RT): (6, 7, 8, 9, 10), GPROperand(RA): (11, 12, 13, 14, 15) but GPROperand(RC): None | 16:31 |
markos | in power_insn.py: ~1123: for idx in operand.span: | 16:31 |
markos | lkcl, is it possible to have 3in and an 4-bit immediate for shifting? | 16:32 |
lkcl | yes but you'll need to design a "Form" to do it. 4-bits is a LOT | 16:33 |
markos | ok, where do I put that form? | 16:33 |
lkcl | and you certainly won't get that in the 3-in 2-out ones that already take 4 operands | 16:34 |
lkcl | in fields.text | 16:34 |
markos | perfect | 16:34 |
lkcl | don't rush into that decision: it needs to be a "Researched" RFC / wiki page | 16:34 |
lkcl | (it gets its own budget) | 16:34 |
lkcl | which reminds me to do exactly that, as each of these instructions needs to be listed on a special twin-butterfly page that currently doesn't exist | 16:35 |
lkcl | ooo there's just enough budget | 16:35 |
*** octavius <octavius!~octavius@92.40.169.65.threembb.co.uk> has joined #libre-soc | 16:36 | |
octavius | lkcl, as you've suggested I go back to verilator, that's what I did. Please see bug 1073 when you have some time, I'd really like to figure out what the problem is | 16:38 |
markos | well, started adding form to see how/if I can fit all that | 16:38 |
lkcl | octavius, take a look at the README as well as the source code of the microwatt_verilator main() loop | 16:39 |
octavius | ok | 16:39 |
lkcl | it requires some command-line options | 16:39 |
lkcl | you can probably guess that those command-line options are "the binary to load into RAM" | 16:39 |
lkcl | octavius, you should have worked out that "if it does nothing then you're looking at a black box, stop it" | 16:42 |
octavius | I did stop it | 16:42 |
lkcl | now you've got compiling, a gentle reminder that the purpose of compiling it is to get it to produce gtkwave traces | 16:42 |
octavius | I just ran to remind myself. Last time was in January :) | 16:43 |
lkcl | :) | 16:43 |
lkcl | and that needs *verilator* compile-time options. | 16:43 |
octavius | Yes, I noticed the .vcd file was unreadable | 16:43 |
lkcl | that's probably because it's an fst file (maybe). | 16:43 |
lkcl | use vcd2fst and fst2vcd - whichever one works use that | 16:43 |
octavius | Also the README in the microwatt repo has no info on verilator at all. Looking at microwatt-verilator.cpp as you've suggested | 16:44 |
lkcl | bear in mind that the output from verilator is *not* immediately compatible with gtkwave (sigh) | 16:44 |
octavius | Ah ok | 16:44 |
lkcl | you want the microwatt_verilator branch (only) | 16:44 |
octavius | That's the one I'm using | 16:44 |
lkcl | it's been too long i can't remember everything | 16:44 |
octavius | And I'm guessing you mean "verilator_trace" branch | 16:45 |
lkcl | markos, i'm slightly concerned about the low "XO" bit count of adding shift-immediates, they are incredibly expensive even when you have 3 operands | 16:45 |
lkcl | yes | 16:45 |
lkcl | if it was 2 bits, not so much of a problem, but 4 is a *LOT* | 16:47 |
lkcl | you risk ending up with needing a full Primary Opcode (or 50% of one) | 16:48 |
lkcl | at which point the instruction is highly likely to get rejected by the OPF ISA WG because it is such a "specialist optional" area | 16:48 |
lkcl | something like ternlogi on the other hand brings a massive 256 instructions with it, saving routinely and systematically across general-purpose code | 16:49 |
markos | what do I need to do when I've added a form in the fields.txt? plain 'make' chokes, I probably need to run something else, but I forget the sequence | 16:49 |
markos | I've added a BF-Form | 16:49 |
lkcl | but these are *area-specific* (DCT/FFT) and the only reason they can even be considered is because the wikipedia page lists something mad like 120 use-cases | 16:49 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 16:50 | |
lkcl | please please wait | 16:50 |
markos | yeah, just experimenting now, to see if it's even possible | 16:50 |
lkcl | there is a process for this, we cannot rush ahead adding new Forms arbitrarily without thinking them through and reviewing them | 16:50 |
markos | not going to commit anything | 16:50 |
lkcl | yes, you need to add it to power_enums.py | 16:50 |
lkcl | then add the (new) fields into the later section of fields.text | 16:50 |
lkcl | i don't _mind_ putting them onto the (new) wiki page to see what they look like | 16:51 |
lkcl | (and/or its discussion page) | 16:51 |
markos | I did, BF = 46 at the Form class | 16:51 |
lkcl | oh excellent | 16:52 |
markos | let me paste the form here for starters | 16:52 |
lkcl | good idea | 16:52 |
markos | |0 | 6 |11 |16 |21 | 25 |30 |31 | | 16:52 |
markos | | PO | RT | RA | RB | RC | SH | XO | Rc | | 16:52 |
lkcl | ok so you see how XO is only 1 bit? | 16:52 |
markos | yes, is that a problem? :D | 16:53 |
markos | how many bits does it have to be? can we skip it entirely? | 16:53 |
lkcl | that makes this an absolute top absolute top ultra-priority instruction in the same sort of category as "addi" | 16:53 |
lkcl | or "bc" | 16:53 |
markos | ah, I need to add BF to the end of XO(30) | 16:53 |
lkcl | to give some context: if you didn't have "SH" you could add *SIXTEEN* other 4-operand instructions | 16:54 |
lkcl | no, you need to consider that there is limited space and to consider not proposing this instruction *at all* because it risks getting rejected | 16:54 |
markos | well, we could leave the shifting out entirely | 16:54 |
lkcl | the lower the XO, the higher the priority has to be | 16:55 |
markos | I see | 16:55 |
lkcl | and obviously it's an exponential curve | 16:55 |
lkcl | as in, "the higher the number of use-cases" | 16:55 |
markos | well, it's about the gain, if the gain is justified | 16:55 |
lkcl | compared to a 10-bit XO this is destroying the opportunity to add a massive *512* other 2-in 1-out instructions | 16:55 |
markos | I mean Arm did include these instructions but with a fixed shifting value | 16:55 |
lkcl | yes, and they are under similar 32-bit constraints | 16:56 |
lkcl | so you start to appreciate why they did that | 16:56 |
lkcl | they're barely going to pass through as they are, with 3-in 1-out (4 operands taking up 20 bits on their own) | 16:56 |
markos | I do, in a sense, I admit I'm seeing this from my own point of view | 16:56 |
markos | being able to do twin butterfly operations in just 2 instructions is a massive win, from my perspective | 16:57 |
lkcl | which has to be compared against the perspective of millions of programmers doing general-purpose | 16:57 |
lkcl | yes i know! :) | 16:57 |
lkcl | read above: about the 120 use-cases for DCT on the wikipedia page | 16:57 |
lkcl | it's the only reason we can get away with proposing these *at all* | 16:57 |
lkcl | (that, and ARM already added them, we can point at that fact and use it as additional justification) | 16:58 |
markos | well, something like that could bring Power as a top performer in video processing | 16:58 |
lkcl | indeed | 16:58 |
markos | or any kind of media processing | 16:58 |
lkcl | but if it takes up *EIGHT* Primary Opcodes to do so, that's not going to fly | 16:58 |
lkcl | there's only 32 new POs in the EXT2xx area, 10 of which i want to allocate to LD/ST-Post-Increment | 16:59 |
lkcl | (because that *is* a huge saving - every single hot-loop in existence in every general-program benefits) | 16:59 |
markos | I'll play with this a bit | 17:00 |
lkcl | hence, "really high priority" | 17:00 |
markos | I'll try to minimize SH as much as possible | 17:00 |
lkcl | awesome | 17:00 |
markos | would 2-bits be ok? | 17:00 |
lkcl | now, about RC/RS - there's a place in power_decoder2.py that you (or more like i) *may* need to pay attention to | 17:00 |
markos | because if I can assume eg. shifting by a number of bits | 17:01 |
lkcl | not really. that's still two Primary Opcodes | 17:01 |
markos | ok | 17:01 |
lkcl | probably one is ok, and that's risky. it's still an entire PO taken up by the (set of) instructions | 17:01 |
lkcl | because there's what... 8 of them? | 17:01 |
markos | understood | 17:02 |
lkcl | ahhh ok | 17:02 |
lkcl | i remember now | 17:02 |
lkcl | search for "implicit_rs" in power_decoder2.py | 17:02 |
lkcl | that's really important. | 17:02 |
lkcl | it's complicated, but a "special check" is needed for the implicit RS/RC/FRS/FRC instructions, actually right there in the decoder | 17:03 |
lkcl | i.e. you can't just "add instructions to the csv files and hope" | 17:03 |
lkcl | gimme a sec... | 17:03 |
lkcl | sorry i forgot about this, it's been a while | 17:03 |
markos | np | 17:03 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=88b2023859061d7601a9dc94e052c75ec59fd8b1;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l1057 | 17:04 |
lkcl | actually line 1046 | 17:05 |
lkcl | so you need to decide the XO field (which is in the CSV file) and under which Major (PO) | 17:06 |
lkcl | btw this *really* needs to all go into the wiki page | 17:06 |
lkcl | https://libre-soc.org/openpower/sv/twin_butterfly | 17:06 |
lkcl | so it can be reviewed reaallly carefully | 17:06 |
lkcl | i'll need to do some temporary opcode allocation and find space for them | 17:07 |
lkcl | probably minor_22.csv - i think there's space still | 17:07 |
lkcl | then a section will be needed in power_decoder2.py to match it | 17:07 |
lkcl | 1046 with m.If((major == 59) & xo.matches( | 17:07 |
lkcl | 1047 '-----00100', # ffmsubs | 17:07 |
lkcl | .... | 17:07 |
lkcl | and you can see that one of either RB or RC can be "extended by MAXVL" when Vectorisation is enabled | 17:08 |
lkcl | so you need to decide which that's going to be. | 17:08 |
markos | ok, this needs a lot of thought still | 17:09 |
lkcl | indeed. fortunately there's a trail already blazed | 17:09 |
lkcl | but it's probably best to use the twin_butterfly page to create stub instructions, ultimately intended to be morphed into actual RFC actual Power ISA form | 17:09 |
lkcl | but kept short for now to make it easy to discuss iteratively | 17:10 |
markos | will start adding stuff there asap to discuss | 17:10 |
lkcl | ack | 17:10 |
lkcl | it's got its own budget and bugreport | 17:10 |
lkcl | i'll add the fp butterfly instructions later | 17:11 |
markos | pushed | 17:26 |
markos | This is the original attempt, still with the 4-bit SH | 17:28 |
lkcl | ok great | 17:28 |
markos | pretty sure there are some great misunderstandings on my part here | 17:29 |
markos | ie, I'm not really sure I'm allowed to just write to RT+1 | 17:29 |
markos | and now that I see it, it's probably wrong, it probably adds 1 to the value of RT, not the index | 17:29 |
lkcl | no | 17:30 |
lkcl | it's implicit | 17:30 |
lkcl | you write to RS | 17:30 |
markos | ah, what you said earlier | 17:30 |
lkcl | and ISACaller "knows" to pick that second (implicit) operand up and... yes | 17:30 |
markos | yeah, I need to read about that | 17:30 |
markos | so it's possible then to write to 2 GPRs | 17:31 |
lkcl | have a look at the biginteger page | 17:31 |
markos | nice to know | 17:31 |
lkcl | which contains the kind of spec-wording | 17:31 |
markos | will do | 17:31 |
lkcl | yes but we will get push-back for doing so | 17:31 |
lkcl | because it's what CISC x86 does | 17:31 |
lkcl | so there is a *lot* of "push-back" going to occur on these instructions, hence why if "and we want 8 Primary Opcodes" is part of that, the ISA WG will just flat-out say "no" | 17:32 |
lkcl | prod1 <- MUL(RC, sum) | 17:32 |
lkcl | can just be | 17:32 |
lkcl | RC * sum | 17:33 |
lkcl | just like in fixedarith | 17:33 |
lkcl | let me check... | 17:33 |
lkcl | ah nope, you're right | 17:33 |
lkcl | # Multiply Low Immediate | 17:33 |
lkcl | prod[0:(XLEN*2)-1] <- MULS((RA), EXTS(SI)) | 17:33 |
lkcl | watch out for this: | 17:34 |
lkcl | RT <- prod[XLEN:(XLEN*2)-1] | 17:34 |
lkcl | the result of MUL and MULS is *DOUBLE* the bitwidth | 17:34 |
lkcl | (sum of the length of the two operands) | 17:34 |
markos | right, ofc | 17:34 |
lkcl | and consequently you have to "pick a half" | 17:34 |
lkcl | but of course, you "pick a half in **MSB0** numbering"... sigh | 17:34 |
markos | hm, the arm instructions return the high half | 17:35 |
markos | we could add 2 pairs | 17:35 |
lkcl | for accuracy | 17:36 |
markos | one returning the high half and another the low | 17:36 |
lkcl | absolutely no chance of that | 17:36 |
markos | without the shifting bit :) | 17:36 |
lkcl | there's an internal hardware limit we've set of 3-in 2-out | 17:36 |
lkcl | @ 64-bit width | 17:36 |
lkcl | and that's down to the massive complexity that results from doing Register Hazard checking | 17:37 |
lkcl | the only reason we get away with hi-lo-half in the bigint operations is because they're actually a carry-in carry-out chain | 17:37 |
markos | right | 17:37 |
lkcl | so for the internal chain the instructions actually become 2-in 1-out, the first one in the chain is 3-in 1-out, and the last one in the chain is 2-in 2-out | 17:38 |
lkcl | which is the only reason we can get away with such ultra-expensive instructions, that and they'll end up in libgmp | 17:38 |
markos | similarly, these will go in pretty much all video/audio codecs | 17:39 |
*** tplaten <tplaten!~tplaten@195.52.20.159> has joined #libre-soc | 17:39 | |
lkcl | btw no need to put the autogenerated code in the wiki | 17:40 |
lkcl | exactly | 17:40 |
lkcl | like... aaaalll of them | 17:40 |
markos | though, for that reason we could avoid the shifting entirely | 17:40 |
markos | I mean as an operand | 17:40 |
lkcl | which we can easily "fly" on the "IoT / Edge / accelerator" thing | 17:40 |
lkcl | yes pleeease | 17:40 |
markos | only reason I'd want it is for future | 17:41 |
markos | in case a future codec decides to change the number of shift bits | 17:41 |
lkcl | it's too much for me to have to explain, and stake the entire reputation of what we're doing on having the instructions be rejected | 17:41 |
markos | though that's unlikely | 17:41 |
markos | we're good until 2030 | 17:41 |
markos | av1/av2/etc | 17:41 |
markos | :D | 17:41 |
lkcl | ahh if there's specific CODECs that use these instructions explicitly please do list them | 17:41 |
lkcl | that again gives me information i can present in ISA WG meetings, "these are common CODECs, actual implementations, the actual spec says DoThisThing()" | 17:42 |
markos | well, these fdct are all libvpx/av1 | 17:43 |
markos | and av2 | 17:43 |
lkcl | minor_59... what's that supposed to be used for... | 17:43 |
lkcl | _great_! | 17:44 |
lkcl | do put it into the page | 17:44 |
lkcl | every instruction needs a "Rationale" | 17:44 |
lkcl | i.e. | 17:44 |
lkcl | "why as IBM should we invest $50-100 million implementing these instructions" | 17:44 |
lkcl | or {insert-N-E-Other-Power-ISA-Implementor} | 17:44 |
lkcl | opcode 59 is typically stuffed with FP-single | 17:46 |
markos | I just picked 59 randomly :) | 17:46 |
lkcl | yyeah and likely overwrote some official instructions in the process! | 17:47 |
lkcl | *extreme* care needs to be taken here, it's a frickin lot of work | 17:48 |
lkcl | i'm looking at the tables here https://libre-soc.org/openpower/sv/bitmanip/ | 17:48 |
lkcl | how many of these instructions are needed? | 17:48 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 17:48 | |
lkcl | one i know is needed for the inner product, and another for the outer product | 17:48 |
lkcl | so that's at least 2 | 17:49 |
lkcl | then iirc you have to use different ones for iDCT than from DCT, so that's 8 | 17:49 |
lkcl | sorry, 4. then FFT needs the same treatment, that's 8 | 17:49 |
lkcl | fortunately though i think the outer-butterfly is just a twin add-subtract - specified as a 2-in 1-out but having an implicit RS | 17:51 |
lkcl | https://libre-soc.org/openpower/isa/svfparith/ | 17:51 |
markos | added some rationale, mention of the Arm instructions | 17:51 |
lkcl | awesome | 17:51 |
lkcl | btw the DCT subsystem *needs* both the inner-butterfly *and* the outer-butterfly instructions | 17:52 |
lkcl | that's why there's 2 separate uses of svremap in the unit tests. first use does the inner butterfly (the twin-madd) | 17:53 |
markos | well, I'd suggest 2 pairs of instructions | 17:53 |
lkcl | second use of svremap does the outer butterfly (which is i believe just an add-sub) | 17:53 |
markos | from what I see in libvpx though, both fdct and idct use the same kind of instructions | 17:54 |
lkcl | haang on... DCT just uses fadds. ha! | 17:55 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD#l594 | 17:55 |
markos | and arm ports also use those special instructions, but again these are of limited precision | 17:55 |
lkcl | so fortunately the outer-butterfly is just "an add" | 17:55 |
lkcl | well, the limited precision occurs when you specify an elwidth | 17:55 |
lkcl | which will be where the biggest efficiency savings come from | 17:56 |
lkcl | so, actually (whew) - at least for DCT - only *two* twin-mul-and-accumulate-and-shift instructions | 17:56 |
markos | well in our case, it would help to be able to do the calculations in a larger width and then just scale/narrow down | 17:56 |
markos | es | 17:57 |
markos | yes | 17:57 |
lkcl | whiiich... means... they can just about fit into opcode 22 | 17:57 |
lkcl | there's an area | 17:57 |
lkcl | https://libre-soc.org/openpower/sv/bitmanip/ | 17:57 |
markos | Arm is full of many versions of these functions because they're fast but not accurate enugh | 17:57 |
lkcl | NNRTRAit/im57im0-40 00 000xpermiTODO-Form | 17:58 |
lkcl | NN- -- 000rsvdrsvd | 17:58 |
markos | 23 helper functions to do basically the same thing | 17:58 |
lkcl | yowser | 17:58 |
lkcl | ok so see that entry just below xpermi? | 17:58 |
markos | rsvd? | 17:59 |
lkcl | as long as 26-28 are *not* zero, that's "free encoding space" | 17:59 |
lkcl | you get *one* bit for a shift, there | 17:59 |
lkcl | let me edit it... | 17:59 |
markos | haha, I'll take it | 18:00 |
lkcl | ahh... where the heck's the page... it's in a separate-include... | 18:00 |
lkcl | ah. draft_opcode_tables | 18:00 |
lkcl | ok what's the instruction names? | 18:00 |
lkcl | one is maddsubrs | 18:01 |
markos | I proposed maddsubrs, but open to suggestions | 18:01 |
lkcl | ahh "s" is usually reserved for "FP single"... are there any other instructions ending in "s" in the *fixed*-point set? | 18:02 |
lkcl | maddsubrs it is for now | 18:02 |
markos | this one does both add and sub | 18:02 |
markos | assuming I can write to RT and RT+1 | 18:02 |
markos | or RT and RS | 18:02 |
lkcl | RT and implicit-RS. | 18:04 |
lkcl | https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=5b0a082545185799b7bf053374aa3b60117ef74b | 18:04 |
lkcl | ok so that's your allocation for the instruction | 18:04 |
lkcl | it'll need to go into minor_22.csv | 18:04 |
lkcl | (not minor_59.csv) | 18:04 |
lkcl | and you want a (sigh) XO length i think of 11... | 18:05 |
lkcl | gimme a sec... | 18:05 |
lkcl | see insndb.csv | 18:05 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/insndb.csv;hb=HEAD | 18:06 |
lkcl | 7 minor_22.csv,22,21:31,NONE,pattern,normal | 18:06 |
lkcl | 21-31... yes, 11 bits | 18:06 |
lkcl | okaaay | 18:06 |
lkcl | so _now_ you can "interpret" the contents of minor_22.csv, every single "pattern" *has* to be 11 bit in length... | 18:06 |
markos | 11? | 18:07 |
lkcl | the 1st column | 18:07 |
lkcl | -----01011-,ALU,OP_FISHMV | 18:07 |
lkcl | example. | 18:07 |
lkcl | count the total "-" "0" and "1"s | 18:07 |
lkcl | comes to 11 | 18:07 |
lkcl | representing bits 21 thru 31 *inclusive* | 18:07 |
lkcl | sooo... with the new allocation | 18:08 |
markos | but I have 4 operands, RT, RA, RB, RC, which are 6:24 | 18:08 |
lkcl | look at the diff | 18:08 |
lkcl | diff --git a/openpower/sv/draft_opcode_tables.mdwn b/openpower/sv/draft_opcode_tables.mdwn | 18:08 |
lkcl | | 0.5|6.10|11.15|16.20 |21..25 | 26....30 |31| name | Form | | 18:08 |
lkcl | +| NN | RT | RA | RB | RC | sh 01 00 |0 | maddsubrs | BF-Form | | 18:08 |
lkcl | RT RA RB and RC are all allocated to 6:24 | 18:09 |
markos | aaaaaaah | 18:09 |
lkcl | but column *one* of each csv file is allocated to *XO* identification | 18:09 |
lkcl | you will also need to add entries further down in fields.text which tell power_decoder.py where those RT RA RB and RC are, for BF-Form | 18:10 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;h=b0f91cae74f2dec822138b97c0286d6b6cda76f8;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l799 | 18:10 |
lkcl | 799 RT (6:10) | 18:10 |
lkcl | 800 Field used to specify a GPR to be used as a target. | 18:11 |
lkcl | 801 Formats: A, BM2, D, DQE, DS, DX, MM, VA, VA2, VX, X, XFX, XO, XX2, SVL, XB, TLI, Z23 | 18:11 |
lkcl | aaaand now... | 18:11 |
lkcl | .... | 18:11 |
lkcl | .... | 18:11 |
lkcl | BF | 18:11 |
lkcl | likewise for RA | 18:11 |
lkcl | 747 RA (11:15) | 18:11 |
lkcl | 748 Field used to specify a GPR to be used as a | 18:11 |
lkcl | 749 source or as a target. | 18:11 |
markos | ok, thanks for your patience | 18:11 |
markos | I'll get it eventually | 18:11 |
lkcl | 750 Formats: ...... .... *BF* | 18:11 |
lkcl | it's all in the (various, numerous) diffs | 18:11 |
lkcl | normally it would be straightforward, just look at one already done, but the extra complication is the implicit arguments | 18:12 |
lkcl | so | 18:12 |
lkcl | let me find git link for minor_22.csv | 18:12 |
lkcl | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/minor_22.csv;h=7cb4785af2ff915acf4c724d72709a470e2c6a48;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l40 | 18:13 |
lkcl | so line 43 | 18:13 |
lkcl | let's take say line 39 - OP_CPROP | 18:13 |
lkcl | that has | 18:13 |
lkcl | 39 0110001110-,ALU,OP_CPROP,R | 18:13 |
lkcl | so that means, that for power_decoder.py to "match" | 18:13 |
lkcl | bit 21 must be 0 | 18:13 |
lkcl | bit 22 must be 1 | 18:13 |
lkcl | bit 23 must be 1 | 18:13 |
lkcl | bit 24 must be 0 | 18:13 |
lkcl | ... | 18:13 |
lkcl | bit 30 must be 0 | 18:14 |
lkcl | and bit 31 we DON'T CARE | 18:14 |
lkcl | (because "-") | 18:14 |
lkcl | so, "translating" the allocation 26:30 from the new allocation | 18:14 |
lkcl | 21-25 is right smack in the middle of RC, therefore must be "don't care" | 18:14 |
lkcl | bit 26 is "sh" so *that* must be "don't care" as well | 18:15 |
lkcl | and bits 27-31 must be "01000" | 18:15 |
lkcl | so! | 18:15 |
lkcl | we have the entry! | 18:15 |
lkcl | and it is... | 18:15 |
lkcl | ------01000 | 18:15 |
lkcl | ta-daaa | 18:15 |
markos | :) | 18:15 |
lkcl | that's the entry to go into minor_22.csv at line... 43. | 18:16 |
lkcl | every single frickin instruction has to go through this process, sigh | 18:16 |
markos | I'll add the entry there | 18:16 |
lkcl | awesome | 18:17 |
lkcl | holy hell barometric pressure change | 18:17 |
lkcl | unbelievably painful even with 4 aspirin and 2 paracetamol | 18:18 |
markos | get some rest | 18:21 |
lkcl | not going to help - weather's changing constantly today | 18:33 |
lkcl | apparently this is a well-known phenomenon in japan | 18:33 |
lkcl | but very much less-recognised in europe / us. | 18:33 |
lkcl | i can feel my ears popping constantly (like in an airplane) hence i know the pressure change is happening | 18:33 |
programmerjake | luke, iirc you removed iterate_indices2 and copied the section to iterate_indices, did you ever push that? | 18:33 |
programmerjake | hope you feel better | 18:34 |
lkcl | no i didn't, i simply called the alternate function if submode=0b10/11 | 18:34 |
lkcl | been a wild ride today | 18:34 |
markos | missing something still: this file (I guess autogenerated) gives me this: | 18:36 |
markos | +maddsubrs,NORMAL,,1P,EXTRA2,NO,d:FRT;d:CR1,s:FRA,s:FRB,s:FRC,RA,RB,RC,RT,0,CR1,0 | 18:36 |
markos | why am I getting FR* registers in there? | 18:36 |
markos | maybe it was generated previously | 18:40 |
*** octavius <octavius!~octavius@92.40.169.65.threembb.co.uk> has quit IRC | 18:41 | |
programmerjake | run `make`, it replaces those files... | 18:41 |
markos | just did | 18:41 |
markos | still getting the same result | 18:41 |
programmerjake | do you have the right form in the csv? | 18:41 |
markos | ah right | 18:42 |
markos | thanks | 18:42 |
markos | weird, still getting the same | 18:44 |
markos | I'm going to commit in a branch | 18:45 |
programmerjake | it's probably going to the wrong case in sv_analysis.py, e.g. when I added pcdec I had to add a case: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l605 | 18:48 |
programmerjake | regs comes from https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l363 | 18:51 |
ghostmansd | markos, check these lines: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l378 | 18:54 |
ghostmansd | all fields for CSVs are generated here | 18:55 |
programmerjake | oh, i think i spotted your issue, do you have it writing CR1 instead of CR0? | 18:55 |
programmerjake | since for Rc=1, fp ops write CR1, but int ops write CR0 | 18:58 |
lkcl | it'll be down to what's in sv_analysis.py | 19:43 |
lkcl | but you don't want to be worrying about sv right now | 19:44 |
lkcl | because this is a *scalar* instruction | 19:44 |
lkcl | but just so you know, look at sv_snalysis RM-1P-3S1D section | 19:45 |
lkcl | elif value == 'RM-1P-3S1D': | 19:45 |
lkcl | it's a previously-unrcognised pattern | 19:45 |
lkcl | and the fallback is "fmadd*" | 19:45 |
lkcl | i need to know the "key" pattern | 19:46 |
lkcl | regs == [somethingsomething] | 19:46 |
lkcl | 1111011111,ALU,OP_MADDSUBRS,RA,RB,RC,RT,NONE,CR1,0 | 19:47 |
lkcl | ah yes, you put CR1, just like jacob said | 19:47 |
lkcl | make that CR0 | 19:47 |
lkcl | and it *should* then match on | 19:47 |
lkcl | elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec | 19:47 |
lkcl | which will activate this | 19:47 |
lkcl | res['0'] = 'd:RT;d:CR0' # RT,CR0: Rdest1_EXTRA2 | 19:48 |
lkcl | res['1'] = 's:RA' # RA: Rsrc1_EXTRA2 | 19:48 |
lkcl | res['2'] = 's:RB' # RT: Rsrc2_EXTRA2 | 19:48 |
lkcl | res['3'] = 's:RC' # RT: Rsrc3_EXTRA2 | 19:48 |
programmerjake | other issues I spotted, the pseudocode uses rotate left instead of shift right...it'll give the wrong results | 19:49 |
lkcl | re-run sv_analysis.py | 19:50 |
lkcl | i also removed Rc=1 from BF-Form, and fitted it to what went into the bitmanip-opcode-22 table | 19:52 |
lkcl | https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=fa1f3c9e3dce8fdb62b47a34b6c9203293b94026 | 19:52 |
lkcl | sorry! Rc=1 effectively doubles the number of instructions, which we can't really afford to do | 19:53 |
programmerjake | another issue: that's actually a 5-in 2-out op...since it both reads and writes RT and RS | 19:53 |
lkcl | ermmm... ermermerm... | 19:53 |
lkcl | yep that's not going to work | 19:53 |
lkcl | so RA has to be the source-of-where-the-accumulating-happens | 19:56 |
lkcl | which *happens* to be *exactly* the same register as RT | 19:56 |
programmerjake | idea: put the pair of coefficients and accumulated sums each in 1 reg with each value being the lower/upper half of a reg...this should reduce input/output regs to 4-in 1-out | 20:06 |
programmerjake | idk if that'll fit the DCT pattern tho | 20:07 |
programmerjake | this is kinda like how cdtbcd works where the upper and lower halves are independent | 20:08 |
programmerjake | e.g. RT <- ((RT)[0:XLEN/2-1] + prod0) || ((RT)[XLEN/2:XLEN-1] + prod1) | 20:10 |
programmerjake | that way if you set elwid=32 you get 2 16-bit results | 20:11 |
markos | sorry had to be afk for a while to pick up my son | 20:18 |
markos | right, CR0 was the reason | 20:19 |
markos | thanks a lot! | 20:19 |
markos | programmerjake, yeah, it's far from perfect right now, and probably incorrect | 20:19 |
markos | but the half-register coefficient is a good idea, I was actually thinking about it for the results | 20:23 |
markos | ie, high RT -> add, low RT -> sub | 20:23 |
markos | lkcl, I saw you removed the accumulate, is there no way to keep the accumulate there? | 20:30 |
lkcl | markos, RA-when-set-to-the-same-register-as-RT *is* the accumulator | 20:33 |
lkcl | that's the way it works | 20:33 |
lkcl | and no, it will not be ok to do split-use of registers | 20:34 |
lkcl | how would it ever then be possible to do 64-bit DCT? | 20:34 |
lkcl | last thing we need is to fall onto a SIMD-within-a-Register paradigm | 20:35 |
markos | hm, hm, RA == RT only makes sense if we do in-place DCT | 20:37 |
markos | and actually it kind of forces us that way | 20:37 |
lkcl | the DCT Schedules are specifically designed for precisely and exactly that | 20:42 |
lkcl | this is a world-first | 20:42 |
lkcl | the only reason it is possible at all is because the elements are loaded and then traversed in a hybrid bit-reversed *and* gray-coding pattern | 20:43 |
lkcl | such that | 20:43 |
lkcl | when "unravelling" layer by layer, each layer is *not* destructively overwritten when doing the 3210 0123 schedule | 20:44 |
lkcl | because it's *already been loaded such that it becomes a 0123 0123* schedule for that exact moment in the schedule | 20:44 |
lkcl | and consequently you *can* do in-place | 20:45 |
lkcl | all standard SIMD algorithms *need* double the registers | 20:45 |
lkcl | because they try to do 0123 3210 and half-way through that they destroy the data | 20:45 |
lkcl | markos, you'll need i think to experiment by running remap_dct_yield.py | 20:49 |
lkcl | and see what it does. | 20:49 |
lkcl | you'll find that - like Indexed REMAP but without the GPRs - it generates "prerequisite offsets" | 20:49 |
lkcl | that you *must* drop on top of a fully in-place instruction | 20:50 |
lkcl | in this case it will be maddsubrs *0, *0, *16, *0 | 20:50 |
lkcl | where *16 equals the coefficients | 20:50 |
lkcl | sorry | 20:50 |
lkcl | maddsubrs *0, *0, *0, *16 | 20:50 |
lkcl | and the *schedule* system will add on the required offsets to RT, RA, RB and RC *for* you | 20:51 |
lkcl | to make the *entire* triple loop | 20:51 |
lkcl | it's liiitttteralllly three (quantity 3of) instructions | 20:51 |
lkcl | svshape, svremap sv.maddsubrs. | 20:52 |
lkcl | bdang. | 20:52 |
lkcl | done. | 20:52 |
markos | you're right, I was thinking that we might need to reuse the coeffs, but if we can do the whole thing in one go, all the better and we don't need to reuse | 20:52 |
lkcl | even the coefficients are established in a set order that makes them useable as a vector | 20:52 |
lkcl | and | 20:52 |
lkcl | guess what? | 20:52 |
lkcl | the "coefficient-offseting" Schedule (REMAP SVSHAPE3) is set up *precisely and exactly* to give you the *exact* required coefficient | 20:53 |
lkcl | at the exact and precise required time | 20:53 |
lkcl | it's extremely elegant, sophisticated, and overwhelmingly-confusingly-straightforward | 20:53 |
lkcl | compared to the absolute hell normally subjected onto programmers | 20:53 |
markos | well, you're right, I'll have to play quite a bit with the dct_yield example, in fact I might copy it to work on the maddsubrs | 20:54 |
*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC | 20:54 | |
markos | wait, in that case, I don't need 3-in | 20:54 |
markos | or rather I don't need a separate RT register | 20:54 |
markos | because they are the same | 20:54 |
lkcl | you should be looking to do nothing else other than to copy the way that the FP DCT works | 20:55 |
lkcl | unless there is a really compelling reason to do otherwise | 20:55 |
lkcl | such that you should literally be able to cut/paste the fp dct test examples | 20:55 |
*** jn <jn!~quassel@95.223.44.193> has joined #libre-soc | 20:56 | |
*** jn <jn!~quassel@95.223.44.193> has quit IRC | 20:56 | |
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc | 20:56 | |
lkcl | replace sv.ffmmads (whatever) with sv.maddsubrs | 20:56 |
markos | what I still haven't figured out | 20:56 |
lkcl | and "It Should Just Work(tm)" | 20:56 |
lkcl | run the tests. and the yield program. and the associated project nayuki dct tests. | 20:56 |
markos | I can't understand where the implicit RS is defined, I mean how does it now where to place the result? | 20:56 |
lkcl | we went over that: that's in power_decoder2.py | 20:57 |
lkcl | search for the word "implicit_rs" | 20:57 |
markos | ah yes, you did say that, sorry | 20:57 |
markos | ok, will continue playing with this | 20:57 |
lkcl | now you'll need a section "with mIf((major==22) & so.matches("------01100") | 20:57 |
lkcl | i'll do that bit | 20:58 |
lkcl | i'll sort it now | 20:58 |
markos | thanks | 21:00 |
lkcl | done | 21:01 |
markos | if RA == RT, can I skip one in the declaration? | 21:05 |
markos | trying to see if I can still shave off some bits for shifting :-) | 21:06 |
lkcl | mmm... maaaybe. maybe not. when using Vertical-First Mode you need to be able to specify some registers as scalar, some as vector | 21:14 |
lkcl | and if they don't exist, you can't do that | 21:14 |
lkcl | Vertical-First Mode would be useful for being able to utilise the Schedule but to run *more than one* instruction, just like in chacha20 | 21:14 |
lkcl | in this case, you could detect "was there an overflow" | 21:14 |
lkcl | and flip to higher bit-width *without* leaving the Schedule Arrangement | 21:15 |
lkcl | just branch to a different area within the loop | 21:15 |
lkcl | you could even go "oop, by Layer 3 or greater we *know* we are going to run out of bit-accuracy in 16-bit therefore let's start using 32-bit for Layer 3 4 and 5" | 21:16 |
lkcl | all sorts of weird stuff | 21:16 |
lkcl | but if you don't have control over the operands it's going to be much more challenging | 21:16 |
lkcl | plus, if you ever need to use this in a scalar context, what should RA, RB and RT be? | 21:17 |
lkcl | if you *really* feel that an overwrite is ok in all circumstances, then yes we can explore that | 21:18 |
lkcl | and it will be ok to do precisely because butterfly will have *two* input operands "in-flight" | 21:18 |
lkcl | (like compare-and-swap) | 21:18 |
programmerjake | in case anyone was wondering, my build server crashed or something since I found it powered off rn, should be up and working now | 22:17 |
markos | lkcl, well, scalar mode is not really the use case here in point, I mean sure one can use it then also, but it doesn't really mean much | 22:22 |
markos | but if it makes a huge difference in in-place DCT applications, and there is no other way, then yes I would be willing to consider it | 22:23 |
markos | again the point is to manage to save some bits for shifting | 22:24 |
markos | eg if instead of maddsubrs RT, RA, RB, RC, SH (=1-bit for shifting), we manage to do the same with maddsubrs RA, RB, RC, SH (4-bits, give back one bit to XO), that makes a huge difference and a very powerful instruction that is future proof for other DCT implementations | 22:25 |
markos | if we leave the shifting out entirely, then it's just a couple of madds | 22:26 |
markos | which sure it can save some instructions but it won't make that much of a difference | 22:26 |
markos | let me give you some examples | 22:26 |
programmerjake | rather than RA, RB, RC, we'd probably name them RT, RA, RB | 22:28 |
programmerjake | maddsubrs RT, RA, RB, SH | 22:28 |
markos | https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/arm/fdct_neon.h | 22:28 |
markos | ok | 22:29 |
markos | what would RC be? | 22:29 |
markos | I can understand RA == RT, but in your example how will they be mapped to the (a +/- b) * c | 22:30 |
programmerjake | a = RT, b = RA, c = RB? | 22:30 |
markos | sigh, ofc | 22:31 |
programmerjake | if there's only 3 args, one of them is almost always named RT or RS | 22:32 |
markos | in any case, if you check the file above, there are about 20+ implementations of these butterfly instructions | 22:32 |
markos | and the reason is that the arm "fast" implementations vqrdmulhq_s16/vqrdmulhq_s32 fail to provide full precision | 22:32 |
markos | so for the single-coeff implementation you have this code: | 22:33 |
markos | const int32x4_t a0 = vmull_n_s16(vget_low_s16(a), constant); | 22:33 |
markos | const int32x4_t a1 = vmull_n_s16(vget_high_s16(a), constant); | 22:33 |
markos | const int32x4_t sum0 = vmlal_n_s16(a0, vget_low_s16(b), constant); | 22:33 |
markos | const int32x4_t sum1 = vmlal_n_s16(a1, vget_high_s16(b), constant); | 22:33 |
markos | const int32x4_t diff0 = vmlsl_n_s16(a0, vget_low_s16(b), constant); | 22:33 |
markos | const int32x4_t diff1 = vmlsl_n_s16(a1, vget_high_s16(b), constant); | 22:33 |
markos | *add_lo = vrshrq_n_s32(sum0, DCT_CONST_BITS); | 22:33 |
markos | *add_hi = vrshrq_n_s32(sum1, DCT_CONST_BITS); | 22:33 |
markos | *sub_lo = vrshrq_n_s32(diff0, DCT_CONST_BITS); | 22:33 |
markos | *sub_hi = vrshrq_n_s32(diff1, DCT_CONST_BITS); | 22:33 |
markos | the DCT_CONST_BITS = 14 | 22:33 |
markos | for vp8/vp9 and av1 | 22:33 |
markos | possibly for av2 as well, and quite likely that applies to other codecs as well | 22:34 |
markos | now what if we have some code that needs another constant for shifting? | 22:34 |
markos | we would have to have another instruction or do what Arm does | 22:34 |
markos | fall-back to less efficient code | 22:34 |
markos | still faster than scalar | 22:35 |
markos | we could do all this code in just a couple of instructions and be future poof, if a) we allow accumulate, b) we allow shifting by an immediate value | 22:35 |
programmerjake | what about putting the constant in a handy SPR? e.g. LR or CTR | 22:36 |
programmerjake | that would be 4-in 2-out then | 22:36 |
markos | can we do that? | 22:36 |
markos | what are the drawbacks vs a normal GPR? | 22:37 |
programmerjake | maybe? | 22:37 |
programmerjake | a normal gpr needs an argument | 22:37 |
programmerjake | a spr needs to be not otherwise used or saved/restored | 22:37 |
markos | problem is that it's not just a single constant for a DCT | 22:38 |
markos | it's essentially a bunch of cospi fractions | 22:38 |
programmerjake | so, hence why I was suggesting LR since we'll probably want CTR for looping | 22:38 |
programmerjake | not c, sh in the spr | 22:38 |
markos | cospi(20/64), cospi(12/64), etc is a pair for the 2-coeff | 22:38 |
markos | aaaa | 22:38 |
markos | sorry | 22:38 |
markos | yes, that would work | 22:39 |
markos | sorry it's late | 22:39 |
markos | because that would remain totally constant throughout the whole code bae | 22:39 |
markos | base | 22:39 |
markos | yes indeed | 22:39 |
markos | is it possible that LR is used for something else in the DCT loop? | 22:40 |
programmerjake | other than return address which can easily be stored on stack or in a spare gpr, no | 22:40 |
markos | if lkcl agrees, that solves a problem | 22:40 |
markos | how would I read the value from LR to use as a shift value? | 22:41 |
programmerjake | yeah, just icr if 4-in is too much... | 22:41 |
programmerjake | uuh, just write `blah >> LR`? | 22:41 |
programmerjake | LR[58:63] | 22:42 |
markos | I mean it's directly accessible and I don't have to use a special instruction within the pseudocode | 22:42 |
markos | ok, thanks | 22:42 |
markos | in that case | 22:43 |
markos | we don't even have to force RA=RT | 22:43 |
markos | we can keep the previous syntax and just use A-Form? | 22:43 |
programmerjake | https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/branch.mdwn;hb=f8e2c0cb1467391aa7ae4b8b092c281ee2e16a7b#l75 | 22:43 |
markos | ok | 22:44 |
lkcl | ya know what? an overwrite would i think work fine | 22:44 |
programmerjake | if you're not reading RT, sure. if you are reading RT too you have too many inputs | 22:44 |
lkcl | markos, no don't do that. it requires LR as an operand into the Dependency Matrices | 22:45 |
lkcl | which will cause absolute mayhem | 22:45 |
markos | right | 22:45 |
markos | ok, then | 22:46 |
lkcl | register files *have* to be kept separate, otherwise the Dependency Management becomes hell | 22:46 |
lkcl | basically think of a matrix, with every register known on both the rows and the columns | 22:46 |
lkcl | any time you add an extra dependency, you end up with the *entire row* having to have a DM Cell for that register | 22:46 |
lkcl | just in case you ever executed an instruction that read LR just after one that wrote it | 22:47 |
lkcl | if you can keep GPR-GPR-GPR then the Matrix becomes "sparse" and you can miss out the majority of entire rows of Dependencies | 22:47 |
lkcl | CTR is definitely allocated to counting, it's even implementable as special Architectural State | 22:48 |
lkcl | rather than an actual "register" per se | 22:48 |
programmerjake | well, if it can match register usage of some pre-existing op, then LR could be used, e.g. if your op uses the same registers as a branch | 22:49 |
lkcl | i need to experiment to see if ffmadds can be reduced by one operand | 22:49 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 22:49 | |
programmerjake | since then it can share the dependency matrixes used for branch ops | 22:49 |
programmerjake | LR is the other spr that is likely treated specially | 22:50 |
programmerjake | oh, idea, mush it into the register profile of the GF(p) fft op | 22:53 |
programmerjake | since that reads a spr | 22:54 |
programmerjake | gfpmaddsubr | 22:55 |
programmerjake | it reads the GFPRIME spr | 22:56 |
programmerjake | though otoh that probably would have special state associated with it making writing it much more expensive | 22:57 |
programmerjake | oh, luke, all the [[!inline]] pseudo-code from nmigen-gf.git has disappeared on the wiki: https://libre-soc.org/openpower/sv/bitmanip/#index14h1 | 22:59 |
lkcl | sigh that's an underlay | 22:59 |
lkcl | no idea | 22:59 |
lkcl | not going to look at it now | 23:00 |
*** gnucode <gnucode!~gnucode@user/jab> has joined #libre-soc | 23:57 | |
lkcl | frickineeeelll | 23:57 |
lkcl | never had difficulty with operands before, sigh | 23:58 |
lkcl | okaaay about time | 23:59 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!