Thursday, 2023-04-27

*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC03:30
*** jn <jn!~quassel@2a02:908:1066:b7c0:20d:b9ff:fe49:15fc> has joined #libre-soc03:32
*** jn <jn!~quassel@2a02:908:1066:b7c0:20d:b9ff:fe49:15fc> has quit IRC03:32
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc03:32
programmerjakelkcl: I'm thinking predicated prefix sum is too complex to figure out easily, plus it produces hard-to-use outputs, so what do you think about declaring prefix-sum with predicated off elements as undefined?04:11
programmerjakeI'm going to go ahead and do that for now04:14
programmerjakeanother thing I ran into is in iterate_indices, it reverses steps if invxyz[1], however that is actually nonsensical, reversing steps doesn't produce a useful operation (unlike reversing indices, which is equivalent to reversing vector elements before and after the prefix-sum/reduction so is useful)04:42
programmerjakeit makes it unnecessarily more complex, so I'm going to copy the existing function, remove steps reversing, and add prefix-sum to that.04:43
programmerjakenote that reversing steps is equivalent to reversing the top half of following diagram vertically (aka. not useful afaict): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=HEAD#l16704:48
ghostmansdlkcl, FYI: https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/416781404:50
ghostmansdFAILED src/openpower/decoder/isa/test_caller_svp64_ldst.py::DecoderTestCase::test_sv_load_dd_ffirst_excl - AssertionError: 2 != 104:51
ghostmansdBroken in master04:51
ghostmansdother than this test, nopr branch seems to produce the same results as master04:52
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC04:57
ghostmansd[m]Correction: this test fails both in master and nopr05:00
programmerjakeyeah, it was broken from the start afaict...just ignore it for now, luke can fix it later05:01
ghostmansd[m]Ok, thank you, later today I'll merge nopr branches both into gdb and openpower-isa05:02
*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC05:04
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has joined #libre-soc05:06
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has quit IRC05:06
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc05:06
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC05:10
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc05:11
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc06:21
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC06:27
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC06:49
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc06:53
markostoshywoshy, lkcl, programmerjake what about Thursday afternoon, biweekly for the svp64 meetings?09:14
*** yambo <yambo!~yambo@069-145-120-113.biz.spectrum.com> has quit IRC09:20
*** midnight <midnight!~midnight@user/midnight> has quit IRC09:21
programmerjakeoh, i'm busy for some of this thursday afternoon, so idk if i can make it09:25
programmerjakeoh, wait, it's probably not afternoon for me when you're thinking09:26
programmerjakewhat time?09:26
*** midnight <midnight!~midnight@user/midnight> has joined #libre-soc09:28
*** yambo <yambo!~yambo@069-145-120-113.biz.spectrum.com> has joined #libre-soc09:32
markosright, it's probably going to be morning for you I guess09:37
markosI'd say pick a time between 3pm-7pm UK time09:38
programmerjake7pm? if it's earlier than 6pm i likely won't make it09:43
programmerjaketbh i prefer later than 7pm if that works for you all09:45
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC09:49
programmerjakethough I am wondering if we have to have meetings, since afaict email and irc have been working fine...sorry, i had missed the part where it was explained why we needed SVP64 meetings. for recording presentations, wouldn't it work fine to record them individually and then publish them09:50
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.44.124> has joined #libre-soc09:50
programmerjakeor are these meetings where we're expecting non-libre-soc people to show up and ask questions?09:51
markosit's basically mostly internal, to share svp64 assembly for people who are not yet up to speed09:53
markosbut I think it could just as well be for other interested people also09:54
markosit's not for recording presentations for conferences etc09:54
programmerjakeah, so not a major problem if i miss any09:54
programmerjakesince afaict i'm mostly up to speed on svp6409:55
markosno, though people will probably benefit from your technical knowledge :)09:55
markosyou are, others not as much :)09:55
markosthe point is not to train your or Luke :)09:56
markoss/your/you09:56
programmerjakeah, ok.09:56
programmerjakei think we should see who all wants to attend, e.g. if cesar wants to attend we'd have to work around his work schedule09:58
lkclprogrammerjake, i solved predication in the parallel-reduction case.  if you can write a short (10-20 lines) python script in a non-predicated demo, like you did last time (but this time excluding predication entirely) i can work it out10:00
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.44.124> has quit IRC10:02
programmerjakelkcl, you'd want https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l3510:03
programmerjakeit's short and to-the-point10:03
lkclexcellent.10:03
lkclok so the inverted-bit (going out again) is the bit i need.10:03
programmerjakedo copy it somewhere else to hack on it...10:04
lkclpredication is *solved* jacob.10:04
cesar12No, go ahead, I'm not that interested on SVP64 assembly right now, more focused on low level HDL and Formal Verification.10:04
programmerjakeok, cesar10:04
lkclit's done by maintaining a suite of indices where instead of a MV operation the indices are MVed.10:04
lkclsuch that on the next operation that would *otherwise* have needed a MV, the source operand is taken from the *MVed index* position10:05
programmerjakeexcept that prefix sum has no moves10:05
lkclso, do predicated elements remain where they are?10:07
programmerjakeand if you tried to renumber indices based on skipping lanes predicated out, you'd end up with a highly variable pattern difficult to optimize hw for10:07
lkcltough.10:07
programmerjakeprefix sum is unpredicated10:07
lkclthen the developer must perform a predicated VCOMPRESS/VEXPAND before/after10:07
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.73.253.155> has joined #libre-soc10:07
programmerjakeok, fine with me10:08
lkcli'd like to keep the predication-index-moving-thing in because it works, and we may find that someone gets it to work10:08
lkclif they find it's low performance and use VCOMPRESS/VEXPAND, they learned something :)10:09
lkclhere:10:09
lkcl+    # start a loop from the lowest step10:09
lkcl+    step = 110:09
lkcl+    while step < xd:10:09
lkcl+        step *= 210:09
lkcl+        stepend = step >= xd  # note end of steps10:09
lkclis that basically the same as the nmigen prefix_sum_ops algorithm?10:10
programmerjakeno but it's similar10:10
programmerjakestep = 2 * dist10:11
lkclbut achieves a work-efficient schedule?10:11
programmerjakebut reduction operates differently than prefix-sum because it does operations toward the other end...10:11
programmerjakereduction achieves a work-efficient schedule, but it's somewhat different than the prefix-sum work-efficient schedule10:13
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.73.253.155> has quit IRC10:14
lkclokaaay. sounding like a separate iterator function is needed: i thought it was identical-first-half10:14
programmerjakeit's similar, i thought it was identical10:14
programmerjakei didn't think through all the details at the time10:15
lkclwell, there's room. submode=0b10 and 0b1110:15
lkclit's all good10:15
lkclok let me just tie this in...10:16
programmerjakethey can probably share a lot of hw at least..,10:16
lkclyehyeh10:16
programmerjakenote the code in the `if` that i comitted is ported from nmutils.prefix_sum10:16
programmerjakeso you don't need to re-convert it10:17
lkcli'm just going to link iterate_indices2() into SVSHAPE.get_iterator10:18
lkclthat's all10:18
programmerjakehttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;h=9e9fa2a69a0efd0a7794149353fff14d6fbcd73a;hb=0b6592c574f814d81cfede4c74c50b583590db13#l4910:18
programmerjakeif you don't mind my having removed steps.reverse(), just delete the existing iterate_indices and rename iterate_indices2 -> iterate_indices10:19
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc10:20
lkclyes i mind. although it should be mirroring rather than total-inversion10:20
programmerjakeit should function identically for all useful cases10:20
lkcli.e. the end-result of inversion is that the result ends up in element VL-1 rather than element 010:21
programmerjakewell, in that case just copy the `if` block to iterate_indices and delete iterate_indices210:22
lkclgimme a sec... ok done10:24
lkclpython3 decoder/isa/test_caller_svp64_parallel_reduce.py >& /tmp/f10:24
lkclnothing "damaged"10:24
lkclnext step: simplev.mdwn10:25
programmerjakek, i'm going to sleep, so ttyl10:27
lkclnight jacob, thanks for your help10:30
markosargh, how the heck do tables work in markdown?10:36
lkcl|heading1|heading2|10:36
markoshttps://libre-soc.org/openpower/sv/cookbook/chacha20/10:36
lkcl|-----|-----|10:36
lkcl|rowdata1|rowdata2|10:36
markosyeah, I've done that but I'm getting crap formatting10:36
lkcl1 sec let me take a look10:36
lkclyou forgot the headings10:37
programmerjakenow that i look at the time, i'm unlikely to make it in time for a 6pm BST meeting, maybe 7pm? sorry10:37
programmerjakedon't count on me attending today10:38
markosprogrammerjake, well no one agreed for today anyways, don't worry10:38
lkclmarkos, fixed the 1st table, you can see what it looks like now.10:39
markosaha!10:39
lkclif you add extra "|----|----|"s it just adds "-----" into cells10:40
lkclyou nearly had it - just the missing headings10:41
lkclthe format you were thinking of is more restructured-text10:41
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc10:43
markosfinally10:44
markosok, so this should be ok now10:44
markoscould you please check if the VF description is adequate?10:44
markosso that I can finally get this 'done' :)10:45
markosargh, accidentally removed the TOML values10:49
markosfixed10:51
lkclyes am good with it. i just added a REMAP Indexing quick intro as well10:55
markosgreat, I'm closing this as fixed then10:57
markosok, now to moving to the butterfly instructions :)10:58
lkcl:)11:04
lkcljust added an intro section, no conclusion - i think the assembler itself is enough.11:09
lkclaand we're good11:09
markosgreat, (belated) RFPs sent for those  :)11:12
lkclgot it. do update the toml field(s)11:13
lkclmarkos = {amount=NNN, submitted=date} i forget the format YYYY-MM-DD?11:13
lkclok you can see in https://bugs.libre-soc.org/show_bug.cgi?id=100711:14
markosdidn't I fix it? did I do it wrongly?11:15
lkclyou need to keep the *bugzilla* records consistent with the RFP11:15
lkcl(and you put in EUR 1800 not EUR 1700 which i don't mind)11:16
markosargh11:16
markoscrap, can I edit it?11:16
lkclThe table of payments (in EUR) for this task; TOML format:11:16
lkcl(edit)11:16
lkclmarkos=110011:16
lkcllkcl={amount=400, submitted=2023-03-25}11:16
lkclnope. it's in, and approved.11:16
lkclso i retrospectively changed the amount to 180011:16
lkclhttps://bugs.libre-soc.org/show_bug.cgi?id=100711:16
lkclyou need to edit the TOML field and put11:17
markoswe'll balance it out in the next one :)11:17
markossorry about that11:17
lkclmarkos={amount=1100, submitted=2023-04-27}11:17
markosat worst I'll buy you an expensive bottle of wine :-)11:17
lkcllikewise in 1006, put the record of the same date11:17
lkcl:)11:17
lkcldon't worry about it11:17
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC11:21
lkclmarkos, ok great - let me run the budget-sync thing and you can review your page https://libre-soc.org/task_db/mdwn/markos/11:25
lkclgive it 2 mins...11:25
lkclok it's updated.  normally what i do is actually copy that markdown auto-generated table *into* the "RFP comments/results"11:26
lkclbut it requires that you run the budget-sync program and do the TOML-field editing before _that_...11:27
lkclbut at least you can see: "Submitted but not yet paid" now contains the two tasks you are waiting for an RFP for, to be paid, yay11:29
markosindeed, thanks11:36
markoslkcl, btw, reg butterfly insns, should those go to fixedarith.mdwn or own file?11:40
markosI'm going to do something better than what Arm is doing, their versions are not as precise so we cannot use them everywhere as expected11:51
markoscan we do 3-in, 2-out?11:51
markoswhich form is that?11:52
markosor 4-in, 1-out12:00
markos4-in might be useful to add in a right-shift immediate12:01
markosbasically the instructions are trying to emulate fdct_round_shift((a +/- b) * c)12:02
markosif we can do 2-out then we can both fdct_round_shift((a + b) * c) and fdct_round_shift((a - b) * c) in the same instruction12:02
markosif not then we have to provide 2 instructions for that, but in that case, we can use an extra instruction for the shifting12:03
markoser, extra operand12:03
markosfdct_round_shift(x) is essentially ROUND_POWER_OF_TWO(x, DCT_CONST_BITS)12:05
markoswhere #define ROUND_POWER_OF_TWO(value, n) (((value) + (1 << ((n)-1))) >> (n))12:05
markosand DCT_CONST_BITS = 1412:06
markosI'd love to be able to do both a+b/a-b in a single instruction though, that would essentially double throughput12:08
markoswhere can I find the possible Forms?12:11
markosnevermind, 1.6.1 ISA manual12:59
markoslkcl, stupid question, could we assume that an instruction has 2 outputs but only needs one output register? ie, it outputs to RT and RT+113:11
markosI guess not, but thought I'd ask13:14
markosbecause that way we can have 3-in,  RA, RB, RC and 2-outs, RT = (RA+RB)*c and RT+1 = (RA-RB)*c13:14
markosif we can squeeze in a 4-bit immediate to right shift, this will be a killer instruction13:15
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc13:27
markoslkcl, actually I think this is already done in fdmadds FRT,FRA,FRC,FRB13:37
markospseudo-code has: FRS <- FPADD32(FRA, FRB)13:37
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC13:37
markosbut there is no FRS declared in the assembly syntax13:37
markosthis looks a bit wrong13:42
markosanyway, could we use the same trick as with svshape and save a bit in output register, and assume a pair of registers written?13:43
markosie instead of RT, provide RT/2, and always assume that this instruction will accumulate both RT and RT+113:45
markoswith accumulate that means you can have the 2-coeffs  butterfly operation fdct_round_shift(a * c1 +/- b * c2) with just 2 instructions :)13:47
markosyou'd just have to swap RA, RB in the second instruction13:47
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc14:35
markosoh well, apparently RT + 1 <- does not work :-/15:06
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC15:07
lkclmarkos, svfixedarith.mdwn or yes their own file is perfectly fine16:14
lkclyes that's what's been done.  the extra operand is declared to exist as RT+1 for scalar-only instructions16:15
lkcland is declared to exist as RT+MAXVL for vectorised instructions16:15
lkclnotes are in the spec and they _should_ be at the top of the mdwn file as comments?16:15
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc16:16
markoswell, I made some progress, mdwn file is written, test case also, added enums, etc. I'm getting the op_maddsubrs (just picked something for now) generated and trying to run it now but getting some errors16:16
markosI called the instruction as maddsubrs (multiply-add-sub-right-shift)16:20
lkclok cool!16:25
lkclif you drop it in a branch give me a shout i can take a look.16:26
lkclunless you feel confident it doesn't cause "damage" in which case just shove it in master16:26
markosyeah, will do that in a bit, getting some stupid errors right now, don't want to mess up master yet, if I fail to fix it I'll just commit in a branch asap16:28
markosit's about GPROperand(RC), getting lists of indices for GPROperand(RT): (6, 7, 8, 9, 10), GPROperand(RA): (11, 12, 13, 14, 15) but GPROperand(RC): None16:31
markosin power_insn.py: ~1123: for idx in operand.span:16:31
markoslkcl, is it possible to have 3in and an 4-bit immediate for shifting?16:32
lkclyes but you'll need to design a "Form" to do it.  4-bits is a LOT16:33
markosok, where do I put that form?16:33
lkcland you certainly won't get that in the 3-in 2-out ones that already take 4 operands16:34
lkclin fields.text16:34
markosperfect16:34
lkcldon't rush into that decision: it needs to be a "Researched" RFC / wiki page16:34
lkcl(it gets its own budget)16:34
lkclwhich reminds me to do exactly that, as each of these instructions needs to be listed on a special twin-butterfly page that currently doesn't exist16:35
lkclooo there's just enough budget16:35
*** octavius <octavius!~octavius@92.40.169.65.threembb.co.uk> has joined #libre-soc16:36
octaviuslkcl, as you've suggested I go back to verilator, that's what I did. Please see bug 1073 when you have some time, I'd really like to figure out what the problem is16:38
markoswell, started adding form to see how/if I can fit all that16:38
lkcloctavius, take a look at the README as well as the source code of the microwatt_verilator main() loop16:39
octaviusok16:39
lkclit requires some command-line options16:39
lkclyou can probably guess that those command-line options are "the binary to load into RAM"16:39
lkcloctavius, you should have worked out that "if it does nothing then you're looking at a black box, stop it"16:42
octaviusI did stop it16:42
lkclnow you've got compiling, a gentle reminder that the purpose of compiling it is to get it to produce gtkwave traces16:42
octaviusI just ran to remind myself. Last time was in January :)16:43
lkcl:)16:43
lkcland that needs *verilator* compile-time options.16:43
octaviusYes, I noticed the .vcd file was unreadable16:43
lkclthat's probably because it's an fst file (maybe).16:43
lkcluse vcd2fst and fst2vcd - whichever one works use that16:43
octaviusAlso the README in the microwatt repo has no info on verilator at all. Looking at microwatt-verilator.cpp as you've suggested16:44
lkclbear in mind that the output from verilator is *not* immediately compatible with gtkwave (sigh)16:44
octaviusAh ok16:44
lkclyou want the microwatt_verilator branch (only)16:44
octaviusThat's the one I'm using16:44
lkclit's been too long i can't remember everything16:44
octaviusAnd I'm guessing you mean "verilator_trace" branch16:45
lkclmarkos, i'm slightly concerned about the low "XO" bit count of adding shift-immediates, they are incredibly expensive even when you have 3 operands16:45
lkclyes16:45
lkclif it was 2 bits, not so much of a problem, but 4 is a *LOT*16:47
lkclyou risk ending up with needing a full Primary Opcode (or 50% of one)16:48
lkclat which point the instruction is highly likely to get rejected by the OPF ISA WG because it is such a "specialist optional" area16:48
lkclsomething like ternlogi on the other hand brings a massive 256 instructions with it, saving routinely and systematically across general-purpose code16:49
markoswhat do I need to do when I've added a form in the fields.txt? plain 'make' chokes, I probably need to run something else, but I forget the sequence16:49
markosI've added a BF-Form16:49
lkclbut these are *area-specific* (DCT/FFT) and the only reason they can even be considered is because the wikipedia page lists something mad like 120 use-cases16:49
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC16:50
lkclplease please wait16:50
markosyeah, just experimenting now, to see if it's even possible16:50
lkclthere is a process for this, we cannot rush ahead adding new Forms arbitrarily without thinking them through and reviewing them16:50
markosnot going to commit anything16:50
lkclyes, you need to add it to power_enums.py16:50
lkclthen add the (new) fields into the later section of fields.text16:50
lkcli don't _mind_ putting them onto the (new) wiki page to see what they look like16:51
lkcl(and/or its discussion page)16:51
markosI did, BF = 46 at the Form class16:51
lkcloh excellent16:52
markoslet me paste the form here for starters16:52
lkclgood idea16:52
markos    |0     | 6   |11   |16   |21   | 25  |30  |31  |16:52
markos    | PO   | RT  | RA  | RB  | RC  | SH  | XO | Rc |16:52
lkclok so you see how XO is only 1 bit?16:52
markosyes, is that a problem? :D16:53
markoshow many bits does it have to be? can we skip it entirely?16:53
lkclthat makes this an absolute top absolute top ultra-priority instruction in the same sort of category as "addi"16:53
lkclor "bc"16:53
markosah, I need to add BF to the end of XO(30)16:53
lkclto give some context: if you didn't have "SH" you could add *SIXTEEN* other 4-operand instructions16:54
lkclno, you need to consider that there is limited space and to consider not proposing this instruction *at all* because it risks getting rejected16:54
markoswell, we could leave the shifting out entirely16:54
lkclthe lower the XO, the higher the priority has to be16:55
markosI see16:55
lkcland obviously it's an exponential curve16:55
lkclas in, "the higher the number of use-cases"16:55
markoswell, it's about the gain, if the gain is justified16:55
lkclcompared to a 10-bit XO this is destroying the opportunity to add a massive *512* other 2-in 1-out instructions16:55
markosI mean Arm did include these instructions but with a fixed shifting value16:55
lkclyes, and they are under similar 32-bit constraints16:56
lkclso you start to appreciate why they did that16:56
lkclthey're barely going to pass through as they are, with 3-in 1-out  (4 operands taking up 20 bits on their own)16:56
markosI do, in a sense, I admit I'm seeing this from my own point of view16:56
markosbeing able to do twin butterfly operations in just 2 instructions is a massive win, from my perspective16:57
lkclwhich has to be compared against the perspective of millions of programmers doing general-purpose16:57
lkclyes i know! :)16:57
lkclread above: about the 120 use-cases for DCT on the wikipedia page16:57
lkclit's the only reason we can get away with proposing these *at all*16:57
lkcl(that, and ARM already added them, we can point at that fact and use it as additional justification)16:58
markoswell, something like that could bring Power as a top performer in video processing16:58
lkclindeed16:58
markosor any kind of media processing16:58
lkclbut if it takes up *EIGHT* Primary Opcodes to do so, that's not going to fly16:58
lkclthere's only 32 new POs in the EXT2xx area, 10 of which i want to allocate to LD/ST-Post-Increment16:59
lkcl(because that *is* a huge saving - every single hot-loop in existence in every general-program benefits)16:59
markosI'll play with this a bit17:00
lkclhence, "really high priority"17:00
markosI'll try to minimize SH as much as possible17:00
lkclawesome17:00
markoswould 2-bits be ok?17:00
lkclnow, about RC/RS - there's a place in power_decoder2.py that you (or more like i) *may* need to pay attention to17:00
markosbecause if I can assume eg. shifting by a number of bits17:01
lkclnot really.  that's still two Primary Opcodes17:01
markosok17:01
lkclprobably one is ok, and that's risky. it's still an entire PO taken up by the (set of) instructions17:01
lkclbecause there's what... 8 of them?17:01
markosunderstood17:02
lkclahhh ok17:02
lkcli remember now17:02
lkclsearch for "implicit_rs" in power_decoder2.py17:02
lkclthat's really important.17:02
lkclit's complicated, but a "special check" is needed for the implicit RS/RC/FRS/FRC instructions, actually right there in the decoder17:03
lkcli.e. you can't just "add instructions to the csv files and hope"17:03
lkclgimme a sec...17:03
lkclsorry i forgot about this, it's been a while17:03
markosnp17:03
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=88b2023859061d7601a9dc94e052c75ec59fd8b1;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l105717:04
lkclactually line 104617:05
lkclso you need to decide the XO field (which is in the CSV file) and under which Major (PO)17:06
lkclbtw this *really* needs to all go into the wiki page17:06
lkclhttps://libre-soc.org/openpower/sv/twin_butterfly17:06
lkclso it can be reviewed reaallly carefully17:06
lkcli'll need to do some temporary opcode allocation and find space for them17:07
lkclprobably minor_22.csv - i think there's space still17:07
lkclthen a section will be needed in power_decoder2.py to match it17:07
lkcl1046             with m.If((major == 59) & xo.matches(17:07
lkcl1047                     '-----00100',  # ffmsubs17:07
lkcl....17:07
lkcland you can see that one of either RB or RC can be "extended by MAXVL" when Vectorisation is enabled17:08
lkclso you need to decide which that's going to be.17:08
markosok, this needs a lot of thought still17:09
lkclindeed.  fortunately there's a trail already blazed17:09
lkclbut it's probably best to use the twin_butterfly page to create stub instructions, ultimately intended to be morphed into actual RFC actual Power ISA form17:09
lkclbut kept short for now to make it easy to discuss iteratively17:10
markoswill start adding stuff there asap to discuss17:10
lkclack17:10
lkclit's got its own budget and bugreport17:10
lkcli'll add the fp butterfly instructions later17:11
markospushed17:26
markosThis is the original attempt, still with the 4-bit SH17:28
lkclok great17:28
markospretty sure there are some great misunderstandings on my part here17:29
markosie, I'm not really sure I'm allowed to just write to RT+117:29
markosand now that I see it, it's probably wrong, it probably adds 1 to the value of RT, not the index17:29
lkclno17:30
lkclit's implicit17:30
lkclyou write to RS17:30
markosah, what you said earlier17:30
lkcland ISACaller "knows" to pick that second (implicit) operand up and... yes17:30
markosyeah, I need to read about that17:30
markosso it's possible then to write to 2 GPRs17:31
lkclhave a look at the biginteger page17:31
markosnice to know17:31
lkclwhich contains the kind of spec-wording17:31
markoswill do17:31
lkclyes but we will get push-back for doing so17:31
lkclbecause it's what CISC x86 does17:31
lkclso there is a *lot* of "push-back" going to occur on these instructions, hence why if "and we want 8 Primary Opcodes" is part of that, the ISA WG will just flat-out say "no"17:32
lkcl prod1 <- MUL(RC, sum)17:32
lkclcan just be17:32
lkclRC * sum17:33
lkcljust like in fixedarith17:33
lkcllet me check...17:33
lkclah nope, you're right17:33
lkcl# Multiply Low Immediate17:33
lkcl    prod[0:(XLEN*2)-1] <- MULS((RA), EXTS(SI))17:33
lkclwatch out for this:17:34
lkcl    RT <- prod[XLEN:(XLEN*2)-1]17:34
lkclthe result of MUL and MULS is *DOUBLE* the bitwidth17:34
lkcl(sum of the length of the two operands)17:34
markosright, ofc17:34
lkcland consequently you have to "pick a half"17:34
lkclbut of course, you "pick a half in **MSB0** numbering"... sigh17:34
markoshm, the arm instructions return the high half17:35
markoswe could add 2 pairs17:35
lkclfor accuracy17:36
markosone returning the high half and another the low17:36
lkclabsolutely no chance of that17:36
markoswithout the shifting bit :)17:36
lkclthere's an internal hardware limit we've set of 3-in 2-out17:36
lkcl@ 64-bit width17:36
lkcland that's down to the massive complexity that results from doing Register Hazard checking17:37
lkclthe only reason we get away with hi-lo-half in the bigint operations is because they're actually a carry-in carry-out chain17:37
markosright17:37
lkclso for the internal chain the instructions actually become 2-in 1-out, the first one in the chain is 3-in 1-out, and the last one in the chain is 2-in 2-out17:38
lkclwhich is the only reason we can get away with such ultra-expensive instructions, that and they'll end up in libgmp17:38
markossimilarly, these will go in pretty much all video/audio codecs17:39
*** tplaten <tplaten!~tplaten@195.52.20.159> has joined #libre-soc17:39
lkclbtw no need to put the autogenerated code in the wiki17:40
lkclexactly17:40
lkcllike... aaaalll of them17:40
markosthough, for that reason we could avoid the shifting entirely17:40
markosI mean as an operand17:40
lkclwhich we can easily "fly" on the "IoT / Edge / accelerator" thing17:40
lkclyes pleeease17:40
markosonly reason I'd want it is for future17:41
markosin case a future codec decides to change the number of shift bits17:41
lkclit's too much for me to have to explain, and stake the entire reputation of what we're doing on having the instructions be rejected17:41
markosthough that's unlikely17:41
markoswe're good until 203017:41
markosav1/av2/etc17:41
markos:D17:41
lkclahh if there's specific CODECs that use these instructions explicitly please do list them17:41
lkclthat again gives me information i can present in ISA WG meetings, "these are common CODECs, actual implementations, the actual spec says DoThisThing()"17:42
markoswell, these fdct are all libvpx/av117:43
markosand av217:43
lkclminor_59... what's that supposed to be used for...17:43
lkcl_great_!17:44
lkcldo put it into the page17:44
lkclevery instruction needs a "Rationale"17:44
lkcli.e.17:44
lkcl"why as IBM should we invest $50-100 million implementing these instructions"17:44
lkclor {insert-N-E-Other-Power-ISA-Implementor}17:44
lkclopcode 59 is typically stuffed with FP-single17:46
markosI just picked 59 randomly :)17:46
lkclyyeah and likely overwrote some official instructions in the process!17:47
lkcl*extreme* care needs to be taken here, it's a frickin lot of work17:48
lkcli'm looking at the tables here https://libre-soc.org/openpower/sv/bitmanip/17:48
lkclhow many of these instructions are needed?17:48
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc17:48
lkclone i know is needed for the inner product, and another for the outer product17:48
lkclso that's at least 217:49
lkclthen iirc you have to use different ones for iDCT than from DCT, so that's 817:49
lkclsorry, 4. then FFT needs the same treatment, that's 817:49
lkclfortunately though i think the outer-butterfly is just a twin add-subtract - specified as a 2-in 1-out but having an implicit RS17:51
lkclhttps://libre-soc.org/openpower/isa/svfparith/17:51
markosadded some rationale, mention of the Arm instructions17:51
lkclawesome17:51
lkclbtw the DCT subsystem *needs* both the inner-butterfly *and* the outer-butterfly instructions17:52
lkclthat's why there's 2 separate uses of svremap in the unit tests.  first use does the inner butterfly (the twin-madd)17:53
markoswell, I'd suggest 2 pairs of instructions17:53
lkclsecond use of svremap does the outer butterfly (which is i believe just an add-sub)17:53
markosfrom what I see in libvpx though, both fdct and idct use the same kind of instructions17:54
lkclhaang on... DCT just uses fadds.  ha!17:55
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD#l59417:55
markosand arm ports also use those special instructions, but again these are of limited precision17:55
lkclso fortunately the outer-butterfly is just "an add"17:55
lkclwell, the limited precision occurs when you specify an elwidth17:55
lkclwhich will be where the biggest efficiency savings come from17:56
lkclso, actually (whew) - at least for DCT - only *two* twin-mul-and-accumulate-and-shift instructions17:56
markoswell in our case, it would help to be able to do the calculations in a larger width and then just scale/narrow down17:56
markoses17:57
markosyes17:57
lkclwhiiich... means... they can just about fit into opcode 2217:57
lkclthere's an area17:57
lkclhttps://libre-soc.org/openpower/sv/bitmanip/17:57
markosArm is full of many versions of these functions because they're fast but not accurate enugh17:57
lkclNNRTRAit/im57im0-40 00 000xpermiTODO-Form17:58
lkclNN- -- 000rsvdrsvd17:58
markos23 helper functions to do basically the same thing17:58
lkclyowser17:58
lkclok so see that entry just below xpermi?17:58
markosrsvd?17:59
lkclas long as 26-28 are *not* zero, that's "free encoding space"17:59
lkclyou get *one* bit for a shift, there17:59
lkcllet me edit it...17:59
markoshaha, I'll take it18:00
lkclahh... where the heck's the page... it's in a separate-include...18:00
lkclah. draft_opcode_tables18:00
lkclok what's the instruction names?18:00
lkclone is maddsubrs18:01
markosI proposed maddsubrs, but open to suggestions18:01
lkclahh "s" is usually reserved for "FP single"... are there any other instructions ending in "s" in the *fixed*-point set?18:02
lkclmaddsubrs it is for now18:02
markosthis one does both add and sub18:02
markosassuming I can write to RT and RT+118:02
markosor RT and RS18:02
lkclRT and implicit-RS.18:04
lkclhttps://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=5b0a082545185799b7bf053374aa3b60117ef74b18:04
lkclok so that's your allocation for the instruction18:04
lkclit'll need to go into minor_22.csv18:04
lkcl(not minor_59.csv)18:04
lkcland you want a (sigh) XO length i think of 11...18:05
lkclgimme a sec...18:05
lkclsee insndb.csv18:05
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/insndb.csv;hb=HEAD18:06
lkcl   7 minor_22.csv,22,21:31,NONE,pattern,normal18:06
lkcl21-31... yes, 11 bits18:06
lkclokaaay18:06
lkclso _now_ you can "interpret" the contents of minor_22.csv, every single "pattern" *has* to be 11 bit in length...18:06
markos11?18:07
lkclthe 1st column18:07
lkcl-----01011-,ALU,OP_FISHMV18:07
lkclexample.18:07
lkclcount the total "-" "0" and "1"s18:07
lkclcomes to 1118:07
lkclrepresenting bits 21 thru 31 *inclusive*18:07
lkclsooo... with the new allocation18:08
markosbut I have 4 operands, RT, RA, RB, RC, which are 6:2418:08
lkcllook at the diff18:08
lkcldiff --git a/openpower/sv/draft_opcode_tables.mdwn b/openpower/sv/draft_opcode_tables.mdwn18:08
lkcl | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name     | Form    |18:08
lkcl+| NN | RT | RA  | RB   | RC      | sh 01  00 |0 | maddsubrs | BF-Form  |18:08
lkclRT RA RB and RC are all allocated to 6:2418:09
markosaaaaaaah18:09
lkclbut column *one* of each csv file is allocated to *XO* identification18:09
lkclyou will also need to add entries further down in fields.text which tell power_decoder.py where those RT RA RB and RC are, for BF-Form18:10
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;h=b0f91cae74f2dec822138b97c0286d6b6cda76f8;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l79918:10
lkcl 799     RT (6:10)18:10
lkcl 800         Field used to specify a GPR to be used as a target.18:11
lkcl 801         Formats: A, BM2, D, DQE, DS, DX, MM, VA, VA2, VX, X, XFX, XO, XX2, SVL, XB, TLI, Z2318:11
lkclaaaand now...18:11
lkcl....18:11
lkcl....18:11
lkclBF18:11
lkcllikewise for RA18:11
lkcl 747     RA (11:15)18:11
lkcl 748         Field used to specify a GPR to be used as a18:11
lkcl 749         source or as a target.18:11
markosok, thanks for your patience18:11
markosI'll get it eventually18:11
lkcl  750 Formats: ...... .... *BF*18:11
lkclit's all in the (various, numerous) diffs18:11
lkclnormally it would be straightforward, just look at one already done, but the extra complication is the implicit arguments18:12
lkclso18:12
lkcllet me find git link for minor_22.csv18:12
lkclhttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/minor_22.csv;h=7cb4785af2ff915acf4c724d72709a470e2c6a48;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l4018:13
lkclso line 4318:13
lkcllet's take say line 39 - OP_CPROP18:13
lkclthat has18:13
lkcl  39 0110001110-,ALU,OP_CPROP,R18:13
lkclso that means, that for power_decoder.py to "match"18:13
lkclbit 21 must be 018:13
lkclbit 22 must be 118:13
lkclbit 23 must be 118:13
lkclbit 24 must be 018:13
lkcl...18:13
lkclbit 30 must be 018:14
lkcland bit 31 we DON'T CARE18:14
lkcl(because "-")18:14
lkclso, "translating" the allocation 26:30 from the new allocation18:14
lkcl21-25 is right smack in the middle of RC, therefore must be "don't care"18:14
lkclbit 26 is "sh" so *that* must be "don't care" as well18:15
lkcland bits 27-31 must be "01000"18:15
lkclso!18:15
lkclwe have the entry!18:15
lkcland it is...18:15
lkcl------0100018:15
lkclta-daaa18:15
markos:)18:15
lkclthat's the entry to go into minor_22.csv at line... 43.18:16
lkclevery single frickin instruction has to go through this process, sigh18:16
markosI'll add the entry there18:16
lkclawesome18:17
lkclholy hell barometric pressure change18:17
lkclunbelievably painful even with 4 aspirin and 2 paracetamol18:18
markosget some rest18:21
lkclnot going to help - weather's changing constantly today18:33
lkclapparently this is a well-known phenomenon in japan18:33
lkclbut very much less-recognised in europe / us.18:33
lkcli can feel my ears popping constantly (like in an airplane) hence i know the pressure change is happening18:33
programmerjakeluke, iirc you removed iterate_indices2 and copied the section to iterate_indices, did you ever push that?18:33
programmerjakehope you feel better18:34
lkclno i didn't, i simply called the alternate function if submode=0b10/1118:34
lkclbeen a wild ride today18:34
markosmissing something still: this file (I guess autogenerated) gives me this:18:36
markos+maddsubrs,NORMAL,,1P,EXTRA2,NO,d:FRT;d:CR1,s:FRA,s:FRB,s:FRC,RA,RB,RC,RT,0,CR1,018:36
markoswhy am I getting FR* registers in there?18:36
markosmaybe it was generated previously18:40
*** octavius <octavius!~octavius@92.40.169.65.threembb.co.uk> has quit IRC18:41
programmerjakerun `make`, it replaces those files...18:41
markosjust did18:41
markosstill getting the same result18:41
programmerjakedo you have the right form in the csv?18:41
markosah right18:42
markosthanks18:42
markosweird, still getting the same18:44
markosI'm going to commit in a branch18:45
programmerjakeit's probably going to the wrong case in sv_analysis.py, e.g. when I added pcdec I had to add a case: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l60518:48
programmerjakeregs comes from https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l36318:51
ghostmansdmarkos, check these lines: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l37818:54
ghostmansdall fields for CSVs are generated here18:55
programmerjakeoh, i think i spotted your issue, do you have it writing CR1 instead of CR0?18:55
programmerjakesince for Rc=1, fp ops write CR1, but int ops write CR018:58
lkclit'll be down to what's in sv_analysis.py19:43
lkclbut you don't want to be worrying about sv right now19:44
lkclbecause this is a *scalar* instruction19:44
lkclbut just so you know, look at sv_snalysis RM-1P-3S1D section19:45
lkcl    elif value == 'RM-1P-3S1D':19:45
lkclit's a previously-unrcognised pattern19:45
lkcland the fallback is "fmadd*"19:45
lkcli need to know the "key" pattern19:46
lkclregs == [somethingsomething]19:46
lkcl1111011111,ALU,OP_MADDSUBRS,RA,RB,RC,RT,NONE,CR1,019:47
lkclah yes, you put CR1, just like jacob said19:47
lkclmake that CR019:47
lkcland it *should* then match on19:47
lkcl        elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']:  # pcdec19:47
lkclwhich will activate this19:47
lkcl            res['0'] = 'd:RT;d:CR0'  # RT,CR0: Rdest1_EXTRA219:48
lkcl            res['1'] = 's:RA'  # RA: Rsrc1_EXTRA219:48
lkcl            res['2'] = 's:RB'  # RT: Rsrc2_EXTRA219:48
lkcl            res['3'] = 's:RC'  # RT: Rsrc3_EXTRA219:48
programmerjakeother issues I spotted, the pseudocode uses rotate left instead of shift right...it'll give the wrong results19:49
lkclre-run sv_analysis.py19:50
lkcli also removed Rc=1 from BF-Form, and fitted it to what went into the bitmanip-opcode-22 table19:52
lkclhttps://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=fa1f3c9e3dce8fdb62b47a34b6c9203293b9402619:52
lkclsorry!  Rc=1 effectively doubles the number of instructions, which we can't really afford to do19:53
programmerjakeanother issue: that's actually a 5-in 2-out op...since it both reads and writes RT and RS19:53
lkclermmm... ermermerm...19:53
lkclyep that's not going to work19:53
lkclso RA has to be the source-of-where-the-accumulating-happens19:56
lkclwhich *happens* to be *exactly* the same register as RT19:56
programmerjakeidea: put the pair of coefficients and accumulated sums each in 1 reg with each value being the lower/upper half of a reg...this should reduce input/output regs to 4-in 1-out20:06
programmerjakeidk if that'll fit the DCT pattern tho20:07
programmerjakethis is kinda like how cdtbcd works where the upper and lower halves are independent20:08
programmerjakee.g. RT <- ((RT)[0:XLEN/2-1] + prod0) || ((RT)[XLEN/2:XLEN-1] + prod1)20:10
programmerjakethat way if you set elwid=32 you get 2 16-bit results20:11
markossorry had to be afk for a while to pick up my son20:18
markosright, CR0 was the reason20:19
markosthanks a lot!20:19
markosprogrammerjake, yeah, it's far from perfect right now, and probably incorrect20:19
markosbut the half-register coefficient is a good idea, I was actually thinking about it for the results20:23
markosie, high RT -> add, low RT -> sub20:23
markoslkcl, I saw you removed the accumulate, is there no way to keep the accumulate there?20:30
lkclmarkos, RA-when-set-to-the-same-register-as-RT *is* the accumulator20:33
lkclthat's the way it works20:33
lkcland no, it will not be ok to do split-use of registers20:34
lkclhow would it ever then be possible to do 64-bit DCT?20:34
lkcllast thing we need is to fall onto a SIMD-within-a-Register paradigm20:35
markoshm, hm, RA == RT only makes sense if we do in-place DCT20:37
markosand actually it kind of forces us that way20:37
lkclthe DCT Schedules are specifically designed for precisely and exactly that20:42
lkclthis is a world-first20:42
lkclthe only reason it is possible at all is because the elements are loaded and then traversed in a hybrid bit-reversed *and* gray-coding pattern20:43
lkclsuch that20:43
lkclwhen "unravelling" layer by layer, each layer is *not* destructively overwritten when doing the 3210 0123 schedule20:44
lkclbecause it's *already been loaded such that it becomes a 0123 0123* schedule for that exact moment in the schedule20:44
lkcland consequently you *can* do in-place20:45
lkclall standard SIMD algorithms *need* double the registers20:45
lkclbecause they try to do 0123 3210 and half-way through that they destroy the data20:45
lkclmarkos, you'll need i think to experiment by running remap_dct_yield.py20:49
lkcland see what it does.20:49
lkclyou'll find that - like Indexed REMAP but without the GPRs - it generates "prerequisite offsets"20:49
lkclthat you *must* drop on top of a fully in-place instruction20:50
lkclin this case it will be maddsubrs *0, *0, *16, *020:50
lkclwhere *16 equals the coefficients20:50
lkclsorry20:50
lkclmaddsubrs *0, *0, *0, *1620:50
lkcland the *schedule* system will add on the required offsets to RT, RA, RB and RC *for* you20:51
lkclto make the *entire* triple loop20:51
lkclit's liiitttteralllly three (quantity 3of) instructions20:51
lkclsvshape, svremap sv.maddsubrs.20:52
lkclbdang.20:52
lkcldone.20:52
markosyou're right, I was thinking that we might need to reuse the coeffs, but if we can do the whole thing in one go, all the better and we don't need to reuse20:52
lkcleven the coefficients are established in a set order that makes them useable as a vector20:52
lkcland20:52
lkclguess what?20:52
lkclthe "coefficient-offseting" Schedule (REMAP SVSHAPE3) is set up *precisely and exactly* to give you the *exact* required coefficient20:53
lkclat the exact and precise required time20:53
lkclit's extremely elegant, sophisticated, and overwhelmingly-confusingly-straightforward20:53
lkclcompared to the absolute hell normally subjected onto programmers20:53
markoswell, you're right, I'll have to play quite a bit with the dct_yield example, in fact I might copy it to work on the maddsubrs20:54
*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC20:54
markoswait, in that case, I don't need 3-in20:54
markosor rather I don't need a separate RT register20:54
markosbecause they are the same20:54
lkclyou should be looking to do nothing else other than to copy the way that the FP DCT works20:55
lkclunless there is a really compelling reason to do otherwise20:55
lkclsuch that you should literally be able to cut/paste the fp dct test examples20:55
*** jn <jn!~quassel@95.223.44.193> has joined #libre-soc20:56
*** jn <jn!~quassel@95.223.44.193> has quit IRC20:56
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc20:56
lkclreplace sv.ffmmads (whatever) with sv.maddsubrs20:56
markoswhat I still haven't figured out20:56
lkcland "It Should Just Work(tm)"20:56
lkclrun the tests.  and the yield program.  and the associated project nayuki dct tests.20:56
markosI can't understand where the implicit RS is defined, I mean how does it now where to place the result?20:56
lkclwe went over that: that's in power_decoder2.py20:57
lkclsearch for the word "implicit_rs"20:57
markosah yes, you did say that, sorry20:57
markosok, will continue playing with this20:57
lkclnow you'll need a section "with mIf((major==22) & so.matches("------01100")20:57
lkcli'll do that bit20:58
lkcli'll sort it now20:58
markosthanks21:00
lkcldone21:01
markosif RA == RT, can I skip one in the declaration?21:05
markostrying to see if I can still shave off some bits for shifting :-)21:06
lkclmmm... maaaybe.  maybe not.  when using Vertical-First Mode you need to be able to specify some registers as scalar, some as vector21:14
lkcland if they don't exist, you can't do that21:14
lkclVertical-First Mode would be useful for being able to utilise the Schedule but to run *more than one* instruction, just like in chacha2021:14
lkclin this case, you could detect "was there an overflow"21:14
lkcland flip to higher bit-width *without* leaving the Schedule Arrangement21:15
lkcljust branch to a different area within the loop21:15
lkclyou could even go "oop, by Layer 3 or greater we *know* we are going to run out of bit-accuracy in 16-bit therefore let's start using 32-bit for Layer 3 4 and 5"21:16
lkclall sorts of weird stuff21:16
lkclbut if you don't have control over the operands it's going to be much more challenging21:16
lkclplus, if you ever need to use this in a scalar context, what should RA, RB and RT be?21:17
lkclif you *really* feel that an overwrite is ok in all circumstances, then yes we can explore that21:18
lkcland it will be ok to do precisely because butterfly will have *two* input operands "in-flight"21:18
lkcl(like compare-and-swap)21:18
programmerjakein case anyone was wondering, my build server crashed or something since I found it powered off rn, should be up and working now22:17
markoslkcl, well, scalar mode is not really the use case here in point, I mean sure one can use it then also, but it doesn't really mean much22:22
markosbut if it makes a huge difference in in-place DCT applications, and there is no other way, then yes I would be willing to consider it22:23
markosagain the point is to manage to save some bits for shifting22:24
markoseg if instead of maddsubrs RT, RA, RB, RC, SH (=1-bit for shifting), we manage to do the same with maddsubrs RA, RB, RC, SH (4-bits, give back one bit to XO), that makes a huge difference and a very powerful instruction that is future proof for other DCT implementations22:25
markosif we leave the shifting out entirely, then it's just a couple of madds22:26
markoswhich sure it can save some instructions but it won't make that much of a difference22:26
markoslet me give you some examples22:26
programmerjakerather than RA, RB, RC, we'd probably name them RT, RA, RB22:28
programmerjakemaddsubrs RT, RA, RB, SH22:28
markoshttps://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/arm/fdct_neon.h22:28
markosok22:29
markoswhat would RC be?22:29
markosI can understand RA == RT, but in your example how will they be mapped to the (a +/- b) * c22:30
programmerjakea = RT, b = RA, c = RB?22:30
markossigh, ofc22:31
programmerjakeif there's only 3 args, one of them is almost always named RT or RS22:32
markosin any case, if you check the file above, there are about 20+ implementations of these butterfly instructions22:32
markosand the reason is that the arm "fast" implementations vqrdmulhq_s16/vqrdmulhq_s32 fail to provide full precision22:32
markosso for the single-coeff implementation you have this code:22:33
markosconst int32x4_t a0 = vmull_n_s16(vget_low_s16(a), constant);22:33
markos  const int32x4_t a1 = vmull_n_s16(vget_high_s16(a), constant);22:33
markos  const int32x4_t sum0 = vmlal_n_s16(a0, vget_low_s16(b), constant);22:33
markos  const int32x4_t sum1 = vmlal_n_s16(a1, vget_high_s16(b), constant);22:33
markos  const int32x4_t diff0 = vmlsl_n_s16(a0, vget_low_s16(b), constant);22:33
markos  const int32x4_t diff1 = vmlsl_n_s16(a1, vget_high_s16(b), constant);22:33
markos  *add_lo = vrshrq_n_s32(sum0, DCT_CONST_BITS);22:33
markos  *add_hi = vrshrq_n_s32(sum1, DCT_CONST_BITS);22:33
markos  *sub_lo = vrshrq_n_s32(diff0, DCT_CONST_BITS);22:33
markos  *sub_hi = vrshrq_n_s32(diff1, DCT_CONST_BITS);22:33
markosthe DCT_CONST_BITS = 1422:33
markosfor vp8/vp9 and av122:33
markospossibly for av2 as well, and quite likely that applies to other codecs as well22:34
markosnow what if we have some code that needs another constant for shifting?22:34
markoswe would have to have another instruction or do what Arm does22:34
markosfall-back to less efficient code22:34
markosstill faster than scalar22:35
markoswe could do all this code in just a couple of instructions and be future poof, if a) we allow accumulate, b) we allow shifting by an immediate value22:35
programmerjakewhat about putting the constant in a handy SPR? e.g. LR or CTR22:36
programmerjakethat would be 4-in 2-out then22:36
markoscan we do that?22:36
markoswhat are the drawbacks vs a normal GPR?22:37
programmerjakemaybe?22:37
programmerjakea normal gpr needs an argument22:37
programmerjakea spr needs to be not otherwise used or saved/restored22:37
markosproblem is that it's not just a single constant for a DCT22:38
markosit's essentially a bunch of cospi fractions22:38
programmerjakeso, hence why I was suggesting LR since we'll probably want CTR for looping22:38
programmerjakenot c, sh in the spr22:38
markoscospi(20/64), cospi(12/64), etc is a pair for the 2-coeff22:38
markosaaaa22:38
markossorry22:38
markosyes, that would work22:39
markossorry it's late22:39
markosbecause that would remain totally constant throughout the whole code bae22:39
markosbase22:39
markosyes indeed22:39
markosis it possible that LR is used for something else in the DCT loop?22:40
programmerjakeother than return address which can easily be stored on stack or in a spare gpr, no22:40
markosif lkcl agrees, that solves a problem22:40
markoshow would I read the value from LR to use as a shift value?22:41
programmerjakeyeah, just icr if 4-in is too much...22:41
programmerjakeuuh, just write `blah >> LR`?22:41
programmerjakeLR[58:63]22:42
markosI mean it's directly accessible and I don't have to use a special instruction within the pseudocode22:42
markosok, thanks22:42
markosin that case22:43
markoswe don't even have to force RA=RT22:43
markoswe can keep the previous syntax and just use A-Form?22:43
programmerjakehttps://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/branch.mdwn;hb=f8e2c0cb1467391aa7ae4b8b092c281ee2e16a7b#l7522:43
markosok22:44
lkclya know what? an overwrite would i think work fine22:44
programmerjakeif you're not reading RT, sure. if you are reading RT too you have too many inputs22:44
lkclmarkos, no don't do that. it requires LR as an operand into the Dependency Matrices22:45
lkclwhich will cause absolute mayhem22:45
markosright22:45
markosok, then22:46
lkclregister files *have* to be kept separate, otherwise the Dependency Management becomes hell22:46
lkclbasically think of a matrix, with every register known on both the rows and the columns22:46
lkclany time you add an extra dependency, you end up with the *entire row* having to have a DM Cell for that register22:46
lkcljust in case you ever executed an instruction that read LR just after one that wrote it22:47
lkclif you can keep GPR-GPR-GPR then the Matrix becomes "sparse" and you can miss out the majority of entire rows of Dependencies22:47
lkclCTR is definitely allocated to counting, it's even implementable as special Architectural State22:48
lkclrather than an actual "register" per se22:48
programmerjakewell, if it can match register usage of some pre-existing op, then LR could be used, e.g. if your op uses the same registers as a branch22:49
lkcli need to experiment to see if ffmadds can be reduced by one operand22:49
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC22:49
programmerjakesince then it can share the dependency matrixes used for branch ops22:49
programmerjakeLR is the other spr that is likely treated specially22:50
programmerjakeoh, idea, mush it into the register profile of the GF(p) fft op22:53
programmerjakesince that reads a spr22:54
programmerjakegfpmaddsubr22:55
programmerjakeit reads the GFPRIME spr22:56
programmerjakethough otoh that probably would have special state associated with it making writing it much more expensive22:57
programmerjakeoh, luke, all the [[!inline]] pseudo-code from nmigen-gf.git has disappeared on the wiki: https://libre-soc.org/openpower/sv/bitmanip/#index14h122:59
lkclsigh that's an underlay22:59
lkclno idea22:59
lkclnot going to look at it now23:00
*** gnucode <gnucode!~gnucode@user/jab> has joined #libre-soc23:57
lkclfrickineeeelll23:57
lkclnever had difficulty with operands before, sigh23:58
lkclokaaay about time23:59

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!