lkcl | trying to reduce fdmadds down to 3 operands but as an overwrite, it's really tricky | 00:00 |
---|---|---|
lkcl | gaah that took forever to get right | 00:29 |
lkcl | alriiiight | 00:30 |
lkcl | REMAP seems to work as well, just reintroducing each unit test using fdmadds.. | 00:31 |
lkcl | hooray | 00:32 |
lkcl | dang | 00:32 |
lkcl | ok re-run everything, make absolutely sure nothing's broken | 00:34 |
lkcl | markos, for tomorrow: yes, *actual* operands RT-as-dest *and* RT-as-source, RA-as-source and RB-as-source "works" | 00:35 |
lkcl | the really good news is that this reduces down to 3 operands in the instruction encoding: RT [6-10] RA [11-15] RB [16-20] leaving free some bits for that shift-operand you wanted | 00:43 |
lkcl | which in turn means an XO of around.... 6? 7? bits? | 00:44 |
lkcl | which is ok | 00:44 |
*** gnucode <gnucode!~gnucode@user/jab> has quit IRC | 02:55 | |
*** tplaten <tplaten!~tplaten@195.52.20.159> has quit IRC | 03:23 | |
*** tplaten <tplaten!~tplaten@195.52.31.23> has joined #libre-soc | 03:38 | |
*** openpowerbot_ <openpowerbot_!~openpower@94-226-187-44.access.telenet.be> has quit IRC | 06:04 | |
*** openpowerbot_ <openpowerbot_!~openpower@94-226-187-44.access.telenet.be> has joined #libre-soc | 08:27 | |
programmerjake | lkcl: I think you forgot a continue: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/caller.py;h=a8bb8487806515bbbe426ad7a018c6cdac10639a;hb=HEAD#l1826 | 08:38 |
programmerjake | assigning 0 to remap_idxs[i] does nothing since it is immediately overwritten with the remapped index | 08:38 |
programmerjake | so you can't disable remap for args when any remap is enabled...unless there's other code overriding the results? | 08:39 |
programmerjake | wait, not args, SVSHAPE[0-3] | 08:41 |
programmerjake | so if SVSHAPE0 is enabled, SVSHAPE[1-3] being zero doesn't disable them | 08:42 |
lkcl | that sounds about right | 08:55 |
programmerjake | well, I wrote some tests and got it to all work up to the point where I run the actual add, but remapping isn't applying to RA or something... | 08:56 |
lkcl | what did you put in the svremap instruction? | 08:56 |
lkcl | it's a 2-stage process | 08:56 |
lkcl | 1. set up the SHAPEs | 08:57 |
lkcl | 2. say *which* shapes apply to which registers | 08:57 |
lkcl | 3. do an instruction | 08:57 |
programmerjake | svremap 0x1A, 0, 1, 0, 1, 0, 0 | 08:57 |
programmerjake | RB and RT work... | 08:57 |
lkcl | ok so that's 26 | 08:57 |
lkcl | which is 0b11010 | 08:57 |
lkcl | so iirc correctly you have requested: | 08:57 |
lkcl | LSB bit 0: RA disabled | 08:58 |
lkcl | LSB bit 1 RB enabled | 08:58 |
lkcl | LSB bit 2 RC disabled | 08:58 |
programmerjake | wait, they're not MSB0? | 08:58 |
lkcl | honestly i have no idea | 08:58 |
lkcl | you'll need to experiment / check | 08:58 |
lkcl | usually i just cheat and set it to 31 | 08:59 |
programmerjake | well, I expected them to be MSB0...hence 0x11010 for RA, RB, and RT | 08:59 |
programmerjake | 0b11010 i mean | 08:59 |
lkcl | i really don't know. and it's 9am, just woke up, in a lot of pain | 09:00 |
programmerjake | imho the tests shouldn't just set it to 31, since that catches errors in how it's decoded... | 09:00 |
programmerjake | k, I'll debug a bit more... | 09:00 |
lkcl | agreed. | 09:01 |
programmerjake | well, MSB0 is correct, according to https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remap.mdwn;hb=1c16375d594c7392f787096e9d2b9aa121c67f46#l521 | 09:03 |
lkcl | you'll just have to experiment and investigate, and the spec will have to change to reflect the simulator not the other way round | 09:08 |
lkcl | because there are so many unit tests passing that changing them is too disruptive | 09:08 |
markos | lkcl, yes, 6 bits for XO | 09:18 |
markos | 4 bits for SH | 09:18 |
programmerjake | yes, the simulator uses LSB0: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=2b9af402cfca43edfb53891f13166fc3de1c2d07;hb=f8e2c0cb1467391aa7ae4b8b092c281ee2e16a7b#l1387 | 09:21 |
programmerjake | since remap_active is a Signal | 09:21 |
programmerjake | I'll create a bug... | 09:22 |
programmerjake | created https://bugs.libre-soc.org/show_bug.cgi?id=1075 | 09:25 |
lkcl | it should go into... yep 1042 | 09:29 |
programmerjake | prefix-sum test passed! | 09:29 |
lkcl | frickin awesome! | 09:29 |
lkcl | if you can throw in some comments in #1042 i can justify throwing part of its budget at you :) mention the prefix-sum as well as bug #1075, remember that NLnet actually read the bugreports before approving payments | 09:31 |
programmerjake | done | 09:34 |
lkcl | awesome | 09:36 |
lkcl | GPR for length is, sigh, doable | 09:38 |
lkcl | but *instead* of an immediate | 09:38 |
programmerjake | can we just allocate one more XO bit to select between immediate and GPR for X? | 09:38 |
lkcl | i already put that the number of management instructions is only 5 | 09:39 |
programmerjake | since dynamic could be useful for more than just parallel reduce/prefix-sum... | 09:39 |
lkcl | DCT/FFT it's not a good idea - you use bluestein convolution anyway | 09:40 |
lkcl | matrix i *really* don't want to go there (3 GPRs) | 09:40 |
programmerjake | well, everyone goofs once in a while, wether or not it's only preduce you need a new insn for GPR since it can't share with immed | 09:40 |
programmerjake | then don't, only do X | 09:40 |
programmerjake | so you'll have to tell them 6 insns | 09:41 |
lkcl | also it means dynamically setting MAXVL from a GPR, which is a definite "no" | 09:41 |
programmerjake | hmm, set MAXVL from MAXVL and VL from GPR? | 09:42 |
programmerjake | or if r0, then from VL | 09:42 |
lkcl | needs a lot more thought (and almost certainly a new instruction) | 09:42 |
programmerjake | well, I'm going to be splitting all my work into commits then going to sleep, so ttyl | 09:43 |
lkcl | also do mention the addition of parallel-prefix in 1042 | 09:43 |
lkcl | that's a big addition | 09:44 |
lkcl | ok | 09:44 |
programmerjake | I don't have a comment dedicated to that, but i did mention it in the spec bug report comment | 09:44 |
programmerjake | https://bugs.libre-soc.org/show_bug.cgi?id=1042#c3 | 09:45 |
programmerjake | 600 eur for both prefix-sum and finding the bug, nice :) | 09:46 |
markos | going to rename BF-Form to DCTI-Form -not IDCT as that means Inverse DCT- because there is a field called BF and it's confusing -to me at least- and you already named a DCT form | 10:07 |
markos | also moved it up in the same part of the definitions in fields.txt | 10:08 |
programmerjake | put a comment announcing adding prefix-sum in 1042. I'm done for now, gn | 10:13 |
markos | gn | 10:20 |
markos | lkcl, getting a KeyError in the parser on 'DCTI' when running make | 10:20 |
markos | I was going to commit stuff so far, but I don't want to break existing things | 10:21 |
markos | https://paste.debian.net/1278749/ | 10:24 |
markos | I've commited the changes in the wiki page though | 10:25 |
markos | added DCTI = 47 in the enums, after DCT= 46 | 10:25 |
lkcl | markos, no use at all throwing random errors out unless the accompanying changes that you've made are also included | 10:29 |
markos | never mind | 10:30 |
lkcl | i have absolutely no idea whatsoever what that error means, and have zero context to even deduce why it occurred | 10:30 |
programmerjake | well, sorry, the build server turned off again, i'll work on it tomorrow. i'm guessing the cpu fan died? | 10:30 |
markos | forgot to change BF in the instruction fields Formats | 10:31 |
markos | sorry about that | 10:31 |
lkcl | programmerjake, no problem | 10:31 |
lkcl | markos, this is why i said put things into a branch for now | 10:31 |
lkcl | we *need* to see what you're doing, as you're doing it. | 10:31 |
lkcl | please don't think "it's not ready" - this is the absolute absolute last-resort (worst) way to work | 10:32 |
lkcl | because it burdens everyone with guesswork about what the error means | 10:32 |
lkcl | and as you've seen this is too many moving parts to make such guesses | 10:32 |
lkcl | don't worry about the size of the commits | 10:32 |
lkcl | or if it doesn't work | 10:33 |
lkcl | or causes errors | 10:33 |
lkcl | or anything | 10:33 |
lkcl | just put it into a branch, as work-in-progress | 10:33 |
lkcl | so that everyone can be on the same page | 10:33 |
markos | the problem with this approach is that I then have to chase other people's commits as they're fixing stuff while I'm not even sure I understand the problem | 10:34 |
markos | I don't mind after a certain point, but in the beginning I do need to work on this alone, even if it breaks, so that I learn | 10:35 |
markos | if you jump in and fix it right away then I will not have learned | 10:35 |
markos | and I need to be confident that I've reached a good level of understanding before I submit/commit anything | 10:36 |
lkcl | i've no intention of taking away your right to learn | 10:36 |
markos | in any case, I'm at this level now :) | 10:36 |
markos | it compiles, it runs, and the tests fail :) | 10:37 |
markos | which is a good thing, I didn't expect it to pass right away | 10:37 |
lkcl | then in all seriousness you can't expect to receive answers to questions, they're a burden, as the inadequate context means it is literally impossible to provide an answer | 10:37 |
lkcl | i know you think it's ok to "reach a good level of understanding before committing" | 10:38 |
markos | yes I know, but the funniest thing happens when I ask a question here | 10:38 |
lkcl | *please* don't hold that attitude. i have been trying to ask you to drop it for 2 years | 10:38 |
lkcl | i know - ghostmansd also calls this a "magic forum" | 10:38 |
lkcl | :) | 10:38 |
markos | as I describe the problem, the solution appears itself | 10:38 |
lkcl | yes, that happened to ghostmansd[m] routinely every week :) | 10:39 |
markos | lkcl, well this is how I work, or been working for the past 20+ years, it's not something that can be changed overnight | 10:39 |
markos | and tbh, I like it, it's a mental challenge and it's rewarding when I can solve a problem without constant hand-holding | 10:40 |
lkcl | it's very different when working for "customers" (who will definitely expect you to not have their time wasted) | 10:40 |
lkcl | however this is a complex project | 10:40 |
lkcl | i'm not going to ever take away your right to learn by "fixing" bugs in a branch for you | 10:41 |
markos | I haven't worked on a simple project for the past 10 years :D | 10:41 |
lkcl | :) | 10:41 |
markos | in any case, there is another reason, I know you are swamped with other stuff, I can't constantly ask you for help | 10:41 |
markos | and it's not like there is an abundance of your clones available to ask | 10:42 |
markos | reg. SVP64, it's just you and programmerjake | 10:42 |
markos | both swamped | 10:42 |
markos | so I need to work my way around stuff | 10:43 |
markos | anyway | 10:44 |
markos | this is theoretical now, I promise I'll commit more often, but I'll try not to abuse your willingness to help | 10:44 |
markos | is there a branch naming scheme I should follow? | 10:44 |
markos | not from what I see | 10:45 |
markos | maddsubrs should do I guess | 10:46 |
markos | committed and pushed | 10:47 |
markos | so one thing I definitely need to fix is the bitmask in minor_22.csv now that I've changed the Form and XO bits | 10:47 |
markos | at least I think so | 10:47 |
lkcl | awesome | 10:50 |
markos | the test is test_caller_maddsubrs.py | 10:50 |
markos | I've copied one with simple values that I've verified from C code and Arm vqrdmulhq_s32 | 10:51 |
markos | before I go and tackle something more complicated like a full DCT | 10:52 |
markos | actual test code is in src/openpower/test/alu/maddsubrs_cases.py | 10:53 |
markos | ignore the line e.intregs[11] = 0x00001942 for now, that's the (a-b)*c value for now which is supposed to go to RS | 10:53 |
markos | I think my parsing of SH is wrong | 10:56 |
lkcl | you may have some bits in the wrong place and/or something | 11:00 |
lkcl | use print() statements in the auto-generated code, with easy-to-find markers | 11:00 |
lkcl | yes annoying that it gets overwritten but it's better than nothing | 11:00 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 11:08 | |
ghostmansd | > i know - ghostmansd also calls this a "magic forum" | 11:08 |
ghostmansd | > yes, that happened to ghostmansd[m] routinely every week :) | 11:08 |
ghostmansd | affirmative, and still happens! | 11:09 |
ghostmansd | I even already suggested to monetize the chat instead of the stuff we do :-) | 11:09 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 11:39 | |
markos | lkcl, the code currently chokes in get_idx_out and I noticed there a comment that PowerDecoder2 should be used instead, how to get the instruction to use that? | 12:00 |
markos | getting "get_idx_out not found" on RA | 12:00 |
markos | the question is not about the error, I'll figure that out, the question is about PowerDecoder2 | 12:03 |
markos | ah, just saw the A-Form variant comment, good catch | 12:16 |
markos | committed changes | 12:29 |
markos | getting there... :) | 12:39 |
markos | ok, it compiles and runs, getting wrong results though, I still haven't fixed the bitmask in minor_22.csv | 12:49 |
*** WhyNotHugo <WhyNotHugo!bc7d0f0b52@2604:bf00:561:2000::28> has joined #libre-soc | 13:28 | |
markos | weird, added prints in the generated code, and I'm getting zeroes for all the elements | 13:33 |
*** octavius <octavius!~octavius@92.40.169.57.threembb.co.uk> has joined #libre-soc | 13:49 | |
markos | lol, ofc it's zero, I didn't pass initial_regs to the Program() call :D | 14:18 |
markos | got ca00000b503f581 expected fffff581 at pc 4 4 | 14:48 |
markos | I suspect that's because the original instructions are doing 32-bit but I'm doing 64-bit and shifting 14-bits right | 14:48 |
markos | I'm going to verify | 14:48 |
markos | without using negative values getting ca000000000aa85 vs 0000aa85 which is a good start, but I need to figure out where the ca comes from | 14:50 |
markos | right it's because of the rotation | 14:53 |
markos | copying what srd does, seems to work | 15:07 |
lkcl | markos, getting there i see :) | 15:29 |
lkcl | toshywoshy, openpowerbot stalled a couple hours ago on mattermost? | 15:30 |
lkcl | yes i suggested using "x <- prod1[XLEN/2-SH:XLEN-1-SH]" instead of ROTL64 | 15:31 |
lkcl | btw DCT-Form can go as well, it's also identical to A-Form | 15:31 |
lkcl | btw are you using "rebase" checkouts? | 15:32 |
lkcl | because if so you can always fast-forward a branch | 15:32 |
markos | yes | 15:34 |
markos | tried the prod1[XLEN/2-SH:XLEN-1-SH], got some wrong results, but why XLEN/2? I mean if I'm shifting eg. 14 bits, the result I need is [14:63-14] right? | 15:35 |
markos | trying to do the same using MASK() but I'm again getting wrong results, trying to figure out the syntax of MASK() | 15:36 |
markos | prod1[XLEN/2-SH:XLEN-1-SH] always gives me zero | 15:44 |
markos | prod1 SelectableInt(value=0x2aa14328, bits=128) | 15:44 |
markos | res1 SelectableInt(value=0x0, bits=32) | 15:44 |
markos | for SH=14 | 15:45 |
markos | that's why I went to use again ROTL64 | 15:45 |
markos | but a mask would work just as well | 15:46 |
markos | essentially I need a mask of the first XLEN-SH bits, ie MASK(0, XLEN-SH) correct? | 15:47 |
markos | if I understand the MASK syntax right | 15:47 |
markos | ah no, MASK(0, XLEN-SH) gives me 0xFFFFFFFFFFFFE000 | 15:49 |
markos | no, it's MASK(n, XLEN-1) | 15:52 |
markos | and I need to do algebraic shifting... | 15:53 |
markos | how do you type the ¬ character? | 16:00 |
*** tplaten <tplaten!~tplaten@195.52.31.23> has quit IRC | 16:01 | |
markos | lol, utf-8 parsing error | 16:01 |
lkcl | err... yes you're right | 16:14 |
lkcl | it's XLEN-SH-1:XLEN-1 | 16:15 |
lkcl | i have no idea, i always get MASK wrong | 16:15 |
lkcl | i usually copy it into a vim paste buffer from another file! | 16:15 |
lkcl | ok i'm going to do the same reduction in number of operands in ffmadds as i did in fdmadds | 16:19 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc | 16:22 | |
ghostmansd | > markos how do you type the ¬ character? | 16:23 |
ghostmansd | > lkcl i usually copy it into a vim paste buffer from another file! | 16:23 |
ghostmansd | lol :-D | 16:23 |
markos | same here :D | 16:24 |
markos | anyway, I almost got it | 16:24 |
markos | mask is correct | 16:24 |
lkcl | awesome | 16:24 |
markos | I get an assertion on b.bits == self.bits wrong, because some elements are of the wrong size while doing the operations | 16:24 |
markos | probably need to use EXTS() ? | 16:25 |
lkcl | no, you need to make sure that you have the exact bit-length for everythin | 16:25 |
lkcl | like in Ada (VHDL) | 16:25 |
markos | ah the results are 128-bits | 16:25 |
lkcl | this is a safety-check because of MSB0 ordering | 16:25 |
lkcl | indeed | 16:25 |
markos | the multiplication results | 16:25 |
lkcl | and there is no ROTL128 and we *really* don't want one added | 16:26 |
lkcl | which i think is why i suggested using the XLEN-SH-1 idea although i got it hopelessly wrong | 16:26 |
lkcl | the other way is - boringly - write it out as a series of if-statements | 16:26 |
lkcl | as part of the specification it does not have to be quotes efficient quotes | 16:27 |
lkcl | it *does* have to be really clear | 16:27 |
ghostmansd | markos, if this is about selectable int, they strictly check for size in bits | 16:27 |
markos | wait | 16:27 |
markos | it's LE isn't it? | 16:27 |
markos | so res1[0:XLEN-1] should give me the low half? | 16:28 |
lkcl | you need to not think in terms of LE or BE. | 16:28 |
lkcl | it's bits. | 16:28 |
lkcl | numerically numbered in MSB0 order | 16:28 |
markos | aaaaah | 16:28 |
lkcl | and otherwise having arithmetic properties that you would expect of any other programming language for all arithmetic operations | 16:28 |
markos | so I need to return the res1[XLEN/2:XLEN-1] ? | 16:28 |
lkcl | something like that, yes | 16:29 |
lkcl | or it will be | 16:29 |
lkcl | res1[XLEN:XLEN*2-1] | 16:29 |
lkcl | but | 16:29 |
lkcl | you can't pass a 128-bit number into ROTL64 and expect it to work | 16:29 |
lkcl | you will have to put a *64-bit* number into ROTL64 | 16:30 |
lkcl | and with a little bit of thought i think you'll find that that loses precision | 16:30 |
markos | ok, you're right | 16:30 |
lkcl | which is probably why i suggested writing it out long-hand | 16:30 |
markos | the problem is in the product earlier | 16:30 |
markos | I need to fix that | 16:31 |
lkcl | if SH = 0 then res1 <- prod[XLEN:XLEN*2-1] | 16:31 |
lkcl | if SH = 1 then res1 <- prod[XLEN-1:XLEN*2-2] | 16:31 |
lkcl | etc. etc. | 16:31 |
lkcl | but if you get ROTL64 to work *great* | 16:31 |
markos | doing something like that now | 16:32 |
lkcl | ha, ffmadds reduced operands is working | 16:33 |
lkcl | octavius, to clarify: check the linker script not the VHDL. | 16:37 |
lkcl | the signs are that you've compiled the *binary* to be at the wrong address. | 16:45 |
lkcl | but because there is absolutely zero concept of "ELF" support, there's no safety net to check that for you | 16:46 |
lkcl | (because it's raw binary data) | 16:46 |
lkcl | so it's down to you to get it right | 16:46 |
lkcl | for getting hold of searchable irc, use "wget --mirror --no-parent" on the irclogs page | 16:47 |
lkcl | then you have a local copy | 16:48 |
markos | prod1 <- prod128_1[XLEN/2:(XLEN*2)-1] shouldn't that produce a 64-bit value? for some reason I'm getting a 96-bit SelectableInt object | 16:50 |
markos | aaaargh | 16:51 |
markos | idiot | 16:51 |
markos | sorry | 16:51 |
markos | it always happens when I paste the problem | 16:51 |
markos | this is really a magic forum | 16:52 |
markos | got ffffffffffff643d expected ffffffffffff643e at pc 4 4 | 17:01 |
markos | getting there | 17:01 |
markos | lkcl, committed progress so far | 17:02 |
markos | wiki too | 17:03 |
markos | so positive values work (hooray!) but need to fix negatives | 17:04 |
markos | the results are compared against C & NEON results, so I know they are correct | 17:04 |
markos | the reference values I mean | 17:04 |
octavius | Thanks lkcl, will check the linker script | 17:06 |
ghostmansd | > this is really a magic forum | 17:10 |
ghostmansd | join our cult! :-D | 17:10 |
markos | how do I construct a XLEN number but set only the sign bit? eg. s1 | 17:20 |
markos | actually I can just check with an if | 17:22 |
markos | hm... (0|s1) might work | 17:30 |
markos | it doesn't, but (0*(XLEN-1) || s1) does | 17:37 |
markos | Ran 1 test in 5.818s | 17:39 |
markos | OK | 17:39 |
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC | 17:39 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 17:42 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 17:42 | |
markos | lkcl, simple test case works, I'm going to add a few more tests and then I'll check with an actual DCT on a matrix | 17:43 |
lkcl | hooray! | 17:43 |
lkcl | btw you need to decide if these are to be signed or unsigned multiplies | 17:43 |
lkcl | and use MULS or MUL as appropriate | 17:43 |
markos | uhm | 17:43 |
markos | would it be too much to ask for both? | 17:44 |
markos | I mean two instructions? | 17:44 |
markos | most likely the signed are more useful | 17:44 |
lkcl | should be fine | 17:44 |
lkcl | indeed | 17:44 |
markos | also the name might need to change | 17:44 |
markos | maddsubrs is ok as a draft | 17:45 |
markos | but s points to single-precision | 17:45 |
markos | feel free to suggest alternatives | 17:45 |
lkcl | just as long as it's not "signed/unsigned multiply but unsigned/signed addition by signed/unsigned result" and then you need *eight* instructions | 17:45 |
markos | :D | 17:45 |
lkcl | btw - also: now you have something working, the hardware cost needs to be estimated | 17:46 |
markos | how would I do that? | 17:46 |
lkcl | (instruction design is ridiculous!) | 17:46 |
markos | is $1M enough? :D | 17:46 |
lkcl | a rule-of-thumb is that a 64-64->128 multiply is around 12,000 gates | 17:46 |
markos | tbh, we don't need those, and for elwidth=16/32 it's an overkill too | 17:47 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 17:47 | |
markos | :q | 17:48 |
markos | sure | 17:48 |
lkcl | well, we work on the basis of using pre-existing Dynamic-Partitioned SIMD units | 17:48 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 17:48 | |
lkcl | and a MUX is 5 gates | 17:48 |
lkcl | so for a 5-bit-wide shift amount, you need 5 rows of 64 MUXes | 17:49 |
lkcl | 64 * 5 * 5 = ... | 17:49 |
markos | isn't there an automated way to do this, eg. by counting the instructions or something similar? | 17:49 |
lkcl | 1600 gates | 17:49 |
lkcl | nnnope. | 17:49 |
lkcl | welcome to hardware | 17:49 |
markos | fantastic | 17:49 |
lkcl | so, 1600 gates plus 12,000 is not bad | 17:49 |
markos | in other words AI is going to take our jobs, suuuuuure | 17:49 |
markos | please, have at it! | 17:50 |
lkcl | not a snowball in hell's chance | 17:50 |
lkcl | so, next question, what's the latency? | 17:50 |
lkcl | and it's basically "a multipler latency plus a shifter latency" | 17:50 |
lkcl | https://www.electronicshub.org/multiplexerandmultiplexing/ | 17:50 |
lkcl | a 2-to-1 mux has a chain of 3 gates | 17:50 |
lkcl | and there are 5 layers | 17:51 |
lkcl | so that's 5*3 = 15 gates of latency in the shifter | 17:51 |
markos | I count 15 cycles | 17:51 |
lkcl | which is about the limit for a 4.8 ghz CPU "gate propagation" | 17:51 |
markos | but probably less as some operations can be done in parallel | 17:51 |
lkcl | so this is going to be a 2 cycle arithmetic operation at high speed | 17:51 |
lkcl | no, those muxes you put in layers, they're unavoidably serial | 17:52 |
lkcl | layer 1 MUX shifts by 1 bit or not-at-all | 17:52 |
lkcl | layer 2 MUX shifts by 2 bits or not-at-all | 17:52 |
markos | ah, I counted the operations as 1-cycle each | 17:52 |
lkcl | layer 3 MUX shifts by 4 bits or not-at-all | 17:52 |
lkcl | ... | 17:52 |
markos | eg. 1 cycle for the add, 1 for the mul, etc | 17:52 |
markos | I guess it doesn't work like that | 17:52 |
lkcl | naah. | 17:52 |
markos | if this instruction can be done in 2 cycles, that's HUGE | 17:53 |
lkcl | the *arithmetic* side is two cycles (assuming a 64-bit multiply can be done in 1) | 17:53 |
markos | even so | 17:53 |
markos | so at worst what? 7 cycles? | 17:54 |
markos | total I mean | 17:54 |
lkcl | reading and writing to regfiles will be the kicker | 17:54 |
lkcl | as long as you're not expecting a CPU speed above 2 to 2 ghz, 7 stages sounds about right | 17:54 |
markos | even that is very good, remember right now all other implementations need at least 2 such instructions (for Arm, only in special cases) and in the generic case, you need at least 8 instructions | 17:55 |
lkcl | 4 ghz CPU speed would be more like 12 stage | 17:55 |
lkcl | dang | 17:55 |
markos | and that's for a single fdct round shift, ignoring the complexity of the data arrangement | 17:55 |
markos | with the remap we will do it in a fraction of the time | 17:56 |
markos | and in a fraction of the code size too | 17:56 |
lkcl | yyup. | 17:56 |
lkcl | welcome to the rabbit-hole | 17:56 |
lkcl | https://en.wikipedia.org/wiki/Binary_multiplier | 17:57 |
lkcl | that gives you some idea of what needs adding up (what goes into a multiplier) | 17:57 |
lkcl | think "long multiplication but in binary" | 17:57 |
lkcl | the means and method by which you actually carry out the additions will affect the gate efficiency | 17:58 |
lkcl | wallace tree is fun https://en.wikipedia.org/wiki/Wallace_tree | 17:58 |
lkcl | what you do there is: any carry-over from one column you put it at the *back* of the schedule of "things to add in the next column" | 17:59 |
lkcl | because you want the carry-over gates to have at least a chance to flip their transistors | 17:59 |
* sadoon[m] is intrigued | 18:00 | |
lkcl | normally in FPGAs you use the DSP multiply block, which has all this hard-wired | 18:01 |
sadoon[m] | And if you don't, you bear the consequences bahaha | 18:01 |
sadoon[m] | It takes a lot of space from what I remember | 18:02 |
sadoon[m] | I remember doing a multiplier using repeated shifts back in uni, idk how efficient it is though | 18:04 |
lkcl | heck yes it does. | 18:05 |
markos | question, how can the instruction -in pseudocode- know how to limit the multiplier width so that we don't have to do 128-bit multiplications if elwidth=16/32? | 18:14 |
lkcl | you don't. XLEN does | 18:19 |
lkcl | hidden behind the scenes, the MUL function has an OO override such that it "knows" about XLEN | 18:20 |
lkcl | but actually it's more subtle than that | 18:20 |
lkcl | RS RA and RB (etc) are all passed in as *XLEN* bits wide arguments | 18:21 |
lkcl | and MUL goes (or, should... sigh), "oh, i have received 2 8-bit arguments, let me produce a 16-bit result | 18:21 |
lkcl | " | 18:21 |
markos | ah, so we don't waste a 128-bit multiplier when dealing with 8-bit elements | 18:23 |
markos | cool | 18:23 |
lkcl | correct. | 18:24 |
lkcl | actually what happens is that in the DynamicPartitionedSIMD multiplier, the "gates" are closed, and it turns into QTY 8of 8-bit multipliers | 18:24 |
markos | ah reg your comment, I did try to put the shifting immediate last, but it failed | 18:24 |
markos | well it didn't fail | 18:25 |
markos | but putting it 3rd in minor_22.csv didn't accept a CONST_SH as 3rd arg | 18:25 |
lkcl | ah then that needs to be added... | 18:25 |
lkcl | 1 sec.. | 18:25 |
lkcl | hang on let me look at fixedshift.mdwn | 18:26 |
lkcl | ok yes | 18:27 |
lkcl | https://libre-soc.org/openpower/isa/fixedshift/ | 18:27 |
lkcl | i know what that's about. in PowerDecoder2 only certain arguments are "allowed" (decoded properly) in certain positions | 18:28 |
lkcl | SH would need to be added to 3rd argument | 18:28 |
lkcl | in this case to In3Sel (in power_enums.py) | 18:29 |
lkcl | and then to DecodeC (which handles the 3rd operand) a switch/case for CONST_SH | 18:29 |
lkcl | screw it - it's fine :) | 18:29 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 18:29 | |
lkcl | hmm i _think_ bug #928 is a duplicate of the R&D one i just created, doh | 18:30 |
lkcl | markos, btw just as an aside: i would expect you to be going "holy cow this is mad" and/or "this is so liberating!" | 18:33 |
markos | oh don't worry I'm super excited, but I'm holding it for the actual implementation of a DCT algorithm with REMAP | 18:35 |
markos | can't wait to see how a full DCT will be be like :D | 18:36 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 18:36 | |
markos | btw, you marked #1027 also as duplicate of #1074, but it has its own budget, it might break things budget-wise | 18:37 |
markos | sorry #1028 | 19:15 |
*** gnucode <gnucode!~gnucode@user/jab> has joined #libre-soc | 19:52 | |
*** octavius <octavius!~octavius@92.40.169.57.threembb.co.uk> has quit IRC | 21:08 | |
lkcl | yes that was an accident, i reverted it | 21:49 |
lkcl | the DCT triple-loop schedule *might* not be totally suitable for integer use, which would be annoying | 21:50 |
lkcl | the FP twin-butterfly is an add on one side then a sub-and-mul on the other | 21:51 |
lkcl | but the INT twin-butterfly you're designing is more like the FFT instruction | 21:51 |
lkcl | and presumably that's because it's important to keep the magnitude the same of the two values | 21:52 |
lkcl | to work out if it's suitable it will be necessary to go back to the original diagram of the DCT algorithm and check how it's applied | 21:52 |
lkcl | this might do | 21:52 |
lkcl | https://arxiv.org/pdf/2008.06091.pdf | 21:53 |
lkcl | nope | 21:53 |
markos | that means we will have to design another DCT triple-loop then? | 21:55 |
lkcl | don't know. | 21:57 |
lkcl | am just looking at the spec | 21:57 |
lkcl | https://aomediacodec.github.io/av1-spec/#inverse-dct-array-permutation-process | 21:57 |
lkcl | unpicking that is going to be... ahh... fun? | 21:58 |
lkcl | Inverse DCT array permutation process - that's automatically done by the LD/ST DCT REMAP but it'll need checking | 22:02 |
lkcl | and, also, what i did was merge the LD/ST bit-reversing with a phase that (recursively) solves the 3210-0123 problem | 22:03 |
lkcl | *such that*, on the last layer, the data is in the correct order *and* you never had to have a temporary vector register set | 22:03 |
*** gnucode <gnucode!~gnucode@user/jab> has quit IRC | 23:18 | |
*** gnucode <gnucode!~gnucode@user/jab> has joined #libre-soc | 23:19 | |
*** gnucode <gnucode!~gnucode@user/jab> has quit IRC | 23:59 | |
lkcl | hmmmm a thought: it's probably best to do manual-implementation first, extract as many VL-for-loops as possible, then do Indexed REMAP, at which point it'll be blindingly-obvious what the indexing coefficients are | 23:59 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!