octavius | My elf says unknown architecture, is that to do with the Makefile not supplying the relevant metadata, or with the objdump version (from Debian buster repos) | 00:00 |
---|---|---|
programmerjake | you need to use the powerpc64le objdump, the name is something like powerpc64le-linux-gnu-objdump | 00:01 |
octavius | OH! | 00:02 |
octavius | Thank you so much! | 00:02 |
octavius | I've been an idiot :) | 00:02 |
programmerjake | if you use the x86 objdump, it doesn't have the powerpc disassembly code compiled in | 00:02 |
octavius | Yes, I've only been able to see the symbol tables so far | 00:03 |
octavius | I'll include this in the wiki page once I get the code running | 00:03 |
lkcl | ghostmansd[m], awesome on the aliases | 00:09 |
lkcl | yes, sorry, i assumed you knew octavius that binutils versions are specifically-compiled for specific architectures | 00:15 |
lkcl | (with ghostmansd[m] working on binutils compiled for ppc64 for us) | 00:15 |
lkcl | and it being in the Makefile(s) | 00:15 |
lkcl | i do appreciate there's a heck of a lot to keep track of | 00:16 |
programmerjake | lkcl, can i try to improve the integer dct add/sub/mul/shift instruction's pseudocode? | 00:18 |
lkcl | programmerjake, no let markos_ handle it. | 00:21 |
octavius | lkcl, looking at the objdump -d, I don't actually see problems regarding the addresses. At address 0x0 cpu *should* branch to 0x12c (boot_entry). Boot_entry then eventually branches to main (0x1014). As I have already shown though, the verilator sim bram.dump goes through the ff00_0000/4/8, then gets stuck at 0x800. | 00:26 |
octavius | The interesting thing is that this exact hello world code (C code, linker script, startup assembler) works on ls2 fpga. So why is the verilator so finicky? | 00:27 |
octavius | This issue is why I've (foolishly) been avoiding doing any simulations at all, and just wanted to work on FPGAs | 00:27 |
lkcl | you cannot inspect the inside of the FPGA. | 00:29 |
octavius | Of course, that's why sims are so useful | 00:29 |
lkcl | ok so you could have diagnosed this yourself by looking at the RESET_ADDRESS in the Makefile | 00:30 |
octavius | I already have looked at the RESET_ADDRESS in the makefile | 00:30 |
lkcl | https://git.libre-soc.org/?p=microwatt.git;a=blob;f=Makefile;h=610f48d8c89be6d5b9902d7f1bf61f8b6d98ffc0;hb=refs/heads/verilator_trace#l220 | 00:30 |
lkcl | 220 RESET_ADDRESS=65280 # 0xff00_0000>>16 | 00:31 |
lkcl | did you perform a full clean rebuild? | 00:31 |
octavius | https://bugs.libre-soc.org/show_bug.cgi?id=1073#c7 | 00:31 |
octavius | Yes, I always ran make clean before generating a new hello_world | 00:32 |
lkcl | the RESET_ADDRESS #define says where the start address is, yes? | 00:32 |
lkcl | so if you are still executing simulations that start at address 0xff00_0000 when you have specifically and explicitly changed that line in the Makefile to 0x0000_0000 and it still starts at 0xff00_0000 | 00:32 |
octavius | Yes, ff00 (which the VHDL then shifts 16 times to get ff00_0000) | 00:33 |
lkcl | then you've not got rid of everything | 00:33 |
lkcl | yes | 00:33 |
lkcl | so why are you expecting the simulation of the CPU to start at an address other than 0xff00_0000 ? | 00:33 |
octavius | I never changed the Makefile, only the powerpc.lds for the hello_world | 00:33 |
lkcl | so i repeat the question: why are you expecting the simulation of the CPU to start at an addres other than the one that is specified at line 220? | 00:34 |
octavius | I thought that the CPU expects the BRAM to start at 0xff00_0000, while the actual address on the BRAM side is 0x0 | 00:34 |
lkcl | start address === RESET_ADDRESS | 00:34 |
lkcl | if you don't tell the CPU to start at the address that matches the linker script's expected start address, how is anything ever going to work? | 00:35 |
lkcl | the verilator simulator is doing precisely and exactly what you've asked it to do. | 00:35 |
octavius | So then the VHDL RESET_ADDRESS needs to change to 0x0? | 00:35 |
lkcl | 1. load a program into memory (probably at 0x0000_00000) | 00:35 |
lkcl | 2. start executing at 0xff00_0000. | 00:35 |
lkcl | what do you think? | 00:36 |
lkcl | or, more to the point, why did it not occur to you to experiment by changing it to anything-at-all and seeing what the effect is? | 00:36 |
octavius | You told me NOT to change the VHDL, so I thought there was a way to do it | 00:36 |
lkcl | that was before i realised you were using the microwatt_verilator directly | 00:37 |
octavius | Yes, I was trying to try microwatt standalone, I apologise for not clarifying earlier | 00:37 |
lkcl | plus (apologies) i've been focussing on the RFCs | 00:37 |
lkcl | and, the HDL != "macro #define options" | 00:38 |
lkcl | you definitely don't want to start modifying the vhdl itself (which strictly speaking isn't the same thing as the compile-time options) | 00:38 |
octavius | Ah, that's what you meant | 00:38 |
lkcl | well, kinda :) honestly, i wasn't paying enough attention | 00:38 |
lkcl | head-spinning from 17 RFCs | 00:39 |
octavius | Ok, I'll make sure to be *even more* specific :) | 00:39 |
octavius | Then tomorrow I'll give some of them a re-read. | 00:39 |
lkcl | but yes, i was expecting you to recompile the binary at address 0xff00_0000 | 00:39 |
octavius | Any RFCs that are going to be submitted soon? | 00:39 |
lkcl | then run options in verilator which load the binary into simulated-memory at that address | 00:39 |
octavius | "i was expecting you to recompile the binary at address 0xff00_0000" - This is what I was trying to do, but I have absolutely no idea which knob I meant to change in the .lds file for that | 00:40 |
lkcl | there are several other examples around, some of which are macro'd (i mentioned that a couple of times already) | 00:41 |
octavius | That's why I mentioned changing _start, which after looking at the disassembly, makes no difference | 00:41 |
lkcl | there's some powerpc.lds.in files around somewhere | 00:41 |
lkcl | which *specifically* use macro-substitution of some #defines to create a powerpc.lds file | 00:41 |
lkcl | and guess what one of the options is? | 00:41 |
lkcl | theee.... start addreeeeesss | 00:42 |
lkcl | i just can't remember which project does that. | 00:42 |
octavius | Oh that would've been really useful about a week ago...but then I probably wouldn't been forced to actually learn some things XD | 00:42 |
octavius | YES! Changing the RESET_ADDRESS define makes microwatt-verilator work! YES!!!!!!! | 00:44 |
lkcl | hoorah | 00:44 |
octavius | Now, need to find this generator file you mentioned | 00:44 |
lkcl | they're around somewhere, i just can't remember where | 00:45 |
lkcl | for loading the ls2 bootloader i think you'll find it does that trick | 00:46 |
lkcl | (the one that reads from QSPI) | 00:46 |
lkcl | or, at least, the programs *it* loads. | 00:46 |
octavius | Sure, I just wanted the Microwatt flow to be confirmed working | 00:46 |
octavius | Would make it easier for new contributors | 00:47 |
lkcl | no fantastic idea. | 00:52 |
octavius | Found it! https://git.libre-soc.org/?p=ls2.git;a=blob;f=hello_world/Makefile;h=50f039112f54165f8f6f7421ac62be1661889576;hb=HEAD#l9 | 00:54 |
octavius | I guess this is what you meant lkcl | 00:54 |
octavius | Also I'd like to make a video going through the setup and running on Microwatt and Libre-SOC | 00:55 |
lkcl | 28 powerpc.lds: powerpc.lds.S | 00:57 |
lkcl | 29 $(CC) $(CFLAGS) -P -E powerpc.lds.S -o powerpc.lds | 00:57 |
lkcl | yep that's it. | 00:57 |
lkcl | "gcc -E" - gcc's "macro" mode | 00:57 |
lkcl | that's what i was expecting you to be using | 00:57 |
lkcl | BOOT_INIT_BASE as a #define *from CFLAGS* gets pre-process-substituted into powerpc.lds.S | 00:58 |
lkcl | 21 -DBOOT_INIT_BASE=$(BOOT_INIT_BASE) | 00:58 |
lkcl | toshywoshy, thx it's back | 01:04 |
octavius | If you give me write access to Microwatt repo, I'll add this to the hello_world example later today | 01:06 |
octavius | Of course, testing it myself first :) | 01:06 |
octavius | Better go to bed now, quite late. Thanks for the help lkcl, programmerjake! | 01:09 |
*** octavius <octavius!~octavius@92.40.169.163.threembb.co.uk> has quit IRC | 01:12 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 07:26 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 07:27 | |
ghostmansd[m] | > lkcl: ghostmansd[m], awesome on the aliases | 07:33 |
ghostmansd[m] | I liked most that the test is even able to demonstrate these are macros :-) | 07:33 |
ghostmansd[m] | Anyway, if we have more of these ahead, we need to generate the records for them, too | 07:34 |
ghostmansd[m] | If you're interested in this, I can think about configuration | 07:34 |
ghostmansd[m] | I'll need some list of insns that have aliases, though, to at least use as example | 07:35 |
ghostmansd[m] | I know about minmax, fminmax, and also vaguely recall something about grevlut et al. | 07:40 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 07:55 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 07:56 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 08:00 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 08:21 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 08:30 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 08:30 | |
ghostmansd[m] | s/macros/aliases | 08:42 |
programmerjake | sounds like we need an aliases.csv | 08:50 |
programmerjake | or some other nicer format | 08:50 |
*** octavius <octavius!~octavius@92.40.169.167.threembb.co.uk> has joined #libre-soc | 09:46 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 09:46 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 09:47 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 10:24 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 10:31 | |
lkcl | ghostmansd[m], please wait until the next version of Power ISA is released. i cannot say more on that. | 10:51 |
lkcl | ghostmansd[m], i have a specific self-contained task that's reasonably high priority if you're interested | 10:53 |
lkcl | we need an offline instruction-ordering-analyser that models a (simple, initially v3.0-only) in-order core and gives estimates of instructions/clock | 10:54 |
lkcl | (IPC) | 10:54 |
lkcl | it needs to be *very* clear what is going on, nothing fancy (so no metaclasses) | 10:55 |
lkcl | and the Hazard Protection should be a straight simple bit-vector | 10:56 |
lkcl | * take the Write result register number: set a bit | 10:56 |
lkcl | * for all Read registers, check the corresponding bit. if set, STALL (fake/model-stall that is) | 10:56 |
lkcl | the input shall be: | 10:57 |
lkcl | * instruction operands (as an assembler listing) plus an optional memory-address and whether it is read/written | 10:58 |
ghostmansd[m] | lkcl, do you have a link to this task so that I could read and get a better idea? | 11:07 |
ghostmansd[m] | I should have said "s strict no-no" once you mentioned "no metaclasses" :-D | 11:07 |
ghostmansd[m] | Also, "high-priority" — what are the time constraints? | 11:27 |
lkcl | https://bugs.libre-soc.org/show_bug.cgi?id=1039 | 13:30 |
lkcl | there are no details yet - what i wrote above *is* the details | 13:30 |
ghostmansd[m] | Does IPC stands for instructions per cycle? | 13:47 |
lkcl | yes. | 13:47 |
ghostmansd[m] | Ok :-) | 13:47 |
ghostmansd[m] | I'm a system programmer, so I had to ask | 13:47 |
ghostmansd[m] | For me IPC means something else | 13:48 |
lkcl | so the basic principle is: some classes are needed which effectively "model" pipeline stages. fetch, decode, issue, execute | 13:48 |
lkcl | indeed :) | 13:48 |
lkcl | and those classes are (obviously) chained together | 13:48 |
ghostmansd[m] | So, after all, this is a processor pipeline model? | 13:48 |
lkcl | correct. | 13:48 |
lkcl | the first Model needed is of an in-order single-issue scalar core. | 13:49 |
ghostmansd[m] | I developed part of this once, but it was too high-level | 13:49 |
ghostmansd[m] | You probably heard of Intel Cofluent | 13:49 |
lkcl | ah great, so you know what to expect. awesome | 13:49 |
lkcl | have now | 13:49 |
lkcl | this needs to be hardware-cycle-accurate | 13:50 |
ghostmansd[m] | Well I actually modeled the only part of Nehalem, insn decoder | 13:50 |
ghostmansd[m] | Not sure if this covers the task sufficiently | 13:50 |
lkcl | where the most important technical internal "flag" - the one that has the most influence in an in-order system - is the global "STALL" flag. | 13:50 |
ghostmansd[m] | But at least something to start with | 13:50 |
lkcl | indeed. | 13:51 |
ghostmansd[m] | This STALL. Is it like a global barrier where all buses stop? | 13:51 |
lkcl | correct. | 13:51 |
ghostmansd[m] | Ok, still vaguely recall something :-) | 13:51 |
lkcl | it tells the fetch to stop fetching, and because fetch has stopped decode has nothing to process | 13:51 |
lkcl | if decode has nothing to process, it has nothing to tell issue to do anything | 13:52 |
lkcl | if issue has nothing to do then execute (pipelines) run with an empty slot | 13:52 |
lkcl | so each "stall" has a ripple-effect down the chain-of-classes | 13:52 |
ghostmansd[m] | Ok, where to start here? | 13:53 |
lkcl | literally from scratch as a stand-alone program | 13:53 |
lkcl | taking as input a file containing instructions: | 13:53 |
lkcl | addi 3,4,5 | 13:54 |
lkcl | cmpi 1,2,3,4 | 13:54 |
lkcl | but with some "augmentation" if it is a LD/ST, assume that there is the memory address as a comment | 13:54 |
lkcl | ld 1,2(3) # 0x12345678 | 13:54 |
lkcl | it'll need some design document (a real simple one), some discussion etc. to get the concept agreed | 13:55 |
ghostmansd[m] | Any IRL examples to look at? | 13:56 |
lkcl | but ultimately if this is more than 1,000 to 1,500 lines of python there's something desperately wrong - bear that in mind | 13:56 |
lkcl | mmm.... maaybe RITA | 13:56 |
lkcl | and definitely cavatools and gem5 | 13:56 |
lkcl | but gem5 is an insanely-large codebase | 13:56 |
lkcl | https://www.google.com/search?q=RITA+RISC-v | 13:56 |
lkcl | oh - the PC obviously will be in there. | 13:57 |
lkcl | so | 13:57 |
lkcl | addi 3,4,5 # PC=8 | 13:57 |
lkcl | cmpi 1,2,3,4 # PC=12 | 13:57 |
lkcl | ld 1,2(3) # PC=16 EA=0x12345678 | 13:58 |
ghostmansd[m] | Why PC is 8 for the first one? | 13:58 |
lkcl | if you literally expect that to be the input, it will be about 5 minutes work to make ISACaller produce that as output | 13:58 |
ghostmansd[m] | Shouldn't be 4? | 13:58 |
lkcl | no reason at all, i just picked it as an example | 13:58 |
ghostmansd[m] | Ah OK | 13:58 |
ghostmansd[m] | Another question, shouldn't all these insns come as binaries? | 13:59 |
lkcl | but it will matter in an iteratively-improved version, because PC is what the "fetch" comes from | 13:59 |
ghostmansd[m] | I.e. 4 bytes at once | 13:59 |
lkcl | ok that begins to tie in to the full capabilities of the simulator itself | 13:59 |
ghostmansd[m] | Not as asm, but rather as a simple stream of insns | 13:59 |
lkcl | which means duplicating the simulator | 13:59 |
lkcl | which is the last thing we need | 13:59 |
lkcl | my idea here is that ISACaller (or other simulator) *generates* a log file that this tool can use | 14:00 |
lkcl | if you have to decode the instructions in this tool it's doing too much. | 14:00 |
ghostmansd[m] | Well, modeling fetch, decode, issue, execute stages is almost the simulator :-) | 14:00 |
ghostmansd[m] | Ah so it's rather a trace walker | 14:01 |
lkcl | it isn't - because it's not actually going to execute the instructions. at all. | 14:01 |
lkcl | all it cares about is "what's the memory address being loaded or stored" and "what registers are used, and are they available/valid" | 14:01 |
lkcl | it doesn't care *at all* what the actual *values* are in those registers, nor the contents of the memory. | 14:01 |
lkcl | let's say you have 2 instructions: | 14:02 |
lkcl | addi 1,2,2 | 14:02 |
lkcl | muli 3,1,2 | 14:02 |
lkcl | the output from addi is used by muli | 14:02 |
lkcl | therefore you *must* stall | 14:02 |
lkcl | you don't care - at all - what the *contents* of register 1 2 or 3 are | 14:02 |
lkcl | you care solely and exclusively "is the result of the add available in register 1 yet, no it isn't, oh dear we need to STALL until it is" | 14:03 |
lkcl | that's an In-Order core | 14:03 |
lkcl | so the Model needs to go | 14:03 |
lkcl | cycle 1: i have fetched the add | 14:03 |
lkcl | cycle 2: i am decoding the add, AND i am fetching the mul | 14:04 |
lkcl | cycle 3: i am issuing the add, i am decoding the mul | 14:04 |
lkcl | cycle 4: i am EXECUTING the add, but the results are NOT READY THEREFORE I MUST STALL | 14:04 |
lkcl | cycle 4: i am stalled on fetching, i am executing the add | 14:04 |
lkcl | cycle 5: the add result is ready, i am WRITING the add, the MUL is unblocked, i can now ISSUE the add | 14:05 |
lkcl | cycle 6: i am EXECUTING the mul | 14:05 |
lkcl | cycle 7: the mul result is ready, i am writing the MUL | 14:05 |
lkcl | sorry, cycle 1 2 3 4 5 6 7 8 not 12344567 | 14:06 |
lkcl | but it is NOT cycle 123456 because of the additional STALL at cycle 4 | 14:06 |
lkcl | (because the mul needed the result of the add, which takes another 2 cycles to produce) | 14:06 |
lkcl | and thus the IPC is 0.75 *not* 1.0 | 14:07 |
lkcl | because of the 2 stalls in 8 cycles | 14:07 |
lkcl | so the crucial information is actually "how many stalls occurred" | 14:07 |
lkcl | hence that has to be Modelled | 14:07 |
lkcl | the "Execute" class should *literally* be a queue | 14:08 |
lkcl | and it should contain elements that are extremely simple: "write result will be in GPR 5" | 14:08 |
lkcl | or, | 14:08 |
lkcl | "write result will be in FPR 7 and CR1" | 14:09 |
lkcl | and once you pop() that off the end of the queue | 14:09 |
lkcl | you use it to clear the associated bit in the vector of "we are waiting for this register result" | 14:09 |
lkcl | note: you don't pass the *result itself* down the queue. | 14:10 |
lkcl | we don't care in the least bit what the contents of the regfiles are | 14:10 |
lkcl | we care *only* about *which* register | 14:10 |
ghostmansd[m] | Ok, input are the instructions. What is the output? Log which describes stalls and register contents? | 14:11 |
lkcl | yes. | 14:11 |
lkcl | no - not register contents | 14:11 |
lkcl | just "a stall occurred here" | 14:11 |
lkcl | it would kinda be handy to have a table showing where each instruction is, through the pipelines? | 14:12 |
lkcl | and if "stall" occurs, then the table will show "blank entry" in that pipeline slot | 14:12 |
lkcl | i think that's probably the most visually-useful output (markdown) | 14:12 |
lkcl | | fetch | decode | issue | execute1 | execute2 | | 14:13 |
ghostmansd[m] | What about jumps? These already per se need some bits of simulation, e.g. tracking the PC and the amount of the instructions. | 14:13 |
lkcl | they're "just another instruction" at this point | 14:13 |
ghostmansd[m] | Cough, I meant branches | 14:13 |
lkcl | but later we can add a branch-predictor "thing" which issues (yet more) stalls | 14:13 |
ghostmansd[m] | Yeah but they JUMP | 14:14 |
lkcl | but for now just treat it as "just another instruction" | 14:14 |
lkcl | not the pipeline's problem | 14:14 |
ghostmansd[m] | Say to 4 instructions below | 14:14 |
lkcl | not the pipeline's problem | 14:14 |
ghostmansd[m] | No my point is, we need to know where they jump | 14:14 |
lkcl | instructions don't actually care what the PC is (unless they have to read/write it) | 14:15 |
ghostmansd[m] | To fetch the next insn | 14:15 |
lkcl | the only place that matters is in the next phase where we "Model" the L1 and L2 caches | 14:15 |
ghostmansd[m] | Don't branches write PC? | 14:15 |
lkcl | (which will be later - don't worry about it for now) | 14:15 |
ghostmansd[m] | Or, well, rather update | 14:15 |
lkcl | correct, but you can ignore them for now | 14:16 |
lkcl | PC is extremely weird: it is a non-existent concept as far as the execute pipelines are concerned | 14:16 |
lkcl | and is dealt with in a different/special way | 14:16 |
lkcl | you have to "guess" which way the branch would go, and if you get it wrong, then, whoops, you STALL | 14:17 |
lkcl | but for now treat it as "just another instruction" | 14:17 |
lkcl | this will get sophisticated quite quickly and i don't want you overwhelmed | 14:18 |
lkcl | so one crucial thing about branch-conditional: it reads CR. | 14:18 |
lkcl | therefore if you have this: | 14:18 |
lkcl | cmpi 0,1,2 | 14:18 |
lkcl | bc 0,... | 14:18 |
lkcl | guess what? | 14:18 |
lkcl | bc must STALL waiting for the output from cmpi | 14:19 |
lkcl | that *is* important to model | 14:19 |
lkcl | but the actual PC you can completely ignore - entirely - for now | 14:19 |
lkcl | | fetch | decode | issue | execute1 | execute2 | | 14:19 |
ghostmansd[m] | OK, I'll think about it. But still: any other task in mind, a bit higher-level? :-) | 14:20 |
lkcl | | addi 1,2,3 | empty | empty | empty | empty | | 14:20 |
lkcl | | muli 3,1,2 | addi 1,2,3 | empty | empty | empty | | 14:20 |
ghostmansd[m] | I'm afraid with my current level of competence I'll be dealing with this for months :-) | 14:20 |
lkcl | | STALL | muli 3,1,2 | addi 1,2,3 | empty | empty | | 14:20 |
lkcl | like i said: if it's more than 1,000 lines of code there's something horribly wrong | 14:21 |
lkcl | if the first iteration takes more than 5-7 days to code up, there's something very very wrong | 14:22 |
lkcl | but this is actually a really important task for justification of commercial funding. | 14:22 |
lkcl | we are getting "but what's the performance but what's the performance but what's the performance" | 14:23 |
lkcl | i'll be able to help advise - and probably "chip in" - once things get started | 14:26 |
lkcl | but doing it myself, it just isn't going to happen. | 14:27 |
lkcl | markos_, had some thoughts - i have a sneaking suspicion you might need this for rounding: | 14:34 |
lkcl | round = sign(partialresult) * (abs(partialresult)+1) | 14:35 |
lkcl | then shift it up (arithmetic-shift, because it's either 1 0 or -1) | 14:36 |
lkcl | you get the idea. | 14:36 |
lkcl | if the (a+b) or (a-b) is negative, you want to *subtract* 1, if zero do nothing, if +ve *add* 1. | 14:37 |
lkcl | because - correct me if wrong - this is a *signed* instruction, you want "round towards zero" | 14:37 |
lkcl | in IEEE754 FP that's the default behaviour | 14:37 |
lkcl | by always adding one you are rounding **DOWN** negative partial-results | 14:37 |
markos_ | well, I'm trying to emulate the C code and the arm neon equivalents | 14:38 |
lkcl | ahh :) | 14:38 |
lkcl | it *might* be the case that A is unsigned and B is signed | 14:39 |
markos_ | well, both operands have to be signed | 14:40 |
lkcl | basically what i've described is likely to be a horrible bug in AV1 | 14:40 |
lkcl | but one that if implemented correctly would be *so bad* in the number of instructions (certainly no longer just 8 instructions per butterfly) that it's been deliberately overlooked | 14:41 |
lkcl | either that or we're missing something | 14:41 |
lkcl | i'm serious about this being a bug in AV1, if -ve A or B result is FLOORed but +ve A or B is CEILINGed, that's quite serious | 14:42 |
lkcl | if the c code is the reference is the spec, that's ultimately a bug in the AV1 specification | 14:42 |
markos_ | yes, if the function would be used on unprocessed/unfiltered data | 14:42 |
markos_ | but they are always fed data that is "clamped" within acceptable limits | 14:43 |
lkcl | if it's "offset" in some way such that the (new) A and (new) B are always +ve then that's fine | 14:43 |
lkcl | new-A and new-B *have* to be unsigned results. | 14:43 |
lkcl | which doesn't smell right, to me | 14:44 |
markos_ | also, all DCT functions in the libs are fed signed data | 14:44 |
lkcl | it means that input-A and input-B have to be "offset" in some magic way which, frankly, is impossible to achieve | 14:44 |
markos_ | it's true you can get really bad results from the functions if you feed them bad data | 14:45 |
lkcl | how can you possibly "arrange" the data such that for all butterflys input-A and input-B will *100% guaranteed* produce +ve result-A and result-B? | 14:45 |
markos_ | already bit by it doing the Arm port | 14:45 |
lkcl | urrrr | 14:45 |
markos_ | nothing you can do really, it's like trying to write the perfect tan() function approximation and you keep feeding it inputs close to pi/2 | 14:46 |
markos_ | the cpu just cannot cope | 14:47 |
markos_ | well, it can, using a different algorithm/approximation | 14:47 |
lkcl | this is way more fundamental - i think it's reasonable to assume you're going to get an even distribution of +ve and -ve values for input-A and input-B | 14:47 |
markos_ | well B is used for the cospi constants | 14:48 |
lkcl | but we may be overthinking this: they may just have not performed any rounding at all | 14:48 |
markos_ | these are pretty known and indeed distributed | 14:48 |
markos_ | the RT, RA are from pixel data, and quite random, can be distributed, or not | 14:49 |
lkcl | yes. ok RT,RA (not A and B) | 14:49 |
lkcl | RT and RA i would expect to be 50% each +ve and -ve | 14:49 |
lkcl | so you have 25% -ve -ve | 14:49 |
lkcl | 25% -ve +ve | 14:49 |
lkcl | 25% +ve -ve | 14:49 |
lkcl | 25% +ve +ve | 14:49 |
lkcl | there's just absolutely no way those can be "massaged" to 100% produce unsigned result-RT and result-RS | 14:50 |
markos_ | I could write some edge cases for that if you want | 14:51 |
markos_ | see how it behaves | 14:51 |
lkcl | probably a good idea. | 14:51 |
markos_ | and compare with C/NEON results | 14:51 |
markos_ | well, C for 64-bit, NEON for 16/32 | 14:52 |
lkcl | i bet you it's always rounded down. i.e. it's not an average-add | 14:52 |
lkcl | that's if the c code is taken as the reference | 14:53 |
markos_ | I noticed earlier that the neon code did not do that, ie did not get the rounded down value, but I'll have to do a more proper research | 14:57 |
markos_ | but I cannot do it today, I have to finish some neon stuff first :-/ | 15:00 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 15:11 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 15:11 | |
sadoon[m] | Doing something absolutely bonkers today, might show you guys during the meeting hahah | 16:44 |
programmerjake | lkcl: it rounds half-way cases towards +inf, otherwise towards nearest (due to the add before shifting). it doesn't need to have any logic for rounding towards zero. e.g. SH=4 prod=0xFFF4=-12 rounds correctly to -1 since it's closer to 0xFFF0, prod=0xFFF8=-8 rounds correctly to 0 since it's halfway, prod=0xFFFC rounds to 0 since it's closer | 16:59 |
programmerjake | try playing around with (v + 8) // 16 in python | 16:59 |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC | 17:28 | |
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has joined #libre-soc | 18:06 | |
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc | 18:21 | |
lkcl | sadoon[m], :) | 18:22 |
lkcl | programmerjake, ahhh okaaay | 18:22 |
lkcl | ghostmansd[m], i made a start https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e74dfbf1ecfb75affa90b7ce091e15764e1b9ac8 | 18:45 |
lkcl | now let me put in some explanatory comments | 18:46 |
programmerjake | meeting in 6min | 19:55 |
*** octavius <octavius!~octavius@92.40.169.167.threembb.co.uk> has quit IRC | 21:30 | |
*** choozy <choozy!~choozy@75-63-174-82.ftth.glasoperator.nl> has quit IRC | 21:46 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!