Thursday, 2021-12-23

programmerjakelkcl, if you have time, does the proposal in #757 look good to you?00:00
programmerjakegoing through grev again:00:24
programmerjakeself.input = Signal(self.width)       # XXX mark this as an input00:24
programmerjake^ it's already marked as an input...that's what "input" means00:24
mikolajwI just ran a small test I made (just a multiplier) with CRTL and it finally works00:29
programmerjakeyay!00:29 still fails however :(00:30
programmerjakemaybe a different name is better, I initially misread CRTL as Ctrl and was confused00:30
mikolajwI've already got an error a few times because I misspelt it as ctrl00:31
programmerjakehow about rtl2c00:32
mikolajwwe'll see00:32
mikolajwweird, for some reason some functions don't appear via the CFFI, despite their file being generated and linked to the shared object00:44
programmerjakehmm, I haven't actually used cffi myself, so I may not be much help there...sorry00:45
mikolajwand if I do "nm crtl/" I can see the missing function there00:45
programmerjakedid you generate the appropriate code to tell cffi to import them?00:46
programmerjakemaybe the functions are just private to the .so cuz you forgot to tell cffi to make them imported00:47
mikolajwI see them both in the .so and crtl/common.h, which is goes to ffi.cdef(), which declares the functions for CFFI00:49
mikolajwok something stupid is probably messed up00:50
mikolajwI probably didn't clean things up and I'm just reading the wrong file00:52
programmerjakehmm, maybe ask on #cffi on libera?00:56
mikolajwtried invalidating importlib's cache and explicitly reloading the module, didn't help01:12
mikolajwcould be related:
mikolajwI could try to give unique names to the CFFI-generated modules01:21
programmerjakethat seems like trying to reload a new .so from within the same python process...I'd expect your code to only need to load each .so once in each python process01:21
programmerjakeunique names definitely should help, they're probably overwriting eachother's files01:21
mikolajwcurrently there can exist only one .so at a time01:22
programmerjakeah, ok. build the .so as a totally separate process, then load the .so by importing it directly?01:22
mikolajwI'll try to have unique names for now01:23
mikolajwthere's only one .so because there is only one name01:23
programmerjakeas described here:
programmerjakeso, you'd run something like: python, then python run_sim.py01:24
mikolajwI would prefer not to01:24
programmerjakek, though it seems like the main way cffi's intended to be used01:25
mikolajwthe names will have to be unique for doing what oyu suggest as well anyway01:27
programmerjakeif you just need unique names, but don't care what they are, you could use something like:;a=blob;f=src/nmutil/;h=f58ada8dbc7da1fedb9bd823bdabb89decf7a2c5;hb=HEAD01:28
mikolajwworry not, I'll just use a class variable as a counter and append it to the names every time01:35
mikolajwand increment it01:35
mikolajwI'm not as sophisticated as you :)01:36
mikolajw>Python's import mechanism will never dlclose() a shared library. Once loaded, the library will stay until the process terminates.01:37
programmerjakethat works, though you'd probably want a way for users to add some meaningful string to the name, cuz it's really hard to know that 23 means and 485 means, especially when they change around anytime any code changes01:37
mikolajwwe'll see01:37
programmerjakethe code I have in get_test_path just grabs the test's name from the unittest infrastructure and tacks on a per-test counter01:38
mikolajwalternatively (if what I'm doing now won't be good), as that SO answer says, we can move each test in to a separate subprocess, if that's okay (unlikely?)01:40
mikolajwwow!01:41 passes!01:41
mikolajwthe next step I'm going to make is moving the entire simulator to C, because currently it's a Python-C hybrid, and this is slightly cumbersome for me01:44
mikolajwso that the Python interface will just be a thin wrapper over it01:44
mikolajwand yes, I'll do the changes you and Luke suggested to improve readability01:51
cesarlkcl: I pulled, but it didn't solve the issue. Now, it has seemingly got into an infinite loop (it stops printing after the first DMI register dump, until simulation ends).10:32
cesarGood news is, I figured out the VCD problem. It seems that Verilator outputs a signal name containing a dot, which GTKWave considers to be illegal... Will look at the traces now.10:35
*** mepy_ is now known as mepy10:43
cesarGot it, core_stopped_i was not being raised when stopped. Fixed.10:58
cesarlkcl: Comparing DMI output of Microwatt and Libresoc should work now.11:58
mikolajwThe "main" process in tests is always Python, since it executes the Python coroutine (that function with "yield"s) registered in the simulator13:27
mikolajwSo if I'm going to move all simulation to C, I'll need a way to call Python from C, or else the coroutines will have to be converted to C13:28
mikolajws/all simulation/the simulation engine/13:29
mikolajwCFFI gives a way to call Python from C, I'll try that13:29
lkclmikolajw, fantastic13:54
lkclah yes, if the names of the modules are the same that would do it13:55
lkclyou need to delete the module name (manually) from the sys.modules dictionary13:55
lkclwhich is an absolutely awful hack but i've had that work in the past13:56
lkclthe names should be unique in the first place13:56
lkclotherwise python is legitimately thinking they're the same thing13:56
lkcl"<mikolajw> passes!"13:57
lkclholy cow :)13:57
lkcli have to try that13:57
lkclFileNotFoundError: [Errno 2] No such file or directory: 'crtl_template.h'13:57
lkcl$ find . -name crtl_template.h13:58
lkclthere's a trick for getting the abspath, we use it in... mmm.... the get_csv() function13:58
lkcl    filedir = os.path.dirname(os.path.abspath(__file__))13:59
lkcl    basedir = dirname(dirname(dirname(filedir)))13:59
mikolajwI thought I has committed ctrl_template.h14:01
lkclyou had.  i'm dealing with it.14:02
lkclgimme 3mins14:02
mikolajwOK I messed up the path probably14:02
lkclsorry taking a bit longer, it's because the import is at a different location from where i am running the program14:21
lkclokaay got it14:22
lkclmikolajw, done14:26
lkcland, confirmed: working. frickin awesome14:27
lkcli'm kinda stunned :)14:27
lkcli do realise we're not looking for performance here but i thought you should know that preliminary tests show it's only twice as slow as _pyrtl.py14:38
lkclfor a first shot that's stunning14:39
lkclwith no effort at all at optimisation14:39
mikolajwI just realized that the calling the "main" process Python coroutine from C is going to have overhead, probably significant14:45
mikolajwSo maaaybe it would make sense to somehow convert it to C as well somewhere in the future14:46
lkclwell, at this point, the primary objective has been achieved14:48
lkcli mean, "achieved but not unit-test-demonstrated-as-achieved" if you know what i mean14:48
lkclPowerDecode2 is the big one that's needed14:49
lkclbut before that, can you take a look at getting the actual Signal names into the slot names?14:49
lkclthis will be needed for when doing the c-based Power ISA simulator, we need to be able to identify the Signal names so that the (auto-generated) function can be called from c14:50
lkcland if they're all called slot_NNNN they're impossible to identify14:50
lkcli must apologise i did actually successfully do this one time (4 months back) but it was a very quick hack and i forgot how it was done14:50
lkcli think i made some correct notes in the bugreport14:51
lkcli do recall that it was very simple14:51
lkclor, or, or....14:52
lkcleven if there are #defines or code-comments14:52
lkclset(1272, next_1272);14:53
lkcl#define THE_SIGNAL_NAME_FROM_src_1272 127214:53
lkclset(THE_SIGNAL_NAME_FROM_SRC_1272, next_1272);14:53
lkclsomething like that would do the trick14:53
mikolajwYes, I remember, will do14:54
lkclerrm ermermerm i don't actually know how main() works :)14:54
mikolajwI'm not talking about C main(), I'm talking about "def process()" that is provided to the simulator through sim.add_process()15:00
lkclyes i'm with you now15:00
lkcl            for process in self._processes:15:00
lkcleven if that was in c it would make a massive difference15:01
lkclmmmm.... yyyyeah, looking at it: all this has to be in c15:03
lkclbecause from e.g. the linux kernel (or cavatools), one single function has to be called which "produces_an_answer()"15:03
lkclwhich is a leetle more involved15:05
lkclbut, again, hey, it's 428 lines of code in that module15:05
mikolajwSo, you want "def process()" to be converted to C too?15:06
mikolajwWe can do it dynamically, via some converter, or by just rewriting it in C15:07
mikolajwIt's not compiled with _pyrtl.c because it's a PyCoroProcess, while all other processes are PyRTLProcess15:09
mikolajwSorry I'm on mobile so it's more effort to be precise15:09
mikolajwOk, I presume you do, I just wanted an affirmative answer to this question precisely to be sure we understand each other15:41
lkclmikolajw, sorry, was afk16:15
lkclno, not def process()16:16
lkclbut starting at PySimEngine._step()16:16
lkclor at least at first its loop "for process in self._processes"16:17
lkcland progressing incrementally from there16:17
lkclwhen using in the linux kernel or cavatools, what is needed is one single step (what gets triggered by Settle())16:18
lkclso we would manually set up the inputs (aka slots)16:18
lkclrun one single c-based-version-of-PySimEngine._step()16:19
lkcland get the outputs16:19
lkclin this way we will have an input of the raw 32-bit instruction16:19
lkcl(run the steps-loop-in-c until converged==True)16:19
lkcland the outputs will be the decoded instruction16:20
lkclwe *don't* need the full process() function converted to c for that16:20
lkcland even when using this in standard python Simulations, it would be problematic to expect everyone and anyone to convert their entire process() functions to c16:21
lkclcesar: works fantastic17:06
programmerjakemikolaj: if you want higher performance, try running it in pypy, it specifically optimizes cffi to basically just raw call instructions to/from c (cuz it has that nice jit that can do that)17:51
lkclprogrammerjake, interesting. didn't know that.18:14
lkclthe target is however being able to do a single (complete) combinatorial circuit "settling" (reaching "no change") as a complete stand-alone piece of c18:15
lkclfor use inside both cavatools and the linux kernel (trap-and-emulate)18:15
lkclthe irony is, that the easiest way to test is to actually have a full complete simulator18:16
lkcli have a whole stack of potential ideas for optimisation, including merging multiple signals into (the same) 64-bit instruction, but am seeeriously resisting talking about them :)18:17
lkclcesar, that single-stepping allowed me to narrow down on potential sources of the bug18:18
lkclit looks like the dcache is triggering an MMU lookup, which is successful, BUT18:18
lkclthe address that actually gets requested - after the lookup - is the *virtual* address not the looked-up (real) one18:19
lkclbut finding that without having the equivalent microwatt traces in a diff file would have been 10x harder to track down18:20
lkcli can now see libresoc-mmu looking up address 0x2600 on the wishbone bus, where microwatt looks up 0x100018:21
programmerjakemikolaj if you just have a single combinatorial circuit without feedback loops, you should be able to calculate a topological ordering of the signals, such that you don't need a simulate loop cuz it can always calculate all signal values in a18:22
programmerjakesingle step by calculating them in that specific ordering. this should greatly simplify the produced c code and make it run faster cuz you don't need the whole signal change tracking system18:22
lkclthat would also help locate combinatorial loops (which is something not done at the moment, at all, in nmigen Simulation, and it's a pain)18:23
lkclprogrammerjake, can you raise a bugreport about it, so we don't forget18:23
lkclthe only thing being a pain in the neck, that sort takes place across an entire swathe of modules/fragments/processes18:24
mikolajwSo as I understand topological sorting would allow to get rid of that while not converged: loop18:32
programmerjakeyup, as well as the signal change tracking datastructures18:35
mikolajwBut to get this done we need a traversable representation of the signal flow graph18:35
mikolajwWhich will require nontrivial changes to the Nmigen to C compiler18:38
programmerjakehow about deferring actually writing the generated c (write to a string instead and store it temporarily) and instead put it in a graph node associated with each signal along with the edges that are the list of signals read by that signal18:39
mikolajwYeah, that's what I'm thinking about18:40
programmerjakethen visit signals in that topological order writing the c code strings as you get to them18:40
programmerjakeif you can easily do it, it'd be nice to still retain the change tracker stuff (but only if a flag is enabled, or if the topological sort fails) cuz it might be handy for a full featured simulator later, if we want that18:46
mikolajwThat would be cool, but of course this is definitely a thing for very much later18:48
mikolajwWhat is however a low hanging fruit is parallelizing the processes. That's most likely going to give a huuuge boost18:49
programmerjakehmm, i'd expect just running everything in a fixed-at-compile-time topological order and relying on the c compiler to optimize/inline/etc. (cuz handling arithmetic/logical DAGs is usually what the compiler is good at) would give waay more performance than whatever complexity you'd likely have from parallelization18:53
programmerjakeuntil you get to very huge designs with multiple cpu cores, or similar, parallelization would likely have more inter-thread overhead than would be gained by multithreading18:56
lkclyes, VLSI is quite annoying. the interconnectivity is so high that locking becomes not only the highest point of contention but also the very presence of the mutexes actually slows down best-case19:17
lkcljean-paul is running into algorithmic issue with PnR this way, as well.19:17
lkclthe early phases (coarse-grain routing) no problem, parallelise all you like19:17
lkclthe fine-grain routing, all cores but one sit there waiting for contention.19:18
programmerjakeif i were to parallelize it, i'd use an algorithm that subdivides the computation into a graph of tasks with dependencies, where each task writes to signals that aren't read/written by other tasks that run at the same time, allowing the task scheduler to be the only place where any inter-thread synchronization is used, no mutexes/atomics on the individual signals required.19:36
programmerjakethat subdivision can be computed ahead of time by the compiler19:37
programmerjakeeach task would compute a decently sized subgraph of the whole signal dependency graph19:38
programmerjake^ for parallelization of the c hdl simulator19:38
programmerjakei'd expect that, assuming the fine-grain routing can be designed to only look at the results of coarse-grain routing of a block and its nearby blocks, and not the fine-grain routing of any blocks at all, then the fine-grain routing can be computed in parallel at the block level by writing the results to an output datastructure where each block is independently writable. the input datastructure with the coarse grain data would be19:44
programmerjakeread-only during this phase.19:44
programmerjakei've used a very similar algorithm to compute in parallel new chunks of a minecraft-style game world for version 0.7 of my game, named voxels19:46
programmerjakei just got a free YubiKey 5 from the github shop, thanks to the Linux Foundation and GitHub and Rust22:06
programmerjakelkcl, you mentioned you thought they could be backdoored by the people running the OpenSSF worked by them giving me a coupon code for the official GitHub shop, GitHub are the ones who are shipping it, I trust GitHub a lot more to not backdoor the things they're selling22:10

Generated by 2.17.1 by Marius Gedminas - find it at!