Thursday, 2021-12-23

programmerjake	lkcl, if you have time, does the proposal in #757 look good to you?	00:00
programmerjake	going through grev again:	00:24
programmerjake	self.input = Signal(self.width) # XXX mark this as an input	00:24
programmerjake	^ it's already marked as an input...that's what "input" means	00:24
mikolajw	I just ran a small test I made (just a multiplier) with CRTL and it finally works	00:29
programmerjake	yay!	00:29
mikolajw	test_power_decoder.py still fails however :(	00:30
programmerjake	maybe a different name is better, I initially misread CRTL as Ctrl and was confused	00:30
mikolajw	I've already got an error a few times because I misspelt it as ctrl	00:31
programmerjake	how about rtl2c	00:32
mikolajw	we'll see	00:32
programmerjake	:)	00:32
mikolajw	weird, for some reason some functions don't appear via the CFFI, despite their file being generated and linked to the shared object	00:44
programmerjake	hmm, I haven't actually used cffi myself, so I may not be much help there...sorry	00:45
mikolajw	and if I do "nm crtl/crtl.cpython-37m-x86_64-linux-gnu.so" I can see the missing function there	00:45
programmerjake	did you generate the appropriate code to tell cffi to import them?	00:46
programmerjake	maybe the functions are just private to the .so cuz you forgot to tell cffi to make them imported	00:47
mikolajw	I see them both in the .so and crtl/common.h, which is goes to ffi.cdef(), which declares the functions for CFFI	00:49
mikolajw	ok something stupid is probably messed up	00:50
mikolajw	I probably didn't clean things up and I'm just reading the wrong file	00:52
programmerjake	hmm, maybe ask on #cffi on libera?	00:56
mikolajw	tried invalidating importlib's cache and explicitly reloading the module, didn't help	01:12
programmerjake	:(	01:13
mikolajw	could be related: https://foss.heptapod.net/pypy/cffi/-/issues/318	01:19
mikolajw	I could try to give unique names to the CFFI-generated modules	01:21
programmerjake	that seems like trying to reload a new .so from within the same python process...I'd expect your code to only need to load each .so once in each python process	01:21
programmerjake	unique names definitely should help, they're probably overwriting eachother's files	01:21
mikolajw	currently there can exist only one .so at a time	01:22
programmerjake	ah, ok. build the .so as a totally separate process, then load the .so by importing it directly?	01:22
mikolajw	I'll try to have unique names for now	01:23
mikolajw	there's only one .so because there is only one name	01:23
programmerjake	as described here: https://cffi.readthedocs.io/en/latest/overview.html#main-mode-of-usage	01:23
programmerjake	so, you'd run something like: python build_so.py, then python run_sim.py	01:24
mikolajw	I would prefer not to	01:24
programmerjake	k, though it seems like the main way cffi's intended to be used	01:25
mikolajw	the names will have to be unique for doing what oyu suggest as well anyway	01:27
programmerjake	if you just need unique names, but don't care what they are, you could use something like: https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/get_test_path.py;h=f58ada8dbc7da1fedb9bd823bdabb89decf7a2c5;hb=HEAD	01:28
mikolajw	worry not, I'll just use a class variable as a counter and append it to the names every time	01:35
mikolajw	and increment it	01:35
mikolajw	I'm not as sophisticated as you :)	01:36
mikolajw	https://stackoverflow.com/questions/8295555/how-to-reload-a-python3-c-extension-module	01:37
mikolajw	>Python's import mechanism will never dlclose() a shared library. Once loaded, the library will stay until the process terminates.	01:37
programmerjake	that works, though you'd probably want a way for users to add some meaningful string to the name, cuz it's really hard to know that 23 means test_lut.py and 485 means mmu.py, especially when they change around anytime any code changes	01:37
mikolajw	we'll see	01:37
programmerjake	the code I have in get_test_path just grabs the test's name from the unittest infrastructure and tacks on a per-test counter	01:38
mikolajw	alternatively (if what I'm doing now won't be good), as that SO answer says, we can move each test in test_power_decoder.py to a separate subprocess, if that's okay (unlikely?)	01:40
mikolajw	wow!	01:41
mikolajw	test_power_decoder.py passes!	01:41
programmerjake	yay!!	01:41
mikolajw	the next step I'm going to make is moving the entire simulator to C, because currently it's a Python-C hybrid, and this is slightly cumbersome for me	01:44
mikolajw	so that the Python interface will just be a thin wrapper over it	01:44
mikolajw	and yes, I'll do the changes you and Luke suggested to improve readability	01:51
cesar	lkcl: I pulled, but it didn't solve the issue. Now, it has seemingly got into an infinite loop (it stops printing after the first DMI register dump, until simulation ends).	10:32
cesar	Good news is, I figured out the VCD problem. It seems that Verilator outputs a signal name containing a dot, which GTKWave considers to be illegal... Will look at the traces now.	10:35
*** mepy_ is now known as mepy		10:43
cesar	Got it, core_stopped_i was not being raised when stopped. Fixed.	10:58
cesar	lkcl: Comparing DMI output of Microwatt and Libresoc should work now.	11:58
mikolajw	The "main" process in tests is always Python, since it executes the Python coroutine (that function with "yield"s) registered in the simulator	13:27
mikolajw	So if I'm going to move all simulation to C, I'll need a way to call Python from C, or else the coroutines will have to be converted to C	13:28
mikolajw	s/all simulation/the simulation engine/	13:29
mikolajw	CFFI gives a way to call Python from C, I'll try that	13:29
lkcl	mikolajw, fantastic	13:54
lkcl	ah yes, if the names of the modules are the same that would do it	13:55
lkcl	you need to delete the module name (manually) from the sys.modules dictionary	13:55
lkcl	which is an absolutely awful hack but i've had that work in the past	13:56
lkcl	but	13:56
lkcl	the names should be unique in the first place	13:56
lkcl	otherwise python is legitimately thinking they're the same thing	13:56
lkcl	"<mikolajw> test_power_decoder.py passes!"	13:57
lkcl	holy cow :)	13:57
lkcl	i have to try that	13:57
lkcl	FileNotFoundError: [Errno 2] No such file or directory: 'crtl_template.h'	13:57
lkcl	$ find . -name crtl_template.h	13:58
lkcl	./decoder/test/crtl_template.h	13:58
lkcl	there's a trick for getting the abspath, we use it in... mmm.... the get_csv() function	13:58
lkcl	filedir = os.path.dirname(os.path.abspath(__file__))	13:59
lkcl	basedir = dirname(dirname(dirname(filedir)))	13:59
mikolajw	I thought I has committed ctrl_template.h	14:01
lkcl	you had. i'm dealing with it.	14:02
lkcl	gimme 3mins	14:02
mikolajw	Aa	14:02
mikolajw	OK I messed up the path probably	14:02
lkcl	sorry taking a bit longer, it's because the import is at a different location from where i am running the program	14:21
lkcl	okaay got it	14:22
lkcl	mikolajw, done	14:26
lkcl	and, confirmed: working. frickin awesome	14:27
lkcl	i'm kinda stunned :)	14:27
lkcl	i do realise we're not looking for performance here but i thought you should know that preliminary tests show it's only twice as slow as _pyrtl.py	14:38
lkcl	for a first shot that's stunning	14:39
lkcl	with no effort at all at optimisation	14:39
mikolajw	I just realized that the calling the "main" process Python coroutine from C is going to have overhead, probably significant	14:45
mikolajw	So maaaybe it would make sense to somehow convert it to C as well somewhere in the future	14:46
lkcl	well, at this point, the primary objective has been achieved	14:48
lkcl	i mean, "achieved but not unit-test-demonstrated-as-achieved" if you know what i mean	14:48
lkcl	PowerDecode2 is the big one that's needed	14:49
lkcl	but before that, can you take a look at getting the actual Signal names into the slot names?	14:49
lkcl	this will be needed for when doing the c-based Power ISA simulator, we need to be able to identify the Signal names so that the (auto-generated) function can be called from c	14:50
lkcl	and if they're all called slot_NNNN they're impossible to identify	14:50
lkcl	i must apologise i did actually successfully do this one time (4 months back) but it was a very quick hack and i forgot how it was done	14:50
mikolajw	Yes	14:50
lkcl	i think i made some correct notes in the bugreport	14:51
lkcl	i do recall that it was very simple	14:51
lkcl	or	14:52
lkcl	or, or, or....	14:52
lkcl	even if there are #defines or code-comments	14:52
lkcl	set(1272, next_1272);	14:53
lkcl	-->	14:53
lkcl	#define THE_SIGNAL_NAME_FROM_src_1272 1272	14:53
lkcl	set(THE_SIGNAL_NAME_FROM_SRC_1272, next_1272);	14:53
lkcl	something like that would do the trick	14:53
mikolajw	Yes, I remember, will do	14:54
lkcl	star	14:54
lkcl	errm ermermerm i don't actually know how main() works :)	14:54
mikolajw	I'm not talking about C main(), I'm talking about "def process()" that is provided to the simulator through sim.add_process()	15:00
lkcl	yes i'm with you now	15:00
lkcl	for process in self._processes:	15:00
lkcl	process.run()	15:00
lkcl	even if that was in c it would make a massive difference	15:01
lkcl	mmmm.... yyyyeah, looking at it: all this has to be in c	15:03
lkcl	because from e.g. the linux kernel (or cavatools), one single function has to be called which "produces_an_answer()"	15:03
lkcl	which is a leetle more involved	15:05
lkcl	but, again, hey, it's 428 lines of code in that module	15:05
mikolajw	So, you want "def process()" to be converted to C too?	15:06
mikolajw	We can do it dynamically, via some converter, or by just rewriting it in C	15:07
mikolajw	It's not compiled with _pyrtl.c because it's a PyCoroProcess, while all other processes are PyRTLProcess	15:09
mikolajw	Sorry I'm on mobile so it's more effort to be precise	15:09
mikolajw	Ok, I presume you do, I just wanted an affirmative answer to this question precisely to be sure we understand each other	15:41
lkcl	mikolajw, sorry, was afk	16:15
lkcl	no, not def process()	16:16
lkcl	but starting at PySimEngine._step()	16:16
lkcl	or at least at first its loop "for process in self._processes"	16:17
lkcl	and progressing incrementally from there	16:17
lkcl	when using in the linux kernel or cavatools, what is needed is one single step (what gets triggered by Settle())	16:18
lkcl	so we would manually set up the inputs (aka slots)	16:18
lkcl	run one single c-based-version-of-PySimEngine._step()	16:19
lkcl	and get the outputs	16:19
lkcl	in this way we will have an input of the raw 32-bit instruction	16:19
lkcl	(run the steps-loop-in-c until converged==True)	16:19
lkcl	and the outputs will be the decoded instruction	16:20
lkcl	we don't need the full test_power_decoder.py process() function converted to c for that	16:20
lkcl	and even when using this in standard python Simulations, it would be problematic to expect everyone and anyone to convert their entire process() functions to c	16:21
lkcl	cesar: works fantastic	17:06
programmerjake	mikolaj: if you want higher performance, try running it in pypy, it specifically optimizes cffi to basically just raw call instructions to/from c (cuz it has that nice jit that can do that)	17:51
lkcl	programmerjake, interesting. didn't know that.	18:14
lkcl	the target is however being able to do a single (complete) combinatorial circuit "settling" (reaching "no change") as a complete stand-alone piece of c	18:15
programmerjake	yup	18:15
lkcl	for use inside both cavatools and the linux kernel (trap-and-emulate)	18:15
lkcl	the irony is, that the easiest way to test is to actually have a full complete simulator	18:16
lkcl	i have a whole stack of potential ideas for optimisation, including merging multiple signals into (the same) 64-bit instruction, but am seeeriously resisting talking about them :)	18:17
lkcl	cesar, that single-stepping allowed me to narrow down on potential sources of the bug	18:18
lkcl	it looks like the dcache is triggering an MMU lookup, which is successful, BUT	18:18
lkcl	the address that actually gets requested - after the lookup - is the virtual address not the looked-up (real) one	18:19
lkcl	but finding that without having the equivalent microwatt traces in a diff file would have been 10x harder to track down	18:20
lkcl	i can now see libresoc-mmu looking up address 0x2600 on the wishbone bus, where microwatt looks up 0x1000	18:21
programmerjake	mikolaj if you just have a single combinatorial circuit without feedback loops, you should be able to calculate a topological ordering of the signals, such that you don't need a simulate loop cuz it can always calculate all signal values in a	18:22
programmerjake	single step by calculating them in that specific ordering. this should greatly simplify the produced c code and make it run faster cuz you don't need the whole signal change tracking system	18:22
programmerjake	https://en.wikipedia.org/wiki/Topological_sort	18:22
lkcl	that would also help locate combinatorial loops (which is something not done at the moment, at all, in nmigen Simulation, and it's a pain)	18:23
lkcl	programmerjake, can you raise a bugreport about it, so we don't forget	18:23
lkcl	the only thing being a pain in the neck, that sort takes place across an entire swathe of modules/fragments/processes	18:24
mikolajw	So as I understand topological sorting would allow to get rid of that while not converged: loop	18:32
programmerjake	https://bugs.libre-soc.org/show_bug.cgi?id=760	18:34
programmerjake	yup, as well as the signal change tracking datastructures	18:35
mikolajw	But to get this done we need a traversable representation of the signal flow graph	18:35
mikolajw	Which will require nontrivial changes to the Nmigen to C compiler	18:38
programmerjake	how about deferring actually writing the generated c (write to a string instead and store it temporarily) and instead put it in a graph node associated with each signal along with the edges that are the list of signals read by that signal	18:39
mikolajw	Yeah, that's what I'm thinking about	18:40
programmerjake	then visit signals in that topological order writing the c code strings as you get to them	18:40
programmerjake	if you can easily do it, it'd be nice to still retain the change tracker stuff (but only if a flag is enabled, or if the topological sort fails) cuz it might be handy for a full featured simulator later, if we want that	18:46
mikolajw	That would be cool, but of course this is definitely a thing for very much later	18:48
programmerjake	yup!	18:49
mikolajw	What is however a low hanging fruit is parallelizing the processes. That's most likely going to give a huuuge boost	18:49
programmerjake	hmm, i'd expect just running everything in a fixed-at-compile-time topological order and relying on the c compiler to optimize/inline/etc. (cuz handling arithmetic/logical DAGs is usually what the compiler is good at) would give waay more performance than whatever complexity you'd likely have from parallelization	18:53
programmerjake	until you get to very huge designs with multiple cpu cores, or similar, parallelization would likely have more inter-thread overhead than would be gained by multithreading	18:56
lkcl	yes, VLSI is quite annoying. the interconnectivity is so high that locking becomes not only the highest point of contention but also the very presence of the mutexes actually slows down best-case	19:17
lkcl	jean-paul is running into algorithmic issue with PnR this way, as well.	19:17
lkcl	the early phases (coarse-grain routing) no problem, parallelise all you like	19:17
lkcl	the fine-grain routing, all cores but one sit there waiting for contention.	19:18
programmerjake	if i were to parallelize it, i'd use an algorithm that subdivides the computation into a graph of tasks with dependencies, where each task writes to signals that aren't read/written by other tasks that run at the same time, allowing the task scheduler to be the only place where any inter-thread synchronization is used, no mutexes/atomics on the individual signals required.	19:36
programmerjake	that subdivision can be computed ahead of time by the compiler	19:37
programmerjake	each task would compute a decently sized subgraph of the whole signal dependency graph	19:38
programmerjake	^ for parallelization of the c hdl simulator	19:38
programmerjake	i'd expect that, assuming the fine-grain routing can be designed to only look at the results of coarse-grain routing of a block and its nearby blocks, and not the fine-grain routing of any blocks at all, then the fine-grain routing can be computed in parallel at the block level by writing the results to an output datastructure where each block is independently writable. the input datastructure with the coarse grain data would be	19:44
programmerjake	read-only during this phase.	19:44
programmerjake	i've used a very similar algorithm to compute in parallel new chunks of a minecraft-style game world for version 0.7 of my game, named voxels	19:46
programmerjake	i just got a free YubiKey 5 from the github shop, thanks to the Linux Foundation and GitHub and Rust	22:06
programmerjake	https://github.com/ossf/great-mfa-project	22:06
programmerjake	lkcl, you mentioned you thought they could be backdoored by the people running the OpenSSF project...it worked by them giving me a coupon code for the official GitHub shop, GitHub are the ones who are shipping it, I trust GitHub a lot more to not backdoor the things they're selling	22:10

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!