Thursday, 2023-04-27

*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC		03:30
*** jn <jn!~quassel@2a02:908:1066:b7c0:20d:b9ff:fe49:15fc> has joined #libre-soc		03:32
*** jn <jn!~quassel@2a02:908:1066:b7c0:20d:b9ff:fe49:15fc> has quit IRC		03:32
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc		03:32
programmerjake	lkcl: I'm thinking predicated prefix sum is too complex to figure out easily, plus it produces hard-to-use outputs, so what do you think about declaring prefix-sum with predicated off elements as undefined?	04:11
programmerjake	I'm going to go ahead and do that for now	04:14
programmerjake	another thing I ran into is in iterate_indices, it reverses steps if invxyz[1], however that is actually nonsensical, reversing steps doesn't produce a useful operation (unlike reversing indices, which is equivalent to reversing vector elements before and after the prefix-sum/reduction so is useful)	04:42
programmerjake	it makes it unnecessarily more complex, so I'm going to copy the existing function, remove steps reversing, and add prefix-sum to that.	04:43
programmerjake	note that reversing steps is equivalent to reversing the top half of following diagram vertically (aka. not useful afaict): https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/test/test_prefix_sum.py;h=2b88407216ccad3fc99a7d633331a30a3d3f562f;hb=HEAD#l167	04:48
ghostmansd	lkcl, FYI: https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/jobs/4167814	04:50
ghostmansd	FAILED src/openpower/decoder/isa/test_caller_svp64_ldst.py::DecoderTestCase::test_sv_load_dd_ffirst_excl - AssertionError: 2 != 1	04:51
ghostmansd	Broken in master	04:51
ghostmansd	other than this test, nopr branch seems to produce the same results as master	04:52
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		04:57
ghostmansd[m]	Correction: this test fails both in master and nopr	05:00
programmerjake	yeah, it was broken from the start afaict...just ignore it for now, luke can fix it later	05:01
ghostmansd[m]	Ok, thank you, later today I'll merge nopr branches both into gdb and openpower-isa	05:02
*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC		05:04
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has joined #libre-soc		05:06
*** jn <jn!~quassel@ip-095-223-044-193.um35.pools.vodafone-ip.de> has quit IRC		05:06
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc		05:06
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC		05:10
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc		05:11
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc		06:21
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		06:27
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC		06:49
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc		06:53
markos	toshywoshy, lkcl, programmerjake what about Thursday afternoon, biweekly for the svp64 meetings?	09:14
*** yambo <yambo!~yambo@069-145-120-113.biz.spectrum.com> has quit IRC		09:20
*** midnight <midnight!~midnight@user/midnight> has quit IRC		09:21
programmerjake	oh, i'm busy for some of this thursday afternoon, so idk if i can make it	09:25
programmerjake	oh, wait, it's probably not afternoon for me when you're thinking	09:26
programmerjake	what time?	09:26
*** midnight <midnight!~midnight@user/midnight> has joined #libre-soc		09:28
*** yambo <yambo!~yambo@069-145-120-113.biz.spectrum.com> has joined #libre-soc		09:32
markos	right, it's probably going to be morning for you I guess	09:37
markos	I'd say pick a time between 3pm-7pm UK time	09:38
programmerjake	7pm? if it's earlier than 6pm i likely won't make it	09:43
programmerjake	tbh i prefer later than 7pm if that works for you all	09:45
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has quit IRC		09:49
programmerjake	though I am wondering if we have to have meetings, since afaict email and irc have been working fine...sorry, i had missed the part where it was explained why we needed SVP64 meetings. for recording presentations, wouldn't it work fine to record them individually and then publish them	09:50
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.44.124> has joined #libre-soc		09:50
programmerjake	or are these meetings where we're expecting non-libre-soc people to show up and ask questions?	09:51
markos	it's basically mostly internal, to share svp64 assembly for people who are not yet up to speed	09:53
markos	but I think it could just as well be for other interested people also	09:54
markos	it's not for recording presentations for conferences etc	09:54
programmerjake	ah, so not a major problem if i miss any	09:54
programmerjake	since afaict i'm mostly up to speed on svp64	09:55
markos	no, though people will probably benefit from your technical knowledge :)	09:55
markos	you are, others not as much :)	09:55
markos	the point is not to train your or Luke :)	09:56
markos	s/your/you	09:56
programmerjake	ah, ok.	09:56
programmerjake	i think we should see who all wants to attend, e.g. if cesar wants to attend we'd have to work around his work schedule	09:58
lkcl	programmerjake, i solved predication in the parallel-reduction case. if you can write a short (10-20 lines) python script in a non-predicated demo, like you did last time (but this time excluding predication entirely) i can work it out	10:00
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@176.59.44.124> has quit IRC		10:02
programmerjake	lkcl, you'd want https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/prefix_sum.py;h=23eca36e2bb748c296c5a7ca88b9fa578258c653;hb=HEAD#l35	10:03
programmerjake	it's short and to-the-point	10:03
lkcl	excellent.	10:03
lkcl	ok so the inverted-bit (going out again) is the bit i need.	10:03
programmerjake	do copy it somewhere else to hack on it...	10:04
lkcl	predication is solved jacob.	10:04
cesar12	No, go ahead, I'm not that interested on SVP64 assembly right now, more focused on low level HDL and Formal Verification.	10:04
programmerjake	ok, cesar	10:04
lkcl	it's done by maintaining a suite of indices where instead of a MV operation the indices are MVed.	10:04
lkcl	such that on the next operation that would otherwise have needed a MV, the source operand is taken from the MVed index position	10:05
programmerjake	except that prefix sum has no moves	10:05
lkcl	so, do predicated elements remain where they are?	10:07
programmerjake	and if you tried to renumber indices based on skipping lanes predicated out, you'd end up with a highly variable pattern difficult to optimize hw for	10:07
lkcl	tough.	10:07
programmerjake	prefix sum is unpredicated	10:07
lkcl	then the developer must perform a predicated VCOMPRESS/VEXPAND before/after	10:07
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.73.253.155> has joined #libre-soc		10:07
programmerjake	ok, fine with me	10:08
lkcl	i'd like to keep the predication-index-moving-thing in because it works, and we may find that someone gets it to work	10:08
lkcl	if they find it's low performance and use VCOMPRESS/VEXPAND, they learned something :)	10:09
lkcl	here:	10:09
lkcl	+ # start a loop from the lowest step	10:09
lkcl	+ step = 1	10:09
lkcl	+ while step < xd:	10:09
lkcl	+ step *= 2	10:09
lkcl	+ stepend = step >= xd # note end of steps	10:09
lkcl	is that basically the same as the nmigen prefix_sum_ops algorithm?	10:10
programmerjake	no but it's similar	10:10
programmerjake	step = 2 * dist	10:11
lkcl	but achieves a work-efficient schedule?	10:11
programmerjake	but reduction operates differently than prefix-sum because it does operations toward the other end...	10:11
programmerjake	reduction achieves a work-efficient schedule, but it's somewhat different than the prefix-sum work-efficient schedule	10:13
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@91.73.253.155> has quit IRC		10:14
lkcl	okaaay. sounding like a separate iterator function is needed: i thought it was identical-first-half	10:14
programmerjake	it's similar, i thought it was identical	10:14
programmerjake	i didn't think through all the details at the time	10:15
lkcl	well, there's room. submode=0b10 and 0b11	10:15
lkcl	it's all good	10:15
lkcl	ok let me just tie this in...	10:16
programmerjake	they can probably share a lot of hw at least..,	10:16
lkcl	yehyeh	10:16
programmerjake	note the code in the `if` that i comitted is ported from nmutils.prefix_sum	10:16
programmerjake	so you don't need to re-convert it	10:17
lkcl	i'm just going to link iterate_indices2() into SVSHAPE.get_iterator	10:18
lkcl	that's all	10:18
programmerjake	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_preduce_yield.py;h=9e9fa2a69a0efd0a7794149353fff14d6fbcd73a;hb=0b6592c574f814d81cfede4c74c50b583590db13#l49	10:18
programmerjake	if you don't mind my having removed steps.reverse(), just delete the existing iterate_indices and rename iterate_indices2 -> iterate_indices	10:19
*** ghostmansd[m] <ghostmansd[m]!~ghostmans@5.32.74.194> has joined #libre-soc		10:20
lkcl	yes i mind. although it should be mirroring rather than total-inversion	10:20
programmerjake	it should function identically for all useful cases	10:20
lkcl	i.e. the end-result of inversion is that the result ends up in element VL-1 rather than element 0	10:21
programmerjake	well, in that case just copy the `if` block to iterate_indices and delete iterate_indices2	10:22
lkcl	gimme a sec... ok done	10:24
lkcl	python3 decoder/isa/test_caller_svp64_parallel_reduce.py >& /tmp/f	10:24
lkcl	nothing "damaged"	10:24
lkcl	next step: simplev.mdwn	10:25
programmerjake	k, i'm going to sleep, so ttyl	10:27
lkcl	night jacob, thanks for your help	10:30
markos	argh, how the heck do tables work in markdown?	10:36
lkcl	\|heading1\|heading2\|	10:36
markos	https://libre-soc.org/openpower/sv/cookbook/chacha20/	10:36
lkcl	\|-----\|-----\|	10:36
lkcl	\|rowdata1\|rowdata2\|	10:36
markos	yeah, I've done that but I'm getting crap formatting	10:36
lkcl	1 sec let me take a look	10:36
lkcl	you forgot the headings	10:37
programmerjake	now that i look at the time, i'm unlikely to make it in time for a 6pm BST meeting, maybe 7pm? sorry	10:37
programmerjake	don't count on me attending today	10:38
markos	programmerjake, well no one agreed for today anyways, don't worry	10:38
lkcl	markos, fixed the 1st table, you can see what it looks like now.	10:39
markos	aha!	10:39
lkcl	if you add extra "\|----\|----\|"s it just adds "-----" into cells	10:40
lkcl	you nearly had it - just the missing headings	10:41
lkcl	the format you were thinking of is more restructured-text	10:41
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc		10:43
markos	finally	10:44
markos	ok, so this should be ok now	10:44
markos	could you please check if the VF description is adequate?	10:44
markos	so that I can finally get this 'done' :)	10:45
markos	argh, accidentally removed the TOML values	10:49
markos	fixed	10:51
lkcl	yes am good with it. i just added a REMAP Indexing quick intro as well	10:55
markos	great, I'm closing this as fixed then	10:57
markos	ok, now to moving to the butterfly instructions :)	10:58
lkcl	:)	11:04
lkcl	just added an intro section, no conclusion - i think the assembler itself is enough.	11:09
lkcl	aand we're good	11:09
markos	great, (belated) RFPs sent for those :)	11:12
lkcl	got it. do update the toml field(s)	11:13
lkcl	markos = {amount=NNN, submitted=date} i forget the format YYYY-MM-DD?	11:13
lkcl	ok you can see in https://bugs.libre-soc.org/show_bug.cgi?id=1007	11:14
markos	didn't I fix it? did I do it wrongly?	11:15
lkcl	you need to keep the bugzilla records consistent with the RFP	11:15
lkcl	(and you put in EUR 1800 not EUR 1700 which i don't mind)	11:16
markos	argh	11:16
markos	crap, can I edit it?	11:16
lkcl	The table of payments (in EUR) for this task; TOML format:	11:16
lkcl	(edit)	11:16
lkcl	markos=1100	11:16
lkcl	lkcl={amount=400, submitted=2023-03-25}	11:16
lkcl	nope. it's in, and approved.	11:16
lkcl	so i retrospectively changed the amount to 1800	11:16
lkcl	https://bugs.libre-soc.org/show_bug.cgi?id=1007	11:16
lkcl	you need to edit the TOML field and put	11:17
markos	we'll balance it out in the next one :)	11:17
markos	sorry about that	11:17
lkcl	markos={amount=1100, submitted=2023-04-27}	11:17
markos	at worst I'll buy you an expensive bottle of wine :-)	11:17
lkcl	likewise in 1006, put the record of the same date	11:17
lkcl	:)	11:17
lkcl	don't worry about it	11:17
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		11:21
lkcl	markos, ok great - let me run the budget-sync thing and you can review your page https://libre-soc.org/task_db/mdwn/markos/	11:25
lkcl	give it 2 mins...	11:25
lkcl	ok it's updated. normally what i do is actually copy that markdown auto-generated table into the "RFP comments/results"	11:26
lkcl	but it requires that you run the budget-sync program and do the TOML-field editing before _that_...	11:27
lkcl	but at least you can see: "Submitted but not yet paid" now contains the two tasks you are waiting for an RFP for, to be paid, yay	11:29
markos	indeed, thanks	11:36
markos	lkcl, btw, reg butterfly insns, should those go to fixedarith.mdwn or own file?	11:40
markos	I'm going to do something better than what Arm is doing, their versions are not as precise so we cannot use them everywhere as expected	11:51
markos	can we do 3-in, 2-out?	11:51
markos	which form is that?	11:52
markos	or 4-in, 1-out	12:00
markos	4-in might be useful to add in a right-shift immediate	12:01
markos	basically the instructions are trying to emulate fdct_round_shift((a +/- b) * c)	12:02
markos	if we can do 2-out then we can both fdct_round_shift((a + b) * c) and fdct_round_shift((a - b) * c) in the same instruction	12:02
markos	if not then we have to provide 2 instructions for that, but in that case, we can use an extra instruction for the shifting	12:03
markos	er, extra operand	12:03
markos	fdct_round_shift(x) is essentially ROUND_POWER_OF_TWO(x, DCT_CONST_BITS)	12:05
markos	where #define ROUND_POWER_OF_TWO(value, n) (((value) + (1 << ((n)-1))) >> (n))	12:05
markos	and DCT_CONST_BITS = 14	12:06
markos	I'd love to be able to do both a+b/a-b in a single instruction though, that would essentially double throughput	12:08
markos	where can I find the possible Forms?	12:11
markos	nevermind, 1.6.1 ISA manual	12:59
markos	lkcl, stupid question, could we assume that an instruction has 2 outputs but only needs one output register? ie, it outputs to RT and RT+1	13:11
markos	I guess not, but thought I'd ask	13:14
markos	because that way we can have 3-in, RA, RB, RC and 2-outs, RT = (RA+RB)c and RT+1 = (RA-RB)c	13:14
markos	if we can squeeze in a 4-bit immediate to right shift, this will be a killer instruction	13:15
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc		13:27
markos	lkcl, actually I think this is already done in fdmadds FRT,FRA,FRC,FRB	13:37
markos	pseudo-code has: FRS <- FPADD32(FRA, FRB)	13:37
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		13:37
markos	but there is no FRS declared in the assembly syntax	13:37
markos	this looks a bit wrong	13:42
markos	anyway, could we use the same trick as with svshape and save a bit in output register, and assume a pair of registers written?	13:43
markos	ie instead of RT, provide RT/2, and always assume that this instruction will accumulate both RT and RT+1	13:45
markos	with accumulate that means you can have the 2-coeffs butterfly operation fdct_round_shift(a * c1 +/- b * c2) with just 2 instructions :)	13:47
markos	you'd just have to swap RA, RB in the second instruction	13:47
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc		14:35
markos	oh well, apparently RT + 1 <- does not work :-/	15:06
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		15:07
lkcl	markos, svfixedarith.mdwn or yes their own file is perfectly fine	16:14
lkcl	yes that's what's been done. the extra operand is declared to exist as RT+1 for scalar-only instructions	16:15
lkcl	and is declared to exist as RT+MAXVL for vectorised instructions	16:15
lkcl	notes are in the spec and they _should_ be at the top of the mdwn file as comments?	16:15
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc		16:16
markos	well, I made some progress, mdwn file is written, test case also, added enums, etc. I'm getting the op_maddsubrs (just picked something for now) generated and trying to run it now but getting some errors	16:16
markos	I called the instruction as maddsubrs (multiply-add-sub-right-shift)	16:20
lkcl	ok cool!	16:25
lkcl	if you drop it in a branch give me a shout i can take a look.	16:26
lkcl	unless you feel confident it doesn't cause "damage" in which case just shove it in master	16:26
markos	yeah, will do that in a bit, getting some stupid errors right now, don't want to mess up master yet, if I fail to fix it I'll just commit in a branch asap	16:28
markos	it's about GPROperand(RC), getting lists of indices for GPROperand(RT): (6, 7, 8, 9, 10), GPROperand(RA): (11, 12, 13, 14, 15) but GPROperand(RC): None	16:31
markos	in power_insn.py: ~1123: for idx in operand.span:	16:31
markos	lkcl, is it possible to have 3in and an 4-bit immediate for shifting?	16:32
lkcl	yes but you'll need to design a "Form" to do it. 4-bits is a LOT	16:33
markos	ok, where do I put that form?	16:33
lkcl	and you certainly won't get that in the 3-in 2-out ones that already take 4 operands	16:34
lkcl	in fields.text	16:34
markos	perfect	16:34
lkcl	don't rush into that decision: it needs to be a "Researched" RFC / wiki page	16:34
lkcl	(it gets its own budget)	16:34
lkcl	which reminds me to do exactly that, as each of these instructions needs to be listed on a special twin-butterfly page that currently doesn't exist	16:35
lkcl	ooo there's just enough budget	16:35
*** octavius <octavius!~octavius@92.40.169.65.threembb.co.uk> has joined #libre-soc		16:36
octavius	lkcl, as you've suggested I go back to verilator, that's what I did. Please see bug 1073 when you have some time, I'd really like to figure out what the problem is	16:38
markos	well, started adding form to see how/if I can fit all that	16:38
lkcl	octavius, take a look at the README as well as the source code of the microwatt_verilator main() loop	16:39
octavius	ok	16:39
lkcl	it requires some command-line options	16:39
lkcl	you can probably guess that those command-line options are "the binary to load into RAM"	16:39
lkcl	octavius, you should have worked out that "if it does nothing then you're looking at a black box, stop it"	16:42
octavius	I did stop it	16:42
lkcl	now you've got compiling, a gentle reminder that the purpose of compiling it is to get it to produce gtkwave traces	16:42
octavius	I just ran to remind myself. Last time was in January :)	16:43
lkcl	:)	16:43
lkcl	and that needs verilator compile-time options.	16:43
octavius	Yes, I noticed the .vcd file was unreadable	16:43
lkcl	that's probably because it's an fst file (maybe).	16:43
lkcl	use vcd2fst and fst2vcd - whichever one works use that	16:43
octavius	Also the README in the microwatt repo has no info on verilator at all. Looking at microwatt-verilator.cpp as you've suggested	16:44
lkcl	bear in mind that the output from verilator is not immediately compatible with gtkwave (sigh)	16:44
octavius	Ah ok	16:44
lkcl	you want the microwatt_verilator branch (only)	16:44
octavius	That's the one I'm using	16:44
lkcl	it's been too long i can't remember everything	16:44
octavius	And I'm guessing you mean "verilator_trace" branch	16:45
lkcl	markos, i'm slightly concerned about the low "XO" bit count of adding shift-immediates, they are incredibly expensive even when you have 3 operands	16:45
lkcl	yes	16:45
lkcl	if it was 2 bits, not so much of a problem, but 4 is a LOT	16:47
lkcl	you risk ending up with needing a full Primary Opcode (or 50% of one)	16:48
lkcl	at which point the instruction is highly likely to get rejected by the OPF ISA WG because it is such a "specialist optional" area	16:48
lkcl	something like ternlogi on the other hand brings a massive 256 instructions with it, saving routinely and systematically across general-purpose code	16:49
markos	what do I need to do when I've added a form in the fields.txt? plain 'make' chokes, I probably need to run something else, but I forget the sequence	16:49
markos	I've added a BF-Form	16:49
lkcl	but these are area-specific (DCT/FFT) and the only reason they can even be considered is because the wikipedia page lists something mad like 120 use-cases	16:49
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		16:50
lkcl	please please wait	16:50
markos	yeah, just experimenting now, to see if it's even possible	16:50
lkcl	there is a process for this, we cannot rush ahead adding new Forms arbitrarily without thinking them through and reviewing them	16:50
markos	not going to commit anything	16:50
lkcl	yes, you need to add it to power_enums.py	16:50
lkcl	then add the (new) fields into the later section of fields.text	16:50
lkcl	i don't _mind_ putting them onto the (new) wiki page to see what they look like	16:51
lkcl	(and/or its discussion page)	16:51
markos	I did, BF = 46 at the Form class	16:51
lkcl	oh excellent	16:52
markos	let me paste the form here for starters	16:52
lkcl	good idea	16:52
markos	\|0 \| 6 \|11 \|16 \|21 \| 25 \|30 \|31 \|	16:52
markos	\| PO \| RT \| RA \| RB \| RC \| SH \| XO \| Rc \|	16:52
lkcl	ok so you see how XO is only 1 bit?	16:52
markos	yes, is that a problem? :D	16:53
markos	how many bits does it have to be? can we skip it entirely?	16:53
lkcl	that makes this an absolute top absolute top ultra-priority instruction in the same sort of category as "addi"	16:53
lkcl	or "bc"	16:53
markos	ah, I need to add BF to the end of XO(30)	16:53
lkcl	to give some context: if you didn't have "SH" you could add SIXTEEN other 4-operand instructions	16:54
lkcl	no, you need to consider that there is limited space and to consider not proposing this instruction at all because it risks getting rejected	16:54
markos	well, we could leave the shifting out entirely	16:54
lkcl	the lower the XO, the higher the priority has to be	16:55
markos	I see	16:55
lkcl	and obviously it's an exponential curve	16:55
lkcl	as in, "the higher the number of use-cases"	16:55
markos	well, it's about the gain, if the gain is justified	16:55
lkcl	compared to a 10-bit XO this is destroying the opportunity to add a massive 512 other 2-in 1-out instructions	16:55
markos	I mean Arm did include these instructions but with a fixed shifting value	16:55
lkcl	yes, and they are under similar 32-bit constraints	16:56
lkcl	so you start to appreciate why they did that	16:56
lkcl	they're barely going to pass through as they are, with 3-in 1-out (4 operands taking up 20 bits on their own)	16:56
markos	I do, in a sense, I admit I'm seeing this from my own point of view	16:56
markos	being able to do twin butterfly operations in just 2 instructions is a massive win, from my perspective	16:57
lkcl	which has to be compared against the perspective of millions of programmers doing general-purpose	16:57
lkcl	yes i know! :)	16:57
lkcl	read above: about the 120 use-cases for DCT on the wikipedia page	16:57
lkcl	it's the only reason we can get away with proposing these at all	16:57
lkcl	(that, and ARM already added them, we can point at that fact and use it as additional justification)	16:58
markos	well, something like that could bring Power as a top performer in video processing	16:58
lkcl	indeed	16:58
markos	or any kind of media processing	16:58
lkcl	but if it takes up EIGHT Primary Opcodes to do so, that's not going to fly	16:58
lkcl	there's only 32 new POs in the EXT2xx area, 10 of which i want to allocate to LD/ST-Post-Increment	16:59
lkcl	(because that is a huge saving - every single hot-loop in existence in every general-program benefits)	16:59
markos	I'll play with this a bit	17:00
lkcl	hence, "really high priority"	17:00
markos	I'll try to minimize SH as much as possible	17:00
lkcl	awesome	17:00
markos	would 2-bits be ok?	17:00
lkcl	now, about RC/RS - there's a place in power_decoder2.py that you (or more like i) may need to pay attention to	17:00
markos	because if I can assume eg. shifting by a number of bits	17:01
lkcl	not really. that's still two Primary Opcodes	17:01
markos	ok	17:01
lkcl	probably one is ok, and that's risky. it's still an entire PO taken up by the (set of) instructions	17:01
lkcl	because there's what... 8 of them?	17:01
markos	understood	17:02
lkcl	ahhh ok	17:02
lkcl	i remember now	17:02
lkcl	search for "implicit_rs" in power_decoder2.py	17:02
lkcl	that's really important.	17:02
lkcl	it's complicated, but a "special check" is needed for the implicit RS/RC/FRS/FRC instructions, actually right there in the decoder	17:03
lkcl	i.e. you can't just "add instructions to the csv files and hope"	17:03
lkcl	gimme a sec...	17:03
lkcl	sorry i forgot about this, it's been a while	17:03
markos	np	17:03
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=88b2023859061d7601a9dc94e052c75ec59fd8b1;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l1057	17:04
lkcl	actually line 1046	17:05
lkcl	so you need to decide the XO field (which is in the CSV file) and under which Major (PO)	17:06
lkcl	btw this really needs to all go into the wiki page	17:06
lkcl	https://libre-soc.org/openpower/sv/twin_butterfly	17:06
lkcl	so it can be reviewed reaallly carefully	17:06
lkcl	i'll need to do some temporary opcode allocation and find space for them	17:07
lkcl	probably minor_22.csv - i think there's space still	17:07
lkcl	then a section will be needed in power_decoder2.py to match it	17:07
lkcl	1046 with m.If((major == 59) & xo.matches(	17:07
lkcl	1047 '-----00100', # ffmsubs	17:07
lkcl	....	17:07
lkcl	and you can see that one of either RB or RC can be "extended by MAXVL" when Vectorisation is enabled	17:08
lkcl	so you need to decide which that's going to be.	17:08
markos	ok, this needs a lot of thought still	17:09
lkcl	indeed. fortunately there's a trail already blazed	17:09
lkcl	but it's probably best to use the twin_butterfly page to create stub instructions, ultimately intended to be morphed into actual RFC actual Power ISA form	17:09
lkcl	but kept short for now to make it easy to discuss iteratively	17:10
markos	will start adding stuff there asap to discuss	17:10
lkcl	ack	17:10
lkcl	it's got its own budget and bugreport	17:10
lkcl	i'll add the fp butterfly instructions later	17:11
markos	pushed	17:26
markos	This is the original attempt, still with the 4-bit SH	17:28
lkcl	ok great	17:28
markos	pretty sure there are some great misunderstandings on my part here	17:29
markos	ie, I'm not really sure I'm allowed to just write to RT+1	17:29
markos	and now that I see it, it's probably wrong, it probably adds 1 to the value of RT, not the index	17:29
lkcl	no	17:30
lkcl	it's implicit	17:30
lkcl	you write to RS	17:30
markos	ah, what you said earlier	17:30
lkcl	and ISACaller "knows" to pick that second (implicit) operand up and... yes	17:30
markos	yeah, I need to read about that	17:30
markos	so it's possible then to write to 2 GPRs	17:31
lkcl	have a look at the biginteger page	17:31
markos	nice to know	17:31
lkcl	which contains the kind of spec-wording	17:31
markos	will do	17:31
lkcl	yes but we will get push-back for doing so	17:31
lkcl	because it's what CISC x86 does	17:31
lkcl	so there is a lot of "push-back" going to occur on these instructions, hence why if "and we want 8 Primary Opcodes" is part of that, the ISA WG will just flat-out say "no"	17:32
lkcl	prod1 <- MUL(RC, sum)	17:32
lkcl	can just be	17:32
lkcl	RC * sum	17:33
lkcl	just like in fixedarith	17:33
lkcl	let me check...	17:33
lkcl	ah nope, you're right	17:33
lkcl	# Multiply Low Immediate	17:33
lkcl	prod[0:(XLEN*2)-1] <- MULS((RA), EXTS(SI))	17:33
lkcl	watch out for this:	17:34
lkcl	RT <- prod[XLEN:(XLEN*2)-1]	17:34
lkcl	the result of MUL and MULS is DOUBLE the bitwidth	17:34
lkcl	(sum of the length of the two operands)	17:34
markos	right, ofc	17:34
lkcl	and consequently you have to "pick a half"	17:34
lkcl	but of course, you "pick a half in MSB0 numbering"... sigh	17:34
markos	hm, the arm instructions return the high half	17:35
markos	we could add 2 pairs	17:35
lkcl	for accuracy	17:36
markos	one returning the high half and another the low	17:36
lkcl	absolutely no chance of that	17:36
markos	without the shifting bit :)	17:36
lkcl	there's an internal hardware limit we've set of 3-in 2-out	17:36
lkcl	@ 64-bit width	17:36
lkcl	and that's down to the massive complexity that results from doing Register Hazard checking	17:37
lkcl	the only reason we get away with hi-lo-half in the bigint operations is because they're actually a carry-in carry-out chain	17:37
markos	right	17:37
lkcl	so for the internal chain the instructions actually become 2-in 1-out, the first one in the chain is 3-in 1-out, and the last one in the chain is 2-in 2-out	17:38
lkcl	which is the only reason we can get away with such ultra-expensive instructions, that and they'll end up in libgmp	17:38
markos	similarly, these will go in pretty much all video/audio codecs	17:39
*** tplaten <tplaten!~tplaten@195.52.20.159> has joined #libre-soc		17:39
lkcl	btw no need to put the autogenerated code in the wiki	17:40
lkcl	exactly	17:40
lkcl	like... aaaalll of them	17:40
markos	though, for that reason we could avoid the shifting entirely	17:40
markos	I mean as an operand	17:40
lkcl	which we can easily "fly" on the "IoT / Edge / accelerator" thing	17:40
lkcl	yes pleeease	17:40
markos	only reason I'd want it is for future	17:41
markos	in case a future codec decides to change the number of shift bits	17:41
lkcl	it's too much for me to have to explain, and stake the entire reputation of what we're doing on having the instructions be rejected	17:41
markos	though that's unlikely	17:41
markos	we're good until 2030	17:41
markos	av1/av2/etc	17:41
markos	:D	17:41
lkcl	ahh if there's specific CODECs that use these instructions explicitly please do list them	17:41
lkcl	that again gives me information i can present in ISA WG meetings, "these are common CODECs, actual implementations, the actual spec says DoThisThing()"	17:42
markos	well, these fdct are all libvpx/av1	17:43
markos	and av2	17:43
lkcl	minor_59... what's that supposed to be used for...	17:43
lkcl	_great_!	17:44
lkcl	do put it into the page	17:44
lkcl	every instruction needs a "Rationale"	17:44
lkcl	i.e.	17:44
lkcl	"why as IBM should we invest $50-100 million implementing these instructions"	17:44
lkcl	or {insert-N-E-Other-Power-ISA-Implementor}	17:44
lkcl	opcode 59 is typically stuffed with FP-single	17:46
markos	I just picked 59 randomly :)	17:46
lkcl	yyeah and likely overwrote some official instructions in the process!	17:47
lkcl	extreme care needs to be taken here, it's a frickin lot of work	17:48
lkcl	i'm looking at the tables here https://libre-soc.org/openpower/sv/bitmanip/	17:48
lkcl	how many of these instructions are needed?	17:48
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has joined #libre-soc		17:48
lkcl	one i know is needed for the inner product, and another for the outer product	17:48
lkcl	so that's at least 2	17:49
lkcl	then iirc you have to use different ones for iDCT than from DCT, so that's 8	17:49
lkcl	sorry, 4. then FFT needs the same treatment, that's 8	17:49
lkcl	fortunately though i think the outer-butterfly is just a twin add-subtract - specified as a 2-in 1-out but having an implicit RS	17:51
lkcl	https://libre-soc.org/openpower/isa/svfparith/	17:51
markos	added some rationale, mention of the Arm instructions	17:51
lkcl	awesome	17:51
lkcl	btw the DCT subsystem needs both the inner-butterfly and the outer-butterfly instructions	17:52
lkcl	that's why there's 2 separate uses of svremap in the unit tests. first use does the inner butterfly (the twin-madd)	17:53
markos	well, I'd suggest 2 pairs of instructions	17:53
lkcl	second use of svremap does the outer butterfly (which is i believe just an add-sub)	17:53
markos	from what I see in libvpx though, both fdct and idct use the same kind of instructions	17:54
lkcl	haang on... DCT just uses fadds. ha!	17:55
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD#l594	17:55
markos	and arm ports also use those special instructions, but again these are of limited precision	17:55
lkcl	so fortunately the outer-butterfly is just "an add"	17:55
lkcl	well, the limited precision occurs when you specify an elwidth	17:55
lkcl	which will be where the biggest efficiency savings come from	17:56
lkcl	so, actually (whew) - at least for DCT - only two twin-mul-and-accumulate-and-shift instructions	17:56
markos	well in our case, it would help to be able to do the calculations in a larger width and then just scale/narrow down	17:56
markos	es	17:57
markos	yes	17:57
lkcl	whiiich... means... they can just about fit into opcode 22	17:57
lkcl	there's an area	17:57
lkcl	https://libre-soc.org/openpower/sv/bitmanip/	17:57
markos	Arm is full of many versions of these functions because they're fast but not accurate enugh	17:57
lkcl	NNRTRAit/im57im0-40 00 000xpermiTODO-Form	17:58
lkcl	NN- -- 000rsvdrsvd	17:58
markos	23 helper functions to do basically the same thing	17:58
lkcl	yowser	17:58
lkcl	ok so see that entry just below xpermi?	17:58
markos	rsvd?	17:59
lkcl	as long as 26-28 are not zero, that's "free encoding space"	17:59
lkcl	you get one bit for a shift, there	17:59
lkcl	let me edit it...	17:59
markos	haha, I'll take it	18:00
lkcl	ahh... where the heck's the page... it's in a separate-include...	18:00
lkcl	ah. draft_opcode_tables	18:00
lkcl	ok what's the instruction names?	18:00
lkcl	one is maddsubrs	18:01
markos	I proposed maddsubrs, but open to suggestions	18:01
lkcl	ahh "s" is usually reserved for "FP single"... are there any other instructions ending in "s" in the fixed-point set?	18:02
lkcl	maddsubrs it is for now	18:02
markos	this one does both add and sub	18:02
markos	assuming I can write to RT and RT+1	18:02
markos	or RT and RS	18:02
lkcl	RT and implicit-RS.	18:04
lkcl	https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=5b0a082545185799b7bf053374aa3b60117ef74b	18:04
lkcl	ok so that's your allocation for the instruction	18:04
lkcl	it'll need to go into minor_22.csv	18:04
lkcl	(not minor_59.csv)	18:04
lkcl	and you want a (sigh) XO length i think of 11...	18:05
lkcl	gimme a sec...	18:05
lkcl	see insndb.csv	18:05
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/insndb.csv;hb=HEAD	18:06
lkcl	7 minor_22.csv,22,21:31,NONE,pattern,normal	18:06
lkcl	21-31... yes, 11 bits	18:06
lkcl	okaaay	18:06
lkcl	so _now_ you can "interpret" the contents of minor_22.csv, every single "pattern" has to be 11 bit in length...	18:06
markos	11?	18:07
lkcl	the 1st column	18:07
lkcl	-----01011-,ALU,OP_FISHMV	18:07
lkcl	example.	18:07
lkcl	count the total "-" "0" and "1"s	18:07
lkcl	comes to 11	18:07
lkcl	representing bits 21 thru 31 inclusive	18:07
lkcl	sooo... with the new allocation	18:08
markos	but I have 4 operands, RT, RA, RB, RC, which are 6:24	18:08
lkcl	look at the diff	18:08
lkcl	diff --git a/openpower/sv/draft_opcode_tables.mdwn b/openpower/sv/draft_opcode_tables.mdwn	18:08
lkcl	\| 0.5\|6.10\|11.15\|16.20 \|21..25 \| 26....30 \|31\| name \| Form \|	18:08
lkcl	+\| NN \| RT \| RA \| RB \| RC \| sh 01 00 \|0 \| maddsubrs \| BF-Form \|	18:08
lkcl	RT RA RB and RC are all allocated to 6:24	18:09
markos	aaaaaaah	18:09
lkcl	but column one of each csv file is allocated to XO identification	18:09
lkcl	you will also need to add entries further down in fields.text which tell power_decoder.py where those RT RA RB and RC are, for BF-Form	18:10
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;h=b0f91cae74f2dec822138b97c0286d6b6cda76f8;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l799	18:10
lkcl	799 RT (6:10)	18:10
lkcl	800 Field used to specify a GPR to be used as a target.	18:11
lkcl	801 Formats: A, BM2, D, DQE, DS, DX, MM, VA, VA2, VX, X, XFX, XO, XX2, SVL, XB, TLI, Z23	18:11
lkcl	aaaand now...	18:11
lkcl	....	18:11
lkcl	....	18:11
lkcl	BF	18:11
lkcl	likewise for RA	18:11
lkcl	747 RA (11:15)	18:11
lkcl	748 Field used to specify a GPR to be used as a	18:11
lkcl	749 source or as a target.	18:11
markos	ok, thanks for your patience	18:11
markos	I'll get it eventually	18:11
lkcl	750 Formats: ...... .... BF	18:11
lkcl	it's all in the (various, numerous) diffs	18:11
lkcl	normally it would be straightforward, just look at one already done, but the extra complication is the implicit arguments	18:12
lkcl	so	18:12
lkcl	let me find git link for minor_22.csv	18:12
lkcl	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/minor_22.csv;h=7cb4785af2ff915acf4c724d72709a470e2c6a48;hb=d40763cd6e186ad9b17ce6f974a38b4c4877965e#l40	18:13
lkcl	so line 43	18:13
lkcl	let's take say line 39 - OP_CPROP	18:13
lkcl	that has	18:13
lkcl	39 0110001110-,ALU,OP_CPROP,R	18:13
lkcl	so that means, that for power_decoder.py to "match"	18:13
lkcl	bit 21 must be 0	18:13
lkcl	bit 22 must be 1	18:13
lkcl	bit 23 must be 1	18:13
lkcl	bit 24 must be 0	18:13
lkcl	...	18:13
lkcl	bit 30 must be 0	18:14
lkcl	and bit 31 we DON'T CARE	18:14
lkcl	(because "-")	18:14
lkcl	so, "translating" the allocation 26:30 from the new allocation	18:14
lkcl	21-25 is right smack in the middle of RC, therefore must be "don't care"	18:14
lkcl	bit 26 is "sh" so that must be "don't care" as well	18:15
lkcl	and bits 27-31 must be "01000"	18:15
lkcl	so!	18:15
lkcl	we have the entry!	18:15
lkcl	and it is...	18:15
lkcl	------01000	18:15
lkcl	ta-daaa	18:15
markos	:)	18:15
lkcl	that's the entry to go into minor_22.csv at line... 43.	18:16
lkcl	every single frickin instruction has to go through this process, sigh	18:16
markos	I'll add the entry there	18:16
lkcl	awesome	18:17
lkcl	holy hell barometric pressure change	18:17
lkcl	unbelievably painful even with 4 aspirin and 2 paracetamol	18:18
markos	get some rest	18:21
lkcl	not going to help - weather's changing constantly today	18:33
lkcl	apparently this is a well-known phenomenon in japan	18:33
lkcl	but very much less-recognised in europe / us.	18:33
lkcl	i can feel my ears popping constantly (like in an airplane) hence i know the pressure change is happening	18:33
programmerjake	luke, iirc you removed iterate_indices2 and copied the section to iterate_indices, did you ever push that?	18:33
programmerjake	hope you feel better	18:34
lkcl	no i didn't, i simply called the alternate function if submode=0b10/11	18:34
lkcl	been a wild ride today	18:34
markos	missing something still: this file (I guess autogenerated) gives me this:	18:36
markos	+maddsubrs,NORMAL,,1P,EXTRA2,NO,d:FRT;d:CR1,s:FRA,s:FRB,s:FRC,RA,RB,RC,RT,0,CR1,0	18:36
markos	why am I getting FR* registers in there?	18:36
markos	maybe it was generated previously	18:40
*** octavius <octavius!~octavius@92.40.169.65.threembb.co.uk> has quit IRC		18:41
programmerjake	run `make`, it replaces those files...	18:41
markos	just did	18:41
markos	still getting the same result	18:41
programmerjake	do you have the right form in the csv?	18:41
markos	ah right	18:42
markos	thanks	18:42
markos	weird, still getting the same	18:44
markos	I'm going to commit in a branch	18:45
programmerjake	it's probably going to the wrong case in sv_analysis.py, e.g. when I added pcdec I had to add a case: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l605	18:48
programmerjake	regs comes from https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l363	18:51
ghostmansd	markos, check these lines: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;h=8b89212fd736d65a383ded16f2b770966efe9cb5;hb=HEAD#l378	18:54
ghostmansd	all fields for CSVs are generated here	18:55
programmerjake	oh, i think i spotted your issue, do you have it writing CR1 instead of CR0?	18:55
programmerjake	since for Rc=1, fp ops write CR1, but int ops write CR0	18:58
lkcl	it'll be down to what's in sv_analysis.py	19:43
lkcl	but you don't want to be worrying about sv right now	19:44
lkcl	because this is a scalar instruction	19:44
lkcl	but just so you know, look at sv_snalysis RM-1P-3S1D section	19:45
lkcl	elif value == 'RM-1P-3S1D':	19:45
lkcl	it's a previously-unrcognised pattern	19:45
lkcl	and the fallback is "fmadd*"	19:45
lkcl	i need to know the "key" pattern	19:46
lkcl	regs == [somethingsomething]	19:46
lkcl	1111011111,ALU,OP_MADDSUBRS,RA,RB,RC,RT,NONE,CR1,0	19:47
lkcl	ah yes, you put CR1, just like jacob said	19:47
lkcl	make that CR0	19:47
lkcl	and it should then match on	19:47
lkcl	elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec	19:47
lkcl	which will activate this	19:47
lkcl	res['0'] = 'd:RT;d:CR0' # RT,CR0: Rdest1_EXTRA2	19:48
lkcl	res['1'] = 's:RA' # RA: Rsrc1_EXTRA2	19:48
lkcl	res['2'] = 's:RB' # RT: Rsrc2_EXTRA2	19:48
lkcl	res['3'] = 's:RC' # RT: Rsrc3_EXTRA2	19:48
programmerjake	other issues I spotted, the pseudocode uses rotate left instead of shift right...it'll give the wrong results	19:49
lkcl	re-run sv_analysis.py	19:50
lkcl	i also removed Rc=1 from BF-Form, and fitted it to what went into the bitmanip-opcode-22 table	19:52
lkcl	https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=fa1f3c9e3dce8fdb62b47a34b6c9203293b94026	19:52
lkcl	sorry! Rc=1 effectively doubles the number of instructions, which we can't really afford to do	19:53
programmerjake	another issue: that's actually a 5-in 2-out op...since it both reads and writes RT and RS	19:53
lkcl	ermmm... ermermerm...	19:53
lkcl	yep that's not going to work	19:53
lkcl	so RA has to be the source-of-where-the-accumulating-happens	19:56
lkcl	which happens to be exactly the same register as RT	19:56
programmerjake	idea: put the pair of coefficients and accumulated sums each in 1 reg with each value being the lower/upper half of a reg...this should reduce input/output regs to 4-in 1-out	20:06
programmerjake	idk if that'll fit the DCT pattern tho	20:07
programmerjake	this is kinda like how cdtbcd works where the upper and lower halves are independent	20:08
programmerjake	e.g. RT <- ((RT)[0:XLEN/2-1] + prod0) \|\| ((RT)[XLEN/2:XLEN-1] + prod1)	20:10
programmerjake	that way if you set elwid=32 you get 2 16-bit results	20:11
markos	sorry had to be afk for a while to pick up my son	20:18
markos	right, CR0 was the reason	20:19
markos	thanks a lot!	20:19
markos	programmerjake, yeah, it's far from perfect right now, and probably incorrect	20:19
markos	but the half-register coefficient is a good idea, I was actually thinking about it for the results	20:23
markos	ie, high RT -> add, low RT -> sub	20:23
markos	lkcl, I saw you removed the accumulate, is there no way to keep the accumulate there?	20:30
lkcl	markos, RA-when-set-to-the-same-register-as-RT is the accumulator	20:33
lkcl	that's the way it works	20:33
lkcl	and no, it will not be ok to do split-use of registers	20:34
lkcl	how would it ever then be possible to do 64-bit DCT?	20:34
lkcl	last thing we need is to fall onto a SIMD-within-a-Register paradigm	20:35
markos	hm, hm, RA == RT only makes sense if we do in-place DCT	20:37
markos	and actually it kind of forces us that way	20:37
lkcl	the DCT Schedules are specifically designed for precisely and exactly that	20:42
lkcl	this is a world-first	20:42
lkcl	the only reason it is possible at all is because the elements are loaded and then traversed in a hybrid bit-reversed and gray-coding pattern	20:43
lkcl	such that	20:43
lkcl	when "unravelling" layer by layer, each layer is not destructively overwritten when doing the 3210 0123 schedule	20:44
lkcl	because it's already been loaded such that it becomes a 0123 0123 schedule for that exact moment in the schedule	20:44
lkcl	and consequently you can do in-place	20:45
lkcl	all standard SIMD algorithms need double the registers	20:45
lkcl	because they try to do 0123 3210 and half-way through that they destroy the data	20:45
lkcl	markos, you'll need i think to experiment by running remap_dct_yield.py	20:49
lkcl	and see what it does.	20:49
lkcl	you'll find that - like Indexed REMAP but without the GPRs - it generates "prerequisite offsets"	20:49
lkcl	that you must drop on top of a fully in-place instruction	20:50
lkcl	in this case it will be maddsubrs 0, 0, 16, 0	20:50
lkcl	where *16 equals the coefficients	20:50
lkcl	sorry	20:50
lkcl	maddsubrs 0, 0, 0, 16	20:50
lkcl	and the schedule system will add on the required offsets to RT, RA, RB and RC for you	20:51
lkcl	to make the entire triple loop	20:51
lkcl	it's liiitttteralllly three (quantity 3of) instructions	20:51
lkcl	svshape, svremap sv.maddsubrs.	20:52
lkcl	bdang.	20:52
lkcl	done.	20:52
markos	you're right, I was thinking that we might need to reuse the coeffs, but if we can do the whole thing in one go, all the better and we don't need to reuse	20:52
lkcl	even the coefficients are established in a set order that makes them useable as a vector	20:52
lkcl	and	20:52
lkcl	guess what?	20:52
lkcl	the "coefficient-offseting" Schedule (REMAP SVSHAPE3) is set up precisely and exactly to give you the exact required coefficient	20:53
lkcl	at the exact and precise required time	20:53
lkcl	it's extremely elegant, sophisticated, and overwhelmingly-confusingly-straightforward	20:53
lkcl	compared to the absolute hell normally subjected onto programmers	20:53
markos	well, you're right, I'll have to play quite a bit with the dct_yield example, in fact I might copy it to work on the maddsubrs	20:54
*** jn <jn!~quassel@user/jn/x-3390946> has quit IRC		20:54
markos	wait, in that case, I don't need 3-in	20:54
markos	or rather I don't need a separate RT register	20:54
markos	because they are the same	20:54
lkcl	you should be looking to do nothing else other than to copy the way that the FP DCT works	20:55
lkcl	unless there is a really compelling reason to do otherwise	20:55
lkcl	such that you should literally be able to cut/paste the fp dct test examples	20:55
*** jn <jn!~quassel@95.223.44.193> has joined #libre-soc		20:56
*** jn <jn!~quassel@95.223.44.193> has quit IRC		20:56
*** jn <jn!~quassel@user/jn/x-3390946> has joined #libre-soc		20:56
lkcl	replace sv.ffmmads (whatever) with sv.maddsubrs	20:56
markos	what I still haven't figured out	20:56
lkcl	and "It Should Just Work(tm)"	20:56
lkcl	run the tests. and the yield program. and the associated project nayuki dct tests.	20:56
markos	I can't understand where the implicit RS is defined, I mean how does it now where to place the result?	20:56
lkcl	we went over that: that's in power_decoder2.py	20:57
lkcl	search for the word "implicit_rs"	20:57
markos	ah yes, you did say that, sorry	20:57
markos	ok, will continue playing with this	20:57
lkcl	now you'll need a section "with mIf((major==22) & so.matches("------01100")	20:57
lkcl	i'll do that bit	20:58
lkcl	i'll sort it now	20:58
markos	thanks	21:00
lkcl	done	21:01
markos	if RA == RT, can I skip one in the declaration?	21:05
markos	trying to see if I can still shave off some bits for shifting :-)	21:06
lkcl	mmm... maaaybe. maybe not. when using Vertical-First Mode you need to be able to specify some registers as scalar, some as vector	21:14
lkcl	and if they don't exist, you can't do that	21:14
lkcl	Vertical-First Mode would be useful for being able to utilise the Schedule but to run more than one instruction, just like in chacha20	21:14
lkcl	in this case, you could detect "was there an overflow"	21:14
lkcl	and flip to higher bit-width without leaving the Schedule Arrangement	21:15
lkcl	just branch to a different area within the loop	21:15
lkcl	you could even go "oop, by Layer 3 or greater we know we are going to run out of bit-accuracy in 16-bit therefore let's start using 32-bit for Layer 3 4 and 5"	21:16
lkcl	all sorts of weird stuff	21:16
lkcl	but if you don't have control over the operands it's going to be much more challenging	21:16
lkcl	plus, if you ever need to use this in a scalar context, what should RA, RB and RT be?	21:17
lkcl	if you really feel that an overwrite is ok in all circumstances, then yes we can explore that	21:18
lkcl	and it will be ok to do precisely because butterfly will have two input operands "in-flight"	21:18
lkcl	(like compare-and-swap)	21:18
programmerjake	in case anyone was wondering, my build server crashed or something since I found it powered off rn, should be up and working now	22:17
markos	lkcl, well, scalar mode is not really the use case here in point, I mean sure one can use it then also, but it doesn't really mean much	22:22
markos	but if it makes a huge difference in in-place DCT applications, and there is no other way, then yes I would be willing to consider it	22:23
markos	again the point is to manage to save some bits for shifting	22:24
markos	eg if instead of maddsubrs RT, RA, RB, RC, SH (=1-bit for shifting), we manage to do the same with maddsubrs RA, RB, RC, SH (4-bits, give back one bit to XO), that makes a huge difference and a very powerful instruction that is future proof for other DCT implementations	22:25
markos	if we leave the shifting out entirely, then it's just a couple of madds	22:26
markos	which sure it can save some instructions but it won't make that much of a difference	22:26
markos	let me give you some examples	22:26
programmerjake	rather than RA, RB, RC, we'd probably name them RT, RA, RB	22:28
programmerjake	maddsubrs RT, RA, RB, SH	22:28
markos	https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/arm/fdct_neon.h	22:28
markos	ok	22:29
markos	what would RC be?	22:29
markos	I can understand RA == RT, but in your example how will they be mapped to the (a +/- b) * c	22:30
programmerjake	a = RT, b = RA, c = RB?	22:30
markos	sigh, ofc	22:31
programmerjake	if there's only 3 args, one of them is almost always named RT or RS	22:32
markos	in any case, if you check the file above, there are about 20+ implementations of these butterfly instructions	22:32
markos	and the reason is that the arm "fast" implementations vqrdmulhq_s16/vqrdmulhq_s32 fail to provide full precision	22:32
markos	so for the single-coeff implementation you have this code:	22:33
markos	const int32x4_t a0 = vmull_n_s16(vget_low_s16(a), constant);	22:33
markos	const int32x4_t a1 = vmull_n_s16(vget_high_s16(a), constant);	22:33
markos	const int32x4_t sum0 = vmlal_n_s16(a0, vget_low_s16(b), constant);	22:33
markos	const int32x4_t sum1 = vmlal_n_s16(a1, vget_high_s16(b), constant);	22:33
markos	const int32x4_t diff0 = vmlsl_n_s16(a0, vget_low_s16(b), constant);	22:33
markos	const int32x4_t diff1 = vmlsl_n_s16(a1, vget_high_s16(b), constant);	22:33
markos	*add_lo = vrshrq_n_s32(sum0, DCT_CONST_BITS);	22:33
markos	*add_hi = vrshrq_n_s32(sum1, DCT_CONST_BITS);	22:33
markos	*sub_lo = vrshrq_n_s32(diff0, DCT_CONST_BITS);	22:33
markos	*sub_hi = vrshrq_n_s32(diff1, DCT_CONST_BITS);	22:33
markos	the DCT_CONST_BITS = 14	22:33
markos	for vp8/vp9 and av1	22:33
markos	possibly for av2 as well, and quite likely that applies to other codecs as well	22:34
markos	now what if we have some code that needs another constant for shifting?	22:34
markos	we would have to have another instruction or do what Arm does	22:34
markos	fall-back to less efficient code	22:34
markos	still faster than scalar	22:35
markos	we could do all this code in just a couple of instructions and be future poof, if a) we allow accumulate, b) we allow shifting by an immediate value	22:35
programmerjake	what about putting the constant in a handy SPR? e.g. LR or CTR	22:36
programmerjake	that would be 4-in 2-out then	22:36
markos	can we do that?	22:36
markos	what are the drawbacks vs a normal GPR?	22:37
programmerjake	maybe?	22:37
programmerjake	a normal gpr needs an argument	22:37
programmerjake	a spr needs to be not otherwise used or saved/restored	22:37
markos	problem is that it's not just a single constant for a DCT	22:38
markos	it's essentially a bunch of cospi fractions	22:38
programmerjake	so, hence why I was suggesting LR since we'll probably want CTR for looping	22:38
programmerjake	not c, sh in the spr	22:38
markos	cospi(20/64), cospi(12/64), etc is a pair for the 2-coeff	22:38
markos	aaaa	22:38
markos	sorry	22:38
markos	yes, that would work	22:39
markos	sorry it's late	22:39
markos	because that would remain totally constant throughout the whole code bae	22:39
markos	base	22:39
markos	yes indeed	22:39
markos	is it possible that LR is used for something else in the DCT loop?	22:40
programmerjake	other than return address which can easily be stored on stack or in a spare gpr, no	22:40
markos	if lkcl agrees, that solves a problem	22:40
markos	how would I read the value from LR to use as a shift value?	22:41
programmerjake	yeah, just icr if 4-in is too much...	22:41
programmerjake	uuh, just write `blah >> LR`?	22:41
programmerjake	LR[58:63]	22:42
markos	I mean it's directly accessible and I don't have to use a special instruction within the pseudocode	22:42
markos	ok, thanks	22:42
markos	in that case	22:43
markos	we don't even have to force RA=RT	22:43
markos	we can keep the previous syntax and just use A-Form?	22:43
programmerjake	https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/branch.mdwn;hb=f8e2c0cb1467391aa7ae4b8b092c281ee2e16a7b#l75	22:43
markos	ok	22:44
lkcl	ya know what? an overwrite would i think work fine	22:44
programmerjake	if you're not reading RT, sure. if you are reading RT too you have too many inputs	22:44
lkcl	markos, no don't do that. it requires LR as an operand into the Dependency Matrices	22:45
lkcl	which will cause absolute mayhem	22:45
markos	right	22:45
markos	ok, then	22:46
lkcl	register files have to be kept separate, otherwise the Dependency Management becomes hell	22:46
lkcl	basically think of a matrix, with every register known on both the rows and the columns	22:46
lkcl	any time you add an extra dependency, you end up with the entire row having to have a DM Cell for that register	22:46
lkcl	just in case you ever executed an instruction that read LR just after one that wrote it	22:47
lkcl	if you can keep GPR-GPR-GPR then the Matrix becomes "sparse" and you can miss out the majority of entire rows of Dependencies	22:47
lkcl	CTR is definitely allocated to counting, it's even implementable as special Architectural State	22:48
lkcl	rather than an actual "register" per se	22:48
programmerjake	well, if it can match register usage of some pre-existing op, then LR could be used, e.g. if your op uses the same registers as a branch	22:49
lkcl	i need to experiment to see if ffmadds can be reduced by one operand	22:49
*** ghostmansd <ghostmansd!~ghostmans@5.32.74.194> has quit IRC		22:49
programmerjake	since then it can share the dependency matrixes used for branch ops	22:49
programmerjake	LR is the other spr that is likely treated specially	22:50
programmerjake	oh, idea, mush it into the register profile of the GF(p) fft op	22:53
programmerjake	since that reads a spr	22:54
programmerjake	gfpmaddsubr	22:55
programmerjake	it reads the GFPRIME spr	22:56
programmerjake	though otoh that probably would have special state associated with it making writing it much more expensive	22:57
programmerjake	oh, luke, all the [[!inline]] pseudo-code from nmigen-gf.git has disappeared on the wiki: https://libre-soc.org/openpower/sv/bitmanip/#index14h1	22:59
lkcl	sigh that's an underlay	22:59
lkcl	no idea	22:59
lkcl	not going to look at it now	23:00
*** gnucode <gnucode!~gnucode@user/jab> has joined #libre-soc		23:57
lkcl	frickineeeelll	23:57
lkcl	never had difficulty with operands before, sigh	23:58
lkcl	okaaay about time	23:59

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!