Sunday, 2022-04-10

programmerjakelkcl: ascii art -- thx!08:06
programmerjakePCB manufacturing nightmare, lkcl can probably relate...
programmerjakeback to ascii art, i was thinking about just outputting svg instead of ascii art, but thought ascii art would be easier to understand from a command line and/or in the code editor08:10
lkclyes definitely. and if put into the docstrings, sphinx plugins can convert it to images automatically. really neat09:29
josuahhello! thanks to lkcl for pointing me this place14:11
josuahso, OpenPOWER is a thing! is that really an open ISA like Risc-V is?14:11
josuahwhat is the advantage over RISC-V? better coverage of performance features (which seems one of the goal of libre-soc)?14:12
lkcljosuah, welcome14:13
lkclyes it is. 1 sec let me find the link14:13
lkclyes performance, but also proper patent indemnification14:14
lkclthis is the best technically-independent link i can find which explains how RISC-V simply isn't up to the job:
lkclit's perfect for *embedded* purposes though14:15
lkclTrinamic was one of the first companies to use it in a commercial product, to save themselves a fortune in ARM Licensing costs: the absolutely superb TMC2660 Stepper14:15
josuahI am more tempted to pursue with embedded, but also very curious about how open hardware can reach these high-perf use-cases14:16
lkclthe only problem being, they exposed themselves to patent litigation in the process, because RISC-V's Members simply aren't old enough to have a decent patent portfolio14:16
lkcladrian_b's post explains it really well14:17
josuah> inefficient because it requires more instructions to do the same work as other ISAs14:17
lkclbut it's important to note that that discussion was sparked by the Alibaba Group releasing a paper about their high-performance RISC-V core14:17
josuahit seems to meet RISC-V goals of keeping the design as simple as possible14:17
lkclthe problem is that they had to add a staggering *50%* additional "rogue" custom instructions in order to (just) exceed the performance of an ARM Cortex A7314:18
lkclunfortunately, as both adrian_b's post and the original Alibaba Group paper make clear, they've over-simplified14:18
lkclto compensate for that oversimplification, the burden is on the hardware architect to make fantastically-complex hardware14:19
lkclmulti-issue out-of-order superscalar designs, with full register renaming14:19
lkclidentification of sequences of instructions and fusing them into internal micro-coded CISC ALUs14:20
lkcland much more14:20
lkclthese are extremely costly and complex to implement14:20
josuahA Trinamic stepper motor controller using OpenPOWER or Risc-V?14:20
lkcl1 sec14:20
josuah> "rogue" custom instructions [...] performance of an ARM Cortex A7314:21
josuahtaking something and stretching as hard as possible for far-reached goals might be sub-par14:21
lkclthey were successful, but they had to go as far as making modifications to gcc to do it14:22
josuahand better start with something targetted at the main goal right away indeed14:22
lkcland because those gcc modifications were using rogue custom instructions, there's no way it could be accepted "upstream"14:23
lkclyes, exactly.14:23
lkclbut unfortunately, when you first look at the Power ISA, it's a "holy s***" moment14:23
josuahit could result in something simpmler in the end14:23
lkcli cannot begin to describe how dismayed i was on realising we had to implement 214 instructions just for the Scalar Fixed-and-Floating Point subset14:24
lkcland a stunning *750* extra ones for Packed SIMD (called VSX)14:24
lkclbut over time - like... 18 months... - it became clear *why* there are 214 instructions14:24
lkcland because of the Microwatt source code, actually it turns out that you only need something like 80-100 "internal micro-coded" instructions14:25
josuahbut I assume it is the same for many designs over the industry14:25
lkclhigh-performance ISAs, yes.14:25
josuahI had the same reflection about interrupts (but unlike people here, I am a naive beginner :P): a lot of interrupts in the sipeed longan nano sounded like bloat14:26
lkclthe China ICT Group who created the Loongson (MIP64) have a binary-translation mode for x8614:26
josuahbut it might not cost a lot: it is more or less data: keeping things separate, and might not be a huge burden design-wise14:26
lkclinterrupts are fun and actually quite straightforward, give me a sec to complete about Loongson14:27
josuahmy bad, carry on14:27
lkclwhat they found that was when doing JIT binary-translation of x86 Branches into native MIPS64, it required a stunning *ten* instructions14:27
lkcland that's because MIPS64 branches do not use Condition Codes.14:27
lkclx86 does (and so does the Power ISA)14:28
lkclif you don't have Condition Codes, you need to emulate them by doing subtraction, ANDing, ORing and more, and that's where it goes to complete hell14:29
lkclRISC-V's designers made a *really deliberate* decision not to include Condition Codes "because it's too complicated"14:29
lkclback to interrupts... :)14:29
josuahkeeping numbers low are also a good thing to put on titles of publications ;)14:30
lkclheh, yes. but also, identifying the high numbers is also important, it means "area for improvement" :14:30
josuahbut it looks like getting things complex/simple is more nuanced than just the "number of $x per $y"14:31
lkclit's a multi-dimensional space, nowhere near as black-and-white as $x $y14:31
lkclwith a lead time on discovery of mistakes somewhere in the 5-7 year range14:32
lkclby which time it's too late.14:32
lkclwe seriously lucked out by picking the Power ISA.  it's not perfect - there's no LD/ST-shifted like there is with x86 and ARM14:33
lkclbut at least there's LD-ST-with-update, and Condition Codes, and Carry14:33
lkclbut for us, the most absolutely crucial aspect is IBM's involvement and good sense.  not just the patent indemnification14:34
josuahvery convenient for loading a base address (like one of a peripheral or struct) and picking around (register or fields), but I see there are alternatives14:34
josuahit feels nice to have projects looking toward different directions14:34
lkclbut also the fact that IBM insisted that contributions to the ISA be possible *without* having to join the OpenPOWER Foundation14:34
josuahhaving one ISA trying to go tiny ( and huge (high-perf) at the same time might not be best14:35
lkclyes, so, for example, you can use the 1st instruction to calculate the base of a struct14:35
josuahnice move: making it easy to contribute is opening the way to contribution14:35
lkcland you save at least one instruction not needing to do an ADD within a hot-loop, because the LD-ST-with-update has already done it14:35
lkclyes, RISC-V is perfect for that kind of "tiny" implementation14:36
lkclwe're really struggling to fit Libre-SOC into low-cost FPGAs because it's (a) 64-bit (b) implements a RADIX MMU (c) has PTEs integrated into the L1 I/D-Caches14:37
lkclmicrowatt has the same problem14:37
lkclwhat's the longan nano?14:38
josuahAre Artix-7 counting as low-costs? :)14:38
lkclyes :)14:38
josuahlinks at the bottom:
lkclbut you need the 100T version14:39
lkcloo nice14:39
lkcli love the GD32 processors14:39
josuaha STM32F103 clone (GD32F103) clone (GD32VF103)14:39
lkclyes, i encountered them when i was living in Taiwan14:39
lkclhave you heard of libopencm3?14:40
lkclhooray, looks like someone did a port
josuahlibopencm3 is very nice! great work from these folks14:40
josuahI used it a lot to understand and get started14:41
lkclyeah i mean, duh.  if you've ever tried to use ST's own library, you know that it's s*** :)14:41
josuahthat was a lot of insightful information today: a glimpse into ISA design. thank you!14:41
lkclany time14:41
ghostmansdhi folks, that's me again, and, as usual, with some questions :-)17:37
ghostmansdfirst, I'm not sure we have an equivalent to this code:;a=blob;f=src/openpower/sv/trans/
ghostmansdparticularly to extra = svp64_src.get / svp64_dst.get17:39
ghostmansdsecond, I've just discovered that stuff that comes as D(RA) or DS(RA) or whatever that comes in parentheses comes as _two_ operands in binutils17:40
ghostmansdit'd be great if you could check svp64_decode_reg function in svp64 branch of binutils-gdb17:43
ghostmansdand it'd be really amazing if someone could help me with conversion17:45
ghostmansdplease keep in mind that our svp64 record is quite limited: we don't have everything `rm = svp64.instrs[v30b_op]` has17:46
ghostmansdwhat we have for now is;a=blob;f=include/opcode/ppc-svp64.h;h=f93a5f61a69221e4e0955fb81e28b24c6a9f802f;hb=refs/heads/svp6417:47
ghostmansd(I can add new fields, though, if needed)17:47
ghostmansdas an example, `sv.add./m=r3 5.v, 2.v, 1.v'17:49
ghostmansdhere we have the following debug printouts17:49
ghostmansdbut, on binutils side...17:51
ghostmansd...are all we have for now17:52
ghostmansdwith two links above, and considering difference between operands for ld/st (e.g. "D(RA)"/{D, RA} in, what'd be the right and sweet way to make C part work identically to Python?17:53
lkclallo me-again18:21
lkclyyep that i _think_ is what i was talking about, yesterday, with a (small) mapping-table from SVP64-table-entry RA, RB, ...18:24
lkclinto ppc-opc.c RA, RB, ...18:24
lkcland i just added a section in the appendix which describes why it's needed, 1 sec...18:24
lkcl "Extra Field Mapping"18:25
lkclso in effect the input to decode_extra is a dictionary of key-value pairs where you need to *invert* that and make the value the key and the key the value18:28
lkclmade more fun by the fact that some of the entries are shared.18:29
lkclabout D(RA) / DS(RA) - the "D" is an immediate, so what i do is: store that, note it, and use it purely for "reconstruction" purposes, later18:30
lkclit's a horrible hack.18:31
lkcli strongly recommend you *do not* try to merge the decoding of the re-constructed v3.0B suffix into the SVP64 identification18:31
lkclsimply reconstruct the v3.0B suffix in as brain-dead a fashion as possible, and hand it over to the rest of binutils to deal with18:32
lkclin that way you should easily be able to deploy the exact same tricks used at lines 591 and 463, not even caring about whether the immediate (D, DS) is even syntactically valid or not18:33
lkclghostmansd, ok yes;a=blob;f=src/openpower/decoder/;h=ea7f465c9d4f299151e2785b80ab4665f2d87fe9;hb=HEAD#l3318:37
lkclright, that needs some explanation18:37
lkclbasically what it does is, takes the svp64-opc table information,18:39
lkcl  34             .in1 = SVP64_IN1SEL_RA,18:39
lkcl  35             .in2 = SVP64_IN2SEL_RB,18:39
lkcland turns it around into a key-value store where key={REGISTERNAME} and value={EXTRA_INDEX}18:40
lkclit does *two* such key-value stores.18:40
lkcl* one for source registers (anything INSEL)18:41
lkcl* one for dest registers (anything OUTSEL)18:41
lkclit also tells you if there was a CR used as one of the srces, and also tells you if there was a CR used as one of the dest regs18:42
lkclonce you have that EXTRA index (0-3) *then*, ta-daaa, you can (finally) work out which bits in the EXTRA field should be set, based on the instruction format (add RT, RA, RB)18:44
lkcldecode_extra is the "glue" function therefore.18:44
lkcltake that add. record at line 31 of ppc-svp64-opc.c18:45
lkcllet us take that example sv.add 5.4, 2.v, 1.v18:45
lkcl* first you match RT=5.v, RA=2.v, RB=2.v18:46
lkcl* then you look at the 1st operand, and line 37 says that the "OUT" is named "RT.  so, good so far18:46
lkcl* then you look at line 46, sv_out = SVP64_SVEXTRA_IDX0, and (thanks to decode_extra) you now know that RT (5.v) must go into EXTRA index ZERO (0)18:47
lkcl* second, you look at the 2nd operand, and line 34 says that IN1 is RA.  so, RA=2.v and this is good18:48
lkcl* then you look at lne 43 (sv_in1) and you find it has an EXTRA index ONE (1). RA (2.v) must go into EXTRA IDX 118:48
lkcl* for RB you see it is in2, then look up sv_in2 which is IDX2, therefore RB (1.v) must go into EXTRA IDX 218:49
programmerjakelkcl, did nlnet get back to you about the gigabit router grant?18:52
lkclno, not yet.18:53
lkclcan you ping michiel again, cc me?18:53
lkclalso, can you remember where that yosys bug is? about carry4?18:54
lkcli'm trying to find it so that paul mackerras has some context on #microwatt18:54
programmerjakeyou can find the other bugs from that nextpnr-xilinx bug18:59
lkclahhh :)18:59
programmerjakealso, imho it's not a yosys bug, just that yosys can do a workaround19:00
programmerjakepinged michiel19:02
lkcli know you think it's "not a yosys bug"19:14
lkclthink of it in these terms: if this was gcc, ld, and binaries, would you be saying "the best way to fix a problem due to adder inefficiency is to create a program that hand-patches the binary executables"19:15
lkcl"i recommend that after all the ELF linking, all the ABI encoding, all the function calls have been encoded, that you should run objdump, *disassemble* the binary, hunt for all occurrences of an add instruction, patch the binary, and re-assemble it"19:17
lkclbecause that's the VLSI-equivalent, here, of what you're advocating!19:17
lkclno kidding! :)19:17
lkclthere does actually exist a script in symbiflow which does one type of god-awful binary-level-patching, already19:18
lkclit takes over a *minute* to complete because it exports to JSON format, runs in python, then re-exports to JSON format19:19
lkcland finally yosys can re-import the JSON into binary-format in order to carry on processing19:19
lkclall because the task that it performs is *not* carried out by a yosys techmap!19:19
lkcla yosys techmap would have the task performed already, in tens to hundreds of milliseconds, even for massive projects like libre-soc19:27
programmerjakedisassembling, modifying, and reassembling is not the equivalent...the equivalent is more like gas's -momit-lock-prefix=yes option where gas *is* the appropriate place to insert that workaround, not gcc...19:27
lkclinstead, because it's *not* being carried out by a yosys techmap, the downstream tools are forced to do FULL node-tree-walking looking for CARRY4 blocks19:28
lkclno, really, it isn't.19:28
programmerjake(was trying to find gas's option for some arm errata that i remeber seeing years ago, but my google-fu fails me...)19:28
lkclyou're assuming that there's an equivalent of gas as a "helper", here19:28
lkclsome sort of plugin-helper-assistance19:28
programmerjakenextpnr and vtr are the equivalent of gas...19:29
lkclthe god-awful-script-bodge-job is *literally* a full dump19:29
lkclfull node-walk19:29
lkclfull node-search-and-replace (DOM-style, in-memory)19:30
lkclit's awful19:30
lkcland requires hundreds of megabytes, if not several gigabytes of memory to perform19:31
programmerjakei'm not talking about the json-dump thing...i mean nextpnr where you should be able to give it a chain of carry4 blocks and it will figure out what wires and where it needs to insert (even crossing routing channels which is where it fails now iirc) to connect all the carry4 blocks you askee for19:31
lkclbecause this particular script is done in python, on ASCII-based JSON, not in a binary-form19:31
lkclat that point, it's already working with 10x the amount of even binary-formatted data19:32
programmerjakesince yosys *shouldn't care* where all the wires are routed, that's nextpnr's job19:32
lkcland VTR doesn't even have the capability to *do* the work, because it's not designed for the task19:32
lkclok, this is not true.19:33
lkcli posted something earlier (yesterday) which is relevant19:33
lkcl1 sec19:33
programmerjakeyosys doesn't have the capability to represent a routing path through a fpga19:33
lkclthat's correct: it has NETLISTs instead19:34
lkclthose NETLISTs are where the problems lie, because yosys "naive" xilinx-add techmap is producing s***-for-brains chains of CARRY4 blocks19:34
lkclthen expecting downstream tools to sort out the mess19:35
lkcltdene and stineje's synth_opt_adders are doing it the "right" way, by producing *alternative* highly-optimised yosys techmaps19:36
programmerjakethose chains of carry4 blocks aren't the problem, the problem is nextpnr doesn't know how to route a carry4-carry4 wire across a routing channel19:36
lkclcorrect, it doesn't.19:36
lkcland the complexity of the blocks that were produced *by yosys* are so insane (binary-level) that the task of nextpnr *and* symbiflow is made 10x harder19:37
lkclyosys doesn't just produce CARRY4-CARRY4 blocks19:37
programmerjakeso, my point is nextpnr should gain the knowledge of how to perform the *routing* task of *routing a wire across the routing channel*19:37
lkclit produces CARRY4-to-OBUF-to-IBUF-to-god-knows-what-else-BUFs19:38
lkclif it *only* produced CARRY4-CARRY4 blocks, i would be agreeing with you 100% that it's a dead-simple task that both nextpnr-xilinx and symbiflow could handle, quickly, easily, and efficiently.19:38
lkclit's not19:38
lkclat all19:38
lkclyou should look at the god-awful mess produced: it's extremely complex (and quite easy to check, just run synth_xilinx on a simple 128 or greater add)19:39
programmerjakeif nextpnr can't route with just wires, it should insert the appropriate buffers to let it route...just like coriolis2 inserts inverters on long wires19:40
lkclthe knowledge of the internal buffers and how to interface between them is produced by *yosys*19:40
lkclthe problem is that even *identifying* those buffer locations is an absolute f*****g pig.19:40
lkclbecause the output from yosys is already deeply complex and contains far more than just "CARRY4-CARRY4 chains"19:41
programmerjakeso...that's still nextpnr's problem even if it's hard19:41
lkclok. sure.19:41
lkcli'm not interested in discussing this further. i have too much to do.19:41
programmerjakek, good luck! i'll be busy with my brother and grandmother's birthday party today, so ttyl19:42
* lkcl programmerjake sorry, i'm barely keeping back from going into shock again.19:44
lkclah bless19:44
programmerjakei'd seen the adder tree thing earlier ... cool, but i don't think the prefix_sum fn needs to be quite that complex...using 500 instead of 486 gates on a prefix sum shouldn't matter that much19:45
ghostmansdlkcl, thanks for help!20:39
ghostmansdI'm trying to make a routine which maps the register type to some category (is_CR_3bit, is_CR_5bit, et al.)20:40
ghostmansdthe code that gets generated goes like this:
ghostmansd...with stuff like RA, BF. etc. coming from ppc-opc.c (;a=blob;f=opcodes/ppc-opc.c;h=ddb9c100c76bb846a618f3bda17eadf8b1a6a7cc;hb=refs/heads/svp64)20:44
ghostmansdat the same time, I see that BC has no counter-part in binutils, that is, this symbol is undefined20:44
ghostmansd...and it seems that the only insn that needs it is `isel'20:45
ghostmansdand this one is defined in binutils as `{"isel",    XISEL(31,15,0), XISEL_MASK, PPCISEL|TITAN, 0,        {RT, RA0, RB, CRB}},'20:46
ghostmansdmeanwhile we have `[['RT', 'RA', 'RB', 'BC']]'20:46
ghostmansdso, should we map BC to CRB?20:47
ghostmansddid it for now, cf. openpower-isa latest commit to sv_binutils.py20:53
ghostmansdI also pushed binutils-gdb:svp64; we should now be able to retrieve the following information for each insn:21:02
ghostmansd1. its powerpc_opcode pointer which contains most of vanilla PPC stuff;21:08
ghostmansd2. the former includes all operands, like RA, RB, etc., so we can now map reg name to reg category;21:08
ghostmansd3. I guess that we can use operand "names" as indices to powerpc_operands array;21:08
ghostmansd4. we already can decode stuff like `4(1.v)` to (1, 4, SVP64_REG_MODE_VECTOR).21:08
ghostmansdcf. svp64_decode_reg at gas/config/tc-ppc-svp64.c (binutils-gdb:svp64)21:09
ghostmansd(I think keeping reg->type is redundant, we already have it at powerpc_opcode->operands anyway)21:09
lkclprogrammerjake, it's about gate delay.22:05
lkclthat function svp64_reg_category looks perfectly reasonable22:06
lkclwhich bits does CRB map to?22:07
* lkcl checks the tables, hang on...22:07
lkclwhere the heck is isel... it's somewhere weird.22:08
lkclline 9322:09
lkcl  93 * isel RT,RA,RB,BC22:09
lkcl  9422:09
lkclok so they have...  {RT, RA0, RB, CRB}},'22:09
lkclso yes, that would seem to match, ut let's check the bitfields22:10
lkcl  89 # Integer Select22:10
lkcl  9022:10
lkcl  91 A-Form22:10
lkcl  9222:10
lkclit's an A-Form...22:10
lkclwhich is here...;a=blob;f=openpower/isatables/fields.text;h=d4b5075f2b3c16252c6686163c0147d2546e1971;hb=HEAD#l17422:10
lkclline 17422:10
lkcl 175    |0     |6     |11      |16     |21      |26    |31 |22:11
lkcl 180    | PO   |   RT |   RA   |   RB  |    BC  |   XO |  /|22:11
lkclbear in mind (sigh) those are in barse-ackwards MSB0 order (sigh)22:11
lkclso BC is in MSB0 order bits 21..2522:11
lkclwhich is (31-21)..(31-25) which is22:11
lkcl10..6 aka 6..1022:12
lkcl(in the sane-and-normal LSB0 order)22:12
lkclso we should expect to see an offset of 6 and a mask of 0b1111122:12
lkclfor CRB, that is22:12
lkcl2898 #define MB CRB22:14
lkcl2899 #define MB_MASK (0x1f << 6)22:14
lkcl2900   { 0x1f, 6, NULL, NULL, 0 },22:14
lkclanswer yes!22:14
lkcl0x1f == 0b1111122:15
lkcland MB_MASK (aka CRB_MASK) is (0b11111<<6)22:15
lkclso that confirms the expectation that CRB === BC.22:15
lkcltotally the wrong comments for CRB in ppc-opc.c :)22:16
lkclmust be referring to a much older version of the Power ISA spec22:16
lkclghostmansd, ^22:26

Generated by 2.17.1 by Marius Gedminas - find it at!