External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops

Date: 2023apr10. v2 released: TODO

The purpose of this RFC is:

  • to give a full list of upcoming Scalar opcodes developed by Libre-SOC (being cognisant that all of them are Vectorizeable)
  • to give OPF Members and non-Members alike the opportunity to comment and get involved early in RFC submission
  • formally agree a priority order on an iterative basis with new versions of this RFC,
  • which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx, which not proposed at all,
  • keep readers summarily informed of ongoing Detailed RFC submissions, with new versions of this RFC,
  • for IBM (in their capacity as Allocator of Opcodes) to get a clear advance picture of Opcode Allocation prior to submission (as Detailed RFCs)

As this is a Formal ISA RFC the evaluation shall ultimately define (in advance of the actual submission of the instructions themselves) which instructions will be submitted over the next 1-18 months.

It is expected that readers visit and interact with the Libre-SOC resources in order to do due-diligence on the prioritisation evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs that may turn out not to be useful, against a background of having no guiding overview or pre-filtering, and everybody's precious time is wasted. Also note that the Libre-SOC Team, being funded by NLnet under Privacy and Enhanced Trust Grants, are prohibited from signing Commercial-Confidentiality NDAs, as doing so is a direct conflict of interest with their funding body's Charitable Foundation Status and remit, and therefore the entire set of almost 150 new SFFS instructions can only go via the External RFC Process. Also be advised and aware that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely separate organisations.

Worth bearing in mind during evaluation that every "Defined Word-instruction" may or may not be Vectorizeable, but that every "Defined Word-instruction" should have merits on its own, not just when Vectorized, precisely because the instructions are Scalar. An example of a borderline Vectorizeable Defined Word-instruction is mv.swizzle which only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads, but has less merit as a Scalar-only operation, yet when SVP64Single-Prefixed can be part of an atomic Compare-and-Swap sequence.

Although one of the top world-class ISAs, Power ISA Scalar (SFFS) has not been significantly advanced in 12 years: IBM's primary focus has understandably been on PackedSIMD VSX. Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any new team to consider (10+ years development effort) and far outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar up-to-date to modern standards and on its own merits is a reasonable goal, and the advantages of the reduced focus is that SFFS remains RISC-paradigm, with lessons being learned from other ISAs from the intervening years. Good examples here include bmask.

SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing" as well as "True-Scalable-Vector Prefixing" - also literally brings new dimensions to the Power ISA. Thus when adding new Scalar "Defined Word-instructions" it has to unavoidably and simultaneously be taken into consideration their value when Vector-Prefixed, as well as SVP64Single-Prefixed.

Target areas

Whilst entirely general-purpose there are some categories that these instructions are targetting: Bit-manipulation, Big-integer, cryptography, Audio/Visual, High-Performance Compute, GPU workloads and DSP.

\newpage{}

Instruction count guide and approximate priority order

qty description RFC URL
6 SVP64 Management ls008
ls009
ls010
5 CR weirds ls015 cr int predication
4 INT<->FP mv ls006.fpintmv
19 GPR LD/ST-PostInc-Update (saves hugely in hot-loops) ls011
~12 FPR LD/ST-PostInc-Update (ditto) ls011
11 GPR LD/ST-Shifted-PostInc-Update (saves in hot-loops) ls011
4 FPR LD/ST-Shifted-PostInc-Update (ditto) ls011
26 GPR LD/ST-Shifted (again saves hugely in hot-loops) ls004
11 FPR LD/ST-Shifted (ditto) ls004
2 Float-Load-Immediate (always saves D-Cache) ls002.fmi
5 Big-Integer Chained 3-in 2-out (64-bit Carry) ls003.bignum biginteger
6 Bitmanip LUT2/3 ops. high cost high reward ls007 bitmanip
1 fclass (Scalar variant of xvtstdcsp) TBD fclass
5 Audio-Video TBD av opcodes
2 Shift-and-Add (mitigates LD-ST-Shift; Crypto twofish) ls004
2 BMI group ls014 vector ops
2 GPU swizzle TBD mv.swizzle
9 FP DCT/FFT Butterfly (2/3-in 2-out) ls016 twin butterfly
~2? Integer DCT/FFT Butterfly ls016 twin butterfly
18 Trigonometric (1-arg) ? transcendentals
15 Transcendentals (1-arg) ? transcendentals
25 Transcendentals (2-arg) ? transcendentals

Summary tables are created below by different sort categories. Additional columns (and tables) as necessary can be requested to be added as part of update revisions to this RFC.

\newpage{}

Target Area summaries

Please note that there are some instructions developed thanks to NLnet funding that have not been included here for assessment. Examples include pcdec and the Galois Field arithmetic operations. From a purely practical perspective due to the quantity the lower-priority instructions were simply left out. However they remain in the Libre-SOC resources.

Some of these SFFS instructions appear to be duplicates of VSX. A frequent argument comes up that if instructions are in VSX already they should not be added to SFFS, especially if they are nominally the same. The logic that this effectively damages performance of an SFFS-only implementation was raised earlier, however there is a more subtle reason why the instructions are needed.

Future versions of SVP64 and SVP64Single are expected to be developed by future Power ISA Stakeholders on top of VSX. The decisions made there about the meaning of Prefixed Vectorized VSX may be completely different from those made for Prefixed SFFS instructions. At which point the lack of SFFS equivalents would penalise SFFS implementors in a much more severe way, effectively expecting them and SFFS programmers to work with a non-orthogonal paradigm, to their detriment. The solution is to give the SFFS Subset the space and respect that it deserves and allow it to be stand-alone on its own merits.

SVP64 Management instructions

These without question have to go in EXT0xx. Future extended variants, bringing even more powerful capabilities, can be followed up later with EXT1xx prefixed variants, which is not possible if placed in EXT2xx. Only svstep is actually Vectorizeable, all other Management instructions are Unvectorizeable. PO1-Prefixed examples include adding psvshape in order to support both Inner and Outer Product Matrix Schedules, by providing the option to directly reverse the order of the triple loops. Outer is used for standard Matrix Multiply (on top of a standard MAC or FMAC instruction), but Inner is required for Warshall Transitive Closure (on top of a cumulatively-applied max instruction).

Excpt for svstep which is Vectorizeable the Management Instructions themselves are all 32-bit Defined Word-instructions (Scalar Operations), so PO1-Prefixing is perfectly reasonable. SVP64 Management instructions of which there are only 6 are all 5 or 6 bit XO, meaning that the opcode space they take up in EXT0xx is not alarmingly high for their intrinsic strategic value.

Transcendentals

Found at transcendentals these subdivide into high priority for accelerating general-purpose and High-Performance Compute, specialist 3D GPU operations suited to 3D visualisation, and low-priority less common instructions where IEEE754 full bit-accuracy is paramount. In 3D GPU scenarios for example even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit would be disastrous.

There are a lot of operations here, and they also bring Power ISA up-to-date to IEEE754-2019. Fortunately the number of critical instructions is quite low, but the caveat is that if those operations are utilised to synthesise other IEEE754 operations (divide by pi for example) full bit-level accuracy (a hard requirement for IEEE754) is lost.

Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy levels for 3D Graphics: these are nowhere near the full accuracy demanded by IEEE754, the reason for the Khronos definitions is a massive reduction often four-fold in power consumption and gate count when 3D Graphics simply has no need for full accuracy.

For 3D GPU markets this definitely needs addressing

These instructions are therefore only likely to be proposed if a Stakeholder comes forward and needs them. If for example RED Semiconductor Ltd had a customer requiring a GPS/GNSS Correlator DSP then the SIN/COS Transcendentals would become a high priority but still be optional, as DSP (and 3D) is still specialist.

Audio/Video

Found at av opcodes these do not require Saturated variants because Saturation is added via svp64 (Vector Prefixing) and via svp64-single Scalar Prefixing. This is important to note for Opcode Allocation because placing these operations in the Unvectorizeable areas would irredeemably damage their value. Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".

The min/max set are under their own RFC, ls013. They are sufficent high priority: fmax requires an astounding 32 SFFS instructions.

Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV

The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing. The wikipedia page lists over a hundred separate and distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on. ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband DSP can do full FFT triple loops in one VLIW group.

It should be pretty clear this is high priority.

With SVP64 remap providing the Loop Schedules it falls to the Scalar side of the ISA to add the prerequisite "Twin Butterfly" operations, typically performing for example one multiply but in-place subtracting that product from one operand and adding it to the other. The in-place aspect is strategically extremely important for significant reductions in Vectorized register usage, particularly for DCT. Further: even without Simple-V the number of instructions saved is huge: 8 for integer and 4 for floating-point vs one.

CR Weird group

Outlined in cr int predication these instructions massively save on CR-Field instruction count. Multi-bit to single-bit and vice-versa normally requiring several CR-ops (crand, crxor) are done in one single instruction. The reason for their addition is down to SVP64 overloading CR Fields as Vector Predicate Masks. Reducing instruction count in hot-loops is considered high priority.

An additional need is to do popcount on CR Field bit vectors but adding such instructions to the Condition Register side was deemed to be far too much. Therefore, priority was given instead to transferring several CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR Logical Operations may be used. This strategy has the side-effect of keeping the CRweird group down to only five instructions.

Big-integer Math

biginteger has always been a high priority area for commercial applications, privacy, Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday use via OpenSSL.

A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its removal from subsequent revisions is regrettable. An alternative concept is to add six explicit 3-in 2-out operations that, on close inspection, always turn out to be supersets of existing Scalar operations that discard upper or lower DWords, or parts thereof.

Thus it is critical to note that not one single one of these operations expands the bitwidth of any existing Scalar pipelines.

The dsld instruction for example merely places additional LSBs into the 64-bit shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second output register (64-bit carry-out). It does not require a 128-bit shifter to replace the existing Scalar Power ISA 64-bit shifters.

The reduction in instruction count these operations bring, in critical hot loops, is remarkably high, to the extent where a Scalar-to-Vector operation of arbitrary length becomes just the one Vector-Prefixed instruction.

Whilst these are 5-6 bit XO their utility is considered high strategic value and as such are strongly advocated to be in EXT04. The alternative is to bring back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar Power ISA multiply, divide, and shift operations at this late stage of maturity of the Power ISA is an entire area of research on its own deemed unlikely to be achievable.

Note: none of these instructions are in VSX. They are a different paradigm and have more akin with their x86 equivalents.

Critical to note regarding 2-out instructions:

https://groups.google.com/g/comp.arch/c/_-dp_ZU6TN0/m/hVuZt86_BgAJ

>For example, having instructions with 2 dest registers changes
>the cost for a multi-lane OoO renamer from BigO(n^2) to BigO((2n)^2)
>so a 4-lane 2-dest renamer costs 16 times as much.
>And this is for a feature that would be rarely used and is redundant. 

Further down an additional author observes that Operand-Forwarding mitigates this problem but sufficient "advance notice" is needed. An inline-assembler chained sequence such as those required for bigint would be considered such and thus the high cost may be avoided.

fclass and GPR-FPR moves

fclass - just one instruction. With SFFS being locked down to exclude VSX, and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS such that it is stand-alone capable. One omission based on the assumption that VSX would always be present is an equivalent to xvtstdcsp.

Similar arguments apply to the GPR-INT move operations, proposed in ls006.fpintmv, with the opportunity taken to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript rounding, one of the worst offenders of Computer Science, requires a phenomenal 35 instructions with six branches to emulate in Power ISA! For desktop as well as Server HTML/JS back-end execution of javascript this becomes an obvious priority, recognised already by ARM as just one example.

Whilst some of these instructions have VSX equivalents they must not be excluded on that basis. SVP64/VSX may have a different meaning from SVP64/SFFS i e. the two Vectorized instructions may not be equivalent.

Bitmanip LUT2/3

These LUT2/3 operations are high cost high reward. Outlined in bitmanip, the simplest ones already exist in PackedSIMD VSX: xxeval. The same reasoning applies as to fclass: SFFS needs to be stand-alone on its own merits and should an implementor choose not to implement any aspect of PackedSIMD VSX the performance of their product should not be penalised for making that decision.

With Predication being such a high priority in GPUs and HPC, CR Field variants of Ternary and Binary LUT instructions were considered high priority, and again just like in the CRweird group the opportunity was taken to work on all bits of a CR Field rather than just one bit as is done with the existing CR operations crand, cror etc.

The other high strategic value instruction is grevlut (and grevluti which can generate a remarkably large number of regular-patterned magic constants). The grevlut set require of the order of 20,000 gates but provide an astonishing plethora of innovative bit-permuting instructions never seen in any other ISA.

The downside of all of these instructions is the extremely low XO bit requirements: 2-3 bit XO due to the large immediates and the number of operands required. The LUT3 instructions are already compacted down to "Overwrite" variants. (By contrast the Float-Load-Immediate instructions are a much larger XO because despite having 16-bit immediate only one Register Operand is needed).

Realistically these high-value instructions should be proposed in EXT2xx where their XO cost does not overwhelm EXT0xx.

(f)mv.swizzle

mv.swizzle is dicey. It is a 2-in 2-out operation whose value as a Scalar instruction is limited except if combined with cmpi and SVP64Single Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap, in two instructions.

Where this instruction comes into its full value is when Vectorized. 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15% swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.

So as not to radicalise the Power ISA the Libre-SOC team decided to introduce mv Swizzle operations, which can always be Macro-op fused in exactly the same way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full independent 3-in 1-out.

BMI (bit-manipulation) group.

Whilst the vector ops instructions are only two in number, in reality the bmask instruction has a Mode field allowing it to cover 24 instructions, more than have been added to any other CPUs by ARM, Intel or AMD. Analysis of the BMI sets of these CPUs shows simple patterns that can greatly simplify both Decode and implementation. These are sufficiently commonly used, saving instruction count regularly, that they justify going into EXT0xx.

The other instruction is cprop - Carry-Propagation - which takes the P and Q from carry-propagation algorithms and generates carry look-ahead. Greatly increases the efficiency of arbitrary-precision integer arithmetic by combining what would otherwise be half a dozen instructions into one. However it is still not a huge priority unlike bmask so is probably best placed in EXT2xx.

Float-Load-Immediate

Very easily justified. As explained in ls002.fmi these always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate FP value being in the I-Cache side. It is such a high priority that these instructions are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate. By designing the second-half instruction as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and can be macro-op fused with its first-half to store a full IEEE754 FP32 immediate into a register.

There is little point in putting these instructions into EXT2xx. Their very benefit and inherent value is as 32-bit instructions, not 64-bit ones. Likewise there is less value in taking up EXT1xx Encoding space because EXT1xx only brings an additional 16 bits (approx) to the table, and that is provided already by the second-half instruction.

Thus they qualify as both high priority and also EXT0xx candidates.

FPR/GPR LD/ST-PostIncrement-Update

These instruction, outlined in ls011, save hugely in hot-loops. Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to the iconic ultra-RISC CDC 6600, all had both pre- and post- increment Addressing Modes.

The reason is very simple: it is a direct recognition of the practice in c to frequently utilise both *p++ and *++p which itself stems from common need in Computer Science algorithms.

The problem for the Power ISA is - was - that the opcode space needed to support both was far too great, and the decision was made to go with pre-increment, on the basis that outside the loop a "pre-subtraction" may be performed.

Whilst this is a "solution" it is less than ideal, and the opportunity exists now with the EXT2xx Primary Opcodes to correct this and bring Power ISA up a level.

Where things begin to get more than a little hairy is if both Post-Increment and Shifted are included. If SVP64 keeps one single bit (/pi) dedicated in the RM.Mode field then this problem ges away, at the cost of reducing SVP64's effectiveness. However again, given that even the Shifted-Post-Increment instructions are all 9-bit XO it is not outside the realm of possibility to include them in EXT2xx.

Shift-and-add (and LD/ST Indexed-Shift)

Shift-and-Add are proposed in ls004. They mitigate the need to add LD-ST-Shift instructions which are a high-priority aspect of both x86 and ARM. LD-ST-Shift is normally just the one instruction: Shift-and-add brings that down to two, where Power ISA presently requires three. Cryptography e.g. twofish also makes use of Integer double-and-add, so the value of these instructions is not limited to Effective Address computation. They will also have value in Audio DSP.

Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx when their whole purpose and value is to reduce binary size in Address offset computation, thus they are best placed in EXT0xx.

The upside as far as adding them is concerned is that existing hardware will already have amalgamated pipelines with very few actual back-end (Micro-Coded) internal operations (likely just two: one load, one store). Passing a 2-bit additional immediate field down to those pipelines really is not hard.

(Readers unfamiliar with Micro-coding should look at the Microwatt VHDL source code)

Also included because it is important to see the quantity of instructions: LD/ST-Indexed-Shifted. Across Update variants, Byte-reverse variants, Arithmetic and FP, the total is a slightly-eye-watering 37 instructions, only ameliorated by the fact that they are all 9-bit XO. Even when adding the Post-Increment-Shifted group it is still only 52 9-bit XO instructions, which is not unreasonable to consider (in EXT2xx).

\newpage{}

Vectorization: SVP64 and SVP64Single

To be submitted as part of ls001, ls008, ls009 and ls010, with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually identical to the 50+ year old 8080 REP instruction and the Zilog Z80 CPIR and LDIR instructions. Parallelism is best achieved by exploiting a Multi-Issue Out-of-Order Micro-architecture. It is extremely important to bear in mind that at no time does SVP64 add even one single actual Vector instruction. It is a pure RISC-paradigm Prefixing concept only.

This has some implications which need unpacking. Firstly: in the future, the Prefixing may be applied to VSX. The only reason it was not included in the initial proposal of SVP64 is because due to the number of VSX instructions the Due Diligence required is obviously five times higher than the 3+ years work done so far on the SFFS Subset.

Secondly: any Scalar instruction involving registers automatically becomes a candidate for Vector-Prefixing. This in turn means that when a new instruction is proposed, it becomes a hard requirement to consider not only the implications of its inclusion as a Scalar-only instruction, but how it will best be utilised as a Vectorized instruction as well. Extreme examples of this are the Big-Integer 3-in 2-out instructions that use one 64-bit register effectively as a Carry-in and Carry-out. The instructions were designed in a Scalar context to be inline-efficient in hardware (use of Operand-Forwarding to reduce the chain down to 2-in 1-out), but in a Vector context it is extremely straightforward to Micro-code an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD pipelines, and to perform a large internal Forward-Carry-Propagation on for example the Vectorized-Multiply instruction.

Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be considered as an independent stand-alone instruction (just like REP). In other words, the Suffix never gets decoded as a completely different instruction just because of the Prefix. The cost of doing so is simply too high in hardware.


Guidance for evaluation

Deciding which instructions go into an ISA is extremely complex, costly, and a huge responsibility. In public standards mistakes are irrevocable, and in the case of an ISA the Opcode Allocation is a finite resource, meaning that mistakes punish future instructions as well. This section therefore provides some Evaluation Guidance on the decision process, particularly for people new to ISA development, given that this RFC is circulated widely and publicly. Constructive feedback from experienced ISA Architects welcomed to improve this section.

Does anyone want it?

Sounds like an obvious question but if there is no driving need (no "Stakeholder") then why is the instruction being proposed? If it is purely out of curiosity or part of a Research effort not intended for production then it's probably best left in the EXT022 Sandbox.

Common, Frequent, Rare

Even to the point of asking the same question about a register file, not just about an instruction, is the instruction (or other feature) intended to be:

  • Common (used all of the time, typically built-in to toolchain)
  • Frequent (specialised tasks but time or resource critical)
  • Rare (when you need them you need them)

A good example would be the addition of 128-bit operations, or even (for Elliptic Curve Cryptography - ec25519) 512-bit ALUs.

How many registers does it need?

The basic RISC Paradigm is not only to make instruction encoding simple (often "wasting" encoding space compared to highly-compacted ISAs such as x86), but also to keep the number of registers used down to a minimum.

Counter-examples are FMAC which had to be added to IEEE754 because the internal product requires more accuracy than can fit into a register (it is well-known that FMUL followed by FADD performs an additional rounding on the intermediate register which loses accuracy compared to FMAC). Another would be a dot-product instruction, which again requires an accumulator of at least double the width of the two vector inputs. And in the AMDGPU ISA, there are Texture-mapping instructions taking up to an astounding twelve input operands!

The downside of going too far however has to be a trade-off with the next question. Both MIPS and RISC-V lack Condition Codes, which means that emulating x86 Branch-Conditional requires ten MIPS instructions.

The downside of creating too complex instructions is that the Dependency Hazard Management in high-performance multi-issue out-of-order microarchitectures becomes infeasibly large, and even simple in-order systems may have performance severely compromised by an overabundance of stalls. Also worth remembering is that register file ports are insanely costly, not just to design but also use considerable power.

That said there do exist genuine reasons why more registers is better than less: Compare-and-Swap has huge benefits but is costly to implement, and DCT/FFT Twin-Butterfly instructions allow creation of in-place in-register algorithms reducing the number of registers needed and thus saving power due to making the overall algorithm more efficient, as opposed to micro-focussing on a localised power increase.

As a general rule of thumb, though:

  • going beyond 3-in 2-out is an extremely bad idea
  • 3-in 2-out is extreme borderline (including Condition Codes)
  • 3-in 1-out needs really good justification
  • 2-in 1-out (or 2-in 2-out if one is a Condition Code or Status Register) is acceptable

Remember to include all Register Files (Status Registers, Condition Fields) in the assessment: each register will need its own Hazard Protection, and in an Out-of-Order system that results in significant resource utilisation in silicon.

How many register files does it use?

Complex instructions pulling in data from multiple register files can create unnecessary issues surrounding Dependency Hazard Management in Out-of-Order systems. As a general rule it is better to keep complex instructions reading and writing to the same register file, relying on much simpler (1-in 1-out) instructions to transfer data between register files. This rule-of-thumb allows the Dependency Matrices to be made sparse or significantly reduced in both row and column entries.

Can other existing instructions (plural) do the same job

The general rule being: if two or more instructions can do the same job, leave it out... unless the number of occurrences of that instruction being missing is causing huge increases in binary size. RISC-V has gone too far in this regard, as explained here: https://news.ycombinator.com/item?id=24459314

Good examples are LD-ST-Indexed-shifted (multiply RB by 2, 4 8 or 16) which are high-priority instructions in x86 and ARM, but lacking in Power ISA, MIPS, and RISC-V. With many critical hot-loops in Computer Science having to perform shift and add as explicit instructions, adding LD/ST-shifted should be considered high priority, except that the sheer number of such instructions needing to be added takes us into the next question

How costly is the encoding?

This can either be a single instruction that is costly (several operands or a few long ones) or it could be a group of simpler ones that purely due to their number increases overall encoding cost. An example of an extreme costly instruction would be those with their own Primary Opcode: addi is a good candidate. However the sheer overwhelming number of times that instruction is used easily makes a case for its inclusion.

Mentioned above was Load-Store-Indexed-Shifted, which only needs 2 bits to specify how much to shift: x2 x4 x8 or x16. And they are all a 10-bit XO Field, so not that costly for any one given instruction. Unfortunately there are around 30 Load-Store-Indexed Instructions in the Power ISA, which means an extra five bits taken up of precious XO space. Then let us not forget the two needed for the Shift amount. Now we are up to three bit XO for the group.

Is this a worthwhile tradeoff? Honestly it could well be. And that's the decision process that the OpenPOWER ISA Working Group could use some assistance on, to make the evaluation easier.

How many gates does it need?

grevlut comes in at an astonishing 20,000 gates, where for comparison an FP64 Multiply typically takes between 12 to 15,000. Not counting the cost in hardware terms is just asking for trouble.

If the number of gates gets too large it has an unintended side-effect: power consumption goes up but so does the distance between functions on-chip. A good illustration here is the CDC6600 and Cray Supercomputers where speed was limited by the size of the room. In other words larger functions cause communication delays, and communication delays reduce top speed.

How long will it take to complete?

In the case of divide or Transcendentals the algorithms needed are so complex that simple implementations can often take an astounding 128 clock cycles to complete (Goldschmidtt reduces that significantly). Other instructions waiting for the results will back up and eventually stall, where in-order systems pretty much just stall straight away.

Less extreme examples include instructions that take only a few cycles to complete, but if commonly used in tight loops with Conditional Branches, an Out-of-Order system with Speculative capability may need significantly more Reservation Stations to hold in-flight data for all instructions when some take longer, so even a single clock cycle reduction could become important.

A rule of thumb is that in Hardware, at 4.8 ghz the budget for what is called "gate propagation delay" is only around 16 to 19 gates chained one after the other. Anything beyond that budget will need to be stored in DFFs (Flip-flops) and another set of 16-19 gates continues on the next clock cycle. Thus for example with grevlut above it is almost certainly the case that high-performance high-clock-rate systems would need at least two clock cycles (two pipeline stages) to produce a valid result. This in turn brings us to the next question as it is common to consider subdividing complex instructions into smaller parts.

Can one instruction do the job of many?

Large numbers of disparate instructions adversely affects resource utilisation in In-Order systems. However it is not always that simple: every one of the Power ISA "add" and "subtract" instructions, as shown by the Microwatt source code, may be micro-coded as one single instruction where RA may optionally be inverted, output likewise, and Carry-In set to 1, 0 or XER.CA. From these options the entire suite of add/subtract may be synthesised (subtract by inverting RA and adding an extra 1 it produces a 2s-complement of RA).

bmask for example is to be proposed as a single instruction with a 5-bit "Mode" operand, greatly simplifying some micro-architectural implementations. Likewise the FP-INT conversion instructions are grouped as a set of four, instead of over 30 separate instructions. Aside from anything this strategy makes the ISA Working Group's evaluation task easier, as well as reducing the work of writing a Compliance Test Suite.

In the case of the MIPS 3D ASE Extension, a Reciprocal-Square-Root instruction was proposed that was split into two halves: 12-14 bit accuracy completing in 7 cycles and "Carry On And Get Better Accuracy" for the second instruction! With 3D only needing reduced accuracy the saving in power consumption and time was definitely worthwhile, and it neatly illustrates a counter-example to trying to make one instruction do too much.

Another good example is the Integer Twin-butterfly instructions, ((a +/- b) * c) >> sh which require eight instructions and temporary registers. Although expensive they save so many other instructions - and registers - that it is hard to disregard them even if their internal implementation is Micro-coded.

Is it general-purpose or have a compelling use-case?

The more specialised the instruction the less power used but the less opportunity it has for being used elsewhere. Good examples of bad instructions are illustrated by an MSc proposing a chacha20 SIMD add-xor-rotate-by-7 instruction, when chacha20 has nowhere near the decades-established use as Rijndael (AES) or SHA. Although the instruction halved the number of inline-unrolled instructions in chacha20 it is clearly so specific as to be useless for any other purpose.

Good examples of good specialist instructions are the AES and SHA round-acceleration instructions in VSX, because these algorithms are so heavily used that nearly all ISAs have them.

Perhaps this point should be placed first but it is a different angle on the cost-benefit analysis that starts with "Does anyone want it": that alone is not quite enough, because although a given Stakeholder might want a particular instruction to accelerate their application, the expression of need is only where the evaluation process begins.

Summary

There are many tradeoffs here, it is a huge list of considerations: any others known about please do submit feedback so they may be included, here. Then the evaluation process may take place: again, constructive feedback on that as to which instructions are a priority also appreciated. The above helps explain the columns in the tables that follow.

\newpage{}

Tables

The original tables are available publicly as as CSV file at https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls012/optable.csv;hb=HEAD. A python program auto-generates the tables in the following sections by sorting into different useful priorities.

The key to headings and sections are as follows:

  • Area - Target Area as described in above sections
  • XO Cost - the number of bits required in the XO Field. whilst not the full picture it is a good indicator as to how costly in terms of Opcode Allocation a given instruction will be. Lower number is a higher cost for the Power ISA's precious remaining Opcode space. "PO" indicates that an entire Primary Opcode is required.
  • rfc the Libre-SOC External RFC resource, https://libre-soc.org/openpower/sv/rfc/ where advance notice of upcoming RFCs in development may be found. Reading advance Draft RFCs and providing feedback strongly advised, it saves time and effort for the OPF ISA Workgroup.
  • SVP64 - Vectorizeable (SVP64-Prefixable) - also implies that SVP64Single is also permitted (required).
  • page - Libre-SOC wiki page at which further information can be found. Again: advance reading strongly advised due to the sheer volume of information.
  • PO1 - the instruction is capable of being PO1-Prefixed (given an EXT1xx Opcode Allocation). Bear in mind that this option is mutually exclusively incompatible with Vectorization.
  • group - the Primary Opcode Group recommended for this instruction. Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx. A third area (Unvectorizeable), EXT3xx, was available in an early Draft RFC but has been made "RESERVED" instead. see po9 encoding.
  • Level - Compliancy Subset and Simple-V Level. SFFS indicates "mandatory" in SFFS. All else is "optional" however some instructions are further Subsetted within Simple-V: SV/Embedded, SV/DSP and SV/Supercomputing.
  • regs - a guide to register usage, to how costly Hazard Management will be, in hardware:
     - 1R: reads one GPR/FPR/SPR/CR.
     - 1W: writes one GPR/FPR/SPR/CR.
     - 1r: reads one CR *Field* (not necessarily the entire CR)
     - 1w: writes one CR *Field* (not necessarily the entire CR)