16 bit Compressed

Similar to VLE (but without immediate-prefixing) this encoding is designed to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR is recommended). Note that Compressed is mutually exclusively incompatible with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000 and EXT001. Hypothetically it could be made to use anything other than EXT001, with some inconvenience (extra gates). The incompatibility is "fixed" by swapping out of "Compressed" Mode and back into "Normal" (v3.1B) Mode, at runtime, as needed.

Although initially intended to be augmented by Simple-V Prefixing (to add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power or size, this Compressed Encoding is not critically dependent on SV Prefixing, and may be used stand-alone.

See:

This one is a conundrum. OpenPOWER ISA was never designed with 16 bit in mind. VLE was added 10 years ago but only by way of marking an entire 64k page as "VLE". With VLE not maintained it is not fully compatible with current PowerISA.

Here, in order to embed 16 bit into a predominantly 32 bit stream the overhead of using an entire 16 bits just to switch into Compressed mode is itself a significant overhead. The situation is made worse by OpenPOWER ISA being fundamentally designed with 6 bits uniformly taking up Major Opcode space, leaving only 10 bits to allocate to actual instructions.

Contrast this with RVC which takes 3 out of 4 combinations of the first 2 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing standard 32 bit and 16 bit to intermingle cleanly. To achieve the same thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which is clearly impractical: other schemes need to be devised.

In addition we would like to add SV-C32 which is a Vectorised version of 16 bit Compressed, and ideally have a variant that adds the 27-bit prefix format from SV-P64, as well.

Potential ways to reduce pressure on the 16 bit space are:

  • To use more than one v3.0B Major Opcode, preferably an odd-even contiguous pair
  • To provide "paging". This involves bank-switching to alternative optimised encodings for specific workloads
  • To enter "16 bit mode" for durations specified at the start
  • To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained

This latter would be useful in the Vector context to have an alternative meaning: as the bit which determines whether the instruction is 11-bit prefixed or 27-bit prefixed:

0 1 2 3 4 5 6 7 8 9 a b c d e f |
|major op | 11 bit vector prefix|
|16 bit opcode  alt vec. mode ^ |
| extra vector prefix if alt set|

Using a major opcode to enter 16 bit mode, leaves 11 bits to find something to use them for:

0 1 2 3 4 5 6 7 8 9 a b c d e f |
|major op | what to do here   1 |
|16 bit    stay in 16bit mode 1 |
|16 bit    stay in 16bit mode 1 |
|16 bit       exit 16bit mode 0 |

One possibility is that the 11 bits are used for bank selection, with some room for additional context such as altering the registers used for the 16 bit operations (bank selection of which scalar regs). However the downside is that short sequences of Compressed instructions become penalised by the fixed overhead. Even a single 16 bit instruction requires a 16 bit overhead to "gain access" to 16 bit "mode", making the exercise pointless.

An alternative is to use the first 11 bits for only the utmost commonly used instructions. That being the case then one of those 11 bits could be dedicated to saying if 16 bit mode is to be continued, at which point all 16 bits can be used for Compressed. 10 bits remain for actual opcodes, which is ridiculously tight, however the opportunity to subsequently use all 16 bits is worth it.

The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:

|0 1 2 3 4 5 6 7 8 9 a b c d e f|
|major op..0| LO Half C space   |
|major op..1| HI Half C space   |
|N N N N N|<--11 bits C space-->|

If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this saves gates at a critical part of the decode phase.

ABI considerations

Unlike RVC, the above "context" encodings require state, to be stored in the PCR, MSR, or a dedicated SPR. These bits (just like LE/BE 32bit mode and the IEEE754 FPCSR mode) all require taking that context into consideration.

In particular it is critically important to recognise that context (in general) is an implicit part of the ABI implemented for example by glibc6. Therefore (in specific) Compressed Mode Context must not be permitted to cross into or out of a function call.

Thus it is the mandatory responsibility of the compiler to ensure that context returns to "v3.0B Standard" prior to entering a function call (responsibility of caller) and prior to exit from a function call (responsibility of callee).

Trap Handlers also take responsibility for saving and restoring of Compressed Mode state, just as they already take responsibility for other critical state. This makes traps transparent to functions as far as Compressed Mode Context is concerned, just as traps are already transparent to functions.

Note however that there are exceptions in a compiler to the otherwise hard rule that Compressed Mode context not be permitted to cross function boundaries: inline functions and static functions. static functions, if correctly identified as never to be called externally, may, as an optimisation, disregard standard ABIs, bearing in mind that this will be fraught (pointers to functions) and not easy to get right.

Opcode Allocation Ideas

Opcodes exploration (Attempt 1)

Switching between different encoding modes is controlled by M (alone) in 10-bit mode, and M and N in 16-bit mode.

  • M in 10-bit mode if zero indicates that following instructions are standard OpenPOWER ISA 32-bit encoded (including, redundantly, further 10/16-bit instructions)
  • M in 10-bit mode if 1 indicates that following instructions are in 16-bit encoding mode

Once in 16-bit mode:

  • 0b01 (M=1, N=0): stay in 16-bit mode
  • 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
  • 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
  • 0b11: free to be used for something completely different.

The current "top" idea for 0b11 is to use it for a new encoding format of predominantly "immediates-based" 16-bit instructions (branch-conditional, addi, mulli etc.)

  • The Compressed Major Opcode is in bits 5-7.
  • Minor opcode in bit 8.
  • In some cases bit 9 is taken as an additional sub-opcode, followed by bits 0-4 (for CR operations)
  • M+N mode-switching is not available for C-Major.minor 0b001.1
  • 10 bit mode may be expanded by 16 bit mode, adding capabilities that do not fit in the extreme limited space.

Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit. 16-bit immediate mode remains in 16-bit.

| 0 | 1234 | 567  8 | 9abcde | f | explanation
| EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
| EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
| 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
| 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
| 1 | flds | Cmaj.m | fields | 0 | 16b then 1x v3.0B
| 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

Notes:

  • Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
  • EXT000 and EXT001 are v3.0B Major Opcodes. The first 5 bits are zero, therefore the 6th bit is actually part of Cmaj.
  • "10bit then 16bit" means "this instruction is encoded C 10bit and the following one in C 16bit"

C Instruction Encoding types

10-bit Opcode formats (all start with v3.0B EXT000 or EXT001 Major Opcodes)

| 01234    | 567  8 | 9  | a b | c  | d e | f | enc
| E01      | Cmaj.m | fld1     | fld2     | M | 10b
| E01      | Cmaj.m | offset              | M | 10b b
| E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
| E01      | 111.m  | fld1     | fld2     | M | 10b LDST

16-bit Opcode formats (including 10/16/v3.0B Switching)

| 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
| N | immf | Cmaj.m | fld1     | fld2     | M | 16b
| 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
| fd3      | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
| N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST

Notes:

  • fld1 and fld2 can contain reg numbers, immediates, or opcode fields (BO, BI, LK)
  • S1 and S2 are further sub-selectors of C 001.1

Immediate Opcodes

only available in 16-bit mode, only available when M=1 and N=1 and when Cmaj.min is not 0b001.1.

instruction counts from objdump on /bin/bash:

  466 extsw r1,r1
  649 stw r1,1(r1)
  691 lwz r1,1(r1)
  705 cmpdi r1,1
  791 cmpwi r1,1
  794 addis r1,r1,1
 1474 std r1,1(r1)
 1846 li r1,1
 2031 mr r1,r1
 2473 addi r1,r1,1
 3012 nop
 3028 ld r1,1(r1)


| 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
| 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
| 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
| 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
| 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
| 1 | 1  | 1 |     | | 001.0 |      |     | 1 | TBD
| 1 | i2 |  RT     | | 010.0 | RA|0 | imm | 1 | addi
| 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
| 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
| 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
| 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
| 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
| 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
| 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
| 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
| 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
| 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
| 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
| 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
| 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
| 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi

Construction of immediate:

  • LD/ST r1 (SP) variants should be offset by -256 see https://bugs.libre-soc.org/show_bug.cgi?id=238#c43
    • SP variants map to e.g ld RT, imm(r1)
    • SV Prefixing can be used to map r1 to alternate regs
  • [1] not the same as v3.0B addis: the shift amount is smaller and actually still maps to within the v3.0B addi immediate range.
  • addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
  • addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in increments of 8
  • all others are EXTS(i2||imm) to give a 7-bit range -128 to +127 (further for LD/ST due to word/dword-alignment)

Further Notes:

  • bc also has an immediate mode, listed separately below in Branch section
  • for LD/ST, offset is aligned. 8-byte: i2||imm||0b000 4-byte: 0b00
  • SV Prefix over-rides help provide alternative bitwidths for LD/ST
  • RA|0 if RA is zero, addi. becomes "li"
    • this only works if RT takes part of opcode
    • mv is also possible by specifying an immediate of zero

Illegal and nop

Note that illeg is all zeros, including in the 16-bit mode. Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and EXT001 this ensures that in both 10-bit and 16-bit mode, a 16-bit run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000 is "nop"

| 16-bit mode | | 10-bit mode                 |
| 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
| 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
| 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop

16 bit mode only:

| 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
| 1 | nonzero | | 000.0  | 0  00 | 0   00 | 0 | TBD

Notes:

  • All-zeros being an illegal instruction is normal for ISAs. Ensuring that this remains true at all times i.e. for both 10 bit and 16 bit mode is common sense.
  • The 10-bit nop (bit 15, M=1) is intended for circumstances where alignment to 32-bit before returning to v3.0B is required. M=1 being an indication "return to Standard v3.0B Encoding Mode".
  • The 16-bit nop (bit 0, N=1) is intended for circumstances where a return to Standard v3.0B Encoding is required for one cycle but one cycle where alignment to a 32-bit boundary is needed. Examples of this would be to return to "strict" (non-C) mode where the PC may not be on a non-word-aligned boundary.
  • If for any reason multiple 16 bit nops are needed in succession the M=1 variant can be used, because each one returns to Standard v3.0B Encoding Mode, each time.

In essence the 2 nops are needed due to there being 2 different C forms: 10 and 16 bit.

Branch

| 16-bit mode | | 10-bit mode                 |
| 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
| N | offs2   | | 000.LK | offs!=0        | M | b, bl
| 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
| N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl

16 bit mode:

  • bc only available when N,M=0b11
  • offs2 extends offset in MSBs
  • BI3 extends BI in MSBs to allow selection of full CR
  • BO3 extends BO
  • bc offset constructed from oo as LSBs and offs2 as MSBs
  • bc BI allows selection of all bits from CR0 or CR1
  • bc CR check is always active (as if BO0=1) therefore BO1 inverts

10 bit mode:

  • illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
  • nop also covers part of branch (offs=0,M=0,LK=1)
  • bc not available in 10-bit mode
  • BO[0] enables CR check, BO[1] inverts check
  • BI refers to CR0 only (4 bits of)
  • no Branch Conditional with immediate
  • no Absolute Address
  • CTR mode allowed with BO[2] for b only.
  • offs is to 2 byte (signed) aligned
  • all branches to 2 byte aligned

LD/ST

| 16-bit mode      | | 10-bit mode               |
| 0   | 1  | 2 3 4 | | 567.8 | 9 a b | c d e | f |
| RA2 | SZ |  RB   | | 001.1 | 1  RA | 0  RT | M | st
| RA2 | SZ |  RB   | | 001.1 | 1  RA | 1  RT | M | fst
| N   | SZ |  RT   | | 111.0 |  RA   |  RB   | M | ld
| N   | SZ |  RT   | | 111.1 |  RA   |  RB   | M | fld
  • elwidth overrides can set different widths

16 bit mode:

  • SZ=1 is 64 bit, SZ=0 is 32 bit
  • RA2 extends RA to 3 bits (MSB)
  • RT2 extends RT to 3 bits (MSB)

10 bit mode:

  • RA and RB are only 2 bit (0-3)
  • for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
  • for ST, there is no offset: "st RT, RA(0)"

Arithmetic

| 16-bit mode | | 10-bit mode             |
| 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
| N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
| N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
| N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl

Notes:

  • sub. and cmpl: default CR target is CR0
  • for (RA|0) when RA=0 the input is a zero immediate, meaning that sub. becomes neg. and cmp becomes cmpi against zero
  • RT is implicitly RB: "add RT(=RB), RA, RB"
  • Opcode 0b010.0 RA=0 is not missing from the above: it is a system-wide instruction, "cbank" (section below)

16 bit mode only:

| 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
| N | 1 | RA  | | 010.0 | RB  | RS    | 0 | sld.
| N | 1 | RA  | | 010.1 | RB  | RS!=0 | 0 | srd.
| N | 1 | RA  | | 010.1 | RB  | 000   | 0 | srad.
| N | 1 | BF  | | 011.0 | RB  | RA|0  | 0 | cmpw

Notes:

  • for srad, RS=RA: "srad. RA(=RS), RS, RB"

Logical

| 16-bit mode   | | 10-bit mode             |
| 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
| N | 0 |  RT   | | 100.0 | RB  | RA!=0 | M | and
| N | 0 |  RT   | | 100.1 | RB  | RA!=0 | M | nand
| N | 0 |  RT   | | 101.0 | RB  | RA!=0 | M | or
| N | 0 |  RT   | | 101.1 | RB  | RA!=0 | M | nor
| N | 0 |  RT   | | 100.0 | RB  | 0 0 0 | M | extsw
| N | 0 |  RT   | | 100.1 | RB  | 0 0 0 | M | cntlz
| N | 0 |  RT   | | 101.0 | RB  | 0 0 0 | M | popcnt
| N | 0 |  RT   | | 101.1 | RB  | 0 0 0 | M | not

16-bit mode only:

| 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
| N | 1 |  RT   | | 100.0 | RB  | RA!=0 | 0 | TBD
| N | 1 |  RT   | | 100.1 | RB  | RA!=0 | 0 | TBD
| N | 1 |  RT   | | 101.0 | RB  | RA!=0 | 0 | xor
| N | 1 |  RT   | | 101.1 | RB  | RA!=0 | 0 | eqv (xnor)
| N | 1 |  RT   | | 100.0 | RB  | 0 0 0 | 0 | extsb
| N | 1 |  RT   | | 100.1 | RB  | 0 0 0 | 0 | cnttz
| N | 1 |  RT   | | 101.0 | RB  | 0 0 0 | 0 | TBD
| N | 1 |  RT   | | 101.1 | RB  | 0 0 0 | 0 | extsh

10 bit mode:

  • for (RA|0) when RA=0 the input is a zero immediate, meaning that nor becomes not
  • cntlz, popcnt, exts not available in 10-bit mode
  • RT is implicitly RB: "and RT(=RB), RA, RB"

Floating Point

Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64

| 16-bit mode   | | 10-bit mode             |
| 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
| N |   |  RT   | | 011.1 | RB  | RA!=0 | M | fsub.
| N | 0 |  RT   | | 110.0 | RB  | RA!=0 | M | fadd
| N | 0 |  RT   | | 110.1 | RB  | RA!=0 | M | fmul
| N | 0 |  RT   | | 011.1 | RB  | 0 0 0 | M | fneg.
| N | 0 |  RT   | | 110.0 | RB  | 0 0 0 | M |
| N | 0 |  RT   | | 110.1 | RB  | 0 0 0 | M |

16-bit mode only:

| 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
| N | 1 |  RT   | | 011.1 | RB  | RA!=0 | 0 |
| N | 1 |  RT   | | 110.0 | RB  | RA!=0 | 0 |
| N | 1 |  RT   | | 110.1 | RB  | RA!=0 | 0 | fdiv
| N | 1 |  RT   | | 011.1 | RB  | 0 0 0 | 0 | fabs.
| N | 1 |  RT   | | 110.0 | RB  | 0 0 0 | 0 | fmr.
| N | 1 |  RT   | | 110.1 | RB  | 0 0 0 | 0 |

16 bit only, FP to INT convert (using C 0b001.1 subencoding)

| 0123 | 4 | | 567.8 | 9 ab | cde  | f |
| 0010 | X | | 001.1 | 0 RA | Y RT | M | fp2int
| 0011 | X | | 001.1 | 0 RA | Y RT | M | int2fp
  • X: signed=1, unsigned=0
  • Y: FP32=0, FP64=1

10 bit mode:

  • fsub. fneg. and fmr. default target is CR1
  • fmr. is not available in 10-bit mode
  • fdiv is not available in 10-bit mode

16 bit mode:

  • fmr. copies RB to RT (and sets CR1)

Condition Register

| 16-bit mode   | | 10-bit mode            |
| 0 1 2 3 | 4   | | 567.8 | 9 ab | cde | f |
| 0 0 0 0 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
| 0 0 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnor
| 0 1 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crandc
| 0 1 1 0 | BA2 | | 001.1 | 0 BA | BB  | M | crxor
| 0 1 1 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnand
| 1 0 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crand
| 1 0 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | creqv
| 1 1 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | crorc
| 1 1 1 0 | BA2 | | 001.1 | 0 BA | BB  | M | cror

10 bit mode:

  • mcrf BF is only 2 bits which means the destination is only CR0-CR3
  • CR operations: not available in 10-bit mode (but mcrf is)

16 bit mode:

  • mcrf BF2 extends BF (in MSB) to 3 bits
  • CR operations: destination register is same as BA.
  • CR operations: only possible on CR0 and CR1

SV (Vector Mode):

  • CR operations: greatly extended reach/range (useful for predicates)

System

cbank: Selection of Compressed-encoding "Bank". Different "banks" give different meanings to opcodes. Example: CBank=0b001 is heavily optimised to A/Video Encode/Decode. cbank borrows from add's encoding space (when RA==0)

| 16-bit mode | | 10-bit mode             |
| 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
| N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank

not available in 10-bit mode:

| 0 1 2 3 | 4  | | 567.8 | 9 ab | cde  | f |
| 1 1 1 1 | 0  | | 001.1 | 0 00 |  RT  | M | mtlr
| 1 1 1 1 | 0  | | 001.1 | 0 01 |  RT  | M | mtctr
| 1 1 1 1 | 0  | | 001.1 | 0 11 |  RT  | M | mtcr
| 1 1 1 1 | 1  | | 001.1 | 0 00 |  RA  | M | mflr
| 1 1 1 1 | 1  | | 001.1 | 0 01 |  RA  | M | mfctr
| 1 1 1 1 | 1  | | 001.1 | 0 11 |  RA  | M | mfcr

Unallocated

| 0 1 2 3 | 4  | | 567.8 | 9 ab | cde  | f |
| 0 1 0 1 |    | | 001.1 | 0    |      | M |
| 1 0 1 0 |    | | 001.1 | 0    |      | M |
| 1 0 1 1 |    | | 001.1 | 0    |      | M |
| 1 1 0 0 |    | | 001.1 | 0    |      | M |
| 1 1 1 1 |    | | 001.1 | 0 10 |      | M |

Other ideas (Attempt 2)

8-bit mode-switching instructions, odd addresses for C mode

Drop the complexity of the 16-bit encoding further reduced to 10-bit, and use a single byte instead of two to switch between modes. This would place compressed (C) mode instructions at odd bytes, so the LSB of the PC can be used for the processor to tell which mode it is in.

To switch from traditional to compressed mode, the single-byte instruction would be at the MSByte, that holds the EXT bits. (When we break up a 32-bit instruction across words, the most significant half should go in the word with the lower address.)

To switch from compressed mode to traditional mode, the single-byte instruction would also be at the opcode/format portion, placed in the lower-address word if split across words, so that the instruction can be recognized as the mode-switching one without going for its second byte.

The C-mode nop should be encoded so that its second byte encodes a switch to compressed mode, if decoded in traditional mode. This enables such a nop to straddle across a label:

8-bit first half of nop
Label:
8-bit second half of nop AKA switch to compressed mode
16-bit insns...

so that if traditional code jumps to the word-aligned label (because traditional branches drop the 2 LSB), it immediately switches to compressed mode; if we fall-through, we remain in 16-bit mode; and if we branch to it from compressed mode, whether we jump to the odd or the even address, we end up in compressed mode as desired.

Tables explaining encoding:

| byte 0 | byte 1 | byte 2 | byte 3 |
| v3.0B standard 32 bit instruction |
| EXT000 | 16 bit          | 16...  |
| .. bit | 8nop   | v3.0b stand...  |
| .. ard 32 bit   | EXT000 | 16...  |
| .. bit | 16 bit          | 8nop   |
| v3.0B standard 32 bit instruction |

TODO

  • make a preliminary assessment of branch in/out viability
  • confirm FSM encoding (is LSB of PC really enough?)
  • guestimate opcode and register allocation (without necessarily doing a full encoding)
  • write throwaway python program that estimates compression ratio from objdump raw parsing
  • finally do full opcode allocation
  • rerun objdump compression ratio estimates

Use 2- rather than 3-register opcodes

Successful compact ISAs have used 2- rather than 3-register insns, in which the same register serves as input and output. Some 20% of general-purpose 3-register insns already use either input register as output, without any effort by the compiler to do so.

Repurposing the 3 bits used to encode one one of the input registers in arithmetic, logical and floating-pointer registers, and the 2 bits used to encode the mode of the next two insns, we could make the full register files available to the opcodes already selected for compressed mode, with one bit to spare to bring additional opcodes in.

An opcode could be assigned to an instruction that combines and extends with the subsequent instruction, providing it with a separate input operand to use rather than the output register, or with additional range for immediate and offset operands, effectively forming a 32-bit operation, enabling us to remain in compressed mode even longer.

Analysis techniques and tools

objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
  s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
  sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
  s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
  sort -n | less

gcc register allocation

FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order:

Special-purpose registers on ppc are:

r0: constant zero/throw-away
r1: stack pointer
r2: thread-local storage pointer in 32-bit mode
r2: non-minimal TOC register
r10: EH return stack adjust register
r11: static chain pointer
r13: thread-local storage pointer in 64-bit mode
r30: minimal-TOC/-fPIC/-fpic base register
r31: frame pointer
lr: return address register

the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is:

 We allocate in the following order:
fp0     (not saved or used for anything)
fp13 - fp2  (not saved; incoming fp arg registers)
fp1     (not saved; return value)
fp31 - fp14 (saved; order given to save least number)
cr7, cr5    (not saved or special)
cr6     (not saved, but used for vector operations)
cr1     (not saved, but used for FP operations)
cr0     (not saved, but used for arithmetic operations)
cr4, cr3, cr2   (saved)
r9      (not saved; best for TImode)
r10, r8-r4  (not saved; highest first for less conflict with params)
r3      (not saved; return value register)
r11     (not saved; later alloc to help shrink-wrap)
r0      (not saved; cannot be base reg)
r31 - r13   (saved; order given to save least number)
r12     (not saved; if used for DImode or DFmode would use r13)
ctr     (not saved; when we have the choice ctr is better)
lr      (saved)
r1, r2, ap, ca  (fixed)
v0 - v1     (not saved or used for anything)
v13 - v3    (not saved; incoming vector arg registers)
v2      (not saved; incoming vector arg reg; return value)
v19 - v14   (not saved or used for anything)
v31 - v20   (saved; order given to save least number)
vrsave, vscr    (fixed)
sfp     (fixed)