16 bit Compressed

Similar to VLE (but without immediate-prefixing) this encoding is designed to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR is recommended). Note that Compressed is mutually exclusively incompatible with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000 and EXT001. Hypothetically it could be made to use anything other than EXT001, with some inconvenience (extra gates). The incompatibility is "fixed" by swapping out of "Compressed" Mode and back into "Normal" (v3.1B) Mode, at runtime, as needed.

Although initially intended to be augmented by Simple-V Prefixing (to add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power or size, this Compressed Encoding is not critically dependent on SV Prefixing, and may be used stand-alone.

See:

This one is a conundrum. OpenPOWER ISA was never designed with 16 bit in mind. VLE was added 10 years ago but only by way of marking an entire 64k page as "VLE". With VLE not maintained it is not fully compatible with current PowerISA.

Here, in order to embed 16 bit into a predominantly 32 bit stream the overhead of using an entire 16 bits just to switch into Compressed mode is itself a significant overhead. The situation is made worse by OpenPOWER ISA being fundamentally designed with 6 bits uniformly taking up Major Opcode space, leaving only 10 bits to allocate to actual instructions.

Contrast this with RVC which takes 3 out of 4 combinations of the first 2 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing standard 32 bit and 16 bit to intermingle cleanly. To achieve the same thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which is clearly impractical: other schemes need to be devised.

In addition we would like to add SV-C32 which is a Vectorized version of 16 bit Compressed, and ideally have a variant that adds the 27-bit prefix format from SV-P64, as well.

Potential ways to reduce pressure on the 16 bit space are:

  • To use more than one v3.0B Major Opcode, preferably an odd-even contiguous pair
  • To provide "paging". This involves bank-switching to alternative optimised encodings for specific workloads
  • To enter "16 bit mode" for durations specified at the start
  • To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained

This latter would be useful in the Vector context to have an alternative meaning: as the bit which determines whether the instruction is 11-bit prefixed or 27-bit prefixed:

0 1 2 3 4 5 6 7 8 9 a b c d e f |
|major op | 11 bit vector prefix|
|16 bit opcode  alt vec. mode ^ |
| extra vector prefix if alt set|

Using a major opcode to enter 16 bit mode, leaves 11 bits to find something to use them for:

0 1 2 3 4 5 6 7 8 9 a b c d e f |
|major op | what to do here   1 |
|16 bit    stay in 16bit mode 1 |
|16 bit    stay in 16bit mode 1 |
|16 bit       exit 16bit mode 0 |

One possibility is that the 11 bits are used for bank selection, with some room for additional context such as altering the registers used for the 16 bit operations (bank selection of which scalar regs). However the downside is that short sequences of Compressed instructions become penalised by the fixed overhead. Even a single 16 bit instruction requires a 16 bit overhead to "gain access" to 16 bit "mode", making the exercise pointless.

An alternative is to use the first 11 bits for only the utmost commonly used instructions. That being the case then one of those 11 bits could be dedicated to saying if 16 bit mode is to be continued, at which point all 16 bits can be used for Compressed. 10 bits remain for actual opcodes, which is ridiculously tight, however the opportunity to subsequently use all 16 bits is worth it.

The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:

|0 1 2 3 4 5 6 7 8 9 a b c d e f|
|major op..0| LO Half C space   |
|major op..1| HI Half C space   |
|N N N N N|<--11 bits C space-->|

If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this saves gates at a critical part of the decode phase.

ABI considerations

Unlike RISC-V RVC, the above "context" encodings require state, to be stored in the PCR, MSR, or a dedicated SPR. These bits (just like LE/BE 32bit mode and the IEEE754 FPCSR mode) all require taking that context into consideration.

In particular it is critically important to recognise that context (in general) is an implicit part of the ABI implemented for example by glibc6. Therefore (in specific) Compressed Mode Context must not be permitted to cross into or out of a function call.

Thus it is the mandatory responsibility of the compiler to ensure that context returns to "v3.0B Standard" prior to entering a function call (responsibility of caller) and prior to exit from a function call (responsibility of callee) by setting appropriate M and N bits.

If however it is known to the compiler that certain static leaf node functions and their immediate callers will never, under any circumstances, be called by externsl ABI compliant code, then of course the compiler may choose to write such static functions as it sees fit.

Trap Handlers also take responsibility for saving and restoring of Compressed Mode state, just as they already take responsibility for other critical state. This makes traps transparent to functions as far as Compressed Mode Context is concerned, just as traps are already transparent to functions.

Note however that there are exceptions in a compiler to the otherwise hard rule that Compressed Mode context not be permitted to cross function boundaries: inline functions and static functions. static functions, if correctly identified as never to be called externally, may, as an optimisation, disregard standard ABIs, bearing in mind that this will be fraught (pointers to functions) and not easy to get right.

Opcode Allocation Ideas

Opcodes exploration (Attempt 1)

Switching between different encoding modes is controlled by M (alone) in 10-bit mode, and M and N in 16-bit mode.

  • M in 10-bit mode if zero indicates that following instructions are standard OpenPOWER ISA 32-bit encoded (including, redundantly, further 10/16-bit instructions)
  • M in 10-bit mode if 1 indicates that following instructions are in 16-bit encoding mode

Once in 16-bit mode:

  • 0b01 (M=1, N=0): stay in 16-bit mode
  • 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
  • 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
  • 0b11: free to be used for something completely different.

The current "top" idea for 0b11 is to use it for a new encoding format of predominantly "immediates-based" 16-bit instructions (branch-conditional, addi, mulli etc.)

  • The Compressed Major Opcode is in bits 5-7.
  • Minor opcode in bit 8.
  • In some cases bit 9 is taken as an additional sub-opcode, followed by bits 0-4 (for CR operations)
  • M+N mode-switching is not available for C-Major.minor 0b001.1
  • 10 bit mode may be expanded by 16 bit mode, adding capabilities that do not fit in the extreme limited space.

Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit. 16-bit immediate mode remains in 16-bit.

| 0 | 1234 | 567  8 | 9abcde | f | explanation
| - | ---- | ------ | ------ | - | -----------
| EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
| EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
| 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
| 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
| 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
| 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

Notes:

  • Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
  • EXT000 and EXT001 are v3.0B Major Opcodes. The first 5 bits are zero, therefore the 6th bit is actually part of Cmaj.
  • "10bit then 16bit" means "this instruction is encoded C 10bit and the following one in C 16bit"
  • "16b, 1x v3.0B, 16b" means, "this instruction is encoded C 16bit, the following one is V3.0B Standard, and the one after that is back to 16bit".

C Instruction Encoding types

10-bit Opcode formats (all start with v3.0B EXT000 or EXT001 Major Opcodes)

| 01234    | 567  8 | 9  | a b | c  | d e | f | enc
| E01      | Cmaj.m | fld1     | fld2     | M | 10b
| E01      | Cmaj.m | offset              | M | 10b b
| E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
| E01      | 111.m  | fld1     | fld2     | M | 10b LDST

16-bit Opcode formats (including 10/16/v3.0B Switching)

| 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
| N | immf | Cmaj.m | fld1     | fld2     | M | 16b
| 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
| N | fd3  | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
| N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST

Notes:

  • fld1 and fld2 can contain reg numbers, immediates, or opcode fields (BO, BI, LK)
  • S1 and S2 are further sub-selectors of C 001.1

Immediate Opcodes

only available in 16-bit mode, only available when M=1 and N=1 and when Cmaj.min is not 0b001.1.

instruction counts from objdump on /bin/bash:

  466 extsw r1,r1
  649 stw r1,1(r1)
  691 lwz r1,1(r1)
  705 cmpdi r1,1
  791 cmpwi r1,1
  794 addis r1,r1,1
 1474 std r1,1(r1)
 1846 li r1,1
 2031 mr r1,r1
 2473 addi r1,r1,1
 3012 nop
 3028 ld r1,1(r1)


| 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
| 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
| 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
| 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
| 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
| 1 | 1  | 1 |     | | 001.0 | 000  | imm | 1 | TBD
| 1 | 1  | 1 | i2  | | 001.0 | RA!=0| imm | 1 | addis
| 1 | 0  | i2      | | 010.0 | 000  | imm | 1 | setvli
| 1 | 1  | i2      | | 010.0 | 000  | imm | 1 | setmvli
| 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi
| 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
| 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
| 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
| 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
| 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
| 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
| 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
| 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
| 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
| 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
| 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
| 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
| 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
| 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi

Construction of immediate:

  • LD/ST r1 (SP) variants should be offset by -256 see https://bugs.libre-soc.org/show_bug.cgi?id=238#c43
    • SP variants map to e.g ld RT, imm(r1)
    • SV Prefixing can be used to map r1 to alternate regs
  • [1] not the same as v3.0B addis: the shift amount is smaller and actually still maps to within the v3.0B addi immediate range.
  • addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
  • addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in increments of 8
  • all others are EXTS(i2||imm) to give a 7-bit range -128 to +127 (further for LD/ST due to word/dword-alignment)

Further Notes:

  • bc also has an immediate mode, listed separately below in Branch section
  • for LD/ST, offset is aligned. 8-byte: i2||imm||0b000 4-byte: 0b00
  • SV Prefix over-rides help provide alternative bitwidths for LD/ST
  • RA|0 if RA is zero, addi. becomes "li"
    • this only works if RT takes part of opcode
    • mv is also possible by specifying an immediate of zero

Illegal, nop and attn

Note that illeg is all zeros, including in the 16-bit mode. Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and EXT001 this ensures that in both 10-bit and 16-bit mode, a 16-bit run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000 is "nop"

| 16-bit mode | | 10-bit mode                 |
| 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
| - | - | --- | | -----  | ----- | ------ | - |
| 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
| 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop

16 bit mode only:

| 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
| - | - | --- | | -----  | ----- | ------ | - |
| 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
| 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
| N | 1   000 | | 000.0  | 0  00 | 0   00 | M | attn

Notes:

  • All-zeros being an illegal instruction is normal for ISAs. Ensuring that this remains true at all times i.e. for both 10 bit and 16 bit mode is common sense.
  • The 10-bit nop (bit 15, M=1) is intended for circumstances where alignment to 32-bit before returning to v3.0B is required. M=1 being an indication "return to Standard v3.0B Encoding Mode".
  • The 16-bit nop (bit 0, N=1) is intended for circumstances where a return to Standard v3.0B Encoding is required for one cycle but one cycle where alignment to a 32-bit boundary is needed. Examples of this would be to return to "strict" (non-C) mode where the PC may not be on a non-word-aligned boundary.
  • If for any reason multiple 16 bit nops are needed in succession the M=1 variant can be used, because each one returns to Standard v3.0B Encoding Mode, each time.

In essence the 2 nops are needed due to there being 2 different C forms: 10 and 16 bit.

Branch

TODO: document that branching whilst using mode-switching bits (M/N) is perfectly well permitted, the caveat being: it is specifically and wholly the complier/assembler writers responsibility to obey ABI rules and ensure that even with branches and returns that, at no time, is an incorrect mode entered or left that could result in any instruction being misinterpreted.

| 16-bit mode | | 10-bit mode                 |
| 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
| - | - | --- | | -----  | ----- | ------ | - |
| N | offs2   | | 000.LK | offs!=0        | M | b, bl
| N |         | | 000.1  | 0  00 | 0   00 | M | TBD
| 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
| N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl

16 bit mode:

  • bc only available when N,M=0b11
  • offs2 extends offset in MSBs
  • BI3 extends BI in MSBs to allow selection of full CR
  • BO3 extends BO
  • bc offset constructed from oo as LSBs and offs2 as MSBs
  • bc BI allows selection of all bits from CR0 or CR1
  • bc CR check is always active (as if BO0=1) therefore BO1 inverts

10 bit mode:

  • illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
  • nop also covers part of branch (offs=0,M=0,LK=1)
  • bc not available in 10-bit mode
  • BO[0] enables CR check, BO[1] inverts check
  • BI refers to CR0 only (4 bits of)
  • no Branch Conditional with immediate
  • no Absolute Address
  • CTR mode allowed with BO[2] for b only.
  • offs is to 2 byte (signed) aligned
  • all branches to 2 byte aligned

LD/ST

Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)

| 16-bit mode  | | 10-bit mode               |
| 0 | 1  | 234 | | 567.8 | 9 a b | c d e | f |
| - | -- | --- | | ----- | ----- | ----- | - |
| N | SZ |  RB | | 001.1 | 1  RA | 0  RT | M | st
| N | SZ |  RB | | 001.1 | 1  RA | 1  RT | M | fst
| N | SZ |  RT | | 111.0 |  RA   |  RB   | M | ld
| N | SZ |  RT | | 111.1 |  RA   |  RB   | M | fld
  • elwidth overrides can set different widths

16 bit mode:

  • SZ=1 is 64 bit, SZ=0 is 32 bit

10 bit mode:

  • RA and RB are only 2 bit (0-3)
  • for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
  • for ST, there is no offset: "st RT, RA(0)"

Arithmetic

  • 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
  • 16-bit: note that bit 1==0 (sub-sub-encoding)

10 and 16 bit:

| 16-bit mode | | 10-bit mode             |
| 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
| - | - | --- | | ----- | --- | ----- | - |
| N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
| N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
| N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl

Notes:

  • sub. and cmpl: default CR target is CR0
  • for (RA|0) when RA=0 the input is a zero immediate, meaning that sub. becomes neg. and cmp becomes cmpi against zero
  • RT is implicitly RB: "add RT(=RB), RA, RB"
  • Opcode 0b010.0 RA=0 is not missing from the above: it is a system-wide instruction, "cbank" (section below)

16 bit mode only:

| 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
| - | - | --- | | ----- | --- | ----- | - |
| N | 1 | RA  | | 010.0 | RB  | RS    | M | sld.
| N | 1 | RA  | | 010.1 | RB  | RS!=0 | M | srd.
| N | 1 | RA  | | 010.1 | RB  | 000   | M | srad.
| N | 1 | BF  | | 011.0 | RB  | RA|0  | M | cmpw

Notes:

  • for srad, RS=RA: "srad. RA(=RS), RS, RB"

Logical

  • 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
  • 16-bit: note that bit 1==0 (sub-sub-encoding)

10 and 16 bit:

| 16-bit mode | | 10-bit mode             |
| 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
| - | - | --- | | ----- | --- | ----- | - |
| N | 0 |  RT | | 100.0 | RB  | RA!=0 | M | and
| N | 0 |  RT | | 100.1 | RB  | RA!=0 | M | nand
| N | 0 |  RT | | 101.0 | RB  | RA!=0 | M | or
| N | 0 |  RT | | 101.1 | RB  | RA!=0 | M | nor/mr
| N | 0 |  RT | | 100.0 | RB  | 0 0 0 | M | popcnt
| N | 0 |  RT | | 100.1 | RB  | 0 0 0 | M | cntlz
| N | 0 |  RT | | 101.0 | RB  | 0 0 0 | M | extsw
| N | 0 |  RT | | 101.1 | RB  | 0 0 0 | M | not

16-bit mode only (note that bit 1 == 1):

| 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
| - | - | --- | | ----- | --- | ----- | - |
| N | 1 |  RT | | 100.0 | RB  | RA!=0 | M | TBD
| N | 1 |  RT | | 100.1 | RB  | RA!=0 | M | TBD
| N | 1 |  RT | | 101.0 | RB  | RA!=0 | M | xor
| N | 1 |  RT | | 101.1 | RB  | RA!=0 | M | eqv (xnor)
| N | 1 |  RT | | 100.0 | RB  | 0 0 0 | M | setvl.
| N | 1 |  RT | | 100.1 | RB  | 0 0 0 | M | cnttz
| N | 1 |  RT | | 101.0 | RB  | 0 0 0 | M | extsb
| N | 1 |  RT | | 101.1 | RB  | 0 0 0 | M | extsh

10 bit mode:

  • idea: for 10bit mode, nor is actually 'mr' because mr is a more common operation. in 16bit however, this encoding (Cmaj.min=0b101.1, N=0) is 'nor'
  • for (RA|0) when RA=0 the input is a zero immediate, meaning that nor becomes not
  • cntlz, popcnt, exts not available in 10-bit mode
  • RT is implicitly RB: "and RT(=RB), RA, RB"

Floating Point

Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64

  • 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
  • 16-bit: note that bit 1==0 (sub-sub-encoding)

10 and 16 bit:

| 16-bit mode | | 10-bit mode             |
| 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
| - | - | --- | | ----- | --- | ----- | - |
| N |   |  RT | | 011.1 | RB  | RA!=0 | M | fsub.
| N | 0 |  RT | | 110.0 | RB  | RA!=0 | M | fadd
| N | 0 |  RT | | 110.1 | RB  | RA!=0 | M | fmul
| N | 0 |  RT | | 011.1 | RB  | 0 0 0 | M | fneg.
| N | 0 |     | | 110.0 |     | 0 0 0 | M | TBD
| N | 0 |     | | 110.1 |     | 0 0 0 | M | TND

16-bit mode only (note that bit 1 == 1):

| 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
| - | - | --- | | ----- | --- | ----- | - |
| N | 1 |     | | 011.1 |     | RA!=0 | M | TBD
| N | 1 |     | | 110.0 |     | RA!=0 | M | TBD
| N | 1 |  RT | | 110.1 | RB  | RA!=0 | M | fdiv
| N | 1 |  RT | | 011.1 | RB  | 0 0 0 | M | fabs.
| N | 1 |  RT | | 110.0 | RB  | 0 0 0 | M | fmr.
| N | 1 |     | | 110.1 |     | 0 0 0 | M | TBD

16 bit only, FP to INT convert (using C 0b001.1 subencoding)

| 0 | 123 | 4 | | 567.8 | 9 ab | cde  | f |
| - | --- | - | | ----- | ---- | ---- | - |
| N | 101 | X | | 001.1 | 0 RA | Y RT | M | fp2int
| N | 110 | X | | 001.1 | 0 RA | Y RT | M | int2fp
  • X: signed=1, unsigned=0
  • Y: FP32=0, FP64=1

10 bit mode:

  • fsub. fneg. and fmr. default target is CR1
  • fmr. is not available in 10-bit mode
  • fdiv is not available in 10-bit mode

16 bit mode:

  • fmr. copies RB to RT (and sets CR1)

Condition Register

10-bit or 16 bit:

| 16-bit mode| | 10-bit mode            |
| 0 | 123 | 4   | | 567.8 | 9 ab | cde | f |
| - | --- | --- | | ----- | ---- | --- | - |
| N | 000 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf

16-bit only:

| 0 | 1234 | | 567.8 | 9 ab | cde | f |
| - | ---- | | ----- | ---- | --- | - |
| N | 0010 | | 001.1 | 0 BA | BB  | M | crnor
| N | 0011 | | 001.1 | 0 BA | BB  | M | crandc
| N | 0100 | | 001.1 | 0 BA | BB  | M | crxor
| N | 0101 | | 001.1 | 0 BA | BB  | M | crnand
| N | 0110 | | 001.1 | 0 BA | BB  | M | crand
| N | 0111 | | 001.1 | 0 BA | BB  | M | creqv
| N | 1000 | | 001.1 | 0 BA | BB  | M | crorc
| N | 1001 | | 001.1 | 0 BA | BB  | M | cror

Notes

10 bit mode:

  • mcrf BF is only 2 bits which means the destination is only CR0-CR3
  • CR operations: not available in 10-bit mode (but mcrf is)

16 bit mode:

  • mcrf BF2 extends BF (in MSB) to 3 bits
  • CR operations: destination register is same as BA.
  • CR operations: only possible on CR0 and CR1

SV (Vector Mode):

  • CR operations: greatly extended reach/range (useful for predicates)

System

cbank: Selection of Compressed-encoding "Bank". Different "banks" give different meanings to opcodes. Example: CBank=0b001 is heavily optimised to A/Video Encode/Decode. cbank borrows from add's encoding space (when RA==0)

| 16-bit mode | | 10-bit mode             |
| 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
| - | ------- | | ----- | ----- | --- | - |
| N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank

not available in 10-bit mode, only in 16-bit mode:

| 0 | 1 | 234 | | 567.8 | 9 ab | cde  | f |
| - | ------- | | ----- | ---- | ---- | - |
| N | 1 | 111 | | 001.1 | 0 00 |  RT  | M | mtlr
| N | 1 | 111 | | 001.1 | 0 01 |  RT  | M | mtctr
| N | 1 | 111 | | 001.1 | 0 00 |  RA  | M | mflr
| N | 1 | 111 | | 001.1 | 0 01 |  RA  | M | mfctr
| N | 0 RA!=0 | | 000.0 | 0 00 |  000 | M | mtcr
| N | 1 RT!=0 | | 000.0 | 0 00 |  000 | M | mfcr

Unallocated

16-bit only:

| 0 | 1 | 234 | | 567.8 | 9 ab | cde  | f |
| - | - | --- | | ----- | ---- | ---- | - |
| N | 1 | 111 | | 001.1 | 0 10 |      | M |
| N | 1 | 111 | | 001.1 | 0 11 |      | M |

Other ideas (Attempt 2)

8-bit mode-switching instructions, odd addresses for C mode

Drop the complexity of the 16-bit encoding further reduced to 10-bit, and use a single byte instead of two to switch between modes. This would place compressed (C) mode instructions at odd bytes, so the LSB of the PC can be used for the processor to tell which mode it is in.

To switch from traditional to compressed mode, the single-byte instruction would be at the MSByte, that holds the EXT bits. (When we break up a 32-bit instruction across words, the most significant half should go in the word with the lower address.)

To switch from compressed mode to traditional mode, the single-byte instruction would also be at the opcode/format portion, placed in the lower-address word if split across words, so that the instruction can be recognized as the mode-switching one without going for its second byte.

The C-mode nop should be encoded so that its second byte encodes a switch to compressed mode, if decoded in traditional mode. This enables such a nop to straddle across a label:

8-bit first half of nop
Label:
8-bit second half of nop AKA switch to compressed mode
16-bit insns...

so that if traditional code jumps to the word-aligned label (because traditional branches drop the 2 LSB), it immediately switches to compressed mode; if we fall-through, we remain in 16-bit mode; and if we branch to it from compressed mode, whether we jump to the odd or the even address, we end up in compressed mode as desired.

Tables explaining encoding:

| byte 0 | byte 1 | byte 2 | byte 3 |
| v3.0B standard 32 bit instruction |
| EXT000 | 16 bit          | 16...  |
| .. bit | 8nop   | v3.0b stand...  |
| .. ard 32 bit   | EXT000 | 16...  |
| .. bit | 16 bit          | 8nop   |
| v3.0B standard 32 bit instruction |

Other ideas (v3)

FSM state switching and mode switching deemed too complex. Instead cut back to

  1. 10bit only (actually, 11 bit)
  2. SV-Prefixed 16bit only (aka SV-C32)

Each will be entirely different which is a huge amount of work.

TODO

  • make a preliminary assessment of branch in/out viability
  • confirm FSM encoding (is LSB of PC really enough?)
  • guestimate opcode and register allocation (without necessarily doing a full encoding)
  • write throwaway python program that estimates compression ratio from objdump raw parsing
  • finally do full opcode allocation
  • rerun objdump compression ratio estimates
  • check in FSM if "return to v3.0B then 16bit" if it is ok to have the v3.0B be a 10bit Compressed. should this be ignored and carry on? should a trap occur?

Use 2- rather than 3-register opcodes

Successful compact ISAs have used 2- rather than 3-register insns, in which the same register serves as input and output. Some 20% of general-purpose 3-register insns already use either input register as output, without any effort by the compiler to do so.

Repurposing the 3 bits used to encode one one of the input registers in arithmetic, logical and floating-pointer registers, and the 2 bits used to encode the mode of the next two insns, we could make the full register files available to the opcodes already selected for compressed mode, with one bit to spare to bring additional opcodes in.

An opcode could be assigned to an instruction that combines and extends with the subsequent instruction, providing it with a separate input operand to use rather than the output register, or with additional range for immediate and offset operands, effectively forming a 32-bit operation, enabling us to remain in compressed mode even longer.

Appendix

Analysis techniques and tools

objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
  s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
  sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
  s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
  sort -n | less

gcc register allocation

FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order:

Special-purpose registers on ppc are:

r0: constant zero/throw-away
r1: stack pointer
r2: thread-local storage pointer in 32-bit mode
r2: non-minimal TOC register
r10: EH return stack adjust register
r11: static chain pointer
r13: thread-local storage pointer in 64-bit mode
r30: minimal-TOC/-fPIC/-fpic base register
r31: frame pointer
lr: return address register

the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is:

We allocate in the following order:

fp0     (not saved or used for anything)
fp13 - fp2  (not saved; incoming fp arg registers)
fp1     (not saved; return value)
fp31 - fp14 (saved; order given to save least number)
cr7, cr5    (not saved or special)
cr6     (not saved, but used for vector operations)
cr1     (not saved, but used for FP operations)
cr0     (not saved, but used for arithmetic operations)
cr4, cr3, cr2   (saved)
r9      (not saved; best for TImode)
r10, r8-r4  (not saved; highest first for less conflict with params)
r3      (not saved; return value register)
r11     (not saved; later alloc to help shrink-wrap)
r0      (not saved; cannot be base reg)
r31 - r13   (saved; order given to save least number)
r12     (not saved; if used for DImode or DFmode would use r13)
ctr     (not saved; when we have the choice ctr is better)
lr      (saved)
r1, r2, ap, ca  (fixed)
v0 - v1     (not saved or used for anything)
v13 - v3    (not saved; incoming vector arg registers)
v2      (not saved; incoming vector arg reg; return value)
v19 - v14   (not saved or used for anything)
v31 - v20   (saved; order given to save least number)
vrsave, vscr    (fixed)
sfp     (fixed)

Comparison to VLE

VLE was a means to reduce executable size through three interleaved methods:

  • (1) invention of 16 bit encodings (of exactly 16 bit in length)
  • (2) invention of 16+16 bit encodings (a 16 bit instruction format but with an additional 16 bit immediate "tacked on" to the end, actually making a 32-bit instruction format)
  • (3) seamless and transparent embedding and intermingling of the above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences, with no additional state, including when the PC was not aligned on a 4-byte boundary.

Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions. (2) in effect is a 32 bit instruction. (2) is not a 16 bit instruction.

Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?

Consequently, we do not envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit. Compressed is Compressed and is by definition limited to precisely - and only - 16 bit.

The additional reason why that is the case is because VLE is exceptionally complex to implement. In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.

However this Compressed Encoding is designed for High performance multi-issue systems as well as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.

By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is specifically designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.

Compressed Decoder Phases

Phase 1 (stage 1 of a 2-stage pipelined decoder) is defined as the minimum necessary FSM required to determine instruction length and mode. This is implemented with the absolute bare minimum of gates and is based on the 6 encodings involving N, M and EXTNNN (see table, below)

Phase 2 (stage 2 of a 2-stage pipelined decoder) is defined as the "full decoder" that includes taking into account the length and mode from Phase 1. Given a 2-stage pipelined decoder it is categorically impossible for Phase 2 to go backwards in time and affect the decisions made in Phase 1.

These two phases are specifically designed to take multi-issue execution into account. Phase 1 is intended to be part of an O(log N) algorithm that can use a form of carry-lookahead propagation. Phase 2 is intended to be on a 2nd pipelined clock cycle, comprising a separate suite of independent local-state-only parallel pipelines that do not require any inter-communication of any kind.

Table: Reminder of the 6 16-bit encodings:

| 0 | 1234 | 567  8 | 9abcde | f | explanation
| - | ---- | ------ | ------ | - | -----------
| EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
| EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
| 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
| 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
| 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
| 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

Phase 1

The Phase 1 length/mode identification takes into account only 3 pieces of information:

  • extc_id: insn[0:4] == EXTNNN (Compressed)
  • N: insn[0]
  • M: insn[15]

The Phase 1 length/mode produces the following lengths/modes:

  • 32 - v3.0B (includes v3.0B followed by 16bit)
  • 16 - 10bit
  • 16 - 16bit

NOTE THAT FURTHER SUBIDENTIFICATION OF C MODES IS NOT CARRIED OUT AT PHASE 1. In particular note specifically that 16 bit "immediate mode" is not part of the Phase 1 FSM, but is specifically isolated to Phase 2.

Pseudocode:

# starting point for FSM
previ = v3.0B

if previ.mode == v3.0B:
    # previous was v3.0B, look for compressed tag
    if extc_id:
         # found it.  move to 10bit mode
         nexti.length = 16
         nexti.mode = 10bit
    else:
         # nope. stay in v3.0B
         nexti.length = 32
         nexti.mode = v3.0B

elif previ.mode == 10bit:
     # previous was v3.0B, move to v3.0B or 16bit?
    if M == 0:
         next.length = 32
         nexti.mode = v3.0B
     else:
         # otherwise stay in 16bit mode
         nexti.length = 16
         nexti.mode = 16bit

elif previ.mode == 16bit:
      # previous was 16bit, stay there or move?
      if M == 0:
         # back to v3.0B
         next.length = 32
         if N == 1:
              # ... but only for 1 insn
              nexti.mode = v3.0B_then_16bit
         else:
              nexti.mode = v3.0B
     else:
         # otherwise stay in 16bit mode
         nexti.length = 16
         nexti.mode = 16bit

# rest of FSM involving 3.0B to 16bit
# and back transitions left to implementor
# (or for someone else to add)

Phase 2: Compressed mode

At this phase, knowing that the length is 16bit and the mode is either 10b or 16b, further analysis is required to determine if the 16bit.immediate encoding is active, and so on. This is a fully combinatorial block that at no time steps outside of the strict bounds already determined by Phase 1.

op_001_1 = insn[5:8] != 0b001.1
if mode == 10bit:
    decode_10bit(insn)
elif mode == 16bit:
    if N == 1 & M == 1 & op_001_1
        # see immediate opcodes table
        decode_16bit_immed_mode(insn)
    if op_001_1:
        # see CR and System tables
        # (16 bit ones at least)
        decode_16bit_cr_or_sys(insn)
    else:
        decode_16bit_nonimmed_mode(insn)

From this point onwards each of the decode_xx functions perform straightforward combinatorial decoding of the 16 bits of "insn". In sone cases this involves further analysis of bit 1, in some cases (Cmaj.m = 0b010.1) even further deep-dive decoding is required (CR ops). All of it is entirely combinatorial and at no time involves changing of, or interaction with, or disruption of, the Phase 1 determination of Length+Mode (that has already taken place in an earlier decoding pipeline time-schedule)

Phase 2: v3.0B mode

Standard v3.0B decoders are deployed. Absolutely no interaction occurs with any 16 bit decoders or state. Absolutely no interaction with the earlier Phase 1 decoding occurs. Absolutely no interaction occurs whatsoever (assuming an implementation that does not perform macro-op fusion) between other multi-issued v3.0B instructions being decoded in parallel at this time.

Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode

demo

Efficient Decoding Algorithm

decoding