SimpleV Overview review comments and questions

BE Regfile confusion

Defining how an element-addressable variable-width regfile is to be accessed is exceptionally difficult to pin down when conventions are used that start MSB numbering from zero (MSB0) rather than at the Industry-standard convention, LSB0. The confusion arises from the possibility of MSB0 numbering being potentially applicable to:

  • bits
  • bytes, halfwords, words, doublewords and quadwords
  • register numbers
  • element numbers

In other words, when communicating in a Specification there are eight possible areas for confusion.

Then there is the completely separate issue of whether:

  • Memory should be the sole location where LE/BE byteswapping should be applicable, Architecturally.
  • Both the register file and Memory should both be, Architecturally, LE/BE reorderable with the implication that direct and explicit control over reordering over Memory and the bytes inside regfile data needs provided at the ISA Level.

The latter introduces at least another four potential areas for reordering, leading to a combinatorial explosion in the potential for confusion in communication, causing utter chaos and a proliferation of complexity.

Simple-V has made the decision that providing programmers with explicit control over the byte-ordering when data is transferred in and out of the register files is too great, and has chosen to require that, Architecturally, the register file is to be treated as a Little-Endian byte-ordered SRAM with elements to be accessed as if in a Little-Endian Software Environment. This does not impact Memory-ordering in any way and it does not impact arithmetic operations.

"Architecturally" is defined by Industry-standard convention to mean that how the Hardware is implemented is entirely down to the implementor, but that as far as software running on that hardware is concerned, "Architecturally" everything appears - to programs - as described. Mathematics would describe a formal definition of any given implementation and an Architecturally correctly-defined specification as Topologically "Homeomorphic".

In other words once loaded, software may be written that considers arithmetic values in a uniform environment regardless of whether MSR.BE, which is considered to affect Memory only, is set or unset. If byteswapping is required it may be performed explicitly, by either loading to/from Memory or by performing Simple-V byte-level element-width override "Reverse Gear" element walking, or using Matrix REMAP with Dimensional Inversion, or any number of methods.

In the overview page, where bytes are numbered in LSB0 order left to right, is this:

   | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
   | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
r0 | b[0]  | b[1]  | b[2]  | b[3]  | b[4]  | b[5]  | b[6]  | b[7]  |
r1 | b[8]  | b[9]  | b[10] | b[11] | b[12] | b[13] | b[14] | b[15] |

Starting an elwidth=16 loop from r0 and extending for 7 elements would begin at r0 and extend partly over r1. Note that b0 indicates the low byte (lowest 8 bits) of each 16-bit word, and b1 represents the top byte:

   | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
   | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
r0 | s[0].b0  b1   | s[1].b0  b1   | s[2].b0  b1   |  s[3].b0  b1  |
r1 | s[4].b0  b1   | s[5].b0  b1   | s[6].b0  b1   |  unmodified   |

If data is loaded at elwidth=16 as above, then accessed as elwidth=8, then in order to get at it in an arithmetically-expected order of lowest byte first followed by highest byte, would require this ordering:

   | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
   | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
r0 | b[1]  | b[0]  | b[3]  | b[2]  | b[5]  | b[4]  | b[7]  | b[6]  |

Whilst this is technically possible to do (with 2D REMAP) it is not reasonable to "punish" BE developers by forcing them into the situation of having to perform such element-level reordering, nor is it reasonable for the complexity to bleed over into LE Mode in order to "compensate" by making BE easier but LE more difficult.

It is actually possible to check what POWER9/10 do, here, using vaddubm and vadduhm (or other suitable instructions)

  • start from a register src of zero
  • perform an 8-byte add 0x05
  • using the result of the previous calculation add a 16 bit value where only the upper half is set (0x0a00)

The expected results for every single element regardless of BE or LE order will be 0x0a05 not 0x00a5 in BE Mode.

This anticipated result will be down to the Architects having chosen a fixed bit-ordering and fixed byte ordering internally in the underlying hardware (without revealing that fact).

Microwatt and Libre-SOC perform the following in hardware:

LD -> brev(XOR(MSR.BE, ldbrx/ld)) -> GPR

such that internally there is no confusion: arithmetic operations are consistent arithmetic operations regardless of how data came to be stored in memory and also completely irrespective of the setting of MSR.BE:

GPR -> add -> GPR

not:

LD -> brev(ldbrx/ld) -> GPR
GPR -> brev(MSR.BE) -> add -> brev(MSR.BE) -> GPR
GPR -> brev(stbrx/st) -> ST

In a fixed-width (64-bit) architecture the above addition of byte-reversing in front of read-ports and write-ports in the regfile is pointless: they make no difference and would be gates completely wasted. However for element-width overrides (which amount to what happens to the typedef c struct union if compiled on BE hardware) it turns out it does matter.

The choice for the register file to be as if it is an LE-ordered byte-addressable typedef c struct union comes down to not having to do the above double-byte-reversing trick on all Register reads/writes in order to artifically preserve an order-flexibility that should not have been allowed on the contents of register file in the first place.

Th alternative architecture which causes huge problems in an element-based context is:

LD -> brev(ldbrx/ld) -> GPR
GPR -> brev(MSR.BE) -> add -> brev(MSR.BE) -> GPR

The reason is that as in the above 8/16 table, the ordering of bytes, which become synonymous with elements, become inverted.

The "correct" fix to this problem is to view arithmetic LE/BE-reversing (reading/writing to/from the regfile on the way into and out of Arithmetic Operations) as completely separate and distinct from Memory LE/BE reordering, which a quick check on an IBM POWER8/9/10 system using VSX arithmetic operations on 8/16 in BE order will confirm is already considered.

The "correct" place to add such reversing would be the REMAP subsystem but the number of available bits is already under pressure so some care is needed. Even better would be the 24-bit RM but that is also precious and under pressure.

In theory there is one bit free:

0-1 2 3 4 description
00 0 dz sz simple mode
00 1 0 RG scalar reduce mode (mapreduce)
00 1 1 / reserved

which could be utilised to instruct hardware between the regfile and the Arithmetic ALUs to perform *element-level byte-reversing:

GPR -> brev(RM.normal.br) -> add -> brev(RM.normal.br) -> GPR

However even this does not solve the problem caused by loading the data in 8-byte (ld/ldbrx) followed by accessing it as element-width-overridden half/word/byte elements: the situation occurs in the 8/16 table, above.

In addition, it would still be necessary to instruct Hardware Architects on ensuring that Memory-Load to Regfile-byte-order is still strictly defined (architecturally, not actual implementation)

Then also there is the completely separate issue of how to describe this in MSB0 numbering, which becomes a nightmare all on its own: one that has to be solved in the ISACaller Simulator when elwidth overrides are completed (recall that the ISACaller Simulator uses a python class where numbers are indeed strict MSB0 defined, arithmetically).

Architecturally LE Regfile

The meaning of "Architecturally" is that the programmer sees one definition, where that definition has absolutely no connection or imposition on precisely how the implementation is implemented (internally), except to comply with the definition as far as external usage is concerned (programs). Mathematics uses the term "Topologically Homeomorphic" to refer to the relationship between external (specification, programs) and any given (internal) hardware implementation.

This is the canonical Architectural definition of the regfile assuming a Little-Endian system, in the c programming language:

#pragma pack
typedef union {
    uint8_t  b[];
    uint16_t s[];
    uint32_t i[];
    uint64_t l[];
    uint8_t actual_bytes[8];
} el_reg_t;

elreg_t int_regfile[128];

Where things become complicated is how to transliterate the above into MSB0 numbering, which unfortunately significantly harms understanding as to how to clearly access elememts within the regfile, and thus requires very careful an comprehensive and explanation. We start with some definitions:

  • bit-ordering below is in MSB0 order, prefix "b"
  • byte-ordering is in MSB0 order, prefix "B"
  • halfword-ordering is in MSB0 order, prefix "H"
  • word-ordering is in MSB0 order, prefix "W"
  • doubleword-ordering is in MSB0 order, prefix "D"
  • register-numbering is in LSB0 order, prefix "r"
  • element-numbering is in LSB0 order, prefix "e"

The reasoning behind the LSB0 numbering for elements and registers is down to the fact that unlike a PackedSIMD Architecture which has fixed-width registers and fixed-size element numbering, Scalable Vector ISAs would require numbering element zero to be VL-1 and element VL-1 to be 0 which would result in complete incomprehensibility. Likewise the fact that registers are sequentially and serially aliases to the same underlying byte-addressable Memory, the register numbering must likewise be LSB0-ordered. The choice of bit- and byte- numbering in MSB0 is not done to increase understanding: it is done to match the precedent set when the Power ISA was first developed, over 25 years ago.

First we define the contents of 64-bit registers:

name hi byte/bit ... lo byte/bit
bits b0.b1.b2.b3.b4.b5.b6.b7 ... b56.b57.b58.b59.b60.b61.b62.b63
bytes B0 ... B7

In pseudocode we may now define B0 to B7 in terms of b0..b63 and thus also define Registers:

 B0 = b0 || b1 || b2 || b3 || b4 || b5 || b6 || b7
 B1 = b8 || ..........                 ....  || b15
 ...
 B7 = b56 || .........                 ....  || b63
 RA = B0 || B1 || B2 || B3 || B4 || B5 || B6 || B7

Further we may now also define half-words, words, and double-words, confirming the definition of registers, RA as the same example, above, where all definitions of RA below are consistent / identical:

 H0 = B0||B1, H1 = B2||B3, H2 = B4||B5, H3 = B6||B7
 RA = H0||H1||H2||H3
 W0 = H0||H1, W1 = H2||H3
 D0 = W0||W1
 RA = H0||H1||H2||H3
 RA = W0||W1
 RA = D0

If we then perform the following arithmetic operations, again using the same Pseudocode notation (Power ISA Public v3.1 Book I Section 1.3.2)

RA = 1 << 63
RB = 128
RT = 3 + 4

then the following bits are set (all others zero):

RA.b0 = 1    RA.B0 = 0x80  RA.H0 = 0x8000  RA.W0 = 0x80000000
RB.b56 = 1   RB.B7 = 0x08  RB.H3 = 0x0008  RB.W1 = 0x00000008
RT.b61-3 = 1 RT.B7 = 0x07  RT.H3 = 0x0007  RT.W1 = 0x00000007

Now taking the c struct definition elreg_t above we may provide a link between the two definitions, noting that our c "hardware" is a LE-ordered "machine" but that RA is as above:

elreg_t x = RA;
x.actual_bytes[0] = RA.B7; // byte zero in LE is byte 7 in MSB0
x.actual_bytes[1] = RA.B6;
...
x.actual_bytes[7] = RA.B0; // byte 7 in LE is byte 0 in MSB0

The following arithmetic operations - in c - which are exactly the same arithmetic operations as given in Pseudocode form above - produce the exact same results as above:

elreg_t x = RA, y = RB, z = RT;
x.l[0] = 1 << 63;  // e0 is element 0
y.l[0] = 128;      // e0, defined above
z.l[0] = 3 + 4;    // e0 (LSB0 element-numbering)

Next, elements are introduced. Using the definition int_regfile above let us perform two operations:

int_regfile[2].s[1] = 1<<3 # e1 in the element numbering definition
int_regfile[2].s[4] = 1<<9 # e4 in the same definition, above
RA = GPR(2) # r2 in the register numbering-definition, above
RB = GPR(3) # r3, again, same numbering definition

Examining the contents of RA and RB is found to be:

RA.H0 = 0x0000  RA.H1 = 0x0000  RA.H2 = 0x0008  RA.H3 = 0x0000
RB.H0 = 0x0000  RB.H1 = 0x0000  RB.H2 = 0x0000  RB.H3 = 0x0200
RA = 0x0000_0000_0008_0000
RB = 0x0000_0000_0000_0200

The reason why GPR(3) contains the value 0x200 (1<<9) when it was the 2nd Vector Element being written to is because of the sequential conceptual overlap between all registers, as ultimately the regfile must be considered arbitrarily-byte-addressable just like any Memory, and therefore writing to half-word element e4 starting from GPR(2) actually wrote to half-word element e0 of GPR(3):

Establishing the MSB0-ordering Bytes B0-B7 thru Half and Words H0-H3 and W0-W1 with the LE-ordered c union for one single register, r0=GPR(0), is as follows:

| B0     | B1    | B2    | B3    | B4    |    B5 |    B6 |    B7   |
|        H0      |      H1       |      H2       |      H3         |
|                W0              |               W1                |
|                                D0                                |
| r0.b[7] r0.b[5] r0.b[4] r0.b[3] r0.b[2] r0.b[1] r0.b[6] r0.b[0]  |
|     r0.s[3]        r0.s[2]         r0.s[1]         r0.s[0]       |
|             r0.i[1]                         r0.i[1]              |
|                            r0.l[0]                               |

It is however just as critical to note that the following are also aliases, where r0=GPR(0) and r1=GPR(1):

| B0     | B1    | B2    | B3    | B4    |   B5  |   B6  |    B7   |
| B8     | B9    | B10   | B11   | B12   |   B13 |   B14 |    B15  |
|        H0      |      H1       |      H2       |      H3         |
|        H4      |      H5       |      H6       |      H7         |
|                W0              |               W1                |
|                W2              |               W3                |
|                                D0                                |
|                                D1                                |
|     r0.s[7]        r0.s[6]         r0.s[5]         r0.s[4]       |
|     r1.s[3]        r1.s[2]         r1.s[1]         r1.s[0]       |
|             r0.i[3]                         r0.i[2]              |
|             r1.i[1]                         r1.i[0]              |
|                            r0.l[1]                               |
|                            r1.l[0]                               |

These "aliases" which extend fully for all elements e0 onwards and all registers r0 onwards are down to the intentionally-defined overlaps in the canonical definition from the LE-organised packed c union.

If the element aliases were defined in MSB0 ordering then the sequential progression through elements so numbered VL-1 to 0 would not only be in the opposite natural order from the loop sequence numbering expected to any computer program, but worse than that it would be on to of the already-complex half-word ordering, H3 H2 H1 H0 H7 H6 H5 H4... and word-ordering W1 W0 W3 W2 W5 W4 W7 W6....

Thus it can also be clearly seen why LSB0-numbering for elements and registers was picked because it would be near-possible to explain the overlapping when Scalability (VL, MAXVL) is introduced.