Register Files

Discussion:

These register files are required for POWER:

  • Floating-point
  • Integer
  • Control and Condition Code Registers (CR0-7)
  • SPRs (Special Purpose Registers)
  • Fast Registers (CTR, LR, SRR0, SRR1 etc.)
  • "State" Registers (CIA, MSR, SimpleV VL)

Source code:

For a GPU, the FP and Integer registers need to be a massive 128 x 64-bit.

Video walkthrough of regfile relationship to Function Units in core: https://youtu.be/7Th1b-jq40k

Regfile groups, Port Allocations and bit-widths

  • INT regfile: 32x 64-bit with 4R1W
  • SPR regfile: 1024x 64-bit (!) needs a "map" on that 1R1W
  • CR regfile: 8x 4-bit with full 8R8W (for full 32-bit read/write)
    • CR0-7: 4-bit
  • XER regfile: 2x 2-bit, 1x 1-bit with full 3R3W
    • CA(32) - 2-bit
    • OV(32) - 2-bit
    • SO - 1 bit
  • FAST regfile: 5x 64-bit, full 3R2W (possibly greater)
    • LR: 64-bit
    • CTR: 64-bit
    • TAR: 64-bit
    • SRR1: 64-bit
    • SRR2: 64-bit
  • STATE regfile: 3x 64-bit, 2R1W (possibly greater)
    • MSR: 64-bit
    • PC: 64-bit
    • SVSTATE: 64-bit

Connectivity between regfiles and Function Units

The target for the first ASICs is a minimum of 4 32-bit FMACs per clock cycle. If it is acceptable that this be achieved on sequentially-adjacent-numbered registers, a significant reduction in the amount of regfile porting may be achieved (down from 12R4W)

It does however require that the register file be broken into four completely separate and independent quadrants, each with their own separate and independent 3R1W (or 4R1W ports).

This then requires some Bus Architecture to connect and keep the pipelines busy. Below is the connectivity diagram:

  • A single Dynamic PartitionedSignal capable 64-bit-wide pipeline is at the top left and top right.
  • Multiple pairs of 32-bit Function Units (making up a 64-bit data path) connect, as "Concurrent Units", to each pipeline.
  • The number of pairs of Function Units must match (or preferably exceed) the number of pipeline stages.
  • Connected to each of the Operand and Result Ports on each Function Unit is a cyclic buffer.
  • Read-operands may "cycle" to reach their destination
  • Write-operands may be "cycled" so as to pick an appropriate destination.
  • Independent Common Data Buses, one for each Quadrant of the Regfile, connect between the Function Unit's cyclic buffers and the global cyclic buffers dedicated to that Quadrant.
  • Within each Quadrant's global cyclic buffers, inter-buffer transfer ports allow for copies of regfile data to be transferred from write-side to read-side. This constitutes the entirety of what is known as an Operand Forwarding Bus.
  • Between each Quadrant's global cyclic buffers, there exists a 4x4 Crossbar that allows data to move (slowly, and if necessary) across Quadrants.

Notes:

  • There is only one 4x4 crossbar (or, one for reads, one for writes?) and thus only one inter-Quadrant 32-bit-wide data path (total bandwidth 4x32 bits). These to be shared by five groups of operand ports at each of the Quadrant Global Cyclic Buffers.
  • The only way for register results and operands to cross over between quadrants of the regfile is that 4x4 crossbar. Data transfer bandwidth being limited, the placement of an operation adversely affects its completion time. Thus, given that read operands exceed the number of write operands, allocation of operations to Function Units should prioritise placing the operation where the "reads" may go straight through.
  • Outlined in this comment https://bugs.libre-soc.org/show_bug.cgi?id=296#10 the infrastructure above can, by way of the cyclic buffers, cope with and automatically adapt between a serial delivery of operands, and a parallel delivery of operands. And, that, actually, performance is not adversely affected by the serial delivery, although the latency of an FMAC is extended by 3 cycles: this being the fact that only one CDB is available to deliver operands.

Click on the image to expand it full-screen:

Regspecs

"Regspecs" is a term used for describing the relationship between register files, register file ports, register widths, and the Computation Units that they connect to.

Regspecs are defined, in python, as follows:

Regfile name CompUnit Record name bit range register mapping
INT ra 0:3,5

Description of each heading:

  • Regfile name: INT corresponds to the INTEGER file, CR to Condition Register etc.
  • CompUnit Record name: in the Input or Output Record there will be a signal by name. This field refers to that record signal, thus providing a sequential ordering for the fields.
  • Bit range: this is specified as an inclusive range of the form "start:end" or just a single bit, "N". Multiple ranges may be specified, and are comma-separated.

Here is how they are used:

class CRInputData(IntegerData):
    regspec = [('INT', 'a', '0:63'),      # 64 bit range
               ('INT', 'b', '0:63'),      # 6B bit range
               ('CR', 'full_cr', '0:31'), # 32 bit range
               ('CR', 'cr_a', '0:3'),     # 4 bit range
               ('CR', 'cr_b', '0:3'),     # 4 bit range
               ('CR', 'cr_c', '0:3')]     # 4 bit range

This tells us, when used by MultiCompUnit, that:

  • CompUnit src reg 0 is from the INT regfile, is linked to CRInputData.a, 64-bit
  • CompUnit src reg 1 is from the INT regfile, is linked to CRInputData.b, 64-bit
  • CompUnit src reg 2 is from the CR regfile, is CRInputData.full_cr, and 32-bit
  • CompUnit src reg 3 is from the CR regfile, is CRInputData.cr_a, and 4-bit
  • CompUnit src reg 4 is from the CR regfile, is CRInputData.cr_b, and 4-bit
  • CompUnit src reg 5 is from the CR regfile, is CRInputData.cr_c, and 4-bit

Likewise there is a corresponding regspec for CROutputData. The two are combined and associated with the Pipeline:

class CRPipeSpec(CommonPipeSpec):
    regspec = (CRInputData.regspec, CROutputData.regspec)
    opsubsetkls = CompCROpSubset

In this way the pipeline can be connected up to a generic, general-purpose class (MultiCompUnit), which would otherwise know nothing about the details of the ALU (Pipeline) that it is being connected to.

In addition, on the other side of the MultiCompUnit, the regspecs contain enough information to be able to wire up batches of MultiCompUnits (now known, because of their association with an ALU, as FunctionUnits), associating the MultiCompUnits correctly with their corresponding Register File.

Note: there are two exceptions to the "generic-ness and abstraction" where MultiCompUnit "knows nothing":

  1. When the Operand Subset has a member "zero_a". this tells MultiCompUnit to create a multiplexer that, if operand.zero_a is set, will put ZERO into its first src operand (src_i[0]) and it will NOT put out a read request (NOT raise rd.req[0]) for that first register.
  2. When the Operand Subset has a member "imm_data". this tells MultiCompUnit to create a multiplexer that, if operand.imm_data.ok is set, will copy operand.imm_data into its second src operand (src_i[1]). Further: that it will NOT put out a read request (NOT raise rd.req[1]) for that second register.

These should only be activated for INTEGER and Logical pipelines, and the regspecs for them must note and respect the requirements: input regspec[0] may only be associated with operand.zero_a, and input regspec[1] may only be associated with operand.imm_data. the POWER9 Decoder and the actual INTEGER and Logical pipelines have these expectations specifically hard-coded into them.