Single-Issue, In-Order Processor Core

note: as of the time of writing, this task is 95-98% completed and requires approximately 10-15 lines of python code to get it actually running a first unit test.

The Libre-SOC TestIssuer core utilises a Finite-State Machine (FSM) to control the fetch/dec/issue/exec Computational Units, with only one such CompUnit (a FSM or a pipeline) being active at any given time. This is good for debugging the HDL, but severly restricts performance as a single instruction will take tens of clock cycles to complete. In-development (Andrey to research and link to the relevant bugreport) is an in-order core and following on from that will be an out-of-order core.

A Single-Issue In-Order control unit (written 12+ months ago) will allow every pipepline to be active, and raises the ideal maximum throughput to 1 instruction per clock cycle, bearing any register hazards.

This control unit has not been written in HDL yet (incorrect: the first version was written 12+ months ago, and is in soc/ and there are options in the Makefile to enable it), however there's currently a task to develop the model for the simulator first. The model will be used to determine performance.

Diagram that Luke drew comparing pipelines and fsms which allows for a transition from FSM to in-order to out-of-order and also allows "Micro-Coding".

The Model

Brief

The model for the Single-Issue In-Order core needs to be added to the in-house Python simulator (ISACaller, called by pypowersim), which will allow basic performance estimates. INCORRECT - pypowersim outputs an execution trace log which after the fact may be passed to any model of which the in-order model is just the very first.

For now, this model resides outside the simulator, and is completely standalone and will ALWAYS remain standalone

A subtask to be carried out as incremental development is that avatools source code will need to be studied to extract power consumption estimation and add that into the inorder model

Task given

The offline instruction ordering analyser need to be COMPLETED (it is currently 98% complete) that models a (simple, initially V3.0-only) in-order core and gives an estimate of instructions per clock (IPC).

Hazard Protection WHICH IS ALREADY COMPLETED is a straightforward, simple bit vector (WRONG it is a "length of pipeline countdown until result is ready" which models the clock cycles needed in the ACTUAL pipeline(s)? the "bit" you refer to is "is there an entry in the python set() for this register yes-or-no")

  • Take the write result register number: set bit WRONG "add num-cycles-until-ready to the set()"
  • For all read registers, check corresponding bit WRONG call the function that checks if there is an entry in the "python set() of expected outstanding results to be written" . If bit is set, STALL (fake/ model-stall)

A stall is defined as a delay in execution of an instruction in order to resolve a hazard (i.e. trying to read a register while it is being written to). See the wikipedia article on Pipeline Stall

Input IS (98% completed, remember?):

  • Instruction with its operands (as assembler listing)
  • plus an optional memory-address and whether it is read or written.

The input will come as a trace output from the ISACaller simulator, see bug comments #7-#16

Some classes needed (WRONG: ALREADY WRITTEN) which "model" pipeline stages: fetch, decode, issue, execute.

One global "STALL" flag will cause all buses to stop:

  • Tells fetch to stop fetching
  • Decode stops (either because empty, or has instrution whose read reg's and being written to).
  • Issue stops.
  • Execute (pipelines) run as an empty slot (except for the initial instruction causing the stall)

Example (PC chosen arbitrarily):

addi 3, 4, 5    #PC=8
cmpi 1, 0, 3, 4 #PC=12
ld   1, 2(3)    #PC=16 EA=0x12345678

The third operand of cmpi is the register which to use in comparison, so register 3 needs to be read. However, addi will be writing to this register, and thus a STALL will occur when cmpi is in the decode phase.

The output diagram will look like this:

TODO, move this to a separate file then include it twice, once with triple-quotes and once without. grep "inline raw=yes" for examples on how to include in mdwn

| clk # |    fetch     |    decode    |   issue      |   execute    |
|:-----:|:------------:|:------------:|:------------:|:------------:|
|   1   | addi 3,4,5   |              |              |              |
|   2   | cmpi 1,0,3,4 | addi 3,4,5   |              |              |
|   3   | STALL        | cmpi 1,0,3,4 | addi 3,4,5   |              |
|   4   | STALL        | cmpi 1,0,3,4 |              | addi 3,4,5   |
|   5   | ld 1,2(3)    |              | cmpi 1,0,3,4 |              |
|   6   |              | ld 1,2(3)    |              | cmpi 1,0,3,4 |
|   7   |              |              | ld 1,2(3)    |              |
|   8   |              |              |              | ld 1,2(3)    |

Explanation:

1: Fetched addi.
2: Decoded addi, fetched cmpi.
3: Issued addi, decoded cmpi, must stall decode phase, stop fetching.
4: Executed addi, everything else stalled.
5: Issued cmpi, fetched ld.
6: Executed cmpi, decoded ld.
7: Issued ld.
8: Executed ld.

For this initial model, it is assumed that all instructions take one cycle to execute (not the case for mul/div etc., but will be dealt with later.

In-progress TODO

Code Explanation - IN PROGRESS

(Not all of the code has been explained, just the general classes.)

Source code: https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/cyclemodel

Hazard namedtuple data structure

A namedtuple object stores the attributes of the register access. The python namedtuple is immutable (like a normal tuple), while also allowing to access elements by predefined names. Immutability is great because the register access attributes won't change from fetch to execution stages, which is why a normal list or dict wouldn't be appropriate.

Unlike a normal dictionary, a namedtuple is also ordered (so the initially defined order is preserved). See the python wiki on namedtuple, online namedtuple tutorial, [sta].

namedtuple instances can also be stored in sets, which is exactly how it is used with the RegisterWrite class. One instruction trace may contain zero or more Hazard register access objects (depending on whether registers are needed for the instruction).

HazardProfiles

A dictionary of currently supported register file types. Each entry (register file type) defines the number of read and write ports, written as a tuple, with the first entry being the number of read ports, and second entry being the number of write ports.

Having multiple read and/or write ports means that multiple different entries in the same register file can be read from and/or written to in the same clock cycle. This doesn't prevent a stall if the same register entry is used by a consecutive instruction, even if a spare port is available (Read-after-Write hazard).

Parsing trace file dump using read_file function

The CPU model class takes as input, a single instruction trace list object.

This trace list object, is produced by the function read_file which itself reads an instruction trace file from modified ISACaller (link to code needed). From now on, the trace list object will simply be referred to as trace.

Each line of the trace dump is of the form [{rw}:FILE:regnum:offset:width]* # insn where:

  • rw is the register to be used for reading (operands), or writing (to store result, condition codes, etc.).
  • FILE is the register file type (GPR/integer, FPR/floating-point, etc. see Additional Information section at the end of this page). (TODO: use section reference link instead).
  • regnum is the register number
  • offset TODO: Perhaps the offset of data in bytes??? no idea (right now not important, as examples all show 0 offset)
  • width is the length of the data in bits to be accessed from the register.
  • insn is the full instruction written in PowerISA assembler.

The block [{rw}:FILE:regnum:offset:width] is used zero or more times, based on the total number of read and write registers used for the instruction.

Example trace file with three instructions:

r:GPR:0:0:64 w:GPR:1:0:64              # addi 1, 0, 0x0010
r:GPR:0:0:64 w:GPR:2:0:64              # addi 2, 0, 0x1234
r:GPR:1:0:64 r:GPR:2:0:64              # stw 2, 0(1)

The instruction trace file is processed line by line, where each line split into the register access atributes (from which a new namedtuple is created using _make() and the Hazard definition; see python wiki on _make() method).

Each line is converted to a trace object of the form: [insn, Hazard(...), Hazard(...), ...]. An example trace looks like this:

['addi 1, 0, 0x0010',
 Hazard(action='r', target='GPR', ident='0', offs='0',elwid='64'),
 Hazard(action='w', target='GPR', ident='1', offs='0', elwid='64')]

The function read_file yields (see python wiki on yield) a single trace for each line of the trace file. To produces a full list of traces all the user needs to do is to call read_file with the filename of the ISACaller instruction trace dump, and assign to a new variable (which will end up being a list of trace objects, ready to be iterated over for the CPU model).

RegisterWrite

A class which is based on a Python set, and is used to keep track of current registers used for writing (for detecting Read-after-Write Hazards).

A python wiki on sets is an unordered collection with no duplicate elements.

By checking if next instruction's read registers match any of the write registers in the RegWrite set, the model can raise a STALL.

Anything in the set MUST STALL at the Decode phase because the currently issued/executed instruction's result has not been written to the register/s needed for the consecutive instruction.

Methods

def __init__(self):
    self.storage = set()

Initialise RegisterWrite set.

def expect_write(self, regs):
    return self.storage.update(regs)

If there are new registers to be written to, add them to the current RegisterWrite set.

def write_expected(self, regs):
    return (len(self.storage.intersection(regs)) != 0)

Boolean flag which is true if no read registers need to be written to (by previous instruction).

def retire_write(self, regs):
    return self.storage.difference_update(regs)

Remove write registers from RegisterWrite set if they match the given read registers.

get_input_regs and get_output_regs functions

CPU class

The CPU class models the in-order, single-issue core. Contains the RegisterWrite set for tracking Read-after-Write Hazards, fetch, decode, issue, and execute stages, as well as a stall flag for indicating if the CPU is currently stalled.

The input to the model is a trace list object.

The main methods used during the running of the model is process_instructions(), which is called every time an instruction trace list object is read from a trace file.

Methods

def __init__(self):
    self.regs = RegisterWrite()
    self.fetch = Fetch(self)
    self.decode = Decode(self)
    self.issue = Issue(self)
    self.exe = Execute(self)
    self.stall = False

def reads_possible(self, regs):
    # TODO: subdivide this down by GPR FPR CR-field.
    # currently assumes total of 3 regs are readable at one time
    possible = set()
    r = regs.copy()
    while len(possible) < 3 and len(r) > 0:
        possible.add(r.pop())
    return possible

def writes_possible(self, regs):
    # TODO: subdivide this down by GPR FPR CR-field.
    # currently assumes total of 1 reg is possible regardless of what it is
    possible = set()
    r = regs.copy()
    while len(possible) < 1 and len(r) > 0:
        possible.add(r.pop())
    return possible

def process_instructions(self):
    stall = self.stall
    stall = self.fetch.process_instructions(stall)
    stall = self.decode.process_instructions(stall)
    stall = self.issue.process_instructions(stall)
    stall = self.exe.process_instructions(stall)
    self.stall = stall
    if not stall:
        self.fetch.tick()
        self.decode.tick()
        self.issue.tick()
        self.exe.tick()

Execute class

The Execute class models the execute phase of the processor. Contains a list

Methods

def __init__(self, cpu):
    self.stages = []
    self.cpu = cpu

def add_stage(self, cycles_away, stage):
    while cycles_away > len(self.stages):
        self.stages.append([])
    self.stages[cycles_away].append(stage)

def add_instruction(self, insn, writeregs):
    self.add_stage(2, {'insn': insn, 'writes': writeregs})

def tick(self):
    self.stages.pop(0) # tick drops anything at time "zero"

def process_instructions(self, stall):
    instructions = self.stages[0] # get list of instructions
    to_write = set()              # need to know total writes
    for instruction in instructions:
        to_write.update(instruction['writes'])
    # see if all writes can be done, otherwise stall
    writes_possible = self.cpu.writes_possible(to_write)
    if writes_possible != to_write:
        stall = True
    # retire the writes that are possible in this cycle (regfile writes)
    self.cpu.regs.retire_write(writes_possible)
    # and now go through the instructions, removing those regs written
    for instruction in instructions:
        instruction['writes'].difference_update(writes_possible)
    return stall

Additional Information

On register file types

Currently (20th Aug 2023), the following register files are included in the CPU model:

  • General Purpose Registers (GPR) - stores integers (0-31 in default PowerISA, 0-127 for Libre-SOC with SVP64)
  • Floating Point Registers (FPR) - stores floating-point numbers
  • Condition Register (CR) - broken up into 4-bit fields
  • Condition Register Fields (CRf) - stores arithmetic condition of an operation (less than, greater than, equal to zero, overflow)
  • Fixed-Point Exception Register (XER)
  • Machine State Register (MSR)
  • Floating-Point Status and Control Register (FPSCR)
  • Program Counter (PC); PowerISA spec primarilly calls this Current Instruction Address (CIA). See PowerISA v3.1, section 1.3.4 Description of Instruction Operation
  • Slow Special Purpose Registers (SPRs)
  • Fast SPR (SPRf)

TODO: Special Purpose Registers and fields need better explation. The initial writer of this page (Andrey) has very little understanding of whether SPR is actually a register, or if it's just a category of registers (XER, etc.)

See the PowerISA 3.1 spec for detailed information on register files (Book I, Chapters 1.3.4, 2.3, 3.2, 4.2, 5.2, 5.3).