SPRs
The full list of SPRs for Simple-V is:
SPR | Width | Description |
---|---|---|
SVSTATE | 64-bit | Zero-Overhead Loop Architectural State |
SVLR | 64-bit | SVSTATE equivalent of LR-to-PC |
SVSHAPE0 | 32-bit | REMAP Shape 0 |
SVSHAPE1 | 32-bit | REMAP Shape 1 |
SVSHAPE2 | 32-bit | REMAP Shape 2 |
SVSHAPE3 | 32-bit | REMAP Shape 3 |
Future versions of Simple-V will have at least 7 more SVSTATE SPRs, in a small "stack", as part of a full Zero-Overhead Loop Control subsystem.
SVSTATE SPR
The format of the SVSTATE SPR is as follows:
Field | Name | Description |
---|---|---|
0:6 | maxvl | Max Vector Length |
7:13 | vl | Vector Length |
14:20 | srcstep | for srcstep = 0..VL-1 |
21:27 | dststep | for dststep = 0..VL-1 |
28:29 | dsubstep | for substep = 0..SUBVL-1 |
30:31 | ssubstep | for substep = 0..SUBVL-1 |
32:33 | mi0 | REMAP RA/FRA/BFA SVSHAPE0-3 |
34:35 | mi1 | REMAP RB/FRB/BFB SVSHAPE0-3 |
36:37 | mi2 | REMAP RC/FRT SVSHAPE0-3 |
38:39 | mo0 | REMAP RT/FRT/BF SVSHAPE0-3 |
40:41 | mo1 | REMAP EA/RS/FRS SVSHAPE0-3 |
42:46 | SVme | REMAP enable (RA-RT) |
47:52 | rsvd | reserved |
53 | pack | PACK (srcstep reorder) |
54 | unpack | UNPACK (dststep order) |
55:61 | hphint | Horizontal Hint |
62 | RMpst | REMAP persistence |
63 | vfirst | Vertical First mode |
Notes:
- The entries are truncated to be within range. Attempts to set VL to greater than MAXVL will truncate VL.
- Setting srcstep, dststep to 64 or greater, or VL or MVL to greater than 64 is reserved and will cause an illegal instruction trap.
SVSTATE Fields
SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient self-contaned information for a full context save/restore. SVSTATE contains (and permits setting of):
- MVL (the Maximum Vector Length) - declares (statically) how much of a regfile is to be reserved for Vector elements
- VL - Vector Length
- dststep - the destination element offset of the current parallel instruction being executed
- srcstep - for twin-predication, the source element offset as well.
- ssubstep - the source subvector element offset of the current parallel instruction being executed
- dsubstep - the destination subvector element offset of the current parallel instruction being executed
- vfirst - Vertical First mode. srcstep, dststep and substep do not advance unless explicitly requested to do so with svstep
- RMpst - REMAP persistence. REMAP will apply only to the following
instruction unless this bit is set, in which case REMAP "persists".
Reset (cleared) on use of the
setvl
instruction if used to alter VL or MVL. - Pack - if set then srcstep/ssubstep VL/SUBVL loop-ordering is inverted.
- UnPack - if set then dststep/dsubstep VL/SUBVL loop-ordering is inverted.
- hphint - Horizontal Parallelism Hint. Indicates that
no Hazards exist between groups of elements in sequential multiples of this number
(before REMAP). By definition: elements for which
FLOOR(step/hphint)
is equal before REMAP are in the same parallelism "group", for bothsrcstep
anddststep
. In Vertical First Mode hardware MUST respect Strict Program Order but is permitted to merge multiple scalar loops into parallel batches, if Reservation Station resources are sufficient. Set to zero to indicate "no hint". - SVme - REMAP enable bits, indicating which register is to be
REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
associated with each bit, with RA being the LSB and EA being the MSB.
See table below for ordering. When
SVme
is zero (0b00000) REMAP is fully disabled and inactive regardless of the contents ofSVSTATE
,mi0-mi2/mo0-mo1
, or the fourSVSHAPEn
SPRs - mi0-mi2/mo0-mo1 - these indicate the SVSHAPE (0-3) that the corresponding register (RA etc) should use, as long as the register's corresponding SVme bit is set
Programmer's Note: the fact that REMAP is entirely dormant when SVme
is zero
allows establishment of REMAP context well in advance, followed by utilising svremap
at a precise (or the very last) moment. Some implementations may exploit this
to cache (or take some time to prepare caches) in the background whilst other
(unrelated) instructions are being executed. This is particularly important to
bear in mind when using svindex
which will require hardware to perform (and
cache) additional GPR reads.
Programmer's Note: when REMAP is activated it becomes necessary on any context-switch (Interrupt or Function call) to detect (or know in advance) that REMAP is enabled and to additionally explicitly save/restore the four SVSHAPE SPRs, SVHAPE0-3. Given that this is expected to be a rare occurrence it was deemed unreasonable to burden every context-switch or function call with mandatory save/restore of SVSHAPEs, and consequently it is a callee (and Trap Handler) responsibility. Callees (and Trap Handlers) MUST avoid using all and any SVP64 instructions during the period where state could be adversely affected. SVP64 purely relies on Scalar instructions, so Scalar instructions (except the SVP64 Management ones and mtspr and mfspr) are 100% guaranteed to have zero impact on SVP64 state.
SVme REMAP area
Each bit of SVSTATE.SVme
indicates whether the SVSHAPE (0-3) is active and to which register
the REMAP applies. The application goes by assembler operand names on a per-mnemonic
basis. Some instructions may have RT
as a source and as a destination: REMAP applies
separately to each use in this case. Also for Load/Store with Update the Effective
Address (stored in EA) also may be separately REMAPed from RA as a source operand.
bit | applies | register applied |
---|---|---|
46 | mi0 | source RA / FRA / BA / BFA / RT / FRT |
45 | mi1 | source RB / FRB / BB |
44 | mi2 | source RC / FRC / BC |
43 | mo0 | result RT / FRT / BT / BF |
42 | mo1 | result Effective Address (RA) / FRS / RS |
MAXVECTORLENGTH is a static (immediate-operand only) compile-time declaration of the maximum number of elements in a Vector. MVL is limited to 7 bits (in the first version of SVP64) and consequently the maximum number of elements is limited to between 0 and 127.
MAXVL is normally (in other True-Scalable Vector ISAs) an Architecturally-defined quantity related indirectly to the total available number of bits in the Vector Register File. Cray Vectors had a Hardware-Architectural set limit of MAXVL=64. RISC-V RVV has MAXVL defined in terms of a Silicon-Partner-selectable fixed number of bits. MAXVL in Simple-V is set in terms of the number of elements and may change at runtime.
Programmer's Note: Except by directly using mtspr
on SVSTATE, which may
result in performance penalties on some hardware implementations, SVSTATE's maxvl
field may only be set statically as an immediate, by the setvl
instruction.
It may NOT be set dynamically from a register. Compiler writers and assembly
programmers are expected to perform static register file analysis, subdivision,
and allocation and only utilise setvl
. Direct writing to SVSTATE in order to
"bypass" this Note could, in less-advanced implementations, potentially cause stalling,
particularly if SVP64 instructions are issued directly after the mtspr
to SVSTATE.
The actual Vector length, the number of elements in a "Vector", SVSTATE.vl
may be set
entirely dynamically at runtime from a number of sources. setvl
is the primary
instruction for setting Vector Length.
setvl
is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
equivalent. Similar to RVV, VL is set to be within
the range 0 <= VL <= MVL. Unlike RVV, VL is set exactly according to the following:
VL = (RT|0) = MIN(vlen, MVL)
where 0 <= MVL <= 127
, and vlen may come from an immediate, RA
, or from the CTR
SPR,
depending on options selected with the setvl
instruction.
Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope of the Power ISA Technical Reference. Guidance on the 50-year-old Cray Vector paradigm is best sought elsewhere: good studies include Academic Courses given on the 1970s Cray Supercomputers over at least the past three decades.
Horizontal Parallelism
A problem exists for hardware where it may not be able to detect
that a programmer (or compiler) knows of opportunities for parallelism
and lack of overlap between loops, despite these being easy for a compiler
to statically detect and potentially express.
hphint
is such an expression, declaring that elements within a batch are
independent of each other (no Register or Memory Hazards).
Elements are considered to be in the same source batch if they have
the same value of FLOOR(srcstep/hphint)
. Likewise in the same destination batch
for the same value FLOOR(dststep/hphint)
.
Four key observations here:
- predication is not involved here. the number of actual elements involved is considered before predicate masks are applied.
- twin predication can result in srcstep and dststep being in different batches
- batch evaluation is done before REMAP, making Hazard elimination easier for Multi-Issue systems.
hphint
is not limited to power-of-two. Hardware implementors may choose a lower parallelism hint up tohphint
and may find power-of-two more convenient.
Regarding (4): if a smaller hint is chosen by hardware, actual parallelism
(Dependency Hazard relaxation) must never
exceed hphint
and must still respect the batch boundaries, even if this results
in just one element being considered Hazard-independent. Even under these
circumstances Multi-Issue Register-renaming is possible, to introduce parallelism
by a different route.
Hardware Architect note: each element within the same group may be treated as 100% independent from any other element within that group, and therefore neither Register Hazards nor Memory Hazards inter-element exist, but crucially inter-group definitely remains. This makes implementation far easier on resources because the Hazard Dependencies are effectively at a much coarser granularity than a single register. With element-width overrides extending down to the byte level reducing Dependency Hazard hardware complexity becomes even more important.
hphint
may legitimately be set greater than MAXVL
. This indicates to Multi-Issue
hardware that even though MAXVL is relatively small the batches are still independent
and therefore if Multi-Issue hardware chooses to allocate several batches up to
MAXVL
in size they are still independent, even if Register-renaming is deployed.
This helps greatly simplify Multi-Issue systems by significantly reducing Hazards.
Considerable care must be taken when setting hphint
. Matrix Outer Product
could produce corrupted results if hphint
is set to greater than the innermost
loop depth. Parallel Reduction, DCT and FFT REMAP all are similarly critically affected
by hphint
in ways that if used correctly greatly increases ease of parallelism but
if done incorrectly will also result in data corruption. Reduction/Iteration
also requires care to correctly declare in hphint
how many elements are
independent. In the case of most Reduction use-cases the answer is almost certainly
"none".
hphint
must never be set on Atomic Memory operations, Cache-Inhibited
Memory operations, or Load-Reservation Store-Conditional. Also if Load-with-Update
Data-Dependent Fail-First is ever used for linked-list pointer-chasing, hphint
should again definitely be disabled. Failure to do so results in UNDEFINED
behaviour.
hphint
may only be ignored by Hardware Implementors as long as full element-level
Register and Memory Hazards are implemented in full (including right down to individual
bytes of each register for when elwidth=8/16/32). In other words if hphint
is to
be ignored then implementations must consider the situation as if hphint=0
.
Horizontal Parallelism in Vertical-First Mode
Setting hphint
with Vertical-First is perfectly legitimate. Under these circumstances
single-element strict Program Execution Order must be preserved at all times, but
should there be a small enough program loop, than Out-of-Order Hardware may
take the opportunity to merge
consecutive element-based instructions into the same Reservation Stations, for
multiple operations to be passed to massive-wide back-end SIMD ALUs or Vector-Chaining ALUs.
Only elements within the same hphint
group (across multiple such looped instructions)
may be treated as mergeable in this fashion.
Note that if the loop of Vertical-First instructions cannot fit entirely into Reservation Stations then Hardware clearly cannot exploit the above optimisation opportunity, but at least there is no harm done: the loop is still correctly executed as Scalar instructions. Programmers do need to be aware though that short loops on some Hardware Implementations can be made considerably faster than on other Implementations.
SVLR
SV Link Register, exactly analogous to LR (Link Register) may be used for temporary storage of SVSTATE, and, in particular, Vectorized Branch-Conditional instructions may interchange SVLR and SVSTATE whenever LR and NIA are.
Note that there is no equivalent Link variant of SVREMAP or SVSHAPE0-3 (it would be too costly), so SVLR has limited applicability: REMAP SPRs must be saved and restored explicitly.