SV Load and Store

Links:

Rationale

All Vector ISAs dating back fifty years have extensive and comprehensive Load and Store operations that go far beyond the capabilities of Scalar RISC and most CISC processors, yet at their heart on an individual element basis may be found to be no different from RISC Scalar equivalents.

The resource savings from Vector LD/ST are significant and stem from the fact that one single instruction can trigger a dozen (or in some microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.

Additionally, and simply: if the Arithmetic side of an ISA supports Vector Operations, then in order to keep the ALUs 100% occupied the Memory infrastructure (and the ISA itself) correspondingly needs Vector Memory Operations as well.

Vectorised Load and Store also presents an extra dimension (literally) which creates scenarios unique to Vector applications, that a Scalar (and even a SIMD) ISA simply never encounters. SVP64 endeavours to add the modes typically found in all Scalable Vector ISAs, without changing the behaviour of the underlying Base (Scalar) v3.0B operations in any way.

Modes overview

Vectorisation of Load and Store requires creation, from scalar operations, a number of different modes:

  • fixed aka "unit" stride - contiguous sequence with no gaps
  • element strided - sequential but regularly offset, with gaps
  • vector indexed - vector of base addresses and vector of offsets
  • Speculative fail-first - where it makes sense to do so
  • Structure Packing - covered in SV by remap and Pack/Unpack Mode.

Despite being constructed from Scalar LD/ST none of these Modes exist or make sense in any Scalar ISA. They only exist in Vector ISAs

Also included in SVP64 LD/ST is both signed and unsigned Saturation, as well as Element-width overrides and Twin-Predication.

Note also that Indexed remap mode may be applied to both v3.0 LD/ST Immediate instructions and v3.0 LD/ST Indexed instructions. LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification is provided below.

Determining the LD/ST Modes

A minor complication (caused by the retro-fitting of modern Vector features to a Scalar ISA) is that certain features do not exactly make sense or are considered a security risk. Fail-first on Vector Indexed would allow attackers to probe large numbers of pages from userspace, where strided fail-first (by creating contiguous sequential LDs) does not.

In addition, reduce mode makes no sense. Realistically we need an alternative table definition for svp64 RM.MODE. The following modes make sense:

  • saturation
  • predicate-result (mostly for cache-inhibited LD/ST)
  • simple (no augmentation)
  • fail-first (where Vector Indexed is banned)
  • Signed Effective Address computation (Vector Indexed only)
  • Pack/Unpack (on LD/ST immediate operations only)

More than that however it is necessary to fit the usual Vector ISA capabilities onto both Power ISA LD/ST with immediate and to LD/ST Indexed. They present subtly different Mode tables, which, due to lack of space, have the following quirks:

  • LD/ST Immediate has no individual control over src/dest zeroing, whereas LD/ST Indexed does.
  • LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
  • LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)

Format and fields

Fields used in tables below:

  • sz / dz if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
  • zz: both sz and dz are set equal to this flag.
  • inv CR bit just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
  • N sets signed/unsigned saturation.
  • RC1 as if Rc=1, stores CRs but not the result
  • SEA - Signed Effective Address, if enabled performs sign-extension on registers that have been reduced due to elwidth overrides

LD/ST immediate

The table for svp64 for immed(RA) which is RM.MODE (bits 19:23 of RM) is:

0-1 2 3 4 description
00 0 zz els simple mode
00 1 / / reserved
01 inv CR-bit Rc=1: ffirst CR sel
01 inv els RC1 Rc=0: ffirst z/nonz
10 N zz els sat mode: N=0/1 u/s
11 inv CR-bit Rc=1: pred-result CR sel
11 inv els RC1 Rc=0: pred-result z/nonz

The els bit is only relevant when RA.isvec is clear: this indicates whether stride is unit or element:

if RA.isvec:
    svctx.ldstmode = indexed
elif els == 0:
    svctx.ldstmode = unitstride
elif immediate != 0:
    svctx.ldstmode = elementstride

An immediate of zero is a safety-valve to allow LD-VSPLAT: in effect the multiplication of the immediate-offset by zero results in reading from the exact same memory location, even with a Vector register. (Normally this type of behaviour is reserved for the mapreduce modes)

For LD-VSPLAT, on non-cache-inhibited Loads, the read can occur just the once and be copied, rather than hitting the Data Cache multiple times with the same memory read at the same location. The benefit of Cache-inhibited LD-splats is that it allows for memory-mapped peripherals to have multiple data values read in quick succession and stored in sequentially numbered registers (but, see Note below).

For non-cache-inhibited ST from a vector source onto a scalar destination: with the Vector loop effectively creating multiple memory writes to the same location, we can deduce that the last of these will be the "successful" one. Thus, implementations are free and clear to optimise out the overwriting STs, leaving just the last one as the "winner". Bear in mind that predicate masks will skip some elements (in source non-zeroing mode). Cache-inhibited ST operations on the other hand MUST write out a Vector source multiple successive times to the exact same Scalar destination. Just like Cache-inhibited LDs, multiple values may be written out in quick succession to a memory-mapped peripheral from sequentially-numbered registers.

Note that any memory location may be Cache-inhibited (Power ISA v.1, Book III, 1.6.1, p1033)

LD/ST Indexed

The modes for RA+RB indexed version are slightly different but are the same RM.MODE bits (19:23 of RM):

0-1 2 3 4 description
00 SEA dz sz simple mode
01 SEA dz sz Strided (scalar only source)
10 N dz sz sat mode: N=0/1 u/s
11 inv CR-bit Rc=1: pred-result CR sel
11 inv zz RC1 Rc=0: pred-result z/nonz

Vector Indexed Strided Mode is qualified as follows:

if mode = 0b01 and !RA.isvec and !RB.isvec:
    svctx.ldstmode = elementstride

A summary of the effect of Vectorisation of src or dest:

 imm(RA)  RT.v   RA.v   no stride allowed
 imm(RA)  RT.s   RA.v   no stride allowed
 imm(RA)  RT.v   RA.s   stride-select allowed
 imm(RA)  RT.s   RA.s   not vectorised
 RA,RB    RT.v  {RA|RB}.v Standard Indexed
 RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)

Signed Effective Address computation is only relevant for Vector Indexed Mode, when elwidth overrides are applied. The source override applies to RB, and before adding to RA in order to calculate the Effective Address, if SEA is set RB is sign-extended from elwidth bits to the full 64 bits. For other Modes (ffirst, saturate), all EA computation with elwidth overrides is unsigned.

Note that cache-inhibited LD/ST when VSPLAT is activated will perform multiple LD/ST operations, sequentially. Even with scalar src a Cache-inhibited LD will read the same memory location multiple times, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals. If a genuine cache-inhibited LD-VSPLAT is required then a single scalar cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv, copying the one scalar value into multiple register destinations.

Note also that cache-inhibited VSPLAT with Predicate-result is possible. This allows for example to issue a massive batch of memory-mapped peripheral reads, stopping at the first NULL-terminated character and truncating VL to that point. No branch is needed to issue that large burst of LDs, which may be valuable in Embedded scenarios.

Vectorisation of Scalar Power ISA v3.0B

Scalar Power ISA Load/Store operations may be seen from fixedload and fixedstore pseudocode to be of the form:

lbux RT, RA, RB
EA <- (RA) + (RB)
RT <- MEM(EA)

and for immediate variants:

lb RT,D(RA)
EA <- RA + EXTS(D)
RT <- MEM(EA)

Thus in the first example, the source registers may each be independently marked as scalar or vector, and likewise the destination; in the second example only the one source and one dest may be marked as scalar or vector.

Thus we can see that Vector Indexed may be covered, and, as demonstrated with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the Power v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.

# LD not VLD!  format - ldop RT, immed(RA)
# op_width: lb=1, lh=2, lw=4, ld=8
op_load(RT, RA, op_width, immed, svctx, RAupdate):
  ps = get_pred_val(FALSE, RA); # predication on src
  pd = get_pred_val(FALSE, RT); # ... AND on dest
  for (i=0, j=0, u=0; i < VL && j < VL;):
    # skip nonpredicates elements
    if (RA.isvec) while (!(ps & 1<<i)) i++;
    if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
    if (RT.isvec) while (!(pd & 1<<j)) j++;
    if svctx.ldstmode == elementstride:
      # element stride mode
      srcbase = ireg[RA]
      offs = i * immed              # j*immed for a ST
    elif svctx.ldstmode == unitstride:
      # unit stride mode
      srcbase = ireg[RA]
      offs = immed + (i * op_width) # j*op_width for ST
    elif RA.isvec:
      # quirky Vector indexed mode but with an immediate
      srcbase = ireg[RA+i]
      offs = immed;
    else
      # standard scalar mode (but predicated)
      # no stride multiplier means VSPLAT mode
      srcbase = ireg[RA]
      offs = immed

    # compute EA
    EA = srcbase + offs
    # update RA?
    if RAupdate: ireg[RAupdate+u] = EA;
    # load from memory
    ireg[RT+j] <= MEM[EA];
    if (!RT.isvec)
        break # destination scalar, end now
    if (RA.isvec) i++;
    if (RAupdate.isvec) u++;
    if (RT.isvec) j++;

Indexed LD is:

# format: ldop RT, RA, RB
function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
  ps = get_pred_val(FALSE, RA); # predication on src
  pd = get_pred_val(FALSE, RT); # ... AND on dest
  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
    # skip nonpredicated RA, RB and RT
    if (RA.isvec) while (!(ps & 1<<i)) i++;
    if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
    if (RB.isvec) while (!(ps & 1<<k)) k++;
    if (RT.isvec) while (!(pd & 1<<j)) j++;
    if svctx.ldstmode == elementstride:
        EA = ireg[RA] + ireg[RB]*j   # register-strided
    else
        EA = ireg[RA+i] + ireg[RB+k] # indexed address
    if RAupdate: ireg[RAupdate+u] = EA
    ireg[RT+j] <= MEM[EA];
    if (!RT.isvec)
        break # destination scalar, end immediately
    if (RA.isvec) i++;
    if (RAupdate.isvec) u++;
    if (RB.isvec) k++;
    if (RT.isvec) j++;

Note in both cases that svp64 allows RA-as-a-dest in "update" mode (ldux) to be effectively a completely different register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector and independently extending their range.

Programmer's note: being able to set RA-as-a-source as separate from RA-as-a-destination as Scalar is extremely valuable once it is remembered that Simple-V element operations must be in Program Order, especially in loops, for saving on multiple address computations. Care does have to be taken however that RA-as-src is not overwritten by RA-as-dest unless intentionally desired, especially in element-strided Mode.

LD/ST Indexed vs Indexed REMAP

Unfortunately the word "Indexed" is used twice in completely different contexts, potentially causing confusion.

  • There has existed instructions in the Power ISA ld RT,RA,RB since its creation: these are called "LD/ST Indexed" instructions and their name and meaning is well-established.
  • There now exists, in Simple-V, a remap mode called "Indexed" Mode that can be applied to any instruction including those named LD/ST Indexed.

Whilst it may be costly in terms of register reads to allow REMAP Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as sv.ld *RT,RA,*RB, or even misleadingly labelled as redundant, firstly the strict application of the RISC Paradigm that Simple-V follows makes it awkward to consider preventing the application of Indexed REMAP to such operations, and secondly they are not actually the same at all.

Indexed REMAP, as applied to RB in the instruction sv.ld *RT,RA,*RB effectively performs an in-place re-ordering of the offsets, RB. To achieve the same effect without Indexed REMAP would require taking a copy of the Vector of offsets starting at RB, manually explicitly reordering them, and finally using the copy of re-ordered offsets in a non-REMAP'ed sv.ld. Using non-strided LD as an example, pseudocode showing what actually occurs, where the pseudocode for indexed_remap may be found in remap:

# sv.ld *RT,RA,*RB with Index REMAP applied to RB
for i in 0..VL-1:
    if remap.indexed:
        rb_idx = indexed_remap(i) # remap
    else:
        rb_idx = i # use the index as-is
    EA = GPR(RA) + GPR(RB+rb_idx)
    GPR(RT+i) = MEM(EA, 8)

Thus it can be seen that the use of Indexed REMAP saves copying and manual reordering of the Vector of RB offsets.

LD/ST ffirst

LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP is not active) as an ordinary one, with all behaviour with respect to Interrupts Exceptions Page Faults Memory Management being identical in every regard to Scalar v3.0 Power ISA LD/ST. However for elements 1 and above, if an exception would occur, then VL is truncated to the previous element: the exception is not then raised because the LD/ST that would otherwise have caused an exception is required to be cancelled. Additionally an implementor may choose to truncate VL for any arbitrary reason except for the very first.

ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. See https://bugs.libre-soc.org/show_bug.cgi?id=561

for(i = 0; i < VL; i++)
    reg[rt + i] = mem[reg[ra] + i * reg[rb]];

High security implementations where any kind of speculative probing of memory pages is considered a risk should take advantage of the fact that implementations may truncate VL at any point, without requiring software to be rewritten and made non-portable. Such implementations may choose to always set VL=1 which will have the effect of terminating any speculative probing (and also adversely affect performance), but will at least not require applications to be rewritten.

Low-performance simpler hardware implementations may also choose (always) to also set VL=1 as the bare minimum compliant implementation of LD/ST Fail-First. It is however critically important to remember that the first element LD/ST MUST be treated as an ordinary LD/ST, i.e. MUST raise exceptions exactly like an ordinary LD/ST.

For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary such as the beginning of a cache line, or beginning of a Virtual Memory page. Likewise, to reduce workloads or balance resources.

Vertical-First Mode is slightly strange in that only one element at a time is ever executed anyway. Given that programmers may legitimately choose to alter srcstep and dststep in non-sequential order as part of explicit loops, it is neither possible nor safe to make speculative assumptions about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is UNDEFINED. This is very different from Arithmetic (Data-dependent) FFirst where Vertical-First Mode is fully deterministic, not speculative.

LOAD/STORE Elwidths

Loads and Stores are almost unique in that the Power Scalar ISA provides a width for the operation (lb, lh, lw, ld). Only extsb and others like it provide an explicit operation width. There are therefore three widths involved:

  • operation width (lb=8, lh=16, lw=32, ld=64)
  • src element width override (8/16/32/default)
  • destination element width override (8/16/32/default)

Some care is therefore needed to express and make clear the transformations, which are expressly in this order:

  • Calculate the Effective Address from RA at full width but (on Indexed Load) allow srcwidth overrides on RB
  • Load at the operation width (lb/lh/lw/ld) as usual
  • byte-reversal as usual
  • Non-saturated mode:
    • zero-extension or truncation from operation width to dest elwidth
    • place result in destination at dest elwidth
  • Saturated mode:
    • Sign-extension or truncation from operation width to dest width
    • signed/unsigned saturation down to dest elwidth

In order to respect Power v3.0B Scalar behaviour the memory side is treated effectively as completely separate and distinct from SV augmentation. This is primarily down to quirks surrounding LE/BE and byte-reversal.

It is rather unfortunately possible to request an elwidth override on the memory side which does not mesh with the overridden operation width: these result in UNDEFINED behaviour. The reason is that the effect of attempting a 64-bit sv.ld operation with a source elwidth override of 8/16/32 would result in overlapping memory requests, particularly on unit and element strided operations. Thus it is UNDEFINED when the elwidth is smaller than the memory operation width. Examples include sv.lw/sw=16/els which requests (overlapping) 4-byte memory reads offset from each other at 2-byte intervals. Store likewise is also UNDEFINED where the dest elwidth override is less than the operation width.

Note the following regarding the pseudocode to follow:

  • scalar identity behaviour SV Context parameter conditions turn this into a straight absolute fully-compliant Scalar v3.0B LD operation
  • brev selects whether the operation is the byte-reversed variant (ldbrx rather than ld)
  • op_width specifies the operation width (lb, lh, lw, ld) as a "normal" part of Scalar v3.0B LD
  • imm_offs specifies the immediate offset ld r3, imm_offs(r5), again as a "normal" part of Scalar v3.0B LD
  • svctx specifies the SV Context and includes VL as well as source and destination elwidth overrides.

Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in both Immediate and Indexed LD/ST, does not have element-width overriding applied to it.

Note that predication, predication-zeroing, and other modes except saturation have all been removed, for clarity and simplicity:

# LD not VLD!
# this covers unit stride mode and a type of vector offset
function op_ld(RT, RA, op_width, imm_offs, svctx)
  for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
    if not svctx.unit/el-strided:
        # strange vector mode, compute 64 bit address which is
        # not polymorphic! elwidth hardcoded to 64 here
        srcbase = get_polymorphed_reg(RA, 64, i)
    else:
        # unit / element stride mode, compute 64 bit address
        srcbase = get_polymorphed_reg(RA, 64, 0)
        # adjust for unit/el-stride
        srcbase += ....

    # read the underlying memory
    memread <= MEM(srcbase + imm_offs, op_width)

    # check saturation.
    if svpctx.saturation_mode:
        # ... saturation adjustment...
        memread = clamp(memread, op_width, svctx.dest_elwidth)
    else:
        # truncate/extend to over-ridden dest width.
        memread = adjust_wid(memread, op_width, svctx.dest_elwidth)

    # takes care of inserting memory-read (now correctly byteswapped)
    # into regfile underlying LE-defined order, into the right place
    # within the NEON-like register, respecting destination element
    # bitwidth, and the element index (j)
    set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)

    # increments both src and dest element indices (no predication here)
    i++;
    j++;

Note above that the source elwidth is not used at all in LD-immediate.

For LD/Indexed, the key is that in the calculation of the Effective Address, RA has no elwidth override but RB does. Pseudocode below is simplified for clarity: predication and all modes except saturation are removed:

# LD not VLD! ld*rx if brev else ld*
function op_ld(RT, RA, RB, op_width, svctx, brev)
  for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
    if not svctx.el-strided:
        # RA not polymorphic! elwidth hardcoded to 64 here
        srcbase = get_polymorphed_reg(RA, 64, i)
    else:
        # element stride mode, again RA not polymorphic
        srcbase = get_polymorphed_reg(RA, 64, 0)
    # RB *is* polymorphic
    offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
    # sign-extend
    if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)

    # takes care of (merges) processor LE/BE and ld/ldbrx
    bytereverse = brev XNOR MSR.LE

    # read the underlying memory
    memread <= MEM(srcbase + offs, op_width)

    # optionally performs byteswap at op width
    if (bytereverse):
        memread = byteswap(memread, op_width)

    if svpctx.saturation_mode:
        # ... saturation adjustment...
        memread = clamp(memread, op_width, svctx.dest_elwidth)
    else:
        # truncate/extend to over-ridden dest width.
        memread = adjust_wid(memread, op_width, svctx.dest_elwidth)

    # takes care of inserting memory-read (now correctly byteswapped)
    # into regfile underlying LE-defined order, into the right place
    # within the NEON-like register, respecting destination element
    # bitwidth, and the element index (j)
    set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)

    # increments both src and dest element indices (no predication here)
    i++;
    j++;

Remapped LD/ST

In the remap page the concept of "Remapping" is described. Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth of LDs or STs. The usual interest in such re-mapping is for example in separating out 24-bit RGB channel data into separate contiguous registers. NEON covers this as shown in the diagram below:

Remap easily covers this capability, and with dest elwidth overrides and saturation may do so with built-in conversion that would normally require additional width-extension, sign-extension and min/max Vectorised instructions as post-processing stages.

Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes because the generic abstracted concept of "Remapping", when applied to LD/ST, will give that same capability, with far more flexibility.

It is worth noting that Pack/Unpack Modes of SVSTATE, which may be established through sv.setvl, are also an easy way to perform regular Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that, REMAP will need to be used.