RFC ls010 Simple-V Zero-Overhead Loop Prefix Subsystem

Fri Jun 23 13:53:20 2023 · also

    New Book: new Zero-Overhead-Loop
    New Appendix, Zero-Overhead-Loop

    Adds a Zero-Overhead-Loop Subsystem based on the Cray True-Scalable Vector concept
    in a RISC-paradigm fashion. Total instructions six 5-bit XO, plus Prefix format (PO9).

    Addition of new "Zero-Overhead-Loop-Control" DSP-style Vector-style
    subsystem that in simple low-end (Embedded) systems may be minimalistically
    and easily be implemented by inserting a new fully-independent Pipeline Stage
    in between Decode and Issue, with very little disruption, and in higher
    performance pre-existing Multi-Issue Out-of-Order systems seamlessly fits likewise
    to significantly boost performance.

    Requires support for new instructions in assembler, debuggers, and related tools.
    Dramatically reduces instructions. Requires introduction of term "High-Level Assembler"

    Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
    True-Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
    Digital Signal Processing (DSP), High-level Assembler

    There are no conceptual arithmetic ordering or other changes over the
    Scalar Power ISA definitions to registers or register files or to
    arithmetic or Logical Operations, beyond element-width subdivision

    #pragma pack
    typedef union {
        uint8_t actual_bytes[8];
        // all of these are very deliberately unbounded arrays
        // that intentionally "wrap" into subsequent actual_bytes...
        uint8_t  bytes[]; // elwidth 8
        uint16_t hwords[]; // elwidth 16
        uint32_t words[]; // elwidth 32
        uint64_t dwords[]; // elwidth 64

    } el_reg_t;

    // ... here, as packed statically-defined GPRs.
    elreg_t int_regfile[128];

    // use element 0 as the destination
    void get_register_element(el_reg_t* el, int gpr, int element, int width) {
        switch (width) {
            case 64: el->dwords[0] = int_regfile[gpr].dwords[element];
            case 32: el->words[0] = int_regfile[gpr].words[element];
            case 16: el->hwords[0] = int_regfile[gpr].hwords[element];
            case 8 : el->bytes[0] = int_regfile[gpr].bytes[element];
        }
    }

    // use element 0 as the source
    void set_register_element(el_reg_t* el, int gpr, int element, int width) {
        switch (width) {
            case 64: int_regfile[gpr].dwords[element] = el->dwords[0];
            case 32: int_regfile[gpr].words[element] = el->words[0];
            case 16: int_regfile[gpr].hwords[element] = el->hwords[0];
            case 8 : int_regfile[gpr].bytes[element] = el->bytes[0];
        }
    }

    # vector-add RT, RA,RB using the "uint64_t" union member, "dwords"
    for i in range(VL):
        int_regfile[RT].dword[i] = int_regfile[RA].dword[i] + int_regfile[RB].dword[i]

    # vector-add RT, RA, RB using the "uint64_t" union member "hwords"
    for i in range(VL):
        int_regfile[RT].hwords[i] = int_regfile[RA].hwords[i] + int_regfile[RB].hwords[i]

    | MSB0:  | 0:15    | 16:31   | 32:47   | 48:63   |
    | LSB0:  | 63:48   | 47:32   | 31:16   | 15:0    |
    |--------|---------|---------|---------|---------|
    | GPR(0) | same    | same    | same    | same    |
    | GPR(1) | result3 | result2 | result1 | result0 |
    | GPR(2) | same    | same    | same    | result4 |
    | GPR(3) | same    | same    | same    | same    |
    | ...    | ...     | ...     | ...     | ...     |
    | ...    | ...     | ...     | ...     | ...     |

    | MSB0:  | 0:31                 | 32:63                |
    | LSB0:  | 63:32                | 31:0                 |
    |--------|----------------------|----------------------|
    | GPR(0) | same                 | same                 |
    | GPR(1) | (result3 || result2) | (result1 || result0) |
    | GPR(2) | same                 | (same    || result4) |
    | GPR(3) | same                 | same                 |
    | ...    | ...                  | ...                  |
    | ...    | ...                  | ...                  |

    #pragma pack
    typedef union {
        // these do NOT match their Power ISA VSX numbering directly, they are all reversed
        // bytes[15] is actually VSR.byte[0] for example.  if this convention is not
        // followed then everything ends up in the wrong place
        uint8_t  bytes[16]; // elwidth 8, QTY 16 FIXED total
        uint16_t hwords[8]; // elwidth 16, QTY 8 FIXED total
        uint32_t words[4]; // elwidth 32, QTY 8 FIXED total
        uint64_t dwords[2]; // elwidth 64, QTY 2 FIXED total
        uint8_t actual_bytes[16]; // totals 128-bit
    } el_reg_t;

    elreg_t VSR_regfile[64];

    static void check_num_elements(int elt, int width) {
        switch (width) {
            case 64: assert elt < 2;
            case 32: assert elt < 4;
            case 16: assert elt < 8;
            case 8 : assert elt < 16;
        }
    }
    void get_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
        check_num_elements(elt, width);
        switch (width) {
            case 64: el->dwords[0] = VSR_regfile[gpr].dwords[1-elt];
            case 32: el->words[0] = VSR_regfile[gpr].words[3-elt];
            case 16: el->hwords[0] = VSR_regfile[gpr].hwords[7-elt];
            case 8 : el->bytes[0] = VSR_regfile[gpr].bytes[15-elt];
        }
    }
    void set_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
        check_num_elements(elt, width);
        switch (width) {
            case 64: VSR_regfile[gpr].dwords[1-elt] = el->dwords[0];
            case 32: VSR_regfile[gpr].words[3-elt] = el->words[0];
            case 16: VSR_regfile[gpr].hwords[7-elt] = el->hwords[0];
            case 8 : VSR_regfile[gpr].bytes[15-elt] = el->bytes[0];
        }
    }

    int calc_VSR_reg_offs(int elt, int width) {
        switch (width) {
            case 64: return floor(elt / 2);
            case 32: return floor(elt / 4);
            case 16: return floor(elt / 8);
            case 8 : return floor(elt / 16);
        }
    }
    int calc_VSR_elt_offs(int elt, int width) {
        switch (width) {
            case 64: return (elt % 2);
            case 32: return (elt % 4);
            case 16: return (elt % 8);
            case 8 : return (elt % 16);
        }
    }
    void _set_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
        int new_elt = calc_VSR_elt_offs(elt, width);
        int new_reg = calc_VSR_reg_offs(elt, width);
        set_VSR_element(el, gpr+new_reg, new_elt, width);
    }

    # VSX-add RT, RA, RB using the "uint64_t" union member "hwords"
    for i in range(VL):
         el_reg_t result, ra, rb;
        _get_VSR_element(&ra, RA, i, 16);
        _get_VSR_element(&rb, RB, i, 16);
         result.hwords[0] = ra.hwords[0] + rb.hwords[0]; // use array 0 elements
        _set_VSR_element(&result, RT, i, 16);

    sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127
    sv.mfcr cr5, *cr40                   # only one source (CR40) copied to CR5
    sv.mfcr *cr16, cr40                  # Vector-Splat CR40 onto CR16,17,18...
    sv.mfcr *cr16, cr3                  # Vector-Splat CR3 onto CR16,17,18...

    sv.mfcr *cr0, cr40        # Vector-Splat onto CR0,1,2
    sv.crand cr7, cr9, cr10   # crosses over between CR0-7 and CR8-127

    Machine-readable CSV files have been autogenerated which will make the
    task of creating SV-aware ISA decoders, documentation, assembler tools
    compiler tools Simulators documentation all aspects of SVP64 easier
    and less prone to mistakes.  Please avoid manual re-creation of
    information from the written specification wording in this chapter,
    and use the CSV files or use the Canonical tool which creates the CSV
    files, named sv_analysis.py. The information contained within
    sv_analysis.py is considered to be part of this Specification, even
    encoded as it is in python3.

    if extra3_mode:
        spec = EXTRA3
    elif EXTRA2[0]:  # vector mode, can express even registers in r0-126
        spec = EXTRA2 << 1  # same as EXTRA3, shifted
    else:            # scalar mode, can express registers in r0-63
        spec = (EXTRA2[0] << 2) | EXTRA2[1]
    if spec[0]: # vector
         return (RA << 2) | spec[1:2]
    else:         # scalar
         return (spec[1:2] << 5) | RA

    if RA.isvec:
        svctx.ldstmode = indexed
    elif els == 0:
        svctx.ldstmode = unitstride
    elif immediate != 0:
        svctx.ldstmode = elementstride

    if els and !RA.isvec and !RB.isvec:
        svctx.ldstmode = elementstride

    imm(RA)  RT.v   RA.v   no stride allowed
    imm(RA)  RT.s   RA.v   no stride allowed
    imm(RA)  RT.v   RA.s   stride-select allowed
    imm(RA)  RT.s   RA.s   not vectorized
    RA,RB    RT.v  {RA|RB}.v Standard Indexed
    RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
    RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
    RA,RB    RT.s  {RA&RB}.s not vectorized (scalar identity)

    lbux RT, RA, RB
    EA <- (RA) + (RB)
    RT <- MEM(EA)

    lb RT,D(RA)
    EA <- RA + EXTS(D)
    RT <- MEM(EA)

    # LD not VLD!  format - ldop RT, immed(RA)
    # op_width: lb=1, lh=2, lw=4, ld=8
    op_load(RT, RA, op_width, immed, svctx, RAupdate):
      ps = get_pred_val(FALSE, RA); # predication on src
      pd = get_pred_val(FALSE, RT); # ... AND on dest
      for (i=0, j=0, u=0; i < VL && j < VL;):
        # skip nonpredicates elements
        if (RA.isvec) while (!(ps & 1<<i)) i++;
        if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
        if (RT.isvec) while (!(pd & 1<<j)) j++;
        if postinc:
            offs = 0; # added afterwards
            if RA.isvec: srcbase = ireg[RA+i]
            else         srcbase = ireg[RA]
        elif svctx.ldstmode == elementstride:
          # element stride mode
          srcbase = ireg[RA]
          offs = i * immed              # j*immed for a ST
        elif svctx.ldstmode == unitstride:
          # unit stride mode
          srcbase = ireg[RA]
          offs = immed + (i * op_width) # j*op_width for ST
        elif RA.isvec:
          # quirky Vector indexed mode but with an immediate
          srcbase = ireg[RA+i]
          offs = immed;
        else
          # standard scalar mode (but predicated)
          # no stride multiplier means VSPLAT mode
          srcbase = ireg[RA]
          offs = immed

        # compute EA
        EA = srcbase + offs
        # load from memory
        ireg[RT+j] <= MEM[EA];
        # check post-increment of EA
        if postinc: EA = srcbase + immed;
        # update RA?
        if RAupdate: ireg[RAupdate+u] = EA;
        if (!RT.isvec)
            break # destination scalar, end now
        if (RA.isvec) i++;
        if (RAupdate.isvec) u++;
        if (RT.isvec) j++;

    # format: ldop RT, RA, RB
    function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
      ps = get_pred_val(FALSE, RA); # predication on src
      pd = get_pred_val(FALSE, RT); # ... AND on dest
      for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
        # skip nonpredicated RA, RB and RT
        if (RA.isvec) while (!(ps & 1<<i)) i++;
        if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
        if (RB.isvec) while (!(ps & 1<<k)) k++;
        if (RT.isvec) while (!(pd & 1<<j)) j++;
        if svctx.ldstmode == elementstride:
            EA = ireg[RA] + ireg[RB]*j   # register-strided
        else
            EA = ireg[RA+i] + ireg[RB+k] # indexed address
        if RAupdate: ireg[RAupdate+u] = EA
        ireg[RT+j] <= MEM[EA];
        if (!RT.isvec)
            break # destination scalar, end immediately
        if (RA.isvec) i++;
        if (RAupdate.isvec) u++;
        if (RB.isvec) k++;
        if (RT.isvec) j++;

    # sv.ld *RT,RA,*RB with Index REMAP applied to RB
    for i in 0..VL-1:
        if remap.indexed:
            rb_idx = indexed_remap(i) # remap
        else:
            rb_idx = i # use the index as-is
        EA = GPR(RA) + GPR(RB+rb_idx)
        GPR(RT+i) = MEM(EA, 8)

    for(i = 0; i < VL; i++)
        reg[rt + i] = mem[reg[ra] + i * reg[rb]];

   RT=1 # vec - deliberately overlaps by one with RA
   RA=0 # vec - first one is valid, contains ptr
   imm = 8 # offset_of(ptr->next)
   for i in range(VL):
       # this part is the Scalar Defined Word-instruction (standard scalar ld operation)
       EA = GPR(RA+i) + imm          # ptr + offset(next)
       data = MEM(EA, 8)             # 64-bit address of ptr->next
       # was a normal vector-ld up to this point. now the Data-Fail-First
       cr_test = conditions(data)
       if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
       action_load = True
       if cr_test.EQ == testbit:             # check if zero
           if VLI then
              VL = i+1            # update VL, inclusive
           else
              VL = i              # update VL, exclusive current
              action_load = False # current load excluded
           stop = True            # stop looping
       if action_load:
          GPR(RT+i) = data        # happens to be read on next loop!
       if stop: break

    # LD not VLD!
    # this covers unit stride mode and a type of vector offset
    function op_ld(RT, RA, op_width, imm_offs, svctx)
      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
        if not svctx.unit/el-strided:
            # strange vector mode, compute 64 bit address which is
            # not polymorphic! elwidth hardcoded to 64 here
            srcbase = get_polymorphed_reg(RA, 64, i)
        else:
            # unit / element stride mode, compute 64 bit address
            srcbase = get_polymorphed_reg(RA, 64, 0)
            # adjust for unit/el-stride
            srcbase += .... uses op_width here

        # read the underlying memory
        memread <= MEM(srcbase + imm_offs, op_width)

        # truncate/extend to over-ridden dest width.
        memread = adjust_wid(memread, op_width, svctx.elwidth)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # using Element-Packing starting at register RT, respecting destination
        # element bitwidth, and the element index (j)
        set_polymorphed_reg(RT, svctx.elwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;

    # LD not VLD! ld*rx if brev else ld*
    function op_ld(RT, RA, RB, op_width, svctx, brev)
      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
        if not svctx.el-strided:
            # RA not polymorphic! elwidth hardcoded to 64 here
            srcbase = get_polymorphed_reg(RA, 64, i)
        else:
            # element stride mode, again RA not polymorphic
            srcbase = get_polymorphed_reg(RA, 64, 0)
        # RB *is* polymorphic
        offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
        # sign-extend
        if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= MEM(srcbase + offs, op_width)

        # optionally performs byteswap at op width
        if (bytereverse):
            memread = byteswap(memread, op_width)

        # truncate/extend to over-ridden dest width.
        dest_width = op_width if RT.isvec else 64
        memread = adjust_wid(memread, op_width, dest_width)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(RT, destwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;

    if (mode_is_64bit) then M <- 0
    else M <- 32
    if ¬BO[2] then CTR <- CTR - 1
    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
    cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
    if ctr_ok & cond_ok then
      if AA then NIA <-iea EXTS(BD || 0b00)
      else       NIA <-iea CIA + EXTS(BD || 0b00)
    if LK then LR  <-iea  CIA + 4

    if (mode_is_64bit) then M <- 0
    else M <- 32
    # the bit of CR to test, if the predicate bit is zero,
    # is overridden
    testbit = CR[BI+32]
    if ¬predicate_bit then testbit = SVRMmode.SNZ
    # otherwise apart from the override ctr_ok and cond_ok
    # are exactly the same
    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
    cond_ok <- BO[0] | ¬(testbit ^ BO[1])
    if ¬predicate_bit & ¬SVRMmode.sz then
      # this is entirely new: CTR-test mode still decrements CTR
      # even when predicate-bits are zero
      if ¬BO[2] & CTRtest & ¬CTi then
        CTR = CTR - 1
      # instruction finishes here
    else
      # usual BO[2] CTR-mode now under CTR-test mode as well
      if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
      # new VLset mode, conditional test truncates VL
      if VLSET and VSb = (cond_ok & ctr_ok) then
        if SVRMmode.VLI then SVSTATE.VL = srcstep+1
        else                 SVSTATE.VL = srcstep
      # usual LR is now conditional, but also joined by SVLR
      lr_ok <- LK
      svlr_ok <- SVRMmode.SL
      if ctr_ok & cond_ok then
        if AA then NIA <-iea EXTS(BD || 0b00)
        else       NIA <-iea CIA + EXTS(BD || 0b00)
        if SVRMmode.LRu then lr_ok <- ¬lr_ok
        if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
      if lr_ok   then LR   <-iea CIA + 4
      if svlr_ok then SVLR <- SVSTATE

    if (mode_is_64bit) then M <- 0
    else M <- 32
    cond_ok = not SVRMmode.ALL
    for srcstep in range(VL):
        # select predicate bit or zero/one
        if predicate[srcstep]:
            # get SVP64 extended CR field 0..127
            SVCRf = SVP64EXTRA(BI>>2)
            CRbits = CR{SVCRf}
            testbit = CRbits[BI & 0b11]
            # testbit = CR[BI+32+srcstep*4]
        else if not SVRMmode.sz:
            # inverted CTR test skip mode
            if ¬BO[2] & CTRtest & ¬CTI then
              CTR = CTR - 1
            continue # skip to next element
        else
            testbit = SVRMmode.SNZ
        # actual element test here
        ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
        el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
        # check if CTR dec should occur
        ctrdec = ¬BO[2]
        if CTRtest & (el_cond_ok ^ CTi) then
           ctrdec = 0b0
        if ctrdec then CTR <- CTR - 1
        # merge in the test
        if SVRMmode.ALL:
            cond_ok &= (el_cond_ok & ctr_ok)
        else
            cond_ok |= (el_cond_ok & ctr_ok)
        # test for VL to be set (and exit)
        if VLSET and VSb = (el_cond_ok & ctr_ok) then
            if SVRMmode.VLI then SVSTATE.VL = srcstep+1
            else                 SVSTATE.VL = srcstep
            break
        # early exit?
        if SVRMmode.ALL != (el_cond_ok & ctr_ok):
             break
        # SVP64 rules about Scalar registers still apply!
        if SVCRf.scalar:
           break
    # loop finally done, now test if branch (and update LR)
    lr_ok <- LK
    svlr_ok <- SVRMmode.SL
    if cond_ok then
        if AA then NIA <-iea EXTS(BD || 0b00)
        else       NIA <-iea CIA + EXTS(BD || 0b00)
        if SVRMmode.LRu then lr_ok <- ¬lr_ok
        if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
    if lr_ok then LR <-iea CIA + 4
    if svlr_ok then SVLR <- SVSTATE

    # get SVP64 extended CR field 0..127
    SVCRf = SVP64EXTRA(BI>>2)
    CRbits = CR{SVCRf}
    # select predicate bit or zero/one
    if predicate[srcstep]:
        if BRc = 1 then # CR0 vectorized
            CR{SVCRf+srcstep} = CRbits
        testbit = CRbits[BI & 0b11]
    else if not SVRMmode.sz:
        # inverted CTR test skip mode
        if ¬BO[2] & CTRtest & ¬CTI then
           CTR = CTR - 1
        SVSTATE.srcstep = new_srcstep
        exit # no branch testing
    else
        testbit = SVRMmode.SNZ
    # actual element test here
    cond_ok <- BO[0] | ¬(testbit ^ BO[1])
    # test for VL to be set (and exit)
    if VLSET and cond_ok = VSb then
        if SVRMmode.VLI
            SVSTATE.VL = new_srcstep+1
        else
            SVSTATE.VL = new_srcstep

    // assume f() g() or h() modify a and/or b
    while(a > 2) {
        if(b < 5)
            f();
        else
            g();
        h();
    }

    vec<i32> a, b;
    // ...
    pred loop_pred = a > 2;
    // loop continues while any of a elements greater than 2
    while(loop_pred.any()) {
        // vector of predicate bits
        pred if_pred = loop_pred & (b < 5);
        // only call f() if at least 1 bit set
        if(if_pred.any()) {
            f(if_pred);
        }
    label1:
        // loop mask ANDs with inverted if-test
        pred else_pred = loop_pred & ~if_pred;
        // only call g() if at least 1 bit set
        if(else_pred.any()) {
            g(else_pred);
        }
        h(loop_pred);
    }

       # start from while loop test point
       b looptest
    while_loop:
       sv.cmpi CR80.v, b.v, 5     # vector compare b into CR64 Vector
       sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
       # only calculate loop_pred & pred_b because needed in f()
       sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
       f(CR80.v.SO)
    skip_f:
       # illustrate inversion of pred_b. invert r30, test ALL
       # rather than SOME, but masked-out zero test would FAIL,
       # therefore masked-out instead is tested against 1 not 0
       sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
       # else = loop & ~pred_b, need this because used in g()
       sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
       g(CR80.v.SO)
    skip_g:
       # conditionally call h(r30) if any loop pred set
       sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
    looptest:
       sv.cmpi CR60.v a.v, 2      # vector compare a into CR60 vector
       sv.crweird r30, CR60.GT # transfer GT vector to r30
       sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
    end:

    for (int i = 0; i < 8; i++) {
        if (x < y) break;
    }

    if (mode_is_64bit) then M <- 0
    else M <- 32
    if ¬BO[2]  then CTR <- CTR - 1
    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
    cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
    if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
    if LK then LR <-iea CIA + 4

    for i in 0 to VL-1:
        ...
        ...
        cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
        lr_ok <- LK
        if ctr_ok & cond_ok then
           NIA <-iea LR[0:61] || 0b00
           if SVRMmode.LRu then lr_ok <- ¬lr_ok
        if lr_ok then LR <-iea CIA + 4
        # if NIA modified exit loop

    for i in 0 to VL-1:
        ...
        ...
        cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
        if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
    # only at the end of looping is LK checked.
    # this completely violates the design principle of SVP64
    # and would actually need to be a separate (scalar)
    # instruction "set LR to CIA+4 but retrospectively"
    # which is clearly impossible
    if LK then LR <-iea CIA + 4

    sv.cmpi/ew=8 *B,*ra,0    # compare bytes against zero
    sv.cmpi/ew=8 *B2,*ra,13. # and against newline
    sv.cror PM.EQ,B.EQ,B2.EQ # OR compares to create mask
    sv.stb/sm=EQ    ...      # store only nonzero/newline

    # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v
    for i in VL-1 downto 0 # reverse gear
         CR.field[4+i].ge &= CR.field[5+i].ge

    cmpli BF,L,RA,UI
    cmpeqb BF,RA,RB

    # assume VL=4, this results in 4 sequential ops (below)
    sv.adde r0.v, r4.v, r8.v

    # instructions that get executed in backend hardware:
    adde r0, r4, r8 # takes carry-in, produces carry-out
    adde r1, r5, r9 # takes carry from previous
    ...
    adde r3, r7, r11 # likewise

    def index():
        for i in range(VL):
            for j in range(SUBVL):
                yield i*SUBVL+j

    for idx in index():
        operation_on(RA+idx)

    # yield an outer-SUBVL or inner VL loop with SUBVL
    def index_p(outer):
        if outer:
            for j in range(SUBVL):   # subvl is outer
                for i in range(VL):  # vl is inner
                    yield i+VL*j
        else:
            for i in range(VL):        # vl is outer
                for j in range(SUBVL): # subvl is inner
                    yield i*SUBVL+j

     # walk through both source and dest indices simultaneously
     for src_idx, dst_idx in zip(index_p(PACK), index_p(UNPACK)):
         move_operation(RT+dst_idx, RA+src_idx)

     srcstep=0   srcstep=1
     0   1   2   3   4   5

     dststep=0  dststep=1  dststep=2
     0   3      1   4      2   5

     # add RT, RA,RB but when RT==RA
     for i in range(VL):
          iregs[RA] += iregs[RB+i] # RT==RA

    # assume VL=4:
    # * Vector of shift-offsets contained in RC (r12.v)
    # * Vector of masks contained in RB (r8.v)
    # * Vector of values to be masked-in in RA (r4.v)
    # * Scalar destination RT (r0) to receive all mask-offset values
    sv.bmset/mr r0, r4.v, r8.v, r12.v

for i in range(VL):
   GPR[RT+i], CR[i] = operation(GPR[RA+i]... )

for i in range(VL):
   GPR[RT+i], CR[i] = operation(GPR[RA+i]... )
   if test(CR[i]) == failure:
      VL = i+VLi
      break

     CR{n} = CR[32+n*4:35+n*4]

    CR_index = (BA>>2)      # top 3 bits
    bit_index = (BA & 0b11) # low 2 bits
    CR_reg = CR{CR_index}     # get the CR
    # finally get the bit from the CR.
    CR_bit = (CR_reg & (1<<bit_index)) != 0

    if extra3_mode:
        spec = EXTRA3
    elif EXTRA2[0]:  # vector mode
        spec = EXTRA2 << 1  # same as EXTRA3, shifted
    else:            # scalar mode
        spec = (EXTRA2[0] << 2) | EXTRA2[1]
    if spec[0]:
       # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
       return ((BA >> 2)<<6) | # hi 3 bits shifted up
              (spec[1:2]<<4) | # to make room for these
              (BA & 0b11)      # CR_bit on the end
    else:
       # scalar constructs "00 spec[1:2] BA[0:4]"
       return (spec[1:2] << 5) | BA

    CR_index = (BA>>2)      # top 3 bits 
    if spec[0]:
        # vector mode, 0-124 increments of 4
        CR_index = (CR_index<<4) | (spec[1:2] << 2)
    else:
        # scalar mode, 0-32 increments of 1
        CR_index = (spec[1:2]<<3) | CR_index
    # same as for v3.0/v3.1 from this point onwards
    bit_index = (BA & 0b11) # low 2 bits
    CR_reg = CR{CR_index}     # get the CR
    # finally get the bit from the CR.
    CR_bit = (CR_reg & (1<<bit_index)) != 0

    for i in range(VL):
         # calculate the vector result of an add
         iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
         # now calculate CR bits
         CRs{8+i}.eq = iregs[RT+i] == 0
         CRs{8+i}.gt = iregs[RT+i] > 0
         ... etc

    function op_add(rd, rs1, rs2) # add not VADD!
      int i, id=0, irs1=0, irs2=0;
      predval = get_pred_val(FALSE, rd);
      for (i = 0; i < VL; i++)
        STATE.srcoffs = i # save context
        if (predval & 1<<i) # predication uses intregs
           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
        if (!int_vec[rd].isvec) break;
        if (rd.isvec)  { id += 1; }
        if (rs1.isvec) { irs1 += 1; }
        if (rs2.isvec) { irs2 += 1; }
        if (id == VL or irs1 == VL or irs2 == VL) {
          # end VL hardware loop
          STATE.srcoffs = 0; # reset
          return;
        }

    svp64 [field=value]*

    sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s

def preduce_yield(vl, vec, pred):
    step = 1
    ix = list(range(vl))
    while step < vl:
        step *= 2
        for i in range(0, vl, step):
            other = i + step // 2
            ci = ix[i]
            oi = ix[other] if other < vl else None
            other_pred = other < vl and pred[oi]
            if pred[ci] and other_pred:
                yield ci, oi
            elif other_pred:
                ix[i] = oi

def preduce_y(vl, vec, pred):
   for i, other in preduce_yield(vl, vec, pred):
       vec[i] += vec[other]

    #pragma pack
    typedef union {
        uint8_t  b[];
        uint16_t s[];
        uint32_t i[];
        uint64_t l[];
        uint8_t actual_bytes[8];
    } el_reg_t;

    elreg_t int_regfile[128];

    elreg_t& get_polymorphed_reg(elreg_t const& reg, bitwidth, offset):
        el_reg_t res; // result
        res.l = 0; // TODO: going to need sign-extending / zero-extending
        if !reg.isvec: // scalar access has no element offset
            offset = 0
        if bitwidth == 8:
            reg.b = int_regfile[reg].b[offset]
        elif bitwidth == 16:
            reg.s = int_regfile[reg].s[offset]
        elif bitwidth == 32:
            reg.i = int_regfile[reg].i[offset]
        elif bitwidth == 64:
            reg.l = int_regfile[reg].l[offset]
        return reg

    set_polymorphed_reg(elreg_t& reg, bitwidth, offset, val):
        if (!reg.isvec):
            # for safety mask out hi bits
            bytemask = (8 << bitwidth) - 1
            val &= bytemask
            # not a vector: first element only, overwrites high bits.
            # and with the *Architectural* definition being LE,
            # storing in the first DWORD works perfectly.
            int_regfile[reg].l[0] = val
        elif bitwidth == 8:
            int_regfile[reg].b[offset] = val
        elif bitwidth == 16:
            int_regfile[reg].s[offset] = val
        elif bitwidth == 32:
            int_regfile[reg].i[offset] = val
        elif bitwidth == 64:
            int_regfile[reg].l[offset] = val

      for (i = 0; i < VL; i++)
        if (predval & 1<<i) # predication
           src1 = get_polymorphed_reg(RA, srcwid, irs1)
           src2 = get_polymorphed_reg(RB, srcwid, irs2)
           result = src1 + src2 # actual add here
           set_polymorphed_reg(RT, destwid, ird, result)
           if (!RT.isvec) break
        if (RT.isvec)  { id += 1; }
        if (RA.isvec)  { irs1 += 1; }
        if (RB.isvec)  { irs2 += 1; }

     # demo of maddedu
     for (i = 0; i < VL; i++)
        if (predval & 1<<i) # predication
           src1 = get_polymorphed_reg(RA, srcwid, irs1)
           src2 = get_polymorphed_reg(RB, srcwid, irs2)
           src2 = get_polymorphed_reg(RC, srcwid, irs3)
           result = src1*src2 + src2
           destmask = (2<<destwid)-1
           # store two halves of result, both start from RT.
           set_polymorphed_reg(RT, destwid, ird      , result&destmask)
           set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
           if (!RT.isvec) break
        if (RT.isvec)  { id += 1; }
        if (RA.isvec)  { irs1 += 1; }
        if (RB.isvec)  { irs2 += 1; }
        if (RC.isvec)  { irs3 += 1; }

     LSB0:  63:32     31:0
     MSB0:  0:31      32:63
     r0    unchanged unchanged
     r1    RT1.lo    RT0.lo
     r2    unchanged RT2.lo
     r3    RT0.hi    unchanged
     r4    RT2.hi    RT1.hi
     r5    unchanged unchanged

setvli r0, 4            # sets VL equal to 4
sv.addi r5, r0, 1       # raises an 0x700 trap
setvli r0, 1            # sets VL equal to 1
sv.addi r5, r0, 1       # gets executed by hardware
sv.addi/ew=8 r5, r0, 1  # raises an 0x700 trap
sv.ori/sm=EQ r5, r0, 1  # executed by hardware

0-5	6	7	8-31	32-37	38-64	Description
PO	0	1	RM[0:23]	1nnnnn	xxxxxxxx	SVP64:EXT232-263
PO	1	1	RM[0:23]	nnnnnn	xxxxxxxx	SVP64:EXT000-063

Field Name	Field bits	Description
MASKMODE	`0`	Execution (predication) Mask Kind
MASK	`1:3`	Execution Mask
SUBVL	`8:9`	Sub-vector length

Field Name	Field bits	Description
ELWIDTH	`4:5`	Element Width
ELWIDTH_SRC	`6:7`	Element Width for Source (or MASK_SRC in 2PM)
EXTRA	`10:18`	Register Extra encoding
MODE	`19:23`	changes Vector behaviour

Value	Mnemonic	Description
00	DEFAULT	default behaviour for operation
01	`ELWIDTH=w`	Word: 32-bit integer
10	`ELWIDTH=h`	Halfword: 16-bit integer
11	`ELWIDTH=b`	Byte: 8-bit integer

Value	Mnemonic	Description
00	DEFAULT	default behaviour for FP operation
01	`ELWIDTH=f32`	32-bit IEEE 754 Single floating-point
10	`ELWIDTH=f16`	16-bit IEEE 754 Half floating-point
11	`ELWIDTH=bf16`	Reserved for `bf16`

Value	Mnemonic	Subvec	Description
00	`SUBVL=1`	single	Sub-vector length of 1
01	`SUBVL=2`	vec2	Sub-vector length of 2
10	`SUBVL=3`	vec3	Sub-vector length of 3
11	`SUBVL=4`	vec4	Sub-vector length of 4

Value	Description
0	MASK/MASK_SRC are encoded using Integer Predication
1	MASK/MASK_SRC are encoded using CR-based Predication

Value	Mnemonic	Element `i` enabled if:
000	ALWAYS	predicate effectively all 1s
001	1 << R3	`i == R3`
010	R3	`R3 & (1 << i)` is non-zero
011	~R3	`R3 & (1 << i)` is zero
100	R10	`R10 & (1 << i)` is non-zero
101	~R10	`R10 & (1 << i)` is zero
110	R30	`R30 & (1 << i)` is non-zero
111	~R30	`R30 & (1 << i)` is zero

Value	Mnemonic	Element `i` is enabled if
000	lt	`CR[offs+i].LT` is set
001	nl/ge	`CR[offs+i].LT` is clear
010	gt	`CR[offs+i].GT` is set
011	ng/le	`CR[offs+i].GT` is clear
100	eq	`CR[offs+i].EQ` is set
101	ne	`CR[offs+i].EQ` is clear
110	so/un	`CR[offs+i].FU` is set
111	ns/nu	`CR[offs+i].FU` is clear

Field Name	Field bits	Description
Rdest_EXTRA2	`10:11`	extends Rdest (R*_EXTRA2 Encoding)
Rsrc1_EXTRA2	`12:13`	extends Rsrc1 (R*_EXTRA2 Encoding)
Rsrc2_EXTRA2	`14:15`	extends Rsrc2 (R*_EXTRA2 Encoding)
Rsrc3_EXTRA2	`16:17`	extends Rsrc3 (R*_EXTRA2 Encoding)
EXTRA2_MODE	`18`	used by `divmod2du` and `maddedu` for RS

Field Name	Field bits	Description
Rdest_EXTRA3	`10:12`	extends Rdest
Rsrc1_EXTRA3	`13:15`	extends Rsrc1
Rsrc2_EXTRA3	`16:18`	extends Rsrc3

Value	Mode	Range/Inc	6..0
000	Scalar	`r0-r31`/1	`0b00 RA`
001	Scalar	`r32-r63`/1	`0b01 RA`
010	Scalar	`r64-r95`/1	`0b10 RA`
011	Scalar	`r96-r127`/1	`0b11 RA`
100	Vector	`r0-r124`/4	`RA 0b00`
101	Vector	`r1-r125`/4	`RA 0b01`
110	Vector	`r2-r126`/4	`RA 0b10`
111	Vector	`r3-r127`/4	`RA 0b11`

Value	Mode	Range/Inc	8..5	4..2	1..0
000	Scalar	`CR0-CR7`/1	0b0000	BA[0:2]	BA[3:4]
001	Scalar	`CR8-CR15`/1	0b0001	BA[0:2]	BA[3:4]
010	Scalar	`CR16-CR23`/1	0b0010	BA[0:2]	BA[3:4]
011	Scalar	`CR24-CR31`/1	0b0011	BA[0:2]	BA[3:4]
100	Vector	`CR0-CR112`/16	BA[0:2] 0	0b000	BA[3:4]
101	Vector	`CR4-CR116`/16	BA[0:2] 0	0b100	BA[3:4]
110	Vector	`CR8-CR120`/16	BA[0:2] 1	0b000	BA[3:4]
111	Vector	`CR12-CR124`/16	BA[0:2] 1	0b100	BA[3:4]

0-1	2	3 4	description
0 0	0	dz sz	simple mode
0 0	1	RG 0	scalar reduce mode (mapreduce)
0 0	1	/ 1	reserved
1 0	N	dz sz	sat mode: N=0/1 u/s
VLi 1	inv	CR-bit	Rc=1: ffirst CR sel
VLi 1	inv	zz RC1	Rc=0: ffirst z/nonz

0	1	2	3 4	description
els	0	PI	zz LF	post-increment and Fault-First
VLi	1	inv	CR-bit	Data-Dependent ffirst CR sel

srcstep	dststep	comment
0	0	both mask[src=0] and mask[dst=0] are 1
1	2	sz=1 but dz=0: dst skips mask[1], src soes not
2	3	mask[src=2] and mask[dst=3] are 1
3	end	loop has ended because dst reached VL-1

0	1	2	3 4	description
els	0	PI	zz SEA	post-increment and Fault-First
VLi	1	inv	CR-bit	Data-Dependent ffirst CR sel

4	5	6	7	17	18	19	20	21	22 23	description
ALL	SNZ	/	/	SL	SLu	0	0	/	LRu sz	simple mode
ALL	SNZ	/	VSb	SL	SLu	0	1	VLI	LRu sz	VLSET mode
ALL	SNZ	CTi	/	SL	SLu	1	0	/	LRu sz	CTR-test mode
ALL	SNZ	CTi	VSb	SL	SLu	1	1	VLI	LRu sz	CTR-test+VLSET mode

srcstep	dststep	comment
0	0	both mask[src=0] and mask[dst=0] are 1
2	1	sz=0 but dz=1: src skips mask[1], dst does not
3	2	mask[src=3] and mask[dst=2] are 1
end	3	loop has ended because src reached VL-1

srcstep	dststep	comment
0	0	both mask[src=0] and mask[dst=0] are 1
2	2	sz=0 and dz=0: both src and dst skip mask[1]
3	3	mask[src=3] and mask[dst=3] are 1
end	end	loop has ended because src and dst reached VL-1

RFC ls010 Simple-V Zero-Overhead Loop Prefix Subsystem

SVP64 Zero-Overhead Loop Prefix Subsystem

Introduction

SVP64 encoding features

Definition of "PO9-Prefixed"

Definition of "SVP64-Prefix"

Definition of "Vectorizable" and "Unvectorizable"

Definition of Strict Element-Level Execution Order

Precise Interrupt Guarantees

Register files, elements, and Element-width Overrides

Scalar Identity Behaviour

Register Naming and size

Future expansion.

SVP64 Remapped Encoding (RM[0:23])

Common RM fields

Mode

ELWIDTH Encoding

Elwidth for Integers:

Elwidth for FP Registers:

Elwidth for CRs (no meaning)

SUBVL Encoding

MASK/MASK_SRC & MASKMODE Encoding

Integer Predication (MASKMODE=0)

CR-based Predication (MASKMODE=1)

Extra Remapped Encoding

RM-1P-3S1D

RM-1P-2S1D

RM-2P-1S1D/2S

RM-1P-2S1D

RM-2P-2S1D/1S2D/3S

RM-2PM-2S1D/1S2D/3S

R*_EXTRA2/3

INT/FP EXTRA3

INT/FP EXTRA2

CR Field EXTRA3

CR EXTRA2

Appendix

Normal SVP64 Modes, for Arithmetic and Logical Operations

Mode

Rounding, clamp and saturate

Reduce mode

Data-dependent Fail-on-first

Data-dependent fail-first on CR operations (crand etc)

SV Load and Store

Rationale

Modes overview

Format and fields

Vectorization of Scalar Power ISA v3.0B

LD/ST Indexed vs Indexed REMAP

LD/ST ffirst (Fault-First)

Data-Dependent Fail-First (not Fail/Fault-First)

LOAD/STORE Elwidths

Remapped LD/ST

SVP64 Branch Conditional behaviour

Rationale

Overview

Format and fields

Vectorized CR Field numbering, and Scalar behaviour

Horizontal-First and Vertical-First Modes

Description and Modes

Link Register Update

CTR-test

VLSET Mode

VLSET and CTR-test combined

Boolean Logic combinations

Pseudocode and examples

Example Shader code

LRu example

Condition Register SVP64 Operations

Format

Data-dependent fail-first on CR operations

Reduction and Iteration

Unusual and quirky CR operations

Effectively-separate Vector and Scalar Condition Register file

Appendix

Partial Implementations

XER, SO and other global flags

EXTRA Field Mapping

Single Predication

Twin Predication

SVP64 Remapped Encoding (`RM[0:23]`)