sof/sif/sbfm etc.

These all it turns out can be done as bitmanip of the form x/~x &/|/^ (x / -x / x+1 / x-1) so are being superceded. needs some work though

    if _RB = 0 then mask <- [1] * XLEN else mask = (RB)
    a1 <- (RA) & mask
    if mode[1] then a1 <- ¬ra
    mode2 <- mode[2:3]
    if mode2 = 0 then a2 <- (¬ra)+1
    if mode2 = 1 then a2 <- ra-1
    if mode2 = 2 then a2 <- ra+1
    if mode2 = 3 then a2 <- ¬(ra+1)
    a1 <- a1 & mask
    a2 <- a2 & mask
    # select operator
    mode3 <- mode[3:4]
    if mode3 = 0 then result <- a1 | a2
    if mode3 = 1 then result <- a1 & a2
    if mode3 = 2 then result <- a1 ^ a2
    if mode3 = 3 then result <- UNDEFINED
    result <- result & mask
    # optionally restore masked-out bits
    if L = 1 then
        result <- result | (RA & ¬mask)
    RT <- result

SBF = 0b01010 # set before first
SOF = 0b01001 # set only first
SIF = 0b10000 # set including first 10011 also works no idea why yet

OBSOLETE sbfm

sbfm RT, RA, RB!=0

Example

 7 6 5 4 3 2 1 0   Bit index

 1 0 0 1 0 1 0 0   v3 contents
                   vmsbf.m v2, v3
 0 0 0 0 0 0 1 1   v2 contents

 1 0 0 1 0 1 0 1   v3 contents
                   vmsbf.m v2, v3
 0 0 0 0 0 0 0 0   v2

 0 0 0 0 0 0 0 0   v3 contents
                   vmsbf.m v2, v3
 1 1 1 1 1 1 1 1   v2

 1 1 0 0 0 0 1 1   RB vcontents
 1 0 0 1 0 1 0 0   v3 contents
                   vmsbf.m v2, v3, v0.t
 0 1 x x x x 1 1   v2 contents

The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.

Executable pseudocode demo:


def sbf(RA, mask=None, zero=False):
    RT = RA if mask is not None and not zero else 0
    i = 0
    # start setting if no predicate or if 1st predicate bit set
    setting_mode = mask is None
    while i < 16:
        bit = 1<<i
        if not setting_mode and mask is not None and (mask & bit):
            setting_mode = True # back into "setting" mode
        if setting_mode and mask is not None and not (mask & bit):
            setting_mode = False # disable when no mask bit
        if setting_mode:
            if RA & bit: # found a bit in rs1: stop setting RT
                setting_mode = False
            else:
                RT |= bit
        i += 1
    return RT

if __name__ == '__main__':
     m  = 0b11000011
     v3 = 0b10010100 # vmsbf.m v2, v3
     v2 = 0b01000011 # v2
     RT = sbf(v3, m, zero=True)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b10010100 # vmsbf.m v2, v3
     v2 = 0b00000011 # v2 contents
     RT = sbf(v3)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b10010101 # vmsbf.m v2, v3
     v2 = 0b00000000 # v2
     RT = sbf(v3)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b00000000 # vmsbf.m v2, v3
     v2 = 0b11111111 # v2
     RT = sbf(v3)
     print(bin(v3), bin(v2), bin(RT))

OBSOLETE sifm

The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.

sifm RT, RA, RB!=0

# Example

 7 6 5 4 3 2 1 0   Bit number

 1 0 0 1 0 1 0 0   v3 contents
                   vmsif.m v2, v3
 0 0 0 0 0 1 1 1   v2 contents

 1 0 0 1 0 1 0 1   v3 contents
                   vmsif.m v2, v3
 0 0 0 0 0 0 0 1   v2

 1 1 0 0 0 0 1 1   RB vcontents
 1 0 0 1 0 1 0 0   v3 contents
                   vmsif.m v2, v3, v0.t
 1 1 x x x x 1 1   v2 contents

Executable pseudocode demo:


def sif(RA, mask=None, zero=False):
    RT = RA if mask is not None and not zero else 0
    i = 0
    # start setting if no predicate or if 1st predicate bit set
    setting_mode = mask is None
    while i < 16:
        bit = 1<<i
        if not setting_mode and mask is not None and (mask & bit):
            setting_mode = True # back into "setting" mode
        if setting_mode and mask is not None and not (mask & bit):
            setting_mode = False # masked out, stop setting
        if setting_mode:
            RT |= bit    # inclusively set bit in RT
            if RA & bit: # found a bit in rs1: stop setting RT
                setting_mode = False
        i += 1
    return RT

if __name__ == '__main__':
     m  = 0b11000011
     v3 = 0b10010100 # vmsif.m v2, v3
     v2 = 0b11000011 # v2
     RT = sif(v3, m, zero=True)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b10010100 # vmsif.m v2, v3
     v2 = 0b00000111 # v2 contents
     RT = sif(v3)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b10010101 # vmsif.m v2, v3
     v2 = 0b00000001 # v2
     RT = sif(v3)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b00000000 # vmsif.m v2, v3
     v2 = 0b11111111 # v2
     RT = sif(v3)
     print(bin(v3), bin(v2), bin(RT))

OBSOLETE vmsof

The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.

sofm RT, RA, RB

Example

 7 6 5 4 3 2 1 0   Bit number

 1 0 0 1 0 1 0 0   v3 contents
                   vmsof.m v2, v3
 0 0 0 0 0 1 0 0   v2 contents

 1 0 0 1 0 1 0 1   v3 contents
                   vmsof.m v2, v3
 0 0 0 0 0 0 0 1   v2

 1 1 0 0 0 0 1 1   RB vcontents
 1 1 0 1 0 1 0 0   v3 contents
                   vmsof.m v2, v3, v0.t
 0 1 x x x x 0 0   v2 content

Executable pseudocode demo:


def sof(RA, mask=None, zero=False):
    RT = RA if mask is not None and not zero else 0
    i = 0
    # start setting if no predicate or if 1st predicate bit set
    setting_mode = mask is None
    found = False
    while i < 16:
        bit = 1<<i
        if not setting_mode and mask is not None and (mask & bit):
            setting_mode = True # back into "setting" mode
            found = False       # start finding first
        if setting_mode:
            if mask is not None and not (mask & bit):
                setting_mode = False
            elif RA & bit and not found: # found a bit: set if not found
                RT |= bit
                found = True # don't set another bit
        i += 1
    return RT

if __name__ == '__main__':
     m  = 0b11000011
     v3 = 0b11010100 # vmsof.m v2, v3
     v2 = 0b01000000 # v2
     RT = sof(v3, m, True)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b10010100 # vmsof.m v2, v3
     v2 = 0b00000100 # v2 contents
     RT = sof(v3)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b10010101 # vmsof.m v2, v3
     v2 = 0b00000001 # v2
     RT = sof(v3)
     print(bin(v3), bin(v2), bin(RT))
     v3 = 0b00000000 # vmsof.m v2, v3
     v2 = 0b00000000 # v2
     RT = sof(v3)
     print(bin(v3), bin(v2), bin(RT))

SV Vector Operations not added

Links:

Both of these instructions may be synthesised from SVP64 Vector instructions. conflictd is an O(N2) instruction based on sv.cmpi and iota is an O(N) instruction based on sv.addi with the appropriate predication

conflictd

This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)

input = [100, 100,   3, 100,   5, 100, 100,   3]
conflict result = [
     0b00000000,    // Note: first element always zero
     0b00000001,    // 100 is present on #0
     0b00000000,
     0b00000011,    // 100 is present on #0 and #1
     0b00000000,
     0b00001011,    // 100 is present on #0, #1, #3
     0b00011011,    // .. and #4
     0b00000100     // 3 is present on #2
]

Pseudocode:

for i in range(VL):
    for j in range(1, i):
        if src1[i] == src2[j]:
            result[j] |= 1<<i

Idea 1: implement this as a Triangular Schedule, Vertical-First Mode, using mfcrweird and cmpi. first triangular schedule on src1, secpnd on src2.

Idea 2: implement using outer loop on varying setvl Horizontal-First with 1<<r3 predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the result.

li r3, 1
li result, 0
for i in range(target):
    setvl target
    sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
    sv.cmpi src2.v, t0 # compare src2 vector to scalar
    sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
    srr t1, t1, i # shift up by i before ORing
    or result, result, t1
    srr r3, r3, 1 # shift r3 predicate up by one

See cr int predication for full details on the crweird instructions: the primary important aspect here is that a Vector of CR Field's EQ bits is transferred into a single GPR. The secondary important aspect is that VL is being adjusted in each loop, testing successively more of the input vector against a given scalar, each time.

To investigate:

iota

Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.

When masked, only the bits not masked out are included in the count process.

viota RT/v, RA, RB

Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).

Example

 7 6 5 4 3 2 1 0   Element number

 1 0 0 1 0 0 0 1   v2 contents
                   viota.m v4, v2 # Unmasked
 2 2 2 1 1 1 1 0   v4 result

 1 1 1 0 1 0 1 1   v0 contents
 1 0 0 1 0 0 0 1   v2 contents
 2 3 4 5 6 7 8 9   v4 contents
                   viota.m v4, v2, v0.t # Masked
 1 1 1 5 1 7 1 0   v4 results

 def iota(RT, RA, RB): 
    mask = RB ? iregs[RB] : 0b111111...1
    val = RA ? iregs[RA] : 0b111111...1
    for i in range(VL):
        if RA.scalar:
        testmask = (1<<i)-1 # only count below
        to_test = val & testmask & mask
        iregs[RT+i] = popcount(to_test)

a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2] where bit 2 is inv, bits 0:1 select the bit of the CR.

 def test_CR_bit(CR, BO):
     return CR[BO[0:1]] == BO[2]

 def iotacr(RT, BA, BO): 
    mask = get_src_predicate()
    count = 0
    for i in range(VL):
        if mask & (1<<i) == 0:
            count = 0 # reset back to zero
            continue
        iregs[RT+i] = count
        if test_CR_bit(CR[i+BA], BO):
             count += 1

the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.

scalar variant which can be Vectorised to give iotacr:

 def crtaddi(RT, RA, BA, BO, D): 
     if test_CR_bit(BA, BO):
         RT = RA + EXTS(D)
     else:
         RT = RA

a Vector for-loop with zero-ing on dest will give the mask-out effect of resetting the count back to zero. However close examination shows that the above may actually be sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1