sof/sif/sbfm etc.
These all it turns out can be done as bitmanip of the form x/~x &/|/^ (x / -x / x+1 / x-1) so are being superceded. needs some work though
if _RB = 0 then mask <- [1] * XLEN else mask = (RB)
a1 <- (RA) & mask
if mode[1] then a1 <- ¬ra
mode2 <- mode[2:3]
if mode2 = 0 then a2 <- (¬ra)+1
if mode2 = 1 then a2 <- ra-1
if mode2 = 2 then a2 <- ra+1
if mode2 = 3 then a2 <- ¬(ra+1)
a1 <- a1 & mask
a2 <- a2 & mask
# select operator
mode3 <- mode[3:4]
if mode3 = 0 then result <- a1 | a2
if mode3 = 1 then result <- a1 & a2
if mode3 = 2 then result <- a1 ^ a2
if mode3 = 3 then result <- UNDEFINED
result <- result & mask
# optionally restore masked-out bits
if L = 1 then
result <- result | (RA & ¬mask)
RT <- result
SBF = 0b01010 # set before first
SOF = 0b01001 # set only first
SIF = 0b10000 # set including first 10011 also works no idea why yet
OBSOLETE sbfm
sbfm RT, RA, RB!=0
Example
7 6 5 4 3 2 1 0 Bit index
1 0 0 1 0 1 0 0 v3 contents
vmsbf.m v2, v3
0 0 0 0 0 0 1 1 v2 contents
1 0 0 1 0 1 0 1 v3 contents
vmsbf.m v2, v3
0 0 0 0 0 0 0 0 v2
0 0 0 0 0 0 0 0 v3 contents
vmsbf.m v2, v3
1 1 1 1 1 1 1 1 v2
1 1 0 0 0 0 1 1 RB vcontents
1 0 0 1 0 1 0 0 v3 contents
vmsbf.m v2, v3, v0.t
0 1 x x x x 1 1 v2 contents
The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
Executable pseudocode demo:
def sbf(RA, mask=None, zero=False):
RT = RA if mask is not None and not zero else 0
i = 0
# start setting if no predicate or if 1st predicate bit set
setting_mode = mask is None
while i < 16:
bit = 1<<i
if not setting_mode and mask is not None and (mask & bit):
setting_mode = True # back into "setting" mode
if setting_mode and mask is not None and not (mask & bit):
setting_mode = False # disable when no mask bit
if setting_mode:
if RA & bit: # found a bit in rs1: stop setting RT
setting_mode = False
else:
RT |= bit
i += 1
return RT
if __name__ == '__main__':
m = 0b11000011
v3 = 0b10010100 # vmsbf.m v2, v3
v2 = 0b01000011 # v2
RT = sbf(v3, m, zero=True)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b10010100 # vmsbf.m v2, v3
v2 = 0b00000011 # v2 contents
RT = sbf(v3)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b10010101 # vmsbf.m v2, v3
v2 = 0b00000000 # v2
RT = sbf(v3)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b00000000 # vmsbf.m v2, v3
v2 = 0b11111111 # v2
RT = sbf(v3)
print(bin(v3), bin(v2), bin(RT))
OBSOLETE sifm
The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
sifm RT, RA, RB!=0
# Example
7 6 5 4 3 2 1 0 Bit number
1 0 0 1 0 1 0 0 v3 contents
vmsif.m v2, v3
0 0 0 0 0 1 1 1 v2 contents
1 0 0 1 0 1 0 1 v3 contents
vmsif.m v2, v3
0 0 0 0 0 0 0 1 v2
1 1 0 0 0 0 1 1 RB vcontents
1 0 0 1 0 1 0 0 v3 contents
vmsif.m v2, v3, v0.t
1 1 x x x x 1 1 v2 contents
Executable pseudocode demo:
def sif(RA, mask=None, zero=False):
RT = RA if mask is not None and not zero else 0
i = 0
# start setting if no predicate or if 1st predicate bit set
setting_mode = mask is None
while i < 16:
bit = 1<<i
if not setting_mode and mask is not None and (mask & bit):
setting_mode = True # back into "setting" mode
if setting_mode and mask is not None and not (mask & bit):
setting_mode = False # masked out, stop setting
if setting_mode:
RT |= bit # inclusively set bit in RT
if RA & bit: # found a bit in rs1: stop setting RT
setting_mode = False
i += 1
return RT
if __name__ == '__main__':
m = 0b11000011
v3 = 0b10010100 # vmsif.m v2, v3
v2 = 0b11000011 # v2
RT = sif(v3, m, zero=True)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b10010100 # vmsif.m v2, v3
v2 = 0b00000111 # v2 contents
RT = sif(v3)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b10010101 # vmsif.m v2, v3
v2 = 0b00000001 # v2
RT = sif(v3)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b00000000 # vmsif.m v2, v3
v2 = 0b11111111 # v2
RT = sif(v3)
print(bin(v3), bin(v2), bin(RT))
OBSOLETE vmsof
The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
sofm RT, RA, RB
Example
7 6 5 4 3 2 1 0 Bit number
1 0 0 1 0 1 0 0 v3 contents
vmsof.m v2, v3
0 0 0 0 0 1 0 0 v2 contents
1 0 0 1 0 1 0 1 v3 contents
vmsof.m v2, v3
0 0 0 0 0 0 0 1 v2
1 1 0 0 0 0 1 1 RB vcontents
1 1 0 1 0 1 0 0 v3 contents
vmsof.m v2, v3, v0.t
0 1 x x x x 0 0 v2 content
Executable pseudocode demo:
def sof(RA, mask=None, zero=False):
RT = RA if mask is not None and not zero else 0
i = 0
# start setting if no predicate or if 1st predicate bit set
setting_mode = mask is None
found = False
while i < 16:
bit = 1<<i
if not setting_mode and mask is not None and (mask & bit):
setting_mode = True # back into "setting" mode
found = False # start finding first
if setting_mode:
if mask is not None and not (mask & bit):
setting_mode = False
elif RA & bit and not found: # found a bit: set if not found
RT |= bit
found = True # don't set another bit
i += 1
return RT
if __name__ == '__main__':
m = 0b11000011
v3 = 0b11010100 # vmsof.m v2, v3
v2 = 0b01000000 # v2
RT = sof(v3, m, True)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b10010100 # vmsof.m v2, v3
v2 = 0b00000100 # v2 contents
RT = sof(v3)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b10010101 # vmsof.m v2, v3
v2 = 0b00000001 # v2
RT = sof(v3)
print(bin(v3), bin(v2), bin(RT))
v3 = 0b00000000 # vmsof.m v2, v3
v2 = 0b00000000 # v2
RT = sof(v3)
print(bin(v3), bin(v2), bin(RT))
SV Vector Operations not added
Links:
- https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions
- http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html conflictd example
- https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html
- https://bugs.libre-soc.org/show_bug.cgi?id=213
Both of these instructions may be synthesised from SVP64 Vector
instructions. conflictd is an O(N2) instruction based on
sv.cmpi
and iota is an O(N) instruction based on sv.addi
with the appropriate predication
conflictd
moved to conflictd
iota
Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
When masked, only the bits not masked out are included in the count process.
viota RT/v, RA, RB
Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
Example
7 6 5 4 3 2 1 0 Element number
1 0 0 1 0 0 0 1 v2 contents
viota.m v4, v2 # Unmasked
2 2 2 1 1 1 1 0 v4 result
1 1 1 0 1 0 1 1 v0 contents
1 0 0 1 0 0 0 1 v2 contents
2 3 4 5 6 7 8 9 v4 contents
viota.m v4, v2, v0.t # Masked
1 1 1 5 1 7 1 0 v4 results
def iota(RT, RA, RB):
mask = RB ? iregs[RB] : 0b111111...1
val = RA ? iregs[RA] : 0b111111...1
for i in range(VL):
if RA.scalar:
testmask = (1<<i)-1 # only count below
to_test = val & testmask & mask
iregs[RT+i] = popcount(to_test)
a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2] where bit 2 is inv, bits 0:1 select the bit of the CR.
def test_CR_bit(CR, BO):
return CR[BO[0:1]] == BO[2]
def iotacr(RT, BA, BO):
mask = get_src_predicate()
count = 0
for i in range(VL):
if mask & (1<<i) == 0:
count = 0 # reset back to zero
continue
iregs[RT+i] = count
if test_CR_bit(CR[i+BA], BO):
count += 1
the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
scalar variant which can be Vectorized to give iotacr:
def crtaddi(RT, RA, BA, BO, D):
if test_CR_bit(BA, BO):
RT = RA + EXTS(D)
else:
RT = RA
a Vector for-loop with zero-ing on dest will give the
mask-out effect of resetting the count back to zero.
However close examination shows that the above may actually
be sv.addi/mr/sm=EQ/sz *r1, *r0, 1
TODO check this
this looks promising:
sv.add *v4+1, *v4, *v2
where v2 is guaranteed to contain 0 or 1, the dependency chain accumulates v2 via the overlap between src1 and dest vectors.
cprop
- https://www.geeksforgeeks.org/carry-look-ahead-adder/
- https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png
- https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge
- https://i.stack.imgur.com/QSLKY.png
- https://stackoverflow.com/questions/27971757/big-integer-addition-code
((P|G)+G)^P
- https://en.m.wikipedia.org/wiki/Carry-lookahead_adder
From QLSKY.png:
x0 = nand(CIn, P0)
C0 = nand(x0, ~G0)
x1 = nand(CIn, P0, P1)
y1 = nand(G0, P1)
C1 = nand(x1, y1, ~G1)
x2 = nand(CIn, P0, P1, P2)
y2 = nand(G0, P1, P2)
z2 = nand(G1, P2)
C1 = nand(x2, y2, z2, ~G2)
# Gen*
x3 = nand(G0, P1, P2, P3)
y3 = nand(G1, P2, P3)
z3 = nand(G2, P3)
G* = nand(x3, y3, z3, ~G3)
P = (A | B) & Ci
G = (A & B)
Stackoverflow algorithm ((P|G)+G)^P
works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
At each id, compute C[id] = A[id]+B[id]+0
Get G[id] = C[id] > radix -1
Get P[id] = C[id] == radix-1
Join all P[id] together, likewise G[id]
Compute newC = ((P|G)+G)^P
result[id] = (C[id] + newC[id]) % radix
two versions: scalar int version and CR based version.
scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
if zero (no propagation) then CR0.eq is zero
CR based version, TODO.