https://libre-soc.org/openpower/sv/vector_ops/discussion/

This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)

input = [100, 100,   3, 100,   5, 100, 100,   3]
conflict result = [
     0b00000000,    // Note: first element always zero
     0b00000001,    // 100 is present on #0
     0b00000000,
     0b00000011,    // 100 is present on #0 and #1
     0b00000000,
     0b00001011,    // 100 is present on #0, #1, #3
     0b00011011,    // .. and #4
     0b00000100     // 3 is present on #2
]

Pseudocode:

for i in range(VL):
    for j in range(1, i):
        if src1[i] == src2[j]:
            result[j] |= 1<<i

Idea 1: implement this as a Triangular Schedule, Vertical-First Mode, using mfcrweird and cmpi. first triangular schedule on src1, secpnd on src2.

Idea 2: implement using outer loop on varying setvl Horizontal-First with 1<<r3 predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the result.

    li r3, target
    li result, 0
    for i in range(target):
        setvl target
        addi r3, r3, -1 # shift 1<<r3 predicate down by one
        sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
        sv.cmpi src2.v, t0 # compare src2 vector to scalar
        sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
        or result, result, t1

See cr int predication for full details on the crweird instructions: the primary important aspect here is that a Vector of CR Field's EQ bits is transferred into a single GPR. The secondary important aspect is that VL is being adjusted in each loop, testing successively more of the input vector against a given scalar, each time.

To investigate: