ISA Comparison Table to DRAFT SVP64  discussion and research at https://bugs.libresoc.org/show_bug.cgi?id=893
ISA name 
No opcodes 
No intrinsics 
Taxonomy / Class 
Binary Compat 
setvl scalable 
Pred. Masks 
Twin Pred 
Vector regs 
128bit ops 
Big int 
LDST F/First 
Datadep Ffirst 
Pred Result 
HW Matrix 
DCT FFT 

SVP64  6 ^{1}  see ^{2}  Scalable ^{3}  yes  yes  yes  yes ^{4}  no ^{5}  see ^{6}  yes^{7}  yes ^{8}  yes ^{9}  yes ^{10}  yes ^{11}  yes^{12} 
VSX  700+  700?^{13}  PackedSIMD  yes  no  no  no  yes ^{14}  yes  no  no  no  no  yes ^{15}  no 
NEON  ~250^{16}  7088 ^{17}  PackedSIMD  yes  no  no  no  yes  see ^{18}  no  no  no  no  no  no 
SVE2  ~1000^{19}  6040 ^{20}  PredSIMD^{21}  NO ^{22}  no [^e3]  yes  no  yes  see [^b1]  no  yes [^8]  no  no  yes ^{23}  no 
AVX512^{24}  ~1000s^{25}  7256^{26}  PredSIMD  yes  no  yes  no  yes  see[^b1]  no  no  no  no  yes^{27}  no 
RVV ^{28}  ~190^{29}  ~25000^{30}  Scalable^{31}  NO [^nc]  yes  yes  no  yes  yes ^{32}  no  yes  no  no  no  no 
AuroraSX^{33}  ~200^{34}  unknown^{35}  Scalable^{36}  yes  yes  yes  no  yes  no  no  no  no  no  ?  no 
66000^{37}  ~200  unknown  AutoVec[^m1]  yes  see [^m1]  see[^m1]  no  see [^m1]  no  yes^{38}  see [^m1]  no  no  no  no 
 plus EXT001 24bit prefixing using 25% of EXT001 space. See svp64↩
 If treated as a 1Dimensional ISA, and designed badly, the 24bit Prefix expands 200+ scalar instructions to well over a million intrinsics (N~=10^{4} times M~=10^{2}). If treated as a 2Dimensional ISA and designed well, there are far less. N prefix intrinsics plus M scalar instruction intrinsics, where N is likely to be of the order of 10^{2} and M of the order of 10^{2}.↩
 A 2Dimensional Scalable Vector ISA specifically designed for the Power ISA with both HorizontalFirst and VerticalFirst Modes. See vector isa comparison↩
 on specific operations. See opcode regs deduped for full list. Key: 2P  Twin Predication, 1P  SinglePredicate↩
 SVP64 provides a Vector concept on top of the Scalar GPR, FPR and CR Fields, extended to 128 entries.↩
 SVP64 Vectorizes Scalar ops. It is up to the implementor to choose (optionally) whether to apply SVP64 to e.g. VSX QuadPrecision (128bit) instructions, to create 128bit Vector ops.↩

biginteger add is just
sv.adde
. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorized by SVP64). See analysis↩  LD/ST FaultFirst: see appendix and ARM SVE FaultFirst↩
 Datadependent FailFirst: Based on LD/ST Failfirst, extended to data. Truncates VL based on failing Rc=1 test. Similar to Z80 CPIR. See appendix↩
 Predicateresult effectively turns any standard op into a type of "cmp". See appendix↩
 Any nonpoweroftwo Matrices up to 127 FMACs or other FMAstyle op including Ternary Logical, full tripleloop Schedule. See remap↩
 DCT (Lee) and FFT Full Tripleloops supported, RADIX2only. Normally only found in VLIW DSPs (TI MSP320, Qualcom Hexagon). See remap↩
 Altivec gcc intrinsics, contains links to additional VSX intrinsics for ISA 2.05/6/7, 3.0 and 3.1↩
 VSX's Vector Registers are misnamed: they are 100% PackedSIMD. AVX512 is not a Vector ISA either. See Flynn's Taxonomy↩
 Power ISA v3.1 contains "Matrix Multiply Assist" (MMA) which due to PackedSIMD is restricted to RADIX2 and requires inline assembler loopunrolling for nonpoweroftwo Matrix dimensions↩
 difficult to ascertain, see NEON/VFP. Critically depends on ARM Scalar instructions↩
 NEON 32bit 2754 intrinsics, NEON 64bit 4334 intrinsics.↩
 Although registers may be 128bit in NEON, SVE2, and AVX, unlike VSX there are very few (or no) actual arithmetic 128bit operations. Only RVV and SVP64 have the possibility of 128bit ops↩
 difficult to exactly ascertain, see ARM Architecture Reference Manual Supplement, DDI 0584. Critically depends on ARM Scalar instructions.↩
 SVE: 4140 intrinsics, SVE2 1900 intrinsics↩

ARM states that the Scalability is a Siliconpartner choice. Scalability in the ISA is not available to the programmer: there is no
setvl
instruction in SVE2, which is already causing assembler programmer difficulties. quote "you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width"↩  "SiliconPartner" Scaling achieved through allowing same instruction to act on different regfile size and bitwidth. This catastrophically results in binary noninteroperability.↩
 Scalable Matrix Optional Extension outerproduct instructions SMOPA which are power2 based on Siliconpartner SIMD width. Nonpower2 not supported but zeroinput masking is.↩
 AVX512 Wikipedia, Lifecycle of an instruction set including full slides↩
 difficult to exactly ascertain, contains subsets. Critically depends on ISA support from earlier x86 ISA subsets (several more thousand instructions). See SIMD ISA listing↩
 Count includes SSE, SSE2, AVX, AVX2 and all AVX512 variants↩
 Advanced matrix Extensions supports BF16 and INT8 only. Separate regfile, poweroftwo "tiles". Not generalpurpose at all.↩
 RVV Spec↩
 RISCV Vectors are not standalone, i.e. like SVE2 and AVX512 are critically dependent on the Scalar ISA (an additional ~96 instructions for the Scalar RV64GC set, needed for Linux).↩
 RVV intrinsics listing page is 25,000 lines long.↩
 Like the original Cray RVV is a truly scalable Vector ISA (Cray setvl instruction). However, like SVE2, the Maximum Vector length is a Siliconpartner choice, which creates similar limitations that SVP64 does not have. The RISCV Founders strongly discourage efforts by programmers to find out the Silicon's Maximum Vector Length, as an effort to steer programmers towards Siliconindependent assembler. This requires all algorithms to contain a loop construct. MAXVL in SVP64 is a Spechardfixed quantity therefore loop constructs are not necessary 100% of the time.↩
 like SVP64 it is up to the hardware implementor (Silicon partner) to choose whether to support 128bit elements.↩
 NEC SX Aurora is based on the original Cray Vectors↩
 Aurora ISA guide Appendix3 11.1 p508↩
 Unknown. estimated to be of the order of length of RVV due to also being a Craystyle Scalable ISA, NEC maintains an LLVM hard fork↩
 Like the original Cray Vectors, the ISA Vector Length is independent of the underlying hardware, however Generation 1 has 256 elements per Vector register (3.2.4 p24, Aurora ISA guide)↩
 Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a Hardwarelevel autovectorization LOOP builtin as an extension named VVM. Classified as "VerticalFirst".↩
 MyISA 66000 has a CARRY register up to 64bit. Repeated application of FMA (esp. within AutoVectored LOOPS) automatically and inherently creates bigint operations with zero effort.↩