ISA Comparison Table to DRAFT SVP64 - discussion and research at

Taxonomy /
Bigint LDST
SVP64 5 1 see 2 Scalable 3 yes yes yes 4 no 5 see 6 yes7 yes 8 yes 9 yes 10 yes 11 yes12
VSX 700+ 700?13 PackedSIMD no no no yes 14 yes no no no no yes 15 no
NEON ~250 16 7088 17 PackedSIMD no no no yes see 18 no no no no no no
SVE2 ~1000 19 6040 20 Predicated SIMD21 no [^e3] yes no yes see [^b1] no yes [^8] no no yes 22 no
AVX512 23 ~1000s 24 7256 25 Predicated SIMD no yes no yes see [^b1] no no no no yes 26 no
RVV 27 ~190 28 ~2500029 Scalable30 yes yes no yes yes 31 no yes no no no no
Aurora SX32 ~200 33 unknown 34 Scalable 35 yes yes no yes no no no no no ? no
6600036 ~200 unknown AutoVec[^m1] see [^m1] see[^m1] no see [^m1] no yes37 see [^m1] no no no no

  1. plus EXT001 24-bit prefixing using 25% of EXT001 space. See svp64
  2. If treated as a 1-Dimensional ISA, and designed badly, the 24-bit Prefix expands 200+ scalar instructions to well over a million intrinsics (N~=104 times M~=102). If treated as a 2-Dimensional ISA and designed well, there are far less. N prefix intrinsics plus M scalar instruction intrinsics, where N is likely to be of the order of 102 and M of the order of 102.
  3. A 2-Dimensional Scalable Vector ISA specifically designed for the Power ISA with both Horizontal-First and Vertical-First Modes. See vector isa comparison
  4. on specific operations. See opcode regs deduped for full list. Key: 2P - Twin Predication, 1P - Single-Predicate
  5. SVP64 provides a Vector concept on top of the Scalar GPR, FPR and CR Fields, extended to 128 entries.
  6. SVP64 Vectorises Scalar ops. It is up to the implementor to choose (optionally) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops.
  7. big-integer add is just sv.adde. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorised by SVP64). See analysis
  8. LD/ST Fault-First: see appendix and ARM SVE Fault-First
  9. Data-dependent Fail-First: Based on LD/ST Fail-first, extended to data. Truncates VL based on failing Rc=1 test. Similar to Z80 CPIR. See appendix
  10. Predicate-result effectively turns any standard op into a type of "cmp". See appendix
  11. Any non-power-of-two Matrices up to 127 FMACs or other FMA-style op including Ternary Logical, full triple-loop Schedule. See remap
  12. DCT (Lee) and FFT Full Triple-loops supported, RADIX2-only. Normally only found in VLIW DSPs (TI MSP320, Qualcom Hexagon). See remap
  13. Altivec gcc intrinsics, contains links to additional VSX intrinsics for ISA 2.05/6/7, 3.0 and 3.1
  14. VSX's Vector Registers are mis-named: they are 100% PackedSIMD. AVX-512 is not a Vector ISA either. See Flynn's Taxonomy
  15. Power ISA v3.1 contains "Matrix Multiply Assist" (MMA) which due to PackedSIMD is restricted to RADIX2 and requires inline assembler loop-unrolling for non-power-of-two Matrix dimensions
  16. difficult to ascertain, see NEON/VFP. Critically depends on ARM Scalar instructions
  17. NEON 32-bit 2754 intrinsics, NEON 64-bit 4334 intrinsics.
  18. Although registers may be 128-bit in NEON, SVE2, and AVX, unlike VSX there are very few (or no) actual arithmetic 128-bit operations. Only RVV and SVP64 have the possibility of 128-bit ops
  19. difficult to exactly ascertain, see ARM Architecture Reference Manual Supplement, DDI 0584. Critically depends on ARM Scalar instructions.
  20. SVE: 4140 intrinsics, SVE2 1900 intrinsics
  21. ARM states that the Scalability is a Silicon-partner choice. Scalability in the ISA is not available to the programmer: there is no setvl instruction in SVE2, which is already causing assembler programmer difficulties. quote "you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width"
  22. Scalable Matrix Optional Extension outer-product instructions SMOPA which are power-2 based on Silicon-partner SIMD width. Non-power-2 not supported but zero-input masking is.
  23. AVX512 Wikipedia, Lifecycle of an instruction set including full slides
  24. difficult to exactly ascertain, contains subsets. Critically depends on ISA support from earlier x86 ISA subsets (several more thousand instructions). See SIMD ISA listing
  25. Count includes SSE, SSE2, AVX, AVX2 and all AVX512 variants
  26. Advanced matrix Extensions supports BF16 and INT8 only. Separate regfile, power-of-two "tiles". Not general-purpose at all.
  27. RVV Spec
  28. RISC-V Vectors are not stand-alone, i.e. like SVE2 and AVX-512 are critically dependent on the Scalar ISA (an additional ~96 instructions for the Scalar RV64GC set, needed for Linux).
  29. RVV intrinsics listing page is 25,000 lines long.
  30. Like the original Cray RVV is a truly scalable Vector ISA (Cray setvl instruction). However, like SVE2, the Maximum Vector length is a Silicon-partner choice, which creates similar limitations that SVP64 does not have. The RISC-V Founders strongly discourage efforts by programmers to find out the Silicon's Maximum Vector Length, as an effort to steer programmers towards Silicon-independent assembler. This requires all algorithms to contain a loop construct. MAXVL in SVP64 is a Spec-hard-fixed quantity therefore loop constructs are not necessary 100% of the time.
  31. like SVP64 it is up to the hardware implementor (Silicon partner) to choose whether to support 128-bit elements.
  32. NEC SX Aurora is based on the original Cray Vectors
  33. Aurora ISA guide Appendix-3 11.1 p508
  34. Unknown. estimated to be of the order of length of RVV due to also being a Cray-style Scalable ISA, NEC maintains an LLVM hard fork
  35. Like the original Cray Vectors, the ISA Vector Length is independent of the underlying hardware, however Generation 1 has 256 elements per Vector register (3.2.4 p24, Aurora ISA guide)
  36. Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a Hardware-level auto-vectorisation LOOP built-in as an extension named VVM. Classified as "Vertical-First".
  37. MyISA 66000 has a CARRY register up to 64-bit. Repeated application of FMA (esp. within Auto-Vectored LOOPS) automatically and inherently creates big-int operations with zero effort.