comparison table

ISA Comparison Table to DRAFT SVP64 - discussion and research at https://bugs.libre-soc.org/show_bug.cgi?id=893

ISA name	No opcodes	No intrinsics	Taxonomy / Class	Binary Compat	setvl scalable	Pred. Masks	Twin Pred	Vector regs	128-bit ops	Big int	LDST F/First	Data-dep F-first	Pred Result	HW Matrix	DCT FFT
SVP64	6 ¹	see ²	Scalable ³	yes	yes	yes	yes ⁴	no ⁵	see ⁶	yes⁷	yes ⁸	yes ⁹	yes ¹⁰	yes ¹¹	yes¹²
VSX	700+	700?¹³	PackedSIMD	yes	no	no	no	yes ¹⁴	yes	no	no	no	no	yes ¹⁵	no
NEON	~250¹⁶	7088 ¹⁷	PackedSIMD	yes	no	no	no	yes	see ¹⁸	no	no	no	no	no	no
SVE2	~1000¹⁹	6040 ²⁰	PredSIMD²¹	NO ²²	no [^e3]	yes	no	yes	see [^b1]	no	yes [^8]	no	no	yes ²³	no
AVX512²⁴	~1000s²⁵	7256²⁶	PredSIMD	yes	no	yes	no	yes	see[^b1]	no	no	no	no	yes²⁷	no
RVV ²⁸	~190²⁹	~25000³⁰	Scalable³¹	NO [^nc]	yes	yes	no	yes	yes ³²	no	yes	no	no	no	no
AuroraSX³³	~200³⁴	unknown³⁵	Scalable³⁶	yes	yes	yes	no	yes	no	no	no	no	no	?	no
66000³⁷	~200	unknown	AutoVec[^m1]	yes	see [^m1]	see[^m1]	no	see [^m1]	no	yes³⁸	see [^m1]	no	no	no	no

plus EXT001 24-bit prefixing using 25% of EXT001 space. See svp64 ↩
If treated as a 1-Dimensional ISA, and designed badly, the 24-bit Prefix expands 200+ scalar instructions to well over a million intrinsics (N~=10⁴ times M~=10²). If treated as a 2-Dimensional ISA and designed well, there are far less. N prefix intrinsics plus M scalar instruction intrinsics, where N is likely to be of the order of 10² and M of the order of 10².↩
A 2-Dimensional Scalable Vector ISA specifically designed for the Power ISA with both Horizontal-First and Vertical-First Modes. See vector isa comparison ↩
on specific operations. See opcode regs deduped for full list. Key: 2P - Twin Predication, 1P - Single-Predicate↩
SVP64 provides a Vector concept on top of the Scalar GPR, FPR and CR Fields, extended to 128 entries.↩
SVP64 Vectorizes Scalar ops. It is up to the implementor to choose (optionally) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops.↩
big-integer add is just sv.adde. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorized by SVP64). See analysis ↩
LD/ST Fault-First: see appendix and ARM SVE Fault-First ↩
Data-dependent Fail-First: Based on LD/ST Fail-first, extended to data. Truncates VL based on failing Rc=1 test. Similar to Z80 CPIR. See appendix ↩
Predicate-result effectively turns any standard op into a type of "cmp". See appendix ↩
Any non-power-of-two Matrices up to 127 FMACs or other FMA-style op including Ternary Logical, full triple-loop Schedule. See remap ↩
DCT (Lee) and FFT Full Triple-loops supported, RADIX2-only. Normally only found in VLIW DSPs (TI MSP320, Qualcom Hexagon). See remap ↩
Altivec gcc intrinsics, contains links to additional VSX intrinsics for ISA 2.05/6/7, 3.0 and 3.1↩
VSX's Vector Registers are mis-named: they are 100% PackedSIMD. AVX-512 is not a Vector ISA either. See Flynn's Taxonomy ↩
Power ISA v3.1 contains "Matrix Multiply Assist" (MMA) which due to PackedSIMD is restricted to RADIX2 and requires inline assembler loop-unrolling for non-power-of-two Matrix dimensions↩
difficult to ascertain, see NEON/VFP. Critically depends on ARM Scalar instructions↩
NEON 32-bit 2754 intrinsics, NEON 64-bit 4334 intrinsics.↩
Although registers may be 128-bit in NEON, SVE2, and AVX, unlike VSX there are very few (or no) actual arithmetic 128-bit operations. Only RVV and SVP64 have the possibility of 128-bit ops↩
difficult to exactly ascertain, see ARM Architecture Reference Manual Supplement, DDI 0584. Critically depends on ARM Scalar instructions.↩
SVE: 4140 intrinsics, SVE2 1900 intrinsics↩
ARM states that the Scalability is a Silicon-partner choice. Scalability in the ISA is not available to the programmer: there is no setvl instruction in SVE2, which is already causing assembler programmer difficulties. quote "you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width"↩
"Silicon-Partner" Scaling achieved through allowing same instruction to act on different regfile size and bitwidth. This catastrophically results in binary non-interoperability.↩
Scalable Matrix Optional Extension outer-product instructions SMOPA which are power-2 based on Silicon-partner SIMD width. Non-power-2 not supported but zero-input masking is.↩
AVX512 Wikipedia, Lifecycle of an instruction set including full slides↩
difficult to exactly ascertain, contains subsets. Critically depends on ISA support from earlier x86 ISA subsets (several more thousand instructions). See SIMD ISA listing ↩
Count includes SSE, SSE2, AVX, AVX2 and all AVX512 variants↩
Advanced matrix Extensions supports BF16 and INT8 only. Separate regfile, power-of-two "tiles". Not general-purpose at all.↩
RVV Spec ↩
RISC-V Vectors are not stand-alone, i.e. like SVE2 and AVX-512 are critically dependent on the Scalar ISA (an additional ~96 instructions for the Scalar RV64GC set, needed for Linux).↩
RVV intrinsics listing page is 25,000 lines long.↩
Like the original Cray RVV is a truly scalable Vector ISA (Cray setvl instruction). However, like SVE2, the Maximum Vector length is a Silicon-partner choice, which creates similar limitations that SVP64 does not have. The RISC-V Founders strongly discourage efforts by programmers to find out the Silicon's Maximum Vector Length, as an effort to steer programmers towards Silicon-independent assembler. This requires all algorithms to contain a loop construct. MAXVL in SVP64 is a Spec-hard-fixed quantity therefore loop constructs are not necessary 100% of the time.↩
like SVP64 it is up to the hardware implementor (Silicon partner) to choose whether to support 128-bit elements.↩
NEC SX Aurora is based on the original Cray Vectors↩
Aurora ISA guide Appendix-3 11.1 p508↩
Unknown. estimated to be of the order of length of RVV due to also being a Cray-style Scalable ISA, NEC maintains an LLVM hard fork ↩
Like the original Cray Vectors, the ISA Vector Length is independent of the underlying hardware, however Generation 1 has 256 elements per Vector register (3.2.4 p24, Aurora ISA guide)↩
Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a Hardware-level auto-vectorization LOOP built-in as an extension named VVM. Classified as "Vertical-First".↩
MyISA 66000 has a CARRY register up to 64-bit. Repeated application of FMA (esp. within Auto-Vectored LOOPS) automatically and inherently creates big-int operations with zero effort.↩