Comparative analysis

These are all, deep breath, basically... required reading, as well as and in addition to a full and comprehensive deep technical understanding of the Power ISA, in order to understand the depth and background on SVP64 as a 3D GPU and VPU Extension.

I am keenly aware that each of them is 300 to 1,000 pages (just like the Power ISA itself).

This is just how it is.

Given the sheer overwhelming size and scope of SVP64 we have gone to considerable lengths to provide justification and rationalisation for adding the various sub-extensions to the Base Scalar Power ISA.

Scalar bitmanipulation is justifiable for the exact same reasons the extensions are justifiable for other ISAs. The additional justification for their inclusion where some instructions are already (sort-of) present in VSX is that VSX is not mandatory, and the complexity of implementation of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
Scalar FP-to-INT conversions, likewise. ARM has a javascript conversion instruction, Power ISA does not (and it costs a ridiculous 45 instructions to implement, including 6 branches!)
Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable for High-Performance Compute workloads.

It also has to be pointed out that normally this work would be covered by multiple separate full-time Workgroups with multiple Members contributing their time and resources. In RISC-V there are over sixty Technical Working Groups https://riscv.org/community/directory-of-working-groups/

Overall the contributions that we are developing take the Power ISA out of the specialist highly-focussed market it is presently best known for, and expands it into areas with much wider general adoption and broader uses.

OpenCL specifications are linked here, these are relevant when we get to a 3D GPU / High Performance Compute ISA WG RFC: transcendentals

(Failure to add Transcendentals to a 3D GPU is directly equivalent to willfully designing a product that is 100% destined for commercial rejection, due to the extremely high competitive performance/watt achieved by today's mass-volume GPUs.)

I mention these because they will be encountered in every single commercial GPU ISA, but they're not part of the "Base" (core design) of a Vector Processor. Transcendentals can be added as a sub-RFC.

SIMD ISAs commonly mistaken for Vector

There is considerable confusion surrounding Vector ISAs because of a mis-use of the word "Vector" in the marketing material of most well-known Packed SIMD ISAs of the past 3 decades. These Packed SIMD ISAs used features "inspired" from Scalable Vector ISAs.

PackedSIMD VSX. VSX, which has the word "Vector" in its name, is "inspired" by Vector Processing but has no "Scaling" capability, and no Predicate masking. Both these factors put pressure on developers to use "inline assembler unrolling" and data repetition, which in turn is detrimental to both L1 Data and Instruction Caches. Adding Predicate Masks to the PackedSIMD VSX ISA would effectively double the number of PackedSIMD instructions (750 becomes 1,500) even if it were practical to do so (no available 32 bit encoding space).
AVX / AVX2 / AVX128 / AVX256 / AVX512 again has the word "Vector" in its name but this in no way makes it a Vector ISA. None of the AVX-* family are "Scalable" however there is at least Predicate Masking in AVX-512.
ARM NEON - accurately described as a Packed SIMD ISA in all literature.
ARM SVE / SVE2 - not a Scalable Vector ISA, it is actually a hybrid PackedSIMD/PredicatedSIMD ISA: with 4-operand instructions being overwrite to fit into 32-bit there was no room for a predicate mask. The "Scaling" is, rather unfortunately, a parameter that is chosen by the Hardware Architect, rather than the programmer. The actual "Scalar" part as far as the programmer is concerned is supposed to be the Predicate Masks. However in practice, ARM NEON programmers have found it too hard to adapt and have instead attempted to fit the NEON SIMD paradigm on top of SVE. This has resulted in programmers writing multiple variants of near-identical hand-coded assembler in order to target different machines with different hardware widths, going directly against the advice given on ARM's developer documentation.

A good analogy explaining why "Silicon-Partner Scalability" is catastrophic is to note that the situation is near-identical to when IBM extended Power ISA from 32 to 64-bit. Existing 32-bit systems were unable to run or trap-and-emulate 64-bit instructions because they were the exact same opcodes and the "Silicon Scalability" of both RVV and ARM SVE/2 is the exact same mistake, but much worse. At least IBM provided an MSR.SF bit.

The saving grace of PackedSIMD VSX is that it did not fall to the seduction outlined in the "SIMD Considered Harmful" article https://www.sigarch.org/simd-instructions-considered-harmful/. It is clear that it is expected to deploy Multi-Issue to achieve high performance, which is a much cleaner approach that has not resulted in ISA poisoning such as that suffered by x86 (AVX).

Actual 3D GPU Architectures and ISAs (all SIMD)

All of these are not Scalable Vector ISAs, they are SIMD ISAs.

Broadcom Videocore https://github.com/hermanhermitage/videocoreiv
Etnaviv https://github.com/etnaviv/etna_viv/tree/master/doc
Nyuzi http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf
MALI https://github.com/cwabbott0/mali-isa-docs
AMD https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf
https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf
MIAOW which is NOT a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU" https://miaowgpu.org/

Actual Scalar Vector Processor Architectures and ISAs

NEC SX Aurora https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf
Cray ISA http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf
RISC-V RVV https://github.com/riscv/riscv-v-spec
MRISC32 ISA Manual (under active development) https://github.com/mrisc32/mrisc32/tree/master/isa-manual
Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from Mitch under NDA on direct contact with him. It is a different approach from the others, which may be termed "Cray-Style Horizontal-First" Vectorization. 66000 is a Vertical-First Vector ISA with hardware-level auto-vectorization.
ETA-10 an extremely rare Scalable Vector Architecture from 1986, similar to the CDC Cyber 205. Only 25 machines were ever delivered. Page 3-220 of its ISA shows that it had Predicate Masks and Horizontal Reduction. Appendix H-1 shows it is likely a Memory-to-Memory Vector Architecture, and overcame the penalties normally associated with this by adding an explicit "Vector operand forwarding/chaining" instruction (Page 3-69). It is however clearly Scalable, up to Vector elements of 2¹⁶.

The term Horizontal or Vertical alludes to the Matrix "Row-First" or "Column-First" technique, where:

Horizontal-First processes all elements in a Vector before moving on to the next instruction
Vertical-First processes ONE element per instruction, and requires loop constructs to explicitly step to the next element.

Vector-type Support by Architecture

Architecture	Horizontal	Vertical
MyISA 66000		X
Cray	X
SX Aurora	X
RVV	X
SVP64	X	X

Horizontal vs Vertical