DRAFT Scalar Transcendentals

Summary:

This proposal extends Power ISA scalar floating point operations to add IEEE754 transcendental functions (pow, log etc) and trigonometric functions (sin, cos etc). These functions are also 98% shared with the Khronos Group OpenCL Extended Instruction Set.

Authors/Contributors:

Luke Kenneth Casson Leighton
Jacob Lifshay
Dan Petroski
Mitch Alsup
Allen Baum
Andrew Waterman
Luis Vitorio Cargnini

DRAFT Scalar Transcendentals
TODO:
Requirements
Proposed Opcodes vs Khronos OpenCL vs IEEE754-2019
Opcode Tables for PO=59/63 XO=1---011--
DRAFT List of 2-arg opcodes
DRAFT List of 1-arg transcendental opcodes
DRAFT List of 1-arg trigonometric opcodes
Subsets
1. Transcendental Subsets
2. Trigonometric subsets
Synthesis, Pseudo-code ops and macro-ops
Evaluation and commentary

See:

http://bugs.libre-soc.org/show_bug.cgi?id=127
https://bugs.libre-soc.org/show_bug.cgi?id=899 transcendentals in simulator
https://bugs.libre-soc.org/show_bug.cgi?id=923 under review
https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
power trans ops for opcode listing.

Extension subsets:

TODO: rename extension subsets -- we're not on RISC-V anymore.

Zftrans: standard transcendentals (best suited to 3D)
ZftransExt: extra functions (useful, not generally needed for 3D, can be synthesised using Ztrans)
Ztrigpi: trig. xxx-pi sinpi cospi tanpi
Ztrignpi: trig non-xxx-pi sin cos tan
Zarctrigpi: arc-trig. a-xxx-pi: atan2pi asinpi acospi
Zarctrignpi: arc-trig. non-a-xxx-pi: atan2, asin, acos
Zfhyp: hyperbolic/inverse-hyperbolic. sinh, cosh, tanh, asinh, acosh, atanh (can be synthesised - see below)
ZftransAdv: much more complex to implement in hardware
Zfrsqrt: Reciprocal square-root.
Zfminmax: Min/Max.

Minimum recommended requirements for 3D: Zftrans, Ztrignpi, Zarctrignpi, with Ztrigpi and Zarctrigpi as augmentations.

Minimum recommended requirements for Mobile-Embedded 3D: Ztrignpi, Zftrans, with Ztrigpi as an augmentation.

The Platform Requirements for 3D are driven by cost competitive factors and it is the Trademarked Vulkan Specification that provides clear direction for 3D GPU markets, but nothing else (IEEE754). Implementors must note that minimum Compliance with the Third Party Vulkan Specification (for power-area competitive reasons with other 3D GPU manufacturers) will not qualify for strict IEEE754 accuracy Compliance or vice-versa.

Implementors must make it clear which accuracy level is implemented and provide a switching mechanism and throw Illegal Instruction traps if fully compliant accuracy cannot be achieved. It is also the Implementor's responsibility to comply with all Third Party Certification Marks and Trademarks (Vulkan, OpenCL). Nothing in this specification in any way implies that any Third Party Certification Mark Compliance is granted, nullified, altered or overridden by this document.

TODO:

Decision on accuracy, moved to zfpacc proposal http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html
Errors MUST be repeatable.
How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded? http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html Accuracy requirements for dual (triple) purpose implementations must meet the higher standard.
Reciprocal Square-root is in its own separate extension (Zfrsqrt) as it is desirable on its own by other implementors. This to be evaluated.

Requirements

This proposal is designed to meet a wide range of extremely diverse needs, allowing implementors from all of them to benefit from the tools and hardware cost reductions associated with common standards adoption in Power ISA (primarily IEEE754 and Vulkan).

The use-cases are:

3D GPUs
Numerical Computation
(Potentially) A.I. / Machine-learning (1)

(1) although approximations suffice in this field, making it more likely to use a custom extension. High-end ML would inherently definitely be excluded.

The power and die-area requirements vary from:

Ultra-low-power (smartwatches where GPU power budgets are in milliwatts)
Mobile-Embedded (good performance with high efficiency for battery life)
Desktop Computing
Server / HPC / Supercomputing

The software requirements are:

Full public integration into GNU math libraries (libm)
Full public integration into well-known Numerical Computation systems (numpy)
Full public integration into upstream GNU and LLVM Compiler toolchains
Full public integration into Khronos OpenCL SPIR-V compatible Compilers seeking public Certification and Endorsement from the Khronos Group under their Trademarked Certification Programme.

Proposed Opcodes vs Khronos OpenCL vs IEEE754-2019

This list shows the (direct) equivalence between proposed opcodes, their Khronos OpenCL equivalents, and their IEEE754-2019 equivalents. 98% of the opcodes in this proposal that are in the IEEE754-2019 standard are present in the Khronos Extended Instruction Set.

See https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html and https://ieeexplore.ieee.org/document/8766229

Special FP16 opcodes are not being proposed, except by indirect / inherent use of elwidth overrides that is already present in the SVP64 Specification.
"Native" opcodes are not being proposed: implementors will be expected to use the (equivalent) proposed opcode covering the same function.
"Fast" opcodes are not being proposed, because the Khronos Specification fast_length, fast_normalise and fast_distance OpenCL opcodes require vectors (or can be done as scalar operations using other Power ISA instructions).

The OpenCL FP32 opcodes are direct equivalents to the proposed opcodes. Deviation from conformance with the Khronos Specification - including the Khronos Specification accuracy requirements - is not an option, as it results in non-compliance, and the vendor may not use the Trademarked words "Vulkan" etc. in conjunction with their product.

IEEE754-2019 Table 9.1 lists "additional mathematical operations". Interestingly the only functions missing when compared to OpenCL are compound, exp2m1, exp10m1, log2p1, log10p1, pown (integer power) and powr.

opcode	OpenCL FP32	OpenCL FP16	OpenCL native	IEEE754	Power ISA	My 66000 ISA
fsin	sin	half_sin	native_sin	sin	NONE	sin
fcos	cos	half_cos	native_cos	cos	NONE	cos
ftan	tan	half_tan	native_tan	tan	NONE	tan
NONE (1)	sincos	NONE	NONE	NONE	NONE
fasin	asin	NONE	NONE	asin	NONE	asin
facos	acos	NONE	NONE	acos	NONE	acos
fatan	atan	NONE	NONE	atan	NONE	atan
fsinpi	sinpi	NONE	NONE	sinPi	NONE	sinpi
fcospi	cospi	NONE	NONE	cosPi	NONE	cospi
ftanpi	tanpi	NONE	NONE	tanPi	NONE	tanpi
fasinpi	asinpi	NONE	NONE	asinPi	NONE	asinpi
facospi	acospi	NONE	NONE	acosPi	NONE	acospi
fatanpi	atanpi	NONE	NONE	atanPi	NONE	atanpi
fsinh	sinh	NONE	NONE	sinh	NONE
fcosh	cosh	NONE	NONE	cosh	NONE
ftanh	tanh	NONE	NONE	tanh	NONE
fasinh	asinh	NONE	NONE	asinh	NONE
facosh	acosh	NONE	NONE	acosh	NONE
fatanh	atanh	NONE	NONE	atanh	NONE
fatan2	atan2	NONE	NONE	atan2	NONE	atan2
fatan2pi	atan2pi	NONE	NONE	atan2pi	NONE	atan2pi
frsqrt	rsqrt	half_rsqrt	native_rsqrt	rSqrt	fsqrte, fsqrtes (4)	rsqrt
fcbrt	cbrt	NONE	NONE	NONE (2)	NONE
fexp2	exp2	half_exp2	native_exp2	exp2	NONE	exp2
flog2	log2	half_log2	native_log2	log2	NONE	ln2
fexpm1	expm1	NONE	NONE	expm1	NONE	expm1
flog1p	log1p	NONE	NONE	logp1	NONE	logp1
fexp	exp	half_exp	native_exp	exp	NONE	exp
flog	log	half_log	native_log	log	NONE	ln
fexp10	exp10	half_exp10	native_exp10	exp10	NONE	exp10
flog10	log10	half_log10	native_log10	log10	NONE	log
fpow	pow	NONE	NONE	pow	NONE	pow
fpown	pown	NONE	NONE	pown	NONE
fpowr	powr	half_powr	native_powr	powr	NONE
frootn	rootn	NONE	NONE	rootn	NONE
fhypot	hypot	NONE	NONE	hypot	NONE
frecip	NONE	half_recip	native_recip	NONE (3)	fre, fres (4)	rcp
NONE	NONE	NONE	NONE	compound	NONE
fexp2m1	NONE	NONE	NONE	exp2m1	NONE	exp2m1
fexp10m1	NONE	NONE	NONE	exp10m1	NONE	exp10m1
flog2p1	NONE	NONE	NONE	log2p1	NONE	ln2p1
flog10p1	NONE	NONE	NONE	log10p1	NONE	logp1
fminnum08	fmin	fmin	NONE	minNum	xsmindp (5)
fmaxnum08	fmax	fmax	NONE	maxNum	xsmaxdp (5)
fmin19	fmin	fmin	NONE	minimum	NONE	fmin
fmax19	fmax	fmax	NONE	maximum	NONE	fmax
fminnum19	fmin	fmin	NONE	minimumNumber	vminfp (6), xsminjdp (5)
fmaxnum19	fmax	fmax	NONE	maximumNumber	vmaxfp (6), xsmaxjdp (5)
fminc	fmin	fmin	NONE	NONE	xsmincdp (5)	fmin*
fmaxc	fmax	fmax	NONE	NONE	xsmaxcdp (5)	fmax*
fminmagnum08	minmag	minmag	NONE	minNumMag	NONE
fmaxmagnum08	maxmag	maxmag	NONE	maxNumMag	NONE
fminmag19	minmag	minmag	NONE	minimumMagnitude	NONE
fmaxmag19	maxmag	maxmag	NONE	maximumMagnitude	NONE
fminmagnum19	minmag	minmag	NONE	minimumMagnitudeNumber	NONE
fmaxmagnum19	maxmag	maxmag	NONE	maximumMagnitudeNumber	NONE
fminmagc	minmag	minmag	NONE	NONE	NONE
fmaxmagc	maxmag	maxmag	NONE	NONE	NONE
fmod	fmod	fmod		NONE	NONE
fremainder	remainder	remainder		remainder	NONE

from Mitch Alsup:

Brian's LLVM compiler converts fminc and fmaxc into fmin and fmax instructions These are all IEEE 754-2019 compliant These are native instructions not extensions All listed functions are available in both F32 and F64 formats. THere is some confusion (in my head) abouot fmin and fmax. I intend both instruction to perform 754-2019 semantics-- but I don know if this is minimum/maximum or minimumNumber/maximumNumber. fmad and remainder are a 2-instruction sequence--don't know how to "edit it in"

Note (1) fsincos is macro-op fused (see below).

Note (2) synthesised in IEEE754-2019 as "rootn(x, 3)"

Note (3) synthesised in IEEE754-2019 using "1.0 / x"

Note (4) these are estimate opcodes that help accelerate software emulation

Note (5) f64-only (though can be used on f32 stored in f64 format), requires VSX.

Note (6) 4xf32-only, requires VMX.

List of 2-arg opcodes

opcode	Description	pseudocode	Extension
fatan2	atan2 arc tangent	FRT = atan2(FRB, FRA)	Zarctrignpi
fatan2pi	atan2 arc tangent / pi	FRT = atan2(FRB, FRA) / pi	Zarctrigpi
fpow	x power of y	FRT = pow(FRA, FRB)	ZftransAdv
fpown	x power of n (n int)	FRT = pow(FRA, RB)	ZftransAdv
fpowr	x power of y (x +ve)	FRT = exp(FRA log(FRB))	ZftransAdv
frootn	x power 1/n (n integer)	FRT = pow(FRA, 1/RB)	ZftransAdv
fhypot	hypotenuse	FRT = sqrt(FRA² + FRB²)	ZftransAdv
fminnum08	IEEE 754-2008 minNum	FRT = minNum(FRA, FRB) (1)	Zfminmax
fmaxnum08	IEEE 754-2008 maxNum	FRT = maxNum(FRA, FRB) (1)	Zfminmax
fmin19	IEEE 754-2019 minimum	FRT = minimum(FRA, FRB)	Zfminmax
fmax19	IEEE 754-2019 maximum	FRT = maximum(FRA, FRB)	Zfminmax
fminnum19	IEEE 754-2019 minimumNumber	FRT = minimumNumber(FRA, FRB)	Zfminmax
fmaxnum19	IEEE 754-2019 maximumNumber	FRT = maximumNumber(FRA, FRB)	Zfminmax
fminc	C ternary-op minimum	FRT = FRA < FRB ? FRA : FRB	Zfminmax
fmaxc	C ternary-op maximum	FRT = FRA > FRB ? FRA : FRB	Zfminmax
fminmagnum08	IEEE 754-2008 minNumMag	FRT = minmaxmag(FRA, FRB, False, fminnum08) (2)	Zfminmax
fmaxmagnum08	IEEE 754-2008 maxNumMag	FRT = minmaxmag(FRA, FRB, True, fmaxnum08) (2)	Zfminmax
fminmag19	IEEE 754-2019 minimumMagnitude	FRT = minmaxmag(FRA, FRB, False, fmin19) (2)	Zfminmax
fmaxmag19	IEEE 754-2019 maximumMagnitude	FRT = minmaxmag(FRA, FRB, True, fmax19) (2)	Zfminmax
fminmagnum19	IEEE 754-2019 minimumMagnitudeNumber	FRT = minmaxmag(FRA, FRB, False, fminnum19) (2)	Zfminmax
fmaxmagnum19	IEEE 754-2019 maximumMagnitudeNumber	FRT = minmaxmag(FRA, FRB, True, fmaxnum19) (2)	Zfminmax
fminmagc	C ternary-op minimum magnitude	FRT = minmaxmag(FRA, FRB, False, fminc) (2)	Zfminmax
fmaxmagc	C ternary-op maximum magnitude	FRT = minmaxmag(FRA, FRB, True, fmaxc) (2)	Zfminmax
fmod	modulus	FRT = fmod(FRA, FRB)	ZftransExt
fremainder	IEEE 754 remainder	FRT = remainder(FRA, FRB)	ZftransExt

Note (1): for the purposes of minNum/maxNum, -0.0 is defined to be less than +0.0. This is left unspecified in IEEE 754-2008.

Note (2): minmaxmag(x, y, cmp, fallback) is defined as:

def minmaxmag(x, y, is_max, fallback):
    a = abs(x) < abs(y)
    b = abs(x) > abs(y)
    if is_max:
        a, b = b, a  # swap
    if a:
        return x
    if b:
        return y
    # equal magnitudes, or NaN input(s)
    return fallback(x, y)

List of 1-arg transcendental opcodes

opcode	Description	pseudocode	Extension
frsqrt	Reciprocal Square-root	FRT = sqrt(FRA)	Zfrsqrt
fcbrt	Cube Root	FRT = pow(FRA, 1.0 / 3)	ZftransAdv
frecip	Reciprocal	FRT = 1.0 / FRA	Zftrans
fexp2m1	power-2 minus 1	FRT = pow(2, FRA) - 1.0	ZftransExt
flog2p1	log2 plus 1	FRT = log(2, 1 + FRA)	ZftransExt
fexp2	power-of-2	FRT = pow(2, FRA)	Zftrans
flog2	log2	FRT = log(2. FRA)	Zftrans
fexpm1	exponential minus 1	FRT = pow(e, FRA) - 1.0	ZftransExt
flog1p	log plus 1	FRT = log(e, 1 + FRA)	ZftransExt
fexp	exponential	FRT = pow(e, FRA)	ZftransExt
flog	natural log (base e)	FRT = log(e, FRA)	ZftransExt
fexp10m1	power-10 minus 1	FRT = pow(10, FRA) - 1.0	ZftransExt
flog10p1	log10 plus 1	FRT = log(10, 1 + FRA)	ZftransExt
fexp10	power-of-10	FRT = pow(10, FRA)	ZftransExt
flog10	log base 10	FRT = log(10, FRA)	ZftransExt

List of 1-arg trigonometric opcodes

opcode	Description	pseudocode	Extension
fsin	sin (radians)	FRT = sin(FRA)	Ztrignpi
fcos	cos (radians)	FRT = cos(FRA)	Ztrignpi
ftan	tan (radians)	FRT = tan(FRA)	Ztrignpi
fasin	arcsin (radians)	FRT = asin(FRA)	Zarctrignpi
facos	arccos (radians)	FRT = acos(FRA)	Zarctrignpi
fatan	arctan (radians)	FRT = atan(FRA)	Zarctrignpi
fsinpi	sin times pi	FRT = sin(pi * FRA)	Ztrigpi
fcospi	cos times pi	FRT = cos(pi * FRA)	Ztrigpi
ftanpi	tan times pi	FRT = tan(pi * FRA)	Ztrigpi
fasinpi	arcsin / pi	FRT = asin(FRA) / pi	Zarctrigpi
facospi	arccos / pi	FRT = acos(FRA) / pi	Zarctrigpi
fatanpi	arctan / pi	FRT = atan(FRA) / pi	Zarctrigpi
fsinh	hyperbolic sin (radians)	FRT = sinh(FRA)	Zfhyp
fcosh	hyperbolic cos (radians)	FRT = cosh(FRA)	Zfhyp
ftanh	hyperbolic tan (radians)	FRT = tanh(FRA)	Zfhyp
fasinh	inverse hyperbolic sin	FRT = asinh(FRA)	Zfhyp
facosh	inverse hyperbolic cos	FRT = acosh(FRA)	Zfhyp
fatanh	inverse hyperbolic tan	FRT = atanh(FRA)	Zfhyp

Opcode Tables for PO=59/63 XO=1---011--

Power ISA v3.1B opcodes extracted from:

Power ISA v3.1B Appendix D Table 23 sheet 2/3 of 4 page 1391/1392
Power ISA v3.1B Appendix D Table 25 sheet 2/3 of 4 page 1399/1400

Parenthesized entries are not part of fptrans.

Entries whose mnemonic ends in s are only in PO=59.
Entries whose mnemonic does not end in s are only in PO=63.
Entries whose mnemonic ends in (s) are in both PO=59 and PO=63.

XO LSB half → XO MSB half ↓	01100	01101	01110	01111
10000	`10000 01100` fcbrt(s) (draft)	`10000 01101` fsinpi(s) (draft)	`10000 01110` fatan2pi(s) (draft)	`10000 01111` fasinpi(s) (draft)
10001	`10001 01100` fcospi(s) (draft)	`10001 01101` ftanpi(s) (draft)	`10001 01110` facospi(s) (draft)	`10001 01111` fatanpi(s) (draft)
10010	`10010 01100` frsqrt(s) (draft)	`10010 01101` fsin(s) (draft)	`10010 01110` fatan2(s) (draft)	`10010 01111` fasin(s) (draft)
10011	`10011 01100` fcos(s) (draft)	`10011 01101` ftan(s) (draft)	`10011 01110` facos(s) (draft)	`10011 01111` fatan(s) (draft)
10100	`10100 01100` frecip(s) (draft)	`10100 01101` fsinh(s) (draft)	`10100 01110` fhypot(s) (draft)	`10100 01111` fasinh(s) (draft)
10101	`10101 01100` fcosh(s) (draft)	`10101 01101` ftanh(s) (draft)	`10101 01110` facosh(s) (draft)	`10101 01111` fatanh(s) (draft)
10110	`10110 01100`	`10110 01101`	`10110 01110`	`10110 01111`
10111	`10111 01100`	`10111 01101`	`10111 01110`	`10111 01111`

XO LSB half → XO MSB half ↓	01100	01101	01110	01111
11000	`11000 01100` fexp2m1(s) (draft)	`11000 01101` flog2p1(s) (draft)	`11000 01110` (cffpro) (draft)	`11000 01111` (ctfpr(s)) (draft)
11001	`11001 01100` fexpm1(s) (draft)	`11001 01101` flogp1(s) (draft)	`11001 01110` (fctid)	`11001 01111` (fctidz)
11010	`11010 01100` fexp10m1(s) (draft)	`11010 01101` flog10p1(s) (draft)	`11010 01110` (fcfid(s))	`11010 01111` fmod(s) (draft)
11011	`11011 01100` fpown(s) (draft)	`11011 01101` frootn(s) (draft)	`11011 01110`	`11011 01111`
11100	`11100 01100` fexp2(s) (draft)	`11100 01101` flog2(s) (draft)	`11100 01110` (mffpr(s)) (draft)	`11100 01111` (mtfpr(s)) (draft)
11101	`11101 01100` fexp(s) (draft)	`11101 01101` flog(s) (draft)	`11101 01110` (fctidu)	`11101 01111` (fctiduz)
11110	`11110 01100` fexp10(s) (draft)	`11110 01101` flog10(s) (draft)	`11110 01110` (fcfidu(s))	`11110 01111` fremainder(s) (draft)
11111	`11111 01100` fpowr(s) (draft)	`11111 01101` fpow(s) (draft)	`11111 01110`	`11111 01111`

XO LSB half → XO MSB half ↓	10000	10001	10010	10011
////0	`....0 10000` fminmax (draft)	`////0 10001`	`////0 10010` (fdiv(s))	`////0 10011`
////1	`////1 10000`	`////1 10001`	`////1 10010` (fdiv(s))	`////1 10011`

DRAFT List of 2-arg opcodes

These are X-Form, recommended Major Opcode 63 for full-width and 59 for half-width (ending in s).

0.5	6.10	11.15	16.20	21..30	31	name	Form
NN	FRT	FRA	FRB	1xxxx011xx	Rc	transcendental	X-Form
NN	FRT	FRA	RB	1xxxx011xx	Rc	transcendental	X-Form
NN	FRT	FRA	FRB	xxxxx10000	Rc	transcendental	X-Form

Recommended 10-bit XO assignments:

opcode	Description	Major 59 and 63	bits 16..20
fatan2(s)	atan2 arc tangent	10010 01110	FRB
fatan2pi(s)	atan2 arc tangent / π	10000 01110	FRB
fpow(s)	x^y	11111 01101	FRB
fpown(s)	xⁿ (n ∈ ℤ)	11011 01100	RB
fpowr(s)	x^y (x >= 0)	11111 01100	FRB
frootn(s)	ⁿ√x (n ∈ ℤ)	11011 01101	RB
fhypot(s)	√(x² + y²)	10100 01110	FRB
fminmax	min/max	....0 10000	FRB
fmod(s)	modulus	11010 01111	FRB
fremainder(s)	IEEE 754 remainder	11110 01111	FRB

DRAFT List of 1-arg transcendental opcodes

These are X-Form, and are mostly identical in Special Registers Altered to fsqrt (the exact fp exceptions they can produce differ). Recommended Major Opcode 63 for full-width and 59 for half-width (ending in s).

Special Registers Altered (FIXME: come up with correct list):

FPRF FR FI FX OX UX XX
VXSNAN VXIMZ VXZDZ
CR1                    (if Rc=1)

0.5	6.10	11.15	16.20	21..30	31	name	Form
NN	FRT	///	FRB	1xxxx011xx	Rc	transcendental	X-Form

Recommended 10-bit XO assignments:

opcode	Description	Major 59 and 63
frsqrt(s)	1 / √x	10010 01100
fcbrt(s)	∛x	10000 01100
frecip(s)	1 / x	10100 01100
fexp2m1(s)	2^x - 1	11000 01100
flog2p1(s)	log₂ (x + 1)	11000 01101
fexp2(s)	2^x	11100 01100
flog2(s)	log₂ x	11100 01101
fexpm1(s)	e^x - 1	11001 01100
flogp1(s)	log_e (x + 1)	11001 01101
fexp(s)	e^x	11101 01100
flog(s)	log_e x	11101 01101
fexp10m1(s)	10^x - 1	11010 01100
flog10p1(s)	log₁₀ (x + 1)	11010 01101
fexp10(s)	10^x	11110 01100
flog10(s)	log₁₀ x	11110 01101

DRAFT List of 1-arg trigonometric opcodes

Special Registers Altered:

FPRF FR FI FX OX UX XX
VXSNAN VXIMZ VXZDZ
CR1                    (if Rc=1)

0.5	6.10	11.15	16.20	21..30	31	name	Form
NN	FRT	///	FRB	1xxxx011xx	Rc	trigonometric	X-Form

Recommended 10-bit XO assignments:

opcode	Description	Major 59 and 63
fsin(s)	sin (radians)	10010 01101
fcos(s)	cos (radians)	10011 01100
ftan(s)	tan (radians)	10011 01101
fasin(s)	arcsin (radians)	10010 01111
facos(s)	arccos (radians)	10011 01110
fatan(s)	arctan (radians)	10011 01111
fsinpi(s)	sin(π * x)	10000 01101
fcospi(s)	cos(π * x)	10001 01100
ftanpi(s)	tan(π * x)	10001 01101
fasinpi(s)	arcsin(x) / π	10000 01111
facospi(s)	arccos(x) / π	10001 01110
fatanpi(s)	arctan(x) / π	10001 01111
fsinh(s)	hyperbolic sin	10100 01101
fcosh(s)	hyperbolic cos	10101 01100
ftanh(s)	hyperbolic tan	10101 01101
fasinh(s)	inverse hyperbolic sin	10100 01111
facosh(s)	inverse hyperbolic cos	10101 01110
fatanh(s)	inverse hyperbolic tan	10101 01111

Subsets

The full set is based on the Khronos OpenCL opcodes. If implemented entirely it would be too much for both Embedded and also 3D.

The subsets are organised by hardware complexity, need (3D, HPC), however due to synthesis producing inaccurate results at the range limits, the less common subsets are still required for IEEE754 HPC.

MALI Midgard, an embedded / mobile 3D GPU, for example only has the following opcodes:

28 - fmin
2C - fmax
E8 - fatan_pt2
F0 - frcp (reciprocal)
F2 - frsqrt (inverse square root, 1/sqrt(x))
F3 - fsqrt (square root)
F4 - fexp2 (2^x)
F5 - flog2
F6 - fsin1pi
F7 - fcos1pi
F9 - fatan_pt1

These in FP32 and FP16 only: no FP64 hardware, at all.

Vivante Embedded/Mobile 3D (etnaviv https://github.com/laanwj/etna_viv/blob/master/rnndb/isa.xml) only has the following:

fmin/fmax (implemented using SELECT)
sin, cos2pi
cos, sin2pi
log2, exp
sqrt and rsqrt
recip.

It also has fast variants of some of these, as a CSR Mode.

AMD's R600 GPU (R600_Instruction_Set_Architecture.pdf) and the RDNA ISA (RDNA_Shader_ISA_5August2019.pdf, Table 22, Section 6.3) have:

MIN/MAX/MIN_DX10/MAX_DX10
COS2PI (appx)
EXP2
LOG (IEEE754)
RECIP
RSQRT
SQRT
SIN2PI (appx)

AMD RDNA has F16 and F32 variants of all the above, and also has F64 variants of SQRT, RSQRT, MIN, MAX, and RECIP. It is interesting that even the modern high-end AMD GPU does not have TAN or ATAN, where MALI Midgard does.

Also a general point, that customised optimised hardware targetting FP32 3D with less accuracy simply can neither be used for IEEE754 nor for FP64 (except as a starting point for hardware or software driven Newton Raphson or other iterative method).

Also in cost/area sensitive applications even the extra ROM lookup tables for certain algorithms may be too costly.

These wildly differing and incompatible driving factors lead to the subset subdivisions, below.

Transcendental Subsets

Zftrans

LOG2 EXP2 RECIP RSQRT

Zftrans contains the minimum standard transcendentals best suited to 3D. They are also the minimum subset for synthesising log10, exp10, exp1m, log1p, the hyperbolic trigonometric functions sinh and so on.

They are therefore considered "base" (essential) transcendentals.

ZftransExt

LOG, EXP, EXP10, LOG10, LOGP1, EXP1M, fmod, fremainder

These are extra transcendental functions that are useful, not generally needed for 3D, however for Numerical Computation they may be useful.

Although they can be synthesised using Ztrans (LOG2 multiplied by a constant), there is both a performance penalty as well as an accuracy penalty towards the limits, which for IEEE754 compliance is unacceptable. In particular, LOG(1+FRA) in hardware may give much better accuracy at the lower end (very small FRA) than LOG(FRA).

Their forced inclusion would be inappropriate as it would penalise embedded systems with tight power and area budgets. However if they were completely excluded the HPC applications would be penalised on performance and accuracy.

Therefore they are their own subset extension.

Zfhyp

SINH, COSH, TANH, ASINH, ACOSH, ATANH

These are the hyperbolic/inverse-hyperbolic functions. Their use in 3D is limited.

They can all be synthesised using LOG, SQRT and so on, so depend on Zftrans. However, once again, at the limits of the range, IEEE754 compliance becomes impossible, and thus a hardware implementation may be required.

HPC and high-end GPUs are likely markets for these.

ZftransAdv

CBRT, POW, POWN, POWR, ROOTN

These are simply much more complex to implement in hardware, and typically will only be put into HPC applications.

Note that pow is commonly used in Blinn-Phong shading (the shading model used by OpenGL 1.0 and commonly used by shader authors that need basic 3D graphics with specular highlights), however it can be sufficiently emulated using pow(b, n) = exp2(n*log2(b)).

Zfrsqrt: Reciprocal square-root.

Trigonometric subsets

Ztrigpi vs Ztrignpi

Ztrigpi: SINPI COSPI TANPI
Ztrignpi: SIN COS TAN

Ztrignpi are the basic trigonometric functions through which all others could be synthesised, and they are typically the base trigonometrics provided by GPUs for 3D, warranting their own subset.

(programmerjake: actually, all other GPU ISAs mentioned in this document have sinpi/cospi or equivalent, and often not sin/cos, because sinpi/cospi are actually waay easier to implement because range reduction is simply a bitwise mask, whereas for sin/cos range reduction is a full division by pi)

(Mitch: My patent USPTO 10,761,806 shows that the above statement is no longer true.)

In the case of the Ztrigpi subset, these are commonly used in for loops with a power of two number of subdivisions, and the cost of multiplying by PI inside each loop (or cumulative addition, resulting in cumulative errors) is not acceptable.

In for example CORDIC the multiplication by PI may be moved outside of the hardware algorithm as a loop invariant, with no power or area penalty.

Again, therefore, if SINPI (etc.) were excluded, programmers would be penalised by being forced to divide by PI in some circumstances. Likewise if SIN were excluded, programmers would be penaslised by being forced to multiply by PI in some circumstances.

Thus again, a slightly different application of the same general argument applies to give Ztrignpi and Ztrigpi as subsets. 3D GPUs will almost certainly provide both.

Zarctrigpi and Zarctrignpi

Zarctrigpi: ATAN2PI ASINPI ACOSPI
Zarctrignpi: ATAN2 ACOS ASIN

These are extra trigonometric functions that are useful in some applications, but even for 3D GPUs, particularly embedded and mobile class GPUs, they are not so common and so are typically synthesised, there.

Although they can be synthesised using Ztrigpi and Ztrignpi, there is, once again, both a performance penalty as well as an accuracy penalty towards the limits, which for IEEE754 compliance is unacceptable, yet is acceptable for 3D.

Therefore they are their own subset extensions.

Zfminmax

fminnum08 fmaxnum08
fmin19 fmax19
fminnum19 fmaxnum19
fminc fmaxc
fminmagnum08 fmaxmagnum08
fminmag19 fmaxmag19
fminmagnum19 fmaxmagnum19
fminmagc fmaxmagc

These are commonly used for vector reductions, where having them be a single instruction is critical. They are also commonly used in GPU shaders, HPC, and general-purpose FP algorithms.

These min and max operations are quite cheap to implement hardware-wise, being comparable in cost to fcmp + some muxes. They're all in one extension because once you implement some of them, the rest require only slightly more hardware complexity.

Therefore they are their own subset extension.

Synthesis, Pseudo-code ops and macro-ops

The pseudo-ops are best left up to the compiler rather than being actual pseudo-ops, by allocating one scalar FP register for use as a constant (loop invariant) set to "1.0" at the beginning of a function or other suitable code block.

fsincos - fused macro-op between fsin and fcos (issued in that order).
fsincospi - fused macro-op between fsinpi and fcospi (issued in that order).

fatanpi example pseudo-code:

fmvis ft0, 0x3F80 // upper bits of f32 1.0 (BF16)
fatan2pis FRT, FRA, ft0

Hyperbolic function example (obviates need for Zfhyp except for high-performance or correctly-rounding):

ASINH( x ) = ln( x + SQRT(x**2+1))

pow sufficient for 3D Graphics:

pow(b, x) = exp2(x * log2(b))

Evaluation and commentary

Moved to discussion