Draft proposal for improved atomic operations for the Power ISA

NOTE THIS PROPOSAL IS NOT BEING SUBMITTED DUE TO DISCOVERY DURING INVESTIGATION THAT ATOMICS ARE DESIGNED FOR MASSIVE DISTRIBUTED CLUSTERS. SIGNIFICANT ADDITIONAL RESEARCH IS REQUIRED SO THIS PROPOSAL IS PUT ON HOLD UNTIL BUDGET IS AVAILABLE

Links:

https://bugs.libre-soc.org/show_bug.cgi?id=236
OpenCAPI spec p47-49 for AMO section
RISC-V A
discussion
http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html

TODO:

investigate Power ISA 3.1 p1077 eh hint

Motivation

Power ISA currently has some issues with its atomic operations support, which are exacerbated by 3D Data structure processing in 3D Shader Binaries needing of the order of 10⁵ or greater atomic locks per second per SMP Core.

Power ISA's current atomic operations are inefficient

Implementations have a hard time recognizing existing atomic operations via macro-op fusion because they would often have to detect and fuse a large number of instructions, including branches. This is contrary to the RISC paradigm.

There is also the issue that PowerISA's memory fences are unnecessarily strong, particularly isync which is used for a lot of acquire and stronger fences. isync forces the cpu to do a full pipeline flush, which is unnecessary when all that is needed is a memory barrier.

atomic_fetch_add_seq_cst is 6 instructions including a loop:

# address in r4, addend in r5
    sync
loop:
    ldarx 3, 0, 4
    add 6, 5, 3
    stdcx. 6, 0, 4
    bne 0, loop
    lwsync
# output in r3

atomic_load_seq_cst is 5 instructions, including a branch, and an unnecessarily-strong memory fence:

# address in r3
    sync
    ld 3, 0(3)
    cmpw 0, 3, 3
    bne- 0, skip
    isync
skip:
# output in r3

atomic_compare_exchange_strong_seq_cst is 7 instructions, including a loop with 2 branches, and an unnecessarily-strong memory fence:

# address in r4, compared-to value in r5, replacement value in r6
    sync
loop:
    ldarx 3, 0, 4
    cmpd 0, 3, 5
    bne 0, not_eq
    stdcx. 6, 0, 4
    bne 0, loop
not_eq:
    isync
# output loaded value in r3, store-occurred flag in cr0.eq

atomic_load_acquire is 4 instructions, including a branch and an unnecessarily-strong memory fence:

# address in r3
    ld 3, 0(3)
    cmpw 0, 3, 3
    bne- skip
    isync
skip:
# output in r3

Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on TODO

Power ISA doesn't align well with C++11 atomics

P0668R5: Revising the C++ memory model:

Existing implementation schemes on Power and ARM are not correct with respect to the current memory model definition. These implementation schemes can lead to results that are disallowed by the current memory model when the user combines acquire/release ordering with seq_cst ordering. On some architectures, especially Power and Nvidia GPUs, it is expensive to repair the implementations to satisfy the existing memory model. Details are discussed in (Lahav et al) http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies on heavily).

Power ISA's Atomic-Memory-Operations have issues

PowerISA v3.1 Book II section 4.5: Atomic Memory Operations

They are still missing better fences, combined operation/fence instructions, and operations on 8/16-bit values, as well as issues with unnecessary restrictions:

it has only 32-bit and 64-bit atomic operations.

see discussion for proposed operations and thoughts TODO remove this sentence

DRAFT atomic instructions

These two instructions, lat and stat, are identical to lwat/ldat and stwat/stdat except add acquire and release guaranteed ordering semantics as well as 8 and 16 bit memory widths.

AT-Form (TODO)

lat. RT,RA,FC,aq,rl,ew
stat. RS,RA,FC,aq,rl,ew

DRAFT EXT031 and XO, these are near to the existing atomic memory operations

0.5	6.10	11.15	16.20	21	22	23.24	25.30	31	name	Form
31	RT	RA	FC	lr	sc	ew	000101	Rc	lat	TODO-Form
31	RS	RA	FC	lr	sc	ew	100101	/	stat	TODO-Form

ew specifies the memory operation width: 0/1/2/3 8/16/32/64
If the aq bit is set, then no later atomic memory operations can be observed to take place before the AMO in this or other cores. (A global Write-after-Read Memory Hazard is created)
If the rl bit is set, then other cores will not observe the AMO before memory accesses preceding the AMO. (A global Read-after-Write Memory Hazard is created)
Setting both the aq and the rl bit makes the sequence sequentially consistent, meaning that it cannot be reordered with respect to earlier or later atomic memory operations. (Both a RaW and WaR are simultaneously created)
FC is identical to the Function tables used in Power ISA v3 for lwat and stwat

read functions v3.1 book II section 4.5.1 p1071

opcode	regs	memory	description
00000	RT, RT+1	mem(EA,s)	Fetch and Add
00001	RT, RT+1	mem(EA,s)	Fetch and XOR
00010	RT, RT+1	mem(EA,s)	Fetch and OR
00011	RT, RT+1	mem(EA,s)	Fetch and AND
00100	RT, RT+1	mem(EA,s)	Fetch and Maximum Unsigned
00101	RT, RT+1	mem(EA,s)	Fetch and Maximum Signed
00110	RT, RT+1	mem(EA,s)	Fetch and Minimum Unsigned
00111	RT, RT+1	mem(EA,s)	Fetch and Minimum Signed
01000	RT, RT+1	mem(EA,s)	Swap
10000	RT, RT+1, RT+2	mem(EA,s)	Compare and Swap Not Equal
11000	RT	mem(EA,s) mem(EA+s, s)	Fetch and Increment Bounded
11001	RT	mem(EA,s) mem(EA+s, s)	Fetch and Increment Equal
11100	RT	mem(EA-s,s) mem(EA, s)	Fetch and Decrement Bounded

store functions

opcode	regs	memory	description
00000	RS	mem(EA,s)	Store Add
00001	RS	mem(EA,s)	Store XOR
00010	RS	mem(EA,s)	Store OR
00011	RS	mem(EA,s)	Store AND
00100	RS	mem(EA,s)	Store Maximum Unsigned
00101	RS	mem(EA,s)	Store Maximum Signed
00110	RS	mem(EA,s)	Store Minimum Unsigned
00111	RS	mem(EA,s)	Store Minimum Signed
11000	RS	mem(EA,s)	Store Twin

These functions are also recognised as being part of the OpenCAPI Specification.