discussion moved here

moved here because the spec page itself must contain certainty rather than "undecided potential ideas"

the operations it has that I was going to propose:

fetch_add
fetch_xor
fetch_or
fetch_and
fetch_umax
fetch_smax
fetch_umin
fetch_smin
exchange

as well as a few I wasn't going to propose (they seem less useful to me):

compare-and-swap-not-equal
fetch-and-increment-bounded
fetch-and-increment-equal
fetch-and-decrement-bounded
store-twin

The spec also basically says that the atomic memory operations are only intended for when you want to do atomic operations on memory, but don't want that memory to be loaded into your L1 cache.

imho that restriction is specifically not wanted, because there are plenty of cases where atomic operations should happen in your L1 cache.

I'd guess that part of why those atomic operations weren't included in gcc or clang as the default implementation of atomic operations (when the appropriate ISA feature is enabled) is because of that restriction.

imho the cpu should be able to (but not required to) predict whether to send an atomic operation to L2-cache/L3-cache/etc./memory or to execute it directly in the L1 cache. The prediction could be based on how often that cache block was accessed from different cpus, e.g. by having a small saturating counter and a last-accessing-cpu field, where it would count how many times the same cpu accessed it in a row, sending it to the L1 cache if that's more than some limit, otherwise doing the operation in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu tried to access it.