Bug 236 - Atomics Standard writeup needed
Summary: Atomics Standard writeup needed
Status: RESOLVED FIXED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Specification (show other bugs)
Version: unspecified
Hardware: PC Linux
: High enhancement
Assignee: Jacob Lifshay
URL: https://libre-soc.org/openpower/atomics
Depends on:
Blocks: 900 174
  Show dependency treegraph
 
Reported: 2020-03-13 12:50 GMT by Luke Kenneth Casson Leighton
Modified: 2022-09-07 12:52 BST (History)
3 users (show)

See Also:
NLnet milestone: NLNet.2019.10.046.Standards
total budget (EUR) for completion of task and all subtasks: 2500
budget (EUR) for this task, excluding subtasks' budget: 2500
parent task for budget allocation: 174
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:
red = { amount = 700, submitted = 2022-08-26, paid = 2022-08-31 } jacob = { amount = 1800, submitted = 2022-08-29, paid = 2022-09-03 }


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-03-13 12:50:07 GMT
A formal write-up of the c++ Atomics protocol is needed, to be proposed
to the OpenPOWER Foundation.  This will be useful for OpenPOWER as well.
Comment 1 Jacob Lifshay 2020-03-13 18:22:27 GMT
See this thread for reason we need something other than Power's already existing atomic instructions: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-October/003085.html
Comment 2 Luke Kenneth Casson Leighton 2022-06-25 10:19:08 BST
are these basically already covered by Power ISA load-store with
reservations?

or the OpenCAPI atomic memory operations?
https://opencapi.org/wp-content/uploads/2016/09/OpenCAPI-Overview.10.14.16.pdf
Comment 3 Luke Kenneth Casson Leighton 2022-06-25 18:38:36 BST
these look to me like they're ok:
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
Comment 4 Jacob Lifshay 2022-06-26 08:32:39 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)
> are these basically already covered by Power ISA load-store with
> reservations?

yes if you don't care about efficiency -- you'll just get large/slow functions whenever you use atomics (for one atomic op: iirc >5 instructions with like 4 of them in a loop).

I'll assume efficiency is something we care about.

some atomics on powerpc64le, x86_64, and amdgpu:
https://gcc.godbolt.org/z/9a6EKjhjh

note how every atomic on powerpc64le is a giant pile of instructions in a loop -- having the decoder need to macro-op fuse a 4 instruction (or more) loop is absurd imho...x86 has a single instruction for add and for exchange (it has more if you don't need the return value), amdgpu has dedicated instructions for all the operations I tried (clang crashes for 8-bit atomics). riscv (not in that godbolt link) also supports a bunch of operations.

we need short instructions for at least atomic-fetch-add and atomic-exchange since they're quite common in cpu code, for gpu code it would be nice to support the full list of atomic ops supported by vulkan/opencl:
https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#_atomic_instructions

atomics supported by vulkan/opencl:
load float/int (already supported by power)
store float/int (already supported by power)
exchange float/int
compare exchange float/int (sufficiently supported by power)
fetch_increment int (covered by fetch_add int)
fetch_decrement int (covered by fetch_add int)
fetch_add int
fetch_sub int (covered by fetch_add int)
fetch_min[u] int
fetch_max[u] int
fetch_and int
fetch_or int
fetch_xor int
flag_test_and_set (covered by exchange int)
flag_clear (covered by store int)
fetch_min float 
fetch_max float
fetch_add float

int/float fetch_min/max are particularly important for gpu code since they can be used for depth buffer ops.

we will want 8/16/32/64-bit int and 16/32/64-bit float support.

we also need 128-bit atomics support, they're relatively uncommon but used in some critical data-structures and are waaayy faster than having to use a global mutex, power's existing instructions are sufficient for that -- we just need to implement them: lq, stq, lqarx, stqcx.

> 
> or the OpenCAPI atomic memory operations?

we need actual instructions to express what we want, otherwise all the fancy hardware support is useless...

that pdf doesn't elaborate at all which atomics opencapi supports.
Comment 5 Jacob Lifshay 2022-06-26 08:49:04 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> these look to me like they're ok:
> https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

that's the bare minimum for c++11 to operate correctly...it doesn't cover most of the ops we want to be efficient.

also, c++11 has problems with atomic fences on power:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html

imho we should introduce fences where needed that follow c++11 semantics.

imho we should also introduce fused atomic/fence ops, where the atomic instruction is also a memory fence, like risc-v and armv8 and x86_64, this would reduce the number of needed instructions for an acq_rel atomic add to 1, rather than the 5-7 with a 4-instruction loop we currently have.

imho we should mirror risc-v's aq/rl bits:
aq rl:
0  0  relaxed
0  1  release
1  0  acquire
1  1  seq_cst
Comment 6 Jacob Lifshay 2022-06-26 08:50:27 BST
(In reply to Jacob Lifshay from comment #5)
> also, c++11 has problems with atomic fences on power:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html

problems as in giving the wrong answer kind of problems...not just slow.
Comment 7 Luke Kenneth Casson Leighton 2022-06-26 09:40:15 BST
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > these look to me like they're ok:
> > https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> 
> that's the bare minimum for c++11 to operate correctly...it doesn't cover
> most of the ops we want to be efficient.

bleh. ok.

> imho we should also introduce fused atomic/fence ops, where the atomic
> instruction is also a memory fence, like risc-v and armv8 and x86_64, this
> would reduce the number of needed instructions for an acq_rel atomic add to
> 1, rather than the 5-7 with a 4-instruction loop we currently have.

ok, brilliant, can you create a page and design them?  keep it simple
and straightforward, bear in mind i can help in a limited way as i
don't fully grasp these instructions, and they need some serious
justification to put to the ISA WG.
Comment 8 Jacob Lifshay 2022-06-26 09:47:37 BST
(In reply to Luke Kenneth Casson Leighton from comment #7)
> (In reply to Jacob Lifshay from comment #5)
> > imho we should also introduce fused atomic/fence ops, where the atomic
> > instruction is also a memory fence, like risc-v and armv8 and x86_64, this
> > would reduce the number of needed instructions for an acq_rel atomic add to
> > 1, rather than the 5-7 with a 4-instruction loop we currently have.
> 
> ok, brilliant, can you create a page and design them?  keep it simple
> and straightforward, bear in mind i can help in a limited way as i
> don't fully grasp these instructions, and they need some serious
> justification to put to the ISA WG.

yeah, i can work on it, though it'll happen after the fadd formal stuff.
Comment 9 Luke Kenneth Casson Leighton 2022-06-26 10:08:32 BST
(In reply to Jacob Lifshay from comment #8)

> yeah, i can work on it, though it'll happen after the fadd formal stuff.

not a problem - please make sure that you cut the fadd/fmul formal
stuff *real* short - to leave enough time.  absolute basics. i'll
put in another grant request to cover the rest of what "needs" to
be done.
Comment 10 Jacob Lifshay 2022-07-07 11:45:36 BST
handy pdf with formal semantics they proposed as "repaired C11" atomics semantics:
https://pure.mpg.de/rest/items/item_2543041/component/file_3332448/content
Kang, Jeehoon, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. "A promising semantics for relaxed-memory concurrency." ACM SIGPLAN Notices 52, no. 1 (2017): 175-189.
Comment 11 Jacob Lifshay 2022-07-08 07:05:24 BST
After doing some research, it turns out that PowerISA actually already has a lot of the atomic operations I was going to propose, they just aren't really implemented in gcc or clang. They are still missing better fences, combined operation/fence instructions, and operations on 8/16-bit values, as well as issues with unnecessary restrictions.

PowerISA v3.1 Book II section 4.5: Atomic Memory Operations

it has only 32-bit and 64-bit atomic operations.

the operations it has that I was going to propose:
fetch_add
fetch_xor
fetch_or
fetch_and
fetch_umax
fetch_smax
fetch_umin
fetch_smin
exchange

as well as a few I wasn't going to propose (they seem less useful to me):
compare-and-swap-not-equal
fetch-and-increment-bounded
fetch-and-increment-equal
fetch-and-decrement-bounded
store-twin

The spec also basically says that the atomic memory operations are only intended for when you want to do atomic operations on memory, but don't want that memory to be loaded into your L1 cache.

imho that restriction is specifically *not* wanted, because there are plenty of cases where atomic operations should happen in your L1 cache.

I'd guess that part of why those atomic operations weren't included in gcc or clang as the default implementation of atomic operations (when the appropriate ISA feature is enabled) is because of that restriction.

imho the cpu should be able to (but not required to) predict whether to send an atomic operation to L2-cache/L3-cache/etc./memory or to execute it directly in the L1 cache. The prediction could be based on how often that cache block was accessed from different cpus, e.g. by having a small saturating counter and a last-accessing-cpu field, where it would count how many times the same cpu accessed it in a row, sending it to the L1 cache if that's more than some limit, otherwise doing the operation in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu tried to access it.
Comment 12 Jacob Lifshay 2022-07-08 07:17:35 BST
I started writing the spec, currently it just has a motivation section:

https://libre-soc.org/openpower/atomics/
Comment 13 Luke Kenneth Casson Leighton 2022-07-08 09:02:49 BST
jacob the non-L1-cache instructions are intended for pushing directly
through OpenCAPI, which has direct corresponding capability.  ariane/pulpino
*very specifically* implemented amo* as a single ALU directly built-in to
the L2 cache.
Comment 14 Jacob Lifshay 2022-07-08 09:10:09 BST
(In reply to Luke Kenneth Casson Leighton from comment #13)
> jacob the non-L1-cache instructions are intended for pushing directly
> through OpenCAPI, which has direct corresponding capability.  ariane/pulpino
> *very specifically* implemented amo* as a single ALU directly built-in to
> the L2 cache.

regardless, the technique I described can be used to optimize AMOs to work well at the L1 cache and at other caches/memory/opencapi -- it would just be modified to handle access from outside the cache where the tracking is implemented.

My point is that it can be done, not that it has to use the algorithm i gave.

Having fast atomics that may operate in the L1 cache is imho a requirement for a good cpu/gpu.
Comment 15 Jacob Lifshay 2022-07-08 12:15:00 BST
found this interesting paper on implementing high performance amo operations, with performance approaching that of normal load/store ops:
Free Atomics: Hardware Atomic Operations without Fences
https://dl.acm.org/doi/pdf/10.1145/3470496.3527385
Comment 16 Luke Kenneth Casson Leighton 2022-07-08 15:02:14 BST
IBM will have to correspondingly begin a process of creating modifications
to OpenCAPI.  awareness of how much work is actually being asked by changes
is being respectful.

they may not be in the least bit happy to have their multi billion dollar
high performance compute market "damaged" by unthinking, unreasonable
and drastic changes to a spec.

when i skim-read the OpenCAPI Specification, not realising at the time that
it was pretty much directly linked to the Book II instructions (the FC
of stwat RT,RA,FC)
i had not looked for a memory width field.  i'll see if i can find it.
Comment 17 Luke Kenneth Casson Leighton 2022-07-08 15:19:36 BST
https://opencapi.org/wp-content/uploads/2016/09/OC-DL-Specification.10.14.16.pdf
nope

http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf

yes.

table 2.5 and 2.6  p49.

page 48 covers length:

    The command address specified is naturally aligned based on the
    operand length. The operand length is operations with the
    exception of fetch and swap operations where the cmd_flag
    is specified as {x‘8’...x‘A’} and pLength shall be specified as
    {‘110, ‘111’}. Refer to the specification of pLength on page 28.

paaaage 28......

    pLength 3 Partial length. Specifies the number of data 
    bytes specified for a partial write command. The address
    specified shall be naturally aligned based on the pLength
    specified. The data may be sent in a data flit, or an 8or
    32-byte datum field specified for some control flits. 
    000 1 byte. Reserved when the command is an amo*. 
    001 2 bytes. Reserved when the command is an amo*. 
    010 4 bytes 
    011 8 bytes 
    100 16 bytes. Reserved when the command is an amo*. 
    101 32 bytes. Reserved when the command is an amo*.
    110 Specifies 4-byte operands when the command is amo_rw and
        the operation is specified as a Fetch and swap. That is,
        the command flag is {x‘8’..x‘A’}. Otherwise, this field is reserved.
    111 Specifies 8-byte operands when the command is amo_rw and 
        the operation is specified as a Fetch and swap. That is, the
        command flag is {x‘8’..x‘A’}. Otherwise, this field is reserved.

conclusion: oops.  it's ok for amo* but not ok for amo_rw, that would
require a drastic (multi-million-dollar impact) redesign of OpenCAPI.
Comment 18 Luke Kenneth Casson Leighton 2022-07-08 15:34:27 BST
table 2.5

 ‘0000’ Fetch and Add
 ‘0001’ Fetch and XOR 
 ‘0010’ Fetch and OR 
 ‘0011’ Fetch and AND 
 ‘0100’ Fetch and maximum unsigned
 ‘0101’ Fetch and maximum signed
 ‘0110’ Fetch and minimum unsigned
 ‘0111’ Fetch and minimum signed 
 ‘1000’ Fetch and swap 
 ‘1001’ Fetch and swap equal 
 ‘1010’ Fetch and swap not equal 
 ‘1011’ through ‘1111’ Reserved
Comment 19 Jacob Lifshay 2022-07-08 19:25:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #17)
> conclusion: oops.  it's ok for amo* but not ok for amo_rw, that would
> require a drastic (multi-million-dollar impact) redesign of OpenCAPI.

well, all that needs to happen is 8/16-bit atomics have to transfer a cache block to the cache (if not already there) instead of using opencapi atomics...unless it's something like atomic or where the cpu can just or zeros into the bytes it doesn't want to write to.

this is exactly what would happen for things like atomic fadd too
Comment 20 Luke Kenneth Casson Leighton 2022-07-08 20:34:59 BST
(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #17)
> > conclusion: oops.  it's ok for amo* but not ok for amo_rw, that would
> > require a drastic (multi-million-dollar impact) redesign of OpenCAPI.
> 
> well, all that needs to happen is 8/16-bit atomics have to transfer a cache
> block to the cache (if not already there) instead of using opencapi
> atomics...

you're missing the point.

there is no "all that needs to happen" here.  i had not realised
how tightly coupled lwat/stwat are into opencapi (i was half expecting
it)

IBM will not think in terms of "instead of using opencapi".

they have not designed "just a processor" they have designed
"a massive data processing business" where Power ISA is one
tiny component of a much bigger ecosystem.

their immediate thought will not be, "these are great for c++ guarantees"

their immediate thought will be, "how much is this going to **** up our
customers spending billions of dollars on coherent parallel distributed
systems for which OpenCAPI is the bedrock".

bottom line if what is proposed does not have a way to fit into opencapi
it is highly likely to be rejected.

aq and rl (acquire and release) will need to be additional opcodes in
opencapi.
Comment 21 Luke Kenneth Casson Leighton 2022-07-08 20:47:33 BST
these are the opencapi opcodes:

AMO read amo_rd       amo_rd.n
         ‘0011 1000’ ‘0011 1100’

AMO read write amo_rw       amo_rw.n
               ‘0100 0000’ ‘0100 0100’

they need to be joined by three more (each):

* AMO read aq
* AMO rdwr aq
* AMO read rl
* AMO rdwr rl
* AMO read aq rl
* AMO rdwr aq rl

which means finding out if there are two bits available
somewhere in the opencapi opcodes

*that have not been used by IBM for private OMI custom extensions*

which leaves us running into a Commercial Confidentiality brick wall.
Comment 22 Luke Kenneth Casson Leighton 2022-07-08 21:05:50 BST
ok so i now have a handle on things, the rationale you did is excellent.
i had no idea IBM had added stwat etc although having seen amos in
opencapi i was half expecting it.

the lack of AQ RL is going to hit hard (IBM's entire design paradigm does
not recognise it).  opencapi section 1.3:

    Command ordering Ordering within a VC is maintained
    through the TL/TLX, but it is not assured after the
    command has moved to the upper protocol layers (host and AFU)
    as described in Section 3 Virtual channel and data credit
    pool specification on page 77.

which fundamentally conflicts with acquire / release ordering requirements.

updates to the page, summary:

* moved 1st person narrative to discussion page
* tracked down function tables
* added new draft "AT" Form
* added lat/stat with elwidth, aq and lr
* found a suitable place in EXT031.

for the function table recommending anything not EXACTLY equal to what IBM
already has will have a hard time.  we don't need the hassle.  doesn't
matter in the least if we like it, understand it, agree with it or want to
eat it with mayonnaise: it's a table that's already defined.
Comment 23 Luke Kenneth Casson Leighton 2022-07-09 10:35:37 BST
a reaaomable justification is needed as to why 8 and 16 bit is to be proposed.
neither Power nor RISCV have them.  what data structures and algorithms
specifically benefit?
Comment 24 Jacob Lifshay 2022-07-12 06:47:57 BST
(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Jacob Lifshay from comment #19)
> > (In reply to Luke Kenneth Casson Leighton from comment #17)
> > > conclusion: oops.  it's ok for amo* but not ok for amo_rw, that would
> > > require a drastic (multi-million-dollar impact) redesign of OpenCAPI.
> > 
> > well, all that needs to happen is 8/16-bit atomics have to transfer a cache
> > block to the cache (if not already there) instead of using opencapi
> > atomics...
> 
> you're missing the point.

i got your point, imho your point is just partially incorrect.
> 
> there is no "all that needs to happen" here.  i had not realised
> how tightly coupled lwat/stwat are into opencapi (i was half expecting
> it)
> 
> IBM will not think in terms of "instead of using opencapi".

I was never advocating for "instead of using opencapi", it would just use a different part of opencapi: the part that allows a cpu to obtain exclusive access to a cache block (M/E states in the MESI cache coherence model).

> aq and rl (acquire and release) will need to be additional opcodes in
> opencapi.

no they won't, memory ordering is handled by the cpu deciding when to issue memory operations:
acquire -- all following memory ops are delayed till after the atomic op is executed.
release -- the atomic op is delayed till after all preceding memory ops are executed.
acq_rel -- do both the above
seq_cst -- do both the above, as well as synchronize across the whole system...essentially what the `sync` instruction already mostly does.

as an example, tilelink has no memory fence operations, and all amo operations have no aq/rl bits in tilelink -- they're all handled by the cpu.
Comment 25 Jacob Lifshay 2022-07-12 06:53:19 BST
(In reply to Luke Kenneth Casson Leighton from comment #23)
> a reaaomable justification is needed as to why 8 and 16 bit is to be
> proposed.
> neither Power nor RISCV have them.  what data structures and algorithms
> specifically benefit?

off the top of my head, a very commonly-used (4.3M downloads per month) Rust mutex library uses 8-bit atomics for implementing their mutexes:

https://lib.rs/crates/parking_lot

i'm sure there're many more.
Comment 27 Jacob Lifshay 2022-07-14 11:45:49 BST
(In reply to Luke Kenneth Casson Leighton from comment #26)
> https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-July/005071.html
> 
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=27dca7419e3b5f8df2fdb9aae45481aea6754bc7

That use of isync actually has very little to do with atomics, instead it is making the cpu see the instructions that were modified by writing to memory.
Comment 28 Luke Kenneth Casson Leighton 2022-07-15 14:07:59 BST
i assumed that because this is an atomics bugreport that you would
post something related to atomic operations.

from the discussion with Paul it is clear that we do not know
enough and that the Power ISA 3.0 Spec is not detailed enough.

we also cannot possibly submit an enhancement where we have absolutely
no idea whatsoever of what IBM has done and why.

what this tells us is that this bugreport is completely inadequately
funded to do a full investigation.

can you do some assembler of existing atomics and locking, both multi-process,
to give a clear idea of performance, so that we are properly informed.

i aim to close this bugreport within one week and to submit a new grant request
to cover it in more depth.
Comment 29 Jacob Lifshay 2022-07-15 19:42:54 BST
(In reply to Luke Kenneth Casson Leighton from comment #28)
> i assumed that because this is an atomics bugreport that you would
> post something related to atomic operations.

I posted the stuff you linked in comment #26 to the mailing list rather than this bug because it wasn't really related to atomics.
> 
> from the discussion with Paul it is clear that we do not know
> enough and that the Power ISA 3.0 Spec is not detailed enough.
> 
> we also cannot possibly submit an enhancement where we have absolutely
> no idea whatsoever of what IBM has done and why.
> 
> what this tells us is that this bugreport is completely inadequately
> funded to do a full investigation.
> 
> can you do some assembler of existing atomics and locking, both
> multi-process,
> to give a clear idea of performance, so that we are properly informed.

sure.
> 
> i aim to close this bugreport within one week and to submit a new grant
> request
> to cover it in more depth.

please don't close it, just reassign the funding and reuse this bug as one of the tasks for the grant, since that allows us to not lose context (not technically lost, but less readily accessible). imho we don't need a full eur 50k, so we should include  other things in the grant.
Comment 30 Luke Kenneth Casson Leighton 2022-07-15 21:20:36 BST
(In reply to Jacob Lifshay from comment #29)

> please don't close it, just reassign the funding and reuse this bug as one
> of the tasks for the grant, since that allows us to not lose context (not
> technically lost, but less readily accessible). imho we don't need a full
> eur 50k, so we should include  other things in the grant.

ah you misunderstood: i want this task done and dusted and declared
closed so that you and i can get the EUR 2500 associated with it into
our bank accounts, promptly.

what i do *not* want is for 10 weeks of work to go by on a task that
gets longer and longer and longer and longer now that we have learned
that the Power ISA spec is not sufficient and have learned also the
extent of the attention paid by IBM to atomics *which we know nothing about*

so i am making the decision to declare the scope of this task to be
of length no greater than (as of right now) 6 days further work,
and *no more*.

i will take care of remembering that it needs to be cross-referenced
to the new grant.  however it shall remain closed at the end of the
task because the scope of payments are for this task and the work
that is done within it.
Comment 31 Jacob Lifshay 2022-07-15 21:24:50 BST
(In reply to Luke Kenneth Casson Leighton from comment #30)
> ah you misunderstood: i want this task done and dusted and declared
> closed so that you and i can get the EUR 2500 associated with it into
> our bank accounts, promptly.

ah, then can you create a subtask for "initial research on atomics" or something so the €2500 can be assigned to that one and it can be closed while leaving this task to track all atomics stuff?
Comment 32 Luke Kenneth Casson Leighton 2022-07-15 22:13:04 BST
(In reply to Jacob Lifshay from comment #31)

> ah, then can you create a subtask for "initial research on atomics" or
> something so the €2500 can be assigned to that one and it can be closed
> while leaving this task to track all atomics stuff?

no, because the money is allocated to this task and this task only.
it is a top-level milestone and part of the MoU, and both of us have
done wotk on the write-up.

it *must* be completed and due to the unforseen change in circumstances
as part of wednesday's meeting with Paul i am *defining* it to be 100%
complete once the evaluation and research in comment #28 is performed
and written up.

keep it simple please.
Comment 33 Jacob Lifshay 2022-07-15 22:17:59 BST
(In reply to Luke Kenneth Casson Leighton from comment #32)
> it is a top-level milestone and part of the MoU, 

ah, didn't notice that.
Comment 34 Jacob Lifshay 2022-07-16 05:38:16 BST
(In reply to Jacob Lifshay from comment #29)
> (In reply to Luke Kenneth Casson Leighton from comment #28)
> > can you do some assembler of existing atomics and locking, both
> > multi-process,
> > to give a clear idea of performance, so that we are properly informed.
> 
> sure.

added initial atomics assembler, along with script that generates it.
Comment 35 Luke Kenneth Casson Leighton 2022-07-16 12:32:50 BST
(In reply to Jacob Lifshay from comment #34)

> added initial atomics assembler, along with script that generates it.

which i've now had to remove and deal with yet another force-master push.

please *do not* break the hard rule of adding auto-generated output to
repositories.

*especially* given that it is a massive batch of identical code.


allow me to be clearer in the instructions.

we need to demonstrate that the POWER9 recommended c++ spinlocks and
atomics are or are not efficient, and to what extent.

the program therefore that needs to be created must:

1) have an option to specify the number of SMT forked processes to run
2) have an option to specify how many lock and unlocks shall be performed
   per forked process
3) have an option to specify the range of memory addresses to be lock/unlocked
   ("1" being "all processes lock and unlock the same address)
4) use RECOMMENDED sequences known to be used in c, c++, and the linux
   kernel. such as these (or other already present in the linux kernel
   and other common libraries)
   http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html
5) have an option to use the "eh" hints that Paul mentioned are in
   Power ISA 3.1 p1077 eh hint
6) time the operations ("time ./locktest" would do).

there is no need to add this program in markdown form.

it is purely experimental in nature for the purposes of research.

it is not for the publication of a specification.

it is for the purposes of actually being executed to obtain
information for which a report (manually) will have to be written.

when executed on the TALOS-II workstation with different numbers of
processes and different memory ranges, this will tell us whether
IBM properly designed the ISA or not.  it will not tell us exactly
*how* they actually implemented them but will give at least some
black-box hints.

if the locking remains linear for up to all 72 hyperthreads and it
is of the order of a million locks per second per core regardless
of the number of memory addresses then we can reasonably conclude
that they did a damn good job.

if they do *not* work then we are 100% justified in proposing additional
enhancements to the ISA

but *not* until the *actual* statistics have *actually* been measured
and real-world reports obtained.

we do not have access to an IBM POWER10 system so IBM POWER9 will have to do.

bottom line is that if we cannot demonstrate good knowledge of IBM's
atomics then we have absolutely no business whatsoever in proposing
alternatives or enhancements.
Comment 36 Luke Kenneth Casson Leighton 2022-07-16 13:07:03 BST
please do not use rust for this task

the people reviewing this bugreport and the source code of the
research will be c and assembler programmers.

it is perfectly fine to have #ifdefs around functions to create
different binaries and for the Makefile to generate (and run)
them all.  it would also be equally perfectly fine to have a runtime
option to select the function / other options (which ah hint)
Comment 37 Jacob Lifshay 2022-07-19 03:50:28 BST
(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Jacob Lifshay from comment #34)
> 
> > added initial atomics assembler, along with script that generates it.
> 
> which i've now had to remove and deal with yet another force-master push.

why'd you remove the python script too? it is not autogenerated... I spent a lot of time writing it...it is a separate commit from the commit adding the generated output so is easy to retain.
> 
> please *do not* break the hard rule of adding auto-generated output to
> repositories.

as explained in the commit message, I added the autogenerated markdown specifically because it is the wiki and there isn't really an easy way to be able to see the results on the website otherwise, you'd have to download and run the script which is quite inconvenient.

Sorry, I didn't think to ask first.

> 
> *especially* given that it is a massive batch of identical code.

it's not actually identical...it's the assembly for all atomic ops in the c11 standard (except consume memory ordering .. no one uses that).
> 
> 
> allow me to be clearer in the instructions.
> 
> we need to demonstrate that the POWER9 recommended c++ spinlocks and
> atomics are or are not efficient, and to what extent.
> 
> the program therefore that needs to be created must:

I'm assuming by "process" you really mean threads.

> 1) have an option to specify the number of SMT forked processes to run
> 2) have an option to specify how many lock and unlocks shall be performed
>    per forked process
> 3) have an option to specify the range of memory addresses to be
> lock/unlocked
>    ("1" being "all processes lock and unlock the same address)
> 4) use RECOMMENDED sequences known to be used in c, c++, and the linux
>    kernel.

the sequences generated for the standard c11 atomic operations (as in the python script I wrote) are the recommended sequences for those standard c11 operations.

>    such as these (or other already present in the linux kernel
>    and other common libraries)
>    http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html
> 5) have an option to use the "eh" hints that Paul mentioned are in
>    Power ISA 3.1 p1077 eh hint
> 6) time the operations ("time ./locktest" would do).

no, it has poor accuracy for shorter times...using clock_gettime (or APIs wrapping that) inside the program is better because we can use realtime clocks and not measure program load/terminate time, as well as loop the measuring process multiple times to do statistical analysis on it, discarding outliers -- this avoids measuring the extra time used by linux to map the program/data into memory or allocate memory.

> 
> there is no need to add this program in markdown form.
> 
> it is purely experimental in nature for the purposes of research.

well, for the purposes of research it would be quite handy to see what assembly is used for each standard atomic operation without having to run the compiler yourself or write the input c program.
> 
> it is not for the publication of a specification.

yup.
> 
> it is for the purposes of actually being executed to obtain
> information for which a report (manually) will have to be written.
> 
> when executed on the TALOS-II workstation with different numbers of
> processes and different memory ranges, this will tell us whether
> IBM properly designed the ISA or not.  it will not tell us exactly
> *how* they actually implemented them but will give at least some
> black-box hints.
> 
> if the locking remains linear
it won't due to hyperthreads on the same core conflicting with each other...expect at least an inflection point at 18 threads since that's where it has to start sharing cores between multiple threads.

> for up to all 72 hyperthreads and it
> is of the order of a million locks per second per core regardless
> of the number of memory addresses then we can reasonably conclude
> that they did a damn good job.
> 
> if they do *not* work then we are 100% justified in proposing additional
> enhancements to the ISA

even if they do work, we still will want improvements to support atomic fadd, fmin/fmax, and a few others.
> 
> but *not* until the *actual* statistics have *actually* been measured
> and real-world reports obtained.
> 
> we do not have access to an IBM POWER10 system so IBM POWER9 will have to do.
> 
> bottom line is that if we cannot demonstrate good knowledge of IBM's
> atomics then we have absolutely no business whatsoever in proposing
> alternatives or enhancements.

Some other things we should test are some common 3d gpu shaders that use atomics, as well as parking_lot's performance (we'll need to use Rust for parking_lot, since it is a Rust library).

parking_lot is used by firefox.
Comment 38 Luke Kenneth Casson Leighton 2022-07-19 19:48:55 BST
(In reply to Jacob Lifshay from comment #37)

> I'm assuming by "process" you really mean threads.

makes no odds, if you consider threads to be easier to check correctness, go
for it. watch out for threading in libc6 transparently doing locking without
your knowledge on data structures and system calls though.

> >    such as these (or other already present in the linux kernel
> >    and other common libraries)
> >    http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html
> > 5) have an option to use the "eh" hints that Paul mentioned are in
> >    Power ISA 3.1 p1077 eh hint
> > 6) time the operations ("time ./locktest" would do).
> 
> no, it has poor accuracy for shorter times...

i was thinking of the order of tens of seconds per run.

> well, for the purposes of research it would be quite handy to see what
> assembly is used for each standard atomic operation without having to run
> the compiler yourself or write the input c program.

sorry, yes, that's what i meant: find out what the libraries do and
explicitly implement it in assembler. otherwise we have no concrete
direct evidence to present to OPF ISA WG.

> > if the locking remains linear
> it won't due to hyperthreads on the same core conflicting with each
> other...expect at least an inflection point at 18 threads since that's where
> it has to start sharing cores between multiple threads.

as POWER9 is multi-issue that assumption may not hold [if they did a decent
job]. be really interesting to find out.

> even if they do work, we still will want improvements to support atomic
> fadd, fmin/fmax, and a few others.

priority is to demonstrate first that we know what the hell we're talking about.

> atomics, as well as parking_lot's performance (we'll need to use Rust for
> parking_lot, since it is a Rust library).
> 
> parking_lot is used by firefox.

we are directly testing and exploring IBM hardware, not the capability of a
library that runs *on* IBM hardware.
Comment 39 Jacob Lifshay 2022-07-20 01:08:50 BST
some benchmarking of the AMO operations added in ARM v8.1a:
https://web.archive.org/web/20210619193003/https://cpufun.substack.com/p/atomics-in-aarch64
Comment 40 Jacob Lifshay 2022-07-20 09:08:05 BST
I've started writing a benchmarking program in c++, I'm compiling it for ppc64le, x86, and aarch64, since just running it for ppc64le doesn't tell you if the performance is good, or kinda terrible -- so I want to compare with aarch64 which has most of the RMW operations I want to add to ppc64le. x86 basically forces all RMW atomics to be sequentially-consistent, which involves a system-wide synchronization. aarch64's atomics should be able to be more efficient.

aarch64 can be tested by running on AWS's graviton2 instances, though that may not be the best option.

I'm waiting on lkcl creating a atomic-benchmarks.git repo, I'll push my work once that's done.
Comment 41 Jacob Lifshay 2022-07-21 11:39:43 BST
I got the benchmarks to run correctly:
I tested it on both my desktop and on the talos server. It currently just has the standard c++11 atomic ops, I'll add more OpenPower-specific benchmarks later.

If you don't specify iteration counts and/or thread counts, it automatically picks a good value, using all available cpu cores and aiming for the average elapsed time to be 0.5-1s by doubling iteration counts until running for that iteration count took >0.5s 

Demo:
./build-x86_64/benchmarks --bench atomic_fetch_add_i64_seq_cst --bench atomic_load_i64_relaxed -j4
Running: atomic_fetch_add_i64_seq_cst
Thread #0 took 0.936348 sec for 33554432 iterations -- 27.9053 ns/iter.
Thread #1 took 0.909495 sec for 33554432 iterations -- 27.1051 ns/iter.
Thread #2 took 0.91187 sec for 33554432 iterations -- 27.1758 ns/iter.
Thread #3 took 0.936216 sec for 33554432 iterations -- 27.9014 ns/iter.
Average elapsed time: 0.923482 sec for 33554432 iterations -- 27.5219 ns/iter.

Running: atomic_load_i64_relaxed
Thread #0 took 0.681217 sec for 1073741824 iterations -- 0.634432 ns/iter.
Thread #1 took 0.679786 sec for 1073741824 iterations -- 0.6331 ns/iter.
Thread #2 took 0.67332 sec for 1073741824 iterations -- 0.627078 ns/iter.
Thread #3 took 0.675972 sec for 1073741824 iterations -- 0.629548 ns/iter.
Average elapsed time: 0.677574 sec for 1073741824 iterations -- 0.63104 ns/iter.


./build-x86_64/benchmarks --help
Usage: ./build-x86_64/benchmarks [-h|--help] [-j|--thread-count <value>] [-n|--iter-count <value>] [--log2-mem-loc-count <value>] [--log2-stride <value>] [-b|--bench <value>]
Options:
-h|--help   Display usage and exit.
-j|--thread-count   Number of threads to run on
-n|--iter-count   Number of iterations to run per thread
--log2-mem-loc-count   Log base 2 of the number of memory locations to access
--log2-stride   Log base 2 of the stride used for accessing memory locations
-b|--bench   List of benchmarks that should be run


./build-x86_64/benchmarks --bench list
Available Benchmarks:
atomic_exchange_u8_relaxed
atomic_fetch_add_u8_relaxed
atomic_fetch_sub_u8_relaxed
atomic_fetch_and_u8_relaxed
atomic_fetch_or_u8_relaxed
atomic_fetch_xor_u8_relaxed
atomic_exchange_u8_acquire
atomic_fetch_add_u8_acquire
atomic_fetch_sub_u8_acquire
atomic_fetch_and_u8_acquire
atomic_fetch_or_u8_acquire
atomic_fetch_xor_u8_acquire
atomic_exchange_u8_release
atomic_fetch_add_u8_release
atomic_fetch_sub_u8_release
atomic_fetch_and_u8_release
atomic_fetch_or_u8_release
atomic_fetch_xor_u8_release
atomic_exchange_u8_acq_rel
atomic_fetch_add_u8_acq_rel
atomic_fetch_sub_u8_acq_rel
atomic_fetch_and_u8_acq_rel
atomic_fetch_or_u8_acq_rel
atomic_fetch_xor_u8_acq_rel
atomic_exchange_u8_seq_cst
atomic_fetch_add_u8_seq_cst
atomic_fetch_sub_u8_seq_cst
atomic_fetch_and_u8_seq_cst
atomic_fetch_or_u8_seq_cst
atomic_fetch_xor_u8_seq_cst
atomic_load_u8_relaxed
atomic_load_u8_acquire
atomic_load_u8_seq_cst
atomic_store_u8_relaxed
atomic_store_u8_release
atomic_store_u8_seq_cst
atomic_compare_exchange_weak_u8_relaxed_relaxed
atomic_compare_exchange_strong_u8_relaxed_relaxed
atomic_compare_exchange_weak_u8_acquire_relaxed
atomic_compare_exchange_strong_u8_acquire_relaxed
atomic_compare_exchange_weak_u8_acquire_acquire
atomic_compare_exchange_strong_u8_acquire_acquire
atomic_compare_exchange_weak_u8_release_relaxed
atomic_compare_exchange_strong_u8_release_relaxed
atomic_compare_exchange_weak_u8_acq_rel_relaxed
atomic_compare_exchange_strong_u8_acq_rel_relaxed
atomic_compare_exchange_weak_u8_acq_rel_acquire
atomic_compare_exchange_strong_u8_acq_rel_acquire
atomic_compare_exchange_weak_u8_seq_cst_relaxed
atomic_compare_exchange_strong_u8_seq_cst_relaxed
atomic_compare_exchange_weak_u8_seq_cst_acquire
atomic_compare_exchange_strong_u8_seq_cst_acquire
atomic_compare_exchange_weak_u8_seq_cst_seq_cst
atomic_compare_exchange_strong_u8_seq_cst_seq_cst
<same for u16 u32 u64 i8 i16 i32 i64>
Comment 42 Jacob Lifshay 2022-07-21 11:41:22 BST
(In reply to Jacob Lifshay from comment #41)
> I got the benchmarks to run correctly:
> I tested it on both my desktop and on the talos server. It currently just
> has the standard c++11 atomic ops, I'll add more OpenPower-specific
> benchmarks later.
> 

Forgot to post the repo link:
https://git.libre-soc.org/?p=benchmarks.git;a=tree;h=141d2e40aa82d1aa4268fc1595d5362a239ce309;hb=ecb29549cf226ed129caa7d43ce79e8b2e4d9575
Comment 43 Luke Kenneth Casson Leighton 2022-07-21 13:06:26 BST
excellllent, ok.

so IBM decided to use "cache barriers" which needs to be determined if
that is directly equivalent to lr/sc's aq/rl flags.

we also need to know if multiple atomic operations can
be multi-issue in-flight (i seem to recall POWER9 is 8-way multi-issue?)

also we need to know what the granularity of internal single-locking
is, by that i mean that if there are multiple requests to the same
{insert thing} then it is 100% guaranteed that, like intel, only
one will ever be serviced.

i suspect, from reading the Power ISA Spec, that {thing} is a Cache
Block.

however that needs to be explicitly determined by deliberately hammering
a POWER9 core with requests at different addresses, varying the differences
and seeing ifthe throughput drops to single-contention.

at exactly the same address is no good, we can assume that will definitely
cause contention.

the other important fact to know is, how does the forward-progress guarantee
work, i.e. how do these "cache barriers" work and i suspect they are similar
to IBM's "Transactions". there is probably an internal counter/tag which goes
up by one on each lwsync.

other architectures are not exacctly of no interest but please really there is
only 2-3 days left before this bugreport gets closed so focus on POWER9
Comment 44 Jacob Lifshay 2022-07-26 05:44:10 BST
(In reply to Luke Kenneth Casson Leighton from comment #43)
> excellllent, ok.
> 
> so IBM decided to use "cache barriers"

you mean "memory barriers" (aka. memory fences).

> which needs to be determined if
> that is directly equivalent to lr/sc's aq/rl flags.

They are a similar kind of memory fence. PowerISA's memory fences are not 1:1 identical to RISC-V's aq/rl flags -- RISC-V's aq/rl flags are basically C++11's memory orderings adapted to work on a cpu (by changing most non-atomic load/stores to be memory_order_relaxed atomic load/store ops).
> 
> we also need to know if multiple atomic operations can
> be multi-issue in-flight (i seem to recall POWER9 is 8-way multi-issue?)
> 
> also we need to know what the granularity of internal single-locking
> is, by that i mean that if there are multiple requests to the same
> {insert thing} then it is 100% guaranteed that, like intel, only
> one will ever be serviced.

The spec defines that (the reservation granule) to be >= 16B and <= minimum supported page size.

According to:
https://wiki.raptorcs.com/w/images/8/89/POWER9_um_OpenPOWER_v20GA_09APR2018_pub.pdf

POWER9's reservation granule is 128B.
> 
> i suspect, from reading the Power ISA Spec, that {thing} is a Cache
> Block.

It's not necessarily...but most reasonable implementations use a cache block.
> 
> however that needs to be explicitly determined by deliberately hammering
> a POWER9 core with requests at different addresses, varying the differences
> and seeing ifthe throughput drops to single-contention.
> 
> at exactly the same address is no good, we can assume that will definitely
> cause contention.
> 
> the other important fact to know is, how does the forward-progress guarantee
> work, i.e. how do these "cache barriers" work and i suspect they are similar
> to IBM's "Transactions".

I'd expect most of them work by just stopping further instructions from executing and restarting the instruction fetch process...afaict that's what the spec says has to happen for sync and lwsync -- lwsync would be cheaper by not requiring as much store buffer flushing i guess.

Note that none of the restarting instruction fetch stuff is needed for any of the C++11 memory fences...they only care about load/store/atomic execution order.

> there is probably an internal counter/tag which goes
> up by one on each lwsync.
> 
> other architectures are not exacctly of no interest but please really there
> is
> only 2-3 days left before this bugreport gets closed so focus on POWER9

Why would we close it now...isn't there still around a month before we have to get all RFPs in to nlnet? (giving them a month to meet their deadline)
Comment 45 Luke Kenneth Casson Leighton 2022-07-26 08:00:30 BST
(In reply to Jacob Lifshay from comment #44)

> Why would we close it now...isn't there still around a month before we have
> to get all RFPs in to nlnet? (giving them a month to meet their deadline)

because there's greater than 2 months worth of work to get done in 2 months
or lose the money.
Comment 46 Jacob Lifshay 2022-07-27 22:50:42 BST
as paul mentioned in the meeting, we should move the suggested implementation description in the discussion page back into the spec.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/atomics/discussion.mdwn;h=050e83458bdb6727d5eb43111de71277ef46a44a;hb=HEAD#l37
Comment 47 Andrey Miroshnikov 2022-07-27 23:05:49 BST
From the meeting:
Jacob Lifshay says:i've been working on benchmarking atomics on power9 and comparing to x86 and armv8.1a to come up with another justification for improving powerisa atomics 
Jacob Lifshay says:
https://git.libre-soc.org/?p=benchmarks.git;a=summary
Jacob Lifshay says:i'm borrowing my proposed implementation from risc-v where AMOs can be sent to L1/2/3 caches adaptively, wherever the cache is shared between the different cpus that are accessing that atomic 
Dmitry says:Speaking of C, I've always found a bit of confusion between barriers and atomicity per se. That's probably C tends to speak of architecture-agnostic stuff and instead talks of abstract state machine. 
Jacob Lifshay says:you do conflict detection on physical addresses, you can use conflict detection on effictive addresses as a good way to predict physical address conflicts 
Jacob Lifshay says:all loads/stores in in a core...*not* across all cores 
Jacob Lifshay says:the other cores have to rely on cache coherencey protocols 
Jacob Lifshay says:yes, but the fetch_add we need are the ones that can be done in l1/l2/l3/memory, wherever is fastest 
Dmitry says:Basically all but cmpxchg (or lol/sc) are for performance reasons, aren't they? 
Dmitry says:Because you can implement anything in terms of cmpxchg. 
Jacob Lifshay says:cmpxchg or ll/sc is slow tho.... 
Konstantinos says:might be a stupid question, but would it be possible to have 2 implementations for the same atomics? ie keep the existing L2 ones for scalable systems and replace them with lighter implementations for CPUs that target embedded/desktop/workstations/etc, non-multicore systems at any rate 
Jacob Lifshay says:we're keeping the existing ll/sc atomics for back compat and because they're fully general...fetch_add instructions are useful cuz they can run faster. 
Dmitry says:This would be strange to have different implementations across the same arch. 
Konstantinos says:I *did* say it might be a stupid question 😃 
Jacob Lifshay says:you can't just use the existing instructions since you'd have to combine 5-7 instructions into one fetch_add microop -- terrible 
Dmitry says:ll/sc is notoriously difficult to use, and has no direct counterpart in memory model 
Dmitry says:IIRC even ARM got its cmpxchg recently 
Konstantinos says:most of the time it's easier to change the hardware than the software 
Dmitry says:Yeah I agree. 
Jacob Lifshay says:uuh, ll/sc works just fine -- when you need the full generality. it's slower otherwise 
Dmitry says:Granted that software is ready for hardware changes 
Dmitry says:😃 
Dmitry says:You can have the generality with cmpxchg. This basically boils down to loop. 
Jacob Lifshay says:all of cmpxchg/ll/sc are often slower than a dedicated fetch_add instruction...no loop needed 
Dmitry says:And you cannot have it with ll/sc, because it doesn't map well to higher-level languages. 
Dmitry says:Yes that's why I started with "for performance reasons". 
Jacob Lifshay says:high-level languages nap to a ll/sc loop currently 
Jacob Lifshay says:map* 
Dmitry says:I guess some cycles can be saved with relaxed semantics 
Dmitry says:I mean C memory model 
Jacob Lifshay says:yes, but you often need acquire/release/sequentially-consistency 
Dmitry says:That is, cmpxchg made its way to higher level 
Dmitry says:Yeah 
Dmitry says:Ok we basically flooded the chat with atomics 😃 
Jacob Lifshay says:sve2 is only scalable from the hw end, sw sees a fixed impl. dependent size 
Jacob Lifshay says:suggested atomic implementation: 
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/atomics/discussion.mdwn;h=050e83458bdb6727d5eb43111de71277ef46a44a;hb=HEAD
Jacob Lifshay says:paul: you wanted a description of a proposed atomics implementation: 
https://bugs.libre-soc.org/show_bug.cgi?id=236#c46
Jacob Lifshay says:iirc aws graviton3 has sve2 support 
Jacob Lifshay says:amazon, not alibaba 
Jacob Lifshay says:for graviton3
Comment 48 Luke Kenneth Casson Leighton 2022-07-31 17:46:34 BST
Edit by programmerjake: Very preliminary benchmarks -- see comment #51 for details:

https://ftp.libre-soc.org/talos-benchmarks.json.gz
Comment 49 Luke Kenneth Casson Leighton 2022-07-31 17:55:35 BST
notes from the conversation on wednesday (not recorded)

Paul mentioned again that there is one person in IBM responsible for
the IBM POWER Architecture Fabric.  that is their sole job for probably
the past 20 years.  Atomic Memory operations are part of the Memory
Controller for which they are directly responsible, and their responsibility
includes ensuring Atomic Consistency when the number of Nodes in a system
exceeds several hundred thousand cores.

this is far beyond anything envisaged for "mere" commercial mass-volume
SoCs and anything that is and has been proposed and discussed at
considerable length including internally in IBM can and has been
rejected on the basis of it damaging the ability to ensure Atomic
Consistency and performance for massive Distributed Computing

therefore having discovered this, the proposal was far out of the
realm of possibility for completion, is being REDEFINED in terms of
the exploration achieved so far, is being CLOSED and a new proposal
is to be submitted later with a much greater budget.

jacob if you can briefly summarise the discoveries from the atomics
benchmarks specifically focussing on Power ISA that would help.
Comment 50 Jacob Lifshay 2022-08-02 06:04:31 BST
(In reply to Luke Kenneth Casson Leighton from comment #49)
> jacob if you can briefly summarise the discoveries from the atomics
> benchmarks specifically focussing on Power ISA that would help.

basically that the benchmarks require more work...the only thing of note that I've discovered so far is x86 atomics (Ryzen 3900X) are a lot slower than relaxed/acquire/release atomics on POWER9 (but not really seq-cst atomics) -- which is entirely expected since x86 atomics are almost all seq-cst due to needing the LOCK prefix. AArch64 (which is what is more useful to compare to) has not been tested yet. I just tested x86 because that's what my desktop is and because x86 is widely available so is nice to have.
Comment 51 Jacob Lifshay 2022-08-02 06:12:21 BST
(In reply to Luke Kenneth Casson Leighton from comment #48)
> https://ftp.libre-soc.org/talos-benchmarks.json.gz

Note that those are very preliminary -- they test atomic ops that always conflict -- because that's what I ran right after adding json output support, not because that's necessarily the benchmarks we care the most about.