A formal write-up of the c++ Atomics protocol is needed, to be proposed to the OpenPOWER Foundation. This will be useful for OpenPOWER as well.
See this thread for reason we need something other than Power's already existing atomic instructions: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-October/003085.html
are these basically already covered by Power ISA load-store with reservations? or the OpenCAPI atomic memory operations? https://opencapi.org/wp-content/uploads/2016/09/OpenCAPI-Overview.10.14.16.pdf
these look to me like they're ok: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
(In reply to Luke Kenneth Casson Leighton from comment #2) > are these basically already covered by Power ISA load-store with > reservations? yes if you don't care about efficiency -- you'll just get large/slow functions whenever you use atomics (for one atomic op: iirc >5 instructions with like 4 of them in a loop). I'll assume efficiency is something we care about. some atomics on powerpc64le, x86_64, and amdgpu: https://gcc.godbolt.org/z/9a6EKjhjh note how every atomic on powerpc64le is a giant pile of instructions in a loop -- having the decoder need to macro-op fuse a 4 instruction (or more) loop is absurd imho...x86 has a single instruction for add and for exchange (it has more if you don't need the return value), amdgpu has dedicated instructions for all the operations I tried (clang crashes for 8-bit atomics). riscv (not in that godbolt link) also supports a bunch of operations. we need short instructions for at least atomic-fetch-add and atomic-exchange since they're quite common in cpu code, for gpu code it would be nice to support the full list of atomic ops supported by vulkan/opencl: https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#_atomic_instructions atomics supported by vulkan/opencl: load float/int (already supported by power) store float/int (already supported by power) exchange float/int compare exchange float/int (sufficiently supported by power) fetch_increment int (covered by fetch_add int) fetch_decrement int (covered by fetch_add int) fetch_add int fetch_sub int (covered by fetch_add int) fetch_min[u] int fetch_max[u] int fetch_and int fetch_or int fetch_xor int flag_test_and_set (covered by exchange int) flag_clear (covered by store int) fetch_min float fetch_max float fetch_add float int/float fetch_min/max are particularly important for gpu code since they can be used for depth buffer ops. we will want 8/16/32/64-bit int and 16/32/64-bit float support. we also need 128-bit atomics support, they're relatively uncommon but used in some critical data-structures and are waaayy faster than having to use a global mutex, power's existing instructions are sufficient for that -- we just need to implement them: lq, stq, lqarx, stqcx. > > or the OpenCAPI atomic memory operations? we need actual instructions to express what we want, otherwise all the fancy hardware support is useless... that pdf doesn't elaborate at all which atomics opencapi supports.
(In reply to Luke Kenneth Casson Leighton from comment #3) > these look to me like they're ok: > https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html that's the bare minimum for c++11 to operate correctly...it doesn't cover most of the ops we want to be efficient. also, c++11 has problems with atomic fences on power: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html imho we should introduce fences where needed that follow c++11 semantics. imho we should also introduce fused atomic/fence ops, where the atomic instruction is also a memory fence, like risc-v and armv8 and x86_64, this would reduce the number of needed instructions for an acq_rel atomic add to 1, rather than the 5-7 with a 4-instruction loop we currently have. imho we should mirror risc-v's aq/rl bits: aq rl: 0 0 relaxed 0 1 release 1 0 acquire 1 1 seq_cst
(In reply to Jacob Lifshay from comment #5) > also, c++11 has problems with atomic fences on power: > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html problems as in giving the wrong answer kind of problems...not just slow.
(In reply to Jacob Lifshay from comment #5) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > these look to me like they're ok: > > https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html > > that's the bare minimum for c++11 to operate correctly...it doesn't cover > most of the ops we want to be efficient. bleh. ok. > imho we should also introduce fused atomic/fence ops, where the atomic > instruction is also a memory fence, like risc-v and armv8 and x86_64, this > would reduce the number of needed instructions for an acq_rel atomic add to > 1, rather than the 5-7 with a 4-instruction loop we currently have. ok, brilliant, can you create a page and design them? keep it simple and straightforward, bear in mind i can help in a limited way as i don't fully grasp these instructions, and they need some serious justification to put to the ISA WG.
(In reply to Luke Kenneth Casson Leighton from comment #7) > (In reply to Jacob Lifshay from comment #5) > > imho we should also introduce fused atomic/fence ops, where the atomic > > instruction is also a memory fence, like risc-v and armv8 and x86_64, this > > would reduce the number of needed instructions for an acq_rel atomic add to > > 1, rather than the 5-7 with a 4-instruction loop we currently have. > > ok, brilliant, can you create a page and design them? keep it simple > and straightforward, bear in mind i can help in a limited way as i > don't fully grasp these instructions, and they need some serious > justification to put to the ISA WG. yeah, i can work on it, though it'll happen after the fadd formal stuff.
(In reply to Jacob Lifshay from comment #8) > yeah, i can work on it, though it'll happen after the fadd formal stuff. not a problem - please make sure that you cut the fadd/fmul formal stuff *real* short - to leave enough time. absolute basics. i'll put in another grant request to cover the rest of what "needs" to be done.
handy pdf with formal semantics they proposed as "repaired C11" atomics semantics: https://pure.mpg.de/rest/items/item_2543041/component/file_3332448/content Kang, Jeehoon, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. "A promising semantics for relaxed-memory concurrency." ACM SIGPLAN Notices 52, no. 1 (2017): 175-189.
After doing some research, it turns out that PowerISA actually already has a lot of the atomic operations I was going to propose, they just aren't really implemented in gcc or clang. They are still missing better fences, combined operation/fence instructions, and operations on 8/16-bit values, as well as issues with unnecessary restrictions. PowerISA v3.1 Book II section 4.5: Atomic Memory Operations it has only 32-bit and 64-bit atomic operations. the operations it has that I was going to propose: fetch_add fetch_xor fetch_or fetch_and fetch_umax fetch_smax fetch_umin fetch_smin exchange as well as a few I wasn't going to propose (they seem less useful to me): compare-and-swap-not-equal fetch-and-increment-bounded fetch-and-increment-equal fetch-and-decrement-bounded store-twin The spec also basically says that the atomic memory operations are only intended for when you want to do atomic operations on memory, but don't want that memory to be loaded into your L1 cache. imho that restriction is specifically *not* wanted, because there are plenty of cases where atomic operations should happen in your L1 cache. I'd guess that part of why those atomic operations weren't included in gcc or clang as the default implementation of atomic operations (when the appropriate ISA feature is enabled) is because of that restriction. imho the cpu should be able to (but not required to) predict whether to send an atomic operation to L2-cache/L3-cache/etc./memory or to execute it directly in the L1 cache. The prediction could be based on how often that cache block was accessed from different cpus, e.g. by having a small saturating counter and a last-accessing-cpu field, where it would count how many times the same cpu accessed it in a row, sending it to the L1 cache if that's more than some limit, otherwise doing the operation in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu tried to access it.
I started writing the spec, currently it just has a motivation section: https://libre-soc.org/openpower/atomics/
jacob the non-L1-cache instructions are intended for pushing directly through OpenCAPI, which has direct corresponding capability. ariane/pulpino *very specifically* implemented amo* as a single ALU directly built-in to the L2 cache.
(In reply to Luke Kenneth Casson Leighton from comment #13) > jacob the non-L1-cache instructions are intended for pushing directly > through OpenCAPI, which has direct corresponding capability. ariane/pulpino > *very specifically* implemented amo* as a single ALU directly built-in to > the L2 cache. regardless, the technique I described can be used to optimize AMOs to work well at the L1 cache and at other caches/memory/opencapi -- it would just be modified to handle access from outside the cache where the tracking is implemented. My point is that it can be done, not that it has to use the algorithm i gave. Having fast atomics that may operate in the L1 cache is imho a requirement for a good cpu/gpu.
found this interesting paper on implementing high performance amo operations, with performance approaching that of normal load/store ops: Free Atomics: Hardware Atomic Operations without Fences https://dl.acm.org/doi/pdf/10.1145/3470496.3527385
IBM will have to correspondingly begin a process of creating modifications to OpenCAPI. awareness of how much work is actually being asked by changes is being respectful. they may not be in the least bit happy to have their multi billion dollar high performance compute market "damaged" by unthinking, unreasonable and drastic changes to a spec. when i skim-read the OpenCAPI Specification, not realising at the time that it was pretty much directly linked to the Book II instructions (the FC of stwat RT,RA,FC) i had not looked for a memory width field. i'll see if i can find it.
https://opencapi.org/wp-content/uploads/2016/09/OC-DL-Specification.10.14.16.pdf nope http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf yes. table 2.5 and 2.6 p49. page 48 covers length: The command address specified is naturally aligned based on the operand length. The operand length is operations with the exception of fetch and swap operations where the cmd_flag is specified as {x‘8’...x‘A’} and pLength shall be specified as {‘110, ‘111’}. Refer to the specification of pLength on page 28. paaaage 28...... pLength 3 Partial length. Specifies the number of data bytes specified for a partial write command. The address specified shall be naturally aligned based on the pLength specified. The data may be sent in a data flit, or an 8or 32-byte datum field specified for some control flits. 000 1 byte. Reserved when the command is an amo*. 001 2 bytes. Reserved when the command is an amo*. 010 4 bytes 011 8 bytes 100 16 bytes. Reserved when the command is an amo*. 101 32 bytes. Reserved when the command is an amo*. 110 Specifies 4-byte operands when the command is amo_rw and the operation is specified as a Fetch and swap. That is, the command flag is {x‘8’..x‘A’}. Otherwise, this field is reserved. 111 Specifies 8-byte operands when the command is amo_rw and the operation is specified as a Fetch and swap. That is, the command flag is {x‘8’..x‘A’}. Otherwise, this field is reserved. conclusion: oops. it's ok for amo* but not ok for amo_rw, that would require a drastic (multi-million-dollar impact) redesign of OpenCAPI.
table 2.5 ‘0000’ Fetch and Add ‘0001’ Fetch and XOR ‘0010’ Fetch and OR ‘0011’ Fetch and AND ‘0100’ Fetch and maximum unsigned ‘0101’ Fetch and maximum signed ‘0110’ Fetch and minimum unsigned ‘0111’ Fetch and minimum signed ‘1000’ Fetch and swap ‘1001’ Fetch and swap equal ‘1010’ Fetch and swap not equal ‘1011’ through ‘1111’ Reserved
(In reply to Luke Kenneth Casson Leighton from comment #17) > conclusion: oops. it's ok for amo* but not ok for amo_rw, that would > require a drastic (multi-million-dollar impact) redesign of OpenCAPI. well, all that needs to happen is 8/16-bit atomics have to transfer a cache block to the cache (if not already there) instead of using opencapi atomics...unless it's something like atomic or where the cpu can just or zeros into the bytes it doesn't want to write to. this is exactly what would happen for things like atomic fadd too
(In reply to Jacob Lifshay from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #17) > > conclusion: oops. it's ok for amo* but not ok for amo_rw, that would > > require a drastic (multi-million-dollar impact) redesign of OpenCAPI. > > well, all that needs to happen is 8/16-bit atomics have to transfer a cache > block to the cache (if not already there) instead of using opencapi > atomics... you're missing the point. there is no "all that needs to happen" here. i had not realised how tightly coupled lwat/stwat are into opencapi (i was half expecting it) IBM will not think in terms of "instead of using opencapi". they have not designed "just a processor" they have designed "a massive data processing business" where Power ISA is one tiny component of a much bigger ecosystem. their immediate thought will not be, "these are great for c++ guarantees" their immediate thought will be, "how much is this going to **** up our customers spending billions of dollars on coherent parallel distributed systems for which OpenCAPI is the bedrock". bottom line if what is proposed does not have a way to fit into opencapi it is highly likely to be rejected. aq and rl (acquire and release) will need to be additional opcodes in opencapi.
these are the opencapi opcodes: AMO read amo_rd amo_rd.n ‘0011 1000’ ‘0011 1100’ AMO read write amo_rw amo_rw.n ‘0100 0000’ ‘0100 0100’ they need to be joined by three more (each): * AMO read aq * AMO rdwr aq * AMO read rl * AMO rdwr rl * AMO read aq rl * AMO rdwr aq rl which means finding out if there are two bits available somewhere in the opencapi opcodes *that have not been used by IBM for private OMI custom extensions* which leaves us running into a Commercial Confidentiality brick wall.
ok so i now have a handle on things, the rationale you did is excellent. i had no idea IBM had added stwat etc although having seen amos in opencapi i was half expecting it. the lack of AQ RL is going to hit hard (IBM's entire design paradigm does not recognise it). opencapi section 1.3: Command ordering Ordering within a VC is maintained through the TL/TLX, but it is not assured after the command has moved to the upper protocol layers (host and AFU) as described in Section 3 Virtual channel and data credit pool specification on page 77. which fundamentally conflicts with acquire / release ordering requirements. updates to the page, summary: * moved 1st person narrative to discussion page * tracked down function tables * added new draft "AT" Form * added lat/stat with elwidth, aq and lr * found a suitable place in EXT031. for the function table recommending anything not EXACTLY equal to what IBM already has will have a hard time. we don't need the hassle. doesn't matter in the least if we like it, understand it, agree with it or want to eat it with mayonnaise: it's a table that's already defined.
a reaaomable justification is needed as to why 8 and 16 bit is to be proposed. neither Power nor RISCV have them. what data structures and algorithms specifically benefit?
(In reply to Luke Kenneth Casson Leighton from comment #20) > (In reply to Jacob Lifshay from comment #19) > > (In reply to Luke Kenneth Casson Leighton from comment #17) > > > conclusion: oops. it's ok for amo* but not ok for amo_rw, that would > > > require a drastic (multi-million-dollar impact) redesign of OpenCAPI. > > > > well, all that needs to happen is 8/16-bit atomics have to transfer a cache > > block to the cache (if not already there) instead of using opencapi > > atomics... > > you're missing the point. i got your point, imho your point is just partially incorrect. > > there is no "all that needs to happen" here. i had not realised > how tightly coupled lwat/stwat are into opencapi (i was half expecting > it) > > IBM will not think in terms of "instead of using opencapi". I was never advocating for "instead of using opencapi", it would just use a different part of opencapi: the part that allows a cpu to obtain exclusive access to a cache block (M/E states in the MESI cache coherence model). > aq and rl (acquire and release) will need to be additional opcodes in > opencapi. no they won't, memory ordering is handled by the cpu deciding when to issue memory operations: acquire -- all following memory ops are delayed till after the atomic op is executed. release -- the atomic op is delayed till after all preceding memory ops are executed. acq_rel -- do both the above seq_cst -- do both the above, as well as synchronize across the whole system...essentially what the `sync` instruction already mostly does. as an example, tilelink has no memory fence operations, and all amo operations have no aq/rl bits in tilelink -- they're all handled by the cpu.
(In reply to Luke Kenneth Casson Leighton from comment #23) > a reaaomable justification is needed as to why 8 and 16 bit is to be > proposed. > neither Power nor RISCV have them. what data structures and algorithms > specifically benefit? off the top of my head, a very commonly-used (4.3M downloads per month) Rust mutex library uses 8-bit atomics for implementing their mutexes: https://lib.rs/crates/parking_lot i'm sure there're many more.
https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-July/005071.html https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=27dca7419e3b5f8df2fdb9aae45481aea6754bc7
(In reply to Luke Kenneth Casson Leighton from comment #26) > https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-July/005071.html > > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=27dca7419e3b5f8df2fdb9aae45481aea6754bc7 That use of isync actually has very little to do with atomics, instead it is making the cpu see the instructions that were modified by writing to memory.
i assumed that because this is an atomics bugreport that you would post something related to atomic operations. from the discussion with Paul it is clear that we do not know enough and that the Power ISA 3.0 Spec is not detailed enough. we also cannot possibly submit an enhancement where we have absolutely no idea whatsoever of what IBM has done and why. what this tells us is that this bugreport is completely inadequately funded to do a full investigation. can you do some assembler of existing atomics and locking, both multi-process, to give a clear idea of performance, so that we are properly informed. i aim to close this bugreport within one week and to submit a new grant request to cover it in more depth.
(In reply to Luke Kenneth Casson Leighton from comment #28) > i assumed that because this is an atomics bugreport that you would > post something related to atomic operations. I posted the stuff you linked in comment #26 to the mailing list rather than this bug because it wasn't really related to atomics. > > from the discussion with Paul it is clear that we do not know > enough and that the Power ISA 3.0 Spec is not detailed enough. > > we also cannot possibly submit an enhancement where we have absolutely > no idea whatsoever of what IBM has done and why. > > what this tells us is that this bugreport is completely inadequately > funded to do a full investigation. > > can you do some assembler of existing atomics and locking, both > multi-process, > to give a clear idea of performance, so that we are properly informed. sure. > > i aim to close this bugreport within one week and to submit a new grant > request > to cover it in more depth. please don't close it, just reassign the funding and reuse this bug as one of the tasks for the grant, since that allows us to not lose context (not technically lost, but less readily accessible). imho we don't need a full eur 50k, so we should include other things in the grant.
(In reply to Jacob Lifshay from comment #29) > please don't close it, just reassign the funding and reuse this bug as one > of the tasks for the grant, since that allows us to not lose context (not > technically lost, but less readily accessible). imho we don't need a full > eur 50k, so we should include other things in the grant. ah you misunderstood: i want this task done and dusted and declared closed so that you and i can get the EUR 2500 associated with it into our bank accounts, promptly. what i do *not* want is for 10 weeks of work to go by on a task that gets longer and longer and longer and longer now that we have learned that the Power ISA spec is not sufficient and have learned also the extent of the attention paid by IBM to atomics *which we know nothing about* so i am making the decision to declare the scope of this task to be of length no greater than (as of right now) 6 days further work, and *no more*. i will take care of remembering that it needs to be cross-referenced to the new grant. however it shall remain closed at the end of the task because the scope of payments are for this task and the work that is done within it.
(In reply to Luke Kenneth Casson Leighton from comment #30) > ah you misunderstood: i want this task done and dusted and declared > closed so that you and i can get the EUR 2500 associated with it into > our bank accounts, promptly. ah, then can you create a subtask for "initial research on atomics" or something so the €2500 can be assigned to that one and it can be closed while leaving this task to track all atomics stuff?
(In reply to Jacob Lifshay from comment #31) > ah, then can you create a subtask for "initial research on atomics" or > something so the €2500 can be assigned to that one and it can be closed > while leaving this task to track all atomics stuff? no, because the money is allocated to this task and this task only. it is a top-level milestone and part of the MoU, and both of us have done wotk on the write-up. it *must* be completed and due to the unforseen change in circumstances as part of wednesday's meeting with Paul i am *defining* it to be 100% complete once the evaluation and research in comment #28 is performed and written up. keep it simple please.
(In reply to Luke Kenneth Casson Leighton from comment #32) > it is a top-level milestone and part of the MoU, ah, didn't notice that.
(In reply to Jacob Lifshay from comment #29) > (In reply to Luke Kenneth Casson Leighton from comment #28) > > can you do some assembler of existing atomics and locking, both > > multi-process, > > to give a clear idea of performance, so that we are properly informed. > > sure. added initial atomics assembler, along with script that generates it.
(In reply to Jacob Lifshay from comment #34) > added initial atomics assembler, along with script that generates it. which i've now had to remove and deal with yet another force-master push. please *do not* break the hard rule of adding auto-generated output to repositories. *especially* given that it is a massive batch of identical code. allow me to be clearer in the instructions. we need to demonstrate that the POWER9 recommended c++ spinlocks and atomics are or are not efficient, and to what extent. the program therefore that needs to be created must: 1) have an option to specify the number of SMT forked processes to run 2) have an option to specify how many lock and unlocks shall be performed per forked process 3) have an option to specify the range of memory addresses to be lock/unlocked ("1" being "all processes lock and unlock the same address) 4) use RECOMMENDED sequences known to be used in c, c++, and the linux kernel. such as these (or other already present in the linux kernel and other common libraries) http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html 5) have an option to use the "eh" hints that Paul mentioned are in Power ISA 3.1 p1077 eh hint 6) time the operations ("time ./locktest" would do). there is no need to add this program in markdown form. it is purely experimental in nature for the purposes of research. it is not for the publication of a specification. it is for the purposes of actually being executed to obtain information for which a report (manually) will have to be written. when executed on the TALOS-II workstation with different numbers of processes and different memory ranges, this will tell us whether IBM properly designed the ISA or not. it will not tell us exactly *how* they actually implemented them but will give at least some black-box hints. if the locking remains linear for up to all 72 hyperthreads and it is of the order of a million locks per second per core regardless of the number of memory addresses then we can reasonably conclude that they did a damn good job. if they do *not* work then we are 100% justified in proposing additional enhancements to the ISA but *not* until the *actual* statistics have *actually* been measured and real-world reports obtained. we do not have access to an IBM POWER10 system so IBM POWER9 will have to do. bottom line is that if we cannot demonstrate good knowledge of IBM's atomics then we have absolutely no business whatsoever in proposing alternatives or enhancements.
please do not use rust for this task the people reviewing this bugreport and the source code of the research will be c and assembler programmers. it is perfectly fine to have #ifdefs around functions to create different binaries and for the Makefile to generate (and run) them all. it would also be equally perfectly fine to have a runtime option to select the function / other options (which ah hint)
(In reply to Luke Kenneth Casson Leighton from comment #35) > (In reply to Jacob Lifshay from comment #34) > > > added initial atomics assembler, along with script that generates it. > > which i've now had to remove and deal with yet another force-master push. why'd you remove the python script too? it is not autogenerated... I spent a lot of time writing it...it is a separate commit from the commit adding the generated output so is easy to retain. > > please *do not* break the hard rule of adding auto-generated output to > repositories. as explained in the commit message, I added the autogenerated markdown specifically because it is the wiki and there isn't really an easy way to be able to see the results on the website otherwise, you'd have to download and run the script which is quite inconvenient. Sorry, I didn't think to ask first. > > *especially* given that it is a massive batch of identical code. it's not actually identical...it's the assembly for all atomic ops in the c11 standard (except consume memory ordering .. no one uses that). > > > allow me to be clearer in the instructions. > > we need to demonstrate that the POWER9 recommended c++ spinlocks and > atomics are or are not efficient, and to what extent. > > the program therefore that needs to be created must: I'm assuming by "process" you really mean threads. > 1) have an option to specify the number of SMT forked processes to run > 2) have an option to specify how many lock and unlocks shall be performed > per forked process > 3) have an option to specify the range of memory addresses to be > lock/unlocked > ("1" being "all processes lock and unlock the same address) > 4) use RECOMMENDED sequences known to be used in c, c++, and the linux > kernel. the sequences generated for the standard c11 atomic operations (as in the python script I wrote) are the recommended sequences for those standard c11 operations. > such as these (or other already present in the linux kernel > and other common libraries) > http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html > 5) have an option to use the "eh" hints that Paul mentioned are in > Power ISA 3.1 p1077 eh hint > 6) time the operations ("time ./locktest" would do). no, it has poor accuracy for shorter times...using clock_gettime (or APIs wrapping that) inside the program is better because we can use realtime clocks and not measure program load/terminate time, as well as loop the measuring process multiple times to do statistical analysis on it, discarding outliers -- this avoids measuring the extra time used by linux to map the program/data into memory or allocate memory. > > there is no need to add this program in markdown form. > > it is purely experimental in nature for the purposes of research. well, for the purposes of research it would be quite handy to see what assembly is used for each standard atomic operation without having to run the compiler yourself or write the input c program. > > it is not for the publication of a specification. yup. > > it is for the purposes of actually being executed to obtain > information for which a report (manually) will have to be written. > > when executed on the TALOS-II workstation with different numbers of > processes and different memory ranges, this will tell us whether > IBM properly designed the ISA or not. it will not tell us exactly > *how* they actually implemented them but will give at least some > black-box hints. > > if the locking remains linear it won't due to hyperthreads on the same core conflicting with each other...expect at least an inflection point at 18 threads since that's where it has to start sharing cores between multiple threads. > for up to all 72 hyperthreads and it > is of the order of a million locks per second per core regardless > of the number of memory addresses then we can reasonably conclude > that they did a damn good job. > > if they do *not* work then we are 100% justified in proposing additional > enhancements to the ISA even if they do work, we still will want improvements to support atomic fadd, fmin/fmax, and a few others. > > but *not* until the *actual* statistics have *actually* been measured > and real-world reports obtained. > > we do not have access to an IBM POWER10 system so IBM POWER9 will have to do. > > bottom line is that if we cannot demonstrate good knowledge of IBM's > atomics then we have absolutely no business whatsoever in proposing > alternatives or enhancements. Some other things we should test are some common 3d gpu shaders that use atomics, as well as parking_lot's performance (we'll need to use Rust for parking_lot, since it is a Rust library). parking_lot is used by firefox.
(In reply to Jacob Lifshay from comment #37) > I'm assuming by "process" you really mean threads. makes no odds, if you consider threads to be easier to check correctness, go for it. watch out for threading in libc6 transparently doing locking without your knowledge on data structures and system calls though. > > such as these (or other already present in the linux kernel > > and other common libraries) > > http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html > > 5) have an option to use the "eh" hints that Paul mentioned are in > > Power ISA 3.1 p1077 eh hint > > 6) time the operations ("time ./locktest" would do). > > no, it has poor accuracy for shorter times... i was thinking of the order of tens of seconds per run. > well, for the purposes of research it would be quite handy to see what > assembly is used for each standard atomic operation without having to run > the compiler yourself or write the input c program. sorry, yes, that's what i meant: find out what the libraries do and explicitly implement it in assembler. otherwise we have no concrete direct evidence to present to OPF ISA WG. > > if the locking remains linear > it won't due to hyperthreads on the same core conflicting with each > other...expect at least an inflection point at 18 threads since that's where > it has to start sharing cores between multiple threads. as POWER9 is multi-issue that assumption may not hold [if they did a decent job]. be really interesting to find out. > even if they do work, we still will want improvements to support atomic > fadd, fmin/fmax, and a few others. priority is to demonstrate first that we know what the hell we're talking about. > atomics, as well as parking_lot's performance (we'll need to use Rust for > parking_lot, since it is a Rust library). > > parking_lot is used by firefox. we are directly testing and exploring IBM hardware, not the capability of a library that runs *on* IBM hardware.
some benchmarking of the AMO operations added in ARM v8.1a: https://web.archive.org/web/20210619193003/https://cpufun.substack.com/p/atomics-in-aarch64
I've started writing a benchmarking program in c++, I'm compiling it for ppc64le, x86, and aarch64, since just running it for ppc64le doesn't tell you if the performance is good, or kinda terrible -- so I want to compare with aarch64 which has most of the RMW operations I want to add to ppc64le. x86 basically forces all RMW atomics to be sequentially-consistent, which involves a system-wide synchronization. aarch64's atomics should be able to be more efficient. aarch64 can be tested by running on AWS's graviton2 instances, though that may not be the best option. I'm waiting on lkcl creating a atomic-benchmarks.git repo, I'll push my work once that's done.
I got the benchmarks to run correctly: I tested it on both my desktop and on the talos server. It currently just has the standard c++11 atomic ops, I'll add more OpenPower-specific benchmarks later. If you don't specify iteration counts and/or thread counts, it automatically picks a good value, using all available cpu cores and aiming for the average elapsed time to be 0.5-1s by doubling iteration counts until running for that iteration count took >0.5s Demo: ./build-x86_64/benchmarks --bench atomic_fetch_add_i64_seq_cst --bench atomic_load_i64_relaxed -j4 Running: atomic_fetch_add_i64_seq_cst Thread #0 took 0.936348 sec for 33554432 iterations -- 27.9053 ns/iter. Thread #1 took 0.909495 sec for 33554432 iterations -- 27.1051 ns/iter. Thread #2 took 0.91187 sec for 33554432 iterations -- 27.1758 ns/iter. Thread #3 took 0.936216 sec for 33554432 iterations -- 27.9014 ns/iter. Average elapsed time: 0.923482 sec for 33554432 iterations -- 27.5219 ns/iter. Running: atomic_load_i64_relaxed Thread #0 took 0.681217 sec for 1073741824 iterations -- 0.634432 ns/iter. Thread #1 took 0.679786 sec for 1073741824 iterations -- 0.6331 ns/iter. Thread #2 took 0.67332 sec for 1073741824 iterations -- 0.627078 ns/iter. Thread #3 took 0.675972 sec for 1073741824 iterations -- 0.629548 ns/iter. Average elapsed time: 0.677574 sec for 1073741824 iterations -- 0.63104 ns/iter. ./build-x86_64/benchmarks --help Usage: ./build-x86_64/benchmarks [-h|--help] [-j|--thread-count <value>] [-n|--iter-count <value>] [--log2-mem-loc-count <value>] [--log2-stride <value>] [-b|--bench <value>] Options: -h|--help Display usage and exit. -j|--thread-count Number of threads to run on -n|--iter-count Number of iterations to run per thread --log2-mem-loc-count Log base 2 of the number of memory locations to access --log2-stride Log base 2 of the stride used for accessing memory locations -b|--bench List of benchmarks that should be run ./build-x86_64/benchmarks --bench list Available Benchmarks: atomic_exchange_u8_relaxed atomic_fetch_add_u8_relaxed atomic_fetch_sub_u8_relaxed atomic_fetch_and_u8_relaxed atomic_fetch_or_u8_relaxed atomic_fetch_xor_u8_relaxed atomic_exchange_u8_acquire atomic_fetch_add_u8_acquire atomic_fetch_sub_u8_acquire atomic_fetch_and_u8_acquire atomic_fetch_or_u8_acquire atomic_fetch_xor_u8_acquire atomic_exchange_u8_release atomic_fetch_add_u8_release atomic_fetch_sub_u8_release atomic_fetch_and_u8_release atomic_fetch_or_u8_release atomic_fetch_xor_u8_release atomic_exchange_u8_acq_rel atomic_fetch_add_u8_acq_rel atomic_fetch_sub_u8_acq_rel atomic_fetch_and_u8_acq_rel atomic_fetch_or_u8_acq_rel atomic_fetch_xor_u8_acq_rel atomic_exchange_u8_seq_cst atomic_fetch_add_u8_seq_cst atomic_fetch_sub_u8_seq_cst atomic_fetch_and_u8_seq_cst atomic_fetch_or_u8_seq_cst atomic_fetch_xor_u8_seq_cst atomic_load_u8_relaxed atomic_load_u8_acquire atomic_load_u8_seq_cst atomic_store_u8_relaxed atomic_store_u8_release atomic_store_u8_seq_cst atomic_compare_exchange_weak_u8_relaxed_relaxed atomic_compare_exchange_strong_u8_relaxed_relaxed atomic_compare_exchange_weak_u8_acquire_relaxed atomic_compare_exchange_strong_u8_acquire_relaxed atomic_compare_exchange_weak_u8_acquire_acquire atomic_compare_exchange_strong_u8_acquire_acquire atomic_compare_exchange_weak_u8_release_relaxed atomic_compare_exchange_strong_u8_release_relaxed atomic_compare_exchange_weak_u8_acq_rel_relaxed atomic_compare_exchange_strong_u8_acq_rel_relaxed atomic_compare_exchange_weak_u8_acq_rel_acquire atomic_compare_exchange_strong_u8_acq_rel_acquire atomic_compare_exchange_weak_u8_seq_cst_relaxed atomic_compare_exchange_strong_u8_seq_cst_relaxed atomic_compare_exchange_weak_u8_seq_cst_acquire atomic_compare_exchange_strong_u8_seq_cst_acquire atomic_compare_exchange_weak_u8_seq_cst_seq_cst atomic_compare_exchange_strong_u8_seq_cst_seq_cst <same for u16 u32 u64 i8 i16 i32 i64>
(In reply to Jacob Lifshay from comment #41) > I got the benchmarks to run correctly: > I tested it on both my desktop and on the talos server. It currently just > has the standard c++11 atomic ops, I'll add more OpenPower-specific > benchmarks later. > Forgot to post the repo link: https://git.libre-soc.org/?p=benchmarks.git;a=tree;h=141d2e40aa82d1aa4268fc1595d5362a239ce309;hb=ecb29549cf226ed129caa7d43ce79e8b2e4d9575
excellllent, ok. so IBM decided to use "cache barriers" which needs to be determined if that is directly equivalent to lr/sc's aq/rl flags. we also need to know if multiple atomic operations can be multi-issue in-flight (i seem to recall POWER9 is 8-way multi-issue?) also we need to know what the granularity of internal single-locking is, by that i mean that if there are multiple requests to the same {insert thing} then it is 100% guaranteed that, like intel, only one will ever be serviced. i suspect, from reading the Power ISA Spec, that {thing} is a Cache Block. however that needs to be explicitly determined by deliberately hammering a POWER9 core with requests at different addresses, varying the differences and seeing ifthe throughput drops to single-contention. at exactly the same address is no good, we can assume that will definitely cause contention. the other important fact to know is, how does the forward-progress guarantee work, i.e. how do these "cache barriers" work and i suspect they are similar to IBM's "Transactions". there is probably an internal counter/tag which goes up by one on each lwsync. other architectures are not exacctly of no interest but please really there is only 2-3 days left before this bugreport gets closed so focus on POWER9
(In reply to Luke Kenneth Casson Leighton from comment #43) > excellllent, ok. > > so IBM decided to use "cache barriers" you mean "memory barriers" (aka. memory fences). > which needs to be determined if > that is directly equivalent to lr/sc's aq/rl flags. They are a similar kind of memory fence. PowerISA's memory fences are not 1:1 identical to RISC-V's aq/rl flags -- RISC-V's aq/rl flags are basically C++11's memory orderings adapted to work on a cpu (by changing most non-atomic load/stores to be memory_order_relaxed atomic load/store ops). > > we also need to know if multiple atomic operations can > be multi-issue in-flight (i seem to recall POWER9 is 8-way multi-issue?) > > also we need to know what the granularity of internal single-locking > is, by that i mean that if there are multiple requests to the same > {insert thing} then it is 100% guaranteed that, like intel, only > one will ever be serviced. The spec defines that (the reservation granule) to be >= 16B and <= minimum supported page size. According to: https://wiki.raptorcs.com/w/images/8/89/POWER9_um_OpenPOWER_v20GA_09APR2018_pub.pdf POWER9's reservation granule is 128B. > > i suspect, from reading the Power ISA Spec, that {thing} is a Cache > Block. It's not necessarily...but most reasonable implementations use a cache block. > > however that needs to be explicitly determined by deliberately hammering > a POWER9 core with requests at different addresses, varying the differences > and seeing ifthe throughput drops to single-contention. > > at exactly the same address is no good, we can assume that will definitely > cause contention. > > the other important fact to know is, how does the forward-progress guarantee > work, i.e. how do these "cache barriers" work and i suspect they are similar > to IBM's "Transactions". I'd expect most of them work by just stopping further instructions from executing and restarting the instruction fetch process...afaict that's what the spec says has to happen for sync and lwsync -- lwsync would be cheaper by not requiring as much store buffer flushing i guess. Note that none of the restarting instruction fetch stuff is needed for any of the C++11 memory fences...they only care about load/store/atomic execution order. > there is probably an internal counter/tag which goes > up by one on each lwsync. > > other architectures are not exacctly of no interest but please really there > is > only 2-3 days left before this bugreport gets closed so focus on POWER9 Why would we close it now...isn't there still around a month before we have to get all RFPs in to nlnet? (giving them a month to meet their deadline)
(In reply to Jacob Lifshay from comment #44) > Why would we close it now...isn't there still around a month before we have > to get all RFPs in to nlnet? (giving them a month to meet their deadline) because there's greater than 2 months worth of work to get done in 2 months or lose the money.
as paul mentioned in the meeting, we should move the suggested implementation description in the discussion page back into the spec. https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/atomics/discussion.mdwn;h=050e83458bdb6727d5eb43111de71277ef46a44a;hb=HEAD#l37
From the meeting: Jacob Lifshay says:i've been working on benchmarking atomics on power9 and comparing to x86 and armv8.1a to come up with another justification for improving powerisa atomics Jacob Lifshay says: https://git.libre-soc.org/?p=benchmarks.git;a=summary Jacob Lifshay says:i'm borrowing my proposed implementation from risc-v where AMOs can be sent to L1/2/3 caches adaptively, wherever the cache is shared between the different cpus that are accessing that atomic Dmitry says:Speaking of C, I've always found a bit of confusion between barriers and atomicity per se. That's probably C tends to speak of architecture-agnostic stuff and instead talks of abstract state machine. Jacob Lifshay says:you do conflict detection on physical addresses, you can use conflict detection on effictive addresses as a good way to predict physical address conflicts Jacob Lifshay says:all loads/stores in in a core...*not* across all cores Jacob Lifshay says:the other cores have to rely on cache coherencey protocols Jacob Lifshay says:yes, but the fetch_add we need are the ones that can be done in l1/l2/l3/memory, wherever is fastest Dmitry says:Basically all but cmpxchg (or lol/sc) are for performance reasons, aren't they? Dmitry says:Because you can implement anything in terms of cmpxchg. Jacob Lifshay says:cmpxchg or ll/sc is slow tho.... Konstantinos says:might be a stupid question, but would it be possible to have 2 implementations for the same atomics? ie keep the existing L2 ones for scalable systems and replace them with lighter implementations for CPUs that target embedded/desktop/workstations/etc, non-multicore systems at any rate Jacob Lifshay says:we're keeping the existing ll/sc atomics for back compat and because they're fully general...fetch_add instructions are useful cuz they can run faster. Dmitry says:This would be strange to have different implementations across the same arch. Konstantinos says:I *did* say it might be a stupid question 😃 Jacob Lifshay says:you can't just use the existing instructions since you'd have to combine 5-7 instructions into one fetch_add microop -- terrible Dmitry says:ll/sc is notoriously difficult to use, and has no direct counterpart in memory model Dmitry says:IIRC even ARM got its cmpxchg recently Konstantinos says:most of the time it's easier to change the hardware than the software Dmitry says:Yeah I agree. Jacob Lifshay says:uuh, ll/sc works just fine -- when you need the full generality. it's slower otherwise Dmitry says:Granted that software is ready for hardware changes Dmitry says:😃 Dmitry says:You can have the generality with cmpxchg. This basically boils down to loop. Jacob Lifshay says:all of cmpxchg/ll/sc are often slower than a dedicated fetch_add instruction...no loop needed Dmitry says:And you cannot have it with ll/sc, because it doesn't map well to higher-level languages. Dmitry says:Yes that's why I started with "for performance reasons". Jacob Lifshay says:high-level languages nap to a ll/sc loop currently Jacob Lifshay says:map* Dmitry says:I guess some cycles can be saved with relaxed semantics Dmitry says:I mean C memory model Jacob Lifshay says:yes, but you often need acquire/release/sequentially-consistency Dmitry says:That is, cmpxchg made its way to higher level Dmitry says:Yeah Dmitry says:Ok we basically flooded the chat with atomics 😃 Jacob Lifshay says:sve2 is only scalable from the hw end, sw sees a fixed impl. dependent size Jacob Lifshay says:suggested atomic implementation: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/atomics/discussion.mdwn;h=050e83458bdb6727d5eb43111de71277ef46a44a;hb=HEAD Jacob Lifshay says:paul: you wanted a description of a proposed atomics implementation: https://bugs.libre-soc.org/show_bug.cgi?id=236#c46 Jacob Lifshay says:iirc aws graviton3 has sve2 support Jacob Lifshay says:amazon, not alibaba Jacob Lifshay says:for graviton3
Edit by programmerjake: Very preliminary benchmarks -- see comment #51 for details: https://ftp.libre-soc.org/talos-benchmarks.json.gz
notes from the conversation on wednesday (not recorded) Paul mentioned again that there is one person in IBM responsible for the IBM POWER Architecture Fabric. that is their sole job for probably the past 20 years. Atomic Memory operations are part of the Memory Controller for which they are directly responsible, and their responsibility includes ensuring Atomic Consistency when the number of Nodes in a system exceeds several hundred thousand cores. this is far beyond anything envisaged for "mere" commercial mass-volume SoCs and anything that is and has been proposed and discussed at considerable length including internally in IBM can and has been rejected on the basis of it damaging the ability to ensure Atomic Consistency and performance for massive Distributed Computing therefore having discovered this, the proposal was far out of the realm of possibility for completion, is being REDEFINED in terms of the exploration achieved so far, is being CLOSED and a new proposal is to be submitted later with a much greater budget. jacob if you can briefly summarise the discoveries from the atomics benchmarks specifically focussing on Power ISA that would help.
(In reply to Luke Kenneth Casson Leighton from comment #49) > jacob if you can briefly summarise the discoveries from the atomics > benchmarks specifically focussing on Power ISA that would help. basically that the benchmarks require more work...the only thing of note that I've discovered so far is x86 atomics (Ryzen 3900X) are a lot slower than relaxed/acquire/release atomics on POWER9 (but not really seq-cst atomics) -- which is entirely expected since x86 atomics are almost all seq-cst due to needing the LOCK prefix. AArch64 (which is what is more useful to compare to) has not been tested yet. I just tested x86 because that's what my desktop is and because x86 is widely available so is nice to have.
(In reply to Luke Kenneth Casson Leighton from comment #48) > https://ftp.libre-soc.org/talos-benchmarks.json.gz Note that those are very preliminary -- they test atomic ops that always conflict -- because that's what I ran right after adding json output support, not because that's necessarily the benchmarks we care the most about.