Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:
/**
* @brief Defines variable alignment to avoid false sharing.
* @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
* @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
*
* The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
*
* @code{.cpp}
* #if defined(__cpp_lib_hardware_interference_size)
* static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
* #else
* static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
* #endif
* @endcode
*
* That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
* unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
* with `-Wno-interference-size`.
*/
static constexpr std::size_t default_alignment_k = 128;
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...
> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines
I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.
Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule
AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.
This makes attempts of cargo-culting __attribute__((aligned(64))) without benchmarking even more hilarious. :-)
It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.
Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.
It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )
I may be completely out of line here, but isn't the story on ARM very very different? I vaguely recall the whole point of having stuff like weak atomics being that on x86, those don't do anything, but on ARM they are essential for cache coherency and memory ordering? But then again, I may just be conflating memory ordering and coherency.
Well, since this is a thread about how programmers use the wrong words to model how they think a CPU cache works, I think it bears mentioning that you've used "atomics" here to mean something irrelevant. It is not true that x86 atomics do nothing. Atomic instructions or, on x86, their prefix, make a naturally non-atomic operation such as a read-modify-write atomic. The ARM ISA actually lacked such a facility until ARMv8.1.
The instructions to which you refer are not atomics, but rather instructions that influence the ordering of loads and stores. x86 has total store ordering by design. On ARM, the program has to use LDAR/STLR to establish ordering.
This post, along with the tutorial links it and the comments contain, provide good insights on the topic of caches, coherence, and related topics. I would like to add a link that I feel is also very good, maybe better:
<https://marabos.nl/atomics/hardware.html>
While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.
Did this article share more than one myth?
The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.
I'm pretty sure that most cases of x86 reordering issues are a matter of the compiler reordering things, which isn't (afaik) solved with just "volatile". Caveat - haven't dealt with this for at least over a year (multicore sync without using OS primitives in general).
Literally the volatile keyword in Java is to make the Java compiler aware in order to insert memory barriers. That only guarantees consistent reads or writes, but it doesn't make it thread safe (i.e. write after read), that's what atomics are for.
Also not only compilers reorder things, most processors nowadays do OoOE; even if the order from the compiler is perfect in theory, different latencies for different instruction operands may lead to execute later things earlier not to stall the CPU.
Note that this is only true of the Java/C# volatile keyword. The volatile keyword in C/C++ is solely about direct access in hardware to memory-mapped locations, for such purposes as controlling external devices; it is entirely unrelated to the C11 memory model for concurrency, does not provide the same guarantees, and should never be used for that purpose.
x86 has a Total Store Order (TSO) memory model, which effectively means (in a mental model where only 1 shared memory operation happens at once and completes before the next) stores are queued but loads can be executed immediately even if stores are queued in the store buffer.
On a single core a load can be served from the store buffer (queue), but other cores can't see those stores yet, which is where all the inconsistencies come from.
In Golang there are sync.Mutex, sync.Atomic and Channels to create this fence and prevent data races. I prefer sync.Mutex.
Does anyone understand how Go handles the CPU cache?
Yes. Locks will use a memory fence. More advanced programs will need fence without locking.
Since the CPU is doing cache coherency transparently, perhaps there should be some sort of way to promise that an application is well-behaved in order to access a lower-level non-transparent instruction set to manually manage the cache coherency from the application level. Or perhaps applications can never be trusted with that level of control over the hardware. The MESI model reminded me of Rust's ownership and borrowing. The pattern also appears in OpenGL vs Vulkan drivers, implicit sync vs explicit sync. Yet another example would be the cache management work involved in squeezing out maximum throughput CUDA on an enterprise GPU.
There are some knobs that newer processors give for cache control, mostly to partition or reserve cache space to improve security or reduce cache contention between processes.
Actual manual cache management is way too much of an implementation detail for a general-purpose CPU to expose; doing so would deeply tie code to a specific set of processor behavior. Cache sizes and even hierarchies change often between processor generations, and some internal cache behavior has changed within a generation as a result of microcode and/or hardware steppings. Actual cache control would be like MIPS exposing delay slots but so much worse (at least older delay slots really only turn into performance issues, older cache control would easily turn into correctness issues).
Really the only way to make this work is for the final compilation/"specialization" step to occur on the specific device in question, like with a processor using binary translation (e.g. Transmeta, Nvidia Denver) or specialization (e.g. Mill) or a system that effectively enforces runtime compilation (e.g. runtime shader/program compilation in OpenGL and OpenCL).
Coherent cache is transparent to the memory model. So if someone trying to explain memory model and ordering mentioned cache as affecting the memory model, it was pretty much a sign they didn't fully understand what they were talking about.
2018, discussed on HN in 2023: https://news.ycombinator.com/item?id=36333034
Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
176 points by whack on June 14, 2023 | hide | past | favorite | 138 commentsVery interesting, I am hardly an expert, but this gives the impression that if only without that meddling software we will all live in a synchronous world.
This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet
If the committee had any good taste at all they would have thrown DEC AXP overboard before C++11, which would have cut the majority of the words out of the specification for std::memory_order. It was only the obsolete and quite impossible to program Alpha CPU that requires the ultimate level of complexity.