Windows gained a WaitForMultipleObjects, which Linux 5.16 (end 2021) aped with a new Futex2. https://www.phoronix.com/news/Linux-5.16-sys_futex_waitv
There's been a nice stream of improvements to futex2 since.
NUMA support (finally landing!), https://www.phoronix.com/news/FUTEX2-NUMA-Small-Futex https://www.phoronix.com/news/FUTEX2-Improvements-Linux-6.16 (see also this fantastic recent submission on NUMA in general, absolutely critical performance stuff, https://news.ycombinator.com/item?id=44936575)
Io_uring support in 6.7 (2024), (with a nice write up on it speeding up postgresql aio), https://www.phoronix.com/news/IO_uring-FUTEX-Linux-6.7
Small requeue and single wait additions in 6.7, https://www.phoronix.com/news/Linux-6.7-Locking-FUTEX2
Windows did not gain a WaitForMultipleObjects, it had it since the first Windows NT, more than 30 years ago.
While WaitForMultipleObjects was an advantage of Windows NT over UNIX, it was nothing new. IBM PL/I had an equivalent function already in 1965, almost 30 years before Windows NT.
The "wait" function of IBM PL/I was actually the model for the UNIX "wait", but the UNIX function was extremely simplified and much weaker than its model, like it was also the case with the many features inherited by UNIX from Multics. Unfortunately, many decades had to pass until the descendants of UNIX began to gain features comparable in sophistication with those of the ancestors of UNIX.
However the Microsoft WaitForSingleObject and WaitForMultipleObjects did not have an efficient implementation, which is why they had to add WaitOnAddress, the equivalent of Linux futex.
It is true however that the Linux futex had and still has some annoying limitations, like the size of only 32 bits of the futex value, instead of 64 bits, and the fact that it is possible to wait only on a single event. Using atomic bit operations on the futex value it is actually possible to wait on multiple events, though not in the most efficient way. However here is where the 32-bit size of the futex value becomes annoying.
Therefore the work that attempts to combine the advantages of "futex" with some of the advantages of WaitForMultipleObjects is very welcome.
However this does not ape Windows, but it just reimplements techniques that are much older than the Microsoft company, which were well known more than a half of century ago.
>However the Microsoft WaitForSingleObject and WaitForMultipleObjects did not have an efficient implementation, which is why they had to add WaitOnAddress
WaitForSingle/MultipleObjects wait for kernel objects, similiar to poll. WaitOnAddress is lightweight synchronization, equivalent to futex. Windows doesn't have something like WaitForMultipleAddresses. futex_waitv was used by Wine patches because they implement NT kernel objects in userspace, and there were some semantic differences that made it hard to represent them as eventfds.
PS: But using futexes to emulate kernel objects breaks security boundaries of a process. That's why it was never merged into upstream Wine, and NTSYNC was developed.
That's very interesting history, thanks.
I agree with Linux still only supporting 32-bit futexes is a bit baffling. The only reason the width matters is for the race condition check, but that's a huge reason. I'd want to have the option to wait on values as wide as whatever the underlying hardware supports, at least!
The futex width matters if you implement your own waiting on multiple events, with one bit per event. The events can be signaled with atomic bit operations.
Ideally the value should be two words -- 16 bytes on 64-bit systems.
Emm, what? Why? If you mean two processor words, which I gather from what you are saying, then I think you are already in the space of full memory barriers.
In that case, why not just say, ideally it would be 256K words, or whatever?
Because mainstream modern architectures (practically speaking, x86-64-v2+ and ARMv8+) give you[1] a two-word compare-and-swap or LL/SC.
However using compare-and-swap as the atomic operation for implementing multiple events can be very inefficient, because it introduces waiting loops where the threads can waste a lot of time when there is high contention.
The signaling of multiple events is implemented efficiently with atomic bit set and bit clear operations, which are supported since Intel 80386 in 1985, so they are available in all Intel/AMD CPUs, and they are also available in 64-bit Arm CPUs starting with Armv8.2-A, since Cortex-A55 & Cortex-A75, in 2017-2018 (in theory the atomic bit operations were added in Armv8.1-A, but there were no consumer CPUs with that ISA).
With atomic bit operations, each thread signals its event independently of the others and there are no waiting loops. The monitoring thread can determine very fast the event with the highest priority using the LZCNT (count leading zeroes) or equivalent instructions, which is also available on all modern Arm-based or Intel/AMD CPUs.
When a futex is used to implement waiting for multiple events, despite not having proper support in the kernel, the thread that sets its bit to signal an event must also execute a FUTEX_WAKE, so that the monitoring thread will examine the futex value. Because the atomic bit operations are fetch-and-OP operations, like fetch-and-add or atomic exchange, the thread that signals an event can determine whether the previous event signaled by it has been handled or not by the monitoring thread, so it can act accordingly.
So currently on Linux you are limited to waiting for up to 32 events. The number of events can be extended by using a multi-level bitmap, but then the overhead increases significantly. Using a 64-bit futex value would have been much better.
In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.
This fact has forced the Arm company to correct their mistake from the first version of the 64-bit ARM ISA, where there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.
> In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.
> This fact has forced ... there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.
I'm not sure if you meant for these two paragraphs to be related, but asking too make sure:
- Isn't compare-and-swap (CMPXCHG on x86) also read-modify-write, which in the first quoted paragraph you mention is slow?
- I think I've benchmarked LOCK CMPXCHG vs LOCK OR before, with various configurations of reading/writing threads. I was almost sure it was going to be an optimization, and it ended up being inobservable. IIRC, some StackOverflow posts lead me to the notion that LOCK OR still needs to acquire ownership of the target address in memory (RMW). Do you have any more insights? Cases where LOCK OR is better? Or should I have used a different instruction to set a single bit atomically?
In terms of the relative cycle cost for instructions, the answer definitely has changed a lot over time.
As CAS has become more and more important as the world has scaled out, hardware companies have been more willing to favor "performance" in the cost / performance tradeoff. Meaning, it shouldn't surprise you if uncontended CAS as fast as a fetch-and-or, even if the later is obviously a much simpler operation logically.
But hardware platforms are a very diverse place.
Generally, if you can design your algorithm with a load-and-store, there's a very good chance you're going to deal with contention much better than w/ CAS. But, if the best you can do is use load-and-store but then have a retry loop if the value isn't right, that probably isn't going to be better.
For instance, I have a in-memory debugging "ring buffer" that keeps an "epoch"; threads logging to the ring buffer fetch-and-add themselves an epoch, then mod by the buffer size to find their slot.
Typically, the best performance will happen when I'm keeping one ring buffer per thread-- not too surprising, as there's no contention (but impact of page faults can potentially slow this down).
If the ring buffer is big enough that there's never a collision where a slow writer is still writing when the next cycle through the ring buffer happens, then the only contention is around the counter, and everything still tends to be pretty good, but the work the hardware has to do around syncing the value will 100% slow it down, despite the fact that there is no retries. If you don't use a big buffer, you have to do something different to get a true ring buffer, or you can lock each record, and send the fast writer back to get a new index if it sees a lock. The contention still has the effect of slowing things down either way.
The worst performance will come with the CAS operation though, because when lots of threads are debugging lots of things, there will be lots of retries.
One thing to add here, I've enjoyed reasonably extensive support for `atomic_compare_exchange_strong()` and the `_explicit` variant for quite a long time (despite the need for the cache line lock on x86).
But, last I checked (the last release, early last year) MUSL still does not provide a 128 bit version, which is disappointing, and hopefully the AVX related semantics changes will encourage them to add it? :)
Because there are apps that use two-word pass/return-by-value values internally, so it'd be convenient.
It doesn't matter if the ancient Egyptians were the first ones to await multiple event sources.
What matters is where the designers of the system in question got their inspiration. If they learned it from Windows, they're aping Windows. If they invented it independently, they're not imitating anyone at all.
But still no futex_swap[0][1]
[0] https://lore.kernel.org/all/414e292195d720c780fab2781c749df3... [1] https://blog.linuxplumbersconf.org/2013/ocw/system/presentat...
A very sad state of affairs :( I want fibers that I can introspect in gdb!
Futex has nothing to do with WFMO. Futex is equivalent to keyed events.
The linux equivalent of WFMO is select/poll/epoll.
Perhaps they meant to write WaitOnAddress, which is the same as a basic futex.
yes, sorry, I meant WaitOnAddress. I thought they were also called keyed events (or at least that was the NT name) but I might be wrong.
Keyed event is older thing, somewhat similar but lacks value comparison.
WaitOnAddress() allows 32 or 64 bit addresses. "[in] AddressSize The size of the value, in bytes. This parameter can be 1, 2, 4, or 8."
Thank god there's no Linus at MSFT to pointlessly veto 64-bit futex support.
Maybe perhaps, but its not clear to me what makes you argue that, and it doesn't match up from what I've read. From the article:
> People often describe the futex as, "Wait on memory address". That overlooks the notification side, but it’s a much more apt name, and why Windows’ name for this API (WaitOnAddress) is superior API naming (to be fair, they did have a decade to think about the name).
The difference between an Address and an Object feels pretty abstract to me. The API surfaced otherwise feels extremely similar. So I'm not sure that there's a ton of ground to stand on for this distinction you are trying to draw. Your assertions could use some argumentation to back them up.
From the Futex2 pull requests in 5.16:
> Add a new sys_futex_waitv() syscall which allows to wait on multiple futexes. The main use case is emulating Windows' WaitForMultipleObjects which allows Wine to improve the performance of Windows Games.
There is a huge difference. The Unix equivalent of an NT kernel object is a file descriptor (or more accurately an open file object). A futex is just a 32-bit word anywhere in memory, hashed to a kernel wait queue on first use. It is much lighter-weight than a kernel file object.
NT handles are a bit more lightweight compared to file descriptors. Their values are not dense, so creating a new one doesn't require searching for the first available number.
io_uring support for futexes is really nice. I used it to implement mutexes and queues for working with Ruby fibers:
https://github.com/digital-fabric/uringmachine/blob/main/ext...
I’ve tried to write something similar for mruby, saw the 90% performance hit compared to just using callbacks and never released it as a mgem.
Lets be clear here; the book sets itself up as a way to gain understanding of multiprocessor programming, in a way that promotes skilled reasoning that is applicable across many subdomains. In many places it points out that you should avoid implementing constructs yourself, but instead use a library/language/system provided construct. It specifically calls this out for mutexes.
The book is quite clearly about concurrency in general, and not for a specific platform. The author of this article has set up a straw man to facilitate the writing and marketing of an otherwise moderately interesting article on futexes.
Personally I find the approach taken by this article more than a little distasteful - presenting from a point of exaggerated conflict is both tiresome and likely to confuse. This article could easily have been written from the perspective "what TAoMP doesn't tell you" and in that vein be taken a lot more collaboratively.
Of course it doesn't escape me that this blog is new, this article was posted by Phil, and Phil has promoted one of their other articles before.
I wrote the article; it was motivated by reading the book, which to my estimation is not well aimed for either academics or practitioners. That's a problem across a big chunk of academia right now, and I hear it not just from industry who would like to have people more prepared coming out of college, but from masters students who realize that they're not learning what they want to be good.
So in no way was it meant to be a strawman around a "hey, learn about the futex!" post (as evidenced by other complaints at the end of things lacking). The fact is, I was disappointed enough with the book, that I put aside another post I was writing for it.
But as for Phil, we did work together several years ago, and he reads my stuff. I didn't just start writing, and have never had problems finding an audience in the past, Phil or not.
You're right. I had students complain about this all the time at both the undergrad and grad level, that academic material often seems like it's teaching them wrong as a joke.
The correct way to teach material at the collegiate level is to teach it correctly. Useful lies and omissions are for grade school. Where material is simplified for narrative clarity, it should be in a framing that immediately points to the materials for the more fully correct information.
Please think about removing the “2025” footer, as it takes almost half my screen on the phone if I put it horizontal. I had to switch to read mode. I suppose is ok, but I assume the article is to be read…
Yeah, a few other people have mentioned this too; I looked and it is indeed abysmal. I'm not the person responsible for the layout, but I will make sure it gets fixed before posting anything else, even if I have to go tweak it myself.
Yeah the part about not even calling the previous sysv style a dinosaur bc it implies it was once mighty is like standing on the shoulders of giants and pooping on their heads. A little humility is all that is required.
Fair. That was meant as tongue in cheek-- sysv was impressive for its day, and its day lasted a long time.
I appreciate that not everyone loves my style of humor, but I know when I read things with similar styles, it keeps me more engaged with the material.
Still, I am not trying to make jokes at the expense of actual people, so I'll take the note and try to avoid, thanks.
I think the coolest part of the futex is that it's a handle-less concept. There's no allocation or deallocation via syscall, just a kernel-based memory watcher that turns out to be incredibly useful as a primitive.
Everything goes cleanly away when there are no more waiters, and the kernel never even sees a mutex where there's no contention.
I would be interested in a technical deep dive of how the kernel manages these in a performant way, however.
EDIT: TIL about futex2 as well: https://docs.kernel.org/userspace-api/futex2.html
Exactly! At the same time you also don't want to call into the kernel's internal malloc() whenever a thread ends up blocking on a lock to allocate the data structures that are needed to keep track of queues of blocked threads for a given lock.
To prevent that, many operating systems allocate these 'queue objects' whenever threads are created and will attach a pointer to it from the thread object. Whenever a thread then stumbles upon a contended lock, it will effectively 'donate' this queue object to that lock, meaning that every lock having one or more waiters will have a linked list of 'queue objects' attached to it. When threads are woken up, they will each take one of those objects with them on the way out. But there's no guarantee that they will get their own queue object back; they may get shuffled! So by the time a thread terminates, it will free one of those objects, but that may not necessarily be the one it created.
I think the first operating system to use this method was Solaris. There they called these 'queue objects' turnstiles. The BSDs adopted the same approach, and kept the same name.
https://www.oreilly.com/library/view/solaristm-internals-cor...
https://www.bsdcan.org/2012/schedule/attachments/195_locking...
This is a really dumb question from someone unfamiliar with the kernel's futex implementation, so bear with me. In userspace locks (e.g. in parking_lot), wait queues can be threaded through nodes allocated on the stack of each waiting thread, so no static or dynamic allocation of wait queue nodes is necessary. Is it possible to allocate wait queue nodes on the kernel stacks of waiting threads as well?
> Is it possible to allocate wait queue nodes on the kernel stacks of waiting threads as well?
Yes, this is exactly what's done. The queue node is an declared as a local variable in kernel code, i.e. on the kernel stack, and its address is passed to the wait function which links it into the waitqueue and then blocks in the scheduler.
The original Unix in-kernel wait queues were also like that.
> Going back to the original futuex paper in 2002, it was immediately clear that the futex was a huge improvement in highly concurrent environments. Just in that original paper, their tests with 1000 parallel tasks ran 20-120 times faster than sysv locks..
I think this is a misunderstanding.
The baseline isn’t sysv locks. The baseline isn’t even what Linux was doing before futexes (Linux had a very immature lock implementation before futexes).
The baseline is all of the ways folks implement locks if they don’t have futexes, which end up having roughly the same properties as a futex based lock:
- fast path that doesn’t hit kernel for either lock or unlock
- slow path that somehow makes the thread wait until the lock is available using some kernel waiting primitive.
The thing futexes improve is the size of the user level data structure that is used for representing the lock in the waiting state. That’s it.
And futexes aren’t the only way to get there. Alternatives:
- thin locks (what JVMs use)
- ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)
If you baseline against what people do when they skip builtin locks, then yes, that's spot on.
Though, I was more coming from the assumption that most people are really learning about the primitives that are going to be there, and they're going to be reaching for in the real world.
So I was thinking more, "what does my language's standard library provide", which has mostly moved from sysv (or worse, sure) -> futex.
It's true though, there have been some recent defections to custom waiting. I haven't looked at any of the well-used parking lot implementations in any depth.
It's a bit of a tangent, but obviously, such constructs can be done completely in userland if you're running your own scheduler. But I assume most of them are not, so if they're not using a futex, I'd assume they're instead writing to a blocking FD and managing their own queues?
If so, how much of a win is that, really? I'm surprised it'd be worth the effort.
> If you baseline against what people do when they skip builtin locks, then yes, that's spot on.
No idea what you mean by "builtin locks".
For example, the best baseline at the time futexes were introduced would have been MS's CRITICAL_SECTION, which was:
- Builtin in the sense that it shipped with the OS and part of that OS's standard library.
- Not a kernel call except on slow paths.
Prior to futexes, LinuxThreads already provided a lock implementation in pthread_mutex that was better than sysv, just worse than CRITICAL_SECTION, and worse than the futex-based one. But that old implementation already achieved the no-kernel-call fast path.
Also worth noting that today, the fastest widely available lock impl is MS's SRWLock, which obviously isn't futex based. So it's not like futexes are even the best
> So I was thinking more, "what does my language's standard library provide", which has mostly moved from sysv (or worse, sure) -> futex.
No, it hasn't. Before and after futexes, folks use pthread_mutex. Before and after futexes, pthread_mutex was faster than sysv. Before and after futexes, pthread_mutex was not based on sysv.
> It's a bit of a tangent, but obviously, such constructs can be done completely in userland if you're running your own scheduler. But I assume most of them are not, so if they're not using a futex, I'd assume they're instead writing to a blocking FD and managing their own queues?
Both ParkingLot and thin locks bottom out in a lock-and-condition-variable-per-thread implementation (or semaphore-per-thread if that's what the OS has). Then the lock and condition variable (or semaphore) may or may not use futexes.
> If so, how much of a win is that, really? I'm surprised it'd be worth the effort.
Portability. If you build a fast lock implementation on futexes, then it won't work on all OSes. You could get it to be portable to different OSes by virtue of the fact that the major OSes these days provide futex-like primitives, but even then, you'd be in a world of hurt because the details are different.
But ParkingLot can be built out of pthread_mutex/pthread_cond so it'll work well on any POSIX
> And futexes aren’t the only way to get there. Alternatives:
> - thin locks (what JVMs use)
> - ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)
Worth nothing that somewhere under the hood, any modern lock is going to be using a futex (if supported). futex is the most efficient way to park on Linux, so you even want to be using it on the slow path. Your language's thread.park() primitive is almost certainly using a futex.
ParkingLot just uses pthread mutex and cond.
Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you
> ParkingLot just uses pthread mutex and cond.
That's interesting, I'm more familiar with the Rust parking-lot implementation, which uses futex on Linux [0].
> Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you
It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly. A futex lets you manage the atomic condition yourself, while generic parking utilities encode that state internally. A mutex implementation generally already has a built-in atomic condition with simpler state transitions for each thread in the queue, and so can avoid the additional overhead by making the futex call directly.
[0]: https://github.com/Amanieu/parking_lot/blob/739d370a809878e4...
> It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly.
No, it absolutely isn’t.
The dominant cost of parking is whatever happens in the kernel and at the microarchitectural level when your thread goes to sleep. That cost is so dominant that whether you park with a futex wait or with a condition variables doesn’t matter at all.
(Source: I’ve done that experiment to death as a lock implementer back when I maintained Jikes RVM’s thin locks and then again when I wrote and maintained ParkingLot.)
Thanks for that reference. Do you know if the JVM still uses thin locks? Did they migrate to thin locks? I ask because I found a 9 year old reference with a JVM calling futex: https://stackoverflow.com/questions/32262946/java-periodical....
OpenJDK has used thin locks since back when it was called HotSpot.
The fact that the JVM is hanging in futexes doesn’t meant anything for the purpose of this discussion. Futexes are _the_ OS locking primitive on Linux so I would expect modern JVM thin locks to bottom out in a futex syscall. That doesn’t mean that the JVM is “using” futexes in the sense of the original post.
> And in practice, behavior across common implementations [of recursive locks] is not remotely consistent. There’s a good reason why this was left undefined – it’s kind of hard.
This is such a frustrating stance that most standards have, honestly. "Well, obviously we can't expect the OS/language implementers to be able to reliably implement feature X ― let's just leave it to the application programmer to deal with; they are, after all, are expected to have great skill sets and could easily work around it". Or rather, "well, we can't force feature X on the people who will actually implement the standard (they are the members of this very committee, after all), but we can't trivially force the downstream users to cope with the feature's absence because seriously, what can those losers do? Switch the vendors?".
If you overspecify, you close the door to better implementations. This is why, for example, C++ standard hash tables and regexes are an order of magnitude slower than third party ones.
The standard didn't say "you must implement std::unordered_map as a hash table with chained buckets and extra memory allocations", but ithe standard specified several guarantees that make it very difficult to implement hash tables with open addressing.
Every constraint that you specify potentially locks out a better implementation.
For recursive rwlocks, there's a lot of ways to implement them. Do you want to lock out high performance implementations that do less error checking, for example?
This is exactly the issue, and unordered_map exemplifies it perfectly.
On paper, unordered_map sounds great. It lists all the admirable properties you would theoretically want in a hashtable. Then in practice when you go to implement it, you realize that you've painted yourself into a garbage fire, as the saying goes.
I suppose this is a failing of the design by committee method, where the committee isn't directly responsible for implementation either before or during standard writing.
OTOH, 99% of use cases don't care about performance and just want a portable implementation.
While it could be useful to have a "fast" variation that offers no guarantee at all, what you would end up with (because people are vain) is that too may people would use those instead "because perf", even though the actual usage is not performance critical, and have code that breaks whenever the compiler or platform changes.
> OTOH, 99% of use cases don't care about performance
For many other languages, I'd agree with '99%', but in the case of c++, performance is one of the main reasons it's still used, so I doubt the number is nearly that high.
With hindsight, I'd say that std::unordered_map is fine for what it is, but there are a lot of cases where it's really not suitable due to having a dynamic allocation for each element and the resulting lack of cache locality, and because of that many people have to go looking elsewhere for a usable hash map. There are good reasons why we have both std::vector and std::list.
To clarify, I believe the issue is C++ unordered map iterators and when / where they are allowed to go invalid.
OpenAddressing means that an address of map[thing] could change on insert. Which means iterators and pointer invalidation concepts can go stale on insert.
C++11 standard for unordered_map guarantees this won't happen. But that forces slower implementations.
And now people rely upon the standard so we can't change it. At best we do fast_unordered_map or unordered_map2 with different guarantees.
There are numerous factors where std::unordered_map makes API design choices that you would not make today and the invalidation of iterators (or lack thereof) is just one. A particularly frustrating issue is that std::unordered_map promises some things we suspect nobody cares about, but whereas Rust routinely does "crater runs" in which they discover to a good approximation whether their changes break the ecosystem (via hidden ABI dependencies, Hyrum's law etc.) there is little equivalent for C++ and much scare mongering about claimed large codebases hidden from sight which may use absolutely anything.
The most stupid thing about std::unordered_map is that it was standardized in 2011, so it isn't from 1998 like much of the C++ standard library containers, it's newer and yet apparently nothing was learned.
It was standardized in c++11, but it was standardizing hash_map that was in the original STL pre-C++98 and was available as an extension in most STL derived standard libraries.
For me the remaining reason I still reach for unordered_map is if I need reference stability as most faster hash tables don't provide it (and I don't care enough about performance to build reference stability on top of a better hash map).
When you want reference stability, do you need references to stay working even where something moved to a different hash table?
That is suppose we have two std::unordered_map<Goose> labelled A and B, and we put a particular Goose in A, later we get a reference to it, and we might, or might not, move it to B, but it definitely still exists. Do you need that? (As I understand it this is in fact what you get in std::unordered_map today)
Or for your purposes is it enough that it works only so long as the goose we're referring to is in A still and if we moved it out of A then it's fine that the reference is invalidated ?
If I need the lifetime of an object to outlive it's presence in the map then I use boost intrusive or simply a map of (smart) pointers.
For me it is usually when I need a multi indexed map where I use a fast map of pointers for the fast access and unordered_map both to keep the object alive and as a secondary access key.
Really I should use boost intrusive or boost multiindex, but these days I value the implementation simplicity in many cases, even if I have to keep the indices in sync myself
Can you use the CPython approach: an open-addressed hashtable mapping keys to indices into a dynamic array holding (pointers to) the actual keys and values? (Reference stability of course precludes implementing deletes by moving the last key/value entry into the vacated slot; instead you must track empty slots for reuse.)
As soon as the hashmap needs to resize the array will be realloc’d, so it won’t be stable unless you add virtual memory tricks or an indirection via some sort of segmented array.
Of course. That's just map<obj*>. The pointer-to-pointers could move but the pointers are now staying put. But now you have a bit of a lifetime worry.
Map<shared_ptr<obj>> with weak_ptr probably is the best general solution. If the object decided to delete itself the weakptr will be set to null.
And if you underspecify, you force the users to invent their own hacks and workarounds, often very poorly.
And no, I don't want to "high performance" lock implementations that regularly completely deadlock the whole process (deadlocked processes are not very performant) unless wrap every single one of the several hundred uses of them in a test with dubiously-predictable branches, or worse, just completely break the lock invariants (e.g., the lock is now permanently unlocked, even after a lock() on it succeeds) ― it really is not that important how fast you can do the wrong thing.
However, recursive locking is a bad idea overall, and the tracking needed to make it work on rwlock contexts isn't worth the overhead it imposes in the common case of non-recursive rwlocks.
It would be better if the spec simply said it was disallowed.
I think part of it is that you shouldn't be using recursive locks, so why bother specifying support for them? IMO.
More on this phenomenon: https://en.wikipedia.org/wiki/Worse_is_better
I hate it, but it's true
Maybe you'd like Anthony Williams's _C++ Concurrency in Action_, which doesn't cover futexes (or how to write your own synchronization primitives in general), but does cover real-world details like memory orderings and SMR for lock-free data structures. If that's still too high-level, then maybe check out Paul McKenney's excellent free monograph "Is Parallel Programming Hard, And, If So, What Can You Do About It?" for a more hardware-focused perspective (which doesn't cover futexes in detail either but does direct you to the canonical reference for implementing futex-based primitives, namely Ulrich Drepper's "Futexes Are Tricky").
I think TAOMPP is fine for what it is (teaching high-level concurrency concepts) and discussing OS-level implementation details would be out of place. The important thing it teaches is how to think about concurrency, not how to write your own synchronization primitives. E.g., the Peterson or bakery locks are useless in the real world (as the book admits), but understanding their proofs of correctness will help you reason about the concurrent algorithms you have to write yourself.
Bakery locks are good for spin locks. They're more cache friendly. Plus you can do reader/writer spin locks. They're going to be strictly FIFO though.
I guess you could tack on a futex wait for the spin wait in user space but it's going to be really inefficient. You are going to get a lot of spurious wake ups. Not one of the things futex's are designed for.
Lock-free with hazard pointers or RCU* is still kind of tricky. It's going to be data structure specific and you really have to know what you are doing.
Fun fact. You can make hazard pointers wait-free, actual wait free, not the dubious bounded retry loop hack.
* Doing copy on write with RCU is fairly straight forward but probably expensive if updates are frequent.
Maybe you mean ticket locks? That’s kind of the practical version of the bakery algorithm. It’s not the same because ticket locks require fetch-and-add (or its emulation over another atomic RMW like CAS), while the bakery algorithm uses only reads and writes (which makes it impractical).
Using a futex to allow ticket locks to block in the kernel does have a thundering herd problem. Fortunately you can do “smart sleeps” by measuring the time elapsed since the last wakeup and the change in the turn counter since then, and using that to estimate the time until the current thread’s turn (we don’t want to block the queue by oversleeping, so sleep for half of that estimate, then redo the estimate again, etc). (The same logic can be applied to spinning during the waiter’s timeslice, to avoid unnecessary cache coherence traffic.)
I assume you’re referring to Guy Blelloch’s work on wait-free hazard pointers? That’s very cool although I doubt it’s solving a practical problem. The practical issue with hazard pointers is the StoreLoad barrier on the read side (which can be moved to the GC side via membarrier(2) etc).
https://groups.google.com/g/comp.programming.threads/c/XU6Bt... https://groups.google.com/g/linux.kernel/c/gk6AUkXR9As/m/-1W... Yes, I am aware of the asymmetric memory barriers trick and the atomic memory move trick also.
There's a couple of other ways to achieve wait freedom on hazard pointer loads (protect) but they haven't been published so they're kind of moot.
FWIW I resonate a lot with what the author says about the art of multiprocessor programming. When I learned it in college (from the author himself no less), it felt like a very theoretical introduction to a deeply practical problem domain. I could complete the practice problems well, but that didn’t teach me the intuition for writing multithreaded programs. Later on once I’ve been through the trenches in the industry and came back to the book, everything started to make more sense and I started to see parallels with distributed systems.
Honestly I think if you really want to understand this topic, getting exposed to atomic memory ordering early on and writing simplified spin locks / SPSC queues is the way to go. You don’t even have to implement them correctly - the implementation just needs to teach you what could go wrong. The book operates at an abstraction level that makes it not very useful inexperienced students.
As the article mentions, Windows introduced a futex-like thing in Windows 8. I know that the original Win32 critical section is based on a kernel-level semaphore. What about the SRW lock introduced in Vista?
Neither CRITICAL_SECTION nor SRWLock enters the kernel when uncontended. (SRWLock is based on keyed events, CRITICAL_SECTION nowadays creates kernel object on-demand but falls back to keyed event on failure)
What _is_ the big difference between CRITICAL_SECTION and a futex, really? I always assumed that futexes were meant to be pretty similar to CRITICAL_SECTION (mostly-userspace locks that didn't have fatal spinning issues on contention).
A critical_section is a mutex, a futex is a general synchronization primitive (a critical_section might be implemented on a more general primitive of course, I'm not a Windows expert).
Critical section was IIRC built on top of windows manual/auto reset events which are a different primitive useful for more than just mutex but without the userspace coordination aspect (32 bit value) of futexes.
SRW lock uses the WaitOnAddress primitives nowadays, not keyed events.
Well, technically both WaitOnAddress and SRWLOCK use the same "wait/wake by thread ID" primitive. WaitOnAddress uses a hash table to store the thread ID to wake for an address, whereas SRWLOCK can just store that in the SRWLOCK itself (well, in an object on the waiter's stack, pointed to by the SRWLOCK).
A particularly tricky exploit in the linux futex implementation from 2014, by Pinkie Pie, https://issues.chromium.org/issues/40079619
"The requeue-once rule is enforced by only allowing requeueing to the futex previously passed to futex_wait_requeue_pi as uaddr2, so it's not possible to requeue from A to B, then from B to C - but it is possible to requeue from B to B.
When this happens, if (!q.rt_waiter) passes, so rt_mutex_finish_proxy_lock is never called. (Also, AFAIK, free_pi_state is never called, which is true even without this weird requeue; in the case where futex_requeue calls requeue_pi_wake_futex directly, pi_state will sit around until it gets cleaned up in exit_pi_state_list when the thread exits. This is not a vulnerability.) futex_wait_requeue_pi exits, and various pointers to rt_waiter become dangling. "
This led me down a bit of a rabbit-hole involving linux's limitation of only supporting 32 bit integers for futexes, and why that is, and the implementation of semaphores in glibc.
In one discussion of supporting 64 bit integers for futexes, Linus suggested that you can just use 64 bit atomics in userspace, but only use 32 bits of that 64 bit integer for the futex, as long as the bits that change on a wakeup are in those 32 bits.
However, AFAICT, using mixed-size atomics in c/c++ is undefined behavior, although on most modern hardware it does work. But looking at the implementation of semaphore in glibc, that is exactly what it does. The semaphore uses a 64 bit integer, with the number of waiters in the high 32 bits, and the value of the semaphore in the low 32 bits, with userspace atomic operations on the whole 64 bit integer, but using just the low 32 bits for the futex.
Is my question is, does gcc specifically consider this defined behavior, or is it ok because the 32 bit access happens in a separate (kernel) process that isn't part of the same compilation, or is it actually undefined behavior in glibc?
> Many people won’t worry about crashed threads, as they often will crash the whole program. However, you can catch the signal a crash generates and keep the overall process from terminating.
That doesn't help if the entire process dies for any reason and you want to clean up the locks. Solution to that is called "robust" locks. You can register list of held futexes with the kernel using sys_set_robust_list, and when the thread dies kernel for each entry will set a specific bit and wake waiter if there's one.
> You can register list of held futexes with the kernel using sys_set_robust_list, and when the thread dies kernel for each entry will set a specific bit and wake waiter if there's one.
My biggest worry with that kind of thing is that the lock was guarding something which is now in an inconsistent state.
Without thoroughly understanding how/why the particular thread crashed, there's no guarantee that the data is in any sort of valid or recoverable state. In that case, crashing the whole app is absolutely a better thing to do.
It's really cool that the capabilities exist to do cleanup/recovery after a single thread crashed. But I think (off-the-cuff guess) that 95% of engineers won't know how to properly utilize robust locks with robust data structures, 4% won't have the time to engineer (including documentation) that kind of solution, and the last 1% are really really well-paid (or, should be) and would find better ways to prevent the crash from happening in the first place.
The concern about state consistency applies to all error conditions, not just those that occur while holding a mutex lock.
It doesn’t matter if multiple threads are running or just one - the process could be in the middle of updating a memory-mapped file, or communicating with another process via shared memory, or a thousand other things.
Ensuring consistency is excruciatingly hard. If it truly matters, the best we have is databases.
one option is to use a non-blocking algorithm that by definition will maintain consistency at instruction boundary. In fact you won't even need robust mutexes this way.
But of course consistency is only maintained if the algorithm is implemented correctly; if you are trying to robustly protect against a process randomly crashing, you might not want to rely on the correctness of that process (or that process randomly writing to shared memory on its way to a segfault).
That also assumes that your data only needs consistency within a machine word (i.e., data types where the CPU can support atomic instructions). If you're just trying to ensure that 64 bits of data are consistent, that's fine, but that usually wasn't so hard anyway.
You need atomic operations at the synchronization points, but you can build arbitrary complex data structures on top of it. For example a lock free tree will stay consistent even if a process dies in the middle of an update (you might need to garbage collect orphaned unpublished nodes).
I think the point isn't to expose this to normal developers, but instead do stuff like enable rust like poisoned states in locks.
If you're using futexes across processes (or any other cross-process state), one generic approach is for a watchdog process to keep a SOCK_STREAM or SOCK_SEQPACKET Unix domain socket open for each process so it can reliably detect when a process crashes and clean up its per-process state.
Maybe the blessed way to do this nowadays is epoll with pidfds?
Yes, good comment on something I glossed over for sure (I tried to stop the mutex discussion at the process boundary, to keep from going forever).
It's not that deep. The futex was developed just to save you from issuing a special system call to ask the OS to put you on a wait queue.
The whole point is that implementing a mutex requires doing things that only the privileged OS kernel can do (e.g. efficiently blocking/unblocking processes). Therefore, for systems like Linux, it made sense to combine the features for a fast implementation.
Also, I should say, in user-land you can efficiently enough save thread state, go off and do something else with that thread, then come back to it, never hitting the kernel while something blocks. That's pretty much async in a nutshell (or green threads).
The point of the article anyway is that it's inexcusable to have a modern concurrency textbook and not cover the futex, since it's at the core of any efficient primitive on modern hardware.
The problem with green threads, historically, was that there was no way to do arbitrary syscalls async; if your syscall blocks it blocks all your other green threads. Doh.
io_uring is supposed to be about solving this, but it's quite the kitchen sink so I have no idea how complete it is on the "arbitrary syscall async" front.
Yes, it's gotten quite large, but I think with far fewer wrong turns in the API compared to the futex. Enough was available async via `epoll()` + having fd interfaces to things that I never was as worried about the arbitrary latency of syscalls, but it's still incredibly cool, especially in the number of calls it avoids outright.
`epoll` doesn’t actually do any IO though, so it doesn’t help with syscall latency. It just avoids the overhead of doing IO via a large number of threads (memory overhead, hard context switches, etc.).
No it doesn't, which is one key reason why I am a fan of `io_uring`. I brought `epoll` up because it does help with the blocking though, for most of the things that matter when it comes to async (at a cost to latency, of course).
You actually issue the `futex` system call to get yourself on the wait queue tied to the memory address. It separates out the waiting from the locking.
And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.
> It separates out the waiting from the locking.
It does not, in fact the two are fundamentally inseparable and the state of the memory address must be treated atomically with the waiting state. The magic of futex is that you can use a hardware atomic operation (c.f. lock cmpxchg on x86) to get the lock in the common/uncontended case, but if you have to wait you need to tell the kernel both that you need to wait and the address on which you're waiting, so it can use the same hardware interlocks along with its own state locking to put you to sleep race-free.
It quite does; the kernel is not the keeper of the lock, it only needs to detect the race condition that would result in a spurious sleep. It cares not one bit about the rest of your semantics.
It's true you could use it that way, but it's not the way it's meant to be used, defeating the purpose by requiring a system call even for uncontended locks.
I think you're misunderstanding how futexes work, or else making an essentially irrelevant semantic argument around a definition for "keeper". The kernel is, 100%, absolutely, the "keeper" of that lock data for the duration of the system call. It knows that (hardware) address and matches it to any other such syscall from any other process on the system. And that requires tracking and intelligence and interaction with the VM system and arch layer up and down the stack.
It just doesn't "allocate" it on its own and lets the process use its own mapped memory. But to pretend that it doesn't have to do any work or that the memory is "separated" betrays some misunderstandings about what is actually happening here.
The kernel is responsible for maintaining the wait queues, and making sure that there is no race condition on the state that should preclude queueing.
It does not care how you use the queue, at all. It doesn't have to be done with a locking primitive, whatsoever. You absolutely can use the exact same mechanism to implement a thread pool with a set of dormant threads, for instance.
The state check in the basic futex is only done to avoid a race condition. None of the logic of preventing threads from entering critical sections is in the purview of the kernel, either. That's all application-level.
And most importantly, no real lock uses a futex for the locking parts. As mentioned in the article, typically a mutex will directly try to acquire the lock with an atomic operation, like an atomic fetch-and-or, fetch-and-add, or even compare-and-swap.
A single atomic op, even if you go for full sequential consistency (which comes w/ full pipeline stalls), is still a lot better than a trip into the kernel when you can avoid it.
Once again, I'm not saying you couldn't use the futex state check to decide what's locked and what's not. I'm saying nobody should, and it was never the intent.
The intent from the beginning was to separate out the locking from the waiting, and I think that's pretty clear in the original futex paper (linked to in my article).
I like to think of a futex as the simplest possible condition variable, where the predicate is just the state of the memory word (note that a mutex guarding the predicate is unnecessary since the word can be read and written atomically). It turns out that this is simple enough to implement efficiently in the kernel, yet expressive enough to implement pretty much any userspace synchronization primitive over it.
You are of course completely right. In fact sometimes I wish that the kernel would do slightly more with the memory location, like optionally reserving a bit to show the empty/non empty state of the queue: the kernel should be able to keep it up to date cheaply as part of the wait/wake operations while is more complicated for userspace.
This sort of thing can be implemented in the kernel without special hardware operations by adding the thread to the wake list, then suspending the thread, then checking the value. If the value has changed, undo that and return, else just return leaving the thread suspended. Special hardware multi-atomic instructions are not required, but the details are very dependent on how kernel's thread switching is designed.
Why is this gray!? This is absolutely correct. Futex was added as an ad hoc solution to the obvious needs of SMP processes communicating via atomic memory operations who still wanted blocking IPC. And it had to be updated and reworked repeatedly as it moved out of the original application (locks and semaphores) into stuff like condition variables and priority inheritance where it didn't work nearly as well.
In point of fact futex is really not a particularly simple syscall and has a lot of traps, see the man page. But the core idea is indeed "not that deep".
As the article says, the futex system call is overly complicated. But one shouldn't downplay its importance. Every major OS has had a slimmed down equivalent for about a decade, and the futex is at the core of any good modern lock.
Many things are obvious after, but there was plenty of time before for other people to do the same thing, it's not like we didn't know sysv semaphores didn't scale well.
"ad hoc" feels like an opinion here. My opinion is that when separation of concerns leads to big gains like the futex did, that's elegant, and an achievement. No need to diminish the good work done!
If this is ad hoc solution, what's the "right" approach?
Futex is a fine solution for locks and semaphores (FUTEX_WAIT/WAKE operations). It's been extended repeatedly to handle the needs of condition variables, priority inheritance, timeouts, interop with file descriptors and async/io_uring, etc... with the result that a lot of the API exists to support newer operations with oddball semantics and not a few genuine mistakes and traps (often undocumented). See the glibc condition variable code for how complicated this can get.
Also, while googling for some examples for you I was reminded of this LWN article from a few years back that details some of the issues: https://lwn.net/Articles/823513/
Just because the Linux futex call is currently a Swiss Army knife with some parts that add no value (which I do say in the article) doesn't mean that it's not valuable, or important.
The fact that Linux has extended it in so many ways is, in fact, a testament to it to how impactful the futex concept has been to the world of concurrency.
The fact that it's also at the core of other OS primitives does as well. At least on the MacOS side, those primitives do have much simpler APIs, as well. For instance, here's the main wait function:
`extern int os_sync_wait_on_address(void * addr, uint64_t value, size_t size, os_sync_wait_on_address_flags_t flags);`
There's also one with a timeout.
The wake side is equally simple, with two calls, one to wake one thread, one to wake all threads. No other number matters, so it's a great simplification in my view.
Your fundamental point is that the futex is actually a pretty unimportant construct. Clearly I don't agree, and it's okay not to agree, but I really am struggling to see your counter-argument.
If futexes aren't important to good locks, then, if modern OSes all felt compelled to jettison the futex for some reason, you'd have pthread implementations do ... what exactly??
They are so good that they keep being reinvented even in userspace, for example WTF::ParkingLot.
Or C++ also adding std::atomic::wait which is basically a thin [1] wrapper over futex.
[1] implementations still manage to mess it up.
The WTF::ParkingLot example is interesting because it shows that you don't actually need futexes in the kernel to implement them efficiently; you just need something like a spinlock (with sane backoff!) to guard the userspace wait queue and a per-thread eventfd to wake up waiters.
Yes, you can do a good futex impression in userspace and add any missing functionality you need. Most importantly for webkit, I think, you get portability.
The advantage of futex provided by the kernel is ABI stability and cross process support.
It gives you an implementation. The implementation does not have to be ideal, though.
This is mostly the fault of pthreads, where baroque POSIX semantics had to be shoehorned into the kernel for efficiency.
Wait/Wake are enough for the vast majority of the use cases. The rest is a mix of niche cases (PI, robust mutexes) and failed experiments.
Mara Bos' Rust Atomics and Locks has a section on futexes [0].
So what's recommended as a better alternative to The Art of Multiprocessor Programming?
The Art of Multiprocessor Programming. It does talk about reentrant locks and other things this review says it doesn't. The more interesting parts of it though are the back half, after going through lock implementations and such it actually starts solving problems using both lock-based and lock-free designs.
Follow it up with something appropriate to the language you're using, like C++ Concurrency in Action for C++ (much of it transfers to other languages).
Although a bit outdated, Java Concurrency in Practice is also a very good one, written by Brian Goetz, before he became one of the leading architects on Java.
I will also say, it's likely this is the best text book on concurrency. There are some first principles where it's explanations are engaging and clear.
It's just that book ignores the very real changes that are at the core of modern concurrency (including Async), and that I don't believe is a positive.
And while it did mention recursion, I looked really hard for it to cover things like shared lock-cleanup on `fork()` or thread crash, and a number of other important (often encountered) real-world concerns, and didn't find anything.
I didn't say it doesn't deal with recursive locks. I said it is so cursory that they don't even acknowledge that their guidance can't really be applied to a RW lock.
Where does everyone browse the C standard library, posix library, and separately glibc?
Is there any really nice listing/walk through these standard libraries/rustdoc like website?
Does everyone just dig through header files and man pages all day?
cppreference [1] has also entries for the C standard library. POSIX/SuS is directly available from opengroup [2]. glibc is better browsed from local man pages, or for a more organic discussion, info pages. Searching 'man function-name' also usually works.
[1] https://en.cppreference.com/w/c.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...
Thanks! Yeah CPP reference is what I tend to find on the internet, but quite surprised there isn't something a bit more "modern" that combines all these.
Searching also revealed some good university guides, the standards, ETC -- just very surprising that people doing day to day work with C are fine with that tooling-wise, though of course people have jump-to-definition so maybe reading header files is "good enough".
I still haven’t seen a good comparison between Futex and Benaphore. Benaphores I understand, it predates Futexes by almost a decade, but what do Futexes add to the equation since hardly anyone talks about Benaphores (or is it a case of not invented here)?
As others have explained the Benaphore is a speed-up for an existing OS primitive, but the futex is a new primitive that's often much better suited to the problem you have. Benoit Schillings uses this name "Benaphore" but never claims explicitly to have invented it in the article naming it, either way though Benoit worked for Be Inc. which were making an entire OS including its kernel, so they could have provided the better primitive. But they didn't, BeOS provided the limited semaphore primitive you'd have seen in a typical 1980s or 1990s Unix.
Given that primitive, the Benaphore is a good way to use it, like if you've got a 1930s refrigerator and you've got a clever technique to reduce frost build-up - a modern fridge has a smarter controller and so it'll just defrost itself anyway automatically, no sweat. The Benaphore is thus redundant today - like that anti-frost technique for your 90 year old fridge.
Benaphore is a kernel synchronization object behind atomic counter for uncontended path. But the kernel object needs to be always here, initialized and destroyed when appropriate.
Futex doesn't need any kernel initialization, and from perspective of kernel it doesn't exist at all when there are no waiters.
(see also CRITICAL_SECTION, which originally always had kernel object created with it, but it was later changed to create them on-demand falling back to keyed event on failure. SRWLock only uses keyed events. Keyed event only needs one object per process, and otherwise consumes kernel resources only if there are any waiters.)
Other then avoiding a syscall on an uncontended path they are not really similar. A benaphore is just a semaphore with an extra atomic counter in userspace to count waiters.
You can’t really use semaphores to implement things that can’t mutexes or semaphores so the overall utility is limited compare to futexes that you can use for condvars and other primitives too.
i mean you can use semaphores to implement condition variables, but it's extremely tricky to the point where you've got to be Mike Burrows in order to do it right [1] (in fact, that's how it's done in the nsync library, as a fallback when every faster option is unavailable).
> i mean you can use semaphores
You absolutely can use semaphores for it, but the benefits of a benaphore would not work with when you build a condition variable with it. The wait on the CV would need to go through the kernel.
Another great read on futexes is Ulrich Drepper’s paper "Futexes Are Tricky" [1].
[1] https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf
Half of the source code is colored very-light-on-white, which is impossible to read. I'm using Chrome on Android.
Interesting, and thanks; that's not how it's supposed to look for sure. I'll ask someone to look at the whole mobile experience, definitely not my area.
I’m right now working with this topic, so vey happy to find it here. The only problem: I like to read with the phone horizontally. If you do that the 2025 footer takes 45% of screen… I mias plain HTML so much!
Yeah, I don't love the mobile experience. I'll try to get that fixed soon, thanks.
Can anyone suggest a good explanation of memory barriers?
If you mean memory barriers in terms of concurrency, it's just a primitive for concurrency that counts downward atomically once per participant (e.g. a group of threads) and then each atomically waits until the counter reaches zero before continuing. It's used to synchronize (i.e. put into lockstep) two concurrent processes such that they must all wait at a given point before continuing more or less all at once, often as part of a larger process.
If you mean a barrier in terms of a memory "fence", that's an event on CPUs whereby any pending memory instructions that have been pipelined and thus not committed are forced to commit and complete before continuing. Usually only relevant for a single core, but they're used to make sure that other cores will see the same memory values and your pending writes would reflect (or, conversely, sometimes making sure your own core sees the reads from other cores as fresh as possible before the actual read op).
Shameless plug: https://lwn.net/Articles/844224/ is in my opinion exactly the parts that are missing from the book that the author criticizes. I focus on how threads synchronize, independent of the actual primitives you use, and then explain how that synchronization is usually realized.
"Memory Barriers: a Hardware View for Software Hackers"
https://www.researchgate.net/publication/228824849_Memory_Ba...
Paul McKenney's "Memory Barriers: a Hardware View for Software Hackers" is excellent, with Preshing's blog (preshing.com) offering more approachable explanations for beginners.
tl;dr:
In a multi-threaded context, memory reads and writes can be reordered by hardware. It gets more complicated with shared cache. Imagine that you have core 1 writing to some address at (nearly) the same time that core 2 reads from that. Does core 2 read the old or the new? Especially if they don't share the same cache -- core 1 might "write" to a given address, but it only gets written to core 1's cache and then "scheduled" to be written out to main memory. Meanwhile, later core 2 tries to read that address, it's not in its cache, so it pulls from main memory before core 1's cache has flushed. As far as core 2 is concerned, the write happened after it read from the address even though physically the write finished in core 1 before core 2's read instruction might have even started.
A memory barrier tells the hardware to ensure that reads-before is also "happens-before" (or after) a given writen to the same address. It's often (but not always) a cache and memory synchronization across cores.
I found Fedora Pikus's cppcon 2017 presentation [1] to be informative, and Michael Wong's 2015 presentation [0] filled in some of the gaps.
C++, being a generic language for many hardware implementations, provides a lot more detailed concepts for memory ordering [2], which is important for hardwares that have more granularity in barrier types that what most people are used to with x86-derived memory models.
[0]: https://www.youtube.com/watch?v=DS2m7T6NKZQ
[1]: https://www.youtube.com/watch?v=ZQFzMfHIxng
[2]: https://en.cppreference.com/w/cpp/atomic/memory_order.html
Is it just me or moving things out of kernel space improves performance in general? Like context switching, mutex or entire TCP stack. I wonder what else can be moved into user space.
Piling on with others; it's avoiding the context switching and copying between user and kernel that improves performance.
If you can stay in user space, or stay in kernel space, that is likely to be better performance than going back and forth.
This is why sendfile is great; rather than storage -> kernel memory -> user memory -> kernel memory -> network, you get storage -> kernel memory -> network. That's two less copies, and fewer context switches.
Most things can be moved into user space.
You're right. With the caveat that it's the moving in and out of the kernel that's the expensive part, so putting things inside it also helps with performance... at the expense of security and reliability, of course, so userspace it is!
It depends on the relative cost of what you're doing of course, but in general, yes, the cost of entering the kernel can be impactful, if you are making a lot of calls into it, since a lot of work has to happen every time.
There are definitely user-land TCP/IP implementations, thread implementations (Go's being particularly notable) and even file systems (FUSE).
fuse probably isn't a good example here because you still have to enter kernel space if i'm not mistaken, then out again to the fs driver in userspace then probably back to kernel space (block driver). fuse has many upsides, but I don't think performance is one of them.
Well, you still have to enter the kernel to actually queue an OS-level thread w/ a futex. The kernel supporting being able to move stuff to userland sure doesn't guarantee better performance-- the main opportunity from that perspective is minimizing how often you cross the boundary.
You're 100% right that there are plenty of other considerations, often positive for lifting things out, like minimization of ring 0 attack surface.
Yes it is. In hft (or any algo trading nowadays) it is commonplace to use usermode networking since like ~2010 or even earlier.
The part of NICs address space is mapped to a process (control registers and memory rings), and whole driver just runs in your process without going through the kernel. Of course, it ignores usual stuff like firewalls, iptables, etc, but who cares when all you need is the lowest latency possible )
io_uring alleviates a lot of those problems by essentially providing a batched, asynchronous syscall API, including futex calls, as mentioned in other comments.
Futexes can be moved into user space!
While I'm sure it sounds great to Apple fans, this ends up just being a circle in practice. We need a way to fall asleep when contended. Apple calls this a "Parking Lot" Linux calls it a "Futex" and as we've seen Microsoft have a different name.
Like Microsoft's Apple's approach is slightly smaller (one byte not four) and over a decade later.
Well, but with a parking_lot abstraction, you only need a wait primitive per thread (as a thread_local for example), so it doesn't need to be a futex. It could be a eventfd which is interesting for integrating with other event loops.
By putting the wait queues in userspace you can customize behavior with callbacks, as the article shows.
[dead]