« BackWhat Is Io_uring?matklad.github.ioSubmitted by todsacerdoti 10 hours ago
  • jclulow 3 hours ago

    This article, and indeed much of the discourse about the facility out in the wild, could really use some expansion on at least these two points:

    > The application submits several syscalls by writing their codes & arguments to a lock-free shared-memory ring buffer.

    That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?

    > The kernel reads the syscalls from this shared memory and executes them at its own pace.

    I think this leaves out most of the interesting information about how it actually works in practice. Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? Whose scheduling quantum is consumed by this?

    Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.

    • Tuna-Fish 3 hours ago

      > That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?

      Correct, the "check for new entries" system call is called io_uring_enter(). (0)

      > Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work?

      A kernel thread. The submit system call can be optionally made to wait for completion, but by default it will always immediately return.

      > Whose scheduling quantum is consumed by this?

      That's a good question. The IO scheduler correctly sees them as belonging to the submitting thread, but if you issue a bunch of computation-heavy syscalls, I would not be surprised if they were not correctly accounted for.

      (0) https://unixism.net/loti/ref-iouring/io_uring_enter.html#c.i...

      • Joker_vD an hour ago

        > Otherwise, how would the kernel know you'd done it?

        Well, you can make a design where a submission queue spans N+1 pages, and to notify the kernel you write something in the last page which is actually write-protected, so it triggers the kernel trap. I believe VirtIO has a similar scheme?

        > What thread is actually performing that work?

        None? You don't need really need a user-space thread to execute code in the kernel: otherwise, starting process 1 would have to be the very first thing the kernel does when booting while in reality, that's about the last thing it does in the booting process.

        With multi-core systems we have today, arguably having a whole core dedicated exclusively for some core OS functionality could be more performant than having this core "constantly" switch contexts?

        • jclulow 34 minutes ago

          > notify the kernel you write something in the last page which is actually write-protected, so it triggers the kernel trap

          Right, that's a page fault; i.e., a context and privilege level switch. I can't imagine that's going to be any cheaper than a system call.

          > None? You don't need really need a user-space thread to execute code in the kernel

          I didn't mention user space. There are threads in the kernel. From a scheduling perspective, somebody needs to be billed for the work they're doing. If it's not the process that invoked the system call, that seems like a pretty easy way to induce a bunch of noisy work on the system in excess of what the regular scheduling algorithm and resource capping would allow.

          > With multi-core systems we have today, arguably having a whole core dedicated exclusively for some core OS functionality could be more performant than having this core "constantly" switch contexts?

          Perhaps! Without a design and some measurement it seems impossible to know. Logically, though, you'll still have (kernel) threads of some kind executing in that special partition. They'll compete for execution time, just like user mode threads compete for execution time, at which point for fairness you'll have to figure out how to bill the work back to the user process somehow.

        • davedx an hour ago

          > Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.

          This article seems to be an attempt to succinctly describe the high level goals of io_uring to people who don't know, and in that respect I find it succeeded. The questions you're asking seem more related to how io_uring (or its API) are implemented, which is something else. I would hope anyone deciding whether to build something on io_uring would then do more detailed research on the trade-offs before pushing anything to production.

          I appreciated the brevity personally...

        • markles 41 minutes ago

          I've been playing with io_uring for the last month or so, just for fun. I'm working on building an async runtime from scratch on top of it. I've been documenting (think of them more as notes to myself) the process thus far:

          Creating bindings to io_uring (just to see the process): https://www.thespatula.io/rust/rust_io_uring_bindings/

          Writing an echo server using those bindings: https://www.thespatula.io/rust/rust_io_uring_echo_server/

          • rwmj 20 minutes ago

            Although I'm a fan of io_uring, a reason to avoid io_uring is it involves quite major program restructuring (versus traditional event loops). If you don't want to forever depend on Linux >= 6 but also support other OSes then you'll have to maintain both versions.

            • larsrc 3 hours ago

              This bit is unfortunate, I hope it improves: "You might want to avoid io_uring [if] you want to use features with a good security track record."

              At least he's clear about it.

              • landr0id 2 hours ago

                I'm kind of curious what Alex meant by this, as the security problems relating to io_uring are, to my knowledge, unrelated to the user-space program. It makes sense if you want to disable the feature in your own kernel or remove potential sandbox escape attack surface, but it's like saying "You might want to avoid win32k if you want to use features with a good security track record" (and I know this is kind of apples to oranges but you get the point).

                • FridgeSeal 2 hours ago

                  IIUC io_uring surfaced a bunch of pre-existing-but-rarely-hit code paths that had issues, which was widely taken to mean “io_uring has issues”. Google also disabled it on all machines in GCP, not clear if Google disabled it because of the same issues, or some other thing. The aforementioned issues have been fixed.

              • twen_ty 28 minutes ago

                So it's 2024 and Linux still doesn't have what Windows NT had 30 years ago?

                • watt 3 hours ago

                  io_uring also caused a ton of problems for our containers in Kubernetes when Node 20 had enabled it by default. They scrambled and turned it off by default in https://github.com/nodejs/node/commit/686da19abb

                  • clhodapp 2 hours ago

                    That sounds like a dubious integration contract for initialization between libuv and node, not an issue with io_uring.

                    I would assume the same thing would happen if they used traditional Posix APIs to open file handles before dropping their privileges.

                  • v3gas 4 hours ago

                    > Oct 2, 2024

                    From the future!

                    • Brajeshwar 4 hours ago

                      Nice! I think he had a post marked for the future (the 32nd of September ), but his tool somehow published that;

                        https://github.com/matklad/matklad.github.io/blob/master/content/posts/2024-09-32--what-is-io-uring.dj
                      • chrismorgan 31 minutes ago

                        I think the “32-” should have been “23”. Transposed digits and an extra hyphen-minus.

                        • kzrdude 4 hours ago

                          Published already on 32nd september, wonderful.

                        • boarush 4 hours ago

                          I wonder if matklad has their scheduled blogs after penning them down?