Comments Page - Rewriting Every Syscall in a Linux Binary at Load Time

« Back Rewriting Every Syscall in a Linux Binary at Load Timeamitlimaye1.substack.comSubmitted by riteshnoronha16 5 days ago

xelaboi 9 hours ago
You either have a writing style that is uncannily similar to what an LLM generates, or this article was substantially written by an LLM. I don't know what it is about the style, but I just find it a bit exhausting, like an overfit on "engaging writing" that strips away sincerity.
- nonameiguess 7 hours ago
  Name sounds very likely not an English speaker. And the one reply here to a top-level comment is extremely obvious. I think it's unfortunate that people who write English poorly feel the need to do it, but I get it at least. The person behind this probably has a real interest and knowledge in the space but feels they can't communicate it without assistance.
  It is too bad, though. People bad at English will themselves be reading this forever now and think this is the way real people write, speak, or are supposed to.
  It's many things. The relentless ethusiasm about everything. Prefacing any answer to a question with an affirmation that it was a good question first. And yes, sorry, pedants of the web who feel witch-hunted because you knew how to employ keyboard shortcuts and used em-dashes in 2015 and have the receipts to prove it -- you never used 17 in the span of a single page. I think that was the first I can remember using ever and I had to contrive a way to do it where a semi-colon wouldn't clearly work better.
- qbane 2 hours ago
  There is even a table copy-pasted into a paragraph without noticing.
  > What’s needed is something different:
  > Requirement ptrace seccomp eBPF Binary rewrite Low overhead per syscall No (~10-20µs) Yes Yes Yes [...]
- renewiltord 8 hours ago
  It’s clearly LLM written but the idea was interesting enough that I read it. I suspect based on username the writer is cleaning up their voice.
  I think the idea of sharing the raw prompt traces is good. Then I can feed that to an LLM and get the original information prior to expansion.
jmillikin 10 hours ago
This might be a very dumb question, but if the process is being run under KVM to catch `int 0x03` then couldn't you also use KVM to catch `syscall` and execute the original binary as-is? I don't understand what value the instruction rewriting is providing here.
- rep_lodsb 8 hours ago
  Yes, that seems unneccessary. The overhead of trapping and rewriting every syscall instruction once can't be (much) greater than that required for rewriting them at the start either.
  Even if you disallow executing anything outside of the .text section, you still need the syscall trap to protect against adversarial code which hides the instruction inside an immediate value:
  foo: mov eax, 0xc3050f ;return a perfectly harmless constant ret ... call foo+1
  (this could be detected if the tracing went by control flow instead of linearly from the top, but what if it's called through a function pointer?)
  rep_lodsb 5 hours ago
  Thinking a bit more about it (and reading TFA more carefully), what's the point of rewriting the instructions anyway?
  I first assumed it was redirecting them to a library in user mode somehow, but actually the syscall is replaced with "int3", which also goes to the kernel. The whole reason why the "syscall" instruction was introduced in the first place was that it's faster than the old software interrupt mechanism which has to load segment descriptors.
  So why not simply use KVM to intercept syscall (as well as int 80h), and then emulate its effect directly, instead of replacing the opcode with something else? Should be both faster and also less obviously detectable.
  jacobgorm 3 hours ago
  Good point, an int3 is not going to be faster than a syscall, and if they implement the sandboxing policy in guest userspace is seems it would be quite easy to disable.
  jacobgorm 4 hours ago
  I think the point here is optimizing for the common case, the untrusted code is still running inside a VM, so you can still trap malicious or corner cases using a more heavy-handed method. The blog post does mention "self-healing" of JIT-generated code for instance.
  It is possible to restrict the call-flow graph to avoid the case you described, the canonical reference here is the CFI and XFI papers by Ulfar Erlingsson et.al. In XFI they/we did have a binary rewriter that tried to handle all the corner cases, but I wouldn't recommend going that deep, instead you should just patch the compiler (which funnily we couldn't do, because the MSVC source code was kept secret even inside MSFT, and GCC source code was strictly off-limits due to being GPL-radioactive...)
- ghoul2 6 hours ago
  Isn't that exactly what gvisor does?
  twic 5 hours ago
  Yes: https://gvisor.dev/docs/
coppsilgold 10 hours ago
You mentioned SECCOMP_RET_TRACE, but there is also SECCOMP_RET_TRAP[1] which appears to perform better. There is also KVM. Both of these are options for gVisor: <https://github.com/google/gvisor>
[1] <https://github.com/google/gvisor/blob/master/pkg/sentry/plat...>
- monocasa 10 hours ago
  There's also SECCOMP_RET_USER_NOTIF, which is typically used by container runtimes for their sandboxing.
  coppsilgold 10 hours ago
  SECCOMP_RET_USER_NOTIF seems to involve sending a struct over an fd on each syscall. Do they really use it? Performance ought to suffer.
  Also gVisor (aka runsc) is a container runtime as well. And it doesn't gatekeep syscalls but chooses to re-implement them in userland.
  xuhu 6 hours ago
  SECCOMP_RET_USER_NOTIF appears to switch between the tracee and tracer processes for each syscall. Using SECCOMP_RET_TRAP to trigger a SIGSYS for every syscall in IO intensive apps introduces 5% overhead (and avoids a separate tracer).
  I wonder if there's any mechanism that works for intercepting static ELF's like Go programs and such.
Thaxll 6 hours ago
It's pretty much what gVisor does.
https://gvisor.dev/
- Thaxll 2 hours ago
  So why not using it instead of re-implementing the exact same thing.
CableNinja 5 days ago
I assume this would break observability through existing methods, right? If you were to strace a process that has been patched, would you see regular syscall data (as if it wasnt patched) or would your syscall replacement appear along the way?
- amitlimaye 5 days ago
  Good question. I didn't cover this in the post — the binary doesn't run on the host kernel directly. It runs inside a lightweight KVM-based VM with no operating system. The shim is the only thing handling syscalls inside the guest. So strace on the host wouldn't see anything — no syscalls reach the host kernel from the guest. From the host side, the only visible activity is the hypervisor process making syscalls on behalf of the guest.
  Inside the guest, there's no kernel to attach strace to — the shim IS the syscall handler. But we do have full observability: every syscall that hits the shim is logged to a trace ring buffer with the syscall number, arguments, and TSC timestamp. It's more complete than strace in some ways — you see denied calls too, with the policy verdict, and there's no observer overhead because the logging is part of the dispatch path.
  So existing tools don't work, but you get something arguably better: a complete, tamper-proof record of every syscall the process attempted, including the ones that were denied before they could execute. I'll publish a follow-on tomorrow that details how we load and execute this rewritten binary and what the VMM architecture looks like.
ozgrakkurt 10 hours ago
Really informative writing thank you.
How secure does this make a binary? For example would you be able to run untrusted binary code inside a browser using a method like this?
Then can websites just use C++ instead of javascript for example?
- lmz 9 hours ago
  They already can use C++ if they want to. Emscripten? Jslinux?
  ozgrakkurt 8 hours ago
  I mean just distributing the regular compiled x86_64 binary and then running it as a normal executable on the client side but just using that syscall shim so it is safe.
  direwolf20 7 hours ago
  If you think about the fundamentals involved here, what you actually need is for the OS to refuse to implement any syscalls, and not share an address space.
  A process is already a hermetically sealed sandbox. Running untrusted code in a process is safe. But then the kernel comes along and pokes holes in your sandbox without your permission.
  On Linux you should be able to turn off the holes by using seccomp.
JSR_FDED 9 hours ago
Love the detailed write up, thanks!
This is the kind of foundation that I would feel comfortable running agents on. It’s not the whole solution of course (yes agent, you’re allowed to delete this email but not that email can’t be solved at this level)… let me know when you tackle that next :-)
foota 10 hours ago
Hah, I've been looking into something amusingly similar to track mmap syscalls for a process :)
- pocksuppet 5 hours ago
  Why not just use ptrace?
hparadiz 9 hours ago
I've been thinking of making a kernel patch that disables eBPF for certain processes as a privacy tool. Everyone is using eBPF now.
im3w1l 9 hours ago
What about int 80h?
- jcalvinowens an hour ago
  Yeah, I had the same question. But I'd guess they probably disable IA32 completely.
szmarczak 7 hours ago
> It can’t detect the interception
What's stopping the process from reading its own memory and seeing that the syscall was patched?