Great writeup. I'd love to know more about how the Supervisor works, and how it. "fork[s] a separate process for each supervised worker/dispatcher/scheduler".
In a Rails app served with Puma, I've always had a hard time understanding what would be the canonical way for having a loop doing some periodic work.
I know Puma has plugin support but I don't see much documentation there.
Forking a process / threads is something that we're used having Rails / Puma take care for us.
Pressed for time and without having time to deep dive, we ended up settling with sidekiq-cron, and it's been serving us so nicely.
Under the hood, it uses good ol' fork and keeps track of the generated process IDs.
It's surprisingly simple. You can check out the relevant source here: https://github.com/rails/solid_queue/blob/main/lib%2Fsolid_q...
Author here. What a pleasant surprise to see this on HN!
Happy to answer any questions.
thanks for this write up.
some of us are happy rails users so more rails content is always welcome
I briefly worked at a YC company that was a ruby shop. Their answer to every performance problem was to stick it on a queue. There were, I don’t know, dozens of them. Then they decided they needed to be multi-region, because reasons. But the queues weren’t set up to be multi-region, so they built an entirely new service that’s job was to decide which queue in which region jobs needed to go on. So now you had jobs crisscrossing datacenters and tracking any issue became literally impossible. Massively turned me off to both that company and ruby in general.
We were offloading to jobs every long running activity in the Elixir/Phoenix project I've been working on years ago. There is no other way. The response to a web request must complete in a short time and free the server for further requests.
We solved debugging by sending all log lines to a centralized server. We were running on the Google cloud.
We were not multiregion though.
My current Rails project uses sidekiq a lot to send mail, generate PDFs, any activity that does not have to necessarily complete before we return the response. We keep the interactive web app up to date by websockets and with callbacks for clients using our public API. I don't think we would have done it differently in any other language.
By the way, we built our slimmer version of sidekiq for Elixir because the language plus the OTP libraries have a lot of functionality but we still need to persist jobs, retry them even after a complete reboot, exponential back off, etc.
Bad architecture can happen in any language. I don't see how the language choice could ever protect you against the described structural problem you built.
Also you will see that the answer to most actual performance problems tend to be queues even in other languages. At least in mature places - mostly because it is possible to inspect what a queue is doing. Though it will of course be a problem if it is part of a big spaghetti architecture.
Yeah this is not the fault of ruby. Sounds more like bad choices that could be made with any language or framework.
You're totally correct. It's not a problem of ruby per se, but engineers would basically just throw their hands up and say "ruby can't handle this" and sidekiq became the One True Way™. What ensued was the most byzantine software architecture I've ever seen.
Culture around a language influences what choices are made.
Funny part: I initially thought you were referring to the word “Byzantine” itself, which tends to carry a negative connotation in English, mostly due to historical bias. But you’re actually talking about Ruby!
If we were to take Byzantine in a more accurate, historical sense, something truly “Byzantine” should be evolving, enduring, top-tier, and built to last for 1k years.
I’ve never felt like “throw everything into a queue” was a mindset within the Ruby community, nor have we done that at my companies. And multi-region is a business decision.
Resque was a staple for a long long time. In the jvm world, throw everything into Kafka is also a staple of a lot of "enterprise" shops. Or SQS for AWS places I've worked at. I think it is not a ruby language thing, but a certain kind of architecture thing.
True that it is not uncommon to use Sidekiq or Resque , but Rails 8 is going to be the first version to ship with a queuing system (SolidQueue), later this year. So queueing has been an add-on for 20 years. I don't think it is quite a staple.
Rails 8 came out in November, and `rails new` generates an app with the solid trio in the Gemfile. Been fun playing around with it for new side projects :)
Doesn't Ruby, like Python, have a GIL? I always found that one is enough to encourage some "premature scalable architecture"
That depends on the Ruby implementation.
MRI (CRuby) has a GVL which is why you might use a forking web server like Puma or Pitchfork.
JRuby and TruffleRuby though have true multi-threading and no GVL.
I’ve used the Concurrent Ruby library with JRuby and Tomcat quite a bit and find works very well for what I need.
It does have a GIL. You’re not wrong, but by that same logic, there’s pitfalls when using multi-threading as well, even in languages where it’s native (e.g., Elixir).
Regardless, in my experience, when you run into scenarios that need queueing, multi-threading, etc., you need to know what you’re doing.
Don’t be silly. Bad choices are made in all sorts of languages and teams - this has nothing to do with language. High pressure situations can lead teams to make choices they don’t always foresee as bad until after they are paying the consequences.
Sure bad choices are made everywhere, but I was essentially claiming that when a community has a hammer, they will see nails.
Queues are not a ruby specific thing, nor are they particularly pervasive within Rails apps. Having a good framework to handle them doesn’t make it the only tool in the tool belt. On the contrary, the fact that Rails has good tools to fit many different types of system architecture needs is a counterpoint of your assertion.
I'm going to be brave (but still use a throwaway) and ask the dumb question - what is wrong with putting things in queues to help with performance problems?
If some endpoint is too slow to return a response to the frontend within a reasonable time, enqueueing it via a worker makes sense to me.
That doesn't cover all performance issues but it handles a lot of them. You should also do things like optimize SQL queries, cache in redis or the db, perhaps run multiple threads within an endpoint, etc. but I don't see anything wrong with specifically having dozens of workers/queues. We have that in my work's Rails app.
Happy to hear how I can do things better if I'm missing something.
There's two primary areas that I've seen teams get bitten by this personally:
1) Designers don't understand that things are going to happen async, and the UI ends up wanting to make assumptions that everything is happening in real time. Even if it works with the current design, it's one small change away from being impossible to implement.
This is a general difficulty with working in eventually consistent systems, but if you're putting something in a queue because you're too lazy to optimize (rather than the natural complexity of the workload demanding it) you're going to be hurting yourself unnecessarily.
2) Errors get swallowed really easily. Instead of being properly reported to the team and surfaced to a user in a timely manner, the default setting of some configurations to just keep retrying the job later means if you're not monitoring closely you'll end up with tens of thousands of jobs retrying over and over at various intervals.
We have 100s of queues processing millions of jobs in sidekiq queues at any given time.
These are data and compute heavy workloads that take anywhere from minutes to hours for a request to be completed, but the UI takes this into account.
Users submit a request and then continue onto whatever is the next thing they intend to do and then they can subscribe to various async notification channels.
It’s not the right choice for everything, but it’s the right choice for something’s.
These are good points, in answer to them:
1. Yes this is true but Rails now comes with nice support for async UI built to push updates to the browser via Hotwire and Turbo.
You’d need something like that anyway anytime you’re calling an external service you don’t control.
2. Again this is also a good point but even running every request synchronously you still need good error logging because you don’t want to share details of an error with your frontend.
With background jobs you definitely need to be on top of monitoring and retry logic. I also think you need to be very careful about idempotency amd retry logic.
I see that as the engineering trade offs for that pattern. There’s very little in the way of silver bullets in engineering; different solutions just come with different trade offs.
Error handling was a huge issue, along with other weird distributed system bugs. Backed up queues, job shedding, thundering herds, you name it. When you have jobs on queues kicking off new jobs on different queues, tracing issues is just miserable. Sure, it's not a problem of ruby per se, but engineers would basically just throw their hands up and say "ruby can't handle this" and sidekiq became the One True Way™.
Maybe "ruby can't handle this" was a short form for "we can't run this in the Rails controller because the response would take too long" possibly calling 3rd party APIs, "and we would run out of threads."
Anything running in sidekiq is written in Ruby too.
I think your question is well-asked, and I lament any work environments that have led you to think asking a question like this would be A Problem(tm).
IMO - there’s a lot of things that queues are an excellent answer to. Potentially including performance.
But - queues (generally and among other things) solve the problem of “this will take some time AND the user doesn’t need an immediate response.”
If that’s not your problem, then using queues might not be the solution. If it’s something that’s taking too long and the user DOES need a response, then (as you say) optimizing is what you should try, not queues. Or some product redesign so the user either doesn’t need an immediate response. Or finding a way to split up the part producing an immediate response and the part that takes awhile.
For example: validating uploaded bulk data is in the right “shape”, and then enqueuing the full validation and insertion.
Also really really avoid jobs that enqueue jobs. Sometimes they’re necessary (spacing out some operation on chunks of a group; or a job that ONLY spawns other jobs) but mostly they’re a route to spaghetti.
You’re not missing anything and are correct in that there are plenty of reasons to use queues and defer work that can be handled asynchronously outside of a request/response. This is not specific to ruby or any language for that matter.
The parent indicated the cross region dynamic required extra routing logic and introduced debugging problems.
Queues have several problems - if the caller is http it may timeout and retry, leading to more jobs being queued - the caller may no longer care because it took so long and the work is wasted - if the caller is called from a queue it can cause cascades - you can fill a disk up and crash the system
To me the question is, so what's a better alternative? At least queues can be designed to handle timeouts, errors, and flakey APIs.
Pretty simple - people often think queues are magic, and an unbounded queue is just your previous problem but now way worse; what happens when the queue "gets full"?
But if you are just smoothing out some work its pretty normal, just make sure you are modeling things instead of putting it in the magic queue.
Don't let one company's misuse of a language turn you off the entire thing!
Neither sticking everything into a queue nor going multi-region are Ruby’s fault.
Ruby and the age of “I don’t care what type this variable is, it quacks like a duck!” is over and dead. Improvements to type systems have shown there is a better way to do software development.
Static typing isn't free. Dynamic typing is perfectly viable to build any sized software - there's living proof, far from dead
So we're only left with personal opinion
Or the standard "it depends" answer that everyone eventually realizes is the only correct answer. ;-)
Maybe, the cycle has happened before and maybe come back around again. Dynamic typing is really nice when most of your data looks like bags of strings. Compilers and tools just don’t add a lot when you’re passing around glorified blobs of stringy json-like stuff. Type gymnastics can eat a lot of time where you could otherwise be shipping something useful.
I would argue when you're passing around stringy JSON-like thingies is when typing is most useful :)
You're not going to misuse an API that takes a Person or Cart, but mixing up two hashes cause you used two different strings as keys can happen easily.
(I do think dynamic typing is mostly fine, but I do wish ruby had optional static typing with some nice syntax instead of RBS)
It’s on the way thankfully.
I’m really excited about Sorbet getting behind the new RBS-inline comment syntax and the prospect of both runtime and static analysis as optional tools when needed.
I haven't really felt a need for static typing in my job's Rails app (small team, and I've been working in this codebase a really long time) but I think LLMs can be a huge help for automating the type gymnastics.
Typescript is literally a language describing what the quack “sounds like” so you can attempt to ensure any particular variable makes those kinds of quacks. Typescript doesn’t care that it also quacks like a dog.
Plus, Ruby has lots of easy ways for you to check typing, if you want to.
man, that brings back memories - very early in my career I tried to use postgres as a task queue, thinking that with O(hundreds) of jobs it wasn't worth setting up something like rabbitmq. sadly I knew pretty much nothing about db design and the performance was horrible, ended up ripping it out and installing rabbitmq after all (and having a whole new set of headaches with random rabbit admin issues but at least when it worked it was fast)
Any performance comparisons to Oban?