« BackGood Retry, Bad Retrymedium.comSubmitted by misonic 2 days ago
  • ramchip 2 days ago

    AWS also say they do something interesting:

    > When adding jitter to scheduled work, we do not select the jitter on each host randomly. Instead, we use a consistent method that produces the same number every time on the same host. This way, if there is a service being overloaded, or a race condition, it happens the same way in a pattern. We humans are good at identifying patterns, and we're more likely to determine the root cause. Using a random method ensures that if a resource is being overwhelmed, it only happens - well, at random. This makes troubleshooting much more difficult.

    https://aws.amazon.com/builders-library/timeouts-retries-and...

    • cpeterso 8 hours ago

      I’ve read a suggestion to use prime numbers for retry timers to reduce the chance of multiple timers synchronizing if they have common factors. I don’t know if that’s a real concern, but it wouldn’t hurt to pick a random prime number instead of some other random number.

    • guideamigo_com 2 days ago

      I never get this desire for micro services. You IDE can help if there are 500 functions, but nothing would help you if you have 500 micro services. Almost no one fully understands such a system. Is is hard to argue who parts of code are unused. And large scale refactoring is impossible.

      The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.

      • liampulles 2 hours ago

        What you are describing, where 1 function = 1 service, is serverless architectures. The "ideal" with any service (micro or macro) is to get it so that it maximises richness of functionality over scale of API.

        But I agree one should do monolith by default.

        • morningsam 2 days ago

          >The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.

          Unless I misunderstand something here, they say pretty early in the article that they didn't have autoscaling configured for the service in question and there is no indication they scaled up the number of replicas manually after the downtime to account for the accumulated backlog of requests. So, in my mind, of course there can be no infinite, or really any, scalability if the service isn't allowed to scale...

          • vrosas 8 hours ago

            I’ve seen monumental engineering effort go into managing systems because for one reason or another people refused to use (or properly configure) autoscaling.

          • FooBarWidget 2 days ago

            The point of microservices is not technical, it's so that the deployment- and repository ownership structure matches your organization structure, and that clear lines are drawn between responsibilities.

            • sim7c00 2 days ago

              its also easier to find devs that have the skills to create and maintain thin services than a large complicated monolith, despite the difficulties found when having to debug a constellation of microservices during a crisis.

              • phil21 2 days ago

                For the folks who downvoted this - why? I hire developers and this is the absolute truth of the matter.

                You can get away with hiring devs able to only debug their little micro empire so long as you can retain some super senior rockstar level folks able to see the big picture when it inevitably breaks down in production under load. These skills are becoming rarer by the day, when they used to be nearly table stakes for a “senior” dev.

                Microservices have their place, but many times you can see that it’s simply developers saying “not my problem” to the actual hard business case things.

                • kunley 14 minutes ago

                  Btw, important factor: you can only see the big picture properly if you co-created the setup. Hiring senior rockstars as a reaction to problems will satisfy some short-term goals but not solve the problems overall

                  • pards 2 days ago

                    > retain some super senior rockstar level folks able to see the big picture

                    This is the critical piece that many organisations miss.

                    Microservices are the bricks; but the customer needs those assembled into a house.

                    • mannyv a day ago

                      You need those senior folks who can see the big picture, whether you use monoliths or microservices.

                      The real benefit of a microservice is that it's easier to see the interactions, because you can't call into some random and unexpected part of the codebase...or at least it's much harder to do something that's not noticeable like that.

                      • paulryanrogers 11 hours ago

                        At the cost of network boundaries everywhere, and all that entails

                        • actionfromafar 3 hours ago

                          It's so funny we always use technical solutions to solve social problems, while confusing which parts are what. :)

                    • ton618 9 hours ago

                      The reality is, the organizational structure likely to change over time, then would anyone want to mirror it in the repo structure?

                      • actionfromafar 3 hours ago

                        Not likely, no.

                      • Joker_vD 10 hours ago

                        But, uh, both Google and Yandex use monorepo-style of development; and microservices style of deployment, yes. Go figure.

                      • delusional 2 days ago

                        I think the dream is that you can reason locally. I'm not convinced that it actually help any, but the dream is that having everything as services, complete with external boundaries and enforced constraints, you're able to more accurately reason about the orchestration of services. It's hard to reason about your order flow if half if it depends on some implicit procedure that's part of your shopping cart.

                        The business I'm part of isn't really after "scalable" technology, so that might color my opinion, but a lot of the arguments for microservices I hear from my colleagues are actually benefits of modular programs. Those two have just become synonyms in their minds.

                        • klabb3 2 days ago

                          > […] the dream is that having everything as services, […], you're able to more accurately reason about the orchestration of services.

                          Well.. I mean that’s an entirely circular point. Maybe you mean something else? That you can individually deploy and roll back different functionality that belong to a team? There’s some appeal for operations yeah.

                          > but a lot of the arguments for microservices I hear from my colleagues are actually benefits of modular programs

                          Yes I mean from a development perspective a library call is far, far superior to an http call. It is much more performant and orders of magnitude easier to reason about since the caller and callee are running the same version of the code. That means that breaking changes is a refactor and single commit, whereas with a service boundary you need a whole migration.

                          You can’t avoid services altogether, like say external services like a payment portal by a completely different company. But to deliberately create more of these expensive boundaries for no reason, within the same small org or team, is madness, imo.

                          • scubbo 9 hours ago

                            > That means that breaking changes is a refactor and single commit, whereas with a service boundary you need a whole migration.

                            This decoupling-of-updates-across-a-call-boundary is one of the key reasons why I _prefer_ microservices. Monoliths _force_ you to update your caller and callee at the same time, which appears attractive when they are 1-1 but becomes prohibitively difficult when there are multiple callers of the same logic - changes take longer and longer to be approved, and you drift further from CD. Microservices allow you to gradually roll out a change across the company at an appropriate rate - the new logic can be provided at a different endpoint for early adopters, and other consumers can gradually migrate to it as they are encouraged or compelled to do so.

                            Similarly with updates to cross-cutting concerns. Say there's a breaking change to your logging or testing framework, or an encryption library, or something like that. You can force all your teams to down tools and to synchronize in collaborating on one monster commit to The Monolith that will update everything at once - or you can tell everyone to update their own microservices, at their own pace (but by a given deadline, if InfoSec so demands), without blocking each other. Making _and testing and deploying_ one large commit containing lots of changes is, counter-intuitively, much harder than making lots of small commits containing the same quantity of actual change - your IDE can find-and-replace easily across the monorepo, but most updates due to breaking changes require human intervention and cannot be scripted. The ability for different microservices within the same company to consume different versions of the same utility library at the same time (as they are gradually, independently, updated) is a _benefit_, not a drawback.

                            > a library call[...]is much more performant [...than] these expensive boundaries

                            I mean, no argument here - but latency tends to be excessively sought by developers, beyond the point of actual experience improvement. If it's your limiting factor, then by all means look for ways to improve it - but designing for fast development and deployment has paid far greater dividends, in my experience, than overly optimizing for latency.

                            • sa46 5 hours ago

                              > Monoliths _force_ you to update your caller and callee at the same time

                              It's possible to migrate method calls incrementally (create a new method or add a parameter). In large codebases, it's necessary to migrate incrementally. The techniques overlap those of changing an RPC method.

                        • lmm 8 hours ago

                          The real reason is that it's impossible to safely upgrade a dependency in Python. And by the time you realise this you're probably already committed to building your system in Python (for mostly good reasons). So the only way to get access to new functionality is to break off parts of your system into new deployables that can have new versions of your dependencies, and you keep doing this forever.

                          • dropofwill 2 days ago

                            The concepts here apply to any client-server networking setup. Monoliths could still have web clients, native apps, IOT sensors, third party APIs, databases, etc.

                            • jeffbee 6 hours ago

                              > but nothing would help you if you have 500 micro services.

                              Have you pondered the likelihood that your IDE sucks?

                            • jrochkind1 4 hours ago

                              I just learned quite a bit about retries. I really liked this tour of one area of the domain in the form of a narrative. When written by someone who clearly knows the area and also has skill at writing it, that's a great way to learn more techniques.

                              Would love to read more things like this in different areas.

                              • patrakov 2 days ago

                                To counter the avalanche of retries on different layers, I have also seen a custom header being added to all requests that are retries. Upon receiving a request with this header, the microservice would turn off its own retry logic for this request.

                                • pyrolistical 6 hours ago

                                  Ya. Instead of blind retries, I have server respond with “try after timestamp” header. This way it can tell everybody to back off. If no response then welp

                                • davedx 2 days ago

                                  This is the kind of well written, in depth technical narrative I visit HN for. I definitely learned from it. Thanks for posting!

                                  • chipdart 2 days ago

                                    I agree. What a treat. One of the best submissions gracing HN in months.

                                  • patrakov 2 days ago

                                    It's worth noting that the logic in the article only applies to idempotent requests. See this article (by the same author) for the non-idempotent counter-part: https://habr.com/ru/companies/yandex/articles/442762/ (unfortunately, in Russian). I am sure somebody posted a human-written English translation back then, but I cannot find it. So here is a Google-translated version (scroll past the internal error, the text is below):

                                    https://habr-com.translate.goog/ru/companies/yandex/articles...

                                    • Rygian 2 days ago

                                      Reading this excellent article put me in the mind of wondering if job interviews for developer positions include enough questions about queue management.

                                      "Ben" developed retries without exponential back-off, and only learned about that concept in code review. Exponential back-off should be part of any basic developer curriculum (except if that curriculum does not mention networks of any sort at all).

                                      • sim7c00 2 days ago

                                        if you have too many deeper questions you rule out a lot of eager juniors who can learn and grow on the job. its a fine balance though, but looking at the article, ben's taking his lessons and growing. thats more important i think than having someone who's some guru from the get go. everyone has things they are better or worse at, and it's really a team effort to do everythinng right. presumably someone reviewed and accepted his code, that person also didnt catch it... there's no developer who knows everything and makes all perfect code and design. its a well balanced team that can help go in that direction

                                        • Rygian 2 days ago

                                          I wholeheartedly agree, and realize my comment was not really clear.

                                          Any training curriculum needs to include exponential back-off as a core concept of any system-to-system interaction.

                                          Ben was let out of school without proper training. Kudos on the employer for finishing up the training that was missed earlier on.

                                      • brabarossa 4 hours ago

                                        Strange architecture. They clearly have a queue, but instead of checking previous request, they create a new one. It's like they managed to get the worst of pub/sub and task queue.

                                        • duffmancd 2 days ago

                                          I missed it on the first read-through but there is a link to the code used to run the simulations in the first appendix.

                                          Homegrown python code (i.e. not a library), very nicely laid out. And would form a good basis for more experiments for anyone interested. I think I'll have a play around later and try and train my intuition.

                                          • easylion 2 days ago

                                            Really good article about retries, its consequences and how load amplification works. Loved it

                                            • azlev 2 days ago

                                              Good reading.

                                              In my last job, the service mesh was responsible to do retries. It was a startup and the system was changing every day.

                                              After a while, we suspect that some services were not reliable enough and retries were hiding this fact. Turning off retries exposed that in fact, quality went down.

                                              In the end, we put retries in just some services.

                                              I never tested neither retry budget nor deadline propagation. I will suggest this in the future.

                                              • vrosas 8 hours ago

                                                Why not just add telemetry to see when requests are retried?

                                              • k3vinw 2 days ago

                                                Great food for thought! I’m currently on an endeavor at work to stabilize some pre-existing rest service integration tests executed in parallel.

                                                • sim7c00 2 days ago

                                                  ver nice read with lots of interesting points and examples / examination. very thorough imo. Im not a microservices guy but it gives a lot of general concepts also applicable outside of that domain. very good thanks!