• shaggie76 3 hours ago

    We had a similar CDN problem with releasing major Warframe updates: our CDN partner would inadvertently DDoS our origin servers when we launched an update because thousands of cold edges would call home simultaneously when all players players relogged at the same time.

    One CDN vendor didn't even offer a tiered distribution system so every edge called home at the same time, another vendor did have a tiered distribution system designed to avoid this problem but it was overwhelmed by the absurd number of files we'd serve multiplied by the large user count and so we'd still end up with too much traffic hitting the origin.

    The interesting thing was that no vendor we evaluated offered a robust preheating solution if they offered one at all. One vendor even went so far as to say that they wouldn't allow it because it would let customers unfairly dominate the shared storage cache at the edge (which sort of felt like airlines overbooking seats on a flight to me).

    These days we run an army of VMs that fetch all assets from every point of presence we can cover right before launching an update.

    Another thing we've had to deal with mentioned in the article is overloading back-end nodes; our solution is somewhat ham-fisted but works quite well for us: we cap the connection counts to the back end and return 503s when we saturate. The trick, however, is getting your load-balancer to leave the client connection open when this happens -- by default multiple LBs we've used would slam the connection closed so that when you're serving up 50K 503s a second the firewall would buckle under the runaway connection pool lingering in TIME_WAIT. Good times.

    • bolognafairy 2 hours ago

      Really one of those “has anyone that built this tried using it for its intended purpose?” things. Not having a carefully considered cache warning solution* is like…if someone built a CDN based on a description someone gave them, instead of actually understanding the problem a CDN sets out to solve.

      * EDIT: actually, any solution that at least attempts to mitigate a thundering herd. I am at least somewhat empathetic to the “indiscriminately allowing pre-warming destroys the shared cache” viewpoint. But there are still plenty of things that can be done!

    • Animats 2 hours ago

      This problem is similar to what electric utilities call "load takeup". After a power outage, when power is turned back on, there are many loads that draw more power at startup.

      The shortest term effects are power supplies recharging their capacitors and incandescent bulbs warming up. That's over within a second.

      Then it's the motors, which have 2x-3x their running load when starting as they bring their rotating mass up to speed. That extra load lasts for tens of seconds.

      If power has been off for more than a few minutes, everything in heating and cooling which normally cycles on and off will want to start. That high load lasts for minutes.

      Bringing up a power grid is thus done by sections, not all at once.

      • _heimdall 2 hours ago

        I live in a somewhat rural area and we got bit hard by this last winter.

        Our road used to have a handful of houses on it but now has around 85 (a mix of smaller lots around an acre and larger farming parcels). Power infrastructure to our street hasn't been updated recently and it just barely keeps up.

        We had a few days that didn't get above freezing (very unusual here). Power was out for about 6 hours after a limb fell on a line. The power company was actually pretty quick to fix it, but the power went out 3 more times in pretty quick succession.

        Apparently a breaker kept blowing as every house regained power and all the various compressors surged on. The solution at the time was for them to jam in a larger breaker. I hope they came back pretty quickly to undo that "fix" but we still haven't had any infrastructure updates to increase capacity.

        • alvah an hour ago

          "The solution at the time was for them to jam in a larger breaker"

          I've seen some cowboy sh!t in my time but jeez, that's rough.

      • emmanueloga_ an hour ago

        The whole incident report is interesting, but I feel like the most important part of the solution is buried here [0]:

        * "We're adding timeouts to prevent user requests from waiting excessively long to retrieve assets."

        When you get to the size of Canva, you can't forget your AbortController and exponential backoff on your Fetch API calls.

        --

        0: https://www.canva.dev/blog/engineering/canva-incident-report...

        • tryauuum 20 minutes ago

          fuck canva, I remember visiting it from Georgia and being greeted a non-working page and a banner shaming me for the war in Ukraine

          I know there's probably some US sanctions list somewhere which the company had to adhere to. But experiencing it in Georgia, where streets are covered with Ukrainian flags and people are very open with their opinion on the war is just surreal

          • perching_aix 10 minutes ago

            that indeed sounds remarkably puzzling, so much so that i find it a bit hard to believe