• fjni an hour ago

    Wait… railway runs on GCP? Didn’t they make a whole thing about not “building a cloud on top of another cloud?”

    Or did they just mean that they’re not renting VPSs but only metal from the cloud provider?

    In my mind I was so excited that there was another provider not just paying one of the hyperscalars but at a minimum colocating and owning more of their stack. https://blog.railway.com/p/heroku-walked-railway-run

    • miniman1337 an hour ago

      from the blog linked via Wayback Machine. "From Day 1, we had this notion at the forefront.

      The other notion that we have intuited is that you can’t build a cloud on another cloud. We have devoted years of practice running our own metal (and playing well with other clouds) to make sure that Railway’s business, which invariably becomes your customer’s business, is as rock solid as possible."

      • MrDarcy 14 minutes ago

        That’s strange, when I interviewed with the founder a few years ago he told me they were on AWS wanting to move to firecracker.

      • eoswald an hour ago

        Yep, and this is why I'm pissed. They lied. They're completely dependent on GCP. So, I gotta do some research, i need something a little more stable (and less dependent on one company's whims) than this. This is bad for them, because it really strikes at the heart of their 'big claim,' peacefull software deployments. This is chaos.

        • ndneighbor 41 minutes ago

          Yea, I mean, that's the whole MO of our platform and we failed at that. So yea, that's disappointing and more so for our customers.

          I can provide an explanation about the GCP dependency. Yes, we have host workloads off GCP, and we have been able to build a good business by performing a cloud exit. However, we were worried that we would have a circular dependency on our own cloud. I don't think we expected to get auto-modded out of our own account, hence we left our DB on CloudSQL.

          It was never our intent to deceive people that we didn't own our own destiny with our business. The last GCP issue, we were assured that this scenario wouldn't happen (when we got auto-ratelimited, which was bad, but survivable) - but it seems like we have further work to do. Apologies.

          • fontain 35 minutes ago

            I’m very sympathetic and understand that decisions are easy to criticize in hindsight but leaving your database in GCP while moving everything else to your own data centres seems so backwards I can’t even begin to imagine how that could happen. Was this really an intentional design decision?

            • arjie 22 minutes ago

              I have exactly the same architecture. You can easily administer a postgres/mysql on your own infrastructure, but it's also the one thing where backups and availability are super strict. I can easily support multi-region in Google Cloud or AWS and that's way harder to do on-prem, and it's also hard to handle the replication story as safely as with Google Cloud. The hope is that GCP et al. give you safety and availability for the control plane stuff and you can run your data plane on-prem.

              At $2m/mo spend, this kind of thing is insane. GCP has never been the most reliable of clouds but this is pretty awful. I would never have expected this.

              • ndneighbor 26 minutes ago

                > decisions are easy to criticize in hindsight

                I mean, the pain we have caused our customer ultimately proves you correct. That said, we made our decisions with the information and constraints that we knew in that moment in time. Railway has hosts in AWS/GCP/and co-los, so coordinating those workloads in a fully distributed manner would be ideal but end of the day, we didn't forsee that would just have our project get deleted just like that.

                (Even if we did get assurances from them in 2024, that it wouldn't happen again, although we just got auto-rate limited the last time.)

                • r_lee 17 minutes ago

                  could you clarify, did an automated process by Google delete a GCP project/account/resource(s)? like, what exactly were you seeing when trying to get access or see what happened?

                  • ndneighbor a minute ago

                    They deleted our GCP proj. sans warning. Still working the details, but that's how this whole thing began.

        • eoswald an hour ago

          Sorry, I have a hard time blaming Google for this, when Railway seems to be having increasing trouble keeping the platform stable. Something like this should NOT take down an ENTIRE service. There should be a backup when literally your business is about being the reliable backend. This just seems like poor planning to me.

          • ryanisnan an hour ago

            I don't quite know what you mean. Do you really expect Railway to use a multi-cloud architecture to host all of their client's projects? I suspect that would lead to a lower availability, all things considered.

            • eoswald an hour ago

              Well, in the same token, is it smart to base your ENTIRE architecture on a single cloud architecture? Isn't that why some of us build in fallbacks for AWS-hosted services? I mean, their enitre platform, both public and private facing, is running on the same thing. One error, one problem, takes out the entire service.

            • impulser_ an hour ago

              They literally own their own data centers. That's whats surprising about this. They are lying to their customers when they say they operate their own data center because obviously they don't if everyone's apps are down with GCP blocking their account.

              • brookst 33 minutes ago

                Is it not possible that they own their own data center and have an unfortunate Google dependency?

                Obviously a fiasco but I’m not prepared to call them liars when it could be an honest mistake.

                • Terr_ 25 minutes ago

                  I imagine there's also an important difference between:

                  1. We depend on X but could gracefully migrate to an alternate in a week if we really needed to.

                  2. All data is mirrored instantly so that we can do seamless fail-over in case X has its own outage.

                • ryanisnan an hour ago

                  Oh, I see what you mean. Eh, it's possibly the same reason that AWS essentially goes down when us-east-1 goes down.

              • cactusplant7374 an hour ago

                Disaster recovery is pretty expensive, right? Especially for their size.

              • Avicebron an hour ago

                Isn't Railway the "the API key to delete the backups is in the prod database, because that's where the backups live duh" guys?

                • enahs-sf an hour ago

                  I respect what railway is doing but also would never run my business on such a platform.

                  • eoswald an hour ago

                    Today changed my opinion on them completely. Was willing to give them the benefit of the doubt that they're growing fast, but now seeing that they've failed to scale properly, and are missing little things that become big things later. I can't take that risk.

                    • dpark an hour ago

                      That kind of sounds like you don’t respect what they are doing.

                    • TheTaytay 31 minutes ago

                      I’ve seen a few smug “all your eggs in one basket” comments here.

                      I’m aware of some companies hosting their own metal and infra, but I’m not aware of large companies mitigating risk by hosting on separate cloud providers as a fallback mechanism. We might disagree with cloud provider choice, or think they should have been hosting their own metal, but that’s still an “all your eggs in one basket” choice, right?

                      Heck, they might even have multi-region fallback with GCP, but if GCP bans your account, that doesn’t matter.

                      Are there good examples of running a company of railway’s size so redundantly that their host could nuke one of their accounts and they’d just keep on trucking?

                      • fontain 29 minutes ago

                        They do run their own metal. That’s their entire ethos. Railway is their own cloud.

                        • chradams 24 minutes ago

                          Just google multi-cloud. Yes. It's a thing.

                          • wmf 10 minutes ago

                            99% of multi-cloud is fake though. True multi-cloud is incredibly rare.

                        • faangguyindia 2 hours ago

                          Google cloud also locked out a Korean Goverment Organization recently. The guy posted on GCP subreddit.

                          Google really need to improve their support team. It's strange such a big corp can't even afford to have proper support team.

                          • choilive 18 minutes ago

                            Not strange, Google has never had a proper support team unless you are an "Enterprise" level customer.

                            • benwoodward 20 minutes ago

                              pretty sure their support team is a flaky ML model that is haplessly flagging random accounts

                              • danpalmer an hour ago

                                > It's strange such a big corp can't even afford to have proper support team

                                Railway say they are in touch with that support team.

                                • shooker435 25 minutes ago

                                  god help them

                                • King-Aaron an hour ago

                                  > It's strange such a big corp can't even afford to have proper support team

                                  This seems to be by design.

                                  • ndneighbor 40 minutes ago

                                    We have a CSM, Head of Customer Support contact, and further contacts with GCP. Despite that, we still had this issue.

                                  • add-sub-mul-div 31 minutes ago

                                    Automating support, automating everything is the key to their whole deal. Tech giants leapfrogged the rest of the economy by innovating a company that can scale its customers without having to scale itself proportionally.

                                  • whh 27 minutes ago

                                    This could kill a startup. I really don't like Google's automated and silent account murder functionality.

                                    • MrDarcy 11 minutes ago

                                      There’s no way this was automated or silent.

                                      The only reasonable explanation is Railway lost control of their estate and something was happening that warranted a group of humans to decide flipping the kill switch was the best of a set of bad alternatives.

                                      • macintux 4 minutes ago

                                        You’re giving Google far more credit than they’ve earned.

                                    • dwa3592 28 minutes ago

                                      Wait, I thought railway was a cloud provider like AWS, GCP but better and more agile. At least that's the impression i got from their website.

                                      • Mengkudulangsat an hour ago

                                        That explains why all my vibe-coded hobby projects are down.

                                        Thank God I'm not dealing with any public-facing sites! Would have been an expensive lesson for a newbie coder if my job depended on this.

                                        • throwaranay4933 3 hours ago

                                          This screenshot from Discord suggests the idea that the outage is caused by automated GCP account ban: https://x.com/acgfbr/status/2056866780866351323

                                          • brokenodo an hour ago

                                            I’m a new customer and have been falling in love with Railway over the last 2 weeks, but this is quite the wake up call.

                                            • choilive 16 minutes ago

                                              Been a customer with them for over a year now, small incidents here and there but never anything this major.

                                              • csw-001 an hour ago

                                                Literally in the same boat. I've been really happy with it, but this is a major eye opener.... It's been done for a looooong time by provider standards.

                                                • reelvideocap an hour ago

                                                  same

                                                • TheAtomic 34 minutes ago

                                                  same same

                                                • Drew-Aetherwave 25 minutes ago

                                                  It is killing me...

                                                  • Osborn_Ojure 20 minutes ago

                                                    compute recovered, get ready boys!

                                                    • undefined 21 minutes ago
                                                      [deleted]
                                                      • bshack0 an hour ago

                                                        so....what are we switching to y'all? cloud-run ? ;P

                                                        • auxiliarymoose an hour ago

                                                          federated hardware (a bunch of raspberry pis networked into a high availability kubernetes cluster, hidden across various local coffee shops for free power and bandwidth)

                                                          • throwatdem12311 an hour ago

                                                            raspberry-pi cluster in my closet

                                                            • frio 5 minutes ago

                                                              16GiB Raspberry Pi 5s in my country are now going for ~$450USD, so I've gotta say that's out of reach for me now :(.

                                                          • ryanisnan an hour ago

                                                            Yikes. I was wondering why my TLS certs were coming up as invalid.

                                                            • mcontrerazCL 2 hours ago

                                                              all my fkn postgres bd in railways! what do i do now?

                                                              • eoswald 44 minutes ago

                                                                Hahah at least you're not getting called every five minutes because you cant shut off the alerts, because its apparently deployed SOMEWHERE but good luck finding how to access it. Can't wait to see the bill from Twilio because of this lol

                                                                • cactusplant7374 an hour ago

                                                                  Take a walk. Breathe in the fresh air. It feels good.

                                                                • iloveplants 3 hours ago

                                                                  seems like it's every day

                                                                  • upnorthmedia 16 minutes ago

                                                                    [dead]

                                                                    • upnorthmedia 16 minutes ago

                                                                      [dead]

                                                                      • rekabis an hour ago

                                                                        TL;DR: putting all your eggs into one basket is bad, man.

                                                                        • lfx an hour ago

                                                                          That’s true, however having only few eggs and shopping for several baskets does not make sense in early days. Not sure how big railway is, but usually you start small with one egg.

                                                                          • christophilus 44 minutes ago

                                                                            You’d think they wouldn’t have started with GCP. There are plenty of datacenters where you can buy racks and racks of servers, and talk to a human when something goes wrong, and even walk in and access your servers. That’s what I’d be using if I were to build a Rackspace today.

                                                                            • tomschlick 38 minutes ago

                                                                              They started on GCP and have been migrating to their own "Metal" DC doing exactly what you're describing. But GCP is still their overflow given how rapidly they are growing and holds some amount of networking that routes to their DC.

                                                                              • wmf 6 minutes ago

                                                                                Colo is worse than cloud when you're getting started. Sure, you can talk to a person but everything else is much lower quality. People are obsessed with having someone to yell at but yelling does not fix outages.

                                                                          • bshack0 an hour ago

                                                                            so...what are we switching to yall? cloud-run :P