• 000ooo000 16 hours ago

    Strange choice of language for the actions:

    >To route traffic through the proxy to a web application, you *deploy* instances of the application to the proxy. *Deploying* an instance makes it available to the proxy, and replaces the instance it was using before (if any).

    >e.g. `kamal-proxy deploy service1 --target web-1:3000`

    'Deploy' is a fairly overloaded term already. Fun conversations ahead. Is the app deployed? Yes? No I mean is it deployed to the proxy? Hmm our Kamal proxy script is gonna need some changes and a redeployment so that it deploys the deployed apps to the proxy correctly.

    Unsure why they couldn't have picked something like 'bind', or 'intercept', or even just 'proxy'... why 'deploy'..

    • nahimn 13 hours ago

      “Yo dawg, i heard you like deployments, so we deployed a deployment in your deployment so your deployment can deploy” -Xzibit

      • chambored 6 hours ago

        "Pimp my deployment!"

      • irundebian 15 hours ago

        Deployed = running and registered to the proxy.

        • vorticalbox 12 hours ago

          Then wouldn’t “register” be a better term?

          • rad_gruchalski 10 hours ago

            It’s registered but is it deployed?

            • IncRnd 7 hours ago

              Who'd a thunk it?

        • 8organicbits 13 hours ago

          If your ingress traffic comes from a proxy, what would deploy mean other than that traffic from the proxy is now flowing to the new app instance?

        • viraptor 12 hours ago

          It's an interesting choice to make this a whole app, when the zero-downtime deployments can be achieved with other servers trivially these days. For example any app+web proxy which supports Unix sockets can do zero-downtime by moving the file. It's atomic and you can send the warm-up requests with curl. Building a whole system with registration feels like an overkill.

        • ksajadi 11 hours ago

          This primarily exists to take care of a fundamental issue in Docker Swarm (Kamal's orchestrator of choice) where replacing containers of a service disrupts traffic. We had the same problem (when building JAMStack servers at Cloud 66) and used Caddy instead of writing our own proxy and also looked at Traefik which would have been just as suitable.

          I don't know why Kamal chose Swarm over k8s or k3s (simplicity perhaps?) but then, complexity needs a home, you can push it around but cannot hide it, hence a home grown proxy.

          I have not tried Kamal proxy to know, but I am highly skeptical of something like this, because I am pretty sure I will be chasing it for support for anything from WebSockets to SSE, to HTTP/3 to various types of compression and encryption.

          • hipadev23 10 hours ago

            I feel like you’re conflating the orchestration with proxying. There’s no reason they couldn’t be using caddy or traefik or envoy for the proxy (just like k8s ends up using them as an ingress controller), while still using docker.

            • ksajadi 9 hours ago

              Docker is the container engine. Swarm is the orchestration, the same as Kubernetes. The concept of "Service" in k8s takes care of a lot of the proxying, while still using Docker (not anymore tho). In Swarm, services exist but only take care of container lifecycle and not traffic. While networking is left to the containers, Swarm services always get in the way, causing issues that will require a proxy.

              In k8s for example, you can use Docker and won't need a proxy for ZDD (while you might want one for Ingress and other uses)

            • jauntywundrkind 10 hours ago

              Kamal feels built around the premise that "Kubernetes is too complicated" (after Basecamp got burned by some hired help), and from that justification it goes out and recreates a sizable chunk of the things Kubernetes does.

              Your list of things a reverse proxy might do is a good example to me of how I expect this to go: what starts out as an ambition to be simple inevitably has to grow & grow more of complexity it sought to avoid.

              Part of me strongly thinks we need competition & need other things trying to create broad ideally extensible ways or running systems. But a huge part of me sees Kamal & thinks, man, this is a lot of work being done only to have to keep walking backwards into the complexity they were trying to avoid. Usually second system syndrome is the first system being simple the second being overly complicated, and on the tin the case is inverse, but man, the competency of Kube & it's flexibility/adaptability as being a framework for Desired State Management really shows through for me.

              • ksajadi 8 hours ago

                I agree with you and at the risk of self-promotion, that's why we built Cloud 66 (which takes care of Day-1 (build and deploy) as well as Day-2 (scale and maintenance) part of infrastructure. As we all can see there is a lot to this than just wrapping code in a Dockerfile and pushing it out to a Swarm cluster.

            • seized an hour ago

              This seems like something HAProxy can do effortlessly.... Especially with it's hitless reload.

              • simonw 9 hours ago

                Does this implement the “traffic pausing” pattern?

                That’s where you have a proxy which effectively pauses traffic for a few seconds - incoming requests appear to take a couple of seconds longer than usual, but are still completed after that short delay.

                During those couple of seconds you can run a blocking infrastructure change - could be a small database migration, or could be something a little more complex as long as you can get it finished in less than about 5 seconds.

                • tinco 6 hours ago

                  Have you seen that done in production? It sounds really dangerous, I've worked for an app server company for years and this is the first I've heard of this pattern. I'd wave it away if I didn't notice in your bio that you co-created Django so you've probably seen your fair share of deployments.

                  • written-beyond 6 hours ago

                    Just asking, isn't this what every serverless platform uses while it spins up an instance? Like it's why cold starts are a topic at all, or else the first few requests would just fail until the instance spun up to handle the request.

                    • simonw 5 hours ago

                      I first heard about it from Braintree. https://simonwillison.net/2011/Jun/30/braintree/

                    • francislavoie 4 hours ago

                      Caddy does this! (As you know, I think. I feel I remember we discussed this some time ago)

                      • ignoramous 8 hours ago

                        tbh, sounds like "living dangerously" pattern to me.

                        • francislavoie 4 hours ago

                          Not really, works quite well as long as your proxy/server have enough memory to hold the requests for a little while. As long as you're not serving near your max load all the time, it's a breeze.

                      • blue_pants 15 hours ago

                        Can someone briefly explain how ZDD works in general?

                        I guess both versions of the app must be running simultaneously, with new traffic being routed to the new version of the app.

                        But what about DB migrations? Assuming the app uses a single database, and the new version of the app introduces changes to the DB schema, the new app version would modify the schema during startup via a migration script. However, the previous version of the app still expects the old schema. How is that handled?

                        • diggan 14 hours ago

                          First step is to decouple migrations from deploys, you want manual control over when the migrations run, contrary to many frameworks default of running migrations when you deploy the code.

                          Secondly, each code version has to work with the current schema and the schema after a future migration, making all code effectively backwards compatible.

                          Your deploys end up being something like:

                          - Deploy new code that works with current and future schema

                          - Verify everything still works

                          - Run migrations

                          - Verify everything still works

                          - Clean up the acquired technical debt (the code that worked with the schema that no longer exists) at some point, or run out of runway and it won't be an issue

                          • wejick 14 hours ago

                            This is very good explanation, no judgment and simply educational. Appreciated

                            Though I'm still surprised that some people run DB alteration on application start up. Never saw one in real life.

                            • whartung 9 hours ago

                              We do this. It has worked very well for us.

                              There's a couple of fundamental rules to follow. First, don't put something that will have insane impact into the application deploy changes. 99% of the DB changes are very cheap, and very minor. If the deploy is going to be very expensive, then just don't do it, we'll do it out of band. This has not been a problem in practice with our 20ish person team.

                              Second, it was kind of like double entry accounting. Once you committed the change, you can not go back and "fix it". If you did something really wrong (i.e. see above), then sure, but if not, you commit a correcting entry instead. Because you don't know who has recently downloaded your commit, and run it against their database.

                              The changes are a list of incremental steps that the system applies in order, if they had not been applied before. So, they are treated as, essentially, append only.

                              And it has worked really well for us, keeping the diverse developers who deploy again local databases in sync with little drama.

                              I've incorporated the same concept in my GUI programs that stand up their own DB. It's a very simple system.

                              • noisy_boy 2 hours ago

                                The main challenge I have noticed with that approach is maintaining the sequencing across different branches being worked upon by different developers - solvable by allocating/locking the numbers from a common place. The other is rolling back multiple changes for a given view/stored proc where, say, each change added a separate column - if only one is rolled back, how do you automate that? Easily done manually though.

                                • whartung 22 minutes ago

                                  I will say that stored procs are specifically messy, and we did not have many of those. They had a tendency to really explode the change file. With DDL, you can fix a table column in isolation. Fixing a typo in a 100 line stored proc is another 100 lines. And we certainly didn't have multiple people working on the same proc at the same time.

                                  We had no real need to address that aspect, and I would do something more formal with those if I had to, such as having a separate file with the store proc, and simply a note that it has changed in the change file. I mean, that's a bit of a trick.

                              • miki123211 13 hours ago

                                It makes things somewhat easier if your app is smallish and your workflow is something like e.g. Github Actions automatically deploying all commits on main to Fly or Render.

                                • e_y_ 9 hours ago

                                  At my company, DB migrations on startup was a flag that was enabled for local development and disabled for production deploys. Some teams had it enabled for staging/pre-production deploys, and a few teams had it turned on for production deploys (although those teams only had infrequent, minor changes like adding a new column).

                                  Personally I found the idea of having multiple instances running the same schema update job at the same time (even if locks would keep it from running in practice) to be concerning so I always had it disabled for deploys.

                                  • diggan 13 hours ago

                                    > Though I'm still surprised that some people run DB alteration on application start up

                                    I think I've seen it more commonly in the Golang ecosystem, for some reason. Also not sure how common it is nowadays, but seen lots of deployments (contained in Ansible scripts, Makefiles, Bash scripts or whatever) where the migration+deploy is run directly in sequence automatically for each deploy, rather than as discrete steps.

                                    Edit: Maybe it's more of an educational problem than something else, where learning resources either don't specify when to actually run migrations or straight up recommend people to run migrations on application startup (one example: https://articles.wesionary.team/integrating-migration-tool-i...)

                                  • svvvy 11 hours ago

                                    I thought it was correct to run the DB migrations for the new code first, then deploy the new code. While making sure that the DB schema is backwards compatible with both versions of the code that will be running during the deployment.

                                    So maybe there's something I'm missing about running DB migrations after the new code has been deployed - could you explain?

                                    • ffsm8 11 hours ago

                                      I'm not the person you've asked, but I've worked in devops before.

                                      It kinda doesn't matter which you do first. And if you squint a little, it's effectively the same thing, because the migration will likely only become available via a deployment too

                                      So yeah, the only things that's important is that the DB migration can't cause an incompatibility with any currently deployed version of the code - and if it would, you'll have to split the change so it doesn't. It'll force another deploy for the change you want to do, but it's what you're forced to do if maintenance windows aren't an option. Which is kinda a given for most b2c products

                                    • shipp02 9 hours ago

                                      So if you add any constraints/data, you can't rely on them being there until version n+2 or you need to have 2 paths 1 for the old date, 1 for new?

                                      • simonw 9 hours ago

                                        Effectively yes. Zero downtime deployments with database migrations are fiddly.

                                      • globular-toast 5 hours ago

                                        There's a little bit more to it. Firstly you can deploy the migration first as long as it's forwards compatible (ie. old code can read from it). That migration needs to be zero downtime; it can't, for example, rewrite whole tables or otherwise lock them, or requests will time out. Doing a whole new schema is one way to do it, but not always necessary. In any case you probably then need a backfill job to fill up the new schema with data before possibly removing the old one.

                                        There's a good post about it here: https://rtpg.co/2021/06/07/changes-checklist.html

                                        • jacobsimon 14 hours ago

                                          This is the way

                                        • andrejguran 15 hours ago

                                          Migrations have to be backwards compatible so the DB schema can serve both versions of the app. It's an extra price to pay for having ZDD or rolling deployments and something to keep in mind. But it's generally done by all the larger companies

                                          • efxhoy 4 hours ago

                                            Strong migrations helps writing migrations that are safe for ZDD deploys. We use it in our rails app, catches quite a few potential footguns. https://github.com/ankane/strong_migrations

                                            • gsanderson 14 hours ago

                                              I haven't tried it but it looks like Xata has come up with a neat solution to DB migrations (at least for postgres). There can be two versions of the app running.

                                              https://xata.io/blog/multi-version-schema-migrations

                                              • efortis 15 hours ago

                                                Yes, both versions must be running at some point.

                                                The load balancer starts accepting connections on Server2 and stops accepting new connections on Server1. Then, Server1 disconnects when all of its connections are closed.

                                                It could be different Servers or multiple Workers on one server.

                                                During that window, as the other comments said, migrations have to be backwards compatible.

                                                • stephenr 14 hours ago

                                                  Others have described the how part if you do need truly zero downtime deployments, but I think it's worth pointing out that for most organisations, and most migrations, the amount of downtime due to a db migration is virtually indistinguishable from zero, particularly if you have a regional audience, and can aim for "quiet" hours to perform deployments.

                                                  • diggan 13 hours ago

                                                    > the amount of downtime due to a db migration is virtually indistinguishable from zero

                                                    Besides, once you've run a service for a while that has acquired enough data for migrations to take a while, you realize that there are in fact two different types of migrations. "Schema migrations" which are generally fast and "Data migrations" that depending on the amount of data can take seconds or days. Or you can do the "data migrations" when needed (on the fly) instead of processing all the data. Can get gnarly quickly though.

                                                    Splitting those also allows you to reduce maintenance downtime if you don't have zero-downtime deployments already.

                                                    • sgarland 11 hours ago

                                                      Schema migrations can be quite lengthy, mostly if you made a mistake earlier. Some things that come to mind are changing a column’s type, or extending VARCHAR length (with caveats; under certain circumstances it’s instant).

                                                      • lukevp 10 hours ago

                                                        Not OP, but I would consider this a data migration as well. Anything that requires an operation on every row in a table would qualify. Really changing the column type is just a built in form of a data migration.

                                                      • globular-toast 5 hours ago

                                                        Lengthy migrations doesn't matter. What matters is whether they hold long locks or not. Data migrations might take a while but they won't lock anything. Schema migrations, on the other hand, can easily do so, like if you add a new column with a default value. The whole table must be rewritten and it's locked for the entire time.

                                                        • stephenr 13 hours ago

                                                          Very much so, we handle these very differently for $client.

                                                          Schema migrations are versioned in git with the app, with up/down (or forward/reverse) migration scripts and are applied automatically during deployment of the associated code change to a given environment.

                                                          SQL Data migrations are stored in git so we have a record but are never applied automatically, always manually.

                                                          The other thing we've used along these lines, is having one or more low priority job(s) added to a queue, to apply some kind of change to records. These are essentially still data migrations, but they're written as part of the application code base (as a Job) rather than in SQL.

                                                        • jakjak123 9 hours ago

                                                          Most are not affected by db migrations in the sense that migrations are run before the service starts the web server during boot. the database might block traffic for other already running connections though,in which case you have a problem with your database design.

                                                      • shafyy 15 hours ago

                                                        Also exciting that Kamal 2 (currently RC https://github.com/basecamp/kamal/releases) will support auto-SSL and make it easy to run multiple apps on one server with Kamal.

                                                        • francislavoie 7 hours ago

                                                          They're using the autocert package which is the bare minimum. It's brittle, doesn't allow for horizontal scaling of your proxy instances because you're subject to Let's Encrypt rate limits and simultaneous cert limits. (Disclaimer: I help maintain Caddy) Caddy/Certmagic solves this by writing the data to a shared storage so only a single cert will be issued and reused/coordinated across all instances through the storage. It also doesn't have issuer fallback, doesn't do rate limit avoidance, doesn't respect ARI, etc.

                                                          Holding requests until an upstream is available is also something Caddy does well, just configure the reverse_proxy with try_duration and try_interval, it will keep trying to choose a healthy upstream (determined via active health checks done in a separate goroutine) for that request until it times out.

                                                          Their proxy headers handling doesn't consider trusted IPs so if enabled, someone could spoof their IP by setting X-Forwarded-For. At least it's off by default, but they don't warn about this.

                                                          This looks pretty undercooked. I get that it's simple and that's the point, but I would never use this for anything in its current state. There's just so many better options out there.

                                                          • markusw 7 hours ago

                                                            I just want to say I use Caddy for this exact thing and it works beautifully! :D Thank you all for your work!

                                                            • shafyy 5 hours ago

                                                              Interesting, thanks for the detailed explanation! I'm not very experienced with devops, so this is very helpful!

                                                          • kh_hk 15 hours ago

                                                            I don't understand how to use this, maybe I am missing something.

                                                            Following the example, it starts 4 replicas of a 'web' service. You can create a service by running a deploy to one of the replicas, let's say example-web-1. What does the other 3 replicas do?

                                                            Now, let's say I update 'web'. Let's assume I want to do a zero-downtime deployment. That means I should be able to run a build command on the 'web' service, start this service somehow (maybe by adding an extra replica), and then run a deploy against the new target?

                                                            If I run a `docker compose up --build --force-recreate web` this will bring down the old replica, turning everything moot.

                                                            Instructions unclear, can anyone chime in and help me understand?

                                                            • sisk 11 hours ago

                                                              For the first part of your question about the other replicas, docker will load balance between all of the replicas either with a VIP or by returning multiple IPs in the DNS request[0]. I didn't check if this proxy balances across multiple records returned in a DNS request but, at least in the case of VIP-based load balancing, should work like you would expect.

                                                              For the second part about updating the service, I'm a little less clear. I guess the expectation would be to bring up a differently-named service within the same network, and then `kamal-proxy deploy` it? So maybe the expectation is for service names to include a version number? Keeping the old version hot makes sense if you want to quickly be able to route back to it.

                                                              [0]: https://docs.docker.com/reference/compose-file/deploy/#endpo...

                                                              • thelastparadise 14 hours ago

                                                                Why would I not just do k8s rollout restart deployment?

                                                                Or just switch my DNS or router between two backends?

                                                                • joeatwork 14 hours ago

                                                                  I think this is part of a lighter weight Kubernetes alternative.

                                                                  • ianpurton 14 hours ago

                                                                    Lighter than the existing light weight kubernetes alternatives i.e. k3s :)

                                                                    • diggan 14 hours ago

                                                                      Or, hear me out: Kubernetes alternatives that don't involve any parts of Kubernetes at all :)

                                                                  • ozgune 11 hours ago

                                                                    I think the parent project, Kamal, positions itself as a simpler alternative to K8s when deploying web apps. They have a question on this on their website: https://kamal-deploy.org

                                                                    "Why not just run Capistrano, Kubernetes or Docker Swarm?

                                                                    ...

                                                                    Docker Swarm is much simpler than Kubernetes, but it’s still built on the same declarative model that uses state reconciliation. Kamal is intentionally designed around imperative commands, like Capistrano.

                                                                    Ultimately, there are a myriad of ways to deploy web apps, but this is the toolkit we’ve used at 37signals to bring HEY and all our other formerly cloud-hosted applications home to our own hardware."

                                                                    • jgalt212 14 hours ago

                                                                      You still need some warm-up routine to run for the newly online server before the hand-off occurs. I'm not a k8s expert, but the above described events can be easily handled by a bash or fab script.

                                                                      • ahoka 13 hours ago

                                                                        What events do you mean? If the app needs a warm up, then it can use its readiness probe to ask for some delay until it gets request routed to it.

                                                                        • jgalt212 13 hours ago

                                                                          GET requests to pages that fill caches or those that make apache start up more than n processes.

                                                                        • thelastparadise 11 hours ago

                                                                          This is a health/readiness probe in k8s. It's already solved quite solidly.

                                                                    • ahdfyasdf 14 hours ago
                                                                      • mt42or 14 hours ago

                                                                        NIH. Nothing else to add.

                                                                      • ianpurton 14 hours ago

                                                                        DHH in the past has said "This setup helped us dodge the complexity of Kubernetes"

                                                                        But this looks like somehow a re-invention of what Kubernetes provides.

                                                                        Kubernetes has come a long way in terms of ease of deployment on bare metal.

                                                                        • wejick 14 hours ago

                                                                          No downtime deployment is always there long before kube. It does look as simple as ever been, not like kube for sure.

                                                                        • moondev 14 hours ago

                                                                          Does this handle a host reboot?

                                                                          • francislavoie 4 hours ago

                                                                            In theory it should, because they do health checking to track status of the upstreams. The upstream server being down would be a failed TCP connection which would fail the health check.

                                                                            Obviously, rebooting the machine the proxy is running on is trickier though. I don't feel confident they've done enough to properly support having multiple proxy instances running side by side (no shared storage mechanism for TLS certs at least), which would allow upgrading one at a time and using a router/firewall/DNS in front of it to route to both normally, then switch it to one at a time while doing maintenance to reboot them, and back to both during normal operations.

                                                                          • risyachka 14 hours ago

                                                                            Did they mention anywhere why they decided to write their own proxy instead of using Traefik or something else battle tested?

                                                                            • yla92 14 hours ago

                                                                              They were actually using Traefik until this "v2.0.0" (pre-release right now) version.

                                                                              There are some context about why they switched and decided to roll their own, from the PR.

                                                                              https://github.com/basecamp/kamal/pull/940

                                                                              • stackskipton 6 hours ago

                                                                                As SRE, that PR scares me. There is no long explanation of why we are throwing out third party, extremely battle tested HTTP Proxy software for our own homegrown except "Traefik didn't do what we wanted 100%".

                                                                                Man, I've been there where you wish third party software had some feature but writing your own is WORST thing you can do for a company 9/10 times. My current company is dealing with massive tech debt because of all this homegrown software.

                                                                            • oDot 16 hours ago

                                                                              DHH mentioned they built it to move from the cloud to bare metal. He glorifies the simplicity but I can't help thinking they are a special use case of predictable, non-huge load.

                                                                              Uber, for example, moved to the cloud. I feel like in the span between them there are far more companies for which Kamal is not enough.

                                                                              I hope I'm wrong, though. It'll be nice for many companies to be have the choice of exiting the cloud.

                                                                              • martinald 15 hours ago

                                                                                I don't think that's the real point. The real point is that 'big 3' cloud providers are so overpriced that you could run hugely over provisioned infra 24/7 for your load (to cope with any spikes) and still save a fortune.

                                                                                The other thing is that cloud hardware is generally very very slow and many engineers don't seem to appreciate how bad it is. Slow single thread performance because of using the most parallel CPUs possible (which are the cheapest per W for the hyperscalers), very poor IO speeds, etc.

                                                                                So often a lot of this devops/infra work is solved by just using much faster hardware. If you have a fairly IO heavy workload then switching from slow storage to PCIe4 7gbyte/sec NVMe drives is going to solve so many problems. If your app can't do much work in parallel then CPUs with much faster single threading performance can have huge gains.

                                                                                • igortg 11 hours ago

                                                                                  I'm using a managed Postgres instance in a well known provider and holy shit, I couldn't believe how slow it is. For small datasets I couldn't notice, but when one of the tables reached 100K rows, queries started to take 5-10 seconds (the same query takes 0.5-0.6 in my standard i5 Dell laptop).

                                                                                  I wasn't expecting blasting speed on the lowest tear, but 10x slower is bonkers.

                                                                                  • mrkurt 2 hours ago

                                                                                    Laptop SSDs are _shockingly_ fast, and getting equivalent speed from something in a datacenter (where you'll want at least two disks) is pretty expensive. It's so annoying.

                                                                                    • mwcampbell 7 minutes ago

                                                                                      To clarify, are you talking about when you buy your own servers, or when you rent from an IaaS provider?

                                                                                  • sgarland 11 hours ago

                                                                                    > The other thing is that cloud hardware is generally very very slow and many engineers don't seem to appreciate how bad it is.

                                                                                    This. Mostly disk latency, for me. People who have only ever known DBaaS have no idea how absurdly fast they can be when you don’t have compute and disk split by network hops, and your disks are NVMe.

                                                                                    Of course, it doesn’t matter, because the 10x latency hit is overshadowed by the miasma of everything else in a modern stack. My favorite is introducing a caching layer because you can’t write performant SQL, and your DB would struggle to deliver it anyway.

                                                                                    • jsheard 13 hours ago

                                                                                      It's sad that what should have been a huge efficiency win, amortizing hardware costs across many customers, ended up often being more expensive than just buying big servers and letting them idle most of the time. Not to say the efficiency isn't there, but the cloud providers are pocketing the savings.

                                                                                      • toomuchtodo 13 hours ago

                                                                                        If you want a compute co-op, build a co-op (think VCs building their own GPU compute clusters for portfolio companies). Public cloud was always about using marketing and the illusion of need for dev velocity (which is real, hypergrowth startups and such, just not nearly as prevalent as the zeitgeist would have you believe) to justify the eye watering profit margin.

                                                                                        Most businesses have fairly predictable interactive workload patterns, and their batch jobs are not high priority and can be managed as such (with the usual scheduling and bin packing orchestration). Wikipedia is one of the top 10 visited sites on the internet, and they run in their own datacenter, for example. The FedNow instant payment system the Federal Reserve recently went live with still runs on a mainframe. Bank of America was saving $2B a year running their own internal cloud (although I have heard they are making an attempt to try to move to a public cloud).

                                                                                        My hot take is public cloud was an artifact of ZIRP and cheap money, where speed and scale were paramount, cost being an afterthought (Russ Hanneman pre-revenue bit here, "get big fast and sell"; great fit for cloud). With that macro over, and profitability over growth being the go forward MO, the equation might change. Too early to tell imho. Public cloud margins are compute customer opportunities.

                                                                                        • miki123211 12 hours ago

                                                                                          Wikipedia is often brought up in these discussions, but it's a really bad example.

                                                                                          To a vast majority of Wikipedia users who are not logged in, all it needs to do is show (potentially pre-rendered) article pages with no dynamic, per-user content. Those pages are easy to cache or even offload to a CDN. FOr all the users care, it could be a giant key-value store, mapping article slugs to HTML pages.

                                                                                          This simplicity allows them to keep costs down, and the low costs mean that they don't have to be a business and care about time-on-page, personalized article recommendations or advertising.

                                                                                          Other kinds of apps (like social media or messaging) have very different usage patterns and can't use this kind of structure.

                                                                                          • toomuchtodo 12 hours ago

                                                                                            > Other kinds of apps (like social media or messaging) have very different usage patterns and can't use this kind of structure.

                                                                                            Reddit can’t turn a profit, Signal is in financial peril. Meta runs their own data centers. WhatsApp could handle ~3M open TCP connections per server, running the operation with under 300 servers [1] and serving ~200M users. StackOverflow was running their Q&A platform off of 9 on prem servers as of 2022 [2]. Can you make a profitable business out of the expensive complex machine? That is rare, based on the evidence. If you’re not a business, you’re better off on Hetzner (or some other dedicated server provider) boxes with backups. If you’re down you’re down, you’ll be back up shortly. Downtime is cheaper than five 9s or whatever.

                                                                                            I’m not saying “cloud bad,” I’m saying cloud where it makes sense. And those use cases are the exception, not the rule. If you're not scaling to an event where you can dump these cloud costs on someone else (acquisition event), or pay for them yourself (either donations, profitability, or wealthy benefactor), then it's pointless. It's techno performance art or fancy make work, depending on your perspective.

                                                                                            [1] https://news.ycombinator.com/item?id=33710911

                                                                                            [2] https://www.datacenterdynamics.com/en/news/stack-overflow-st...

                                                                                      • miki123211 13 hours ago

                                                                                        You can always buy some servers to handle your base load, and then get extra cloud instances when needed.

                                                                                        If you're running an ecommerce store for example, you could buy some extra capacity from AWS for Christmas and Black Friday, and rely on your own servers exclusively for the rest of the year.

                                                                                      • olieidel 16 hours ago

                                                                                        > I feel like in the span between them there are far more companies for which Kamal is not enough.

                                                                                        I feel like this is a bias in the HN bubble: In the real world, 99% of companies with any sort of web servers (cloud or otherwise) are running very boring, constant, non-Uber workloads.

                                                                                        • ksec 13 hours ago

                                                                                          Not just HN but overall the whole internet. Because all the news and article, tech achievements are pumped out from Uber and other big tech companies.

                                                                                          I am pretty sure Uber belongs to the 1% of the internet companies in terms of scale. 37Signals isn't exactly small either. They spend $3M a year on infrastructure in 2019. Likely a lot higher now.

                                                                                          The whole Tech cycle needs to stop having a top down approach where everyone are doing what Big tech are using. Instead we should try to push the simplest tool from low end all the way to 95% mark.

                                                                                          • nchmy 13 hours ago

                                                                                            They spend considerably less on infra now - this was the entire point of moving off cloud. DHH has written and spoken lots about it, providing real numbers. They bought their own servers and the savings paid for it all in like 6 months. Now its just money in the bank til they replace the hardware in 5 years.

                                                                                            Cloud is a scam for the vast majority of companies.

                                                                                        • toberoni 15 hours ago

                                                                                          I feel Uber is the outlier here. For every unicorn company there are 1000s of companies that don't need to scale to millions of users.

                                                                                          And due to the insane markup of many cloud services it can make sense to just use beefier servers 24/7 to deal with the peaks. From my experience crazy traffic outliers that need sophisticated auto-scaling rarely happens outside of VC-fueled growth trajectories.

                                                                                          • appendix-rock 16 hours ago

                                                                                            You can’t talk about typical cases and then bring up Uber.

                                                                                            • pinkgolem 16 hours ago

                                                                                              I mean most B2B company have a pretty predictable load when providing services to employees..

                                                                                              I can get weeks advance notice before we have a load increase through new users

                                                                                            • rohvitun 13 hours ago

                                                                                              Aya

                                                                                              • 0xblinq 13 hours ago

                                                                                                3 years from now they'll have invented their own AWS. NIH syndrome in full swing.

                                                                                                • bdcravens 12 hours ago

                                                                                                  It's a matter of cost, not NIH syndrome. In Basecamp's case, saving $2.4M a year isn't something to ignore.

                                                                                                  https://basecamp.com/cloud-exit

                                                                                                  Of course, it's fair to say that rebuilding the components that the industry uses for hosting on bare metal is NIH syndrome.

                                                                                                  • mannyv 7 hours ago

                                                                                                    A new proxy is a proxy filled with issues. It's nice that it's go, but in production I'd go with nginx or something else and replay traffic to kamal. There are enough weird behaviors out there (and bad actors) that I'd be worried about exploits etc.