• carlsborg 16 hours ago

    The main lichess engine (lila, open source) is a single monolith program that's deployed on a single server. It serves ~5 million games per day. But there are a several other pieces too. They discuss the architecture here https://www.youtube.com/watch?v=crKNBSpO2_I

    BTW consider donating if you use lichess.

    • justinclift 15 hours ago

      Wow. ~US$40k/mo running costs, with about US$5k/mo for server hosting:

      https://lichess.org/costs

      It looks like the servers are individually managed via OVH or similar, rather than running their own gear in co-location or similar. Wonder why?

      • tormeh 9 hours ago

        Easy: If something is wrong with the physical gear it's OVH's problem rather than theirs. It also means no one has to ever go to the data center which is probably important for a geographically distributed team (I assume they are). Cheap, no-frills cloud is extremely underrated, IMO.

        • squigz 12 hours ago

          Surprising numbers, and really goes to show how cheap the hardware/software side is for this sort of thing if you do it right.

          I wonder what the "Misc dev salaries" is for - only curious because it's a flat $5k

          • justinclift 11 hours ago

            Heh heh heh.

            To me those numbers seem on the high side as I'm (personally) used to (for cheap projects) scavenging together stuff from Ebay before deploying to a data centre. ;)

            • squigz 11 hours ago

              lichess is hardly a "cheap project" though :P It's one of the most popular chess platforms

              • justinclift 10 hours ago

                Sure, but they seem to be extremely budget constrained. ;)

                • me_me_me 7 hours ago

                  no surprise there tbh

                  Here is a comparison of free and their premium accounts:

                  https://lichess.org/features

                  • justinclift 6 hours ago

                    Looks like they're fulfilling their mission?

          • benmmurphy 6 hours ago

            its also crazy how much cheaper it is than AWS. the primary DB is around $500/month with 32 CPU and 256 GB of RAM and 7TB. AWS RDS db.m6gd.8xlarge which is 32 CPU and 128 GB of RAM costs $2150/month before paying for storage as well.

            • bryan_w 5 hours ago

              Yeah, but you get what you pay for. That m6gd.8xlarge would never be subject to such a long network outage as once the hardware fault was detected, it would be moved to another machine

              • beaviskhan 40 minutes ago

                Yup, and you also get to make AWS deal with OS upgrades, DB upgrades, backups, etc.

          • squigz 12 hours ago
            • hilux 5 hours ago

              I'm a patron!

              I really appreciate the benefits package for patrons. Thibault is zee best.

            • theideaofcoffee 5 hours ago

              I guess some of my questions are addressed in the latter half of the post, but I'm still puzzled why a prominent service didn't have a plan for what looked like a run of the mill hardware outage. It's hard to know exactly what happened as I'm having trouble parsing some of the post (what is a 'network connector'? is it a cable? nic?). What were some of the 'increasingly outlandish' workarounds? Are they actually standing up production hosts manually, and was that the cause of a delay or unwillingness to get new hardware goin? I think it would be important to have all of that set down either in documentation or code seeing as most of their technical staff are either volunteers, who may come and go, or part timers. Maybe they did, it's not clear.

              It's also weird seeing that they are still waiting on their provider to tell them exactly what was done to the hardware to get it going again, that's usually one of the first things a tech mentions: "ok, we replaced the optics in port 1" or "I replaced that cable after seeing increased error rates", something like that.

              • holsta 9 hours ago

                This response and post-mortem is superior to most commercial services I have seen in recent years.

                • hyperbovine 7 hours ago

                  That's basically every aspect of their service. The founder Thibault Duplessis is criminally undercompensated (his choice) for running a site that is better designed, faster, and more popular than 99% of commercial websites out there.

                  • agentcoops 6 hours ago

                    I worked with him once on a job -- incredibly nice guy and obviously talented developer who used to work for the French agency responsible for the Scala Play Framework. https://github.com/lichess-org/lila and https://github.com/lichess-org/scalachess are great resources for anyone ever curious to see a production quality Scala3 web application using Cats and all the properly functional properties of the language.

                    • notagoodidea 5 hours ago

                      Would you recommend it as a deep-dive to observe Scala in production?

                      • agentcoops an hour ago

                        I haven't looked at the code in ages, but it's probably the only scaled consumer web application written in Scala and moreover running on Scala 3 that you can see the end-to-end source for. You have all the Twitter open source Scala projects, of course, but that's just infrastructure for running a web application, rather than an actual production quality app -- and my sense is that in 2024 there aren't many product teams outside of Twitter using their application tooling (as opposed to some of their data infrastructure, certainly the area where Scala sees the most use today with Spark etc).

                        TLDR if you want to see production-quality Scala code that this very second is serving 40k chess games -- and mostly bullet/blitz where ms latency is of course crucial -- definitely take a look.

                        Not as much hype for the language at the moment over Rust or Kotlin, say, but it remains my language of choice for web backends by far.

                  • nomilk 8 hours ago

                    Exact same thought went through my head. Also note in the first few paragraphs they acknowledge the worst impacts to users. That's very selfless - often corporate postmortems downplay the impact, which frustrates users more. Incidentally, a critical service I use (Postmark) had an outage this week and I didn't even hear from them (I found out via a random twitter post). Shows the difference.

                    • CSMastermind 7 hours ago

                      Presumably because Lichess is free thus doesn't have contractual obligations and SLAs that they'll be sued for breaching.

                    • redbell 7 hours ago

                      > so you, as our beneficiaries and stakeholders, who support us and encourage us — deserve to get clarification on what happened

                      Is it that complicated for big tech to reply politely with the above statement when they suddenly disable your account for no obvious reason!

                      • mewpmewp2 7 hours ago

                        It may not be complicated, but it does require caring about what you do and your customers as opposed to going through basic minimum requirements to appear that you are doing something.

                        It is much more difficult for corporate cogs to have that level of care compared to someone who does their things with passion.

                      • morgante 5 hours ago

                        The post-mortem is honest, but the infrastructure is well below what I'd expect from commercial services.

                        If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.

                        It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.

                        • lukhas 3 hours ago

                          We're an understaffed charity.

                          • morgante 3 hours ago

                            Yeah I'm not criticizing it as a charity, just pointing out this definitely isn't "superior to most commercial services."

                            That being said, removing dependence on single hardware nodes isn't something you need a big team for. I've done failover at 1-person startups.

                          • KolmogorovComp 3 hours ago

                            And yet even Meta recently had a multiple hours downtime, despite a budget thousands if not million times higher. Would you call them negligent too?

                            By increasing the complexity you multiply the failure points and increase ongoing maintenance, which is the bottleneck (even more than money) for volunteer-driven projects.

                            • morgante 3 hours ago

                              To be clear, you don't need to make it more complex / failure-prone. I didn't say failover needs to be automated.

                              Kubernetes or complex cloud services are not required to have some basic deployment automation.

                              You can do it with a simple bash script if you need to. It's just pretty surprising to see the reaction to a hardware failure being to wait around for it to be repaired instead of simply spinning up a new host.

                        • ctippett 7 hours ago

                          Once the private link was reestablished, could they not have tunneled out to the internet via another server acting as a sort of gateway?

                          Disclaimer: I'm not a network engineer so I may be misunderstanding the practicality and complexity of such a workaround.

                          • lazyant 4 hours ago

                            summary for the lazy: OVH