« Back100M-Row Challenge with PHPgithub.comSubmitted by brentroose 9 hours ago
  • brentroose 9 hours ago

    A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.

    That's why I built a performance challenge for the PHP community

    The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).

    I hope people will have fun with it :)

    • Tade0 6 hours ago

      Pitch this to whoever is in charge of performance at Wordpress.

      A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.

      • tracker1 an hour ago

        Wordpress is something that I cannot believe hasn't been displaced by a service that uses a separate application for editing and delivery.

        It seems like something like vercel/cloudflare could host the content-side published as a worker for mostly-static content from a larger application and that would be more beneficial and run better with less risk, for that matter. Having the app editing and auth served from the same location is just begging for the issues WP and plugins have seen.

        • devmor 34 minutes ago

          As someone who built full ecommerce websites on wordpress over 15 years ago, I can tell you exactly why it hasn't been replaced - the plugin/theme ecosystem.

          There are tens of thousands of plugins and themes to make a Wordpress website do whatever you want and look however you want, either for free or a very low fee. You have to replace that entire ecosystem for the same price to replace Wordpress.

          No matter how many times people get hacked, the perceived value of getting something for nothing outweighs the eventual cost.

          • shimman 5 minutes ago

            Maybe it's just my poor imagination but how many plugins are truly unique to WP that you can't find on other CMSs? The only ones that come to mind would be those plugins that help connect to various B2B or B2C workflows, is that where the gold is mostly found?

            • hparadiz 13 minutes ago

              I did a short contract a few years back where multiple WordPress plugins were pulling different versions of guzzle and I had to use a namespace rewriter to be able to run multiple guzzle versions at the same time.

              The thing about WordPress is you can put it on a box and lock it down so hard you just treat it as an untrusted process on your server.

          • rectang 2 hours ago

            Are you talking about a new, empty WordPress instance running the default theme? Because if so, that doesn't match my anecdotal experience.

            If you're talking about a WordPress instance with arbitrary plugins running an arbitrary theme, then sure — but that's an observation about those plugins and themes, not core.

            As someone who has to work with WordPress, I have all kinds of issues with it, but "20 seconds to load core with caching disabled" isn't one of them.

            • embedding-shape 5 hours ago

              Microbenchmarks are very different from optimizing performance in real applications in wide use though, they could do great on this specific benchmark but still have no clue about how to actually make something large like Wordpress to perform OK out of the box.

              • rkozik1989 3 hours ago

                Much like anything else your performance is going to vary a lot based on architecture of implementation. You really shouldn't deploying anything into production without some kind of caching. Whether that's done in the application itself or with memcached/redis or varnish or OPcache.

                • slopinthebag 2 minutes ago

                  Either you use a slow language and deal with caching or you use a fast language and just put Cloudflare/Bunny/etc in front.

                  • LoganDark 2 hours ago

                    > You really shouldn't deploying anything into production without some kind of caching.

                    Citation needed? You only need cache if a render is expensive to produce.

                  • monkey_monkey 5 hours ago

                    That's often a skill issue.

                    • almosthere 2 hours ago

                      skill issue being they only know php

                  • ge96 an hour ago

                    5 days to 30 seconds? What kind of factor/order of magnitude is that damn

                    What takes 5 days to run

                    • slopinthebag 2 minutes ago

                      One query per column per row

                      • hosteur 38 minutes ago

                        Poorly made analytics/datawarehouse stuff.

                      • onion2k 3 hours ago

                        A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds.

                        When people say leetcode interviews are pointless I might share a link to this post. If that sort of optimization is possible there is a structures and algorithms problem in the background somewhere.

                        • nicoburns 3 hours ago

                          I find that these kind of optimizations are usually more about technical architecture than leetcode. Last time I got speedups this crazy the biggest win was reducing the number of network/database calls. There were also optimisations around reducing allocations and pulling expensive work out of hot loops. But leetcode interview questions don't tend to cover any of that.

                          They tend to be about the implementation details of specific algorithms and data structures. Whereas the important skill in most real-world scenarios would be to understand the trade-offs between different algorithms and data structures so that you pick an appropriate off-the-shelf implementation to use.

                          • LollipopYakuza 2 hours ago

                            I agree. The "advanced" leetcode is about those last % of optimization. But when network latency is involved in a flow, it is usually the most obvious low hanging fruit.

                          • tuetuopay 2 hours ago

                            Well leetcode asks you to implement the data structure, not how and when to use which data structure. I don’t need to know how to implement a bloom filter on a whiteboard off the top of my head to know when to use it.

                          • CyberDildonics 2 hours ago

                            Using a language that is 100x slower than naive native programs to do a "speed challenge" is like spending your entire day speed walking to run errands when you can just learn how to drive a car.

                            • gib444 6 hours ago

                              > A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds

                              That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)

                              • brentroose 6 hours ago

                                Almost all, actually. I wrote about it here: https://stitcher.io/blog/11-million-rows-in-seconds

                                A couple of things I did:

                                - Cursor based pagination - Combining insert statements - Using database transactions to prevent fsync calls - Moving calculations from the database to PHP - Avoiding serialization where possible

                                • tiffanyh 6 hours ago

                                  Aren’t these optimizations less about PHP, and more about optimizing how your using the database.

                                  • toast0 2 hours ago

                                    PHP is kind of like C. It can be very fast if you do things right, and it gives you more than enough rope to tie yourself in knots.

                                    Making your application fast is less about tuning your runtime and more about carefully selecting what you do at runtime.

                                    Runtime choice does still matter, an environment where you can reasonably separate sending database queries and receiving the result (async communication) or otherwise lets you pipeline requests will tend to have higher throughput, if used appropriately, batching queries can narrow the gap though. Languages with easy parallelism can make individual requests faster at least while you have available resources. Etc.

                                    A lot of popular PHP programs and frameworks start by spending lots of time assembling a beautiful sculpture of objects that will be thrown away at the end of the request. Almost everything is going to be thrown away at the end of the request; making your garbage beautiful doesn't usually help performance.

                                    • hu3 5 hours ago

                                      It's still valid as as example to the language community of how to apply these optimizations.

                                      • swasheck 5 hours ago

                                        in all my years doing database tuning/admin/reliability/etc, performance have overwhelmingly been in the bad query/bad data pattern categories. the data platform is rarely the issue

                                        • tosti 3 hours ago

                                          The worst offenders I've seen were looping over a shitty ORM

                                          • cobbzilla 2 hours ago

                                            hey don’t forget, that shitty ORM also empowers you to write beautiful, fluent code that, under the hood, generates a 12-way join that brings down your entire database.

                                            • edoceo 2 hours ago

                                              And that is true across languages.

                                      • Joel_Mckay 4 hours ago

                                        In general, it is bad practice to touch transaction datasets in php script space. Like all foot-guns it leads to Read-modify-write bugs eventually.

                                        Depending on the SQL engine, there are many PHP Cursor optimizations that save moving around large chunks of data.

                                        Clean cached PHP can be fast for REST transactional data parsing, but it is also often used as a bodge language by amateurs. PHP is not slow by default or meant to run persistently (low memory use is nice), but it still gets a lot of justified criticism.

                                        Erlang and Elixir are much better for clients/host budgets, but less intuitive than PHP =3

                                      • user3939382 6 hours ago

                                        exec(‘c program that does the parsing’);

                                        Where do I get my prize? ;)

                                        • brentroose 6 hours ago

                                          The FAQ states that solutions like FFI are not allowed because the goal is to solve it with PHP :)

                                          • kpcyrd 5 hours ago

                                            What about using the filesystem as an optimized dict implementation?

                                            • olmo23 4 hours ago

                                              this is never going to be faster because it requires syscalls

                                    • Xeoncross 4 hours ago

                                      This is why I jumped from PHP to Go, then why I jumped from Go to Rust.

                                      Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.

                                      Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.

                                      If Rust scares you use Go. If Go scares you use Rust.

                                      • tracker1 an hour ago

                                        Can't speak for go... but for the handful of languages I've thrown at Claude Code, I'd say it's doing the best job with Rust. Maybe the Rust examples in the wild are just better compared to say C#, but I've had a much smoother time of it with Rust than anything else. TS has been decent though.

                                        • thinkingtoilet 4 hours ago

                                          It's almost comical how often bring up Rust. "Here's a fun PHP challange!" "Let's talk about Rust..."

                                          • hu3 5 minutes ago

                                            Yep. It's like a crossfit vegan religion at this point.

                                            You don't even have to ask. They will tell you and usually add nothing to the conversation while doing so.

                                            Quite off-putting.

                                            • Xeoncross 4 hours ago

                                              Sorry, but it's honestly just a lot of our journeys. Started on scripting languages like PHP/Ruby/Lua (self-taught) or Java/VB/C#/Python (collage) and then slowly expanded to other languages as we realized we were being held back by our own tools. Each new language/relationship makes you kick yourself for putting up with things so long.

                                              • tracker1 an hour ago

                                                I don't know about that... I like Rust a lot... but I also like a lot of things about C# or TS/JS... I'll still reach for TS first (Deno) for most things, including shell scripting.

                                                • thinkingtoilet 3 hours ago

                                                  I understand that but there's a time and a place. Rust has nothing to do with this. 100% of the people on this site understand that this challenge can be done faster in C, or Rust, or whatever. This is a PHP challenge. Perhaps we could discuss the actual submission as opposed to immediately derailing it.

                                              • codegeek 4 hours ago

                                                I am not that smart to use Rust so take it with a grain of salt. However, its syntax just makes me go crazy. Go/Golang on the other hand is a breath of fresh air. I think unless you really need that additional 20% improvement that Rust provides, Go should be the default for most projects between the 2.

                                                • Xeoncross 4 hours ago

                                                  I hear you, advanced generics (for complex unions and such) with TypeScript and Rust are honestly unreadable. It's code you spend a day getting right and then no one touches it.

                                                  I'm just glad modern languages stopped throwing and catching exceptions at random levels in their call chain. PHP, JavaScript and Java can (not always) have unreadable error handling paths not to mention hardly augmenting the error with any useful information and you're left relying on the stack trace to try to piece together what happened.

                                              • pxtail 6 hours ago

                                                Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!

                                                https://elephpant.me/

                                                • t1234s 6 hours ago

                                                  Elephpants should be for second and third place. First place should be the double-clawed hammer.

                                                  • thih9 5 hours ago

                                                    Excellent project. My favorites: the joker, php storm, phplashy, Molly.

                                                    • normie3000 3 hours ago

                                                      I love Mollie!

                                                  • semiquaver 4 hours ago

                                                    Are they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember?

                                                        {
                                                            "\/blog\/11-million-rows-in-seconds": {
                                                                "2025-01-24": 1,
                                                                "2026-01-24": 2
                                                            },
                                                            "\/blog\/php-enums": {
                                                                "2024-01-24": 1
                                                            }
                                                        }
                                                    • idoubtit 2 hours ago

                                                      The weirdness is partly in JSON . In the JSON spec, the slash (named "solidus" there) is the only character that can be written plainly or prefixed with a backslash (AKA "reverse solidus").

                                                      See page 4, section 9 of the latest ECMA for JSON: https://ecma-international.org/wp-content/uploads/ECMA-404_2...

                                                      • daviddoran 3 hours ago

                                                        PHP has always escaped forward slashes to help prevent malicious JSON from injecting tags into JavaScript I believe. Because it was common for PHP users to json_encode some data and then to write it out into the HTML in a script tag. A malicious actor could include a closing script tag, and then could inject their own HTML tags and scripts etc.

                                                        • CapitaineToinon 4 hours ago

                                                          That's the default output when using json_encode with the JSON_PRETTY_PRINT flag in php.

                                                          • idoubtit 3 hours ago

                                                            > That's the default output when using json_encode with the JSON_PRETTY_PRINT flag in php.

                                                            JSON_PRETTY_PRINT is irrelevant. Escaping slashes is the default behavior of json_encode(). To switch it off, use JSON_UNESCAPED_SLASHES.

                                                            • CapitaineToinon an hour ago

                                                              Ah, my bad. Thanks for the correction, TIL!

                                                          • poizan42 4 hours ago

                                                            > The output should be encoded as a pretty JSON string.

                                                            So apparently that is what they consider "pretty JSON". I really don't want to see what they would consider "ugly JSON".

                                                            (I think the term they may have been looking for is "pretty-printed JSON" which implies something about the formatting rather than being a completely subjective term)

                                                            • ourmandave 37 minutes ago

                                                              Pretty JSON not meaning formatting, but more "That was pretty JSON of you."

                                                          • tveita 6 hours ago

                                                            > Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others

                                                            Except that the generator script generates dates relative to time() ?

                                                            • tomaytotomato 2 hours ago

                                                              Tempted to submit a Java app wrapped in PHP exec() :D

                                                              • Retr0id 6 hours ago

                                                                How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)

                                                                • brentroose 6 hours ago

                                                                  Around 7GB

                                                                • csjh 3 hours ago

                                                                  Obligatory DuckDB solution:

                                                                  > duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"

                                                                  Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.

                                                                • poizan42 4 hours ago

                                                                  > The output should be encoded as a pretty JSON string.

                                                                  ...

                                                                  > Your parser should store the following output in $outputPath as a JSON file:

                                                                      {
                                                                          "\/blog\/11-million-rows-in-seconds": {
                                                                              "2025-01-24": 1,
                                                                              "2026-01-24": 2
                                                                          },
                                                                          "\/blog\/php-enums": {
                                                                              "2024-01-24": 1
                                                                          }
                                                                      }
                                                                  
                                                                  They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.
                                                                  • kijin 4 hours ago

                                                                    They probably mean "Should look like the output of json_encode($data, JSON_PRETTY_PRINT)". Which most PHP devs would be familiar with.

                                                                    • poizan42 4 hours ago

                                                                      It sounds plausible, but they really need to spell out exactly what the formatting requirements are, because it can make a huge difference in how efficiently you can write the json out.

                                                                      • maleldil 2 hours ago

                                                                        It's a challenge for PHP programmers. I imagine the relevant people would recognise that format.

                                                                  • spiderfarmer 7 hours ago

                                                                    Awesome. I’ll be following this. I’ll probably learn a ton.