Small pieces of code replacing long pieces of code is a daily routine.
But more on the topic, i would say calling ffmpeg as external binary to handle some multimedia processing would be one of those cases where simple is better.
Generally, I would say that implementing your own solution over an external one(like a library, service or product) will always fall under this umbrella. Mostly, because you can implement only what you need and add other things that might be missing or adjust things that do not work exactly as you'd needed them to, so you can avoid any patches, translation layers or external pipelines.
For example, right now I am implementing my own layouting library, because Clay bindings for Go were not working or manual translations were missing features or were otherwise incomplete or non-functional. So I've learnt about Clay and on what principles it was built, by Nic Baker, and wrote my own version in Go. It has little over 2k lines of code right now and will take me about two weeks to finish, with all or most features I wanted. Now I have a fully native Go layouting library that does what I need and I can use and modify it ad infinitum.
So I would say that I equal a "dumb solution" with "my own" solution.
PS: looking back, when I used to work in advertising/marketing/web agency, we used to make websites in CMSs(I did Drupal, colleague did Wordpress). Before my departure from the job in general, I came to a conclusion that if we would be using static website generators, we could have saved unimaginable amount of work hours and deliver the same products, as 99% of clients never managed their websites as by nature of the job, we were doing presentational websites, not complicated functional ones. And when they did, they only needed such tiny changes that it would make way more sense to do it for them for free upon request. For example, imagine you charge someone 5000€ for a website that takes you two months to ship, because you need to design it, build it functionally, fit the visual style and tweak whatever is needed. If you'd use static website generator, the work would take two weeks - a week for the design and a week for coding the website itself. Now you've saved yourself 6 weeks of work while getting paid the same amount of money. Unfortunately, I did not have a chance to try this out and force a new direction at the company as it was at the end of my career.
When working at an influencer marketing company a while ago, back when Instagram still allowed pretty much complete access through their API. As we were indexing the entire Instagram universe for our internal tooling, we had this graph traversal setup to crawl Instagram profiles, then each of their followers etc. We’d need to keep track of visited profiles to not loop and had an Apache Storm cluster for the entire scraping pipeline. It worked, but was cumbersome to work with and monitor as well as couldn’t reach our desired throughput.
Given there were about a billion IG profiles total at the time, I just replaced the entire setup with a single Go script that iterated from 1 to billion and tried to scrape every id in between. That gave us 10k requests per second on a single machine, which was more than enough.
People forget that a billion rows isn’t big data anymore.
I recently wrote a command-line full-text search engine [1]. I needed to implement an inverted index. I choose what seems like the "dumb" solution at first glance: a trie (prefix tree).
There are "smarter" solutions like radix tries, hash tables, or even skip lists, but for any design choice, you also have to examine the tradeoffs. A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
I guess the moral of the story is to just examine all your options during the design stage. Machine learning solutions are just that, another tool in the toolbox. If another simpler and often cheaper solution gets the job done without all of that fuss, you should consider using it, especially if it ends up being more reliable.
Similar I have a script that has the following format: “q replace all onstances of http: with https: in all txt files recurisvely”
And it goes the ChatGPT comes back with and runs the appropriate command.
> I choose what seems like the "dumb" solution at first glance: a trie (prefix tree).
> There are "smarter" solutions like... hash tables.... A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
Strangely, my own software-related answer is the opposite for the same reason.
I was implementing something for which I wanted to approximate a https://en.wikipedia.org/wiki/Shortest_common_supersequence , and my research at the time led me to a trie-based approach. But I was working in Python, and didn't want to actually define a node class and all the logic to build the trie, so I bodged it together with a dict (i.e., a hash table).
What body of knowledge (books, tutorials etc) did you use while developing it?
Before I started the project, I was already vaguely familiar with the notion of an inverted index [1]. That small bit of knowledge meant that I knew where to start looking for more information and saved me a ton of time. Inverted indices form the bulk of many search engines, with the big unknown being how you implement it. I just had to find an adequate data structure for my application.
To figure that out, I remember searching for articles on how to implement inverted indices. Once I had a list of candidate strategies and data structures, I used Wikipedia supplemented by some textbooks like Skiena's [2] and occasionally some (somewhat outdated) information from NIST [3]. I found Wikipedia quite detailed for all of the data structures for this problem, so it was pretty easy to compare the tradeoffs between different design choices here. I originally wanted to implement the inverted index as a hash table but decided to use a trie because it makes wildcard search easier to implement.
After I developed most of the backend, I looked for books on "information retrieval" in general. I found a history book (Bourne and Hahn 2003) on the development of these kind of search systems [4]. I read some portions of this book, and that helped confirm many of the design choices that I made. I actually was just doing what people traditionally did when they first built these systems in the 1960s and 1970s, albeit with more modern tools and much more information on hand.
The harder part of this project for me was writing the interpreter. I actually found YouTube videos on how to write recursive descent parsers to be the most helpful there, particular this one [5]. Textbooks were too theoretical and not concrete enough, though Crafting Interpreters was sometimes helpful [6].
[1] https://en.wikipedia.org/wiki/Inverted_index
[2] https://doi.org/10.1007/978-3-030-54256-6
[3] https://xlinux.nist.gov/dads/
[4] https://doi.org/10.7551/mitpress/3543.001.0001
Thanks for detailing, how much time you invested in it?
I spent around 170 hours on this so far, with only 60% of that being coding. The rest was mostly research or writing.
Heuristics often work well enough that an AI/ML approach isn't needed. If it is needed, you still need the heuristics. If you were writing a chess engine, you wouldn't just pass the board state and history to a model. You'd still work with chess experts to come up with scores and heuristics for the material and strategic state of the board. You'd come up with detectors for certain conditions or patterns that experts have noted. Along with the board state, that's the input. And you'd still have a long way to go.
----
For storage, people often overcomplicate things. Maybe you do need RAID 5 in a NAS, etc. Maybe what you need is a simple server with a single disk and an offsite backup that rsyncs every night. That RAID 5 doesn't stop 'rm -rf' from destroying everything.
For databases, people often shove a database into an app or product much too early. The rule of thumb that I use is that you should switch to a database (from flat files) when you would have to implement foreign keys, or when data won't fit in memory anymore and memory-mapped files aren't sufficient. Using a database before that just complicates your data model, introducing ORM too early seriously complicates your code.
For algorithms, there are an awful lot of O(nLogn) solutions deployed for problems with small n. An O(n) solution is often faster to write, and still solves the problem. O(n) is often actually faster when things fit in L1 or L2 cache.
For software architecture, we often forget that the client has CPU and storage (and network) that we can use. Even if you don't trust the client, you can sign a cache entry to be saved on the client, and let the client forward it later. Greatly reduces the need for consistency on the backend. If you don't trust the client to compute, you can have the server compute a spot check at lower resolution, a subset, etc.
Several times I have rewritten overly-multithreaded (and intermittently buggy) processes with a single-threaded version, and both reduced LoC to roughly 1/20th and binary size to 1/10th, while also obtaining a few times speedup and reduced memory usage, and entirely eliminating many bugs.
When I was on Google Docs, I watched the Google Forms team build a sophisticated ML model that attempted to detect when people were using it for nefarious purposes.
It underperformed banning the word "password" from a Google Form.
So that's what they went with.
I may have written about this before on HN, but once I wrote a simple Perl script that could run the daily trade reconciliation for an entire US primary exchange. It could run on my laptop and complete the process in under 20 minutes. Ten years later, I watched a team spending days setting up a Spark cluster to handle a comparable amount of data in a somewhat simpler business domain.
I wrote a tiny game that was basically a dice war clone and needed to implement an enemy AI. I researched the probability formula for throwing a higher number with N dice versus M dice and spent days on the math. In the end I simulated every possible combination aka. fight up to 12 dice (which was the max amount) with an simple python script and stored the results in a key value table. It was soo much easier.
Still happens all the time in certain finance tasks (eg trying to predict stock prices), but I'm not sure how long that will hold. As for why that might be, I don't think I can do any better than linking to this comment about a comment about your question: <https://news.ycombinator.com/item?id=45306256>.
I suspect that locating the referenced comment would require a semantic search system that incorporates "fancy models with complex decision boundaries". A human applying simple heuristics could use that system to find the comment.
In the "Dictionary of Heuristic" chapter, Polya's "How to Solve it" says this: *The feeling that harmonious simple order cannot be deceitful guides the discover in both in mathematical and in other sciences, and is expressed by the Latin saying simplex sigillum veri (simplicity is the seal of truth).*
It was a very long time ago, but during a programming competition one of the warm-up questions was something to do with a modified sudoku puzzle. The naive algorithmic solution was too slow, the fancy algorithm took quite a bit of effort... and then there were people who realised that the threshold for max points was higher than you needed for a brute force check of all possible boards. (I wasn't one of them)
This generalises to a few situations where going faster just doesn't matter. For example for many cli tools it matters if they finish in 1s or 10s. But once you get to 10ms vs 100ms, you can ask "is anyone ever likely to run this in a loop on a massive amount of data?" And if the answer is yes, "should they write their own optimised version then?"
I often favour low maintenance and over head solutions. Most recently I made a stupidly large static website with over 50k items (i.e. pages).
I think a lot of people would have used a database at this point, but the site didn't need to be updated once built so serving a load of static files via S3 makes ongoing maintenance very low.
Also feel a slight sense of superiority when I see colleagues write a load of pandas scripts to generate some basic summary stats Vs my usual throw away approach based around awk.
I’m mostly a hardware engineer.
I needed to test pumping water through a special tube, but didn’t have access to a pump. I spent days searching how to rig a pump to this thing.
Then I remembered I could just hang a bucket of water up high to generate enough head pressure. Free instant solution!
You would have made maximum faget proud
Aside from https://news.ycombinator.com/item?id=46665611, way back in my engineering classes in university we had this design project... I'm not sure I've ever told the story publicly before and it brings a smile to remember it more than 20 years later.
My group (and some others) had to design a device to transport an egg from one side of a very simple "obstacle course" to the other, with the aid of beacons (to indicate the egg location and target, each along opposite ends) and light sensors. There was basically a single obstacle, a barrier running most of the way across the middle. The field was fairly small, I think 4 metres across by 3 metres wide.
The other teams followed tutorials, created beacons that emitted high-frequency light pulses and circuitry to filter out 60Hz ambient light and detect the pulse; various robots (I think at least one repurposed a remote-control car) and feedback control to steer them toward the beacons, etc. There were a few different microcontrollers on offer to us for this task, and groups generally had three people: someone responsible for the mechanical parts, someone doing circuitry, and someone doing assembly programming.
My group was just the two of us.
I designed extenders for the central barrier, a carriage to straddle the barrier, and a see-saw the length of the field. The machine would find the egg, scoop it into one end, tilt the see-saw (the other person's innovation: by releasing a stop allowing the counterweighted far side to fall), find the target and release the scoop on the other end. Our light sensors were pointed directly at the ceiling (the source of the "noise"), and put through a simple RC circuit to see that light as more or less constant. Our "beacons" were pieces of construction paper used to block the light physically. All controlled by a 3-bit finite state machine implemented directly in TTL/CMOS (I forget which).
And it worked in testing (praise for my partner; I would never have gotten the mechanics robust enough), but on presentation day the real barrier (made sloppily out of wood) was noticeably wider than specified and the carriage didn't fit on it.
As I recall, in later years the obstacle course was made considerably more complex, ruling out solutions like mine entirely. (There were other projects to choose from, for my year and later years, that as far as I know didn't require modification.)
Depends on your criteria. For me, about 90% of the time, the dumb solution beats the sophisticated one: it's not just the performance, but the maintenance burden etc.
I have a silly little internal website I use for bookmarks, searching internal tools, and some little utilities. I keep getting pressure to put it into our heavy and bespoke enterprise CICD process. I’ve seen people quit over trying to onboard into this thing… more than one. It’s complete overkill for my silly little site.
My “dumb” solution is a little Ansible job that just runs a git pull on the server. It gets the new code and I’m done. The job also has an option to set everything up, so if the server is wiped out for some reason I can be back up and running in a couple minutes by running the job with a different flag.
For me, CP-SAT is the "dumb" solution that works in a lot of situations. Whenever a hackathon has a problem definable in constraints, that tends to be the first path I take and generally scores top 5
I recently needed AI memory and instead of setting up a vector db and RAG, I just used git as a history graph and a knowledge graph in one.
- before ML try linear or polynomial regression
- buying a bigger server is almost always better than distributed system
- Few lines of bash can often wipe out hundreds of lines of python.
I wrote a clone of battle zone the old Atari tank game. For the enemy tank “AI” I just used a simple state machine with some basic heuristics.
This gave a great impression of an intelligent adversary with very minimal code and low CPU overhead.
Game design is filled with simple ideas that interact in fun ways. Every time I have tried to come up with complex AIs I ended up scrapping them in favor of "stupid" solutions that turned out to be more enjoyable and easier to tune.
I can vouch from my experience of turn-based games that exploiting a dumb AI often makes the game more fun (and gives the developer license to throw more/tougher enemies at the player), and noticing the faults really doesn't degrade the experience like you'd expect.
Unless enemies have entirely non-functional pathing. Then it's just funny.
I once modeled user journeys on a website using fancy ML models that honored sequence information, i.e., order of page visits, only to be beaten by bag-of-words (i.e., page url becomes a vector dimension, but order is lost) decision tree model, which was supposed to be my baseline.
What I had overlooked was that journeys on that particular website were fairly constrained by design, i.e., if you landed on the home page, did a bunch of stuff, put product X in the cart - there was pretty much one sequence of pages (or in the worst case, a small handful) that you'd traverse for the journey. Which means the bag-of-words (BoW) representation was more or less as expressive as the sequence model; certain pages showing up in the BoW vector corresponded to a single sequence (mostly). But the DT could learn faster with less data.
I once on a project where we couldn't use third party libs. We needed a substring search but the needle could be 1 of N letters. My teammate loves SIMD and wanted to write a solution. I took a look at all of our data and the most strings were < 2kb with many being empty and < 40 letters. SIMD would have been overkill. So I wrote a simple dumb for loop checking each letter for the 3 interesting characters (`";\n`)
Great question, I could answer with many stories but here are two:
The (deliberately) very limited analytics software I wrote for my personal website[0] could have used database but I didn't want to add a dependency to what was a very simple project so I hacked up an in-memory datastructure that periodically dumps itself to disk as a json file. This gives persistence across reboots and at a pinch I can just edit the file with a text editor.
Game design is filled with "stupid" ideas that work well. I wrote a text-based game[1] that includes Trek-style starship combat. I played around with a bunch of different ideas for enemy AI before just reverting to a simple action drawn off the top of a small deck. It's a very easy system to balance and expand, and just as fun for the player.
I remember Scalyr, at least before they were bought by SentinelOne basically did parallel / SIMD grep for each search query and consistently beat data that was continually indexed by the likes of Splunk and ElasticSearch.
They had a great article on this too.
The common one I fought long ago was folks who always use regular expressions when what they want is a string match, or contains, or other string library function.
I've seen a lot of the opposite, especially having done a lot of string parsing in PHP: some developers would nest half a dozen string functions just to prepare and extract a line of data while a simple regular expression would have handled the operation much more concisely and accurately
I occasionally see people complaining about long TypeScript compile times where a small code base can take multiple minutes (possibly 10 minutes). I think to myself WTF, because large code bases should take no more than 20 seconds on ancient hardware.
On another note I recently wrote this large single page app that is just a collection of functions organized by page sections as a collection of functions according to a nearly flat typescript interface. It’s stupid simple to follow in the code and loads as fast as an eighth of a second. Of course that didn’t stop HN users from crying like children for avoiding use of their favorite framework.
Seen people tripped up with dynamodb like stores, especially when they have a misleading sql interface like Azure tables.
You cant be "agile" with them, you need to design your data storage upfront. Like a system design interview :).
Just use postgres (or friends) until you are webscale. Unless you really have a problem amenible to key/value storage.
That's fun to read, I remember when NoSQL was getting cargo-culted, it was specifically because it was more "agile". The reason being you don't need to worry about a schema. Just stick your data in there and figure it out later.
Interesting to hear now that the opinion is the opposite.