In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.
Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.
The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them. I've been part of some elusive production issues where eventually 1-2 team members attempted a rewrite of the offending routine while everyone else debugged it, and the rewrite "won" and shipped to production before we found the bug. Heresy I know. In at least one case we never found the bug, because we could only dedicate a finite amount of time to a "fixed" issue.
> The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them.
So as a metaphor for software debugging, this is "throw away the code, buy a working solution from somewhere else." It may be a way to run a business, but it does not explain how to debug software.
The worst cases in debugging are the ones where the mental model behind the code is wrong and, in those cases, "throw it away" is the way out.
Those cases are highly seductive because the code seems to work 98% and you'd think it is another 2% to get it working but you never get to the end of the 2%, it is like pushing a bubble around under a rug. (Many of today's LLM enthusiasts will, after they "pay their dues", will wind up sounding like me)
This is documented in https://www.amazon.com/Friends-High-Places-W-Livingston/dp/0... and takes its worst form when you've got an improperly designed database that has been in production for some time and cannot be 100% correctly migrated to a correct database structure.
I’m not sure it implies buying from elsewhere. I understood it to mean make a new one, rather than try to repair the broken one.
Depending on how your routine looks like, you could run both on a given input, and see when / where they differ.
Rule 0: Don't panic
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
There's a story in the book - on nuclear submarines there's a brass bar in front of all the dials and knobs, and the engineers are trained to "grab the bar" when something goes wrong rather than jumping right to twiddling knobs to see what happens.
I read this book and took this advice to heart. I don't have a brass bar in the office, but when I'm about to push a button that could cause destructive changes, especially in prod, my hands reflexively fly up into the air while I double-check everything.
A weird, yet effective recommendation from someone at my last job: If it's a destructive or dangerous action in prod, touch both your elbows first. This forces ou to take the hands away from the keyboard, stop any possible auto-pilot and look what you're doing.
Related: write down what you're seeing (or rather, what you _think_ you're seeing), and so with pen and paper, not the keyboard. You can type way faster than you can write, and the slowness of writing makes you think harder about what you think you know. Often you do the know the answer, you just have to tease it out. Or there are gaps in your knowledge that you hadn't clocked. After all, an assumption is something you don't realise you've made.
This also works well in conjunction with debug tooling -- the tooling gives you the raw information, writing down that information helps join the dots.
Reminds me of https://en.wikipedia.org/wiki/Pointing_and_calling
"a good chess player sits on his hands" (NN). It's good advice as it prevents you from playing an impulsive move.
I have to sit on my hands at the dentist to prevent impulse moves.
Thank you for explaining that phrase! I couldn't find it with a quick Google.
I always wanted that to be a true story, but I don't think it is.
100% agree. I remember I had an on-call and our pagerduty started going off for a SEV-2 and naturally a lot of managers from teams that are affected are in there sweating bullets because their products/features/metrics are impacted. It can get pretty frustrating having so many people try to be cooks in the kitchen. We had a great manager who literally just moved us to a different call/meeting and he told us "ignore everything those people are saying; just stay focused and I'll handle them." Everyone's respect for our manager really went up from there.
Slow is smooth and smooth is fast. If you don't have time to do it right, what makes you think there is time to do it twice?
"We have to do something!"
And this is something, so I’m doing it.
I had a boss who used to say that her job was to be a crap umbrella, so that the engineers under her could focus on their actual jobs.
I once worked with a company that provided IM services to hyper competitive, testosterone poisoned options traders. On the first fine trading day of a January new year, our IM provider rolled out an incompatible "upgrade" to some DLL that we (our software, hence our customers) relied on, that broke our service. Our customers, ahem, let their displeasure be known.
Another developer and I were tasked with fixing it. The Customer Service manager (although one of the most conniving political-destructive assholes I have ever not-quite worked with), actually carried a crap umbrella. Instead of constantly flaming us with how many millions of dollars our outage was costing every minute, he held up that umbrella and diverted the crap. His forbearance let us focus. He discretely approached every 20 minutes, toes not quite into entering office, calmly inquiring how it was going. In just over an hour (between his visits 3 and 4), Nate and I had the diagnosis, the fix, and had rolled it out to production, to the relief of pension funds worldwide.
As much as I dislike the memory of that manager to this day, I praise his wisdom every chance I get.
At first I thought you meant an umbrella that doesn't work very well.
Ah, the unintentional ambiguity of language, the reason there are so many lawyers in the world and why they are so expensive. The GP's phrasing is not incorrect but your comment made me realize: I only parsed it correctly the first time because I've heard managers use similar phrases so I recognized the metaphor immediately. But for the sake of reducing miscommunication, which sadly tends to trigger so many conflicts, I could offer a couple of disambiguatory alternatives:
- "her job was to be a crap-umbrella": hyphenate into a compound noun, implies "an umbrella of/for crap" to clarify the intended meaning
- "her job was to be a crappy umbrella": make the adjective explicit if the intention was instead to describe an umbrella that doesn't work well
I always say this too. But the real trick is knowing what to let thru. You can’t just shield your team from everything going on in the organisation. You’re all a part of the organisation and should know enough to have an opinion.
A better analogy is you’re there to turn down the noise. The team hears what they need to hear and no more.
Equally, the job of a good manager is to help escalate team concerns. But just as there’s a filter stopping the shit flowing down, you have to know what to flow up too.
Ideally it's crap umbrellas all the way down. Everyone should be shielding everyone below them from the crap slithering its way down.
Agreed. Even as a relatively junior engineer at my first job, I realized that a certain amount of the job was not worth exposing interns to (like excessive amounts of Jira ticket wrangling) because it would take parts of their already limited time away from doing things that would benefit them far more. Unless there's quite literally no one "below" you, there's probably _something_ you can do to stop shit from flowing down past you.
That must be the trickle-down effect everyone talking about in the 80ies. ;)
A corollary to this is always have a good roll-back plan. It's much nicer to be able to roll-back to a working version and then be able to debug without the crisis-level pressure.
Rollback ability is a must—it can be the most used mitigation if done right.
Not all issues can be fixed with a rollback though.
I once worked for a team that, when a serious visible incident occurred, a company VP would pace the floor, occasionally yelling, describing how much money we were losing per second (or how much customer trust if that number was too low) or otherwise communicating that we were in a battlefield situation and things were Very Critical.
Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.
> Rule 0: Don’t panic
“Slow is smooth, smooth is fast.”
I know this and still get caught out sometimes…
Edit: hahaha ‘Cerium saying ~same thing: https://news.ycombinator.com/item?id=42683671
Also a pager/phone going off incessantly isn't useful either. manage your alarms or you'll be throwing your phone at a wall.
This is very underrated. Also an extension to this is don’t be afraid to break things further to probe. I often see a lot of devs mid level included panicking and thus preventing them to even know where to start. I’ve come to believe that some people just have an inherent intuition and some just need to learn it.
Yes its sometimes instinct takes over when your on the spot in a pinch but there are institutional things you can do to be prepared in advance that expand your set of options in the moment much like a pre-prepared firedrill playbook you can pull from also there are training courses like Kepner-Tregoe but you are right there are just some people who do better than others when _it's hitting the fan.
agreed. it’s practically a prerequisite for everything else in the book. Staying calm and thinking clearly is foundational
Uff, yeah. I used to work with a guy who would immediately turn the panic up to 11 at the first thought of a bug in prod. We would end up with worse architecture after his "fix" or he would end up breaking something else.
For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.
Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
"git bisect" is why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release. I use this as my #1 principle, above "I should be able to see every keystroke ever written" or "I want every last 'Fixes.' commit" that is sometimes advocated for here, because those principles make bisect useless.
The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.
If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.
Sometimes you'll find a repo where that isn't true. Fortunately, git bisect has a way to deal with failed builds, etc: three-value logic. The test program that git bisect runs can return an exit value that means that the failure didn't happen, a different value that means that it did, or a third that means that it neither failed nor succeeded. I wrote up an example here:
Very handy! I forgot about `git bisect skip`.
I do bisecting almost as a last resort. I've used it when all else fails only a few times. Especially as I've never worked on code where it was very easy to just build and deploy a working debug system from a random past commit.
Edit to add: I will study old diffs when there is a bug, particularly for bugs that seem correlated with a new release. Asking "what has changed since this used to work?" often leads to an obvious cause or at least helps narrow where to look. Also asking the person who made those changes for help looking at the bug can be useful, as the code may be more fresh in their mind than in yours.
Same. Every branch apart from the “real” one and release snapshots is transient and WIP. They don’t get merged back unless tests pass.
> why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release
You’re spot on.
However it’s clearly a missing feature that Git/Mercurial can’t tag diffs as “passes” or “bisectsble”.
This is especially annoying when you want to merge a stack of commits and the top passes all tests but the middle does not. It’s a monumental and valueless waste of time to fix the middle of the stack. But it’s required if you want to maintain bisectability.
It’s very annoying and wasteful. :(
This is why we use squash like here https://docs.gitlab.com/ee/user/project/merge_requests/squas...
As someone who doesn't like to see history lost via "rebase" and "squashing" branches, I have had to think through some of these things, since my personal preferences are often trampled on by company policy.
I have only been in one place where "rebase" is used regularly, and now that I'm a little more familiar with it, I don't mind using it to bring in changes from a parent branch into a working branch, if the working branch hasn't been pushed to origin. It still weirds me out somewhat, and I don't see why a simple merge can't just be the preferred way.-
I have, however, seen "squashing" regularly (and my current position uses it as well as rebasing) -- and I don't particularly like it, because sometimes I put in notes and trials that get "lost" as the task progresses, but nonetheless might be helpful for future work. While it's often standard to delete "squashed" branches, I cannot help but think that, for history-minded folks like me, a good compromise would be to "squash and keep" -- so that the individual commits don't pollute the parent branch, while the details are kept around for anyone needing to review them.
Having said that, I've never been in a position where I felt like I need to "forcibly" push for my preferences. I just figure I might as well just "go with the flow", even if a tiny bit of me dies every time I squash or rebase something, or delete a branch upon merging!
> I cannot help but think that, for history-minded folks like me, a good compromise would be to "squash and keep" -- so that the individual commits don't pollute the parent branch, while the details are kept around for anyone needing to review them.
But not linked together and those "closed" branches are mixed in with the current ones.
Instead, try out "git merge --no-ff" to merge back into master (forcing a merge commit to exist even if a fast-forward was possible) and "git log --first-parent" to only look at those merge commits. Kinda squash-like, but with all the commits still there.
I use git-format-patch to create a list of diffs for the individual commits before the branch gets squashed, and tuck them away in a private directory. Several times have I gone back to peek at those lists to understand my own thoughts later.
Well dang, we are not restricted to git as the only place we can put historical metadata. You know, discursive comments, for starters?
git bisect supports --first-parent, no need to squash if you just merge.
I explicitly don’t want squash. The commits are still worth keeping separate. There’s lots of distinct pieces of work. But sometimes you break something and fix it later. Or you add something new but support different environments/platforms later.
But if you don't squash, doesn't this render git bisect almost useless?
I think every commit that gets merged to main should be an atomic believed-to-work thing. Not only does this make bisect way more effective, but it's a much better narrative for others to read. You should write code to be as readable by others as possible, and your git history likewise.
Individual atomic working commits don't necessarily make a full feature. Most of the time I build features up in stages and each commit works on its own, even without completing the feature in question.
git bisect --first-parent
You should read my original comment.
git bisect should operate on bisectable commits. Which may not be all commits. Git is missing information. This is, imho, a flaw in Git.
If there's a way to identify those incomplete commits, git bisect does support "skip" - a commit that's neither good nor bad, just ignored.
Can we use git trailers for this? Something like "commit: incomplete" in the commit message.
Could you not use --first-parent option to test only at the merge-points?
Back in the 1990s, while debugging some network configuration issue a wiser older colleague taught me the more general concept that lies behind git bisect, which is "compare the broken system to a working system and systematically eliminate differences to find the fault." This can apply to things other than software or computer hardware. Back in the 90s my friend and I had identical jet-skis on a trailer we shared. When working on one of them, it was nice to have its twin right there to compare it to.
The principle here "bisection" is a lot more general than just "git bisect" for identifying ranges of commits. It can also be used for partitioning the space of systems. For instance, if a workflow with 10 steps is broken, can you perform some tests to confirm that 5 of the steps functioned correctly? Can you figure out that it's definitely not a hardware issue (or definitely a hardware issue) somewhere?
This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!
Tacking on my own article about git bisect run. It really is an amazing little tool.
You can also use divide and conquer when dealing with a complex system.
Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.
That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.
And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?
As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.
git bisect is an absolute power feature everybody should be aware of. I use it maybe once or twice a year at most but it's the difference between fixing a bug in an hour vs spending days or weeks spinning your wheels
Bisection is also useful when debugging css.
When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.
Rinse and repeat, it's a basic binary search.
I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.
Binary search rules. Being systematic about dividing the problem in half, determining which half the issue is in, and then repeating applies to non software problems quite well. I use the strategy all the time while troubleshooting issue with cars, etc.
Not to complain about bisect, which is great. But IMHO it's really important to distinguish the philosophy and mindspace aspect to this book (the "rules") from the practical advice ("tools").
Someone who thinks about a problem via "which tool do I want" (c.f. "git bisect helps a lot"[1]) is going to be at a huge disadvantage to someone else coming at the same decisions via "didn't this used to work?"[2]
The world is filled to the brim with tools. Trying to file away all the tools in your head just leads to madness. Embrace philosophy first.
[1] Also things like "use a time travel debugger", "enable logging", etc...
[2] e.g. "This state is illegal, where did it go wrong?", "What command are we trying to process here?"
I've spent the past two decades working on a time travel debugger so obviously I'm massively biassed, but IMO most programmers are not nearly as proficient in the available debug tooling as they should be. Consider how long it takes to pick up a tool so that you at least have a vague understanding of what it can do, and compare to how much time a programmer spends debugging. Too many just spend hour after hour hammering out printf's.
I find the tools are annoyingly hard to use, particularly when a program is using a build system you aren't familiar with. I love time travelling debuggers, but I've also lost hours to getting large java, or C++ programs, into any working debuggers along with their debugging symbols (for C++).
This is one area where I've been disappointed by rust, they cleaned up testing, and getting dependencies, by getting them into core, but debugging is still a mess with several poorly supported cargo extensions, none of which seem to work consistently for me (no insult to their authors, who are providing something better than nothing!)
Different mindsets, I find many problems easier to reason about from traces.
It's very rare for me to use debuggers.
Just be careful to not contradict #3 “Quit thinking and look”.
Touché
Make sure you're editing the correct file on the correct machine.
Yep, this is a variation of "check the plug"
I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)
This is also covered by "make it fail"
I'm glad I'm not the only one doing this after I wasted too much time trying to figure out why my docker build was not reflecting the changes ... never again..
How much time I've wasted unknowingly editing generated files, out of version files, forgetting to save, ... only god knows.
the biggest thing I've always told myself and anyone ive taught: make sure youre running the code you think youre running.
Baby steps, if the foundation is shaky no amount of reasoning on top is going to help.
Poor Yorick!
That's why you make it break differently first. To see your changes have any effect.
When working on a test that has several asserts, I have adopted the process of adding one final assert, "assert 'TEST DEBUGGED' is False", so that even when I succeed, the test fails -- and I could review to consider if any other tests should be added or adjusted.
Once I'm satisfied with the test, I remove the line.
Also that it's in the correct folder
very first order of business: git stash && git checkout main && git pull
...and you are building and running the correct clone of a repository
Some additional rules: - "It is your own fault". Always suspect your code changes before anything else. It can be a compiler bug or even a hardware error, but those are very rare. - "When you find a bug, go back hunt down its family and friends". Think where else the same kind of thing could have happened, and check those. - "Optimize for the user first, the maintenance programmer second, and last if at all for the computer".
The first one is known in the Pragmatic Programmer as “select isn’t broken.” Summarized at https://blog.codinghorror.com/the-first-rule-of-programming-...
Alternatively, I've found the "Maybe it's a bug. I'll try an make a test case I can report on the mailing list" approach useful at times.
Usually, in the process of reducing my error-generating code down to a simpler case, I find the bug in my logic. I've been fortunate that heisenbugs have been rare.
Once or twice, I have ended up with something to report to the devs. Generally, those were libraries (probably from sourceforge/github) with only a few hundred or less users that did not get a lot of testing.
I always have the mindset of "its my fault". My Linux workstation constantly crashing because of the i9-13900k in it was honestly humiliating. Was very relieved when I realized it was the CPU and not some impossible to find code error.
Linux is like an abusive relationship in that way--it's always your fault.
About "family and friends", a couple of times by fixing minor and a priori unrelated side issues, it revealed the bug I was after.
It's healthier to assume your code is wrong than otherwise. But it's best to simply bisect the cause-effect chain a few more times and be sure.
Step 10, add the bug as a test to the CI to prevent regressions? Make sure the CI fails before the fix and works after the fix.
The largest purely JavaScript repo I ever worked on (150k LoC) had this rule and it was a life saver, particularly because the project had commits dating back more than five years and since it was a component/library, it had quite few strange hacks for IE.
I don't think this is always worth it. Some tests can be time consuming or complex to write, have to be maintained, and we accept that a test suite won't be testing all edge cases anyway. A bug that made it to production can mean that particular bug might happen again, but it could be a silly mistake and no more likely to happen again than 100s of other potential silly mistakes. It depends, and writing tests isn't free.
Writing tests isn't free but writing non-regression tests for bugs that were actually fixed is one of the best test cases to consider writing right away, before the bug is fixed. You'll be reproducing the bug anyway (so already consider how to reproduce). You'll also have the most information about it to make sure the test is well written anyway, after building a mental model around the bug.
Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
For people who aren't getting the value of unit tests, this is my intro to the idea. You had to do some sort of testing on your code. At its core, the concept of unit testing is just, what if instead of throwing away that code, you kept it?
To the extent that other concerns get in the way of the concept, like the general difficulty of testing that GUIs do what they are supposed to do, I don't blame the concept of unit testing; I blame the techs that make the testing hard.
I also think that this is a great way to emphasis their value.
If anything I'd only keep those if it's hard to write them, if people push back against it (and I myself don't like them sometimes, e.g. when the goal is just to push up the coverage metric but without actually testing much, which only add test code to maintain but no real testing value...).
Like any topic there's no universal truth and lots of ways to waste time and effort, but this specifically is extremely practical and useful in a very explicit manner: just fix it once and catch it the next time before production. Massively reduce the chance one thing has to be fixed twice or more.
I can't count how many times when other people ask me "how can I use this API?", I just send a test case to them. Best example you can give to someone that is never out of sync.
> Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
Some examples that come to mind are bugs to do with UI interactions, visuals/styling, external online APIs/services, gaming/simulation stuff, and asynchronous/thread code, where it can be a substantial effort to write tests for, vs fixing the bug that might just be a typo. This is really different compared to if you're testing some pure functions that only need a few inputs.
It depends on what domain you're working in, but I find people very rarely mention how much work certain kinds of test can be to write, especially if there aren't similar tests written already and you have to do a ton of setup like mocking, factories, and UI automation.
Definitely agree with you on the fact that there are tests which are complicated to write and will take effort.
But I think all other things considered my impression still holds, and that I should maybe rather say they're easier to write in a way, though not necessarily easy.
You (or other people) will thank yourself in a few months/years when refactoring the code, knowing that they don't need to worry about missing edge cases, because all known edge cases are covered with these non regression tests.
There's really no situation you wouldn't write a test? Have you not had situations where writing the test would take a lot of effort vs the risk/impact of the bug it's checking for? Your test suite isn't going to be exhaustive anyway so there's always a balance of weighing up what's worth testing. Going overkill with tests can actually get in the way of refactoring as well when a change to the UI or API isn't done because it would require updating too many tests.
I have, and in 90% of cases that's because my code is seriously messed up, so that (1) it is not easy to write a unit test (2) there has not been enough effort in testing so that mocks are not available when I need it.
Adding tests is not easy, and you can always find excuses to not do that and instead "promise" to do it later, which almost never happens. I have seen enough to know this. Which is why I myself have put more work in writing unit tests, refactoring code and creating test mocks than anyone else in my team.
And I can't tell you how much I appreciate it when I find that this has benefitted me personally -- when I need to write a new test, often I find that I can reuse 80% if not 80% of the test setup and focus on the test body.
After adding the first test, it becomes much easier to add the second and third test. If you don't add the first test or put the effort into making your code actually testable, it's never going to be easy to test anything.
I'm assuming here the code is written in a testable way, but the behaviour is hard or time-consuming to test.
It's not about being lazy or making excuses, it's not free to write exhaustive tests and the resources have to come from somewhere. For MVPs, getting the first release out is going to be much more important for example.
You are not supposed to spend a significant amount of time testing "behavior". Unit tests should be the vast majority of testing.
And I can tell you I have first hand experience of an "MVP" product getting delayed, multiple times, because management realizes that nobody wants to purchase our product when they discover how buggy and unusable it is.
We probably work on different kinds of projects and/or different kinds of teams? I'd rather have the majority being end-to-end/UI tests than unit tests most of the time because you're not going to know if your app or UI is broken otherwise. I tend to work in smaller teams where everyone is experienced, and use languages with decent type checking.
It's very possible to write great software without any tests at all too. It's like people forget that coders used to write assembly code without source control, CI, mocking and all that other stuff.
What do you do with the years old bug fixes? How fast can one run the CI after a long while of accumulating tests? Do they still make sense to be kept in the long run?
I would say yes, your CI should accumulate all of those regression tests. Where I work we now have many, many thousands of regression test cases. There's a subset to be run prior to merge which runs in reasonable time, but the full CI just cycles through.
For this to work all the regression tests must be fast, and 100% reliable. It's worth it though. If the mistake was made once, unless there's a regression test to catch it, it'll be made again at some point.
> For this to work all the regression tests must be fast,
Doesn't matter how fast it is, if you're continually adding tests for every single line of code introduced eventually it will get so slow you will want to prune away old tests.
I'm not particularly passionate about arguing the exact details of "unit" versus "integration" testing, let alone breaking down the granularity beyond that as some do, but I am passionate that they need to be fast, and this is why. By that, I mean, it is a perfectly viable use of engineering time to make changes that deliberately make running the tests faster.
A lot of slow tests are slow because nobody has even tried to speed them up. They just wrote something that worked, probably literally years ago, that does something horrible like fully build a docker container and fully initialize a complicated database and fully do an install of the system and starts processes for everything and liberally uses "sleep"-based concurrency control and so on and so forth, which was fine when you were doing that 5 times but becomes a problem when you're trying to run it hundreds of times, and that's a problem, because we really ought to be running it hundreds of thousands or millions of times.
I would love to work on a project where we had so many well-optimized automated tests that despite their speed they were still a problem for building. I'm sure there's a few out there, but I doubt it's many.
This is a great problem to have, if (IME) rare. Step 1 Understand the System helps you figure out when tests can be eliminated as no longer relevant and/or which tests can be merged.
Why would you want to stop knowing that your old bug fixes still worked in the context of your system?
Saying "oh its been good for awhile now" has nothing to do with breaking it in the future.
I think for some types of bugs a CI test would be valuable if the developer believes regressions may occur, for other bugs they would be useless.
Yes, just more generally document it
I've lost count of how many things i've fixed only to to see;
1) It recurs because a deeper "cause" of the bug reactivated it.
2) Nobody knew I fixed something so everyone continued to operate workarounds as if the bug was still there.
I realise these are related and arguably already fall under "You didn't fix it". That said a bit of writing-up and root-cause analysis after getting to "It's fixed!" seems helpful to others.
> #1 Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.
Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.
I'd be curious if there's a single programmer that follows this advice.
I think we have a bit different interpretations here.
> read everything in depth
Is not necessarily
> first read the entire manual of the library I'm using, all 700 pages
If I have problem with “git bisect”. I can go only to stackoverflow try several snippets and see what sticks, or I can also go to https://git-scm.com/docs/git-bisect to get a bit deeper knowledge on the topic.
Essentially yes, that's correct. Your mistake is thinking that the outcome of those months of work is being able to kinda-probably fix one single bug. No: the point of all that effort is to truly fix all the bugs of that kind (or as close to "all" as is feasible), and to stop writing them in the first place.
The alternative is paradropping into an unknown system with a weird bug, messing randomly with things you don't understand until the tests turn green, and then submitting a PR and hoping you didn't just make everything even worse. It's never really knowing whether your system actually works or not. While I understand that is sometimes how it goes, doing that regularly is my nightmare.
P.S. if the manual of a library you're using is 700 pages, you're probably using the wrong library.
> if the manual of a library you're using is 700 pages, you're probably using the wrong library.
Statement bordering on papyrophobia. (Seems that is a real phobia)
More like fear of needless complexity and over-broad scope, which is actually rampant in library design.
This was written in 2004, the year of Google's IPO. Atwood and Spolsky didn't found Stack Overflow until 2008. [0] People knew things as the "Camel book" [1] and generally just knew things.
0. https://stackoverflow.blog/2021/12/14/podcast-400-an-oral-hi...
1. https://www.perl.com/article/extracting-the-list-of-o-reilly...
great strawman, guy that refuses to read documentation
I mean he has a point. Things are incredibly complex now adays, I don't think most people have time to "understand the system."
I would be much more interested in rules that don't start with that... Like "Rules for debugging when you don't have the capacity to fully understand every part of the system."
Bisecting is a great example here. If you are Bisecting, by definition you don't fully understand the system (or you would know which change caused the problem!)
If folks want to instill this mindset in their kids, themselves or others I would recommend at least
The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)
https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...
https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)
To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...
https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!
What part of Three Body Problem has anything to do with debugging or troubleshooting or any form of planning?
Things just sort of happen with wild leaps of logic. The book is actually a fantasy book with thinnest layer of science babble on top.
Agree. I think Zen and the Art of Motorcycle Maintenance encapsulates the art of troubleshooting the best. Especially the concept of "gumption traps",
"What you have to do, if you get caught in this gumption trap of value rigidity, is slow down...you're going to have to slow down anyway whether you want to or not...but slow down deliberately and go over ground that you've been over before to see if the things you thought were important were really important and to -- well -- just stare at the machine. There's nothing wrong with that. Just live with it for a while. Watch it the way you watch a line when fishing and before long, as sure as you live, you'll get a little nibble, a little fact asking in a timid, humble way if you're interested in it. That's the way the world keeps on happening. Be interested in it."
Words to live by
i'm slogging through zen, it's a bit trite so far (opening pages). im struggling to continue. when will it stop talking about the climate and blackbirds and start saying something interesting?
Yes it starts slow, but keep at it. It starts to get interesting about halfway into the book, if I remember correctly.
One also doesn't need to enjoy something to get something very worthwhile out of it. Not apologizing for Zen, but all books have rough patches to someone.
Yeah, it is probably what I would start with and all the messages the book is sending you will resurface in the others. You have to cultivate that debugging mind, but once it starts to grow, it can't be stopped.
I wrote a fairly similar take on this a few years ago (without having read the original book mentioned here) -- https://explog.in/notes/debugging.html
Julia Evans also has a very nice zine on debugging: https://wizardzines.com/zines/debugging-guide/
I love Julia Evans’ zine! Bought several copies when it came out, gave some to coworkers and donated one to our office library.
Then, after successful debugging your job isn't finished. The outline of "Three Questions About Each Bug You Find" <http://www.multicians.org/thvv/threeq.html> is:
1. Is this mistake somewhere else also?
2. What next bug is hidden behind this one?
3. What should I do to prevent bugs like this?
The article is a 2024 "review" (really more of a very brief summary) of a 2002 book about debugging.
The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...
The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.
> The article is a 2024 "review"
2004.
I’m not sure that doesn’t sit well with me.
Rule 1 should be: Reproduce with most minimal setup.
99% you’ll already have found the bug.
1% for me was a font that couldn’t do a combination of letters in a row. life ft, just didn’t work and thats why it made mistakes in the PDF.
No way I could’ve ever known that if I wouldn’t have reproduced it down to the letter.
Just split code in half till you find what’s the exact part that goes wrong.
Related: decrease your iterating time as much as possible. If you can test your fix in 30 seconds vs 5 minutes, you’ll fix it in hours instead of days.
Rule 4 is divide and conquer, which is the 'splitting code in half' you reference.
I'd argue that you can't effectively split something in half unless you first understand the system.
The book itself really is wonderful - the author is quite approachable and anything but dogmatic.
I have been bitten more than once thinking that my initial assumption was correct, diving deeper and deeper - only to realize I had to ascend and look outside of the rabbit hole to find the actual issue.
> Assumption is the mother of all screwups.
This is how I view debugging, aligning my mental model with how the system actually works. Assumptions are bugs in the mental model. The problem is conflating what is knowledge with what is an assumption.
I've once heard from an RTS game caster (IIRC it was Day9 about Starcraft) "Assuming... Is killing you".
When playing Captain Obvious (i.e. the human rubber duck) with other devs, every time they state something to be true my response is, "prove it!" It's amazing how quickly you find bugs when somebody else is making you question your assumptions.
I also think it is worthwhile stepping thru working code with a debugger. The actual control flow reveals what is actually happening and will tell you how to improve the code. It is also a great way to demystify how other's code runs.
I agree and have found using a time travel debugger very useful because you can go backwards and forwards to figure out exactly what the code is doing. I made a video of me using our debugger to compare two recordings - one where the program worked and one where a very intermittent bug occurred. This was in code I was completely unfamiliar with so would have been hard for me to figure out without this. The video is pretty rubbish to be honest - I could never work in sales - but if you skip the first few minutes it might give you a flavour of what you can do. (I basically started at the end - where it failed - and worked backwards comparing the good and bad recordings) https://www.youtube.com/watch?v=GyKrDvQ2DdI
That is rule #3. quit thinking and look. Use whatever tool you need and look at what is going on. The next few rules (4-6) are what you need to do while you are doing step #3.
Make sure through pure logic that you have correctly identified the Root Cause. Don't fix other probable causes. This is very important.
I think that fits nicely under rule 1 ("Understand the system"). The rules aren't about tools and methods, they're about core tasks and the reason behind them.
This is necessary sometimes when you’re simply working on an existing code base.
> Check the plug
I just spent a whole day trying to figure out what was going on with a radio. Turns out I had tx/rx swapped. When I went to check tx/rx alignment I misread the documentation in the same way as the first. So, I would even add "try switching things anyways" to the list. If you have solid (but wrong) reasoning for why you did something then you won't see the error later even if it's right in front of you.
Yes the human brain can really be blind when its a priori assumptions turn out to be wrong.
One good timesaver: debug in the easiest environment that you can reproduce the bug in. For instance, if it’s an issue with a website on an iPad, first see if you reproduce in chrome using the responsive tools in web developer. If that doesn’t work, see if it reproduces in desktop safari. Then the iPad simulator, and only then the real hardware. Saves a lot of frustration and time, and each step towards the actual hardware eliminates a whole category of bugs.
I used to manage a team that supported an online banking platform and gave a copy of this book to each new team member. If nothing else, it helped create a shared vocabulary.
It's useful to get the poster and make sure everyone knows the rules.
Also sometimes: the bug is not in the code, its in the data.
A few times I looked for a bug like "something is not happening when it should" or "This is not the expected result", when the issue was with some config file, database records, or thing sent by a server.
For instance, particularly nasty are non-printable characters in text files that you don't see when you open the file.
"simulate the failure" is sometimes useful, actually. Ask yourself "how would I implement this behavior", maybe even do it.
Also: never reason on the absence of a specific log line. The logs can be wrong (bugged) too, sometimes. If you printf-debugging a problem around a conditional for instance, log both branches.
> Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs
Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.
Over twenty five odd years, I have found the path to a general debugging prowess can best be achieved by doing it. I'd recommend taking the list/buying the book, using https://up-for-grabs.net to find bugs on github/bugzilla, etc. and doing the following:
1. set up the dev environment
2. fork/clone the code
3. create a new branch to make changes and tests
4. use the list to try to find the root cause
5. create a pull request if you think you have fixed the bug
And use Rule 0 from GuB-42: Don't panic
(edited for line breaks)
Very good online course on debugging: Software Debugging on Udacity by Andreas Zeller
Udacity is owned by Accenture? That is...Surprising.
For folks who love to read books, here's an excerpt from the Debugging book's accompanying website (https://debuggingrules.com/):
"Dave was asked as the author of Debugging to create a list of 5 books he would recommend to fans, and came up with this.
https://shepherd.com/best-books/to-give-engineers-new-perspe..."
thanks for the link
The tenth golden rule:
10) Enable frame pointers [1].
[1] The return of the frame pointers:
Nice classic that sticks to timeless pricniples. the nine rules are practical with war stories that make them stick. but agree that "don't panic" should be added
Take the time to speed up my iteration cycles has always been incredibly valuable. It can be really painful because its not directly contributing to determining/fixing the bug (which could be exacerbated if there is external pressure), but its always been worth it. Of course, this only applies to instances where it takes ~4+ minutes to run a single 'experiment' (test, startup etc). I find when I do just try to push through with long running tests I'll often forget the exact variable I tweaked during the course of the run. Further, these tweaks can be very nuanced and require you to maintain a lot of the larger system in your head.
A good bug is the most fun thing about software development
Just let a LLM create even better bugs for you - Erik Meijer https://www.youtube.com/live/SsJqmV3Wtkg?si=MUoiNbWpsunsZ39y...
Sometimes I am actually happy when there is a obvious bug. It is like solving a murder mystery.
And often you’re the culprit too!
The victim, the murderer, and the detective.
I had the incredible luck to stumble upon this book early in my career and it helped me tremendously in so many ways. If I could name only one it would be that it helped me get over the sentiment of being helpless in front of a difficult situation. This book brought me to peace with imperfection and me being an artisan of imperfection.
> Check that it's really fixed, check that it's really your fix that fixed it, know that it never just goes away by itself
I wish this were true, and maybe it was in 2004, but when you've got noise coming in from the cloud provider and noise coming in from all of your vendors I think it's actually quite likely that you'll see a failure once and never again.
I know I've fixed things for people without without asking if they ever noticed it was broken, and I'm sure people are doing that to me also.
I've had trouble keeping the audit trail. It can distract from the flow of debugging, and there can be lots of details to it, many of which end up being irrelevant; i.e. all the blind rabbit holes that were not on the maze path to the bug. Unless you're a consultant who needs to account for the hours, or a teller of engaging debugging war stories, the red herrings and blind alleys are not that useful later.
> Quit thinking and look (get data first, don't just do complicated repairs based on guessing)
From my experience, this is the single most important part of the process. Once you keep in mind that nothing paranormal ever happens in systems and everything has an explanation, it is your job to find the reason for things, not guess them.
I tell my team: just put your brain aside and start following the flow of events checking the data and eventually you will find where things mismatch.
There's a book I love and always talk about called "Stop Guessing: The 9 Behaviors of Great Problem Solvers" by Nat Greene. It's coincidental, I guess, that they both have 9 steps. Some of the steps are similar so I think the two books would be complementary, so I'm going to check out "Debugging" as well.
I worked at a place once where the process was "Quit thinking, and have a meeting where everyone speculates about what it might be." "Everyone" included all the nontechnical staff to whom the computer might as well be magic, and all the engineers who were sitting there guessing and as a consequence not at a keyboard looking.
I don't miss working there.
Acquire a rubber duck. Teach the duck how the system works.
One thing I have been doing is to create a directory called "debug" from the software and write lots of different files when the main program has executed to add debugging information but only write files outside of hot loops for debugging and then visually inspect the logs when the program is exited.
For intermediate representations this is better than printf to stdout
I found this book so helpful I created a worksheet based on it - might be helpful for some: https://andypi.co.uk/2024/01/26/concise-guide-to-debugging-a...
I can't comment further on David A. Wheeler's review because his words were from 2004 (He said everything true), and I can't comment on the book either because I haven't read it yet.
Thank you for introducing me to this book.
One of my favorite rules of debugging is to read the code in plain language. If the words don't make sense somewhere, you have found the problem or part of it.
First time hearing about these 9 rules, but I learning most of them by experience with many years resolving or trying to resolved bugs.
Only thing that I dont agree is the book cost US$ 4.291,04 on Amazon
btw the hardcover its this price
My first rule for debugging debutants:
Don't be too embarassed to scatter debug logmessages in the code. It helps.
My second rule:
Don't forget to remove them when you're done.
My rule for a long time has been anytime I add a print or log, except for the first time I am writing some new cide with tricky logic, which I try not to do, never delete it. Lower it to the lowest possible debug or trace level but if it was useful once it will be useful again, even if only to document the flow thru the code on full debug.
The nicest log package I had would always count the number of times a log msg was hit even if the debug level meant nothing happened. The C preprocessor made this easy, haven't been able to get a short way to do this counting efficiently in other languages.
I really like this.
#7 Check the plug: Question your assumptions, start at the beginning, and test the tool.
I have found that 90% of network problems, are bad cables.
That's not an exaggeration. Most IT folks I know, throw out ethernet cables immediately. They don't bother testing them. They just toss 'em in the trash, and break a new one out of the package.
I prefer to cut the connectors off with savage vengeance before tossing the faulty cable ;-)
Post: "9 rules of debugging"
Each comment: "..and this is my 10th rule: <insert witty rule>"
Total number of rules when reaching the end of the post: 9 + n + n * m, with n being number of users commenting, m being the number of users not posting but still mentally commenting on the other users' comments.
> Ask for fresh insights (just explaining the problem to a mannequin may help!)
You can’t trust a thing this person says if they’re not recommending a duck.
Rule 11: If you haven't solved it and reach this rule, one of your assertions is incorrect. Start over.
One I learned on Friday: Check your solder connections under a microscope before hacking the firmware.
The worst is when it works only when your oscilloscope probe is pushing down on the pin.
Review was good enough to make me snag the entire book. I'm taking a break from algorithmic content for a bit and this will help. Besides, I've got an OOM bug at work and it will be fun to formalize the steps of troubleshooting it. Thanks, OP!
I recommend this book to all Jr. devs. Many feel very overwhelmed by the process. Putting it into nice interesting stories and how to be methodical is a good lesson for everyone.
Wasn't Bryan Cantrill writing a book about debugging? I'd love to read that.
I was! (Along with co-author Dave Pacheco.) And I still have the dream that we'll finish it one day: we had written probably a third of it, but then life intervened in various dimensions. And indeed, as part of our preparation to write our book (which we titled The Joy of Debugging), we read Wheeler's Debugging. On the one hand, I think it's great to have anything written about debugging, as it's a subject that has not been treated with the weight that it deserves. But on the other, the "methodology" here is really more of a collection of aphorisms; if folks find it helpful, great -- but I came away from Debugging thinking that the canonical book on debugging has yet to be written.
Fortunately, my efforts with Dave weren't for naught: as part of testing our own ideas on the subject, I gave a series of presentations from ~2015 to ~2017 that described our thinking. A talk that pulls many of these together is my GOTO Chicago talk in 2017, on debugging production systems.[0] That talk doesn't incorporate all of our thinking, but I think it gets to a lot of it -- and I do think it stands at a bit of contrast to Wheeler's work.
It's a great talk! I have stolen your "if you smell smoke, find the source" advice and put it in some of my own talks on the subject.
I’m so bad at #1.
I know it is the best route, I do know the system (maybe I wrote it) and yet time and again I don’t take the time to read what I should… and I make assumptions in hopes of speeding up the process/ fix, and I cost myself time…
Rule #10 - it’s probably DNS
Years ago, my boss thought he was being clever and set our server’s DNS to the root nameservers. We kept getting sporadic timeouts on requests. That took a while to track down… I think I got a pizza out of the deal.
That sounds hellish
I would almost change 4 into "Binary search".
Wheeler gets close to it by suggesting to locate which side of the bug you're on, but often I find myself doing this recursively until I locate it.
Yeah people say use git bisect but that's another dimension (which change introduced the bug).
Bisecting is just as useful when searching for the layer of application which has the bug (including external libraries, OS, hardware, etc.) or data ranges that trigger the bug. There's just no handy tools like git bisect for that. So this amounts to writing down what you tested and removing the possibilities that you excluded with each test.
Personally, I’d start with divide and conquer. If you’re working on a relevant code base chances are that you can’t learn all the API spec and documentation because it’s just too much.
Also: Fix every bug twice: Both the implementation and the “call site” — if at all possible
Ye ol’ “belt and suspenders” approach?
Check the plug should be first
Did anyone say debugging?
I've followed https://debugbetter.com/ for a few weeks and the content has been great!
The unspoken rule is talking to the rubber duck :)
That is literally #8 in the list
Rule #10 - read everything twice
Hmm. Perhaps, but mannequin is not nearly as whimsical sounding as a rubber duck which inspires you to bounce your ideas off of the inanimate object.
Go on a walk or take a shower...
This is related to the classic debugging book with the same title. I first discovered it here in HN.
I’d add “a logging module done today will save you a lot of overtime next year”.
(2004)
Title is: David A. Wheeler's Review of Debugging by David J. Agans
Great book +1
I love the "if you didn't fix it, it ain't fixed". It's too easy to convince yourself something is fixed when you haven't fully root-caused it. If you don't understand exactly how the thing your seeing manifested, papering over the cracks will only cause more pain later on.
As someone who has been working on a debugging tool (https://undo.io) for close to two decades now, I totally agree that it's just weird how little attention debugging as a whole gets. I'm somewhat encouraged to see this topic staying near the top of hacker news for as long as it has.
> If you didn't fix it, it ain't fixed
AKA: “Problems that go away by themselves come back by themselves.”
rule -1: don't trust the bug issuer
bad rule for bad programmers.
trust, but verify.
yes, that was the point.
Halo
>Rule 1: Understand the system: Read the manual, read everything in depth (...)
Yeah, ain't nobody got time for that. If e.g. debugging a compile issue meant we read the compiler manual, we'd get nothing done...