Comments Page - Debugging: Indispensable rules for finding even the most elusive problems (2004)

« Back Debugging: Indispensable rules for finding even the most elusive problems (2004)dwheeler.comSubmitted by omkar-foss 6 months ago

hughdbrown 6 months ago
In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.
Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.
- randerson 6 months ago
  The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them. I've been part of some elusive production issues where eventually 1-2 team members attempted a rewrite of the offending routine while everyone else debugged it, and the rewrite "won" and shipped to production before we found the bug. Heresy I know. In at least one case we never found the bug, because we could only dedicate a finite amount of time to a "fixed" issue.
GuB-42 6 months ago
Rule 0: Don't panic
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
- augbog 6 months ago
  100% agree. I remember I had an on-call and our pagerduty started going off for a SEV-2 and naturally a lot of managers from teams that are affected are in there sweating bullets because their products/features/metrics are impacted. It can get pretty frustrating having so many people try to be cooks in the kitchen. We had a great manager who literally just moved us to a different call/meeting and he told us "ignore everything those people are saying; just stay focused and I'll handle them." Everyone's respect for our manager really went up from there.
- ianmcgowan 6 months ago
  There's a story in the book - on nuclear submarines there's a brass bar in front of all the dials and knobs, and the engineers are trained to "grab the bar" when something goes wrong rather than jumping right to twiddling knobs to see what happens.
- Cerium 6 months ago
  Slow is smooth and smooth is fast. If you don't have time to do it right, what makes you think there is time to do it twice?
- adamc 6 months ago
  I had a boss who used to say that her job was to be a crap umbrella, so that the engineers under her could focus on their actual jobs.
- o_nate 6 months ago
  A corollary to this is always have a good roll-back plan. It's much nicer to be able to roll-back to a working version and then be able to debug without the crisis-level pressure.
- CobrastanJorji 6 months ago
  I once worked for a team that, when a serious visible incident occurred, a company VP would pace the floor, occasionally yelling, describing how much money we were losing per second (or how much customer trust if that number was too low) or otherwise communicating that we were in a battlefield situation and things were Very Critical.
  Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.
- chrsig 6 months ago
  Also a pager/phone going off incessantly isn't useful either. manage your alarms or you'll be throwing your phone at a wall.
- bilekas 6 months ago
  This is very underrated. Also an extension to this is don’t be afraid to break things further to probe. I often see a lot of devs mid level included panicking and thus preventing them to even know where to start. I’ve come to believe that some people just have an inherent intuition and some just need to learn it.
- bch 6 months ago
  > Rule 0: Don’t panic
  “Slow is smooth, smooth is fast.”
  I know this and still get caught out sometimes…
  Edit: hahaha ‘Cerium saying ~same thing: https://news.ycombinator.com/item?id=42683671
- ahci8e 6 months ago
  Uff, yeah. I used to work with a guy who would immediately turn the panic up to 11 at the first thought of a bug in prod. We would end up with worse architecture after his "fix" or he would end up breaking something else.
- roxyrox 6 months ago
  agreed. it’s practically a prerequisite for everything else in the book. Staying calm and thinking clearly is foundational
nickjj 6 months ago
For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.
Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
- jerf 6 months ago
  "git bisect" is why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release. I use this as my #1 principle, above "I should be able to see every keystroke ever written" or "I want every last 'Fixes.' commit" that is sometimes advocated for here, because those principles make bisect useless.
  The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.
  If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.
- smcameron 6 months ago
  Back in the 1990s, while debugging some network configuration issue a wiser older colleague taught me the more general concept that lies behind git bisect, which is "compare the broken system to a working system and systematically eliminate differences to find the fault." This can apply to things other than software or computer hardware. Back in the 90s my friend and I had identical jet-skis on a trailer we shared. When working on one of them, it was nice to have its twin right there to compare it to.
- yuliyp 6 months ago
  The principle here "bisection" is a lot more general than just "git bisect" for identifying ranges of commits. It can also be used for partitioning the space of systems. For instance, if a workflow with 10 steps is broken, can you perform some tests to confirm that 5 of the steps functioned correctly? Can you figure out that it's definitely not a hardware issue (or definitely a hardware issue) somewhere?
  This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!
- ajross 6 months ago
  Not to complain about bisect, which is great. But IMHO it's really important to distinguish the philosophy and mindspace aspect to this book (the "rules") from the practical advice ("tools").
  Someone who thinks about a problem via "which tool do I want" (c.f. "git bisect helps a lot"[1]) is going to be at a huge disadvantage to someone else coming at the same decisions via "didn't this used to work?"[2]
  The world is filled to the brim with tools. Trying to file away all the tools in your head just leads to madness. Embrace philosophy first.
  [1] Also things like "use a time travel debugger", "enable logging", etc...
  [2] e.g. "This state is illegal, where did it go wrong?", "What command are we trying to process here?"
- Icathian 6 months ago
  Tacking on my own article about git bisect run. It really is an amazing little tool.
  https://andrewrepp.com/git_bisect_run
- tetha 6 months ago
  You can also use divide and conquer when dealing with a complex system.
  Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.
  That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.
  And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?
  As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.
- jvans 6 months ago
  git bisect is an absolute power feature everybody should be aware of. I use it maybe once or twice a year at most but it's the difference between fixing a bug in an hour vs spending days or weeks spinning your wheels
- epolanski 6 months ago
  Bisection is also useful when debugging css.
  When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.
  Rinse and repeat, it's a basic binary search.
  I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.
- rozap 6 months ago
  Binary search rules. Being systematic about dividing the problem in half, determining which half the issue is in, and then repeating applies to non software problems quite well. I use the strategy all the time while troubleshooting issue with cars, etc.
qwertox 6 months ago
Make sure you're editing the correct file on the correct machine.
- ZedZark 6 months ago
  Yep, this is a variation of "check the plug"
  I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)
- eddd-ddde 6 months ago
  How much time I've wasted unknowingly editing generated files, out of version files, forgetting to save, ... only god knows.
- netcraft 6 months ago
  the biggest thing I've always told myself and anyone ive taught: make sure youre running the code you think youre running.
- ajuc 6 months ago
  That's why you make it break differently first. To see your changes have any effect.
- chupasaurus 6 months ago
  Poor Yorick!