« BackThoughts on Debuggingcatskull.netSubmitted by todsacerdoti 7 hours ago
  • vardump 3 hours ago

    From the article:

      Here’s my one simple rule for debugging:
      
      Reproduce the issue.
    
    Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.

    Some things can be incredibly hard to debug and can depend on the craziest of things you'd never even consider possible. Like a thunderstorm causing voltage spikes that very subtly damage the equipment causing subtle failures a few months later. Sometimes that "software bug" turns out to be hardware in weird ways. Or issues like https://web.mit.edu/jemorris/humor/500-miles – every person who's debugging weird issues should read that.

    Once you can actually reproduce the issue, you've often done 80-99+% of the work already.

    • virgilp 2 hours ago

      It is true that for many (especially concurrency/distributed-systems-related) bugs, reproducing the issue might be the hardest part... but that's not always true.

      Long time ago, I was working on a C++ compiler for an embedded processor. Customer complains that when they turn on optimization, code fails. 100% reproducible, just put "-O3" and it fails, with "-O0" it works. Now, we're used to these bug reports, it's often a bug in the original software (like relying on undefined behaviour), but we manage to get the code, it looks good / we don't find anything strange. Can't run it, it only works on the customer board (which is their custom proprietary hardware, using our microprocessor). We look through the assembly, can't find any obvious optimization error that the compiler made. After much (remote) debugging, it turns out that it was a fault in their memory system.

      • vardump 27 minutes ago

        These kind of things happen all the time in embedded development. Tough to debug, because you can't initially know whether it's software or hardware.

      • vrighter 42 minutes ago

        had some hp microservers once, of which some, but not all, got occasional bit corruptions due to emf interference from themselves.

      • avidiax 4 hours ago

        > Here’s my one simple rule for debugging:

        > Reproduce the issue.

        > . . .

        > I’m not sure that I’ve ever worked on a hard problem . . .

        I agree, the author has probably not worked on hard problems.

        There are many situations where either a) reproducing the problem is extremely elusive, or b) reproduction is easy, but the cause is especially elusive, or c) organizational issues prevent access to the system, to source code, or to people willing to investigate and/or solve the issue.

        Some examples:

        For A, the elusive reproduction, I saw an issue where we had an executive escalation that their laptop would always blue screen shortly after boot up. Nobody could reproduce this issue. Telemetry showed nobody else had this issue. Changing hardware didn't fix it. Only this executive had the anti-Midas touch to cause the issue. Turned out the executive lived on a large parcel, and had exactly one WiFi network visible. Some code parsing that list of WiFi APs had an off-by-one error which caused a BSOD. A large portion of wireless technology (Bluetooth/Thread/WiFi/cellular) bugs fall into this category.

        For B, the easy to repro but still difficult, I've seen numerous bugs that cause stack corruption, random memory corruption, or trigger a hardware flaw that freezes or resets the system. These types of issues are terrible to debug, because either the logs aren't available (system comes down before the final moments), or because the culprit doesn't log anything and never harms themselves, only an innocent victim. Time-travel tracing is often the best solution, but is also often unavailable. Bisecting the code changes is sometimes little help in a busy codebase, since the culprit is often far away from their victims.

        Category C is also pretty common if you are integrating systems. Vendors will have closed source and be unable or unwilling to admit even the possibility of fault, help with an investigation, or commit to a fix. Partners will have ship blocking bugs in hardware that they just can't show you or share with you, but it must nonetheless get fixed. You will often end up shipping workarounds for errors in code you don't control, or carefully instrumenting code to uncover the partner's issues.

        • pjc50 27 minutes ago

          My favourite case of this was simply "touchscreen sometimes doesn't work". That took over a year before it even made it to us developers, being reported intermittently across a large number of point of sale systems. Sensibly, people were reluctant to pass on something that they couldn't reproduce, and it was often blamed on hardware issues or user error, such as damage from people operating the (resistive) touchscreen with a bunch of keys or the smooth metal Dallas key fob that was also used for log in.

          We eventually got a robot to reproduce some other issues with PIN pads (which for obvious reasons do not support any kind of software injection of input!), and got some time using the robot to press the touchscreen over and over. That way we were able to confirm it happening, and start the debug process.

          (anticlimactically, I can't quite remember what the resolution was, something to do with debounce logic and interrupt events)

          • vardump 21 minutes ago

            > something to do with debounce logic and interrupt events

            Oh, the classic, an unfiltered hardware button connected to a GPIO that generates an IRQ? What could go wrong... :-D

            For the uninitiated: https://en.wikipedia.org/wiki/Switch#Contact_bounce

          • forrestthewoods 4 hours ago

            > I agree, the author has probably not worked on hard problems.

            Conversely, if you focus on creating a repro you make problems easy.

            I’m a little surprised the OP doesn’t mention debuggers. For whatever reason many modern Linux programmers seem to have never and used a debugger and rely almost entirely on printf. Which is utterly bonkers to me! If you can repro and issue and trap it in a debugger then you’ve done 90% of the work.

            • vardump 3 hours ago

              Debuggers are not great when you have a bug that occurs between multiple CPU cores, hardware device timing, etc.

              In those cases, a debugger just can't stop the world, so back to printf it is.

              • forrestthewoods an hour ago

                printf also mucks with sensitive timing issues. So it’s not great either.

                Also time traveling debuggers like WinDbg and rr are the ideal way to debug complex timing issues.

                • vardump 37 minutes ago

                  > printf also mucks with sensitive timing issues. So it’s not great either.

                  True, but this can be worked around for example with lock free ring buffers etc.

                  > Also time traveling debuggers like WinDbg and rr are the ideal way to debug complex timing issues.

                  When applicable rr can be great indeed. Not sure how to apply TTD tools for example in kernel context.

                  Nice to see Microsoft keeps further improving WinDbg. (Shame Visual Studio's debugger is so weak in comparison.)

              • glandium 3 hours ago

                And if you can repro in rr, you've done 99% of the work most of the time.

            • lordnacho 3 hours ago

              > Reproduce the issue.

              Once you have done this, you are already over the hump. It's like being the first rider over the last mountain on the Tour de France stage, you've more or less won by doing that.

              I'm not sure I even consider it a challenge if the issue is easily reproduced. You will simply grind out the solution once you have the reproduction done.

              The real bugs are the ones that you cannot replicate. The kind of thing that breaks once a week on 10 continuously running machines. You can't scale that system to 1000 or more with the bug around, you'll be swamped with reports. But you also can't fix it because the conditions to reproduce it are so elusive that your logs aren't useful. You don't know if all the errors have the same root cause.

              Typically the kind of thing that creates a lot of paths to check is "manual multithreading". The kind of thing where you have a bunch of threads, each locking and unlocking (or forgetting either) access to shared data in a particular way. The number of ways where "this read and then that writes" explodes quite fast with such code, and it also explodes in a way that isn't obvious from the code. Sprinkling log outputs over such code can change the frequency of the errors.

              • fch42 3 hours ago

                I follow, yet I disagree that "first priority" must always be a reproducer. There are a lot of conditions that can be rootcaused clearly from diagnostics; say, Linux kernel code deadlocks can exhibit as two different (in their stacks) repeatedly shown "task stuck for more than ... seconds" messages; the remainder follows from the code (to see the abba-lock-ordering violation). There's a certain fetishisation of reproducers not unlike the fetishisation of build-time testing - to denigrate a bug because "you can't reproduce it" or "if it doesn't show in the tests it needn't be changed". Personally, that mindset irks me. Fortunately, most developers are happy to learn more about their code any which way. And debugging, tracing, monitoring is cool in itself.

                • viraptor 20 minutes ago

                  It's a nice ideal to aim for. But I agree, it's not "must" and not "exactly". If I run into a "very-interesting-problem" which depends on synchronisation between multiple machines and includes some geographic dependencies and caching, I may do the mental calculation:

                  - reproducing the issue in a repeatable way would take a week or more

                  - I can make 2 educated-guess fixes a day which don't affect production negatively

                  I would be a fool to choose the first (or at least start with the first) in most companies. I mean, I want to display pictures on the internet, not send people into space.

                • jillesvangurp 2 hours ago

                  My rule for debugging is to park your assumptions and scientifically invalidate each thing you think might be the issue. It's usually something simple and your initial assumptions are probably wrong. Which is why trying to invalidate those is a productive course of action.

                  If you catch yourself thinking, it's probably X. Then you should try to prove yourself wrong. Because if your are, you are looking in the wrong place. And if you are struggling to understand why a thing is happening you can safely assume that something you assume to be true is in fact not true. Invalidating that assumption would be how you figure out why.

                  Assumptions can range from "there's a bug in a library we are using", "the problem must have been introduced recently", "the problem only happens when we do X", etc. Most of these things are fairly simple to test.

                  The other day I was debugging someone else's code that I inherited. I started looking at the obvious place in the code, adding some logging and I was getting nowhere. Then I decided to try to reproduce the problem in a place where that code was definitely not used to challenge the assumption I was making that the problem even was in that part of the code. I instantly managed to reproduce the issue. I wasted two hours staring at that code and trying to understand it.

                  In the end, the issue was with a weird bug that only showed up when using our software in the US (or as it turns out, the western hemisphere). The problem wasn't the functionality I was testing but everything that used negative coordinates.

                  Once I narrowed it down to a simple math problem with negative longitudes and I realized the problem was a missing call to abs where we subtracting values (subtracting a negative value means you are adding it). That function was used in four different places; each of those was broken. Easy fix and the problem went away. Being in Europe (only positive longitudes), we just never properly tested that part of our software in the US. The bug had lurked there for over a year. Kind of embarrassing really.

                  Which is why randomizing your inputs in unit tests is important. We were testing with just one hard coded coordinate. The fix included me adding proper unit tests for the algorithm.

                  • dave333 4 hours ago

                    Reproducing the problem not only allows in depth debugging but the conditions needed to reproduce can give clues as to the cause. The most significant/interesting bug of my career was a problem in 1978 with a Linotron 606 phototypesetter at the Auckland Star newspaper in NZ that occasionally would have a small patch of mangled text at a random place in the job. Reprint the text again and the problem would disappear. The problem had been outstanding for several months as it wasn't a showstopper. The hardware engineer and I figured it might be related to how the fonts were pulled off disk and put in the typesetting memory buffer so we set up some artificial disk transfer errors where every 50th transfer would fail and sure enough this made the problem happen 100% of the time. From there simply inspecting the code that transferred the fonts we found the problem which was that an extra blank character used for filling the background in reverse video (white text on black background) which was put at the top of the buffer was omitted when things were redone after a disk transfer error. So all the character addresses in the buffer were incorrect resulting in mangled characters.

                    • spc476 3 hours ago

                      But you had a theory you could test. The hardest bug I had to debug (over a month of constant work) was very difficult to reproduce on demand. The program (a single threaded server) would crash after running a few hours in production. I was able to get it to crash on the development server only if I let it run handling requests for a few days. The core dumps (and yes, there were plenty of core dumps to look at) were inconsistent---each crash was in a different location. There was no reason I could find that caused the crashes, so no theory I could really test.

                      I was able to locate the root cause, which was calling a non-async-signal safe function from a signal handler, but that came only after staring at the code and thinking hard for a long time.

                    • pdpi 4 hours ago

                      In a narrower sense of the word, one technique I developed early on in my career that I don’t see mentioned very often is exploratory debugging. At its most basic, you run the program, look at the output and find the output string in the source code. Then you set a breakpoint there, and go again. You’re now in a place where you can start understanding the program by working backwards from the output.

                      One thing that makes me sad about the pervasive use of async/await-style programming is that it usually breaks the stack in a way that makes this technique a bit useless.

                      • toolslive 2 hours ago

                        He has a point about reproducing the issue. However, tracing is better than logging, and for god's sake, put a f*cking breakpoint.

                        • z33k 4 hours ago

                          ”If you don’t love your logging system, proactively fix that problem.”

                          Really, you have a ”one-system” where you can see _ALL_ the logs? I don’t believe that. This whole software thing is abstractions everywhere, and we are probably using some abstraction somewhere that isn’t compatible with this fabled ”one-system”.

                          Often the most debugging takes place on the least observable systems.