« BackThoughts on Debuggingcatskull.netSubmitted by todsacerdoti 9 months ago
  • vardump 9 months ago

    From the article:

      Here’s my one simple rule for debugging:
      
      Reproduce the issue.
    
    Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.

    Some things can be incredibly hard to debug and can depend on the craziest of things you'd never even consider possible. Like a thunderstorm causing voltage spikes that very subtly damage the equipment causing subtle failures a few months later. Sometimes that "software bug" turns out to be hardware in weird ways. Or issues like https://web.mit.edu/jemorris/humor/500-miles – every person who's debugging weird issues should read that.

    Once you can actually reproduce the issue, you've often done 80-99+% of the work already.

    • metaphor 9 months ago

      > Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.

      > Once you can actually reproduce the issue, you've often done 80-99+% of the work already.

      War story...

      I spent over a year on-and-off between adjacent sprints circa peak COVID chasing an elusive intermittent bug that kept coming back after failing formal acceptance testing performed at a designated off-site location. Despite requesting that I be flown to this location to personally replicate the issue within days of the initial report (unable to replicate in local lab and familiar enough with the system's design to intuit the superficially unbounded depth of the rabbit hole) and every subsequent failed test report thereafter, PM persistently decided to allocate funds elsewhere and just kept kicking this product to the back of the pipeline.

      Nearing project end as the last backlogged issue in queue, I managed to isolate the root cause to a race condition in a firmware ISR of an embedded microcontroller buried 3 levels deep in mixed-signal hardware. This subroutine's execution path interacted with an async signal from an instrument that was occasionally ~10 ms slower between bursts than any of the references we had in our lab...which really shouldn't have mattered because the entire flow control setup was designed to be timing insensitive and executed by non-realtime host...except this one buried onion subroutine that really wasn't.

      Attaining local replication was an imperative, which required some creative speculation and a kludge hardware mock, but once I got to a state of repeatably mirroring the failure mode at will, the problem was effectively 95% solved; what followed was a quick subroutine rewrite, anti-pattern scrubdown review, host software update to account for new payload, and instructing off-site techs to field reprogram before the next scheduled test event. I'd never been happier to tag an issue resolved and finally merge an all but abandoned WIP branch.

      To this day, I still don't know with certainty what caused that catalyst instrument to occasionally perform ~10 ms slower relative to our references. In the end, we indirectly absorbed at least 50x the notional cost of that site visit when the dust finally settled.

      • catskull 9 months ago

        The deeper you get debugging, the more you have to reverse engineer, modify, even outright hack the target. Awesome war story, I know how good that must have felt.

      • virgilp 9 months ago

        It is true that for many (especially concurrency/distributed-systems-related) bugs, reproducing the issue might be the hardest part... but that's not always true.

        Long time ago, I was working on a C++ compiler for an embedded processor. Customer complains that when they turn on optimization, code fails. 100% reproducible, just put "-O3" and it fails, with "-O0" it works. Now, we're used to these bug reports, it's often a bug in the original software (like relying on undefined behaviour), but we manage to get the code, it looks good / we don't find anything strange. Can't run it, it only works on the customer board (which is their custom proprietary hardware, using our microprocessor). We look through the assembly, can't find any obvious optimization error that the compiler made. After much (remote) debugging, it turns out that it was a fault in their memory system.

        • vardump 9 months ago

          These kind of things happen all the time in embedded development. Tough to debug, because you can't initially know whether it's software or hardware.

        • dexen 9 months ago

          >Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.

          I had a bug like that in my previous (telcom embedded dev) career. Ended up driving to the customer premises (luckily mere 200km), working two weeks on collecting traces for the repro, and a day on the patch. Once I figured out how to trace the repro, the rest was trivial - the bug was glaringly obvious in the trace. Which in hindsight means I didn't really need to drive there at all, I merely needed to properly implement tracing the repro, and send the built artifact to the customer.

          The problem would have been trivially solved if I had sufficient experience with tracing, or found a colleague with sufficient experience. However this one time the experience was bought & paid for with the trip.

          • vardump 9 months ago

            > Which in hindsight means I didn't really need to drive there at all, I merely needed to properly implement tracing the repro, and send the built artifact to the customer.

            In the alternate reality where you chose not to drive there, you'd now be complaining how it took 3 weeks to troubleshoot with the customer and how you should have really driven there instead.

          • vrighter 9 months ago

            had some hp microservers once, of which some, but not all, got occasional bit corruptions due to emf interference from themselves.

          • avidiax 9 months ago

            > Here’s my one simple rule for debugging:

            > Reproduce the issue.

            > . . .

            > I’m not sure that I’ve ever worked on a hard problem . . .

            I agree, the author has probably not worked on hard problems.

            There are many situations where either a) reproducing the problem is extremely elusive, or b) reproduction is easy, but the cause is especially elusive, or c) organizational issues prevent access to the system, to source code, or to people willing to investigate and/or solve the issue.

            Some examples:

            For A, the elusive reproduction, I saw an issue where we had an executive escalation that their laptop would always blue screen shortly after boot up. Nobody could reproduce this issue. Telemetry showed nobody else had this issue. Changing hardware didn't fix it. Only this executive had the anti-Midas touch to cause the issue. Turned out the executive lived on a large parcel, and had exactly one WiFi network visible. Some code parsing that list of WiFi APs had an off-by-one error which caused a BSOD. A large portion of wireless technology (Bluetooth/Thread/WiFi/cellular) bugs fall into this category.

            For B, the easy to repro but still difficult, I've seen numerous bugs that cause stack corruption, random memory corruption, or trigger a hardware flaw that freezes or resets the system. These types of issues are terrible to debug, because either the logs aren't available (system comes down before the final moments), or because the culprit doesn't log anything and never harms themselves, only an innocent victim. Time-travel tracing is often the best solution, but is also often unavailable. Bisecting the code changes is sometimes little help in a busy codebase, since the culprit is often far away from their victims.

            Category C is also pretty common if you are integrating systems. Vendors will have closed source and be unable or unwilling to admit even the possibility of fault, help with an investigation, or commit to a fix. Partners will have ship blocking bugs in hardware that they just can't show you or share with you, but it must nonetheless get fixed. You will often end up shipping workarounds for errors in code you don't control, or carefully instrumenting code to uncover the partner's issues.

            • forrestthewoods 9 months ago

              > I agree, the author has probably not worked on hard problems.

              Conversely, if you focus on creating a repro you make problems easy.

              I’m a little surprised the OP doesn’t mention debuggers. For whatever reason many modern Linux programmers seem to have never and used a debugger and rely almost entirely on printf. Which is utterly bonkers to me! If you can repro and issue and trap it in a debugger then you’ve done 90% of the work.

              • vardump 9 months ago

                Debuggers are not great when you have a bug that occurs between multiple CPU cores, hardware device timing, etc.

                In those cases, a debugger just can't stop the world, so back to printf it is.

                • Neikius 9 months ago

                  Well I am aware there are ring 0 debuggers out there. Also running in a VM could help. Question is is there anything useful around you could just grab (for linux)? It should be possible at last.

                  • forrestthewoods 9 months ago

                    printf also mucks with sensitive timing issues. So it’s not great either.

                    Also time traveling debuggers like WinDbg and rr are the ideal way to debug complex timing issues.

                    • vardump 9 months ago

                      > printf also mucks with sensitive timing issues. So it’s not great either.

                      True, but this can be worked around for example with lock free ring buffers etc.

                      > Also time traveling debuggers like WinDbg and rr are the ideal way to debug complex timing issues.

                      When applicable rr can be great indeed. Not sure how to apply TTD tools for example in kernel context.

                      Nice to see Microsoft continues improving WinDbg. (Shame Visual Studio's debugger is so weak in comparison.)

                  • glandium 9 months ago

                    And if you can repro in rr, you've done 99% of the work most of the time.

                  • pjc50 9 months ago

                    My favourite case of this was simply "touchscreen sometimes doesn't work". That took over a year before it even made it to us developers, being reported intermittently across a large number of point of sale systems. Sensibly, people were reluctant to pass on something that they couldn't reproduce, and it was often blamed on hardware issues or user error, such as damage from people operating the (resistive) touchscreen with a bunch of keys or the smooth metal Dallas key fob that was also used for log in.

                    We eventually got a robot to reproduce some other issues with PIN pads (which for obvious reasons do not support any kind of software injection of input!), and got some time using the robot to press the touchscreen over and over. That way we were able to confirm it happening, and start the debug process.

                    (anticlimactically, I can't quite remember what the resolution was, something to do with debounce logic and interrupt events)

                    • vardump 9 months ago

                      > something to do with debounce logic and interrupt events

                      Oh, the classic, an unfiltered hardware button connected to a GPIO that generates an IRQ? What could go wrong... :-D

                      For the uninitiated: https://en.wikipedia.org/wiki/Switch#Contact_bounce

                  • lordnacho 9 months ago

                    > Reproduce the issue.

                    Once you have done this, you are already over the hump. It's like being the first rider over the last mountain on the Tour de France stage, you've more or less won by doing that.

                    I'm not sure I even consider it a challenge if the issue is easily reproduced. You will simply grind out the solution once you have the reproduction done.

                    The real bugs are the ones that you cannot replicate. The kind of thing that breaks once a week on 10 continuously running machines. You can't scale that system to 1000 or more with the bug around, you'll be swamped with reports. But you also can't fix it because the conditions to reproduce it are so elusive that your logs aren't useful. You don't know if all the errors have the same root cause.

                    Typically the kind of thing that creates a lot of paths to check is "manual multithreading". The kind of thing where you have a bunch of threads, each locking and unlocking (or forgetting either) access to shared data in a particular way. The number of ways where "this read and then that writes" explodes quite fast with such code, and it also explodes in a way that isn't obvious from the code. Sprinkling log outputs over such code can change the frequency of the errors.

                    • fch42 9 months ago

                      I follow, yet I disagree that "first priority" must always be a reproducer. There are a lot of conditions that can be rootcaused clearly from diagnostics; say, Linux kernel code deadlocks can exhibit as two different (in their stacks) repeatedly shown "task stuck for more than ... seconds" messages; the remainder follows from the code (to see the abba-lock-ordering violation). There's a certain fetishisation of reproducers not unlike the fetishisation of build-time testing - to denigrate a bug because "you can't reproduce it" or "if it doesn't show in the tests it needn't be changed". Personally, that mindset irks me. Fortunately, most developers are happy to learn more about their code any which way. And debugging, tracing, monitoring is cool in itself.

                      • viraptor 9 months ago

                        It's a nice ideal to aim for. But I agree, it's not "must" and not "exactly". If I run into a "very-interesting-problem" which depends on synchronisation between multiple machines and includes some geographic dependencies and caching, I may do the mental calculation:

                        - reproducing the issue in a repeatable way would take a week or more

                        - I can make 2 educated-guess fixes a day which don't affect production negatively

                        I would be a fool to choose the first (or at least start with the first) in most companies. I mean, I want to display pictures on the internet, not send people into space.

                        • catskull 9 months ago

                          Excellent points! I considered how to work in those cheap "Hail Mary" kind of commits. In then right situation those can be effective but it can also be incredibly risky. This is coming from someone with "Revert Revert Revert..." commits with my name tagged on them hahaha!

                      • jillesvangurp 9 months ago

                        My rule for debugging is to park your assumptions and scientifically invalidate each thing you think might be the issue. It's usually something simple and your initial assumptions are probably wrong. Which is why trying to invalidate those is a productive course of action.

                        If you catch yourself thinking, it's probably X. Then you should try to prove yourself wrong. Because if your are, you are looking in the wrong place. And if you are struggling to understand why a thing is happening you can safely assume that something you assume to be true is in fact not true. Invalidating that assumption would be how you figure out why.

                        Assumptions can range from "there's a bug in a library we are using", "the problem must have been introduced recently", "the problem only happens when we do X", etc. Most of these things are fairly simple to test.

                        The other day I was debugging someone else's code that I inherited. I started looking at the obvious place in the code, adding some logging and I was getting nowhere. Then I decided to try to reproduce the problem in a place where that code was definitely not used to challenge the assumption I was making that the problem even was in that part of the code. I instantly managed to reproduce the issue. I wasted two hours staring at that code and trying to understand it.

                        In the end, the issue was with a weird bug that only showed up when using our software in the US (or as it turns out, the western hemisphere). The problem wasn't the functionality I was testing but everything that used negative coordinates.

                        Once I narrowed it down to a simple math problem with negative longitudes and I realized the problem was a missing call to abs where we subtracting values (subtracting a negative value means you are adding it). That function was used in four different places; each of those was broken. Easy fix and the problem went away. Being in Europe (only positive longitudes), we just never properly tested that part of our software in the US. The bug had lurked there for over a year. Kind of embarrassing really.

                        Which is why randomizing your inputs in unit tests is important. We were testing with just one hard coded coordinate. The fix included me adding proper unit tests for the algorithm.

                        • catskull 9 months ago

                          I love war stories!

                          "All truths are easy to understand once they are discovered." - Galileo

                          I always think about this. "Once I figured out the solution, it was easy!"

                        • dave333 9 months ago

                          Reproducing the problem not only allows in depth debugging but the conditions needed to reproduce can give clues as to the cause. The most significant/interesting bug of my career was a problem in 1978 with a Linotron 606 phototypesetter at the Auckland Star newspaper in NZ that occasionally would have a small patch of mangled text at a random place in the job. Reprint the text again and the problem would disappear. The problem had been outstanding for several months as it wasn't a showstopper. The hardware engineer and I figured it might be related to how the fonts were pulled off disk and put in the typesetting memory buffer so we set up some artificial disk transfer errors where every 50th transfer would fail and sure enough this made the problem happen 100% of the time. From there simply inspecting the code that transferred the fonts we found the problem which was that an extra blank character used for filling the background in reverse video (white text on black background) which was put at the top of the buffer was omitted when things were redone after a disk transfer error. So all the character addresses in the buffer were incorrect resulting in mangled characters.

                          • spc476 9 months ago

                            But you had a theory you could test. The hardest bug I had to debug (over a month of constant work) was very difficult to reproduce on demand. The program (a single threaded server) would crash after running a few hours in production. I was able to get it to crash on the development server only if I let it run handling requests for a few days. The core dumps (and yes, there were plenty of core dumps to look at) were inconsistent---each crash was in a different location. There was no reason I could find that caused the crashes, so no theory I could really test.

                            I was able to locate the root cause, which was calling a non-async-signal safe function from a signal handler, but that came only after staring at the code and thinking hard for a long time.

                          • dave333 9 months ago

                            A corollary rule is what makes a problem go away. I was pulled into a war room situation on a remote call (pre Zoom days) to debug a problem where after some digging it seemed some variable was being corrupted deep in the code. I suspected it might be a scope issue with two variables with the same name in different functions that the author thought were separate but which were actually one value. I descended the function tree from the top changing every variable "i" to something unique until luckily the problem went away. Made my boss happy.

                            • catskull 9 months ago

                              these are the moments that make our careers worthwhile, in my opinion.

                            • pdpi 9 months ago

                              In a narrower sense of the word, one technique I developed early on in my career that I don’t see mentioned very often is exploratory debugging. At its most basic, you run the program, look at the output and find the output string in the source code. Then you set a breakpoint there, and go again. You’re now in a place where you can start understanding the program by working backwards from the output.

                              One thing that makes me sad about the pervasive use of async/await-style programming is that it usually breaks the stack in a way that makes this technique a bit useless.

                              • eternityforest 9 months ago

                                I don't know why they don't fix this. Can't they reconstruct the chain of awaits somehow by addition some metadata to whatever objects they use to keep track of async stuff?

                                • catskull 9 months ago

                                  In Safari, I've had good success setting breakpoints in client side JS. The call stack isn't perfect, but usually it's at least enough to get me to the next step. The other thing I'd mention is request/response overrides in those cases you're debugging in production!

                              • toolslive 9 months ago

                                He has a point about reproducing the issue. However, tracing is better than logging, and for god's sake, put a f*cking breakpoint.

                                • wakawaka28 9 months ago

                                  This seems kinda basic. Of course you need a way to reproduce an issue. But is that all you got? The talking down to juniors at the end, as if he laid out some huge insights, is also slightly hilarious.

                                  • catskull 9 months ago

                                    Thanks for reading my blog! I agree, it is basic.

                                  • z33k 9 months ago

                                    ”If you don’t love your logging system, proactively fix that problem.”

                                    Really, you have a ”one-system” where you can see _ALL_ the logs? I don’t believe that. This whole software thing is abstractions everywhere, and we are probably using some abstraction somewhere that isn’t compatible with this fabled ”one-system”.

                                    Often the most debugging takes place on the least observable systems.

                                    • radus 9 months ago

                                      I like using EFK (ElasticSearch-Fluent-Kibana) for this

                                      • z33k 9 months ago

                                        You’re able to aggregate all logs from control plane (all events?), firewall/ ingress, Storage, Autoscaler , CNI, operating system, audit events?