A debugging story from my Microsoft days

My debugging by thinking story yesterday reminded me of another tricky debugging problem I encountered while I was working at Microsoft.

We had a bunch of bugs that had come in through "Watson", the system that receives automatic bug reports from users' machines (the "This program has encountered a problem. Send error report to Microsoft?" dialog) and categorizes them in an attempt to find common causes. Yes, people really do look at those reports and act on them. We had a mandate to fix all the Watson reports with more than some number (10 or so) of occurrences. Watson reports with too few crashes aren't worth looking at (they're usually due to some hardware fault on one particular user's machine or something similar).

Most of these bugs had been dealt with, but there was one particular one that was rather tricky - at least one other developer had tried to figure it out, given up on it and thrown it back into the pile for someone else to look at. Somehow it ended up on my plate. It wasn't a huge number of crashes, but it was too many to ignore (maybe hundreds - the top Watson reports had thousands).

Developers don't (or didn't, the situation might have changed since I left) usually like looking at Watson bugs because they contain very little information - usually just a call stack and a list of the modules (DLLs) loaded into the process along with their versions. The call stack is the important bit for developers - they can see what piece of code the program was executing when it died, what piece of code called that code and so on. Often it includes useful variable values as well (function arguments and local variables). Most Watson bugs are treated as just failures to validate parameters: "This function should never be called with a NULL pointer argument but given this Watson report we can see that somebody did just that. We can't tell how it got into that state but we can add a NULL check and either throw an exception or return an error code here so that there's a chance the user will see a useful error message instead of crashing the entire program."

This particular bug could not be dismissed so easily - it was a double-release of some reference-counted COM pointer, and the nasty thing about these is that where they are detected bears absolutely no relation to the piece of code that actually did the double-release. So adding a check at the point where the program was crashing would not have helped at all. Sometimes you can fix these by just looking at all the pieces of code that use the object in question, but in this case that was out of the question - this was a very common object used in dozens if not hundreds of disparate places.

After banging my head against it for a few hours, I stepped back and thought "what information do we have to work with here?" The call stack wasn't much help, so on a whim I turned to look for clues in the module list. As it turned out, the critical piece of information I needed was right there, hiding in plain sight. At the time (I'm not sure if it's still true) there were two (slightly different) implementations of the managed project system in Visual Studio - one for Visual Basic and the other for C#. As it turned out, one thing that all the occurrences of this bug had in common was that the C# project system was loaded in all of them, and the VB project system was loaded in none of them (or it might have been the other way around, I don't recall). That fingered something C#-specific as the culprit.

As it turned out, there was precisely one place where the object in question was used by the C# project system but not the VB project system. Close inspection of this particular function showed that there was a very bizarre set of circumstances in which maybe-just-maybe the object could get double-released if everything lined up just right and something else had gotten into a bad state first. We had no idea how to trigger that particular set of circumstances so no way to be sure if this was the right fix, but making that change was enough to close the bug. I left Microsoft before finding out if it really was the right fix or not (something you'd only be able to tell by trying to cross reference the set of Watson reports from the more recent version of VS and seeing whether or not there was an equivalent report for the later version - and even that wouldn't be 100% reliable since something else could have changed to make the bug go away).

Still, the lack of definitive proof didn't stop me from feeling rather proud of myself for figuring that one out.

Leave a Reply