bl.OGware

infrequent grumblings of a software engineer and then some… (also some Delphi programming)

Posts Tagged ‘bug hunting’

We don’t support test systems?!?!

Posted by tier777 on 2009-05-23

This was one of those days… where you really want to bite a chunk from your keyboard or table and then instead just shout at the top of your lungs all the way home until you feel slightly better – and this was already the second day in a row that ended this way. So what happened? Here’s what.

I’ve had my first close encounter with the customer support of a certain big software vendor which I’ll try my best not to name here. Let’s say we have a product that works closely together with one of theirs and because of that some of our customers notified us of a problem that after close examination turned out to be what I’m currently certain is a bug in that vendor’s piece of software. One of their recent updates introduced a significant change in behaviour in a somewhat obscure area that just couldn’t have been on purpose (it’s also mentioned nowhere in the release notes). As that particular obscure area also touches on the correct functioning of our own tool we had an increased interest in getting this sorted out.

As that company does no longer have a public bug tracker (“because 90% of the filed items were really just requests for free tech support”) the only way to notify them of bugs in their products nowadays is by opening a regular support case — at the risk of having to pay for the tech support in case the tech handling the case determines that the problem was on our end after all. So, with that in mind, I wanted to make extra sure that I knew as much as possible about the issue (and especially how to trigger it) before I contacted them. I set up a test system with nothing but the supposedly defective software on it in a virtual machine and tried to reproduce and analyze the issue as best as I could. Once I was sure I knew what to do to reliably provoke the issue I phoned the first screener. That is, I phoned the first level support who noted down my data and a rough description of the problem, gave me a case number and told me that I would be contacted by a tech within 4 business hours. All fine. The call even came less than an hour later.

I first started to explain the problem again but the helpful support person quickly suggested we do a remote access session so he could see the issue first hand on my machine. All the better I thought, shared my desktop with the VM on it and Bob’s your uncle. Well, no. Not really. After having spent three days investigating, discussing with several unrelated affected customers, analyzing, probing, diagnosing, testing and consistently provoking the issue – all of a sudden the issue no longer occurred! No way, everything just worked as it was supposed to. It hadn’t been five minutes since I ran my last batch of tests where each and every single attempt succeeded in making visible the issue we were dealing with. And now – nothing to show. Quite embarrassing. I was completely dumbfounded. As the issue could no longer be reproduced at the time of first contact, the nice support person kindly offered to close the case for free. If I managed to reliably reproduce the issue again I could still contact them again and have the case reopened but in that case I better had something to show.

Needless to say, less than five minutes after hanging up I was able to reproduce the issue again – and again – and again. Not quite as reliably as before the call but still quite definitely there. I started to remember that in the release notes for the update that introduced the issue it was mentioned that several operations that were previously carried out sequentially had now been parallelized or moved to background threads in order to improve performance and responsiveness of the app. OK, so timing obviously plays a role here as well – and that could well be affected by a remote support software hooking into the video driver – we had a similar case ourselves a mere three months earlier.

Once I had been able to trigger the issue again after the call I installed Camtasia into the VM and began stress testing for good. At first I only managed to reproduce the issue about 4 out of 30 times but soon I was getting better. Apparently the issue was more likely to happen right after the app had been restarted. Most importantly, I finally had the bugger on camera. I trimmed  the video and sent it to the support people – and waited.

Today, less than two hours before closing time I sent another reminder, asking for confirmation that the case had been reopened. I quickly received a response that they were still checking whether the video would actually suffice for “producing evidence”. I was also told that it would probably be better if I was able to reproduce the issue in a really clean environment without any other software involved at all (keep this snippet in mind for later).

I just as quickly replied back pointing out that the only other software installed on the machine where the video was produced was Camtasia and anyway, we could simply do another remote access session and just try a little harder to trigger the issue this time – after all, when we tried it the last time I fully expected the issue to occur on each and every try so when it didn’t, we immediately aborted the whole show. I also reported that by now I was able to reproduce the issue in my test environment in more than 20 out of 30 cases. I also inquired (and neither for the first nor the last time during that exchange) whether there was some kind of diagnostic logging I could turn on or whether sending the produced output files (which exhibited the error) would help.

No direct reply to any of those offers and questions. What I did get was a message that they had now at last watched the video I sent and that they would reopen the case – but in that case all further reproduction tests should be workable in a non-virtualized environment. Sure, no big deal, we have lots of unused physical boxes lying around here that could be prepared for this (remember: “ideally without any other software involved at all”), not to speak of the limitless amounts of time on our hands – not!

After my somewhat snappy and already more than slightly annoyed response the phone was quickly ringing again. The case had now been officially reopened and we started noting down the specs of the test environment (OS and program versions, network setup, etc.), possibly in preparation for another, more in depth analysis session. Somewhere in the middle of all that came the sentence “well, we do not support test systems really” and then “we’d prefer to see the issue reproduced on a live production system”… WTF?!?!?!

“Yes, just imagine we fix that issue in your test environment and then you take the fix to the production environment and it fails again.”

Um, I don’t know about you but the first thing I do when I’m hunting down a bug in my own programs is to eliminate outside factors to the point where I have distilled the exact minimum set of factors that are responsible for causing the bug. The last thing I would want to do is maximize the number of outside factors as was being proposed here.

At that point I realized that what the friendly tech guy was doing had nothing at all to do with trying to find the bug. He was still trying to prove (or rather make me admit) that there was no bug in the first place! After having supposedly watched the video of the bug happening! This was probably just the second screener that I had to get past before someone actually started taking me seriously and finally looked into the issue. I’m fully convinced that noone there has taken any steps to try and reproduce or otherwise investigate the issue themselves so far.

Did I mention that at that time I was able to reproduce the issue with something approaching 95% certainty in the test environment again?

Right, so if that’s the way they want it… I agreed to contact some of the affected customers and ask them for participation in “producing evidence” (I guess I should mention that we are not using the affected version of the software in production ourselves).

So I called one of the customers who immediately agreed to assist. As I had tentatively scheduled the remote access session for after the weekend I wanted to “rehearse” the whole thing to be on the slightly safer side this time and connected to the customer (who also consistently experienced the issue ever since the update). Guess what? Yep, no longer reproducible on that box as well as soon as I started watching… then again, we only tried three or four times because it was already late.

What a way to leave for the weekend…

And yes, as you can probably tell, I am so looking forward to Monday…

(to be continued)

Posted in Rants | Tagged: , , , , , , , , , | Leave a Comment »