You Can’t Always Have Both

When you fix an issue you can often times want to two things; to fix the problem quickly, and determine root cause so you can prevent it. You often only get to have one or the other, it’s a rare occasion you can do both.

Yesterday I dealt with a customer Sev 0 issues that made me go off on a bit of a rant about how I deal with my team and how I will work to protect them. The other idea that is going around in my head about that incident was reinforced today when I had a call with a customer over the RCA that I wrote for them. It basically said that due to the severity of the issue we focusing on fixing the issue and we were not able to determine root cause

What I expect was wrong is that there something corrupted in one of the mailboxes that we moved that was causing the error. I think that it was due to an error in the folder index but I cannot confirm that. In order to determine if my guess is right we would have had to make a copy of the database, open a case with Microsoft, upload the database to Microsoft and have them analyze the database and determine the cause. During the testing time we would have had to leave the database offline and the 100 users on the database would not have been able work with the Exchange server. I made the call that this was not worth the downtime.

The following is what I told them for the root cause reason:

One of the following 5 users; XXXXXX had a corrupted folder or item in their mailbox that was causing the information store service to run out memory handles and crash. Due to the this incident being a Severity 0 (server down) issue we took the quickest path that we could to bring the server back to full functionality. Due to the need for the server to be fixed as soon as possible we were not able to leave it a down state or capture the down state for an extended period of time (up to 48 hours) to allow Microsoft to analyze the database and determine the true root cause

The customer argues that Exchange is a sophisticated system and that the root cause should be determinable. I agree with that statement from the customer, but I would have to add that there is time involved in determining the root cause. It is not always there. The error message that we were receiving was an incomplete error message and did not give us much to go on. What ended up fixing the issues was a keen understanding of the product and brute force repairs. We ignored digging into the issues and fixed them instead with a bit of a shotgun approach.

Back to me point, during trouble shooting we can take two roads that both have their own issues:

  • Fix the problem as quickly as possible  | There are different ways to fix and trouble shoot things. When I am in a time sensitive situation I generally take the quickest path. The quickest path is often to blanket fix things all at once. In performing blanket fixes it is hard to determine which of your fixes for a fact fixed the issue.
  • Determine root cause   | Often times root cause is hidden deep in a memory stack dump, or in some maximum level logging log that you have to find in the millions of spam messages that you get when you turn logging up. On rare occasions the error that you get tells you the cause and leads to the fix. In those cases the customer almost never asks for a root cause. They seem to only want to know the root when there is no root.

Out of all of this I am going to preface all of my fixes now with the following statement:

I can take two approaches to solving this issue. I can solve the problem as quickly as possible, or I can determine what caused the issue. I cannot do both in a timely manner. If you give me the OK to solve this as quickly as possible I make no guarantees that I will be able to tell you the root cause. Do you accept that and give me permission to fix the issue?

This is a little bit of a bonus link. I say Let’s go here, all of the Heres |


Related Posts with Thumbnails

About Kevinm