I remember the largest outage of my career. Late in the evening on a Friday night, I received a call from my incident center saying that the entire development side of my VMware environment was down, and that there seemed to be a potential of rolling outage including quite possibly, my production environment.
What followed was a weekend of finger pointing and root cause analysis between my team, the Virtual Data Center group, and the Storage group. Our org had hired IBM as the first line of defense on these Sev-1 calls. IBM included EMC and VMware in the problem resolution process as issues went higher up the call chain, and still the finger pointing continued. By 7am on Monday, we’d gotten the environment back up and running for our user community, and we’d been able to isolate the root cause, and ensure that this issue would never come again. Others, certainly, but this one was not to recur.
Have you had circumstances like this arise in your work experience? I can imagine most of you have.
So, what do you do? What may seem obvious to one may not be obvious to others. Of course, you can troubleshoot in the way I always choose. Occam’s Razer or Parsimony are my course of action. Try to apply logic, and force yourself to choose the easiest and least painful solutions first. Once you’ve exhausted those, you’ve got to move on to the more illogical, and less obvious.
Early in my career, I was asked what I’d do as my first troubleshooting maneuver for a Windows workstation having difficulty connecting to the network. My response was save the work locally that was open on the machine, then reboot. If that didn’t solve the connectivity issue then I’d check cabling, at the desktop then the cross-connect before I even looked at driver issues.
Simple parsimony (Def: economy in the use of means to an end), is often the ideal approach.
Today’s data centers have complex architectures. Often, they’ve grown up over long periods of time, with many hands in the architectural mix. As a result, the logic as to why things have been done the way that they have has been lost. As a result, the troubleshooting toward application or infrastructural issues can be just as complex.
Understanding recent changes, patching, etc., can be excellent to directionalize your efforts. For example, patching windows servers has been known to break applications. A firewall rule implementation can certainly break the ways in which application stacks can interact. Again, these are important things to know when approaching the troubleshooting of an issue that arises.
But, what do you do if there is no guidance on these changes? There are a great number of monitoring software applications out there which can track key changes in the environment and can point the troubleshooter toward potential issues. I am an advocate for the integration of a Change Management software into Help Desk software, and would like to add to that some feed toward this operations element with some SIEM collection element. If at all possible, a CMDB (Configuration Management Database) with solid historical information is critical. The issue here has to do with the number of these components already in place at an organization, and with that in mind, would the company desire changing these tools in favor of an all-in-one type solution, or try to cobble pieces together. Of course, it is hard to discover, due to the nature of enterprise architectural choices, a single overall component that incorporates all of the choices made throughout the history of an organization.
Again, this is a caveat emptor situation. Do the research, find out a solution that best solves your issues, determines appropriate course of action, and helps to provide the closest to an overall solution to the problem at hand.