High Availability Part II: Storage

In Part 1 (here), I discussed HA for server level and application level workloads. While this is important for uptime, that’s only one key criteria. As we all know, some level of integrity may be lost during various key events in an outage. Storage technology has taken some very important steps to mitigate some of these types of data integrity issues.

One of the key rationales for having highly available storage is to prevent any server level workload outage or hiccup from having negative impact on the data being stored.

To be sure, many applications are very tolerant to things like this. If you’re writing a word document and the file server disconnects, then you save a later version after reconnect, well, you’ve lost nothing, right? If, alternately, you’re running on a high transaction database, and there are multiple connections coming in to the database server, which then experiences some fault in communication, transactions can be lost, data may be lost, and potentially even worse, the database may become corrupted. This is certainly “Restore from most recent backup” time. A moment we would all like to avoid.

Toward that end, storage manufacturers have been building systems with tolerance, caching, and other technologies to hopefully prevent these scenarios from causing damage to our most precious commodity: Data.

My first experience with a massive data related event took place years ago when I was working the VMware environment at Zurich Insurance. We were using very well designed Clariion Storage from EMC. The way in which the architecture of these devices was created was two mirrored pylons of storage, providing failover in the event that one-side of the array did go down. Seems very appropriate, but in our case, our storage administrator had inadvertently provisioned 60% of the storage on both sides of the array. There’s no physical way in which one side could possibly handle the data from both. By both, I mean, 120% of the workload on a system designed to max out at 100%. Big failure…

Again, please understand that I’m not disparaging the Clariion. Had we provisioned our storage to a maximum of 45% utilization on both sides, we’d have weathered this storm just fine. However, of course, we hadn’t.

So, how has the concept of data integrity via HA in storage matured over the years? Honestly, the same rule applies. Build for failure! Never over-provision!

I look at some of the “Dual Headed” HA storage that is out today, built for resiliency, with the intent to be based on commodity equipment, which I see as an attempt to build a similar architecture to the monolithic builds of the past. So many benefits exist in this model, yet the failover can take too long. Many of the custom architectures have technological gaps that make them prohibitive. Again, buyer beware. In the case of Hardware Compatibility List builds, ensure you stick quite closely if not identically to that reference architecture.

Meanwhile, the goal here is to ensure uptime, and application consistency whenever possible. I have said for a long time that your storage must be as stabile and reliable as your firewall. Remember, backups are not restores… If you find yourself with lost or corrupt files and cannot get them back, you could be in for serious trouble. If your storage experiences a split-brain, or inconsistency due to a faulty failover, restoring from backup may be your only choice.

I also believe that you should never have a storage conversation without having a disaster recovery conversation. Ensure what you use today is viable for tomorrow’s architecture, not just in sizing but compatibilities. And, if you must upgrade the DR environment, then be sure you’ve the ability to restore from older backups as well. This may mean storing a pristine tape drive in the config of the one you’re retiring, restoring everything then re-backing up all data to the new tape file formats, or using disc to disc backing up.

As with the previous posting, I want to mention that the buyer must beware. Test the failover process for an HA event. See if your tolerance for an event like that is such that the limitations of the process presented by the storage vendor is something for which your organization will be able to contend.

I’ve had managers who refused to create robust systems accepting the risk of a downage, that could occur and the ramifications however negative, as the cost of the HA environment was simply too much for them to accept. Understand that this is not at all what I’m advocating. But without proper education, the key answer to this question is “I don’t Know.”

For me, the “I don’t know” answer is dissatisfying. I like to answer these questions before I previous to issuing a purchase order.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s