High availability solutions do not magically guarantee the safety and availability of your systems even if they’re working flawlessly. That n+1 failover cluster you spent all that money to build? It could just be an impressively expensive disaster waiting to happen.
This article will not talk about the obvious tenets of a successful high availability deployment. I’m going to assume everyone reading this has taken “HA 101″.
Before you even think about purchasing a high availability system of some sort, you need to understand a few important points. It may come as a surprise to some people just how many points of consideration in a HA system are not strictly technical in nature. Here are seven things to keep in mind when designing a high availability solution.
Oh, but before you read any further, skip down to point four. Yes, read the fourth fact first. Come on back when you’re done with it.
Are you back? Great! Carry on.
First, you’re high availability system will fail you if you’re protecting the wrong things. Before any technical decisions are made, you need to understand the business’s processes. Get input from the business leaders and see what daily tasks take place in the business that are critical. There are probably a few applications and processes out there that you either don’t know anything about or have no idea how crucial they are. A recent exploration into one of my client’s online payment gateway systems exposed a tangled web of credit card slips, multiple payment gateways and unnecessary systems. It exposed several areas that were previously unknown to me and some things that I hadn’t realized were as important as they were. I’ve worked for them for over two years and didn’t know any of that!
Get with the leaders of the business units and understand what is truly important to the operation of the business. It’s possible that you will end up spending a lot of your company’s money on a failover system that protects services that weren’t as important as you first thought, or were less important than other systems you weren’t aware of. Knowing what should be made highly available is more important than knowing how to make something highly available.
Second, a high availability system will tumble down like the proverbial house of cards if ownership is monolithic and imperial. You should assign “ownership” of the service that is made highly available in two ways. First, one person or group should be a technical owner (I.e. an IT person or team) that knows the underpinnings of what makes the service tick. Next, a second person or group needs to be the business owner of the service, knowing what purpose that service has and how important it is. If ownership isn’t clearly defined and comes into dispute, handling a problem with a HA system can become problematic.
An IT person may want to perform maintenance at a certain time that isn’t the best option, but only someone that uses that service would know what window of time would be best for the service. A service owner may demand an upgrade at a time when technical windows are not optimal. HA is reduced to an expensive buzzword if there is no communication between the service users and the service stewards. It should probably be noted here that the most important part of a HA system is the service’s availability, not the hardware’s availability.
Really, this step isn’t much different from the steps that should be taken with a non-HA system. However, the tendency of thought is that if a system is highly available, sometimes unnecessary risks will be taken because “Hey! It’ll fail over!” Having a few extra eyes on it can prevent problems born from presumption.
Thirdly, your HA system will fail if the Sysadmin(s) caring for it are slobs. You need disciplined IT habits. If you administer your high availability systems with all the alacrity of a drunken unicyclist, you will be introduced to cold, hard pavement in a most unpleasant manner. Or at least, the unemployment line. You need to follow change control processes, update software and hardware in approved manners and have rollback procedures tested and at the ready. Of course, you can’t forget the SysAdmin trifecta either: documentation, documentation and documentation. Ignoring these habits will ensure that your high availability system will become a “wistfully available” system.
Fourth, your HA systems will make you gnash your teeth in agony if they’re not properly secured. I would venture to say that most IT people do not consider security to be a crucial part of highly available systems. In fact, I am cursed by my own words because I placed it fourth in the list. To prove a point, I’ll leave it here.
If you have a HA system that isn’t patched, you will probably still have a HA system. However, you will also have made a hacker very happy for the highly available spam canon that you provided to him after he rooted your box. Or maybe it was compromised by a more rudimentary virus that just destroyed data and you now have highly available space heaters that look suspiciously like servers. Possibly worst of all, a targeted attack could allow an intruder to steal information any time they wanted with the exception of 5 minutes and 15 seconds per year (the downtime implied with a 99.999% uptime SLA). Keep your stuff secured and patched.
Fifth, highly available systems will fail in breathtaking ways if you have tunnel vision. Yes, your fancy NAS installation is geoclustered and all of your offices are replicating to an offsite datacenter. High-fives for everyone! However, it relies on LDAP authentication, and while you have two LDAP servers replicating with each other, they’re located in only one of the company’s server rooms. And there’s only one WAN link to some of the offices. And one of the geoclustered NAS nodes is in an office’s server room running on just a single circuit. And the switch closet it connects to before the line leaves the building has no lock on the door. And I think I just made you cry.
When designing HA solutions, don’t get caught up in the minutia of an individual implementation of some form of HA. Remember, HA is about service availability not system availability. It doesn’t matter if the servers have had uninterrupted run-time if access to the vital service that the servers are running has been interrupted because of an unreliable network connection, unavailable authentication methods or any number of other problems.
Sixth, high availability does not change the fact that your backups are only as good as your restores. You still need verified backup and disaster recovery plans. Just because you’re fancy schmancy HA product is working like a charm and you’ve ramped up your change management and security habits to clinical OCD levels, doesn’t mean that things can’t go horribly wrong. If an errant application stomps all over a database, do you know what you have? That’s correct. You have a highly available stomped-on database. You do have backups in some form, correct? Government agencies have been known to demand old emails on occasion. You do keep archived copies of all communication, right? You don’t have potentially sensitive information scattered around in PSTs, right? (*cough* Red Gate’s PST Importer 2010 *cough*) A configuration change on the spiffy firewall cluster hosed some ACLs. You do take regular configuration backups, right? Furthermore, you know how to restore those backups because you regularly practice restoration procedures… right??
Many times we SysAdmins can forget what we’re not protected from by HA systems. We are not protected from the necessity of backups, backup versions, archiving, restoration drills and disaster recovery plans.
Seventh, HA systems do not guarantee that the product it is protecting will support being run in that configuration. Check the applications and services that you’re planning on making HA to see if they’re compatible with your plans. Some applications don’t like some kinds of HA. The clustering abilities of an operating system can be less robust than a third-party application. For instance, Windows Server 2008 offers some improved clustering features compared to previous versions, but some applications just don’t support being run in a windows cluster. In that case, you might have to try a third party clustering / synchronizing product either with software or hardware.
However, it’s no fun to purchase servers and set up a cluster only to have the application flop around like a freshly caught salmon. Is that transparent clustering tool really, really transparent in every way? Equally un-fun is having your support contract nullified while you’re on the phone during an emergency when the technician discovers you’re running it in an unsupported environment. Be careful and research your choices carefully.
Those are the seven ways that high availability systems will help you to fail in even more spectacular ways than you ever thought possible. Remember, technology is fun and useful, but it’s no silver bullet and it’s certainly no substitute for the scientific method and some good, ol’ fashioned horse sense.