It turns out that for all my demanding dual power supplies in all data center equipment, server 'power events' (as we classify them in our NOC log) are three times more likely to be caused by Acts of Man (human error) than they are actual power supply failures. I must talk to some people about that... but designing resilience in to our systems has repeatedly saved the embarrassment of explaining who tripped a PDU, pulled the wrong cord, knocked a cable etc.
In an earlier blog entry I differentiated between making business systems 'resilient', 'high availability' and 'disaster-proof' - and argued that IT's minimum obligation to any business is resilience. Unlike the rest of our non-IT colleagues that fret over whether to buy the extended warranty on their new PC at the weekend, issues such as drive and power supply failures, bad cables, bad code and bad decisions keep us awake at night (literally, for those of us lucky enough to carry pagers for a living). And so to ensure a. continued employment and b. some sleep, we routinely configure RAID sets, connect redundant power supplies, mesh switches, stock spares... and invest in warranties and services that guarantee far more than 'return to retailer'.
A few years ago a series of online articles titled 'Ten Commandments of the Sysadmin'** did the usual rounds, and while a little skewed towards one class of systems (honestly, the list should be much longer!) I still have a copy on my office wall (and weighing on my conscience):
Thou shalt make regular and complete backups
Thou shalt establish absolute trust in thy servers
Thou shalt be the first to know when something goes down
Thou shalt keep server logs on everything
Thou shalt document complete and effective policies and procedures
Thou shalt know what cable goes where
Thou shalt use encryption for insecure services
Thou shalt not lose system logs when a server dies
Thou shalt know the openings into your servers
Thou shalt not waste time doing repetitive and mundane tasks
Unsuprisingly, every one of these can be related to good BC practices and maintaining a resilient business systems infrastructure:
1. Thou shalt trust ten verified regular backups slightly more than one, never be complacent about having included 'everything' every time, rehearse your restores, know what is on which physical or virtual tape and where every physical tape is, all the time. You must monitor for successful backup completion, investigate failures. Make sure that updating your backup jobs and a restore test are both part of your go-live checklist procedure for every new server. Have a go-live checklist!
2. You provide services without which your employer cannot do business; the minimum standard is 'resilient', which means the business can RELY on servers being available. "My son wouldn't do a thing like that." Know how your servers are used, who their friends are (dependencies), when they are getting tired, which ones are grumpy and if they need a break then set a curfew (maintenance window). If you don't trust them any more the business certainly can't.
3. Never ever let a user tell you your server is down before your monitoring does. Expect that users will tell you every server is down all the time, but remind them to first power on / reconnect / login.... seriously, effective monitoring doesn't just tell you the binary server up/down, it can give you advanced warning of a performance or capacity issue (both or which can lead to a BC event), it can provide assurances (backup job completed successfully, hotspare drive now in use so time to go replace the failed drive etc). Again, have a go-live process that ensures new servers are added to your monitoring!
More on this in the next post, but from our own experiences this week: greatly improve your BC preparedness by holding on-site stock of a couple of commonly used (consumed) spares, irrespective of your levels of service cover with the vendor; the most obvious examples are power supplies and disk drives. Even with a RAID array and a hotspare drive, or a redundant power supply, as soon as you experience a failure you are immediately at high risk of degraded performance, data loss or both - and its another great reason to standardize your server, storage and networking vendors. Two failed drives, an internal array backplane and ultimately an entire (standard model, held spare-on-site) server later we minimized the impact of catastrophic hardware failure on one of our own business systems.
(** see http://www.linux.com/articles/44315 for the original Ten Commandments)
--
James Deveson
Director Information Technology