So far: Thou shalt make regular and complete backups, Thou shalt establish absolute trust in thy servers, Thou shalt be the first to know when something goes down, Thou shalt keep server logs on everything, Thou shalt document complete and effective policies and procedures, Thou shalt know what cable goes where, Thou shalt use encryption for insecure services
8. Thou shalt not lose system logs when a server dies
9. Thou shalt know the openings into your servers
10. Thou shalt not waste time doing repetitive and mundane tasks
Now, in BC terms:
8. Consolidate your logs to a separate system via traps, alerts, emails, syslog etc - or the only evidence of the root cause of a failure may be destroyed at the same time as that critical business system dies. Restore the database, restart the system and watch the same 'routine' maintenance script self-destruct all over again. Or look in the logs, see what happened last (or sometimes 'who' happened), reconfigure / interrogate as required.
9. Security issues aside, know your weak spots - especially Single Points of Failure - you may not be able to afford to mesh core routers and switches, to cluster every server or hold every conceivable spare but in your risk assessment identify those items that on failure would cause the most disruption, and have a backup plan. Be prepared to improvise using dissimilar hardware, especially when it comes to server restores. We once tested a financial app recovery capability using only hardware that could be purchased at a local computer store, and in the process found we couldn't possibly meet the SLAs for recovery time (tip: know where to get a tape drive from same-day, store a spare off-site with your tapes or place your backup media, locate VTLs and libraries in a different physical room/building from the data center they protect!). Baffle and distract your auditors with your 'SPOF risk assessment methodology' ;-).
10. Don't rely on a engineer taking the time to backup a critical configuration file before every change (why would we, we're busy people!) if you can automate and schedule the process, especially with appliances, routers, switches, firewalls, VPN concentrators etc. Somehow those of us that started our careers religiously updating Emergency Repair Disks for every one of our servers forget there are now dozens of uniquely configured critical devices that can't be directly protected by our 'backup' software. All backup software packages include features to automatically notify you of a job failure - use them. Routine maintenance is required - know how frequently key systems need to be restarted or failed over to ensure best performance, plan and use the required maintenance windows, and formalize that maintenance plan as part of the go-live process for each new system.
One final 'mundane task' that you shouldn't try to avoid: testing your backups. Make sure you know how to restore to dissimilar hardware, the limitations of any bare metal restore process, how to rebuild your backup media server (you can download or have off-site the OS and backup vendor's software, right?), how to rebuild media catalogs, how to identify which tapes you need and where they are if your backup server is 'otherwise unavailable'. Check that you have secure off-site records of key database and media passwords or encryption keys that may be needed for restores - and the passwords you'll need to revive each business system (you do know how to restore a domain controller from scratch, right? that obscure recovery console password may be needed after all,,,). If you're backing up because one day you might need to restore - make sure you can!
(** see http://www.linux.com/articles/44315 for the original Ten Commandments)
--
James Deveson
Director Information Technology