Thou Shalt Avoid Seeking Employment Elsewhere

 

(last time: Thou shalt make regular and complete backups, Thou shalt establish absolute trust in thy servers, Thou shalt be the first to know when something goes down)

 

4. Thou shalt keep server logs on everything
5. Thou shalt document complete and effective policies and procedures
6. Thou shalt know what cable goes where
7. Thou shalt use encryption for insecure services

 

Continuing with the Ten Sysadmin Commandments, and some practical implications for Business Continuity:

 

4. Know what changed, when (optional: why and who by). Changes (including those made in error or through poor judgement) are the most likely cause for any BC event, and knowing the timing of a BC event and associating that back to a documented change can eliminate hours of troubleshooting. Consolidate and then preserve the logs as long as practical so you can relate symptoms to known or detectable changes (when did the server start recording that error? what changed 2 weeks ago the users now say it slowed down? which day was it exactly that you last rebooted and when was it that you begrudgingly let that consultant patch that troublesome-app-of-the-day?). Remember that many 'events' that need logging won't be recorded directly by the servers - so some kind of manual change log or journal may be needed. If you're really unlucky your SOX auditors will help you with this, and more!

 

5. You've probably been here already with your auditors as well, but just in case: document your backup procedures and to the greatest extent possible any information that may be needed for a restore, cross-train on those procedures, where practical document configurations but at a minimum know what the basic hardware requirements are to support each business system, in case you are recovering from a complete equipment loss. Even for the most paperwork-averse sysadmin simple checklists for routine events such as new server deployments and retirements, regular maintenance, reboots, manual backup procedures and datacenter walkthroughs greatly reduce the risk of human error (and keep the auditors busy while you do some real work!).

 

6. If your monitoring is any good you'll know immediately which cable just got unplugged - but when you're troubleshooting a network issue nothing burns time like tugging on cables, lifting floor tiles trying to figure out where that 'temporary' cable from last year really connects. Inside the data center require every network, SAN and WAN cable to be consistently labelled, and make it someone's responsibility to check, trace and/or remove those that aren't. Hold spares, and know which cables are huge potential single points of failure (tip: know the cabling paths to your building for the main utility and telecom providers as well, you may not be able to protect them but when a crew starts trenching inches from your only telco cable you might want to make polite inquiries while the phones still work; beware spending money on a backup telco/internet provider that is just going to follow the same path as your primary!) Every time you add a server in your data center make cable and switch port identification a step in that go-live checklist.

 

7. If you transfer your backup data to a third party (eg send tapes to an off-site vault) or uncontrolled location (eg replicate data to a location where you can't ensure the physical security), if possible encrypt the data in transit and at its destination. Make sure you have a robust, documented and tested process for storing and recovering the encryption keys. And on the subject of insecurity make sure you know exactly who can access your data center, IDFs and equipment rooms, make sure you have access to that log. Interrogation is a form of retrospective change management: "What the @#$@#$ did you do?..." (it works so much better if you know who to shout at).

 

PS: if any of our auditors are reading this, I'm just kidding.

 

Next: Thou shalt not lose system logs when a server dies, Thou shalt know the openings into your servers, Thou shalt not waste time doing repetitive and mundane tasks

 

(** see http://www.linux.com/articles/44315 for the original Ten Commandments)

 

--
James Deveson

Director Information Technology

Print | posted on Monday, October 15, 2007 2:44 PM

Comments on this post

No comments posted yet.

Your comment:

 (will show your gravatar)
 
Please add 1 and 5 and type the answer here: