Does this sound familiar?: no-one notices IT until something breaks (* exception: budget season), and everyone assumes that IT is planned and ready for the day when the sky falls. In fact many business leaders assume that IT ("with their huge budgets and all") keep things from breaking in the first place.
The reality as we all know is that the very nature of IT operations is managing risk - and therefore Business Continuance - and that equipment failures, human errors (Acts of Man), application and data issues are inevitable, only their impact is in question. Understanding those risks and directing investment in BC shelters the business from the disruption and losses that routine issues would otherwise cause - some regulatory requirements aside, fundamentally we backup because one day we may need to restore.
IT managers worldwide face the same problem of leading business executives to the point of 'risk acceptance', without unnecessarily scaring or alienating them based on the reality that they can't possibly afford what they really want: to never have to worry about it again. The worst time to have the conversation with executives about risk acceptance is after a failure has impacted the business (although those conversations are usually very short!) - but creating a culture of continuance within IT and exposing that to the rest of the business will support better decisions, a clearer shared understanding of the risks and likely more resources.
The business case or ROI statement for any new business system must clearly address the chance of its subsequent failure (the costs of being without its benefits) - and I find that's the ideal place to start addressing risk acceptance and BC costs. Business managers need to know the difference between a 'resilient' system and a 'highly available' system (consider using those terms consistently, even classifying your existing systems as one or the other) - they need to know the difference in the cost of the extra infrastructure, people and training for High Availability, and how that changes the ROI. You may even submit two budgets for the new system depending on the level of risk acceptance, and not surprisingly a lot of projects that start out with business demands for 'HA' end up 'resilient'.
The design of any new business system and its business case should address:
• What SLAs are needed for this system? What are the expectations of availability (including maintenance windows), recoverability (Recovery Time Objective, Recovery Point Objective, ie the required recovery time and acceptable data loss)?
• What is the cost to the business of the system being unavailable, and of recovering from an outage? (this helps build the business sponsor's case for HA vs resilient)
• What requirements are there for contingency manual processes, failover capability, failover testing?
• What demonstrations of the resilience or high-availability are a prerequisite for go-live?
Aside from new system implementations there are many other ways to engage the business in risk management and risk acceptance decisions:
• Keep a record of system availability, and if possible publish it as a metric visible outside the IT department - the objective for the metric (close but not quite 100%) directly reflects the necessary investment. At Overland Storage we have separate goals for our 'HA' systems during business hours and 'HA' out of hours, and have a third measure for 'resilient' systems where the expectation is lower; on a monthly basis we report those metrics and a brief summary of any business-impacting incidents that contributed to them.
• Understand the impact of a business system being down - perhaps through an informal annual poll to managers and application 'sponsors', or through a more formal business impact analysis (you wouldn't be asking if downtime wasn't a possibility, right?). How long does it take to recover from the system being down for a day? Which systems do departments rely on most and least? (and can you tie lost revenue or increased expense, for example based on the number of affected users, to those systems)
• Regularly share with applic
ation owners - the line-of-business sponsors for your major business systems - the 'near misses' that didn't impact the center of their universe precisely because IT maintained a resilient environment that handled a basic equipment failure, a cluster failover, a rollback. Take the opportunity to discuss any lessons learnt by your team, and remind them of the possible exposures that you can't engineer for or afford. Making the business case for ensuring continuity should ultimately be their responsibility.
• Share your own risk assessments, for example by classifying your major systems as resilient or HA - and encourage discussion about why a perhaps critical business system is just 'resilient'. Be prepared to give real examples of problems that have brought down one of your resilient systems but wouldn't have disrupted a HA.
In addition to 'resilient' and 'HA' there is a third class of BC that business managers need to understand: disaster-proof. Far beyond the costs of HA, disaster-proofing business systems requires an entirely different level of planning and expense.
Resilient, HA, disaster-proof - ensure that your team and your executives understand those terms, challenge them for the business case for each. Resilience is your job, always ("one power supply or two, sir?"), HA is a business-driven decision to invest against less likely but more impacting failures, and disaster-proof is out there with the money tree.
--
James Deveson
Director Information Technology