GreenPages' Newsletter Apr 2009

The Challenges of Business Continuity Planning: The Top 3 Tips for BC Data Protection and Recovery Planning

By John Ferraro, CEO InMage

Business Continuity Planning (BCP) is a critical task for enterprises of all sizes. Statistics collected over the past twenty years consistently show an extremely high percentage of businesses that sustain multi-day outages go out of business within one year. But BCP can be a complex process in today’s rapidly changing IT environments, presenting formidable challenges for IT managers.

Tip 1: Understand Your RPO/RTO Requirements

Recovery point objective (RPO) defines how much data loss you can sustain on recovery, while recovery time objective (RTO) defines how fast your recovery must occur. Together, these two metrics define the service level agreements (SLAs) that you must maintain for any given application environment, based on end user requirements for local recovery. As RPO/RTO requirements become more stringent, the costs of the infrastructure required to meet them tends to become more expensive. To ensure that you’re spending where you should to meet stringent RPO/RTO requirements, and not overspending where you shouldn’t, you need to understand the SLAs required by each of your application environments. This should be the first thing you do when determining the data protection infrastructure you need to meet your business continuity requirements.

Applications can be divided into different tiers, depending on their RPO/RTO requirements. A common approach is to set up a 3 tier model, with tier 1 limited to applications that require RPOs and RTOs of under 1 hour, tier 2 including applications with RPOs and RTOs of under 4 hours, and tier 3 comprised of applications which can sustain RPOs and RTOs of under 1 day. As an IT manager, you’ll want to work with your end users to help determine where applications fit, and using a chargeback system that makes it very clear to each end user department what the cost of their SLA is will encourage applications to be tiered appropriately. Many end users will state that they require very stringent RPOs and RTOs until they understand the cost to them of each tier. Depending on the type of clients you have, there may be fees or penalties that accrue when contractually defined SLAs are not met.

Understanding the impact to your business when data is unavailable helps immensely in determining SLAs. On-line applications that drive revenue in real-time or may directly impact critical services such as health care, police, and fire are examples of applications that fall into the tier 1 definition. Mail systems may fall into tier 1 or tier 2, depending upon the criticality of e-mail to the daily flow of business. Home directories may be an example of a tier 3 application environment, again depending upon your business. Generally, very few applications are defined as tier 1 or tier 2, and it’s not uncommon for 80% or more of a company’s application environments to fall into their tier 3 definition.

Tip 2: Implement the Appropriate Data Protection Infrastructure

Once you understand where your applications fall in terms of this “recovery tiering”, you can then turn to implementing the infrastructure necessary to meet your recovery requirements. First, it’s likely that you have a heterogeneous environment, with different applications, operating systems, and server and storage from different vendors. You’ll want to ensure that your data protection solution (i.e. your backup software plus the supporting hardware infrastructure for it) supports this heterogeneity so that you can minimize complexity by using a common approach across your environment and have the maximum amount of flexibility when purchasing new hardware or re-purposing existing hardware. Server virtualization technology is implemented in over 80% of enterprises of all sizes in at least some capacity, and this technology will be used more and more in production environments. Ensure that your data protection infrastructure can not only accommodate both physical and virtual server environments but also take advantage of some of the optimizations available only with server virtualization platforms.

Second, you’ll need to understand the scale of storage you need to support over time, and ensure that your infrastructure can be reliably built out to accommodate it. With data growing at 50% to 60% or more per year across the board, even many small enterprises will be managing upwards of 100TB or more over the next several years. Storage area networks (SANs) provide more flexibility than direct attach storage, are particularly interesting for virtual server environments, and can offer ease of use and cost advantages through centralized management. SANs facilitate the deployment of storage management technologies critical to addressing data protection problems in the SAN fabric, providing a very cost-effective way to leverage sophisticated storage functionality (thin provisioning from a centralized storage pool, off-host backups through proxy servers, WAN optimization technologies like compression, security capabilities such as encryption, etc.) while minimizing the overhead imposed on production servers. The storage consolidation obtained with a SAN also allows replication to be efficiently used to create and maintain copies of critical data sets at remote sites for DR purposes.

Third, you’ll want to evaluate the technologies available that will enable you to meet your RPO and RTO requirements across your defined tiers. Disk-based backup offers many performance and reliability advantages over tape, and gives you access to other technologies that are important in meeting stringent recovery requirements such as snapshot backups, continuous data protection (CDP), and replication. Snapshot backups can be used to minimize the production impact of backups and increase the frequency of backups relative to the conventional tape-based “once a day” backups, offering multiple recovery points per day. CDP can transparently maintain an up to date copy of your data in real time—offering the ability to almost instantly recover your data with zero data loss—and is the right data protection technology to meet the most stringent RPO/RTO requirements. SATA disk technologies offer the performance and reliability required for secondary storage applications like backup and DR at aggressive price points, winnowing the cost difference between disk and tape-based data protection infrastructures.

Tip 3: Document and Test Recovery

Even if you’ve spent the planning time up front to design a good recovery process, have you documented it in writing? If there is only one administrator at your company that knows the recovery process in its entirety and that process is not documented, you are running major risks. What if that person leaves the company or is out on a day when a recovery is required? If you’ve spent the time up front to plan appropriately, then make sure that you will be able to reliably take advantage of that planning to get expected recovery results: document your recovery processes in a “run book”. And make sure you keep two copies of it in two different places.

Let’s distinguish, however, between local and disaster recovery. Most enterprise backup software products do a pretty good job of tracking local data protection operations (e.g. backup jobs, file-level recoveries) and notifying you of associated problems. Most enterprises are recovering at least some data locally almost every day in response to user requests for deleted or corrupt files, effectively testing those processes on a regular basis. If problems arise in your local recovery processes, they are generally discovered and resolved quickly.

When you’re dealing with multi-site DR solutions, however, it’s a different story. Compared to recovering data locally, DR is generally a more complicated process with more steps and is therefore riskier. In the natural course of operation, DR configurations can tend to devolve—a process called “configuration drift”. Patches are applied on systems, new data volumes are added, configuration parameters are changed; to ensure that DR systems recover as expected, these changes will all have to be made at both the replication source and target locations. If they are not, your recovery process may not perform as you expect. Testing would be a way to manage DR solutions to perform as expected, but historically DR testing has been disruptive to production operations and therefore is not done very often. Most enterprises with a DR plan rarely test it, and many enterprises never test it after initial deployment. This is a disaster waiting to happen.

Testing your DR plan may be the difference between experiencing a disaster recovery (where your recovery processes perform as expected) and a “disaster” recovery (where you run into unexpected problems that preclude your ability to meet your RPO and RTO requirements). A comprehensive discussion of this issue is beyond the scope of this article, but there are two key points that can be succinctly made. Point 1: leveraging automation is a very smart way to reduce the risk associated with DR scenarios, make DR testing faster and easier, and provide the framework for improving your DR processes over time. Many of the recovery steps identified in your run book can be automated through scripting or other software tools: if it is at all possible, do it. Point 2: if you’re using server virtualization technology, you may be able to leverage snapshot and replication functionality within your data protection solution to very cost-effectively and non-disruptively perform DR testing. Look into the tools your server virtualization vendor offers to support this. If testing is automated and can be done non-disruptively, you’re likely to do it more often. And if you do it more often, you will enjoy more reliable recoveries and fewer surprises. How often is often enough? We recommend that companies do DR testing at least every 6 months.

Crafting and maintaining the data protection solution that meets your business continuity requirements is an ongoing job, but mature technologies exist that meet the range of RPO and RTO requirements. With the right planning and the right technologies in place, you’ll be able to meet your requirements. The 3 tips discussed in this article will take you a long way towards designing the right data protection solution to ensure business continuity for your environment.

For more information about InMage’s BC, Data Protection, and Recovery Planning solutions, please call your GreenPages account manager at 800-989-2989.

‹ ‹ ‹ Back to Newsletter

Visit GreenPages Online: www.greenpages.com
©2009 GreenPages Technology Solutions. All rights reserved.