Leveraging Cloud for Disaster Recovery

March 24th, 2019
Leveraging Cloud for Disaster Recovery

As public cloud infrastructures mature and storage costs decrease, more and more enterprises are looking to the cloud to implement their disaster recovery (DR) plans. Virtually every recent survey of IT trends shows that secondary backup in general and DR in particular are highly compelling cloud use cases, and are often the first forays of an organization into the cloud.

In this blog post we discuss cloud disaster recovery benefits and challenges and we will examine the different DR options in public cloud.

What Is Disaster Recovery?

Disaster recovery in cloud computing comprises the IT policies, tools, and procedures that ensure critical business infrastructure and systems will function despite disruptive events such as natural disasters, cyber security attacks, or even planned upgrades/maintenance that require shutting down production infrastructures. DR supports business continuity objectives by quickly restoring infrastructure, applications, and data. The key to any DR strategy is data replication.

The Need to Minimize Downtime

In today’s always-connected, always-on global economy, users expect apps and services to be available 24/7/365. At an average cost of ~$8,850 per minute, an average of 95 minutes of downtime per outage, and 31% of organizations having experienced at IT downtime incident over the previous 12 months, the direct costs of downtime are clearly significant. Indirect costs in the form of lost customers, business opportunities and productivity, as well as damage to reputation, can be just as or even more damaging. No matter what the cause, when disaster strikes, rapid recovery is essential.

DR Solution Challenges

Disaster recovery, in cloud deployments or on-prem, is a business-critical issue. In many verticals, it is also important to maintain redundant data sets in order to comply with long-term data retention regulations. However, the costs of setting up, testing and maintaining a DR site in a failover data center are very high—especially in light of the fact that the organization hopes never to have to use that replicated site. Other DR challenges include:

  • Finding and maintaining the minimal DR data footprint that will provide adequate protection but minimizes data replication costs.
  • Continuously synchronizing the data between the production data center and the DR replicated site.
  • Establishing failover and failback processes that are as seamless and as automated as possible.
  • Allocating the human and IT resources to test the DR environment at regular intervals to make sure that it will work if and when needed.

In order to meet these challenges, enterprises of all sizes often seek to implement their DR strategies by leveraging on-demand public cloud compute, network, and storage resources. In this way they can reduce their data center footprints and costs, shift CAPEX to OPEX, enhance data safety, and benefit from limitless scalability.

Why Use Public Cloud for DR?

The traditional approach to DR requires significant investment of time and resources. At minimum, users must consider how they would replicate their primary infrastructure to a secondary site. That secondary site needs to be procured, installed, and maintained. During normal operations, the secondary site will typically be under-utilized or over-provisioned.

The cost of such an investment is beyond the means of many companies. Even for companies with the means, DR is seen as a sunk cost that delivers little return quarter over quarter. However, not having an adequate DR strategy is also something no company can afford.

The public cloud offers a way for companies of all sizes to build DR sites with little upfront costs through a pay-as-you-go model.

Options for Disaster Recovery in the Cloud

Every major public cloud vendor offers multiple options for building a DR site using their cloud. AWS, for example, offers four options or scenarios that they highlight in a white paper published in 2014. Each scenario, which can also be created with the other public cloud vendors, comes in at a different price point and delivers a different Recovery Time Objective (RTO) and a different Recovery Point Objective (RPO).

Options for Disaster Recovery in the Cloud

Companies can choose the option that best meets their RTO and RPO requirements and budget. In general, public cloud enables customers to build solutions with better RTO and RPO at a lowered cost than a secondary DR site.

Backup and Restore

Traditionally, companies have used off-site backup tapes as their primary means for restoring data in the event of a disaster. This typically involved retrieving tapes from cold storage and recovering data when the primary facility has been restored or when the tapes have been sent to a cold secondary site only turned on when a disaster has occurred.

Companies have started to leverage public cloud storage services such as Amazon S3 and Azure Blob Storage as alternatives to archiving tape to an off-site facility. Not only is this a more cost-effective solution, it delivers better RTO and RPO since the data is already in the cloud where it can be used to launch a DR site on-demand.

Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014
Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014

There are various approaches for transferring data from the user’s on-premises infrastructure to the public cloud. These include migration tools specific to a particular cloud vendor, as well as vendor neutral data management platforms such as Rubrik.

In a disaster, users create cloud resources to restore data to and launch new server instances/VMs to run production workloads in the cloud.

Pilot Light

The Pilot Light option is named after the constantly-on gas heater pilot light that is used to quickly light the furnace. With this approach, a minimal copy of the production environment is maintained in the cloud. Core components whose state must be maintained and updated, such as a production database, run continuously in the cloud and are synced regularly with production. Servers in the cloud can be provisioned but turned off until a disaster or server images can be maintained for launching instances/VMs.

Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014
Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014

Compared to the Backup and Restore option, the Pilot Light scenario offers a better RTO since the core components are already running in the cloud and servers are already provisioned or ready to be provisioned. It also offers better RPO since core services are regularly updated and synced with production. However, the cost is typically higher.

Warm Standby

The Warm Standby option requires a scaled down copy of production to be provisioned and run continuously in the cloud. Stateful core components are also updated and synced regularly with production. A subset of servers, found in production, run continuously as instances/VMs in the cloud and can be scaled up as needed.

Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014
Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014

Compared to the previous two options, the Warm Standby scenario offers a better RTO since the core components are already running in the cloud and critical servers are already provisioned and running. In a disaster, production traffic for critical workloads can be redirected to the cloud while additional instances/VMs are launched to take on additional workloads. The Warm Standby option also offers better RPO since core services are being regularly updated and synced with production. The cost is higher than the earlier two options since more resources are provisioned and continuously running.

Hot Site

Similar to the Warm Standby option, a copy of the production environment runs continuously in the cloud. But in the hot site scenario, a copy of the full production environment runs in the cloud. This allows for immediate failover during a disaster, with the cloud provisioned to run the same amount of workload as production. In addition, if core components are being updated synchronously, then the cloud can be used for production, along with the user’s on-premises infrastructure, in an active-active setup.

Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014
Source: White paper: “Using Amazon Web Services for Disaster Recovery” – 2014

This option has the best RTO and RPO since the user is running an exact replica of the on-premises infrastructure in the cloud. As expected, it also has the highest cost, particularly if core components for both the on-premises and cloud environments are being completely synced.

This option has the best RTO and RPO since the user is running an exact replica of the on-premises infrastructure in the cloud. As expected, it also has the highest cost, particularly if core components for both the on-premises and cloud environments are being completely synced.

Conclusion

With disaster recovery a business imperative, enterprises are focused on optimizing their DR strategies in order to provide bullet-proof protection at minimal costs. The cloud has come to play an important role in DR, offering services that leverage global data centers and flexible storage tiering for cost-effective yet robust DR replication targets.

Follow Us
Other Articles