Does your business have a disaster recovery strategy? How do you identify a disaster? What could be the impact of a disaster for your business and your revenue?
If you feel puzzled when answering these questions, then this blog post is for you. I’m going to cover the basics of disaster recovery (DR) following AWS best practices and how you can set up a disaster recovery strategy starting from zero.
What’s a disaster?
A disaster is an event preventing a workload or a system to fulfil its business objective and causing a serious negative impact on your business. A disaster is usually caused by nature (e.g. floods, storms), by a technical failure (e.g. power or network connectivity issues), or by a human action (e.g. a misconfiguration or unauthorized modification).
Why plan for a disaster?
Stop for a second and think about this horror scenario: you lost all the data in the production database. What do you do? Could this be the end of your business? OK, you are smart and you have backups but are you sure they work? Have you ever tested them? How long is it going to take to recover all the data in production? How much data have you lost since the disaster hit you? How much revenue have you lost? What impact does this have on your customers and on your reputation?
If you’ve never thought about these questions, you should. Setting up a disaster recovery strategy is the methodical approach to answer these questions and be prepared for a disaster before it occurs. If you’re an owner, founder, CTO, or senior IT engineer, you must think about the kind of events that could impair your business and how you might recover from them.
Disaster Recovery Plan
As part of your overall approach to risk management and business continuity, you should assess the impact of a low-likelihood, high-severity incident on your business. We work with a lot of SaaS businesses where obligations to their customers around data and availability are a key concern. Assuming you’re in a similar situation, you’ll need to make a DR plan based on formally performing:
- a business impact analysis to quantify the business impact of a disruption on your systems or workloads. This analysis determines how much downtime and data loss you can reasonably tolerate without having a major impact on your business operations.
- a risk assessment determining the likelihood of a disaster and the mitigation strategies you can put in place.
A business impact analysis and a risk assessment allow you to take an informed decision on the most suitable disaster recovery strategy for your business and to quantify the cost of a disaster recovery solution.
RTO & RPO
When performing a business impact analysis, the most important metrics to determine are the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO):
- Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service, i.e. how much downtime can you afford?
- Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point, i.e. how much data can you afford to lose?
The image below shows when RTO and RPO come into place during a disaster and the levels of downtime and data loss that can occur during an incident.
You may need to have discussions with various stakeholders in your organization in order to set up RTO and RPO. As you can image, these metrics have an impact on the whole business and they rely on both business and technical decisions.
Disaster Recovery Strategies
Using AWS’ terminology for different cloud recovery strategies, you can choose from:
- Backup & Restore
- Pilot Light
- Warm Standby
- Multi-site Active/Active
The image below summarizes these four options with their associated RTO and RPO, ranging from the low cost and low complexity of the Backup & Restore option to the more effective, complex and costly option of Multi-site Active/Active:
In the next sections I’m going to focus on each disaster recovery strategy and provide examples of how each solution is achieved on AWS.
Backup & Restore
Backup & Restore is the simplest and cheapest DR strategy on AWS. It’s suitable for lower priority use cases and it may be a good starting point if you have no disaster recovery strategy yet. It involves backing up data as well as infrastructure, configuration, and application code on regular intervals according to your RPO and restoring data and redeploying resources after the disaster occurred. This strategy has RTO and RPO measured in terms of hours so it’s only suitable for less critical workloads and systems.
You can leverage AWS services like RDS snapshot (including point-in-time recovery), EBS snapshot, AMIs, S3 cross-region replication for continuous replication to another region, S3 object versioning for mitigating erroneous deletion or modification of objects, as well as Infrastructure-as-Code to automatically deploy resources.
The diagram below shows an example of Backup & Restore architecture with cross-region backup.
You can go a step further in terms of automation and security by using AWS Backup which provides a centralized location to configure, schedule, and monitor backups for several AWS services in multiple AWS accounts and regions.
Do not forget to test your backups on a regular interval - backups are only useful if you test they actually work.
With the Pilot Light strategy a scaled down core infrastructure is always on and ready to be scaled up to match a real production environment. This approach requires data replication to be enabled for your database and S3 buckets. Application servers are switched off to reduce cost but ready to be scaled up to match the production configuration. The Pilot Light strategy minimizes the cost of a disaster recovery solution and usually has RTO and RPO calculated in terms of tens of minutes / less than an hour. It’s recommended to use a separate AWS account for the disaster recovery solution to increase security isolation (e.g. in case security credentials are compromised in the production account).
RPO is kept low thanks to continuous, asynchronous data replication. Failing over the database requires promoting an RDS read replica to become the primary instance and this may cause downtime due to failover and reboot for the RDS database (this is applicable to all flavours of RDS except for Aurora which has mechanisms to avoid downtime). You must also scale up the application servers and containers so using an Infrastructure-as-Code tool like Terraform to handle the automatic scale-up of an environment could help here. If you have existing Terraform code to provision infrastructure, bear in mind that this may require adapting your current Terraform code to run in a disaster recovery account.
In a Pilot Light strategy you normally rely on Amazon Route 53 to switch traffic from your production account to the disaster recovery account - often, running using a different region. Amazon Route 53 health checks can help you automate monitoring the health of a workload. You can use those Route53 health checks both to trigger failover (you may need to implement a manual trigger for initial scale-out) and to alert you to follow additional steps, such as verifying a successful failover.
Depending on the implementation approach, recovering back to the primary production account might need manual work.
The image below shows an example of Pilot Light for a three-tier architecture.
The Warm Standby strategy extends the concepts of the Pilot Light strategy and decreases RTO and RPO by running a scaled-down copy of your production environment. With the Warm Standby strategy a fully functional (but scaled-down) copy of your production environment is always on and is ready to be scaled up to match the capacity of your production environment. The diagram below illustrates an example of Warm Standby architecture.
Pilot Light and Warm Standby can look similar but there are some key differences between them:
- A Pilot Light solution cannot handle production traffic without initial action (e.g. turning on application servers) whereas a Warm Standby solution can immediately handle traffic (although at a reduced capacity when compared to production).
- A Pilot Light solution requires to turn on or deploy certain infrastructure components (e.g. application servers/containers) and then to scale up other components whereas a Warm Standby solution requires only to scale up (i.e. all the infrastructure is already deployed and running at a reduced capacity).
- A Pilot Light solution usually has RTO/RPO measured in terms of tens of minutes / up to an hour whereas the Warm Standby usually has RTO/RPO of minutes.
- In terms of total cost a Pilot Light solution is less expensive than a Warm Standby solution.
Multi-site Active/Active is the most reliable DR solution and it’s the only strategy that can guarantee almost zero downtime and data loss. However, it’s also the most complex and costly strategy so it’s mainly suitable for mission critical services where you can’t tolerate downtime or data loss. It involves creating parallel infrastructure and data stores that are continuously kept in sync with production and sit idle until a disaster occurs. The switch between the production region and the DR region is done via Route53 or Global Accelerator which automatically route traffic towards the DR region. Route53 health checks can be used as a failover mechanism and be triggered manually or automatically (if you need to create automatic failover mechanisms with Route53 check out this AWS blog post).
The diagram below illustrates an example of Multi-site Active/Active architecture:
Bear in mind that there is still a risk of data loss or downtime in case of data corruption or data deletion on your production systems. As data is continuously synchronized between the production and the DR data stores, you may need to restore from a data backup - which may involve downtime or data loss - to recover from corrupted data.
The multi-site Active/Active strategy is also relevant if the nature of the workload means it needs to be low latency for customers in different locations and with consistency guarantees. If your actual app already solves the hard bit, the case for doing the remaining work is stronger.
Common questions for cloud DR
We’ve helped several SaaS businesses to implement DR strategies and we’ve seen some recurring questions I want to share here.
We have backups, we are fine, right?
Not necessarily. As we’ve seen, Backup & Restore is the simplest DR strategy and has RTO and RPO measured in hours. Can your business tolerate downtime and data losses lasting several hours? Even if your backups work (and you should test them regularly), what if a malicious attacker or rogue employee accesses your production AWS account and deletes the RDS database and the RDS snapshots? Are they in the same AWS accounts because it was the easiest way to create a backup? If this happens, it could mean the end of your business so it’s worth storing backups in a separate, more segregated AWS account (AWS Backup can help here).
As you can see, implementing a DR strategy is much more than doing backups and requires some thoughts from both business and technical perspectives.
We don’t have any DR strategy. Where to start?
Start by gathering business and technical stakeholders and define your RTO and RPO. This will inform you on the most suitable DR strategy to implement. We can help you organizing a workshop covering these areas and produce a plan of action to implement a DR plan. If you don’t even have backups or are unsure what you’re backing up, start implementing the Backup & Restore strategy and build up on the next DR strategy according to your RTO and RPO.
What’s the best DR strategy for my business?
Consider the RTO and RPO you’re planning to achieve to guarantee business continuity and assess the cost of two DR solutions. If you want to go beyond the basic Backup & Restore strategy and reduce your RTO and RPO, start with the Pilot Light strategy. Then perform a disaster recovery simulation and assess the achieved RTO and RPO as well as the associated cost. You can then re-asses your business requirements with real figures. We can help you perform a disaster recovery simulation and further assist you in assessing the best solution for your use case.
I hope this blog post helped you understanding DR strategies on AWS and provide you with some guidelines to implement the mos suitable DR strategy for your business. Whether you’re starting from zero or already have backups or a DR plan, regularly assess your RTO and RPO and test your DR strategy with simulations and game days. Being prepared for a disaster is the best approach to minimize its consequences.
Do you have a disaster recovery plan, and do you trust it? If not, book a health check with one of our AWS experts.
This blog is written exclusively by The Scale Factory team. We do not accept external contributions.