Disaster Recovery and Failover in AWS–Some notes from the field
(See also, a newer post on advanced D.R. in the cloud)
D.R. is not the same as Failover – though several people seem to use these interchangeably. Failover typically means that there is a passive node (or nodes) available – that can quickly become the Active Node if needed, in the same clustered environment. The physical location of the standby node is not that important – and is typically in the same datacenter as the primary node. For e.g. your web server farm may have 2 or 3 nodes configured. An Oracle RAC has built-in support for multiple failover nodes.
Disaster Recovery differs from this in that the RECOVERY node (or nodes) CANNOT (should not) reside in the same data-center (or anywhere close to it). A disaster would affect the entire data center – so it wouldn’t make sense to have your recovery node in the same location as the primary node Hence, provisioning for DR is different from provisioning for Failover.
Replicate all tiers – not just RDS (data tier)
1. Replicate the entire Application – not just RDS. Use Multi-AZ for PROD environments.
- Web and Middle Tier– Your application must be able to function in both source and target locations. That means replicating the rest of your application to the same target location – and pointing it to the read-replica (the failover RDS instance). For the Web Tier, this is a little more convoluted than for the data tier. Essentially – your options are
- a) Full blown EC2 instance creation using CloudFormation (Server Templates).
- b) Live s3 bucket that serves as a failover node – with appropriate route 53 configuratio
- For option a) – a full blown web tier replication, one would need to set up custom server templates that would be able to spin up a new Web Server from either an AMI or a VMWare template – and have the app code copied from a ‘configuration’ server.
- For option b) – there is a shortcut – if your web tier is simple enough that it can run off an S3 bucket. In that case, all you do is set up a live replica of your website on the S3 bucket , keep monitoring your EC2 instance with the active website – and if the health check fails, simply route to the S3 bucket (Route 53 is the easiest way to accomplish this).
- Data Tier – RDS (in AWS) offers a LIVE read-replica of your RDS instance – which is an ‘out of the box’ solution for a D.R. scenario.
- Production – For PROD environments, use Multi-AZ deployment (mirroring) and provisioned IOPS. It is much harder to change this after the fact – if you want to ‘upsize’ your RDS instance.
2. Set up replication ( failure ) alerts
Every service has the potential to fail – and AWS based replication services (including RDS Replication) are no exception. You can configure service alerts (using Amazon SNS), to inform you of the success/failure of your environment replication.
3. Database/Data Tier – Perform daily backups (see the side note below on what you can and cannot do in RDS)
Enabling backups is a good idea because it is simple and effective. Also, to work with read-replicas, you need backups enabled.
4. Utilize multi-AZ (availability zone) architecture
Make use of multi-AZ architecture on AWS for availability of mission critical applications. In particular, enabling multi-AZ on RDS is the simplest way to replicate the instance within the same region.
If you’re replicating across regions, isolating RDS in a single AZ will introduce downtime as a direct result of the replication mechanism (e.g. backups, read-replica or a combination).
- For PROD environments, use Multi-AZ deployment (mirroring) and provisioned IOPS. It is much harder to change this after the fact – if you want to ‘upsize’ your RDS instance.
- Recreating a prod environment from a dev or staging environment (Create a backup, restore a database from backup – Read this post )
- Pushing large amounts of data into an RDS instance – using SQL Bulk Copy (bcp) – read this post.
Side Note – Backup and Restore in RDS
Restoring from a backup essentially involves dropping your existing RDS instance and creating a new one. This has implications. The endpoint address associated with your original RDS instance is lost in this process – and you will need to remap (anything using that instance) to the new URL.
What can you NOT do with RDS?
- You can’t copy, paste or create files in the underlying disk system. If your on-site DB server has non-SQL related files on disk, they can’t be ported across.
- You can’t run batch files, Windows Command Shell files or PowerShell scripts in the host.
- You can’t directly monitor disk space, CPU usage or memory usage from the host. AWS provides a different way for monitoring.
- You can’t copy backup files into the local disk from another location and restore databases from there.
- You can’t decide which drive your database files go to, AWS has a default location for that.
Summary
Recovering from a DR scenario is not something that you want to take a chance with. Fortunately, applications hosted in the cloud (or utilizing cloud infrastructure) are a lot easier to recover – than conventional apps. This is due to several built-in features such as a live read-replica, multi-AZ architecture – that makes it possible to experiment with different DR configurations.
Contact Anuj Varma to see if he can help with your cloud DR needs. (See also, a newer post on advanced D.R. in the cloud)
Leave a Reply