Skip to content

Cloud Disaster Recovery: Plan, Test and Automate

A
abemon
| | 8 min read | Written by practitioners
Share

The plan nobody tests

Most companies have a disaster recovery plan. Few have tested it. Fewer still have tested it in the last 12 months. And those that have tested it discovered it did not work as expected.

A Zerto report (2024) states that 76% of organizations have experienced at least one significant data loss or disruption event in the past two years. Of those, 40% discovered during the incident that their recovery plan had critical gaps. An untested plan is a hypothesis, not a guarantee. A robust cybersecurity framework includes DR as an integral part of its strategy.

In cloud, disaster recovery has structural advantages over the on-premise model (elasticity, geographic distribution, native automation), but also traps. The ease of provisioning resources creates a false sense of security. “If something happens, we’ll spin everything up in another region.” That sounds fine until you discover your database is 3 TB, cross-region replication is not configured, and “spinning everything up” takes 14 hours.

RTO and RPO: the two numbers that matter

Before designing any strategy, two parameters must be quantified.

RTO (Recovery Time Objective): how long your service can be down before business impact becomes unacceptable. This is not a technical question. It is a business question. An ecommerce site with a 4-hour RTO loses sales during those 4 hours. An internal HR system with a 24-hour RTO is probably tolerable.

RPO (Recovery Point Objective): how much data you can afford to lose. If your RPO is 1 hour, you need backups or replication that guarantee you never lose more than 1 hour of data. If your RPO is zero, you need synchronous replication. And synchronous replication has cost and latency implications that must be understood before committing.

The common trap is defining aspirational rather than realistic RTO and RPO. An RTO of “15 minutes” sounds professional, but if your recovery process takes 2 hours in practice, your real RTO is 2 hours regardless of what the document says.

The productive conversation with leadership is: “what is the cost per hour of downtime?” If the answer is EUR 5,000/hour, investing EUR 2,000/month in infrastructure that reduces RTO from 8 hours to 1 hour is an obvious decision. If the answer is EUR 200/hour, the justification changes.

Three DR levels and their costs

Hot standby

A complete, functional environment in a second region, receiving data in real time or near real time. If the primary region goes down, traffic redirects to the standby. RTO is minutes. RPO can be seconds with synchronous replication, or minutes with asynchronous.

The cost is high: you essentially pay double for compute and storage infrastructure. In AWS, a hot standby with RDS Multi-AZ, ECS/EKS in the second region, and Route53 for DNS failover can cost between 80% and 100% on top of primary infrastructure.

Justified for: high-volume ecommerce, financial platforms, critical healthcare systems. Any service where the cost of one hour of downtime exceeds the monthly cost of the standby.

Warm standby

A scaled-down environment in the second region with base infrastructure deployed but at minimum scale. Data replicates continuously. In case of disaster, infrastructure is scaled up (more instances, more capacity) before redirecting traffic. RTO is 15-60 minutes depending on how much scaling is needed.

The cost is moderate: 30% to 50% additional. The database and storage represent most of the cost because replication is continuous. Compute is minimal until activation.

Justified for: most enterprise applications, SaaS with 99.9% SLAs, operations management systems. It is the sweet spot between cost and recovery time.

Cold standby (pilot light)

Only data replicates to a second region. Compute infrastructure does not exist until needed. In case of disaster, infrastructure is provisioned from scratch (using Infrastructure as Code), connected to replicated data, and traffic is redirected. RTO is hours.

The cost is low: just the storage for replicated data and cross-region transfer costs. In AWS, cross-region storage of a 500 GB RDS snapshot costs approximately EUR 50/month. Compute is zero until activation.

Justified for: internal systems, development environments, applications with hours-level downtime tolerance. And as a complement to warm standby for historical data that does not need immediate recovery.

Backups: the overlooked foundation

Backups are the most basic and most underestimated form of disaster recovery. Three mistakes we see repeatedly:

Not verifying restoration. A backup that cannot be restored is not a backup. It is a file that occupies space. The minimum test is restoring a backup to a test environment at least once per quarter. Automating this test is ideal: a weekly job that restores the latest backup to an ephemeral environment, runs basic validations, and reports the result.

Not protecting against accidental deletion. Backups stored in the same account and region as primary infrastructure are vulnerable to accidental (or malicious) deletion. A misguided terraform destroy or ransomware can wipe production and backups simultaneously. Backups should be in a separate account with immutability policies. AWS S3 Object Lock and Azure Immutable Blob Storage provide this guarantee.

Not calculating restoration time. Restoring a 500 GB backup from S3 to an RDS instance takes time. We have measured between 45 minutes and 3 hours depending on instance type and backup format. If your RTO is 1 hour and restoration takes 2 hours, your backup strategy does not meet your RTO.

Automated failover

Manual failover works if someone is available, if that person knows what to do, and if they do it without errors under pressure at 3 AM. Three conditions that fail with uncomfortable frequency.

Automated failover removes the human factor from the critical path. The typical implementation uses:

Active health checks. An external component (Route53 health checks, Cloudflare health checks, a dedicated service like Pingdom or UptimeRobot) verifies primary environment availability. If N consecutive checks fail, failover triggers.

DNS failover. The DNS record updates to point to the secondary environment. With low TTLs (60 seconds), propagation is fast. AWS Route53 offers native failover routing policies. Cloudflare enables similar logic with load balancing.

Database failover. For RDS, Multi-AZ failover is automatic and takes 60 to 120 seconds. For self-managed databases, promoting a replica to primary requires scripting (pg_promote in PostgreSQL, STOP SLAVE + RESET SLAVE in MySQL) and verification.

Automated failover without automated failback creates a different problem: once the primary region recovers, data must be resynchronized and traffic reverted. This process (failback) is typically more complex than failover and must be equally documented and tested.

Testing cadence

A DR plan that is not tested degrades with every infrastructure change. Every new database table, every new microservice, every new external dependency can invalidate assumptions in the plan.

Our recommended cadence:

  • Monthly: Automated backup restoration verification. A script that restores, validates, and destroys.
  • Quarterly: Controlled failover to the secondary environment during low-traffic hours. The entire team observes and documents.
  • Semi-annually: Full disaster simulation. The primary environment is “destroyed” (without actually destroying it: traffic is cut) and actual RTO is measured.

Each test produces a report with: measured actual recovery time, issues found, differences from the documented plan, and corrective actions. Corrective actions have dates and owners. Without this, the report is dead paper.

DR is not a project. It is a continuous practice. The best time to discover your plan does not work is during a scheduled test, not during an actual disaster.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.