Disaster Recovery Planning: Strategies, Tiers

In my career, I’ve lived through three genuine disasters, not “the server went down” incidents, but actual disasters where we lost entire facilities. A flooded data center in Houston. A fire that took out a colocation facility in New Jersey. And a cloud region outage that lasted eleven hours and affected half the internet.

Each of these events taught me something different about disaster recovery. The flood taught me that untested DR plans are fiction. The fire taught me that people panic and documentation matters more than you think. The cloud outage taught me that “the cloud” doesn’t eliminate the need for DR. It changes the shape of it.

If your disaster recovery plan is a dusty PDF that nobody has read since it was written three years ago, you don’t have a DR plan. You have a liability. Let me walk you through how to build one that actually works when the building is on fire, metaphorically or literally.

What Disaster Recovery Actually Covers

Let’s get clear on scope. Disaster recovery (DR) is specifically about restoring IT systems and data after a significant disruptive event. It’s a subset of business continuity planning (BCP), which covers the broader question of how the entire business operates during and after a disaster.

DR focuses on:

Restoring application services to operational state
Recovering data to an acceptable point (your RPO)
Meeting time-to-recovery targets (your RTO)
Ensuring the recovered environment is fully functional

If you haven’t defined your RTO and RPO yet, stop here and read my guide on RTO vs RPO. Those objectives are the foundation of everything that follows. Without them, you’re building a plan without requirements.

DR Strategy Tiers

The industry has standardized on a tiering model that maps recovery capabilities to cost and complexity. I use a simplified version of the SHARE tier model that I’ve found more practical than the original seven-tier version.

Tier 0: No DR (Don’t Be Here)

Backups exist on the same infrastructure as the primary system. If the infrastructure goes down, the backups go with it. I still encounter this more often than I’d like, usually in smaller organizations or for systems that “aren’t that important,” until they are.

Tier 1: Cold Standby

You have offsite backups and documented recovery procedures, but no standing infrastructure at the recovery site. When disaster strikes, you provision new infrastructure, restore from backups, and bring systems online.

Recovery time: Days to weeks Cost: Low (just backup storage and documentation maintenance) Best for: Non-critical systems, development environments, archival systems

The biggest risk with cold standby: your recovery procedures are wrong. They were written months or years ago. The infrastructure has changed. The team that wrote them has turned over. I’ve seen cold DR recovery take three times longer than planned because of procedure rot.

Tier 2: Warm Standby

Infrastructure exists at the recovery site but is not actively serving traffic. Data is replicated on a delay (asynchronous). When disaster strikes, you activate the standby infrastructure, verify data consistency, and switch traffic.

Recovery time: Hours Cost: Moderate (standby infrastructure runs at reduced capacity) Best for: Business-critical internal systems, secondary customer-facing systems

Warm standby is the sweet spot for many organizations. The infrastructure is there, the data is reasonably current, and the recovery process is well-defined. The key is automation. The more of the failover process you automate, the closer you get to your target RTO.

Tier 3: Hot Standby

The recovery site runs a complete copy of the production environment. Data is replicated in near-real-time. Failover is automatic or semi-automatic.

Recovery time: Minutes Cost: High (essentially doubling your infrastructure) Best for: Revenue-generating customer-facing systems, regulated industries

This is where most high-availability architectures operate. The secondary environment is always warm, always synchronized, and ready to take over at a moment’s notice.

Tier 4: Active-Active

Both sites actively serve traffic simultaneously. There is no primary and secondary; both are primary. If one goes down, the other absorbs all traffic automatically.

Recovery time: Near zero (seconds) Cost: Very high (full infrastructure at every site, plus complexity overhead) Best for: Systems where any downtime is unacceptable

Active-active is the most resilient and the most complex. It requires careful handling of data consistency, request routing, and conflict resolution. It’s not always the right answer, even for critical systems, because the complexity introduces its own risks.

DR tier comparison showing recovery time, cost, and complexity for each tier

Building Your DR Plan: A Step-by-Step Guide

Step 1: Inventory and Classify

Document every system, every database, every integration, every external dependency. Then classify each one using your tiered approach. This inventory becomes the backbone of your DR plan.

For each system, document:

What it does and who depends on it
Where it runs (cloud provider, region, account)
What data it stores and how that data is backed up
What other systems it depends on
What systems depend on it
Its assigned DR tier (which determines RTO/RPO targets)

This inventory exercise alone is worth doing even if you never finish the DR plan, because it forces you to understand your system landscape in a way that day-to-day operations never does.

Step 2: Design Recovery Architecture

For each DR tier, design the specific architecture that will deliver the required RTO and RPO.

For cold standby systems: Document the infrastructure provisioning process. Use infrastructure-as-code (Terraform, CloudFormation, Pulumi) so that the recovery site can be provisioned automatically. Store these templates in a separate location from your primary infrastructure. If your primary cloud account is compromised, you need access to these templates from elsewhere.

For warm standby systems: Provision the standby infrastructure. Set up data replication. Database replication is typically the most complex piece, so make sure it’s configured, monitored, and tested. Write automated failover scripts that handle DNS updates, connection string changes, and service startup.

For hot standby and active-active: This is a full architecture project. You need load balancing across sites, data replication with conflict resolution, health checking and automatic failover, and a plan for split-brain scenarios.

Step 3: Document Recovery Procedures

Write runbooks for every recovery scenario. These runbooks should be usable by someone who has never done this before, because the person doing the recovery might be a junior engineer at 3 AM on a Saturday.

Every runbook should include:

Triggering conditions: When should this runbook be executed?
Prerequisites: What access, tools, and information do you need?
Step-by-step procedures: Explicit, numbered, copy-pasteable commands
Verification steps: How do you confirm each step worked?
Rollback procedures: What if a step fails?
Escalation contacts: Who to call if you’re stuck

I format my runbooks as checklists because under stress, people skip steps. A checkbox forces you to confirm each step before moving on.

Step 4: Implement Monitoring and Alerting

Your DR infrastructure needs its own monitoring. Specifically:

Replication lag monitoring: If your RPO is 15 minutes, alert when replication lag exceeds 5 minutes
Backup verification: Don’t just monitor that backups ran. Monitor that they completed successfully and can be restored
Standby health checks: Verify that standby infrastructure is actually healthy and ready to receive traffic
DR infrastructure access: Can your team actually reach the DR environment? VPN access, credentials, SSH keys? Verify these regularly

Step 5: Test, Test, Test

I dedicated a whole section to this because it’s the most important and most neglected part of DR planning. I covered testing approaches in my RTO vs RPO guide, but let me add some DR-specific testing guidance here.

Backup restore tests (monthly): Pick a random system, restore its latest backup to a clean environment, and verify the data. Time the restore. Compare to your RTO. I’ve found corrupted backups, incomplete backups, and backups that restored successfully but were missing critical data. Better to find these problems on a Tuesday afternoon than during a real disaster.

Component failover tests (quarterly): Fail over individual components (a database, a cache cluster, an application service) and verify that the system continues operating. This tests your redundancy at the component level.

Site failover tests (semi-annually): Fail over an entire service or application to the DR site. Run real traffic against it. Verify performance, functionality, and data integrity. Then fail back.

Full DR simulation (annually): Simulate loss of the primary site. Recover everything. Involve leadership, communication teams, and customer support. Time everything. Document every problem encountered. This is the gold standard of DR testing. These full simulations share a lot of DNA with chaos engineering game days, where you deliberately inject failures to validate your assumptions about system resilience. I wrote a dedicated guide on chaos engineering and resilience testing that covers how to plan and run these exercises safely.

DR testing schedule and types from monthly to annual

Cloud-Specific DR Considerations

The cloud hasn’t eliminated the need for DR. It’s changed the tooling and the failure modes.

Multi-Region Is Your DR Strategy

In cloud environments, your “DR site” is another region. AWS us-east-1 goes down? Fail over to us-west-2. This is conceptually the same as traditional DR but with cloud-native tooling.

Key cloud DR services:

AWS: S3 cross-region replication, RDS cross-region read replicas, Route 53 health checks and failover routing, Aurora Global Database
GCP: Cloud SQL cross-region replicas, Cloud Storage dual-region/multi-region, Global Load Balancer with failover
Azure: Azure Site Recovery, Cosmos DB multi-region, Traffic Manager with priority routing

Don’t Forget Cloud Account Security

A common DR blind spot: what if you lose access to your cloud account? Compromised credentials, billing issues, or vendor disputes can lock you out of your entire infrastructure. Have a plan that addresses:

Multi-factor authentication recovery
Break-glass access procedures
Backups stored outside your primary cloud provider (even if it’s just critical data)
Infrastructure-as-code stored in a separate version control system

Availability Zones vs. Regions

Deploying across availability zones within a region gives you high availability against single-facility failures. It does not give you disaster recovery against regional failures. For true DR, you need multi-region capability. These are different design points with different costs and different risk profiles.

The Human Element

The best DR plan in the world fails if people don’t know their roles. When the disaster happens, you need:

An incident commander: One person who makes decisions and coordinates the response. Not a committee. One person.

Clear communication channels: How do you communicate when your primary tools are down? If Slack is hosted in the same region that just failed, how does the team coordinate? Have a backup communication plan (phone tree, secondary Slack workspace, Microsoft Teams, carrier pigeon, whatever works).

Customer communication: Who tells customers what’s happening? What channels? How frequently? Pre-draft templates for common scenarios so you’re not wordsmithing a status page update during an outage.

Decision authority: Who can authorize a failover that might cause brief data inconsistency? Who can authorize spending on emergency infrastructure? Pre-authorize these decisions so you’re not waiting for executive approval at 2 AM.

DR team roles and communication flow during an incident

A Real-World DR Playbook

Let me share the skeleton of a DR playbook I’ve used for a SaaS platform running on AWS:

Scenario: Complete loss of primary region (us-east-1)

Detection: Route 53 health checks fail for all primary endpoints. Monitoring alerts fire. On-call engineer confirms regional outage via AWS status page and independent testing.

Decision: Incident commander authorizes failover (pre-authorized for this scenario).

Execution (automated via runbook):

Promote RDS read replica in us-west-2 to primary (5 minutes)
Update application configuration to point to new primary database (automated via Parameter Store, 2 minutes)
Scale up ECS services in us-west-2 to handle full traffic (3 minutes)
Verify application health checks pass in us-west-2 (2 minutes)
Update Route 53 to route all traffic to us-west-2 (immediate, DNS propagation 60-300 seconds)
Run smoke test suite against production endpoints (5 minutes)
Monitor error rates and latency for 15 minutes

Total target RTO: 30 minutes Actual tested RTO: 22-28 minutes across four tests

Post-failover: Customer communication sent. Engineering team begins root cause analysis. Failback planning begins once primary region is confirmed stable (typically 24-48 hours after region recovery).

This playbook works because we tested it. The first time we ran it, it took 90 minutes and we found eleven issues. The fourth time, it was under 30 minutes and smooth.

Start Now

If you’ve read this far without a DR plan, let me leave you with a simple starting point: back up your critical data to a different geographic location, write down how to restore it, and test the restore. That’s Tier 1, cold standby. It’s not perfect, but it’s infinitely better than nothing.

Then work your way up from there. Classify your systems. Define your tiers. Build the architecture. Write the runbooks. Test the recovery. Each step makes your organization more resilient.

Disasters are not theoretical. They happen. I’ve been through them. The organizations that recover quickly are the ones that planned and practiced. The ones that don’t… well, some of them aren’t around anymore.

Get Cloud Architecture Insights

Practical deep dives on infrastructure, security, and scaling. No spam, no fluff.

What Disaster Recovery Actually Covers

DR Strategy Tiers

Tier 0: No DR (Don’t Be Here)

Tier 1: Cold Standby

Tier 2: Warm Standby

Tier 3: Hot Standby

Tier 4: Active-Active

Building Your DR Plan: A Step-by-Step Guide

Step 1: Inventory and Classify

Step 2: Design Recovery Architecture

Step 3: Document Recovery Procedures

Step 4: Implement Monitoring and Alerting

Step 5: Test, Test, Test

Cloud-Specific DR Considerations

Multi-Region Is Your DR Strategy

Don’t Forget Cloud Account Security

Availability Zones vs. Regions

The Human Element

A Real-World DR Playbook

Start Now

Get Cloud Architecture Insights

Related Articles

RTO vs RPO: Recovery Time and Recovery Point Objectives Explained

Snapshots vs Volumes: Understanding Cloud Storage Primitives

Multi-Region Active-Active Architecture: Designing Systems That Serve Traffic from Everywhere

High Availability Explained: Designing Systems That Don't Go Down

Get Cloud Architecture Insights