How to Eliminate Single Points of Failure in Your Architecture

In 2009, I was brought in to review the architecture of an e-commerce platform that had experienced four major outages in six months. The CTO described the system as “highly available.” He pointed to a nice architecture diagram showing load balancers, multiple application servers, and a database cluster. It looked great on the whiteboard.

Then I started asking questions.

“Where does your SSL certificate termination happen?” One load balancer. “Where does your session data live?” One Redis instance. “How does your application connect to the database?” Through a single connection proxy. “Where do your DNS records point?” A single Elastic IP on the single load balancer.

The “highly available” system had four single points of failure that I found in the first thirty minutes. The outages weren’t bad luck. They were inevitable. Every component I identified had failed at least once in the preceding six months. The architecture guaranteed failure; the only question was which component would fail next.

That engagement changed how I approach every architecture review. I now start by mapping every component in the request path and asking one question about each: “What happens when this fails?”

What Is a Single Point of Failure?

A single point of failure (SPOF) is any component whose failure causes the entire system (or a critical function) to become unavailable. If removing one component takes down your service, that component is a SPOF.

The concept is simple. Identifying SPOFs in real systems is harder than you’d think, because they hide in places you don’t expect.

The Obvious SPOFs

Let me start with the ones everybody knows about, because even these still show up in production systems with alarming frequency.

Single Database Instance

One database server with no replication, no standby, no failover. If it dies, the application dies. This is the most common SPOF I encounter, usually in systems that grew organically from a small deployment without re-evaluating the architecture.

Fix: Primary-standby replication with automated failover. RDS Multi-AZ does this out of the box. For self-managed databases, set up streaming replication and use a failover manager like Patroni (PostgreSQL) or Orchestrator (MySQL).

Single Application Server

One server handling all requests. No redundancy, no failover.

Fix: Multiple instances behind a load balancer. At minimum, two instances across two availability zones. Configure health checks so the load balancer stops sending traffic to failed instances.

Single Load Balancer

I see this more than I should. Teams add multiple application servers behind a load balancer and congratulate themselves on eliminating the application tier SPOF. Then the load balancer itself fails.

Fix: Managed load balancers (AWS ALB/NLB, Azure Load Balancer, GCP Load Balancer) are redundant by design. The provider manages multiple load balancer nodes across AZs. If you’re running your own load balancers (HAProxy, NGINX), you need at least two in an active-passive or active-active configuration with a floating IP or DNS failover.

The Hidden SPOFs

These are the ones that trip up experienced architects. They don’t show up on typical architecture diagrams, and they lurk in the shadows until they bite.

DNS

Your application depends on DNS to resolve domain names. If your DNS provider goes down, your application is unreachable even though every server is healthy. This happened famously with the Dyn DDoS attack in 2016 that took down huge portions of the internet.

Fix: Use multiple DNS providers. Route 53 is highly available, but having a secondary DNS provider (Cloudflare, Google Cloud DNS) that serves the same zone provides protection against a single DNS provider outage. This requires keeping zone records synchronized, which tools like OctoDNS can automate.

Certificate Management

SSL/TLS certificates expire. When they do, your site shows scary browser warnings or stops working entirely. I’ve seen major outages caused by expired certificates, including one at a company that had spent millions on HA infrastructure.

Fix: Automate certificate renewal (Let’s Encrypt + cert-manager, ACM in AWS). Monitor certificate expiration dates. Alert well in advance (30 days, 14 days, 7 days).

External Dependencies

Your application probably depends on external services: payment processors, email providers, authentication services, CDNs, third-party APIs. Each one is a SPOF unless you’ve designed for its failure.

Fix: Circuit breakers for external calls. Fallback behavior when dependencies are unavailable. Retry with exponential backoff. Timeouts on every external call (I’ve seen systems lock up because they waited indefinitely for a response from a dead third-party API). Cache external responses where possible.

Shared State

This is the subtle one. Two application instances sharing a single Redis instance for session data, a single Elasticsearch cluster for search, a single message queue for async processing. The application tier is redundant, but it depends on non-redundant shared state.

Fix: Make every shared state component redundant. Redis Cluster or Redis Sentinel for cache/session. Multi-node Elasticsearch cluster. Replicated message queue (SQS is inherently redundant, self-managed RabbitMQ needs clustering).

Configuration and Secrets Management

Where do your application configurations and secrets live? If they’re in a single configuration server or a single secrets manager, that’s a SPOF.

Fix: Use managed services with built-in redundancy (AWS Secrets Manager, Parameter Store). Cache configuration locally so the application can operate temporarily if the configuration service is unavailable.

Diagram showing hidden single points of failure in a typical web application architecture

The Process: Finding SPOFs Systematically

I use a systematic approach to find SPOFs because gut instinct misses things. Here’s the process:

Step 1: Map the Request Path

Trace every request from the user’s browser to the database and back. Every component the request touches is a potential SPOF. Include DNS, CDN, load balancers, web servers, application servers, caches, databases, file storage, and external services.

Don’t skip the “boring” components. DNS, NTP, and service discovery are critical infrastructure that get overlooked.

Step 2: Map the Dependencies

For each component, list its dependencies. The application server depends on the database, the configuration service, and the secrets manager. The database depends on its storage subsystem and network connectivity. Map the full dependency tree.

Step 3: Ask “What If?”

For every component and dependency, ask: “What happens when this fails?” Document the answer honestly. Don’t write “failover to standby” unless you’ve tested the failover and verified it works.

Step 4: Classify by Impact

Not all SPOFs are equal. A SPOF that causes total system failure is more critical than one that degrades a non-essential feature. Prioritize elimination based on impact.

Step 5: Eliminate or Mitigate

For each SPOF:

Eliminate by adding redundancy (preferred).
Mitigate by designing graceful degradation, where the system continues operating with reduced functionality when the SPOF fails.

Some SPOFs can’t be economically eliminated. A single payment processor integration might be a SPOF, but integrating a second payment processor just for redundancy might cost more than the downtime it prevents. In those cases, design for graceful degradation: show users a helpful message and allow retries rather than crashing.

The Layer Cake of Redundancy

High availability requires eliminating SPOFs at every layer. Miss one layer and your “redundant” architecture has a single point of failure hiding inside it.

Physical Layer

Redundant power supplies in servers
Redundant network interfaces (bonded NICs)
Multiple power feeds to racks
Generator backup for data center power

Network Layer

Multiple network paths between components
Redundant switches and routers
Multiple internet uplinks
DNS redundancy

Compute Layer

Multiple instances per service
Cross-AZ distribution
Auto-scaling to replace failed instances
Health checks for detection

Data Layer

Database replication
Multi-AZ storage
Backup and recovery procedures
Connection pooling with failover

Application Layer

Stateless design enabling instance replacement
Circuit breakers for dependency failures
Retry logic with exponential backoff
Feature flags for graceful degradation

Operations Layer

Automated monitoring and alerting
Runbooks for manual intervention
On-call rotation (don’t be a SPOF yourself)
Cross-trained team members

That last bullet deserves emphasis. I’ve seen organizations where one person knew how to manage the database, and when that person was unavailable during an outage, recovery took hours instead of minutes. People are single points of failure too.

Layer cake diagram showing redundancy required at every architectural layer

Real-World War Stories

The Hidden NAT Gateway SPOF

A client’s multi-AZ architecture looked perfect on paper. Application servers in two AZs, RDS Multi-AZ, redundant ALB. But all outbound traffic from both AZs routed through a single NAT Gateway in AZ-A. When AZ-A had a networking issue, the NAT Gateway became unreachable. Application servers in AZ-B were healthy but couldn’t reach external services (payment processor, email service, third-party APIs).

Lesson: NAT Gateways should be per-AZ, with route tables directing each AZ’s traffic to its local NAT Gateway.

The Deployment SPOF

A team had excellent runtime redundancy: multiple instances, multi-AZ, the works. But their deployment process updated all instances simultaneously. A bad deployment took down every instance at once. Their “highly available” system had a deployment process that was a SPOF.

Lesson: Rolling deployments, blue-green deployments, or canary deployments ensure that a bad deployment doesn’t take down all instances simultaneously. Your deployment process must respect the redundancy your runtime architecture provides.

The Monitoring SPOF

During an outage, the team discovered that their monitoring system was running on the same infrastructure that was failing. They lost visibility into the outage precisely when they needed visibility most.

Lesson: Monitoring infrastructure must be independent from the systems it monitors. External monitoring (Pingdom, StatusCake) that checks your system from outside provides visibility even when your infrastructure is down.

The Fault Tolerance Connection

Eliminating SPOFs gets you to high availability. The next step, fault tolerance, ensures that failover between redundant components is so fast that users never notice. The progression is:

No redundancy: Single point of failure. Component fails, system fails.
Redundancy with manual failover: SPOF eliminated, but recovery requires human intervention.
Redundancy with automated failover: SPOF eliminated, recovery is automatic but may have brief interruption.
Fault-tolerant redundancy: SPOF eliminated, failure is completely invisible to users.

Most systems should aim for level 3. Level 4 is reserved for systems where any interruption is unacceptable.

Audit Checklist

Here’s the checklist I use when reviewing architectures. For every item, I ask: “Is there more than one? What happens when one fails?”

DNS provider(s)
CDN/edge nodes
Load balancer(s)
Web/API server instances
Application server instances
Cache instances
Database primary and standby
Message queue/broker
File/object storage
Secrets/configuration management
SSL/TLS certificate management
NAT gateways (per AZ)
External service integrations
Monitoring and alerting systems
Deployment pipeline
Team knowledge (bus factor)

If any of those is a single instance with no redundancy or fallback, you have a SPOF. Whether you need to fix it depends on the impact analysis, but you should at least know it’s there.

Comprehensive SPOF audit checklist visualization covering all architectural layers

The uncomfortable truth about single points of failure is that every system has them. The difference between a reliable system and an unreliable one isn’t the absence of SPOFs; it’s the awareness of where they are and the deliberate decision about which ones to eliminate, which ones to mitigate, and which ones to accept.

After thirty years, I’ve never reviewed an architecture that didn’t have at least one SPOF I wasn’t expecting. The search never ends. But each one you find and fix makes the next 2 AM call a little less likely.

Get Cloud Architecture Insights

Practical deep dives on infrastructure, security, and scaling. No spam, no fluff.

What Is a Single Point of Failure?

The Obvious SPOFs

Single Database Instance

Single Application Server

Single Load Balancer

The Hidden SPOFs

DNS

Certificate Management

External Dependencies

Shared State

Configuration and Secrets Management

The Process: Finding SPOFs Systematically

Step 1: Map the Request Path

Step 2: Map the Dependencies

Step 3: Ask “What If?”

Step 4: Classify by Impact

Step 5: Eliminate or Mitigate

The Layer Cake of Redundancy

Physical Layer

Network Layer

Compute Layer

Data Layer

Application Layer

Operations Layer

Real-World War Stories

The Hidden NAT Gateway SPOF

The Deployment SPOF

The Monitoring SPOF

The Fault Tolerance Connection

Audit Checklist

Get Cloud Architecture Insights

Related Articles

Fault Tolerance vs High Availability: Understanding the Difference

High Availability Explained: Designing Systems That Don't Go Down

Get Cloud Architecture Insights