Understanding Uptime SLAs and Availability
A complete guide to uptime SLAs and high availability. Covers the nines of uptime, SLA structure, SLI/SLO/SLA frameworks, high availability architectures, error budgets, and tracking availability.
Every hosting provider, cloud platform, and SaaS vendor promises high uptime. The numbers sound impressive: 99.9%, 99.99%, 99.999%. But the difference between those numbers is enormous when you translate them into actual downtime minutes per year. And the gap between what a vendor promises in their SLA and what they actually deliver is often even bigger.
Understanding uptime SLAs, what they cover, what they exclude, and how to hold vendors accountable, is essential for anyone responsible for keeping a website or service available. This guide breaks down the math, the contracts, the architectures, and the monitoring practices that separate teams who hit their availability targets from those who do not.
The nines of uptime
Uptime is expressed as a percentage of total time that a service is available. The industry refers to these as "nines" because the difference between availability levels comes down to how many nines follow the decimal point. [1]
| Availability | Annual downtime | Monthly downtime | Weekly downtime | |:---:|:---:|:---:|:---:| | 99% (two nines) | 3 days, 15 hours | 7 hours, 18 minutes | 1 hour, 41 minutes | | 99.5% | 1 day, 19 hours | 3 hours, 39 minutes | 50 minutes | | 99.9% (three nines) | 8 hours, 46 minutes | 43 minutes, 50 seconds | 10 minutes, 5 seconds | | 99.95% | 4 hours, 23 minutes | 21 minutes, 55 seconds | 5 minutes, 2 seconds | | 99.99% (four nines) | 52 minutes, 36 seconds | 4 minutes, 23 seconds | 1 minute, 0 seconds | | 99.999% (five nines) | 5 minutes, 16 seconds | 26 seconds | 6 seconds |
The jump from 99.9% to 99.99% is not a 0.09% improvement. It is a tenfold reduction in allowed downtime, from nearly 9 hours per year to under 53 minutes. Each additional nine requires exponentially more investment in infrastructure, redundancy, and operational discipline.
For a deeper breakdown with examples, see uptime nines explained and the uptime calculator.
What counts as downtime
This sounds like a simple question, but the answer varies wildly depending on who is defining it. Downtime could mean:
- The server returns a 5xx error
- The page takes longer than a threshold (e.g., 10 seconds) to load
- The service is unreachable from one or more monitoring locations
- A critical feature (like checkout or login) is broken, even if the homepage loads
- Scheduled maintenance windows (some SLAs exclude these)
Your definition of downtime determines your measured availability. A vendor that excludes scheduled maintenance, single-location failures, and performance degradation from their downtime calculation will report higher availability than one that counts everything. Always read the fine print.
For calculating your own uptime, see how to calculate uptime.
SLA structure and terminology
A Service Level Agreement (SLA) is a contract between a service provider and a customer that defines the expected level of service, how it is measured, and what happens when the provider fails to meet it. [2]
SLI, SLO, and SLA
Google's Site Reliability Engineering team popularized a three-part framework for thinking about service levels: [3]
Service Level Indicator (SLI) is the metric you measure. It is a specific, quantifiable measure of service health. Examples:
- Availability: the percentage of successful requests
- Latency: the percentage of requests completed within a time threshold
- Error rate: the percentage of requests that return errors
- Throughput: requests per second successfully handled
Service Level Objective (SLO) is the target value for an SLI. It is an internal goal, not a contractual commitment. Examples:
- "99.95% of requests should return a 2xx status code"
- "95% of requests should complete in under 200ms"
- "Error rate should stay below 0.1%"
Service Level Agreement (SLA) is the contractual commitment. It is an SLO with consequences attached: if the provider fails to meet the target, the customer gets something (usually service credits). SLAs are typically set below SLOs to provide a buffer. If your SLO is 99.95%, your SLA might promise 99.9%.
For a complete breakdown, see SLI vs SLO vs SLA.
Anatomy of an uptime SLA
A typical uptime SLA includes:
Uptime commitment. The headline number, usually expressed as a monthly percentage. "99.9% monthly uptime" is the most common for standard hosting.
Definition of downtime. What the provider considers an outage. Read this carefully. Common exclusions include:
- Scheduled maintenance (often with advance notice requirements)
- Force majeure events
- Customer-caused outages (misconfiguration, exceeding resource limits)
- Issues caused by third-party services
- DDoS attacks (some providers exclude these, others do not)
- Single-region failures in multi-region setups
Measurement method. How the provider measures uptime. Some use their own internal monitoring (which can be optimistic). Others accept external monitoring data. Some measure per-region, per-service, or per-instance rather than globally.
Remedies. What you get when the SLA is breached. Almost always service credits, not refunds. Typical credit schedules:
| Monthly uptime | Service credit | |:---:|:---:| | 99.0% to 99.9% | 10% | | 95.0% to 99.0% | 25% | | Below 95.0% | 50% |
Claim process. How to request credits. Most providers require you to file a claim within a specific window (often 30 days) with evidence of the outage. If you do not file, you do not get credits, even if the outage was widely publicized.
What SLAs do not cover
SLA credits rarely come close to covering the actual cost of downtime. If your e-commerce site earns $10,000 per hour and goes down for 4 hours, the $40,000 in lost revenue dwarfs a 25% credit on a $500/month hosting bill.
SLAs are a baseline commitment, not insurance. They tell you the minimum service level the provider is willing to be held accountable for. Your actual availability needs to exceed the SLA, and you need your own monitoring and redundancy to achieve that.
For more on the real cost implications, see cost of website downtime.
SLA credits are designed to compensate you for the inconvenience, not for business losses. If your business requires true high availability, you need to build it yourself through redundant architecture rather than relying on a single provider's SLA.
What hosting providers promise vs. deliver
Cloud provider SLAs
Major cloud providers offer tiered SLAs based on the service and architecture:
AWS offers 99.99% for EC2 instances within a Region (requires multi-AZ deployment). Single-instance SLA is lower. Compute, storage, and networking each have separate SLAs.
Google Cloud Platform offers 99.99% for Compute Engine with instances spread across multiple zones. Their SLA is notable for being relatively straightforward in its definitions.
Microsoft Azure offers 99.99% for Virtual Machines deployed across two or more Availability Zones. Single-instance VMs with premium SSD get 99.9%.
Cloudflare offers 100% uptime SLA for their Enterprise plan. Their standard plans have no formal SLA.
The pattern is clear: higher SLAs require you to architect for redundancy. A single server, regardless of provider, will not achieve 99.99% availability.
Hosting provider SLAs
Traditional hosting providers often promise impressive numbers that are less impressive once you read the exclusions:
- "99.9% uptime guarantee" frequently excludes scheduled maintenance, which can add up to hours per month
- Some providers measure uptime based on network availability, not application availability. Your server can be crashed but the network is "up."
- Budget hosting providers may offer "99.9% uptime" with a remedy of "prorated credit for downtime." On a $5/month plan, a full month of credit is $5.
How to evaluate SLA claims
When evaluating a provider's SLA:
- Read the actual SLA document, not the marketing page. The marketing page says "99.99% uptime." The SLA document defines exactly what that means.
- Check the exclusions. The longer the exclusion list, the less meaningful the uptime guarantee.
- Look at the measurement method. Internal monitoring vs. external monitoring. Per-region vs. global.
- Calculate the maximum credit. If the maximum credit is capped at one month's fee, the financial incentive for the provider to maintain uptime is limited.
- Check historical performance. Status page archives, third-party monitoring reports, and community forums reveal actual availability. Use services like Is That Down to track vendor status.
See SLA monitoring tools for options.
High availability architectures
If your availability target exceeds what a single server can provide, you need to architect for high availability (HA). HA is the practice of designing systems that continue to operate when individual components fail. [4]
Redundancy at every layer
High availability requires eliminating single points of failure throughout your stack:
DNS. Use multiple DNS providers or a provider with an anycast network. A DNS outage makes your site unreachable even if your servers are running. See what is high availability for foundational concepts.
Load balancers. Multiple load balancers in an active-active or active-passive configuration. If the load balancer is a single point of failure, redundant servers behind it do not help.
Application servers. Multiple instances behind a load balancer. If one instance fails or needs maintenance, others handle the traffic.
Databases. Primary-replica replication with automatic failover. Consider multi-region replication for the highest availability tiers.
Storage. Redundant storage (RAID, distributed storage systems, cloud object storage with built-in redundancy).
Network. Multiple network paths, multiple ISPs, multiple data centers.
SSL certificates. Monitor certificate expiry because an expired cert takes down HTTPS availability even when servers are healthy.
Active-active vs. active-passive
Active-active means all instances are serving traffic simultaneously. If one fails, the remaining instances absorb its load. This provides the best utilization and fastest failover, but requires your application to handle concurrent writes and session consistency.
Active-passive means one instance (or set of instances) serves traffic while the standby waits. If the active fails, the passive takes over. This is simpler to implement but wastes the standby resources during normal operation, and failover takes longer.
For the tradeoffs, see high availability vs fault tolerance.
Geographic distribution
For the highest availability, distribute across geographic regions:
- Multi-region deployment protects against entire data center failures, regional network outages, and natural disasters.
- CDN edge caching serves static content from locations close to users, providing availability for cached content even if your origin is down.
- DNS-based failover routes users to the nearest healthy region.
See high availability hosting for provider-specific guidance.
Fault tolerance vs. disaster recovery
These terms are related but address different problems. Understanding the distinction helps you invest appropriately. [5]
Fault tolerance
Fault tolerance is the ability to continue operating when a component fails, with no user-visible interruption. The system detects the failure and routes around it automatically. Examples:
- A load balancer that detects a failed server and stops routing traffic to it
- A database replica that automatically promotes to primary when the primary fails
- A CDN that serves cached content when the origin server is unreachable
Fault tolerance provides continuous availability. Users never know a failure occurred.
Disaster recovery
Disaster recovery (DR) is the process of restoring service after a major failure. Unlike fault tolerance, DR accepts that there will be downtime. The goal is to minimize recovery time (RTO, Recovery Time Objective) and data loss (RPO, Recovery Point Objective). [5]
DR planning addresses scenarios like:
- Complete data center failure
- Catastrophic data corruption
- Ransomware attacks
- Major cloud provider outages
DR involves backups, runbooks, tested recovery procedures, and often a separate "cold" or "warm" standby environment that can be brought online when the primary fails.
For a full comparison, see high availability vs disaster recovery.
Choosing the right approach
| Factor | Fault tolerance | Disaster recovery | |--------|:---:|:---:| | Cost | High (redundant infrastructure always running) | Lower (standby can be scaled down) | | Recovery time | Near-zero (automatic) | Minutes to hours (manual or semi-automated) | | Complexity | High (must handle failover logic) | Moderate (must maintain and test procedures) | | Suitable for | Mission-critical, zero-tolerance services | Important but tolerates brief outages |
Most organizations use a combination: fault tolerance for their most critical services and disaster recovery for everything else.
Error budgets
The error budget concept, introduced by Google's SRE team, transforms availability from an abstract target into a practical decision-making tool. [3]
How error budgets work
If your SLO is 99.9% availability per month, that means you can tolerate 0.1% unavailability, which is approximately 43 minutes of downtime per month. That 43 minutes is your error budget.
The error budget is not just a number. It is a resource that gets "spent" by:
- Unplanned outages
- Planned maintenance windows
- Deployments that cause brief unavailability
- Performance degradation that causes timeouts
Using error budgets for decisions
When you have a healthy error budget (plenty of budget remaining), you can:
- Deploy more aggressively
- Run experiments and migrations
- Accept some risk for faster feature delivery
When your error budget is depleted (you have already used your allowed downtime):
- Slow down deployments
- Focus on reliability improvements
- Require additional review for changes
- Invest in automated testing and canary deployments
This turns the tension between velocity (shipping features fast) and reliability (keeping the service stable) into a quantifiable tradeoff.
Tracking error budget burn
Monitor your error budget consumption over time. A burn rate that is too high early in the month signals that you will exceed your budget. A burn rate that is consistently near zero might indicate that your SLO is too conservative and you could be moving faster.
See how to calculate uptime for the math behind tracking these numbers.
Error budgets only work if your organization commits to the consequences. If depleting the error budget does not actually slow down deployments, it is just a dashboard metric that gets ignored. The value of error budgets comes from the policy decisions attached to them.
Calculating and tracking availability
The basic formula
Availability = (Total time - Downtime) / Total time * 100
For a 30-day month (43,200 minutes):
- 43 minutes of downtime = (43,200 - 43) / 43,200 * 100 = 99.9%
- 4 minutes of downtime = (43,200 - 4) / 43,200 * 100 = 99.99%
What to include in "downtime"
Define your measurement criteria before you start tracking:
- Full outages (HTTP 5xx or connection refused): Always count these.
- Partial outages (some pages or features broken): Count these proportionally or based on affected user percentage.
- Performance degradation (slow but functional): Set a threshold (e.g., page load over 10 seconds counts as down).
- Scheduled maintenance: Some organizations exclude this; others include it. Excluding it makes the number look better but does not reflect the user's experience.
Monitoring methods
Synthetic monitoring sends automated requests to your site at regular intervals and measures the response. This gives you consistent, comparable data but only tests from the monitoring service's locations and only tests the specific URLs you configure.
Real user monitoring (RUM) captures data from actual user interactions. This gives you real-world availability data but is noisier and only captures data when users are visiting.
For best results, use both. Synthetic monitoring provides the baseline and catches outages outside business hours. RUM provides real-world context.
See what is uptime monitoring and what is endpoint monitoring for method details.
Incident response metrics
Beyond overall availability, track these operational metrics:
- MTTD (Mean Time to Detect): How long between the start of an outage and when your team learns about it. Good monitoring drives this toward zero. See MTTA and MTTD explained.
- MTTA (Mean Time to Acknowledge): How long between detection and someone taking ownership of the incident.
- MTTR (Mean Time to Resolve): How long between the start of an outage and full recovery. See MTTR explained.
- MTBF (Mean Time Between Failures): How long your service runs between outages. See MTBF explained.
These metrics tell you not just how available you are, but how effective your operational response is. See incident response metrics for a complete framework.
Reporting availability
When reporting availability to stakeholders:
- Report over consistent time periods (monthly, quarterly)
- Include the definition of downtime used
- Show both the raw number and the number of incidents
- Compare against the SLO/SLA target
- Include context for significant incidents
- Track the trend over time, not just the current number
Building your availability program
Step 1: Define your targets
Start by determining what availability level your service actually needs, not what sounds impressive. A personal blog does not need five nines. An emergency response system does.
Consider:
- What is the cost of downtime for your service?
- What do your customers or users expect?
- What is the availability of your dependencies? (Your service cannot be more available than its least available dependency.)
- What can you afford to invest in infrastructure and operations?
Step 2: Measure your current state
Before setting targets, measure where you are today. Set up uptime monitoring if you have not already. Collect at least 30 days of data to establish a baseline.
Step 3: Set SLOs
Based on your needs and current state, set internal SLOs. Make them achievable but aspirational. If you are currently at 99.5%, setting a 99.99% SLO is a stretch goal that requires architectural changes. Setting 99.9% is a reasonable next step.
Step 4: Implement monitoring and alerting
Monitoring is the foundation. Without it, you cannot measure availability, detect outages, or track improvement. Set up alerts that follow best practices. See uptime alerts best practices.
Step 5: Invest in reliability
Based on your gap analysis (where you are vs. where you need to be), invest in the areas with the highest impact:
- If most downtime is from deployments, invest in deployment safety (canary releases, automated rollback)
- If most downtime is from infrastructure, invest in redundancy
- If most downtime is from slow detection, invest in monitoring
- If most downtime is from slow resolution, invest in runbooks and automation
See how to reduce website downtime and the uptime SLA guide for tactical recommendations.
Step 6: Review and iterate
Review your availability metrics monthly. Conduct post-incident reviews for every significant outage. Update your SLOs and investments based on what you learn.
Availability is not a destination. It is a practice that improves over time through consistent measurement, honest assessment, and targeted investment.
References
- Google, "Site Reliability Engineering: Monitoring Distributed Systems," https://sre.google/sre-book/monitoring-distributed-systems/
- ITIL, "Service Level Management," ITIL Foundation, AXELOS. https://www.axelos.com/best-practice-solutions/itil
- Google, "Site Reliability Engineering," O'Reilly, 2016. https://sre.google/sre-book/table-of-contents/
- AWS, "Reliability Pillar - AWS Well-Architected Framework." https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
- NIST, "Contingency Planning Guide for Federal Information Systems," NIST SP 800-34 Rev 1. https://csrc.nist.gov/publications/detail/sp/800-34/rev-1/final
- Gartner, "The Cost of Downtime." https://www.gartner.com/en/documents/3956882
- Microsoft Azure, "SLA for Virtual Machines." https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/
- AWS, "Amazon Compute Service Level Agreement." https://aws.amazon.com/compute/sla/
Track your uptime and meet your SLAs
Monitor your website from multiple locations with checks every minute. Instant alerts when availability drops.
Try Uptime Monitor