SLAs & Availability Guide

Every hosting provider, cloud platform, and SaaS vendor promises high uptime. The numbers sound impressive: 99.9%, 99.99%, 99.999%. But the difference between those numbers is enormous when you translate them into actual downtime minutes per year. And the gap between what a vendor promises in their SLA and what they actually deliver is often even bigger.

Understanding uptime SLAs, what they cover, what they exclude, and how to hold vendors accountable, is essential for anyone responsible for keeping a website or service available. This guide breaks down the math, the contracts, the architectures, and the monitoring practices that separate teams who hit their availability targets from those who do not.

The nines of uptime

Uptime is expressed as a percentage of total time that a service is available. The industry refers to these as "nines" because the difference between availability levels comes down to how many nines follow the decimal point. [1]

The jump from 99.9% to 99.99% is not a 0.09% improvement. It is a tenfold reduction in allowed downtime, from nearly 9 hours per year to under 53 minutes. Each additional nine requires exponentially more investment in infrastructure, redundancy, and operational discipline.

For a deeper breakdown with examples, see uptime nines explained and the uptime calculator.

What counts as downtime

This sounds like a simple question, but the answer varies wildly depending on who is defining it. Downtime could mean:

The server returns a 5xx error
The page takes longer than a threshold (e.g., 10 seconds) to load
The service is unreachable from one or more monitoring locations
A critical feature (like checkout or login) is broken, even if the homepage loads
Scheduled maintenance windows (some SLAs exclude these)

Your definition of downtime determines your measured availability. A vendor that excludes scheduled maintenance, single-location failures, and performance degradation from their downtime calculation will report higher availability than one that counts everything. Always read the fine print.

For calculating your own uptime, see how to calculate uptime.

SLA structure and terminology

A Service Level Agreement (SLA) is a contract between a service provider and a customer that defines the expected level of service, how it is measured, and what happens when the provider fails to meet it. [2]

SLI, SLO, and SLA

Google's Site Reliability Engineering team popularized a three-part framework for thinking about service levels: [3]

Service Level Indicator (SLI) is the metric you measure. It is a specific, quantifiable measure of service health. Examples:

Availability: the percentage of successful requests
Latency: the percentage of requests completed within a time threshold
Error rate: the percentage of requests that return errors
Throughput: requests per second successfully handled

Service Level Objective (SLO) is the target value for an SLI. It is an internal goal, not a contractual commitment. Examples:

"99.95% of requests should return a 2xx status code"
"95% of requests should complete in under 200ms"
"Error rate should stay below 0.1%"

Service Level Agreement (SLA) is the contractual commitment. It is an SLO with consequences attached: if the provider fails to meet the target, the customer gets something (usually service credits). SLAs are typically set below SLOs to provide a buffer. If your SLO is 99.95%, your SLA might promise 99.9%.

For a complete breakdown, see SLI vs SLO vs SLA.

Anatomy of an uptime SLA

A typical uptime SLA includes:

Uptime commitment. The headline number, usually expressed as a monthly percentage. "99.9% monthly uptime" is the most common for standard hosting.

Definition of downtime. What the provider considers an outage. Read this carefully. Common exclusions include:

Scheduled maintenance (often with advance notice requirements)
Force majeure events
Customer-caused outages (misconfiguration, exceeding resource limits)
Issues caused by third-party services
DDoS attacks (some providers exclude these, others do not)
Single-region failures in multi-region setups

Measurement method. How the provider measures uptime. Some use their own internal monitoring (which can be optimistic). Others accept external monitoring data. Some measure per-region, per-service, or per-instance rather than globally.

Remedies. What you get when the SLA is breached. Almost always service credits, not refunds. Typical credit schedules:

| Monthly uptime | Service credit | |:---:|:---:| | 99.0% to 99.9% | 10% | | 95.0% to 99.0% | 25% | | Below 95.0% | 50% |

Claim process. How to request credits. Most providers require you to file a claim within a specific window (often 30 days) with evidence of the outage. If you do not file, you do not get credits, even if the outage was widely publicized.

What SLAs do not cover

SLA credits rarely come close to covering the actual cost of downtime. If your e-commerce site earns $10,000 per hour and goes down for 4 hours, the $40,000 in lost revenue dwarfs a 25% credit on a $500/month hosting bill.

SLAs are a baseline commitment, not insurance. They tell you the minimum service level the provider is willing to be held accountable for. Your actual availability needs to exceed the SLA, and you need your own monitoring and redundancy to achieve that.

For more on the real cost implications, see cost of website downtime.

SLA credits are designed to compensate you for the inconvenience, not for business losses. If your business requires true high availability, you need to build it yourself through redundant architecture rather than relying on a single provider's SLA.

What hosting providers promise vs. deliver

Cloud provider SLAs

Major cloud providers offer tiered SLAs based on the service and architecture:

AWS offers 99.99% for EC2 instances within a Region (requires multi-AZ deployment). Single-instance SLA is lower. Compute, storage, and networking each have separate SLAs.

Google Cloud Platform offers 99.99% for Compute Engine with instances spread across multiple zones. Their SLA is notable for being relatively straightforward in its definitions.

Microsoft Azure offers 99.99% for Virtual Machines deployed across two or more Availability Zones. Single-instance VMs with premium SSD get 99.9%.

Cloudflare offers 100% uptime SLA for their Enterprise plan. Their standard plans have no formal SLA.

The pattern is clear: higher SLAs require you to architect for redundancy. A single server, regardless of provider, will not achieve 99.99% availability.

Hosting provider SLAs

Traditional hosting providers often promise impressive numbers that are less impressive once you read the exclusions:

"99.9% uptime guarantee" frequently excludes scheduled maintenance, which can add up to hours per month
Some providers measure uptime based on network availability, not application availability. Your server can be crashed but the network is "up."
Budget hosting providers may offer "99.9% uptime" with a remedy of "prorated credit for downtime." On a $5/month plan, a full month of credit is $5.

How to evaluate SLA claims

When evaluating a provider's SLA:

Read the actual SLA document, not the marketing page. The marketing page says "99.99% uptime." The SLA document defines exactly what that means.
Check the exclusions. The longer the exclusion list, the less meaningful the uptime guarantee.
Look at the measurement method. Internal monitoring vs. external monitoring. Per-region vs. global.
Calculate the maximum credit. If the maximum credit is capped at one month's fee, the financial incentive for the provider to maintain uptime is limited.
Check historical performance. Status page archives, third-party monitoring reports, and community forums reveal actual availability. Use services like Is That Down to track vendor status.

See SLA monitoring tools for options.

High availability architectures

If your availability target exceeds what a single server can provide, you need to architect for high availability (HA). HA is the practice of designing systems that continue to operate when individual components fail. [4]

Redundancy at every layer

High availability requires eliminating single points of failure throughout your stack:

DNS. Use multiple DNS providers or a provider with an anycast network. A DNS outage makes your site unreachable even if your servers are running. See what is high availability for foundational concepts.

Load balancers. Multiple load balancers in an active-active or active-passive configuration. If the load balancer is a single point of failure, redundant servers behind it do not help.

Application servers. Multiple instances behind a load balancer. If one instance fails or needs maintenance, others handle the traffic.

Databases. Primary-replica replication with automatic failover. Consider multi-region replication for the highest availability tiers.

Storage. Redundant storage (RAID, distributed storage systems, cloud object storage with built-in redundancy).

Network. Multiple network paths, multiple ISPs, multiple data centers.

SSL certificates. Monitor certificate expiry because an expired cert takes down HTTPS availability even when servers are healthy.

Active-active vs. active-passive

Active-active means all instances are serving traffic simultaneously. If one fails, the remaining instances absorb its load. This provides the best utilization and fastest failover, but requires your application to handle concurrent writes and session consistency.

Active-passive means one instance (or set of instances) serves traffic while the standby waits. If the active fails, the passive takes over. This is simpler to implement but wastes the standby resources during normal operation, and failover takes longer.

For the tradeoffs, see high availability vs fault tolerance.

Geographic distribution

For the highest availability, distribute across geographic regions:

Multi-region deployment protects against entire data center failures, regional network outages, and natural disasters.
CDN edge caching serves static content from locations close to users, providing availability for cached content even if your origin is down.
DNS-based failover routes users to the nearest healthy region.

See high availability hosting for provider-specific guidance.

Fault tolerance vs. disaster recovery

These terms are related but address different problems. Understanding the distinction helps you invest appropriately. [5]

Fault tolerance

Fault tolerance is the ability to continue operating when a component fails, with no user-visible interruption. The system detects the failure and routes around it automatically. Examples:

A load balancer that detects a failed server and stops routing traffic to it
A database replica that automatically promotes to primary when the primary fails
A CDN that serves cached content when the origin server is unreachable

Fault tolerance provides continuous availability. Users never know a failure occurred.

Disaster recovery

Disaster recovery (DR) is the process of restoring service after a major failure. Unlike fault tolerance, DR accepts that there will be downtime. The goal is to minimize recovery time (RTO, Recovery Time Objective) and data loss (RPO, Recovery Point Objective). [5]

DR planning addresses scenarios like:

Complete data center failure
Catastrophic data corruption
Ransomware attacks
Major cloud provider outages

DR involves backups, runbooks, tested recovery procedures, and often a separate "cold" or "warm" standby environment that can be brought online when the primary fails.

For a full comparison, see high availability vs disaster recovery.

Choosing the right approach

| Factor | Fault tolerance | Disaster recovery | |--------|:---:|:---:| | Cost | High (redundant infrastructure always running) | Lower (standby can be scaled down) | | Recovery time | Near-zero (automatic) | Minutes to hours (manual or semi-automated) | | Complexity | High (must handle failover logic) | Moderate (must maintain and test procedures) | | Suitable for | Mission-critical, zero-tolerance services | Important but tolerates brief outages |

Most organizations use a combination: fault tolerance for their most critical services and disaster recovery for everything else.

Error budgets

The error budget concept, introduced by Google's SRE team, transforms availability from an abstract target into a practical decision-making tool. [3]

How error budgets work

If your SLO is 99.9% availability per month, that means you can tolerate 0.1% unavailability, which is approximately 43 minutes of downtime per month. That 43 minutes is your error budget.

The error budget is not just a number. It is a resource that gets "spent" by:

Unplanned outages
Planned maintenance windows
Deployments that cause brief unavailability
Performance degradation that causes timeouts

Using error budgets for decisions

When you have a healthy error budget (plenty of budget remaining), you can:

Deploy more aggressively
Run experiments and migrations
Accept some risk for faster feature delivery

When your error budget is depleted (you have already used your allowed downtime):

Slow down deployments
Focus on reliability improvements
Require additional review for changes
Invest in automated testing and canary deployments

This turns the tension between velocity (shipping features fast) and reliability (keeping the service stable) into a quantifiable tradeoff.

Tracking error budget burn

Monitor your error budget consumption over time. A burn rate that is too high early in the month signals that you will exceed your budget. A burn rate that is consistently near zero might indicate that your SLO is too conservative and you could be moving faster.

See how to calculate uptime for the math behind tracking these numbers.

Error budgets only work if your organization commits to the consequences. If depleting the error budget does not actually slow down deployments, it is just a dashboard metric that gets ignored. The value of error budgets comes from the policy decisions attached to them.

Calculating and tracking availability

The basic formula

Availability = (Total time - Downtime) / Total time * 100

For a 30-day month (43,200 minutes):

43 minutes of downtime = (43,200 - 43) / 43,200 * 100 = 99.9%
4 minutes of downtime = (43,200 - 4) / 43,200 * 100 = 99.99%

What to include in "downtime"

Define your measurement criteria before you start tracking:

Full outages (HTTP 5xx or connection refused): Always count these.
Partial outages (some pages or features broken): Count these proportionally or based on affected user percentage.
Performance degradation (slow but functional): Set a threshold (e.g., page load over 10 seconds counts as down).
Scheduled maintenance: Some organizations exclude this; others include it. Excluding it makes the number look better but does not reflect the user's experience.

Monitoring methods

Synthetic monitoring sends automated requests to your site at regular intervals and measures the response. This gives you consistent, comparable data but only tests from the monitoring service's locations and only tests the specific URLs you configure.

Real user monitoring (RUM) captures data from actual user interactions. This gives you real-world availability data but is noisier and only captures data when users are visiting.

For best results, use both. Synthetic monitoring provides the baseline and catches outages outside business hours. RUM provides real-world context.

See what is uptime monitoring and what is endpoint monitoring for method details.

Incident response metrics

Beyond overall availability, track these operational metrics:

MTTD (Mean Time to Detect): How long between the start of an outage and when your team learns about it. Good monitoring drives this toward zero. See MTTA and MTTD explained.
MTTA (Mean Time to Acknowledge): How long between detection and someone taking ownership of the incident.
MTTR (Mean Time to Resolve): How long between the start of an outage and full recovery. See MTTR explained.
MTBF (Mean Time Between Failures): How long your service runs between outages. See MTBF explained.

These metrics tell you not just how available you are, but how effective your operational response is. See incident response metrics for a complete framework.

Reporting availability

When reporting availability to stakeholders:

Report over consistent time periods (monthly, quarterly)
Include the definition of downtime used
Show both the raw number and the number of incidents
Compare against the SLO/SLA target
Include context for significant incidents
Track the trend over time, not just the current number

Building your availability program

Step 1: Define your targets

Start by determining what availability level your service actually needs, not what sounds impressive. A personal blog does not need five nines. An emergency response system does.

Consider:

What is the cost of downtime for your service?
What do your customers or users expect?
What is the availability of your dependencies? (Your service cannot be more available than its least available dependency.)
What can you afford to invest in infrastructure and operations?

Step 2: Measure your current state

Before setting targets, measure where you are today. Set up uptime monitoring if you have not already. Collect at least 30 days of data to establish a baseline.

Step 3: Set SLOs

Based on your needs and current state, set internal SLOs. Make them achievable but aspirational. If you are currently at 99.5%, setting a 99.99% SLO is a stretch goal that requires architectural changes. Setting 99.9% is a reasonable next step.

Step 4: Implement monitoring and alerting

Monitoring is the foundation. Without it, you cannot measure availability, detect outages, or track improvement. Set up alerts that follow best practices. See uptime alerts best practices.

Step 5: Invest in reliability

Based on your gap analysis (where you are vs. where you need to be), invest in the areas with the highest impact:

If most downtime is from deployments, invest in deployment safety (canary releases, automated rollback)
If most downtime is from infrastructure, invest in redundancy
If most downtime is from slow detection, invest in monitoring
If most downtime is from slow resolution, invest in runbooks and automation

See how to reduce website downtime and the uptime SLA guide for tactical recommendations.

Step 6: Review and iterate

Review your availability metrics monthly. Conduct post-incident reviews for every significant outage. Update your SLOs and investments based on what you learn.

Availability is not a destination. It is a practice that improves over time through consistent measurement, honest assessment, and targeted investment.

References

Google, "Site Reliability Engineering: Monitoring Distributed Systems," https://sre.google/sre-book/monitoring-distributed-systems/
ITIL, "Service Level Management," ITIL Foundation, AXELOS. https://www.axelos.com/best-practice-solutions/itil
Google, "Site Reliability Engineering," O'Reilly, 2016. https://sre.google/sre-book/table-of-contents/
AWS, "Reliability Pillar - AWS Well-Architected Framework." https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
NIST, "Contingency Planning Guide for Federal Information Systems," NIST SP 800-34 Rev 1. https://csrc.nist.gov/publications/detail/sp/800-34/rev-1/final
Gartner, "The Cost of Downtime." https://www.gartner.com/en/documents/3956882
Microsoft Azure, "SLA for Virtual Machines." https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/
AWS, "Amazon Compute Service Level Agreement." https://aws.amazon.com/compute/sla/

Track your uptime and meet your SLAs

Monitor your website from multiple locations with checks every minute. Instant alerts when availability drops.

Try Uptime Monitor

Understanding Uptime SLAs and Availability