How to Set Up Uptime Alerts That Don't Cry Wolf
A practical guide to configuring uptime alerts that catch real problems without drowning your team in noise. Covers channels, thresholds, escalation, and alert fatigue.
The Alert That Nobody Noticed
Your monitoring tool detected a full outage at 3:17 AM. It sent an alert. Nobody responded. The site stayed down for two hours until a customer tweeted about it. When you checked the alert channel later, you found it buried under 47 other notifications from the past week, most of them false alarms or low-priority warnings that trained your team to stop paying attention.
This is alert fatigue. It is the most common failure mode in uptime monitoring, and it has nothing to do with the monitoring tool itself. The tool did its job. The problem was how the alerts were configured.
Getting alerts right is the difference between a monitoring setup that protects your business and one that just generates noise. This guide covers how to configure uptime alerts that surface real problems, reach the right people, and actually get a response. For the broader monitoring picture, see our complete uptime monitoring guide.
Choose the Right Alert Channels
Not every alert belongs in the same place. The channel you use should match the urgency and the audience.
Email works for non-urgent notifications: weekly uptime reports, resolved-incident summaries, or informational alerts about degraded performance. It does not work for critical alerts. Email is asynchronous by nature. People check it on their own schedule. If your site is down, you need a channel that interrupts.
Use email for: summary reports, resolved notifications, low-severity warnings.
Slack and Microsoft Teams
Chat tools are a step up from email because they are more visible during working hours. A dedicated #alerts channel can work well for teams that are actively monitoring chat throughout the day. The risk is channel noise. If your alerts channel also contains deployment notifications, CI/CD updates, and bot messages, critical alerts get lost.
Use Slack/Teams for: business-hours alerting, team visibility into incidents, secondary notification for context.
SMS
Text messages interrupt. That is exactly what you want for critical alerts. SMS reaches people regardless of whether they are at their desk, in a meeting, or asleep. The tradeoff is that SMS is intrusive, which means it must be reserved for genuinely critical situations. If you send SMS alerts for every minor blip, your team will start ignoring those too.
Use SMS for: confirmed outages, critical services only, after-hours alerting.
PagerDuty, Opsgenie, and Incident Management Tools
Dedicated incident management platforms add structure that individual channels lack. They support on-call schedules, automatic escalation, acknowledgment tracking, and incident timelines. If an alert is not acknowledged within a defined window, it escalates to the next person. Nobody falls through the cracks.
Use incident management tools for: teams with on-call rotations, businesses where response time SLAs matter, any team larger than two or three people.
Combining Channels
The best setups layer multiple channels. A critical alert goes to Slack for team visibility plus SMS and PagerDuty for the on-call engineer. A warning goes to Slack only. A resolved notification goes to email. The key is that each channel has a purpose, and critical alerts always reach an interrupting channel.
Set Thresholds That Reduce Noise
A single failed check does not necessarily mean your site is down. Network blips, temporary DNS hiccups, and monitoring probe issues can all produce a single failed check followed by an immediate recovery. If you alert on every single failure, you will drown in false positives.
Confirm Before Alerting
Most monitoring tools let you require multiple consecutive failures before triggering an alert. Requiring two or three consecutive failed checks from different locations filters out transient issues while still catching real outages quickly. If your checks run every minute and you require two consecutive failures, you add one minute of delay before alerting. That is a worthwhile tradeoff for eliminating most false positives.
Use Multiple Check Locations
A check failing from one location might indicate a regional network issue or a problem with the monitoring probe itself. A check failing from three locations simultaneously almost certainly means your site is actually down. Configure your monitoring to check from multiple geographic locations and require failures from more than one location before alerting.
Distinguish Between Down and Degraded
Not every problem is a full outage. Your site might respond but with 5-second load times, or it might return a 200 status code but with an error message in the body. These are real problems that deserve attention, but they do not need the same urgency as a complete outage.
Set up separate alert levels:
- Critical: Site is completely unreachable (multiple consecutive failures from multiple locations)
- Warning: Site is responding but slowly (response time exceeds threshold) or returning unexpected content
- Info: Minor anomalies worth logging but not alerting on (single failed check that recovered immediately)
Only critical alerts should go to interrupting channels. Warnings can go to Slack. Info-level events can go to a log or dashboard.
Prevent Alert Fatigue
Alert fatigue is what happens when your team receives so many alerts that they stop treating any of them as urgent. According to PagerDuty's State of Digital Operations report, teams that receive more than a handful of alerts per day start to experience significant response degradation. The alerts keep coming, but the response times get longer and longer until critical alerts are treated the same as routine noise.
The Root Causes of Alert Fatigue
Too many low-value alerts. Every warning, every transient blip, every resolved notification adds to the noise. If your team gets 20 alerts a day and only 1 of them requires action, the other 19 are training them to ignore alerts.
Duplicate alerts. The same incident triggering alerts from multiple monitors, multiple channels, and multiple tools simultaneously. One outage produces a cascade of 15 notifications. After a few of these, the instinct is to mute everything.
Alerts with no clear owner. When an alert goes to a group channel and nobody is specifically responsible, everyone assumes someone else will handle it. This is the bystander effect applied to incident response.
No differentiation between severity levels. When a minor API slowdown and a complete site outage both produce the same notification format in the same channel, neither feels urgent.
How to Fix It
Reduce alert volume ruthlessly. Review every alert rule you have. For each one, ask: "If this fires at 3 AM, does someone need to wake up and act?" If the answer is no, it should not be a critical alert. Downgrade it to a warning or remove it entirely.
Deduplicate. Group related alerts so that a single incident produces one notification, not ten. If you monitor five endpoints on the same server and the server goes down, you should get one "server down" alert, not five individual endpoint alerts.
Assign clear ownership. Every alert should have an explicit recipient or on-call person. "The team" is not an owner. Use on-call schedules so there is always exactly one person responsible for responding.
Make severity visible. Use different formats, channels, or sounds for different severity levels. A critical alert should look and feel different from a warning. This is covered well in our guide on MTTA and MTTD, which explains how detection and acknowledgment speed depend on alert quality.
Build Escalation Chains
An escalation chain defines what happens when an alert is not acknowledged within a certain time window. Without escalation, an alert that reaches an unavailable person just sits there.
A Simple Escalation Model
- Minute 0: Alert goes to the primary on-call engineer via SMS and Slack
- Minute 10: If not acknowledged, alert escalates to the secondary on-call engineer
- Minute 20: If still not acknowledged, alert goes to the engineering manager
- Minute 30: If still not acknowledged, alert goes to the CTO or a leadership contact
The specific timeframes depend on your response time requirements. A business with a 15-minute response SLA needs tighter windows than one targeting a 1-hour response time.
Keep Escalation Chains Short
Three to four levels is usually enough. If an alert has not been acknowledged after reaching four people, adding a fifth is unlikely to help. At that point, you have a process problem that needs a different solution.
Include Contact Information
Each level in the escalation chain should include how to reach the person, not just who they are. Phone numbers, Slack handles, and PagerDuty IDs should all be up to date. Stale contact information is a common point of failure.
Use Maintenance Windows
Planned maintenance causes expected downtime. If your monitoring tool does not know about scheduled maintenance, it will alert your team about downtime they already know about. These false alarms contribute directly to alert fatigue.
Most monitoring tools support maintenance windows: scheduled periods where alerts are suppressed for specific monitors. Use them every time you have planned downtime.
Before maintenance:
- Schedule the maintenance window in your monitoring tool
- Notify your team that alerts will be suppressed during that period
- Set the window slightly longer than the expected maintenance duration to account for overruns
After maintenance:
- Verify that monitoring has resumed and checks are passing
- Close the maintenance window if it has not auto-expired
Forgetting to set a maintenance window is a minor annoyance. Forgetting to close one is dangerous, because real outages during an open maintenance window will be silently suppressed. For more on minimizing downtime around these events, see our guide on how to reduce website downtime.
Test Your Alerts Regularly
An alert configuration that you have never tested is an assumption, not a guarantee. Alert testing should happen in three situations:
After Initial Setup
When you first configure monitoring and alerts, trigger a test alert. Most tools have a "send test notification" button. Use it for every channel. Verify that the notification arrives, that it reaches the right person, and that it contains useful information (which monitor failed, from which location, at what time).
After Changes
Any change to your alert configuration, on-call schedule, notification channels, or escalation chains should be followed by a test. This includes changes on the receiving end, such as someone switching phone numbers or leaving the company.
On a Regular Schedule
Run a quarterly or monthly alert audit. Send test alerts through every channel and escalation level. Verify that on-call schedules are current, that phone numbers work, that Slack channels still exist, and that escalation contacts are still with the company. This five-minute exercise prevents the scenario where your alerts have been silently broken for months.
Route Alerts by Severity and Service
Not every service needs the same alert configuration. Route by both service criticality and alert severity:
- Revenue-critical services (checkout, payment API): Critical alerts via SMS/PagerDuty to the on-call engineer, 24/7
- Customer-facing services (main site, documentation): Critical alerts via Slack and SMS during business hours, SMS-only after hours
- Internal services (admin panel, staging): Warning alerts via Slack during business hours only
For severity, reserve interrupting channels for critical alerts (site down). Use Slack for warnings (slow response, partial failure). Use dashboards or email digests for informational events (recovered, minor anomaly). This ensures the right people are interrupted for the right reasons.
The metrics that tell you whether your alerting is healthy are incident response metrics like MTTA (mean time to acknowledge). If MTTA is creeping up, your alert setup needs attention. If your false positive rate is high, you need tighter thresholds.
Getting alerts right is an ongoing process, not a one-time configuration. Review your setup quarterly, adjust thresholds based on what you learn, and always prioritize signal over noise. For a broader look at how vendor outages affect your alerting strategy, see the vendor outage response playbook.
The goal is not more alerts. The goal is the right alerts reaching the right people at the right time. Every alert that does not require action is actively making your monitoring worse by training your team to ignore notifications.
References
- PagerDuty, "State of Digital Operations," https://www.pagerduty.com/resources/reports/digital-operations/
- Beyer, B., Jones, C., Petoff, J., Murphy, N.R., Site Reliability Engineering, O'Reilly Media, https://sre.google/sre-book/table-of-contents/
- Atlassian, "Incident Management Best Practices," https://www.atlassian.com/incident-management/on-call/alert-fatigue
Know the moment your site goes down
Monitor your websites with checks every minute from multiple locations.
Try Uptime Monitor