Uptime monitoring is so simple that everyone underestimates how easy it is to set up wrong. A poorly configured monitor either misses real outages (silent and dangerous) or pages you on every regional networking blip (loud and exhausting). Here's how to get it right.
Probe from multiple regions
A single-region uptime check is a single-region uptime check — it tells you nothing about whether your site is actually down or whether the network between the probe and your origin is having a bad day. Use a service that probes from at least 3 regions (US, EU, Asia) and requires consensus before alerting.
Validate the response, not just the status code
A 200 OK doesn't mean your app is working. It might be returning an empty body, a "site maintenance" page, or a stale cached response. Configure your monitor to:
- Check response status code (default: 2xx).
- Check that the response body contains a known string (e.g. "Welcome to Acme").
- Check response time is under your SLA (e.g. < 2s).
Monitor your dependencies, not just your site
Your site can be "up" while your payment processor, CDN, or auth provider is down — and your users still can't do what they came for. Add monitors for:
- Stripe / payment provider status endpoint.
- Auth0 / Cognito / your auth provider.
- Critical SaaS API endpoints you depend on.
Require multiple consecutive failures before alerting
Single-failure alerts page you for transient blips. Require 3 consecutive failures across regions before sending an email. False-positive rate drops dramatically; real-outage MTTR rises by 3 minutes (acceptable trade-off for most teams).
Run a recovery alert
The "is it back yet" question is asked 50 times during an incident. A recovery alert (single email when monitors return to up) saves the on-call engineer from constantly refreshing.
Don't monitor what you can't fix
A monitor that pages you for things outside your control (e.g. a third-party blog you don't host) just adds noise. If you can't act on the alert, don't create it.