“The shift was quiet. They'd been using datadog for weeks, mostly out of obligation. Then custom metrics and dashboards solved a problem they'd been routing around — and suddenly the friction of alert fatigue — too many alerts means none of them get taken seriously felt absurd. They couldn't go back.”
When I'm at 2:47 am, pagerduty fires, I want to build alerts that fire for real problems and stay quiet for everything else, so I can create dashboards that help during incidents, not just look good in reviews.
A site reliability engineer or DevOps engineer responsible for the uptime and performance of production systems. They chose Datadog because it combines metrics, traces, logs, and alerts in one place — but now they're paying for all of it and the bill is terrifying. They've built dashboards that are beautiful, alerts that are precise, and runbooks that nobody reads. They are the person who gets paged at 3 AM and needs to determine in 90 seconds whether this is a real incident or a flapping alert.
To build alerts that fire for real problems and stay quiet for everything else — reliably, without workarounds, and without becoming the team's single point of failure for datadog, leveraging APM with distributed tracing across microservices.
A site reliability engineer or devops engineer who trusts their setup. Build alerts that fire for real problems and stay quiet for everything else is reliable enough that they've stopped checking. Deployment-correlated alerting that automatically links metric changes to recent deploys reduces investigation time. They've moved from configuring datadog to using it.
At 2:47 AM, PagerDuty fires. The SRE opens Datadog on their phone and sees API latency spiked to 5 seconds. They switch to their laptop, open the service dashboard, and see the database connection pool is saturated. They check traces — one slow query is holding connections. They check logs — the query started misbehaving after last night's deployment. Root cause: a missing index on a new table. Time to detect: 12 minutes. Time to identify: 25 minutes. Time to fix: 3 minutes. Total incident: 40 minutes. If the dashboard had correlated the deployment event with the latency spike automatically, they'd have saved 15 minutes.
Monitors 50–500 services across cloud infrastructure (AWS, GCP, or Azure). Manages 100–1,000 active monitors (alerts). Maintains 20–50 dashboards across teams. Handles 3–15 incidents per month. Uses Datadog APM for distributed tracing, Logs for debugging, and Infrastructure for host monitoring. Spends 5–10 hours per week on monitoring configuration and optimization. Reviews the Datadog bill monthly and has at least once cut custom metrics to reduce costs.
Two things you'd notice: they reference datadog in conversation without being asked, and they've built workflows on top of it that weren't in the original plan. anomaly detection and intelligent alerting has become part of their muscle memory. They're now focused on create dashboards that help during incidents, not just look good in reviews — a sign the basics are solved.
The trigger is specific: custom metrics pricing that makes every new metric a cost decision, not a technical one, combined with a high-stakes deadline. datadog fails them at exactly the wrong moment. Alert fatigue meant the team ignored Datadog notifications — defeating the purpose entirely. What makes it irreversible: they fundamentally believe monitoring that generates noise is worse than no monitoring — it trains people to ignore alerts, and datadog just proved it doesn't share that belief.
Pairs with datadog-primary-user for the developer vs. SRE perspective on the same platform. Contrast with sentry-primary-user for the error monitoring specialist comparison. Use with pagerduty-primary-user for the alerting-to-incident workflow.