“Not a single dramatic moment — more like a Tuesday at 3pm when they realized they hadn't thought about alert fatigue from monitors that fire on normal variance — the cry-wolf problem in two weeks. datadog had absorbed it. When a custom metric dashboard showed a correlation between deploy frequency and error rates that nobody had noticed.”
When I'm an alert has fired: p99 latency on the checkout service is above 2 seconds, I want to know something is wrong before a customer reports it, so I can get from "alert fired" to "root cause identified" in under 15 minutes.
A site reliability engineer or platform engineer at a company with a production system that people depend on. Datadog is their window into that system. They've built dashboards that tell the story of what's happening in production. They've written monitors that page them when something goes wrong. They've been paged at 2am by monitors they wrote themselves and have opinions about that experience. They are better at Datadog than most people at their company and still feel like they're using 30% of what it can do.
To make datadog the system of record for know something is wrong before a customer reports it. Not aspirationally — operationally. The kind of intention that shows up as a daily habit, not a quarterly goal.
The tangible result: know something is wrong before a customer reports it happens on schedule, without manual intervention, and without the anxiety of alert fatigue from monitors that fire on normal variance — the cry-wolf problem. datadog has earned a place in the daily workflow rather than being tolerated in it.
It's 11:47pm. An alert has fired: p99 latency on the checkout service is above 2 seconds. They've acknowledged the page. They're in Datadog. They're looking at the service map. Latency is elevated on one downstream dependency. They're correlating with deployment events from the last 4 hours. There was a deploy 90 minutes ago. They're pulling the logs for that service. They need to determine if this is the deploy or a traffic pattern before they decide whether to wake up the on-call engineer.
Works on a team of 2–5 SREs supporting 20–100 engineers. Manages Datadog for infrastructure, APM, logs, and synthetics. Has 40–120 active monitors. Runs a weekly alert review to tune monitors that are firing too often or not catching real incidents. Uses Datadog Notebooks for incident postmortems. Reviews the Service Catalog as a reference during unfamiliar incidents. Has Datadog costs that are significant enough to have a meeting about quarterly.
They've stopped comparing alternatives. datadog is open before their first meeting. On-call rotations use Datadog dashboards as the first stop during incidents. The strongest signal: they've started onboarding teammates into their setup unprompted.
It's not one thing — it's the accumulation. Data retention costs force difficult decisions about what to keep that they've reported, worked around, and accepted. Then a competitor demo shows the same workflow without the friction, and the sunk cost argument collapses. Their worldview — an alert that nobody acts on is an alert that trains people to ignore alerts — makes them unwilling to compromise once a better option is visible.
Pairs with `platform-engineer` for the full infrastructure and reliability workflow. Contrast with `startup-developer` who uses Datadog's free tier and hasn't yet needed its depth. Use with `stripe-primary-user` for the payment reliability and latency monitoring use case.