datadogtechnicalAPP-019

The Datadog SRE

#datadog#observability#sre#monitoring#on-call#infrastructure

Aha Moment

“What was the moment this product clicked?” —

Identity

A site reliability engineer or platform engineer at a company with a production system that people depend on. Datadog is their window into that system. They've built dashboards that tell the story of what's happening in production. They've written monitors that page them when something goes wrong. They've been paged at 2am by monitors they wrote themselves and have opinions about that experience. They are better at Datadog than most people at their company and still feel like they're using 30% of what it can do.

Intention

What are they trying to do? —

Outcome

What do they produce? —

Goals

→Know something is wrong before a customer reports it
→Get from "alert fired" to "root cause identified" in under 15 minutes
→Build monitors that page when something matters and stay silent when it doesn't

Frustrations

—Alert fatigue from monitors that fire on normal variance — the cry-wolf problem
—that makes on-call feel like a punishment
—Dashboards that look great in a postmortem presentation but aren't useful
—during the incident itself
—Log queries that time out on the data volume that matters most — peak traffic incidents
—The cost of Datadog scaling faster than engineering's understanding of what they're paying for

Worldview

An alert that nobody acts on is an alert that trains people to ignore alerts
Observability is not monitoring — monitoring tells you something is wrong;
observability tells you why
The 2am page is the grade on the work done in the quiet hours

Scenario

It's 11:47pm. An alert has fired: p99 latency on the checkout service is above 2 seconds. They've acknowledged the page. They're in Datadog. They're looking at the service map. Latency is elevated on one downstream dependency. They're correlating with deployment events from the last 4 hours. There was a deploy 90 minutes ago. They're pulling the logs for that service. They need to determine if this is the deploy or a traffic pattern before they decide whether to wake up the on-call engineer.

Context

Works on a team of 2–5 SREs supporting 20–100 engineers. Manages Datadog for infrastructure, APM, logs, and synthetics. Has 40–120 active monitors. Runs a weekly alert review to tune monitors that are firing too often or not catching real incidents. Uses Datadog Notebooks for incident postmortems. Reviews the Service Catalog as a reference during unfamiliar incidents. Has Datadog costs that are significant enough to have a meeting about quarterly.

Impact

→Alert grouping that combines related signals into a single incident notification
→reduces the pager volume during cascading failures to something actionable
→Log query performance that holds at peak traffic volume removes the investigative
→gap that occurs precisely when logs are most needed
→Monitor configuration suggestions that recommend thresholds based on historical
→baseline variance reduce the manual tuning loop that produces alert fatigue
→Cost visibility at the team and service level makes Datadog spend defensible
→without requiring a FinOps specialist

Composability Notes

Pairs with `platform-engineer` for the full infrastructure and reliability workflow. Contrast with `startup-developer` who uses Datadog's free tier and hasn't yet needed its depth. Use with `stripe-primary-user` for the payment reliability and latency monitoring use case.