“What was the moment this product clicked?” —
A site reliability engineer or platform engineer at a company with a production system that people depend on. Datadog is their window into that system. They've built dashboards that tell the story of what's happening in production. They've written monitors that page them when something goes wrong. They've been paged at 2am by monitors they wrote themselves and have opinions about that experience. They are better at Datadog than most people at their company and still feel like they're using 30% of what it can do.
What are they trying to do? —
What do they produce? —
It's 11:47pm. An alert has fired: p99 latency on the checkout service is above 2 seconds. They've acknowledged the page. They're in Datadog. They're looking at the service map. Latency is elevated on one downstream dependency. They're correlating with deployment events from the last 4 hours. There was a deploy 90 minutes ago. They're pulling the logs for that service. They need to determine if this is the deploy or a traffic pattern before they decide whether to wake up the on-call engineer.
Works on a team of 2–5 SREs supporting 20–100 engineers. Manages Datadog for infrastructure, APM, logs, and synthetics. Has 40–120 active monitors. Runs a weekly alert review to tune monitors that are firing too often or not catching real incidents. Uses Datadog Notebooks for incident postmortems. Reviews the Service Catalog as a reference during unfamiliar incidents. Has Datadog costs that are significant enough to have a meeting about quarterly.
Pairs with `platform-engineer` for the full infrastructure and reliability workflow. Contrast with `startup-developer` who uses Datadog's free tier and hasn't yet needed its depth. Use with `stripe-primary-user` for the payment reliability and latency monitoring use case.