pagerdutytechnicalAPP-103

The PagerDuty On-Call Engineer

#pagerduty#on-call#incident-management#devops#alerting#reliability

Aha Moment

“What was the moment this product clicked?” —

Identity

A software engineer or site reliability engineer who is on a rotating on-call schedule and whose relationship with PagerDuty is defined by the moments it wakes them up. They've been paged at 3am. They've resolved incidents from their phone in bed. They've also been paged for something that wasn't an incident — a flaky alert, a threshold set too low, a monitoring rule that was never updated after the system changed. Every false positive erodes their trust in the alert and their willingness to respond with full urgency next time. They manage this tension carefully.

Intention

What are they trying to do? —

Outcome

What do they produce? —

Goals

→Get paged only for things that require human intervention right now
→Diagnose and resolve incidents fast enough to minimize user impact
→Hand off incidents cleanly and build a post-mortem record that actually prevents recurrence

Frustrations

—Alert fatigue from noisy, low-signal pages that train them to under-respond
—Runbooks that are out of date because the last person to update them left 18 months ago
—Escalation policies that don't match how the team actually works
—Post-mortems that identify root causes but don't produce action items that stick

Worldview

Every page is a hypothesis: "this is real, and you need to act now"
Alert quality is the on-call team's shared responsibility — nobody else will fix it
A post-mortem that produces no lasting change is documentation theater

Scenario

It's 2:47am. PagerDuty fires. Payment processing latency is above threshold. They're awake, phone in hand. They open the incident. Linked to a Datadog alert. They open Datadog. The latency spike started 12 minutes ago and is ongoing. They check the deployment log — a deploy happened 40 minutes ago. They roll back. Latency normalizes in 3 minutes. Total time: 19 minutes. They write the incident summary, flag the deploy for post-mortem, and go back to sleep. This is the best version of this scenario. They know this.

Context

Is on an on-call rotation that cycles every 1–2 weeks. Has PagerDuty mobile app with escalating alert tones. Has been on-call for 1–5 years. Manages their own alert rules — or inherits ones they didn't write. Reviews alert noise monthly — or plans to. Has written at least one runbook. Knows which runbooks are out of date. Has escalated an incident to a senior engineer at least twice. Has been that senior engineer at least once. Has strong opinions about alert thresholds that they will share at any retrospective.

Impact

→Alert noise scoring that surfaces which alert rules generate the most false positives
→enables systematic noise reduction rather than tolerance of the status quo
→Runbook linking directly from the incident with last-edited date visible removes
→the "is this runbook current?" uncertainty during incident response
→Incident timeline that auto-populates from linked monitoring tools (Datadog, Sentry,
→Grafana) removes the manual documentation step during a live incident
→Post-mortem action item tracking that follows up with assignees removes the
→"we identified this three months ago and nothing changed" pattern

Composability Notes

Pairs with `sentry-primary-user` for the error-detection-to-incident-response chain. Contrast with `datadog-primary-user` for the monitoring-as-prevention vs. incident-response-when-it-fails distinction. Use with `gitlab-primary-user` for DevOps teams where the deployment pipeline is the most common incident source.