Persona Library
← All personas
pagerdutytechnicalAPP-103

The PagerDuty On-Call Engineer

#pagerduty#on-call#incident-management#devops#alerting#reliability
Aha Moment

The shift was quiet. They'd been using pagerduty for weeks, mostly out of obligation. Then one feature clicked into place — and suddenly the friction of alert fatigue from noisy, low-signal pages that train them to under-respond felt absurd. They couldn't go back.

Job Story (JTBD)

When I'm payment processing latency is above threshold, I want to get paged only for things that require human intervention right now, so I can diagnose and resolve incidents fast enough to minimize user impact.

Identity

A software engineer or site reliability engineer who is on a rotating on-call schedule and whose relationship with PagerDuty is defined by the moments it wakes them up. They've been paged at 3am. They've resolved incidents from their phone in bed. They've also been paged for something that wasn't an incident — a flaky alert, a threshold set too low, a monitoring rule that was never updated after the system changed. Every false positive erodes their trust in the alert and their willingness to respond with full urgency next time. They manage this tension carefully.

Intention

To make pagerduty the system of record for get paged only for things that require human intervention right now. Not aspirationally — operationally. The kind of intention that shows up as a daily habit, not a quarterly goal.

Outcome

The tangible result: get paged only for things that require human intervention right now happens on schedule, without manual intervention, and without the anxiety of alert fatigue from noisy, low-signal pages that train them to under-respond. pagerduty has earned a place in the daily workflow rather than being tolerated in it.

Goals
  • Get paged only for things that require human intervention right now
  • Diagnose and resolve incidents fast enough to minimize user impact
  • Hand off incidents cleanly and build a post-mortem record that actually prevents recurrence
Frustrations
  • Alert fatigue from noisy, low-signal pages that train them to under-respond
  • Runbooks that are out of date because the last person to update them left 18 months ago
  • Escalation policies that don't match how the team actually works
  • Post-mortems that identify root causes but don't produce action items that stick
Worldview
  • Every page is a hypothesis: "this is real, and you need to act now"
  • Alert quality is the on-call team's shared responsibility — nobody else will fix it
  • A post-mortem that produces no lasting change is documentation theater
Scenario

It's 2:47am. PagerDuty fires. Payment processing latency is above threshold. They're awake, phone in hand. They open the incident. Linked to a Datadog alert. They open Datadog. The latency spike started 12 minutes ago and is ongoing. They check the deployment log — a deploy happened 40 minutes ago. They roll back. Latency normalizes in 3 minutes. Total time: 19 minutes. They write the incident summary, flag the deploy for post-mortem, and go back to sleep. This is the best version of this scenario. They know this.

Context

Is on an on-call rotation that cycles every 1–2 weeks. Has PagerDuty mobile app with escalating alert tones. Has been on-call for 1–5 years. Manages their own alert rules — or inherits ones they didn't write. Reviews alert noise monthly — or plans to. Has written at least one runbook. Knows which runbooks are out of date. Has escalated an incident to a senior engineer at least twice. Has been that senior engineer at least once. Has strong opinions about alert thresholds that they will share at any retrospective.

Success Signal

They've stopped comparing alternatives. pagerduty is open before their first meeting. Get paged only for things that require human intervention right now runs on a cadence they didn't have to enforce. The strongest signal: they've started onboarding teammates into their setup unprompted.

Churn Trigger

It's not one thing — it's the accumulation. Alert fatigue from noisy, low-signal pages that train them to under-respond that they've reported, worked around, and accepted. Then a competitor demo shows the same workflow without the friction, and the sunk cost argument collapses. Their worldview — every page is a hypothesis: "this is real, and you need to act now" — makes them unwilling to compromise once a better option is visible.

Impact
  • Alert noise scoring that surfaces which alert rules generate the most false positives
  • enables systematic noise reduction rather than tolerance of the status quo
  • Runbook linking directly from the incident with last-edited date visible removes
  • the "is this runbook current?" uncertainty during incident response
  • Incident timeline that auto-populates from linked monitoring tools (Datadog, Sentry,
  • Grafana) removes the manual documentation step during a live incident
  • Post-mortem action item tracking that follows up with assignees removes the
  • "we identified this three months ago and nothing changed" pattern
Composability Notes

Pairs with `sentry-primary-user` for the error-detection-to-incident-response chain. Contrast with `datadog-primary-user` for the monitoring-as-prevention vs. incident-response-when-it-fails distinction. Use with `gitlab-primary-user` for DevOps teams where the deployment pipeline is the most common incident source.