PagerDuty Integration
PagerDuty integration enables sophisticated incident response workflows, on-call management, and automated escalation paths. This guide covers complete setup, incident routing, and advanced features for enterprise incident management.
Quick Setup
Section titled “Quick Setup”-
Create PagerDuty Service
- Log into your PagerDuty account
- Go to Services > Service Directory
- Click “New Service” → “Build Your Own”
- Choose “Events API v2” integration
-
Add Integration in 9n9s
- Go to Organization Settings > Notification Channels
- Click “Add Channel” → “PagerDuty”
- Enter your Integration Key and configure options
- Click “Create Channel”
-
Test Integration
- Click “Send Test Alert”
- Verify incident appears in PagerDuty
- Check routing and escalation behavior
Prerequisites
Section titled “Prerequisites”PagerDuty Requirements
Section titled “PagerDuty Requirements”- PagerDuty Account: Free or paid PagerDuty subscription
- Service Configuration: At least one PagerDuty service for receiving alerts
- Escalation Policy: Configured escalation policy with on-call schedules
- User Permissions: Admin or manager role to create integrations
9n9s Requirements
Section titled “9n9s Requirements”- Organization Admin: Admin role to configure notification channels
- Monitor Access: Appropriate permissions for monitors you want to alert on
- Plan Features: PagerDuty integration available on all plans
Setting Up PagerDuty Integration
Section titled “Setting Up PagerDuty Integration”Step 1: Configure PagerDuty Service
Section titled “Step 1: Configure PagerDuty Service”-
Create New Service
- Log into pagerduty.com
- Navigate to Services → Service Directory
- Click “New Service” button
-
Service Configuration
- Service Name: “9n9s Production Monitoring” (descriptive name)
- Description: “Alerts from 9n9s monitoring system”
- Escalation Policy: Select or create appropriate escalation policy
- Alert Grouping: Choose grouping strategy (recommended: “Intelligent”)
- Incident Settings: Configure auto-resolution and urgency rules
-
Integration Setup
- Integration Type: Select “Events API v2”
- Integration Name: “9n9s Integration”
- Copy Integration Key: Save this key securely for 9n9s configuration
Step 2: Add PagerDuty Channel in 9n9s
Section titled “Step 2: Add PagerDuty Channel in 9n9s”-
Navigate to Notification Channels
- Log into app.9n9s.com
- Go to Organization Settings in the sidebar
- Click “Notification Channels”
-
Create PagerDuty Channel
- Click “Add Channel” button
- Select “PagerDuty” from integration types
- This opens the PagerDuty configuration form
-
Basic Configuration
- Channel Name: “PagerDuty Production” (internal reference name)
- Integration Key: Paste the key from PagerDuty service
- Default Severity: Choose default incident severity
- Description: Optional description of channel purpose
Step 3: Configure Alert Mapping
Section titled “Step 3: Configure Alert Mapping”-
Severity Mapping
- 9n9s Critical → PagerDuty Critical (highest urgency)
- 9n9s Warning → PagerDuty Warning (medium urgency)
- 9n9s Info → PagerDuty Info (low urgency)
-
Incident Details Configuration
- Incident Title Format: Customize how incident titles appear
- Incident Description: Configure detail level in incident body
- Custom Fields: Map 9n9s monitor tags to PagerDuty fields
- Deduplication Key: Configure how similar alerts are grouped
-
Advanced Settings
- Auto-Resolution: Enable automatic incident resolution when monitor recovers
- Event Action: Configure trigger/acknowledge/resolve behavior
- Client Information: Include 9n9s details in incident context
- Rate Limiting: Control incident creation frequency
Step 4: Test and Verify
Section titled “Step 4: Test and Verify”-
Send Test Incident
- Click “Send Test Alert” in channel configuration
- Choose test severity level (Critical, Warning, Info)
- Add test message and details
-
Verify in PagerDuty
- Check that incident appears in PagerDuty within 30 seconds
- Verify incident details, severity, and routing
- Confirm escalation policy triggers correctly
- Test acknowledgment and resolution workflows
-
Test Auto-Resolution
- Create a test monitor that will fail and recover
- Verify incidents auto-resolve when monitor recovers
- Check incident timeline for proper event sequence
PagerDuty Service Configuration
Section titled “PagerDuty Service Configuration”Escalation Policies
Section titled “Escalation Policies”Example Basic Escalation Policy:
Level 1 (Immediate):• Primary On-Call Engineer• Escalate after: 5 minutes
Level 2 (Secondary):• Senior On-Call Engineer• Secondary On-Call Engineer• Escalate after: 15 minutes
Level 3 (Management):• Engineering Manager• Ops Manager• Repeat escalation every: 30 minutesAdvanced Escalation with Time Restrictions:
Business Hours (9 AM - 6 PM):Level 1: Primary Engineer (5 min)Level 2: Team Lead (10 min)Level 3: Manager (repeat every 30 min)
After Hours/Weekends:Level 1: On-Call Engineer (immediate)Level 2: Backup On-Call (5 min)Level 3: Emergency Contact (15 min)Alert Grouping Strategies
Section titled “Alert Grouping Strategies”Intelligent Grouping (Recommended):
- Automatically groups related alerts using machine learning
- Reduces noise during incidents
- Groups by service, component, or similar characteristics
Time-Based Grouping:
- Groups alerts occurring within specified time window
- Useful for services with predictable failure patterns
- Configurable grouping window (1-60 minutes)
Content-Based Grouping:
- Groups alerts with similar titles or descriptions
- Uses exact matching or fuzzy matching
- Good for standardized alert formats
Custom Grouping Rules:
{ "grouping_config": { "type": "content_based", "fields": ["monitor.name", "tags.environment"], "time_window": 300, "max_group_size": 10 }}Advanced Configuration
Section titled “Advanced Configuration”Multiple PagerDuty Services
Section titled “Multiple PagerDuty Services”Configure different services for different alert types:
Critical Production Service:
Service: "Critical Production Alerts"Integration Key: [critical-service-key]Escalation Policy: "Critical Production Policy"Urgency: HighAuto-Resolution: EnabledNon-Critical Service:
Service: "Non-Critical Alerts"Integration Key: [non-critical-service-key]Escalation Policy: "Standard Policy"Urgency: LowAuto-Resolution: EnabledDevelopment Environment Service:
Service: "Development Alerts"Integration Key: [dev-service-key]Escalation Policy: "Development Policy"Urgency: LowAuto-Resolution: EnabledCustom Event Payloads
Section titled “Custom Event Payloads”Standard Event Payload:
{ "routing_key": "YOUR_INTEGRATION_KEY", "event_action": "trigger", "dedup_key": "{{monitor.id}}-{{incident.id}}", "payload": { "summary": "{{monitor.name}} is {{monitor.status}}", "source": "9n9s", "severity": "critical", "component": "{{tags.component}}", "group": "{{tags.team}}", "class": "{{monitor.type}}", "custom_details": { "monitor_id": "{{monitor.id}}", "project": "{{project.name}}", "environment": "{{tags.environment}}", "duration": "{{incident.duration}}", "response_code": "{{metadata.response_code}}", "response_time": "{{metadata.response_time}}", "monitor_url": "{{links.monitor}}", "logs_url": "{{links.logs}}" } }, "client": "9n9s Monitoring", "client_url": "{{links.monitor}}"}Enhanced Event with Context:
{ "routing_key": "YOUR_INTEGRATION_KEY", "event_action": "trigger", "dedup_key": "{{monitor.id}}-{{incident.id}}", "payload": { "summary": "[{{tags.environment}}] {{monitor.name}} - {{monitor.status}}", "source": "{{tags.service}}.{{tags.environment}}", "severity": "{{alert.severity}}", "component": "{{tags.component}}", "group": "{{tags.team}}", "class": "monitoring", "custom_details": { "incident_details": { "monitor_name": "{{monitor.name}}", "monitor_type": "{{monitor.type}}", "current_status": "{{monitor.status}}", "previous_status": "{{monitor.previous_status}}", "incident_started": "{{incident.started_at}}", "incident_duration": "{{incident.duration}}" }, "monitor_configuration": { "check_frequency": "{{monitor.frequency}}", "timeout": "{{monitor.timeout}}", "regions": "{{monitor.regions}}" }, "failure_details": { "response_code": "{{metadata.response_code}}", "response_time": "{{metadata.response_time}}", "error_message": "{{metadata.error_message}}" }, "environment_context": { "project": "{{project.name}}", "environment": "{{tags.environment}}", "service": "{{tags.service}}", "team": "{{tags.team}}", "criticality": "{{tags.criticality}}" }, "actions": { "view_monitor": "{{links.monitor}}", "view_logs": "{{links.logs}}", "view_project": "{{links.project}}", "create_incident": "{{links.incident_creation}}" } } }, "client": "9n9s Monitoring System", "client_url": "{{links.organization}}"}Intelligent Routing Rules
Section titled “Intelligent Routing Rules”Configure routing based on monitor attributes:
Environment-Based Routing:
routing_rules: - name: "Production Critical" conditions: tags: environment: production criticality: critical pagerduty_settings: service_key: "critical-production-key" severity: "critical" escalation_policy: "Critical Production Policy"
- name: "Production Standard" conditions: tags: environment: production criticality: [standard, low] pagerduty_settings: service_key: "standard-production-key" severity: "warning" escalation_policy: "Standard Production Policy"
- name: "Development Alerts" conditions: tags: environment: [development, staging] pagerduty_settings: service_key: "development-key" severity: "info" escalation_policy: "Development Policy"Team-Based Routing:
routing_rules: - name: "Backend Team Alerts" conditions: tags: team: backend pagerduty_settings: service_key: "backend-team-key" custom_details: runbook: "https://wiki.company.com/backend-runbooks"
- name: "Frontend Team Alerts" conditions: tags: team: frontend pagerduty_settings: service_key: "frontend-team-key" custom_details: runbook: "https://wiki.company.com/frontend-runbooks"Event Actions and Lifecycle
Section titled “Event Actions and Lifecycle”Event Action Types
Section titled “Event Action Types”Trigger Events:
- Create new incidents in PagerDuty
- Sent when monitor status changes to DOWN/DEGRADED
- Includes full context and failure details
Acknowledge Events:
- Acknowledge existing incidents
- Can be triggered automatically or manually from 9n9s
- Updates incident status without resolving
Resolve Events:
- Automatically resolve incidents when monitor recovers
- Sent when monitor status changes to UP
- Includes recovery time and resolution details
Incident Lifecycle Management
Section titled “Incident Lifecycle Management”Automatic Incident Management:
lifecycle_settings: auto_trigger: true # Create incidents for failed monitors auto_acknowledge: false # Manual acknowledgment required auto_resolve: true # Resolve when monitor recovers resolve_timeout: 300 # Auto-resolve after 5 minutes of UP status escalation_timeout: 900 # Re-escalate if unacknowledged for 15 minutesManual Incident Management:
lifecycle_settings: auto_trigger: true # Create incidents for failed monitors auto_acknowledge: false # Manual acknowledgment required auto_resolve: false # Manual resolution required resolve_timeout: null # No automatic resolution escalation_timeout: 1800 # Re-escalate after 30 minutesDeduplication Strategies
Section titled “Deduplication Strategies”Monitor-Based Deduplication:
Dedup Key: "{{monitor.id}}"Result: One incident per monitor, updates existing incidentIncident-Based Deduplication:
Dedup Key: "{{monitor.id}}-{{incident.id}}"Result: New incident for each failure occurrenceService-Based Deduplication:
Dedup Key: "{{tags.service}}-{{tags.environment}}"Result: One incident per service environment combinationIntegration with 9n9s Features
Section titled “Integration with 9n9s Features”Alert Rules Integration
Section titled “Alert Rules Integration”Simple Alert Rule:
alert_rule: name: "Critical Production Alert" conditions: - monitor_status: "down" - tags.environment: "production" - tags.criticality: "critical" actions: - channel: "pagerduty-critical" delay: 0 # ImmediateEscalation Alert Rule:
alert_rule: name: "Escalating Production Alert" conditions: - monitor_status: "down" - tags.environment: "production" actions: - channel: "slack-alerts" delay: 0 # Immediate Slack notification - channel: "pagerduty-standard" delay: 300 # PagerDuty after 5 minutes - channel: "pagerduty-critical" delay: 900 # Critical escalation after 15 minutesMonitor Tag Mapping
Section titled “Monitor Tag Mapping”Environment Mapping:
tag_mapping: environment: production: service_key: "prod-service-key" severity: "critical" staging: service_key: "staging-service-key" severity: "warning" development: service_key: "dev-service-key" severity: "info"Team Mapping:
tag_mapping: team: backend: service_key: "backend-service-key" escalation_policy: "Backend Team Policy" frontend: service_key: "frontend-service-key" escalation_policy: "Frontend Team Policy" infrastructure: service_key: "infra-service-key" escalation_policy: "Infrastructure Policy"Incident Response Workflows
Section titled “Incident Response Workflows”Automated Response Actions
Section titled “Automated Response Actions”Incident Creation Workflow:
-
Monitor Failure Detected
- 9n9s detects monitor failure
- Evaluates alert rules and conditions
- Determines appropriate PagerDuty service
-
Incident Creation
- Creates incident with full context
- Includes monitor details and failure information
- Triggers escalation policy
-
Escalation Management
- PagerDuty handles escalation based on policy
- Notifications sent via configured channels
- Incident assigned to on-call personnel
Recovery Workflow:
-
Monitor Recovery Detected
- 9n9s detects monitor recovery
- Waits for configured stability period
- Validates sustained recovery
-
Incident Resolution
- Sends resolve event to PagerDuty
- Updates incident with recovery details
- Logs resolution in incident timeline
Manual Incident Actions
Section titled “Manual Incident Actions”From 9n9s Dashboard:
- Acknowledge Alert: Mark incident as acknowledged in PagerDuty
- Resolve Incident: Manually resolve PagerDuty incident
- Snooze Alerts: Temporarily suspend notifications
- Create Related Incident: Create additional incident for investigation
From PagerDuty:
- View Monitor: Direct link to 9n9s monitor dashboard
- View Logs: Access monitor logs and check history
- Create Runbook: Link to incident response procedures
- Escalate: Manual escalation to next level or different team
Incident Response Best Practices
Section titled “Incident Response Best Practices”Response Procedure:
- Acknowledge Quickly: Acknowledge incidents within SLA timeframes
- Assess Impact: Determine scope and business impact
- Communicate Status: Update stakeholders via status pages
- Investigate Root Cause: Use 9n9s logs and monitoring data
- Implement Resolution: Fix underlying issues
- Document Learning: Update runbooks and improve monitoring
Escalation Guidelines:
- Level 1: Individual contributor response (5-15 minutes)
- Level 2: Senior engineer or team lead (15-30 minutes)
- Level 3: Management involvement (30+ minutes or high impact)
- Emergency: Immediate escalation for critical business impact
Security and Compliance
Section titled “Security and Compliance”API Security
Section titled “API Security”Integration Key Management:
- Store integration keys securely in 9n9s encrypted storage
- Rotate keys quarterly or as required by security policy
- Use separate keys for different environments
- Monitor integration key usage and access
Network Security:
- All communications use HTTPS/TLS encryption
- PagerDuty webhook verification for inbound events
- IP allowlisting for enhanced security
- Audit logging for all API interactions
Compliance Features
Section titled “Compliance Features”Audit Trail:
- Complete log of all incidents and actions
- Integration activity tracking
- User action attribution
- Retention policies for compliance requirements
Data Protection:
- Minimal sensitive data in incident payloads
- Configurable data masking for sensitive fields
- Regional data storage compliance
- GDPR and SOC 2 compliance support
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Incidents Not Created in PagerDuty:
-
Check Integration Key
- Verify integration key is correct and active
- Test key with PagerDuty Events API directly
- Check for key expiration or service changes
-
Review Event Payload
- Check 9n9s notification logs for error details
- Verify event payload format and required fields
- Test with minimal payload to isolate issues
-
Service Configuration
- Ensure PagerDuty service is active
- Check service integration settings
- Verify escalation policy is properly configured
Incidents Not Auto-Resolving:
-
Auto-Resolution Settings
- Verify auto-resolution is enabled in 9n9s channel
- Check resolve timeout configuration
- Ensure monitor actually recovers before timeout
-
Deduplication Key Issues
- Verify deduplication key consistency
- Check for key format changes
- Review incident grouping behavior
Performance Issues:
-
Delayed Incident Creation
- Check PagerDuty API status and response times
- Review 9n9s notification queue and processing
- Verify network connectivity and latency
-
Missing Escalations
- Check PagerDuty escalation policy configuration
- Verify on-call schedule coverage
- Review notification method settings
Error Codes and Resolution
Section titled “Error Codes and Resolution”Common PagerDuty API Errors:
-
400 Bad Request: Invalid event payload format
- Solution: Check payload syntax and required fields
-
401 Unauthorized: Invalid or expired integration key
- Solution: Verify integration key and regenerate if needed
-
429 Rate Limited: Too many API requests
- Solution: Review rate limiting settings and reduce frequency
-
500 Server Error: PagerDuty service unavailable
- Solution: Check PagerDuty status page and retry later
Getting Help
Section titled “Getting Help”Support Resources:
- 9n9s Support: Use in-app chat or email [email protected]
- PagerDuty Documentation: PagerDuty Developer Docs
- PagerDuty Support: Contact PagerDuty support for service issues
- Community: Join our Discord for peer support and best practices
Best Practices
Section titled “Best Practices”Service Organization
Section titled “Service Organization”-
Logical Service Grouping
- Create services based on team ownership
- Separate critical and non-critical services
- Use environment-specific services for different alerting needs
-
Escalation Policy Design
- Keep escalation paths simple and well-documented
- Include backup coverage for all time periods
- Regular review and testing of escalation policies
-
Alert Hygiene
- Regular review of incident volume and patterns
- Tune monitor sensitivity to reduce noise
- Implement proper alert routing to avoid fatigue
Incident Management
Section titled “Incident Management”-
Response Procedures
- Document clear incident response procedures
- Provide runbook links in incident details
- Train team members on PagerDuty features and workflows
-
Post-Incident Reviews
- Regular review of incident patterns and response times
- Use PagerDuty analytics to identify improvement opportunities
- Update monitoring and alerting based on learnings
-
Continuous Improvement
- Monitor escalation effectiveness and adjust policies
- Regular testing of notification channels and workflows
- Keep integration configurations updated with service changes
PagerDuty integration provides enterprise-grade incident management capabilities for your 9n9s monitoring. Start with basic service setup and gradually implement advanced features like intelligent routing and custom workflows as your incident response processes mature.