Skip to content

PagerDuty Integration

PagerDuty integration enables sophisticated incident response workflows, on-call management, and automated escalation paths. This guide covers complete setup, incident routing, and advanced features for enterprise incident management.

  1. Create PagerDuty Service

    • Log into your PagerDuty account
    • Go to Services > Service Directory
    • Click “New Service”“Build Your Own”
    • Choose “Events API v2” integration
  2. Add Integration in 9n9s

    • Go to Organization Settings > Notification Channels
    • Click “Add Channel”“PagerDuty”
    • Enter your Integration Key and configure options
    • Click “Create Channel”
  3. Test Integration

    • Click “Send Test Alert”
    • Verify incident appears in PagerDuty
    • Check routing and escalation behavior
  • PagerDuty Account: Free or paid PagerDuty subscription
  • Service Configuration: At least one PagerDuty service for receiving alerts
  • Escalation Policy: Configured escalation policy with on-call schedules
  • User Permissions: Admin or manager role to create integrations
  • Organization Admin: Admin role to configure notification channels
  • Monitor Access: Appropriate permissions for monitors you want to alert on
  • Plan Features: PagerDuty integration available on all plans
  1. Create New Service

    • Log into pagerduty.com
    • Navigate to ServicesService Directory
    • Click “New Service” button
  2. Service Configuration

    • Service Name: “9n9s Production Monitoring” (descriptive name)
    • Description: “Alerts from 9n9s monitoring system”
    • Escalation Policy: Select or create appropriate escalation policy
    • Alert Grouping: Choose grouping strategy (recommended: “Intelligent”)
    • Incident Settings: Configure auto-resolution and urgency rules
  3. Integration Setup

    • Integration Type: Select “Events API v2”
    • Integration Name: “9n9s Integration”
    • Copy Integration Key: Save this key securely for 9n9s configuration
  1. Navigate to Notification Channels

    • Log into app.9n9s.com
    • Go to Organization Settings in the sidebar
    • Click “Notification Channels”
  2. Create PagerDuty Channel

    • Click “Add Channel” button
    • Select “PagerDuty” from integration types
    • This opens the PagerDuty configuration form
  3. Basic Configuration

    • Channel Name: “PagerDuty Production” (internal reference name)
    • Integration Key: Paste the key from PagerDuty service
    • Default Severity: Choose default incident severity
    • Description: Optional description of channel purpose
  1. Severity Mapping

    • 9n9s CriticalPagerDuty Critical (highest urgency)
    • 9n9s WarningPagerDuty Warning (medium urgency)
    • 9n9s InfoPagerDuty Info (low urgency)
  2. Incident Details Configuration

    • Incident Title Format: Customize how incident titles appear
    • Incident Description: Configure detail level in incident body
    • Custom Fields: Map 9n9s monitor tags to PagerDuty fields
    • Deduplication Key: Configure how similar alerts are grouped
  3. Advanced Settings

    • Auto-Resolution: Enable automatic incident resolution when monitor recovers
    • Event Action: Configure trigger/acknowledge/resolve behavior
    • Client Information: Include 9n9s details in incident context
    • Rate Limiting: Control incident creation frequency
  1. Send Test Incident

    • Click “Send Test Alert” in channel configuration
    • Choose test severity level (Critical, Warning, Info)
    • Add test message and details
  2. Verify in PagerDuty

    • Check that incident appears in PagerDuty within 30 seconds
    • Verify incident details, severity, and routing
    • Confirm escalation policy triggers correctly
    • Test acknowledgment and resolution workflows
  3. Test Auto-Resolution

    • Create a test monitor that will fail and recover
    • Verify incidents auto-resolve when monitor recovers
    • Check incident timeline for proper event sequence

Example Basic Escalation Policy:

Level 1 (Immediate):
• Primary On-Call Engineer
• Escalate after: 5 minutes
Level 2 (Secondary):
• Senior On-Call Engineer
• Secondary On-Call Engineer
• Escalate after: 15 minutes
Level 3 (Management):
• Engineering Manager
• Ops Manager
• Repeat escalation every: 30 minutes

Advanced Escalation with Time Restrictions:

Business Hours (9 AM - 6 PM):
Level 1: Primary Engineer (5 min)
Level 2: Team Lead (10 min)
Level 3: Manager (repeat every 30 min)
After Hours/Weekends:
Level 1: On-Call Engineer (immediate)
Level 2: Backup On-Call (5 min)
Level 3: Emergency Contact (15 min)

Intelligent Grouping (Recommended):

  • Automatically groups related alerts using machine learning
  • Reduces noise during incidents
  • Groups by service, component, or similar characteristics

Time-Based Grouping:

  • Groups alerts occurring within specified time window
  • Useful for services with predictable failure patterns
  • Configurable grouping window (1-60 minutes)

Content-Based Grouping:

  • Groups alerts with similar titles or descriptions
  • Uses exact matching or fuzzy matching
  • Good for standardized alert formats

Custom Grouping Rules:

{
"grouping_config": {
"type": "content_based",
"fields": ["monitor.name", "tags.environment"],
"time_window": 300,
"max_group_size": 10
}
}

Configure different services for different alert types:

Critical Production Service:

Service: "Critical Production Alerts"
Integration Key: [critical-service-key]
Escalation Policy: "Critical Production Policy"
Urgency: High
Auto-Resolution: Enabled

Non-Critical Service:

Service: "Non-Critical Alerts"
Integration Key: [non-critical-service-key]
Escalation Policy: "Standard Policy"
Urgency: Low
Auto-Resolution: Enabled

Development Environment Service:

Service: "Development Alerts"
Integration Key: [dev-service-key]
Escalation Policy: "Development Policy"
Urgency: Low
Auto-Resolution: Enabled

Standard Event Payload:

{
"routing_key": "YOUR_INTEGRATION_KEY",
"event_action": "trigger",
"dedup_key": "{{monitor.id}}-{{incident.id}}",
"payload": {
"summary": "{{monitor.name}} is {{monitor.status}}",
"source": "9n9s",
"severity": "critical",
"component": "{{tags.component}}",
"group": "{{tags.team}}",
"class": "{{monitor.type}}",
"custom_details": {
"monitor_id": "{{monitor.id}}",
"project": "{{project.name}}",
"environment": "{{tags.environment}}",
"duration": "{{incident.duration}}",
"response_code": "{{metadata.response_code}}",
"response_time": "{{metadata.response_time}}",
"monitor_url": "{{links.monitor}}",
"logs_url": "{{links.logs}}"
}
},
"client": "9n9s Monitoring",
"client_url": "{{links.monitor}}"
}

Enhanced Event with Context:

{
"routing_key": "YOUR_INTEGRATION_KEY",
"event_action": "trigger",
"dedup_key": "{{monitor.id}}-{{incident.id}}",
"payload": {
"summary": "[{{tags.environment}}] {{monitor.name}} - {{monitor.status}}",
"source": "{{tags.service}}.{{tags.environment}}",
"severity": "{{alert.severity}}",
"component": "{{tags.component}}",
"group": "{{tags.team}}",
"class": "monitoring",
"custom_details": {
"incident_details": {
"monitor_name": "{{monitor.name}}",
"monitor_type": "{{monitor.type}}",
"current_status": "{{monitor.status}}",
"previous_status": "{{monitor.previous_status}}",
"incident_started": "{{incident.started_at}}",
"incident_duration": "{{incident.duration}}"
},
"monitor_configuration": {
"check_frequency": "{{monitor.frequency}}",
"timeout": "{{monitor.timeout}}",
"regions": "{{monitor.regions}}"
},
"failure_details": {
"response_code": "{{metadata.response_code}}",
"response_time": "{{metadata.response_time}}",
"error_message": "{{metadata.error_message}}"
},
"environment_context": {
"project": "{{project.name}}",
"environment": "{{tags.environment}}",
"service": "{{tags.service}}",
"team": "{{tags.team}}",
"criticality": "{{tags.criticality}}"
},
"actions": {
"view_monitor": "{{links.monitor}}",
"view_logs": "{{links.logs}}",
"view_project": "{{links.project}}",
"create_incident": "{{links.incident_creation}}"
}
}
},
"client": "9n9s Monitoring System",
"client_url": "{{links.organization}}"
}

Configure routing based on monitor attributes:

Environment-Based Routing:

routing_rules:
- name: "Production Critical"
conditions:
tags:
environment: production
criticality: critical
pagerduty_settings:
service_key: "critical-production-key"
severity: "critical"
escalation_policy: "Critical Production Policy"
- name: "Production Standard"
conditions:
tags:
environment: production
criticality: [standard, low]
pagerduty_settings:
service_key: "standard-production-key"
severity: "warning"
escalation_policy: "Standard Production Policy"
- name: "Development Alerts"
conditions:
tags:
environment: [development, staging]
pagerduty_settings:
service_key: "development-key"
severity: "info"
escalation_policy: "Development Policy"

Team-Based Routing:

routing_rules:
- name: "Backend Team Alerts"
conditions:
tags:
team: backend
pagerduty_settings:
service_key: "backend-team-key"
custom_details:
team_lead: "[email protected]"
runbook: "https://wiki.company.com/backend-runbooks"
- name: "Frontend Team Alerts"
conditions:
tags:
team: frontend
pagerduty_settings:
service_key: "frontend-team-key"
custom_details:
team_lead: "[email protected]"
runbook: "https://wiki.company.com/frontend-runbooks"

Trigger Events:

  • Create new incidents in PagerDuty
  • Sent when monitor status changes to DOWN/DEGRADED
  • Includes full context and failure details

Acknowledge Events:

  • Acknowledge existing incidents
  • Can be triggered automatically or manually from 9n9s
  • Updates incident status without resolving

Resolve Events:

  • Automatically resolve incidents when monitor recovers
  • Sent when monitor status changes to UP
  • Includes recovery time and resolution details

Automatic Incident Management:

lifecycle_settings:
auto_trigger: true # Create incidents for failed monitors
auto_acknowledge: false # Manual acknowledgment required
auto_resolve: true # Resolve when monitor recovers
resolve_timeout: 300 # Auto-resolve after 5 minutes of UP status
escalation_timeout: 900 # Re-escalate if unacknowledged for 15 minutes

Manual Incident Management:

lifecycle_settings:
auto_trigger: true # Create incidents for failed monitors
auto_acknowledge: false # Manual acknowledgment required
auto_resolve: false # Manual resolution required
resolve_timeout: null # No automatic resolution
escalation_timeout: 1800 # Re-escalate after 30 minutes

Monitor-Based Deduplication:

Dedup Key: "{{monitor.id}}"
Result: One incident per monitor, updates existing incident

Incident-Based Deduplication:

Dedup Key: "{{monitor.id}}-{{incident.id}}"
Result: New incident for each failure occurrence

Service-Based Deduplication:

Dedup Key: "{{tags.service}}-{{tags.environment}}"
Result: One incident per service environment combination

Simple Alert Rule:

alert_rule:
name: "Critical Production Alert"
conditions:
- monitor_status: "down"
- tags.environment: "production"
- tags.criticality: "critical"
actions:
- channel: "pagerduty-critical"
delay: 0 # Immediate

Escalation Alert Rule:

alert_rule:
name: "Escalating Production Alert"
conditions:
- monitor_status: "down"
- tags.environment: "production"
actions:
- channel: "slack-alerts"
delay: 0 # Immediate Slack notification
- channel: "pagerduty-standard"
delay: 300 # PagerDuty after 5 minutes
- channel: "pagerduty-critical"
delay: 900 # Critical escalation after 15 minutes

Environment Mapping:

tag_mapping:
environment:
production:
service_key: "prod-service-key"
severity: "critical"
staging:
service_key: "staging-service-key"
severity: "warning"
development:
service_key: "dev-service-key"
severity: "info"

Team Mapping:

tag_mapping:
team:
backend:
service_key: "backend-service-key"
escalation_policy: "Backend Team Policy"
frontend:
service_key: "frontend-service-key"
escalation_policy: "Frontend Team Policy"
infrastructure:
service_key: "infra-service-key"
escalation_policy: "Infrastructure Policy"

Incident Creation Workflow:

  1. Monitor Failure Detected

    • 9n9s detects monitor failure
    • Evaluates alert rules and conditions
    • Determines appropriate PagerDuty service
  2. Incident Creation

    • Creates incident with full context
    • Includes monitor details and failure information
    • Triggers escalation policy
  3. Escalation Management

    • PagerDuty handles escalation based on policy
    • Notifications sent via configured channels
    • Incident assigned to on-call personnel

Recovery Workflow:

  1. Monitor Recovery Detected

    • 9n9s detects monitor recovery
    • Waits for configured stability period
    • Validates sustained recovery
  2. Incident Resolution

    • Sends resolve event to PagerDuty
    • Updates incident with recovery details
    • Logs resolution in incident timeline

From 9n9s Dashboard:

  • Acknowledge Alert: Mark incident as acknowledged in PagerDuty
  • Resolve Incident: Manually resolve PagerDuty incident
  • Snooze Alerts: Temporarily suspend notifications
  • Create Related Incident: Create additional incident for investigation

From PagerDuty:

  • View Monitor: Direct link to 9n9s monitor dashboard
  • View Logs: Access monitor logs and check history
  • Create Runbook: Link to incident response procedures
  • Escalate: Manual escalation to next level or different team

Response Procedure:

  1. Acknowledge Quickly: Acknowledge incidents within SLA timeframes
  2. Assess Impact: Determine scope and business impact
  3. Communicate Status: Update stakeholders via status pages
  4. Investigate Root Cause: Use 9n9s logs and monitoring data
  5. Implement Resolution: Fix underlying issues
  6. Document Learning: Update runbooks and improve monitoring

Escalation Guidelines:

  • Level 1: Individual contributor response (5-15 minutes)
  • Level 2: Senior engineer or team lead (15-30 minutes)
  • Level 3: Management involvement (30+ minutes or high impact)
  • Emergency: Immediate escalation for critical business impact

Integration Key Management:

  • Store integration keys securely in 9n9s encrypted storage
  • Rotate keys quarterly or as required by security policy
  • Use separate keys for different environments
  • Monitor integration key usage and access

Network Security:

  • All communications use HTTPS/TLS encryption
  • PagerDuty webhook verification for inbound events
  • IP allowlisting for enhanced security
  • Audit logging for all API interactions

Audit Trail:

  • Complete log of all incidents and actions
  • Integration activity tracking
  • User action attribution
  • Retention policies for compliance requirements

Data Protection:

  • Minimal sensitive data in incident payloads
  • Configurable data masking for sensitive fields
  • Regional data storage compliance
  • GDPR and SOC 2 compliance support

Incidents Not Created in PagerDuty:

  1. Check Integration Key

    • Verify integration key is correct and active
    • Test key with PagerDuty Events API directly
    • Check for key expiration or service changes
  2. Review Event Payload

    • Check 9n9s notification logs for error details
    • Verify event payload format and required fields
    • Test with minimal payload to isolate issues
  3. Service Configuration

    • Ensure PagerDuty service is active
    • Check service integration settings
    • Verify escalation policy is properly configured

Incidents Not Auto-Resolving:

  1. Auto-Resolution Settings

    • Verify auto-resolution is enabled in 9n9s channel
    • Check resolve timeout configuration
    • Ensure monitor actually recovers before timeout
  2. Deduplication Key Issues

    • Verify deduplication key consistency
    • Check for key format changes
    • Review incident grouping behavior

Performance Issues:

  1. Delayed Incident Creation

    • Check PagerDuty API status and response times
    • Review 9n9s notification queue and processing
    • Verify network connectivity and latency
  2. Missing Escalations

    • Check PagerDuty escalation policy configuration
    • Verify on-call schedule coverage
    • Review notification method settings

Common PagerDuty API Errors:

  • 400 Bad Request: Invalid event payload format

    • Solution: Check payload syntax and required fields
  • 401 Unauthorized: Invalid or expired integration key

    • Solution: Verify integration key and regenerate if needed
  • 429 Rate Limited: Too many API requests

    • Solution: Review rate limiting settings and reduce frequency
  • 500 Server Error: PagerDuty service unavailable

    • Solution: Check PagerDuty status page and retry later

Support Resources:

  1. 9n9s Support: Use in-app chat or email [email protected]
  2. PagerDuty Documentation: PagerDuty Developer Docs
  3. PagerDuty Support: Contact PagerDuty support for service issues
  4. Community: Join our Discord for peer support and best practices
  1. Logical Service Grouping

    • Create services based on team ownership
    • Separate critical and non-critical services
    • Use environment-specific services for different alerting needs
  2. Escalation Policy Design

    • Keep escalation paths simple and well-documented
    • Include backup coverage for all time periods
    • Regular review and testing of escalation policies
  3. Alert Hygiene

    • Regular review of incident volume and patterns
    • Tune monitor sensitivity to reduce noise
    • Implement proper alert routing to avoid fatigue
  1. Response Procedures

    • Document clear incident response procedures
    • Provide runbook links in incident details
    • Train team members on PagerDuty features and workflows
  2. Post-Incident Reviews

    • Regular review of incident patterns and response times
    • Use PagerDuty analytics to identify improvement opportunities
    • Update monitoring and alerting based on learnings
  3. Continuous Improvement

    • Monitor escalation effectiveness and adjust policies
    • Regular testing of notification channels and workflows
    • Keep integration configurations updated with service changes

PagerDuty integration provides enterprise-grade incident management capabilities for your 9n9s monitoring. Start with basic service setup and gradually implement advanced features like intelligent routing and custom workflows as your incident response processes mature.