PagerDuty Integration

PagerDuty integration enables sophisticated incident response workflows, on-call management, and automated escalation paths. This guide covers complete setup, incident routing, and advanced features for enterprise incident management.

Quick Setup

Create PagerDuty Service
- Log into your PagerDuty account
- Go to Services > Service Directory
- Click “New Service” → “Build Your Own”
- Choose “Events API v2” integration
Add Integration in 9n9s
- Go to Organization Settings > Notification Channels
- Click “Add Channel” → “PagerDuty”
- Enter your Integration Key and configure options
- Click “Create Channel”
Test Integration
- Click “Send Test Alert”
- Verify incident appears in PagerDuty
- Check routing and escalation behavior

Prerequisites

PagerDuty Requirements

PagerDuty Account: Free or paid PagerDuty subscription
Service Configuration: At least one PagerDuty service for receiving alerts
Escalation Policy: Configured escalation policy with on-call schedules
User Permissions: Admin or manager role to create integrations

9n9s Requirements

Organization Admin: Admin role to configure notification channels
Monitor Access: Appropriate permissions for monitors you want to alert on
Plan Features: PagerDuty integration available on all plans

Setting Up PagerDuty Integration

Step 1: Configure PagerDuty Service

Create New Service
- Log into pagerduty.com
- Navigate to Services → Service Directory
- Click “New Service” button
Service Configuration
- Service Name: “9n9s Production Monitoring” (descriptive name)
- Description: “Alerts from 9n9s monitoring system”
- Escalation Policy: Select or create appropriate escalation policy
- Alert Grouping: Choose grouping strategy (recommended: “Intelligent”)
- Incident Settings: Configure auto-resolution and urgency rules
Integration Setup
- Integration Type: Select “Events API v2”
- Integration Name: “9n9s Integration”
- Copy Integration Key: Save this key securely for 9n9s configuration

Step 2: Add PagerDuty Channel in 9n9s

Navigate to Notification Channels
- Log into app.9n9s.com
- Go to Organization Settings in the sidebar
- Click “Notification Channels”
Create PagerDuty Channel
- Click “Add Channel” button
- Select “PagerDuty” from integration types
- This opens the PagerDuty configuration form
Basic Configuration
- Channel Name: “PagerDuty Production” (internal reference name)
- Integration Key: Paste the key from PagerDuty service
- Default Severity: Choose default incident severity
- Description: Optional description of channel purpose

Step 3: Configure Alert Mapping

Severity Mapping
- 9n9s Critical → PagerDuty Critical (highest urgency)
- 9n9s Warning → PagerDuty Warning (medium urgency)
- 9n9s Info → PagerDuty Info (low urgency)
Incident Details Configuration
- Incident Title Format: Customize how incident titles appear
- Incident Description: Configure detail level in incident body
- Custom Fields: Map 9n9s monitor tags to PagerDuty fields
- Deduplication Key: Configure how similar alerts are grouped
Advanced Settings
- Auto-Resolution: Enable automatic incident resolution when monitor recovers
- Event Action: Configure trigger/acknowledge/resolve behavior
- Client Information: Include 9n9s details in incident context
- Rate Limiting: Control incident creation frequency

Step 4: Test and Verify

Send Test Incident
- Click “Send Test Alert” in channel configuration
- Choose test severity level (Critical, Warning, Info)
- Add test message and details
Verify in PagerDuty
- Check that incident appears in PagerDuty within 30 seconds
- Verify incident details, severity, and routing
- Confirm escalation policy triggers correctly
- Test acknowledgment and resolution workflows
Test Auto-Resolution
- Create a test monitor that will fail and recover
- Verify incidents auto-resolve when monitor recovers
- Check incident timeline for proper event sequence

PagerDuty Service Configuration

Escalation Policies

Example Basic Escalation Policy:

Level 1 (Immediate):
• Primary On-Call Engineer
• Escalate after: 5 minutes

Level 2 (Secondary):
• Senior On-Call Engineer
• Secondary On-Call Engineer
• Escalate after: 15 minutes

Level 3 (Management):
• Engineering Manager
• Ops Manager
• Repeat escalation every: 30 minutes

Advanced Escalation with Time Restrictions:

Business Hours (9 AM - 6 PM):
Level 1: Primary Engineer (5 min)
Level 2: Team Lead (10 min)
Level 3: Manager (repeat every 30 min)

After Hours/Weekends:
Level 1: On-Call Engineer (immediate)
Level 2: Backup On-Call (5 min)
Level 3: Emergency Contact (15 min)

Alert Grouping Strategies

Intelligent Grouping (Recommended):

Automatically groups related alerts using machine learning
Reduces noise during incidents
Groups by service, component, or similar characteristics

Time-Based Grouping:

Groups alerts occurring within specified time window
Useful for services with predictable failure patterns
Configurable grouping window (1-60 minutes)

Content-Based Grouping:

Groups alerts with similar titles or descriptions
Uses exact matching or fuzzy matching
Good for standardized alert formats

Custom Grouping Rules:

{
    "grouping_config": {
        "type": "content_based",
        "fields": ["monitor.name", "tags.environment"],
        "time_window": 300,
        "max_group_size": 10
    }
}

Advanced Configuration

Multiple PagerDuty Services

Configure different services for different alert types:

Critical Production Service:

Service: "Critical Production Alerts"
Integration Key: [critical-service-key]
Escalation Policy: "Critical Production Policy"
Urgency: High
Auto-Resolution: Enabled

Non-Critical Service:

Service: "Non-Critical Alerts"
Integration Key: [non-critical-service-key]
Escalation Policy: "Standard Policy"
Urgency: Low
Auto-Resolution: Enabled

Development Environment Service:

Service: "Development Alerts"
Integration Key: [dev-service-key]
Escalation Policy: "Development Policy"
Urgency: Low
Auto-Resolution: Enabled

Custom Event Payloads

Standard Event Payload:

{
    "routing_key": "YOUR_INTEGRATION_KEY",
    "event_action": "trigger",
    "dedup_key": "{{monitor.id}}-{{incident.id}}",
    "payload": {
        "summary": "{{monitor.name}} is {{monitor.status}}",
        "source": "9n9s",
        "severity": "critical",
        "component": "{{tags.component}}",
        "group": "{{tags.team}}",
        "class": "{{monitor.type}}",
        "custom_details": {
            "monitor_id": "{{monitor.id}}",
            "project": "{{project.name}}",
            "environment": "{{tags.environment}}",
            "duration": "{{incident.duration}}",
            "response_code": "{{metadata.response_code}}",
            "response_time": "{{metadata.response_time}}",
            "monitor_url": "{{links.monitor}}",
            "logs_url": "{{links.logs}}"
        }
    },
    "client": "9n9s Monitoring",
    "client_url": "{{links.monitor}}"
}

Enhanced Event with Context:

{
    "routing_key": "YOUR_INTEGRATION_KEY",
    "event_action": "trigger",
    "dedup_key": "{{monitor.id}}-{{incident.id}}",
    "payload": {
        "summary": "[{{tags.environment}}] {{monitor.name}} - {{monitor.status}}",
        "source": "{{tags.service}}.{{tags.environment}}",
        "severity": "{{alert.severity}}",
        "component": "{{tags.component}}",
        "group": "{{tags.team}}",
        "class": "monitoring",
        "custom_details": {
            "incident_details": {
                "monitor_name": "{{monitor.name}}",
                "monitor_type": "{{monitor.type}}",
                "current_status": "{{monitor.status}}",
                "previous_status": "{{monitor.previous_status}}",
                "incident_started": "{{incident.started_at}}",
                "incident_duration": "{{incident.duration}}"
            },
            "monitor_configuration": {
                "check_frequency": "{{monitor.frequency}}",
                "timeout": "{{monitor.timeout}}",
                "regions": "{{monitor.regions}}"
            },
            "failure_details": {
                "response_code": "{{metadata.response_code}}",
                "response_time": "{{metadata.response_time}}",
                "error_message": "{{metadata.error_message}}"
            },
            "environment_context": {
                "project": "{{project.name}}",
                "environment": "{{tags.environment}}",
                "service": "{{tags.service}}",
                "team": "{{tags.team}}",
                "criticality": "{{tags.criticality}}"
            },
            "actions": {
                "view_monitor": "{{links.monitor}}",
                "view_logs": "{{links.logs}}",
                "view_project": "{{links.project}}",
                "create_incident": "{{links.incident_creation}}"
            }
        }
    },
    "client": "9n9s Monitoring System",
    "client_url": "{{links.organization}}"
}

Intelligent Routing Rules

Configure routing based on monitor attributes:

Environment-Based Routing:

routing_rules:
    - name: "Production Critical"
      conditions:
          tags:
              environment: production
              criticality: critical
      pagerduty_settings:
          service_key: "critical-production-key"
          severity: "critical"
          escalation_policy: "Critical Production Policy"

    - name: "Production Standard"
      conditions:
          tags:
              environment: production
              criticality: [standard, low]
      pagerduty_settings:
          service_key: "standard-production-key"
          severity: "warning"
          escalation_policy: "Standard Production Policy"

    - name: "Development Alerts"
      conditions:
          tags:
              environment: [development, staging]
      pagerduty_settings:
          service_key: "development-key"
          severity: "info"
          escalation_policy: "Development Policy"

Team-Based Routing:

routing_rules:
    - name: "Backend Team Alerts"
      conditions:
          tags:
              team: backend
      pagerduty_settings:
          service_key: "backend-team-key"
          custom_details:
              team_lead: "[email protected]"
              runbook: "https://wiki.company.com/backend-runbooks"

    - name: "Frontend Team Alerts"
      conditions:
          tags:
              team: frontend
      pagerduty_settings:
          service_key: "frontend-team-key"
          custom_details:
              team_lead: "[email protected]"
              runbook: "https://wiki.company.com/frontend-runbooks"

Event Actions and Lifecycle

Event Action Types

Trigger Events:

Create new incidents in PagerDuty
Sent when monitor status changes to DOWN/DEGRADED
Includes full context and failure details

Acknowledge Events:

Acknowledge existing incidents
Can be triggered automatically or manually from 9n9s
Updates incident status without resolving

Resolve Events:

Automatically resolve incidents when monitor recovers
Sent when monitor status changes to UP
Includes recovery time and resolution details

Incident Lifecycle Management

Automatic Incident Management:

lifecycle_settings:
    auto_trigger: true # Create incidents for failed monitors
    auto_acknowledge: false # Manual acknowledgment required
    auto_resolve: true # Resolve when monitor recovers
    resolve_timeout: 300 # Auto-resolve after 5 minutes of UP status
    escalation_timeout: 900 # Re-escalate if unacknowledged for 15 minutes

Manual Incident Management:

lifecycle_settings:
    auto_trigger: true # Create incidents for failed monitors
    auto_acknowledge: false # Manual acknowledgment required
    auto_resolve: false # Manual resolution required
    resolve_timeout: null # No automatic resolution
    escalation_timeout: 1800 # Re-escalate after 30 minutes

Deduplication Strategies

Monitor-Based Deduplication:

Dedup Key: "{{monitor.id}}"
Result: One incident per monitor, updates existing incident

Incident-Based Deduplication:

Dedup Key: "{{monitor.id}}-{{incident.id}}"
Result: New incident for each failure occurrence

Service-Based Deduplication:

Dedup Key: "{{tags.service}}-{{tags.environment}}"
Result: One incident per service environment combination

Integration with 9n9s Features

Alert Rules Integration

Simple Alert Rule:

alert_rule:
    name: "Critical Production Alert"
    conditions:
        - monitor_status: "down"
        - tags.environment: "production"
        - tags.criticality: "critical"
    actions:
        - channel: "pagerduty-critical"
          delay: 0 # Immediate

Escalation Alert Rule:

alert_rule:
    name: "Escalating Production Alert"
    conditions:
        - monitor_status: "down"
        - tags.environment: "production"
    actions:
        - channel: "slack-alerts"
          delay: 0 # Immediate Slack notification
        - channel: "pagerduty-standard"
          delay: 300 # PagerDuty after 5 minutes
        - channel: "pagerduty-critical"
          delay: 900 # Critical escalation after 15 minutes

Monitor Tag Mapping

Environment Mapping:

tag_mapping:
    environment:
        production:
            service_key: "prod-service-key"
            severity: "critical"
        staging:
            service_key: "staging-service-key"
            severity: "warning"
        development:
            service_key: "dev-service-key"
            severity: "info"

Team Mapping:

tag_mapping:
    team:
        backend:
            service_key: "backend-service-key"
            escalation_policy: "Backend Team Policy"
        frontend:
            service_key: "frontend-service-key"
            escalation_policy: "Frontend Team Policy"
        infrastructure:
            service_key: "infra-service-key"
            escalation_policy: "Infrastructure Policy"

Incident Response Workflows

Automated Response Actions

Incident Creation Workflow:

Monitor Failure Detected
- 9n9s detects monitor failure
- Evaluates alert rules and conditions
- Determines appropriate PagerDuty service
Incident Creation
- Creates incident with full context
- Includes monitor details and failure information
- Triggers escalation policy
Escalation Management
- PagerDuty handles escalation based on policy
- Notifications sent via configured channels
- Incident assigned to on-call personnel

Recovery Workflow:

Monitor Recovery Detected
- 9n9s detects monitor recovery
- Waits for configured stability period
- Validates sustained recovery
Incident Resolution
- Sends resolve event to PagerDuty
- Updates incident with recovery details
- Logs resolution in incident timeline

Manual Incident Actions

From 9n9s Dashboard:

Acknowledge Alert: Mark incident as acknowledged in PagerDuty
Resolve Incident: Manually resolve PagerDuty incident
Snooze Alerts: Temporarily suspend notifications
Create Related Incident: Create additional incident for investigation

From PagerDuty:

View Monitor: Direct link to 9n9s monitor dashboard
View Logs: Access monitor logs and check history
Create Runbook: Link to incident response procedures
Escalate: Manual escalation to next level or different team

Incident Response Best Practices

Response Procedure:

Acknowledge Quickly: Acknowledge incidents within SLA timeframes
Assess Impact: Determine scope and business impact
Communicate Status: Update stakeholders via status pages
Investigate Root Cause: Use 9n9s logs and monitoring data
Implement Resolution: Fix underlying issues
Document Learning: Update runbooks and improve monitoring

Escalation Guidelines:

Level 1: Individual contributor response (5-15 minutes)
Level 2: Senior engineer or team lead (15-30 minutes)
Level 3: Management involvement (30+ minutes or high impact)
Emergency: Immediate escalation for critical business impact

Security and Compliance

API Security

Integration Key Management:

Store integration keys securely in 9n9s encrypted storage
Rotate keys quarterly or as required by security policy
Use separate keys for different environments
Monitor integration key usage and access

Network Security:

All communications use HTTPS/TLS encryption
PagerDuty webhook verification for inbound events
IP allowlisting for enhanced security
Audit logging for all API interactions

Compliance Features

Audit Trail:

Complete log of all incidents and actions
Integration activity tracking
User action attribution
Retention policies for compliance requirements

Data Protection:

Minimal sensitive data in incident payloads
Configurable data masking for sensitive fields
Regional data storage compliance
GDPR and SOC 2 compliance support

Troubleshooting

Common Issues

Incidents Not Created in PagerDuty:

Check Integration Key
- Verify integration key is correct and active
- Test key with PagerDuty Events API directly
- Check for key expiration or service changes
Review Event Payload
- Check 9n9s notification logs for error details
- Verify event payload format and required fields
- Test with minimal payload to isolate issues
Service Configuration
- Ensure PagerDuty service is active
- Check service integration settings
- Verify escalation policy is properly configured

Incidents Not Auto-Resolving:

Auto-Resolution Settings
- Verify auto-resolution is enabled in 9n9s channel
- Check resolve timeout configuration
- Ensure monitor actually recovers before timeout
Deduplication Key Issues
- Verify deduplication key consistency
- Check for key format changes
- Review incident grouping behavior

Performance Issues:

Delayed Incident Creation
- Check PagerDuty API status and response times
- Review 9n9s notification queue and processing
- Verify network connectivity and latency
Missing Escalations
- Check PagerDuty escalation policy configuration
- Verify on-call schedule coverage
- Review notification method settings

Error Codes and Resolution

Common PagerDuty API Errors:

400 Bad Request: Invalid event payload format
- Solution: Check payload syntax and required fields
401 Unauthorized: Invalid or expired integration key
- Solution: Verify integration key and regenerate if needed
429 Rate Limited: Too many API requests
- Solution: Review rate limiting settings and reduce frequency
500 Server Error: PagerDuty service unavailable
- Solution: Check PagerDuty status page and retry later

Getting Help

Support Resources:

9n9s Support: Use in-app chat or email [email protected]
PagerDuty Documentation: PagerDuty Developer Docs
PagerDuty Support: Contact PagerDuty support for service issues
Community: Join our Discord for peer support and best practices

Best Practices

Service Organization

Logical Service Grouping
- Create services based on team ownership
- Separate critical and non-critical services
- Use environment-specific services for different alerting needs
Escalation Policy Design
- Keep escalation paths simple and well-documented
- Include backup coverage for all time periods
- Regular review and testing of escalation policies
Alert Hygiene
- Regular review of incident volume and patterns
- Tune monitor sensitivity to reduce noise
- Implement proper alert routing to avoid fatigue

Incident Management

Response Procedures
- Document clear incident response procedures
- Provide runbook links in incident details
- Train team members on PagerDuty features and workflows
Post-Incident Reviews
- Regular review of incident patterns and response times
- Use PagerDuty analytics to identify improvement opportunities
- Update monitoring and alerting based on learnings
Continuous Improvement
- Monitor escalation effectiveness and adjust policies
- Regular testing of notification channels and workflows
- Keep integration configurations updated with service changes

PagerDuty integration provides enterprise-grade incident management capabilities for your 9n9s monitoring. Start with basic service setup and gradually implement advanced features like intelligent routing and custom workflows as your incident response processes mature.