Monitor Status & States

Monitor status provides real-time visibility into the health of your services. Understanding monitor states and status transitions helps you respond appropriately to incidents and maintain reliable systems.

Monitor States

Basic States

All monitors have these fundamental states:

State	Description	Visual Indicator
Up	Monitor is passing all checks	🟢 Green
Down	Monitor is failing one or more checks	🔴 Red
Unknown	Monitor status cannot be determined	🟡 Yellow
Paused	Monitor is temporarily disabled	⏸️ Gray

State Transitions

Understanding how monitors transition between states:

graph TD
    A[Unknown] --> B[Up]
    A --> C[Down]
    B --> C
    C --> B
    B --> D[Paused]
    C --> D
    D --> A

Transition Triggers:

Unknown → Up: First successful check
Unknown → Down: First failed check
Up → Down: Check failure occurs
Down → Up: Check succeeds after failure
Any → Paused: Manual pause or scheduled maintenance
Paused → Unknown: Monitor resumed (next check determines state)

Uptime Monitor Status

Status Determination

For uptime monitors, status is determined by assertion results:

All Assertions Pass = Up:

assertions:
    - type: STATUS_CODE
      operator: EQUALS
      value: "200"
      result: ✅ PASS
    - type: RESPONSE_TIME
      operator: LESS_THAN
      value: "2000"
      result: ✅ PASS
# Overall Status: UP

Any Assertion Fails = Down:

assertions:
    - type: STATUS_CODE
      operator: EQUALS
      value: "200"
      result: ✅ PASS
    - type: RESPONSE_TIME
      operator: LESS_THAN
      value: "2000"
      result: ❌ FAIL (3500ms)
# Overall Status: DOWN

Common Failure Reasons

Connection Issues:

DNS resolution failure
Connection timeout
Network unreachable
SSL/TLS handshake failure

HTTP Response Issues:

Unexpected status code (e.g., 500, 404)
Response timeout
Malformed HTTP response
Content encoding issues

Assertion Failures:

Response time exceeded threshold
Response body doesn’t contain expected content
JSON structure validation failed
HTTP header missing or incorrect

Status Details

Each uptime check provides detailed information:

{
    "status": "DOWN",
    "timestamp": "2024-01-15T10:30:00Z",
    "response_time": 3500,
    "status_code": 200,
    "error": null,
    "assertions": [
        {
            "type": "STATUS_CODE",
            "expected": "200",
            "actual": "200",
            "result": "PASS"
        },
        {
            "type": "RESPONSE_TIME",
            "expected": "< 2000ms",
            "actual": "3500ms",
            "result": "FAIL"
        }
    ],
    "response_headers": {
        "content-type": "application/json",
        "content-length": "1234"
    },
    "response_body_preview": "{\"status\":\"ok\",\"timestamp\":..."
}

Heartbeat Monitor Status

Status Determination

Heartbeat monitors track when pulses are expected vs received:

On Schedule = Up:

Expected: Every hour (0 * * * *)
Last Pulse: 2024-01-15 14:00:00 (on time)
Grace Period: 15 minutes
Current Time: 2024-01-15 14:45:00
Status: UP (next pulse due at 15:00:00)

Overdue = Down:

Expected: Every hour (0 * * * *)
Last Pulse: 2024-01-15 14:00:00
Grace Period: 15 minutes
Current Time: 2024-01-15 15:20:00
Status: DOWN (pulse was due at 15:00:00, now 20 minutes late)

Pulse Information

Each pulse provides timing data:

{
    "status": "UP",
    "last_pulse": {
        "timestamp": "2024-01-15T14:00:00Z",
        "status": "success",
        "message": "Backup completed successfully",
        "duration": 1800,
        "source_ip": "192.168.1.100"
    },
    "next_expected": "2024-01-15T15:00:00Z",
    "grace_period_expires": "2024-01-15T15:15:00Z",
    "schedule": "0 * * * *",
    "timezone": "UTC"
}

Pulse Status Types

Pulses can indicate different outcomes:

Success Pulse (Default):

curl -X POST https://hb.9n9s.com/abc123
# Indicates successful completion

Failure Pulse:

curl -X POST https://hb.9n9s.com/abc123/fail
# Indicates process failed but is still running

Start Pulse:

curl -X POST https://hb.9n9s.com/abc123/start
# Indicates process started (optional)

Log Pulse:

curl -X POST https://hb.9n9s.com/abc123/log \
  -d "Processing 1000 records"
# Includes log message with pulse

Status History and Trends

Uptime Calculations

Monitor uptime percentage over different periods:

Calculation Method:

Uptime % = (Successful Checks / Total Checks) × 100

Time Periods:

Last 24 hours
Last 7 days
Last 30 days
Last 90 days
Custom date ranges

Example:

Period: Last 30 days
Total Checks: 43,200 (1 check per minute)
Successful: 43,056
Failed: 144
Uptime: 99.67%

Incident Tracking

Monitors track incidents (periods of downtime):

{
    "incident_id": "inc_abc123",
    "started_at": "2024-01-15T10:30:00Z",
    "ended_at": "2024-01-15T10:35:00Z",
    "duration": 300,
    "affected_checks": 5,
    "root_cause": "HTTP 500 errors",
    "resolved_by": "automatic_recovery"
}

Performance Trends

Track response time trends over time:

Average Response Time: Mean response time for successful checks
95th Percentile: 95% of requests completed within this time
99th Percentile: 99% of requests completed within this time
Maximum Response Time: Slowest successful response

Status Notifications

Alert Conditions

Configure when alerts are triggered:

State Change Alerts:

Down Alert: Triggered when monitor goes from UP to DOWN
Recovery Alert: Triggered when monitor goes from DOWN to UP
Flapping Alert: Triggered when monitor changes state frequently

Performance Alerts:

Slow Response: Response time exceeds threshold consistently
High Error Rate: Error rate exceeds threshold over time period
Degraded Performance: Performance drops below baseline

Alert Timing

Control alert timing to reduce noise:

Grace Periods:

# Wait 5 minutes before sending DOWN alert
down_alert_delay: 5m

# Send recovery alert immediately
recovery_alert_delay: 0s

# Re-alert every 30 minutes while DOWN
repeat_interval: 30m

Escalation:

# Initial alert to Slack
- delay: 0m
  channels: [slack-alerts]

# Escalate to PagerDuty after 15 minutes
- delay: 15m
  channels: [pagerduty-critical]

# Escalate to SMS after 30 minutes
- delay: 30m
  channels: [sms-oncall]

Status API and Webhooks

Retrieving Status

Get current monitor status via API:

# Get single monitor status
curl -H "Authorization: Bearer $API_KEY" \
  https://api.9n9s.com/v1/monitors/mon_abc123/status

# Get status for all monitors
curl -H "Authorization: Bearer $API_KEY" \
  https://api.9n9s.com/v1/monitors/status

# Filter by status
curl -H "Authorization: Bearer $API_KEY" \
  "https://api.9n9s.com/v1/monitors/status?status=DOWN"

Response Format:

{
    "monitor_id": "mon_abc123",
    "name": "API Health Check",
    "status": "UP",
    "last_check": "2024-01-15T10:30:00Z",
    "uptime_24h": 99.5,
    "uptime_7d": 99.8,
    "response_time": 250,
    "incident_count": 2
}

Status Webhooks

Receive real-time status updates:

{
    "event": "monitor.down",
    "timestamp": "2024-01-15T10:30:00Z",
    "monitor": {
        "id": "mon_abc123",
        "name": "API Health Check",
        "url": "https://api.example.com/health"
    },
    "status": {
        "current": "DOWN",
        "previous": "UP",
        "changed_at": "2024-01-15T10:30:00Z"
    },
    "check_result": {
        "status_code": 500,
        "response_time": 5000,
        "error": "HTTP 500 Internal Server Error"
    }
}

Maintenance and Scheduled Downtime

Maintenance Windows

Schedule maintenance to prevent false alerts:

maintenance_window:
    name: "Database Maintenance"
    start: "2024-01-15T02:00:00Z"
    end: "2024-01-15T04:00:00Z"
    recurrence: "weekly"
    affected_monitors:
        - "Database Connection Check"
        - "API Health Check"

During Maintenance:

Monitors continue running but don’t trigger alerts
Status shows as “MAINTENANCE” instead of UP/DOWN
Uptime calculations exclude maintenance periods
Incidents are not recorded during maintenance

Manual Pausing

Temporarily disable monitors:

# Pause monitor
9n9s-cli monitor pause mon_abc123 --reason "Planned deployment"

# Resume monitor
9n9s-cli monitor resume mon_abc123

# Pause with automatic resume
9n9s-cli monitor pause mon_abc123 --duration 30m

Best Practices

Status Interpretation

Don’t Panic on Single Failures:

One DOWN check might be a temporary network issue
Look for patterns and duration of failures
Consider implementing confirmation checks

Monitor Your Monitoring:

Set up alerts for monitors that haven’t checked in
Monitor 9n9s platform status pages
Track your overall monitoring health metrics

Use Grace Periods Wisely:

Short grace periods for critical services (1-2 minutes)
Longer grace periods for less critical services (5-10 minutes)
Consider service restart times and deployment windows

Alert Fatigue Prevention

Right-Size Your Alerting:

Critical alerts should require immediate action
Important alerts can wait for business hours
Informational alerts might only need dashboard visibility

Use Intelligent Escalation:

Start with team channels (Slack, Teams)
Escalate to paging systems for sustained issues
Include escalation delays to prevent spam

Historical Analysis

Regular Reviews:

Weekly review of incidents and response times
Monthly analysis of uptime trends
Quarterly assessment of monitoring effectiveness
Annual review of SLA compliance

Learn from Incidents:

Document root causes and resolutions
Update monitoring based on incident learnings
Adjust thresholds based on real-world patterns
Share knowledge across teams

Troubleshooting Status Issues

Common Problems

Flapping Monitors:

Status changes frequently between UP and DOWN
Often caused by network instability or marginal performance
Solutions: Increase grace periods, improve service reliability

False Positives:

Monitor shows DOWN but service is actually healthy
Often caused by overly strict assertions or timeouts
Solutions: Adjust thresholds, improve assertions

Missed Incidents:

Service has issues but monitor shows UP
Often caused by insufficient assertions or monitoring gaps
Solutions: Add more comprehensive checks, monitor dependencies

Debugging Steps

For Uptime Monitors:

Check recent check results for error details
Verify assertions are appropriate for the service
Test the endpoint manually from different locations
Review network connectivity and DNS resolution

For Heartbeat Monitors:

Verify the process is running and sending pulses
Check pulse timing against expected schedule
Confirm grace period is appropriate for the process
Validate pulse endpoint URL and authentication

Monitor Status & States

Monitor States

Basic States

State Transitions

Uptime Monitor Status

Status Determination

Common Failure Reasons

Status Details

Heartbeat Monitor Status

Status Determination

Pulse Information

Pulse Status Types

Status History and Trends

Uptime Calculations

Incident Tracking

Performance Trends

Status Notifications

Alert Conditions

Alert Timing

Status API and Webhooks

Retrieving Status

Status Webhooks

Maintenance and Scheduled Downtime

Maintenance Windows

Manual Pausing

Best Practices

Status Interpretation

Alert Fatigue Prevention

Historical Analysis

Troubleshooting Status Issues

Common Problems

Debugging Steps

Next Steps