Skip to content

Monitor Status & States

Monitor status provides real-time visibility into the health of your services. Understanding monitor states and status transitions helps you respond appropriately to incidents and maintain reliable systems.

All monitors have these fundamental states:

StateDescriptionVisual Indicator
UpMonitor is passing all checks🟢 Green
DownMonitor is failing one or more checks🔴 Red
UnknownMonitor status cannot be determined🟡 Yellow
PausedMonitor is temporarily disabled⏸️ Gray

Understanding how monitors transition between states:

graph TD
A[Unknown] --> B[Up]
A --> C[Down]
B --> C
C --> B
B --> D[Paused]
C --> D
D --> A

Transition Triggers:

  • Unknown → Up: First successful check
  • Unknown → Down: First failed check
  • Up → Down: Check failure occurs
  • Down → Up: Check succeeds after failure
  • Any → Paused: Manual pause or scheduled maintenance
  • Paused → Unknown: Monitor resumed (next check determines state)

For uptime monitors, status is determined by assertion results:

All Assertions Pass = Up:

assertions:
- type: STATUS_CODE
operator: EQUALS
value: "200"
result: ✅ PASS
- type: RESPONSE_TIME
operator: LESS_THAN
value: "2000"
result: ✅ PASS
# Overall Status: UP

Any Assertion Fails = Down:

assertions:
- type: STATUS_CODE
operator: EQUALS
value: "200"
result: ✅ PASS
- type: RESPONSE_TIME
operator: LESS_THAN
value: "2000"
result: ❌ FAIL (3500ms)
# Overall Status: DOWN

Connection Issues:

  • DNS resolution failure
  • Connection timeout
  • Network unreachable
  • SSL/TLS handshake failure

HTTP Response Issues:

  • Unexpected status code (e.g., 500, 404)
  • Response timeout
  • Malformed HTTP response
  • Content encoding issues

Assertion Failures:

  • Response time exceeded threshold
  • Response body doesn’t contain expected content
  • JSON structure validation failed
  • HTTP header missing or incorrect

Each uptime check provides detailed information:

{
"status": "DOWN",
"timestamp": "2024-01-15T10:30:00Z",
"response_time": 3500,
"status_code": 200,
"error": null,
"assertions": [
{
"type": "STATUS_CODE",
"expected": "200",
"actual": "200",
"result": "PASS"
},
{
"type": "RESPONSE_TIME",
"expected": "< 2000ms",
"actual": "3500ms",
"result": "FAIL"
}
],
"response_headers": {
"content-type": "application/json",
"content-length": "1234"
},
"response_body_preview": "{\"status\":\"ok\",\"timestamp\":..."
}

Heartbeat monitors track when pulses are expected vs received:

On Schedule = Up:

Expected: Every hour (0 * * * *)
Last Pulse: 2024-01-15 14:00:00 (on time)
Grace Period: 15 minutes
Current Time: 2024-01-15 14:45:00
Status: UP (next pulse due at 15:00:00)

Overdue = Down:

Expected: Every hour (0 * * * *)
Last Pulse: 2024-01-15 14:00:00
Grace Period: 15 minutes
Current Time: 2024-01-15 15:20:00
Status: DOWN (pulse was due at 15:00:00, now 20 minutes late)

Each pulse provides timing data:

{
"status": "UP",
"last_pulse": {
"timestamp": "2024-01-15T14:00:00Z",
"status": "success",
"message": "Backup completed successfully",
"duration": 1800,
"source_ip": "192.168.1.100"
},
"next_expected": "2024-01-15T15:00:00Z",
"grace_period_expires": "2024-01-15T15:15:00Z",
"schedule": "0 * * * *",
"timezone": "UTC"
}

Pulses can indicate different outcomes:

Success Pulse (Default):

Terminal window
curl -X POST https://hb.9n9s.com/abc123
# Indicates successful completion

Failure Pulse:

Terminal window
curl -X POST https://hb.9n9s.com/abc123/fail
# Indicates process failed but is still running

Start Pulse:

Terminal window
curl -X POST https://hb.9n9s.com/abc123/start
# Indicates process started (optional)

Log Pulse:

Terminal window
curl -X POST https://hb.9n9s.com/abc123/log \
-d "Processing 1000 records"
# Includes log message with pulse

Monitor uptime percentage over different periods:

Calculation Method:

Uptime % = (Successful Checks / Total Checks) × 100

Time Periods:

  • Last 24 hours
  • Last 7 days
  • Last 30 days
  • Last 90 days
  • Custom date ranges

Example:

Period: Last 30 days
Total Checks: 43,200 (1 check per minute)
Successful: 43,056
Failed: 144
Uptime: 99.67%

Monitors track incidents (periods of downtime):

{
"incident_id": "inc_abc123",
"started_at": "2024-01-15T10:30:00Z",
"ended_at": "2024-01-15T10:35:00Z",
"duration": 300,
"affected_checks": 5,
"root_cause": "HTTP 500 errors",
"resolved_by": "automatic_recovery"
}

Track response time trends over time:

  • Average Response Time: Mean response time for successful checks
  • 95th Percentile: 95% of requests completed within this time
  • 99th Percentile: 99% of requests completed within this time
  • Maximum Response Time: Slowest successful response

Configure when alerts are triggered:

State Change Alerts:

  • Down Alert: Triggered when monitor goes from UP to DOWN
  • Recovery Alert: Triggered when monitor goes from DOWN to UP
  • Flapping Alert: Triggered when monitor changes state frequently

Performance Alerts:

  • Slow Response: Response time exceeds threshold consistently
  • High Error Rate: Error rate exceeds threshold over time period
  • Degraded Performance: Performance drops below baseline

Control alert timing to reduce noise:

Grace Periods:

# Wait 5 minutes before sending DOWN alert
down_alert_delay: 5m
# Send recovery alert immediately
recovery_alert_delay: 0s
# Re-alert every 30 minutes while DOWN
repeat_interval: 30m

Escalation:

# Initial alert to Slack
- delay: 0m
channels: [slack-alerts]
# Escalate to PagerDuty after 15 minutes
- delay: 15m
channels: [pagerduty-critical]
# Escalate to SMS after 30 minutes
- delay: 30m
channels: [sms-oncall]

Get current monitor status via API:

Terminal window
# Get single monitor status
curl -H "Authorization: Bearer $API_KEY" \
https://api.9n9s.com/v1/monitors/mon_abc123/status
# Get status for all monitors
curl -H "Authorization: Bearer $API_KEY" \
https://api.9n9s.com/v1/monitors/status
# Filter by status
curl -H "Authorization: Bearer $API_KEY" \
"https://api.9n9s.com/v1/monitors/status?status=DOWN"

Response Format:

{
"monitor_id": "mon_abc123",
"name": "API Health Check",
"status": "UP",
"last_check": "2024-01-15T10:30:00Z",
"uptime_24h": 99.5,
"uptime_7d": 99.8,
"response_time": 250,
"incident_count": 2
}

Receive real-time status updates:

{
"event": "monitor.down",
"timestamp": "2024-01-15T10:30:00Z",
"monitor": {
"id": "mon_abc123",
"name": "API Health Check",
"url": "https://api.example.com/health"
},
"status": {
"current": "DOWN",
"previous": "UP",
"changed_at": "2024-01-15T10:30:00Z"
},
"check_result": {
"status_code": 500,
"response_time": 5000,
"error": "HTTP 500 Internal Server Error"
}
}

Schedule maintenance to prevent false alerts:

maintenance_window:
name: "Database Maintenance"
start: "2024-01-15T02:00:00Z"
end: "2024-01-15T04:00:00Z"
recurrence: "weekly"
affected_monitors:
- "Database Connection Check"
- "API Health Check"

During Maintenance:

  • Monitors continue running but don’t trigger alerts
  • Status shows as “MAINTENANCE” instead of UP/DOWN
  • Uptime calculations exclude maintenance periods
  • Incidents are not recorded during maintenance

Temporarily disable monitors:

Terminal window
# Pause monitor
9n9s-cli monitor pause mon_abc123 --reason "Planned deployment"
# Resume monitor
9n9s-cli monitor resume mon_abc123
# Pause with automatic resume
9n9s-cli monitor pause mon_abc123 --duration 30m

Don’t Panic on Single Failures:

  • One DOWN check might be a temporary network issue
  • Look for patterns and duration of failures
  • Consider implementing confirmation checks

Monitor Your Monitoring:

  • Set up alerts for monitors that haven’t checked in
  • Monitor 9n9s platform status pages
  • Track your overall monitoring health metrics

Use Grace Periods Wisely:

  • Short grace periods for critical services (1-2 minutes)
  • Longer grace periods for less critical services (5-10 minutes)
  • Consider service restart times and deployment windows

Right-Size Your Alerting:

  • Critical alerts should require immediate action
  • Important alerts can wait for business hours
  • Informational alerts might only need dashboard visibility

Use Intelligent Escalation:

  • Start with team channels (Slack, Teams)
  • Escalate to paging systems for sustained issues
  • Include escalation delays to prevent spam

Regular Reviews:

  • Weekly review of incidents and response times
  • Monthly analysis of uptime trends
  • Quarterly assessment of monitoring effectiveness
  • Annual review of SLA compliance

Learn from Incidents:

  • Document root causes and resolutions
  • Update monitoring based on incident learnings
  • Adjust thresholds based on real-world patterns
  • Share knowledge across teams

Flapping Monitors:

  • Status changes frequently between UP and DOWN
  • Often caused by network instability or marginal performance
  • Solutions: Increase grace periods, improve service reliability

False Positives:

  • Monitor shows DOWN but service is actually healthy
  • Often caused by overly strict assertions or timeouts
  • Solutions: Adjust thresholds, improve assertions

Missed Incidents:

  • Service has issues but monitor shows UP
  • Often caused by insufficient assertions or monitoring gaps
  • Solutions: Add more comprehensive checks, monitor dependencies

For Uptime Monitors:

  1. Check recent check results for error details
  2. Verify assertions are appropriate for the service
  3. Test the endpoint manually from different locations
  4. Review network connectivity and DNS resolution

For Heartbeat Monitors:

  1. Verify the process is running and sending pulses
  2. Check pulse timing against expected schedule
  3. Confirm grace period is appropriate for the process
  4. Validate pulse endpoint URL and authentication