Monitor Status & States
Monitor status provides real-time visibility into the health of your services. Understanding monitor states and status transitions helps you respond appropriately to incidents and maintain reliable systems.
Monitor States
Section titled “Monitor States”Basic States
Section titled “Basic States”All monitors have these fundamental states:
| State | Description | Visual Indicator |
|---|---|---|
| Up | Monitor is passing all checks | 🟢 Green |
| Down | Monitor is failing one or more checks | 🔴 Red |
| Unknown | Monitor status cannot be determined | 🟡 Yellow |
| Paused | Monitor is temporarily disabled | ⏸️ Gray |
State Transitions
Section titled “State Transitions”Understanding how monitors transition between states:
graph TD A[Unknown] --> B[Up] A --> C[Down] B --> C C --> B B --> D[Paused] C --> D D --> ATransition Triggers:
- Unknown → Up: First successful check
- Unknown → Down: First failed check
- Up → Down: Check failure occurs
- Down → Up: Check succeeds after failure
- Any → Paused: Manual pause or scheduled maintenance
- Paused → Unknown: Monitor resumed (next check determines state)
Uptime Monitor Status
Section titled “Uptime Monitor Status”Status Determination
Section titled “Status Determination”For uptime monitors, status is determined by assertion results:
All Assertions Pass = Up:
assertions: - type: STATUS_CODE operator: EQUALS value: "200" result: ✅ PASS - type: RESPONSE_TIME operator: LESS_THAN value: "2000" result: ✅ PASS# Overall Status: UPAny Assertion Fails = Down:
assertions: - type: STATUS_CODE operator: EQUALS value: "200" result: ✅ PASS - type: RESPONSE_TIME operator: LESS_THAN value: "2000" result: ❌ FAIL (3500ms)# Overall Status: DOWNCommon Failure Reasons
Section titled “Common Failure Reasons”Connection Issues:
- DNS resolution failure
- Connection timeout
- Network unreachable
- SSL/TLS handshake failure
HTTP Response Issues:
- Unexpected status code (e.g., 500, 404)
- Response timeout
- Malformed HTTP response
- Content encoding issues
Assertion Failures:
- Response time exceeded threshold
- Response body doesn’t contain expected content
- JSON structure validation failed
- HTTP header missing or incorrect
Status Details
Section titled “Status Details”Each uptime check provides detailed information:
{ "status": "DOWN", "timestamp": "2024-01-15T10:30:00Z", "response_time": 3500, "status_code": 200, "error": null, "assertions": [ { "type": "STATUS_CODE", "expected": "200", "actual": "200", "result": "PASS" }, { "type": "RESPONSE_TIME", "expected": "< 2000ms", "actual": "3500ms", "result": "FAIL" } ], "response_headers": { "content-type": "application/json", "content-length": "1234" }, "response_body_preview": "{\"status\":\"ok\",\"timestamp\":..."}Heartbeat Monitor Status
Section titled “Heartbeat Monitor Status”Status Determination
Section titled “Status Determination”Heartbeat monitors track when pulses are expected vs received:
On Schedule = Up:
Expected: Every hour (0 * * * *)Last Pulse: 2024-01-15 14:00:00 (on time)Grace Period: 15 minutesCurrent Time: 2024-01-15 14:45:00Status: UP (next pulse due at 15:00:00)Overdue = Down:
Expected: Every hour (0 * * * *)Last Pulse: 2024-01-15 14:00:00Grace Period: 15 minutesCurrent Time: 2024-01-15 15:20:00Status: DOWN (pulse was due at 15:00:00, now 20 minutes late)Pulse Information
Section titled “Pulse Information”Each pulse provides timing data:
{ "status": "UP", "last_pulse": { "timestamp": "2024-01-15T14:00:00Z", "status": "success", "message": "Backup completed successfully", "duration": 1800, "source_ip": "192.168.1.100" }, "next_expected": "2024-01-15T15:00:00Z", "grace_period_expires": "2024-01-15T15:15:00Z", "schedule": "0 * * * *", "timezone": "UTC"}Pulse Status Types
Section titled “Pulse Status Types”Pulses can indicate different outcomes:
Success Pulse (Default):
curl -X POST https://hb.9n9s.com/abc123# Indicates successful completionFailure Pulse:
curl -X POST https://hb.9n9s.com/abc123/fail# Indicates process failed but is still runningStart Pulse:
curl -X POST https://hb.9n9s.com/abc123/start# Indicates process started (optional)Log Pulse:
curl -X POST https://hb.9n9s.com/abc123/log \ -d "Processing 1000 records"# Includes log message with pulseStatus History and Trends
Section titled “Status History and Trends”Uptime Calculations
Section titled “Uptime Calculations”Monitor uptime percentage over different periods:
Calculation Method:
Uptime % = (Successful Checks / Total Checks) × 100Time Periods:
- Last 24 hours
- Last 7 days
- Last 30 days
- Last 90 days
- Custom date ranges
Example:
Period: Last 30 daysTotal Checks: 43,200 (1 check per minute)Successful: 43,056Failed: 144Uptime: 99.67%Incident Tracking
Section titled “Incident Tracking”Monitors track incidents (periods of downtime):
{ "incident_id": "inc_abc123", "started_at": "2024-01-15T10:30:00Z", "ended_at": "2024-01-15T10:35:00Z", "duration": 300, "affected_checks": 5, "root_cause": "HTTP 500 errors", "resolved_by": "automatic_recovery"}Performance Trends
Section titled “Performance Trends”Track response time trends over time:
- Average Response Time: Mean response time for successful checks
- 95th Percentile: 95% of requests completed within this time
- 99th Percentile: 99% of requests completed within this time
- Maximum Response Time: Slowest successful response
Status Notifications
Section titled “Status Notifications”Alert Conditions
Section titled “Alert Conditions”Configure when alerts are triggered:
State Change Alerts:
- Down Alert: Triggered when monitor goes from UP to DOWN
- Recovery Alert: Triggered when monitor goes from DOWN to UP
- Flapping Alert: Triggered when monitor changes state frequently
Performance Alerts:
- Slow Response: Response time exceeds threshold consistently
- High Error Rate: Error rate exceeds threshold over time period
- Degraded Performance: Performance drops below baseline
Alert Timing
Section titled “Alert Timing”Control alert timing to reduce noise:
Grace Periods:
# Wait 5 minutes before sending DOWN alertdown_alert_delay: 5m
# Send recovery alert immediatelyrecovery_alert_delay: 0s
# Re-alert every 30 minutes while DOWNrepeat_interval: 30mEscalation:
# Initial alert to Slack- delay: 0m channels: [slack-alerts]
# Escalate to PagerDuty after 15 minutes- delay: 15m channels: [pagerduty-critical]
# Escalate to SMS after 30 minutes- delay: 30m channels: [sms-oncall]Status API and Webhooks
Section titled “Status API and Webhooks”Retrieving Status
Section titled “Retrieving Status”Get current monitor status via API:
# Get single monitor statuscurl -H "Authorization: Bearer $API_KEY" \ https://api.9n9s.com/v1/monitors/mon_abc123/status
# Get status for all monitorscurl -H "Authorization: Bearer $API_KEY" \ https://api.9n9s.com/v1/monitors/status
# Filter by statuscurl -H "Authorization: Bearer $API_KEY" \ "https://api.9n9s.com/v1/monitors/status?status=DOWN"Response Format:
{ "monitor_id": "mon_abc123", "name": "API Health Check", "status": "UP", "last_check": "2024-01-15T10:30:00Z", "uptime_24h": 99.5, "uptime_7d": 99.8, "response_time": 250, "incident_count": 2}Status Webhooks
Section titled “Status Webhooks”Receive real-time status updates:
{ "event": "monitor.down", "timestamp": "2024-01-15T10:30:00Z", "monitor": { "id": "mon_abc123", "name": "API Health Check", "url": "https://api.example.com/health" }, "status": { "current": "DOWN", "previous": "UP", "changed_at": "2024-01-15T10:30:00Z" }, "check_result": { "status_code": 500, "response_time": 5000, "error": "HTTP 500 Internal Server Error" }}Maintenance and Scheduled Downtime
Section titled “Maintenance and Scheduled Downtime”Maintenance Windows
Section titled “Maintenance Windows”Schedule maintenance to prevent false alerts:
maintenance_window: name: "Database Maintenance" start: "2024-01-15T02:00:00Z" end: "2024-01-15T04:00:00Z" recurrence: "weekly" affected_monitors: - "Database Connection Check" - "API Health Check"During Maintenance:
- Monitors continue running but don’t trigger alerts
- Status shows as “MAINTENANCE” instead of UP/DOWN
- Uptime calculations exclude maintenance periods
- Incidents are not recorded during maintenance
Manual Pausing
Section titled “Manual Pausing”Temporarily disable monitors:
# Pause monitor9n9s-cli monitor pause mon_abc123 --reason "Planned deployment"
# Resume monitor9n9s-cli monitor resume mon_abc123
# Pause with automatic resume9n9s-cli monitor pause mon_abc123 --duration 30mBest Practices
Section titled “Best Practices”Status Interpretation
Section titled “Status Interpretation”Don’t Panic on Single Failures:
- One DOWN check might be a temporary network issue
- Look for patterns and duration of failures
- Consider implementing confirmation checks
Monitor Your Monitoring:
- Set up alerts for monitors that haven’t checked in
- Monitor 9n9s platform status pages
- Track your overall monitoring health metrics
Use Grace Periods Wisely:
- Short grace periods for critical services (1-2 minutes)
- Longer grace periods for less critical services (5-10 minutes)
- Consider service restart times and deployment windows
Alert Fatigue Prevention
Section titled “Alert Fatigue Prevention”Right-Size Your Alerting:
- Critical alerts should require immediate action
- Important alerts can wait for business hours
- Informational alerts might only need dashboard visibility
Use Intelligent Escalation:
- Start with team channels (Slack, Teams)
- Escalate to paging systems for sustained issues
- Include escalation delays to prevent spam
Historical Analysis
Section titled “Historical Analysis”Regular Reviews:
- Weekly review of incidents and response times
- Monthly analysis of uptime trends
- Quarterly assessment of monitoring effectiveness
- Annual review of SLA compliance
Learn from Incidents:
- Document root causes and resolutions
- Update monitoring based on incident learnings
- Adjust thresholds based on real-world patterns
- Share knowledge across teams
Troubleshooting Status Issues
Section titled “Troubleshooting Status Issues”Common Problems
Section titled “Common Problems”Flapping Monitors:
- Status changes frequently between UP and DOWN
- Often caused by network instability or marginal performance
- Solutions: Increase grace periods, improve service reliability
False Positives:
- Monitor shows DOWN but service is actually healthy
- Often caused by overly strict assertions or timeouts
- Solutions: Adjust thresholds, improve assertions
Missed Incidents:
- Service has issues but monitor shows UP
- Often caused by insufficient assertions or monitoring gaps
- Solutions: Add more comprehensive checks, monitor dependencies
Debugging Steps
Section titled “Debugging Steps”For Uptime Monitors:
- Check recent check results for error details
- Verify assertions are appropriate for the service
- Test the endpoint manually from different locations
- Review network connectivity and DNS resolution
For Heartbeat Monitors:
- Verify the process is running and sending pulses
- Check pulse timing against expected schedule
- Confirm grace period is appropriate for the process
- Validate pulse endpoint URL and authentication