Skip to content

Monitor Management

Effective monitor management is crucial for maintaining a reliable monitoring system as your infrastructure grows. This guide covers best practices for organizing, maintaining, and scaling your monitoring setup.

Organize monitors into logical projects based on your team structure and infrastructure:

By Environment:

├── Production Services
│ ├── API Gateway
│ ├── User Authentication
│ └── Payment Processing
├── Staging Environment
│ ├── API Testing
│ └── Integration Tests
└── Development Environment
├── Local Services
└── Feature Branches

By Team Ownership:

├── Backend Team
│ ├── Database Monitors
│ ├── API Health Checks
│ └── Background Jobs
├── Frontend Team
│ ├── Website Monitoring
│ ├── CDN Performance
│ └── User Flows
└── Infrastructure Team
├── Server Health
├── Network Monitoring
└── Security Scans

By Service Type:

├── Critical Services
│ ├── Core APIs
│ ├── Payment Systems
│ └── User Authentication
├── Supporting Services
│ ├── Logging Systems
│ ├── Analytics
│ └── Backup Processes
└── Development Tools
├── CI/CD Pipelines
├── Testing Frameworks
└── Development Environments

Establish consistent naming patterns for easy identification:

Recommended Format: [Environment] - [Service] - [Check Type]

Examples:

  • Prod - User API - Health Check
  • Staging - Payment Gateway - SSL Certificate
  • Dev - Database - Connection Test

Alternative Format: [Team] - [Environment] - [Service]

Examples:

  • Backend - Prod - Authentication Service
  • Frontend - Staging - Main Website
  • Infra - Prod - Load Balancer

Use tags to enable powerful filtering and organization:

Essential Tags:

environment: production | staging | development
team: backend | frontend | infrastructure | data
criticality: critical | high | medium | low
service: api | website | database | cache
component: authentication | payment | analytics

Optional Tags:

owner: john.doe
region: us-east-1 | eu-west-1
version: v2.1.0
deployment: 2024-01-15

Tag Usage Examples:

Terminal window
# Find all critical production monitors
tags: environment=production AND criticality=critical
# Find all backend team monitors that are down
tags: team=backend AND status=down
# Find all payment-related monitors
tags: component=payment OR service=payment

1. Planning Phase:

  • Identify what needs monitoring
  • Determine appropriate monitor type (heartbeat vs uptime)
  • Define success criteria and thresholds
  • Plan alert routing and escalation

2. Configuration:

  • Set appropriate check frequencies
  • Configure realistic timeouts and grace periods
  • Define comprehensive assertions for uptime monitors
  • Set up proper tagging for organization

3. Testing:

  • Verify monitor functionality with test cases
  • Confirm alert delivery to all channels
  • Test edge cases and failure scenarios
  • Validate recovery notifications

4. Documentation:

  • Document monitor purpose and configuration
  • Create runbooks for common issues
  • Define escalation procedures
  • Maintain contact information

Regular Reviews:

  • Monthly review of monitor effectiveness
  • Quarterly assessment of alert noise vs signal
  • Annual audit of monitor relevance and accuracy
  • Continuous optimization based on incidents

Configuration Updates:

  • Adjust thresholds based on performance trends
  • Update contact information and escalation paths
  • Modify check frequencies based on service criticality
  • Update assertions as services evolve

Performance Tuning:

  • Optimize check frequencies to balance coverage and cost
  • Adjust grace periods based on historical data
  • Fine-tune alert sensitivity to reduce false positives
  • Update timeout values based on service performance

When to Remove Monitors:

  • Service has been permanently shut down
  • Monitoring responsibility transferred to another team
  • Monitor provides redundant or obsolete information
  • Resource optimization requirements

Decommissioning Process:

  1. Notification: Inform stakeholders of planned removal
  2. Grace Period: Allow time for feedback and concerns
  3. Backup: Export historical data if needed
  4. Removal: Delete monitor and update documentation
  5. Verification: Confirm no dependent processes or alerts

Update multiple monitors efficiently:

Via CLI:

Terminal window
# Update all production monitors with new tags
9n9s-cli monitors update \
--filter "tags.environment=production" \
--add-tags "reviewed=2024-01-15,sla=99.9"
# Adjust grace periods for all heartbeat monitors
9n9s-cli heartbeat update \
--filter "type=heartbeat" \
--grace-period "30m"
# Update check frequency for non-critical monitors
9n9s-cli uptime update \
--filter "tags.criticality=low" \
--frequency "10m"

Via API:

import requests
# Get all monitors with specific tags
response = requests.get(
"https://api.9n9s.com/v1/monitors",
params={"tags": "environment:staging"},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
monitors = response.json()["data"]
# Update each monitor
for monitor in monitors:
update_data = {
"tags": monitor["tags"] + ["updated:2024-01-15"]
}
requests.patch(
f"https://api.9n9s.com/v1/monitors/{monitor['id']}",
json=update_data,
headers={"Authorization": "Bearer YOUR_API_KEY"}
)

Manage monitors using version-controlled configuration:

YAML Configuration:

monitors.yml
projects:
production:
heartbeats:
- name: "Daily Backup Job"
schedule: "0 2 * * *"
grace_period: "2h"
tags:
environment: production
team: infrastructure
criticality: high
uptime:
- name: "Main Website"
url: "https://example.com"
frequency: "1m"
assertions:
- type: status_code
value: 200
tags:
environment: production
team: frontend
criticality: critical

Deployment Process:

Terminal window
# Preview changes
9n9s-cli config diff --file monitors.yml
# Apply changes
9n9s-cli config apply --file monitors.yml
# Verify deployment
9n9s-cli monitors list --tags environment=production

Check Frequency Optimization:

  • Critical services: 30 seconds - 1 minute
  • Important services: 1 - 5 minutes
  • Standard services: 5 - 15 minutes
  • Background processes: 15 minutes - 1 hour

Resource Management:

  • Distribute check timing to avoid load spikes
  • Use appropriate timeout values to prevent resource waste
  • Monitor your monitoring system’s resource usage
  • Scale monitoring infrastructure with service growth

Access Control:

# Example RBAC setup
teams:
backend:
projects:
- name: "API Services"
role: admin
- name: "Database Systems"
role: admin
frontend:
projects:
- name: "Web Applications"
role: admin
- name: "API Services"
role: viewer

Shared Responsibilities:

  • Monitor Owners: Responsible for specific monitors and their alerts
  • Project Admins: Manage project-level configuration and access
  • Organization Admins: Handle global settings and team management

Automated Monitor Creation:

# Create monitors for new services automatically
def create_service_monitors(service_config):
"""Create standard monitors for a new service"""
base_url = service_config["base_url"]
service_name = service_config["name"]
team = service_config["team"]
# Health check monitor
health_monitor = {
"name": f"{service_name} - Health Check",
"type": "uptime",
"url": f"{base_url}/health",
"frequency": "1m",
"tags": {
"service": service_name.lower(),
"team": team,
"environment": "production",
"auto_created": "true"
}
}
# Create monitor via API
create_monitor(health_monitor)
# SSL certificate monitor for HTTPS services
if base_url.startswith("https://"):
ssl_monitor = {
"name": f"{service_name} - SSL Certificate",
"type": "uptime",
"url": base_url,
"frequency": "1d",
"assertions": [
{"type": "tls_cert_expiry", "operator": "more_than_days", "value": "14"}
],
"tags": {
"service": service_name.lower(),
"team": team,
"environment": "production",
"type": "ssl",
"auto_created": "true"
}
}
create_monitor(ssl_monitor)

Integration with CI/CD:

# GitHub Actions example
- name: Update Monitoring
run: |
# Update monitor configuration on deployment
9n9s-cli heartbeat update $MONITOR_ID \
--tags "version=${{ github.sha }},deployed_at=$(date -Iseconds)"
# Create temporary monitor for deployment validation
9n9s-cli uptime create \
--name "Post-Deploy Validation" \
--url "$HEALTH_CHECK_URL" \
--frequency "30s" \
--timeout "30s" \
--tags "temporary=true,deployment=${{ github.sha }}"

Start Simple:

  • Begin with basic monitors for critical services
  • Use simple naming conventions consistently
  • Implement essential tags from the beginning
  • Grow complexity as your team and infrastructure scale

Plan for Growth:

  • Design tag taxonomy that scales with your organization
  • Establish clear ownership and responsibility models
  • Create templates for common monitor types
  • Document processes and conventions

Regular Health Checks:

  • Review alert effectiveness monthly
  • Update thresholds based on service evolution
  • Remove or update obsolete monitors
  • Validate contact information and escalation paths

Documentation:

  • Maintain up-to-date runbooks for each monitor
  • Document the purpose and context of each monitor
  • Keep contact information current
  • Share knowledge across team members

Monitor Quality Metrics:

  • Alert-to-incident ratio (measure false positives)
  • Time to detection for real issues
  • Coverage of critical service components
  • Team satisfaction with monitoring effectiveness

Continuous Improvement:

  • Collect feedback from incident responses
  • Analyze patterns in monitor failures
  • Optimize based on business impact
  • Regular training on monitoring best practices

Too Many False Positives:

  • Review and adjust alert thresholds
  • Implement proper grace periods
  • Use maintenance windows for scheduled work
  • Consider monitor sensitivity settings

Missing Critical Issues:

  • Audit monitor coverage for all critical services
  • Review assertion completeness for uptime monitors
  • Verify alert delivery mechanisms
  • Test escalation procedures

Organization Confusion:

  • Implement consistent naming conventions
  • Use tags effectively for filtering and search
  • Create clear ownership documentation
  • Provide team training on monitor organization

High Check Volume:

  • Optimize check frequencies based on service criticality
  • Distribute check timing to avoid load spikes
  • Consider regional monitoring for global services
  • Monitor your monitoring system’s performance

Resource Constraints:

  • Review and optimize timeout settings
  • Remove unnecessary or redundant monitors
  • Use appropriate grace periods
  • Consider upgrading subscription plan