Monitoring Alerting Grundlagen
Monitoring & Alerting: Systeme überwachen
Gutes Monitoring zeigt Probleme bevor User sie bemerken. Lernen Sie, wie Sie effektives Monitoring aufbauen.
Die drei Säulen
Observability ├── Metrics → Zahlen über Zeit (CPU, RAM, Requests) ├── Logs → Ereignisse und Fehler └── Traces → Request-Pfad durch System
Wichtige Metriken
| Kategorie | Metriken |
|---|---|
| System | CPU, RAM, Disk, Network I/O |
| Application | Request Rate, Error Rate, Latency |
| Business | Orders/min, Active Users, Revenue |
| Database | Connections, Query Time, Cache Hit Rate |
RED Method (Services)
R - Rate: Requests pro Sekunde
E - Errors: Fehler pro Sekunde
D - Duration: Latenz (p50, p95, p99)
# Prometheus Queries
# Rate
rate(http_requests_total[5m])
# Errors
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
# Duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
USE Method (Resources)
U - Utilization: Wie viel wird genutzt (%) S - Saturation: Wie überlastet (Queue-Länge) E - Errors: Fehleranzahl # CPU Utilization: CPU usage % Saturation: Load average / CPU cores Errors: CPU errors (selten) # Disk Utilization: Disk usage % Saturation: I/O wait Errors: Disk errors
Prometheus + Grafana Setup
# docker-compose.yml
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
static_configs:
- targets: ['app:8080']
Application Metrics (Node.js)
npm install prom-client
const client = require('prom-client');
// Default Metrics (CPU, Memory, etc.)
client.collectDefaultMetrics();
// Custom Metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status']
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || 'unknown';
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestsTotal
.labels(req.method, route, res.statusCode)
.inc();
});
next();
});
// Metrics Endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.send(await client.register.metrics());
});
Alerting Rules
# prometheus/alerts.yml
groups:
- name: app
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }})"
# High Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile latency > 2s"
# Service Down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
- name: infrastructure
rules:
# High CPU
- alert: HighCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
# Disk Space
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space < 10% on {{ $labels.mountpoint }}"
Alertmanager Konfiguration
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
Uptime Monitoring
# Einfacher Health Check
GET /health
→ 200 OK { "status": "healthy" }
# Detaillierter Health Check
GET /health/ready
{
"status": "healthy",
"checks": {
"database": { "status": "up", "latency_ms": 5 },
"redis": { "status": "up", "latency_ms": 1 },
"external_api": { "status": "up", "latency_ms": 120 }
}
}
💡 Empfehlung:
Nutzen Sie den Enjyn Status Monitor für einfaches Uptime-Monitoring Ihrer Websites und APIs mit Alerts per E-Mail.
Best Practices
✅ Gutes Monitoring:
- Alerts für Symptome, nicht Ursachen
- Actionable Alerts (klar was zu tun ist)
- Runbooks für jeden Alert
- On-Call Rotation
- Post-Mortems nach Incidents