Infrastructure Monitoring Metriken
Infrastructure Monitoring: Wichtige Metriken
Effektives Monitoring basiert auf den richtigen Metriken. Lernen Sie die wichtigsten Infrastruktur-Metriken und wie Sie diese mit Prometheus und Grafana überwachen.
Die vier goldenen Signale
┌─────────────────────────────────────────────────────────────┐ │ FOUR GOLDEN SIGNALS │ │ (Google SRE Buch) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. LATENCY 2. TRAFFIC │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Antwortzeit │ │ Requests/sec │ │ │ │ p50, p95, p99 │ │ Throughput │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ 3. ERRORS 4. SATURATION │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Error Rate % │ │ CPU, Memory │ │ │ │ 5xx Responses │ │ Disk, Network │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘
USE Method (für Ressourcen)
Für jede Ressource (CPU, Memory, Disk, Network) messen: ┌────────────────┬─────────────────────────────────────────┐ │ U - Utilization│ % der Kapazität die genutzt wird │ ├────────────────┼─────────────────────────────────────────┤ │ S - Saturation │ Arbeit die warten muss (Queue) │ ├────────────────┼─────────────────────────────────────────┤ │ E - Errors │ Fehler bei der Ressource │ └────────────────┴─────────────────────────────────────────┘ Beispiel CPU: - Utilization: 75% CPU-Auslastung - Saturation: Load Average > CPU Cores = Prozesse warten - Errors: CPU Hardware-Fehler (selten)
CPU Metriken
# Prometheus Queries für CPU
# CPU Utilization pro Core (%)
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU Breakdown nach Mode
sum by(mode)(irate(node_cpu_seconds_total[5m])) * 100
# Modes: user, system, iowait, idle, steal, nice
# Load Average (1, 5, 15 Minuten)
node_load1
node_load5
node_load15
# Load Average normalisiert (pro CPU Core)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)
# Wichtig: Load > Cores = CPU Saturation
# Wenn Load15 > Anzahl Cores → System überlastet
// Alert-Regeln für CPU
groups:
- name: cpu_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
- alert: CPUSaturation
expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 1.5
for: 10m
labels:
severity: critical
annotations:
summary: "CPU saturation on {{ $labels.instance }}"
Memory Metriken
# Prometheus Queries für Memory # Verfügbarer Speicher (%) (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 # Genutzter Speicher (%) 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) # Memory Breakdown node_memory_MemTotal_bytes # Gesamt node_memory_MemFree_bytes # Komplett frei node_memory_MemAvailable_bytes # Verfügbar (inkl. Buffers/Cache) node_memory_Buffers_bytes # Buffers node_memory_Cached_bytes # Cache # Swap Nutzung node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes # Swap Nutzung (%) - Warnsignal wenn hoch (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 # OOM Kills (Container) container_oom_events_total
// Alert-Regeln für Memory
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
- alert: SwapInUse
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) > 0
for: 10m
labels:
severity: warning
annotations:
description: "Swap usage indicates memory pressure"
- alert: OOMKillsDetected
expr: increase(container_oom_events_total[5m]) > 0
labels:
severity: critical
Disk Metriken
# Prometheus Queries für Disk # Disk Space Nutzung (%) 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100) # Disk I/O Utilization (%) irate(node_disk_io_time_seconds_total[5m]) * 100 # Read/Write Throughput irate(node_disk_read_bytes_total[5m]) irate(node_disk_written_bytes_total[5m]) # IOPS irate(node_disk_reads_completed_total[5m]) irate(node_disk_writes_completed_total[5m]) # Disk Latency (avg ms pro Operation) irate(node_disk_read_time_seconds_total[5m]) / irate(node_disk_reads_completed_total[5m]) * 1000 irate(node_disk_write_time_seconds_total[5m]) / irate(node_disk_writes_completed_total[5m]) * 1000 # Inodes (Datei-Anzahl Limit) (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100
// Alert-Regeln für Disk
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 5m
labels:
severity: warning
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
for: 1m
labels:
severity: critical
- alert: DiskWillFillIn24h
expr: predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
for: 30m
labels:
severity: warning
annotations:
description: "Disk will fill within 24 hours at current rate"
- alert: HighDiskIOUtilization
expr: irate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 10m
labels:
severity: warning
Network Metriken
# Prometheus Queries für Network # Bandwidth (bytes/sec) irate(node_network_receive_bytes_total[5m]) irate(node_network_transmit_bytes_total[5m]) # Packets per Second irate(node_network_receive_packets_total[5m]) irate(node_network_transmit_packets_total[5m]) # Errors und Drops irate(node_network_receive_errs_total[5m]) irate(node_network_transmit_errs_total[5m]) irate(node_network_receive_drop_total[5m]) irate(node_network_transmit_drop_total[5m]) # TCP Connections node_netstat_Tcp_CurrEstab # Aktive Verbindungen node_netstat_Tcp_InSegs # Eingehende Segmente node_netstat_Tcp_OutSegs # Ausgehende Segmente node_netstat_Tcp_RetransSegs # Retransmissions (Qualität) # Socket Saturation node_sockstat_TCP_tw # TIME_WAIT Sockets node_sockstat_TCP_alloc # Allocated Sockets
// Alert-Regeln für Network
- alert: NetworkErrors
expr: rate(node_network_receive_errs_total[5m]) > 0 or rate(node_network_transmit_errs_total[5m]) > 0
for: 5m
labels:
severity: warning
- alert: HighNetworkUtilization
expr: irate(node_network_receive_bytes_total[5m]) > 100000000 # 100 MB/s
for: 10m
labels:
severity: warning
- alert: TcpRetransmissions
expr: rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
description: "TCP retransmission rate above 1%"
Application Metriken (RED)
# RED Method für Services:
# Rate, Errors, Duration
# Request Rate
rate(http_requests_total[5m])
# Error Rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Duration (Latency Percentiles)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m])) # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # p99
# Apdex Score (Application Performance Index)
# T = zufriedenstellende Antwortzeit (z.B. 0.5s)
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) +
sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))
Grafana Dashboard Beispiel
// Dashboard JSON Snippet
{
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)",
"legendFormat": "CPU %"
}],
"yaxes": [{ "max": 100, "min": 0, "format": "percent" }],
"thresholds": [
{ "value": 70, "colorMode": "warning" },
{ "value": 90, "colorMode": "critical" }
]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}],
"thresholds": { "steps": [
{ "value": 0, "color": "green" },
{ "value": 70, "color": "yellow" },
{ "value": 85, "color": "red" }
]}
},
{
"title": "Request Rate",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "req/s"
}]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "errors %"
}],
"thresholds": { "steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]}
}
]
}
Prometheus Setup
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
# Prometheus selbst
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (Server-Metriken)
- job_name: 'node'
static_configs:
- targets:
- 'server1:9100'
- 'server2:9100'
# Application Metriken
- job_name: 'app'
static_configs:
- targets:
- 'app1:8080'
- 'app2:8080'
metrics_path: '/metrics'
# Kubernetes Service Discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
💡 Best Practices:
1. USE für Ressourcen, RED für Services
2. Alerts auf Symptome, nicht Ursachen
3. Percentiles (p99) statt Durchschnitt für Latency
4. Predict-Funktionen für proaktive Alerts
5. Dashboards nach Service gruppieren, nicht Metriken
2. Alerts auf Symptome, nicht Ursachen
3. Percentiles (p99) statt Durchschnitt für Latency
4. Predict-Funktionen für proaktive Alerts
5. Dashboards nach Service gruppieren, nicht Metriken
Weitere Informationen
- 📊 Log Aggregation
- 🚨 Alerting Best Practices
- 📈 Enjyn Status Monitor - Uptime Monitoring für Ihre Services