Hallo Welt
Hallo Welt
Original Lingva Deutsch
Übersetzung wird vorbereitet...
Dieser Vorgang kann bis zu 60 Sekunden dauern.
Diese Seite wird erstmalig übersetzt und dann für alle Besucher gespeichert.
0%
DE Zurück zu Deutsch
Übersetzung durch Lingva Translate

234 Dokumentationen verfügbar

Wissensdatenbank

Infrastructure Monitoring Metriken

Zuletzt aktualisiert: 20.01.2026 um 11:26 Uhr

Infrastructure Monitoring: Wichtige Metriken

Effektives Monitoring basiert auf den richtigen Metriken. Lernen Sie die wichtigsten Infrastruktur-Metriken und wie Sie diese mit Prometheus und Grafana überwachen.

Die vier goldenen Signale

┌─────────────────────────────────────────────────────────────┐
│                 FOUR GOLDEN SIGNALS                         │
│                 (Google SRE Buch)                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. LATENCY                      2. TRAFFIC                │
│   ┌─────────────────┐            ┌─────────────────┐       │
│   │ Antwortzeit     │            │ Requests/sec    │       │
│   │ p50, p95, p99   │            │ Throughput      │       │
│   └─────────────────┘            └─────────────────┘       │
│                                                             │
│   3. ERRORS                       4. SATURATION            │
│   ┌─────────────────┐            ┌─────────────────┐       │
│   │ Error Rate %    │            │ CPU, Memory     │       │
│   │ 5xx Responses   │            │ Disk, Network   │       │
│   └─────────────────┘            └─────────────────┘       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

USE Method (für Ressourcen)

Für jede Ressource (CPU, Memory, Disk, Network) messen:

┌────────────────┬─────────────────────────────────────────┐
│ U - Utilization│ % der Kapazität die genutzt wird        │
├────────────────┼─────────────────────────────────────────┤
│ S - Saturation │ Arbeit die warten muss (Queue)          │
├────────────────┼─────────────────────────────────────────┤
│ E - Errors     │ Fehler bei der Ressource                │
└────────────────┴─────────────────────────────────────────┘

Beispiel CPU:
- Utilization: 75% CPU-Auslastung
- Saturation: Load Average > CPU Cores = Prozesse warten
- Errors: CPU Hardware-Fehler (selten)

CPU Metriken

# Prometheus Queries für CPU

# CPU Utilization pro Core (%)
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Breakdown nach Mode
sum by(mode)(irate(node_cpu_seconds_total[5m])) * 100
# Modes: user, system, iowait, idle, steal, nice

# Load Average (1, 5, 15 Minuten)
node_load1
node_load5
node_load15

# Load Average normalisiert (pro CPU Core)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)

# Wichtig: Load > Cores = CPU Saturation
# Wenn Load15 > Anzahl Cores → System überlastet
// Alert-Regeln für CPU

groups:
- name: cpu_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value | printf \"%.1f\" }}%"

  - alert: CPUSaturation
    expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 1.5
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "CPU saturation on {{ $labels.instance }}"

Memory Metriken

# Prometheus Queries für Memory

# Verfügbarer Speicher (%)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Genutzter Speicher (%)
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Memory Breakdown
node_memory_MemTotal_bytes       # Gesamt
node_memory_MemFree_bytes        # Komplett frei
node_memory_MemAvailable_bytes   # Verfügbar (inkl. Buffers/Cache)
node_memory_Buffers_bytes        # Buffers
node_memory_Cached_bytes         # Cache

# Swap Nutzung
node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes

# Swap Nutzung (%) - Warnsignal wenn hoch
(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100

# OOM Kills (Container)
container_oom_events_total
// Alert-Regeln für Memory

- alert: HighMemoryUsage
  expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
  for: 5m
  labels:
    severity: warning

- alert: SwapInUse
  expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    description: "Swap usage indicates memory pressure"

- alert: OOMKillsDetected
  expr: increase(container_oom_events_total[5m]) > 0
  labels:
    severity: critical

Disk Metriken

# Prometheus Queries für Disk

# Disk Space Nutzung (%)
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

# Disk I/O Utilization (%)
irate(node_disk_io_time_seconds_total[5m]) * 100

# Read/Write Throughput
irate(node_disk_read_bytes_total[5m])
irate(node_disk_written_bytes_total[5m])

# IOPS
irate(node_disk_reads_completed_total[5m])
irate(node_disk_writes_completed_total[5m])

# Disk Latency (avg ms pro Operation)
irate(node_disk_read_time_seconds_total[5m]) / irate(node_disk_reads_completed_total[5m]) * 1000
irate(node_disk_write_time_seconds_total[5m]) / irate(node_disk_writes_completed_total[5m]) * 1000

# Inodes (Datei-Anzahl Limit)
(node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100
// Alert-Regeln für Disk

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
  for: 5m
  labels:
    severity: warning

- alert: DiskSpaceCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
  for: 1m
  labels:
    severity: critical

- alert: DiskWillFillIn24h
  expr: predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    description: "Disk will fill within 24 hours at current rate"

- alert: HighDiskIOUtilization
  expr: irate(node_disk_io_time_seconds_total[5m]) * 100 > 80
  for: 10m
  labels:
    severity: warning

Network Metriken

# Prometheus Queries für Network

# Bandwidth (bytes/sec)
irate(node_network_receive_bytes_total[5m])
irate(node_network_transmit_bytes_total[5m])

# Packets per Second
irate(node_network_receive_packets_total[5m])
irate(node_network_transmit_packets_total[5m])

# Errors und Drops
irate(node_network_receive_errs_total[5m])
irate(node_network_transmit_errs_total[5m])
irate(node_network_receive_drop_total[5m])
irate(node_network_transmit_drop_total[5m])

# TCP Connections
node_netstat_Tcp_CurrEstab                 # Aktive Verbindungen
node_netstat_Tcp_InSegs                    # Eingehende Segmente
node_netstat_Tcp_OutSegs                   # Ausgehende Segmente
node_netstat_Tcp_RetransSegs               # Retransmissions (Qualität)

# Socket Saturation
node_sockstat_TCP_tw                       # TIME_WAIT Sockets
node_sockstat_TCP_alloc                    # Allocated Sockets
// Alert-Regeln für Network

- alert: NetworkErrors
  expr: rate(node_network_receive_errs_total[5m]) > 0 or rate(node_network_transmit_errs_total[5m]) > 0
  for: 5m
  labels:
    severity: warning

- alert: HighNetworkUtilization
  expr: irate(node_network_receive_bytes_total[5m]) > 100000000  # 100 MB/s
  for: 10m
  labels:
    severity: warning

- alert: TcpRetransmissions
  expr: rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    description: "TCP retransmission rate above 1%"

Application Metriken (RED)

# RED Method für Services:
# Rate, Errors, Duration

# Request Rate
rate(http_requests_total[5m])

# Error Rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Duration (Latency Percentiles)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99

# Apdex Score (Application Performance Index)
# T = zufriedenstellende Antwortzeit (z.B. 0.5s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) +
  sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))

Grafana Dashboard Beispiel

// Dashboard JSON Snippet

{
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "targets": [{
        "expr": "100 - (avg(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)",
        "legendFormat": "CPU %"
      }],
      "yaxes": [{ "max": 100, "min": 0, "format": "percent" }],
      "thresholds": [
        { "value": 70, "colorMode": "warning" },
        { "value": 90, "colorMode": "critical" }
      ]
    },
    {
      "title": "Memory Usage",
      "type": "gauge",
      "targets": [{
        "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
      }],
      "thresholds": { "steps": [
        { "value": 0, "color": "green" },
        { "value": 70, "color": "yellow" },
        { "value": 85, "color": "red" }
      ]}
    },
    {
      "title": "Request Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total[5m]))",
        "legendFormat": "req/s"
      }]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100",
        "legendFormat": "errors %"
      }],
      "thresholds": { "steps": [
        { "value": 0, "color": "green" },
        { "value": 1, "color": "yellow" },
        { "value": 5, "color": "red" }
      ]}
    }
  ]
}

Prometheus Setup

# prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  # Prometheus selbst
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  # Node Exporter (Server-Metriken)
  - job_name: 'node'
    static_configs:
    - targets:
      - 'server1:9100'
      - 'server2:9100'

  # Application Metriken
  - job_name: 'app'
    static_configs:
    - targets:
      - 'app1:8080'
      - 'app2:8080'
    metrics_path: '/metrics'

  # Kubernetes Service Discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
💡 Best Practices: 1. USE für Ressourcen, RED für Services
2. Alerts auf Symptome, nicht Ursachen
3. Percentiles (p99) statt Durchschnitt für Latency
4. Predict-Funktionen für proaktive Alerts
5. Dashboards nach Service gruppieren, nicht Metriken

Weitere Informationen