Infrastructure Monitoring: Wichtige Metriken | Enjyn Gruppe
Hallo Welt
Hallo Welt
Original Lingva Deutsch
Übersetzung wird vorbereitet...
Dieser Vorgang kann bis zu 60 Sekunden dauern.
Diese Seite wird erstmalig übersetzt und dann für alle Besucher gespeichert.
0%
DE Zurück zu Deutsch
Übersetzung durch Lingva Translate

235 Dokumentationen verfügbar

Wissensdatenbank

Infrastructure Monitoring Metriken

Zuletzt aktualisiert: 20.01.2026 um 11:26 Uhr

Infrastructure Monitoring: Wichtige Metriken

Effektives Monitoring basiert auf den richtigen Metriken. Lernen Sie die wichtigsten Infrastruktur-Metriken und wie Sie diese mit Prometheus und Grafana überwachen.

Die vier goldenen Signale

┌─────────────────────────────────────────────────────────────┐
│                 FOUR GOLDEN SIGNALS                         │
│                 (Google SRE Buch)                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. LATENCY                      2. TRAFFIC                │
│   ┌─────────────────┐            ┌─────────────────┐       │
│   │ Antwortzeit     │            │ Requests/sec    │       │
│   │ p50, p95, p99   │            │ Throughput      │       │
│   └─────────────────┘            └─────────────────┘       │
│                                                             │
│   3. ERRORS                       4. SATURATION            │
│   ┌─────────────────┐            ┌─────────────────┐       │
│   │ Error Rate %    │            │ CPU, Memory     │       │
│   │ 5xx Responses   │            │ Disk, Network   │       │
│   └─────────────────┘            └─────────────────┘       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

USE Method (für Ressourcen)

Für jede Ressource (CPU, Memory, Disk, Network) messen:

┌────────────────┬─────────────────────────────────────────┐
│ U - Utilization│ % der Kapazität die genutzt wird        │
├────────────────┼─────────────────────────────────────────┤
│ S - Saturation │ Arbeit die warten muss (Queue)          │
├────────────────┼─────────────────────────────────────────┤
│ E - Errors     │ Fehler bei der Ressource                │
└────────────────┴─────────────────────────────────────────┘

Beispiel CPU:
- Utilization: 75% CPU-Auslastung
- Saturation: Load Average > CPU Cores = Prozesse warten
- Errors: CPU Hardware-Fehler (selten)

CPU Metriken

# Prometheus Queries für CPU

# CPU Utilization pro Core (%)
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Breakdown nach Mode
sum by(mode)(irate(node_cpu_seconds_total[5m])) * 100
# Modes: user, system, iowait, idle, steal, nice

# Load Average (1, 5, 15 Minuten)
node_load1
node_load5
node_load15

# Load Average normalisiert (pro CPU Core)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)

# Wichtig: Load > Cores = CPU Saturation
# Wenn Load15 > Anzahl Cores → System überlastet
// Alert-Regeln für CPU

groups:
- name: cpu_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value | printf \"%.1f\" }}%"

  - alert: CPUSaturation
    expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 1.5
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "CPU saturation on {{ $labels.instance }}"

Memory Metriken

# Prometheus Queries für Memory

# Verfügbarer Speicher (%)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Genutzter Speicher (%)
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Memory Breakdown
node_memory_MemTotal_bytes       # Gesamt
node_memory_MemFree_bytes        # Komplett frei
node_memory_MemAvailable_bytes   # Verfügbar (inkl. Buffers/Cache)
node_memory_Buffers_bytes        # Buffers
node_memory_Cached_bytes         # Cache

# Swap Nutzung
node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes

# Swap Nutzung (%) - Warnsignal wenn hoch
(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100

# OOM Kills (Container)
container_oom_events_total
// Alert-Regeln für Memory

- alert: HighMemoryUsage
  expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
  for: 5m
  labels:
    severity: warning

- alert: SwapInUse
  expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    description: "Swap usage indicates memory pressure"

- alert: OOMKillsDetected
  expr: increase(container_oom_events_total[5m]) > 0
  labels:
    severity: critical

Disk Metriken

# Prometheus Queries für Disk

# Disk Space Nutzung (%)
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

# Disk I/O Utilization (%)
irate(node_disk_io_time_seconds_total[5m]) * 100

# Read/Write Throughput
irate(node_disk_read_bytes_total[5m])
irate(node_disk_written_bytes_total[5m])

# IOPS
irate(node_disk_reads_completed_total[5m])
irate(node_disk_writes_completed_total[5m])

# Disk Latency (avg ms pro Operation)
irate(node_disk_read_time_seconds_total[5m]) / irate(node_disk_reads_completed_total[5m]) * 1000
irate(node_disk_write_time_seconds_total[5m]) / irate(node_disk_writes_completed_total[5m]) * 1000

# Inodes (Datei-Anzahl Limit)
(node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100
// Alert-Regeln für Disk

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
  for: 5m
  labels:
    severity: warning

- alert: DiskSpaceCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
  for: 1m
  labels:
    severity: critical

- alert: DiskWillFillIn24h
  expr: predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    description: "Disk will fill within 24 hours at current rate"

- alert: HighDiskIOUtilization
  expr: irate(node_disk_io_time_seconds_total[5m]) * 100 > 80
  for: 10m
  labels:
    severity: warning

Network Metriken

# Prometheus Queries für Network

# Bandwidth (bytes/sec)
irate(node_network_receive_bytes_total[5m])
irate(node_network_transmit_bytes_total[5m])

# Packets per Second
irate(node_network_receive_packets_total[5m])
irate(node_network_transmit_packets_total[5m])

# Errors und Drops
irate(node_network_receive_errs_total[5m])
irate(node_network_transmit_errs_total[5m])
irate(node_network_receive_drop_total[5m])
irate(node_network_transmit_drop_total[5m])

# TCP Connections
node_netstat_Tcp_CurrEstab                 # Aktive Verbindungen
node_netstat_Tcp_InSegs                    # Eingehende Segmente
node_netstat_Tcp_OutSegs                   # Ausgehende Segmente
node_netstat_Tcp_RetransSegs               # Retransmissions (Qualität)

# Socket Saturation
node_sockstat_TCP_tw                       # TIME_WAIT Sockets
node_sockstat_TCP_alloc                    # Allocated Sockets
// Alert-Regeln für Network

- alert: NetworkErrors
  expr: rate(node_network_receive_errs_total[5m]) > 0 or rate(node_network_transmit_errs_total[5m]) > 0
  for: 5m
  labels:
    severity: warning

- alert: HighNetworkUtilization
  expr: irate(node_network_receive_bytes_total[5m]) > 100000000  # 100 MB/s
  for: 10m
  labels:
    severity: warning

- alert: TcpRetransmissions
  expr: rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    description: "TCP retransmission rate above 1%"

Application Metriken (RED)

# RED Method für Services:
# Rate, Errors, Duration

# Request Rate
rate(http_requests_total[5m])

# Error Rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Duration (Latency Percentiles)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99

# Apdex Score (Application Performance Index)
# T = zufriedenstellende Antwortzeit (z.B. 0.5s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) +
  sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))

Grafana Dashboard Beispiel

// Dashboard JSON Snippet

{
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "targets": [{
        "expr": "100 - (avg(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)",
        "legendFormat": "CPU %"
      }],
      "yaxes": [{ "max": 100, "min": 0, "format": "percent" }],
      "thresholds": [
        { "value": 70, "colorMode": "warning" },
        { "value": 90, "colorMode": "critical" }
      ]
    },
    {
      "title": "Memory Usage",
      "type": "gauge",
      "targets": [{
        "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
      }],
      "thresholds": { "steps": [
        { "value": 0, "color": "green" },
        { "value": 70, "color": "yellow" },
        { "value": 85, "color": "red" }
      ]}
    },
    {
      "title": "Request Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total[5m]))",
        "legendFormat": "req/s"
      }]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100",
        "legendFormat": "errors %"
      }],
      "thresholds": { "steps": [
        { "value": 0, "color": "green" },
        { "value": 1, "color": "yellow" },
        { "value": 5, "color": "red" }
      ]}
    }
  ]
}

Prometheus Setup

# prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  # Prometheus selbst
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  # Node Exporter (Server-Metriken)
  - job_name: 'node'
    static_configs:
    - targets:
      - 'server1:9100'
      - 'server2:9100'

  # Application Metriken
  - job_name: 'app'
    static_configs:
    - targets:
      - 'app1:8080'
      - 'app2:8080'
    metrics_path: '/metrics'

  # Kubernetes Service Discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
💡 Best Practices: 1. USE für Ressourcen, RED für Services
2. Alerts auf Symptome, nicht Ursachen
3. Percentiles (p99) statt Durchschnitt für Latency
4. Predict-Funktionen für proaktive Alerts
5. Dashboards nach Service gruppieren, nicht Metriken

Weitere Informationen

Enjix Beta

Enjyn AI Agent

Hallo 👋 Ich bin Enjix — wie kann ich dir helfen?
120