Monitoring & Alerting: Systeme überwachen | Enjyn Gruppe
Hallo Welt
Hallo Welt
Original Lingva Deutsch
Übersetzung wird vorbereitet...
Dieser Vorgang kann bis zu 60 Sekunden dauern.
Diese Seite wird erstmalig übersetzt und dann für alle Besucher gespeichert.
0%
DE Zurück zu Deutsch
Übersetzung durch Lingva Translate

235 Dokumentationen verfügbar

Wissensdatenbank

Monitoring Alerting Grundlagen

Zuletzt aktualisiert: 20.01.2026 um 10:05 Uhr

Monitoring & Alerting: Systeme überwachen

Gutes Monitoring zeigt Probleme bevor User sie bemerken. Lernen Sie, wie Sie effektives Monitoring aufbauen.

Die drei Säulen

Observability
├── Metrics     → Zahlen über Zeit (CPU, RAM, Requests)
├── Logs        → Ereignisse und Fehler
└── Traces      → Request-Pfad durch System

Wichtige Metriken

Kategorie Metriken
System CPU, RAM, Disk, Network I/O
Application Request Rate, Error Rate, Latency
Business Orders/min, Active Users, Revenue
Database Connections, Query Time, Cache Hit Rate

RED Method (Services)

R - Rate:     Requests pro Sekunde
E - Errors:   Fehler pro Sekunde
D - Duration: Latenz (p50, p95, p99)

# Prometheus Queries
# Rate
rate(http_requests_total[5m])

# Errors
rate(http_requests_total{status=~"5.."}[5m])
  / rate(http_requests_total[5m])

# Duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

USE Method (Resources)

U - Utilization: Wie viel wird genutzt (%)
S - Saturation:  Wie überlastet (Queue-Länge)
E - Errors:      Fehleranzahl

# CPU
Utilization: CPU usage %
Saturation:  Load average / CPU cores
Errors:      CPU errors (selten)

# Disk
Utilization: Disk usage %
Saturation:  I/O wait
Errors:      Disk errors

Prometheus + Grafana Setup

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['app:8080']

Application Metrics (Node.js)

npm install prom-client
const client = require('prom-client');

// Default Metrics (CPU, Memory, etc.)
client.collectDefaultMetrics();

// Custom Metrics
const httpRequestDuration = new client.Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration in seconds',
    labelNames: ['method', 'route', 'status'],
    buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestsTotal = new client.Counter({
    name: 'http_requests_total',
    help: 'Total HTTP requests',
    labelNames: ['method', 'route', 'status']
});

// Middleware
app.use((req, res, next) => {
    const start = Date.now();

    res.on('finish', () => {
        const duration = (Date.now() - start) / 1000;
        const route = req.route?.path || 'unknown';

        httpRequestDuration
            .labels(req.method, route, res.statusCode)
            .observe(duration);

        httpRequestsTotal
            .labels(req.method, route, res.statusCode)
            .inc();
    });

    next();
});

// Metrics Endpoint
app.get('/metrics', async (req, res) => {
    res.set('Content-Type', client.register.contentType);
    res.send(await client.register.metrics());
});

Alerting Rules

# prometheus/alerts.yml
groups:
  - name: app
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }})"

      # High Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile latency > 2s"

      # Service Down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

  - name: infrastructure
    rules:
      # High CPU
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning

      # Disk Space
      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space < 10% on {{ $labels.mountpoint }}"

Alertmanager Konfiguration

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-email'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

Uptime Monitoring

# Einfacher Health Check
GET /health
→ 200 OK { "status": "healthy" }

# Detaillierter Health Check
GET /health/ready
{
  "status": "healthy",
  "checks": {
    "database": { "status": "up", "latency_ms": 5 },
    "redis": { "status": "up", "latency_ms": 1 },
    "external_api": { "status": "up", "latency_ms": 120 }
  }
}
💡 Empfehlung: Nutzen Sie den Enjyn Status Monitor für einfaches Uptime-Monitoring Ihrer Websites und APIs mit Alerts per E-Mail.

Best Practices

✅ Gutes Monitoring:
  • Alerts für Symptome, nicht Ursachen
  • Actionable Alerts (klar was zu tun ist)
  • Runbooks für jeden Alert
  • On-Call Rotation
  • Post-Mortems nach Incidents

Weitere Informationen

Enjix Beta

Enjyn AI Agent

Hallo 👋 Ich bin Enjix — wie kann ich dir helfen?
120