Prometheus的自身监控

Awesome Prometheus Alerts

Awesome Prometheus Alerts是一个收集了各种Prometheus告警规则的开源项目，我们可以参考里面的规则来定制我们的自己项目的监控告警规则

Awesome Prometheus Alerts中收录28条关于Prometheus的自身监控告警规则，让我们一起来看看吧

28条Prometheus自身监控告警规则

1. Prometheus作业丢失

Prometheus job missing

一个Prometheus作业消失了

- alert: PrometheusJobMissing
  expr: absent(up{job="prometheus"})
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus job missing (instance {{ $labels.instance }})
    description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

2. Prometheus目标丢失

Prometheus target missing

一个Prometheus的目标消失了，导出器可能会崩溃。

- alert: PrometheusTargetMissing
  expr: up == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target missing (instance {{ $labels.instance }})
    description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

3. Prometheus所有目标丢失

Prometheus all targets missing

一个Prometheus作业不再有存活的目标

- alert: PrometheusAllTargetsMissing
  expr: sum by (job) (up) == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus all targets missing (instance {{ $labels.instance }})
    description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

4. Prometheus目标在热身时间丢失

Prometheus target missing with warmup time

在告警工作出现问题之前，允许该工作启动时间（10分钟），也就是所谓的热身时间。

- alert: PrometheusTargetMissingWithWarmupTime
  expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
    description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

5. Prometheus配置重载失败

Prometheus configuration reload failure

Prometheus配置重载出现了错误

- alert: PrometheusConfigurationReloadFailure
  expr: prometheus_config_last_reload_successful != 1
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
    description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

6. Prometheus频繁重启

Prometheus too many restarts

Prometheus在最近15分钟重启了两次，它可能陷入了崩溃循环。

- alert: PrometheusTooManyRestarts
  expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus too many restarts (instance {{ $labels.instance }})
    description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"]

7. Prometheus AlertManager作业丢失

Prometheus AlertManager job missing

一个Prometheus AlertManager的作业消失了

- alert: PrometheusAlertmanagerJobMissing
  expr: absent(up{job="alertmanager"})
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
    description: "A Prometheus AlertManager job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

8. Prometheus AlertManager配置重载失败

Prometheus AlertManager configuration reload failure

Prometheus AlertManager配置重载出现了错误

- alert: PrometheusAlertmanagerConfigurationReloadFailure
    expr: alertmanager_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

9. Prometheus AlertManager配置未同步

Prometheus AlertManager config not synced

AlertManager群集实例的配置不同步

- alert: PrometheusAlertmanagerConfigNotSynced
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
      description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

10. Prometheus AlertManager的E2E死亡开关

Prometheus AlertManager E2E dead man switch

Prometheus的死亡开关是一个一直触发的告警。它通过Alertmanager用作Prometheus的端到端（E2E，即End to End）测试

- alert: PrometheusAlertmanagerE2eDeadManSwitch
  expr: vector(1)
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
    description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

11. Prometheus不能连接到alertmanager

Prometheus not connected to alertmanager

Prometheus不能连接到alertmanager

- alert: PrometheusNotConnectedToAlertmanager
  expr: prometheus_notifications_alertmanagers_discovered < 1
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
    description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12. Prometheus规则评估失败

Prometheus rule evaluation failures

Prometheus遇到规则评估失败，导致潜在的告警被忽略。

- alert: PrometheusRuleEvaluationFailures
  expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

13. Prometheus文本扩展失败

Prometheus template text expansion failures

Prometheus遇到模板文本扩展失败

- alert: PrometheusTemplateTextExpansionFailures
  expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

14. Prometheus规则评估缓慢

Prometheus rule evaluation slow

Prometheus规则评估花费比预定的时间间隔还要长，这表明访问后端数据存储较慢或者查询过于复杂。

- alert: PrometheusRuleEvaluationSlow
  expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
    description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

15. Prometheus通知积压

Prometheus notifications backlog

Prometheus通知队列已经10分钟没有清空了。

- alert: PrometheusNotificationsBacklog
  expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus notifications backlog (instance {{ $labels.instance }})
    description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

16. Prometheus AlertManager通知失败

Prometheus AlertManager notification failing

Prometheus AlertManager发送通知失败

- alert: PrometheusAlertmanagerNotificationFailing
  expr: rate(alertmanager_notifications_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
    description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

17. Prometheus目标为空

Prometheus target empty

Prometheus在服务中没有发现目标

- alert: PrometheusTargetEmpty
  expr: prometheus_sd_discovered_targets == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target empty (instance {{ $labels.instance }})
    description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

18. Prometheus目标抓取缓慢

Prometheus target scraping slow

Prometheus正在缓慢地抓取exporter，它超过了预设要求的间隔时间。这说明你的Prometheus服务器配置不足。

- alert: PrometheusTargetScrapingSlow
  expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus target scraping slow (instance {{ $labels.instance }})
    description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

19. Prometheus过度抓取

Prometheus large scrape

Prometheus有许多抓取超过样本限制

- alert: PrometheusLargeScrape
   expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
   for: 5m
   labels:
     severity: warning
   annotations:
     summary: Prometheus large scrape (instance {{ $labels.instance }})
     description: "Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

20. Prometheus目标抓取重复

Prometheus target scrape duplicate

Prometheus有许多样本由于重复的时间戳但值不同被拒绝。

- alert: PrometheusTargetScrapeDuplicate
  expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
    description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

21. Prometheus TSDB检查点创建失败

Prometheus TSDB checkpoint creation failures

Prometheus遇到检查点创建失败

- alert: PrometheusTsdbCheckpointCreationFailures
  expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

22. Prometheus TSDB检查点删除失败

Prometheus TSDB checkpoint deletion failures

Prometheus遇到检查点删除失败

- alert: PrometheusTsdbCheckpointDeletionFailures
  expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

23. Prometheus TSDB压缩失败

Prometheus TSDB compactions failed

Prometheus遇到 TSDB 压缩失败

- alert: PrometheusTsdbCompactionsFailed
  expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

24. Prometheus TSDB头部截断失败

Prometheus TSDB head truncations failed

Prometheus 遇到 TSDB 头部截断失败

- alert: PrometheusTsdbHeadTruncationsFailed
  expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

25. Prometheus TSDB重载失败

Prometheus TSDB reload failures

Prometheus 遇到 TSDB 重载失败

- alert: PrometheusTsdbReloadFailures
  expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

26. Prometheus TSDB WAL损坏

Prometheus TSDB WAL corruptions

Prometheus 遇到 TSDB WAL 损坏

- alert: PrometheusTsdbWalCorruptions
  expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

27. Prometheus TSDB WAL截断失败

Prometheus TSDB WAL truncations failed

Prometheus 遇到 TSDB WAL 截断失败

- alert: PrometheusTsdbWalTruncationsFailed
  expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

28. Prometheus时间序列基数

Prometheus timeserie cardinality

这”“时间序列基数（唯一时间序列的数量）正在变得非常高:

- alert: PrometheusTimeserieCardinality
  expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus timeserie cardinality (instance {{ $labels.instance }})
    description: "The \"{{ $labels.name }}\" timeserie cardinality is getting very high: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"