Awesome Prometheus Alerts

Awesome Prometheus Alerts是一个收集了各种Prometheus告警规则的开源项目,我们可以参考里面的规则来定制我们的自己项目的监控告警规则

Awesome Prometheus Alerts中收录28条关于Prometheus的自身监控告警规则,让我们一起来看看吧

28条Prometheus自身监控告警规则

1. Prometheus作业丢失

Prometheus job missing

一个Prometheus作业消失了

1
2
3
4
5
6
7
8
- alert: PrometheusJobMissing
expr: absent(up{job="prometheus"})
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus job missing (instance {{ $labels.instance }})
description: "A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

2. Prometheus目标丢失

Prometheus target missing

一个Prometheus的目标消失了,导出器可能会崩溃。

1
2
3
4
5
6
7
8
- alert: PrometheusTargetMissing
expr: up == 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus target missing (instance {{ $labels.instance }})
description: "A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

3. Prometheus所有目标丢失

Prometheus all targets missing

一个Prometheus作业不再有存活的目标

1
2
3
4
5
6
7
8
- alert: PrometheusAllTargetsMissing
expr: sum by (job) (up) == 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus all targets missing (instance {{ $labels.instance }})
description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

4. Prometheus目标在热身时间丢失

Prometheus target missing with warmup time

在告警工作出现问题之前,允许该工作启动时间(10分钟),也就是所谓的热身时间。

1
2
3
4
5
6
7
8
- alert: PrometheusTargetMissingWithWarmupTime
expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

5. Prometheus配置重载失败

Prometheus configuration reload failure

Prometheus配置重载出现了错误

1
2
3
4
5
6
7
8
- alert: PrometheusConfigurationReloadFailure
expr: prometheus_config_last_reload_successful != 1
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
description: "Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

6. Prometheus频繁重启

Prometheus too many restarts

Prometheus在最近15分钟重启了两次,它可能陷入了崩溃循环。

1
2
3
4
5
6
7
8
- alert: PrometheusTooManyRestarts
expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus too many restarts (instance {{ $labels.instance }})
description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"]

7. Prometheus AlertManager作业丢失

Prometheus AlertManager job missing

一个Prometheus AlertManager的作业消失了

1
2
3
4
5
6
7
8
- alert: PrometheusAlertmanagerJobMissing
expr: absent(up{job="alertmanager"})
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
description: "A Prometheus AlertManager job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

8. Prometheus AlertManager配置重载失败

Prometheus AlertManager configuration reload failure

Prometheus AlertManager配置重载出现了错误

1
2
3
4
5
6
7
8
- alert: PrometheusAlertmanagerConfigurationReloadFailure
expr: alertmanager_config_last_reload_successful != 1
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
description: "AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

9. Prometheus AlertManager配置未同步

Prometheus AlertManager config not synced

AlertManager群集实例的配置不同步

1
2
3
4
5
6
7
8
- alert: PrometheusAlertmanagerConfigNotSynced
expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
description: "Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

10. Prometheus AlertManager的E2E死亡开关

Prometheus AlertManager E2E dead man switch

Prometheus的死亡开关是一个一直触发的告警。它通过Alertmanager用作Prometheus的端到端(E2E,即End to End)测试

1
2
3
4
5
6
7
8
- alert: PrometheusAlertmanagerE2eDeadManSwitch
expr: vector(1)
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

11. Prometheus不能连接到alertmanager

Prometheus not connected to alertmanager

Prometheus不能连接到alertmanager

1
2
3
4
5
6
7
8
- alert: PrometheusNotConnectedToAlertmanager
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

12. Prometheus规则评估失败

Prometheus rule evaluation failures

Prometheus遇到 规则评估失败,导致潜在的告警被忽略。

1
2
3
4
5
6
7
8
- alert: PrometheusRuleEvaluationFailures
expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

13. Prometheus文本扩展失败

Prometheus template text expansion failures

Prometheus遇到 模板文本扩展失败

1
2
3
4
5
6
7
8
- alert: PrometheusTemplateTextExpansionFailures
expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

14. Prometheus规则评估缓慢

Prometheus rule evaluation slow

Prometheus规则评估花费比预定的时间间隔还要长,这表明访问后端数据存储较慢或者查询过于复杂。

1
2
3
4
5
6
7
8
- alert: PrometheusRuleEvaluationSlow
expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
for: 5m
labels:
severity: warning
annotations:
summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

15. Prometheus通知积压

Prometheus notifications backlog

Prometheus通知队列已经10分钟没有清空了。

1
2
3
4
5
6
7
8
- alert: PrometheusNotificationsBacklog
expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus notifications backlog (instance {{ $labels.instance }})
description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

16. Prometheus AlertManager通知失败

Prometheus AlertManager notification failing

Prometheus AlertManager发送通知失败

1
2
3
4
5
6
7
8
- alert: PrometheusAlertmanagerNotificationFailing
expr: rate(alertmanager_notifications_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

17. Prometheus目标为空

Prometheus target empty

Prometheus在服务中没有发现目标

1
2
3
4
5
6
7
8
- alert: PrometheusTargetEmpty
expr: prometheus_sd_discovered_targets == 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus target empty (instance {{ $labels.instance }})
description: "Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

18. Prometheus目标抓取缓慢

Prometheus target scraping slow

Prometheus正在缓慢地抓取exporter,它超过了预设要求的间隔时间。这说明你的Prometheus服务器配置不足。

1
2
3
4
5
6
7
8
- alert: PrometheusTargetScrapingSlow
expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
for: 5m
labels:
severity: warning
annotations:
summary: Prometheus target scraping slow (instance {{ $labels.instance }})
description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

19. Prometheus过度抓取

Prometheus large scrape

Prometheus有许多抓取超过样本限制

1
2
3
4
5
6
7
8
- alert: PrometheusLargeScrape
expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: Prometheus large scrape (instance {{ $labels.instance }})
description: "Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

20. Prometheus目标抓取重复

Prometheus target scrape duplicate

Prometheus有许多样本由于重复的时间戳但值不同被拒绝。

1
2
3
4
5
6
7
8
- alert: PrometheusTargetScrapeDuplicate
expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

21. Prometheus TSDB检查点创建失败

Prometheus TSDB checkpoint creation failures

Prometheus遇到 检查点创建失败

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbCheckpointCreationFailures
expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

22. Prometheus TSDB检查点删除失败

Prometheus TSDB checkpoint deletion failures

Prometheus遇到 检查点删除失败

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbCheckpointDeletionFailures
expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

23. Prometheus TSDB压缩失败

Prometheus TSDB compactions failed

Prometheus遇到 TSDB 压缩失败

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbCompactionsFailed
expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

24. Prometheus TSDB头部截断失败

Prometheus TSDB head truncations failed

Prometheus 遇到 TSDB 头部截断失败

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbHeadTruncationsFailed
expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

25. Prometheus TSDB重载失败

Prometheus TSDB reload failures

Prometheus 遇到 TSDB 重载失败

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbReloadFailures
expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

26. Prometheus TSDB WAL损坏

Prometheus TSDB WAL corruptions

Prometheus 遇到 TSDB WAL 损坏

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbWalCorruptions
expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

27. Prometheus TSDB WAL截断失败

Prometheus TSDB WAL truncations failed

Prometheus 遇到 TSDB WAL 截断失败

1
2
3
4
5
6
7
8
- alert: PrometheusTsdbWalTruncationsFailed
expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

28. Prometheus时间序列基数

Prometheus timeserie cardinality

这”“时间序列基数(唯一时间序列的数量)正在变得非常高:

1
2
3
4
5
6
7
8
- alert: PrometheusTimeserieCardinality
expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus timeserie cardinality (instance {{ $labels.instance }})
description: "The \"{{ $labels.name }}\" timeserie cardinality is getting very high: {{ $value }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"