-alert:PrometheusTargetMissing expr:up==0 for:0m labels: severity:critical annotations: summary:Prometheustargetmissing(instance {{ $labels.instance }}) description:"A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
3. Prometheus所有目标丢失
Prometheus all targets missing
一个Prometheus作业不再有存活的目标
1 2 3 4 5 6 7 8
-alert:PrometheusAllTargetsMissing expr:sumby(job)(up)==0 for:0m labels: severity:critical annotations: summary:Prometheusalltargetsmissing(instance {{ $labels.instance }}) description:"A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
4. Prometheus目标在热身时间丢失
Prometheus target missing with warmup time
在告警工作出现问题之前,允许该工作启动时间(10分钟),也就是所谓的热身时间。
1 2 3 4 5 6 7 8
-alert:PrometheusTargetMissingWithWarmupTime expr:sumby(instance,job)((up==0)*on(instance)group_right(job)(node_time_seconds-node_boot_time_seconds>600)) for:0m labels: severity:critical annotations: summary:Prometheustargetmissingwithwarmuptime(instance {{ $labels.instance }}) description:"Allow a job time to start up (10 minutes) before alerting that it's down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-alert:PrometheusTooManyRestarts expr:changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m])>2 for:0m labels: severity:warning annotations: summary:Prometheustoomanyrestarts(instance {{ $labels.instance }}) description:"Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"]
-alert:PrometheusAlertmanagerConfigNotSynced expr:count(count_values("config_hash",alertmanager_config_hash))>1 for:0m labels: severity:warning annotations: summary:PrometheusAlertManagerconfignotsynced(instance {{ $labels.instance }}) description:"Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
10. Prometheus AlertManager的E2E死亡开关
Prometheus AlertManager E2E dead man switch
Prometheus的死亡开关是一个一直触发的告警。它通过Alertmanager用作Prometheus的端到端(E2E,即End to End)测试
1 2 3 4 5 6 7 8
-alert:PrometheusAlertmanagerE2eDeadManSwitch expr:vector(1) for:0m labels: severity:critical annotations: summary:PrometheusAlertManagerE2Edeadmanswitch(instance {{ $labels.instance }}) description:"Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-alert:PrometheusRuleEvaluationSlow expr:prometheus_rule_group_last_duration_seconds>prometheus_rule_group_interval_seconds for:5m labels: severity:warning annotations: summary:Prometheusruleevaluationslow(instance {{ $labels.instance }}) description:"Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
15. Prometheus通知积压
Prometheus notifications backlog
Prometheus通知队列已经10分钟没有清空了。
1 2 3 4 5 6 7 8
-alert:PrometheusNotificationsBacklog expr:min_over_time(prometheus_notifications_queue_length[10m])>0 for:0m labels: severity:warning annotations: summary:Prometheusnotificationsbacklog(instance {{ $labels.instance }}) description:"The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-alert:PrometheusTargetScrapingSlow expr:prometheus_target_interval_length_seconds{quantile="0.9"}/on(interval,instance,job)prometheus_target_interval_length_seconds{quantile="0.5"}>1.05 for:5m labels: severity:warning annotations: summary:Prometheustargetscrapingslow(instance {{ $labels.instance }}) description:"Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
19. Prometheus过度抓取
Prometheus large scrape
Prometheus有许多抓取超过样本限制
1 2 3 4 5 6 7 8
-alert:PrometheusLargeScrape expr:increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m])>10 for:5m labels: severity:warning annotations: summary:Prometheuslargescrape(instance {{ $labels.instance }}) description:"Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
20. Prometheus目标抓取重复
Prometheus target scrape duplicate
Prometheus有许多样本由于重复的时间戳但值不同被拒绝。
1 2 3 4 5 6 7 8
-alert:PrometheusTargetScrapeDuplicate expr:increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m])>0 for:0m labels: severity:warning annotations: summary:Prometheustargetscrapeduplicate(instance {{ $labels.instance }}) description:"Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"