alertmanager中route属性用来设置告警的分发策略。

配置alertmanager

我们大致在下面的匹配规则可以看出,当匹配存在标签 service: mysql|redis 就将告警发送到group1(谷歌邮箱),当匹配 service: linux 则将告警发送到 group2(QQ邮箱)

global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: 'xxx@qq.com'
smtp_auth_username: 'xxx@qq.com'
smtp_auth_password: 'xxxxxxxxxx'
smtp_require_tls: false

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'group1'
# 所有不匹配以下的路由,告警都保留在根节点,并发送到根节点的receiver设置的默认路由
routes:
- receiver: 'group1'
# 使用正则匹配
match_re:
service: mysql|redis
- receiver: 'group2'
match:
service: linux
receivers:
- name: 'group1'
email_configs:
- to: 'xxx@gmail.com'
send_resolved: true
- name: 'group2'
email_configs:
- to: 'xxx@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

配置rules

为了更加容易触发告警,我们分别选用cpu和内存使用率告警,当cpu使用率大于0.5%告警,当内存使用率大于5%告警。

groups:
- name: general
rules:
- alert: CPU usage is more than 75%
expr: 100 - (avg(irate(node_cpu_seconds_total{job="prometheus",mode="idle"}[1m])) * 100) > 0.5
for: 1m
labels:
severity: error
service: linux
annotations:
summary: "instance {{ $labels.instance }} CPU usage is more than 75%"
description: "{{ $labels.instance }} of job {{ $labels.job }} CPU usage is more than 75% for more than 1 minutes."
- name: db
rules:
- alert: Memory usage is more than 75%
expr: ((node_memory_MemTotal_bytes{job="prometheus"} - node_memory_MemFree_bytes{job="prometheus"} - node_memory_Buffers_bytes{job="prometheus"} - node_memory_Cached_bytes{job="prometheus"}) / (node_memory_MemTotal_bytes{job="prometheus"} )) * 100 > 5
for: 1m
labels:
severity: error
service: mysql
annotations:
summary: "instance {{ $labels.instance }} Memory usage is more than 75%"
description: "{{ $labels.instance }} of job {{ $labels.job }} Memory usage is more than 75% for more than 1 minute

可以看到,我们对CPU告警添加了service: linux标签,也就是这个告警信息应该是发送到QQ邮箱的,内存告警添加了service: mysql标签,应该是发到谷歌邮箱的。

在expr表达式中,我们调低了报警阈值,使得报警更容易触发。

测试

重启相应的服务后,我们在prometheus页面可以看到alert状态,并且邮箱也收到了告警邮件

谷歌邮箱收到了memory告警邮件

QQ邮箱也受到了cpu的告警邮件

根据这个我们可以控制不同服务的告警发送相应负责人,例如DB相关的告警就发送给DBA、Linux系统报警发送给系统运维人员,等等。