在生产环境,我们肯定需要对我们的服务、端口等进行探测、监控和告警,以便第一时间获取服务的状态。blackbox_exporter提供icmp、tcp、udp、http等多种探针。

安装BlackBox_exporter

下载地址:https://github.com/prometheus/blackbox_exporter

tar zxvf blackbox_exporter-0.18.0.linux-amd64.tar.gz
# 启动
cd blackbox_exporter-0.18.0.linux-amd64
cp blackbox_exporter /usr/local/bin
nohup blackbox_exporter >> logs/blackbox_exporter.log &

配置Prometheus

- job_name: icmp_probe
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets: ['180.101.49.11']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
- job_name: tcp_probe
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: ['172.19.205.2:8080']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
- job_name: http_probe
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['http:www.baidu.com']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115

配置Grafana模板

引入 13659 模板id

配置Rules规则

groups:
- name: blackbox_rules
rules:
- alert: BlackboxProbeFailed
expr: probe_success == 0
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe failed (instance {{ $labels.instance }})
description: Probe failed\n VALUE = {{ $value }}
- alert: BlackboxSlowProbe
expr: probe_duration_seconds > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox slow probe (instance {{ $labels.instance }})
description: Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: HTTP status code is not 200-399\n VALUE = {{ $value }}
- alert: BlackboxSslCertificateWillExpireSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 3600 / 24 < 7
for: 0m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: SSL certificate expires in 7 days\n VALUE = {{ $value }}
- alert: BlackboxSslCertificateWillExpireSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 3600 / 24 < 3
for: 0m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: SSL certificate expires in 3 days\n VALUE = {{ $value }}
- alert: BlackboxSslCertificateWillExpireSoon
expr: (probe_ssl_earliest_cert_expiry - time()) <= 0
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
description: SSL certificate has expired already\n VALUE = {{ $value }}
- alert: BlackboxProbeSlowHttp
expr: probe_http_duration_seconds > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
description: HTTP request took more than 1s\n VALUE = {{ $value }}
- alert: BlackboxProbeSlowPing
expr: probe_icmp_duration_seconds > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox probe slow ping (instance {{ $labels.instance }})
description: Blackbox ping took more than 1s\n VALUE = {{ $value }}
- alert: BlackboxProbeSlowDNS
expr: probe_dns_lookup_time_seconds > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox probe slow dns (instance {{ $labels.instance }})
description: Blackbox dns took more than 1s\n VALUE = {{ $value }}