Prometheus配置手册

snow chuai汇总、整理、撰写---2020/07/21


1. 实现Prometheus
1.1 配置与安装Prometheus Repos
1) 设置Prometheus Repo
[root@srv1 ~]# cat > /etc/yum.repos.d/prometheus.repo <<'EOF' 
[prometheus]
name=prometheus
baseurl=https://packagecloud.io/prometheus-rpm/release/el/$releasever/$basearch
repo_gpgcheck=0
enabled=1
gpgkey=https://packagecloud.io/prometheus-rpm/release/gpgkey
       https://raw.githubusercontent.com/lest/prometheus-rpm/master/RPM-GPG-KEY-prometheus-rpm
gpgcheck=0
EOF
[root@srv1 ~]# vim /etc/hosts ...... ...... ...... ...... ...... ......
54.193.34.251 packagecloud.io 54.215.161.51 packagecloud.io
13.33.179.161 d28dx6y1hfq314.cloudfront.net 13.33.179.25 d28dx6y1hfq314.cloudfront.net 13.33.179.43 d28dx6y1hfq314.cloudfront.net 13.33.179.64 d28dx6y1hfq314.cloudfront.net

2) 安装Prometheus # 安装node_exporter工具,以获取系统上资源数据(如CPU或内存使用情况) [root@srv1 ~]# yum install prometheus2 node_exporter -y
3) 配置安装Prometheus [root@srv1 ~]# vim /etc/prometheus/prometheus.yml # 转到最底部追加如下内容 ...... ......
static_configs: - targets: ['srv1.1000y.cloud:9090']
# 通过[node exporter]获取本地计算机的统计信息 - job_name: srv1 static_configs: - targets: ['srv1.1000y.cloud:9100']
[root@srv1 ~]# systemctl enable --now prometheus node_exporter
4) 设定防火墙 [root@srv1 ~]# firewall-cmd --add-service=prometheus --permanent success [root@srv1 ~]# firewall-cmd --reload success
1.2 访问Prometheus
1) 访问Prometheus
[浏览器]===>http://srv1.1000y.cloud:9090





2.  监控目标主机
2.1 配置与安装Prometheus Repos
1) 设置Prometheus Repo
[root@srv2 ~]# cat > /etc/yum.repos.d/prometheus.repo <<'EOF' 
[prometheus]
name=prometheus
baseurl=https://packagecloud.io/prometheus-rpm/release/el/$releasever/$basearch
repo_gpgcheck=0
enabled=1
gpgkey=https://packagecloud.io/prometheus-rpm/release/gpgkey
       https://raw.githubusercontent.com/lest/prometheus-rpm/master/RPM-GPG-KEY-prometheus-rpm
gpgcheck=0
EOF
[root@srv2 ~]# vim /etc/hosts ...... ...... ...... ...... ...... ......
54.193.34.251 packagecloud.io 54.215.161.51 packagecloud.io
13.33.179.161 d28dx6y1hfq314.cloudfront.net 13.33.179.25 d28dx6y1hfq314.cloudfront.net 13.33.179.43 d28dx6y1hfq314.cloudfront.net 13.33.179.64 d28dx6y1hfq314.cloudfront.net

2) 安装node_exporter [root@srv2 ~]# yum install node_exporter -y [root@srv2 ~]# systemctl enable --now node_exporter
3) 设定防火墙 [root@srv2 ~]# firewall-cmd --add-service=prometheus --permanent success [root@srv2 ~]# firewall-cmd --reload success
4) 于Srv1上加入新节点 [root@srv1 ~]# vim /etc/prometheus/prometheus.yml # 转到最底部追加如下内容 ...... ......
- job_name: node static_configs: - targets: ['localhost:9100', 'srv2.1000y.cloud:9100']
...... ......
# 如果打算给新添加的主机起名字,可按如下操作 ...... ......
- job_name: node static_configs: - targets: ['localhost:9100']
- job_name: srv2 static_configs: - targets: ['srv2.1000y.cloud:9100']
[root@srv1 ~]# systemctl restart prometheus
2.2 访问Prometheus
1) 访问Prometheus(两个节点同为一个job_name)
[浏览器]===>http://srv1.1000y.cloud:9090





1) 访问Prometheus(两个节点为不同job_name)

3. 设置警告-EMail方式
3.1 安装及配置Postfix
1) 配置Postfix
[root@srv1 ~]# vim /etc/postfix/main.cf
# 取消113行注释
inet_interfaces = all
# 注释116行 #inet_interfaces = localhost
# 于183行,最后增加变量$mydomain mydestination = $myhostname, localhost.$mydomain, localhost, $mydomain
# 于267行添加本地网络 mynetworks = 192.168.10.0/24, 127.0.0.0/8
2) 重启Postfix [root@srv1 ~]# systemctl restart postfix
3) 防火墙设定 [root@srv1 ~]# firewall-cmd --add-service=smtp --permanent success [root@srv1 ~]# firewall-cmd --reload success
3.2 安装及配置Alter
1) 安装Alert Mangaer
[root@srv1 ~]# yum install alertmanager -y
2) 配置Alert Mangaer [root@srv1 ~]# mv /etc/prometheus/alertmanager.yml /etc/prometheus/alertmanager.yml.bak [root@srv1 ~]# vim /etc/prometheus/alertmanager.yml global: # 定义STMP smtp_smarthost: 'localhost:25' # 不使用TLS smtp_require_tls: false # 定义发件人地址 smtp_from: 'Alertmanager <root@srv1.1000y.cloud>' # 如果STMP开启认证,请定义STMP需要验证的信息(本示例保持注释> # smtp_auth_username: 'alertmanager' # smtp_auth_password: 'password'
route: # 警告名称 receiver: 'email-notice' # 定义分组 group_by: ['alertname', 'Service', 'Stage', 'Role'] # 为一个组发送通知的初始等待时间,默认30s、等待是时间内为了合并更多同类邮件 group_wait: 30s # 在发送新告警前的等待时间。通常5m或以上、第二组发送邮件间隔时间 group_interval: 5m # 发送重复告警的周期。如果已经发送了通知,再次发送之前需要等待多长时间。通常3小时或以上 repeat_interval: 4h
receivers: # 收件人名称(可任意定义) - name: 'email-notice' email_configs: # 收件人邮件地址 - to: "root@localhost"

3) 设定Alert规则--此示例为监控node-exporter是up/down [root@srv1 ~]# vim /etc/prometheus/alert_rules.yml groups: - name: Instances rules: # 告警名称 - alert: InstanceDown # up = 0 相当于目标为down状态 expr: up == 0 # 持续时间。 表示持续5分钟获取不到信息,则触发报警。0表示不使 for: 5m # 定义当前告警规则级别 labels: severity: critical annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: 'Instance {{ $labels.instance }} down'
[root@srv1 ~]# vim /etc/prometheus/prometheus.yml ...... ...... Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # 开启12行注释,打开altermanager功能 - 'localhost:9093'
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # 于19行添加刚才所创建的Altert规则 - "alert_rules.yml"
...... ......
[root@srv1 ~]# systemctl restart prometheus [root@srv1 ~]# systemctl enable --now alertmanager
4) 测试---使用mailx查看邮件 [root@srv2 ~]# systemctl stop node_exporter
# 收到警告通知 [root@srv1 ~]# mail Heirloom Mail version 12.5 7/5/10. Type ? for help. "/var/spool/mail/root": 1 message 1 new >N 1 Alertmanager Tue Jul 21 16:24 189/10272 "[FIRING:1] InstanceDown (srv2.1000y.cloud:9100 srv2 critical)" & 1
Message 1: From root@srv1.1000y.cloud Tue Jul 21 01:26:39 2020 Return-Path: <root@srv1.1000y.cloud> X-Original-To: root@localhost Delivered-To: root@localhost Subject: [FIRING:1] InstanceDown (srv2.1000y.cloud:9100 srv2 critical) To: root@localhost From: Alertmanager <root@srv1.1000y.cloud> Date: Tue, 21 Jul 2020 01:26:38 +0800 Content-Type: multipart/alternative; boundary=9eeff1f44d60cbf4257b4023aa2410a3ddd97f7acbbfd60383f749299a6e Status: RO
Part 1: Content-Type: text/html; charset=UTF-8 ...... ......
& q [root@srv1 ~]#
4. 删除数据
1) 删除时间序列数据需要版本≥2.1
2) 关于删除的API请查看:http://prometheus/api/v2/admin/tsdb/delete_series
3) 开启Admin API才可以使用Deletion API [root@srv1 ~]# vim /etc/default/prometheus # 于此行之后添加如下内容(绿色部分) PROMETHEUS_OPTS='--config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/data --web.enable-admin-api'
[root@srv1 ~]# systemctl restart prometheus
4) 删除数据举例 (1) 删除于[job]名称匹配的[Blackbox_tcp]的数据 [root@srv1 ~]# curl -X POST -g 'http://srv1.1000y.cloud:9090/api/v1/admin/tsdb/delete_series?match[]={job="Blackbox_tcp"}'
(2) 删除实例[instance]名称匹配的[srv2.1000y.cloud]的数据 [root@srv1 ~]# curl -X POST -g 'http://srv1.1000y.cloud:9090/api/v1/admin/tsdb/delete_series?match[]={instance="srv2.1000y.cloud"}'
(3) 删除所有数据 [root@srv1 ~]# curl -X POST -g 'http://srv1.1000y.cloud:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'
(4) 删除的数据实际上仍然存留在存储上,如果想清除磁盘的数据,可执行以下操作 [root@srv1 ~]# curl -X POST -g 'http://srv1.1000y.cloud:9090/api/v1/admin/tsdb/clean_tombstones'
5. 使用Grafana支持可视化
5.1 安装Grafana
1) Grafana WEB面板
(1) 使用yum源安装Grafana WEB面板
[root@srv1 ~]# cat > /etc/yum.repos.d/grafana.repo <<'EOF'
[grafana]
name=grafana
baseurl=https://mirror.tuna.tsinghua.edu.cn/grafana/yum/el7
gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana
enabled=0
gpgcheck=0
EOF
[root@srv1 ~]# yum --enablerepo=grafana install grafana initscripts fontconfig -y
(2) 安装最新版本的Grafana面板---推荐 [浏览器]===>https://grafana.com/grafana/download===>查看最新的版本 [root@srv1 ~]# wget https://dl.grafana.com/oss/release/grafana-7.1.1-1.x86_64.rpm [root@srv1 ~]# yum install grafana-7.1.1-1.x86_64.rpm
2) 配置Grafana [root@srv1 ~]# vim /etc/grafana/grafana.ini # 修改32行,取消注释并指定所使用的协议http/https protocol = http
# 修改38行,指定监听的端口 http_port = 3000
# 修改41行,指定本机的FQDN domain = srv1.1000y.cloud
[root@srv1 ~]# systemctl enable --now grafana-server
3) 配置防火墙 [root@srv1 ~]# firewall-cmd --add-port=3000/tcp --permanent success [root@srv1 ~]# firewall-cmd --reload success
5.2 访问Grafana
2) 访问Grafana
[浏览器]====>http://srv1.1000y.cloud:3000===>账户名:admin=====>密码:admin



5.3 添加Prometheus数据至Grafana并验证
[浏览器]====>http://srv1.1000y.cloud:3000











# 更多的Dashboard可通过https://grafana.com/grafana/dashboards找到 # 上传步骤如下---下面的Grafana版本为7.1.1




6. 实现Blackbox exporter
6.1 安装与配置Blackbox exporter
1) 在一个被监控端节点安装Blackbox_exporter
[root@srv2 ~]# yum install blackbox_exporter -y
2) Blackbox_exporter的设置文件所在位置 [root@srv2 ~]# cat /etc/prometheus/blackbox.yml ...... ...... ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" icmp: prober: icmp
3) 启动Blackbox_exporter服务 [root@srv2 ~]# systemctl enable --now blackbox_exporter
4) 防火墙设置 [root@srv2 ~]# firewall-cmd --add-port=9115/tcp --permanent success [root@srv2 ~]# firewall-cmd --reload success
5) 在Prometheus服务端进行配置 [root@srv1 ~]# vim /etc/prometheus/prometheus.yml ...... ...... ...... ...... ...... ......
# 于文档最后追加如下内容 # 使用icmp模块 - job_name: 'Blackbox_icmp' metrics_path: /probe params: module: [icmp] static_configs: - targets: - srv2.1000y.cloud relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ # Blackbox exporter(监测) Host:Port replacement: srv2.1000y.cloud:9115
# 使用ssh_banner模块 - job_name: 'Blackbox_ssh' metrics_path: /probe params: module: [ssh_banner] static_configs: - targets: - srv2.1000y.cloud:22 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: srv2.1000y.cloud:9115
# 使用tcp_connect模块 - job_name: 'Blackbox_tcp' metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: - srv2.1000y.cloud:3306 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: srv2.1000y.cloud:9115

[root@srv1 ~]# systemctl restart prometheus
6.2 使用Blackbox exporter
[浏览器]===>http://srv1.1000y.cloud:9090




查看有关[probe_success]指标的数据。 [1]表示成功,[0]表示失败。


本例子中,并没有安装MariaDB,故显示值为[0]表示失败。


 

如对您有帮助,请随缘打个赏。^-^

gold