自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

手把手教你實現(xiàn)Prometheus通過企業(yè)微信告警

作者：運維之美 2023-11-24 16:57:53

Prometheus被號稱是下一代的監(jiān)控，可以解決云上K8S集群的監(jiān)控問題，搭配部署alertmanager,通過郵件或者webhook的方式就可以實現(xiàn)告警實時發(fā)送出來了，本篇我們就通過企業(yè)微信實現(xiàn)告警發(fā)送，運維小哥可以躺平了。

1、prometheus架構(gòu)

組件介紹

Prometheus Server：收集指標(biāo)和存儲時間序列數(shù)據(jù)，并提供查詢接口
PushGateway：短期存儲指標(biāo)數(shù)據(jù)。主要用于臨時性的任務(wù)
Exporters：是提供監(jiān)控數(shù)據(jù)的來源，采集已有的第三方服務(wù)監(jiān)控指標(biāo)并暴露metrics，常見的監(jiān)控主機(jī)安裝node-exporter,數(shù)據(jù)庫mysql-exporter,按需安裝,對于Exporter，Prometheus Server采用pull的方式來采集數(shù)據(jù)
Alertmanager：告警觸發(fā)并通過短信，郵件等將告警發(fā)送出來
Web UI：簡單的Web控制臺，可以通過安裝grafana，并配置prometheus數(shù)據(jù)源來做監(jiān)控大盤

前置準(zhǔn)備工作，提前部署好Prometheus，grafana，node-exporter，此處不做詳細(xì)講解

### 部署 Prometheus
docker run -d --name=prometheus -p 9090:9090 prom/prometheus  
#可以將配置文件
訪問地址：http://IP:9090
### 部署 Grafana
docker run -d --name=grafana -p 3000:3000 grafana/grafana
訪問地址：http://IP:3000
### 部署node-exporter ###
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.*-amd64.tar.gz
cd node_exporter-*.*-amd64
./node_exporter

2、前置準(zhǔn)備工作

環(huán)境：prometheus服務(wù)端和alertmanager部署在同一臺機(jī)器上，實驗前提是prometheus服務(wù)端已經(jīng)安裝好

操作系統(tǒng)：Centos7.4

prometheus的告警管理分為兩部分。通過在prometheus服務(wù)端設(shè)置告警規(guī)則， Prometheus服務(wù)器端通過拉取exporter的數(shù)據(jù)指標(biāo)，當(dāng)指標(biāo)滿足告警閾值后，通過Alertmanager管理這些告警，包括靜默，抑制，聚合以及通過電子郵件，企業(yè)微信，釘釘?shù)确椒òl(fā)送告警通知。

設(shè)置警報和通知的主要步驟如下：

部署prometheus，一臺機(jī)器【本文略】
node-exporter，所有要監(jiān)控節(jié)點都要部署，類似于agent【本文略】
安裝啟動Alertmanager，和prometheus同節(jié)點
配置Prometheus對Alertmanager訪問，配置告警規(guī)則；
配置企微后臺，alertmanager配置對接企微并配置告警模板；
修改閾值觸發(fā)告警

前置工作，也可以采用離線包方式部署

### 部署 Prometheus
#創(chuàng)建prometheus的docker-compose.yml的配置
services:
  prometheus:
    command:
    - --web.listen-address=0.0.0.0:9090
    - --config.file=/etc/prometheus/prometheus.yml
    - --storage.tsdb.path=/var/lib/prometheus
    - --storage.tsdb.retention.time=30d
    - --web.enable-lifecycle
    - --web.external-url=prometheus
    - --web.enable-admin-api
    container_name: prometheus
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 8g
    hostname: prometheus
    image: prom/prometheus
    labels:
    - docker-compose-reset=true
    - midware-group=monitor
    network_mode: host
    restart: always
    volumes:
    - /usr/share/zoneinfo/Hongkong:/etc/localtime
    - /data/prometheus/data:/var/lib/prometheus
    - /data/prometheus/config:/etc/prometheus
    working_dir: /var/lib/prometheus
version: '3'
#執(zhí)行docker-compose up -d啟動prometheus服務(wù)
### 部署 Grafana
docker run -d --name=grafana -p 3000:3000 grafana/grafana
訪問地址：http://IP:3000
### 部署node-exporter ###
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.*-amd64.tar.gz
cd node_exporter-*.*-amd64
./node_exporter

3、安裝AlertManager

以官網(wǎng)最新版本為例，可以從官網(wǎng)地址下載alertmanager安裝包https://prometheus.io/download/

將包上傳到服務(wù)器上，按照下面步驟安裝和啟動alertmanager服務(wù)

[root@localhost ~]# mkdir -p /data/alertmanager
[root@localhost~]# tar -xvf alertmanager-0.22.2.linux-amd64.tar.gz  -C /data/alertmanager
[root@localhost~]# cd /data/alertmanager/
[root@localhost alertmanager]# nohup ./alertmanager &

4、配置prometheus告警規(guī)則

prometheus中添加配置監(jiān)控alertmanager服務(wù)器

prometheus.yml添加如下配置

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.61.123:9093
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"
scrape_configs:
  - job_name: 'alertmanager' #配置alertmanager，等alertmanager部署后配置
    static_configs:
    - targets: ['localhost:9093']
  - job_name: 'node_exporter'   #配置node-exporter
    static_configs:
    - targets: ['192.168.61.123:9100']

rule_files為告警觸發(fā)的規(guī)則文件

prometheus當(dāng)前路徑下新建rules目錄，創(chuàng)建如下配置文件，分別配置節(jié)點告警和pod容器告警

[root@prometheus prometheus]# cd rules/
[root@prometheus rules]# ls
node_alerts.yml  pod_rules.yml

Node節(jié)點告警

node_alerts.yml #監(jiān)控主機(jī)級別告警

[root@localhost rules]# cat node_alerts.yml
groups:
- name: 主機(jī)狀態(tài)-監(jiān)控告警
  rules:
  - alert: 主機(jī)狀態(tài)
    expr: up {job="kubernetes-nodes"} == 0
    for: 15s
    labels:
      status: 非常嚴(yán)重
    annotations:
      summary: "{{.instance}}:服務(wù)器宕機(jī)"
      description: "{{.instance}}:服務(wù)器延時超過15s"
  - alert: CPU使用情況
    expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
    for: 1m
    labels:
      status: warning
    annotations:
      summary: "{{$labels.instance}}: High CPU Usage Detected"
      description: "{{$labels.instance}}: CPU usage is {{$value}}, above 60%"

  - alert: NodeFilesystemUsage
    expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區(qū)使用率過高"
      description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區(qū)使用大于80% (當(dāng)前值: {{ $value }})"
  - alert: 內(nèi)存使用
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m
    labels:
      status: 嚴(yán)重告警
    annotations:
      summary: "{{ $labels.instance}} 內(nèi)存使用率過高！"
      description: "{{ $labels.instance }} 內(nèi)存使用大于80%(目前使用:{{ $value}}%)"
  - alert: IO性能
    expr: (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 60
    for: 1m
    labels:
      status: 嚴(yán)重告警
    annotations:
      summary: "{{$labels.instance}} 流入磁盤IO使用率過高！"
      description: "{{ $labels.instance }} 流入磁盤IO大于60%(目前使用:{{ $value }})"

pod告警配置

pod_rules.yml文件配置 #pod級別告警

[root@localhost rules]# cat pod_rules.yml
groups:
- name: k8s_pod.rules
  rules:
  - alert: pod-status
    expr: kube_pod_container_status_running != 1
    for: 5s
    labels:
      severity: warning
    annotations:
      description : pod-{{ $labels.pod }}故障
      summary: pod重啟告警
  - alert: Pod_all_cpu_usage
    expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
    for: 5m
    labels:
      severity: critical
      service: pods
    annotations:
      description: 容器 {{ $labels.name }} CPU 資源利用率大于 75% , (current value is {{ $value }})
      summary: Dev CPU 負(fù)載告警
  - alert: Pod_all_memory_usage
    expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
    for: 10m
    labels:
      severity: critical
    annotations:
      description: 容器 {{ $labels.name }} Memory 資源利用率大于 2G , (current value is {{ $value }})
      summary: Dev Memory 負(fù)載告警
  - alert: Pod_all_network_receive_usage
    expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
    for: 10m
    labels:
      severity: critical
    annotations:
      description: 容器 {{ $labels.name }} network_receive 資源利用率大于 50M , (current value is {{ $value }})
      summary: network_receive 負(fù)載告警

更多告警規(guī)則【科學(xué)上網(wǎng)】

https://samber.github.io/awesome-prometheus-alerts/rules

for子句：Prometheus將expr中的規(guī)則作為觸發(fā)條件，在這種情況下，Prometheus將在每次檢查警報是否繼續(xù)處于活動狀態(tài)，然后再觸發(fā)警報。處于活動狀態(tài)但尚未觸發(fā)的元素處于pending狀態(tài)，for中定義時間即為達(dá)到活動狀態(tài)持續(xù)時間才觸發(fā)告警

配置加之后熱重啟prometheus服務(wù)

curl -XPOST http://localhost:9090/-/reload

注：prometheus啟動命令添加參數(shù)--web.enable-lifecycle可實現(xiàn)支持熱重啟

$ ./promtool check config prometheus.yml 
Checking prometheus.yml
  SUCCESS: 0 rule files found

上面命令可以檢查配置文件修改是否正確

登錄prometheus targets界面已經(jīng)出現(xiàn)alertmanager的監(jiān)控對象

檢查prometheus告警規(guī)則配置是否生效

可以看到node和pod的監(jiān)控指標(biāo)都已經(jīng)加載，Perfect，離成功更近一步

5、配置AlertManager告警發(fā)送

實現(xiàn)企業(yè)微信告警通知，需要首先在企業(yè)后臺創(chuàng)建應(yīng)用，起名叫prometheus

記錄企業(yè)ID，secret,agentid信息，后邊配置文件中需要。

[root@localhost alertmanager]# cat alertmanager.yml
global:
  resolve_timeout: 1m   # 每1分鐘檢測一次是否恢復(fù)
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: 'xxxxxxxxx'      # 企業(yè)微信中企業(yè)ID
  wechat_api_secret: 'xxxxxxxx'
templates:
  - '/data/alertmanager/template/*.tmpl'
route:
  receiver: 'wechat'
  group_by: ['env','instance','type','group','job','alertname']
  group_wait: 10s
  group_interval: 5s
  repeat_interval: 1h

receivers:
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    message: '{{ template "wechat.default.message" . }}'
    to_party: '57'
    agent_id: 'xxxx'   #企微后臺查詢的agentid
    to_user : "@all"
    api_secret: 'xxxxxxx'  #后臺查詢的secret

說明

wechat_api_url配置為企業(yè)微信的接口地址，因此需要alertmanager所在服務(wù)器能夠連接公網(wǎng)
to_user需要配置，all是發(fā)送所有可見范圍用戶，無此標(biāo)簽告警無法發(fā)出，本人親測，企微后臺可見范圍可以添加接收告警的用戶
字段解釋
global：全局配置
resolve_timeout：告警恢復(fù)超時時間，當(dāng)接收的告警沒有EndsAt字段時，經(jīng)過該時間就將該告警標(biāo)志為已解決，prometheus上用不上，告警都會帶EndsAt字段
route：告警分配配置
group_by：設(shè)置分組標(biāo)簽，告警時出現(xiàn)的labels都可用于分組，如果需要對所有不同label都分組，可以使用’…’
group_wait：告警發(fā)送等待時間，時間拉長便于告警聚合
group_interval：前后兩組告警發(fā)送間隔時間
repeat_interval：重復(fù)告警發(fā)送間隔時間
receiver：定義接收告警的對象
receivers：告警接收對象，這部分信息參考步驟1獲取
name：告警接收名稱，與route中的receiver一一對應(yīng)，這里我們配置的是企業(yè)微信
corp_id: 企業(yè)微信唯一ID，我的企業(yè) -> 企業(yè)信息
to_party: 告警需要發(fā)送的組
agent_id: 自己創(chuàng)建應(yīng)用的ID，自己創(chuàng)建的應(yīng)用詳情頁面查看
api_secret: 自己創(chuàng)建應(yīng)用的密鑰，自己創(chuàng)建的應(yīng)用詳情頁面查看
send_resolved: 告警解決是否發(fā)送通知
inhibit_rules：告警抑制規(guī)則

當(dāng)新的告警匹配到target_match規(guī)則，而已發(fā)送告警滿足source_match規(guī)則，并且新告警與已發(fā)送告警中equal定義的標(biāo)簽完全相同，則抑制這個新的告警。

上述配置的結(jié)果就是同個instance的同個alertname告警，major會抑制warning告警，這很好理解，比如閾值告警，達(dá)到critical肯定也達(dá)到了warning，沒必要發(fā)送兩個告警。

不過，從實際測試結(jié)果看，這個抑制規(guī)則只能在觸發(fā)告警時使用，對于告警恢復(fù)沒有，應(yīng)該是個bug，也有可能我用的版本過低，有時間再去看下源碼，查一查

templates：告警消息模板

企業(yè)微信告警發(fā)送模板，當(dāng)前路徑新建template目錄

[root@localhost alertmanager]# cat template/wechat.tmpl
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
=========xxx環(huán)境監(jiān)控報警 =========
告警狀態(tài)：{{   .Status }}
告警級別：{{ .Labels.severity }}
告警類型：{{ $alert.Labels.alertname }}
故障主機(jī): {{ $alert.Labels.instance }} {{ $alert.Labels.pod }}
告警主題: {{ $alert.Annotations.summary }}
告警詳情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
觸發(fā)閥值：{{ .Annotations.value }}
故障時間: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
=========xxx環(huán)境異常恢復(fù) =========
告警類型：{{ .Labels.alertname }}
告警狀態(tài)：{{   .Status }}
告警主題: {{ $alert.Annotations.summary }}
告警詳情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障時間: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢復(fù)時間: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
實例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}

配置修改后，執(zhí)行命令進(jìn)行熱重啟

curl -XPOST http://localhost:9093/-/reload

配置完成，我們可以調(diào)整告警閾值進(jìn)行測試

修改/usr/local/prometheus/rules/node_alerts.yml中磁盤告警閾值

expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 10

修改為>10就告警，登錄管理界面發(fā)現(xiàn)馬上就收到告警了

這里說明一下 Prometheus Alert 告警狀態(tài)有三種狀態(tài)：Inactive、Pending、Firing。

Inactive：非活動狀態(tài)，表示正在監(jiān)控，但是還未有任何警報觸發(fā)。
Pending：表示這個警報必須被觸發(fā)。由于警報可以被分組、壓抑/抑制或靜默/靜音，所以等待驗證，一旦所有的驗證都通過，則將轉(zhuǎn)到 Firing 狀態(tài)。
Firing：將警報發(fā)送到 AlertManager，它將按照配置將警報的發(fā)送給所有接收者。一旦警報解除，則將狀態(tài)轉(zhuǎn)到 Inactive，如此循環(huán)。

大功告成，此處該有掌聲！

責(zé)任編輯：龐桂玉來源：運維之美

Prometheus 企業(yè)微信

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

^{<thead id="sgzlq"></thead>}