六個(gè)步驟搞定云原生應(yīng)用監(jiān)控和告警
云原生系統(tǒng)搭建完畢之后,要建立可觀測性和告警,有利于了解整個(gè)系統(tǒng)的運(yùn)行狀況?;赑rometheus搭建的云原生監(jiān)控和告警是業(yè)內(nèi)常用解決方案,每個(gè)云原生參與者都需要了解。
本文主要以springboot應(yīng)用為例,講解云原生應(yīng)用監(jiān)控和告警的實(shí)操,對于理論知識講解不多。等朋友們把實(shí)操都理順之后,再補(bǔ)充理論知識,就更容易理解整個(gè)體系了。
1、監(jiān)控告警技術(shù)選型
kubernetes集群非常復(fù)雜,有容器基礎(chǔ)資源指標(biāo)、k8s集群Node指標(biāo)、集群里的業(yè)務(wù)應(yīng)用指標(biāo)等等。面對大量需要監(jiān)控的指標(biāo),傳統(tǒng)監(jiān)控方案Zabbix對于云原生監(jiān)控的支持不是很好。
所以需要使用更適合云原生的監(jiān)控告警方案prometheus,prometheus和云原生是密不可分的,并且prometheus現(xiàn)已成為云原生生態(tài)中監(jiān)控的事實(shí)標(biāo)準(zhǔn)。下面來一步步搭建基于prometheus的監(jiān)控告警方案。
prometheus的基本原理是:主動去**被監(jiān)控的系統(tǒng)**拉取各項(xiàng)指標(biāo),然后匯總存入到自身的時(shí)序數(shù)據(jù)庫,最后再通過圖表展示出來,或者是根據(jù)告警規(guī)則觸發(fā)告警。被監(jiān)控的系統(tǒng)要主動暴露接口給prometheus去抓取指標(biāo)。流程圖如下:
2、前置準(zhǔn)備
本文的操作前提是:需要安裝好docker、kubernetes,在K8S集群里部署好一個(gè)springboot應(yīng)用。
假設(shè)K8S集群有4個(gè)節(jié)點(diǎn),分別是:k8s-master(10.20.1.21)、k8s-worker-1(10.20.1.22)、k8s-worker-2(10.20.1.23)、k8s-worker-3(10.20.1.24)。
3、安裝Prometheus
3.1、在k8s-master節(jié)點(diǎn)創(chuàng)建命名空間
kubectl create ns monitoring
3.2、準(zhǔn)備configmap文件
準(zhǔn)備configmap文件prometheus-config.yaml,yaml文件中暫時(shí)只配置了對于prometheus本身指標(biāo)的抓取任務(wù)。下文會繼續(xù)補(bǔ)充這個(gè)yaml文件:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
3.3、創(chuàng)建configmap
kubectl apply -f prometheus-config.yaml
3.4、準(zhǔn)備prometheus的部署文件
準(zhǔn)備prometheus的部署文件prometheus-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- image: prom/prometheus:v2.31.1
name: prometheus
securityContext:
runAsUser: 0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus" # 指定tsdb數(shù)據(jù)路徑
- "--storage.tsdb.retention.time=24h"
- "--web.enable-admin-api" # 控制對admin HTTP API的訪問,其中包括刪除時(shí)間序列等功能
- "--web.enable-lifecycle" # 支持熱更新,直接執(zhí)行l(wèi)ocalhost:9090/-/reload立即生效
ports:
- containerPort: 9090
name: http
volumeMounts:
- mountPath: "/etc/prometheus"
name: config-volume
- mountPath: "/prometheus"
name: data
resources:
requests:
cpu: 200m
memory: 1024Mi
limits:
cpu: 200m
memory: 1024Mi
- image: jimmidyson/configmap-reload:v0.4.0 #prometheus配置動態(tài)加載
name: prometheus-reload
securityContext:
runAsUser: 0
args:
- "--volume-dir=/etc/config"
- "--webhook-url=http://localhost:9090/-/reload"
volumeMounts:
- mountPath: "/etc/config"
name: config-volume
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 100m
memory: 50Mi
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus-data
- configMap:
name: prometheus-config
name: config-volume
3.5、準(zhǔn)備prometheus的存儲文件
準(zhǔn)備prometheus的存儲文件prometheus-storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-local
labels:
app: prometheus
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
storageClassName: local-storage
local:
path: /data/k8s/prometheus #確保該節(jié)點(diǎn)上存在此目錄
persistentVolumeReclaimPolicy: Retain
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8s-worker-2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: monitoring
spec:
selector:
matchLabels:
app: prometheus
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: local-storage
這里我使用的是k8s-worker-2節(jié)點(diǎn)作為存儲資源,讀者們使用時(shí)要改成自己的節(jié)點(diǎn)名稱,同時(shí)要在對應(yīng)的節(jié)點(diǎn)下創(chuàng)建目錄:/data/k8s/prometheus。最終時(shí)序數(shù)據(jù)庫的數(shù)據(jù)會存儲到此目錄下,見下圖:
上面的yaml中用到了pv、pvc、storageclass存儲相關(guān)的知識,后面寫篇文章講解下,這里簡單介紹下:pv、pvc、storageclass主要是為pod自動創(chuàng)建存儲資源相關(guān)的組件。
3.6、創(chuàng)建存儲資源
kubectl apply -f prometheus-storage.yaml
3.7、準(zhǔn)備用戶、角色、權(quán)限相關(guān)文件
準(zhǔn)備用戶、角色、權(quán)限相關(guān)文件prometheus-rbac.yaml:
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
3.8、創(chuàng)建RBAC資源
kubectl apply -f prometheus-rbac.yaml
3.9、創(chuàng)建deployment資源
kubectl apply -f prometheus-deploy.yaml
3.10、準(zhǔn)備service資源對象文件
準(zhǔn)備service資源對象文件prometheus-svc.yaml。采用NortPort方式,供外部訪問prometheus:
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
selector:
app: prometheus
type: NodePort
ports:
- name: web
port: 9090
targetPort: http
3.11、創(chuàng)建service對象:
kubectl apply -f prometheus-svc.yaml
3.12、訪問prometheus
此時(shí)通過kubectl get svc -n monitoring獲取暴露的端口號,通過K8S集群的任意節(jié)點(diǎn)+端口號就可以訪問prometheus了。比如通過http://10.20.1.21:32459/訪問,可以看到如下界面,通過targets可以看到上面prometheus-config.yaml文件中配置的被抓取對象:
至此prometheus安裝完畢,下面繼續(xù)安裝grafana。
4、安裝Grafana
prometheus的圖表功能比較弱,一般使用grafana來展示prometheus的數(shù)據(jù),下面開始安裝grafana。
4.1、準(zhǔn)備grafana部署文件
準(zhǔn)備grafana部署文件grafana-deploy.yaml,這是一個(gè)all-in-one的文件,將Deployment、Service、PV、PVC的編排全部卸載該文件中:
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
volumes:
- name: storage
persistentVolumeClaim:
claimName: grafana-data
containers:
- name: grafana
image: grafana/grafana:8.3.3
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 0
ports:
- containerPort: 3000
name: grafana
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin
readinessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 400m
memory: 1024Mi
requests:
cpu: 200m
memory: 512Mi
volumeMounts:
- mountPath: /var/lib/grafana
name: storage
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- port: 3000
selector:
app: grafana
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: grafana-local
labels:
app: grafana
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
storageClassName: local-storage
local:
path: /data/k8s/grafana #保證節(jié)點(diǎn)上創(chuàng)建好該目錄
persistentVolumeReclaimPolicy: Retain
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8s-worker-2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-data
namespace: monitoring
spec:
selector:
matchLabels:
app: grafana
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: local-storage
上文中依舊用到了PV、PVC、StorageClass的知識,節(jié)點(diǎn)親和選擇了k8s-worker-2節(jié)點(diǎn),同時(shí)需要在該節(jié)點(diǎn)上創(chuàng)建改目錄/data/k8s/grafana。
4.2、部署grafana資源
kubectl apply -f grafana-deploy.yaml
4.3、訪問grafana
查看對應(yīng)的service端口映射:
通過鏈接http://10.20.1.21:31881/訪問grafana,通過配置文件中的用戶名和密碼訪問grafana,再導(dǎo)入prometheus的數(shù)據(jù)源:
5、配置數(shù)據(jù)抓取
5.1、配置抓取node數(shù)據(jù)
在抓取數(shù)據(jù)之前,需要在node節(jié)點(diǎn)上配置node-exporter,這樣prometheus才能通過node-exporter暴露的接口抓取數(shù)據(jù)。
5.1.1、準(zhǔn)備node-exporter的部署文件
準(zhǔn)備node-exporter的部署文件node-exporter-daemonset.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: kube-system
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostPID: true
hostIPC: true
hostNetwork: true
nodeSelector:
kubernetes.io/os: linux
containers:
- name: node-exporter
image: prom/node-exporter:v1.3.1
args:
- --web.listen-address=$(HOSTIP):9100
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --no-collector.hwmon # 禁用不需要的一些采集器
- --no-collector.nfs
- --no-collector.nfsd
- --no-collector.nvme
- --no-collector.dmi
- --no-collector.arp
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/containerd/.+|/var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
ports:
- containerPort: 9100
env:
- name: HOSTIP
valueFrom:
fieldRef:
fieldPath: status.hostIP
resources:
requests:
cpu: 150m
memory: 200Mi
limits:
cpu: 300m
memory: 400Mi
securityContext:
runAsNonRoot: true
runAsUser: 65534
volumeMounts:
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
tolerations: # 添加容忍
- operator: "Exists"
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
5.1.2、部署node-exporter
kubectl apply -f node-exporter-daemonset.yaml
5.1.3、prometheus接入抓取數(shù)據(jù)
在之前的prometheus-config.yaml文件中繼續(xù)增加job-name,如下:
- job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
完整的prometheus-config.yaml見文末。
prometheus-config.yaml文件修改完,稍等一會兒就可以看到頁面多了幾個(gè)target,如下圖所示,這些都是被prometheus監(jiān)控的對象:
5.2、配置抓取springboot actuator數(shù)據(jù)
5.2.1、配置springboot應(yīng)用
- springboot應(yīng)用增加pom
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
- springboot應(yīng)用配置properties文件:
management.endpoint.health.probes.enabled=true
management.health.probes.enabled=true
management.endpoint.health.enabled=true
management.endpoint.health.show-details=always
management.endpoints.web.exposure.include=*
management.endpoints.web.exposure.exclude=env,beans
management.endpoint.shutdown.enabled=true
management.server.port=9090
- 查看指標(biāo)鏈接
配置完之后,重新打鏡像部署到K8S集群,這里不做演示了。訪問應(yīng)用的/actuator/prometheus鏈接得到如下結(jié)果,將系統(tǒng)的指標(biāo)信息暴露出來:
5.2.2、prometheus接入抓取數(shù)據(jù)
繼續(xù)修改配置文件prometheus-config.yaml,如下:
- job_name: 'spring-actuator-many'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: 'test1'
target_label: namespace
action: keep
- source_labels: [__address__]
regex: '(.*):9090'
target_label: __address__
action: keep
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
配置文件中的大概意思是,選擇“端口是9090,namespace是test1”的pod資源進(jìn)行監(jiān)控。更多的語法,讀者自行查閱prometheus官網(wǎng)。
稍等片刻,可以看到多了springboot應(yīng)用的監(jiān)控目標(biāo):
6、配置監(jiān)控圖表
指標(biāo)數(shù)據(jù)都有了,接下來就是如何配置圖表了。grafana提供了豐富的圖表,可以在官網(wǎng)上自行選擇。下文繼續(xù)配置監(jiān)控node的圖表 和 監(jiān)控springboot應(yīng)用的圖表。
配置圖表有3種方式:json文件、輸入圖表id、輸入json內(nèi)容。配置界面如下圖:
6.1、配置node監(jiān)控圖表
在上圖的界面中選擇輸入圖表id的方式,輸入圖表id8919,即可看到如下界面:
6.2、配置springboot應(yīng)用的圖表
在上圖的界面中選擇輸入json內(nèi)容的方式,輸入此鏈接下的json內(nèi)容https://img.mangod.top/blog/13-6-jvm-micrometer.json,即可看到如下圖表:
至此k8s-node監(jiān)控和springboot應(yīng)用監(jiān)控已經(jīng)完成。如果還需要更多的監(jiān)控,讀者需要自行查閱資料。
7、安裝告警alertmanager
監(jiān)控完成之后,就是安裝告警組件alertmanager了??梢赃x擇在K8S集群下的任一節(jié)點(diǎn)使用docker安裝。
7.1、安裝alertmanager
7.1.1、拉取docker鏡像
docker pull prom/alertmanager:v0.25.0
7.1.2、創(chuàng)建報(bào)警配置文件
創(chuàng)建報(bào)警配置文件alertmanager.yml之前,需要在安裝alertmanager所在節(jié)點(diǎn)上創(chuàng)建目錄/data/prometheus/alertmanager,在目錄下創(chuàng)建文件alertmanager.yml,內(nèi)容如下:
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'mail_163'
global:
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '294931067@qq.com'
smtp_auth_username: '294931067@qq.com'
# 此處是發(fā)送郵件的授權(quán)碼,不是密碼
smtp_auth_password: '此處是授權(quán)碼,比如sdfasdfsdffsfa'
smtp_require_tls: false
receivers:
- name: 'mail_163'
email_configs:
- to: 'yclxiao@163.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
7.1.3、安裝啟動:
docker run --name alertmanager -d -p 9093:9093 -v
/data/prometheus/alertmanager:/etc/alertmanager prom/alertmanager:v0.25.0
7.1.4、訪問alertmanager
安裝完畢之后,通過如下鏈接訪問:http://10.20.1.21:9093/#/alerts,界面如下:
7.2、與prometheus關(guān)聯(lián)
在prometheus-configmap.yaml文件中增加如下配置,即可讓prometheus與alertmanager關(guān)聯(lián)起來,配置中的地址改成自己的prometheus地址。
7.3、配置觸發(fā)告警規(guī)則
7.3.1、增加配置目錄
在prometheus-configmap.yaml文件中增加如下配置,即可增加觸發(fā)告警的規(guī)則:
注意此處的文件目錄/prometheus/是prometheus所在存儲目錄,我這里是安裝在k8s-worker-2下,然后在prometheus的存儲目錄下建立/rules文件夾,如下圖:
至此prometheus-config.yaml全部配置完畢,最后附上完整的prometheus-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.20.1.21:9093
rule_files:
- /prometheus/rules/*.rules
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: "cadvisor"
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
- replacement: /metrics/cadvisor # <nodeip>/metrics -> <nodeip>/metrics/cadvisor
target_label: __metrics_path__
- job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'spring-actuator-many'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: 'test1'
target_label: namespace
action: keep
- source_labels: [__address__]
regex: '(.*):9090'
target_label: __address__
action: keep
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
7.3.2、配置觸發(fā)告警規(guī)則
觸發(fā)告警規(guī)則的目錄已經(jīng)定好了,接下來就是寫具體規(guī)則了,在目錄下創(chuàng)建2個(gè)觸發(fā)告警的規(guī)則文件,如上圖,文件中寫了觸發(fā)node節(jié)點(diǎn)告警規(guī)則和觸發(fā)springboot應(yīng)用的告警規(guī)則,具體內(nèi)容如下:
- node節(jié)點(diǎn)告警規(guī)則-hoststats-alert.yaml:
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance) > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
- springboot應(yīng)用告警規(guī)則-jvm-metrics-rules.yaml:
groups:
- name: jvm-metrics-rules
rules:
# 在5分鐘里,GC花費(fèi)時(shí)間超過10%
- alert: GcTimeTooMuch
expr: increase(jvm_gc_collection_seconds_sum[5m]) > 30
for: 5m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} GC時(shí)間占比超過10%"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} GC時(shí)間占比超過10%,當(dāng)前值({{ $value }}%)"
# GC次數(shù)太多
- alert: GcCountTooMuch
expr: increase(jvm_gc_collection_seconds_count[1m]) > 30
for: 1m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} 1分鐘GC次數(shù)>30次"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} 1分鐘GC次數(shù)>30次,當(dāng)前值({{ $value }})"
# FGC次數(shù)太多
- alert: FgcCountTooMuch
expr: increase(jvm_gc_collection_seconds_count{gc="ConcurrentMarkSweep"}[1h]) > 3
for: 1m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} 1小時(shí)的FGC次數(shù)>3次"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} 1小時(shí)的FGC次數(shù)>3次,當(dāng)前值({{ $value }})"
# 非堆內(nèi)存使用超過80%
- alert: NonheapUsageTooMuch
expr: jvm_memory_bytes_used{job="spring-actuator-many", area="nonheap"} / jvm_memory_bytes_max * 100 > 80
for: 1m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} 非堆內(nèi)存使用>80%"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} 非堆內(nèi)存使用率>80%,當(dāng)前值({{ $value }}%)"
# 內(nèi)存使用預(yù)警
- alert: HeighMemUsage
expr: process_resident_memory_bytes{job="spring-actuator-many"} / os_total_physical_memory_bytes * 100 > 15
for: 1m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} rss內(nèi)存使用率大于85%"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} rss內(nèi)存使用率大于85%,當(dāng)前值({{ $value }}%)"
# JVM高內(nèi)存使用預(yù)警
- alert: JavaHeighMemUsage
expr: sum(jvm_memory_used_bytes{area="heap",job="spring-actuator-many"}) by(app,instance) / sum(jvm_memory_max_bytes{area="heap",job="spring-actuator-many"}) by(app,instance) * 100 > 85
for: 1m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} rss內(nèi)存使用率大于85%"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} rss內(nèi)存使用率大于85%,當(dāng)前值({{ $value }}%)"
# CPU使用預(yù)警
- alert: JavaHeighCpuUsage
expr: system_cpu_usage{job="spring-actuator-many"} * 100 > 85
for: 1m
labels:
severity: red
annotations:
summary: "{{ $labels.app }} rss CPU使用率大于85%"
message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} rss內(nèi)存使用率大于85%,當(dāng)前值({{ $value }}%)"
- 告警文件準(zhǔn)備好之后,先重啟alertmanager,再重啟prometheus:
kubectl delete -f prometheus-deploy.yamlkubectl apply -f
prometheus-deploy.yaml
- 查看界面
此時(shí)查看alertmanager的status,可以看到如下界面:
此時(shí)查看promethetus的rules,可以看到如下界面:
7.3.3、注意點(diǎn)
改了alertmanager的告警配置要重啟alertmanager才生效。
alertmanager.yml中的smtp_auth_password配置的是郵件發(fā)送的授權(quán)碼,不是郵箱密碼。郵箱的授權(quán)碼的配置如下圖,下圖以QQ郵箱為例:
至此基于Prometheus和Grafana的監(jiān)控和告警已經(jīng)安裝完畢。
8、測試告警
安裝完畢后,簡單測試下告警效果。有2種方式測試。
方式1:將告警規(guī)則值調(diào)低,會收到如下郵件:
方式2:通過命令cat /dev/zero>/dev/null拉高node節(jié)點(diǎn)的cpu或者拉高容器的cpu,,會收到如下郵件:
9、總結(jié)
本文主要講解基于Prometheus + Grafana的云原生應(yīng)用監(jiān)控和告警的實(shí)戰(zhàn),助你快速搭建系統(tǒng),希望對你有幫助!