Kubernetes 集群中 ServiceMonitor 無效問題的深度剖析
引言
線上環(huán)境新上了幾個服務(wù),需要監(jiān)控它相應(yīng)的指標(biāo),這邊使用 Prometheus-Operator 的 ServiceMonitor 實現(xiàn)。
馬上開動。
開始
直接上它的 YAML 文件:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lobby-bank-consumer
namespace: lobby
labels:
app.kubernetes.io/name: lobby-bank-consumer
app.kubernetes.io/part-of: lobby
spec:
selector:
matchLabels:
app: lobby-bank-consumer
namespaceSelector:
matchNames:
- lobby
endpoints:
- port: tcp-63200
path: /metrics
interval: 30s
scrapeTimeout: 10s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lobby-bank-producer
namespace: lobby
labels:
app.kubernetes.io/name: lobby-bank-producer
app.kubernetes.io/part-of: lobby
spec:
selector:
matchLabels:
app: lobby-bank-producer
namespaceSelector:
matchNames:
- lobby
endpoints:
- port: tcp-63100
path: /metrics
interval: 30s
scrapeTimeout: 10s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lobby-bank-server
namespace: lobby
labels:
app.kubernetes.io/name: lobby-bank-server
app.kubernetes.io/part-of: lobby
spec:
selector:
matchLabels:
app: lobby-bank-server
namespaceSelector:
matchNames:
- lobby
endpoints:
- port: tcp-63001
path: /metrics
interval: 30s
scrapeTimeout: 10s
部署:
$ kubectl apply -f lobby-bank-sm.yaml
部署完成后,這邊沒有數(shù)據(jù):
圖片
開始排查。
排查
詳細(xì)檢查了我的 ServiceMonitor YAML 文件是否有問題,發(fā)現(xiàn)沒有問題,奇怪了,
想了半天,我想不應(yīng)該是 RBAC 之類的,但是沒辦法了,只能去看看 Prometheus 的 Logs 了。
沒想到問題真出在這里:
圖片
這里有添加了相應(yīng)資源和 Verb:
- apiGroups:
- "monitoring.coreos.com"
resources:
- servicemonitors
- podmonitors
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
以下是完整的 YAML 文件:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 3.0.1
name: prometheus-k8s
rules:
- apiGroups:
- "monitoring.coreos.com"
resources:
- servicemonitors
- podmonitors
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- nonResourceURLs:
- /metrics
- /metrics/slis
verbs:
- get
重新部署下 Prometheus-Operator:
$ kubectl delete -f .
$ kubectl create -f .
依次等待全部啟動完成。
再次查看:
圖片
最好再用 PromQL 查看下:
圖片