通過Prometheus來做SLI/SLO監(jiān)控展示
什么是SLI/SLO
SLI,全名Service Level Indicator,是服務(wù)等級(jí)指標(biāo)的簡(jiǎn)稱,它是衡定系統(tǒng)穩(wěn)定性的指標(biāo)。
SLO,全名Sevice Level Objective,是服務(wù)等級(jí)目標(biāo)的簡(jiǎn)稱,也就是我們?cè)O(shè)定的穩(wěn)定性目標(biāo),比如"4個(gè)9","5個(gè)9"等。
SRE通常通過這兩個(gè)指標(biāo)來衡量系統(tǒng)的穩(wěn)定性,其主要思路就是通過SLI來判斷SLO,也就是通過一系列的指標(biāo)來衡量我們的目標(biāo)是否達(dá)到了"幾個(gè)9"。
如何選擇SLI
在系統(tǒng)中,常見的指標(biāo)有很多種,比如:
- 系統(tǒng)層面:CPU使用率、內(nèi)存使用率、磁盤使用率等
- 應(yīng)用服務(wù)器層面:端口存活狀態(tài)、JVM的狀態(tài)等
- 應(yīng)用運(yùn)行層面:狀態(tài)碼、時(shí)延、QPS等
- 中間件層面:QPS、TPS、時(shí)延等
- 業(yè)務(wù)層面:成功率、增長(zhǎng)速度等
這么多指標(biāo),應(yīng)該如何選擇呢?只要遵從兩個(gè)原則就可以:
- 選擇能夠標(biāo)識(shí)一個(gè)主體是否穩(wěn)定的指標(biāo),如果不是這個(gè)主體本身的指標(biāo),或者不能標(biāo)識(shí)主體穩(wěn)定性的,就要排除在外。
- 優(yōu)先選擇與用戶體驗(yàn)強(qiáng)相關(guān)或用戶可以明顯感知的指標(biāo)。
通常情況下,可以直接使用谷歌的VALET指標(biāo)方法。
- V:Volume,容量,服務(wù)承諾的最大容量
- A:Availability,可用性,服務(wù)是否正常
- L:Latency,延遲,服務(wù)的響應(yīng)時(shí)間
- E:Error,錯(cuò)誤率,請(qǐng)求錯(cuò)誤率是多少
- T:Ticket,人工介入,是否需要人工介入
這就是谷歌使用VALET方法給的樣例。
上面僅僅是簡(jiǎn)單的介紹了一下SLI/SLO,更多的知識(shí)可以學(xué)習(xí)《SRE:Google運(yùn)維解密》和趙成老師的極客時(shí)間課程《SRE實(shí)踐手冊(cè)》。下面來簡(jiǎn)單介紹如何使用Prometheus來進(jìn)行SLI/SLO監(jiān)控。
service-level-operator
Service level operator是為了Kubernetes中的應(yīng)用SLI/SLO指標(biāo)來衡量應(yīng)用的服務(wù)指標(biāo),并可以通過Grafana來進(jìn)行展示。
Operator主要是通過SLO來查看和創(chuàng)建新的指標(biāo)。例如:
- apiVersion: monitoring.spotahome.com/v1alpha1
- kind: ServiceLevel
- metadata:
- name: awesome-service
- spec:
- serviceLevelObjectives:
- - name: "9999_http_request_lt_500"
- description: 99.99% of requests must be served with <500 status code.
- disable: false
- availabilityObjectivePercent: 99.99
- serviceLevelIndicator:
- prometheus:
- address: http://myprometheus:9090
- totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m]))
- errorQuery: sum(increase(http_request_total{host="awesome_service_io", code=~"5.."}[2m]))
- output:
- prometheus:
- labels:
- team: a-team
- iteration: "3"
- availabilityObjectivePercent:SLO
- totalQuery:總請(qǐng)求數(shù)
- errorQuery:錯(cuò)誤請(qǐng)求數(shù)
Operator通過totalQuert和errorQuery就可以計(jì)算出SLO的指標(biāo)了。
部署service-level-operator
- 前提:在Kubernetes集群中部署好Prometheus,我這里是采用Prometheus-Operator方式進(jìn)行部署的。
(1)首先創(chuàng)建RBAC
- apiVersion: v1
- kind: ServiceAccount
- metadata:
- name: service-level-operator
- namespace: monitoring
- labels:
- app: service-level-operator
- component: app
- ---
- apiVersion: rbac.authorization.k8s.io/v1
- kind: ClusterRole
- metadata:
- name: service-level-operator
- labels:
- app: service-level-operator
- component: app
- rules:
- # Register and check CRDs.
- - apiGroups:
- - apiextensions.k8s.io
- resources:
- - customresourcedefinitions
- verbs:
- - "*"
- # Operator logic.
- - apiGroups:
- - monitoring.spotahome.com
- resources:
- - servicelevels
- - servicelevels/status
- verbs:
- - "*"
- ---
- kind: ClusterRoleBinding
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: service-level-operator
- subjects:
- - kind: ServiceAccount
- name: service-level-operator
- namespace: monitoring
- roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: service-level-operator
(2)然后創(chuàng)建Deployment
- apiVersion: apps/v1
- kind: Deployment
- metadata:
- name: service-level-operator
- namespace: monitoring
- labels:
- app: service-level-operator
- component: app
- spec:
- replicas: 1
- selector:
- matchLabels:
- app: service-level-operator
- component: app
- strategy:
- rollingUpdate:
- maxUnavailable: 0
- template:
- metadata:
- labels:
- app: service-level-operator
- component: app
- spec:
- serviceAccountName: service-level-operator
- containers:
- - name: app
- imagePullPolicy: Always
- image: quay.io/spotahome/service-level-operator:latest
- ports:
- - containerPort: 8080
- name: http
- protocol: TCP
- readinessProbe:
- httpGet:
- path: /healthz/ready
- port: http
- livenessProbe:
- httpGet:
- path: /healthz/live
- port: http
- resources:
- limits:
- cpu: 220m
- memory: 254Mi
- requests:
- cpu: 120m
- memory: 128Mi
(3)創(chuàng)建service
- apiVersion: v1
- kind: Service
- metadata:
- name: service-level-operator
- namespace: monitoring
- labels:
- app: service-level-operator
- component: app
- spec:
- ports:
- - port: 80
- protocol: TCP
- name: http
- targetPort: http
- selector:
- app: service-level-operator
- component: app
(4)創(chuàng)建prometheus serviceMonitor
- apiVersion: monitoring.coreos.com/v1
- kind: ServiceMonitor
- metadata:
- name: service-level-operator
- namespace: monitoring
- labels:
- app: service-level-operator
- component: app
- prometheus: myprometheus
- spec:
- selector:
- matchLabels:
- app: service-level-operator
- component: app
- namespaceSelector:
- matchNames:
- - monitoring
- endpoints:
- - port: http
- interval: 10s
到這里,Service Level Operator部署完成了,可以在prometheus上查看到對(duì)應(yīng)的Target,如下:
然后就需要?jiǎng)?chuàng)建對(duì)應(yīng)的服務(wù)指標(biāo)了,如下所示創(chuàng)建一個(gè)示例。
- apiVersion: monitoring.spotahome.com/v1alpha1
- kind: ServiceLevel
- metadata:
- name: prometheus-grafana-service
- namespace: monitoring
- spec:
- serviceLevelObjectives:
- - name: "9999_http_request_lt_500"
- description: 99.99% of requests must be served with <500 status code.
- disable: false
- availabilityObjectivePercent: 99.99
- serviceLevelIndicator:
- prometheus:
- address: http://prometheus-k8s.monitoring.svc:9090
- totalQuery: sum(increase(http_request_total{service="grafana"}[2m]))
- errorQuery: sum(increase(http_request_total{service="grafana", code=~"5.."}[2m]))
- output:
- prometheus:
- labels:
- team: prometheus-grafana
- iteration: "3"
上面定義了grafana應(yīng)用"4個(gè)9"的SLO。
然后可以在Prometheus上看到具體的指標(biāo),如下。
接下來在Grafana上導(dǎo)入ID為8793的Dashboard,即可生成如下圖表。
上面是SLI,下面是錯(cuò)誤總預(yù)算和已消耗的錯(cuò)誤。
下面可以定義告警規(guī)則,當(dāng)SLO下降時(shí)可以第一時(shí)間收到,比如:
- groups:
- - name: slo.rules
- rules:
- - alert: SLOErrorRateTooFast1h
- expr: |
- (
- increase(service_level_sli_result_error_ratio_total[1h])
- /
- increase(service_level_sli_result_count_total[1h])
- ) > (1 - service_level_slo_objective_ratio) * 14.6
- labels:
- severity: critical
- team: a-team
- annotations:
- summary: The monthly SLO error budget consumed for 1h is greater than 2%
- description: The error rate for 1h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 2% monthly budget.
- - alert: SLOErrorRateTooFast6h
- expr: |
- (
- increase(service_level_sli_result_error_ratio_total[6h])
- /
- increase(service_level_sli_result_count_total[6h])
- ) > (1 - service_level_slo_objective_ratio) * 6
- labels:
- severity: critical
- team: a-team
- annotations:
- summary: The monthly SLO error budget consumed for 6h is greater than 5%
- description: The error rate for 6h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 5% monthly budget.
第一條規(guī)則表示在1h內(nèi)消耗的錯(cuò)誤率大于30天內(nèi)的2%,應(yīng)該告警。第二條規(guī)則是在6h內(nèi)的錯(cuò)誤率大于30天的5%,應(yīng)該告警。
下面是谷歌的的基準(zhǔn)。
最后
說到系統(tǒng)穩(wěn)定性,這里不得不提到系統(tǒng)可用性,SRE提高系統(tǒng)的穩(wěn)定性,最終還是為了提升系統(tǒng)的可用時(shí)間,減少故障時(shí)間。那如何來衡量系統(tǒng)的可用性呢?
目前業(yè)界有兩種衡量系統(tǒng)可用性的方式,一個(gè)是時(shí)間維度,一個(gè)是請(qǐng)求維度。時(shí)間維度就是從故障出發(fā)對(duì)系統(tǒng)的穩(wěn)定性進(jìn)行評(píng)估。請(qǐng)求維度是從成功請(qǐng)求占比的角度出發(fā),對(duì)系統(tǒng)穩(wěn)定性進(jìn)行評(píng)估。
時(shí)間維度:可用性 = 服務(wù)時(shí)間 / (服務(wù)時(shí)間 + 故障時(shí)間)
請(qǐng)求維度:可用性 = 成功請(qǐng)求數(shù) / 總請(qǐng)求數(shù)
在SRE實(shí)踐中,通常會(huì)選擇請(qǐng)求維度來衡量系統(tǒng)的穩(wěn)定性,就如上面的例子。不過,如果僅僅通過一個(gè)維度來判斷系統(tǒng)的穩(wěn)定性也有點(diǎn)太武斷,還應(yīng)該結(jié)合更多的指標(biāo),比如延遲,錯(cuò)誤率等,而且對(duì)核心應(yīng)用,核心鏈路的SLI應(yīng)該更細(xì)致。
參考
[1] 《SRE實(shí)踐手冊(cè)》- 趙成
[2] 《SRE:Google運(yùn)維解密》
[3] https://github.com/spotahome/service-level-operator