自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<legend id="rwgus"><track id="rwgus"></track></legend>

<sub id="rwgus"><rt id="rwgus"></rt></sub>

<sub id="rwgus"><p id="rwgus"></p></sub>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

通過Prometheus來做SLI/SLO監(jiān)控展示

作者：?jiǎn)炭?/span> 2021-04-07 14:53:09

SRE通常通過這兩個(gè)指標(biāo)來衡量系統(tǒng)的穩(wěn)定性，其主要思路就是通過SLI來判斷SLO，也就是通過一系列的指標(biāo)來衡量我們的目標(biāo)是否達(dá)到了"幾個(gè)9"。

[[391678]]

什么是SLI/SLO

SLI，全名Service Level Indicator，是服務(wù)等級(jí)指標(biāo)的簡(jiǎn)稱，它是衡定系統(tǒng)穩(wěn)定性的指標(biāo)。

SLO，全名Sevice Level Objective，是服務(wù)等級(jí)目標(biāo)的簡(jiǎn)稱，也就是我們?cè)O(shè)定的穩(wěn)定性目標(biāo)，比如"4個(gè)9"，"5個(gè)9"等。

SRE通常通過這兩個(gè)指標(biāo)來衡量系統(tǒng)的穩(wěn)定性，其主要思路就是通過SLI來判斷SLO，也就是通過一系列的指標(biāo)來衡量我們的目標(biāo)是否達(dá)到了"幾個(gè)9"。

如何選擇SLI

在系統(tǒng)中，常見的指標(biāo)有很多種，比如：

系統(tǒng)層面：CPU使用率、內(nèi)存使用率、磁盤使用率等
應(yīng)用服務(wù)器層面：端口存活狀態(tài)、JVM的狀態(tài)等
應(yīng)用運(yùn)行層面：狀態(tài)碼、時(shí)延、QPS等
中間件層面：QPS、TPS、時(shí)延等
業(yè)務(wù)層面：成功率、增長(zhǎng)速度等

這么多指標(biāo)，應(yīng)該如何選擇呢?只要遵從兩個(gè)原則就可以：

選擇能夠標(biāo)識(shí)一個(gè)主體是否穩(wěn)定的指標(biāo)，如果不是這個(gè)主體本身的指標(biāo)，或者不能標(biāo)識(shí)主體穩(wěn)定性的，就要排除在外。
優(yōu)先選擇與用戶體驗(yàn)強(qiáng)相關(guān)或用戶可以明顯感知的指標(biāo)。

通常情況下，可以直接使用谷歌的VALET指標(biāo)方法。

V：Volume，容量，服務(wù)承諾的最大容量
A：Availability，可用性，服務(wù)是否正常
L：Latency，延遲，服務(wù)的響應(yīng)時(shí)間
E：Error，錯(cuò)誤率，請(qǐng)求錯(cuò)誤率是多少
T：Ticket，人工介入，是否需要人工介入

這就是谷歌使用VALET方法給的樣例。

上面僅僅是簡(jiǎn)單的介紹了一下SLI/SLO，更多的知識(shí)可以學(xué)習(xí)《SRE：Google運(yùn)維解密》和趙成老師的極客時(shí)間課程《SRE實(shí)踐手冊(cè)》。下面來簡(jiǎn)單介紹如何使用Prometheus來進(jìn)行SLI/SLO監(jiān)控。

service-level-operator

Service level operator是為了Kubernetes中的應(yīng)用SLI/SLO指標(biāo)來衡量應(yīng)用的服務(wù)指標(biāo)，并可以通過Grafana來進(jìn)行展示。

Operator主要是通過SLO來查看和創(chuàng)建新的指標(biāo)。例如：

apiVersion: monitoring.spotahome.com/v1alpha1 
kind: ServiceLevel 
metadata: 
  name: awesome-service 
spec: 
  serviceLevelObjectives: 
    - name: "9999_http_request_lt_500" 
      description: 99.99% of requests must be served with <500 status code. 
      disable: false 
      availabilityObjectivePercent: 99.99 
      serviceLevelIndicator: 
        prometheus: 
          address: http://myprometheus:9090 
          totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m])) 
          errorQuery: sum(increase(http_request_total{host="awesome_service_io", code=~"5.."}[2m])) 
      output: 
        prometheus: 
          labels: 
            team: a-team 
            iteration: "3"

availabilityObjectivePercent：SLO
totalQuery：總請(qǐng)求數(shù)
errorQuery：錯(cuò)誤請(qǐng)求數(shù)

Operator通過totalQuert和errorQuery就可以計(jì)算出SLO的指標(biāo)了。

部署service-level-operator

前提：在Kubernetes集群中部署好Prometheus，我這里是采用Prometheus-Operator方式進(jìn)行部署的。

(1)首先創(chuàng)建RBAC

apiVersion: v1 
kind: ServiceAccount 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
 
--- 
apiVersion: rbac.authorization.k8s.io/v1 
kind: ClusterRole 
metadata: 
  name: service-level-operator 
  labels: 
    app: service-level-operator 
    component: app 
rules: 
  # Register and check CRDs. 
  - apiGroups: 
      - apiextensions.k8s.io 
    resources: 
      - customresourcedefinitions 
    verbs: 
      - "*" 
 
  # Operator logic. 
  - apiGroups: 
      - monitoring.spotahome.com 
    resources: 
      - servicelevels 
      - servicelevels/status 
    verbs: 
      - "*" 
 
--- 
kind: ClusterRoleBinding 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  name: service-level-operator 
subjects: 
  - kind: ServiceAccount 
    name: service-level-operator 
    namespace: monitoring  
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: ClusterRole 
  name: service-level-operator

（2）然后創(chuàng)建Deployment

apiVersion: apps/v1  
kind: Deployment 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: service-level-operator 
      component: app 
  strategy: 
    rollingUpdate: 
      maxUnavailable: 0 
  template: 
    metadata: 
      labels: 
        app: service-level-operator 
        component: app 
    spec: 
      serviceAccountName: service-level-operator 
      containers: 
        - name: app 
          imagePullPolicy: Always 
          image: quay.io/spotahome/service-level-operator:latest 
          ports: 
            - containerPort: 8080 
              name: http 
              protocol: TCP 
          readinessProbe: 
            httpGet: 
              path: /healthz/ready 
              port: http 
          livenessProbe: 
            httpGet: 
              path: /healthz/live 
              port: http 
          resources: 
            limits: 
              cpu: 220m 
              memory: 254Mi 
            requests: 
              cpu: 120m 
              memory: 128Mi

（3）創(chuàng)建service

apiVersion: v1 
kind: Service 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
spec: 
  ports: 
    - port: 80 
      protocol: TCP 
      name: http 
      targetPort: http 
  selector: 
    app: service-level-operator 
    component: app

（4）創(chuàng)建prometheus serviceMonitor

apiVersion: monitoring.coreos.com/v1 
kind: ServiceMonitor 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
    prometheus: myprometheus 
spec: 
  selector: 
    matchLabels: 
      app: service-level-operator 
      component: app 
  namespaceSelector: 
    matchNames: 
      - monitoring  
  endpoints: 
    - port: http 
      interval: 10s

到這里，Service Level Operator部署完成了，可以在prometheus上查看到對(duì)應(yīng)的Target，如下：

然后就需要?jiǎng)?chuàng)建對(duì)應(yīng)的服務(wù)指標(biāo)了，如下所示創(chuàng)建一個(gè)示例。

apiVersion: monitoring.spotahome.com/v1alpha1 
kind: ServiceLevel 
metadata: 
  name: prometheus-grafana-service 
  namespace: monitoring 
spec: 
  serviceLevelObjectives: 
    - name: "9999_http_request_lt_500" 
      description: 99.99% of requests must be served with <500 status code. 
      disable: false 
      availabilityObjectivePercent: 99.99 
      serviceLevelIndicator: 
        prometheus: 
          address: http://prometheus-k8s.monitoring.svc:9090 
          totalQuery: sum(increase(http_request_total{service="grafana"}[2m])) 
          errorQuery: sum(increase(http_request_total{service="grafana", code=~"5.."}[2m])) 
      output: 
        prometheus: 
          labels: 
            team: prometheus-grafana  
            iteration: "3"

上面定義了grafana應(yīng)用"4個(gè)9"的SLO。

然后可以在Prometheus上看到具體的指標(biāo)，如下。

接下來在Grafana上導(dǎo)入ID為8793的Dashboard，即可生成如下圖表。

上面是SLI，下面是錯(cuò)誤總預(yù)算和已消耗的錯(cuò)誤。

下面可以定義告警規(guī)則，當(dāng)SLO下降時(shí)可以第一時(shí)間收到，比如：

groups: 
  - name: slo.rules 
    rules: 
      - alert: SLOErrorRateTooFast1h 
        expr: | 
          ( 
            increase(service_level_sli_result_error_ratio_total[1h]) 
            / 
            increase(service_level_sli_result_count_total[1h]) 
          ) > (1 - service_level_slo_objective_ratio) * 14.6 
        labels: 
          severity: critical 
          team: a-team 
        annotations: 
          summary: The monthly SLO error budget consumed for 1h is greater than 2% 
          description: The error rate for 1h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 2% monthly budget. 
      - alert: SLOErrorRateTooFast6h 
        expr: | 
          ( 
            increase(service_level_sli_result_error_ratio_total[6h]) 
            / 
            increase(service_level_sli_result_count_total[6h]) 
          ) > (1 - service_level_slo_objective_ratio) * 6 
        labels: 
          severity: critical 
          team: a-team 
        annotations: 
          summary: The monthly SLO error budget consumed for 6h is greater than 5% 
          description: The error rate for 6h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 5% monthly budget.

第一條規(guī)則表示在1h內(nèi)消耗的錯(cuò)誤率大于30天內(nèi)的2%，應(yīng)該告警。第二條規(guī)則是在6h內(nèi)的錯(cuò)誤率大于30天的5%，應(yīng)該告警。

下面是谷歌的的基準(zhǔn)。

最后

說到系統(tǒng)穩(wěn)定性，這里不得不提到系統(tǒng)可用性，SRE提高系統(tǒng)的穩(wěn)定性，最終還是為了提升系統(tǒng)的可用時(shí)間，減少故障時(shí)間。那如何來衡量系統(tǒng)的可用性呢?

目前業(yè)界有兩種衡量系統(tǒng)可用性的方式，一個(gè)是時(shí)間維度，一個(gè)是請(qǐng)求維度。時(shí)間維度就是從故障出發(fā)對(duì)系統(tǒng)的穩(wěn)定性進(jìn)行評(píng)估。請(qǐng)求維度是從成功請(qǐng)求占比的角度出發(fā)，對(duì)系統(tǒng)穩(wěn)定性進(jìn)行評(píng)估。

時(shí)間維度：可用性 = 服務(wù)時(shí)間 / (服務(wù)時(shí)間 + 故障時(shí)間)

請(qǐng)求維度：可用性 = 成功請(qǐng)求數(shù) / 總請(qǐng)求數(shù)

在SRE實(shí)踐中，通常會(huì)選擇請(qǐng)求維度來衡量系統(tǒng)的穩(wěn)定性，就如上面的例子。不過，如果僅僅通過一個(gè)維度來判斷系統(tǒng)的穩(wěn)定性也有點(diǎn)太武斷，還應(yīng)該結(jié)合更多的指標(biāo)，比如延遲，錯(cuò)誤率等，而且對(duì)核心應(yīng)用，核心鏈路的SLI應(yīng)該更細(xì)致。

參考

[1] 《SRE實(shí)踐手冊(cè)》- 趙成

[2] 《SRE：Google運(yùn)維解密》

[3] https://github.com/spotahome/service-level-operator

責(zé)任編輯：姜華來源：運(yùn)維開發(fā)故事

Prometheus 開源監(jiān)控

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<cite id="5gipv"></cite><cite id="5gipv"><rp id="5gipv"><pre id="5gipv"></pre></rp></cite>