自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

探索 PrometheusRule：監(jiān)控與報(bào)警的利器

作者：劉俊夏 2025-01-17 09:54:54

開發(fā) 項(xiàng)目管理

PrometheusRule 提供了強(qiáng)大的告警和記錄規(guī)則管理能力，通過合理設(shè)計(jì)規(guī)則，可以顯著提升監(jiān)控系統(tǒng)的可靠性和可用性。在實(shí)際應(yīng)用中，根據(jù)業(yè)務(wù)需求和系統(tǒng)特性優(yōu)化規(guī)則，并結(jié)合 Prometheus 的高性能查詢能力，可以構(gòu)建高效的監(jiān)控告警體系。結(jié)語

引言

隨著我們深入的學(xué)習(xí)和擴(kuò)展，我們能走到這一步，已經(jīng)很不錯(cuò)了，所以，再堅(jiān)持下。

我們今天的主角是 PrometheusRule，它是我們的一個(gè) CRD，專門用來設(shè)置告警規(guī)則的，這一塊也是一個(gè)重點(diǎn)，今天我們就來把它摁到床上（把它想成異性，特別是你喜歡的那種），好好交流下，學(xué)習(xí)下。

開始

但是這些報(bào)警信息是哪里來的呢？它們應(yīng)該用怎樣的方式通知我們呢？我們知道之前我們使用自定義的方式可以在 Prometheus 的配置文件之中指定 AlertManager 實(shí)例和報(bào)警的 rules 文件，現(xiàn)在我們通過 Operator 部署的呢？我們可以在 Prometheus Dashboard 的 Config 頁面下面查看關(guān)于 AlertManager 的配置：

alerting:
  alert_relabel_configs:
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop
  alertmanagers:
  - follow_redirects: true
    enable_http2: true
    scheme: http
    path_prefix: /
    timeout: 10s
    api_version: v2
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      separator: ;
      regex: alertmanager-main
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      separator: ;
      regex: web
      replacement: $1
      action: keep
    kubernetes_sd_configs:
    - role: endpoints
      kubeconfig_file: ""
      follow_redirects: true
      enable_http2: true
      namespaces:
        own_namespace: false
        names:
        - monitoring
rule_files:
- /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml

上面 alertmanagers 的配置我們可以看到是通過 role 為 endpoints 的 kubernetes 的自動(dòng)發(fā)現(xiàn)機(jī)制獲取的，匹配的是服務(wù)名為 alertmanager-main，端口名為 web 的 Service 服務(wù)，我們可以查看下 alertmanager-main 這個(gè) Service：

$ kubectl describe svc alertmanager-main -n monitoring
Name:              alertmanager-main
Namespace:         monitoring
Labels:            app.kubernetes.io/component=alert-router
                   app.kubernetes.io/instance=main
                   app.kubernetes.io/name=alertmanager
                   app.kubernetes.io/part-of=kube-prometheus
                   app.kubernetes.io/version=0.27.0
Annotations:       <none>
Selector:          app.kubernetes.io/component=alert-router,app.kubernetes.io/instance=main,app.kubernetes.io/name=alertmanager,app.kubernetes.io/part-of=kube-prometheus
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.16.29.21
IPs:               172.16.29.21
Port:              web  9093/TCP
TargetPort:        web/TCP
Endpoints:         192.168.0.173:9093,192.168.0.175:9093,192.168.0.176:9093
Port:              reloader-web  8080/TCP
TargetPort:        reloader-web/TCP
Endpoints:         192.168.0.173:8080,192.168.0.175:8080,192.168.0.176:8080
Session Affinity:  ClientIP
Events:            <none>

可以看到服務(wù)名正是 alertmanager-main，Port 定義的名稱也是 web，符合上面的規(guī)則，所以 Prometheus 和 AlertManager 組件就正確關(guān)聯(lián)上了。而對應(yīng)的報(bào)警規(guī)則文件位于：/etc/prometheus/rules/prometheus-k8s-rulefiles-0/目錄下面所有的 YAML 文件。我們可以進(jìn)入 Prometheus 的 Pod 中驗(yàn)證下該目錄下面是否有 YAML 文件：

kex prometheus-k8s-0 -nmonitoring -- sh

/prometheus $ cd /etc/prometheus/rules/prometheus-k8s-rulefiles-0/
/etc/prometheus/rules/prometheus-k8s-rulefiles-0 $ ls
monitoring-alertmanager-main-rules-b6e33381-7319-4dd3-8b96-6b99ec46bcf3.yaml
monitoring-grafana-rules-eedbd431-e04a-4a08-83a0-17cbd96fe942.yaml
monitoring-kube-prometheus-rules-0da776fa-e5f4-498e-bcc6-5e310b2995ba.yaml
monitoring-kube-state-metrics-rules-2242037f-5eaf-41a7-b122-036d05526c6b.yaml
monitoring-kubernetes-monitoring-rules-e31ffca7-ac96-4f9e-bb82-6a37e1ce6500.yaml
monitoring-prometheus-k8s-prometheus-rules-55f68dd4-e3f1-45ca-9c5b-bb6b952d2ae4.yaml
monitoring-prometheus-operator-rules-6e51dbe0-410b-468e-b4f2-84b32cd32665.yaml

這個(gè) YAML 文件實(shí)際上就是我們之前創(chuàng)建的一個(gè) PrometheusRule 文件包含的內(nèi)容：

$ cat grafana-prometheusrule.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/component: grafana
    app.kubernetes.io/name: grafana
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 11.4.0
    prometheus: k8s
    role: alert-rules
  name: grafana-rules
  namespace: monitoring
spec:
  groups:
  - name: GrafanaAlerts
    rules:
    - alert: GrafanaRequestsFailing
      annotations:
        message: '{{ $labels.namespace }}/{{ $labels.job }}/{{ $labels.handler }} is experiencing {{ $value | humanize }}% errors'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/grafana/grafanarequestsfailing
      expr: |
        100 * namespace_job_handler_statuscode:grafana_http_request_duration_seconds_count:rate5m{handler!~"/api/datasources/proxy/:id.*|/api/ds/query|/api/tsdb/query", status_code=~"5.."}
        / ignoring (status_code)
        sum without (status_code) (namespace_job_handler_statuscode:grafana_http_request_duration_seconds_count:rate5m{handler!~"/api/datasources/proxy/:id.*|/api/ds/query|/api/tsdb/query"})
        > 50
      for: 5m
      labels:
        severity: warning
  - name: grafana_rules
    rules:
    - expr: |
        sum by (namespace, job, handler, status_code) (rate(grafana_http_request_duration_seconds_count[5m]))
      record: namespace_job_handler_statuscode:grafana_http_request_duration_seconds_count:rate5m

我們這里的 PrometheusRule 的 name 為 prometheus-k8s-rules，namespace 為 monitoring，我們可以猜想到我們創(chuàng)建一個(gè) PrometheusRule 資源對象后，會(huì)自動(dòng)在上面的 prometheus-k8s-rulefiles-0 目錄下面生成一個(gè)對應(yīng)的 {namespace}-{name}.yaml 文件，所以如果以后我們需要自定義一個(gè)報(bào)警選項(xiàng)的話，只需要定義一個(gè) PrometheusRule 資源對象即可。至于為什么 Prometheus 能夠識別這個(gè) PrometheusRule 資源對象呢？這就需要查看我們創(chuàng)建的 prometheus 這個(gè)資源對象了，里面有非常重要的一個(gè)屬性 ruleSelector，用來匹配 rule 規(guī)則的過濾器，要求匹配具有 prometheus=k8s 和 role=alert-rules 標(biāo)簽的 PrometheusRule 資源對象，現(xiàn)在明白了吧？

不過我們可以添加，也可以不添加，根據(jù)你的場景。但是最好是添加上，我們上一篇的 Prometheus 里面沒有講這個(gè)，所以我們這邊暫時(shí)先補(bǔ)上

ruleSelector:
  matchLabels:
    prometheus: k8s
    role: alert-rules

添加和不添加 ruleSelector 的主要區(qū)別在于 Prometheus 實(shí)例如何選擇并加載告警規(guī)則 (Alerting Rules)。

什么是 ruleSelector

ruleSelector 是 Prometheus-Operator 中的一個(gè)配置選項(xiàng)，它用于指定 Prometheus 實(shí)例加載哪些告警規(guī)則 (PrometheusRule) 資源。通過標(biāo)簽匹配的方式，ruleSelector 允許用戶精確控制 Prometheus 實(shí)例加載的規(guī)則來源。

配置區(qū)別

不添加 ruleSelector

? 行為

a.Prometheus 實(shí)例將會(huì)嘗試加載同一命名空間中的所有 PrometheusRule 資源（或者根據(jù) ruleNamespaceSelector 的配置，從其他命名空間中加載）。

? 優(yōu)點(diǎn)

a.簡單易用，不需要為每個(gè) PrometheusRule 資源添加標(biāo)簽。

b.適合小規(guī)模部署或開發(fā)環(huán)境。

?缺點(diǎn)

a.在生產(chǎn)環(huán)境中，可能會(huì)導(dǎo)致 Prometheus 實(shí)例加載不相關(guān)的告警規(guī)則，增加負(fù)擔(dān)。

b.如果有多個(gè) Prometheus 實(shí)例運(yùn)行，容易出現(xiàn)沖突或重復(fù)加載。

添加 ruleSelector

ruleSelector:
  matchLabels:
    prometheus: k8s
    role: alert-rules

? 行為

Prometheus 實(shí)例只會(huì)加載符合 ruleSelector 指定標(biāo)簽的 PrometheusRule 資源。例如，只有同時(shí)擁有以下標(biāo)簽的 PrometheusRule 會(huì)被加載：

a.prometheus: k8s

b.role: alert-rules

? 優(yōu)點(diǎn)

a.精確控制 Prometheus 實(shí)例加載的告警規(guī)則，避免加載無關(guān)規(guī)則。

b.在多 Prometheus 實(shí)例場景中，可以通過標(biāo)簽隔離規(guī)則來源。

c.方便管理大規(guī)模部署中的規(guī)則集。

? 缺點(diǎn)

a.需要在每個(gè) PrometheusRule 資源上添加相應(yīng)的標(biāo)簽。

b.配置稍顯復(fù)雜，可能導(dǎo)致規(guī)則遺漏（忘記添加標(biāo)簽）。

實(shí)際場景中的應(yīng)用

場景 1：單實(shí)例 Prometheus

如果您的集群中只有一個(gè) Prometheus 實(shí)例，可以不使用 ruleSelector，讓其自動(dòng)加載所有規(guī)則。這種方式適合簡單部署和測試環(huán)境。

場景 2：多實(shí)例 Prometheus

在復(fù)雜環(huán)境中（例如多團(tuán)隊(duì)共享集群或跨環(huán)境監(jiān)控），建議使用 ruleSelector，為每個(gè) Prometheus 實(shí)例指定特定的告警規(guī)則：

? 不同團(tuán)隊(duì)維護(hù)不同的規(guī)則，通過標(biāo)簽隔離。
? 在生產(chǎn)和測試環(huán)境中分別部署 Prometheus，加載不同的規(guī)則。

總結(jié)

配置	加載方式	優(yōu)勢	劣勢
不添加 ruleSelector	加載所有 PrometheusRule 資源	簡單易用，自動(dòng)加載所有規(guī)則	不適合大規(guī)模環(huán)境，規(guī)則管理混亂
添加 ruleSelector	按標(biāo)簽篩選加載特定的 PrometheusRule	精確控制規(guī)則加載，避免沖突，適合生產(chǎn)環(huán)境	配置復(fù)雜，需要正確設(shè)置標(biāo)簽

在生產(chǎn)環(huán)境中，推薦添加 ruleSelector，以便更好地管理和隔離告警規(guī)則。

深入理解 PrometheusRule：創(chuàng)建高效的告警規(guī)則

PrometheusRule 是 Prometheus Operator 提供的一種自定義資源，用于定義和管理 Prometheus 的告警規(guī)則和記錄規(guī)則（Recording Rules）。這些規(guī)則是 Prometheus 在監(jiān)控環(huán)境中發(fā)揮告警和數(shù)據(jù)處理能力的關(guān)鍵工具。

本篇文章將詳細(xì)講解 PrometheusRule 的結(jié)構(gòu)、配置方法、應(yīng)用場景，以及如何根據(jù)實(shí)際需求優(yōu)化告警策略。

PrometheusRule 的作用

PrometheusRule 的主要作用包括：

告警規(guī)則（Alerting Rules）

? 定義監(jiān)控?cái)?shù)據(jù)的條件，當(dāng)這些條件滿足時(shí)觸發(fā)告警。

? 例如，當(dāng)某個(gè)應(yīng)用的請求延遲超過閾值時(shí)發(fā)出警告。

記錄規(guī)則（Recording Rules）

? 定期將復(fù)雜的查詢結(jié)果存儲為新的時(shí)間序列，以優(yōu)化查詢性能。

? 例如，將一段時(shí)間內(nèi)的平均 CPU 使用率存儲為一個(gè)新指標(biāo)。

通過 PrometheusRule，可以將規(guī)則配置與 Prometheus 的管理邏輯分離，實(shí)現(xiàn)規(guī)則的靈活分組和統(tǒng)一管理。

PrometheusRule 的基礎(chǔ)結(jié)構(gòu)

一個(gè) PrometheusRule 資源由以下幾個(gè)關(guān)鍵部分組成：

元數(shù)據(jù)部分

定義 PrometheusRule 的名稱、命名空間和標(biāo)簽信息。

metadata:
  name: example-rules
  namespace: monitoring
  labels:
    app.kubernetes.io/name: prometheus

spec 部分

spec 定義了規(guī)則組（groups），其中每個(gè)組包含一個(gè)或多個(gè)規(guī)則。

? groups:

a.name：規(guī)則組的名稱，用于邏輯分組。

b.rules：規(guī)則列表，定義具體的告警或記錄規(guī)則。

以下是一個(gè)完整的 PrometheusRule 示例：

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-rules
  namespace: monitoring
spec:
  groups:
name: example-group
rules:

詳解規(guī)則的字段

Alerting Rules

告警規(guī)則的核心字段包括：

alert

? 告警名稱，簡明描述告警內(nèi)容。

? 例如：HighRequestLatency。

expr

? PromQL 表達(dá)式，用于定義觸發(fā)告警的條件。

? 例如：rate(http_request_duration_seconds_bucket{le="0.5"}[5m]) < 0.1 表示過去 5 分鐘內(nèi)請求延遲大于 0.5 秒的比例小于 10%。

for

? 持續(xù)時(shí)間。告警條件需要滿足多長時(shí)間后才會(huì)觸發(fā)。

? 例如：2m 表示條件需持續(xù) 2 分鐘。

labels

? 自定義標(biāo)簽，用于對告警進(jìn)行分類。

? 例如：severity: warning。

annotations

? 注釋信息，通常用于告警的詳細(xì)描述。

? summary：簡要說明。

? description：詳細(xì)描述，可包含變量占位符（如 {{ $labels.instance }}）。

Recording Rules

記錄規(guī)則的核心字段包括：

record

? 記錄名稱，用于定義新生成的時(shí)間序列指標(biāo)。

? 例如：job:http_inprogress_requests:sum。

? PromQL 表達(dá)式，用于計(jì)算新指標(biāo)。

? 例如：sum(http_inprogress_requests) by (job) 表示按作業(yè)匯總正在處理的 HTTP 請求數(shù)。

應(yīng)用場景

系統(tǒng)資源監(jiān)控告警

監(jiān)控節(jié)點(diǎn) CPU 和內(nèi)存的使用情況，設(shè)置告警規(guī)則：

alert: HighCpuUsage
expr: node_cpu_seconds_total{mode="idle"} / node_cpu_seconds_total < 0.2
for: 1m
labels:
  severity: critical
annotations:
  summary: "High CPU usage detected"
  description: "Instance {{ $labels.instance }} has high CPU usage."

應(yīng)用性能監(jiān)控告警

例如，監(jiān)控 HTTP 請求錯(cuò)誤率：

alert: HighHttpErrorRate
expr: rate(http_requests_total{status=~"5.*"}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
  severity: warning
annotations:
  summary: "High HTTP error rate detected"
  description: "Instance {{ $labels.instance }} has an HTTP error rate above 5%."

數(shù)據(jù)聚合與優(yōu)化

通過記錄規(guī)則優(yōu)化查詢性能，例如記錄每個(gè)服務(wù)的總請求數(shù)：

record: job:http_requests:total
expr: sum(rate(http_requests_total[5m])) by (job)

擴(kuò)展

我們要想自定義一個(gè)報(bào)警規(guī)則，只需要?jiǎng)?chuàng)建一個(gè)具有 prometheus=k8s 和 role=alert-rules 標(biāo)簽的 PrometheusRule 對象就行了，我這里準(zhǔn)備聲明一個(gè) ArgoCD 的 Rule：

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-rules
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
    app.kubernetes.io/name: argocd
    app.kubernetes.io/part-of: argocd
spec:
  groups:
  - name: argocd.rules
    rules:
    # ArgoCD Application Controller 高延遲告警
    - alert: ArgoCDApplicationControllerHighLatency
      expr: histogram_quantile(0.95, rate(argocd_app_controller_reconciliation_duration_seconds_bucket[5m])) > 1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "ArgoCD Application Controller Reconciliation Latency High ({{ $labels.namespace }})"
        description: "The reconciliation latency for ArgoCD application controller in namespace {{ $labels.namespace }} is over 1 second for more than 2 minutes."
        
    # ArgoCD Server 高請求錯(cuò)誤率告警
    - alert: ArgoCDServerHighRequestErrorRate
      expr: rate(argocd_server_http_requests_total{status=~"5.*"}[5m]) > 0.05
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "ArgoCD Server High Error Rate ({{ $labels.namespace }})"
        description: "The ArgoCD server in namespace {{ $labels.namespace }} has an HTTP error rate over 5% for the past 5 minutes."

    # ArgoCD Repo Server 同步失敗告警
    - alert: ArgoCDRepoServerSyncFailed
      expr: increase(argocd_repo_server_git_request_failures_total[5m]) > 5
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "ArgoCD Repo Server Sync Failures ({{ $labels.namespace }})"
        description: "ArgoCD Repo Server in namespace {{ $labels.namespace }} has more than 5 sync failures in the last 5 minutes."

    # ArgoCD Dex Server 不可用告警
    - alert: ArgoCDDexServerDown
      expr: up{job="argocd-dex-server"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "ArgoCD Dex Server Down ({{ $labels.namespace }})"
        description: "The ArgoCD Dex Server in namespace {{ $labels.namespace }} is not running for more than 1 minute."

    # ArgoCD Application 運(yùn)行失敗告警
    - alert: ArgoCDApplicationOutOfSync
      expr: argocd_app_info{health_status!="Healthy"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "ArgoCD Applications Out of Sync ({{ $labels.namespace }})"
        description: "There are applications in ArgoCD that are not in a healthy state for more than 5 minutes."

優(yōu)化告警策略

避免告警風(fēng)暴

? 使用 for 字段，避免短時(shí)間內(nèi)的波動(dòng)觸發(fā)大量告警。

分級告警

? 設(shè)置不同嚴(yán)重級別（如 critical 和 warning）的告警規(guī)則。

基于歷史數(shù)據(jù)優(yōu)化閾值

? 通過查詢歷史數(shù)據(jù)，確定合理的告警觸發(fā)閾值。

定期回顧和優(yōu)化規(guī)則

? 定期檢查規(guī)則的觸發(fā)頻率和準(zhǔn)確性，調(diào)整表達(dá)式或閾值。

總結(jié)

PrometheusRule 提供了強(qiáng)大的告警和記錄規(guī)則管理能力，通過合理設(shè)計(jì)規(guī)則，可以顯著提升監(jiān)控系統(tǒng)的可靠性和可用性。在實(shí)際應(yīng)用中，根據(jù)業(yè)務(wù)需求和系統(tǒng)特性優(yōu)化規(guī)則，并結(jié)合 Prometheus 的高性能查詢能力，可以構(gòu)建高效的監(jiān)控告警體系。

結(jié)語

好的，我們的 PrometheusRule 到此就結(jié)束了，你學(xué)的怎么樣呢，到后面我們就知道了，因?yàn)槲覀凂R上就到了期末考試了，期待你的發(fā)揮，不要遺憾哈！

責(zé)任編輯：武曉燕來源：云原生運(yùn)維圈

監(jiān)控報(bào)警體系

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營