
當(dāng)我們使用 Prometheus 來(lái)監(jiān)控 Kubernetes 集群的時(shí)候,kube-state-metrics(KSM) 基本屬于一個(gè)必備組件,它通過(guò) Watch APIServer 來(lái)生成資源對(duì)象的狀態(tài)指標(biāo),它并不會(huì)關(guān)注單個(gè) Kubernetes 組件的健康狀況,而是關(guān)注各種資源對(duì)象的健康狀態(tài),比如 Deployment、Node、Pod、Ingress、Job、Service 等等,每種資源對(duì)象中包含了需要指標(biāo),我們可以在官方文檔 https://github.com/kubernetes/kube-state-metrics/tree/main/docs 處進(jìn)行查看。
要安裝 KSM 也非常簡(jiǎn)單,代碼倉(cāng)庫(kù)中就包含了對(duì)應(yīng)的資源清單文件,但是在安裝的時(shí)候記得要和你的 K8s 集群版本對(duì)應(yīng)。

我這里的測(cè)試集群是 v1.25 版本的,所以我先切換到該分支:
$ git clone https://github.com/kubernetes/kube-state-metrics && cd kube-state-metrics
$ git checkout v2.7.0
$ kubectl apply -f examples/standard
該方式會(huì)以 Deployment 方式部署一個(gè) KSM 實(shí)例:
$ kubectl get deploy -n kube-system kube-state-metrics
NAME READY UP-TO-DATE AVAILABLE AGE
kube-state-metrics 1/1 1 1 2m49s
$ kubectl get pods -n kube-system -l app.kubernetes.io/name=kube-state-metrics
NAME READY STATUS RESTARTS AGE
kube-state-metrics-548546fc89-zgkx5 1/1 Running 0 2m51s
然后只需要讓 Prometheus 來(lái)發(fā)現(xiàn) KSM 實(shí)例就可以了,當(dāng)然有很多方式,比如可以通過(guò)添加注解來(lái)自動(dòng)發(fā)現(xiàn),也可以單獨(dú)為 KSM 創(chuàng)建一個(gè)獨(dú)立的 Job,如果使用的是 Prometheus Operator,也可以創(chuàng)建 ServiceMonitor 對(duì)象來(lái)獲取 KSM 指標(biāo)數(shù)據(jù)。
這種方式對(duì)于小規(guī)模集群是沒(méi)太大問(wèn)題的,數(shù)據(jù)量不大,可以正常提供服務(wù),只需要保證 KSM 高可用就可以在生產(chǎn)環(huán)境使用了。但是對(duì)于大規(guī)模的集群來(lái)說(shuō),就非常困難了,比如我們這里有一個(gè) 8K 左右 Pod 的集群,不算特別大。

但是只通過(guò)一個(gè) KSM 實(shí)例來(lái)提供 metrics 指標(biāo)還是非常吃力的,這個(gè)時(shí)候可能大部分情況下是獲取不到指標(biāo)的,因?yàn)?metrics 接口里面的數(shù)據(jù)量太大了。

即使偶爾獲取到了,也需要話(huà)花很長(zhǎng)時(shí)間,要知道我們會(huì)每隔 scrape_interval 的時(shí)間都會(huì)去訪(fǎng)問(wèn)該指標(biāo)接口的,可能前面一次請(qǐng)求還沒(méi)結(jié)束,下一次請(qǐng)求又發(fā)起了,要解決這個(gè)問(wèn)題就得從 KSM 端入手解決,在 KSM 的啟動(dòng)參數(shù)中我們可以配置過(guò)濾掉一些不需要的指標(biāo)標(biāo)簽:
$ kube-state-metrics -h
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
Usage:
kube-state-metrics [flags]
kube-state-metrics [command]
Available Commands:
completion Generate completion script for kube-state-metrics.
help Help about any command
version Print version information.
Flags:
--add_dir_header If true, adds the file directory to the header of the log messages
--alsologtostderr log to standard error as well as files (no effect when -logtostderr=true)
--apiserver string The URL of the apiserver to use as a master
--config string Path to the kube-state-metrics options config file
--custom-resource-state-config string Inline Custom Resource State Metrics config YAML (experimental)
--custom-resource-state-config-file string Path to a Custom Resource State Metrics config file (experimental)
--custom-resource-state-only Only provide Custom Resource State metrics (experimental)
--enable-gzip-encoding Gzip responses when requested by clients via 'Accept-Encoding: gzip' header.
-h, --help Print Help text
--host string Host to expose metrics on. (default "::")
--kubeconfig string Absolute path to the kubeconfig file
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory (no effect when -logtostderr=true)
--log_file string If non-empty, use this log file (no effect when -logtostderr=true)
--log_file_max_size uint Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--logtostderr log to standard error instead of files (default true)
--metric-allowlist string Comma-separated list of metrics to be exposed. This list comprises of exact metric names and/or regex patterns. The allowlist and denylist are mutually exclusive.
--metric-annotations-allowlist string Comma-separated list of Kubernetes annotations keys that will be used in the resource' labels metric. By default the metric contains only name and namespace labels. To include additional annotations provide a list of resource names in their plural form and Kubernetes annotation keys you would like to allow for them (Example: '=namespaces=[kubernetes.io/team,...],pods=[kubernetes.io/team],...)'. A single '*' can be provided per resource instead to allow any annotations, but that has severe performance implications (Example: '=pods=[*]').
--metric-denylist string Comma-separated list of metrics not to be enabled. This list comprises of exact metric names and/or regex patterns. The allowlist and denylist are mutually exclusive.
--metric-labels-allowlist string Comma-separated list of additional Kubernetes label keys that will be used in the resource' labels metric. By default the metric contains only name and namespace labels. To include additional labels provide a list of resource names in their plural form and Kubernetes label keys you would like to allow for them (Example: '=namespaces=[k8s-label-1,k8s-label-n,...],pods=[app],...)'. A single '*' can be provided per resource instead to allow any labels, but that has severe performance implications (Example: '=pods=[*]'). Additionally, an asterisk (*) can be provided as a key, which will resolve to all resources, i.e., assuming '--resources=deployments,pods', '=*=[*]' will resolve to '=deployments=[*],pods=[*]'.
--metric-opt-in-list string Comma-separated list of metrics which are opt-in and not enabled by default. This is in addition to the metric allow- and denylists
--namespaces string Comma-separated list of namespaces to be enabled. Defaults to ""
--namespaces-denylist string Comma-separated list of namespaces not to be enabled. If namespaces and namespaces-denylist are both set, only namespaces that are excluded in namespaces-denylist will be used.
--node string Name of the node that contains the kube-state-metrics pod. Most likely it should be passed via the downward API. This is used for daemonset sharding. Only available for resources (pod metrics) that support spec.nodeName fieldSelector. This is experimental.
--one_output If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
--pod string Name of the pod that contains the kube-state-metrics container. When set, it is expected that --pod and --pod-namespace are both set. Most likely this should be passed via the downward API. This is used for auto-detecting sharding. If set, this has preference over statically configured sharding. This is experimental, it may be removed without notice.
--pod-namespace string Name of the namespace of the pod specified by --pod. When set, it is expected that --pod and --pod-namespace are both set. Most likely this should be passed via the downward API. This is used for auto-detecting sharding. If set, this has preference over statically configured sharding. This is experimental, it may be removed without notice.
--port int Port to expose metrics on. (default 8080)
--resources string Comma-separated list of Resources to be enabled. Defaults to "certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments"
--shard int32 The instances shard nominal (zero indexed) within the total number of shards. (default 0)
--skip_headers If true, avoid header prefixes in the log messages
--skip_log_headers If true, avoid headers when opening log files (no effect when -logtostderr=true)
--stderrthreshold severity logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
--telemetry-host string Host to expose kube-state-metrics self metrics on. (default "::")
--telemetry-port int Port to expose kube-state-metrics self metrics on. (default 8081)
--tls-config string Path to the TLS configuration file
--total-shards int The total number of shards. Sharding is disabled when total shards is set to 1. (default 1)
--use-apiserver-cache Sets resourceVersinotallow=0 for ListWatch requests, using cached resources from the apiserver instead of an etcd quorum read.
-v, --v Level number for the log level verbosity
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
Use "kube-state-metrics [command] --help" for more information about a command.
可以通過(guò) --metric-allowlist 或者 --metric-denylist 參數(shù)進(jìn)行過(guò)濾。但是如果即使過(guò)濾了不需要的指標(biāo)或標(biāo)簽后指標(biāo)接口數(shù)據(jù)仍然非常大又該怎么辦呢?
其實(shí)我們可以想象一下,無(wú)論怎么過(guò)濾,請(qǐng)求一次到達(dá) metrics 接口后的數(shù)據(jù)量都是非常大的,這個(gè)時(shí)候是不是只能對(duì)指標(biāo)數(shù)據(jù)進(jìn)行拆分了,可以部署多個(gè) KSM 實(shí)例,每個(gè)實(shí)例提供一部分接口數(shù)據(jù),是不是就可以緩解壓力了,這其實(shí)就是我們常說(shuō)的水平分片。為了水平分片 kube-state-metrics,它已經(jīng)實(shí)現(xiàn)了一些自動(dòng)分片功能,它是通過(guò)以下標(biāo)志進(jìn)行配置的:
- --shard (從 0 開(kāi)始)
- --total-shards
分片是通過(guò)對(duì) Kubernetes 對(duì)象的 UID 進(jìn)行 MD5 哈希和對(duì)總分片數(shù)進(jìn)行取模運(yùn)算來(lái)完成的,每個(gè)分片決定是否由 kube-state-metrics 的相應(yīng)實(shí)例處理對(duì)象。不過(guò)需要注意的是,kube-state-metrics 的所有實(shí)例,即使已經(jīng)分片,也會(huì)處理所有對(duì)象的網(wǎng)絡(luò)流量和資源消耗,而不僅僅是他們負(fù)責(zé)那部分對(duì)象,要優(yōu)化這個(gè)問(wèn)題,Kubernetes API 需要支持分片的 list/watch 功能。在最理想的情況下,每個(gè)分片的內(nèi)存消耗將比未分片設(shè)置少 1/n。通常,為了使 kube-state-metrics 能夠迅速返回其指標(biāo)給 Prometheus,需要進(jìn)行內(nèi)存和延遲優(yōu)化。減少 kube-state-metrics 和 kube-apiserver 之間的延遲的一種方法是使用 --use-apiserver-cache 標(biāo)志運(yùn)行 KSM,除了減少延遲,這個(gè)選項(xiàng)還將導(dǎo)致減少對(duì) etcd 的負(fù)載,所以我們也是建議啟用該參數(shù)的。
使用了分片模式,則最好對(duì)分片相關(guān)指標(biāo)進(jìn)行監(jiān)控,以確保分片設(shè)置符合預(yù)期,可以用下面兩個(gè)報(bào)警規(guī)則來(lái)進(jìn)行報(bào)警:
- alert: KubeStateMetricsShardingMismatch
annotations:
description: kube-state-metrics pods are running with different --total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all.
summary: kube-state-metrics sharding is misconfigured.
expr: |
stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0
for: 15m
labels:
severity: critical
- alert: KubeStateMetricsShardsMissing
annotations:
description: kube-state-metrics shards are missing, some Kubernetes objects are not being exposed.
summary: kube-state-metrics shards are missing.
expr: |
2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1
-
sum( 2 ^ max by (shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) )
!= 0
for: 15m
labels:
severity: critical
由于手動(dòng)去配置分片可能會(huì)出現(xiàn)錯(cuò)誤,所以 KSM 也提供了自動(dòng)分片的功能,可以通過(guò) StatefulSet 方式來(lái)部署多個(gè)副本的 KSM,自動(dòng)分片允許每個(gè)分片在 StatefulSet 中部署時(shí)發(fā)現(xiàn)其實(shí)例位置,這對(duì)于自動(dòng)配置分片非常有用。所以要啟用自動(dòng)分片,必須通過(guò) StatefulSet 運(yùn)行 kube-state-metrics,并通過(guò) --pod 和 --pod-namespace 標(biāo)志將 pod 名稱(chēng)和命名空間傳遞給 kube-state-metrics 進(jìn)程,如下所示:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 10
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
serviceName: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.8.0
spec:
automountServiceAccountToken: true
containers:
- args:
- --pod=$(POD_NAME)
- --pod-namespace=$(POD_NAMESPACE)
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.0
# ......
使用這種部署分片的方法,當(dāng)你想要通過(guò)單個(gè) Kubernetes 資源(在這種情況下為單個(gè) StatefulSet)管理 KSM 分片時(shí)是很有用的,而不是每個(gè)分片都有一個(gè) Deployment,這種優(yōu)勢(shì)在部署大量分片時(shí)尤為顯著。
當(dāng)然使用自動(dòng)分片的部署方式也是有缺點(diǎn)的,主要是來(lái)自于 StatefulSet 支持的滾動(dòng)升級(jí)策略,當(dāng)由 StatefulSet 管理時(shí),一個(gè)一個(gè)地替換 pod,當(dāng) pod 先被終止后,然后再重新創(chuàng)建,這樣的升級(jí)速度較慢,也可能會(huì)導(dǎo)致每個(gè)分片的短暫停機(jī),如果在升級(jí)期間進(jìn)行 Prometheus 抓取,則可能會(huì)錯(cuò)過(guò) kube-state-metrics 導(dǎo)出的某些指標(biāo)。
自動(dòng)分片功能的示例清單在 examples/autosharding 目錄中可以找到,可以直接通過(guò)下面的命令來(lái)部署:
$ kubectl apply -k examples/autosharding
上面的命令會(huì)以 StatefulSet 方式部署 2 個(gè) KSM 實(shí)例:
$ kubectl get pods -n kube-system -l app.kubernetes.io/name=kube-state-metrics
NAME READY STATUS RESTARTS AGE
kube-state-metrics-0 1/1 Running 0 70m
kube-state-metrics-1 1/1 Running 0 65m
可以隨便查看一個(gè) Pod 的日志:
$ kubectl logs -f kube-state-metrics-1 -nkube-system
I0216 05:53:23.151163 1 wrapper.go:78] Starting kube-state-metrics
I0216 05:53:23.154495 1 server.go:125] "Used default resources"
I0216 05:53:23.154923 1 types.go:184] "Using all namespaces"
I0216 05:53:23.155556 1 server.go:166] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: "
W0216 05:53:23.155792 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0216 05:53:23.178553 1 server.go:311] "Tested communication with server"
I0216 05:53:23.241024 1 server.go:316] "Run with Kubernetes cluster version" major="1" minor="25" gitVersinotallow="v1.25.3" gitTreeState="clean" gitCommit="434bfd82814af038ad94d62ebe59b133fcb50506" platform="linux/arm64"
I0216 05:53:23.241169 1 server.go:317] "Communication with server successful"
I0216 05:53:23.245132 1 server.go:263] "Started metrics server" metricsServerAddress="[::]:8080"
I0216 05:53:23.246148 1 metrics_handler.go:103] "Autosharding enabled with pod" pod="kube-system/kube-state-metrics-1"
I0216 05:53:23.246233 1 metrics_handler.go:104] "Auto detecting sharding settings"
I0216 05:53:23.246267 1 server.go:252] "Started kube-state-metrics self metrics server" telemetryAddress="[::]:8081"
I0216 05:53:23.253477 1 server.go:69] levelinfomsgListening onaddress[::]:8081
I0216 05:53:23.253477 1 server.go:69] levelinfomsgListening onaddress[::]:8080
I0216 05:53:23.253944 1 server.go:69] levelinfomsgTLS is disabled.http2falseaddress[::]:8080
I0216 05:53:23.254534 1 server.go:69] levelinfomsgTLS is disabled.http2falseaddress[::]:8081
I0216 05:53:23.297524 1 metrics_handler.go:80] "Configuring sharding of this instance to be shard index (zero-indexed) out of total shards" shard=1 totalShards=2
I0216 05:53:23.411710 1 builder.go:257] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments"
可以看到有類(lèi)型 "Configuring sharding of this instance to be shard index (zero-indexed) out of total shards" shard=1 totalShards=2 這樣的日志信息,表面自動(dòng)分片成功了。我們可以去分別獲取下分片的指標(biāo)數(shù)據(jù)大小,并和未分片之前的進(jìn)行對(duì)比,可以看到分片后的指標(biāo)明顯減少了,如果單個(gè)實(shí)例的指標(biāo)數(shù)據(jù)還是太大,則可以增加 StatefulSet 的副本數(shù)即可。

此外我們還可以單獨(dú)針對(duì) pod 指標(biāo)按照每個(gè)節(jié)點(diǎn)進(jìn)行分片,只需要加上 --node 和 --resource 即可,這個(gè)時(shí)候我們直接使用一個(gè) DaemonSet 來(lái)創(chuàng)建 KSM 實(shí)例即可,如下所示:
apiVersion: apps/v1
kind: DaemonSet
spec:
template:
spec:
containers:
- image: registry.k8s.io/kube-state-metrics/kube-state-metrics:IMAGE_TAG
name: kube-state-metrics
args:
- --resource=pods
- --node=$(NODE_NAME)
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
對(duì)于其他的指標(biāo)我們也可以使用 --resource 來(lái)單獨(dú)指定部署,也可以繼續(xù)使用分片的方式??偨Y(jié)來(lái)說(shuō)就是對(duì)于大規(guī)模集群使用 kube-state-metrics 需要做很多優(yōu)化:
- 過(guò)濾不需要的指標(biāo)和標(biāo)簽
- 通過(guò)分片降低 KSM 實(shí)例壓力
- 可以使用 DaemonSet 方式單獨(dú)針對(duì) pod 指標(biāo)進(jìn)行部署
當(dāng)然可能也有人會(huì)問(wèn),如果自己的業(yè)務(wù)指標(biāo)也超級(jí)大的情況下該怎么辦呢?當(dāng)然就得讓業(yè)務(wù)方來(lái)做支持了,首先要明確指標(biāo)數(shù)據(jù)這么大是否正常?如果需求就是如此,那么也得想辦法能夠支持分片。