Kubernetes 自動(dòng)化診斷工具:K8sgpt-Operator
背景
在 Kubernetes 上,從部署 Deployment 到正常提供服務(wù),整個(gè)流程可能會(huì)出現(xiàn)各種各樣問(wèn)題,有興趣的可以瀏覽 Kubernetes Deployment 的故障排查可視化指南(2021 中文版)[1]。從可視化指南也可能看出這些問(wèn)題實(shí)際上都是有跡可循,根據(jù)錯(cuò)誤信息基本很容易找到解決方法。隨著 ChatGPT 的流行,基于 LLM 的文本生成項(xiàng)目不斷涌現(xiàn),k8sgpt[2] 便是其中之一。
k8sgpt 是一個(gè)掃描 Kubernetes 集群、診斷和分類問(wèn)題的工具。它將 SRE 經(jīng)驗(yàn)編入其分析器,并通過(guò) AI 幫助提取并豐富相關(guān)的信息。
其內(nèi)置了大量的分析器:
- podAnalyzer
- pvcAnalyzer
- rsAnalyzer
- serviceAnalyzer
- eventAnalyzer
- ingressAnalyzer
- statefulSetAnalyzer
- deploymentAnalyzer
- cronJobAnalyzer
- nodeAnalyzer
- hpaAnalyzer(可選)
- pdbAnalyzer(可選)
- networkPolicyAnalyzer(可選)
k8sgpt 的能力是通過(guò) CLI 來(lái)提供的,通過(guò) CLI 可以對(duì)集群中的錯(cuò)誤進(jìn)行快速的診斷。
k8sgpt analyze --explain --filter=Pod --namespace=default --output=json
{
"status": "ProblemDetected",
"problems": 1,
"results": [
{
"kind": "Pod",
"name": "default/test",
"error": [
{
"Text": "Back-off pulling image \"flomesh/pipy2\"",
"Sensitive": []
}
],
"details": "The Kubernetes system is experiencing difficulty pulling the requested image named \"flomesh/pipy2\". \n\nThe solution may be to check that the image is correctly spelled or to verify that it exists in the specified container registry. Additionally, ensure that the networking infrastructure that connects the container registry and Kubernetes system is working properly. Finally, check if there are any access restrictions or credentials required to pull the image and ensure they are provided correctly.",
"parentObject": "test"
}
]
}
但是,每次進(jìn)行診斷都要執(zhí)行命令,有點(diǎn)繁瑣且限制較多。我想大家想要的肯定是能夠監(jiān)控到問(wèn)題并自動(dòng)診斷。這就有了今天要介紹的 k8sgpt-operator[3]
介紹
簡(jiǎn)單來(lái)說(shuō) k8sgpt-operator 可以在集群中開(kāi)啟自動(dòng)化的 k8sgpt。它提供了兩個(gè) CRD: K8sGPT 和 Result。前者可以用來(lái)設(shè)置 k8sgpt 及其行為;而后者則是用來(lái)展示問(wèn)題資源的診斷結(jié)果。
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-sample
namespace: kube-system
spec:
model: gpt-3.5-turbo
backend: openai
noCache: false
version: v0.2.7
enableAI: true
secret:
name: k8sgpt-sample-secret
key: openai-api-key
演示
實(shí)驗(yàn)環(huán)境使用 k3s 集群。
export INSTALL_K3S_VERSION=v1.23.8+k3s2
curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable local-storage --disable servicelb --write-kubeconfig-mode 644 --write-kubeconfig ~/.kube/config
安裝 k8sgpt-operator
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install release k8sgpt/k8sgpt-operator -n openai --create-namespace
安裝完成后,可以看到隨 operator 安裝的兩個(gè) CRD:k8sgpts 和 results。
kubectl api-resources | grep -i gpt
k8sgpts core.k8sgpt.ai/v1alpha1 true K8sGPT
results core.k8sgpt.ai/v1alpha1 true Result
在開(kāi)始之前,需要先生成一個(gè) OpenAI 的 key[4],并保存到 secret 中。
OPENAI_TOKEN=xxxx
kubectl create secret generic k8sgpt-sample-secret --from-literal=openai-api-key=$OPENAI_TOKEN -n openai
接下來(lái)創(chuàng)建 K8sGPT 資源。
kubectl apply -n openai -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-sample
spec:
model: gpt-3.5-turbo
backend: openai
noCache: false
version: v0.2.7
enableAI: true
secret:
name: k8sgpt-sample-secret
key: openai-api-key
EOF
執(zhí)行完上面的命令后在 openai 命名空間下會(huì)自動(dòng)創(chuàng)建 Deployment k8sgpt-deployment 。
測(cè)試
使用一個(gè)不存在的鏡像創(chuàng)建 pod。
kubectl run test --image flomesh/pipy2 -n default
然后在 openai 命名空間下會(huì)看到一個(gè)名為 defaulttest 的資源。
kubectl get result -n openai
NAME AGE
defaulttest 5m7s
詳細(xì)信息中可以看到診斷內(nèi)容以及出現(xiàn)問(wèn)題的資源。
kubectl get result -n openai defaulttest -o yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2023-05-02T09:00:32Z"
generation: 1
name: defaulttest
namespace: openai
resourceVersion: "1466"
uid: 2ee27c26-61c1-4ef5-ae27-e1301a40cd56
spec:
details: "The error message is indicating that Kubernetes is having trouble pulling
the image \"flomesh/pipy2\" and is therefore backing off from trying to do so.
\n\nThe solution to this issue would be to check that the image exists and that
the spelling and syntax of the image name is correct. Additionally, check that
the image is accessible from the Kubernetes cluster and that any required authentication
or authorization is in place. If the issue persists, it may be necessary to troubleshoot
the network connectivity between the Kubernetes cluster and the image repository."
error:
- text: Back-off pulling image "flomesh/pipy2"
kind: Pod
name: default/test
parentObject: test
參考資料
[1] Kubernetes Deployment 的故障排查可視化指南(2021 中文版): https://atbug.com/troubleshooting-kubernetes-deployment-zh-v2/
[2] k8sgpt: https://github.com/k8sgpt-ai/k8sgpt
[3] k8sgpt-operator: https://github.com/k8sgpt-ai/k8sgpt-operator
[4] OpenAI 的 key: https://platform.openai.com/account/api-keys