背景
在 Kube.NETes 上,從部署 Deployment 到正常提供服務,整個流程可能會出現各種各樣問題,有興趣的可以瀏覽 Kubernetes Deployment 的故障排查可視化指南(2021 中文版)[1]。從可視化指南也可能看出這些問題實際上都是有跡可循,根據錯誤信息基本很容易找到解決方法。隨著 ChatGPT 的流行,基于 LLM 的文本生成項目不斷涌現,k8sgpt[2] 便是其中之一。
k8sgpt 是一個掃描 Kubernetes 集群、診斷和分類問題的工具。它將 SRE 經驗編入其分析器,并通過 AI 幫助提取并豐富相關的信息。
其內置了大量的分析器:
- podAnalyzer
- pvcAnalyzer
- rsAnalyzer
- serviceAnalyzer
- eventAnalyzer
- ingressAnalyzer
- statefulSetAnalyzer
- deploymentAnalyzer
- cronJobAnalyzer
- nodeAnalyzer
- hpaAnalyzer(可選)
- pdbAnalyzer(可選)
- networkPolicyAnalyzer(可選)
k8sgpt 的能力是通過 CLI 來提供的,通過 CLI 可以對集群中的錯誤進行快速的診斷。
k8sgpt analyze --explain --filter=Pod --namespace=default --output=json
{
"status": "ProblemDetected",
"problems": 1,
"results": [
{
"kind": "Pod",
"name": "default/test",
"error": [
{
"Text": "Back-off pulling image "flomesh/pipy2"",
"Sensitive": []
}
],
"details": "The Kubernetes system is experiencing difficulty pulling the requested image named "flomesh/pipy2". nnThe solution may be to check that the image is correctly spelled or to verify that it exists in the specified container registry. Additionally, ensure that the networking infrastructure that connects the container registry and Kubernetes system is working properly. Finally, check if there are any access restrictions or credentials required to pull the image and ensure they are provided correctly.",
"parentObject": "test"
}
]
}
但是,每次進行診斷都要執行命令,有點繁瑣且限制較多。我想大家想要的肯定是能夠監控到問題并自動診斷。這就有了今天要介紹的 k8sgpt-operator[3]
介紹
簡單來說 k8sgpt-operator 可以在集群中開啟自動化的 k8sgpt。它提供了兩個 CRD: K8sGPT 和 Result。前者可以用來設置 k8sgpt 及其行為;而后者則是用來展示問題資源的診斷結果。
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-sample
namespace: kube-system
spec:
model: gpt-3.5-turbo
backend: openai
noCache: false
version: v0.2.7
enableAI: true
secret:
name: k8sgpt-sample-secret
key: openai-api-key
演示
實驗環境使用 k3s 集群。
export INSTALL_K3S_VERSION=v1.23.8+k3s2
curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable local-storage --disable servicelb --write-kubeconfig-mode 644 --write-kubeconfig ~/.kube/config
安裝 k8sgpt-operator
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install release k8sgpt/k8sgpt-operator -n openai --create-namespace
安裝完成后,可以看到隨 operator 安裝的兩個 CRD:k8sgpts 和 results。
kubectl api-resources | grep -i gpt
k8sgpts core.k8sgpt.ai/v1alpha1 true K8sGPT
results core.k8sgpt.ai/v1alpha1 true Result
在開始之前,需要先生成一個 OpenAI 的 key[4],并保存到 secret 中。
OPENAI_TOKEN=xxxx
kubectl create secret generic k8sgpt-sample-secret --from-literal=openai-api-key=$OPENAI_TOKEN -n openai
接下來創建 K8sGPT 資源。
kubectl Apply -n openai -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-sample
spec:
model: gpt-3.5-turbo
backend: openai
noCache: false
version: v0.2.7
enableAI: true
secret:
name: k8sgpt-sample-secret
key: openai-api-key
EOF
執行完上面的命令后在 openai 命名空間下會自動創建 Deployment k8sgpt-deployment 。
測試
使用一個不存在的鏡像創建 pod。
kubectl run test --image flomesh/pipy2 -n default
然后在 openai 命名空間下會看到一個名為 defaulttest 的資源。
kubectl get result -n openai
NAME AGE
defaulttest 5m7s
詳細信息中可以看到診斷內容以及出現問題的資源。
kubectl get result -n openai defaulttest -o yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2023-05-02T09:00:32Z"
generation: 1
name: defaulttest
namespace: openai
resourceVersion: "1466"
uid: 2ee27c26-61c1-4ef5-ae27-e1301a40cd56
spec:
details: "The error message is indicating that Kubernetes is having trouble pulling
the image "flomesh/pipy2" and is therefore backing off from trying to do so.
nnThe solution to this issue would be to check that the image exists and that
the spelling and syntax of the image name is correct. Additionally, check that
the image is accessible from the Kubernetes cluster and that any required authentication
or authorization is in place. If the issue persists, it may be necessary to troubleshoot
the network connectivity between the Kubernetes cluster and the image repository."
error:
- text: Back-off pulling image "flomesh/pipy2"
kind: Pod
name: default/test
parentObject: test
參考資料
[1] Kubernetes Deployment 的故障排查可視化指南(2021 中文版): https://atbug.com/troubleshooting-kubernetes-deployment-zh-v2/
[2] k8sgpt: https://Github.com/k8sgpt-ai/k8sgpt
[3] k8sgpt-operator: https://github.com/k8sgpt-ai/k8sgpt-operator
[4] OpenAI 的 key: https://platform.openai.com/account/api-keys