[CKA] KodeKloud - Worker Node Failure
안녕하세요, 쯀리입니다.
지난번 Application Failure와 ControlPlane Failure에 이어
오늘은 Worker Node Failure 에 관해 다뤄보겠습니다.
https://kubernetes.io/docs/tasks/debug/debug-application/
https://funlife-julie.tistory.com/80
https://funlife-julie.tistory.com/81
Quiz.
1. Fix the broken cluster (node01)
node01 ~ ➜ systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor >
Active: active (running) since Thu 2024-08-22 15:38:21 UTC; 15min ago
Docs: https://containerd.io
Main PID: 1183 (containerd)
Tasks: 99
Memory: 132.9M
CGroup: /system.slice/containerd.service
├─1183 /usr/bin/containerd
├─2713 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 26>
├─2714 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 53>
├─kubepods-besteffort-pod6836abc1_6b74_4f0c_a21e_eb71f551449d.sl>
│ └─2763 /pause
├─kubepods-besteffort-pod6836abc1_6b74_4f0c_a21e_eb71f551449d.sl>
│ └─2965 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/>
├─kubepods-burstable-pod8fc07dc8_a6ce_486a_94ea_91973b22a39c.sli>
│ └─2765 /pause
└─kubepods-burstable-pod8fc07dc8_a6ce_486a_94ea_91973b22a39c.sli>
└─3387 /opt/bin/flanneld --ip-masq --kube-subnet-mgr --iface=e>
Aug 22 15:39:29 node01 containerd[1183]: time="2024-08-22T15:39:29.920803087Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.182698528Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.202544490Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.202610264Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.202620287Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.271256324Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.888125268Z>
Aug 22 15:39:30 node01 containerd[1183]: time="2024-08-22T15:39:30.922267880Z>
node01 ~ ➜ systemctl status kubelet
○ kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor pre>
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Thu 2024-08-22 15:48:02 UTC; 5min ago
Docs: https://kubernetes.io/docs/
Process: 2562 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELE>
Main PID: 2562 (code=exited, status=0/SUCCESS)
Aug 22 15:39:27 node01 kubelet[2562]: I0822 15:39:27.781386 2562 reconcile>
Aug 22 15:39:27 node01 kubelet[2562]: I0822 15:39:27.781401 2562 reconcile>
Aug 22 15:39:27 node01 kubelet[2562]: I0822 15:39:27.781417 2562 reconcile>
Aug 22 15:39:27 node01 kubelet[2562]: I0822 15:39:27.781429 2562 reconcile>
Aug 22 15:39:28 node01 kubelet[2562]: I0822 15:39:28.805667 2562 kuberunti>
Aug 22 15:39:28 node01 kubelet[2562]: I0822 15:39:28.884666 2562 kuberunti>
Aug 22 15:39:29 node01 kubelet[2562]: I0822 15:39:29.884279 2562 kuberunti>
Aug 22 15:39:29 node01 kubelet[2562]: I0822 15:39:29.903389 2562 pod_start>
Aug 22 15:39:30 node01 kubelet[2562]: I0822 15:39:30.203314 2562 kubelet_n>
Aug 22 15:39:30 node01 kubelet[2562]: I0822 15:39:30.887778 2562 kuberunti>
kubelet이 죽어있네요
node01 ~ ✖ systemctl start kubelet
node01 ~ ➜ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor pre>
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2024-08-22 15:54:13 UTC; 2s ago
Docs: https://kubernetes.io/docs/
Main PID: 8277 (kubelet)
Tasks: 26 (limit: 251379)
Memory: 50.3M
CGroup: /system.slice/kubelet.service
└─8277 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/b>
완료:
controlplane ~ ➜ k get node
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 15m v1.30.0
node01 Ready <none> 15m v1.30.0
2. The cluster is broken again. Investigate and fix the issue.
node01 ~ ✖ journalctl -u kubelet -f
Aug 22 15:57:32 node01 kubelet[10069]: I0822 15:57:32.040698 10069 server.go:205] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
Aug 22 15:57:32 node01 kubelet[10069]: E0822 15:57:32.044187 10069 run.go:74] "command failed" err="failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt: open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory"
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
Kubelet의 기본 config위치는 /var/lib/kubelet/config.yaml입니다.
node01 /var/lib/kubelet ➜ vi config.yaml
## 수정 전
x509:
clientCAFile: /etc/kubernetes/pki/WRONG-CA-FILE.crt
## 수정 후
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
3. The cluster is broken again. Investigate and fix the issue.
Aug 22 16:03:31 node01 kubelet[13373]: E0822 16:03:31.828174 13373 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: Get "https://controlplane:6553/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 192.35.51.3:6553: connect: connection refused
node01과 controlplane이 통신 되지 않는 모습입니다.
우선 controlplane의 port가 무엇인지 확인해야합니다
controlplane ~ ➜ cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep port
- --secure-port=6443
port: 6443
port: 6443
port: 6443
그럼 node01 의 port가 6553으로 설정되어있으니 6443으로 변경해주어야합니다.
node01 ~ ➜ cd /etc/kubernetes/
node01 /etc/kubernetes ➜ vi kubelet.conf
....
### controlplane의 포트 수정 : 6553 -> 6443
server: https://controlplane:6553
...
## kubelet 재시작
node01 ~ ✖ systemctl restart kubelet
node01 ~ ➜ exit
logout
Connection to node01 closed.
controlplane ~ ➜ k get node
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 36m v1.30.0
node01 Ready <none> 36m v1.30.0
오늘 Worker Node Failure에 대해 알아보았습니다. 다음시간에는 Troubleshoot Network 를 알아보겠습니다.
참조
※ Udemy Labs - Certified Kubernetes Administrator with Practice Tests