Asking For Help: Kubernetes Pods Unable To Reach Each Other (Flannel, Networking)
I am not entirely sure whether this sub is also intended for asking questions, but after opening the question on stack overflow and only getting an AI answer, I thought it would be worth a shot to ask here. What follows is a rather long question, but most of it is just debugging information to avoid obvious questions.
I have created a small kubernetes cluster (6 nodes) using kubeadm and flannel as the CNI in an openstack project. This is my first time using more than a single node kubernetes cluster.
manually reboot the node and installed flannel via kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml. Worker's are created in a similar way (without flannel). I omit their script for now but I can add it if it seems important.
I then had dns resolution issues with a helm chart which is why I tried to investigate network issues and noticed that instances are unable to ping each other.
I am unsure how to debug this issue further.
Debug Info
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-0 Ready control-plane 4h38m v1.30.14
k8s-worker-0 Ready <none> 4h35m v1.30.14
k8s-worker-1 Ready <none> 4h35m v1.30.14
k8s-worker-2 Ready <none> 4h35m v1.30.14
k8s-worker-3 Ready <none> 4h35m v1.30.14
k8s-worker-4 Ready <none> 4h35m v1.30.14
k8s-worker-5 Ready <none> 4h34m v1.30.14
ip addr show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default
link/ether a2:4a:11:1f:84:ef brd ff:ff:ff:ff:ff:ff
inet 10.244.0.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::a04a:11ff:fe1f:84ef/64 scope link
valid_lft forever preferred_lft forever
ip route
default via 192.168.33.1 dev ens3 proto dhcp src 192.168.33.117 metric 100
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink
10.244.4.0/24 via 10.244.4.0 dev flannel.1 onlink
10.244.5.0/24 via 10.244.5.0 dev flannel.1 onlink
10.244.6.0/24 via 10.244.6.0 dev flannel.1 onlink
169.254.169.254 via 192.168.33.3 dev ens3 proto dhcp src 192.168.33.117 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.33.0/24 dev ens3 proto kernel scope link src 192.168.33.117 metric 100
192.168.33.1 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
192.168.33.2 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
192.168.33.3 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
192.168.33.4 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100
kubectl run -it --rm dnsutils --image=busybox:1.28 --restart=Never -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
pod "dnsutils" deleted
pod default/dnsutils terminated (Error)
kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-55cb58b774-6vb7p 1/1 Running 1 (4h19m ago) 4h38m
coredns-55cb58b774-wtrz6 1/1 Running 1 (4h19m ago) 4h38m
Ping Test
ubuntu@k8s-master-0:~$ kubectl run pod1 --image=busybox:1.28 --restart=Never --command -- sleep 3600
pod/pod1 created
ubuntu@k8s-master-0:~$ kubectl run pod2 --image=busybox:1.28 --restart=Never --command -- sleep 3600
pod/pod2 created
ubuntu@k8s-master-0:~$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod1 1/1 Running 0 15m 10.244.5.2 k8s-worker-1 <none> <none>
pod2 1/1 Running 0 15m 10.244.4.2 k8s-worker-3 <none> <none>
ubuntu@k8s-master-0:~$ kubectl exec -it pod1 -- sh
/ # ping 10.244.5.2
PING 10.244.5.2 (10.244.5.2): 56 data bytes
64 bytes from 10.244.5.2: seq=0 ttl=64 time=0.107 ms
64 bytes from 10.244.5.2: seq=1 ttl=64 time=0.091 ms
64 bytes from 10.244.5.2: seq=2 ttl=64 time=0.090 ms
^C
--- 10.244.5.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.090/0.096/0.107 ms
/ # 10.244.4.2
sh: 10.244.4.2: not found
/ # ping 10.244.4.2
PING 10.244.4.2 (10.244.4.2): 56 data bytes
^C
--- 10.244.4.2 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
/ # exit
command terminated with exit code 1
If I understand flannel correctly, it is fine that the pods are in other subnets as the ip routes manage the forwarding.
That's a good idea. While I have opened some ports that are commonly used by kubernetes, I didn't think about flannel requiring additional open ports. I will look into this. It's also mentioned in the troubleshooting guide, but I looked for debugging advice there and only scanned the other sections: https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md#firewalls
Can you elaborate on that? I am not sure whether you are promoting the usage of Calico over Flannel in general or if you think that using Calico would not result in this issue or if Calico would be easier to debug if such an issue arise.
I decided to use flannel because it was labeled as the easiest. I know that Calico is most popular.