Kubernetes @lemmy.ml VoxAliorum @lemmy.ml 2d ago

Asking For Help: Kubernetes Pods Unable To Reach Each Other (Flannel, Networking)

I am not entirely sure whether this sub is also intended for asking questions, but after opening the question on stack overflow and only getting an AI answer, I thought it would be worth a shot to ask here. What follows is a rather long question, but most of it is just debugging information to avoid obvious questions.

I have created a small kubernetes cluster (6 nodes) using kubeadm and flannel as the CNI in an openstack project. This is my first time using more than a single node kubernetes cluster.

I set up the kubernetes cluster's master via

# tasks file for kubernetes_master
- name: Install required packages
  apt:
    name:
      - curl
      - gnupg2
      - software-properties-common
      - apt-transport-https
      - ca-certificates
    state: present
    update_cache: yes

- name: Install Docker
  apt:
    name: docker.io
    state: present
    update_cache: yes

- name: Remove Keyrings Directory (if it exists)
  ansible.builtin.shell: rm -rf /etc/apt/keyrings

- name: Remove Existing Kubernetes Directory (if it exists)
  ansible.builtin.shell: sudo rm -rf /etc/apt/sources.list.d/pkgs_k8s_io_core_stable_v1_30_deb.list

- name: Disable swap
  ansible.builtin.command:
    cmd: swapoff -a

#- name: Ensure swap is disabled on boot
#  ansible.builtin.command:
#    cmd: sudo sed -i -e '/\/swap.img\s\+none\s\+swap\s\+sw\s\+0\s\+0/s/^/#/' /etc/fstab

- name: Ensure all swap entries are disabled on boot
  ansible.builtin.command:
    cmd: sudo sed -i -e '/\s\+swap\s\+/s/^/#/' /etc/fstab

- name: Add kernel modules for Containerd
  ansible.builtin.copy:
    dest: /etc/modules-load.d/containerd.conf
    content: |
      overlay
      br_netfilter

- name: Load kernel modules for Containerd
  ansible.builtin.shell:
    cmd: modprobe overlay && modprobe br_netfilter
  become: true

- name: Add kernel parameters for Kubernetes
  ansible.builtin.copy:
    dest: /etc/sysctl.d/kubernetes.conf
    content: |
      net.bridge.bridge-nf-call-ip6tables = 1
      net.bridge.bridge-nf-call-iptables = 1
      net.ipv4.ip_forward = 1

- name: Load kernel parameter changes
  ansible.builtin.command:
    cmd: sudo sysctl --system

- name: Configuring Containerd (building the configuration file)
  ansible.builtin.command:
    cmd: sudo sh -c "containerd config default > /opt/containerd/config.toml"

- name: Configuring Containerd (Setting SystemdCgroup Variable to True)
  ansible.builtin.command:
    cmd: sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /opt/containerd/config.toml

- name: Reload systemd configuration
  ansible.builtin.command:
    cmd: systemctl daemon-reload

- name: Restart containerd service
  ansible.builtin.service:
    name: containerd
    state: restarted

- name: Allow 6443/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 6443/tcp

- name: Allow 2379:2380/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 2379:2380/tcp

- name: Allow 22/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 22/tcp

- name: Allow 8080/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 8080/tcp

- name: Allow 10250/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 10250/tcp

- name: Allow 10251/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 10251/tcp

- name: Allow 10252/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 10252/tcp

- name: Allow 10255/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 10255/tcp

- name: Allow 5473/tcp through firewall
  ansible.builtin.command:
    cmd: sudo ufw allow 5473/tcp

- name: Enable the firewall
  ansible.builtin.ufw:
    state: enabled

- name: Reload the firewall
  ansible.builtin.command:
    cmd: sudo ufw reload

- name: Prepare keyrings directory and update permissions
  file:
    path: /etc/apt/keyrings
    state: directory
    mode: '0755'

- name: Download Kubernetes GPG key securely
  ansible.builtin.shell: curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

- name: Add Kubernetes repository
  ansible.builtin.apt_repository:
    repo: "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /"
    state: present

- name: Install kubeadm, kubelet, kubectl
  ansible.builtin.apt:
    name:
      - kubelet
      - kubeadm
      - kubectl
    state: present
    update_cache: yes

- name: Hold kubelet, kubeadm, kubectl packages
  ansible.builtin.command:
    cmd: sudo apt-mark hold kubelet kubeadm kubectl

- name: Replace /etc/default/kubelet contents
  ansible.builtin.copy:
    dest: /etc/default/kubelet
    content: 'KUBELET_EXTRA_ARGS="--cgroup-driver=cgroupfs"'

- name: Reload systemd configuration
  ansible.builtin.command:
    cmd: systemctl daemon-reload

- name: Restart kubelet service
  ansible.builtin.service:
    name: kubelet
    state: restarted

- name: Update System-Wide Profile for Kubernetes
  ansible.builtin.copy:
    dest: /etc/profile.d/kubernetes.sh
    content: |
      export KUBECONFIG=/etc/kubernetes/admin.conf
      export ANSIBLE_USER="sysadmin"

# only works if not executing on master
#- name: Reboot the system
#  ansible.builtin.reboot:
#    msg: "Reboot initiated by Ansible for Kubernetes setup"
#    reboot_timeout: 150

- name: Replace Docker daemon.json configuration
  ansible.builtin.copy:
    dest: /etc/docker/daemon.json
    content: |
      {
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
          "max-size": "100m"
        },
        "storage-driver": "overlay2"
      }

- name: Reload systemd configuration
  ansible.builtin.command:
    cmd: systemctl daemon-reload

- name: Restart Docker service
  ansible.builtin.service:
    name: docker
    state: restarted

- name: Update Kubeadm Environment Variable
  ansible.builtin.command:
    cmd: sudo sed -i -e '/^\[Service\]/a Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"' /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf

- name: Reload systemd configuration
  ansible.builtin.command:
    cmd: systemctl daemon-reload

- name: Restart kubelet service
  ansible.builtin.service:
    name: kubelet
    state: restarted

- name: Pull kubeadm container images
  ansible.builtin.command:
    cmd: sudo kubeadm config images pull

- name: Initialize Kubernetes control plane
  ansible.builtin.command:
    cmd: kubeadm init --pod-network-cidr=10.244.0.0/16
    creates: /tmp/kubeadm_output
  register: kubeadm_init_output
  become: true
  changed_when: false

- name: Set permissions for Kubernetes Admin
  file:
    path: /etc/kubernetes/admin.conf
    state: file
    mode: '0755'

- name: Store Kubernetes initialization output to file
  copy:
    content: "{{ kubeadm_init_output.stdout }}"
    dest: /tmp/kubeadm_output
  become: true
  delegate_to: localhost

- name: Generate the Join Command
  ansible.builtin.shell: cat /tmp/kubeadm_output | tail -n 2 | sed ':a;N;$!ba;s/\\\n\s*/ /g' > /tmp/join-command
  delegate_to: localhost

- name: Set permissions for the Join Executable
  file:
    path: /tmp/join-command
    state: file
    mode: '0755'
  delegate_to: localhost

manually reboot the node and installed flannel via kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml. Worker's are created in a similar way (without flannel). I omit their script for now but I can add it if it seems important.

I then had dns resolution issues with a helm chart which is why I tried to investigate network issues and noticed that instances are unable to ping each other.

I am unsure how to debug this issue further.

Debug Info

kubectl get nodes
NAME           STATUS   ROLES           AGE     VERSION
k8s-master-0   Ready    control-plane   4h38m   v1.30.14
k8s-worker-0   Ready    <none>          4h35m   v1.30.14
k8s-worker-1   Ready    <none>          4h35m   v1.30.14
k8s-worker-2   Ready    <none>          4h35m   v1.30.14
k8s-worker-3   Ready    <none>          4h35m   v1.30.14
k8s-worker-4   Ready    <none>          4h35m   v1.30.14
k8s-worker-5   Ready    <none>          4h34m   v1.30.14

kube-flannel-ds-275hx   1/1     Running   0          150m   192.168.33.149   k8s-worker-0   <none>           <none>
kube-flannel-ds-2rplc   1/1     Running   0          150m   192.168.33.38    k8s-worker-5   <none>           <none>
kube-flannel-ds-2w98x   1/1     Running   0          150m   192.168.33.113   k8s-worker-1   <none>           <none>
kube-flannel-ds-g4vb6   1/1     Running   0          150m   192.168.33.167   k8s-worker-4   <none>           <none>
kube-flannel-ds-mpwbz   1/1     Running   0          150m   192.168.33.163   k8s-worker-2   <none>           <none>
kube-flannel-ds-qmbgc   1/1     Running   0          150m   192.168.33.117   k8s-master-0   <none>           <none>
kube-flannel-ds-sgdgs   1/1     Running   0          150m   192.168.33.243   k8s-worker-3   <none>           <none>

ip addr show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default 
    link/ether a2:4a:11:1f:84:ef brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::a04a:11ff:fe1f:84ef/64 scope link 
       valid_lft forever preferred_lft forever

ip route
default via 192.168.33.1 dev ens3 proto dhcp src 192.168.33.117 metric 100 
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink 
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink 
10.244.4.0/24 via 10.244.4.0 dev flannel.1 onlink 
10.244.5.0/24 via 10.244.5.0 dev flannel.1 onlink 
10.244.6.0/24 via 10.244.6.0 dev flannel.1 onlink 
169.254.169.254 via 192.168.33.3 dev ens3 proto dhcp src 192.168.33.117 metric 100 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.33.0/24 dev ens3 proto kernel scope link src 192.168.33.117 metric 100 
192.168.33.1 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100 
192.168.33.2 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100 
192.168.33.3 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100 
192.168.33.4 dev ens3 proto dhcp scope link src 192.168.33.117 metric 100

kubectl run -it --rm dnsutils --image=busybox:1.28 --restart=Never -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.

Address 1: 10.96.0.10

nslookup: can't resolve 'kubernetes.default'
pod "dnsutils" deleted
pod default/dnsutils terminated (Error)

kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS        AGE
coredns-55cb58b774-6vb7p   1/1     Running   1 (4h19m ago)   4h38m
coredns-55cb58b774-wtrz6   1/1     Running   1 (4h19m ago)   4h38m

Ping Test

ubuntu@k8s-master-0:~$ kubectl run pod1 --image=busybox:1.28 --restart=Never --command -- sleep 3600
pod/pod1 created
ubuntu@k8s-master-0:~$ kubectl run pod2 --image=busybox:1.28 --restart=Never --command -- sleep 3600
pod/pod2 created

ubuntu@k8s-master-0:~$ kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          15m   10.244.5.2   k8s-worker-1   <none>           <none>
pod2   1/1     Running   0          15m   10.244.4.2   k8s-worker-3   <none>           <none>
ubuntu@k8s-master-0:~$ kubectl exec -it pod1 -- sh
/ # ping 10.244.5.2
PING 10.244.5.2 (10.244.5.2): 56 data bytes
64 bytes from 10.244.5.2: seq=0 ttl=64 time=0.107 ms
64 bytes from 10.244.5.2: seq=1 ttl=64 time=0.091 ms
64 bytes from 10.244.5.2: seq=2 ttl=64 time=0.090 ms
^C
--- 10.244.5.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.090/0.096/0.107 ms
/ # 10.244.4.2
sh: 10.244.4.2: not found
/ # ping 10.244.4.2
PING 10.244.4.2 (10.244.4.2): 56 data bytes
^C
--- 10.244.4.2 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
/ # exit
command terminated with exit code 1

If I understand flannel correctly, it is fine that the pods are in other subnets as the ip routes manage the forwarding.

4 comments

Is the firewall config blocking any ports flannel might use? As far as Google can tell vxlan uses port 8472.
- That's a good idea. While I have opened some ports that are commonly used by kubernetes, I didn't think about flannel requiring additional open ports. I will look into this. It's also mentioned in the troubleshooting guide, but I looked for debugging advice there and only scanned the other sections: https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md#firewalls
Flannel is garbage, use Calico.
- Can you elaborate on that? I am not sure whether you are promoting the usage of Calico over Flannel in general or if you think that using Calico would not result in this issue or if Calico would be easier to debug if such an issue arise.
  
  I decided to use flannel because it was labeled as the easiest. I know that Calico is most popular.