a73x
high effort, low reward← Posts
Kubernetes Networking
Every Pod in a Kubernetes cluster is given a cluster IP address. This cluster IP address is unique within the entire cluster, and allows pod within the same cluster to communicate with each other. However we rarely deploy just a pod, as a pod is considered ephemeral, and is the most basic units of compute. We often use a deployment instead, providing us with horizontal scaling and a promise that a minimum number of pods will be maintained. When a pod goes down, it could be recreated on any node, with a completely different IP address.
Kubernetes Services abstracts us away from the IP addresses of the pod, and provides a stable IP address we can utilise instead. Services come in several types:
- ClusterIP(default): Provides an IP only usable within the cluster.
- NodePort: Exposes the service on a port on every node.
- LoadBalancer: Relies on external cloud provider to provision a load balancer.
- ExternalName: Maps to an external DNS name without proxying.
A Service uses Kubernetes selectors to determine which Pods are valid backends
for the created service. A Service is backed by an EndpointSlice, a list of
valid IP address that the service's virtual IP address should route to. This
routing of traffic is, by default, handled by kube-proxy.
kube-proxy is a component that runs on each node that is part of the cluster. It can operate using two different backends
- iptables(default)
- ipvs
It watches Services and EndpointSlices to determine how to configure the backend to redirect traffic from the Service's virtual IP to one of the IP addresses in the EndpointSlice.
The Pods networking is established by a CNI plugin (Container Network Interface plugin). The plugin is responsible for
- creating the network interfaces within the Pod.
- Assigning IP addresses from the Pod CIDR range.
- Managing Pod-To-Pod communication across nodes. While pods have their own network namespace, they share the node's kernel. Pods have an entire copy the Linux networking stack, which is why we can run so many Pods on localhost:80 without any collision.
A CNI plugin like Flannel, uses a vxlan backend by default. vxlan wraps a MAC frame in a UDP datagram for transport across an IP network, creating an overlay network that spans all nodes.
To prevent collisions, the Service CIDR and the Pod CIDR are separate.
Example
To demonstrate some of this, here is output from a kind cluster of 3 nodes.
❯ k get nodes -o=custom-columns='NAME:.metadata.name,CIDR:.spec.podCIDR,ExternalIP:.status.addresses[0].address'
NAME CIDR ExternalIP
kind-control-plane 10.244.0.0/24 172.18.0.3
kind-worker 10.244.1.0/24 172.18.0.2
kind-worker2 10.244.2.0/24 172.18.0.4
On a node, running ip route shows how traffic for Pods on other nodes is routed.
> docker exec -it kind-worker ip route
default via 172.18.0.1 dev eth0
10.244.0.0/24 via 172.18.0.3 dev eth0
10.244.1.2 dev vethf0b11f5a scope host
10.244.2.0/24 via 172.18.0.4 dev eth0
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.2
Inside a Pod, ip route returns:
root@my-shell:/# ip route
default via 10.244.1.1 dev eth0
10.244.1.0/24 via 10.244.1.1 dev eth0 src 10.244.1.2
10.244.1.1 dev eth0 scope link src 10.244.1.2
Showing how Pod traffic is routed: traffic for Pods on the same node go directly through the Pod's eth0 interface, while traffic for Pods on the other nodes get sent to the node first, which then forwards it appropriately. Since these CIDRs overlap, the more specific route (longest prefix) takes precedence.