Cloud Native Short Take – Kubernetes Pod Scheduling

(Sep. 23 2021) – Welcome to another Cloud Native Short Take from RX-M, my name is Chris Hanson and today we’re going to be reviewing the Kubernetes module: pod scheduling. In this module we typically cover things like the default scheduling process, selectors, affinities, pod priorities, and preemptions. 

The demo covers pod anti-affinities and uses a four node cluster with labeled nodes. The odd workers (1 and 3)  are zone A and the even workers (2 and 4) are zone B. These zone labels are used by the pod’s anti-affinity expression to enable pod spreading in two ways.

We first take a look at the Deployment’s anti-affinity that has two expressions, the first of which is: requiredDuringSchedulingIgnoredDuringExecution. It is a very long key because they wanted to make something clear: the expression will be used during scheduling but at execution time (when the pods are already running on a node) it doesn’t get reevaluated. Scheduling is not dynamic; it happens one time unless something else happens to a pod (like a rolling update, or eviction, both of which cause a reschedule). The label selector used by the expression uses the labels of the pods themselves. What this means is that by using the topology key “hostname” that no two pods from this same Deployment should land on the same host.

The second part of this expression is preferredDuringSchedulingIgnoredDuringExecution. This one uses the zone label to spread across the zones. The difference between “required” and “preferred” is that the “required” expression uses a predicate filter that removes a node from the scheduling loop while “preferred” merely ranks nodes in zones without pods from this Deployment higher than zones with existing pods from the set.

The demo starts by deploying one replica. All nodes in all zones are viable candidates so the pod ends up on worker 3 which is in zone A; No more pods can run on worker 3 and the next pod that we create should not get scheduled on worker 1 which is also in zone A.

The next pod should land on worker 2 or 4 because those nodes are in a different zone. We scale and see that the second replica is on worker 2 in the opposite zone. At this point, because we have a pod in each zone, both zones are treated equally. The next pod in a scale up can be scheduled to either zone.

The next pod is scheduled in the same zone on worker 4. That is okay because our two zones  are saturated but we are continuing to spread across our workers. When we scale up to four pods we should expect that the fourth pod lands on our last unused node (which it does).

Combining these two expressions we’ve created an explicit spread across nodes and zones!

What if we scale to five? Do we just reuse existing nodes because we only have a four node cluster? The demo reveals that replica five will stay in a pending state because no nodes can satisfy the “required during scheduling” expression! If we have the situation where there are more pods than there are nodes then the excess pod sits in a pending state (which is also why we don’t use “required during scheduling” for zonal based expressions). You could if you only wanted one pod per zone but if you want more than one pod in a zone you cannot use “required”, you have to use “preferred”.

This is just one of the things that you will learn when you attend one of our Kubernetes courses! You can go to the RX-M custom courseware builder (https://rx-m.com/cloud-native-training-custom-course-builder-list/), find the pod scheduling module and add it to a custom course along with a number of other modules.

That is our Cloud Native Short Take: pod scheduling module.