Using kube-scheduler – KubeCon + CloudNativeCon NA 2021

Thanks to everyone who attended the kube-scheduler discussion/demo! Here are the steps to recreate the demos.

Setup

The setup is a 3 node cluster (t3.large, could be smaller) on AWS.

  • 1 control plane node
  • 2 worker nodes

There is no special setup otherwise (aside labels for scheduling in demo).

Control Plane Node

This script (https://github.com/RX-M/classfiles/blob/master/k8s.sh) will install everything you need for the control plane node including:

  • Docker
  • Kubeadm, Kubectl
  • K8s 1.22 (as of today), and related items
  • Weave for network
kubectl get nodes
NAME               STATUS   ROLES                  AGE   VERSION
ip-172-31-34-142   Ready    control-plane,master   10m   v1.22.2

Worker Nodes

Use the following commands to set up the worker nodes.

sudo apt-get update
sudo apt-get install -y kubeadm
wget -qO- https://get.docker.com/ | sh

cat <<EOF | sudo tee /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF

sudo systemctl restart docker
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubeadm
sudo swapoff -a

echo "Run on control plane to retrieve join command: kubeadm token create --print-join-command"

Retrieve Worker Join Command

kubeadm token create --print-join-command

Run the output on each node. Below is an example output when you use the “token create” command provided above.

sudo kubeadm join 172.31.34.142:6443 --token s58ey2.53btft6swiun00fm --discovery-token-ca-cert-hash sha256:8de0c0d115e2611581fd4d166dd1be736121d42d6ee1fcbc9cb232cd22bbef3d 
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Review Nodes

View count of nodes and related labels before any further changes.

kubectl get nodes --show-labels
NAME               STATUS   ROLES                  AGE     VERSION   LABELS
ip-172-31-33-243   Ready    <none>                 3d20h   v1.22.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-31-33-243,kubernetes.io/os=linux
ip-172-31-34-142   Ready    control-plane,master   3d20h   v1.22.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-31-34-142,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
ip-172-31-39-32    Ready    <none>                 3d20h   v1.22.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-31-39-32,kubernetes.io/os=linux

Looks good, we are now ready to test.

Scheduler Logs and Defaults

Here we take a look at the scheduler log file. While it gives some details, it doesn’t explain how it works (even if you increase the log level).

kubectl logs kube-scheduler-ip-172-31-34-142 -n kube-system
I1011 15:46:12.855811       1 serving.go:347] Generated self-signed cert in-memory
W1011 15:46:18.979651       1 requestheader_controller.go:193] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
W1011 15:46:18.979718       1 authentication.go:345] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
W1011 15:46:18.979736       1 authentication.go:346] Continuing without authentication configuration. This may treat all requests as anonymous.
W1011 15:46:18.979745       1 authentication.go:347] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
I1011 15:46:19.089201       1 secure_serving.go:200] Serving securely on 127.0.0.1:10259
I1011 15:46:19.089391       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1011 15:46:19.091699       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1011 15:46:19.089425       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1011 15:46:19.189737       1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-scheduler...
I1011 15:46:19.193325       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I1011 15:46:19.202629       1 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler 

The scheduler is running in a pod. That pod is created with this manifest file. The key thing to notice is the “command” and the “volumes”.

~$ sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0
    image: k8s.gcr.io/kube-scheduler:v1.22.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

As we progress and make changes, you will want to confirm we launched the correct manifest. Here is a pretty sure way of knowing what is running. This command will show the actual command (not just listing the pods). Notice this matches the above “command”.

ps -p $(pidof kube-scheduler) -o cmd -hww | sed -e 's/--/\n--/g'
kube-scheduler 
--authentication-kubeconfig=/etc/kubernetes/scheduler.conf 
--authorization-kubeconfig=/etc/kubernetes/scheduler.conf 
--bind-address=127.0.0.1 
--kubeconfig=/etc/kubernetes/scheduler.conf 
--leader-elect=true 
--port=0

We can view the authentication configuration as well.

sudo kubectl config view --kubeconfig /etc/kubernetes/scheduler.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://172.31.34.142:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: system:kube-scheduler
  name: system:kube-scheduler@kubernetes
current-context: system:kube-scheduler@kubernetes
kind: Config
preferences: {}
users:
- name: system:kube-scheduler
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

For those who are new, here is how you can see which “user” the scheduler is running as (a.k.a. making requests to K8s API server).

sudo apt install jq -y
sudo kubectl config view --kubeconfig /etc/kubernetes/scheduler.conf -o json --raw | jq '.users[].user."client-certificate-data"' -r | base64 -d | openssl x509 -in - -text | grep CN
        Issuer: CN = kubernetes
        Subject: CN = system:kube-scheduler

For those with more interest, here are some of the Kubernetes resources that the scheduler accesses.

kubectl get clusterrolebindings.rbac.authorization.k8s.io | grep kube-scheduler
kubectl describe clusterroles.rbac.authorization.k8s.io system:kube-scheduler
kubectl describe clusterrolebindings.rbac.authorization.k8s.io system:kube-scheduler
kubectl get rolebindings.rbac.authorization.k8s.io -A
kubectl describe roles.rbac.authorization.k8s.io system::leader-locking-kube-scheduler -n kube-system

Basic Test

Here we want start learning how to follow when scheduling happens (or even how).

kubectl run nginx --image=nginx

kubectl get events -A
NAMESPACE   LAST SEEN   TYPE     REASON      OBJECT      MESSAGE
default     2m13s       Normal   Scheduled   pod/nginx   Successfully assigned default/nginx to ip-172-31-33-243
default     2m12s       Normal   Pulling     pod/nginx   Pulling image "nginx"
default     2m8s        Normal   Pulled      pod/nginx   Successfully pulled image "nginx" in 4.221582414s
default     2m8s        Normal   Created     pod/nginx   Created container nginx
default     2m7s        Normal   Started     pod/nginx   Started container nginx

Notice the “REASON”, first thing that happens is scheduling, the remaining are ‘execution’ side activities. While simple in appearance, node selection happened, via predicates (filter) and priorities (sorting).

Using pod(anti)affinity

Next we try to influence pod placement with affinity configuration.

We first launch a pod that has a lab (security=S1).

kubectl run like-minded-pod --image=k8s.gcr.io/pause:2.0 -l security=S1

Next we run with ‘affinity’ for that label, this node should land on same node as security=S1!

cat podaffinity.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: meal 
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

kubeclt apply -f podaffinity.yaml    

If you list the pods and nodes:

kubectl get pods -o=custom-columns=NAME:.metadata.name,Namespace:.metadata.namespace,NODE:.spec.nodeName,LABELS:.metadata.labels | sort

You will see the “with-pod-affinity” did not get placed. This has to do with the topologyKey. That key is used to help spread pods across ‘zones’ of nodes. We will need to add a label called meal (value doesn’t matter except to group nodes by it).

Here we label one of the workers (your IP/DNS will differ).

kubectl label nodes ip-172-31-39-32 meal=breakfast

If you list the pods and nodes now, you should see it running on the same worker.

We are now ready to demo anti-affinity, very similar to above, yet “avoids” pods with a certain label.

cat podantiaffinity.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: meal 
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

This works even though we didn’t label the other node (you can, just make it a different meal value), no label is like a default group.

Policy Configuration

We next see how to do policy configuration. In our case we want to demo ImagePullPolicy to be the only sorting factor.

First clean up prior work and make backup of schedule manifest.

kubectl label node --all meal-
sudo cp kube-scheduler.yaml.bak /etc/kubernetes/manifests/kube-scheduler.yaml

Create our new profile.

cat /etc/kubernetes/myscheduler.conf 
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- plugins:
    queueSort:
      disabled:
      - name: '*'
      enabled:
      - name: PrioritySort
    preFilter:
      disabled:
      - name: '*'
    filter:
      disabled:
      - name: '*'
      enabled:
      - name: TaintToleration
    postFilter:
      disabled:
      - name: '*'
    preScore:
      disabled:
      - name: '*'
    score:
      disabled:
      - name: '*'
      enabled:
      - name: ImageLocality
        weight: 10
    reserve:
      disabled:
      - name: '*'
    permit:
      disabled:
      - name: '*'
    preBind:
      disabled:
      - name: '*'
    bind:
      disabled:
      - name: '*'
      enabled:
      - name: DefaultBinder
    postBind:
      disabled:
      - name: '*'

Update scheduler manifest to following:

vi kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --config=/etc/kubernetes/myscheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    image: k8s.gcr.io/kube-scheduler:v1.22.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/myscheduler.conf
      name: mysched
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate    
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/myscheduler.conf
      type: FileOrCreate
    name: mysched      

Notice we are loading a new configuration file and the command changed to point to it.

Overwrite the manifest.

sudo cp kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml

Confirm (could take a minute) new schedule is running, the command should match above (.. –config= ..)

ps -p $(pidof kube-scheduler) -o cmd -hww | sed -e 's/--/\n--/g'

Set Up Nodes

Due to various complexities, we need to prep the nodes first.

On one node (doesn’t matter which worker), pull nginx:1.10.

sudo docker image pull nginx:1.10

On other node, pull nginx:1.20.

sudo docker image pull nginx:1.20

Wait 10 seconds (remember we are eventually consistant, kubelet heartbeating).

Now, we can launch, create the following script and test.

cat test.sh 

kubectl get pods -o=custom-columns=NAME:.metadata.name,Namespace:.metadata.namespace,NODE:.spec.nodeName,LABELS:.metadata.labels | sort

for i in {0..20..1}
do
  kubectl run nx1-10-$i --image=nginx:1.10
  kubectl run nx1-20-$i --image=nginx:1.20
  sleep 1
done

kubectl get pods -o=custom-columns=NAME:.metadata.name,Namespace:.metadata.namespace,NODE:.spec.nodeName,LABELS:.metadata.labels | sort
chmod +x test.sh
./test.sh
...

If all went well, you will see 10 nginx:1.10 on one node, and 10 nginx:1.20 on the other; showing how we influenced the scheduler.

If you need a cleanup script, you can use this.

cat cleanup.sh 
#!/bin/bash -x

echo "Start kubectl proxy --port=8080&"

# no paging, use api - kubectl get pods | head | awk '{print $1}' | xargs -I% kubectl delete pod %

curl -s http://localhost:8080/api/v1/pods | jq '.items[] as $i | select($i.metadata.namespace|contains("default")) | $i.metadata.name' -r | xargs -I% kubectl delete pod %

echo on each worker
echo "sudo docker image ls nginx | awk '{print $2}' | xargs -I% sudo docker image rm nginx:%"

chmod +x cleanup.sh

kubectl proxy --port=8080&

./cleanup.sh

The proxy is because kubectl doesn’t do pagniation so we use the API instead.

If you have any questions please let me know! Thanks for stopping by.

Ronald Petty

ronald.petty@rx-m.com 

https://www.linkedin.com/in/ronaldpetty/