Using kube-scheduler – KubeCon + CloudNativeCon NA 2021
(Oct. 12, 2021) – Thanks to everyone who attended the kube-scheduler discussion/demo! Here are the steps to recreate the demos.
Setup
The setup is a 3 node cluster (t3.large, could be smaller) on AWS.
- 1 control plane node
- 2 worker nodes
There is no special setup otherwise (aside labels for scheduling in demo).
Control Plane Node
This script (https://github.com/RX-M/classfiles/blob/master/k8s.sh) will install everything you need for the control plane node including:
- Docker
- Kubeadm, Kubectl
- K8s 1.22 (as of today), and related items
- Weave for network
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-34-142 Ready control-plane,master 10m v1.22.2
Worker Nodes
Use the following commands to set up the worker nodes.
sudo apt-get update
sudo apt-get install -y kubeadm
wget -qO- https://get.docker.com/ | sh
cat <<EOF | sudo tee /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF
sudo systemctl restart docker
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubeadm
sudo swapoff -a
echo "Run on control plane to retrieve join command: kubeadm token create --print-join-command"
Retrieve Worker Join Command
kubeadm token create --print-join-command
Run the output on each node. Below is an example output when you use the “token create” command provided above.
sudo kubeadm join 172.31.34.142:6443 --token s58ey2.53btft6swiun00fm --discovery-token-ca-cert-hash sha256:8de0c0d115e2611581fd4d166dd1be736121d42d6ee1fcbc9cb232cd22bbef3d
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
Review Nodes
View count of nodes and related labels before any further changes.
kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-172-31-33-243 Ready <none> 3d20h v1.22.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-31-33-243,kubernetes.io/os=linux
ip-172-31-34-142 Ready control-plane,master 3d20h v1.22.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-31-34-142,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
ip-172-31-39-32 Ready <none> 3d20h v1.22.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-31-39-32,kubernetes.io/os=linux
Looks good, we are now ready to test.
Scheduler Logs and Defaults
Here we take a look at the scheduler log file. While it gives some details, it doesn’t explain how it works (even if you increase the log level).
kubectl logs kube-scheduler-ip-172-31-34-142 -n kube-system
I1011 15:46:12.855811 1 serving.go:347] Generated self-signed cert in-memory
W1011 15:46:18.979651 1 requestheader_controller.go:193] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
W1011 15:46:18.979718 1 authentication.go:345] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
W1011 15:46:18.979736 1 authentication.go:346] Continuing without authentication configuration. This may treat all requests as anonymous.
W1011 15:46:18.979745 1 authentication.go:347] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
I1011 15:46:19.089201 1 secure_serving.go:200] Serving securely on 127.0.0.1:10259
I1011 15:46:19.089391 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1011 15:46:19.091699 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1011 15:46:19.089425 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1011 15:46:19.189737 1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-scheduler...
I1011 15:46:19.193325 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1011 15:46:19.202629 1 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler
The scheduler is running in a pod. That pod is created with this manifest file. The key thing to notice is the “command” and the “volumes”.
~$ sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.22.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
As we progress and make changes, you will want to confirm we launched the correct manifest. Here is a pretty sure way of knowing what is running. This command will show the actual command (not just listing the pods). Notice this matches the above “command”.
ps -p $(pidof kube-scheduler) -o cmd -hww | sed -e 's/--/\n--/g'
kube-scheduler
--authentication-kubeconfig=/etc/kubernetes/scheduler.conf
--authorization-kubeconfig=/etc/kubernetes/scheduler.conf
--bind-address=127.0.0.1
--kubeconfig=/etc/kubernetes/scheduler.conf
--leader-elect=true
--port=0
We can view the authentication configuration as well.
sudo kubectl config view --kubeconfig /etc/kubernetes/scheduler.conf
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://172.31.34.142:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: system:kube-scheduler
name: system:kube-scheduler@kubernetes
current-context: system:kube-scheduler@kubernetes
kind: Config
preferences: {}
users:
- name: system:kube-scheduler
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
For those who are new, here is how you can see which “user” the scheduler is running as (a.k.a. making requests to K8s API server).
sudo apt install jq -y
sudo kubectl config view --kubeconfig /etc/kubernetes/scheduler.conf -o json --raw | jq '.users[].user."client-certificate-data"' -r | base64 -d | openssl x509 -in - -text | grep CN
Issuer: CN = kubernetes
Subject: CN = system:kube-scheduler
For those with more interest, here are some of the Kubernetes resources that the scheduler accesses.
kubectl get clusterrolebindings.rbac.authorization.k8s.io | grep kube-scheduler
kubectl describe clusterroles.rbac.authorization.k8s.io system:kube-scheduler
kubectl describe clusterrolebindings.rbac.authorization.k8s.io system:kube-scheduler
kubectl get rolebindings.rbac.authorization.k8s.io -A
kubectl describe roles.rbac.authorization.k8s.io system::leader-locking-kube-scheduler -n kube-system
Basic Test
Here we want start learning how to follow when scheduling happens (or even how).
kubectl run nginx --image=nginx
kubectl get events -A
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
default 2m13s Normal Scheduled pod/nginx Successfully assigned default/nginx to ip-172-31-33-243
default 2m12s Normal Pulling pod/nginx Pulling image "nginx"
default 2m8s Normal Pulled pod/nginx Successfully pulled image "nginx" in 4.221582414s
default 2m8s Normal Created pod/nginx Created container nginx
default 2m7s Normal Started pod/nginx Started container nginx
Notice the “REASON”, first thing that happens is scheduling, the remaining are ‘execution’ side activities. While simple in appearance, node selection happened, via predicates (filter) and priorities (sorting).
Using pod(anti)affinity
Next we try to influence pod placement with affinity configuration.
We first launch a pod that has a lab (security=S1).
kubectl run like-minded-pod --image=k8s.gcr.io/pause:2.0 -l security=S1
Next we run with ‘affinity’ for that label, this node should land on same node as security=S1!
cat podaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: meal
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
kubeclt apply -f podaffinity.yaml
If you list the pods and nodes:
kubectl get pods -o=custom-columns=NAME:.metadata.name,Namespace:.metadata.namespace,NODE:.spec.nodeName,LABELS:.metadata.labels | sort
You will see the “with-pod-affinity” did not get placed. This has to do with the topologyKey. That key is used to help spread pods across ‘zones’ of nodes. We will need to add a label called meal (value doesn’t matter except to group nodes by it).
Here we label one of the workers (your IP/DNS will differ).
kubectl label nodes ip-172-31-39-32 meal=breakfast
If you list the pods and nodes now, you should see it running on the same worker.
We are now ready to demo anti-affinity, very similar to above, yet “avoids” pods with a certain label.
cat podantiaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-pod-antiaffinity
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: meal
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
This works even though we didn’t label the other node (you can, just make it a different meal value), no label is like a default group.
Policy Configuration
We next see how to do policy configuration. In our case we want to demo ImagePullPolicy to be the only sorting factor.
First clean up prior work and make backup of schedule manifest.
kubectl label node --all meal-
sudo cp kube-scheduler.yaml.bak /etc/kubernetes/manifests/kube-scheduler.yaml
Create our new profile.
cat /etc/kubernetes/myscheduler.conf
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- plugins:
queueSort:
disabled:
- name: '*'
enabled:
- name: PrioritySort
preFilter:
disabled:
- name: '*'
filter:
disabled:
- name: '*'
enabled:
- name: TaintToleration
postFilter:
disabled:
- name: '*'
preScore:
disabled:
- name: '*'
score:
disabled:
- name: '*'
enabled:
- name: ImageLocality
weight: 10
reserve:
disabled:
- name: '*'
permit:
disabled:
- name: '*'
preBind:
disabled:
- name: '*'
bind:
disabled:
- name: '*'
enabled:
- name: DefaultBinder
postBind:
disabled:
- name: '*'
Update scheduler manifest to following:
vi kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --config=/etc/kubernetes/myscheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
image: k8s.gcr.io/kube-scheduler:v1.22.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/myscheduler.conf
name: mysched
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
- hostPath:
path: /etc/kubernetes/myscheduler.conf
type: FileOrCreate
name: mysched
Notice we are loading a new configuration file and the command changed to point to it.
Overwrite the manifest.
sudo cp kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
Confirm (could take a minute) new schedule is running, the command should match above (.. –config= ..)
ps -p $(pidof kube-scheduler) -o cmd -hww | sed -e 's/--/\n--/g'
Set Up Nodes
Due to various complexities, we need to prep the nodes first.
On one node (doesn’t matter which worker), pull nginx:1.10.
sudo docker image pull nginx:1.10
On other node, pull nginx:1.20.
sudo docker image pull nginx:1.20
Wait 10 seconds (remember we are eventually consistant, kubelet heartbeating).
Now, we can launch, create the following script and test.
cat test.sh
kubectl get pods -o=custom-columns=NAME:.metadata.name,Namespace:.metadata.namespace,NODE:.spec.nodeName,LABELS:.metadata.labels | sort
for i in {0..20..1}
do
kubectl run nx1-10-$i --image=nginx:1.10
kubectl run nx1-20-$i --image=nginx:1.20
sleep 1
done
kubectl get pods -o=custom-columns=NAME:.metadata.name,Namespace:.metadata.namespace,NODE:.spec.nodeName,LABELS:.metadata.labels | sort
chmod +x test.sh
./test.sh
...
If all went well, you will see 10 nginx:1.10 on one node, and 10 nginx:1.20 on the other; showing how we influenced the scheduler.
If you need a cleanup script, you can use this.
cat cleanup.sh
#!/bin/bash -x
echo "Start kubectl proxy --port=8080&"
# no paging, use api - kubectl get pods | head | awk '{print $1}' | xargs -I% kubectl delete pod %
curl -s http://localhost:8080/api/v1/pods | jq '.items[] as $i | select($i.metadata.namespace|contains("default")) | $i.metadata.name' -r | xargs -I% kubectl delete pod %
echo on each worker
echo "sudo docker image ls nginx | awk '{print $2}' | xargs -I% sudo docker image rm nginx:%"
chmod +x cleanup.sh
kubectl proxy --port=8080&
./cleanup.sh
The proxy is because kubectl doesn’t do pagniation so we use the API instead.
If you have any questions please let me know! Thanks for stopping by.
Ronald Petty