1, Overview of Pod scheduling strategy
The biggest problem to be solved in the container choreography function in Kubernetes is to schedule the created Pod to the Node. Then, how does the Pod decide which Node to schedule to. This involves the Kube scheduler component introduced when we installed the Kubernetes cluster earlier.
Kube scheduler selects a pod based on two steps:
(1) Filtering: according to the scheduling requirements of Pod, filter out some nodes, such as whether resources match, labels match, stains are tolerated, etc.
(2) Scoring: when the filtering requirements are met, these nodes are scored. The nodes with high scores will be established as the nodes of the whole Pod scheduling.
Pod's demand for resources has been introduced to readers in the previous content. Scheduling is possible only when the resources of the corresponding Node meet the pod's request for resources.
There are two other strategies that will directly affect our Pod scheduling, namely:
(1) Label
(2) Stain
2, Scheduling practice of Label
The label label can be defined on various resources in Kubernetes cluster. The upper level resources can select the corresponding resources according to the label. There are three common ways to use labels:
(1) Deployment: select the Pod corresponding to the Service according to the tag. It will be introduced later.
(2) PVC can select the PV with this label that meets the resource requirements according to the label selector. It will be introduced later.
(3) When Pod is defined, the definition can be displayed and deployed on the node of the label that meets the requirements.
Next, we label the node through an example, and define the node with what label to deploy in the Pod to verify the effect.
Adding tags to our resources in Kubernetes cluster can be defined in the metadata of yaml file, as shown below:
metadata: labels: key1: value1 key2: value2
For the Worker node in Kubernetes cluster, we can label the node through command operation. The commands to remove the label are as follows:
(1)kubectl label nodes node1 key1=value1
(2)kubectl label nodes node1 key1-
First, label the node:
kubectl label nodes kubernetes-worker01 scheduler-node=node1
Verify whether the next node is labeled. The query results are as shown in the figure. It is found that the label has been successfully labeled.
[root@kubernetes-master01 ~]# kubectl describe node kubernetes-worker01 Name: kubernetes-worker01 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=kubernetes-worker01 kubernetes.io/os=linux scheduler-node=node1
In the second step, we use yaml to define a Pod. The contents of the file are as follows:
apiVersion: v1 kind: Pod metadata: name: pod-nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 hostPort: 80 nodeSelector: scheduler-node: node1
Create this Pod:
kubectl apply -f pod-nginx-select.yaml
Let's see which node the Pod is scheduled to. As shown in the figure, this Pod has been defined to the corresponding node.
[root@kubernetes-master01 ~]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE pod-nginx 1/1 Running 0 102s 10.244.1.28 kubernetes-worker01
Step 3: modify the definition of Pod and change the value of the label scheduler node defined by the node to node2. The changes in yaml file are as follows:
nodeSelector: scheduler-node: node2
Step 4: delete the previous Pod and re create the Pod, as shown in the figure, and execute the corresponding command.
[root@kubernetes-master01 ~]# kubectl delete pod pod-nginx pod "pod-nginx" deleted [root@kubernetes-master01 ~]# kubectl apply -f pod-nginx-select.yaml pod/pod-nginx created
Step 5: let's check the status of the Pod. As shown in the figure, the status of the Pod is Pending.
[root@kubernetes-master01 ~]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE pod-nginx 0/1 Pending 0 7s <none> <none>
Step 6: check the specific operation of the Pod and why the scheduling failed, as shown in the figure, because the node corresponding to the nodeSelector was not found.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 78s (x4 over 4m11s) default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
Next, let's summarize the relevant contents of the tag. The tag is to mark the corresponding tag on the corresponding resource. In this example, it is to mark the corresponding tag on the node, and then let the Pod decide to deploy to a specific type of node according to the tag selection parameter nodeSelector.
For advanced resources such as Deployment and Service, the same principle applies to their Pod selection strategy. You can mark the corresponding tag on the Pod and set the tag range of the corresponding Pod on resources such as Deployment or Service, so that these advanced resources can find the corresponding Pod.
In the yaml file, the format of the matching label specified by the upper level resource is as follows:
selector: matchLabels: component: redis matchExpressions: - {key: tier, operator: In, values: [cache]} - {key: environment, operator: NotIn, values: [dev]}
Of which:
(1) matchLabels is a mapping consisting of {key,value} pairs.
(2) A single {key,value} In the matchLabels map is equivalent to the element of matchExpressions. The key field is "key", the operator is "In", and the values array contains only "value".
(3) Match expressions is a list of Pod selection operator requirements.
(4) Valid operators include In, NotIn, Exists, and DoesNotExist.
(5) In the case of in and NotIn, the set value must be non empty.
(6) All requirements from matchLabels and matchExpressions are grouped together in a logical relationship with – they must all be met to match.
3, Scheduling practice of stain
In addition to using tags to match resources accordingly, we also have a mechanism called stain. Stains are generally bad things. When our nodes do not want Pod scheduling to the top because of some situations, we can stain the nodes. The Pod will not dispatch the nodes with stains, unless it is clearly indicated on the Pod that the corresponding stains can be tolerated.
The commands to stain nodes are:
kubectl taint nodes node1 key1=value1:NoSchedule
Of which:
(1) node1: is the name of the node.
(2) key1: it's a stain. It's OK.
(3) value1: is the value of the stain.
(4) NoSchedule: represents the effect of the stain.
The value of the effect of the stain can be:
(1) NoSchedule: Pod will not be scheduled to this node.
(2) PreferNoSchedule: Pod may be scheduled to this node, but it will not be scheduled to this node first.
(3) NoExecute: if the Pod is not running on this node, it will not be scheduled to this node. If the Pod is already running on this node, the Pod will be expelled.
We use an example to lead readers to practice and understand the concept of stain. In the example, we stain a node, create a Pod normally, and then check the scheduling of the Pod. Then we modify the definition of Pod to tolerate this stain, and check the scheduling of Pod again.
Step 1: stain both Worker nodes.
kubectl taint node kubernetes-worker01 key1=value1:NoSchedule kubectl taint node kubernetes-worker02 key1=value1:NoSchedule
Check the stain on the node. As shown in the figure, the node has been successfully stained.
[root@kubernetes-master01 ~]# kubectl describe node kubernetes-worker01 Name: kubernetes-worker01 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=kubernetes-worker01 kubernetes.io/os=linux scheduler-node=node1 Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"82:5b:4d:5a:49:f8"} flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: true flannel.alpha.coreos.com/public-ip: 192.168.8.22 kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Sun, 14 Mar 2021 10:47:25 +0800 Taints: key1=value1:NoSchedule
[root@kubernetes-master01 ~]# kubectl describe node kubernetes-worker02 Name: kubernetes-worker02 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=kubernetes-worker02 kubernetes.io/os=linux Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"96:a9:e7:6c:c6:32"} flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: true flannel.alpha.coreos.com/public-ip: 192.168.8.33 kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Sun, 14 Mar 2021 10:47:25 +0800 Taints: key1=value1:NoSchedule
Step 2: define a Pod normally, and its yaml file contents are as follows:
apiVersion: v1 kind: Pod metadata: name: pod-nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 hostPort: 80
Create this Pod:
kubectl apply -f pod-nginx-toleration.yaml
Step 3: check the status of the Pod. As shown in the figure, the Pod is not scheduled successfully.
[root@kubernetes-master01 k8s-yaml]# kubectl get pod NAME READY STATUS RESTARTS AGE pod-nginx 0/1 Pending 0 67s
Step 4: check the reason why the Pod failed to schedule successfully, as shown in the figure. Because there are stains on the master and two node s, the scheduling failed.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 3s (x3 over 93s) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had taint {key1: value1}, that the pod didn't tolerate.
Step 5: we delete the previous Pod.
kubectl delete pod pod-nginx
Step 6: modify the definition of Pod so that it can tolerate the corresponding stains.
apiVersion: v1 kind: Pod metadata: name: pod-nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 hostPort: 80 tolerations: - key: "key1" operator: "Exists" effect: "NoSchedule"
Create this Pod:
kubectl apply -f pod-nginx-toleration.yaml
Step 7: check the scheduling status of the Pod. As shown in the figure, the Pod has been successfully scheduled.
[root@kubernetes-master01 k8s-yaml]# kubectl get pod NAME READY STATUS RESTARTS AGE pod-nginx 1/1 Running 0 56s
Check the running status of the specific Pod, as shown in the figure. It is found that it has been successfully dispatched to node kubernetes-worker01.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 21s default-scheduler Successfully assigned default/pod-nginx to kubernetes-worker01 Normal Pulling 20s kubelet, kubernetes-worker01 Pulling image "nginx" Normal Pulled 4s kubelet, kubernetes-worker01 Successfully pulled image "nginx" Normal Created 4s kubelet, kubernetes-worker01 Created container nginx Normal Started 4s kubelet, kubernetes-worker01 Started container nginx
Next, I'll summarize the relevant information about stains. Generally, stains will define the poor operation of some nodes. Under normal circumstances, Pod will not be scheduled to the nodes with stains. Unless the corresponding stain can be tolerated is defined in the Pod. The nodes in Kubernetes cluster will automatically add corresponding stains to the nodes according to the operation of the nodes:
(1)node. kubernetes. IO / not Ready: the node is not Ready. This is equivalent to the value of node state Ready is "False".
(2)node.kubernetes.io/unreachable: the node controller cannot access the node This is equivalent to the value of node state Ready as "Unknown".
(3)node. kubernetes. IO / out of disk: node disk is exhausted.
(4)node. kubernetes. IO / memory pressure: the node has memory pressure.
(5)node. kubernetes. IO / disk pressure: the node has disk pressure.
(6)node. kubernetes. IO / network unavailable: the node network is unavailable.
(7)node. kubernetes. IO / unscheduled: node is not schedulable.
(8)node.cloudprovider.kubernetes.io/uninitialized: if an "external" cloud platform driver is specified when kubelet starts, it will add a stain to the current node and mark it as unavailable. After a controller of cloud controller manager initializes the node, kubelet will remove the stain.
So under what circumstances do we need to actively add stains to nodes in production? It is generally based on the following two situations:
(1) Dedicated nodes: when we need some pods to run on specific nodes, we can stain these nodes and make these corresponding pods tolerate these stains.
(2) Nodes equipped with special hardware: because some pods have some special requirements for the use of resources, at this time, the stain mechanism can be used to solve it, so as to achieve the goal of Pod running on these special resources.
This section explains the scheduling strategy of Pod for readers. First, the corresponding Node needs to meet the requirements of Pod for resources. Then, based on the label and stain mechanism, the purpose of interfering with the scheduling of Pod can be achieved, so that Pod can decide whether to run or not to run or not to give priority to running on some nodes. Finally, according to the scoring mechanism, in the Node that meets the requirements, Select the optimal Node to run.
4, Guess you like it
If you are interested in the knowledge of containerization technology, you can read: Beautiful containerization technology Kubernetes column
Features of this column:
- Combining theory with practice
- Explain in simple terms