Resource scheduling (nodeSelector, nodeAffinity, taint, Tolrations)

Kubernetes is a master-slave distributed architecture consisting mainly of Master Node and Worker Node, as well as client command line tools kubectl and other add-ons.

Master Node: As a control node, dispatch management of the cluster; Master Node consists of API Server, Scheduler, Cluster State Store, and Controller-Manger Server.
Worker Node: A container for running business applications as a real work node; Worker Node contains kubelet, kube proxy, and Container Runtime;
kubectl: Used for interacting with API Server from the command line, while operating on Kubernetes, to add, delete, change and check various resources in the cluster, etc.
Add-on: An extension to the core functionality of Kubernetes, such as adding capabilities such as network and network policies.
RepliceationScales the number of copies
endpoint is used to manage network requests
Scheduler scheduler

The overall architecture of Kubernetes is as follows:

Scheduling process for Kubernetes:

Users submit requests to create Pod s, either through the REST API of the API Server or through the Kubectl command line tool, supporting both Json and Yaml formats;
API Server handles user requests and stores Pod data to Etcd;
Schedule looks at the new pod through the watch mechanism of API Server and tries to bind Node to it.
Filter hosts: The scheduler uses a set of rules to filter out hosts that do not meet the requirements, such as Pod specifies the required resources, then it filters out hosts that do not have enough resources;
Host Scoring: Scoring the first step filtered eligible hosts. During the host scoring stage, the scheduler considers some overall optimization strategies, such as distributing a copy of Replication Controller to different hosts, using the lowest load hosts, etc.
Select host: Select the host with the highest score, binding, and storing the results in Etcd;
kubelet performs a Pod creation operation based on the scheduling results: after successful binding, container, docker run, scheduler calls API Server's API to create a bound pod object in the etcd describing all pod information running on a working node. kubelet running on each work node also periodically synchronizes bound pod information with etcd, and once it finds that the bound pod object that should be running on that work node has not been updated, calls the Docker API to create and start the container within the pod.

1.nodeSelector

NoeSelector is the simplest form of constraint. nodeSelector is a pod. A field of spec

The labels for the specified node can be viewed through show-labels

[root@master ~]# kubectl get node node1 --show-labels
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4d18h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux
[root@master haproxy]#

If no additional nodes labels are added, the default labels you see above are shown. We can add labels to a given node through the kubectl label node command:

[root@master mainfest]# kubectl get node node1 --show-labels  
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4d18h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux
[root@master haproxy]#

You can also delete the specified labels through the kubectl label node

[root@maste mainfest]# kubectl label node node1 disktype-
node/node1 labeled
[root@master haproxy]# kubectl get node node1 --show-labels
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4d18h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux
[root@master haproxy]#

Create a test pod and specify the nodeSelector option binding node:

[root@master mainfest]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
[root@master mainfest]# kubectl get node node1.example.com --show-labels
NAME                STATUS   ROLES    AGE    VERSION   LABELS
node1.example.com   Ready    <none>   4d18h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux

[root@master mainfest]# vi test.yml
[root@master mainfest]# cat test.yml 
apiVersion: v1
kind: Pod
metadata:
  name: test
  labels:
    env: test
spec:
  containers:
  - name: test
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd

//View pod dispatched nodes
NAME   READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
test   1/1     Running   0          6s    10.244.1.92   node1   <none>           <none>

2.nodeAffinity

NoeAffinity means a node affinity dispatch policy. Is a new dispatch policy to replace the nodeSelector. There are currently two types of node affinity expression:

RequiredDuringSchedulingIgnoredDuringExecution:
The rules you make must be met to schedule a pode on Node. Equivalent to hard limit
PreferredDuringSchedulingIgnoreDuringExecution:
Emphasizing the priority of satisfying formulation rules, the dispatcher tries to dispatch the pod to Node, but does not demand it, which is equivalent to a soft limit. Multiple priority rules can also set weight values to define the order of execution.
IgnoredDuringExecution means:
If the label of a pod's node changes during the pod's operation and does not meet the affinity requirements of the pod's node, the system will ignore the lable change on the node, where the pod machine is selected to run.
Operators supported by the NodeAffinity syntax include:
In:label value in a list
NotIn:label value is not in a list
Exists: A label exists
DoesNotExit: A label does not exist
The value of Gt:label is greater than a value
The value of Lt:label is less than a value

Notes for setting the nodeAffinity rule are as follows:

If both nodeSelector and nodeAffinity are defined, the name must be satisfied for the pod to eventually run on the specified node.
If nodeAffinity specifies more than one nodeSelectorTerms, then one of them can match successfully.
If there are multiple matchExpressions in nodeSelectorTerms, a node must satisfy all matchExpressions to run the pod.

[root@master mainfest]# cat test.yml 
apiVersion: v1
kind: Pod
metadata:
  name: test1
  labels:
    app: nginx
spec:
  containers:
  - name: test1
    image: nginx
    imagePullPolicy: IfNotPresent
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            values:
            - ssd
            operator: In
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10
        preference:
          matchExpressions:
          - key: name
            values:
            - test
            operator: In 

//Label the node2 host disktype=ssd as well
[root@master mainfest]# kubectl label node node2 disktype=ssd
node/node2 labeled
[root@master mainfest]# kubectl get node node2 --show-labels
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node2   Ready    <none>   4d22h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node2,kubernetes.io/os=linux

test

Label node1 with name=test

[root@master ~]# kubectl label node node1 name=test
node/node1 labeled
[root@master ~]# kubectl get node node1 --show-labels
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4d18h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux,name=sb
[root@master ~]#

Label node1 with name=test and delete label with name=test and test to see the results

[root@master ~]# kubectl label node node1 name=test
node/node1 labeled
[root@master ~]# kubectl get node node1 --show-labels
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4d22h   v1.20.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux,name=test

[root@master mainfest]# kubectl apply -f test1.yaml 
pod/test1 created
[root@master mainfest]# kubectl get pod -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
test    1/1     Running   0          46m   10.244.1.92   node1   <none>           <none>
test1   1/1     Running   0          28s   10.244.1.93   node1   <none>           <none>

3. Taint and Tolerations

Taints: Avoid Pod scheduling on a specific Node
olerations: Allows Pod to be dispatched to a Node with Taints
Scenarios:

Private Node: Node is managed in groups based on the line of business, which is not scheduled by default and is allowed only if stain tolerance is configured
Equipped with special hardware: Some Nodes are equipped with SSD hard drives, GPU s, and want to not schedule the Node by default. Allocation is only allowed if stain tolerance is configured
Taint-based expulsion
effect description

The following is a brief explanation of the value of effect:

NoSchedule: If a pod does not declare tolerance for this Taint, the system will not schedule the Pod on the node of this Taint
PreferNoSchedule: Soft restricted version of NoSchedule. If a Pod does not declare tolerance for this Taint, the system will try to avoid scheduling this pod to this node, but it is not mandatory.
NoExecute: Defines the pod's expulsion behavior in response to node failure. NoExecute This Taint effect has the following effects on running pods on the node:
PD without Toleration set will be expelled immediately
The pod corresponding to Toleration is configured, and if no value is assigned to Toleration Seconds, it will remain in this node
If a pod corresponding to Toleration is configured and a toleration Seconds value is specified, it will be expelled after the specified time
An alpha version of functionality has been introduced since version 1.6 of kubernetes. That is, the node failures are marked as Taint (currently only for node unreachable and node not ready, with corresponding NodeCondition "Ready" values of Unknowown and False). When the TaintBasedEvictions feature is activated (TaintBasedEvictions=true is added to the feature-gates parameter), NodeController automatically sets Taint for Node, and the status is "Ready" Common expulsion logic previously set on the ode will be disabled. Note that in case of node failure, in order to maintain the existing speed limit for pod expulsion, the system will gradually set Taint to the node in a speed limit mode, which will prevent the consequences of a large number of pods being expelled in certain circumstances (such as temporary loss of master s). This feature is compatible with tolerationSeconds, allowing pods to define how long a node failure lasts before it is expelled.

Using the kubeclt taint command, you can set five points for a node, and there is a mutually exclusive relationship between the node and the pod once it is stained, allowing the node to reject the pod's scheduled execution or even expel the pod that already exists on the node

The composition of the stain is as follows:
Each stain has a key and a value as the label of the stain, where the value can be empty, and the effect describes the effect of the stain

[root@master mainfest]# kubectl taint node node1 name=test:NoSchedule
node/node1 tainted
[root@master mainfest]# kubectl describe node node1|grep -i taint
Taints:             name=test:NoSchedule

Create a pod test named test2

apiVersion: v1
kind: Pod
metadata:
  name: test2
  labels:
    app: nginx
spec:
  containers:
  - name: test2
    image: nginx
    imagePullPolicy: IfNotPresent
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            values:
            - ssd
            operator: In
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: name
            values:
            - test
            operator: In

            
[root@master mainfest]# kubectl apply -f test2.yaml 
pod/test2 created
[root@master mainfest]# kubectl get pod -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES



Normally, test2 Should be created in node1 But because node1 We set a stain on it, so it's not created on node1 Up, when we remove the stain, we can create node1 upper

[root@master mainfest]# kubectl taint node node1 name-
node/node1 untainted
[root@master mainfest]# kubectl describe node node1|grep -i taint
Taints:             <none>

[root@master mainfest]# kubectl apply -f test2.yaml 
pod/test2 created
[root@master mainfest]# kubectl get pod -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE    NOMINATED NODE   READINESS GATES
test    1/1     Running   0          58m     10.244.1.92    node1      <none>           <none>
test1   1/1     Running   0          5m12s   10.244.2.93    node2      <none>           <none>
test2   1/1     Running   0          12m38s  10.244.2.132   node2      <none>           <none>

Keywords: Operation & Maintenance Kubernetes Container

Added by salih0vicX on Fri, 24 Dec 2021 04:15:19 +0200

Programming VIP

Resource scheduling (nodeSelector, nodeAffinity, taint, Tolrations)