Rematch k8s (11): stain, tolerance, affinity


The story comes from the Internet.

preface

This article describes through self understanding. If there is any inaccuracy, please correct it.

Before telling a series of related professional terms, try to explain the love and hatred between node and pod in Kubernetes with an easy to understand story.

Male (node) | female (pod)

On a planet outside the Milky way, there is a group of bisexual creatures, which are female (pod) and male (node). Female creatures are in the majority, while only three high-quality male creatures (nodes) are left due to the survival of the fittest. When males and females are together, they are easy to attract, and the following situations will occur:

(1) There are more males and fewer females, and the three high-quality male creatures have the same personality and advantages, and the female creatures choose the same one. Therefore, the female creatures are divided into three equal rows and three male creatures together. This is similar to the principle of equal distribution.

Concept in k8s: it is the most common and common pod distribution mode in k8s, which is commonly used with deployment and daemon controllers.

(2) Over time, organisms began to evolve. The three male creatures have their own differences. The male numbered 1 learned to plant (node01 label skill = 'grow'); Male No. 2 learned to hunt (node02 label skill = 'hunt'); The male with number 3 learned nothing (kept default without learning). Ordinary females (pod) still maintain the principle of equal distribution. Some rapidly evolving female (pod) creatures found the differences of the three high-quality male (node) creatures, and each began to have some choices. Some females prefer male creatures that learn to grow (nodeSelector skill = 'grow'); Some females prefer male creatures that learn to hunt (nodeSelector skill = 'hunt'); Some female creatures like strange characteristics. They like flying males, which can't be satisfied by the only three male creatures. If one day a male creature evolves to be able to fly, they will be attached to the male creatures participating in the flight. If they have not evolved to be able to fly, they will wait. In k8s, this is the use of nodeSelector in yaml file.

Concept in k8s: when a node is labeled with a specific label, the following situations will occur:

  1. If nodeSelector is not specified in pod, the default schedule scheduling algorithm will be maintained;
  2. If a nodeSelector is specified in the pod, and the key and value in the specified nodeSelector match a key and value in a node, the pod is directly scheduled to the node;
  3. A nodeSelector is specified in the pod. If the key and value in the specified nodeSelector are not included in any node, the pod will always be in padding state.

(3) Three male (node) not only has the advantages increased, but also has the disadvantages. The male with number 1 likes to hit the face (node01 taint hobby = 'face'); Male No. 2 likes spanking (node01 taint hobby = 'hunkers'); Male No. 3 likes stepping on feet (node01 taint hobby = 'foot'); Ordinary female (pod) organisms still maintain the principle of equal distribution, while some re evolved female organisms also have their own personality. Females who can tolerate slapping face are with males numbered 1 (tolerance hobby = 'face'), but this tolerance may be permanent or one day (tolerance seconds = 86400). The three male creatures occasionally fool around together, and the male creature numbered 1 may change his hobby to sleeping in one day. Some female creatures who can't tolerate its habit of sleeping in will leave it at intervals or immediately.

Concept in k8s: This is stain and tolerance.

There are other options and parameters for stain and tolerance, which will be explained later.

nodeSelector

Assign the pod to the specified node.

The clusters are as follows:

$ kubectl get nodes


NAME         STATUS   ROLES    AGE   VERSION
k8s-master   Ready    master   20h   v1.19.7
k8s-node01   Ready    <none>   20h   v1.19.7
k8s-node02   Ready    <none>   20h   v1.19.7

Add a label for k8s-node01

$ kubectl label nodes k8s-node01 disktype=ssd
node/k8s-node01 labeled

View label

$ kubectl get nodes --show-labels
NAME         STATUS   ROLES    AGE   VERSION   LABELS
k8s-master   Ready    master   20h   v1.19.7   ..., kubernetes.io/hostname=k8s-master
k8s-node01   Ready    <none>   20h   v1.19.7   ..., disktype=ssd,kubernetes.io/hostname=k8s-node01
k8s-node02   Ready    <none>   20h   v1.19.7   ..., kubernetes.io/hostname=k8s-node02

You can see the k8s-node01 node label: disktype=ssd

Create a Pod scheduled to the selected node

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ngx
  name: ngx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ngx
  template:
    metadata:
      labels:
        app: ngx
    spec:
      containers:
      - image: nginx:alpine-arm64
        name: nginx
      nodeSelector:
        disktype: ssd    #### Select the node of service key: value

Create a pod to check whether it is dispatched to the specified node

$ kubectl apply -f  ngx.yaml
deployment.apps/ngx created
$ kubectl get po -o wide
NAME                   READY   STATUS    RESTARTS   AGE   IP            NODE         NOMINATED NODE   READINESS GATES
ngx-5f4df66559-hjmzg   1/1     Running   0          10s   10.244.1.13   k8s-node01   <none>           <none>
ngx-5f4df66559-wqgdb   1/1     Running   0          10s   10.244.1.14   k8s-node01   <none>           <none>

k8s affinity

When it comes to affinity, affinity is mainly divided into two categories: nodeAffinity and podAffinity.

nodeAffinity

nodeAffinity is node affinity. Scheduling can be divided into soft strategy and hard strategy. Soft strategy is that if you don't have a node that meets the scheduling requirements, POD will ignore this rule and continue to complete the scheduling process. In short, it is a strategy that meets the best conditions and doesn't matter if there are none; The hard strategy is tough. If there is no node that meets the conditions, try again and again until the conditions are met. In short, it is a strategy that you must meet my requirements, or I won't do it. nodeAffinity has two strategies:

  • requiredDuringSchedulingIgnoredDuringExecution: hard policy
  • preferredDuringSchedulingIgnoredDuringExecution: soft policy

Hard strategy

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ngx
  name: ngx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ngx
  template:
    metadata:
      labels:
        app: ngx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: disktype ## Value of key
                operator: In ## include
                values:
                - ssd ## value
      containers:
      - image: nginx:alpine-arm64
        name: nginx
      nodeSelector:
        disktype: ssd

The above two pod s will only run on nodes that meet node label disktype=value. If no node meets this condition, it will always be in pending state.

$ kubectl get po -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
ngx-7b65b44bc-gff9x   1/1     Running   0          3m53s   10.244.1.15   k8s-node01   <none>           <none>
ngx-7b65b44bc-w7bsf   1/1     Running   0          3m53s   10.244.1.16   k8s-node01   <none>           <none>

$ kubectl get nodes k8s-node01 --show-labels
NAME         STATUS   ROLES    AGE   VERSION   LABELS
k8s-node01   Ready    <none>   22h   v1.19.7   ..., disktype=ssd,kubernetes.io/arch=arm64,kubernetes.io/hostname=k8s-node01

Soft strategy

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ngx
  name: ngx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ngx
  template:
    metadata:
      labels:
        app: ngx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 10
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - hdd
      containers:
      - image: nginx:alpine-arm64
        name: nginx

The soft strategy is that the first choice is the node with node label disktype=hdd. If not, the scheduling strategy of the default scheduler is adopted, which is not mandatory.

$ kubectl get po -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
ngx-d4754b6fd-lr96g   1/1     Running   0          2m55s   10.244.2.13   k8s-node02   <none>           <none>
ngx-d4754b6fd-ns7hs   1/1     Running   0          2m55s   10.244.1.28   k8s-node01   <none>           <none>

The operator provides the following operations:

  • The value of In: label is In a list
  • NotIn: the value of label is not in a list
  • A value greater than Gt:
  • Lt: the value of label is less than a certain value
  • Exists: a label exists
  • DoesNotExist: a label does not exist

If there are multiple options under nodeSelectorTerms, you can meet any condition; If matchExpressions has multiple options, these conditions must be met at the same time to schedule POD normally.

Stain and tolerance

In Kubernetes, Node affinity is an attribute defined on the Pod, which can make the Pod schedule to a Node according to our requirements. On the contrary, Taints is an attribute on the Node, which can make the Pod unable to schedule to the Node with stain, and even expel the existing Pod on the Node with stain. Of course, the corresponding Kubernetes can set the tolerance attribute for the Pod to allow the Pod to tolerate the stains set on the Node. In this way, the stains set on the Node will be ignored during scheduling and the Pod will be scheduled to the Node. In general, Taints are usually used in conjunction with accelerations.

Taints (taints)

View stains

View the stain of node

$ kubectl describe nodes k8s-master
...
Taints:             node-role.kubernetes.io/master:NoSchedule
...

# You can also view it through the following operations:
$ kubectl get nodes k8s-master -o go-template={{.spec.taints}}
[map[effect:NoSchedule key:node-role.kubernetes.io/master]]

The tainted content is generally composed of three elements: key, value and an effect, which are shown as follows:

<key>=<value>:<effect>

The value here can be empty, expressed as:

node-role.kubernetes.io/master:NoSchedule
- key: node-role.kubernetes.io/master
- value: empty
- effect: NoSchedule

Set stain

Generally, we need to set a node to allow only a specific Pod for scheduling. At this time, we have to set a stain on the node, which can be set in the format of kubectl taint node [node] key=value[effect], where the value of effect can be as follows:

  • PreferNoSchedule: try not to schedule.
  • NoSchedule: must not be scheduled.
  • NoExecute: not only will it not be scheduled, it will also expel the existing Pod on the Node.

Generally, we set stains by aligning them as in the following example:

### Setting a stain does not allow Pod scheduling to this node
$ kubectl taint node k8s-master key1=value1:NoSchedule
### Set the stain and try to prevent the stain from scheduling to the node
$ kubectl taint node k8s-master key2=value2:PreferNoSchedule
### Set a stain, do not allow ordinary pods to be scheduled to the node, and expel the existing pods on the node
$ kubectl taint node k8s-master key3=value3:NoExecute

Remove stains

The above describes how to add a stain to the Node to prevent the Pod from scheduling. Next, we will talk about how to delete the stain on the Node. You can use the following command:

kubectl taint node [node] [key]-

The above syntax is similar to creating a stain, but it should be noted that to delete a stain, you need to know the key and set a "-" item at the end to delete the stain. An example is as follows:

For the convenience of demonstration, set the stain first:

### Set stain 1
$ kubectl taint node k8s-master key1=value1:PreferNoSchedule
node/k8s-master tainted
### Set stain 2
$ kubectl taint node k8s-master key2=value2:NoSchedule
node/k8s-master tainted
### Set stain 3 and do not set value
$ kubectl taint node k8s-master key2=:PreferNoSchedule
node/k8s-master tainted

Looking at the stain, you can see the three values set above:

$ kubectl describe nodes k8s-master
...
Taints:             key2=value2:NoSchedule
                    node-role.kubernetes.io/master:NoSchedule
                    key1=value1:PreferNoSchedule
                    key2:PreferNoSchedule
...

Then remove the stain

### To delete the stain, you can specify the [effect] value without specifying the value to delete the stain of the key[effect]
$ kubectl taint node k8s-master key1:PreferNoSchedule-

### You can also delete all [effect s] of this key2 directly according to the key:
$ kubectl taint node k8s-master key2-

Check the stains again and you can see that the above stains have been deleted:

$ kubectl describe nodes k8s-master
...
Taints:             node-role.kubernetes.io/master:NoSchedule
...

Tolerants

Pod setting tolerance

In order to prevent some pods from scheduling to some specific nodes, you can set taints on the nodes. Of course, if you want some pods to be able to ignore the stain of the node and continue to be able to schedule to the node, you can set tolerance for the Pod so that the Pod can tolerate the stain set on the node, for example:

Stain a node:

kubectl taint node k8s-node01 key=value:NoSchedule

For Pod setting tolerance, the following two methods are available:

### The tolerated key, value and corresponding effect must also be consistent with the stain taints
......
tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"
### The key of tolerance tolerances is the same as the key of taints to be tainted, and the set effect is also the same. There is no need to set value
......
tolerations:
- key: "key"
  operator: "Exists"
  effect: "NoSchedule"

Basic concepts of Node and Pod for stain and tolerance

concept

  • A node can have multiple stains;
  • A pod can have multiple tolerance;
  • kubernetes performs multiple stain and tolerance methods similar to filters

If a node has multiple stains and there are multiple tolerance on the pod, the pod can be scheduled to the node as long as the tolerance in the pod can contain all the stains set on the node. If the tolerance set on the pod cannot contain all the stains set on the node, and the effect of the remaining stains on the node that cannot be included is PreferNoSchedule, it may also be scheduled to the node.

be careful

When there is always tolerance in pod, pod will first select the node without stain, and then select the node with stain tolerance again.

  • If the node is tainted, the effect is NoSchedule, and the pod is not tolerant of response, kubernetes will not schedule the pod to this node.
  • If there is a stain on the Node and the effect is PreferNoShedule, Kubernetes will try not to schedule the Pod to the Node.
  • If the Node has a stain and the effect is NoExecute, the pod running on the Node will be expelled from the Node. A pod that is not running on a Node cannot be scheduled to this Node. Generally, when a Node is in NotReady state, the pod starts quickly on other normal nodes.

Set tolerance in Deployment

Set tolerance for deployment in kubernetes. An example is as follows:

apiVersion: apps/vl
kind: Deployment
metadata:
  name: example
spec:
  replicas: 5
  template:
    spec:
      ......
      tolerations:
      - key: "key"
        operator: "Equal"
        value: "value"
        effect: "NoSchedule"

Set tolerance time

Normally, if a stain with effect = noexecute is added to the Node. Then all pods that cannot tolerate this stain will be kicked out immediately. The Pod with tolerance label will not be kicked off. However, a tolerance with effect = noexecute can specify a tolerationSeconds to specify how long the stain will not be kicked off when it is added. For example:

tolerations:



- key: "key"



  operator: "Equal"



  value: "value"



  effect: "Noexecute"



  tolerationSeconds: 3600

If the pod is already on the node with stain and effect of NoExecute. This pod can run until 3600s before being kicked out. If the stain on the node is removed at this time, the pod will not be kicked off.

Tolerance example

The Operator defaults to Equal and can be set to Equal and Exists. Examples are as follows:

Operator is Exists

Tolerate any stain

For example, an empty key will match all keys, value s and effect s. That is, tolerate any stain.

tolerations:



- operator: "Exists"

Tolerate the stain of a key value

For example, if an empty effect and the key is not empty, all the same effects as the key will be matched:

tolerations:



- key: "key"



  operator: "Exists"

Operator is Equal

There is a stain on the node

If the key of Node and Pod is key1 and value1 is the same as effect, it can be scheduled:

#stain



key1=value1:NoSchedule



 



#Pod settings



tolerations:



- key: "key1"



  operator: "Equal"



  value: "value1"



  effect: "NoSchedule"

There are multiple stains on the node

If the key, value, effect and Pod tolerance of the stain of the Node are the same, the scheduling can be performed:

## Set stain



key1=value1:NoSchedule



key2=value2:NoExecute



 



## Pod setting tolerance



tolerations:



- key: "key1"



  operator: "Equal"



  value: "value1"



  effect: "NoSchedule"



- key: "key2"



  operator: "Equal"



  value: "value2"



  effect: "NoExecute"

Most of the stains on the Node are the same as those on the Pod. The difference is that if the Node stain effect is PreferNoSchedule, it may be scheduled:

## stain



key1=value1:NoSchedule



key2=value2:NoExecute



key3=value3:PreferNoSchedule



 



## Pod tolerance settings



tolerations:



- key: "key1"



  operator: "Equal"



  value: "value1"



  effect: "NoSchedule"



- key: "key2"



  operator: "Equal"



  value: "value2"



  effect: "NoExecute"

Most of the stains on Node and Pod are the same. The difference is that Node stain effect s are NoSchedule and NoExecute and will not be scheduled:

## stain



key1=value1:NoSchedule



key2=value2:NoExecute



key3=value3:PreferNoSchedule



 



## Pod setting tolerance



tolerations:



- key: "key1"



  operator: "Equal"



  value: "value1"



  effect: "NoSchedule"



- key: "key3"



  operator: "Equal"



  value: "value3"



  effect: "PreferNoSchedule"

Compare and understand the difference between Exists and Equal:

  • Exists is inclusive, Equal is Equal, exists is used in a wider range, and Equal is accurate matching.
  • When NoExecute exists in the stain, but NoExecute does not exist in the tolerance, it will not be scheduled to this node.
  • Exists can not write value, while Equal must specify the corresponding value

Stain expulsion

When using kubernetes, you will encounter that a node has obviously been down. Check that the node state changes from Ready state to NotReady state, but the pod where the node is located is already in running state, and it will change to Terminating state after a long time. Why?

For node nodes, if the node effect is set to NoExecute, it will affect the running Pod on the node, as shown below:

  • Immediately expel the pod without matching tolerance.
  • If tolerance is set but the tolerationSeconds parameter is not specified, the tolerance will take effect permanently and will not be expelled.
  • If tolerance is set but the tolerationSeconds parameter is specified, the tolerance is valid at the specified time and will be rejected after the specified time. (the default setting of pod is tolerance seconds = 300s)

In addition, when some conditions are true, the node controller will automatically pollute the node. Built in the following stains:

keynotes
node.kubernetes.io/not-readyThe node is not ready. This corresponds to NodeCondition Ready being false.
node.kubernetes.io/unreachableUnable to access node from node controller. This corresponds to NodeCondition Ready being Unknown.
node.kubernetes.io/out-of-diskNode is low on disk.
node.kubernetes.io/memory-pressureNode has memory pressure.
node.kubernetes.io/disk-pressureNode has disk pressure.
node.kubernetes.io/network-unavailableThe node's network is not available.
node.kubernetes.io/unschedulableNode is not schedulable.
node.cloudprovider.kubernetes.io/uninitializedWhen kubelet starts from an "external" cloud provider, this stain is set on the node to mark it as unavailable. After the controller from cloud controller manager initializes this node, kubelet removes this stain.

Through the foreshadowing of the above knowledge, when a node goes down, what kind of stain will kubernetes cluster put on it?

A node in Ready status

$ kubectl get node k8s-node02 -o go-template={{.spec.taints}}



<no value>

A node with NotReady status

$ kubectl get node k8s-node02 -o go-template={{.spec.taints}}



[map[effect:NoSchedule key:node.kubernetes.io/unreachable timeAdded:2021-12-23T13:49:58Z] map[effect:NoExecute key:node.kubernetes.io/unreachable timeAdded:2021-12-23T13:50:03Z]]

Nodes in NotReady state are marked with the following two stains:

Taints:             node.kubernetes.io/unreachable:NoExecute



                    node.kubernetes.io/unreachable:NoSchedule

Next, test what tolerance the kubernetes cluster will assign to the Pod.

$ kubectl get po nginx-745b4df97d-mgdmp -o yaml



...



  tolerations:



  - effect: NoExecute



    key: node.kubernetes.io/not-ready



    operator: Exists



    tolerationSeconds: 300  ## 300/60=5min



  - effect: NoExecute



    key: node.kubernetes.io/unreachable



    operator: Exists



    tolerationSeconds: 300 ## 300/60=5min



...

See here, the failure mechanism of the Pod is very clear. When the node node is in NotReady or unreachable state, the Pod will tolerate it for 5 minutes and then be expelled. In these five minutes, even if the Pod is in the running state, it cannot provide services normally. Therefore, you can manually specify 0 tolerance in the yaml list. The list file is as follows:

apiVersion: apps/v1



kind: Deployment



metadata:



  labels:



    app: nginx



  name: nginx



spec:



  replicas: 4



  selector:



    matchLabels:



      app: nginx



  template:



    metadata:



      labels:



        app: nginx



    spec:



      tolerations:



      - effect: NoExecute



        key: node.kubernetes.io/not-ready



        operator: Exists



        tolerationSeconds: 0



      - effect: NoExecute



        key: node.kubernetes.io/unreachable



        operator: Exists



        tolerationSeconds: 0



      containers:



      - image: nginx:alpine



        name: nginx

Generate Pod

$ kubectl get po -o wide



NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE         NOMINATED NODE   READINESS GATES



nginx-84f6f75c6-c76fm   1/1     Running   0          6s    10.244.3.16   k8s-node02   <none>           <none>



nginx-84f6f75c6-hsxq5   1/1     Running   0          6s    10.244.3.15   k8s-node02   <none>           <none>



nginx-84f6f75c6-wkt52   1/1     Running   0          6s    10.244.1.63   k8s-node01   <none>           <none>



nginx-84f6f75c6-xmkjs   1/1     Running   0          6s    10.244.3.17   k8s-node02   <none>           <none>

Next, forcibly close k8s-node02 node and check whether Pod is transferred.

$ kubectl get po -o wide



NAME                    READY   STATUS        RESTARTS   AGE    IP            NODE         NOMINATED NODE   READINESS GATES



nginx-84f6f75c6-c76fm   1/1     Terminating   0          116s   10.244.3.16   k8s-node02   <none>           <none>



nginx-84f6f75c6-csqf4   1/1     Running       0          13s    10.244.1.66   k8s-node01   <none>           <none>



nginx-84f6f75c6-hsxq5   1/1     Terminating   0          116s   10.244.3.15   k8s-node02   <none>           <none>



nginx-84f6f75c6-r2v4p   1/1     Running       0          13s    10.244.1.64   k8s-node01   <none>           <none>



nginx-84f6f75c6-v4knq   1/1     Running       0          13s    10.244.1.65   k8s-node01   <none>           <none>



nginx-84f6f75c6-wkt52   1/1     Running       0          116s   10.244.1.63   k8s-node01   <none>           <none>



nginx-84f6f75c6-xmkjs   1/1     Terminating   0          116s   10.244.3.17   k8s-node02   <none>           <none>

After the node node is in NotReady state, the Pod is transferred immediately. This is done by explicitly specifying the tolerance time in the yaml manifest file. You can also modify the apiserver configuration directly to modify the default tolerance time.

vim /etc/kubernetes/manifests/kube-apiserver.yaml



...



spec:



  containers:



  - command:



    - kube-apiserver



    - --advertise-address=192.168.1.11



    - --default-not-ready-toleration-seconds=1    ## New row



    - --default-unreachable-toleration-seconds=1  ## New row



...

After the modification is saved, Kube apiserver-k8s-masterpod will automatically reload the latest configuration.

$ kubectl get po nginx-84f6f75c6-wkt52 -o yaml



...



  tolerations:



  - effect: NoExecute



    key: node.kubernetes.io/not-ready



    operator: Exists



    tolerationSeconds: 0



  - effect: NoExecute



    key: node.kubernetes.io/unreachable



    operator: Exists



    tolerationSeconds: 0



...

For small clusters, you can set global variables directly.

Note: when there is only one node in kubernetes cluster, Pod transfer cannot be achieved because Pod has no way out.

Keywords: Kubernetes

Added by amedhussaini on Mon, 07 Mar 2022 16:41:47 +0200