Kubernetes之Pod的调度实现方式
作者:Hud.
在默认情况下,一个pod在哪个Node节点上运行,是由Scheduler组件采用相应的算法计算出来的,这个过程是不受人工控制的。
但是在实际的使用过程当中,这并不能满足需求,因为在很多情况下,我们想控制某些Pod到某些节点上,那么应该怎么做呢?
这就要求了解Kubernetes对于Pod的调度规则,Kubernetes提供了四大类的调度方式
- 自动调度:运行在哪个节点上完全由Scheduler经过一系列的计算得出
- 定向调度:NodeName,NodeSelector
- 亲和性调度:NodeAffinity、PodAffinity、PodAntiAffinity
- 污点(容忍)调度:Taints、Toleration
1、定向调度
定向调度,指的是利用在pod上声明NodeName或者nodeSelector,以此将Pod调度到期望的node节点上,注意,这里的调度是强制的,这就意味着即使要调度的目标Node不存在,也会向上面进行调度,只不过pod运行失败而已。
1.1 NodeName
NodeName用于强制约束将Pod调度到指定的Name的Node节点上。这种方式,其实就是直接跳过Scheduler的调度逻辑,直接写入PodList列表。
接下来,就这个定向调度创建一个pod-nodename,yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodename namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 nodeName: k8s-node1 # 指定调度到node1节点上去 # 创建Pod [root@master ~]# vi pod-nodename.yaml [root@master ~]# kubectl create -f pod-nodename.yaml pod/pod-nodename created #修改上面的文件再创建一个pod改到node2节点上去,看看效果 [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodename 1/1 Running 0 2m25s 10.244.36.91 k8s-node1 <none> <none> pod-nodename2 1/1 Running 0 5s 10.244.169.163 k8s-node2 <none>
对了,这里插一句,关于Node节点的名称,你要查看可以使用命令:
[root@master ~]# kubectl get node NAME STATUS ROLES AGE VERSION k8s-node1 Ready <none> 12d v1.23.5 k8s-node2 Ready <none> 12d v1.23.5 master Ready control-plane,master 12d v1.23.5
如果我在yaml文件中将nodeName改成了k8s-node3,会不会创建成功呢?来试试看:
[root@master ~]# kubectl create -f pod-nodename.yaml pod/pod-nodename2 created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE pod-nodename2 0/1 Pending 0 14s
1.2 NodeSelector
NodeSelector用于将pod调度到添加了指定标签的node节点上,他是通过Kubernetes的label-selector机制实现的,也就是说在pod创建之前会由scheduler使用MatchNodeSelector调整策略进行label匹配,找出目标node,然后将pod调度到目标节点,该匹配规则是强制约束。
通过一个小案例,熟悉一下操作:
① 首先分别为node节点添加标签
[root@master ~]# kubectl label nodes k8s-node1 nodeenv=first node/k8s-node1 labeled [root@master ~]# kubectl label nodes k8s-node2 nodeenv=second node/k8s-node2 labeled
② 创建一个pod-nodeselector.yaml文件,使用它来创建pod
apiVersion: v1 kind: Pod metadata: name: pod-nodeselector namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 nodeSelector: nodeenv: first # 将pod调度到nodeenv=first的node上去 # 创建容器并查看是否被放在了指定的node节点上 [root@master ~]# vi pod-nodeselector.yaml [root@master ~]# kubectl create -f pod-nodeselector.yaml pod/pod-nodeselector created [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodeselector 1/1 Running 0 19s 10.244.36.92 k8s-node1 <none> <none>
2、亲和性调度
定向调度虽然使用起来非常方便但是也有一定的问题,那就是如果没有满足条件的node,那么pod就不会运行,即使在集群中还有可用的node列表也不行,这也就限制了这种调度方式的可用场景。
基于这个问题,Kubernetes还提供了一种亲和性调度(Affinity),他在nodeSelector的基础上进行了扩展,可以通过配置的形式,实现优先选择满足条件的node进行调度,如果没有,也可以调度到不满足条件的节点上去,使得调度更加灵活:
Affinity主要分为三类:
- NodeAffinity(Node亲和性):以node为目标,解决pod可以调度到哪些node的问题
- podAffinity(pod亲和性):以pod为目标,解决pod可以和哪些已存在的pod部署在同一个拓扑域中
- podAntiAffinity(pod反亲和性):以pod为目标,解决pod不能和哪些已存在的pod部署在同一个拓扑域中的问题
关于亲和性(反亲和性)使用场景的说明:
- 亲和性:如果两个应用频繁交互,那就有必要利用亲和性让两个应用尽可能的靠近,这样可以减少因网络通信带来的性能损耗
- 反亲和性:当应用采用多副本部署时,有必要采用反亲和性让各个应用实例打散分布在各个node节点上,这样可以提高服务的高可用性
2.1 NodeAffinity的可选配置项
[root@master ~]# kubectl explain pod.spec.affinity.nodeAffinity KIND: Pod VERSION: v1 # 优先调度到满足指定规则的node,相当于软限制(倾向) preferredDuringSchedulingIgnoredDuringExecution <[]Object> preference #一个节点选择器项,与相应的权重相关联 matchFields #按节点字段列出的节点选择器要求列表 matchExpression #按节点标签列出的节点选择器要求列表(推荐) key #键 value #值 operator #关系符 weight #倾向权重,0~100之间 #Node节点必须满足指定的所有规则才可以,相当于硬限制 requiredDuringSchedulingIgnoredDuringExecution <Object> nodeSelectorTerms #节点选择列表 matchFields #按节点字段列出的节点选择器要求列表 matchExpression #按节点标签列出的节点选择器要求列表(推荐) key #键 value #值 operator #关系符
关系符的使用说明:
-matchExpression: -key: nodeenv #匹配存在标签的key为nodeenv的节点 operator: Exists -key: nodeenv #匹配标签的key为nodeenv,且value是"xxx"或"yyy"的节点 operator: In values:["xxx","yyy"] -key:nodeenv #匹配标签的key为nodeenv且value大于"xxx"的节点 operator: Gt values: "xxx"
先来演示一下 requiredDuringSchedulingIgnoredDuringExecution,创建pod-nodeaffinity-required.yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: # 亲和性设置 nodeAffinity: # 设置node亲和性 requiredDuringSchedulingIgnoredDuringExecution: # 硬限制 nodeSelectorTerms: - matchExpressions: # 匹配env的值在["xxx","yyy"]中的标签 - key: nodeenv operator: In values: ["xxx","yyy"] # 创建容器 [root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml pod/pod-affinity-required created # 先看看node的标签 [root@master ~]# kubectl get node --show-labels NAME STATUS ROLES AGE VERSION LABELS k8s-node1 Ready <none> 12d v1.23.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1,kubernetes.io/os=linux,nodeenv=first k8s-node2 Ready <none> 12d v1.23.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux,nodeenv=second master Ready control-plane,master 12d v1.23.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
现在的节点中,node1和node2的nodeenv标签不存在xxx或者yyy,那么也就意味着这个pod应该是无法运行成功的
[root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE pod-affinity-required 0/1 Pending 0 5m4s # 使用describe看看调度失败的原因 [root@master ~]# kubectl describe pod pod-nodeaffinity-required -n dev Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 22s (x10 over 10m) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity/selector.
现在显示是失败,试试xxx改成first重新创建看看
[root@master ~]# vi pod-nodeaffinity-required.yaml [root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml pod/pod-nodeaffinity-required created [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodeaffinity-required 1/1 Running 0 12s 10.244.36.93 k8s-node1 <none> <none>
下面再来看看软限制(requiredDuringSchedulingIgnoredDuringExecution)
创建pod-nodeaffinity-preferred.yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-preferred namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: # 亲和性设置 nodeAffinity: # 设置node亲和性 preferredDuringSchedulingIgnoredDuringExecution: # 软限制 - weight: 1 preference: # 匹配env的值在["xxx","yyy"]中的标签(当前环境中没有) matchExpressions: - key: nodeenv operator: In values: ["xxx","yyy"] ~ # 创建pod并查看 [root@master ~]# kubectl create -f pod-nodeaffinity-preferred.yaml pod/pod-nodeaffinity-preferred created [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodeaffinity-preferred 1/1 Running 0 119s 10.244.169.164 k8s-node2 <none> <none>
调度亲和性只在调度过程中生效,调度完成后即使node的标签发生了改变,也不理你了。
2.2 PodAffinity的可选配置项
[root@master ~]# kubectl explain pod.spec.affinity.podAffinity FIELDS: # 软限制 requiredDuringSchedulingIgnoredDuringExecution <[]Object> namespace # 指定参照pod的namespace topologyKey # 指定调度作用域 labelSelector # 标签选择器 matchExpressions: key # 键 values # 值 operator # 关系符 matchlabels # 指多个matchExpressions映射的内容 # 软限制 preferredDuringSchedulingIgnoredDuringExecution <[]Object> namespace # 指定参照pod的namespace topologyKey # 指定调度作用域 labelSelector # 标签选择器 matchExpressions: key # 键 values # 值 operator # 关系符 matchlabels # 指多个matchExpressions映射的内容 weight # 权重
topologyKey用于指定调度时作用域,例如:
- 如果指定为Kubernetes.io/hostname,那就是以node节点为区分范围
- 如果指定为beta.kubernetes.io/os,则以node节点的操作系统类型来区分
1)先创建一个参照的pod,pod-podaffinity-target.yaml
apiVersion: v1 kind: Pod metadata: name: pod-podaffinity-target namespace: dev labels: # 设置标签 podenv: target spec: containers: - name: nginx image: nginx:1.17.1 nodeName: k8s-node1 # 部署到node1上面 #创建并查看pod [root@master ~]# vi pod-podaffinity-target.yaml [root@master ~]# kubectl create -f pod-podaffinity-target.yaml pod/pod-podaffinity-target created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE pod-podaffinity-target 1/1 Running 0 9s
2)创建pod-podaffinity-required.yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-preferred namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: # 亲和性设置 podAffinity: # 设置pod亲和性 requiredDuringSchedulingIgnoredDuringExecution: # 硬限制 - labelSelector: matchExpressions: # 匹配env的值在["xxx","yyy"]中的标签 - key: nodeenv operator: In values: ["xxx","yyy"] topologyKey: kubernetes.io/hostname # 如果找不到,调度到目标pod同一个节点上 # 创建并查看 [root@master ~]# kubectl create -f pod-podaffinity-required.yaml pod/pod-podaffinity-required created [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-podaffinity-required 0/1 Pending 0 8s <none> <none> <none> <none> pod-podaffinity-target 1/1 Running 0 14m 10.244.36.94 k8s-node1 <none> <none> [root@master ~]# kubectl describe pod pod-podaffinity-required -n dev Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 13s (x2 over 101s) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.
有一个节点上有污点(taint)---主节点
2.3 podAntiAffinity 反亲和性案例
还是将刚才设置的target作为参照,反亲和这个pod那应该被调度到node2上
apiVersion: v1 kind: Pod metadata: name: pod-podantiaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: # 亲和性设置 podAntiAffinity: # 设置pod反亲和性 requiredDuringSchedulingIgnoredDuringExecution: # 硬限制 - labelSelector: matchExpressions: # 匹配podenv的值在["target"]中的标签 - key: podenv operator: In values: ["target"] topologyKey: kubernetes.io/hostname # 如果找不到,调度到目标pod同一个节点上 # 创建并查看pod [root@master ~]# kubectl create -f pod-podantiaffinity-required.yaml pod/pod-podantiaffinity-required created [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-podaffinity-required 1/1 Running 0 10m 10.244.36.95 k8s-node1 <none> <none> pod-podaffinity-target 1/1 Running 0 28m 10.244.36.94 k8s-node1 <none> <none> pod-podantiaffinity-required 1/1 Running 0 43s 10.244.169.165 k8s-node2 <none> <none>
3、 污点调度--站在Node的角度
3.1 污点(Taints)
前面的调度方式都是站在pod的角度上,通过在Pod上添加属性,来确定Pod是否需要调度到指定的Node上,其实我们也可以站在Node的角度上面,通过在Node上添加污点属性,来决定是否允许Pod调度过来。
Node被设置上污点之后就和Pod之间存在了一种相斥的关系,进而拒绝Pod调度进来,设置可以将已经调度进来的Pod驱逐出去。
污点的格式为:key=value:effect,key和value是污点的标签,effect描述污点的作用,支持如下的三个选项:
- PreferNoScheduler:Kubernetes尽量避免把Pod调度到具有该污点的Node上,除非没有其他节点可调度
- NoScheduler:Kubernetes将不会把Pod调度到具有该污点的Node上,但不影响当前node上已经存在的pod
- NoExecute:Kubernetes将不会把Pod调度到具有该污点的Node上,同时也会讲Node上已存在的pod驱离
# 设置污点 kubectl taint nodes k8s-node1 key=value:effect # 去除污点 kubectl taint nodes k8s-node1 key:effect- # 去除所有污点 kubectl taint nodes k8s-node1 key-
接下来,演示一下污点的效果:
- 1、准备节点node1(暂时停止node2节点)
- 2、为node1设置一个污点:tag=qty:PreferNoSchedule;然后创建pod1(可以)
- 3、修改为node1节点设置一个污点:tag=qty:NoSchedule;然后创建pod2(pod1正常 pod2失败)
- 4、修改为node1节点设置一个污点:tag=qty:noExecute;创建pod3,全失败
[root@master ~]# kubectl get node NAME STATUS ROLES AGE VERSION k8s-node1 Ready <none> 13d v1.23.5 k8s-node2 NotReady <none> 13d v1.23.5 master Ready control-plane,master 13d v1.23.5 # 为node1设置污点 [root@master ~]# kubectl taint nodes k8s-node1 tag=qty:PreferNoSchedule node/k8s-node1 tainted # 创建pod1 [root@master ~]# kubectl run taint1 --image=naginx:1.17.1 -n dev pod/taint1 created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1 1/1 Running 0 53s # 为node1设置污点,改为NoScheduler [root@master ~]# kubectl taint nodes k8s-node1 tag:PreferNoSchedule- node/k8s-node1 untainted [root@master ~]# kubectl taint nodes k8s-node1 tag=qty:NoSchedule node/k8s-node1 tainted # 创建pod2 [root@master ~]# kubectl run taint2 --image=nginx -n dev pod/taint2 created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1 1/1 Running 0 7m14s taint2 0/1 Pending 0 11s # 为node1设置污点 [root@master ~]# kubectl taint nodes k8s-node1 tag:NoSchedule- node/k8s-node1 untainted [root@master ~]# kubectl taint nodes k8s-node1 tag=qty:NoExecute node/k8s-node1 tainted # 创建pod3 [root@master ~]# kubectl run taint3 --image=nginx -n dev pod/taint3 created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint3 0/1 Pending 0 4s # 删除所有污点 [root@master ~]# kubectl taint nodes k8s-node1 tag- node/k8s-node1 untainted
3.2 容忍(Toleration)
污点就是拒绝,容忍就是忽略,Node通过污点拒绝pod调度上去,pod通过容忍忽略拒绝
创建pod-toleration.yaml
apiVersion: v1 kind: Pod metadata: name: pod-toleration namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 tolerations: # 添加容忍 - key: "tag" # 要容忍的污点的key operator: "Equal" # 操作符 value: "qty" # 容忍的污点的value effect: "NoExecute" # 添加容忍的规则,要和污点的规则相同 #创建pod查看 [root@master ~]# kubectl create -f pod-toleration.yaml pod/pod-toleration created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE pod-toleration 1/1 Running 0 39s
总结
以上为个人经验,希望能给大家一个参考,也希望大家多多支持脚本之家。