k8s pod如何使用sriov
作者:魏志标
讲述下如何使用multus来实现sriov的使用。
一、sriov 简介
SR-IOV在2010年左右由Intel提出,但是随着容器技术的推广,intel官方也给出了SR-IOV技术在容器中使用的开源组件,例如:sriov-cni和sriov-device-plugin等,所以SR-IOV也开始在容器领域得到的大量使用。
在传统的虚拟化中,虚拟机的网卡通常是通过桥接(Bridge或OVS)的方式,因为这种方式最方便,也最简单,但是这样做最大的问题在于性能。本文讲的SR-IOV在2010年左右由Intel提出,SR-IOV全称Single-Root I/O Virtualization,是一种基于硬件的虚拟化解决方案,它允许多个云主机高效共享PCIe设备,且同时获得与物理设备性能媲美的I/O性能,能有效提高性能和可伸缩性。
SR-IOV技数主要是虚拟出来通道给用户使用的,通道分为两种:
- PF(Physical Function,物理功能):管理 PCIe 设备在物理层面的通道功能,可以看作是一个完整的 PCIe 设备,包含了 SR-IOV 的功能结构,具有管理、配置 VF 的功能。
- VF(Virtual Function,虚拟功能):是 PCIe 设备在虚拟层面的通道功能,即仅仅包含了 I/O 功能,VF 之间共享物理资源。VF 是一种裁剪版的 PCIe 设备,仅允许配置其自身的资源,虚拟机无法通过 VF 对 SR-IOV 网卡进行管理。所有的 VF 都是通过 PF 衍生而来,有些型号的 SR-IOV 网卡最多可以生成 256 个 VF。SR-IOV设备数据包分发机制
从逻辑上可以认为启用了 SR-IOV 技术后的物理网卡内置了一个特别的 Switch,将所有的 PF 和 VF 端口连接起来,通过 VF 和 PF 的 MAC 地址以及 VLAN ID 来进行数据包分发。
- 在 Ingress 上(从外部进入网卡):如果数据包的目的MAC地址和VLANID都匹配某一个VF,那么数据包会分发到该VF,否则数据包会进入PF;如果数据包的目的MAC地址是广播地址,那么数据包会在同一个 VLAN 内广播,所有 VLAN ID 一致的 VF 都会收到该数据包。
- 在 Egress 上(从 PF 或者 VF发出):如果数据包的MAC地址不匹配同一VLAN内的任何端口(VF或PF),那么数据包会向网卡外部转发,否则会直接在内部转发给对应的端口;如果数据包的 MAC 地址为广播地址,那么数据包会在同一个 VLAN 内以及向网卡外部广播。注意:所有未设置 VLAN ID 的 VF 和 PF,可以认为是在同一个 LAN 中,不带 VLAN 的数据包在该 LAN 中按照上述规则进行处理。此外,设置了 VLAN 的 VF,发出数据包时,会自动给数据包加上 VLAN,在接收到数据包时,可以设置是否由硬件剥离 VLAN 头部。
二、SR-IOV设备与容器网络
英特尔推出了 SR-IOV CNI 插件,支持 Kubernetes pod 在两种模式任意之一的条件下直接连接 SR-IOV 虚拟功能 (VF)。
- 第一个模式在容器主机核心中使用标准 SR-IOV VF 驱动程序。
- 第二个模式支持在用户空间执行 VF 驱动程序和网络协议的 DPDK VNF。
本文介绍的是第一个模式,直接连接SR-IOV虚拟功能(vf设备),如下图所示:
上图中包含了一个node节点上使用的组件:kubelet、sriov-device-plugin、sriov-cni和multus-cni。
节点上的vf设备需要提前生成,然后由sriov-device-plugin将vf设备发布到k8s集群中。
在pod创建的时候,由kubelet调用multus-cni,multus-cni分别调用默认cni和sriov-cni插件为pod构建网络环境。
sriov-cni就是将主机上的vf设备添加进容器的网络命名空间中并配置ip地址。
三、环境准备
- k8s环境
[root@node1 ~]# kubectl get node NAME STATUS ROLES AGE VERSION node1 Ready control-plane,master 47d v1.23.17 node2 Ready control-plane,master 47d v1.23.17 node3 Ready control-plane,master 47d v1.23.17
- 硬件环境
[root@node1 ~]# lspci -nn | grep -i eth 23:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01) 23:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01) 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 42:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 42:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 63:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 63:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] a1:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] a1:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] [root@node1 ~]# 本环境将使用Mellanox Technologies MT27710进行实验测试。 ########确认网卡是否支持sriov [root@node1 ~]# lspci -v -s 41:00.0 41:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Subsystem: Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT Physical Slot: 19 Flags: bus master, fast devsel, latency 0, IRQ 195, IOMMU group 56 Memory at 2bf48000000 (64-bit, prefetchable) [size=32M] Expansion ROM at c6f00000 [disabled] [size=1M] Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable+ Count=64 Masked- Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [180] Single Root I/O Virtualization (SR-IOV) ##支持sriov Capabilities: [1c0] Secondary PCI Express Capabilities: [230] Access Control Services Kernel driver in use: mlx5_core Kernel modules: mlx5_core ####网卡支持的驱动类型
- 开启vf
[root@node1 ~]# echo 8 > /sys/class/net/ens19f0/device/sriov_numvfs ####物理机查看开启的vf [root@node1 ~]# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 4: ens19f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether e8:eb:d3:33:be:ea brd ff:ff:ff:ff:ff:ff vf 0 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 1 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 2 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 3 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 4 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 5 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 6 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off vf 7 link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off ###确认vf被开启 [root@node1 ~]# lspci -nn | grep -i ether 23:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01) 23:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01) 41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 41:00.2 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:00.3 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:00.4 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:00.5 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:00.6 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:00.7 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:01.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 41:01.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016] 42:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 42:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 63:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 63:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] a1:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] a1:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] #####ip a查看在系统中被识别 [root@node1 ~]# ip a | grep ens19f0v 18: ens19f0v0: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 19: ens19f0v1: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 20: ens19f0v2: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 21: ens19f0v3: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 22: ens19f0v4: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 23: ens19f0v5: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 24: ens19f0v6: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 25: ens19f0v7: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 [root@node1 ~]#
四、sriov安装
- sriov-device-plugin安装
[root@node1 ~]# git clone https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin.git [root@node1 ~]# cd sriov-network-device-plugin/ [root@node1 ~]# make image ###编译镜像 [root@node1 ~]# 或者直接通过pull 命令下载镜像 [root@node1 ~]# docker pull ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:latest-amd ############################## SR-IOV设备的pf资源和vf资源需要发布到k8s集群中以供pod使用,所以这边需要用到device-plugin,device-plugin的pod是用daemonset部署的,运行在每个node节点上,节点上的kubelet服务会通过grpc方式调用device-plugin里的ListAndWatch接口获取节点上的所有SR-IOV设备device信息,device-plugin也会通过register方法向kubelet注册自己的服务,当kubelet需要为pod分配SR-IOV设备时,会调用device-plugin的Allocate方法,传入deviceId,获取设备的详细信息。 ##############修改configmap,主要是用于筛选节点上的SR-IOV的vf设备,注册vf到k8s集群 [root@node1 ~]# vim sriov-network-device-plugin/deployments/configMap.yaml apiVersion: v1 kind: ConfigMap metadata: name: sriovdp-config namespace: kube-system data: config.json: | { "resourceList": [{ "resourcePrefix": "Mellanox.com", "resourceName": "Mellanox_sriov_switchdev_MT27710_ens19f0_vf", "selectors": { "drivers": ["mlx5_core"], "pfNames": ["ens19f0#0-7"] ###填写被系统中识别到设备名称也可以使用设备厂商的vendors,配置方式多种 } } ] } #######部署sriov-device-plugin [root@node1 ~]# kubectl create -f deployments/configMap.yaml [root@node1 ~]# kubectl create -f deployments/sriovdp-daemonset.yaml ######查看sriov已经启动 [root@node1 ~]# kubectl get po -A -o wide | grep sriov kube-system kube-sriov-device-plugin-amd64-d7ctb 1/1 Running 0 6d5h 172.28.30.165 node3 <none> <none> kube-system kube-sriov-device-plugin-amd64-h86dl 1/1 Running 0 6d5h 172.28.30.164 node2 <none> <none> kube-system kube-sriov-device-plugin-amd64-rlpwb 1/1 Running 0 6d5h 172.28.30.163 node1 <none> <none> [root@node1 ~]# #####describe node查看vf已经被注册到节点 [root@node1 ~]# kubectl describe node node1 --------- Capacity: cpu: 128 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 256374468Ki hugepages-1Gi: 120Gi Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf: 8 ##已经被注册 memory: 527839304Ki pods: 110 Allocatable: cpu: 112 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 236274709318 hugepages-1Gi: 120Gi Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf: 8 ##可分配数量
- sriov cni安装
[root@node1 ~]# git clone https://github.com/k8snetworkplumbingwg/sriov-cni.git [root@node1 ~]# cd sriov-cni [root@node1 ~]# make ###编译sriov cni [root@node1 ~]# cp build/sriov /opt/cni/bin/ #每个sriov节点都要拷贝以及执行下面修改权限的命令 [root@node1 ~]# chmod 777 /opt/cni/bin/sriov
sriov-cni主要做的事情:
首先sriov-cni部署后,会在/opt/cni/bin目录下放一个sriov的可执行文件。
然后,当kubelet会调用multus-cni插件,然后multus-cni插件里会调用delegates数组里的插件,delegates数组中会有SR-IOV信息,然后通过执行/opt/cni/bin/sriov命令为容器构建网络环境,这边构建的网络环境的工作有:
根据kubelet分配的sriov设备id找到设备,并将其添加到容器的网络命名空间中为该设备添加ip地址
- multus安装
安装步骤可以参考https://www.jb51.net/server/325044x0h.htm
五、pod使用sriov
- 创建net-attach-def
[root@node1 ~]# vim sriov-attach.yaml apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: sriov-attach annotations: k8s.v1.cni.cncf.io/resourceName: Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf spec: config: '{ "cniVersion": "0.3.1", "name": "sriov-attach", "type": "sriov", "ipam": { "type": "calico-ipam", "range": "222.0.0.0/8" } }' [root@node1 ~]# kubectl apply -f sriov-attach.yaml networkattachmentdefinition.k8s.cni.cncf.io/sriov-attach created [root@node1 ~]# kubectl get net-attach-def NAME AGE sriov-attach 12s
- 定义pod yaml
[root@node1 ~]# cat sriov-attach.yaml --- apiVersion: apps/v1 kind: Deployment metadata: name: sriov labels: app: sriov-attach spec: replicas: 1 selector: matchLabels: app: sriov-attach template: metadata: annotations: k8s.v1.cni.cncf.io/networks: sriov-attach labels: app: sriov-attach spec: containers: - name: sriov-attach image: docker.io/library/nginx:latest imagePullPolicy: IfNotPresent resources: requests: cpu: 1 memory: 1Gi Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf: '1' limits: cpu: 1 memory: 1Gi Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf: '1' [root@node1 ~]# #####启动pod测试 [root@node1 ~]# kubectl apply -f sriov-attach.yaml deployment.apps/sriov created [root@node1 ~]# kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sriov-65c8f754f9-jlcd5 1/1 Running 0 6s 172.25.36.87 node1 <none> <none>
- 查看pod
#########1:describe pod查看资源分配情况 [root@node1 wzb]# kubectl describe po sriov-65c8f754f9-jlcd5 Name: sriov-65c8f754f9-jlcd5 Namespace: default Priority: 0 Node: node1/172.28.30.163 Start Time: Wed, 28 Feb 2024 20:56:24 +0800 Labels: app=sriov-attach pod-template-hash=65c8f754f9 Annotations: cni.projectcalico.org/containerID: 21ec82394a00c893e5304577b59984441bd3adac82929b5f9b5538f988245bf5 cni.projectcalico.org/podIP: 172.25.36.87/32 cni.projectcalico.org/podIPs: 172.25.36.87/32 k8s.v1.cni.cncf.io/network-status: [{ "name": "k8s-pod-network", "ips": [ "172.25.36.87" ], "default": true, "dns": {} },{ "name": "default/sriov-attach", "interface": "net1", "ips": [ "172.25.36.90" ], "mac": "f6:c2:e5:d1:7b:fa", "dns": {}, "device-info": { "type": "pci", "version": "1.1.0", "pci": { "pci-address": "0000:41:00.6" } } }] k8s.v1.cni.cncf.io/networks: sriov-attach Status: Running IP: 172.25.36.87 IPs: IP: 172.25.36.87 Controlled By: ReplicaSet/sriov-65c8f754f9 Containers: nginx: Container ID: containerd://6d5246c3e36a125ba60bad6af63f8bffe4710d78c2e14e6afb0d466c3f0f5d6e Image: docker.io/library/nginx:latest Image ID: sha256:12766a6745eea133de9fdcd03ff720fa971fdaf21113d4bc72b417c123b15619 Port: <none> Host Port: <none> State: Running Started: Wed, 28 Feb 2024 20:56:28 +0800 Ready: True Restart Count: 0 Limits: cpu: 1 intel.com/intel_sriov_switchdev_MT27710_ens19f0_vf: 1 memory: 1Gi Requests: cpu: 1 intel.com/intel_sriov_switchdev_MT27710_ens19f0_vf: 1 ##pod中已经分配了sriov资源 memory: 1Gi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pnl9d (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-pnl9d: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Guaranteed Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 105s default-scheduler Successfully assigned default/sriov-65c8f754f9-jlcd5 to node1 Normal AddedInterface 103s multus Add eth0 [172.25.36.87/32] from k8s-pod-network Normal AddedInterface 102s multus Add net1 [172.25.36.90/26] from default/sriov-attach ####sriov网卡正常被添加 Normal Pulled 102s kubelet Container image "docker.io/library/nginx:latest" already present on machine Normal Created 102s kubelet Created container nginx Normal Started 102s kubelet Started container nginx [root@node1 wzb]# ####################### 进入pod内部查看,已经有网卡net1 获取到地址 [root@node1 ~]# crictl ps | grep sriov-attach 6d5246c3e36a1 12766a6745eea 4 minutes ago Running nginx 0 21ec82394a00c [root@node1 ~]# crictl inspect 6d5246c3e36a1 | grep -i pid "pid": 2775224, "pid": 1 "type": "pid" [root@node1 ~]# ns nsec3hash nsenter nslookup nss-policy-check nstat nsupdate [root@node1 ~]# nsenter -t 2775224 -n bash [root@node1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if30729: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default link/ether ae:f9:85:03:13:2f brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.25.36.87/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::acf9:85ff:fe03:132f/64 scope link valid_lft forever preferred_lft forever 21: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether f6:c2:e5:d1:7b:fa brd ff:ff:ff:ff:ff:ff inet 172.25.36.90/26 brd 172.25.36.127 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::f4c2:e5ff:fed1:7bfa/64 scope link valid_lft forever preferred_lft forever [root@node1 ~]#
总结
以上为个人经验,希望能给大家一个参考,也希望大家多多支持脚本之家。