云其它

关注公众号 jb51net

关闭
首页 > 网站技巧 > 服务器 > 云和虚拟化 > 云其它 > k8s pod使用sriov

k8s pod如何使用sriov

作者:魏志标

这篇文章主要介绍了k8s pod如何使用sriov问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教

讲述下如何使用multus来实现sriov的使用。

一、sriov 简介

SR-IOV在2010年左右由Intel提出,但是随着容器技术的推广,intel官方也给出了SR-IOV技术在容器中使用的开源组件,例如:sriov-cni和sriov-device-plugin等,所以SR-IOV也开始在容器领域得到的大量使用。

在传统的虚拟化中,虚拟机的网卡通常是通过桥接(Bridge或OVS)的方式,因为这种方式最方便,也最简单,但是这样做最大的问题在于性能。本文讲的SR-IOV在2010年左右由Intel提出,SR-IOV全称Single-Root I/O Virtualization,是一种基于硬件的虚拟化解决方案,它允许多个云主机高效共享PCIe设备,且同时获得与物理设备性能媲美的I/O性能,能有效提高性能和可伸缩性。

SR-IOV技数主要是虚拟出来通道给用户使用的,通道分为两种:

从逻辑上可以认为启用了 SR-IOV 技术后的物理网卡内置了一个特别的 Switch,将所有的 PF 和 VF 端口连接起来,通过 VF 和 PF 的 MAC 地址以及 VLAN ID 来进行数据包分发。

二、SR-IOV设备与容器网络

英特尔推出了 SR-IOV CNI 插件,支持 Kubernetes pod 在两种模式任意之一的条件下直接连接 SR-IOV 虚拟功能 (VF)。

本文介绍的是第一个模式,直接连接SR-IOV虚拟功能(vf设备),如下图所示:

上图中包含了一个node节点上使用的组件:kubelet、sriov-device-plugin、sriov-cni和multus-cni。

节点上的vf设备需要提前生成,然后由sriov-device-plugin将vf设备发布到k8s集群中。

在pod创建的时候,由kubelet调用multus-cni,multus-cni分别调用默认cni和sriov-cni插件为pod构建网络环境。

sriov-cni就是将主机上的vf设备添加进容器的网络命名空间中并配置ip地址。

三、环境准备

[root@node1 ~]# kubectl get node 
NAME    STATUS   ROLES                  AGE   VERSION
node1   Ready    control-plane,master   47d   v1.23.17
node2   Ready    control-plane,master   47d   v1.23.17
node3   Ready    control-plane,master   47d   v1.23.17
[root@node1 ~]# lspci -nn  | grep -i eth  
23:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
23:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
42:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
42:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
63:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
63:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
a1:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
a1:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
[root@node1 ~]# 

本环境将使用Mellanox Technologies MT27710进行实验测试。
########确认网卡是否支持sriov
[root@node1 ~]# lspci -v -s 41:00.0
41:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
	Subsystem: Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT
	Physical Slot: 19
	Flags: bus master, fast devsel, latency 0, IRQ 195, IOMMU group 56
	Memory at 2bf48000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at c6f00000 [disabled] [size=1M]
	Capabilities: [60] Express Endpoint, MSI 00
	Capabilities: [48] Vital Product Data
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [180] Single Root I/O Virtualization (SR-IOV)   ##支持sriov
	Capabilities: [1c0] Secondary PCI Express
	Capabilities: [230] Access Control Services
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core    ####网卡支持的驱动类型
[root@node1 ~]# echo 8 > /sys/class/net/ens19f0/device/sriov_numvfs

####物理机查看开启的vf
[root@node1 ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: ens19f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether e8:eb:d3:33:be:ea brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 2     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 3     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 4     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 5     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 6     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off
    vf 7     link/ether 00:00:00:00:00:00, spoof checking off, link-state auto, trust off, query_rss off

###确认vf被开启
[root@node1 ~]# lspci -nn  | grep -i ether
23:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
23:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
41:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
41:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
41:00.2 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:00.3 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:00.4 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:00.5 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:00.6 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:00.7 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:01.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
41:01.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
42:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
42:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
63:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
63:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
a1:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
a1:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]


#####ip a查看在系统中被识别
[root@node1 ~]# ip a | grep ens19f0v
18: ens19f0v0: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
19: ens19f0v1: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
20: ens19f0v2: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
21: ens19f0v3: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
22: ens19f0v4: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
23: ens19f0v5: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
24: ens19f0v6: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
25: ens19f0v7: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
[root@node1 ~]#

四、sriov安装

[root@node1 ~]# git clone https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin.git
[root@node1 ~]# cd sriov-network-device-plugin/
[root@node1 ~]# make image  ###编译镜像
[root@node1 ~]#
或者直接通过pull 命令下载镜像
[root@node1 ~]# docker pull ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:latest-amd

##############################
SR-IOV设备的pf资源和vf资源需要发布到k8s集群中以供pod使用,所以这边需要用到device-plugin,device-plugin的pod是用daemonset部署的,运行在每个node节点上,节点上的kubelet服务会通过grpc方式调用device-plugin里的ListAndWatch接口获取节点上的所有SR-IOV设备device信息,device-plugin也会通过register方法向kubelet注册自己的服务,当kubelet需要为pod分配SR-IOV设备时,会调用device-plugin的Allocate方法,传入deviceId,获取设备的详细信息。

##############修改configmap,主要是用于筛选节点上的SR-IOV的vf设备,注册vf到k8s集群
[root@node1 ~]# vim sriov-network-device-plugin/deployments/configMap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourcePrefix": "Mellanox.com",
                "resourceName": "Mellanox_sriov_switchdev_MT27710_ens19f0_vf",
                "selectors": { 
                    "drivers": ["mlx5_core"],
                    "pfNames": ["ens19f0#0-7"]   ###填写被系统中识别到设备名称也可以使用设备厂商的vendors,配置方式多种
                }
            }
        ]   
    }


#######部署sriov-device-plugin
[root@node1 ~]# kubectl create -f deployments/configMap.yaml
[root@node1 ~]# kubectl create -f deployments/sriovdp-daemonset.yaml

######查看sriov已经启动
[root@node1 ~]# kubectl get po -A  -o wide | grep sriov
kube-system   kube-sriov-device-plugin-amd64-d7ctb              1/1     Running     0               6d5h    172.28.30.165    node3   <none>           <none>
kube-system   kube-sriov-device-plugin-amd64-h86dl              1/1     Running     0               6d5h    172.28.30.164    node2   <none>           <none>
kube-system   kube-sriov-device-plugin-amd64-rlpwb              1/1     Running     0               6d5h    172.28.30.163    node1   <none>           <none>
[root@node1 ~]# 

#####describe node查看vf已经被注册到节点
[root@node1 ~]# kubectl describe  node node1 
---------
Capacity:
  cpu:                                                  128
  devices.kubevirt.io/kvm:                              1k
  devices.kubevirt.io/tun:                              1k
  devices.kubevirt.io/vhost-net:                        1k
  ephemeral-storage:                                    256374468Ki
  hugepages-1Gi:                                        120Gi
  Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf:   8  ##已经被注册
  memory:                                               527839304Ki
  pods:                                                 110
Allocatable:
  cpu:                                                  112
  devices.kubevirt.io/kvm:                              1k
  devices.kubevirt.io/tun:                              1k
  devices.kubevirt.io/vhost-net:                        1k
  ephemeral-storage:                                    236274709318
  hugepages-1Gi:                                        120Gi
  Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf:   8  ##可分配数量
[root@node1 ~]# git clone https://github.com/k8snetworkplumbingwg/sriov-cni.git
[root@node1 ~]# cd sriov-cni
[root@node1 ~]# make  ###编译sriov cni
[root@node1 ~]# cp build/sriov /opt/cni/bin/ #每个sriov节点都要拷贝以及执行下面修改权限的命令
[root@node1 ~]# chmod 777  /opt/cni/bin/sriov

sriov-cni主要做的事情:

首先sriov-cni部署后,会在/opt/cni/bin目录下放一个sriov的可执行文件。

然后,当kubelet会调用multus-cni插件,然后multus-cni插件里会调用delegates数组里的插件,delegates数组中会有SR-IOV信息,然后通过执行/opt/cni/bin/sriov命令为容器构建网络环境,这边构建的网络环境的工作有:

根据kubelet分配的sriov设备id找到设备,并将其添加到容器的网络命名空间中为该设备添加ip地址

安装步骤可以参考https://www.jb51.net/server/325044x0h.htm

五、pod使用sriov

[root@node1 ~]# vim sriov-attach.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-attach
  annotations:
    k8s.v1.cni.cncf.io/resourceName: Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf
spec:
  config: '{
  "cniVersion": "0.3.1",
  "name": "sriov-attach",
  "type": "sriov",
  "ipam": {
    "type": "calico-ipam",
    "range": "222.0.0.0/8"
  }
}'
[root@node1 ~]# kubectl apply -f  sriov-attach.yaml 
networkattachmentdefinition.k8s.cni.cncf.io/sriov-attach created
[root@node1 ~]# kubectl get net-attach-def
NAME           AGE
sriov-attach   12s
[root@node1 ~]# cat sriov-attach.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sriov
  labels:
    app: sriov-attach
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: sriov-attach
  template: 
    metadata:
      annotations: 
        k8s.v1.cni.cncf.io/networks: sriov-attach
      labels:
        app: sriov-attach
    spec:
      containers:
      - name: sriov-attach
        image: docker.io/library/nginx:latest
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            cpu: 1
            memory: 1Gi
            Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf: '1'
          limits: 
            cpu: 1
            memory: 1Gi
            Mellanox.com/Mellanox_sriov_switchdev_MT27710_ens19f0_vf: '1'
[root@node1 ~]# 

#####启动pod测试
[root@node1 ~]# kubectl apply -f sriov-attach.yaml 
deployment.apps/sriov created
[root@node1 ~]# kubectl get po -o wide 
NAME                     READY   STATUS    RESTARTS   AGE   IP             NODE    NOMINATED NODE   READINESS GATES
sriov-65c8f754f9-jlcd5   1/1     Running   0          6s    172.25.36.87   node1   <none>           <none>
#########1:describe pod查看资源分配情况
[root@node1 wzb]# kubectl describe po sriov-65c8f754f9-jlcd5
Name:         sriov-65c8f754f9-jlcd5
Namespace:    default
Priority:     0
Node:         node1/172.28.30.163
Start Time:   Wed, 28 Feb 2024 20:56:24 +0800
Labels:       app=sriov-attach
              pod-template-hash=65c8f754f9
Annotations:  cni.projectcalico.org/containerID: 21ec82394a00c893e5304577b59984441bd3adac82929b5f9b5538f988245bf5
              cni.projectcalico.org/podIP: 172.25.36.87/32
              cni.projectcalico.org/podIPs: 172.25.36.87/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "172.25.36.87"
                    ],
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/sriov-attach",
                    "interface": "net1",
                    "ips": [
                        "172.25.36.90"
                    ],
                    "mac": "f6:c2:e5:d1:7b:fa",
                    "dns": {},
                    "device-info": {
                        "type": "pci",
                        "version": "1.1.0",
                        "pci": {
                            "pci-address": "0000:41:00.6"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks: sriov-attach
Status:       Running
IP:           172.25.36.87
IPs:
  IP:           172.25.36.87
Controlled By:  ReplicaSet/sriov-65c8f754f9
Containers:
  nginx:
    Container ID:   containerd://6d5246c3e36a125ba60bad6af63f8bffe4710d78c2e14e6afb0d466c3f0f5d6e
    Image:          docker.io/library/nginx:latest
    Image ID:       sha256:12766a6745eea133de9fdcd03ff720fa971fdaf21113d4bc72b417c123b15619
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Wed, 28 Feb 2024 20:56:28 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                                                 1
      intel.com/intel_sriov_switchdev_MT27710_ens19f0_vf:  1
      memory:                                              1Gi
    Requests:
      cpu:                                                 1
      intel.com/intel_sriov_switchdev_MT27710_ens19f0_vf:  1  ##pod中已经分配了sriov资源
      memory:                                              1Gi
    Environment:                                           <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pnl9d (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-pnl9d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       105s  default-scheduler  Successfully assigned default/sriov-65c8f754f9-jlcd5 to node1
  Normal  AddedInterface  103s  multus             Add eth0 [172.25.36.87/32] from k8s-pod-network
  Normal  AddedInterface  102s  multus             Add net1 [172.25.36.90/26] from default/sriov-attach  ####sriov网卡正常被添加
  Normal  Pulled          102s  kubelet            Container image "docker.io/library/nginx:latest" already present on machine
  Normal  Created         102s  kubelet            Created container nginx
  Normal  Started         102s  kubelet            Started container nginx
[root@node1 wzb]# 


#######################
进入pod内部查看,已经有网卡net1 获取到地址
[root@node1 ~]# crictl ps | grep sriov-attach
6d5246c3e36a1       12766a6745eea       4 minutes ago       Running             nginx                      0                   21ec82394a00c
[root@node1 ~]# crictl inspect  6d5246c3e36a1 | grep -i pid 
    "pid": 2775224,
            "pid": 1
            "type": "pid"
[root@node1 ~]# ns
nsec3hash         nsenter           nslookup          nss-policy-check  nstat             nsupdate          
[root@node1 ~]# nsenter -t 2775224 -n bash 
[root@node1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if30729: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
    link/ether ae:f9:85:03:13:2f brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.25.36.87/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::acf9:85ff:fe03:132f/64 scope link 
       valid_lft forever preferred_lft forever
21: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether f6:c2:e5:d1:7b:fa brd ff:ff:ff:ff:ff:ff
    inet 172.25.36.90/26 brd 172.25.36.127 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::f4c2:e5ff:fed1:7bfa/64 scope link 
       valid_lft forever preferred_lft forever
[root@node1 ~]# 

总结

以上为个人经验,希望能给大家一个参考,也希望大家多多支持脚本之家。

您可能感兴趣的文章:
阅读全文