Docker和K8S集群调用GPU

news/2024/10/12 11:09:29

参考:
安装Docker插件
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Unbntu使用Docker调用GPU
https://blog.csdn.net/dw14132124/article/details/140534628
https://www.cnblogs.com/li508q/p/18444582

  1. 环境查看
    系统环境
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy
# cat /etc/redhat-release 
Rocky Linux release 9.3 (Blue Onyx)

软件环境

# kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.16
WARNING: version difference between client (1.30) and server (1.25) exceeds the supported minor version skew of +/-1
  1. 安装Nvidia的Docker插件
    在有GPU资源的主机安装,改主机作为K8S集群的Node
    设置源
# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

配置存储库以使用实验性软件包

# sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

修改后把以下注释取消
image
更新

# sudo apt-get update

安装Toolkit

# sudo apt-get install -y nvidia-container-toolkit

配置Docker以使用Nvidia

# sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json  
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted. 

这条命令会修改配置文件/etc/docker/daemon.json添加runtimes配置

# cat /etc/docker/daemon.json 
{"insecure-registries": ["192.168.3.61"],"registry-mirrors": ["https://7sl94zzz.mirror.aliyuncs.com","https://hub.atomgit.com","https://docker.awsl9527.cn"],"runtimes": {"nvidia": {"args": [],"path": "nvidia-container-runtime"}}

重启docker

# systemctl daemon-reload
# systemctl restart docker
  1. 使用Docker调用GPU
    验证配置
    启动一个镜像查看GPU信息
~#   docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Sat Oct 12 01:33:33 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   53C    P2             59W /  450W |    4795MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

该输出结果显示了 GPU 的详细信息,包括型号、温度、功率使用情况和内存使用情况等。这表明 Docker 容器成功地访问到了 NVIDIA GPU,并且 NVIDIA Container Toolkit 安装和配置成功。
4. 使用K8S集群Pod调用GPU
以下操作在K8S机器的Master节点操作
安装K8S插件
下载最新版本

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

yml文件内容如下

# cat nvidia-device-plugin.yml 
apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"containers:- image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: "false"securityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins

使用DaemonSet方式部署在每一台node服务器部署
查看Pod日志

# kubectl logs -f nvidia-device-plugin-daemonset-8bltf -n kube-system
I1012 02:15:37.171056       1 main.go:199] Starting FS watcher.
I1012 02:15:37.171239       1 main.go:206] Starting OS watcher.
I1012 02:15:37.172177       1 main.go:221] Starting Plugins.
I1012 02:15:37.172236       1 main.go:278] Loading configuration.
I1012 02:15:37.173224       1 main.go:303] Updating config with default resource matching patterns.
I1012 02:15:37.173717       1 main.go:314] 
Running with config:
{"version": "v1","flags": {"migStrategy": "none","failOnInitError": false,"mpsRoot": "","nvidiaDriverRoot": "/","nvidiaDevRoot": "/","gdsEnabled": false,"mofedEnabled": false,"useNodeFeatureAPI": null,"deviceDiscoveryStrategy": "auto","plugin": {"passDeviceSpecs": false,"deviceListStrategy": ["envvar"],"deviceIDStrategy": "uuid","cdiAnnotationPrefix": "cdi.k8s.io/","nvidiaCTKPath": "/usr/bin/nvidia-ctk","containerDriverRoot": "/driver-root"}},"resources": {"gpus": [{"pattern": "*","name": "nvidia.com/gpu"}]},"sharing": {"timeSlicing": {}}
}
I1012 02:15:37.173760       1 main.go:317] Retrieving plugins.
E1012 02:15:37.174052       1 factory.go:87] Incompatible strategy detected auto
E1012 02:15:37.174086       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1012 02:15:37.174096       1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1012 02:15:37.174104       1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1012 02:15:37.174113       1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1012 02:15:37.174123       1 main.go:346] No devices found. Waiting indefinitely.

驱动失败,错误提示已经清楚说明了失败原因

  1. 该Node部署GPU节点即该Node没有GPU资源
  2. 该Node有GPU资源,没有安装Docker驱动
    没有GPU资源的节点肯定无法使用,但是已经有GPU资源的Node节点也会报这个错误
    有GPU节点的修复方法,修改配置文件添加配置
# cat /etc/docker/daemon.json
{"insecure-registries": ["192.168.3.61"],"registry-mirrors": ["https://7sl94zzz.mirror.aliyuncs.com","https://hub.atomgit.com","https://docker.awsl9527.cn"],"default-runtime": "nvidia","runtimes": {"nvidia": {"args": [],"path": "/usr/bin/nvidia-container-runtime"}}
}

关键配置是以下行
image

再次查看Pod日志

# kubectl logs -f nvidia-device-plugin-daemonset-mp5ql -n kube-system
I1012 02:22:00.990246       1 main.go:199] Starting FS watcher.
I1012 02:22:00.990278       1 main.go:206] Starting OS watcher.
I1012 02:22:00.990373       1 main.go:221] Starting Plugins.
I1012 02:22:00.990382       1 main.go:278] Loading configuration.
I1012 02:22:00.990692       1 main.go:303] Updating config with default resource matching patterns.
I1012 02:22:00.990776       1 main.go:314] 
Running with config:
{"version": "v1","flags": {"migStrategy": "none","failOnInitError": false,"mpsRoot": "","nvidiaDriverRoot": "/","nvidiaDevRoot": "/","gdsEnabled": false,"mofedEnabled": false,"useNodeFeatureAPI": null,"deviceDiscoveryStrategy": "auto","plugin": {"passDeviceSpecs": false,"deviceListStrategy": ["envvar"],"deviceIDStrategy": "uuid","cdiAnnotationPrefix": "cdi.k8s.io/","nvidiaCTKPath": "/usr/bin/nvidia-ctk","containerDriverRoot": "/driver-root"}},"resources": {"gpus": [{"pattern": "*","name": "nvidia.com/gpu"}]},"sharing": {"timeSlicing": {}}
}
I1012 02:22:00.990780       1 main.go:317] Retrieving plugins.
I1012 02:22:01.010950       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I1012 02:22:01.011281       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1012 02:22:01.012376       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

查看GPU节点信息

# kubectl describe node aiserver003087

image
在k8s中测试GPU资源调用
测试Pod

# cat gpu_test.yaml 
apiVersion: v1
kind: Pod
metadata:name: ffmpeg-pod
spec:nodeName: aiserver003087 #指定有gpu的节点containers:- name: ffmpeg-containerimage: nightseas/ffmpeg:latest #k8s中配置阿里的私有仓库遇到一些问题,暂时用公共镜像command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]resources:limits:nvidia.com/gpu: 1 # 请求分配 1个 GPU

创建Pod

# kubectl apply -f gpu_test.yaml 
pod/ffmpeg-pod configured

往Pod内倒入一个视频进行转换测试

# kubectl cp test.mp4 ffmpeg-pod:/root

进入Pod

# kubectl exec -it ffmpeg-pod bash

转换测试视频

# ffmpeg -hwaccel cuvid -c:v h264_cuvid -i test.mp4 -vf scale_npp=1280:720 -vcodec h264_nvenc out.mp4

成功转换并且输出out.mp4则代表调用GPU资源成功

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.ryyt.cn/news/70588.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈,一经查实,立即删除!

相关文章

mongo对文档中数组进行过滤的三种方法

想要实现数组的过滤有三种方法,包括:1. 聚合查询 使用`$unwind`将`travel`数组打散,获取结果集后用`$match`筛选符合条件的数据,最后使用`$group`进行聚合获取最终结果集 2. 聚合查询 使用`$match`过滤符合条件的根文档结果集,然后使用`$projec`t返回对应字段的同时,在`tr…

sshd 启动失败

解决方法 yum remove opensshyum install openssh openssh-server openssh-clientssystemctl start sshdsystemctl status sshd

ubuntu20 运行playwright

步骤 pip install playwrightplaywright installplaywright install-deps若弹出这个参考:https://www.cnblogs.com/code3/p/18458533 解决tip 但是貌似有2s的延迟。。本文来自博客园,作者:__username,转载请注明原文链接:https://www.cnblogs.com/code3/p/18460038

海康大华宇视等摄像头/执法记录仪等设备通过GB28181注册到LiveGBS流媒体平台,如何实时获取GB28181设备和通道的在线状态

@目录1、如何监听设备状态2、device订阅2.1、设备上线消息2.2、设备离线消息2.2、通道上线消息2.2、通道离线消息3、订阅示例3.1、连接REDIS3.2、订阅device示例3.3、设备上线示例3.3.1、注册上线后3.4、设备离线示例3.4.1、注销离线后4、更多4.1、如何切换redis5、搭建GB28181…

DIKI:清华提出基于残差的可控持续学习方案,完美保持预训练知识 | ECCV24

本研究解决了领域-类别增量学习问题,这是一个现实但富有挑战性的持续学习场景,其中领域分布和目标类别在不同任务中变化。为应对这些多样化的任务,引入了预训练的视觉-语言模型(VLMs),因为它们具有很强的泛化能力。然而,这也引发了一个新问题:在适应新任务时,预训练VL…

redis闪退

由于我长期开启redis,它突然关了,我再打开,就闪退了。通过网上大佬的解决方案,成功解决: 1.打开redis文件,点击redis-cli.exe程序:2.依次输入: shutdawn exit退出redis之后,再次打开就好了。 阅读文档:https://blog.csdn.net/m0_73379880/article/details/128942115

[Java/Spring] 深入理解 : SpringBoot PropertyMapper

1 概述: SpringBoot PropertyMapper 简介PropertyMapper是Spring提供的一个工具类,主要用于重新赋值,转换等操作位于: org.springframework.boot.context.properties.PropertyMapper2 应用场景 场景 :2个异构数据对象的转换在实际工作中,经常会遇到将数据库的实体类 Entit…

--Nacos服务注册与发现的概述与原理--

什么是 Nacos 官网中的概述:Nacos官网链接 Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service的首字母简称,一个更易于构建云原生应用的动态服务发现、配置管理和服务管理平台。 Nacos 致力于帮助您发现、配置和管理微服务。Nacos 提供了一组简单易用的特性集…