首页 > 脚本专栏 > python > PyTorch限制GPU使用上限

使用PyTorch实现限制GPU显存的可使用上限

2024-03-28 08:34:29 作者：小锋学长生活大爆炸

从 PyTorch 1.4 版本开始,引入了一个新的功能,可以允许用户为特定的 GPU 设备设置进程可使用的显存上限比例,下面我们就来看看具体实现方法吧

从 PyTorch 1.4 版本开始，引入了一个新的功能 torch.cuda.set_per_process_memory_fraction(fraction, device)，这个功能允许用户为特定的 GPU 设备设置进程可使用的显存上限比例。

测试代码：

torch.cuda.empty_cache()
 
# 设置进程可使用的GPU显存最大比例为50%
torch.cuda.set_per_process_memory_fraction(0.5, device=0)
 
# 计算总内存
total_memory = torch.cuda.get_device_properties(0).total_memory
print("实际总内存:", round(total_memory / (1024 * 1024), 1), "MB")
 
# 尝试分配大量显存的操作
try:
    # 使用10%的显存:
    tmp_tensor = torch.empty(int(total_memory * 0.1), dtype=torch.int8, device='cuda:0')
    print("分配的内存:", round(torch.cuda.memory_allocated(0) / (1024 * 1024), 1), "MB")
    print("保留的内存:", round(torch.cuda.memory_reserved(0) / (1024 * 1024), 1), "MB")
    # 清空显存
    del tmp_tensor
    torch.cuda.empty_cache()
    # 使用50%的显存:
    torch.empty(int(total_memory * 0.5), dtype=torch.int8, device='cuda:0')
except RuntimeError as e:
    print("Error allocating tensor:", e)
 
# 打印当前GPU的显存使用情况
print("分配的内存:", torch.cuda.memory_allocated(0) / (1024 * 1024), "MB")
print("保留的内存:", torch.cuda.memory_reserved(0) / (1024 * 1024), "MB")

结果如下

已分配显存：通过torch.cuda.memory_allocated(device)查询，它返回已经直接分配给张量的显存总量。这部分显存是当前正在被Tensor对象使用的。

保留（预留）显存：通过torch.cuda.memory_reserved(device)查询，它包括了已分配显存以及一部分由PyTorch的CUDA内存分配器为了提高分配效率和减少CUDA操作所需时间而预留的显存。这部分预留的显存不直接用于存储Tensor对象的数据，但可以被视为快速响应未来显存分配请求的“缓冲区”。

知识补充

除了上文的方法，小编还为大家整理了一些其他PyTorch限制GPU使用的方法，有需要的可以参考下

限制使用显存

# 指定之后所有操作在 GPU3 上执行
torch.cuda.set_device(3)

# 限制 GPU3 显存使用50%
desired_memory_fraction = 0.5  # 50% 显存
torch.cuda.set_per_process_memory_fraction(desired_memory_fraction)

# 获取当前GPU上的总显存容量
total_memory = torch.cuda.get_device_properties(3).total_memory

# 指定使用 GPU3
tmp_tensor = torch.empty(int(total_memory * 0.4999), dtype=torch.int8, device="cuda") # 此处 cuda 即指 GPU3

# 获取当前已分配的显存，计算可用显存
allocated_memory = torch.cuda.memory_allocated()
available_memory = total_memory - allocated_memory

# 打印结果
print(f"Total GPU Memory: {total_memory / (1024**3):.2f} GB")
print(f"Allocated GPU Memory: {allocated_memory / (1024**3):.2f} GB")
print(f"Available GPU Memory: {available_memory / (1024**3):.2f} GB")

此时占用了50%的显存，而将0.4999改为0.5会爆显存，可能是受浮点数精度影响。

PyTorch限制GPU显存的函数与使用

函数形态

torch.cuda.set_per_process_memory_fraction(0.5, 0)

参数1：fraction 限制的上限比例，如0.5 就是总GPU显存的一半，可以是0~1的任意float大小；

参数2：device 设备号；如0 表示GPU卡 0号；

使用示例：

import torch
# 限制0号设备的显存的使用量为0.5，就是半张卡那么多，比如12G卡，设置0.5就是6G。
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
# 计算一下总内存有多少。
total_memory = torch.cuda.get_device_properties(0).total_memory
# 使用0.499的显存:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')

# 清空该显存：
del tmp_tensor
torch.cuda.empty_cache()

# 下面这句话会触发显存OOM错误，因为刚好触碰到了上限:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows: 
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
显存超标后，比不设置限制的错误信息多了一个提示，“5.59 GiB allowed;”

注意事项：

函数限制的是进程的显存，这点跟TensorFlow的显存限制类似。

到此这篇关于使用PyTorch实现限制GPU显存的可使用上限的文章就介绍到这了,更多相关PyTorch限制GPU使用上限内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

使用PyTorch实现限制GPU显存的可使用上限

您可能感兴趣的文章: