pytorch使用过程中遇到的错误处理之内存溢出问题

2023-09-08 09:04:59 作者：great-wind

这篇文章主要介绍了pytorch使用过程中遇到的错误处理之内存溢出问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教

内存溢出

在使用 pytorch 训练的模型进行推理操作时，

出现以下错误：

RuntimeError: CUDA out of memory. Tried to allocate 416.00 MiB (GPU 0; 2.00 GiB total capacity; 1.32 GiB already allocated; 0 bytes free; 1.34 GiB reserved in total by PyTorch)

从上述报错信息中可以看出， GPU0 共有 2GiB 容量，已经分配出去 1.32 GiB ， 0 bytes 可用，PyTorch占用 1.34 GiB 。

使用下述命令查看GPU的使用情况：

> nvidia-smi
Wed Jul 13 15:20:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 512.95       Driver Version: 512.95       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   39C    P0    N/A /  N/A |      0MiB /  2048MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

发现并没有进程占用GPU资源。

然后使用 torch 包内的命令查看内存占用情况，

结果如下：

> print(torch.cuda.memory.memory_summary())
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

从结果中看到，没有内存被占用。

再次运行代码依旧报错，难道是代码自身所需的内存过大而导致失败？

但是我们的代码只是推理代码，不应该占用这么高的内存，经过查询，发现在推理模型时，应该在主代码部分添加torch.no_grad()以防止推理过程中对梯度进行追踪。

追踪梯度时会占用大量的内存。

解决办法

如下：

with torch.no_grad():
    outputs = model(samples) #主代码

总结

以上为个人经验，希望能给大家一个参考，也希望大家多多支持脚本之家。

pytorch使用过程中遇到的错误处理之内存溢出问题

内存溢出

解决办法

总结

您可能感兴趣的文章: