pytorch_memlab

PyPI PyPI - 下载量

这是一个简单而精确的pytorch CUDA内存管理实验室，它包含了不同的内存相关部分：

功能：
- 内存分析器：一个类似line_profiler风格的CUDA内存分析器，具有简单的API。
- 内存报告器：用于检查占用CUDA内存的张量的报告器。
- 礼貌功能：一个有趣的功能，可以临时将所有CUDA张量移动到CPU内存中以释放资源，当然也包括反向传输。
- 通过%mlrun/%%mlrun行/单元魔法命令支持IPython。
目录

安装

已发布版本：

pip install pytorch_memlab

最新版本：

pip install git+https://github.com/stonesjtu/pytorch_memlab

用途

在pytorch中经常会遇到内存不足（OOM）错误，无论是新手还是经验丰富的程序员都会遇到。一个常见的原因是大多数人并不真正了解pytorch和GPU的底层内存管理原理。他们编写了内存效率低下的代码，然后抱怨pytorch消耗了太多的CUDA内存。

在这个仓库中，我将分享一些有用的工具，帮助调试OOM问题，或者为那些对底层机制感兴趣的人提供洞察。

用户文档

内存分析器

内存分析器是对Python的line_profiler的修改，它为指定函数/方法中的每行代码提供内存使用信息。

示例：

import torch
from pytorch_memlab import LineProfiler

def inner():
    torch.nn.Linear(100, 100).cuda()

def outer():
    linear = torch.nn.Linear(100, 100).cuda()
    linear2 = torch.nn.Linear(100, 100).cuda()
    linear3 = torch.nn.Linear(100, 100).cuda()

work()

脚本结束或被键盘中断后，如果你在Jupyter notebook中，它会给出以下分析信息：

或者如果你在纯文本终端中，会显示以下信息：

## outer

active_bytes reserved_bytes line  code
         all            all
        peak           peak
       0.00B          0.00B    7  def outer():
      40.00K          2.00M    8      linear = torch.nn.Linear(100, 100).cuda()
      80.00K          2.00M    9      linear2 = torch.nn.Linear(100, 100).cuda()
     120.00K          2.00M   10      inner()


## inner

active_bytes reserved_bytes line  code
         all            all
        peak           peak
      80.00K          2.00M    4  def inner():
     120.00K          2.00M    5      torch.nn.Linear(100, 100).cuda()

关于每列含义的解释可以在Torch文档中找到。memory_stats()中的任何字段名都可以传递给display()以查看相应的统计信息。

如果你使用profile装饰器，内存统计信息会在多次运行中收集，最后只显示最大值。我们还提供了一个更灵活的API，叫做profile_every，它可以每执行N次函数就打印一次内存信息。你可以简单地将@profile替换为@profile_every(1)来打印每次执行的内存使用情况。

@profile和@profile_every可以混合使用，以获得更精细的调试粒度控制。

你也可以在模块类中添加装饰器：

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
    @profile
    def forward(self, inp):
        #do_something

Line Profiler默认分析CUDA设备0的内存使用情况，你可能想通过set_target_gpu切换要分析的设备。GPU选择是全局的，这意味着你需要在整个过程中记住你正在分析哪个GPU：

import torch
from pytorch_memlab import profile, set_target_gpu
@profile
def func():
    net1 = torch.nn.Linear(1024, 1024).cuda(0)
    set_target_gpu(1)
    net2 = torch.nn.Linear(1024, 1024).cuda(1)
    set_target_gpu(0)
    net3 = torch.nn.Linear(1024, 1024).cuda(0)

func()

更多示例可以在test/test_line_profiler.py中找到。

IPython支持

确保你已安装IPython，或者通过pip install pytorch-memlab[ipython]安装了带IPython支持的pytorch-memlab。

首先，加载扩展：

%load_ext pytorch_memlab

这使得 %mlrun 和 %%mlrun 行/单元魔法命令可以使用。例如，在新的单元格中运行以下代码来分析整个单元格

%%mlrun -f func
import torch
from pytorch_memlab import profile, set_target_gpu
def func():
    net1 = torch.nn.Linear(1024, 1024).cuda(0)
    set_target_gpu(1)
    net2 = torch.nn.Linear(1024, 1024).cuda(1)
    set_target_gpu(0)
    net3 = torch.nn.Linear(1024, 1024).cuda(0)

或者你可以通过 %mlrun 单元魔法命令为单个语句调用分析器。

import torch
from pytorch_memlab import profile, set_target_gpu
def func(input_size):
    net1 = torch.nn.Linear(input_size, 1024).cuda(0)
%mlrun -f func func(2048)

查看 %mlrun? 以获取支持的参数帮助。你可以设置要分析的GPU设备，将分析结果保存到文件，并返回 LineProfiler 对象以进行分析后检查。

通过查看演示Jupyter笔记本了解更多信息。

内存报告器

由于内存分析器只提供按行的整体内存使用信息，可以通过内存报告器获得更低级别的内存使用信息。

内存报告器遍历所有 Tensor 对象并获取底层的 UntypedStorage（之前称为 Storage）对象，以获取实际的内存使用情况，而不是表面的 Tensor.size。

查看 UntypedStorage 获取详细信息

示例

最简单的示例：

import torch
from pytorch_memlab import MemReporter
linear = torch.nn.Linear(1024, 1024).cuda()
reporter = MemReporter()
reporter.report()

输出：

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
-------------------------------------------------------------------------------
Total Tensors: 1049600  Used Memory: 4.00M
The allocated memory on cuda:0: 4.00M
-------------------------------------------------------------------------------

你也可以传入一个模型对象以自动推断名称。

import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
inp = torch.Tensor(512, 1024).cuda()
# 传入模型以自动推断张量名称
reporter = MemReporter(linear)
out = linear(inp).mean()
print('========= 反向传播前 =========')
reporter.report()
out.backward()
print('========= 反向传播后 =========')
reporter.report()

输出：

========= 反向传播前 =========
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889  Used Memory: 6.00M
The allocated memory on cuda:0: 6.00M
-------------------------------------------------------------------------------
========= 反向传播后 =========
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
weight.grad                                     (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
bias.grad                                            (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2623489  Used Memory: 10.01M
The allocated memory on cuda:0: 10.01M
-------------------------------------------------------------------------------

报告器会自动处理共享权重参数：

import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
linear2 = torch.nn.Linear(1024, 1024).cuda()
linear2.weight = linear.weight
container = torch.nn.Sequential(
    linear, linear2
)
inp = torch.Tensor(512, 1024).cuda()
# 传入模型以自动推断张量名称

out = container(inp).mean()
out.backward()

# verbose 显示存储如何在多个张量间共享
reporter = MemReporter(container)
reporter.report(verbose=True)

输出：

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
0.weight                                        (1024, 1024)     4.00M
0.weight.grad                                   (1024, 1024)     4.00M
0.bias                                               (1024,)     4.00K
0.bias.grad                                          (1024,)     4.00K
1.bias                                               (1024,)     4.00K
1.bias.grad                                          (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2625537  Used Memory: 10.02M
The allocated memory on cuda:0: 10.02M
-------------------------------------------------------------------------------

你可以更好地理解更复杂模块的内存布局：

import torch
from pytorch_memlab import MemReporter

lstm = torch.nn.LSTM(1024, 1024).cuda()
reporter = MemReporter(lstm)
reporter.report(verbose=True)
inp = torch.Tensor(10, 10, 1024).cuda()
out, _ = lstm(inp)
out.mean().backward()
reporter.report(verbose=True)

如下所示，(->)表示重复使用相同的存储后端输出:

元素类型                                            大小  已用内存
-------------------------------------------------------------------------------
cuda:0上的存储
weight_ih_l0                                    (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
Tensor0                                       (10, 10, 1024)   400.00K
-------------------------------------------------------------------------------
总张量数: 8499200  已用内存: 32.42M
cuda:0上分配的内存: 32.52M
内存差异是由于矩阵对齐造成的
-------------------------------------------------------------------------------
元素类型                                            大小  已用内存
-------------------------------------------------------------------------------
cuda:0上的存储
weight_ih_l0                                    (4096, 1024)    32.03M
weight_ih_l0.grad                               (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
weight_hh_l0.grad(->weight_ih_l0.grad)          (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_ih_l0.grad(->weight_ih_l0.grad)                 (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0.grad(->weight_ih_l0.grad)                 (4096,)     0.00B
Tensor0                                       (10, 10, 1024)   400.00K
Tensor1                                       (10, 10, 1024)   400.00K
Tensor2                                        (1, 10, 1024)    40.00K
Tensor3                                        (1, 10, 1024)    40.00K
-------------------------------------------------------------------------------
总张量数: 17018880         已用内存: 64.92M
cuda:0上分配的内存: 65.11M
内存差异是由于矩阵对齐造成的
-------------------------------------------------------------------------------

注意:

当使用grad_mode=True进行前向传播时，PyTorch会在C层级维护用于未来反向传播的张量缓冲区。因此这些缓冲区不会被PyTorch管理或回收。但如果你将这些中间结果存储为Python变量，它们就会被报告出来。

你也可以通过传递额外参数来筛选要报告的设备: report(device=torch.device(0))
由于PyTorch的C端张量缓冲区导致的失败示例

在下面的例子中，在inp * (inp + 2)处创建了一个临时缓冲区来存储inp和inp + 2，不幸的是Python只知道inp的存在，所以我们丢失了2M内存，这与张量inp的大小相同。

import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
inp = torch.Tensor(512, 1024).cuda()
# 传入一个模型以自动推断张量名称
reporter = MemReporter(linear)
out = linear(inp * (inp + 2)).mean()
reporter.report()

输出:

元素类型                                            大小  已用内存
-------------------------------------------------------------------------------
cuda:0上的存储
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
总张量数: 1573889  已用内存: 6.00M
cuda:0上分配的内存: 8.00M
内存差异是由于矩阵对齐或不可见的梯度缓冲区张量造成的
-------------------------------------------------------------------------------

礼让

有时人们想抢占你正在运行的任务，但你不想保存检查点然后再加载，实际上他们只需要GPU资源（通常在GPU集群中CPU资源和CPU内存总是有富余的），所以你可以将所有工作空间从GPU移到CPU，然后暂停你的任务，直到触发重启信号，而不是保存和加载检查点并从头开始引导。

仍在开发中.....但你可以尝试使用:

from pytorch_memlab import Courtesy

iamcourtesy = Courtesy()
for i in range(num_iteration):
    if something_happens:
        iamcourtesy.yield_memory()
        wait_for_restart_signal()
        iamcourtesy.restore()

已知问题

如上文Memory_Reporter中所述，中间张量没有被正确覆盖，所以你可能想在backward之后或forward之前插入这样的礼让逻辑。
目前PyTorch的CUDA上下文需要约1 GB的CUDA内存，这意味着即使所有张量都在CPU上，仍有1GB的CUDA内存被浪费，:-(。然而，我仍在调查是否可以完全销毁上下文然后重新初始化。