深度学习之多GPU并行训练

通常情况下，多GPU运算分为单机多卡和多机多卡，两者在pytorch上面的实现并不相同，因为多机时，需要多个机器之间的通信协议等设置。

pytorch实现单机多卡十分容易，其基本原理就是：加入我们一次性读入一个batch的数据, 其大小为[16, 10, 5]，我们有4张卡可以使用。那么计算过程遵循以下步骤：

假设我们有4个GPU可以用，pytorch先把模型同步放到4个GPU中。
那么首先将数据分为4份，按照次序放置到四个GPU的模型中，每一份大小为[4, 10, 5]；
每个GPU分别进行前项计算过程；
前向过程计算完后，pytorch再从四个GPU中收集计算后的结果假设[4, 10, 5]，然后再按照次序将其拼接起来[16, 10, 5]，计算loss。
整个过程其实就是 同步模型参数→分别前向计算→计算损失→梯度反传

pytorch 实现

在我们设备中确实存在多卡的条件下，最简单的方法是直接使用torch.nn.DataParallel将你的模型wrap一下即可：

net = torch.nn.DataParallel(model)

这时，默认所有存在的显卡都会被使用。

如果我们机子中有很多显卡(例如我们有八张显卡)，但我们只想使用0、1、2号显卡，那么我们可以：

net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])

DistributedParallel

另一种方法DistributedParallel，虽然主要的目标为分布式训练，但也是可以实现单主机多GPU方式训练的，只不过比上一种方法稍微麻烦一点，但是训练速度和效果比上一种更好。

上述的命令和我们平常的命令稍有区别，这里我们用到了torch.distributed.launch这个module，我们选择运行的方式变换为python -m，上面相当于使用torch.distributed.launch.py去运行我们的YOUR_TRAINING_SCRIPT.py，其中torch.distributed.launch会向我们的运行程序传递一些变量。

为此，我们的YOUR_TRAINING_SCRIPT.py也就是我们的训练代码中这样写(省略多余代码，只保留核心代码)：

import torch.distributed as dist
# 这个参数是torch.distributed.launch传递过来的，我们设置位置参数来接受，local_rank代表当前程序进程使用的GPU标号
parser.add_argument("--local_rank", type=int, default=0) 
 
def synchronize():
    """
    Helper function to synchronize (barrier) among all processes when
    using distributed training
    """
    if not dist.is_available():
        return
    if not dist.is_initialized():
        return
    world_size = dist.get_world_size()
    if world_size == 1:
        return
    dist.barrier()
 
 
## WORLD_SIZE 由torch.distributed.launch.py产生 具体数值为 nproc_per_node*node(主机数，这里为1)
num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
 
is_distributed = num_gpus > 1
 
if is_distributed:
    torch.cuda.set_device(args.local_rank)  # 这里设定每一个进程使用的GPU是一定的
    torch.distributed.init_process_group(
        backend="nccl", init_method="env://"
    )
    synchronize()
 
# 将模型移至到DistributedDataParallel中，此时就可以进行训练了
if is_distributed:
    model = torch.nn.parallel.DistributedDataParallel(
        model, device_ids=[args.local_rank], output_device=args.local_rank,
        # this should be removed if we update BatchNorm stats
        broadcast_buffers=False,
    )
 
# 注意，在测试的时候需要执行 model = model.module

参考

https://www.jianshu.com/p/b366cad90a6c

https://blog.csdn.net/andrew80/article/details/89189544