pytorch uses DistributedDataParallel for multi card acceleration training

stay Above We introduced how to use multithreading to accelerate model training in the data module. In this paper, we mainly introduced how to use distributed dataparallel, torch.multiprocessing and other modules in pytorch to carry out multi card parallel processing and improve the training speed of the module.

The following describes pytorch's data parallel processing, multi card multi process parallel processing, and how to adjust the code for multi card parallel computing.

DataParallel(DP)

DataParallel is to parallelize data, which is relatively simple to use:

model = nn.DataParallel(model,device_ids=gpu_ids)

However, in the process of use, it will be found that the acceleration is not obvious, and there will be serious load imbalance. The main reason here is that although the model carries out multi card parallel processing on the data, it is unified to the first card for recalculation and processing when calculating loss, so the load of the first card is much greater than that of other cards.

”DataParallel is data parallel, but the gradient calculation is summarized in the first GPU, which causes the load of the first GPU to be much greater than that of other graphics cards.

In the forward process, your input data will be divided into multiple sub parts (hereinafter referred to as copies) and sent to different devices for calculation, and your model module is copied on each device, that is, the input batch will be evenly divided into each device, but your model module will be copied to each device, Each model module only needs to process each copy. Of course, you should ensure that your batch size is greater than your GPU number. Then, in the back propagation process, the gradient of each copy is accumulated into the original module. To sum up, DataParallel will automatically help us split and load the data to the corresponding GPU, copy the model to the corresponding GPU, carry out forward propagation, calculate the gradient and summarize. "

For specific analysis, please refer to: https://zhuanlan.zhihu.com/p/102697821

DistributedDataParallel(DDP)

The actual gpu load of DP is unbalanced, so it can not make good use of multi card. At present, officials also prefer to use the "torch.nn.parallel.DistributedDataParallel" DDP parallel mode.

Unlike DP in the single process multithreading mode, DDP is implemented through multiple processes, creating a process on each GPU. In the parameter update method, DDP is also that each process performs gradient calculation independently, summarizes and averages it, and then propagates it to all processes. In DP, the gradients are summarized to GPU0, the back propagation updates the parameters, and then broadcast the parameters to other GPUs. Therefore, DDP is faster in speed and avoids the problem of multi card load imbalance.

For the difference between DP and DDP, please refer to: https://zhuanlan.zhihu.com/p/206467852

The following directly analyzes how to adjust from single card training to multi card training using DDP from the perspective of code.

Logic of model training with single card:

def train(args, gpu_id, is_dist=False):
    # Create model
    model_builder = ModelBuilder()
    models, optimizers= model_builder.build_net(args, is_dist)
    # Create loss
    model_builder.build_loss()
    # Create data
    train_loader, test_loader = build_data(args, is_dist)
    
   for epoch in range(start_epoch, max_epoch):
        for x_input, x_gt in enumerate(train_loader):
            # forward
            model_builder.forward(x_input, x_gt)
            # build loss
            model_builder.get_loss()
            # compute loss
            model_builder.criterion(args)
            # backward
            model_builder.backward()
            steps += 1

Multi card model training logic:

import torch
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.multiprocessing as mp

def train_worker(gpu_id, nprocs, cfg, is_dist):
    '''Multi card distributed training, independent process operation
    '''
    os.environ['NCCL_BLOCKING_WAIT']="1"
    os.environ['NCCL_ASYNC_ERROR_HANDLING']='1'
    cudnn.deterministic = True
    # The speed improvement is mainly effective when the input shape is fixed. If it is dynamic, the time consumption is slow
    torch.backends.cudnn.benchmark = True
    dist.init_process_group(backend='nccl',
    init_method='tcp://127.0.0.1:'+str(cfg['port']),
                            world_size=len(cfg['gpu_ids']),
                            rank=gpu_id)
    torch.cuda.set_device(gpu_id)
    # Divide each GPU by batch
    cfg['batch_size'] = int(cfg['batch_size'] / nprocs)
    train(cfg, gpu_id, is_dist)
    
def main():
    mp.spawn(train_worker, nprocs=gpu_nums, args=(gpu_nums, args, True))

Where build_net interface, if is is passed in_ Dist is True. DistributedDataParallel needs to be set

if is_dist:
    d_net = DistributedDataParallel(
        net, device_ids=[gpu_id], find_unused_parameters=find_unused_parameters)

Where build_ In the data interface, if is_dist is True and sampler needs to be set

sampler = None
if dist:
    # The sampler automatically assigns data to each gpu
    sampler = DistributedSampler(dataset)
# pin_memory = True: Lock page memory to speed up data transfer in memory.
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=pin_memory,
    sampler=sampler,
    drop_last=True,  
)

Summarize the main logic to be modified:

  1. Using mp.spawn to create multiple processes
mp.spawn(train_worker)

2. Initialize process configuration

train_ GPU in worker_ IDS and process configuration dist.init_process_group

3. Modify the model

Use DistributedDataParallel when creating a model

4. Modify data

Using the DistributedSampler in the dataloader build

Keywords: Pytorch gpu

Added by codersrini on Tue, 02 Nov 2021 04:24:06 +0200