PyTorch common code snippet collection

This paper is a collection of common code segments of PyTorch, covering five aspects: basic configuration, tensor processing, model definition and operation, data processing, model training and testing, and also gives several noteworthy Tips, which are very comprehensive.

PyTorch's best information is official documents. This article is a common code snippet of PyTorch. Some patches have been made on the basis of reference [1] (Zhang Hao: PyTorch Cookbook) for easy reference.

1. Basic configuration

Import package and version query

import torch
import torch.nn as nn
import torchvision
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.cuda.get_device_name(0))

Reproducibility

When the hardware devices (CPU, GPU) are different, the complete reproducibility cannot be guaranteed, even if the random seeds are the same. However, on the same device, reproducibility should be guaranteed. The specific method is to fix the random seed of torch at the beginning of the program, and also fix the random seed of numpy.

np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Graphics card settings

If you only need one graphics card

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

If you need to specify multiple graphics cards, such as 0 and 1 graphics cards.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

You can also set the graphics card when running code from the command line:

CUDA_VISIBLE_DEVICES=0,1 python train.py

Clear video memory

torch.cuda.empty_cache()

You can also use the command to reset the GPU on the command line

nvidia-smi --gpu-reset -i [gpu_id]

2. Tensor processing

Data type of tensor

PyTorch has 9 CPU tensor types and 9 GPU tensor types.

Tensor basic information

tensor = torch.randn(3,4,5)
print(tensor.type())  # data type
print(tensor.size())  # The shape of tensor is a tuple
print(tensor.dim())   # Number of dimensions

Named tensor

Tensor naming is a very useful method, which can easily use the name of the dimension for indexing or other operations, greatly improving readability and ease of use and preventing errors.

# Before PyTorch 1.3, you need to use comments
# Tensor[N, C, H, W]
images = torch.randn(32, 3, 56, 56)
images.sum(dim=1)
images.select(dim=1, index=0)

# After PyTorch 1.3
NCHW = ['N', 'C', 'H', 'W']
images = torch.randn(32, 3, 56, 56, names=NCHW)
images.sum('C')
images.select('C', index=0)
# You can also set it this way
tensor = torch.rand(3,4,1,2,names=('C', 'N', 'H', 'W'))
# Use align_to can easily sort dimensions
tensor = tensor.align_to('N', 'C', 'H', 'W')

Data type conversion

# By setting the default type, FloatTensor in pytorch is much faster than DoubleTensor
torch.set_default_tensor_type(torch.FloatTensor)

# Type conversion
tensor = tensor.cuda()
tensor = tensor.cpu()
tensor = tensor.float()
tensor = tensor.long()

torch.Tensor and NP Ndarray conversion

In addition to CharTensor, tensors on all other CPU s support conversion to numpy format and then back.

ndarray = tensor.cpu().numpy()
tensor = torch.from_numpy(ndarray).float()
tensor = torch.from_numpy(ndarray.copy()).float() # If ndarray has negative stride.

Torch.tensor and PIL Image conversion

# The tensor in pytorch adopts the order of [N, C, H, W] by default, and the data range is [0,1], which needs to be transposed and normalized
# torch.Tensor -> PIL.Image
image = PIL.Image.fromarray(torch.clamp(tensor*255, min=0, max=255).byte().permute(1,2,0).cpu().numpy())
image = torchvision.transforms.functional.to_pil_image(tensor)  # Equivalently way

# PIL.Image -> torch.Tensor
path = r'./figure.jpg'
tensor = torch.from_numpy(np.asarray(PIL.Image.open(path))).permute(2,0,1).float() / 255
tensor = torchvision.transforms.functional.to_tensor(PIL.Image.open(path)) # Equivalently way

np.ndarray and PIL Image conversion

image = PIL.Image.fromarray(ndarray.astype(np.uint8))

ndarray = np.asarray(PIL.Image.open(path))

Extract values from tensors that contain only one element

value = torch.rand(1).item()

Tensor deformation

# When the convolution layer is input into the fully connected layer, it is usually necessary to deform the tensor,
# Compared with torch view´╝îtorch.reshape can automatically handle the discontinuous input tensor.
tensor = torch.rand(2,3,4)
shape = (6, 4)
tensor = torch.reshape(tensor, shape)

Disorder order

tensor = tensor[torch.randperm(tensor.size(0))]  # Disrupt the first dimension

Flip horizontally

# pytorch does not support negative step operation such as tensor[::-1], and horizontal flip can be realized through tensor index
# Suppose the dimension of the tensor is [N, D, H, W]
tensor = tensor[:,:,:,torch.arange(tensor.size(3) - 1, -1, -1).long()]

Replication tensor

# Operation                 |  New/Shared memory | Still in computation graph |
tensor.clone()            # |        New         |          Yes               |
tensor.detach()           # |      Shared        |          No                |
tensor.detach.clone()()   # |        New         |          No                |

Tensor splicing

'''
be careful torch.cat and torch.stack The difference is torch.cat Splice along a given dimension,
and torch.stack One dimension will be added. For example, when the parameter is 3 10 x5 Tensor of, torch.cat The result is 30 x5 Tensor of,
and torch.stack The result is 3 x10x5 Tensor of.
'''
tensor = torch.cat(list_of_tensors, dim=0)
tensor = torch.stack(list_of_tensors, dim=0)

Convert integer labels to one hot encoding

# pytorch tags start at 0 by default
tensor = torch.tensor([0, 2, 1, 3])
N = tensor.size(0)
num_classes = 4
one_hot = torch.zeros(N, num_classes).long()
one_hot.scatter_(dim=1, index=torch.unsqueeze(tensor, dim=1), src=torch.ones(N, num_classes).long())

Get non-zero elements

torch.nonzero(tensor)               # index of non-zero elements
torch.nonzero(tensor==0)            # index of zero elements
torch.nonzero(tensor).size(0)       # number of non-zero elements
torch.nonzero(tensor == 0).size(0)  # number of zero elements

Judge that the two tensors are equal

torch.allclose(tensor1, tensor2)  # float tensor
torch.equal(tensor1, tensor2)     # int tensor

Tensor extension

# Expand tensor of shape 64*512 to shape 64*512*7*7.
tensor = torch.rand(64,512)
torch.reshape(tensor, (64, 512, 1, 1)).expand(64, 512, 7, 7)

Matrix multiplication

# Matrix multiplcation: (m*n) * (n*p) * -> (m*p).
result = torch.mm(tensor1, tensor2)

# Batch matrix multiplication: (b*m*n) * (b*n*p) -> (b*m*p)
result = torch.bmm(tensor1, tensor2)

# Element-wise multiplication.
result = tensor1 * tensor2

Calculate the Euclidean distance between two sets of data

Using broadcast mechanism

dist = torch.sqrt(torch.sum((X1[:,None,:] - X2) ** 2, dim=2))

3. Model definition and operation

An example of a simple two-layer convolution network

# convolutional neural network (2 convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out


model = ConvNet(num_classes).to(device)

The calculation and display of convolution layer can be assisted by this website.

bilinear pooling

X = torch.reshape(N, D, H * W)                        # Assume X has shape N*D*H*W
X = torch.bmm(X, torch.transpose(X, 1, 2)) / (H * W)  # Bilinear pooling
assert X.size() == (N, D, D)
X = torch.reshape(X, (N, D * D))
X = torch.sign(X) * torch.sqrt(torch.abs(X) + 1e-5)   # Signed-sqrt normalization
X = torch.nn.functional.normalize(X)                  # L2 normalization

Multi card synchronization BN (Batch normalization)

When using torch nn. When dataparallel runs the code on multiple GPU cards, the default operation of PyTorch's BN layer is to calculate the mean and standard deviation of the data on each card independently. Synchronous BN uses the data on all cards to calculate the mean and standard deviation of BN layer together, which alleviates the inaccurate estimation of the mean and standard deviation when the batch size is small, It is an effective skill to improve performance in tasks such as target detection.

sync_bn = torch.nn.SyncBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True, 
                                 track_running_stats=True)

Change all BN layers of the existing network to synchronous BN layer

def convertBNtoSyncBN(module, process_group=None):
    '''Recursively replace all BN layers to SyncBN layer.

    Args:
        module[torch.nn.Module]. Network
    '''
    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
        sync_bn = torch.nn.SyncBatchNorm(module.num_features, module.eps, module.momentum, 
                                         module.affine, module.track_running_stats, process_group)
        sync_bn.running_mean = module.running_mean
        sync_bn.running_var = module.running_var
        if module.affine:
            sync_bn.weight = module.weight.clone().detach()
            sync_bn.bias = module.bias.clone().detach()
        return sync_bn
    else:
        for name, child_module in module.named_children():
            setattr(module, name) = convert_syncbn_model(child_module, process_group=process_group))
        return module

Similar to BN moving average

If you want to achieve an operation similar to BN moving average, you should use the in place operation to assign a value to the moving average in the forward function.

class BN(torch.nn.Module)
    def __init__(self):
        ...
        self.register_buffer('running_mean', torch.zeros(num_features))

    def forward(self, X):
        ...
        self.running_mean += momentum * (current - self.running_mean)

Calculate the overall parameters of the model

num_parameters = sum(torch.numel(parameter) for parameter in model.parameters())

View parameters in the network

You can use model state_ Dict () or model named_ Parameters() function to view all the trainable parameters (including the parameters in the parent class obtained through inheritance)

params = list(model.named_parameters())
(name, param) = params[28]
print(name)
print(param.grad)
print('-------------------------------------------------')
(name2, param2) = params[29]
print(name2)
print(param2.grad)
print('----------------------------------------------------')
(name1, param1) = params[30]
print(name1)
print(param1.grad)

Model visualization (using pytorchviz)

szagoruyko/pytorchvizgithub.com

A model similar to Keras Summary () outputs model information, using pytorch summary

sksq96/pytorch-summarygithub.com

Model weight initialization

Pay attention to model Modules () and model The difference between children (): model Modules () iterates through all sublayers of the model, while model Children () will only traverse one layer under the model.

# Common practise for initialization.
for layer in model.modules():
    if isinstance(layer, torch.nn.Conv2d):
        torch.nn.init.kaiming_normal_(layer.weight, mode='fan_out',
                                      nonlinearity='relu')
        if layer.bias is not None:
            torch.nn.init.constant_(layer.bias, val=0.0)
    elif isinstance(layer, torch.nn.BatchNorm2d):
        torch.nn.init.constant_(layer.weight, val=1.0)
        torch.nn.init.constant_(layer.bias, val=0.0)
    elif isinstance(layer, torch.nn.Linear):
        torch.nn.init.xavier_normal_(layer.weight)
        if layer.bias is not None:
            torch.nn.init.constant_(layer.bias, val=0.0)

# Initialization with given tensor.
layer.weight = torch.nn.Parameter(tensor)

Extract a layer in the model

modules() will return the iterators of all modules in the model, and it can access the innermost layer, such as self layer1. The conv1 module also has a name corresponding to them_ Children() attribute and named_modules(), which will return not only the iterator of the module, but also the name of the network layer.

# Take the first two layers in the model
new_model = nn.Sequential(*list(model.children())[:2] 
# If you want to extract all the convolution layers in the model, you can do the following:
for layer in model.named_modules():
    if isinstance(layer[1],nn.Conv2d):
         conv_model.add_module(layer[0],layer[1])

Some layers use the pre training model

Note that if the saved model is torch nn. Dataparallel, the current model also needs to be

model.load_state_dict(torch.load('model.pth'), strict=False)

Load the model saved in GPU into CPU

model.load_state_dict(torch.load('model.pth', map_location='cpu'))

Import the same part of another model into the new model

When importing parameters from a model, if the structures of the two models are inconsistent, an error will be reported when importing parameters directly. The following method can be used to import the same part of another model into the new model.

# model_new stands for the new model
# model_saved represents other models, such as torch Load imported saved model
model_new_dict = model_new.state_dict()
model_common_dict = {k:v for k, v in model_saved.items() if k in model_new_dict.keys()}
model_new_dict.update(model_common_dict)
model_new.load_state_dict(model_new_dict)

4. Data processing

Calculate the mean and standard deviation of the data set

import os
import cv2
import numpy as np
from torch.utils.data import Dataset
from PIL import Image


def compute_mean_and_std(dataset):
    # Input the dataset of PyTorch and output the mean and standard deviation
    mean_r = 0
    mean_g = 0
    mean_b = 0

    for img, _ in dataset:
        img = np.asarray(img) # change PIL Image to numpy array
        mean_b += np.mean(img[:, :, 0])
        mean_g += np.mean(img[:, :, 1])
        mean_r += np.mean(img[:, :, 2])

    mean_b /= len(dataset)
    mean_g /= len(dataset)
    mean_r /= len(dataset)

    diff_r = 0
    diff_g = 0
    diff_b = 0

    N = 0

    for img, _ in dataset:
        img = np.asarray(img)

        diff_b += np.sum(np.power(img[:, :, 0] - mean_b, 2))
        diff_g += np.sum(np.power(img[:, :, 1] - mean_g, 2))
        diff_r += np.sum(np.power(img[:, :, 2] - mean_r, 2))

        N += np.prod(img[:, :, 0].shape)

    std_b = np.sqrt(diff_b / N)
    std_g = np.sqrt(diff_g / N)
    std_r = np.sqrt(diff_r / N)

    mean = (mean_b.item() / 255.0, mean_g.item() / 255.0, mean_r.item() / 255.0)
    std = (std_b.item() / 255.0, std_g.item() / 255.0, std_r.item() / 255.0)
    return mean, std

Get basic information of video data

import cv2
video = cv2.VideoCapture(mp4_path)
height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
num_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
fps = int(video.get(cv2.CAP_PROP_FPS))
video.release()

TSN samples one frame of video per segment

K = self._num_segments
if is_train:
    if num_frames > K:
        # Random index for each segment.
        frame_indices = torch.randint(
            high=num_frames // K, size=(K,), dtype=torch.long)
        frame_indices += num_frames // K * torch.arange(K)
    else:
        frame_indices = torch.randint(
            high=num_frames, size=(K - num_frames,), dtype=torch.long)
        frame_indices = torch.sort(torch.cat((
            torch.arange(num_frames), frame_indices)))[0]
else:
    if num_frames > K:
        # Middle index for each segment.
        frame_indices = num_frames / K // 2
        frame_indices += num_frames // K * torch.arange(K)
    else:
        frame_indices = torch.sort(torch.cat((                              
            torch.arange(num_frames), torch.arange(K - num_frames))))[0]
assert frame_indices.size() == (K,)
return [frame_indices[i] for i in range(K)]

Common training and verification data preprocessing

The ToTensor operation will PIL Image or shape is h × W × D. NP with value range of [0, 255] Convert ndarray to shape D × H × W. Torch with value range of [0.0, 1.0] Tensor.

train_transform = torchvision.transforms.Compose([
    torchvision.transforms.RandomResizedCrop(size=224,
                                             scale=(0.08, 1.0)),
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=(0.485, 0.456, 0.406),
                                     std=(0.229, 0.224, 0.225)),
 ])
 val_transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize(256),
    torchvision.transforms.CenterCrop(224),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=(0.485, 0.456, 0.406),
                                     std=(0.229, 0.224, 0.225)),
])

5. Model training and testing

Classification model training code

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i ,(images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimizer
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print('Epoch: [{}/{}], Step: [{}/{}], Loss: {}'
                  .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

Classification model test code

# Test the model
model.eval()  # eval mode(batch norm uses moving mean/variance 
              #instead of mini-batch mean/variance)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test accuracy of the model on the 10000 test images: {} %'
          .format(100 * correct / total))

Custom loss

Inherit torch nn. Module class writes its own loss.

class MyLoss(torch.nn.Moudle):
    def __init__(self):
        super(MyLoss, self).__init__()

    def forward(self, x, y):
        loss = torch.mean((x - y) ** 2)
        return loss

label smoothing

Write a label_smoothing.py file, and then reference it in the training code, and use LSR instead of cross entropy loss. label_smoothing.py contents are as follows:

import torch
import torch.nn as nn


class LSR(nn.Module):

    def __init__(self, e=0.1, reduction='mean'):
        super().__init__()

        self.log_softmax = nn.LogSoftmax(dim=1)
        self.e = e
        self.reduction = reduction

    def _one_hot(self, labels, classes, value=1):
        """
            Convert labels to one hot vectors

        Args:
            labels: torch tensor in format [label1, label2, label3, ...]
            classes: int, number of classes
            value: label value in one hot vector, default to 1

        Returns:
            return one hot format labels in shape [batchsize, classes]
        """

        one_hot = torch.zeros(labels.size(0), classes)

        #labels and value_added  size must match
        labels = labels.view(labels.size(0), -1)
        value_added = torch.Tensor(labels.size(0), 1).fill_(value)

        value_added = value_added.to(labels.device)
        one_hot = one_hot.to(labels.device)

        one_hot.scatter_add_(1, labels, value_added)

        return one_hot

    def _smooth_label(self, target, length, smooth_factor):
        """convert targets to one-hot format, and smooth
        them.
        Args:
            target: target in form with [label1, label2, label_batchsize]
            length: length of one-hot format(number of classes)
            smooth_factor: smooth factor for label smooth

        Returns:
            smoothed labels in one hot format
        """
        one_hot = self._one_hot(target, length, value=1 - smooth_factor)
        one_hot += smooth_factor / (length - 1)

        return one_hot.to(target.device)

    def forward(self, x, target):

        if x.size(0) != target.size(0):
            raise ValueError('Expected input batchsize ({}) to match target batch_size({})'
                    .format(x.size(0), target.size(0)))

        if x.dim() < 2:
            raise ValueError('Expected input tensor to have least 2 dimensions(got {})'
                    .format(x.size(0)))

        if x.dim() != 2:
            raise ValueError('Only 2 dimension tensor are implemented, (got {})'
                    .format(x.size()))


        smoothed_target = self._smooth_label(target, x.size(1), self.e)
        x = self.log_softmax(x)
        loss = torch.sum(- x * smoothed_target, dim=1)

        if self.reduction == 'none':
            return loss

        elif self.reduction == 'sum':
            return torch.sum(loss)

        elif self.reduction == 'mean':
            return torch.mean(loss)

        else:
            raise ValueError('unrecognized option, expect reduction to be one of none, mean, sum')

Or do label smoothing directly in the training file

for images, labels in train_loader:
    images, labels = images.cuda(), labels.cuda()
    N = labels.size(0)
    # C is the number of classes.
    smoothed_labels = torch.full(size=(N, C), fill_value=0.1 / (C - 1)).cuda()
    smoothed_labels.scatter_(dim=1, index=torch.unsqueeze(labels, dim=1), value=0.9)

    score = model(images)
    log_prob = torch.nn.functional.log_softmax(score, dim=1)
    loss = -torch.sum(log_prob * smoothed_labels) / N
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Mixup training

beta_distribution = torch.distributions.beta.Beta(alpha, alpha)
for images, labels in train_loader:
    images, labels = images.cuda(), labels.cuda()

    # Mixup images and labels.
    lambda_ = beta_distribution.sample([]).item()
    index = torch.randperm(images.size(0)).cuda()
    mixed_images = lambda_ * images + (1 - lambda_) * images[index, :]
    label_a, label_b = labels, labels[index]

    # Mixup loss.
    scores = model(mixed_images)
    loss = (lambda_ * loss_function(scores, label_a)
            + (1 - lambda_) * loss_function(scores, label_b))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

L1 regularization

l1_regularization = torch.nn.L1Loss(reduction='sum')
loss = ...  # Standard cross-entropy loss
for param in model.parameters():
    loss += torch.sum(torch.abs(param))
loss.backward()

Do not weight decay the offset term

weight decay in pytorch is equivalent to l2 regularization

bias_list = (param for name, param in model.named_parameters() if name[-4:] == 'bias')
others_list = (param for name, param in model.named_parameters() if name[-4:] != 'bias')
parameters = [{'parameters': bias_list, 'weight_decay': 0},                
              {'parameters': others_list}]
optimizer = torch.optim.SGD(parameters, lr=1e-2, momentum=0.9, weight_decay=1e-4)

gradient clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20)

Get current learning rate

# If there is one global learning rate (which is the common case).
lr = next(iter(optimizer.param_groups))['lr']

# If there are multiple learning rates for different layers.
all_lr = []
for param_group in optimizer.param_groups:
    all_lr.append(param_group['lr'])

Another way is that in a batch training code, the current LR is optimizer param_ groups[0]['lr']

Learning rate attenuation

# Reduce learning rate when validation accuarcy plateau.
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', patience=5, verbose=True)
for t in range(0, 80):
    train(...)
    val(...)
    scheduler.step(val_acc)

# Cosine annealing learning rate.
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=80)
# Reduce learning rate by 10 at given epochs.
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[50, 70], gamma=0.1)
for t in range(0, 80):
    scheduler.step()    
    train(...)
    val(...)

# Learning rate warmup by 10 epochs.
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda t: t / 10)
for t in range(0, 10):
    scheduler.step()
    train(...)
    val(...)

Optimizer chain update

Starting with version 1.4, torch optim. lr_ Scheduler supports chaining, that is, users can define two schedulers and use them alternately in training.

import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import ExponentialLR, StepLR
model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler1 = ExponentialLR(optimizer, gamma=0.9)
scheduler2 = StepLR(optimizer, step_size=3, gamma=0.1)
for epoch in range(4):
    print(epoch, scheduler2.get_last_lr()[0])
    optimizer.step()
    scheduler1.step()
    scheduler2.step()

Model training visualization

PyTorch can use tensorboard to visualize the training process.

Install and run TensorBoard.

pip install tensorboard
tensorboard --logdir=runs

Use the SummaryWriter class to collect and visualize the corresponding data. It is convenient to view it. You can use different folders, such as' Loss/train 'and' Loss/test '.

from torch.utils.tensorboard import SummaryWriter
import numpy as np

writer = SummaryWriter()

for n_iter in range(100):
    writer.add_scalar('Loss/train', np.random.random(), n_iter)
    writer.add_scalar('Loss/test', np.random.random(), n_iter)
    writer.add_scalar('Accuracy/train', np.random.random(), n_iter)
    writer.add_scalar('Accuracy/test', np.random.random(), n_iter)

Save and load breakpoints

Note that in order to recover the training, we need to save the state of the model and optimizer, as well as the current number of training rounds.

start_epoch = 0
# Load checkpoint.
if resume: # resume is the parameter, which is set to 0 during the first training and 1 when the retraining is interrupted
    model_path = os.path.join('model', 'best_checkpoint.pth.tar')
    assert os.path.isfile(model_path)
    checkpoint = torch.load(model_path)
    best_acc = checkpoint['best_acc']
    start_epoch = checkpoint['epoch']
    model.load_state_dict(checkpoint['model'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    print('Load checkpoint at epoch {}.'.format(start_epoch))
    print('Best accuracy so far {}.'.format(best_acc))

# Train the model
for epoch in range(start_epoch, num_epochs): 
    ... 

    # Test the model
    ...

    # save checkpoint
    is_best = current_acc > best_acc
    best_acc = max(current_acc, best_acc)
    checkpoint = {
        'best_acc': best_acc,
        'epoch': epoch + 1,
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
    }
    model_path = os.path.join('model', 'checkpoint.pth.tar')
    best_model_path = os.path.join('model', 'best_checkpoint.pth.tar')
    torch.save(checkpoint, model_path)
    if is_best:
        shutil.copy(model_path, best_model_path)

Extract the convolution feature of a layer of ImageNet pre training model

# VGG-16 relu5-3 feature.
model = torchvision.models.vgg16(pretrained=True).features[:-1]
# VGG-16 pool5 feature.
model = torchvision.models.vgg16(pretrained=True).features
# VGG-16 fc7 feature.
model = torchvision.models.vgg16(pretrained=True)
model.classifier = torch.nn.Sequential(*list(model.classifier.children())[:-3])
# ResNet GAP feature.
model = torchvision.models.resnet18(pretrained=True)
model = torch.nn.Sequential(collections.OrderedDict(
    list(model.named_children())[:-1]))

with torch.no_grad():
    model.eval()
    conv_representation = model(image)

Extract the convolution features of ImageNet pre training model

class FeatureExtractor(torch.nn.Module):
    """Helper class to extract several convolution features from the given
    pre-trained model.

    Attributes:
        _model, torch.nn.Module.
        _layers_to_extract, list<str> or set<str>

    Example:
        >>> model = torchvision.models.resnet152(pretrained=True)
        >>> model = torch.nn.Sequential(collections.OrderedDict(
                list(model.named_children())[:-1]))
        >>> conv_representation = FeatureExtractor(
                pretrained_model=model,
                layers_to_extract={'layer1', 'layer2', 'layer3', 'layer4'})(image)
    """
    def __init__(self, pretrained_model, layers_to_extract):
        torch.nn.Module.__init__(self)
        self._model = pretrained_model
        self._model.eval()
        self._layers_to_extract = set(layers_to_extract)

    def forward(self, x):
        with torch.no_grad():
            conv_representation = []
            for name, layer in self._model.named_children():
                x = layer(x)
                if name in self._layers_to_extract:
                    conv_representation.append(x)
            return conv_representation

Fine tuning full connection layer

model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(512, 100)  # Replace the last fc layer
optimizer = torch.optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9, weight_decay=1e-4)

Fine tune the full connection layer with a larger learning rate and the convolution layer with a smaller learning rate

model = torchvision.models.resnet18(pretrained=True)
finetuned_parameters = list(map(id, model.fc.parameters()))
conv_parameters = (p for p in model.parameters() if id(p) not in finetuned_parameters)
parameters = [{'params': conv_parameters, 'lr': 1e-3}, 
              {'params': model.fc.parameters()}]
optimizer = torch.optim.SGD(parameters, lr=1e-2, momentum=0.9, weight_decay=1e-4)

6. Other precautions

Do not use too large linear layers. Because NN Linear (m, n) uses memory. If the linear layer is too large, it is easy to exceed the existing video memory.

Do not use RNN on too long sequences. Because RNN back propagation uses BPTT algorithm, the memory required is linear with the length of the input sequence.

Use model before model(x) Train () and model Eval() switches the network state.

For code blocks that do not need to calculate the gradient, use with torch no_ Grad() is included.

model.eval() and torch no_ The difference between grad () and model Eval () is to switch the network to the test state. For example, BN and dropout use different calculation methods in the training and test stages. torch.no_grad() turns off the automatic derivation mechanism of PyTorch tensor to reduce storage usage and speed up calculation. The results obtained cannot be lost backward().

model.zero_grad() will reset the gradient of the parameters of the whole model to zero, while optimizer zero_ Grad () will only zero the gradient of the parameter passed in torch. nn. The input of crossentropyloss does not need to go through Softmax. torch.nn.CrossEntropyLoss is equivalent to torch nn. functional. log_ Softmax + torch. nn. NLLLoss. loss. Use optimizer before backward() zero_ Grad() clears the cumulative gradient.

torch. utils. data. Try to set pin in dataloader_ Memory = true, set pin for very small data sets such as MNIST_ Memory = false is faster. num_ The setting of workers needs to find the fastest value in the experiment. Delete unused intermediate variables in time with del to save GPU storage.

Using inplace operation can save GPU storage, such as

x = torch.nn.functional.relu(x, inplace=True)

Reduce data transmission between CPU and GPU. For example, if you want to know the loss and accuracy of each mini batch in an epoch, first accumulate them in the GPU, and then transmit them back to the CPU together after the end of an epoch, which is faster than the GPU to CPU transmission of each mini batch.

The use of semi precision floating-point number half() will improve the speed to a certain extent, and the specific efficiency depends on the GPU model. We need to be careful about the stability problem caused by low numerical accuracy. Often use assert tensor Size () = = (n, D, h, w) as a debugging means to ensure that the tensor dimension is consistent with your assumption. In addition to marking y, the one-dimensional tensor should be used as little as possible, and the two-dimensional tensor of n*1 should be used instead, which can avoid some unexpected one-dimensional tensor calculation results.

Time consuming for each part of the statistical code

with torch.autograd.profiler.profile(enabled=True, use_cuda=False) as profile:    ...print(profile)# Or run Python - M torch. On the command line utils. bottleneck main. py

Use TorchSnooper to debug PyTorch code. When the program is executed, it will automatically print out the shape, data type, equipment and gradient information of the tensor of the execution result of each line.

# pip install torchsnooperimport torchsnooper# For functions, use modifiers@torchsnooper.snoop()# Load the with statement into the training function, if not the with statement. with torchsnooper.snoop(): original code

reference material

  1. Zhang Hao: PyTorch Cookbook (collection of common code snippets), https://zhuanlan.zhihu.com/p/59205847?
  2. PyTorch official documents and examples
  3. https://pytorch.org/docs/stable/notes/faq.html
  4. https://github.com/szagoruyko/pytorchviz
  5. https://github.com/sksq96/pytorch-summary
  6. other

end

Added by Lefu on Mon, 14 Feb 2022 03:48:00 +0200