pytorch_lightning notes

Original address: pytorch_lightning whole journey notes - Zhihu

preface

This article will be updated continuously. After my algorithm is trained, I will write another record about the experience of pytorch lightning in reinforcement learning.

There's already a lot about pytorch_lightning (pl). In a word, this framework is really fragrant, including Install and converting from pytorch code to pytorch_lightning is very easy. The problem is how we use it. Many articles and blogs have introduced pl. here, I mainly record some dry goods I applied a few days ago.

For individuals, why use pytorch and pytorch_lightning:

I used tensorflow originally, because the whole laboratory was transferred to pytorch, so I simply transferred it myself so that the code in the laboratory could circulate. Because tensorflow is a static graph and pytorch is a dynamic graph implementation, it turns around quickly. Pytorch also has defects. For example, if you want to use semi-precision training, BatchNorm parameter synchronization and single machine multi card training, you need to arrange apex. Apex installation is also very annoying. My personal experience is all kinds of error reports, installed programs or all kinds of error reports, but pl is different. All these are arranged, and you only need to set parameters. In addition, according to my training model, the training speed of four cards is about three times higher, and the training effect (image generation) is much better. It's really fragrant. In addition, another feature is that all your super parameters are saved in the model. If you want to adjust a large number of parameters, you don't need to mark the parameters of each training model. Moreover, when restoring the model, you can directly restore the super parameters, which can greatly reduce the amount of code and workload. This is really delicious.

environment

OS (e.g., Linux): Ubuntu 18.04
python: 3.6.2
pytorch: 1.7
pytorch_lightning: 1.0.7
cudatoolkit version: 10.0.130
GPU models and configuration: 2080Ti 10GB x 4

pl process

The PL process is very simple. The production line has a fixed sequence:

initialization def __init__(self) --> Training training_step(self, batch, batch_idx) --> validation_step(self, batch, batch_idx) --> test test_step(self, batch, batch_idx). It's done. The president is to rewrite these three functions.

Of course, in addition to these three main functions, there are other functions. In order to facilitate us to implement other functions, a more complete process is in training_ step ,validation_step,test_step is followed by its corresponding training_step_end(self, batch_parts) and training_epoch_end(self, training_step_outputs) Functions, of course, have corresponding functions for checksum testing*_ step_end and*_ epoch_end function. Because the checksum test*_ step_ The end function is the same, so we only take training as an example.

*_ step_end -- That is, when each step is completed, it is called.

*_ epoch_end -- That is, after the epoch of each * is completed, it will be called automatically

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.cross_entropy(y_hat, y)
    pred = ...
    return {'loss': loss, 'pred': pred}

def training_step_end(self, batch_parts):
    '''
    When gpus=0 or 1 When, here batch_parts mean traing_step Return value of (verified)
    When gpus>1 When, here batch_parts by list，list Each is training_step Return value, list[i]by i number gpu Return value of (not verified here)
    '''
    gpu_0_prediction = batch_parts[0]['pred']
    gpu_1_prediction = batch_parts[1]['pred']

    # do something with both outputs
    return (batch_parts[0]['loss'] + batch_parts[1]['loss']) / 2

def training_epoch_end(self, training_step_outputs):
    '''
    When gpu=0 or 1 When, training_step_outputs by list，Count Reg steps Quantity of (excl validation When you train, you will find that you return list<During training steps Number, because it is displayed during training steps The data also includes validation Yes, if limit_val_batches=0.，Namely close validation，Is displayed steps Will and training_step_outputs The same length). list Each value in the dictionary is of dictionary type and will be stored in the dictionary`training_step_end()`The returned key value, key name is`training_step()`The variable name returned by the function, and which device the value is on(Which one GPU upper)，for example{device='cuda:0'}
    '''
    for out in training_step_outputs:
       # do something with preds

Train

Training is mainly to rewrite def training_setp(self, batch, batch_idx) function and return the loss to be backpropagated, where batch is the slave train_ The dataloader samples the data of a batch, batch_ IDX is the index of the current batch.

def training_setp(self, batch, batch_idx):
    image, label = batch
    pred = self.forward(iamge)
    loss = ...
    # Be sure to return loss
    return loss

Validation

Set the frequency of verification

Check every n epochs trained

By default, every epoch is verified, that is, validation is called automatically_ Step() function

trainer = Trainer(check_val_every_n_epoch=1)

Verification frequency in a single epoch

When an epoch is large, it needs to be verified multiple times in a single epoch. At this time, it needs to modify the verification transfer frequency and pass in val_ check_ When the parameter of interval is float, it indicates percentage, and when it is int, it indicates batch:

# Call the check function once every 25% of a single epoch is trained. Note: the number of float types should be passed in
trainer = Trainer(val_check_interval=0.25)
# Of course, it can also call a check function once a single epoch has finished training a number of batch, but it must be passed into the int type.
trainer = Trainer(val_check_interval=100) # Check every 100 batch es

The verification and training are the same. Rewrite def validation_ The step (self, batch, batch_idx) function does not need to return a value:

def validation_step(self, batch, batch_idx):
    image, label = batch
    pred = self.forward(iamge)
    loss = ...
    # Mark the loss to monitor the amount when saving the model
    self.log('val_loss', loss)

test

In pytoch_ In the lightning framework, test is not called during training, that is, it is irrelevant. Only training and validation are performed during training. Therefore, if you need to save some validation information during training, you should put it into validation.

As for the test, the test is after the training is completed, so it is assumed that the training has been completed:

# Obtain and restore the model with weight and super parameters
model = MODEL.load_from_checkpoint(checkpoint_path='my_model_path/heiheihei.ckpt')
# Modify the parameters required for testing, such as predicted steps, etc
model.pred_step = 1000
# Define a trainer, where limit_test_batches means to take the data of 0.05 in the test set for testing
trainer = pl.Trainer(gpus=1, precision=16, limit_test_batches=0.05)
# Test, automatically call test_step(), where dm is the data set, which will be described below
trainer.test(model=dck, datamodule=dm)

data set

Data sets can be implemented in two ways:

Of course, you should first implement the definition of Dataset. You can use existing datasets such as MNIST. If you use your own Dataset, you need to inherit torch.utils.data.dataset.Dataset and customize classes. This part will not be discussed in detail. Check other materials.

Direct implementation

Direct implementation means rewriting def train in the Model_ Dataloader (self) and other functions to return the dataloader:

class ExampleModel(pl.LightningModule):
    def __init__(self, args):
        super().__init__()
        self.train_dataset = ...
        self.val_dataset = ...
        self.test_dataset = ...
        ...
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False)
    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=1, shuffle=True)

This completes the programming of dataset and dataloader. Note that you should first complete the compilation of dataset yourself or use the existing fair dataset

Custom DataModule

This method inherits pl.LightningDataModule to provide training, verification and test data.

class MyDataModule(pl.LightningDataModule):
    def __init__(self):
        super().__init__()
        ...blablabla...
    def setup(self, stage):
        # To implement the definition of data set, each GPU will execute this function, and stage is used to mark what stage it is used for
        if stage == 'fit' or stage is None:
            self.train_dataset = DCKDataset(self.train_file_path, self.train_file_num, transform=None)
            self.val_dataset = DCKDataset(self.val_file_path, self.val_file_num, transform=None)
        if stage == 'test' or stage is None:
            self.test_dataset = DCKDataset(self.test_file_path, self.test_file_num, transform=None)
    def prepare_data(self):
        # In this function, data set downloading is generally implemented, and only cuda:0 will execute this function
        pass
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=1, shuffle=True)

use

dm = MyDataModule(args)
if not is_predict:# train
    # Define the callback to save the model. Check the following text carefully
    checkpoint_callback = ModelCheckpoint(monitor='val_loss')
    # Define model
    model = MyModel()
    # Define logger
    logger = TensorBoardLogger('log_dir', name='test_PL')
    # Define the data set as the training verification phase
    dm.setup('fit')
    # Define trainer
    trainer = pl.Trainer(gpus=gpu, logger=logger, callbacks=[checkpoint_callback]);
    # Start training
    trainer.fit(dck, datamodule=dm)
else:
    # Test phase
    dm.setup('test')
    # Recovery model
    model = MyModel.load_from_checkpoint(checkpoint_path='trained_model.ckpt')
    # Define the trainer and test it
    trainer = pl.Trainer(gpus=1, precision=16, limit_test_batches=0.05)
    trainer.test(model=model, datamodule=dm)

Model saving and recovery

Model saving

https://pytorch-lightning.readthedocs.io/en/latest/weights_loading.html#weights-loading

https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.callbacks.model_checkpoint.html?highlight=save

Auto save

Lightning will automatically save the model of the recently trained epoch to the current workspace (or.getcwd()), or specify when defining the Trainer:

trainer = Trainer(default_root_dir='/your/path/to/save/checkpoints')

Of course, you can also turn off auto save model:

trainer = Trainer(checkpoint_callback=False)

ModelCheckpoint (callbacks)

Under automatic saving, you can also customize the quantity to be monitored to save the model. The steps are as follows:

Calculate the quantity to be monitored, such as calibration error: loss
Use the log() function to mark the quantity to be monitored
Initialize the ModelCheckpoint callback and set the quantity to be monitored, as described in detail below
Pass it back to the Trainer

Step example code:

from pytorch_lightning.callbacks import ModelCheckpoint

class LitAutoEncoder(pl.LightningModule):
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.backbone(x)

        # 1. Calculate the quantity to be monitored
        loss = F.cross_entropy(y_hat, y)

        # 2. Use the log() function to mark the quantity to be monitored. The name is' Val '_ loss'
        self.log('val_loss', loss)

# 3. Initialize the 'ModelCheckpoint' callback and set the quantity to be monitored
checkpoint_callback = ModelCheckpoint(monitor='val_loss')

# 4. Put this callback into the list of other callbacks
trainer = Trainer(callbacks=[checkpoint_callback])

CLASS pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint(filepath=None, monitor=None, verbose=False, save_last=None, save_top_k=None, save_weights_only=False, mode='auto', period=1, prefix='', dirpath=None, filename=None)

Parameter Description: all parameters are optional.
filepath -- It is not recommended and will be deleted in subsequent versions; Preserved Model file The following parameters will have two other parameters to replace this.
monitor -- The amount to be monitored, string type. for example 'val_loss' (at training_step() or The validation_step() function is marked by self.log('val_loss', loss); Default to None, save only the model parameters of the last epoch. (my understanding is that only the model parameters of the last epoch are retained, but it will be saved every time an epoch is trained, and then the last model will be overwritten.)
verbose: redundancy mode, which is False by default
save_last: bool type; default None, when When True, it means that a model will always be saved for each epoch result last.ckpt, which means that it will be overwritten and saved, and only one file will be retained.
save_top_k: int type; when save_top_k==k, according to The amount monitored by the monitor is saved k best models, and the best model is when When the amount monitored by monitor is the largest, it means the best, or when it is the smallest, it means the best. See the following parameters mode. When save_top_k==0, do not save When save_top_k==-1, all models are saved, that is, each saved model is saved without overwriting Save_top_k > = 2, and the function to save the model is called multiple times in a single epoch, the version number will be appended to the name of the model at the end v0 start.
mode : string type, only one of {'auto', 'min', 'max'} can be taken; when When save_top_k!=0, the model will be overwritten when saving. If What monitor monitors is The smaller val_loss, the better the model. This parameter should be set to 'min ', when What monitor monitors is The larger the val_acc (verification accuracy), the better the amount of model training, and it should be set to 'max'. auto will automatically monitor's name( auto mode is personal understanding and may make mistakes. For example, when you program, you like to use it val_loss indicates the accuracy of the model, which will make the saved model the worst model).
save_weights_only: bool Type; True saves only model weights( model.save_weights(flepath)), otherwise save the whole model. It is recommended to save the weights. Saving the whole model will consume more time and storage space.
period: int type. The interval between saving the model, in epoch, that is, how many epochs are automatically saved once.
prefix: string type; to save the model file Prefix.
dirpath: string type. For example: dirpath='my/path_to_save_model/'
filename: string type; as mentioned earlier, it is not recommended filepath variable, recommended dirpath+filename as the model path. For example:
The file name will be saved with epoch, val_loss, and other indicators
Model name: my/path/epoch=2-val_loss=0.02-other_metric=0.03.ckpt
`checkpoint_callback = ModelCheckpoint( ... , dirpath='my/path', ... filename='{epoch}-{val_loss:.2f}-{other_metric:.2f}' ... ) `

Case:

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

# saves checkpoints to 'my/path/' at every epoch
checkpoint_callback = ModelCheckpoint(dirpath='my/path/')
trainer = Trainer(callbacks=[checkpoint_callback])

# save epoch and val_loss in name
# saves a file like: my/path/sample-mnist-epoch=02-val_loss=0.32.ckpt
checkpoint_callback = ModelCheckpoint(monitor='val_loss', dirpath='my/path/', filename='sample-mnist-{epoch:02d}-{val_loss:.2f}')

Get the best model

Because multiple models may be saved according to the parameters saved above, restore the best model according to the best_model_path.

Case:

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(dirpath='my/path/')
trainer = Trainer(callbacks=[checkpoint_callback])
model = ...
trainer.fit(model)
# After the training, several models are saved. The following is to obtain the best model, that is, apply the best model weight in the original saved model to the current network
checkpoint_callback.best_model_path

Save model manually

When we use this framework for reinforcement learning, because the training data set of reinforcement learning is not fixed, it is the training data generated by real-time interaction with the environment. Therefore, in the whole training process, Epoch is constant to 0, and the model will not be saved automatically. At this time, we need to save the model manually. In addition, the saved models are generally very large, so the best three models are saved Type is OK. You can maintain it through a queue, save the new and delete the old:

from collections import deque
import os
# Maintain a queue
self.save_models = deque(maxlen=3)
# self here means that this function is placed in the class that inherits pl.LightningModule, which is at the same level as training_step()
def manual_save_model(self):
    model_path = 'your_model_save_path_%s' % (your_loss)
    if len(self.save_models) >= 3:
        # When the queue is full, take the path of the oldest model and delete it
        old_model = self.save_models.popleft()
        if os.path.exists(old_model):
            os.remove(old_model)
    # Save manually
    self.trainer.save_checkpoint(model_path)
    # To save Model path Add to queue
    self.save_models.append(model_path)

The above function can be simply judged. If the loss is smaller or the reward is greater, we can call it again to save the model. To be safe, we can also save the latest model every other period of time. This function is extracted from the original code of pl, so the saved path is the path we set when setting checkpoint_callbacks Path, that is, the dir_path path in the model checkpoint (callbacks) section earlier in this article, will automatically save the latest.ckpt file under this path

# Save the latest path
def save_latest_model(self):
        checkpoint_callbacks = [c for c in self.trainer.callbacks if isinstance(c, ModelCheckpoint)]
        print("Saving latest checkpoint...")
        model = self.trainer.get_model()
        [c.on_validation_end(self.trainer, model) for c in checkpoint_callbacks]

Some other functions

format_checkpoint_name(epoch, step, metrics, ver=None)

In the above filename parameter, the save format of the model file is defined. This function assigns a value to the variable and returns the string type and file name

>>> tmpdir = os.path.dirname(__file__)
>>> ckpt = ModelCheckpoint(dirpath=tmpdir, filename='{epoch}')
>>> os.path.basename(ckpt.format_checkpoint_name(0, 1, metrics={}))
'epoch=0.ckpt'
>>> ckpt = ModelCheckpoint(dirpath=tmpdir, filename='{epoch:03d}')
>>> os.path.basename(ckpt.format_checkpoint_name(5, 2, metrics={}))
'epoch=005.ckpt'
>>> ckpt = ModelCheckpoint(dirpath=tmpdir, filename='{epoch}-{val_loss:.2f}')
>>> os.path.basename(ckpt.format_checkpoint_name(2, 3, metrics=dict(val_loss=0.123456)))
'epoch=2-val_loss=0.12.ckpt'
>>> ckpt = ModelCheckpoint(dirpath=tmpdir, filename='{missing:d}')
>>> os.path.basename(ckpt.format_checkpoint_name(0, 4, metrics={}))
'missing=0.ckpt'
>>> ckpt = ModelCheckpoint(filename='{step}')
>>> os.path.basename(ckpt.format_checkpoint_name(0, 0, {}))
'step=0.ckpt'

Save manually

In addition to automatic saving, you can also save and load models manually

model = MyLightningModule(hparams)
trainer.fit(model)
trainer.save_checkpoint("example.ckpt")
new_model = MyModel.load_from_checkpoint(checkpoint_path="example.ckpt")

Load Checkpoint

Load model weights, offsets, and hyperparameters:

model = MyLightingModule.load_from_checkpoint(PATH)

print(model.learning_rate)
# prints the learning_rate you used in this checkpoint

model.eval()
y_hat = model(x)

If you need to modify the super parameter, overwrite it when writing the Module:

class LitModel(LightningModule):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.save_hyperparameters()
        # Here, use the new hyperparameters instead of the hyperparameters loaded from the model
        self.l1 = nn.Linear(self.hparams.in_dim, self.hparams.out_dim)

In this case, the recovery model can be as follows:

# For example, during training, initialize in_dim = 32 and out_dim = 10
LitModel(in_dim=32, out_dim=10)
# Restore the model in the following way, using in_dim=32 and out_dim=10 as the saved parameters
model = LitModel.load_from_checkpoint(PATH)
# Of course, these parameters can also be overwritten. For example, it can be changed to in_dim = 128 and out_dim = 10
model = LitModel.load_from_checkpoint(PATH, in_dim=128, out_dim10)

load_from_checkpoint method

LightningModule.load_from_checkpoint(checkpoint_path, map_location=None, hparams_file=None, strict=True, **kwargs)

This method is the main method for loading models from checkpoint s,

Restore model and Trainer

If you want to not only restore the model, but also continue training, you can restore the Trainer

model = LitModel()
trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')
# Automatically recover model, epoch, step, learning rate information (including LR schedulers), accuracy, etc
# automatically restores model, epoch, step, LR schedulers, apex, etc...
trainer.fit(model)

Training assistance

Early Stopping

Monitor a certain amount in the validation_step(), and stop training in advance if it cannot become better

pytorch_lightning.callbacks.early_stopping.EarlyStopping(monitor='early_stop_on', min_delta=0.0, patience=3, verbose=False, mode='auto', strict=True)

monitor ( str )– monitored quantity; default is: early_stop_on; you can self.log('var_name', val_loss) to mark the quantity to be monitored
min_delta ( float )– minimum change amount; default: 0.0; that is, when the absolute value of the monitored variable is less than this value, it is considered that there is no new improvement
patience ( int )- default: 3; if the monitored quantity continues to be patient epoch s without better improvement, stop training;
verbose ( bool )– default: False;
mode ( str )– one of {auto, min, max} follows the previous one In ModelCheckpoint mode has the same meaning. If What monitor monitors is val_ The smaller the loss, the better the model. This parameter should be set to 'min ', when What monitor monitors is val_ The greater the ACC (verification accuracy), the better the amount of model training, and it should be set to 'max'. auto will automatically monitor's name( auto mode is personal understanding and may make mistakes. For example, when you program, you like to use it val_loss indicates the accuracy of the model, which will make the saved model the worst model).
strict ( bool )– True by default; If the monitor is not validation_ If the amount you monitor is found in the step() function, an error will be forcibly reported and the training will be terminated;

Logging

Only Tensorboard is involved here. For other necessary information, please refer to the official documents Logging , tensorboard has two basic methods: one is only applicable to scaler, which can directly use self.log(), and the other is image, weight, etc.

# When defining the Trainer object, pass in the tensorboardlogger
logger = TensorBoardLogger(args['log_dir'], name='DCK_PL')
trainer = pl.Trainer(logger=logger)
# Get the tensorboard Logger to use in validation_ Take the step() function as an example
def validation_step():
    tensorboard = self.logger.experiment
    # For example, get the validation loss as:
    loss = ...
    # Direct log
    self.log('val_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
    # For images, you need to use the tensorboard API
    tensorboard.add_image()
    # log multiple at the same time
    other_loss = ...
    loss_dict = {'val_loss': loss, 'loss': other_loss}
    tensorboard.add_scalars(loss_dict)
    # log weight, etc
    tensorboard.add_histogram(...)

Note that if you use anaconda, activate your env first. In addition, note that, -- logdir=my_log_dir /, the logdir here should go to version_0 / directory, which is a folder that holds various variables of your log

# The viewing method is the same as that of the tensorboard, under the terminal
tensorboard --logdir=my/log_path

Of course, you can also inherit the LightningLoggerBase class to customize the Logger. See the official document for yourself

optimizer and lr_scheduler

Of course, in the training process, it is also very important to control the learning rate. Reasonably setting the learning rate is conducive to improving the effect, and the attenuation of learning rate can be viewed Four learning rate attenuation methods . That's in pytorch_ How to set in lightning? In fact, it is the same as pytorch and basically does not need to be modified:

# Override configure_ Just use the optimizers() function
# Set optimizer
def configure_optimizers(self):
    weight_decay = 1e-6  # l2 regularization coefficient
    # Suppose there are two networks, one encoder and one decoder
    optimizer = optim.Adam([{'encoder_params': self.encoder.parameters()}, {'decoder_params': self.decoder.parameters()}], lr=learning_rate, weight_decay=weight_decay)
    # Similarly, if there is only one network structure, it can be more direct
    optimizer = optim.Adam(my_model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    # After I set 2000 epoch s here, the learning rate becomes the original 0.5, which will not change after that
    StepLR = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[2000], gamma=0.5)
    optim_dict = {'optimizer': optimizer, 'lr_scheduler': StepLR}
    return optim_dict

That's OK, just in training_ If loss is returned in the step() function, it will automatically back propagate and automatically call loss.backward() and optimizer.step() and stepLR.step()

Multiple optimizers are used for network structures such as multiple models

When we train a complex network structure, there may be multiple models that need different training sequences and different training and learning rates. At this time, we need to design multiple optimizers and manually call the gradient backpropagation function

 # multiple optimizer case (e.g.: GAN)
 def configure_optimizers(self):
     opt_d = Adam(self.model_d.parameters(), lr=0.01)
     opt_g = Adam(self.model_g.parameters(), lr=0.02)
     return opt_d, opt_g

Then turn off the automatic optimization, so that you can manually control the weight update of the optimizer like pytorch, so that you can control the complex update sequence like pytorch. At the same time, pytorch lightning has the advantages, such as the synchronization of batchnorm parameters under multiple GPU s.

 # Turn off auto optimization when the new Trainer object
 trainer = Trainer(automatic_optimization=False)

At this time training_ The step () function does not directly return loss or dictionary, but does not need to return loss, because the weight update function is called manually in this function. In addition, it should be noted that: 1. No longer use the loss.backward() function, use self. Manual instead_ Backward (loss, OPT) can realize half precision training. 2. Ignore optimizer_idx parameter.

 def training_step(self, batch, batch_idx, opt_idx):
     # Get in configure_ Optimizer returned in optimizers()
     (opt_d, opt_g) = self.optimizers()
     loss_g = self.acquire_loss_g()
     
     # Note: loss.backward() is no longer used. In addition, take GAN as an example, because the dynamic graph of the generator needs to be kept for updating by the discriminator, so retain_graph=True.
     self.manual_backward(loss_g, opt_g, retain_graph=True)
     # Destroy dynamic graph
     self.manual_backward(loss_g, opt_g)
     opt_g.step()
     # When updating the discriminator, the save generator is 0 gradient
     opt_g.zero_grad()
     
     # Update discriminator
     loss_d = self.acquire_loss_d()
     self.manual_backward(loss_d, opt_d)

other

Other important settings include synchronizing BatchNorm parameters, using half precision training (the original apex features, but PL is more fragrant than apex), multi gpu training, etc

Multi GPU training

In case of CPU training, you can ignore the parameter gpus when defining the Trainer, or set the parameter to 0:

trainer = pl.Trainer(gpus=0)

Multi GPU training is also very convenient. Just set this parameter to the number of GPUs you want to use, for example, 4 GPUs:

trainer = pl.Trainer(gpus=4)

If you have many GPUs, but you want to use them separately from your classmates, just set which GPUs are available at the front of the program. For example, the server has four cards, but you can only use cards 0 and 2:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 2'
trainer = pl.Trainer(gpus=2)

Half precision training

Half precision training is also a major feature of Apex. It can reduce the utilization rate of GPU memory (about 50%) and improve the training speed without affecting the effect. Now pytorch_lightning gives you everything. You can just set the following parameters:

trainer = pl.Trainer(precision=16)

Cumulative gradient

By default, the gradient is updated once after each batch. Of course, it can also be updated after N batches, so as to have the effect of large batch size update. For example, when your memory is very small and the batch size of training is set to be very small, you can use the cumulative gradient:

# Cumulative gradients are not turned on by default
trainer = Trainer(accumulate_grad_batches=1)

Autoscale batch_size

This method still has many limitations. The direct trainer.fit(model) is invalid. It feels very troublesome and is not recommended

Big batch_ Better gradient estimation can be obtained by size. But it also takes longer. In addition, if the memory is full, the computer will get stuck. ' power' -- Double the batch size from 1, for example, '1 -- > 2 -- > 4 -- >... To out of memory (OOM); binsearch also doubles the search and goes straight to oom, but then we have to continue a binary search to find a better batch size. In addition, the maximum batch size searched will not exceed the size of the dataset.

# Not on by default
trainer = Trainer(auto_scale_batch_size=None)

# Automatically find batch size meeting memory
trainer = Trainer(auto_scale_batch_size=None|'power'|'binsearch')

# Load into model
trainer.tune(model)

Save all hyperparameters to the model

Save all model super parameters to the model. When restoring the model, you don't have to drag and restore the super parameters in the model. This is very unique:

# For example, the super parameter dictionary you passed in is params_dict
self.hparams.update(params_dict)    # Update your hyperparameters directly to the hyperparameter Dictionary of the pl model
# In this way, the super parameters will be saved when saving
self.save_hyperparameters()

Of course, for different models without training, we still need to view their super parameters. We can save the super parameter dictionary to local txt for later viewing

def save_dict_as_txt(list_dict, save_dir):
    with open(save_dir, 'w') as fw:
        if isinstance(list_dict, list):
            for dict in list_dict:
                for key in dict.keys():
                    fw.writelines(key + ': ' + str(dict.get(key)) + '\n')
        else:
            for key in list_dict.keys():
                fw.writelines(key + ': ' + str(list_dict.get(key)) + '\n')
        fw.close()
# Save the hyperparameter dictionary to txt        
save_dict_as_txt(self.hparams, save_dir)

Gradient clipping

When it is necessary to avoid gradient explosion, the gradient clipping method can be adopted. This gradient norm is calculated through all model weights:

# The default is no clipping
trainer = Trainer(gradient_clip_val=0)

# The upper limit of the gradient norm is 0.5
trainer = Trainer(gradient_clip_val=0.5)

Set the minimum and maximum epichs for training

By default, the minimum training is 1 epoch and the maximum training is 1000 epoch.

trainer = Trainer(min_epochs=1, max_epochs=1000)

Small data set

When our dataset is too large or when we debug, we don't want to load the whole dataset, we can load only a small part of it:

The default is to load all, that is, the following parameter values are 1.0

# Only 10%, 20% and 30% of the training set, verification set and test set are loaded respectively, or int type is used to represent batch
trainer = Trainer(
    limit_train_batches=0.1,
    limit_val_batches=0.2,
    limit_test_batches=0.3
)

Among them, we should pay attention to the setting of the proportion of training set and test set, because pytorch_ Each time lightning validates and tests, it calculates an epoch instead of a step. Therefore, during the training process, if your validation dataset is large, it will consume a lot of time on validation. In fact, we just want to know how the model training is in the training process. We don't need to run a complete epoch, so we can set the limit_ val_ The batches setting is smaller. For test, after the training, if we don't want to test all the data, we can also set it through this parameter.

In addition, the framework has a parameter num_sanity_val_steps, which is used to set num before training_ sanity_ val_ Steps is a batch validation, so as not to waste time when you train for a period of time and the program reports an error during verification. This parameter is passed in when the trainer is obtained:

# The default is the validation of two batch es
trainer = Trainer(num_sanity_val_steps=2)

# Close the validation before training and start training directly
trainer = Trainer(num_sanity_val_steps=0)

# Run all the check sets (it may waste a lot of time)
trainer = Trainer(num_sanity_val_steps=-1)

exception handling

The framework was launched only in October, and it is normal to have bugs. Here are the bugs I encountered (some of which are my own pits):

Asynchronous problem of multi GPU CUDA devices

In the process of multi GPU training, when an epoch is completed or runs to the specified percentage of epoch, the validation process will be carried out. After the validation is completed, an error is reported:

RuntimeError: All input tensors must be on the same device. Received cuda:2 and cuda:0

I launched a campaign on github issue , someone has fixed this bug in the later version of pytorch_lightning should be modified, but if you report this error when you use it, you can fix it through the following steps:

https://github.com/PyTorchLightning/pytorch-lightning/pull/4138/files/b20f383acaac4662caee86b76ec56c5c478f44a0

Problems with DataLoader

RuntimeError: DataLoader worker (pid(s) 6700, 10620) exited unexpectedly

This problem usually occurs when multiple GPU s are running, mainly when the DataLoader is loaded, num_works=0 is OK. In addition, I set num in a task_ Works = 8 is OK, but in another task, the image is larger. It may be that there is not enough memory and loading the dataset is very slow and almost motionless.

In case of this error, reduce batch_size, set num_works=0, set when defining the trainer

trainer = pl.Trainer(distributed_backend='ddp')
# or
trainer = pl.Trainer(distributed_backend='dp')

Keywords: Python AI Pytorch Deep Learning

Added by tony.j.jackson@o2.co.uk on Tue, 07 Dec 2021 00:05:42 +0200

Programming VIP