Record the bug s encountered in distributed training using pytorch's DistributedDataParallel for the first time: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight
Background: single machine multi card parallel training using linux server, pytorch version 1.7, python 3.0 7,cuda101
After consulting many relevant materials, the main reason is that the input of the model and the parameters of the model are not on the same device (GPU), but various solutions introduced have not been solved, including:
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument
In any case, the core idea is to solve the problem that the model input and parameters are not on the same GPU. The specific step wrong varies from person to person.
In my case, I print out the cuda number of the input and model parameters in each training. When it is found that the error is reported, all model parameters are on cuda 1, and the data is on the correct cuda
I carefully annotated each code and found that there was an error in saving the model during the training. The original code is as follows:
if rank == 0 and total_iters_all % opt.save_latest_freq == 0: # cache our latest model every <save_latest_freq> iterations print('saving the latest model (epoch %d, total_iters %d)' % (epoch, total_iters_all)) save_suffix = 'iter_%d' % total_iters_all if opt.save_by_iter else 'latest' model.save_networks(save_suffix)
Where, save_ The prototype of networks method is as follows:
def save_networks(self, epoch): """Save all the networks to the disk.""" for name in self.model_names: if isinstance(name, str): save_filename = '%s_net_%s.pth' % (epoch, name) save_path = os.path.join(self.save_dir, save_filename) net = getattr(self, 'net' + name) if len(self.gpu_ids) > 0 and torch.cuda.is_available(): torch.save(net.module.cpu().state_dict(), save_path) # There's something wrong with these two lines net.cuda(self.gpu_ids[0]) # There's something wrong with these two lines else: torch.save(net.cpu().state_dict(), save_path)
There is something wrong with the two lines I annotated. The function of these two lines is to first move the model from GPU to CPU for storage, and then move it back to GPU. In this move around operation, the model parameters are moved to self gpu_ The location of IDS [0] corresponds to my super parameter setting, that is, cuda 1, which is consistent with my error checking information.
I tried to put net CUDA (self. gpu_ids [0]) is changed to the corresponding local_rank still doesn't seem to succeed. The final solution is to directly save the gpu parameters of the model and skip the steps of moving them back to the CPU for storage. The modified code is as follows:
def save_networks(self, epoch): """Save all the networks to the disk.""" for name in self.model_names: if isinstance(name, str): save_filename = '%s_net_%s.pth' % (epoch, name) save_path = os.path.join(self.save_dir, save_filename) net = getattr(self, 'net' + name) if len(self.gpu_ids) > 0 and torch.cuda.is_available(): torch.save(net.module.state_dict(), save_path) # Change to this line else: torch.save(net.cpu().state_dict(), save_path)
Finally, it was successfully executed
Besides, I'm in train Using os.com in PY System() calls test Py, another error occurs, namely AssertionError: Invalid device id
The cause seems to be test The cuda number called by initializing model in py is the same as train Py does not meet the requirements, and finally I forced test The model in py can be initialized to cuda: 0
In short, human bugs are not interlinked. While referring to others' bugs, we should also find our own code bugs in a down-to-earth manner...