PyTorch: data reading mechanism under batch training DataLoader

First clarify the meaning of several common nouns: batch, epoch and iteration
Batch: usually, we divide a data set into several small sample sets, and then feed a small part to the neural network for iteration. Each small part of the sample is called a batch.
Epoch: all the data in the training set are trained completely in the model (including one forward propagation and one back propagation) to become an epoch.
Iteration: the process of using a batch to update the parameters of the model to become an iteration.

Import the package first. The packages described below are located in torch util. Data.

Let's manually create a data set, including 1000 samples, each with two features. Therefore, the sample set is 1000 × 2 tensor, the label set is 1000 × Tensor of 1. The two features of each sample are obtained at random, and the label is consistent y = 2 x 1 − 5 x 2 + 3 y = 2x_1 - 5x_2 + 3 y=2x1 − 5x2 + 3 plus an error conforming to the positive distribution.

import torch
import torch.utils.data as Data

# Generate dataset
feature_num = 2   # Two inputs x1, x2
sample_num = 1000 # 1000 samples
true_w = [2, -5]
true_b = 3
samples = torch.randn(sample_num, feature_num, dtype=torch.float32) # Generate a tensor of 1000 * 2 as the input sample
labels = true_w[0] * samples[:,0] + true_w[1] * samples[:,1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float32)

1, TensorDataset() function: wrap the data into a dataset

For a given data, including samples and labels of data, wrap them into a Dataset.

Note: 1 The input parameters must be Tensor.
2. The first dimension of the two parameters must be consistent, that is, samples and labels can be matched and combined one by one by line.

We will generate the data set from the above data set.

dataset = Data.TensorDataset(samples, labels)

2, DataLoader() function: load dataset

Load the dataset and return the iterator according to the set parameters.

Function is

torch.utils.data.DataLoader(dataset, batch_size=1,
		shuffle=False, sampler=None,
		batch_sampler=None, num_workers=0,
		collate_fn=None, pin_memory=False,
		drop_last=False, timeout=0,
		worker_init_fn=None, multiprocessing_context=None,
		generator=None, *, 
		prefetch_factor=2, persistent_workers=False)

Common parameters have the following meanings.

  1. Dataset: dataset type, which is the dataset to load data.
  2. batch_size: int type. The number of samples contained in each batch. The default value is 1.
  3. shuffle: bool type. When set to True, the data will be scrambled at each epoch.
  4. Sampler: sampler type. Used to specify the policy for extracting samples from the dataset. If sampler is specified, shffle must be set to False.
    Several common values are as follows.
  • torch. utils. data. sampler. Sequential sampler (dataset): sample elements are sampled sequentially, always in the same order.
  • torch. utils. data. sampler. Random sampler (dataset): random sampling of sample elements without replacement.
  • torch. utils. data. sampler. Subset random sampler (indexes): sample elements are randomly selected from the specified index list without replacement.
  1. num_workers: int type. Defines how many child processes are used to load data. 0 means that the data is loaded in the main process.
  2. pin_memory: bool type. If True, copy them to the pinned memory of CUDA before returning to Tensor.
  3. drop_last: bool type. When the data set size cannot be batch_ When dividing by size, if set to True, the last incomplete batch will be deleted. If set to False, this batch will be retained.
  4. Timeout: numeric type, must be greater than or equal to 0. It is used to set the timeout for reading data. If it is positive, an error will be reported if no data is read after this time.

The data set generated above is read randomly in small batches.

batch_size = 10
data_iter = Data.DataLoader(dataset, batch_size, shuffle=False, 
		sampler=torch.utils.data.sampler.RandomSampler(dataset))

We can read and print samples of each batch. Here, data is called once for each loop_ iter,data_iter will automatically iterate back a batch.

for X, y in data_iter:
  print(X, "\n", y)

Keywords: Python Pytorch Deep Learning

Added by schandhok on Thu, 30 Dec 2021 22:12:53 +0200