# preface

This note mainly introduces the function of autograd module in pytorch, mainly involves the code under torch/autograd, and does not involve the underlying C + + implementation. The source code involved in this article is subject to PyTorch 1.7.

- torch.autograd.function (back propagation of function)
- torch.autograd.functional (back propagation of computational graph)
- torch.autograd.gradcheck (numerical gradient check)
- torch.autograd.anomaly_mode (detect error generation path during automatic derivation)
- torch.autograd.grad_mode (set whether gradient is required)
- model.eval() and torch no_ grad()
- torch.autograd.profiler (provides function level statistics)

### torch.autograd.function (back propagation of function)

When we build a network, we usually use NN provided by pytorch Module (such as nn.Conv2d, nn.ReLU, etc.) as the basic unit. These modules usually wrap the autograd function as part of the real implementation. For example, NN Relu actually uses torch nn. functional. relu(F.relu):

from torch.nn import functional as F class ReLU(Module): __constants__ = ['inplace'] inplace: bool def __init__(self, inplace: bool = False): super(ReLU, self).__init__() self.inplace = inplace def forward(self, input: Tensor) -> Tensor: return F.relu(input, inplace=self.inplace)

The F.relu type here is function. If another layer is stripped, the function type of the actual package is builtin_function_or_method, which is also the part that really completes the operation. These parts are usually implemented in C + + (such as ATen). So far, we know that the operation part of a model is composed of autograd functions. Forward and backward are defined in these autograd functions to describe the process of forward and gradient back transmission. After combination, the forward and gradient back transmission of the whole model can be realized. In torch autograd. The function class defined in function is the base class. We can implement the custom autograd function. The implemented function needs to include forward and backward methods. Take Exp and GradCoeff as examples to explain:

class Exp(Function): # Calculate e^x for this layer @staticmethod def forward(ctx, i): # Model forward result = i.exp() ctx.save_for_backward(result) # Save the required contents for backward use, and the required results will be saved in saved_tensors tuple; Only tensor type variables can be saved here. If other type variables (Int, etc.), ctx can be directly assigned as member variables, or the saving effect can be achieved return result @staticmethod def backward(ctx, grad_output): # Model gradient back propagation result, = ctx.saved_tensors # Take out the result saved in forward return grad_output * result # Calculate gradient and return # Try to use x = torch.tensor([1.], requires_grad=True) # You need to set the requirements of the tensor_ Only when the grad property is True can gradient backpropagation be performed ret = Exp.apply(x) # Use the apply method to call the custom autograd function print(ret) # tensor([2.7183], grad_fn=<ExpBackward>) ret.backward() # Reverse gradient print(x.grad) # tensor([2.7183])

The forward direction of exp function is very simple. You can directly call the member method exp of tensor. In reverse, we know

Therefore, we use it directly

Multiply by grad_ The gradient is output. We found that our custom function Exp performs forward and reverse correctly. At the same time, we also note that the results obtained from the front and back include grad_fn attribute, which points to the function used to calculate its gradient (i.e. the backward function of Exp). This will be explained in more detail in the next section. Next, let's look at another function, GradCoeff, whose function is to multiply the backscattering gradient by a user-defined coefficient.

class GradCoeff(Function): @staticmethod def forward(ctx, x, coeff): # Model forward ctx.coeff = coeff # Save coeff as a member variable of ctx return x.view_as(x) @staticmethod def backward(ctx, grad_output): # Model gradient back propagation return ctx.coeff * grad_output, None # The number of outputs of backward should be the same as the number of inputs of forward. Here, coeff does not need gradient, so it returns None # Try to use x = torch.tensor([2.], requires_grad=True) ret = GradCoeff.apply(x, -0.1) # The forward needs to provide both x and coeff, and set coeff to - 0.1 ret = ret ** 2 print(ret) # tensor([4.], grad_fn=<PowBackward0>) ret.backward() print(x.grad) # tensor([-0.4000]), the gradient has been multiplied by the corresponding coefficient

## torch.autograd.functional (back propagation of computational graph)

In the previous section, we described the back propagation of a single function and how to write a custom autograd function. In this section, we briefly introduce the interface of computing graph back propagation provided in pytorch.

In the training process, we usually use prediction and groundtruth label to calculate loss (the type of loss is Tensor), and then call loss.. Backward() performs gradient backpropagation. The backward method of tensor class actually calls torch autograd. Backward interface. This python interface implements the back propagation of computational graph level.

class Tensor(torch._C._TensorBase) def backward(self, gradient=None, retain_graph=None, create_graph=False): relevant_args = (self,) ... torch.autograd.backward(self, gradient, retain_graph, create_graph) # gradient: the shape is consistent with tensor, which can be understood as the intermediate result of chain derivation. If tensor is scalar, it can be omitted (1 by default) # retain_graph: gradient accumulation during multiple back propagation. The intermediate cache of back propagation will be emptied. To perform multiple back propagation, you need to specify retain_graph=True to save these caches. # create_graph: a calculation diagram is also established for the process of back propagation, which can be used to calculate the second-order derivative

In the implementation of pytorch, autograd will record all operations that generate the current variable with the user's operation, and establish a directed acyclic graph (DAG). The operation Function is recorded in the figure, and the position of each variable in the figure can be determined by its grad_ The position of FN attribute in the figure is inferred. In the process of back propagation, autograd traces the source from the current variable (root node F) along this graph, and the gradient of all leaf nodes can be calculated by using the chain derivation rule. Each Function of forward propagation operation has its corresponding back propagation Function to calculate the gradient of each input variable. The Function name of these functions usually ends with Backward. We construct a simplified calculation diagram and take it as an example for a brief introduction.

A = torch.tensor(2., requires_grad=True) B = torch.tensor(.5, requires_grad=True) E = torch.tensor(1., requires_grad=True) C = A * B D = C.exp() F = D + E print(F) # Tensor (3.7183, grad_fn = < AddBackward0 >) prints the calculation results, and you can see the grad of F_ FN points to AddBackward, that is, the operation that generates F print([x.is_leaf for x in [A, B, C, D, E, F]]) # [True, True, False, False, True, False] print whether it is a leaf node, created by the user, and requires_ Nodes with grad set to True are leaf nodes print([x.grad_fn for x in [F, D, C, A]]) # [< addbackward0 object at 0x7f972de8c7b8 >, < expbackward0 object at 0x7f972de8c278 >, < mulbackward0 object at 0x7f972de8c2b0 >, none] grad of each variable_ FN points to the backward function that generates its operator and grad of leaf node_ FN is empty print(F.grad_fn.next_functions) # ((< expiated object at 0x7f972de8c390 >, 0), (< AccumulateGrad object at 0x7f972de8c5f8 >, 0)) since F = D + E, F.grad_fn.next_functions also has two terms, corresponding to D and e variables respectively. The first term in each tuple corresponds to grad of the corresponding variable_ FN, the second item indicates that the corresponding variable is the output that produces its op. E is a leaf node without grad_fn, but there is a gradient accumulation function, i.e. AccumulateGrad (because there are many gradients during reverse transmission, it needs to be accumulated) F.backward(retain_graph=True) # Carry out gradient reverse transmission print(A.grad, B.grad, E.grad) # tensor(1.3591) tensor(5.4366) tensor(1.) The gradient of each variable is calculated, which is consistent with that obtained by derivation print(C.grad, D.grad) # None None to save space, the gradient of the intermediate node will not be retained after the gradient reverse transmission is completed

Let's take another look at the following calculation diagram and simulate the work done by autograd on this calculation diagram:

A = torch.tensor([3.], requires_grad=True) B = torch.tensor([2.], requires_grad=True) C = A ** 2 D = B ** 2 E = C * D F = D + E F.manual_grad = torch.tensor(1) # We use manual_grad indicates that when the structure of the calculation graph is known, we simulate the gradient calculated manually by the autograd process D.manual_grad, E.manual_grad = F.grad_fn(F.manual_grad) C.manual_grad, tmp2 = E.grad_fn(E.manual_grad) D.manual_grad = D.manual_grad + tmp2 # Here, we first complete the gradient accumulation on D, and then reverse transmission A.manual_grad = C.grad_fn(C.manual_grad) B.manual_grad = D.grad_fn(D.manual_grad) # (tensor([24.], grad_fn=<MulBackward0>), tensor([40.], grad_fn=<MulBackward0>))

Next, we write a simple function, perform autograd on this calculation diagram, and verify whether the result is correct:

# This example can only be used when each op produces only one output, and the efficiency is very low (because for a node, the gradient returned this time is directly transmitted to the leaf node without waiting for all gradients to be transmitted back to this node each time) def autograd(grad_fn, gradient): auto_grad = {} queue = [[grad_fn, gradient]] while queue != []: item = queue.pop() gradients = item[0](item[1]) functions = [x[0] for x in item[0].next_functions] if type(gradients) is not tuple: gradients = (gradients, ) for grad, func in zip(gradients, functions): if type(func).__name__ == 'AccumulateGrad': if hasattr(func.variable, 'auto_grad'): func.variable.auto_grad = func.variable.auto_grad + grad else: func.variable.auto_grad = grad else: queue.append([func, grad]) A = torch.tensor([3.], requires_grad=True) B = torch.tensor([2.], requires_grad=True) C = A ** 2 D = B ** 2 E = C * D F = D + E autograd(F.grad_fn, torch.tensor(1)) print(A.auto_grad, B.auto_grad) # tensor(24., grad_fn=<UnbindBackward>) tensor(40., grad_fn=<AddBackward0>) # This autograd can also act on the model written. We will see that it produces the same results as the backward provided by pytorch from torch import nn class MLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(10, 5) self.relu = nn.ReLU() self.fc2 = nn.Linear(5, 2) self.fc3 = nn.Linear(5, 2) self.fc4 = nn.Linear(2, 2) def forward(self, x): x = self.fc1(x) x = self.relu(x) x1 = self.fc2(x) x2 = self.fc3(x) x2 = self.relu(x2) x2 = self.fc4(x2) return x1 + x2 x = torch.ones([10], requires_grad=True) mlp = MLP() mlp_state_dict = mlp.state_dict() # Customize autograd mlp = MLP() mlp.load_state_dict(mlp_state_dict) y = mlp(x) z = torch.sum(y) autograd(z.grad_fn, torch.tensor(1.)) print(x.auto_grad) # tensor([-0.0121, 0.0055, -0.0756, -0.0747, 0.0134, 0.0867, -0.0546, 0.1121, -0.0934, -0.1046], grad_fn=<AddBackward0>) mlp = MLP() mlp.load_state_dict(mlp_state_dict) y = mlp(x) z = torch.sum(y) z.backward() print(x.grad) # tensor([-0.0121, 0.0055, -0.0756, -0.0747, 0.0134, 0.0867, -0.0546, 0.1121, -0.0934, -0.1046])

python uses dynamic graph, and its calculation graph is built from scratch every time forward propagation, so it can use python control statements (such as for, if, etc.) to create calculation graph according to requirements. Here is an example:

def f(x): result = 1 for ii in x: if ii.item()>0: result=ii*result return result x = torch.tensor([0.3071, 1.1043, 1.3605, -0.3471], requires_grad=True) y = f(x) # y = x[0]*x[1]*x[2] y.backward() print(x.grad) # tensor([1.5023, 0.4178, 0.3391, 0.0000]) x = torch.tensor([ 1.2817, 1.7840, -1.7033, 0.1302], requires_grad=True) y = f(x) # y = x[0]*x[1]*x[3] y.backward() print(x.grad) # tensor([0.2323, 0.1669, 0.0000, 2.2866])

The previous example used tensor Backward () interface (call autograd.backward internally). Let's introduce the jacobian() and hessian() interfaces provided by autograd and use them directly for automatic differentiation. The inputs of these two functions are the operation function (accept the input tensor and return the output tensor) and the input tensor, which returns the jacobian and hessian matrices. For the jacobian interface, the input and output can be n-dimensional tensors. For the hessian interface, the output must be a scalar. The tensor shape returned by jacobian is output_dim x input_dim (if the function output is scalar, output_dim can be omitted), and the tensor returned by hessian is input_dim x input_dim. In addition, the two automatic differential interfaces support the operation function to receive and output multiple tensors at the same time.

from torch.autograd.functional import jacobian, hessian from torch.nn import Linear, AvgPool2d fc = Linear(4, 2) pool = AvgPool2d(kernel_size=2) def scalar_func(x): y = x ** 2 z = torch.sum(y) return z def vector_func(x): y = fc(x) return y def mat_func(x): x = x.reshape((1, 1,) + x.shape) x = pool(x) x = x.reshape(x.shape[2:]) return x ** 2 vector_input = torch.randn(4, requires_grad=True) mat_input = torch.randn((4, 4), requires_grad=True) j = jacobian(scalar_func, vector_input) assert j.shape == (4, ) assert torch.all(jacobian(scalar_func, vector_input) == 2 * vector_input) h = hessian(scalar_func, vector_input) assert h.shape == (4, 4) assert torch.all(hessian(scalar_func, vector_input) == 2 * torch.eye(4)) j = jacobian(vector_func, vector_input) assert j.shape == (2, 4) assert torch.all(j == fc.weight) j = jacobian(mat_func, mat_input) assert j.shape == (2, 2, 4, 4)

In the previous example, we have introduced autograd Backward() saves only the gradient of leaf nodes in order to save space. If we want to know the gradient of the output about an intermediate result, we can choose to use autograd Grad () interface or hook mechanism:

A = torch.tensor(2., requires_grad=True) B = torch.tensor(.5, requires_grad=True) C = A * B D = C.exp() torch.autograd.grad(D, (C, A)) # (tensor(2.7183), tensor(1.3591)), the returned gradient is tuple type, and the grad interface supports gradient calculation for multiple variables def variable_hook(grad): # hook is registered on tensor, and the input is the gradient transmitted back to this tensor print('the gradient of C is: ', grad) A = torch.tensor(2., requires_grad=True) B = torch.tensor(.5, requires_grad=True) C = A * B hook_handle = C.register_hook(variable_hook) # Register hook on intermediate variable C D = C.exp() D.backward() # Printing during reverse transmission: the gradient of C is: tensor(2.7183) hook_handle.remove() # If it is no longer needed, you can remove this hook

## torch.autograd.gradcheck (numerical gradient check)

After writing your own autograd function, you can use the gradcheck and grad gradgradcheck interfaces provided in gradcheck to compare the gradient calculated by numerical value with the gradient calculated by derivation, so as to check whether the backward is written correctly. With function

As an example, it is obtained by numerical method

The gradient of the point is:

. In the following example, we implement the Sigmoid function ourselves and use gradcheck to check whether the backward is written correctly.

class Sigmoid(Function): @staticmethod def forward(ctx, x): output = 1 / (1 + torch.exp(-x)) ctx.save_for_backward(output) return output @staticmethod def backward(ctx, grad_output): output, = ctx.saved_tensors grad_x = output * (1 - output) * grad_output return grad_x test_input = torch.randn(4, requires_grad=True) # tensor([-0.4646, -0.4403, 1.2525, -0.5953], requires_grad=True) torch.autograd.gradcheck(Sigmoid.apply, (test_input,), eps=1e-3) # pass torch.autograd.gradcheck(torch.sigmoid, (test_input,), eps=1e-3) # pass torch.autograd.gradcheck(Sigmoid.apply, (test_input,), eps=1e-4) # fail torch.autograd.gradcheck(torch.sigmoid, (test_input,), eps=1e-4) # fail

We found that when eps is 1e-3, both the Sigmoid we wrote and the builtin Sigmoid built in torch can pass the gradient check, but when eps drops to 1e-4, neither can pass. Under the general intuition, when calculating the numerical gradient, the smaller the eps, the value obtained should be closer to the real gradient. The abnormal phenomenon here is caused by the error caused by the accuracy of the machine: test_ The type of input is torch Float32, therefore, when the eps is too small, there is a large accuracy error (eps is taken as the divisor when calculating the numerical gradient), so there is a large gap between the eps and the real accuracy. Will test_ After the input is changed to the tensor of float64, this phenomenon no longer occurs. This also reminds us that when writing backward, we should consider some properties of numerical calculation and retain more accurate results as far as possible.

test_input = torch.randn(4, requires_grad=True, dtype=torch.float64) # tensor([-0.4646, -0.4403, 1.2525, -0.5953], dtype=torch.float64, requires_grad=True) torch.autograd.gradcheck(Sigmoid.apply, (test_input,), eps=1e-4) # pass torch.autograd.gradcheck(torch.sigmoid, (test_input,), eps=1e-4) # pass torch.autograd.gradcheck(Sigmoid.apply, (test_input,), eps=1e-6) # pass torch.autograd.gradcheck(torch.sigmoid, (test_input,), eps=1e-6) # pass

## torch.autograd.anomaly_mode (detect error generation path during automatic derivation)

It can be used to detect the error generation path during automatic derivation with the help of with autograd detect_ Anomaly (): or torch autograd. set_ detect_ Anomaly (true) to enable:

>>> import torch >>> from torch import autograd >>> >>> class MyFunc(autograd.Function): ... ... @staticmethod ... def forward(ctx, inp): ... return inp.clone() ... ... @staticmethod ... def backward(ctx, gO): ... # Error during the backward pass ... raise RuntimeError("Some error in backward") ... return gO.clone() >>> >>> def run_fn(a): ... out = MyFunc.apply(a) ... return out.sum() >>> >>> inp = torch.rand(10, 10, requires_grad=True) >>> out = run_fn(inp) >>> out.backward() Traceback (most recent call last): Some Error Log RuntimeError: Some error in backward >>> with autograd.detect_anomaly(): ... inp = torch.rand(10, 10, requires_grad=True) ... out = run_fn(inp) ... out.backward() Traceback of forward call that caused the error: # Trace where the error occurred was detected File "tmp.py", line 53, in <module> out = run_fn(inp) File "tmp.py", line 44, in run_fn out = MyFunc.apply(a) Traceback (most recent call last): Some Error Log RuntimeError: Some error in backward

## torch.autograd.grad_mode (set whether gradient is required)

In the process of information, we don't want autograd to derive tensor, because it needs to cache many intermediate structures and increase additional memory / video memory overhead. In information, turning off automatic derivation can improve the speed to a certain extent and save a lot of memory and video memory (the saved part is not limited to the part originally used for gradient storage). We can use grad_ Troch in mode no_ Grad() to turn off automatic derivation:

from torchvision.models import resnet50 import torch net = resnet50().cuda(0) num = 128 inp = torch.ones([num, 3, 224, 224]).cuda(0) net(inp) # If torch is not turned on no_ grad()，batch_ When the size is 128, it will OOM (on 1080 Ti) net = resnet50().cuda(1) num = 512 inp = torch.ones([num, 3, 224, 224]).cuda(1) with torch.no_grad(): # Open torch no_ Batch after grad()_ When the size is 512, you can still run information (save more than 4 times of video memory) net(inp)

## model.eval() and torch no_ grad()

These two items are actually irrelevant. They need to be opened in the process of information: model Eval() enables BatchNorm, Dropout and other module s in the model to adopt eval mode to ensure the correctness of the information result, but it does not save video memory; torch.no_grad() declares that it does not calculate gradients, saving a lot of memory and video memory.

## torch.autograd.profiler (provides function level statistics)

import torch from torchvision.models import resnet18 x = torch.randn((1, 3, 224, 224), requires_grad=True) model = resnet18() with torch.autograd.profiler.profile() as prof: for _ in range(100): y = model(x) y = torch.sum(y) y.backward() # NOTE: some columns were removed for brevity print(prof.key_averages().table(sort_by="self_cpu_time_total"))

The output includes CPU time, proportion, call times and other information (since a kernel may call other kernels, Self CPU refers to the time consumed by itself (excluding the time consumed by other kernels being called)):

--------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::mkldnn_convolution_backward_input 18.69% 1.722s 18.88% 1.740s 870.001us 2000 aten::mkldnn_convolution 17.07% 1.573s 17.28% 1.593s 796.539us 2000 aten::mkldnn_convolution_backward_weights 16.96% 1.563s 17.21% 1.586s 792.996us 2000 aten::native_batch_norm 9.51% 876.994ms 15.06% 1.388s 694.049us 2000 aten::max_pool2d_with_indices 9.47% 872.695ms 9.48% 873.802ms 8.738ms 100 aten::select 7.00% 645.298ms 10.06% 926.831ms 7.356us 126000 aten::native_batch_norm_backward 6.67% 614.718ms 12.16% 1.121s 560.466us 2000 aten::as_strided 3.07% 282.885ms 3.07% 282.885ms 2.229us 126900 aten::add_ 2.85% 262.832ms 2.85% 262.832ms 37.350us 7037 aten::empty 1.23% 113.274ms 1.23% 113.274ms 4.089us 27700 aten::threshold_backward 1.10% 101.094ms 1.17% 107.383ms 63.166us 1700 aten::add 0.88% 81.476ms 0.99% 91.350ms 32.625us 2800 aten::max_pool2d_with_indices_backward 0.86% 79.174ms 1.02% 93.706ms 937.064us 100 aten::threshold_ 0.56% 51.678ms 0.56% 51.678ms 30.399us 1700 torch::autograd::AccumulateGrad 0.40% 36.909ms 2.81% 258.754ms 41.072us 6300 aten::empty_like 0.35% 32.532ms 0.63% 57.630ms 6.861us 8400 NativeBatchNormBackward 0.32% 29.572ms 12.48% 1.151s 575.252us 2000 aten::_convolution 0.31% 28.182ms 17.63% 1.625s 812.258us 2000 aten::mm 0.27% 24.983ms 0.32% 29.522ms 147.611us 200 aten::stride 0.27% 24.665ms 0.27% 24.665ms 0.583us 42300 aten::mkldnn_convolution_backward 0.22% 20.025ms 36.33% 3.348s 1.674ms 2000 MkldnnConvolutionBackward 0.21% 19.112ms 36.53% 3.367s 1.684ms 2000 aten::relu_ 0.20% 18.611ms 0.76% 70.289ms 41.346us 1700 aten::_batch_norm_impl_index 0.16% 14.298ms 15.32% 1.413s 706.254us 2000 aten::addmm 0.14% 12.684ms 0.15% 14.138ms 141.377us 100 aten::fill_ 0.14% 12.672ms 0.14% 12.672ms 21.120us 600 ReluBackward1 0.13% 11.845ms 1.29% 119.228ms 70.134us 1700 aten::as_strided_ 0.13% 11.674ms 0.13% 11.674ms 1.946us 6000 aten::div 0.11% 10.246ms 0.13% 12.288ms 122.876us 100 aten::batch_norm 0.10% 8.894ms 15.42% 1.421s 710.700us 2000 aten::convolution 0.08% 7.478ms 17.71% 1.632s 815.997us 2000 aten::sum 0.08% 7.066ms 0.10% 9.424ms 31.415us 300 aten::conv2d 0.07% 6.851ms 17.78% 1.639s 819.423us 2000 aten::contiguous 0.06% 5.597ms 0.06% 5.597ms 0.903us 6200 aten::copy_ 0.04% 3.759ms 0.04% 3.980ms 7.959us 500 aten::t 0.04% 3.526ms 0.06% 5.561ms 11.122us 500 aten::view 0.03% 2.611ms 0.03% 2.611ms 8.702us 300 aten::div_ 0.02% 1.973ms 0.04% 4.051ms 40.512us 100 aten::expand 0.02% 1.720ms 0.02% 2.225ms 7.415us 300 AddmmBackward 0.02% 1.601ms 0.37% 34.141ms 341.414us 100 aten::to 0.02% 1.596ms 0.04% 3.871ms 12.902us 300 aten::mean 0.02% 1.485ms 0.10% 9.204ms 92.035us 100 AddBackward0 0.01% 1.381ms 0.01% 1.381ms 1.726us 800 aten::transpose 0.01% 1.297ms 0.02% 2.035ms 4.071us 500 aten::empty_strided 0.01% 1.163ms 0.01% 1.163ms 3.877us 300 MaxPool2DWithIndicesBackward 0.01% 1.095ms 1.03% 94.802ms 948.018us 100 MeanBackward1 0.01% 974.822us 0.16% 14.393ms 143.931us 100 aten::resize_ 0.01% 911.689us 0.01% 911.689us 3.039us 300 aten::zeros_like 0.01% 884.496us 0.11% 10.384ms 103.843us 100 aten::clone 0.01% 798.993us 0.04% 3.687ms 18.435us 200 aten::reshape 0.01% 763.804us 0.03% 2.604ms 13.021us 200 aten::zero_ 0.01% 689.598us 0.13% 11.919ms 59.595us 200 aten::resize_as_ 0.01% 562.349us 0.01% 776.967us 7.770us 100 aten::max_pool2d 0.01% 492.109us 9.49% 874.295ms 8.743ms 100 aten::adaptive_avg_pool2d 0.01% 469.736us 0.10% 9.673ms 96.733us 100 aten::ones_like 0.00% 460.352us 0.01% 1.377ms 13.766us 100 SumBackward0 0.00% 399.188us 0.01% 1.206ms 12.057us 100 aten::flatten 0.00% 397.053us 0.02% 1.917ms 19.165us 100 ViewBackward 0.00% 351.824us 0.02% 1.436ms 14.365us 100 TBackward 0.00% 308.947us 0.01% 1.315ms 13.150us 100 detach 0.00% 127.329us 0.00% 127.329us 2.021us 63 torch::autograd::GraphRoot 0.00% 114.731us 0.00% 114.731us 1.147us 100 aten::detach 0.00% 106.170us 0.00% 233.499us 3.706us 63 --------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 9.217s

## Reference

[1] Automatic differentiation package - torch.autograd — PyTorch 1.7.0 documentation

[2] Autograd