1, Background
in the Encoder part of so net classification model, first_ After pointnet gives the eigenvector, there is an operation to find the index:
M = node.size()[2] with torch.cuda.device(self.first_pn_out.get_device()): gather_index = index_max.forward_cuda(self.first_pn_out.detach(), min_idx.int(), M).detach().long()
Index here_ Max in. / models/index_max_ext file is the CUDA extension of pytorch, and the interface is C + + interface. Because there is no basis for CUDA Programming, I had to find relevant books, briefly understand the basic knowledge, read relevant blogs and posts, and gradually understand what the function of this extension program is.
2, Concrete implementation
2.1 document structure
implement a CUDA extension of pytorch and take C + + as the interface. At least three parts are required, including setup.py,. Cu and. CPP files. Refer to index for details_ Max is setup.py, index_max.cpp and index_max_cuda.cu file. The following describes the specific functions of these three files one by one.
2.2 setup.py
this file is used to compile the following files. The specific code is as follows:
import setuptools import torch from setuptools import setup from torch.utils.cpp_extension import CppExtension, CUDAExtension, BuildExtension setup(name='index_max', # Compiled link library name ext_modules=[CUDAExtension('index_max', ['index_max.cpp', 'index_max_cuda.cu'])], # Files to be compiled and compiled functions cmdclass={'build_ext': BuildExtension}) # Execute compile command settings
2.3 index_max.cpp
some parts in the original document are not used, so they will not be introduced, and the parts used will be explained directly. The functions used in. cu are declared in the. cpp file to realize the specific functions of the function.
#include <torch/extension.h> #include <iostream> #include <vector> #include <thread> torch::Tensor index_max_forward_cpu(const torch::Tensor data,const torch::Tensor index,const int K) { int B = data.size(0); int C = data.size(1); int N = data.size(2); torch::Tensor max_idx = torch::zeros({B, C, K}, torch::TensorOptions().dtype(torch::kInt32).requires_grad(false)); torch::Tensor max_val = torch::ones({B, C, K}, torch::TensorOptions().dtype(torch::kFloat32)) * -1000.0; auto data_a = data.accessor<float, 3>(); auto index_a = index.accessor<int, 2>(); auto max_idx_a = max_idx.accessor<int, 3>(); auto max_val_a = max_val.accessor<float, 3>(); for (int b=0; b<B; ++b) { for (int c=0; c<C; ++c) { for (int n=0; n<N; ++n) { int k = index_a[b][n]; float data_point = data_a[b][c][n]; if (data_point > max_val_a[b][c][k]) { max_val_a[b][c][k] = data_point; max_idx_a[b][c][k] = n; } } } } return max_idx; } #define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor/variable") #define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous") #define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x) torch::Tensor index_max_forward_cuda_wrapper(const torch::Tensor data,const torch::Tensor index,const int K){ CHECK_INPUT(data); CHECK_INPUT(index); return index_max_forward_cuda(data, index, K); } //binding PYBIND11_MODULE(index_max, m) { m.def("forward_cuda", &index_max_forward_cuda_wrapper, "CUDA code without shared memory"); }
the parameters passed in by this function are:
self.first_pn_out.detach() # [B,C,kN] min_idx.int(): # [B,kN] M # Number of nodes
in the program, k=index_a[b][n], indicating the index of a nearest neighbor node of a point in a batch, data_point is the characteristic value of a dimension of the corresponding node. If the characteristic value is greater than the existing value, update the value and record the corresponding index. This is equal to: Max all points with each node as the nearest neighbor node (the number of points corresponding to each node is different)_ Pool, for example, suppose that the original point cloud number is from 0-1024 and the node number is from 0-64. For node 1, there are four points numbered 01235891024, which are regarded as nearest neighbors. Then the function is to do a max for the eigenvectors of these four nodes_ Pooling, whose result is the characteristic of node 0. Only the index of the point corresponding to the maximum value is returned in the program. Combined with the following code, it can be seen that the function is to do Max once for the eigenvector of the corresponding original point of each node_ Pooling, whose output is used as the new node eigenvector.
self.first_pn_out_masked_max = self.first_pn_out.gather(dim=2, index=gather_index * mask_row_max.unsqueeze(1).long()) # BxCxM
2.4 index_max_cuda.cu
CUDA Programming aims to accelerate the operation of the program and improve efficiency. The main part of the code is the accelerated version of specific functions in the. cpp file.
#include <ATen/ATen.h> #include <torch/extension.h> #include <cuda.h> #include <cuda_runtime.h> #include <iostream> __global__ void index_max_forward_cuda_kernel(const float* __restrict__ data, const int* __restrict__ index, int* __restrict__ max_idx, float* __restrict__ max_val, const int B, const int C, const int N, const int K){ int b = threadIdx.x; //Thread id 0 ~ B-1 int c = blockIdx.x; //Block id 0 ~ C-1 for(int n=0;n<N;++n){ int k = index[b*N+n]; float data_point = data[b*C*N+c*N+n]; if (data_point > max_val[b*C*K+c*K+k]){ max_val[b*C*K+c*K+k] = data_point; max_idx[b*C*K+c*K+k] = n; } } } torch::Tensor index_max_forward_cuda(const torch::Tensor data, const torch::Tensor index, const int K){ int B = data.size(0); // batch_size int C = data.size(1); // feature channels int N = data.size(2); // number of points auto device_idx = data.device().index(); auto max_idx = torch::zeros({B, C, K}, torch::TensorOptions().dtype(torch::kInt32).device(torch::kCUDA, device_idx)); // index auto max_val = torch::ones({B, C, K}, torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA, device_idx)) * -1000.0; // value index_max_forward_cuda_kernel<<<C, B>>>(data.data<float>(), index.data<int>(), max_idx.data<int>(), max_val.data<float>(), B, C, N, K); // Start C blocks, each block contains B threads, and the parameters are data,index,max_index and max_value,batch_size,channels,number of points, number of neighbors return max_idx; }
3, Summary
due to the different number of original points corresponding to each node, the efficiency of serial computing is not as efficient as that of parallel computing. Therefore, the CUDA extension program of pytorch is designed. The calling order is:
./models/networks: index_max.forward_cuda() ./models/index_max_ext/index_max.cpp index_max_forward_cuda_wrapper() ./models/index_max_ext/index_max.cpp index_max_forward_cuda() ./models/index_max_ext/index_max_cuda.cu index_max_forward_cuda() ./models/index_max_ext/index_max_cuda.cu index_max_forward_cuda_kernel<<<C,B>>>(data.data<float>(),index.data<int>(),max_idx.data<int>(),max_val.data<float>(),B, C, N, K);
The last function is__ global__ Flag, indicating that it is performed on the device rather than the host.