Index in so net_ Function and implementation of Max

1, Background

  in the Encoder part of so net classification model, first_ After pointnet gives the eigenvector, there is an operation to find the index:

		M = node.size()[2]
        with torch.cuda.device(self.first_pn_out.get_device()):
            gather_index = index_max.forward_cuda(self.first_pn_out.detach(),
                                                  min_idx.int(),
                                                  M).detach().long()

Index here_ Max in. / models/index_max_ext file is the CUDA extension of pytorch, and the interface is C + + interface. Because there is no basis for CUDA Programming, I had to find relevant books, briefly understand the basic knowledge, read relevant blogs and posts, and gradually understand what the function of this extension program is.

2, Concrete implementation

2.1 document structure

    implement a CUDA extension of pytorch and take C + + as the interface. At least three parts are required, including setup.py,. Cu and. CPP files. Refer to index for details_ Max is setup.py, index_max.cpp and index_max_cuda.cu file. The following describes the specific functions of these three files one by one.

2.2 setup.py

  this file is used to compile the following files. The specific code is as follows:

import setuptools
import torch
from setuptools import setup
from torch.utils.cpp_extension import CppExtension, CUDAExtension, BuildExtension


setup(name='index_max',  # Compiled link library name
      ext_modules=[CUDAExtension('index_max', ['index_max.cpp', 'index_max_cuda.cu'])],  # Files to be compiled and compiled functions
      cmdclass={'build_ext': BuildExtension})  # Execute compile command settings

2.3 index_max.cpp

  some parts in the original document are not used, so they will not be introduced, and the parts used will be explained directly. The functions used in. cu are declared in the. cpp file to realize the specific functions of the function.

#include <torch/extension.h>
#include <iostream>
#include <vector>
#include <thread>

torch::Tensor index_max_forward_cpu(const torch::Tensor data,const torch::Tensor index,const int K) 
{
	int B = data.size(0);
	int C = data.size(1);
	int N = data.size(2);
	
	torch::Tensor max_idx = torch::zeros({B, C, K}, torch::TensorOptions().dtype(torch::kInt32).requires_grad(false));
	torch::Tensor max_val = torch::ones({B, C, K}, torch::TensorOptions().dtype(torch::kFloat32)) * -1000.0;
	
	auto data_a = data.accessor<float, 3>();
    auto index_a = index.accessor<int, 2>();
    auto max_idx_a = max_idx.accessor<int, 3>();
    auto max_val_a = max_val.accessor<float, 3>();

	for (int b=0; b<B; ++b) 
	{
		for (int c=0; c<C; ++c) 
		{
			for (int n=0; n<N; ++n) 
			{
				int k = index_a[b][n];
				float data_point = data_a[b][c][n];
				if (data_point > max_val_a[b][c][k]) 
				{
					max_val_a[b][c][k] = data_point;
					max_idx_a[b][c][k] = n;
				}
			}
		}
	}
    return max_idx;
}

#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor/variable")
#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor index_max_forward_cuda_wrapper(const torch::Tensor data,const torch::Tensor index,const int K){
	CHECK_INPUT(data);
	CHECK_INPUT(index);

    return index_max_forward_cuda(data, index, K);
}

//binding
PYBIND11_MODULE(index_max, m)
{
    m.def("forward_cuda", &index_max_forward_cuda_wrapper, "CUDA code without shared memory");
}

  the parameters passed in by this function are:

self.first_pn_out.detach()  # [B,C,kN]
min_idx.int():  # [B,kN]
M  # Number of nodes

  in the program, k=index_a[b][n], indicating the index of a nearest neighbor node of a point in a batch, data_point is the characteristic value of a dimension of the corresponding node. If the characteristic value is greater than the existing value, update the value and record the corresponding index. This is equal to: Max all points with each node as the nearest neighbor node (the number of points corresponding to each node is different)_ Pool, for example, suppose that the original point cloud number is from 0-1024 and the node number is from 0-64. For node 1, there are four points numbered 01235891024, which are regarded as nearest neighbors. Then the function is to do a max for the eigenvectors of these four nodes_ Pooling, whose result is the characteristic of node 0. Only the index of the point corresponding to the maximum value is returned in the program. Combined with the following code, it can be seen that the function is to do Max once for the eigenvector of the corresponding original point of each node_ Pooling, whose output is used as the new node eigenvector.

self.first_pn_out_masked_max = self.first_pn_out.gather(dim=2, index=gather_index * mask_row_max.unsqueeze(1).long())  # BxCxM

2.4 index_max_cuda.cu

   CUDA Programming aims to accelerate the operation of the program and improve efficiency. The main part of the code is the accelerated version of specific functions in the. cpp file.

#include <ATen/ATen.h>
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>

__global__ void index_max_forward_cuda_kernel(const float* __restrict__ data,
		const int* __restrict__ index,
		int* __restrict__ max_idx,
		float* __restrict__ max_val,
		const int B, const int C, const int N, const int K){
	int b = threadIdx.x;  //Thread id 0 ~ B-1
	int c = blockIdx.x;  //Block id 0 ~ C-1

	for(int n=0;n<N;++n){
		int k = index[b*N+n];
		float data_point = data[b*C*N+c*N+n];
		if (data_point > max_val[b*C*K+c*K+k]){
			max_val[b*C*K+c*K+k] = data_point;
			max_idx[b*C*K+c*K+k] = n;
		}
	}
}

torch::Tensor index_max_forward_cuda(const torch::Tensor data, const torch::Tensor index, const int K){
	int B = data.size(0);  // batch_size
	int C = data.size(1);  // feature channels
	int N = data.size(2);  // number of points

	auto device_idx = data.device().index();
        auto max_idx = torch::zeros({B, C, K}, torch::TensorOptions().dtype(torch::kInt32).device(torch::kCUDA, device_idx));  // index
        auto max_val = torch::ones({B, C, K}, torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA, device_idx)) * -1000.0; // value

	index_max_forward_cuda_kernel<<<C, B>>>(data.data<float>(),
			index.data<int>(),
			max_idx.data<int>(),
			max_val.data<float>(),
			B, C, N, K);  // Start C blocks, each block contains B threads, and the parameters are data,index,max_index and max_value,batch_size,channels,number of points, number of neighbors

	return max_idx;
}

3, Summary

   due to the different number of original points corresponding to each node, the efficiency of serial computing is not as efficient as that of parallel computing. Therefore, the CUDA extension program of pytorch is designed. The calling order is:

./models/networks:                          index_max.forward_cuda()
./models/index_max_ext/index_max.cpp        index_max_forward_cuda_wrapper()
./models/index_max_ext/index_max.cpp        index_max_forward_cuda()
./models/index_max_ext/index_max_cuda.cu    index_max_forward_cuda()
./models/index_max_ext/index_max_cuda.cu    index_max_forward_cuda_kernel<<<C,B>>>(data.data<float>(),index.data<int>(),max_idx.data<int>(),max_val.data<float>(),B, C, N, K);

The last function is__ global__ Flag, indicating that it is performed on the device rather than the host.

Keywords: Python AI Pytorch

Added by deltawing on Thu, 04 Nov 2021 20:36:29 +0200