cuda parallel programming review (histogram, convolution, scan, prefix and)

Chapter 5 thread execution efficiency and SIMD

Single instruction multiple data execution (SIMD) when a warp thread executes; threads in warp execute the same command at any time
Control divergence: occurs when threads in warp adopt different control paths through different control decisions. Threads adopting different control paths will eventually execute serially. It may occur when the condition of branch or loop is thread index. It occurs in block (each block is divided into 32-thread warps)
The impact depends on the data. For programs with a large amount of data, the impact is small. For large data, the impact of control divergence caused by boundary check is insignificant, and a large number of control flows in the kernel does not mean that there will be a large number of control divergence

Chapter VII histogram calculation

Histogram: a method to extract salient features and patterns from large data sets

For example: object recognition in the image, feature extraction, credit card transaction fraud detection, celestial motion association

Basic histogram
Segmented partitions lead to memory access. Adjacent absorption Chengguang will not access adjacent memory locations, access will not be merged, and DRAM bandwidth utilization is low

Interleaved partitioning, all threads process a continuous element part, they all move to the next part and repeat, and memory access is merged
data race

When multiple threads operate on the same piece of data, the reason for the reading order will appear that the read data has been modified and part of the operation content will be lost. There are four types, which can be solved by atomic operation
cuda atom add sub inc dec min max exch CAS

int atomicAdd(int * address, int val); unsigned int unsigned long long int flaot four operation types

Basic histogram code

__global__ void hist(uchar *b,long size,int *histo)
{
  int i = threadIdx.x+blockIdx.x+blockDim.x;
  int stride = blockDim.x*gridDim.x;
  
 while(i<size){
     int  pos = b[i]-'a';
      if(pos>=0 &&pos <=26){
        atomicAdd(&histo[pos/4],1];
      }
      i+=stride;
  }
}

Histogram privatization

Creating a private histogram will increase the space cost of replicas and the cost of summarizing all replicas to the sum. However, when accessing the private histogram and summary, the competitive serial situation will be reduced and the performance will be improved by at least 10 times

Histogram privatization code

__global void hist_p(uchar*b,long siez,uchar *histo){
  __shared__ int private[7];
  if(theadIdx.x<7)
    private[theadIdx.x] = 0;
  __synctheads();
  
  int i = threadIdx.x + blockIdx.x*blockDim.x;
  int stride = blockDim.x*gridDim.x;
  while(i < size){
    int pos = b[i]-'a';
    if(pos>=0 && pos<=26){
    atomicAdd(&private[pos/4],1);
    }
    i+=stride;
  }
  __syncthreads();
  
  if(threadIdx.x < 7)
    atomicAdd(&histo[threadIdx.x],privat[threadIdx.x]);
}

The operation is associative and exchangeable, and the size of the histogram should be limited to the shared memory. If it is too large, you can use the way of partial histogram privatization to access the global memory through the range test

Eight template convolution

Convolution: an array operation in which each output element is the weighted sum of adjacent input elements. The weighting method is usually determined by the convolution kernel, which remains unchanged during operation

One dimensional convolutional code

__global__ void con_1(flaot *N,float *M,float *P,int Mwid,int wid){
  int i = threadIdx.x + blockIdx.x*blockDim.x;
  
  int start = i - Mwid/2;
  float pValue = 0;
  for(int j=0;j<Mwid;j++){
    if(start+j>=0 && start+j<wid)
      pVaLue += N[start+j] * M[j]
  }
  P[i] = pValue;
}

Two dimensional convolutional code

__gloabl__ con_2(uchar *in,uchar* m,uchar *out,int mwid,int w,int h){
  int col = theadIdx.x+blockIdx.x*blockDim.x;
  int row = theadIdx.y+blockIdx.y*blockDim.y;
  
  if(col <w &&row <h){
    int pvalue = 0;
    
    int startc = col - mwid/2;
    int startr = row - mwid/2;
    
    for(int i=0;i<mwid;i++)
      for(int j=0;j<miwd;j++){
          int ci = i+startr;
          int cj = j+startc;
          if(ci>-1 && ci<h && cj>-1 &&cj<w){
            pvalue += in[ci*w + cj] * m[i*mwid+j]
          }
      }
      
      out[row *w +col] =(uchar) pvalue;
  }

tile convolution does not participate in the output

__global__ con1_s(flot* N,float* M,float*P){
  __shared__ float Ns[O_TILE_WID+Mwid-1]
  
  int o_idx = O_TILE_WID*blockIdx.x+threadIdx.x;
  int i_idx = o_idx - Mwid/2;
  if(i_idx>=0 && i_idx<Arraywid){
    Ns[threadIdx.x] = N[i_idx];
  }
  __syncthreads();
  float pvalue = 0;
  if(threadIdx.x < O_TILE_WID){
    for(int j=0;j<Mwid;j++){
      pvalue +=M[j]*Ns[threadIdx.x + j]
    }
    P[o_idx] = pvalue;
    __syncthreads();
  }

}

const __restrict__ Limited constant storage, automatic adaptation of appropriate storage

2D con limited memory

__global__ void con_2(float* P.float* N,int height,int wid,
int channels,const float __restrict__ *M){
  __shared__ float Ns[O_TILE_WID + Mwid-1][O_TILE_WID + Mwid-1]
  int ty = threadIdx.y;
  int tx = threadIdx.x;
  
  int o_col = blockIdx.x*O_TILE_WID + tx;
  int o_row = blockIdx.y*O_TILE_WID + ty;
  int i_col = o_col - Mwid/2;
  int i_row = o_row - Mwid /2;
  if(i_col >=0 && i_col<wid &&i_row >=0 && i_row<height){
    Ns[ty][tx] = N[i_row * pitch +i_col]
  }else  Ns[ty][tx]=0.0f;
  __syncthreads();
  
  float pvalue = 0;
  if(ty<O_TILE_WID && tx <O_TILE_WID){
  
  for(int i=0;i<Mwid;i++)
    for(int j=0;j<Mwid;j++){
      pvalue += M[i][j] * Ns[ty+i][tx+j]
    }
   if(o_col<wid & o_row<heigth)  P[o_row * wid+o_col]=pvalue;
    __syncthreads();
  }
  
}

computational efficiency
- For one-dimensional convolution, the original operation needs to load O_TILE_WID*MWID is the global memory access of elements, and it is o after using shared memory_ TILE_ Wid + mwid-1 access times bandwidth reduction is the ratio of the two, ignoring the boundary elements
  - O_TILE_WID*MWID / O_TILE_WID+MWID-1
- For two-dimensional convolution, the original operation bandwidth is small, which is
  - (O_TILE_WID) 2 ^2 2 *MWID 2 ^2 2 / (O_TILE_WID+MWID-1) 2 ^2 2

Chapter 9 parallel reduction

Division and summary: it is required that the elements processed by the dataset have no order, association and exchange. The data can be divided into smaller blocks, and each thread can process a block. The results of all blocks are summarized and summed through the reduction tree, similar to mapreduce
Conventional calculation: summation and factorization

The complexity of the protocol algorithm is generally o (n), and N values have n reduction operations
The reduced tree algorithm only needs log(N) times for N-1 operations, similar to bisection

For N input values, operate N-1 times and reduce log(N) times. The total parallelism is (N-1) / log(N)

The speed is equivalent to that of sequential algorithm, but the use of resources is large

The reduced sum processes two elements per thread to share memory, and each block processes the size of two dim s

__shared__ float Ns[blockDim.x*2]
int t = threadIdx.x
int start = blockDim.x*2*blockIdx.x
Ns[t] = input[start+t]
Ns[blockDim.x+t]= input[start+t+blockDim.x]

for(int stride = 1;stride<blockDim.x;stride *=2){
  __syncthreads(); //You need to make sure that each element is added before the next step
  if(t % stride ==0)
    Ns[2*t] += Ns[2*t+stride];
}

Improved reduction

for(int s=blockDim.x;s>0;s=/2){
  __syncthreads();
  if(t < s ){
    p[t]+=p[s+t] 
  }
}

Chapter 10 prefix and calculation

Scanning is often used for parallel work allocation and resource allocation, cardinal sorting, quick sorting, solving recursion,

Efficient serial algorithm O(N)

y[0] = x[0]for i to max_i:  y[i] = y[i-1] +x[i]

 Ordinary parallelism y0 = x0; y1 = x0+x1;...

The more effective prefix and is the number of two adjacent locations per thread

__global__ void scan(float* X,float*Y,int inputsize){  __shared__ float XY[SECTIONSIZE]  int i = blockIdx.x*blockDim.x +threadIdx.x;  if(i < inputsize)    XY[threadIdx.x] = X[i];  __syncthreads();  for(int s=1;s<=threadIdx.x,s*=2){     __syncthreads();    float temp = XY[threadIdx.x-s];    __syncthreads();    XY[threadIdx.x] +=temp;  }   __syncthreads();  if(i<inputsize) Y[i] =XY[threadIdx.x];}

work efficiency

Total Adds : n*log(n) -(n-1) => *O(nlog(n))

Prefix sum based on balanced tree

__global__ void presum(float* go,float* gi,int n){    extern __shared__ float tmp[];    int tid = threadIdx.x;    int offset = 1;        temp[2*tid] = gi[2*tid];    temp[2*tid+1] = gi[2*tid+1]        for(int i = n/2;i>0;i/=2){      __synctheads();      if(tid<i){      int a = offset*(2*tid +1)-1;      int b = offset*(2*tid +2)-1      temp[b] +=temp[a]       }      offset *=2;    }    if(tid==0) temp[n-1] = 0;        for(int i = 1;i<n;i*=2){      offset /=2;      __synctheads();      if(tid < i){          int a = offset*(tid*2+1)-1;          int b = offset*(tid*2+2)-1;                    float t = tmp[a];          tmp[a] = tmp[b];          tmp[b] += t;      }    }  __syncthreads();  go[2*tid] =tmp[tid*2];  go[2*tid+1] = tmp[tid*2+1];}

operating efficiency

A total of * * log(n) * * iterations, total adds n-1 = > O (n)

Calculate the reduction log(n)-1 iteration total add n-2 - (log(n)-1) = > O (n)

Both do not exceed 2(N-1) actual 2(n-1)-logn in total

Xi. bank conflict

Bank is the division mode of storage units on shared memory. GPU shared memory is based on the architecture of bank memory switching
1.x warp=32 bank=16 2.x warp=bank=32 3.x bank can be customized

Each bank can only point to one operation per cycle, that is, the bandwidth of each bank is 32bit per cycle.
When did it happen?

Different threads in the same warp access different locations in the same bank
When won't it happen?

Broadcast: This is best when all threads in a warp access shared memory at the same address

Multicast: when multiple threads in a warp access shared memory at the same address, it is suboptimal

Even if threads in the same warp randomly access different banks, bank conflict will not occur as long as they do not access different locations of the same bank
Data parallelism principle
- Effective use of parallelism
- Merge memory access whenever possible
- Using share memory
- Develop other memory, such as texture constant
- Reduce bank conflicts

Texture memory

Texture memory does not correspond to a special memory in hardware
Texture memory functions: address mapping, data filtering, caching and other functions
Texture reference frame: provides the binding of data and texture memory. CUDA Array can realize the binding of memory. The linear memory bound to texture and the elements in the array are called pixels
Texture < type, dim, readmode > textref readmode read type. Whether normalization is required. The default is no cudaReadModeElementType and yes cudaReadModeNormalizedFloat

struct cudaChannelFormatDesc{ int x,y,z,w;enum cudaChannelFormatKind f;}

cudaChannelFormatKind is the member type cudaChannelFormatKindFloat floating point ~ Signed signed integer ~ Unsigned unsigned integer

cudaMalloc3DArray() can allocate 1, 2 and 3-dimensional arrays. Cudamalloarray() is generally used for two-dimensional data allocation, cudaFreeArray() frees memory, and cudaMemcpyToArray()
Texture pickup: the texture pickup function uses texture coordinates to access the texture memory.
- The linear memory is accessed by texfetch1D function, and the texture coordinates are integer
- Access to textures bound to one-dimensional, two-dimensional, and three-dimensional CUDA arrays: access using tex1D(), tex2D(), and tex3D() functions, respectively, and use floating-point texture coordinates‘

ds_M[ty][tx] = M[G_tx * WIDTH+G_ty + m*TILE]
ds_N[ty][tx] = N[G_ty+m*TILE) * WIDTH+G_tx]

value += ds_M[ty][k] * ds_N[k][tx]
__syncthreads();

P[G_tx*WIDTH+G_ty] = pavlue;

Keywords: CUDA

Added by Threehearts on Wed, 29 Dec 2021 23:22:59 +0200

Programming VIP