Chapter 5 thread execution efficiency and SIMD
- Single instruction multiple data execution (SIMD) when a warp thread executes; threads in warp execute the same command at any time
- Control divergence: occurs when threads in warp adopt different control paths through different control decisions. Threads adopting different control paths will eventually execute serially. It may occur when the condition of branch or loop is thread index. It occurs in block (each block is divided into 32-thread warps)
- The impact depends on the data. For programs with a large amount of data, the impact is small. For large data, the impact of control divergence caused by boundary check is insignificant, and a large number of control flows in the kernel does not mean that there will be a large number of control divergence
Chapter VII histogram calculation
-
Histogram: a method to extract salient features and patterns from large data sets
For example: object recognition in the image, feature extraction, credit card transaction fraud detection, celestial motion association
Basic histogram
-
Segmented partitions lead to memory access. Adjacent absorption Chengguang will not access adjacent memory locations, access will not be merged, and DRAM bandwidth utilization is low
Interleaved partitioning, all threads process a continuous element part, they all move to the next part and repeat, and memory access is merged
-
data race
When multiple threads operate on the same piece of data, the reason for the reading order will appear that the read data has been modified and part of the operation content will be lost. There are four types, which can be solved by atomic operation
-
cuda atom add sub inc dec min max exch CAS
int atomicAdd(int * address, int val); unsigned int unsigned long long int flaot four operation types
-
Basic histogram code
__global__ void hist(uchar *b,long size,int *histo) { int i = threadIdx.x+blockIdx.x+blockDim.x; int stride = blockDim.x*gridDim.x; while(i<size){ int pos = b[i]-'a'; if(pos>=0 &&pos <=26){ atomicAdd(&histo[pos/4],1]; } i+=stride; } }
-
Histogram privatization
Creating a private histogram will increase the space cost of replicas and the cost of summarizing all replicas to the sum. However, when accessing the private histogram and summary, the competitive serial situation will be reduced and the performance will be improved by at least 10 times
-
Histogram privatization code
__global void hist_p(uchar*b,long siez,uchar *histo){ __shared__ int private[7]; if(theadIdx.x<7) private[theadIdx.x] = 0; __synctheads(); int i = threadIdx.x + blockIdx.x*blockDim.x; int stride = blockDim.x*gridDim.x; while(i < size){ int pos = b[i]-'a'; if(pos>=0 && pos<=26){ atomicAdd(&private[pos/4],1); } i+=stride; } __syncthreads(); if(threadIdx.x < 7) atomicAdd(&histo[threadIdx.x],privat[threadIdx.x]); }
-
The operation is associative and exchangeable, and the size of the histogram should be limited to the shared memory. If it is too large, you can use the way of partial histogram privatization to access the global memory through the range test
Eight template convolution
-
Convolution: an array operation in which each output element is the weighted sum of adjacent input elements. The weighting method is usually determined by the convolution kernel, which remains unchanged during operation
-
One dimensional convolutional code
__global__ void con_1(flaot *N,float *M,float *P,int Mwid,int wid){ int i = threadIdx.x + blockIdx.x*blockDim.x; int start = i - Mwid/2; float pValue = 0; for(int j=0;j<Mwid;j++){ if(start+j>=0 && start+j<wid) pVaLue += N[start+j] * M[j] } P[i] = pValue; }
-
Two dimensional convolutional code
__gloabl__ con_2(uchar *in,uchar* m,uchar *out,int mwid,int w,int h){ int col = theadIdx.x+blockIdx.x*blockDim.x; int row = theadIdx.y+blockIdx.y*blockDim.y; if(col <w &&row <h){ int pvalue = 0; int startc = col - mwid/2; int startr = row - mwid/2; for(int i=0;i<mwid;i++) for(int j=0;j<miwd;j++){ int ci = i+startr; int cj = j+startc; if(ci>-1 && ci<h && cj>-1 &&cj<w){ pvalue += in[ci*w + cj] * m[i*mwid+j] } } out[row *w +col] =(uchar) pvalue; }
-
tile convolution does not participate in the output
__global__ con1_s(flot* N,float* M,float*P){ __shared__ float Ns[O_TILE_WID+Mwid-1] int o_idx = O_TILE_WID*blockIdx.x+threadIdx.x; int i_idx = o_idx - Mwid/2; if(i_idx>=0 && i_idx<Arraywid){ Ns[threadIdx.x] = N[i_idx]; } __syncthreads(); float pvalue = 0; if(threadIdx.x < O_TILE_WID){ for(int j=0;j<Mwid;j++){ pvalue +=M[j]*Ns[threadIdx.x + j] } P[o_idx] = pvalue; __syncthreads(); } }
-
const __restrict__ Limited constant storage, automatic adaptation of appropriate storage
-
2D con limited memory
__global__ void con_2(float* P.float* N,int height,int wid, int channels,const float __restrict__ *M){ __shared__ float Ns[O_TILE_WID + Mwid-1][O_TILE_WID + Mwid-1] int ty = threadIdx.y; int tx = threadIdx.x; int o_col = blockIdx.x*O_TILE_WID + tx; int o_row = blockIdx.y*O_TILE_WID + ty; int i_col = o_col - Mwid/2; int i_row = o_row - Mwid /2; if(i_col >=0 && i_col<wid &&i_row >=0 && i_row<height){ Ns[ty][tx] = N[i_row * pitch +i_col] }else Ns[ty][tx]=0.0f; __syncthreads(); float pvalue = 0; if(ty<O_TILE_WID && tx <O_TILE_WID){ for(int i=0;i<Mwid;i++) for(int j=0;j<Mwid;j++){ pvalue += M[i][j] * Ns[ty+i][tx+j] } if(o_col<wid & o_row<heigth) P[o_row * wid+o_col]=pvalue; __syncthreads(); } }
-
computational efficiency
- For one-dimensional convolution, the original operation needs to load O_TILE_WID*MWID is the global memory access of elements, and it is o after using shared memory_ TILE_ Wid + mwid-1 access times bandwidth reduction is the ratio of the two, ignoring the boundary elements
- O_TILE_WID*MWID / O_TILE_WID+MWID-1
- For two-dimensional convolution, the original operation bandwidth is small, which is
- (O_TILE_WID) 2 ^2 2 *MWID 2 ^2 2 / (O_TILE_WID+MWID-1) 2 ^2 2
- For one-dimensional convolution, the original operation needs to load O_TILE_WID*MWID is the global memory access of elements, and it is o after using shared memory_ TILE_ Wid + mwid-1 access times bandwidth reduction is the ratio of the two, ignoring the boundary elements
Chapter 9 parallel reduction
-
Division and summary: it is required that the elements processed by the dataset have no order, association and exchange. The data can be divided into smaller blocks, and each thread can process a block. The results of all blocks are summarized and summed through the reduction tree, similar to mapreduce
-
Conventional calculation: summation and factorization
The complexity of the protocol algorithm is generally o (n), and N values have n reduction operations
-
The reduced tree algorithm only needs log(N) times for N-1 operations, similar to bisection
For N input values, operate N-1 times and reduce log(N) times. The total parallelism is (N-1) / log(N)
The speed is equivalent to that of sequential algorithm, but the use of resources is large
-
The reduced sum processes two elements per thread to share memory, and each block processes the size of two dim s
__shared__ float Ns[blockDim.x*2] int t = threadIdx.x int start = blockDim.x*2*blockIdx.x Ns[t] = input[start+t] Ns[blockDim.x+t]= input[start+t+blockDim.x] for(int stride = 1;stride<blockDim.x;stride *=2){ __syncthreads(); //You need to make sure that each element is added before the next step if(t % stride ==0) Ns[2*t] += Ns[2*t+stride]; }
-
Improved reduction
for(int s=blockDim.x;s>0;s=/2){ __syncthreads(); if(t < s ){ p[t]+=p[s+t] } }
Chapter 10 prefix and calculation
-
Scanning is often used for parallel work allocation and resource allocation, cardinal sorting, quick sorting, solving recursion,
-
Efficient serial algorithm O(N)
y[0] = x[0]for i to max_i: y[i] = y[i-1] +x[i]
Ordinary parallelism y0 = x0; y1 = x0+x1;...
-
The more effective prefix and is the number of two adjacent locations per thread
__global__ void scan(float* X,float*Y,int inputsize){ __shared__ float XY[SECTIONSIZE] int i = blockIdx.x*blockDim.x +threadIdx.x; if(i < inputsize) XY[threadIdx.x] = X[i]; __syncthreads(); for(int s=1;s<=threadIdx.x,s*=2){ __syncthreads(); float temp = XY[threadIdx.x-s]; __syncthreads(); XY[threadIdx.x] +=temp; } __syncthreads(); if(i<inputsize) Y[i] =XY[threadIdx.x];}
-
work efficiency
Total Adds : n*log(n) -(n-1) => *O(nlog(n))
-
Prefix sum based on balanced tree
__global__ void presum(float* go,float* gi,int n){ extern __shared__ float tmp[]; int tid = threadIdx.x; int offset = 1; temp[2*tid] = gi[2*tid]; temp[2*tid+1] = gi[2*tid+1] for(int i = n/2;i>0;i/=2){ __synctheads(); if(tid<i){ int a = offset*(2*tid +1)-1; int b = offset*(2*tid +2)-1 temp[b] +=temp[a] } offset *=2; } if(tid==0) temp[n-1] = 0; for(int i = 1;i<n;i*=2){ offset /=2; __synctheads(); if(tid < i){ int a = offset*(tid*2+1)-1; int b = offset*(tid*2+2)-1; float t = tmp[a]; tmp[a] = tmp[b]; tmp[b] += t; } } __syncthreads(); go[2*tid] =tmp[tid*2]; go[2*tid+1] = tmp[tid*2+1];}
-
operating efficiency
A total of * * log(n) * * iterations, total adds n-1 = > O (n)
Calculate the reduction log(n)-1 iteration total add n-2 - (log(n)-1) = > O (n)
Both do not exceed 2(N-1) actual 2(n-1)-logn in total
Xi. bank conflict
-
Bank is the division mode of storage units on shared memory. GPU shared memory is based on the architecture of bank memory switching
-
1.x warp=32 bank=16 2.x warp=bank=32 3.x bank can be customized
Each bank can only point to one operation per cycle, that is, the bandwidth of each bank is 32bit per cycle.
-
When did it happen?
Different threads in the same warp access different locations in the same bank
-
When won't it happen?
Broadcast: This is best when all threads in a warp access shared memory at the same address
Multicast: when multiple threads in a warp access shared memory at the same address, it is suboptimal
Even if threads in the same warp randomly access different banks, bank conflict will not occur as long as they do not access different locations of the same bank
-
Data parallelism principle
- Effective use of parallelism
- Merge memory access whenever possible
- Using share memory
- Develop other memory, such as texture constant
- Reduce bank conflicts
Texture memory
-
Texture memory does not correspond to a special memory in hardware
-
Texture memory functions: address mapping, data filtering, caching and other functions
-
Texture reference frame: provides the binding of data and texture memory. CUDA Array can realize the binding of memory. The linear memory bound to texture and the elements in the array are called pixels
-
Texture < type, dim, readmode > textref readmode read type. Whether normalization is required. The default is no cudaReadModeElementType and yes cudaReadModeNormalizedFloat
struct cudaChannelFormatDesc{ int x,y,z,w;enum cudaChannelFormatKind f;}
cudaChannelFormatKind is the member type cudaChannelFormatKindFloat floating point ~ Signed signed integer ~ Unsigned unsigned integer
cudaMalloc3DArray() can allocate 1, 2 and 3-dimensional arrays. Cudamalloarray() is generally used for two-dimensional data allocation, cudaFreeArray() frees memory, and cudaMemcpyToArray()
-
Texture pickup: the texture pickup function uses texture coordinates to access the texture memory.
- The linear memory is accessed by texfetch1D function, and the texture coordinates are integer
- Access to textures bound to one-dimensional, two-dimensional, and three-dimensional CUDA arrays: access using tex1D(), tex2D(), and tex3D() functions, respectively, and use floating-point texture coordinates‘
ds_M[ty][tx] = M[G_tx * WIDTH+G_ty + m*TILE] ds_N[ty][tx] = N[G_ty+m*TILE) * WIDTH+G_tx] value += ds_M[ty][k] * ds_N[k][tx] __syncthreads(); P[G_tx*WIDTH+G_ty] = pavlue;