# 程序分析与优化 - 11 多分支分析

## 11.1 什么是GPU

### 11.1.1 GPU的发展历史

GPU指令集的诞生→泛化的整数通用处理函数 → 带分支支持的处理函数

……

### 11.1.3 编程环境

```1 void saxpy_serial(int n, float alpha, float *x, float *y) {
2   for (int i = 0; i < n; i++)
3     y[i] = alpha * x[i] + y[i];
4 }
5 // Invoke the serial function:
6 saxpy_serial(n, 2.0, x, y);```

```1 __global__ void saxpy_parallel(int n, float alpha, float *x, float *y) {
2   int i = blockIdx.x * blockDim.x + threadIdx.x;
3   if (i < n)
4     y[i] = alpha * x[i] + y[i];
5 }
6 // Invoke the parallel kernel:
7 int nblocks = (n + 255) / 256;
8 saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);```

NV的GPU的组织结构：Grids → Blocks → Warps → Threads

Cuda programs = CPU programs + kernels

Kernels调用语法：

kernel<<<dGrd, dBck>>>(A,B,w,C);

CPU programs → host programs

kernels → PTX (Parallel Thread Execution) → SASS (Streaming ASSembly)

## 11.2 分岔（Divergence）

### 11.2.1 SIMD的优缺点

• 因为内存访问地址不一致导致的内存分岔
• 因为控制流分支导致的分岔

### 11.2.2 控制流分岔的例子

```1 __global__ void ex(float *v) {
2   if (v[tid] < 0.0) {
3     v[tid] /= 2;
4   } else {
5     v[tid] = 0.0;
6   }
7 }```

### 11.2.3 什么样的输入性能最好？

```1 __global__ void dec2zero(int *v, int N) {
2   int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
3   if (xIndex < N) {
4     while (v[xIndex] > 0) {
5       v[xIndex]--;
6     }
7   }
8 }```

``` 1 void vecIncInit(int *data, int size) {
2   for (int i = 0; i < size; ++i) {
3     data[i] = size - i - 1;
4   }
5 }
6 void vecConsInit(int *data, int size) {
7   int cons = size / 2;
8   for (int i = 0; i < size; ++i) {
9     data[i] = cons;
10   }
11 }
12 void vecAltInit(int *data, int size) {
13   for (int i = 0; i < size; ++i) {
14     if (i % 2) {
15       data[i] = size;
16     }
17   }
18 }
19 void vecRandomInit(int *data, int size) {
20   for (int i = 0; i < size; ++i) {
21     data[i] = random() % size;
22   }
23 }
24 void vecHalfInit(int *data, int size) {
25   for (int i = 0; i < size / 2; ++i) {
26     data[i] = 0;
27   }
28   for (int i = size / 2; i < size; ++i) {
29     data[i] = size;
30   }
31 }```

`vecIncInit`
vecConsInit
`vecAltInit`
`vecRandomInit`
`vecHalfInit`

## 11.3 分岔的动态检测

### 11.3.1 分岔profiling

``` 1 int writer = 0;
2 bool gotWriter = false;
3 while (!gotWriter) {
4     bool iAmWriter = false;
5     if (laneid == writer) {
6       iAmWriter = true;
7     }
8   if ( ∃ t ∈ w | iAmWriter == true) {
9   gotWriter = true;
10   }
11   else {
12     writer++;
13   }
14 }```

### 11.3.2 经典的双调排序Bitonic Sort

``` 1 __global__ static void bitonicSort(int *values) {
2   extern __shared__ int shared[];
3   const unsigned int tid = threadIdx.x;
4   shared[tid] = values[tid];
6   for (unsigned int k = 2; k <= NUM; k *= 2) {
7     for (unsigned int j = k / 2; j > 0; j /= 2) {
8       unsigned int ixj = tid ^ j;
9       if (ixj > tid) {
10         if ((tid & k) == 0) {
11           if (shared[tid] > shared[ixj]) {
12             swap(shared[tid], shared[ixj]);
13           }
14         } else {
15           if (shared[tid] < shared[ixj]) {
16             swap(shared[tid], shared[ixj]);
17           }
18         }
19       }
21     }
22   }
23   values[tid] = shared[tid];
24 }```

``` 1 unsigned int a, b;
2 if ((tid & k) == 0) {
3   b = tid;
4   a = ixj;
5 } else {
6   b = ixj;
7   a = tid;
8 }
9 if (sh[b] > sh[a]) {
10   swap(sh[b], sh[a]);
11 }```

```1 int p = (tid & k) == 0;
2 unsigned b = p ? tid : ixj;
3 unsigned a = p ? ixj : tid;
4 if (sh[b] > sh[a]) {
5   swap(sh[b], sh[a]);
6 }```

## 11.4 分岔的静态检测

### 11.4.1 分岔变量和统一变量

• tid是分岔变量
• 原子操作产生的变量是分岔变量
• 如果v对分岔变量有数据依赖，则v也是分岔变量
• 如果v对分岔变量有控制依赖，则v也是分岔变量

## 11.5 分岔优化

### 11.5.1 同步栅栏删除

CUDA的ptx指令集默认分支命令都是会产生分岔的，除非特定加上.uni后缀：

### 11.5.3 数据重定位

``` 1 __global__ static void maxSort1(int *values, int N) {
2   // 1) COPY-INTO: Copy data from the values vector
3   // into  shared memory:
5   for (unsigned k = 0; k < THREAD_WORK_SIZE; k++) {
6     unsigned loc = k * blockDim.x + threadIdx.x;
7     if (loc < N) {
8       shared[loc] = values[loc + blockIdx.x * blockDim.x];
9     }
10   }
12   // 2) SORT: each thread sorts its chunk of data
13   // with a  small sorting net.
18   if (index4 < N) {
19     swapIfNecessary(shared, index1, index3);
20     swapIfNecessary(shared, index2, index4);
21     swapIfNecessary(shared, index1, index2);
22     swapIfNecessary(shared, index3, index4);
23     swapIfNecessary(shared, index2, index3);
24   }
26   // 3) SCATTER: the threads distribute their data
27   // along  the array.
28   __shared__ int scattered[THREAD_WORK_SIZE * 300];
29   unsigned int nextLoc = threadIdx.x;
30   for (unsigned i = 0; i < THREAD_WORK_SIZE; i++) {
32     nextLoc += blockDim.x;
33   }
35   // 4) COPY-BACK: Copy the data back from the shared
36   // memory into the values vector:
37   for (unsigned k = 0; k < THREAD_WORK_SIZE; k++) {
38     unsigned loc = k * blockDim.x + threadIdx.x;
39     if (loc < N) {
40       values[loc + blockIdx.x * blockDim.x] = scattered[loc];
41     }
42   }
43 }```

## 11.6 分岔研究历史

GPU的历史都比较新，所以关于GPU的分岔分析资料也比较新：

1. Ryoo, S. Rodrigues, C. Baghsorkhi, S. Stone, S. Kirk, D. and Hwu, Wen-Mei. “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA”, PPoPP, p 73-82 (2008) CUDA介绍

2. Coutinho, B. Diogo, S. Pereira, F and Meira, W. “Divergence Analysis and Optimizations”, PACT, p 320-329 (2011) 分岔分析与优化

3. Sampaio, D. Martins, R. Collange, S. and Pereira, F. “Divergence Analysis”, TOPLAS, 2013. 分岔分析

原文作者：周荣华
原文地址: https://www.cnblogs.com/zhouronghua/p/16484014.html
本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。