文章目录
- Vector优化 – Vector掩码寄存器
Vector优化 – Vector掩码寄存器
One of the reasons for low levels of vectorization is the presence of conditionals (IF statements) inside loops. IF statements introduce control dependencies into a loop.
矢量化水平低的原因之一是循环内存在条件(IF 语句)。 IF 语句将控制依赖性引入循环中。
for (i = 0; i < 64 i = i + 1) {
if (X[i] != 0) {
X[i] = X[i] * 2;
}
}
This loop cannot normally be vectorized because of the conditional execution of the body; however, if the inner loop could be run for the iterations for which X[i] \neq 0, then the subtraction could be vectorized.
由于循环体的条件执行,该循环通常无法矢量化;然而,如果内部循环可以运行迭代 X[i] \neq 0 ,那么减法就可以向量化。
Mask registers essentially provide conditional execution of each element operation in a vector instruction. The vector-mask control uses a Boolean vector to control the execution of a vector instruction. When the vector-mask register is enabled, any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask register are one. The entries in the destination vector register that correspond to a zero in the mask register are unaffected by the vector operation. Clearing the vector-mask register sets it to all ones, making subsequent vector instructions operate on all vector elements.
掩码寄存器本质上提供vector指令中每个元素操作的条件执行。 vector掩码控件使用布尔vector来控制vector指令的执行。当vector掩码寄存器被使能时,所执行的任何vector指令仅对vector掩码寄存器中的对应条目为1的vector元素进行操作。目标vector寄存器中对应于掩码寄存器中的零的条目不受vector操作的影响。清除vector掩码寄存器将其设置为全 1,使后续vector指令对所有vector元素进行操作。
Consider the following snippet of code.
考虑以下代码片段。
for (i = 0; i < 64 i = i + 1) {
if (a[i] >= b[i]) {
c[i] = a[i]
} else {
c[i] = b[i]
}
}
The above code goes through the following masking processes to populate c.
上面的代码经过以下屏蔽过程来填充c 。
The transformation to change an IF statement to a straight-line code sequence using conditional execution is called if conversion.
使用条件执行将 IF 语句更改为直线代码序列的转换称为if 转换。
Masking introduces an overhead – conditionally executed instructions still require execution time when the condition is not satisfied. Nonetheless, the elimination of a branch and the associated control dependences can make a conditional instruction faster (faster than using scalar mode) even if it sometimes does useless work. Vector instructions executed with a vector mask still take the same execution time, even for the elements where the mask is zero.
屏蔽引入了开销——当条件不满足时,有条件执行的指令仍然需要执行时间。尽管如此,消除分支和相关的控制依赖性可以使条件指令更快(比使用标量模式更快),即使它有时会做无用的工作。使用vector掩码执行的Vector指令仍然需要相同的执行时间,即使对于掩码为零的元素也是如此。
Vector processors make the mask registers part of the architectural state and rely on compilers to manipulate mask registers explicitly. GPUs get the same effect using hardware to manipulate internal mask registers that are invisible to GPU software. In both cases, the hardware spends the time to execute a vector element whether the mask is zero or one.
Vector处理器使掩码寄存器成为体系结构状态的一部分,并依赖编译器显式操作掩码寄存器。 GPU 使用硬件来操作 GPU 软件不可见的内部掩码寄存器,从而获得相同的效果。在这两种情况下,无论掩码是零还是一,硬件都会花费时间来执行vector元素。