Skip to content

4.1 Optimizing Vector&GPU

:material-circle-edit-outline: 约 190 个字 :fontawesome-solid-code: 7 行代码 :material-clock-time-two-outline: 预计阅读时间 1 分钟

计算机体系结构(本)2024-12-02 第 3-5 节 PPT

Optimizing Vector Performance

Vector Chaining

the Concept of Forwarding Extended to Vector Registers,Vector version of register bypassing

image-20241208145519098

Convey: Set of vector instructions that could potentially execute together

Chimes: Sequences with RAW dependency hazards placed in same convey via chaining

[!EXAMPLE]

image-20241208150034087

vector Conditional Execution

针对下面这种每轮 loop 都有判断的情况进行优化:

for (I=0; I<N; I++)
    if (A[I]!= B[I]) A[I] -= B[I];
  • Add vector flag registers with single-bit elements
  • Use a vector compare to set the a flag register
    • 置 1 就做一次运算,置 0 就不做计算
vld           V1, Ra        
vld       V2, Rb
vcmp.neq.vv   F0, V1, V2    # vector compare
vsub.vv   V3, V2, V1, F0    # conditional vadd
vst       V3, Ra

Masked Vector reg 里的值会决定某个操作的结果是否会进行写回,相当于写的使能信号

如此我们可以选取两个向量中部分元素的计算结果,抛弃其它元素的结果

sparce matrices 稀疏矩阵

PPT 33页

Multi-lane Implementation

PPT

Graphical Processing Units

PPT 52页