4.1 Optimizing Vector&GPU
:material-circle-edit-outline: 约 190 个字 :fontawesome-solid-code: 7 行代码 :material-clock-time-two-outline: 预计阅读时间 1 分钟
计算机体系结构(本)2024-12-02 第 3-5 节 PPT
Optimizing Vector Performance
Vector Chaining
the Concept of Forwarding Extended to Vector Registers,Vector version of register bypassing
Convey: Set of vector instructions that could potentially execute together
Chimes: Sequences with RAW dependency hazards placed in same convey via chaining
[!EXAMPLE]
vector Conditional Execution
针对下面这种每轮 loop 都有判断的情况进行优化:
- Add vector flag registers with single-bit elements
- Use a vector compare to set the a flag register
- 置 1 就做一次运算,置 0 就不做计算
vld V1, Ra
vld V2, Rb
vcmp.neq.vv F0, V1, V2 # vector compare
vsub.vv V3, V2, V1, F0 # conditional vadd
vst V3, Ra
Masked Vector reg 里的值会决定某个操作的结果是否会进行写回,相当于写的使能信号
如此我们可以选取两个向量中部分元素的计算结果,抛弃其它元素的结果
sparce matrices 稀疏矩阵
PPT 33页
Multi-lane Implementation
PPT
Graphical Processing Units
PPT 52页