Chapter 4 Data-level Parallelism Vector, SIMD, and GPU

:material-circle-edit-outline: 约 454 个字 :fontawesome-solid-code: 9 行代码 :material-clock-time-two-outline: 预计阅读时间 2 分钟

计算机体系结构（本）2024-11-25 第 3-4 节 PPT

这章重点掌握向量处理器及其优化、convey的判断

SIMD

SIMD 是一个能发挥数据并行性的架构

VECTOR Process

向量处理器架构下，单条指令能完成大量处理（类似 Python 的一些操作不需要循环？）

Single vector instruction implies lots of work ( loop)，Each result independent of previous result

Vector instructions that access memory have a known access pattern，Reduces branches and branch problems in pipelines

Types of Vector Architectures

根据向量运算的操作数来源，分为两类：

memory-memory vector processors
- 使用 Vector memory-memory architectures (VMMA)
- all vector operations are memory to memory
  - require greater main memory bandwidth
  - 不需要 load store 指令
- VMMA make if difficult to overlap execution of multiple vector operations
  - Because it must check dependencies on memory addresses
- VMMAs incur greater startup latency
vector-register processors
- all vector operations between vector registers, except load and store)
- 需要先将操作数取到 reg 再用 reg 计算，需要 load store 指令
- 比 VMMA 更成功，也是我们后面讲的重点

Components of Vector Processor

Vector Register
Vector Functional Units (FUs)
- Fully pipelined, start new operation every clock
Vector Load-Store Units (LSUs)
- fully pipelined unit to load or store a vector
- Multiple elements fetched/stored per cycle
- may have multiple LSUs
Scalar registers
- single element for FP scalar or address
Cross-bar to connect FUs , LSUs, registers

优缺点

Pro.

No dependencies within a vector
Each instruction generates a lot of results
Regular Memory access pattern
Fewer branches in the instruction sequence

Cons.

Very inefficient if parallelism is irregular
Memory bandwidth can become a bottleneck

[!EXAMPLE] VECTOR vs. SCALAR

Vector Length

A vector register can hold some maximum number of elements for each data width (maximum vector length or MVL)

Vector-length (VL) register controls the length of any vector

VL can be anything from 0 to MVL

Strip Mining

when application vector length > MVL

Strip mining: Generation of a loop that handles MVL elements per iteration

就是说，实际的 VL 超过了 MVL（机器能处理的最长的向量长度），就把这个向量切成一段一段的分别处理

[!EXAMPLE]

Example: vector daxpy of N elements

First loop handles (N mod MVL) elements, the rest handle MVL

VL = (N mod MVL);       // set VL = N mod MVL
for (I=0; I<VL; I++)    // 1st loop is a single set of
   Y[I]=A*X[I]+Y[I];   //   vector instructions

low = (N mod MVL);

VL = MVL;               // set VL to MVL
for (I=low; I<N; I++)   // 2nd loop requires N/MVL
   Y[I]=A*X[I]+Y[I];   //   sets of vector instructions