Skip to content

Chapter 4 Data-level Parallelism Vector, SIMD, and GPU

:material-circle-edit-outline: 约 454 个字 :fontawesome-solid-code: 9 行代码 :material-clock-time-two-outline: 预计阅读时间 2 分钟

计算机体系结构(本)2024-11-25 第 3-4 节 PPT

这章重点掌握向量处理器及其优化、convey的判断

SIMD

SIMD 是一个能发挥数据并行性的架构

VECTOR Process

向量处理器架构下,单条指令能完成大量处理(类似 Python 的一些操作不需要循环?)

image-20241208141821551

Single vector instruction implies lots of work (­ loop),Each result independent of previous result

Vector instructions that access memory have a known access pattern,Reduces branches and branch problems in pipelines

Types of Vector Architectures

根据向量运算的操作数来源,分为两类:

  • memory-memory vector processors
    • 使用 Vector memory-memory architectures (VMMA)
    • all vector operations are memory to memory
      • require greater main memory bandwidth
      • 不需要 load store 指令
    • VMMA make if difficult to overlap execution of multiple vector operations
      • Because it must check dependencies on memory addresses
    • VMMAs incur greater startup latency
  • vector-register processors

    • all vector operations between vector registers, except load and store)
    • 需要先将操作数取到 reg 再用 reg 计算,需要 load store 指令
    • 比 VMMA 更成功,也是我们后面讲的重点

image-20241208142339630

Components of Vector Processor

  1. Vector Register
  2. Vector Functional Units (FUs)
    • Fully pipelined, start new operation every clock
  3. Vector Load-Store Units (LSUs)
    • fully pipelined unit to load or store a vector
    • Multiple elements fetched/stored per cycle
    • may have multiple LSUs
  4. Scalar registers
    • single element for FP scalar or address
  5. Cross-bar to connect FUs , LSUs, registers

优缺点

Pro.

  1. No dependencies within a vector
  2. Each instruction generates a lot of results
  3. Regular Memory access pattern
  4. Fewer branches in the instruction sequence

Cons.

  1. Very inefficient if parallelism is irregular
  2. Memory bandwidth can become a bottleneck

[!EXAMPLE] VECTOR vs. SCALAR

image-20241208144551025

Vector Length

A vector register can hold some maximum number of elements for each data width (maximum vector length or MVL)

Vector-length (VL) register controls the length of any vector

VL can be anything from 0 to MVL

Strip Mining

when application vector length > MVL

Strip mining: Generation of a loop that handles MVL elements per iteration

就是说,实际的 VL 超过了 MVL(机器能处理的最长的向量长度),就把这个向量切成一段一段的分别处理

[!EXAMPLE]

Example: vector daxpy of N elements

First loop handles (N mod MVL) elements, the rest handle MVL

VL = (N mod MVL);       // set VL = N mod MVL
for (I=0; I<VL; I++)    // 1st loop is a single set of
   Y[I]=A*X[I]+Y[I];   //   vector instructions

low = (N mod MVL);

VL = MVL;               // set VL to MVL
for (I=low; I<N; I++)   // 2nd loop requires N/MVL
   Y[I]=A*X[I]+Y[I];   //   sets of vector instructions