HW2

:material-circle-edit-outline: 约 1158 个字 :fontawesome-solid-code: 12 行代码 :material-clock-time-two-outline: 预计阅读时间 4 分钟

3220104929 241024

HW2.pdf

1

(a)

How many bits are there in the fields of tag, index, and block offset of the physical memory address?

block offset = $\log_232=5$ bit

index = $\log_2(2^{13}/32/2)=7$ bit

tag = 38 - 7 - 5 = 26 bit

(b)

Draw a graph to show a cache line (including tag, data, and some other control bits) in the cache.

1bit	1bit	26bit	32byte
dirty bit	valid bit	tag	data

(c)

Draw a graph to show if it is implemented in the way of virtually indexed and physically tagged cache. Draw both cache and TLB.

(d)

Please describe the access procedure to the memory hierarchy in (c) when a CPU address (virtual address) is given to access the cache.

VA 进入处理程序
VPN 进入 TLB 进行虚实页号转换，同时 VA 里的 index 进入 cache 定位 cache line
- TLB 缺项时会去页表里找
- 缺页（内存里没有）就要从硬盘取
实页号输出至 cache，将 index 获取的 cache line 的 tag 进行比对，锁定数据位置
- cache 没有要往下找
cache 访问完成

2

Which machine will have better performance in memory access (AMAT)? Why?

Average Memory Assess Time (AMAT) = avg hit time + avg miss time = avg hit time + miss rate × miss penalty

$AMAT_A = 8+0.08\times 50 = 12ns$

$AMAT_B=2+0.15\times (20+0.1\times 50)=5.75ns$

可知 B 的 AMAT 更好

[!IMPORTANT]

上面错了，cache access time 是 hit 后才有的

3

提示: L1 cache 的 miss penalty 认为是替换数据从 L2 到 L1 的传输时间，忽略响应时间。

a

What is the average memory access time for instruction accesses?

1 cycle = 1/$10^9$ = 1 ns

transfer rate $_{L1}$ = 16 / (1GHz/266Mhz) = 532/125 byte/cycle

transfer time $_{IL1}$ = 32 / (532/125) = 1000/133 cycle/block

transfer rate $_{L2}$ = 16 / (1GHz/133MHz) = 266/125 byte /cycle

transfer time $_{L2}$ = 64 / (266/125) = 4000/133 cycle/block

miss penalty $_{L2}$ = mem access time + transfer time $_{L2}$

= 80 + 4000/133 cycle/block

miss penalty $_{L2dirty}$ = (80 + 4000/133) * 1.5

= 120 + 6000/133 cycle/block

miss penalty $_{IL1}$ = 12 + 0.15 * (120 + 6000/133) + 1000/133

= 30 + 1900/133 cycle/block

~~miss penalty $_{IL1dirty}$ = (30 + 1900/133) * 1.5 = 45 + 2850/133 cycle/block~~

$AMAT_I=hit\;time_{L1}+miss\;rate_{L1}\times miss\;penalty_{L1}$

$=0+0.02\times ( 45 + 2850/133)$

$= 0.9+57/133 = 1.33cycle$

b

What is the average memory access time for data reads?

transfer time $_{DL1}$ = 16 / (532/125) = 500/133 cycle/block

miss penalty $_{RDL1}$ = 12 + 0.15 * (120 + 6000/133) + 500/133 + 0.5 * 0.1 * 80

~~miss penalty $_{L1dirty}$ = 37.92+ 37.92 * 0.5 * 0.1 = 39.82 cycle/block~~

$AMAT_{DR}=hit\;time_{L1}+miss\;rate_{L1}\times miss\;penalty_{L1}$

$=0+0.05\times(12 + 0.15 * (120 + 6000/133) + 500/133) $

$=1.5+70/133=2.03cycle$

c

What is the average memory access time for data writes?

麻，write through 没有 dirty 一说，50% of all blocks replaced are dirty 是虚晃一枪

~~miss penalty $_{L1dirty}$ = 39.82 + 37.92 * 0.1 = 43.61 cycle/block~~

miss penalty $_{WDL1}$ = 12 + 0.15 * (120 + 6000/133) + 500/133 + 0.1 * 80

= 38 + 1400/133

$AMAT_{DW}=hit\;time_{L1}+miss\;rate_{L1}\times miss\;penalty_{L1}$

$=0+0.05\times ( 38 + 1400/133 )$

$=1.9+70/133=2.43cycle$

d

What is the overall CPI, including memory accesses？

CPI = 1.35 + 1 * 1.33 + 0.2 * 2.03 + 0.1 * 2.43 = 3.33 CPI

4

a

What should be the minimum size of the cache to take advantage of blocked execution?

C 语言中的一个单精度浮点数用 4 byte 储存

则 cache 里 1 block = 64 byte = 16 个 float，那么我们每个子块的大小最好为 16 x 16

根据题中矩阵在内存中的储存方式，可以知道子块每行 float 都正好需要 cache 里一个 block

输入矩阵和输出矩阵的子块各需要一片缓存，故缓存一共需要存 16 x 2 x 64 = 2048bit = 2KB

b

How do the relative number of misses in the blocked and unblocked versions compare in the minimum sized cache above?

blocked version 可以让每个输入 cache 的 block 里的每一个 float 都被修改以完成自己的使命，即每个 float 都只需要传入 cache 一次，而且，16 个紧挨一起的 float 是以一个 block 的形式换进 cache 的，这就是每 16 个访问里有 15 次 hit 和 1 次 miss

所以 miss rate = 1/16

unblocked version 就离谱了，输入矩阵因为是在行方向上推进操作所以 miss 还正常，和 blocked version 一样

而输出矩阵需要连续多行跳转操作，cache 容量不够一次性操作完，一个 block 操作完里面的 16 个 float 需要 16 次传入 cache，就是 100% miss

所以 miss rate = 1/16 * 0.5 + 1 * 0.5 = 17/32

c

Write code to perform a transpose with a block size parameter B which uses BxB blocks.

int i, j, k, l;
int input[256][256], output[256][256];

for (i = 0; i < 256; i = i+B){
    for (j = 0; j < 256; j = j+B){
        for(k = 0; k < B; k++){
            for(l = 0; l < B; l++){
                output[j+l][i+k] = input[i+k][j+l];
            };
        };
    };
};

d

What is the minimum associativity required of the L1 cache for consistent performance independent of both arrays’ position in memory?

2 路，防止 index 一致把要用的 block 换了，而 2 个子块用 2 路刚刚好

5

解释上图为什么会呈 U 型

一开始 block size 较小，没能充分利用空间局部性，之后随着 block size 增大，空间局部性影响增大，然后到了一个平衡点，空间局部性和时间局部性都得到了很好的利用，性能到达最高点

之后 block size 进一步增大，在 cache size 不变的情况下 block 的数量减少，时间局部性的影响减弱，且 hit rate 也在增大，miss penalty 也有增大，导致性能下降

6

Reduce the miss penalty

Reduce the miss rate

Reduce the miss penalty and miss rate via parallelism

Reduce the time to hit in the cache

Multilevel caches: ( 1 )
Critical word first: ( 1 )
Larger block size: ( 2 )
Non-blocking caches: ( 3 )
Hardware prefetching: ( 3 )
Small and simple caches: ( 4 )
Higher associativity: ( 2 ）
Avoiding address translation: ( 4 )
Victim caches: ( 1 )
Compiler optimizations: ( 2 )
Way prediction and pseudo-associativity: ( 4 ) 2
Pipelined cache access: ( 4 )