Temporal Prefetcher

0. Temporal Pattern

Pasted image 20230128110653.png|400


Pasted image 20230530163957.png


1. Category 1: GHB-Based

1.1. STMS (Sampled Temporal Memory Streaming)



在global access stream上记录address correlation (pairs of correlated address, 比如访问了A之后接下去会访问B)。然后根据address correlation来issue prefetch requests。
STMS通过一个FIFO buffer来记录global stream的address correlation。比如一个访问序列A,B,C,D,E,这些被记录到FIFO buffer。接下来如果遇到一个A的访问,则STMS认为接下来可能会访问B

a. Structure

Pasted image 20230203115638.png|400

b. Process

c. Comments


1.2. Domino



STMS通过一个地址来匹配GHB,这样的准确率是不够的。
Domino优化了STMS。Domino可以使用一个或两个地址匹配GHB。

如下图所示,上方A,B,C,X是一个global access stream,下方是三个GHB-based prefetcher的index table的结构。

a. Structure

b. Process

触发prefetch的triggering event有

b.i. Recording

b.ii. Replaying (replay first, record second)

PointBuf存放了Domino要prefetch的address stream(从HT获取)。
如果遇到了triggering event prefetch hit, Domino会继续prefetch PointBuf中的addr
如果遇到了triggering event cache miss, Domino认为PointBuf中的序列需要更新,就会find a new stream。

Find a New Stream:

2. Irregular Stream Buffer

2.1. Simple ISB



STMS和Domino使用的都是global access stream。而ISB这篇文章认为用PC-localized stream(每一条load指令的miss stream)更好。
Pasted image 20230530155828.png

但是基于GHB的prefetcher是很难实现pc-localization的。如下图所示,用PC去index stream, pc-localized stream内部通过linked list来串联。
Pasted image 20230530160300.png|400
而ISB很好的结合了Address correlation和PC-localization。并且支持metadata的prefetch。
ISB将correlated physical addr映射到一个新的地址空间:structural address space。Temporal stream在structural address space中地址分配的地址是连续的,能够通过Next line prefetcher进行Prefetch

Pasted image 20230203115137.png|400
Pasted image 20230128144808.png

a. Components

Pasted image 20230128154912.png|500

a.i. Training Unit: Segregates global stream by PC and assigns structural addresses

Input

这个模块维护PC-localized stream中每个PC最后访问的地址 last observed addr。
当一个Input来的时候,通过Input PC匹配stream, 然后为Input Addr分配一个在last observed addr后的structural address。然后更新last observed addr为当前Input Addr。

a.ii. Address Mapping Caches (AMCs)

a.iii. Stream Predictor

类似stream buffer

a.vi. TLB Sync

ISB的cache缓存TLB resident page中的cache line的信息,并且会和TLB miss协同工作(page被evict了,则cache中的mapping被写回,page被load了,则cache会load mapping)。

b. Process

ISB’s three key functions: training, prediction, and TLB eviction

b.i. Training

The training process assigns consecutive structural addresses to the correlated physical addresses that are observed by the training unit.

Example:
the Training Unit’s last observed address is 0xba1f00, whose structural address is 0x1100.
When the Training Unit receives the physical address 0xca4b00 in the same localized stream, it performs three steps.

  1. It assigns 0xca4b00 the structural address following 0xba1f00’s structural address, namely 0x1101.
  2. It updates the PS-AMC entry indexed by physical address 0xca4b00, and it updates the SP-AMC entry indexed by structural address 0x1101.
  3. It changes the last observed address in the Training Unit to 0xca4b00.

Pasted image 20230128155334.png|400

b.ii. Prediction

Each L2 cache access becomes a trigger address for the prefetcher, causing the PS-AMC to retrieve the trigger address’ structural address.
Pasted image 20230130133929.png|400

b.iii. TLB evictions

During a TLB eviction, the ISB writes to DRAM any modified mappings for the evicted page, and it fetches from DRAM the structural mapping for the incoming page.

c. Overhead

c.i. Traffic

8.4% memory traffic overhead due to meta-data accesses

c.ii. Storage

2.2. Optimized MISB



MISB在ISB的基础上改进了metadata的管理方式。MISB引入了metadata prefetching

a. Components

a.i. Metadata Cache

Each logical metadata cache line in MISB's PS and SP caches holds one mapping, and on an eviction, the least recently used mapping is evicted.

a.ii. Metadata Prefetching

On PS and SP metadata cache misses, MISB gets prefetching benefits from fetching a metadata cache line with 8 mappings.

a.iii. Metadata Filtering

Many PS load requests are to physical addresses for which MISB has no mapping since they have never been seen before.
To filter these requests, MISB uses a Bloom Filter.
In particular, when a new PS mapping is written to off-chip storage, the mapping is added to a Bloom filter. Future PS misses are sent to memory only if the physical address is found in the Bloom filter.

b. Process

Pasted image 20230129235019.png

c. Overhead

c.i. Traffic

c.ii. Storage

32KB for the on-chip metadata cache and 17KB for the Bloom filter (Bloom Filter这么大吗)

Pasted image 20230129235941.png|500

3. Category 3: Table-Based

3.1. Markov Prefetcher

omit

3.2. Triage



Triage提出了一个metadata完全on-chip的temporal prefetcher(将metadata存储到LLC)。Triage认为绝大部分性能的提升来自一小部分metadata。
Pasted image 20230131164852.png|500

a. Components

a.i. Training Unit

和ISB中的training unit类似。
维护PC-localized stream中每个PC最后访问的地址 last observed addr。

a.ii. Metadata Store

在LLC中划分一块空间作为metadata store, 存储metadata。
Triage以table形式存储address correlation。
Pasted image 20230203133100.png|400

b. Process

b.i. Training

下面是一个training的例子。
假设training unit中pc为4的stream的last observed addr是A。Input了(PC, Addr)=(4,B), 那么training unit认为A和B存在correlation, 并将这个correlation发送给Triage的metadata store
Pasted image 20230131170247.png|500

b.ii. Prediction

当一个地址A来的时候,Triage搜索metadata store发现匹配项(A,B), Triage对B发起prefetch request
Pasted image 20230131170538.png|500

b.iii. Metadata Replacement

Hawkeye replacement policy

b.vi. Metadata Partition Updates

每隔多少次metadata access, 重新做一次LLC的way partition,动态扩缩metadata store占用的LLC的大小。
如果triage要扩张,那么dirty cache line被flush并且标记为invalid供triage使用
如果triage要缩小,那么triage占用的位置被标记为invalid供llc使用

c. Overhead

c.i. Storage

Occupy LLC 512KB/1MB/Dynamic allocation