Prefetchers on Commercial CPUs

AMD

AMD family 19h processors implement data prefetch logic for its L1 data cache and L2 cache. In general, the L1 data prefetchers fetch lines into both the L1 data cache and the L2 cache, while the L2 data prefetchers fetch lines into the L2 cache. (意思是L1分流,L2不分流)

The following prefetchers are included:

Software optimization guide for amd epyc 7003 processors.

Intel

Intel使用了哪些prefetcher

Disclosure of H/W prefetcher control on some Intel processors

Prefetcher Bit# in MSR 0x1A4 Description
L2 hardware prefetcher 0 Fetches additional lines of code or data into the L2 cache
L2 adjacent cache line prefetcher 1 Fetches the cache line that comprises a cache line pair (128 bytes)
DCU prefetcher 2 Fetches the next cache line into L1-D cache
DCU IP prefetcher 3 Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines
If any of the above bits are set to 1 on a core, then that particular prefetcher on that core is disabled. Clearing that bit (setting it to 0) will enable the corresponding prefetcher.

Pasted image 20230222143800.png
Pasted image 20230222144621.png
Fetching Title#wnft

Prefetcher Performance

Performance: L2P > DCUI > DCUP > L2A
为什么streamer会性能最好,难道L2A是stride prefetcher?
Pasted image 20230317115617.png

Stream Prefetcher (L2/LLC)


16 instruction streams, 32 data streams
Each stream is for a 4K page and is core specific
Streams are replaced using round robin when there's no empty stream

Request is Demand Type CAM Match Request got L2 Hit Action
0 0 x Do nothing
1 0 1 Do nothing
1 0 0 Stream Creation
x 1 x Stream Update

最前面表示cache line
Pasted image 20230313143938.png
Pasted image 20230314094355.png

A Stream Lift Cycle

A stream has 4 states: Searching, Forward, Backward, Done

D-Side State Machine

Pasted image 20230314101418.png

I-Side State Machine

Stream's Key Fields

Stream Prefetcher: LLC

Stream Prefetcher: SendStream (Send Prefetch Requests to L2Q)

Processes

Bitmap Filter (For each stream)

L2Q credit

Prefetch Accelation and Throttling

Pasted image 20230314161207.png

Prefetcher Blocking

Pasted image 20230314161154.png

Intel Stream Prefetcher Reverse Engineering

ReadPaper

Structure

The Stream prefetcher treats in a special way streams that first access a page in its first or last two lines (如果access了一个page中的前两个或者后两个line 会触发prefetch).
Otherwise, if confident enough, it prefetches a pair of consecutive lines starting on the last fetched line or the current line, whichever is furthest along the direction of the stream. (有两个起始地点)
Accesses 32 or more lines away from the last fetched line are treated differently. (current line和last fetched line不能差距太远)
Prefetches issued will safely wrap around page limits, which may issue pointless prefetches but causes no potentially dangerous prefetches across page limits.
The prefetcher seems to output a confidence metric used to decide whether to prefetch, but suppressed prefetches may update the prefetcher state. Lastly, the prefetcher is reluctant to start streams too close to the page end.

AMP Prefetcher

Glossary

Data Structure

Pasted image 20230316150555.png|600

Delta Bucket Encoding (Index Into SPT0/SPT1)

Pasted image 20230316145042.png|400

Steps

Pasted image 20230316150410.png
Pasted image 20230316152202.png

Pasted image 20230316145439.png

Pasted image 20230316150313.png

Pasted image 20230316152718.png

Pasted image 20230316152743.png

Pasted image 20230316153229.png

Pasted image 20230316151610.png

Pasted image 20230316170245.png

ARM

APPLE

X3

The X3 features a dozen prefetch engines.
One new engine looks for sequences of indirect loads while the other seeks three-dimensional(spatial) patterns.