Software Prefetcher

Profile-guided Optimization(PGO) Overview

A. Overall Process

Step 1: Running & Profiling
Step 2: Analysis.
The profiling and tracing data as well as the unmodified binaries are then fed into a tool
Step 3: Injection
Pasted image 20230223232525.png|500

B. Profiling & Tracing Tools

tools --
VTune instruction mix
PMU counters instruction mix
PEBS (precise event-based sampling) execution frequency, AMAT,
LBR (last branch record)
PT (processor trace) execution frequency

C. Representative work

Heiner Litz


Classifying Memory Access Patterns

This approach executes the target binracy once to obtain execution traces and miss profiles, performs dataflow analysis, and finally injects prefetches into a new binary.

Pasted image 20230223133644.png
classify


Memory Access Pattern -- The Recurrence Relation

An=f(An1)
Pattern Recurrence relation Example Note
Constant An=An1 *ptr
Delta An=An1+d streaming, array traversal d=stride
Pointer Chase An=Ld(An1) next=current->next load addr is derived from the value of prev load
Indirect Delta An=Ld(Bn1+d) *(M[i])
Indirect Index An=Ld(Bn1+c+d) M[N[i]] c=base addr of M, d=stride
Constant plus offset reference An=Bn+c1,Bn=Ld(Bn1+c2) Linked list traversal c1=data offset c2=next pointer offset Bn=address of the nth struct

To formalize prefetch patterns, define prefetch kernel:

For a given load instruction and its function to compute the next address, f(x) can be expressed as a prefetch kernel which consists of a number of data sources and operations on those sources which generate the delinquent load address.


Extract Prefetch Kernel

Machine Classification Corresponding Pattern
Constant Constant
Add Delta
Add, Multiply Complex
Load, Add Linked List, Indirect Index, ...
Load, Add, Multiply Complex

Preparation


Process


Analysis

Pasted image 20230223112632.png


Findings:


CRISP: Critical Slice Prefetching


Overview

Pasted image 20230223232525.png|500


Motivation

Reordering instructions --> critical first

Pasted image 20230223144004.png|600

What is CRISP:

CRISP is a a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling.
CRISP executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach.


Two Tasks:


Task 1: Instruction Criticality


Step1: Determining Delinquent Loads

A. What is a critical load

Define: A load is critical if


There are other factors, e.g.


B. How to obtain these info

How to collect thses info


Step2a: Load Slice Extraction


Step2b: Branch Slice Extraction

For loops that contain hard-to-predict branches, the execution time of a loop iteration is determined to a large degree by the time required to resolve the branch outcom.
--> Prioritize hard-to-predict branch slices


Step3: Inject into binrary

Rewrite the compiled assembly code adding the new instruction prefix to every critical instruction using post-link-time compilation.


Task 2: Hardware Scheduling

Pasted image 20230223232214.png|400

APT-GET: Profile-Guided Timely Software Prefetching


Overview

自动化的过程

  1. 用perf record找到cache miss多的loads和他们的basic block
  2. 找到对应的LBR profile
  3. 计算prefetch distance和prefetch injection site
  4. 输入LLVM使用

Background

hwpf无法检测complex pattern比如indirect mem access, s现有的sw prefetcher通过static信息可能可以有比较好的accuracy和coverage, 但是由于缺少dynamic execution info 比如execution time, 很难做到timely。这篇文章认为timely prefetch有两个重要的点:prefetch distance和prefetch injection site, 所以通过profile-guided的方法获取了一些dynamic info来指导timely prefetch 途径

Pasted image 20230223234223.png|400

Motivation

Pasted image 20230223234333.png
!Pasted image 20230223234357.png

Step1: Profiling

文章里面用到Intel CPU上的Last Branch Record (LBR) 如上图所示
每个LBR有一个PC,记录这里发生了一个taken branch;有一个Target, 记录跳转地址;有一个branch执行的时间。
LBR是一个buffer, 其中存了last 32 basic blocks (BBL) executed by the CPU。
BBL是两个taken branch之间连续执行的一段指令

Step2: Analysis

Overview

Loop latency包含两部分的执行时间信息

IC_latency和MC_latency得到之后,可以通过如下公式计算prefetch_distance。在这个distance下,prefetch可以hide memory latency

IC_latency×prefetch_distance=MC_latency

接下来就是怎么去得到IC_latency和MC_latency

A. 计算loop execution time & loop trip count

Pasted image 20230223235046.png|400

B. Determining the Optimal Prefetch Distance

得到loop execution time之后还是得不到IC_latency, MC_latency
为此,这篇文章做了loop execution time的latency distribution

Pasted image 20230223235205.png|500
这个一个distribution图,显示了4个峰:80, 230, 400, 650 cycles; 对应了serve from L1, L2, LLC, DRAM
计算可得IC_latency=80 cycles, MC_latency=650-IC_latency=570 cycles
最后prefetch_distance=570/80~7

Step3: Injection -- Finding the Optimal Prefetch Injection Site

inject在outerloop还是innerloop, 由以下公式选择,如果以下成立,就inject在outerloop

loop_trip_count×k<prefetch_distance

其中k代表的是???
如果决定inject在outerloop, 则需要重新根据outer loop execution latency计算prefetch distance


Resources

**### Disclosure of H/W prefetcher control on some Intel processors

Prefetcher Bit# in MSR 0x1A4 Description
L2 hardware prefetcher 0 Fetches additional lines of code or data into the L2 cache
L2 adjacent cache line prefetcher 1 Fetches the cache line that comprises a cache line pair (128 bytes)
DCU prefetcher 2 Fetches the next cache line into L1-D cache
DCU IP prefetcher 3 Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines
If any of the above bits are set to 1 on a core, then that particular prefetcher on that core is disabled. Clearing that bit (setting it to 0) will enable the corresponding prefetcher.

Pasted image 20230222143800.png
Pasted image 20230222144621.png
Fetching Title#wnft**