Software Prefetcher

Profile-guided Optimization(PGO) Overview

A. Overall Process

Step 1: Running & Profiling
Step 2: Analysis.
The profiling and tracing data as well as the unmodified binaries are then fed into a tool
Step 3: Injection

B. Profiling & Tracing Tools

tools	--
VTune	instruction mix
PMU counters	instruction mix
PEBS (precise event-based sampling)	execution frequency, AMAT,
LBR (last branch record)
PT (processor trace)	execution frequency

C. Representative work

Heiner Litz

[MICRO'22 Best Paper] Whisper: Profile-Guided Branch Misprediction Elimination for Data Center Applications
[ISCA'22] Thermometer: Profile-Guided BTB Replacement for Data Center Applications
[EuroSys'22] APT-GET: Profile-Guided Timely Software Prefetching
[ASPLOS'22] CRISP: Critical Slice Prefetching
[MICRO'21] Twig: Profile-Guided BTB Prefetching for Data Center Applications
[ISCA'21] Ripple: profile-guided instruction cache replacement for data center applications
[MICRO'20] I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing
[ASPLOS'20] Classifying Memory Access Patterns for Prefetching
[ISCA'19] AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers
[CGO'19] BOLT: A Practical Binary Optimizer for Data Centers and Beyond
[CGO'16] AutoFDO: automatic feedback-directed optimization for warehouse-scale applications

Classifying Memory Access Patterns

This approach executes the target binracy once to obtain execution traces and miss profiles, performs dataflow analysis, and finally injects prefetches into a new binary.

classify

Memory Access Pattern -- The Recurrence Relation

A_{n} = f (A_{n - 1})

$A$ is a memory address
$n$ represents the nth execution of a particular load instruction
$f (x)$ is an arbitrary function. The complexity of $f (x)$ determines the capabilities a prefetcher requires to predict and prefetch a certain future cache miss.

Pattern	Recurrence relation	Example	Note
Constant	$A_{n} = A_{n - 1}$	*ptr
Delta	$A_{n} = A_{n - 1} + d$	streaming, array traversal	d=stride
Pointer Chase	$A_{n} = L d (A_{n - 1})$	next=current->next	load addr is derived from the value of prev load
Indirect Delta	$A_{n} = L d (B_{n - 1} + d)$	*(M[i])
Indirect Index	$A_{n} = L d (B_{n - 1 + c} + d)$	M[N[i]]	c=base addr of M, d=stride
Constant plus offset reference	$A_{n} = B_{n} + c_{1}, B_{n} = L d (B_{n - 1} + c 2)$	Linked list traversal	$c_{1}$ =data offset $c_{2}$ =next pointer offset $B_{n}$ =address of the $n^{t h}$ struct

To formalize prefetch patterns, define prefetch kernel:

For a given load instruction and its function to compute the next address, $f (x)$ can be expressed as a prefetch kernel which consists of a number of data sources and operations on those sources which generate the delinquent load address.

Data source types
- constants
- registers
- memory locations
Operations
- instruction or micro-op specified by tehe ISA

Extract Prefetch Kernel

Machine Classification	Corresponding Pattern
Constant	Constant
Add	Delta
Add, Multiply	Complex
Load, Add	Linked List, Indirect Index, ...
Load, Add, Multiply	Complex

Preparation

program & library binaries
application trace: PC as well as the memory addr of all load and store insts
cache miss profile: a ranked list of PCs that cause cache misses

Process

Dataflow Analysis
- Locate the execution history window (from miss PC back to the same PC)
- Build data dependencies
Compaction
- store-load bypassing
- arithmetic distribution and compaction
- assignment and zero pruning

Analysis

Findings:

Indirection使得spatial prefetcher效果变差
很大一部分程序虽然没有Indirection, 但是reuse distance很大
simple kernel可能也需要复杂计算
pattern数量很多

CRISP: Critical Slice Prefetching

Overview

Motivation

Reordering instructions --> critical first

What is CRISP:

CRISP is a a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling.
CRISP executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach.

Two Tasks:

Task1: Identifying cirtical instruction. CRISP identifies high-latency load instructions that frequently induce pipeline stalls due to cache misses and tracks their load-address-generating instructions (slices).
Task2: Tagging and prioritizing critical instruction's execution. By tagging these instructions as critical and prioritizing their execution, the instruction scheduler can hide a large fraction of the memory access latency, improve memory-level parallelism (MLP) and overall performance.

Task 1: Instruction Criticality

Step1: Determining Delinquent Loads

A. What is a critical load

Define: A load is critical if

its LLC miss rate is higher than a threshold (e.g. 20%)
its memory addr cannot be easily predicted by the hardware prefetcher
the number of independent instructions behind the load in the sequential instruction stream is small

There are other factors, e.g.

the load's execution ratio over other loads in the program
the LLC miss rate of the load
the pipeline stalls induced by the load
the baseline IPC and instruction mix of a program
the MLP of the program at the time where the load occurs
the time a load becomes ready to be scheduled, determined by its dependency on other high latency instructions

B. How to obtain these info

How to collect thses info

IPC: directly measured
instruction mix: VTune or PMU counters
the execution frequency of a specific load: Precise event-based sampling (PEBS), Processor trace (PT)
A load' average memory access time: PEBS
the pipeline stalls induced by a load and MLP: observing precise back-end stalls and load queue occupancy

Step2a: Load Slice Extraction

After obtaining a trace, we perform load slice extraction by iterating through the trace until one of the delinquent load instructions (see Section 3.2) is found
traverse in reverse order

算法上用了一个队列的形式。我感觉相当于一个二叉树的层序遍历？
flag all load slice instructions as critical by prepending the new 'critical' instruction prefix duing the FDO pass.

Step2b: Branch Slice Extraction

For loops that contain hard-to-predict branches, the execution time of a loop iteration is determined to a large degree by the time required to resolve the branch outcom.
--> Prioritize hard-to-predict branch slices

Step3: Inject into binrary

Rewrite the compiled assembly code adding the new instruction prefix to every critical instruction using post-link-time compilation.

Task 2: Hardware Scheduling

Extend the instruction decoder to interpret the new latency-critical instruction prerfix and tag all critical instructions as they progress through the CPU pipeline
Extend the IQ scheduler to priotitize these critical instructions.

APT-GET: Profile-Guided Timely Software Prefetching

Overview

自动化的过程

用perf record找到cache miss多的loads和他们的basic block
找到对应的LBR profile
计算prefetch distance和prefetch injection site
输入LLVM使用

Background

hwpf无法检测complex pattern比如indirect mem access, s现有的sw prefetcher通过static信息可能可以有比较好的accuracy和coverage, 但是由于缺少dynamic execution info 比如execution time, 很难做到timely。这篇文章认为timely prefetch有两个重要的点：prefetch distance和prefetch injection site, 所以通过profile-guided的方法获取了一些dynamic info来指导timely prefetch 途径

Motivation

Step1: Profiling

Cache miss profile --> 收集比较关心的demand loads的信息，比如这条load的hit/miss ratio
LBR (Last Branch Record) samples --> 可以得到包含这条load指令的loop的trip count, 单次loop的执行时间
收集这些信息后，就可以分析知道同一条load的两次执行之间的时间(loop latency or execution time)，以便得到timely prefetch

文章里面用到Intel CPU上的Last Branch Record (LBR) 如上图所示
每个LBR有一个PC，记录这里发生了一个taken branch；有一个Target, 记录跳转地址；有一个branch执行的时间。
LBR是一个buffer, 其中存了last 32 basic blocks (BBL) executed by the CPU。
BBL是两个taken branch之间连续执行的一段指令

Step2: Analysis

Overview

Loop latency包含两部分的执行时间信息

instruction component (IC)
- includes all (non-load) instructions implementing the loop
- IC_latency取决于指令数量和指令的控制流依赖，基本上是常量
memory component (MC)
- includes the loads causing frequent cache misses
- MC_latency取决于内存状态，会有很大变化

IC_latency和MC_latency得到之后，可以通过如下公式计算prefetch_distance。在这个distance下，prefetch可以hide memory latency

I C_l a t e n c y \times p r e f e t c h_d i s t a n c e = M C_l a t e n c y

接下来就是怎么去得到IC_latency和MC_latency

A. 计算loop execution time & loop trip count

loop execution time计算：finding two instances of the same branch PC implementing a loop and subtracting their cycle counts, we can compute the execution time of a loop iteration
loop trip count计算：In the case of a nested loop, if we know the branch PC corresponding to the outer loop and the branch PC corresponding to the inner loop, we can count the number of inner branch PCs within two outer branch PCs in the LBR to compute the number of inner loop iterations.

B. Determining the Optimal Prefetch Distance

得到loop execution time之后还是得不到IC_latency, MC_latency
为此，这篇文章做了loop execution time的latency distribution

for all LBR samples that contain at least two instances of the BBL containing the delinquent load, measure the loop execution time by subtracting the cycle counts of the two subsequent branches
analyze the latency distribution of the loop's execution time to predict the latency in the case that the load is served from the L1 or L2 cache

这个一个distribution图，显示了4个峰：80, 230, 400, 650 cycles; 对应了serve from L1, L2, LLC, DRAM
计算可得IC_latency=80 cycles, MC_latency=650-IC_latency=570 cycles
最后prefetch_distance=570/80~7

Step3: Injection -- Finding the Optimal Prefetch Injection Site

inject在outerloop还是innerloop, 由以下公式选择，如果以下成立，就inject在outerloop

l o o p_t r i p_c o u n t \times k < p r e f e t c h_d i s t a n c e

其中k代表的是？？？
如果决定inject在outerloop, 则需要重新根据outer loop execution latency计算prefetch distance

Resources

**### Disclosure of H/W prefetcher control on some Intel processors

Prefetcher	Bit# in MSR 0x1A4	Description
L2 hardware prefetcher	0	Fetches additional lines of code or data into the L2 cache
L2 adjacent cache line prefetcher	1	Fetches the cache line that comprises a cache line pair (128 bytes)
DCU prefetcher	2	Fetches the next cache line into L1-D cache
DCU IP prefetcher	3	Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines
If any of the above bits are set to 1 on a core, then that particular prefetcher on that core is disabled. Clearing that bit (setting it to 0) will enable the corresponding prefetcher.

The algorithms found in the L1 cache hardware are the Data Cache Unit (DCU) Prefetcher (INTEL, 2019) and the DCU IP Prefetcher (INTEL, 2019). The DCU Prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line. The DCU IP Prefetcher keeps track of individual load instructions (based on their instruction pointer value). If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride.
The L2 Hardware Prefetcher (INTEL, 2019) and the L2 Adjacent Cache Prefetcher (INTEL, 2019) are the prefetcher algorithms found in the real machine L2 cache. The L2 Hardware Prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 data cache requests initiated by load and store operations and also by the L1 prefetchers, and L1 instruction cache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. This prefetcher may issue two prefetch requests on every L2 lookup and run up to 20 lines ahead of the load request. The L2 Adjacent Cache Prefetcher fetches two 64-byte cache lines into a 128-byte sector instead of only one, regardless of whether the additional cache line has been requested or not.

Fetching Title#wnft**