Skia: Exposing Shadow Branches

[abstract] (PDF)
Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino A. Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, and David I. August
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), March 2025.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
Modern processors implement a decoupled front-end, often
using a form of Fetch Directed Instruction Prefetching (FDIP), to
avoid front-end stalls. FDIP is driven by the Branch Prediction Unit
(BPU), relying on the BPU's accuracy and branch target tracking
structures to speculatively fetch instructions into the Instruction
Cache (L1-I cache). As contemporary data center applications become
more complex, their code footprints also grow, resulting in a high
number of Branch Target Buffer (BTB) misses. These BTB missing
branches typically have previously been decoded and placed in the BTB,
but have since been evicted, leading to BTB misses now. FDIP can
alleviate L1-I cache misses, but its reliance on the BPU's tracking
structures means that when it encounters a BTB miss, the BPU may not
identify the current instruction as a branch to FDIP. This can prevent
FDIP from prefetching or cause it to speculate down the wrong path,
further polluting the L1-I cache.
We observe that the vast majority, 75\%, of BTB-missing, unidentified
branches are actually present in instruction cache lines that FDIP has
previously fetched. Nevertheless, these missing branches have not yet
been decoded and inserted into the BTB. This is because the
instruction line is decoded from an entry point (which is the target
of the previous taken branch) till an exit point (taken branch). We
call branch instructions present in the ignored portion of the cache
line ``Shadow Branches.'' Here we present Skia, a novel shadow branch
decoding technique that identifies and decodes unused bytes in cache
lines fetched by FDIP, inserting them into a Shadow Branch Buffer
(SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to
speculate despite a BTB miss.
With a minimal storage state of 12.25KB, Skia delivers a geomean
speedup of ~5.7\% over an 8K-entry BTB (78KB) and ~2\% versus adding
an equal amount of state to the BTB, across 16 front-end bound
applications. Since many branches stored in the SBB are distinct
compared to those in a similarly sized BTB, we consistently observe
greater performance gains with Skia across all examined sizes until
saturation.