Decoupled Software Pipelining
Compiler and microarchitectural techniques have been largely successful in improving program performance by exposing instruction-level parallelism (ILP). However, performance is still far from ideal in applications that experience long variable latency stalls (e.g. recursive data structure (RDS) traversals). Static scheduling techniques including software pipelining cannot schedule optimally for variable latencies. Building processors with large monolithic instruction windows to exploit far ILP to tolerate such stalls is impractical. Decoupled software pipelining (DSWP) solves this by preferentially fetching and executing instructions from a program's critical path (CP). This allows DSWP to achieve high IPC with fast simple execution cores.
How does DSWP work? It works by statically splitting programs into critical path (CP) and off-critical path (off-CP) threads that run concurrently on thread-parallel architectures like SMT or CMP. Special microarchitectural support, called the synchronization array (SA), provides low-latency inter-thread synchronization and value communication and acts as a decoupling buffer between the threads. Decoupled execution ensures that stalls in one thread do not affect the other. Dedicated execution of CP thread results in enhanced CP performance. CP and off-CP threads execute concurrently in a pipelined fashion resulting in staged parallelism.
Automatic Multithreading for DSWP: Unlike prior attempts at automatically multithreading sequential programs, DSWP does not try to partition programs into totally independent threads. Instead, it pipelines programs into dependent communicating threads. This makes DSWP a very practical multithreading technique. We have a working compiler implementation that automatically generates DSWP code. We are currently developing heuristics to improve the quality of the generated code. We are also looking at other options like repartitioning the compiler-generated code at run-time to achieve greater performance.
For more information, see our paper in PACT '04.