Performance Scalability of Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August
ACM Transactions on Architecture and Code Optimization (TACO), Volume 5, Number 2, August 2008.

Any successful solution to using multi-core processors to scale general-purpose program performance will have to contend with rising inter-core communication costs while exposing coarsegrained parallelism. Recently proposed pipelined multithreading (PMT) techniques have been demonstrated to have general-purpose applicability and are also able to effectively tolerate intercore latencies through pipelined inter-thread communication. These desirable properties make PMT techniques strong candidates for program parallelization on current and future multi-core processors and understanding their performance characteristics is critical to their deployment.

To that end, this paper evaluates the performance scalability of a general-purpose PMT technique called decoupled software pipelining (DSWP) and presents a thorough analysis of the communication bottlenecks that must be overcome for optimal DSWP scalability.