Automatic Thread Extraction with Decoupled Software Pipelining [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture (MICRO), November 2005.
One of five papers nominated for the Best Paper Award by the Program Committee.

Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor companies to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes.

To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. Then, we explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.