Parallelization Techniques with Improved Dependence Handling [abstract] (PDF)
Easwaran Raman
Ph.D. Thesis, Department of Computer Science,
Princeton University, June 2009.
Continuing exponential growth in transistor density and diminishing
returns from the increasing transistor count have forced processor
manufacturers to pack multiple processor cores onto a single chip.
These processors, known as multi-core processors, generally do not
improve the performance of single-threaded applications. Automatic
parallelization has a key role to play in improving the performance of
legacy and newly written single-threaded applications in this new
multi-threaded era. Automatic parallelizations transform single-threaded code into a
semantically equivalent multi-threaded code by preserving the
dependences of the original code. This dissertation proposes two new
automatic parallelization techniques that differ from related existing
techniques in their handling of dependences. This difference in
dependence handling enables the proposed techniques to outperform
related techniques. The first technique is known as parallel-stage decoupled software
pipelining (PS-DSWP). PS-DSWP extends pipelined parallelization
techniques like DSWP by allowing certain pipelined stages to be
executed by multiple threads. Such a parallel execution of pipelined
stages requires distinguishing inter-iteration dependences
of the loop being parallelized from the rest of the dependences.
The applicability and effectiveness of PS-DSWP is
further enhanced by applying speculation to remove some dependences. The second technique, known as speculative iteration chunk execution
(Spice), uses value speculation to ignore inter-iteration dependences,
enabling speculative execution of chunks of iterations in parallel.
Unlike other value-speculation based parallelization techniques, Spice
speculates only a few dynamic instances of those inter-iteration
dependences. Both these techniques are implemented in the VELOCITY compiler and are
evaluated using a multi-core Itanium 2 simulator. PS-DSWP results in
a geometric mean loop speedup of 2.13 over single-threaded
performance with five threads on a set of loops from five benchmarks.
The use of speculation improves the performance of PS-DSWP resulting
in a geometric mean loop speedup of 3.67 over single-threaded
performance on a set of loops from six benchmarks. Spice shows a
geometric mean loop speedup of 2.01 on a set of loops from four
benchmarks. Based on the above experimental results and qualitative
and quantitative comparisons with related techniques, this
dissertation demonstrates the effectiveness of the proposed
techniques.