Outer-Loop Vectorization - Revisited for Short SIMD Architectures [abstract] (ACM DL)
Dorit Nuzman and Ayal Zaks
Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008.
Vectorization has been an important method of using data level
parallelism to accelerate scientific workloads on vector
machines such as Cray for the past three decades. In the
last decade it has also proven useful for accelerating multimedia
and embedded applications on short SIMD architectures
such as MMX, SSE and AltiVec. Most of the focus has
been directed at innermost loops, effectively executing their
iterations concurrently as much as possible. Outer loop vectorization
refers to vectorizing a level of a loop nest other
than the innermost, which can be beneficial if the outer loop
exhibits greater data-level parallelism and locality than the
innermost loop. Outer loop vectorization has traditionally
been performed by interchanging an outer-loop with the innermost
loop, followed by vectorizing it at the innermost
position. A more direct unroll-and-jam approach can be
used to vectorize an outer-loop without involving loop interchange,
which can be especially suitable for short SIMD
architectures.
In this paper we revisit the method of outer loop vectorization,
paying special attention to properties of modern short
SIMD architectures. We show that even though current
optimizing compilers for such targets do not apply outerloop
vectorization in general, it can provide significant performance
improvements over innermost loop vectorization.
Our implementation of direct outer-loop vectorization, available
in GCC 4.3, achieves speedup factors of 3.13 and 2.77
on average across a set of benchmarks, compared to 1.53
and 1.39 achieved by innermost loop vectorization, when
running on a Cell BE SPU and PowerPC970 processors respectively.
Moreover, outer-loop vectorization provides new
reuse opportunities that can be vital for such short SIMD
architectures, including efficient handling of alignment. We
present an optimization tapping such opportunities, capable
of further boosting the performance obtained by outer-loop
vectorization to achieve average speedup factors of 5.26 and
3.64.