Communication Optimizations for Global Multi-Threaded Instruction Scheduling [abstract] (ACM DL, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 13th ACM International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS), March 2008.
The recent shift in the industry towards chip
multiprocessor (CMP) designs has brought the need for multi-threaded
applications to mainstream computing. Given the difficulty in finding
coarse-grained parallelism in general-purpose, sequential
applications, computer architects have proposed CMPs with
light-weight, fine-grained (scalar) communication
mechanisms.
As observed in several limit studies, most of the parallelization
opportunities require looking for parallelism beyond local regions of
code. To exploit these opportunities, especially for sequential
applications, researchers have recently proposed global
multi-threaded instruction scheduling techniques, including Decoupled
Software Pipelining (DSWP) and
GREMIO. These techniques simultaneously
schedule instructions from large regions of code, such as arbitrary
loop nests or whole procedures, and have been shown effective at
extracting threads for many applications. A key enabler of these
global instruction scheduling techniques is the Multi-Threaded
Code Generation (MTCG) algorithm proposed in [Ottoni et al, '05],
which generates multi-threaded code for any partition of the
instructions into threads. This algorithm inserts communication and
synchronization instructions in order to satisfy all inter-thread
dependences.
In this paper, we present a general compiler framework, COCO, to
optimize the communication and synchronization instructions inserted
by the MTCG algorithm. This framework, based on thread-aware data-flow
analyses and graph min-cut algorithms, appropriately models and
optimizes all kinds of inter-thread dependences, including register,
memory, and control dependences. Our experiments, using a fully
automatic compiler implementation of these techniques, demonstrate
significant reductions (about 30% on average) in the number of
dynamic communication instructions in code parallelized with DSWP and
GREMIO. This reduction in communication translates to performance
gains of up to 40%.