Parallelization Project[Thomas Jablin is responsible for this page]
For decades, each new processor generation could be relied upon to continue the exponentially increasing performance trend at the historical rate, enhancing existing applications and enabling new ones. Recently, however, this frequency scaling approach has faltered. Concurrently, there is an ongoing renaissance in the development of parallel architectures. Modern multi-core, cluster, and GPU architectures offer enormous parallel resources, but present commensurate performance, portability, and debugging challenges to programmers. Without automatic parallelization and tuning technology, multicore violates every existing layer of abstraction from the processor to the programmer. Programming is hard enough, and parallel architectures make it worse by requiring programmers to perform the additional tasks of parallelizing and tuning their applications. These notoriously difficult tasks, once the domain of only highly skilled experts, are becoming a new barrier to effectively using all types of computer systems, including smart phones, tablets, and embedded devices.
Conventional wisdom has long held that new parallel programming languages and libraries will address the problems programmers have parallelizing and tuning applications. Our interviews of 114 scientists and engineers across 20 disciplines at Princeton University show little evidence of progress in this direction. With hundreds of parallel programming languages, parallel libraries, and other tools available, top scientists and engineers continue to struggle to make parallel computation work for them [SC 2011]. The difficulty of parallel programming is limiting the pace of scientific progress.
To address this important problem, we created a unified automatically parallelizing compiler targeting multicore, clusters, and GPU architectures. This unified compiler framework incorporates next-generation auto-parallelization techniques including: Decoupled Software Pipelining (DSWP) [CAL 2005, MICRO 2005, PACT 2004, PAC2 2005, EPIC 2005], Degree of Parallelism Executive (DoPE) [PLDI 2011], and speculative techniques [ASPLOS 2010, CGO 2012]. Additionally, we are extending the sequential programming model with semantic annotations that allow programmers to relax unnecessary restrictions on program behavior without thinking about concurrency [IEEE MICRO 2008, MICRO 2007, PLDI 2010].
Decoupled Software Pipelining (DSWP)
With the exception of some scientific codes, application generally contain irregular memory dependence patterns, complex control flow, and hard cyclic dependences. Decoupled Software Pipelining (DSWP) makes automatic parallelization possible for general purpose codes [CAL 2005, MICRO 2005, PACT 2004, PAC2 2005, EPIC 2005]. By keeping cyclic dependences thread-local, DSWP makes all inter-thread communication acylic, resulting in code that is extremely tolerant to long and variable inter-thread latencies (unlike DOACROSS).
While DSWP by itself scales to a few threads, it is more importantly the first step in achieving previously unattainable levels of scalable parallelism [CGO 2008, CGO 2010]. When used to isolate hard cyclic dependencies, DSWP makes decades of automatic parlalleization research more applicable to general purpose codes. For example, DSWP can create a pipeline stage with no cyclic dependencies for DOALL execution (Parallel-Stage DSWP) [CGO 2008] or create a pipeline stage suitable for the LOCALWRITE transformation (DSWP+LOCALWRITE) [CGO 2010].
Implicitly Parallel Programming
For a variety of reasons, programmers generally write programs that deterministically produce a single answer even when many equally acceptable answers exist. While determinism has value, specifying a few points of non-determinism in a program can dramatically improve its scalability. Such effort on the part of the programmer is a form of implicitly parallel programming because the programmer aids parallelization but does not explicitly direct it. The programmer remains focused on the algorithm, not the target machine or a particular parallel implementation.
Semantic commutivity is one approach to implicit parallel programming already implemented in our compiler framework. Sequential programming models express a total program order, of which a partial order must be respected. This inhibits parallelizing tools from extracting scalable performance. Programmer written semantic commutativity points provide a natural way of relaxing this partial order, thereby exposing parallelism implicitly in a program. Our compiler extends C and C++ with commutivity points that allow programmers to specify commutativity relations between groups of arbitrary structured code blocks. Using only this construct, serializing constraints that inhibit parallelization can be relaxed, independent of any particular parallelization strategy or concurrency control mechanism [MICRO 2007, IEEE MICRO 2008, PLDI 2011]. In just two weeks, one student added 60 semantic commutivity points to over 500,000 lines of code (SPEC CINT 2000). Combined with Parallel-Stage DSWP, semantic commutivity restored the historic processor performance growth rate for five generations (32 cores) [MICRO 2007, IEEE MICRO 2008].
SpeculationOur compiler framework uses efficient targeted speculation to dramatically increase the applicability of many auto-parallelization techniques. Speculation enables auto-parallelization in two ways: first by overcoming the limits of static analysis and second by speculating away dependences that almost never manifest. For multi-core processors and clusters our compiler currently supports the SMTX and DSMTX speculation frameworks respectively [ASPLOS 2010, MICRO 2010]. The MTX family of speculation techniques use multi-threaded transactions that can be entered by multiple threads, but still commit atomically. SMTX and DSMTX support value, alias, and control speculation. To overcome the high cost of communication in large clusters where DSMTX may not be appropriate, our compiler also supports classifier-based speculation [CGO 2012]. Unlike DSMTX, classifier-based speculation techniques allow each node in a cluster to detect speculation independently without a centralized checker node. Frequently, programs feature large data-structures that change very slowly over time. The SPICE speculation and parallelization technique enables parallel traversal over complex recursive data-structures by speculating that the data-structure is mostly conserved over time [CGO 2008].
Targeting Multiple Parallel Architectures
The current renaissance of parallel architectures has meant a massive increase in complexity for parallel programmers as different techniques are needed to achieve best performance for multicore processors, clusters, and GPUs. By parallelizing sequential codes automatically, our compiler framework enables portability parallel performance. While our original DSWP technique was for multi-core architectures, we have expanded it for both cluster and GPU architectures [MICRO 2010, CGO 2012, PLDI 2011, CGO 2012]. We believe we are the first to automatically produce scalable parallel code for clusters from ordinary C programs. Furthermore, our compiler currently enables fully-automatic DOALL parallelization for GPUs [PLDI 2011, CGO 2012] and speculative DOALL parallelization for clusters [CGO 2012].
Automatic and Dynamic Tuning of Parallelized Applications
Traditionally, developers tune and parallelize an application to meet a particular performance goal on an unloaded platform under a given set of workload characteristics. In deployed systems, performance generally degrades as the situation deviates from this assumed environment and performance goal. Compile-time automatic parallelization naturally suffers from the same problem. The optimal degree and type of parallelization can only be known after a user chooses a desired performance goal and the program starts executing on the selected platform under the actual workload. To solve this problem, we freed programmers and compilers from the concerns of parallelizing and tuning for particular platforms, performance goals, and workloads. Programmers and compilers expose parallelism in the usual way except that they defer important performance decisions (such as selecting the type of parallelism) to an API. After deployment, users can express their desired performance goal as a fitness function of throughput, latency, response time, power, time of day, etc. The run-time system then makes all previously deferred decisions dynamically and automatically in response to the execution environment (platform and workload), optimizing for the user’s desired performance goal. Using this method in our compiler framework, we found dramatic improvements over static tuning for user goals like "Maximize Throughput at 90% peak power" or "Minimize Response Time on 24 Cores" [PLDI 2011].
Project Ph.D. Graduates
Selected Project Publications
All Project Publications
The document has moved here.