Front Page

The Liberty Research Group
SPEC INT Performance trend, 1993-2012

Parallelization Project

[Thomas Jablin is responsible for this page]

For decades, each new processor generation could be relied upon to continue the exponentially increasing performance trend at the historical rate, enhancing existing applications and enabling new ones. Recently, however, this frequency scaling approach has faltered. Concurrently, there is an ongoing renaissance in the development of parallel architectures. Modern multi-core, cluster, and GPU architectures offer enormous parallel resources, but present commensurate performance, portability, and debugging challenges to programmers. Without automatic parallelization and tuning technology, multicore violates every existing layer of abstraction from the processor to the programmer. Programming is hard enough, and parallel architectures make it worse by requiring programmers to perform the additional tasks of parallelizing and tuning their applications. These notoriously difficult tasks, once the domain of only highly skilled experts, are becoming a new barrier to effectively using all types of computer systems, including smart phones, tablets, and embedded devices.

Conventional wisdom has long held that new parallel programming languages and libraries will address the problems programmers have parallelizing and tuning applications. Our interviews of 114 scientists and engineers across 20 disciplines at Princeton University show little evidence of progress in this direction. With hundreds of parallel programming languages, parallel libraries, and other tools available, top scientists and engineers continue to struggle to make parallel computation work for them [SC 2011]. The difficulty of parallel programming is limiting the pace of scientific progress.

To address this important problem, we created a unified automatically parallelizing compiler targeting multicore, clusters, and GPU architectures. This unified compiler framework incorporates next-generation auto-parallelization techniques including: Decoupled Software Pipelining (DSWP) [CAL 2005, MICRO 2005, PACT 2004, PAC2 2005, EPIC 2005], Degree of Parallelism Executive (DoPE) [PLDI 2011], and speculative techniques [ASPLOS 2010, CGO 2012]. Additionally, we are extending the sequential programming model with semantic annotations that allow programmers to relax unnecessary restrictions on program behavior without thinking about concurrency [IEEE MICRO 2008, MICRO 2007, PLDI 2010].

Decoupled Software Pipelining (DSWP)

With the exception of some scientific codes, application generally contain irregular memory dependence patterns, complex control flow, and hard cyclic dependences. Decoupled Software Pipelining (DSWP) makes automatic parallelization possible for general purpose codes [CAL 2005, MICRO 2005, PACT 2004, PAC2 2005, EPIC 2005]. By keeping cyclic dependences thread-local, DSWP makes all inter-thread communication acylic, resulting in code that is extremely tolerant to long and variable inter-thread latencies (unlike DOACROSS).

While DSWP by itself scales to a few threads, it is more importantly the first step in achieving previously unattainable levels of scalable parallelism [CGO 2008, CGO 2010]. When used to isolate hard cyclic dependencies, DSWP makes decades of automatic parlalleization research more applicable to general purpose codes. For example, DSWP can create a pipeline stage with no cyclic dependencies for DOALL execution (Parallel-Stage DSWP) [CGO 2008] or create a pipeline stage suitable for the LOCALWRITE transformation (DSWP+LOCALWRITE) [CGO 2010].

Implicitly Parallel Programming

For a variety of reasons, programmers generally write programs that deterministically produce a single answer even when many equally acceptable answers exist. While determinism has value, specifying a few points of non-determinism in a program can dramatically improve its scalability. Such effort on the part of the programmer is a form of implicitly parallel programming because the programmer aids parallelization but does not explicitly direct it. The programmer remains focused on the algorithm, not the target machine or a particular parallel implementation.

Semantic commutivity is one approach to implicit parallel programming already implemented in our compiler framework. Sequential programming models express a total program order, of which a partial order must be respected. This inhibits parallelizing tools from extracting scalable performance. Programmer written semantic commutativity points provide a natural way of relaxing this partial order, thereby exposing parallelism implicitly in a program. Our compiler extends C and C++ with commutivity points that allow programmers to specify commutativity relations between groups of arbitrary structured code blocks. Using only this construct, serializing constraints that inhibit parallelization can be relaxed, independent of any particular parallelization strategy or concurrency control mechanism [MICRO 2007, IEEE MICRO 2008, PLDI 2011]. In just two weeks, one student added 60 semantic commutivity points to over 500,000 lines of code (SPEC CINT 2000). Combined with Parallel-Stage DSWP, semantic commutivity restored the historic processor performance growth rate for five generations (32 cores) [MICRO 2007, IEEE MICRO 2008].

Speculation

Our compiler framework uses efficient targeted speculation to dramatically increase the applicability of many auto-parallelization techniques. Speculation enables auto-parallelization in two ways: first by overcoming the limits of static analysis and second by speculating away dependences that almost never manifest. For multi-core processors and clusters our compiler currently supports the SMTX and DSMTX speculation frameworks respectively [ASPLOS 2010, MICRO 2010]. The MTX family of speculation techniques use multi-threaded transactions that can be entered by multiple threads, but still commit atomically. SMTX and DSMTX support value, alias, and control speculation. To overcome the high cost of communication in large clusters where DSMTX may not be appropriate, our compiler also supports classifier-based speculation [CGO 2012]. Unlike DSMTX, classifier-based speculation techniques allow each node in a cluster to detect speculation independently without a centralized checker node. Frequently, programs feature large data-structures that change very slowly over time. The SPICE speculation and parallelization technique enables parallel traversal over complex recursive data-structures by speculating that the data-structure is mostly conserved over time [CGO 2008].

Targeting Multiple Parallel Architectures

The current renaissance of parallel architectures has meant a massive increase in complexity for parallel programmers as different techniques are needed to achieve best performance for multicore processors, clusters, and GPUs. By parallelizing sequential codes automatically, our compiler framework enables portability parallel performance. While our original DSWP technique was for multi-core architectures, we have expanded it for both cluster and GPU architectures [MICRO 2010, CGO 2012, PLDI 2011, CGO 2012]. We believe we are the first to automatically produce scalable parallel code for clusters from ordinary C programs. Furthermore, our compiler currently enables fully-automatic DOALL parallelization for GPUs [PLDI 2011, CGO 2012] and speculative DOALL parallelization for clusters [CGO 2012].

Automatic and Dynamic Tuning of Parallelized Applications

Traditionally, developers tune and parallelize an application to meet a particular performance goal on an unloaded platform under a given set of workload characteristics. In deployed systems, performance generally degrades as the situation deviates from this assumed environment and performance goal. Compile-time automatic parallelization naturally suffers from the same problem. The optimal degree and type of parallelization can only be known after a user chooses a desired performance goal and the program starts executing on the selected platform under the actual workload. To solve this problem, we freed programmers and compilers from the concerns of parallelizing and tuning for particular platforms, performance goals, and workloads. Programmers and compilers expose parallelism in the usual way except that they defer important performance decisions (such as selecting the type of parallelism) to an API. After deployment, users can express their desired performance goal as a fitness function of throughput, latency, response time, power, time of day, etc. The run-time system then makes all previously deferred decisions dynamically and automatically in response to the execution environment (platform and workload), optimizing for the user’s desired performance goal. Using this method in our compiler framework, we found dramatic improvements over static tuning for user goals like "Maximize Throughput at 90% peak power" or "Minimize Response Time on 24 Cores" [PLDI 2011].

Project Ph.D. Graduates

Selected Project Publications

All Project Publications

Hardware MultiThreaded Transactions [abstract]
Jordan Fix, Nayana P. Nagendra, Sotiris Apostolakis, Hansen Zhang, Sophie Qiu, and David I. August
To Appear: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2018.
Accept Rate: 17% (56/319).
"This is without doubt the best paper on speculative threading I've read in the past five years." -- ASPLOS Reviewer

A Collaborative Dependence Analysis Framework [abstract] (ACM DL, PDF)
Nick P. Johnson, Jordan Fix, Taewook Oh, Stephen R. Beard, Thomas B. Jablin, and David I. August
Proceedings of the 2017 International Symposium on Code Generation and Optimization (CGO), February 2017.
Accept Rate: 22% (26/114).
Highest ranked paper in double-blind review process.

Static Dependence Analysis in an Infrastructure for Automatic Parallelization [abstract] (PDF)
Nick P. Johnson
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2015.

Automatic Exploitation of Input Parallelism [abstract] (PDF)
Taewook Oh
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2015.

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information [abstract] (PDF)
Jialuh Huang
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2013.

Semantic Language Extensions for Implicit Parallel Programming [abstract] (PDF)
Prakash Prabhu
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2013.

ASAP: Automatic Speculative Acyclic Parallelization for Clusters [abstract] (PDF)
Hanjun Kim
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2013.

Automatic Parallelization for GPUs [abstract] (PDF)
Thomas B. Jablin
Ph.D. Thesis, Department of Computer Science, Princeton University, April 2013.

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information [abstract] (PDF)
Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and David I. August
Proceedings of the 2013 International Symposium on Code Generation and Optimization (CGO), February 2013.
Accept Rate: 28% (33/117).

Parcae: A System for Flexible Parallel Execution [abstract] (PDF)
Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2012.
Accept Rate: 18% (48/255).

Speculative Separation for Privatization and Reductions [abstract] (ACM DL, PDF)
Nick P. Johnson, Hanjun Kim, Prakash Prabhu, Ayal Zaks, and David I. August
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2012.
Accept Rate: 18% (48/255).

Dynamically Managed Data for CPU-GPU Architectures [abstract] (PDF)
Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August
Proceedings of the 2012 International Symposium on Code Generation and Optimization (CGO), March 2012.
Accept Rate: 28% (26/90).

Automatic Speculative DOALL for Clusters [abstract] (PDF)
Hanjun Kim, Nick P. Johnson, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation and Optimization (CGO), March 2012.
Accept Rate: 28% (26/90).

A System for Flexible Parallel Execution [abstract] (PDF)
Arun Raman
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, December 2011.

Automatic Extraction of Parallelism from Sequential Code
David I. August, Jialu Huang, Thomas B. Jablin, Hanjun Kim, Thomas R. Mason, Prakash Prabhu, Arun Raman, and Yun Zhang
Fundamentals of Multicore Software Development (ISBN: 978-1439812730)
Edited by Ali-Reza Adl-Tabatabai, Victor Pankratius, and Walter Tichy. Chapman & Hall / CRC Press, December 2011.

A Survey of the Practice of Computational Science [abstract] (ACM DL, PDF)
Prakash Prabhu, Thomas B. Jablin, Arun Raman, Yun Zhang, Jialu Huang, Hanjun Kim, Nick P. Johnson, Feng Liu, Soumyadeep Ghosh, Stephen Beard, Taewook Oh, Matthew Zoufaly, David Walker, and David I. August
Proceedings of the 24th ACM/IEEE Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2011.

Commutative Set: A Language Extension for Implicit Parallel Programming [abstract] (ACM DL, PDF)
Prakash Prabhu, Soumyadeep Ghosh, Yun Zhang, Nick P. Johnson, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.
Accept Rate: 23% (55/236).

Automatic CPU-GPU Communication Management and Optimization [abstract] (ACM DL, PDF)
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.
Accept Rate: 23% (55/236).

Parallelism Orchestration using DoPE: the Degree of Parallelism Executive [abstract] (ACM DL, PDF)
Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.
Accept Rate: 23% (55/236).

Tolerant Value Speculation in Coarse-Grain Streaming Computations [abstract] (IEEE Xplore, PDF)
Nathaniel Azuelos, Idit Keidar, and Ayal Zaks
Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2011.
Accept Rate: 14% (24/169).

Scalable Speculative Parallelization on Commodity Clusters [abstract] (ACM DL, PDF)
Hanjun Kim, Arun Raman, Feng Liu, Jae W. Lee, and David I. August
Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010.
Accept Rate: 18% (45/248).
Highest ranked paper in double-blind review process.

Safe Programmable Speculative Parallelism [abstract] (ACM DL, PDF)
Prakash Prabhu, G Ramalingam, and Kapil Vaswani
Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2010.
Accept Rate: 20% (41/204).

Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs? [abstract] (IEEE Xplore, PDF)
Arvind, David I. August, Keshav Pingali, Derek Chiou, Resit Sendag, and Joshua J. Yi
IEEE Micro, Volume 30, Number 3, May 2010.

Decoupled Software Pipelining Creates Parallelization Opportunities [abstract] (ACM DL, PDF)
Jialu Huang, Arun Raman, Yun Zhang, Thomas B. Jablin, Tzu-Han Hung, and David I. August
Proceedings of the 2010 International Symposium on Code Generation and Optimization (CGO), April 2010.
Accept Rate: 41% (29/70).

Liberty Queues for EPIC Architectures [abstract] (PDF)
Thomas B. Jablin, Yun Zhang, James A. Jablin, Jialu Huang, Hanjun Kim, and David I. August
Proceedings of the Eighth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), April 2010.

Speculative Parallelization Using Software Multi-threaded Transactions [abstract] (ACM DL, PDF)
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August
Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2010.
Accept Rate: 17% (32/181).

Compilation Strategies and Challenges for Multicore Signal Processing [abstract] (IEEE Xplore, PDF)
Mojtaba Mehrara, Thomas B. Jablin, Dan Upton, David I. August, Kim Hazelwood, and Scott Mahlke
IEEE Signal Processing Magazine, November 2009.

LAMPVIEW: A Loop-Aware Toolset for Facilitating Parallelization [abstract] (PDF)
Thomas Rorie Mason
Master's Thesis, Department of Electrical Engineering, Princeton University, August 2009.

Parallelization Techniques with Improved Dependence Handling [abstract] (PDF)
Easwaran Raman
Ph.D. Thesis, Department of Computer Science, Princeton University, June 2009.

Intelligent Speculation for Pipelined Multithreading [abstract] (PDF)
Neil Amar Vachharajani
Ph.D. Thesis, Department of Computer Science, Princeton University, November 2008.

The VELOCITY Compiler: Extracting Efficient Multicore Execution from Legacy Sequential Programs [abstract] (PDF)
Matthew John Bridges
Ph.D. Thesis, Department of Computer Science, Princeton University, November 2008.

Global Instruction Scheduling for Multi-Threaded Architectures [abstract] (PDF)
Guilherme de Lima Ottoni
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2008.

Performance Scalability of Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August
ACM Transactions on Architecture and Code Optimization (TACO), Volume 5, Number 2, August 2008.

Spice: Speculative Parallel Iteration Chunk Execution [abstract] (ACM DL, PDF)
Easwaran Raman, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.
Accept Rate: 31% (21/66).

Parallel-Stage Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew Bridges, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.
Accept Rate: 31% (21/66).

Communication Optimizations for Global Multi-Threaded Instruction Scheduling [abstract] (ACM DL, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 13th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2008.
Accept Rate: 24% (31/127).

Revisiting the Sequential Programming Model for the Multicore Era [abstract] (Original Full Paper, IEEE Xplore, PDF)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
IEEE Micro, January 2008.
Accept Rate: 14% (10/70).
IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.

Revisiting the Sequential Programming Model for Multi-Core [abstract] (IEEE Xplore, PDF, Top Picks Version)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2007.
Accept Rate: 21% (35/166).
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.

Global Multi-Threaded Instruction Scheduling [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2007.
Accept Rate: 21% (35/166).

Operating System Support for Pipeline Parallelism on Multicore Architectures [abstract] (PDF)
John Giacomoni and Manish Vachharajani
2007 Workshop on Operating Systems Support for Heterogeneous Multicore Architectures (OSHMA), September 2007.

Speculative Decoupled Software Pipelining [abstract] (IEEE CS, PDF)
Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2007.
Accept Rate: 19% (34/175).

Pipelined Multithreading Transformations and Support Mechanisms [abstract] (PDF)
Ram Rangan
Ph.D. Thesis, Department of Computer Science, Princeton University, June 2007.

Toward a Toolchain for Pipeline Parallel Programming on CMPs [abstract] (PDF)
John Giacomoni, Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachahrajani, and Dirk Grunwald
Proceedings of the 2007 Workshop on Software Tools for Multicore Systems (STMCS), March 2007.

Global Multi-Threaded Instruction Scheduling: Technique and Initial Results [abstract] (CiteSeerX, PDF)
Guilherme Ottoni and David I. August
Proceedings of the Sixth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), March 2007.

Eliminating Scope and Selection Restrictions in Compiler Optimizations [abstract] (PDF)
Spyridon Triantafyllis
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2006.

Amortizing Software Queue Overhead for Pipelined Inter-Thread Communication [abstract] (PDF)
Ram Rangan and David I. August
Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006.
Accept Rate: 41% (10/24).

Automatic Thread Extraction with Decoupled Software Pipelining [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture (MICRO), November 2005.
Accept Rate: 18% (27/143).
One of five papers nominated for the Best Paper Award by the Program Committee.

A New Approach to Thread Extraction for General-Purpose Programs [abstract] (PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 2nd Watson Conference on Interaction between Architecture, Circuits, and Compilers (PAC2), September 2005.
Accept Rate: 44% (21/47).

From Sequential Programs to Concurrent Threads [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, Matthew J. Bridges, and David I. August
IEEE Computer Architecture Letters (CAL), June 2005.
Accept Rate: 20%

Finding Parallelism for Future EPIC Machines [abstract] (PDF)
Matthew Iyer, Chinmay Ashok, Joshua Stone, Neil Vachharajani, Daniel A. Connors, and Manish Vachharajani
Proceedings of the Fourth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), March 2005.

Decoupled Software Pipelining: A Promising Technique to Exploit Thread-Level Parallelism [abstract]
Guilherme Ottoni, Ram Rangan, Neil Vachharajani, and David I. August
Proceedings of the Fourth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), March 2005.

Decoupled Software Pipelining with the Synchronization Array [abstract] (IEEE Xplore, PDF)
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2004.
Accept Rate: 18% (23/122).
Highest ranked paper in double-blind review process.

Program Slicing of Explicitly Parallel Programs [abstract] (PDF)
Matthew Bridges
Senior Thesis, Department of Computer Science, University of Delaware, May 2002.