Publications
Revisiting Computation for Research: Practices and Trends [abstract]
Jeremiah Giordani*, Ziyang Xu*, Ella Colby, August Ning, Bhargav Reddy Godala, Ishita Chaturvedi, Shaowei Zhu, Yebin Chon, Greg Chan, Zujun Tan, Galen Collier, Jonathan D. Halverson, Enrico Armenio Deiana, Jasper Liang, Federico Sossai, Yian Su, Atmn Patel, Bangyen Pham, Nathan Greiner, Simone Campanoni, and David I. August
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 2024.
(*Co-first authors)
PROMPT: A Fast and Extensible Memory Profiling Framework [abstract] (arXiv)
Ziyang Xu, Yebin Chon, Yian Su, Zujun Tan, Sotiris Apostolakis, Simone Campanoni, and David I. August
Proceedings of the ACM on Programming Languages, Volume 8, Issue OOPSLA (OOPSLA), October 2024.
GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction [abstract] (PDF)
Ishita Chaturvedi, Bhargav Reddy Godala, Yucan Wu, Ziyang Xu, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis, Dimitrios Soudris, Tyler Sorensen, Simone Campanoni, Tor M. Aamodt, and David I. August
Proceedings of the 51st International Symposium on Computer Architecture (ISCA), June 2024.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
PDIP: Priority Directed Instruction Prefetching [abstract] (PDF)
Bhargav Reddy Godala, Sankara Prasad Ramesh, Gilles A. Pokam, Jared Stark, Andre Seznec, Dean Tullsen, and David I. August
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), April 2024.
Awarded Best Paper (one of six) out of 922 submissions.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
PROMPT: A Fast and Extensible Memory Profiling Framework
Ziyang Xu, Yebin Chon, Yian Su, Zujun Tan, Sotiris Apostolakis, Simone Campanoni, and David I. August
Available at https://github.com/PrincetonUniversity/PROMPT/, January 2024.
Transpilation Utilizing Language-Agnostic IR and Interactivity for Parallelization [abstract] (PDF)
Zujun Tan
Ph.D. Thesis, Department of Computer Science,
Princeton University, 2024.
Migrating codes between architectures is difficult because different execution models require different types of parallelism for optimal performance. Previous approaches, like libraries or source-level tools, generate correct and natural-looking syntax for the new parallel model with limited optimization and largely leave performance engineering to the programmer. Recent approaches, such as transpilation at the compiler intermediate representation (IR) level, can automate performance engineering, but profitability can be limited by not having facts known only to the programmer. Decompiling the optimized program could leverage the strength of existing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with arbitrary compilers of the source language.
This thesis provides a method for migrating code such that the compiler and programmer work together to generate code with optimal performance. To achieve this,
it introduces Tulip via IR-level transpilation. Transpilation at the IR level enables Tulip to generalize the transformations applied to retarget parallelism. Furthermore, Tulip integrates the state-of-the-art automatic parallelization framework to explore additional parallelism expressible only in the target parallel programming model. It then generates natural source code through a novel decompiler, SPLENDID, in a high-level parallel programming language (OpenMP), which can be interactively optimized and tuned with programmer intervention. For 19 Polybench benchmarks, Tulip-generated OpenMP offloading programs perform 14% faster than the original CUDA sources on NVIDIA GPUs. Moreover, transpilation to the CPU leads to a 2.9x speedup over the best state-of-the-art transpiler. Tulip-generated portable parallel code is also more natural than what existing decompilers produce, resulting in a 39x higher average BLEU score.
This thesis includes contributions from Yebin Chon, Ziyang Xu, Sophia Zhang, and David I. August from Princeton Liberty Research Group, Brian Homerding, Yian Su, and Simone Campanoni from Northwestern Arcana Lab, Michael Kruse (AMD), Johannes Doerfert (LLNL), William S. Moses (UIUC), and Ivan R. Ivanov (Tokyo Tech).
Criticality-Aware Front-end [abstract] (PDF)
Bhargav Reddy Godala
Ph.D. Thesis, Department of Computer Science,
Princeton University, 2024.
Code footprints continue to grow faster than instruction caches, putting additional pres- sure on existing front-end structures. Even with aggressive front-ends with fetch-directed instruction prefetching (FDIP), modern processors experience significant front-end stalls. Due to the end of Mooreâs Law, increasing cache sizes raises critical path latency, with mod- est returns for scaling instruction cache sizes. This dissertation aims to address front-end bottlenecks by making two key observations. In FDIP-enabled processors, cache misses have unequal costs, and a small fraction of critical instruction cache lines contribute to most of the front-end stalls.
EMISSARY, the pioneering cost-aware replacement policy tailored for the L1 Instruction Cache (L1I), defies conventional wisdom by presenting a groundbreaking approach. Unlike traditional replacements, EMISSARY demonstrates performance enhancements even amidst increased instruction cache misses. However, EMISSARY proves to be less effective when applied to datacenter workloads characterized by large code footprints. This is due to data- center workloads having more critical lines greater than the capacity of L1I. This dissertation first presents improved EMISSARY-L2, the first criticality-aware cache replacement family of policies specifically designed for datacenter workloads. Observing that modern architec- tures entirely tolerate many instruction cache misses, EMISSARY-L2 resists evicting those cache lines whose misses cause costly decode starvations from L2. In the context of a modern FDIP-enabled processor, EMISSARY-L2 delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on datacenter workloads. This speedup is 21.6% of the speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.
This dissertation then proposes Priority Directed Instruction Prefetching (PDIP), a novel cost-ware instruction prefetching technique that complements FDIP by issuing prefetches for targets along the resteer path where FDIP stalls occur. PDIP identifies these targets and associates them with a trigger for future prefetch. When paired with EMISSARY-L2, PDIP achieves a geomean IPC speedup of 3.7% across a set of datacenter workloads using a budget of only 43.5KB. PDIP achieves 62% of the ideal prefetching performance.
QPoints: QEMU to gem5 ARM Full System Checkpointing [abstract]
Bhargav Reddy Godala, Ishita Chaturvedi, Yucan Wu, Simone Campanoni, and David I. August
ISCA 2023: The gem5 Workshop, June 2023.
Modern Front-end Support in gem5 [abstract]
Bhargav Reddy Godala, Nayana Prasad Nagendra, Ishita Chaturvedi, Simone Campanoni, and David I. August
ISCA 2023: The gem5 Workshop, June 2023.
EMISSARY: Enhanced Miss Awareness Replacement Policy For L2 Instruction Caching [abstract] (PDF)
Nayana Prasad Nagendra*, Bhargav Reddy Godala*, Ishita Chaturvedi, Atmn Patel, Svilen Kanev, Tipp Moseley, Jared Stark, Gilles A. Pokam, Simone Campanoni, and David I. August
Proceedings of the 50th International Symposium on Computer Architecture (ISCA), June 2023.
(*Co-first authors)
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development [abstract] (PDF)
Zujun Tan, Yebin Chon, Michael Kruse, Johannes Doerfert, Ziyang Xu, Brian Homerding, Simone Campanoni, and David I. August
Proceedings of the Twenty-Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2023.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
LAMP: A Practical Loop-Aware Memory Dependence Profiler [abstract]
Yebin Chon, Ziyang Xu, Zujun Tan, and David I. August
The Fourth Young Architect Workshop (YArch), March 2022.
NOELLE Offers Empowering LLVM Extensions [abstract] (PDF)
Angelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Ishita Chaturvedi, Brian Homerding, Tommy McMichen, David I. August, and Simone Campanoni
Proceedings of the 2022 International Symposium on Code Generation
and Optimization, February 2022.
Awarded ACM Artifact Available, Artifact Functional, and Result Reproduced Badges.
Safer at Any Speed: Automatic Context-Aware Safety Enhancement for Rust [abstract] (PDF)
Natalie Popescu*, Ziyang Xu*, Sotiris Apostolakis, David I. August, and Amit Levy
Proceedings of the ACM on Programming Languages, Volume 5, Issue OOPSLA (OOPSLA), October 2021.
(*Co-first authors) Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
Improving Instruction Cache Performance For Modern Processors With Growing Workloads [abstract] (PDF)
Nayana Prasad Nagendra
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2021.
Cloud computing is the backbone of today\'s digital world. Billions of devices ranging from internet-of-things to mobile phones connect to services running on massive data centers, dubbed as Warehouse Scale Computers (WSCs). Due to the scale at which WSCs operate, their performance has a huge cost impact. Even the slightest percentage improvement in performance generates significant profits for the companies that operate them. Besides, they have a hugely adverse effect on the environment because of their massive carbon footprint.
Surprisingly, our work with Google shows significant inefficiencies among the WSC processor utilization. More than two-thirds of the CPU cycles are wasted in many forms. Specifically, nearly 20% of the wasted cycles come from the processor front-end, which is responsible for keeping the pipeline fed with useful instructions. Since a substantial fraction of the front-end cycle wastages occurs due to instruction cache misses, this Dissertation has taken a full-stack approach to study and solve those.
There have been vast growths in data center workloads and their complexity. Consequently, programs\' working set sizes are much larger than those of the instruction caches in modern processors. With the nearing end of Moore\'s Law, innovations are necessary to increase the efficiency of instruction cache usage, given the growing workloads.
In this Dissertation, we first present AsmDB, a fleet-wide study of Google workloads to understand processor front-end bottlenecks. Based on our knowledge, this is the first work to study the in-depth behavior of instruction caches at such a large scale. Among its many findings, AsmDB highlights inefficient usage of instruction caches because of significant fragmentation even among the highly-optimized code.
The second part of this Dissertation describes Emissary, a new approach to alleviate the said problem by using the available cache capacity more efficiently. Observing that not all cache misses cause stalls, Emissary employs a novel cache replacement algorithm that reserves cache capacity for lines whose misses cause the most significant performance impact. Emissary outperforms all known cache replacement algorithms in performance and energy consumption. In some cases, it produces speedups that exceed Beladyâs OPT, the perfect- knowledge minimal miss ideal cache replacement algorithm.
NOELLE Offers Empowering LLVM Extensions [abstract] (arXiv, PDF)
Angelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Ishita Chaturvedi, David I. August, and Simone Campanoni
arXiv:2102.05081 [cs.PL], February 2021.
A Sensible Approach to Speculative Automatic Parallelization [abstract] (PDF)
Sotiris Apostolakis
Ph.D. Thesis, Department of Computer Science,
Princeton University, January 2021.
The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of automatic parallelization. The introduction of speculation to automatic parallelization overcame these applicability limitations but caused profitability problems due to high communication and bookkeeping costs for speculation validation and commit.
This dissertation shifts the focus from applicability to profitability by making speculative automatic parallelization more efficient. Unlike current approaches that perform analysis and transformations independently and in sequence, the proposed system integrates, without loss of modularity, memory analysis, speculative techniques, and enabling transformations.
Specifically, this dissertation involves three main contributions. First, it introduces a novel speculation-aware collaborative analysis framework (SCAF) that minimizes the need for high-overhead speculation. Second, it proposes a new parallelizing-compiler design that enables careful planning. Third, it presents new efficient speculative privatization transformations that avoid privatization overheads of prior work.
SCAF learns of available speculative information via profiling, computes its full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.
The introduction of planning in the compiler enables the selection of parallelization-enabling transformations based on their cost and overall impacts. The benefit is twofold. First, the compiler only applies a minimal-cost set of enabling transformations. Second, this decision-making process makes the addition of new efficient speculative enablers possible, including the proposed privatization transformations.
This dissertation integrates these contributions into a fully-automatic speculative-DOALL parallelization framework for commodity hardware. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, this work obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory machine) than Privateer (11.5x), the prior automatic speculative-DOALL system with the highest applicability.
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework (ACM DL)
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Campanoni, and David I. August
Available at https://liberty.princeton.edu/Projects/AutoPar/SCAF/, December 2020.
NOELLE Offers Empowering LLVM Extensions
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, David I. August, and Simone Campanoni
Available at https://liberty.princeton.edu/Projects/NOELLE/, December 2020.
Perspective: A Sensible Approach to Speculative Automatic Parallelization
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Campanoni, and David I. August
Available at https://liberty.princeton.edu/Projects/AutoPar/Perspective/, December 2020.
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework [abstract] (ACM DL, PDF)
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Campanoni, and David I. August
Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2020.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
Automatic and Speculative Parallel-Stage Decoupled Software Pipelining [abstract]
Zujun Tan, Greg Chan, Ziyang Xu, Sotiris Apostolakis, and David I. August
The Second Young Architect Workshop (YArch), March 2020.
Perspective: A Sensible Approach to Speculative Automatic Parallelization [abstract] (ACM DL, PDF)
Sotiris Apostolakis, Ziyang Xu, Greg Chan, Simone Campanoni, and David I. August
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2020.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
Hardware MultiThreaded Transactions: Enabling Speculative MultiThreaded Pipeline Parallelization For Complex Programs [abstract] (PDF)
Jordan Fix
Ph.D. Thesis, Department of Computer Science,
Princeton University, 2020.
Speculation with transactional memory systems helps programmers and compilers produce profitable thread-level parallel programs. Prior work shows that supporting transactions that can span multiple threads, rather than requiring transactions be contained within
a single thread, enables new types of speculative parallelization techniques for both programmers and parallelizing compilers.
Unfortunately, software support for multi-threaded transactions (MTXs) comes with
significant additional inter-thread communication overhead for speculation validation. This
overhead can make otherwise good parallelization unprofitable for programs with sizeable
read and write sets.
Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to achieve to date. Instead,
this dissertation makes speculative parallelization less laborious and more feasible through
low-overhead speculation validation, presenting the first complete design, implementation,
and evaluation of hardware MTXs.
Even with maximal speculation validation of every load and store inside transactions of
tens to hundreds of millions of instructions, profitable parallelization of complex programs
can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 104%
over sequential execution on a multicore machine with 4 cores, with modest increases in
power and energy consumption. This work could be used as a building block to enable
more realistic automatic parallelization of complex programs, by providing low-overhead,
long-running, resilient transactions that support a diverse set of parallel paradigm options.
AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers [abstract] (PDF, Top Picks Version)
Grant Ayers, Nayana P. Nagendra, David I. August, Hyoun K. Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan
Proceedings of the 46th International Symposium on Computer Architecture (ISCA), June 2019.
Selected for IEEE Micro's "Top Picks" special issue "based on novelty and potential for long-term impact in the field of computer architecture" in 2019.
Architectural Support for Containment-based Security [abstract] (PDF)
Hansen Zhang, Soumyadeep Ghosh, Jordan Fix, Sotiris Apostolakis, Stephen R. Beard, Nayana P. Nagendra, Taewook Oh, and David I. August
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.
Collaborative Parallelization Framework [abstract]
Ziyang Xu, Greg Chan, Sotiris Apostolakis, and David I. August
The First Young Architect Workshop (YArch), February 2019.
Automatic parallelization frees programmers from the error-prone and
time-consuming process of making applications run efficiently on multi-core
systems. Each existing parallelizing compiler is the best option for a
different set of benchmarks; none is universally performant. While new
multicore-optimized parallelization transformations, stronger analyses, and
more efficient speculation techniques are powerful enough to extract
parallelism reliably from even irregular and complex applications, they often
fail to do so because they lack proper coordination and information. This work
introduces the Collaborative Parallelization Framework (CPF), a means to
address these deficiencies. To address the lack of coordination, CPF
orchestrates collaborations among parallelization techniques and enabling
transformations. In cases where this is insufficient, CPF seeks additional
information from the programmer. Orchestration of all these components allows
for a modular design that facilitates extensibility.
TrustGuard: A Model for Practical Trust in Real Systems [abstract] (PDF)
Stephen R. Beard
Ph.D. Thesis, Department of Computer Science,
Princeton University, 2019.
All layers of today's computing systems, from hardware to software, are
vulnerable to attack. Market and economic pressure push companies to focus
on performance and features, leaving security and correctness as secondary
concerns. As a consequence, systems must frequently be patched after
deployment to fix security vulnerabilities. While this non-virtuous
exploit-patch-exploit cycle is insecure, it is practical enough for
companies to use.
Formal methods in both software and hardware can guarantee the security they
provide. Ideally, modern systems would be comprised of formally verified and
secure components. Unfortunately, such methods have not seen widespread
adoption for a number of reasons, such as difficulty in scaling, lack of
tools, and high skill requirements. Additionally, the economics involved in
securing and replacing every component in all systems, both new and
currently deployed, result in clean slate solutions being impractical. A
practical solution should rely on a few, simple components and should be
adoptable incrementally.
TrustGuard, the first implementation of the Containment Architecture with Verified
Output (CAVO) model developed at Princeton, showed how a simple,
trusted Sentry could protect against malicious or buggy hardware components
to ensure integrity of external communications. This was accomplished by
ensuring the correct execution of signed software, with support from a
modified CPU and system architecture. However, TrustGuardâs practicality was
limited due to its reliance on modified host hardware and its requirement to
trust entire application stacks, including the operating system.
The work presented in this dissertation seeks to make the CAVO model a
practical solution for ensuring the integrity of data communicated
externally in an untrusted commodity system. This work extends CAVO in two
ways. First, it makes the Sentry compatible with a wide range of devices
without requiring hardware modifications to the host system, thus increasing
its flexibility and ease of integration into existing environments. Second,
it gives developers the option to use trusted code to verify the execution
of untrusted code, thus reducing the size of the trusted code base. This is
analogous to the small, trusted Sentry ensuring correctness
of execution of a large amount of complex, untrusted hardware.
MemoDyn: Exploiting Weakly Consistent Data Structures for Dynamic Parallel Memoization [abstract] (ACM DL, PDF)
Prakash Prabhu, Stephen R. Beard, Sotiris Apostolakis, Ayal Zaks, and David I. August
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT), November 2018.
Hardware MultiThreaded Transactions [abstract] (ACM DL, PDF)
Jordan Fix, Nayana P. Nagendra, Sotiris Apostolakis, Hansen Zhang, Sophie Qiu, and David I. August
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2018.
Static and Dynamic Instruction Mapping for Spatial Architectures [abstract] (PDF)
Feng Liu
Ph.D. Thesis, Department of Electrical Engineering,
Princeton University, 2018.
In response to the technology scaling trends, spatial architectures have emerged
as a new style of processors for executing programs more efficiently. Unlike
traditional out-of-order (OoO) processors, which time-share a small set of
functional units, a spatial computer is composed of hundreds or even thousands
of simple and replicated functional units. Spatial architectures avoid the
overheads of time-sharing and of generating schedules repeatedly, by mapping
instruction sequences onto the functional units explicitly and reusing the map-
ping across multiple invocations.
Currently, spatial architectures mainly use static methods to map and schedule
instructions onto the arrays of functional units. The existing methods have
several limitations: First, for programs with irregular memory accesses and
control flows, they yield poor performance because the functional units need
to be invoked sequentially to respect data and control dependences. Second,
static methods cannot fully exploit speculation techniques, which are the
dominant performance sources in OoO processors. Finally, static methods cannot
adapt to changing workloads and are not compatible across hardware generations.
To address these issues and improve the applicability of spatial architectures,
this dissertation proposes two techniques. The first, Coarse-Grained Pipelined
Accelerators (CGPA), is a static compiling framework that exploits the hidden
parallelism within irregular C/C++ loops and translates them into spatial
hardware modules. The proposed technique has been implemented as a compiler pass
and the experiment shows 3.3x speedup over the performance achieved by an
open-source tool baseline.
The second technique, Dynamic Spatial Architecture Mapping (DYNASPAM), reuses
the speculation system in the OoO processors to dynamically produce high
performance scheduling and execution on a dedicated spatial fabric. The proposed
technique is modeled by a cycle accurate simulator and the experiment shows the
new technique can achieve 1.4x geomean performance improvement and 23.9%
energy consumption reduction, compared to an aggressive OoO processor baseline.
A Generalized Framework for Automatic Scripting Language Parallelization [abstract] (PDF)
Taewook Oh, Stephen R. Beard, Nick P. Johnson, Sergiy Popovych, and David I. August
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2017.
Computational scientists are typically not expert programmers, and thus work
in easy to use dynamic languages. However, they have very high performance
requirements, due to their large datasets and experimental setups. Thus,
the performance required for computational science must be extracted from
dynamic languages in a manner that is transparent to the programmer.
Current approaches to optimize and parallelize dynamic languages, such as
just-in-time compilation and highly optimized interpreters, require a huge
amount of implementation effort and are typically only effective for a
single language. However, scientists in different fields use different
languages, depending upon their needs.
This paper presents techniques to enable automatic extraction of parallelism
within scripts that are universally applicable across multiple different
dynamic scripting languages. The key insight is that combining a script
with its interpreter, through program specialization techniques, will embed
any parallelism within the script into the combined program. Additionally,
this paper presents several enhancements to existing speculative automatic
parallelization techniques to handle the dependence patterns created by the
specialization process. A prototype of the proposed technique, called
Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script
interpreters with 6 input scripts each. The resulting geomean speedup of
5.1x on a 24-core machine shows the potential of the generalized
approach in automatic extraction of parallelism in dynamic scripting
languages.
A Collaborative Dependence Analysis Framework [abstract] (ACM DL, PDF)
Nick P. Johnson, Jordan Fix, Taewook Oh, Stephen R. Beard, Thomas B. Jablin, and David I. August
Proceedings of the 2017 International Symposium on Code Generation
and Optimization (CGO), February 2017.
Highest ranked paper in double-blind review process.
TrustGuard: A Containment Architecture with Verified Output [abstract] (PDF)
Soumyadeep Ghosh
Ph.D. Thesis, Department of Computer Science,
Princeton University, 2017.
Computers today are so complex and opaque that a user cannot know
everything occurring within the system. Most efforts toward computer
security have focused on securing software. However, software
security techniques implicitly assume correct execution by the
underlying system, including the hardware. Securing these systems has
been challenging due to their complexity and the proportionate attack
surface they present during their design, manufacturing, deployment,
and operation. Ultimately, the user's trust in the system depends on
claims made by each party supplying the system's components.
This dissertation presents the Containment Architecture with Verified Output
(CAVO) model in recognition of the reality that existing tools and techniques
are insufficient to secure complex hardware components in modern computing
systems. Rather than attempt to secure each complex hardware component
individually, CAVO establishes trust in hardware using a single, simple,
separately manufactured component, called the Sentry. The Sentry bridges a
physical gap between the untrusted system and its external interfaces and
contains the effects of malicious behavior by untrusted system components before
the external manifestation of any such effects. Thus, only the Sentry and the
physical gap must be secured in order to assure users of the containment of
malicious behavior. The simplicity and pluggability of CAVO's Sentry enable
suppliers and consumers to take additional measures to secure it, including
formal verification, supervised manufacture, and supply chain diversification.
This dissertation also presents TrustGuard---the first prototype CAVO
design---to demonstrate the feasibility of the CAVO model. TrustGuard achieves
containment by only allowing the communication of correctly executed results of
signed software. The Sentry in TrustGuard leverages execution information
obtained from the untrusted processor to enable efficient checking of the
untrusted system's work, even when the Sentry itself is simpler and much slower
than the untrusted processor. Simulations show that TrustGuard can guarantee
containment of malicious hardware components with a geomean of 8.5% decline in
the processor's performance, even when the Sentry operates at half the clock
frequency of the complex, untrusted processor.
Speculatively Exploiting Cross-Invocation Parallelism [abstract] (PDF)
Jialu Huang, Prakash Prabhu, Thomas B. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, and David I. August
Proceedings of the 25th International Conference on Parallel
Architectures and Compilation Techniques (PACT), September 2016.
Static Dependence Analysis in an Infrastructure for Automatic Parallelization [abstract] (PDF)
Nick P. Johnson
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2015.
Now that parallel architectures are common, software must exploit
multiple cores to fully utilize hardware resources and
achieve efficient execution.
Restructuring applications for explicit parallelism
requires developers to reason about low-level details
to avoid concurrency bugs and
achieve parallel performance.
Automatic
thread extraction relieves developers of this arduous
task.
This dissertation presents a compiler
{\em middle-ware} for automatic thread extraction---analyses and
transformations that allow the compiler to deliver
parallel performance for sequentially-specified input programs.
This middle-ware reconsiders
the
compilation infrastructure to present alternative technologies
better suited to the needs of automatic thread extraction.
Specifically,
{\bf Collaborative Dependence Analysis:}
Since no analysis algorithm can precisely decide all
analysis facts, this dissertation
combines simple analysis algorithms into a {\em collaborative ensemble}
algorithm:
the ensemble algorithm can disprove dependence queries which no
member disproves alone. Collaboration enables
{\em factored} development, which prescribes the development of
small, orthogonal analysis algorithms tailored to the needs of
analysis clients. Results demonstrate
independently-developed analysis algorithms collaborating
to solve complex, multiple-logic queries.
Further, results
demonstrate a large impact on performance:
for some benchmarks,
analysis strength is the difference between 11$\times$ slowdown
and 28$\times$ speedup on 50 cores.
{\bf Scaling Analysis to Larger Scopes:}
The infrastructure builds around Parallel-Stage
Decoupled Software Pipelining (PS-DSWP) thread extraction.
PS-DSWP targets large, hot program scopes to overcome the
non-recurring overheads of parallel execution.
However, as scopes grow, the burden of analysis becomes prohibitive.
This dissertation contributes a faster algorithm to
compute a dependence graph to drive PS-DSWP.
This algorithm identifies dependence edges which {\em cannot}
affect PS-DSWP. It skips dependence analysis queries pertaining
to unimportant edges, reducing analysis
time---or allowing more expensive analysis
algorithms---for the remaining, important queries.
Evaluation demonstrates that
the algorithm computes the
${\rm DAG}_{\rm SCC}$ twice as fast
using half as many
dependence analysis queries without sacrificing analysis
precision.
{\bf Incorporating Speculation into the Compilation Pipeline:}
A parallelization system may speculate
various properties to enable thread extraction.
This dissertation
presents design patterns to
simplify development of novel speculation types.
The dissertation integrates these contributions
into a robust and flexible middle-ware upon which many works are
built.
Automatic Exploitation of Input Parallelism [abstract] (PDF)
Taewook Oh
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2015.
Parallelism may reside in the input of a program rather than the program
itself. A script interpreter, for example, is hard to parallelize
because its dynamic behavior is unpredictable until an input script is given.
Once the interpreter is combined with the script, the resulting program becomes
predictable, and even parallelizable if the input script contains parallelism.
Despite recent progress in automatic parallelization research, however, existing
techniques cannot take advantage of the parallelism within program inputs, even
when the inputs remain fixed across multiple executions of the program.
This dissertation shows that the automatic exploitation of parallelism within
fixed program inputs can be achieved by coupling program specialization
with automatic parallelization techniques. Program specialization
marries a program with the values that remain
invariant across the program execution, including fixed inputs, and creates a
program that is highly optimized for the invariants. The proposed technique exploits
program specialization as an enabling transformation for automatic
parallelization; through specialization, the parallelism within the fixed
program inputs can be materialized within the specialized program.
First, this dissertation presents Invariant-induced Pattern-based Loop
Specialization (IPLS).
IPLS folds the parallelism within the program invariants into the specialized
program, thereby creating a more complete and predictable program that is easier
to parallelize.
Second, this dissertation applies automatic speculative
parallelization techniques to specialized programs to exploit parallelism in
inputs. As existing techniques fail to extract parallelism from complex programs
such as IPLS specialized programs, context-sensitive speculation and
optimized design of the speculation run-time system are proposed to improve the
applicability and minimize the execution overhead of the parallelized program.
A prototype of the proposed technique is evaluated against two widely-used
open-source script interpreters. Experimental results demonstrate the
effectiveness of the proposed techniques.
DynaSpAM: Dynamic Spatial Architecture Mapping using Out of Order Instruction Schedules [abstract] (PDF)
Feng Liu, Heejin Ahn, Stephen R. Beard, Taewook Oh, and David I. August
Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), June 2015.
CGPA:Coarse-Grained Pipelined Accelerators [abstract] (PDF)
Feng Liu, Soumyadeep Ghosh, Nick P. Johnson, and David I. August
The Design Automation Conference (DAC), June 2014.
Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information [abstract] (PDF)
Jialuh Huang
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2013.
Automatic parallelization is a promising approach to producing
scalable multi-threaded programs for multi-core architectures. Most
existing techniques parallelize independent loops and insert global
synchronizations at the end of each loop invocation. For programs with
few loop invocations, these global synchronizations do not limit
parallel execution performance. However, for programs with many loop
invocations, those synchronizations can easily become the performance
bottleneck since they frequently force all threads to wait, losing
potential parallelization opportunities. To address this problem,
some automatic parallelization techniques apply static analyses to
enable cross-invocation parallelization. Instead of waiting, threads
can execute iterations from follow-up invocations if they do not cause
any conflict. However, static analysis must be conservative and cannot
handle irregular dependence patterns manifested by particular program
inputs at runtime.
In order to enable more parallelization across loop invocations, this thesis
presents two novel automatic parallelization techniques: DOMORE and SpecCross.
Unlike existing techniques relying on static analyses, these two techniques
take advantage of runtime information to achieve much more aggressive
parallelization. DOMORE constructs a custom runtime engine which
non-speculatively observes dependences at runtime and synchronizes iterations
only when necessary; while SpecCross applies software speculative barriers to
permit some of the threads to execute past the invocation boundaries. The two
techniques are complimentary in the sense that they can parallelize programs
with potentially very different characteristics. SpecCross, with less runtime
overhead, works best when programs' cross-invocation dependences seldom cause
any runtime conflict. DOMORE, on the other hand, has its advantage in handling
dependences which cause frequent conflicts. Evaluating implementations of
DOMORE and SpecCross demonstrates that both techniques can achieve much better
scalability compared to existing automatic parallelization techniques.
Semantic Language Extensions for Implicit Parallel Programming [abstract] (PDF)
Prakash Prabhu
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2013.
Several emerging fields of science and engineering are increasingly
characterized by computationally intensive programs. Without parallelization,
such programs do not benefit from the increasing core counts available in todays
chip multiprocessors. However, writing correct and well-performing parallel
programs is widely perceived to be an extremely hard problem. In order to
understand the challenges faced by scientific programmers in effectively
leveraging parallel computation, this dissertation first presents an in-depth
field study of the practice of computational science.
Based on the results of the field study, this dissertation proposes two new
implicit parallel programming (IPP) solutions. With IPP, artificial constraints
imposed by sequential models for automatic parallelization are overcome by use
of semantic programming extensions. These preserve the ease of sequential
programming and enable multiple parallelism forms without additional parallelism
constructs, achieving the best of both automatic and explicit parallelization.
The first IPP solution, Commutative Set, generalizes existing notions of
semantic commutativity. It allows a programmer to relax execution orders
prohibited under a sequential programming model with a high degree of
expressiveness. The second IPP solution, WeakC, provides language extensions to
relax strict consistency requirements of sequential data structures, and
dynamically optimizes a parallel configuration of these data structures via a
combined compiler-runtime system.
This dissertation evaluates both Commutative Set and WeakC on real-world
applications running on real hardware, including some that are actively used by
some scientists in their day-to-day research. The detailed experimental
evaluation results demonstrate the effectiveness of the proposed techniques.
ASAP: Automatic Speculative Acyclic Parallelization for Clusters [abstract] (PDF)
Hanjun Kim
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2013.
While clusters of commodity servers and switches are the most
popular form of large-scale parallel computers, many programs are not
easily parallelized for clusters due to high inter-node communication
cost and lack of globally shared memory. Speculative Decoupled
Software Pipelining (Spec-DSWP) is a promising automatic
parallelization technique for clusters that speculatively partitions a
loop into multiple threads that communicate in a pipelined manner.
Speculation can complement conservative static analysis, making
automatic parallelization more robust and applicable. Pipelining
allows Spec-DSWP to speculate only rarely occurring dependences while
respecting the other dependences through communication among threads.
Acyclic communication patterns in pipelining make the parallelized
programs tolerant of high communication latency of clusters. However,
since Spec-DSWP partitions a loop iteration (a transaction) into
multiple sub-transactions across multiple threads according to the
pipeline stages, a special runtime system is required that supports
multi-threaded transactions (MTXs).
This dissertation proposes the Automatic Speculative Acyclic
Parallelization (ASAP) system that enables Spec-DSWP for clusters
without any hardware modification. The ASAP system supports
various speculation techniques that require different validation and
communication costs, and automatically parallelizes sequential loops
using the Spec-DSWP transformation with the optimal application of the
speculation techniques. The ASAP system efficiently supports MTXs
to correctly execute the speculatively transformed programs on
clusters. With synergistic combination of speculation, acyclic
communication, and runtime system support, this approach achieves or
demonstrates a path to achieve scalable performance speedup up to
109x for a wide range of applications on clusters without any
hardware modification.
Fast Condensation of the Program Dependence Graph [abstract] (ACM DL, PDF)
Nick P. Johnson, Taewook Oh, Ayal Zaks, and David I. August
Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2013.
Automatic Parallelization for GPUs [abstract] (PDF)
Thomas B. Jablin
Ph.D. Thesis, Department of Computer Science,
Princeton University, April 2013.
GPUs are flexible parallel processors capable of accelerating real
applications. To exploit them, programmers rewrite programs in new languages
using intimate knowledge of the underlying hardware. This is a step backwards in
abstraction and ease of use from sequential programming. When implementing
sequential applications, programmers focus on high-level algorithmic concerns,
allowing the compiler to target the peculiarities of specific
hardware. Automatic parallelization can return ease of use and hardware
abstraction to programmers. This dissertation presents techniques for
automatically parallelizing ordinary sequential C codes for GPUs using DOALL and
pipelined parallelization techniques. The key contributions include: the first
automatic data management and communication optimization framework for GPUs and
the first automatic pipeline parallelization system for GPUs. Combining these
two contributions with an automatic DOALL parallelization yields the first fully
automatic parallelizing compiler for GPUs.
Practical Automatic Loop Specialization [abstract] (PDF)
Taewook Oh, Hanjun Kim, Nick P. Johnson, Jae W. Lee, and David I. August
Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2013.
Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information [abstract] (PDF)
Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and David I. August
Proceedings of the 2013 International Symposium on Code Generation
and Optimization (CGO), February 2013.
Automatic parallelization is a promising approach to producing scalable
multi-threaded programs for multicore architectures. Many existing automatic
techniques only parallelize iterations within a loop invocation and synchronize
threads at the end of each loop invocation. When parallel code contains many
loop invocations, synchronization can easily become a performance bottleneck.
Some automatic techniques address this problem by exploiting crossinvocation
parallelism. These techniques use static analysis to partition iterations among
threads to avoid cross-thread dependences. However, this partitioning is not
always achievable at compile-time, because program input termines dependence
patterns at run-time. By contrast, this paper proposes DOMORE, the first
automatic parallelization technique that uses runtime information to exploit
additional cross-invocation parallelism. Instead of partitioning iterations
statically, DOMORE dynamically detects crossthread dependences and synchronizes
only when necessary. DOMORE consists of a compiler and a runtime library. At
compile time, DOMORE automatically parallelizes loops and inserts a custom
runtime engine into programs. At runtime, the engine observes dependences and
synchronizes iterations only when necessary. For six programs, DOMORE achieves
a geomean loop speedup of 2.1X over parallel execution without crossinvocation
parallelization and of 3.2X over sequential execution on eight cores.
From Sequential Programming to Flexible Parallel Execution [abstract] (PDF)
Arun Raman, Jae W. Lee, and David I. August
Proceedings of the International Conference on Compilers and Synthesis for Embedded Systems (CASES), October 2012.
Invited.
PASSERT: A Tool for Debugging Parallel Programs [abstract] (SpringerLink, PDF)
Daniel Schwartz-Narbonne, Feng Liu, David I. August, and Sharad Malik
Proceedings of the 18th International Conference on Computer-Aided Verification (CAV), June 2012.
Parcae: A System for Flexible Parallel Execution [abstract] (PDF)
Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2012.
Workload, platform, and available resources constitute a parallel program's
execution environment. Most parallelization efforts statically target an
anticipated range of environments, but performance generally degrades outside
that range. Existing approaches address this problem with dynamic tuning but do
not optimize a multiprogrammed system holistically. Further, they either
require manual programming effort or are limited to array-based data-parallel
programs.
This paper presents Parcae, a generally applicable automatic system
for platform-wide dynamic tuning. Parcae includes (i) the Nona compiler, which
creates flexible parallel programs whose tasks can be efficiently reconfigured
during execution; (ii) the Decima monitor, which measures resource availability
and system performance to detect change in the environment; and (iii) the Morta
executor, which cuts short the life of executing tasks, replacing them with
other functionally equivalent tasks better suited to the current environment.
Parallel programs made flexible by Parcae outperform original parallel
implementations in many interesting scenarios.
Speculative Separation for Privatization and Reductions [abstract] (ACM DL, PDF)
Nick P. Johnson, Hanjun Kim, Prakash Prabhu, Ayal Zaks, and David I. August
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2012.
Runtime Asynchronous Fault Tolerance via Speculation [abstract] (PDF)
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation
and Optimization (CGO), April 2012.
Transient faults are emerging as a critical reliability concern
in modern microprocessors. Redundant hardware solutions are commonly deployed
to detect transient faults, but they are less flexible and cost-effective than
software solutions. However, software solutions are rendered impractical
because of high performance overheads. To address this problem, this paper
presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the
fastest transient fault detection technique known to date. Serving as a virtual
layer between the application and the underlying platform, RAFT automatically
generates two symmetric program instances from a program binary. It detects
transient faults in a noninvasive way, and exploits high-confidence value
speculation to achieve low runtime overhead. Evaluation on a commodity
multicore system demonstrates that RAFT delivers a geomean performance overhead
of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient
fault detection techniques, RAFT exhibits the best performance and fault
coverage, without requiring any change to the hardware or the software
applications
Dynamically Managed Data for CPU-GPU Architectures [abstract] (PDF)
Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August
Proceedings of the 2012 International Symposium on Code Generation
and Optimization (CGO), March 2012.
GPUs are flexible parallel processors capable of accelerating real
applications. To exploit them, programmers must ensure a consistent program
state between the CPU and GPU memories by managing data. Manually managing data
is tedious and error-prone. In prior work on automatic CPU-GPU data management,
alias analysis quality limits performance, and type-inference quality limits
applicability. This paper presents Dynamically Managed Data (DyManD), the first
automatic system to manage complex and recursive data-structures without static
analyses. By replacing static analyses with a dynamic run-time system, DyManD
overcomes the performance limitations of alias analysis and enables management
for complex and recursive data-structures. DyManD-enabled GPU parallelization
matches the performance of prior work equipped with perfectly precise alias
analysis for 27 programs and demonstrates improved applicability on programs not
previously managed automatically.
Automatic Speculative DOALL for Clusters [abstract] (PDF)
Hanjun Kim, Nick P. Johnson, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation
and Optimization (CGO), March 2012.
Automatic parallelization for clusters is a promising alternative to
time-consuming, error-prone manual parallelization. However, automatic
parallelization is frequently limited by the imprecision of static analysis.
Moreover, due to the inherent fragility of static analysis, small changes to
the source code can significantly undermine performance. By replacing static
analysis with speculation and profiling, automatic parallelization becomes more
robust and applicable. A naive automatic speculative parallelization does not
scale for distributed memory clusters, due to the high bandwidth required to
validate speculation. This work is the first automatic speculative DOALL
(Spec-DOALL) parallelization system for clusters. We have implemented a
prototype automatic parallelization system, called Cluster Spec-DOALL, which
consists of a Spec-DOALL parallelizing compiler and a speculative runtime for
clusters. Since the compiler optimizes communication patterns, and the runtime
is optimized for the cases in which speculation succeeds, Cluster Spec-DOALL
minimizes the communication and validation overheads of the speculative
runtime. Across 8 benchmarks, Cluster Spec-DOALL achieves a geomean speedup of
43.8x on a 120-core cluster, whereas DOALL without speculation achieves only
4.5x speedup. This demonstrates that speculation makes scalable fully-automatic
parallelization for clusters possible.
DAFT: Decoupled Acyclic Fault Tolerance [abstract] (SpringerLink, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
The International Journal of Parallel Programming (IJPP), February 2012.
Invited.
Special issue composed of "top papers" selected by the
Program Committee of the 19th International Conference on Parallel
Architectures and Compilation Techniques.
Higher transistor counts, lower voltage levels, and reduced noise
margin increase the susceptibility of multicore processors to
transient faults. Redundant hardware modules can detect such errors,
but software transient fault detection techniques are more appealing
for their low cost and flexibility. Recent software proposals double
register pressure or memory usage, or are too slow in the absence of
hardware extensions, preventing widespread acceptance. This paper
presents DAFT, a fast, safe, and memory efficient transient fault
detection framework for commodity multicore systems. DAFT replicates
computation across multiple cores and schedules fault detection off
the critical path. Where possible, values are speculated to be correct
and only communicated to the redundant thread at essential program
points. DAFT is implemented in the LLVM compiler framework and
evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a
commodity multicore system. Results demonstrate DAFT's high
performance and broad fault coverage. Speculation allows DAFT to
reduce the performance overhead of software redundant multithreading
from an average of 200% to 38% with no degradation of faultcoverage.
A General Approach for Efficiently Accelerating Software-based Dynamic Data Flow Tracking on Commodity Hardware [abstract] (PDF)
Kangkook Jee, Georgios Portokalidis, Vasileios P. Kemerlis, Soumyadeep Ghosh, David I. August, and Angelos D. Keromytis
Proceedings of the 19th Internet Society (ISOC) Symposium on Network
and Distributed Systems Security (NDSS), February 2012.
A System for Flexible Parallel Execution [abstract] (PDF)
Arun Raman
Ph.D. Thesis, Department of Electrical Engineering,
Princeton University, December 2011.
Exponential growth in transistor density combined
with diminishing returns from uniprocessor
improvements has compelled the industry to
transition to multicore architectures.
To realize the
performance potential of multicore architectures,
programs must be parallelized effectively. The
efficiency of parallel program execution depends
on the execution environment comprised of
workload, platform, and performance goal. In
writing parallel programs, most programmers and
compilers expose parallelism and optimize it to
meet a particular performance goal on a single
platform under an assumed set of workload
characteristics. In the field, changing workload
characteristics, new parallel platforms, and
deployments with different performance goals make
the programmer's or compiler's development-time or
compile-time choices suboptimal.
This dissertation presents Parcae, a generally applicable holistic system for
platform-wide dynamic parallelism tuning. Parcae includes:
Parallel programs made flexible by Parcae outperform original
parallel implementations in a variety of interesting scenarios.
1. the Nona compiler, which applies a variety of auto-parallelization techniques
to create flexible parallel programs whose tasks can be efficiently paused, reconfigured, and resumed during execution;
2. the Decima monitor, which measures resource availability and system performance to detect change in the environment; and
3. the Morta executor, which cuts short the life of executing tasks, replacing
them
with other functionally equivalent tasks better suited to the current environment.
Automatic Extraction of Parallelism from Sequential Code
David I. August, Jialu Huang, Thomas B. Jablin, Hanjun Kim, Thomas R. Mason, Prakash Prabhu, Arun Raman, and Yun Zhang
Fundamentals of Multicore Software Development (ISBN: 978-1439812730)
Edited by Ali-Reza Adl-Tabatabai, Victor Pankratius, and Walter Tichy.
Chapman & Hall / CRC Press, December 2011.
EPIC Processors
Arun Raman and David I. August
Encyclopedia of Parallel Computing (ISBN: 978-0387098449)
Edited by David Padua.
Springer, November 2011.
A Survey of the Practice of Computational Science [abstract] (ACM DL, PDF)
Prakash Prabhu, Thomas B. Jablin, Arun Raman, Yun Zhang, Jialu Huang, Hanjun Kim, Nick P. Johnson, Feng Liu, Soumyadeep Ghosh, Stephen Beard, Taewook Oh, Matthew Zoufaly, David Walker, and David I. August
Proceedings of the 24th ACM/IEEE Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2011.
The SPARCHS Project: Hardware Support for Software Security [abstract] (PDF)
Simha Sethumadhavan, Salvatore J. Stolfo, Angelos D. Keromytis, Junfeng Yang, and David I. August
Proceedings of the First SysSec Workshop (SYSSEC), July 2011.
This paper describes the research philosophy and
motivations of the SPARCHS project. The goal of the SPARCHS project is
to improve security by implementing security mechanisms in
hardware.
Parallel Assertions for Debugging Parallel Programs [abstract] (IEEE Xplore, PDF)
Daniel Schwartz-Narbonne, Feng Liu, Tarun Pondicherry, David I. August, and Sharad Malik
2011 9th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE), July 2011.
Commutative Set: A Language Extension for Implicit Parallel Programming [abstract] (ACM DL, PDF)
Prakash Prabhu, Soumyadeep Ghosh, Yun Zhang, Nick P. Johnson, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.
Automatic CPU-GPU Communication Management and Optimization [abstract] (ACM DL, PDF)
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.
Parallelism Orchestration using DoPE: the Degree of
Parallelism Executive [abstract] (ACM DL, PDF)
Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI), June 2011.
Scalable Speculative Parallelization on Commodity Clusters [abstract] (ACM DL, PDF)
Hanjun Kim, Arun Raman, Feng Liu, Jae W. Lee, and David I. August
Proceedings of the 43rd IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2010.
Highest ranked paper in double-blind review process.
While clusters of commodity servers and switches are the most
popular form of large-scale parallel computers, many programs are
not easily parallelized for execution upon them. In particular,
high inter-node communication cost and lack of globally shared
memory appear to make clusters suitable only for server applications
with abundant task-level parallelism and scientific applications
with regular and independent units of work. Clever use of pipeline
parallelism (DSWP), thread-level speculation (TLS), and speculative
pipeline parallelism (Spec-DSWP) can mitigate the costs of
inter-thread communication on shared memory multicore machines.
This paper presents Distributed Software Multi-threaded
Transactional memory (DSMTX), a runtime system which makes these
techniques applicable to non-shared memory clusters, allowing them
to efficiently address inter-node communication costs. Initial
results suggest that DSMTX enables efficient cluster execution of a
wider set of application types. For 11 sequential C programs
parallelized for a 4-core 32-node (128 total core) cluster without
shared memory, DSMTX achieves a geomean speedup of 49x. This
compares favorably to the 15x speedup achieved by our
implementation of TLS-only support for clusters.
DAFT: Decoupled Acyclic Fault Tolerance [abstract] (ACM DL, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2010.
Selected by the Program Committee as a "top paper" for inclusion
in a special issue of the International Journal of Parallel
Programming (IJPP).
Higher transistor counts, lower voltage levels, and reduced noise
margin increase the susceptibility of multicore processors to
transient faults. Redundant hardware modules can detect such errors,
but software transient fault detection techniques are more appealing
for their low cost and flexibility. Recent software proposals double
register pressure or memory usage, or are too slow in the absence of
hardware extensions, preventing widespread acceptance. This paper
presents DAFT, a fast, safe, and memory efficient transient fault
detection framework for commodity multicore systems. DAFT replicates
computation across multiple cores and schedules fault detection off
the critical path. Where possible, values are speculated to be correct
and only communicated to the redundant thread at essential program
points. DAFT is implemented in the LLVM compiler framework and
evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a
commodity multicore system. Results demonstrate DAFT's high
performance and broad fault coverage. Speculation allows DAFT to
reduce the performance overhead of software redundant multithreading
from an average of 200% to 38% with no degradation of fault
coverage.
Structural Simulation for Architecture Exploration
David I. August, Veerle Desmet, Silvain Girbal, Daniel Gracia Pérez, and Olivier Temam
Processor and System-on-Chip Simulation (ISBN: 978-1441961747)
Edited by Rainer Leupers and Olivier Temam.
Springer, September 2010.
Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs? [abstract] (IEEE Xplore, PDF)
Arvind, David I. August, Keshav Pingali, Derek Chiou, Resit Sendag, and Joshua J. Yi
IEEE Micro, Volume 30, Number 3, May 2010.
In this panel discussion from the 2009 Workshop On Computer
Architecture Research Directions, David August and Keshav Pingali
debate whether explicitly parallel programming is a necessary evil for
applications programmers, assess the current state of parallel
programming models, and discuss possible routes toward finding the
programming model for the multicore era.
Decoupled Software Pipelining Creates Parallelization Opportunities [abstract] (ACM DL, PDF)
Jialu Huang, Arun Raman, Yun Zhang, Thomas B. Jablin, Tzu-Han Hung, and David I. August
Proceedings of the 2010 International Symposium on Code Generation and Optimization (CGO), April 2010.
Decoupled Software Pipelining (DSWP) is one approach to
automatically extract threads from loops. It partitions loops into long-running
threads that communicate in a pipelined manner via inter-core queues. This work
recognizes that DSWP can also be an enabling transformation for other loop
parallelization techniques. This use of DSWP, called DSWP+, splits a loop into
new loops with dependence patterns amenable to parallelization using techniques
that were originally either inapplicable or poorly-performing. By parallelizing
each stage of the DSWP+ pipeline using (potentially) different techniques, not
only is the benefit of DSWP increased, but the applicability and performance of
other parallelization techniques are enhanced. This paper evaluates DSWP+ as an
enabling framework for other transformations by applying it in conjunction with
DOALL, LOCALWRITE, and SpecDOALL to individual stages of the pipeline. This paper
demonstrates significant performance gains on a commodity 8-core multicore machine
running a variety of codes transformed with DSWP+.
Liberty Queues for EPIC Architectures [abstract] (PDF)
Thomas B. Jablin, Yun Zhang, James A. Jablin, Jialu Huang, Hanjun Kim, and David I. August
Proceedings of the Eighth Workshop on Explicitly Parallel
Instruction Computer Architectures and Compiler Technology (EPIC), April 2010.
Core-to-core communication bandwidth is critical for
high-performance pipeline-parallel programs. Hardware communication queues are
unlikely to be implemented and are perhaps unnecessary. This paper presents
Liberty Queues, a high-performance lock-free software-only ring buffer, and
describes the porting effort from the original x86-64 implementation to
IA-64. Liberty Queues achieve a bandwidth of 500 MB/s between unrelated
processors on a first generation Itanium 2, compared with 281 MB/s on modern
Opterons and 430 MB/s on modern Xeons claimed by related works. Bandwidth
results are presented for seven different multicore and multiprocessor systems,
as well as a sensitivity analysis.
Speculative Parallelization Using Software Multi-threaded Transactions [abstract] (ACM DL, PDF)
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August
Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2010.
Compilation Strategies and Challenges for Multicore Signal Processing [abstract] (IEEE Xplore, PDF)
Mojtaba Mehrara, Thomas B. Jablin, Dan Upton, David I. August, Kim Hazelwood, and Scott Mahlke
IEEE Signal Processing Magazine, November 2009.
To overcome challenges stemming from power density and hot spots on the chip,
multicore computing platforms have emerged as the ubiquitous computing platform
for servers down through embedded systems. Unfortunately, providing multiple
cores does not directly translate into increased performance for most
applications. The burden is placed on software developers and tools to find and
exploit coarse-grain parallelism in order to effectively make use of the
abundance of computing resources provided by these systems. Concurrent
applications are much more complex to develop than their single-threaded
ancestors, thus software development tools will be critical to help programmers
create both high performance and correct software. This article provides an
overview multicore parallelism and compiler technology for the signal processing
community.
LAMPVIEW: A Loop-Aware Toolset for
Facilitating Parallelization [abstract] (PDF)
Thomas Rorie Mason
Master's Thesis, Department of Electrical Engineering,
Princeton University, August 2009.
A continual growth of the number of transistors per unit area
coupled with diminishing returns from traditional microarchitectural
and clock frequency improvements has led processor manufacturers to
place multiple cores on a single chip. However, only multi-threaded
code can fully take advantage of the new multicore processors;
legacy single-threaded code does not benefit. Many approaches to
parallelization have been explored, including both manual and
automatic techniques.
Unfortunately, research in this area is impeded by the innate
difficulty of exploring code by hand for new possible
parallelization schemes. Regardless of whether it is a researcher
attempting to discover possible automatic techniques or a programmer
trying to make manual parallelization, the benefits of good
dependence information are substantial. This thesis provides a
profiling and analysis toolset aimed at easing a programmer or
researcher.s effort in finding parallelism. The toolset, The
Loop-Aware Memory Profile Viewing System (LAMPView), is developed in
three parts.
The first part is a multi-frontend, multi-target compiler pass
written to instrument the code with calls to the Loop-Aware Memory
Profiling (LAMP) library. The compile-time instrumentation was
partially developed previously and has been augmented here with
additional features. The second part is a post-runtime processing
pass that translates the output of the profiling run from a
machine-level view to a source-level view. As it translates, it also
processes and sorts dependence information. The third and final part
is a pair of stand-alone utilities that takes the translated
information and provides the user with human-readable output that is
searchable by various parameters. In various discussions with
potential users, it has been seen that the utility eases analysis of
parallelism opportunities.
Parallelization Techniques with Improved Dependence Handling [abstract] (PDF)
Easwaran Raman
Ph.D. Thesis, Department of Computer Science,
Princeton University, June 2009.
Continuing exponential growth in transistor density and diminishing
returns from the increasing transistor count have forced processor
manufacturers to pack multiple processor cores onto a single chip.
These processors, known as multi-core processors, generally do not
improve the performance of single-threaded applications. Automatic
parallelization has a key role to play in improving the performance of
legacy and newly written single-threaded applications in this new
multi-threaded era. Automatic parallelizations transform single-threaded code into a
semantically equivalent multi-threaded code by preserving the
dependences of the original code. This dissertation proposes two new
automatic parallelization techniques that differ from related existing
techniques in their handling of dependences. This difference in
dependence handling enables the proposed techniques to outperform
related techniques. The first technique is known as parallel-stage decoupled software
pipelining (PS-DSWP). PS-DSWP extends pipelined parallelization
techniques like DSWP by allowing certain pipelined stages to be
executed by multiple threads. Such a parallel execution of pipelined
stages requires distinguishing inter-iteration dependences
of the loop being parallelized from the rest of the dependences.
The applicability and effectiveness of PS-DSWP is
further enhanced by applying speculation to remove some dependences. The second technique, known as speculative iteration chunk execution
(Spice), uses value speculation to ignore inter-iteration dependences,
enabling speculative execution of chunks of iterations in parallel.
Unlike other value-speculation based parallelization techniques, Spice
speculates only a few dynamic instances of those inter-iteration
dependences. Both these techniques are implemented in the VELOCITY compiler and are
evaluated using a multi-core Itanium 2 simulator. PS-DSWP results in
a geometric mean loop speedup of 2.13 over single-threaded
performance with five threads on a set of loops from five benchmarks.
The use of speculation improves the performance of PS-DSWP resulting
in a geometric mean loop speedup of 3.67 over single-threaded
performance on a set of loops from six benchmarks. Spice shows a
geometric mean loop speedup of 2.01 on a set of loops from four
benchmarks. Based on the above experimental results and qualitative
and quantitative comparisons with related techniques, this
dissertation demonstrates the effectiveness of the proposed
techniques.
Speculation
Neil Vachharajani and David I. August
Encyclopedia of Computer Science and Engineering (ISBN: 978-0471383932)
Edited by Benjamin W. Wah.
John Wiley & Sons, Inc., January 2009.
Intelligent Speculation for Pipelined Multithreading [abstract] (PDF)
Neil Amar Vachharajani
Ph.D. Thesis, Department of Computer Science,
Princeton University, November 2008.
In recent years, microprocessor manufacturers have shifted their focus
from single-core to multi-core processors. Since many of today's
applications are single-threaded and since it is likely that many of
tomorrow's applications will have far fewer threads than there
will be processor cores, automatic thread extraction is an essential
tool for effectively leveraging today's multi-core and
tomorrow's many-core processors. A recently proposed technique,
Decoupled Software Pipelining (DSWP), has demonstrated promise by
partitioning loops into long-running threads organized into a
pipeline. Using a pipeline organization and execution decoupled by
inter-core communication queues, DSWP offers increased execution
efficiency that is largely independent of inter-core communication
latency and variability in intra-thread performance. This dissertation extends the pipelined parallelism paradigm with
speculation. Using speculation, dependences that manifest infrequently
or are easily predictable can be safely ignored by the compiler
allowing it to carve more, and better balanced, thread-based pipeline
stages from a single thread of execution. Prior speculative threading
proposals were obligated to speculate most, if not all, loop-carried
dependences to squeeze the code segment under consideration into the
mold required by the parallelization paradigm. Unlike those
techniques, this dissertation demonstrates that speculation need only
break the longest few dependence cycles to enhance the applicability
and scalability of the pipelined multi-threading paradigm. By
speculatively breaking these cycles, instructions that were formerly
restricted to a single thread to ensure decoupling are now free to
span multiple threads. To demonstrate the effectiveness of speculative
pipelined multi-threading, this dissertation presents the design and
experimental evaluation of our fully automatic compiler
transformation, Speculative Decoupled Software Pipelining, a
speculative extension to DSWP. This dissertation additionally introduces multi-threaded transactional
memories to support speculative pipelined multi-threading. Similar to
past speculative parallelization approaches, speculative pipelined
multi-threading relies on runtime-system support to buffer speculative
modifications to memory. However, this dissertation demonstrates that
existing proposals to buffer speculative memory state, transactional
memories, are insufficient for speculative pipelined multi-threading
because the speculative buffers are restricted to a single
thread. Further, this dissertation demonstrates that this limitation
leads to modularity and composability problems even for transactional
programming, thus limiting the potential of that approach also. To
surmount these limitations, this thesis introduces multi-threaded
transactional memories and presents an initial hardware
implementation.
The VELOCITY Compiler: Extracting Efficient Multicore Execution from Legacy Sequential Programs [abstract] (PDF)
Matthew John Bridges
Ph.D. Thesis, Department of Computer Science,
Princeton University, November 2008.
Multiprocessor systems, particularly chip multiprocessors, have
emerged as the predominant organization for future microprocessors.
Systems with 4, 8, and 16 cores are already shipping and a future with
32 or more cores is easily conceivable. Unfortunately, multiple cores
do not always directly improve application performance, particularly
for a single legacy application. Consequently, parallelizing
applications to execute on multiple cores is essential.
Parallel programming models and languages could be used to create
multi-threaded applications. However, moving to a parallel programming
model only increases the complexity and cost involved in software
development. Many automatic thread extraction techniques have been
explored to address these costs.
Unfortunately, the amount of parallelism that has been automatically
extracted using these techniques is generally insufficient to keep
many cores busy. Though there are many reasons for this, the main
problem is that extensions are needed to take full advantage of these
techniques. For example, many important loops are not parallelized
because the compiler lacks the necessary scope to apply the
optimization. Additionally, the sequential programming model forces
programmers to define a single legal application outcome, rather than
allowing for a range of legal outcomes, leading to conservative
dependences that prevent parallelization.
This dissertation integrates the necessary parallelization
techniques, extending them where necessary, to enable automatic
thread extraction. In particular, this includes an expanded
optimization scope, which facilitates the optimization of large loops,
leading to parallelism at higher levels in the
application. Additionally, this dissertation shows that many unnecessary
dependences can be broken with the help of the programmer using
natural, simple extensions to the sequential programming model.
Through a case study of several applications, including several C
benchmarks in the SPEC CINT 2000 suite, this dissertation shows how
scalable parallelism can be extracted. By changing only 38 source
code lines out of 100,000, several of the applications were
parallelizable by automatic thread extraction techniques, yielding a
speedup of 3.64x on 32 cores.
Global Instruction Scheduling for Multi-Threaded Architectures [abstract] (PDF)
Guilherme de Lima Ottoni
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2008.
Recently, the microprocessor industry has moved toward multi-core or
chip multiprocessor (CMP) designs as a means of utilizing the
increasing transistor counts in the face of physical and
micro-architectural limitations. Despite this move, CMPs do not
directly improve the performance of single-threaded codes, a
characteristic of most applications. In effect, the move to CMPs has
shifted even more the task of improving performance from the hardware
to the software.
Since developing parallel applications has long been recognized as
significantly harder than developing sequential ones, it is very
desirable to have automatic tools to extract thread-level parallelism
(TLP) from sequential applications. Unfortunately, automatic
parallelization has only been successful in the restricted domains of
scientific and data-parallel applications, which usually have regular
array-based memory accesses and little control flow. In order to
support parallelization of general-purpose applications, computer
architects have proposed CMPs with light-weight, fine-grained (scalar)
communication mechanisms. Despite such support, most existing
multi-threading compilation techniques have generally demonstrated
little effectiveness in extracting parallelism from non-scientific
applications. The main reason for this is that such techniques are
mostly restricted to extracting parallelism within local,
straight-line regions of code. As observed in several limit studies,
local regions of code contain limited parallelism, and control
dependence analysis and multiple control units are necessary to
extract most of the parallelism opportunities.
This thesis describes a general compiler framework for Global
Multi-Threaded (GMT) instruction scheduling, i.e. to simultaneously
schedule instructions from a global region of code to extract TLP for
multi-threaded architectures. Our compiler framework is based on a
Program Dependence Graph (PDG) representation, efficient graph
partitioning algorithms, and novel multi-threaded code generation
algorithms. Central to this framework are our multi-threaded code
generation algorithms, which produce efficient code for arbitrary
partitions of the PDG into threads. Based on this framework, three
thread-partitioning strategies for GMT instructions scheduling are
proposed. The first one, called GREMIO, extends list-scheduling to
target multi-threaded architectures and to operate on global code
regions. The second technique, called Decoupled Software Pipelining
(DSWP), extracts pipelined TLP from arbitrary loop nests. We also
propose Parallel-Stage DSWP, an extension of DSWP that allows multiple
threads to concurrently execute the same stage of the pipeline. These
techniques are implemented in the VELOCITY compiler and evaluated on
an accurate CMP simulator built on top of validated Itanium 2 core
models. The experiments show that our techniques balance
applicability and scalability differently, with each technique
resulting in the best speedup in different scenarios. Overall, the
results demonstrate the effectiveness of the proposed compilation
techniques, with significant speedups on a number of real benchmark
applications written in C.
Performance Scalability of Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August
ACM Transactions on Architecture and Code Optimization (TACO), Volume 5, Number 2, August 2008.
Any successful solution to using multi-core processors to scale
general-purpose program performance will have to contend with rising
inter-core communication costs while exposing coarsegrained
parallelism. Recently proposed pipelined multithreading (PMT)
techniques have been demonstrated to have general-purpose
applicability and are also able to effectively tolerate intercore
latencies through pipelined inter-thread communication. These
desirable properties make PMT techniques strong candidates for program
parallelization on current and future multi-core processors and
understanding their performance characteristics is critical to their
deployment. To that end, this paper evaluates the performance scalability of a
general-purpose PMT technique called decoupled software pipelining
(DSWP) and presents a thorough analysis of the communication
bottlenecks that must be overcome for optimal DSWP scalability.
Shape Analysis with Inductive Recursion Synthesis [abstract] (PDF)
Bolei Guo
Ph.D. Thesis, Department of Computer Science,
Princeton University, June 2008.
For program optimization and verification purposes, shape analysis
can be used to statically determine structural properties of the
runtime heap. One promising formalism for describing heap is
separation logic, with recursively defined predicates that allow
for concise yet precise summarization of linked data structures.
A major challenge in this approach is the derivation of the
predicates directly from the program being analyzed. As a result,
current uses of separation logic rely heavily on predefined
predicates, limiting the class of programs analyzable to those
that manipulate only predefined data types. This thesis
addresses this problem by proposing a general algorithm based on
\emph{inductive program synthesis} that automatically infers
recursive predicates in a canonical form. This technique yields a
separation logic based shape analysis that can be effective on
a much wider range of programs.
A key strength of separation logic is that it facilitates, via
explicit expression of structural separation, local reasoning
about heap where the effects of altering one part of a data structure
are analyzed in isolation from the rest. The interaction between
local reasoning and the global invariants given by recursive
predicates is a difficult area, especially in the presence of complex
internal sharing in the data structures. Existing approaches, using
logic rules specifically designed for the list predicate to unfold
and fold linked lists, again require a priori knowledge about the
data structures and do not easily generalize to more complex data
structures. We introduce a notion of ``truncation points" in a
recursive predicate, which gives rise to generic algorithms for
unfolding and folding arbitrary data structures.
We present a fully implemented interprocedural analysis algorithm
that handles recursive procedures. A combination of pointer
analysis and program slicing is used to deal with the scalability
issue typically faced by shape analyses.
Finally, we present a data dependence test for recursive data
structures that takes advantage of the results of our shape analysis.
Software Modulated Fault Tolerance [abstract] (PDF)
George A. Reis
Ph.D. Thesis, Department of Electrical Engineering,
Princeton University, June 2008.
In recent decades, microprocessor performance has been increasing
exponentially, due in large part to smaller and faster transistors
enabled by improved fabrication technology. While such transistors
yield performance enhancements, their smaller size and sheer number
make chips more susceptible to transient faults. Designers frequently
introduce redundant hardware or software to detect
these faults because process and material advances are often
insufficient to mitigate their effect.
Regardless of the methods used for adding reliability, these
techniques incur significant performance penalties because they
uniformly protect the entire application. They do not consider
the varying resilience to transient faults of different
program regions. This uniform protection leads to wasted resources that
reduce performance and/or increase cost.
To maximize fault coverage while minimizing the performance impact,
this dissertation takes advantage of the key insights that not all faults in
an unprotected application will cause an incorrect answer and not all
parts of the program respond the same way to reliability techniques.
First, this dissertation demonstrates the varying vulnerability and
performance responses of an application and identifies regions
which are most susceptible to faults as well as those which are
inexpensive to protect. Second, this dissertation advocates the use of
software and hybrid approaches to fault tolerance to enable the
synergy of high-level information with specific redundancy techniques.
Third, this dissertation demonstrates how to exploit this non-uniformity via
Software Modulated Fault Tolerance. Software Modulated Fault Tolerance leverages reliability
and performance information at a high level and directs the
reliability choices at fine granularities to provide the most
efficient use of processor resources for an application.
Spice: Speculative Parallel Iteration Chunk Execution [abstract] (ACM DL, PDF)
Easwaran Raman, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.
Parallel-Stage Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew Bridges, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.
Communication Optimizations for Global Multi-Threaded Instruction Scheduling [abstract] (ACM DL, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 13th ACM International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS), March 2008.
The recent shift in the industry towards chip
multiprocessor (CMP) designs has brought the need for multi-threaded
applications to mainstream computing. Given the difficulty in finding
coarse-grained parallelism in general-purpose, sequential
applications, computer architects have proposed CMPs with
light-weight, fine-grained (scalar) communication
mechanisms.
As observed in several limit studies, most of the parallelization
opportunities require looking for parallelism beyond local regions of
code. To exploit these opportunities, especially for sequential
applications, researchers have recently proposed global
multi-threaded instruction scheduling techniques, including Decoupled
Software Pipelining (DSWP) and
GREMIO. These techniques simultaneously
schedule instructions from large regions of code, such as arbitrary
loop nests or whole procedures, and have been shown effective at
extracting threads for many applications. A key enabler of these
global instruction scheduling techniques is the Multi-Threaded
Code Generation (MTCG) algorithm proposed in [Ottoni et al, '05],
which generates multi-threaded code for any partition of the
instructions into threads. This algorithm inserts communication and
synchronization instructions in order to satisfy all inter-thread
dependences.
In this paper, we present a general compiler framework, COCO, to
optimize the communication and synchronization instructions inserted
by the MTCG algorithm. This framework, based on thread-aware data-flow
analyses and graph min-cut algorithms, appropriately models and
optimizes all kinds of inter-thread dependences, including register,
memory, and control dependences. Our experiments, using a fully
automatic compiler implementation of these techniques, demonstrate
significant reductions (about 30% on average) in the number of
dynamic communication instructions in code parallelized with DSWP and
GREMIO. This reduction in communication translates to performance
gains of up to 40%.
Revisiting the Sequential Programming Model for the Multicore Era [abstract] (Original Full Paper, IEEE Xplore, PDF)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
IEEE Micro, Volume 28, Number 1, January 2008.
IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.
Single-threaded programming is already considered a complicated task.
The move to multi-threaded programming only increases the complexity
and cost involved in software development due to rewriting legacy
code, training of the programmer, increased debugging of the program,
and efforts to avoid race conditions, deadlocks, and other problems
associated with parallel programming. To address these costs, other
approaches, such as automatic thread extraction, have been explored.
Unfortunately, the amount of parallelism that has been automatically
extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic
limitation of the sequential programming model, but rather occurs for
two reasons. First, there exists no framework for automatic thread
extraction that brings together key existing state-of-the-art compiler
and hardware techniques. This paper shows that such a framework can
yield scalable parallelization on several SPEC CINT2000 benchmarks.
Second, existing sequential programming languages force programmers to
define a single legal program outcome, rather than allowing for a
range of legal outcomes. This paper shows that natural, simple
extensions to the sequential programming model enable parallelization
for the remainder of the SPEC CINT2000 suite. Our experience
demonstrates that, by changing only 60 source code lines, all of the C
benchmarks in the SPEC CINT2000 suite were parallelizable by automatic
thread extraction. This process, constrained by the limits of modern
optimizing compilers, yielded a speedup of 454% on these
applications.
Optimizations for the Memory Hierarchy
Easwaran Raman and David I. August
Compiler Design Handbook (ISBN: 978-1420043822)
Edited by Y. N. Srikant and Priti Shankar.
CRC Press, December 2007.
Revisiting the Sequential Programming Model for Multi-Core [abstract] (IEEE Xplore, PDF, Top Picks Version)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2007.
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.
Single-threaded programming is already considered a complicated task.
The move to multi-threaded programming only increases the complexity
and cost involved in software development due to rewriting legacy
code, training of the programmer, increased debugging of the program,
and efforts to avoid race conditions, deadlocks, and other problems
associated with parallel programming. To address these costs, other
approaches, such as automatic thread extraction, have been explored.
Unfortunately, the amount of parallelism that has been automatically
extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic
limitation of the sequential programming model, but rather occurs for
two reasons. First, there exists no framework for automatic thread
extraction that brings together key existing state-of-the-art compiler
and hardware techniques. This paper shows that such a framework can
yield scalable parallelization on several SPEC CINT2000 benchmarks.
Second, existing sequential programming languages force programmers to
define a single legal program outcome, rather than allowing for a
range of legal outcomes. This paper shows that natural, simple
extensions to the sequential programming model enable parallelization
for the remainder of the SPEC CINT2000 suite. Our experience
demonstrates that, by changing only 60 source code lines, all of the C
benchmarks in the SPEC CINT2000 suite were parallelizable by automatic
thread extraction. This process, constrained by the limits of modern
optimizing compilers, yielded a speedup of 454% on these
applications.
Global Multi-Threaded Instruction Scheduling [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2007.
Recently, the microprocessor industry has moved toward chip
multiprocessor (CMP) designs as a means of utilizing the increasing
transistor counts in the face of physical and micro-architectural
limitations. Despite this move, CMPs do not directly improve the
performance of single-threaded codes, a characteristic of most
applications. In order to support parallelization of general-purpose
applications, computer architects have proposed CMPs with light-weight
scalar communication mechanisms. Despite such support, most existing
compiler multi-threading techniques have generally demonstrated little
effectiveness in extracting parallelism from non-scientific
applications. The main reason for this is that such techniques are
mostly restricted to extracting parallelism within straight-line
regions of code.
In this paper, we first propose a framework that enables global
multi-threaded instruction scheduling in general. We then describe
GREMIO, a scheduler built using this framework. GREMIO operates at a
global scope, at the procedure level, and uses control dependence
analysis to extract non-speculative thread-level parallelism from
sequential codes. Using a fully automatic compiler implementation of
GREMIO and a validated processor model, this paper demonstrates gains
for a dual-core CMP model running a variety of codes. Our experiments
demonstrate the advantage of exploiting global scheduling for
multi-threaded architectures, and present gains in a detailed
comparison with the Decoupled Software Pipelining (DSWP)
multi-threading technique. Furthermore, our experiments show that
adding GREMIO to a compiler with DSWP improves the average speedup
from 16.5% to 32.8% for important benchmark functions when utilizing
two cores, indicating the importance of this technique in making
compilers extract threads effectively.
UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development [abstract] (IEEE Xplore, PDF)
David I. August, Jonathan Chang, Sylvain Girbal, Daniel Gracia-Perez, Gilles Mouchard, David Penry, Olivier Temam, and Neil Vachharajani
IEEE Computer Architecture Letters (CAL), September 2007.
Simulator development is already a huge burden for many academic and
industry research groups; future complex or heterogeneous multi-cores,
as well as the multiplicity of performance metrics and required
functionality, will make matters worse. We present a new simulation
environment, called UNISIM, which is designed to rationalize simulator
development by making it possible and efficient to distribute the
overall effort over multiple research groups, even without direct
cooperation. UNISIM achieves this goal with a combination of modular
software development, distributed communication protocols, multi-level
abstract modeling, interoperability capabilities, a set of simulator
services APIs, and an open library/repository for providing a
consistent set of simulator modules.
Speculative Decoupled Software Pipelining [abstract] (IEEE CS, PDF)
Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2007.
In recent years, microprocessor manufacturers have
shifted their focus from single-core to multicore processors. To
avoid burdening programmers with the responsibility of parallelizing
their applications, some researchers have advocated automatic thread
extraction. A recently proposed technique, Decoupled Software
Pipelining (DSWP), has demonstrated promise by partitioning loops into
long-running, fine-grained threads organized into a pipeline. Using a
pipeline organization and execution decoupled by inter-core
communication queues, DSWP offers increased execution efficiency that
is largely independent of inter-core communication latency. This paper proposes adding speculation to DSWP and evaluates an
automatic approach for its implementation. By speculating past
infrequent dependences, the benefit of DSWP is increased by making it
applicable to more loops, facilitating better balanced threads, and
enabling parallelized loops to be run on more cores. Unlike prior
speculative threading proposals, speculative DSWP focuses on breaking
dependence recurrences. By speculatively breaking these recurrences,
instructions that were formerly restricted to a single thread to
ensure decoupling are now free to span multiple threads. Using an
initial automatic compiler implementation and a validated processor
model, this paper demonstrates significant gains using speculation for
4-core chip multiprocessor models running a variety of codes.
Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker
Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007.
Winner Best Paper Award.
A transient hardware fault occurs when an energetic particle strikes
a transistor, causing it to change state. Although transient faults do
not permanently damage the hardware, they may corrupt computations
by altering stored values and signal transfers. In this paper, we
propose a new scheme for provably safe and reliable computing in
the presence of transient hardware faults. In our scheme, software
computations are replicated to provide redundancy while special
instructions compare the results of replicas to detect errors before
writing critical data. In stark contrast to any previous efforts in this
area, we have analyzed our fault tolerance scheme from a formal,
theoretical perspective. First, we provide an operational semantics
for our assembly language, which includes a precise formal denition
of our fault model. Formulating such a model is a crucial step
forward for the science of hardware fault tolerance as it pins down
the assumptions being made about when and where faults may occur
in a completely precise and transparent manner. Second, we
develop an assembly-level type system designed to detect reliability
problems in compiled code. On the one hand, this type system
may be viewed as a theoretical tool for proving that code is reliable.
On the other hand, it may be used as a practical tool for compiler
debugging. In the latter case, the type system is strictly superior
to conventional testing as it guarantees full coverage and perfect
fault tolerance relative to the fault model. Third, we provide a formal
specication for program fault tolerance under the given fault
model and prove that all well-typed programs are indeed fault tolerant.
In addition to the formal analysis, we evaluate the execution
time of our detection scheme to quantify the performance penalty
for sound fault tolerance.
Pipelined Multithreading Transformations and Support Mechanisms [abstract] (PDF)
Ram Rangan
Ph.D. Thesis, Department of Computer Science,
Princeton University, June 2007.
Even though chip multiprocessors have emerged as the
predominant organization for future microprocessors, the multiple
on-chip cores do not directly result in improved application
performance (especially for legacy applications, which are
predominantly sequential C/C++ codes). Consequently, parallelizing
applications to execute on multiple cores is essential to their
success. Independent multithreading techniques, like DOALL extraction,
create partially or fully independent threads, which communicate
rarely, if at all. While such strategies keep high inter-thread
communication costs from impacting program performance, they cannot be
applied to parallelize general-purpose applications which are
characterized by difficult-to-break recurrences. Even though cyclic
multithreading techniques, such as DOACROSS, are more applicable, the
cyclic inter-thread dependences created by these techniques cause them
to have very low tolerance to rising inter-core latencies. To address these problems, this work introduces a pipelined
multithreading (PMT) transformation called Decoupled Software
Pipelining (DSWP). DSWP, in particular, and PMT techniques, in
general, are able to tolerate inter-core latencies, while still
handling codes with complex recurrences. They achieve this by
enforcing an acyclic communication discipline amongst threads, which
allow threads to use inter-thread queues to communicate in a pipelined
fashion. This dissertation demonstrates that DSWPed codes not only
tolerate inter-core communication costs, but also effectively tolerate
variable latency stalls in applications better than single-threaded
execution on both in-order and out-of-order issue processors with
comparable resources. It then performs a thorough analysis of the
performance scalability of automatically generated DSWPed codes and
identifies the conditions necessary to achieve peak PMT
performance. Next, the dissertation shows that even though PMT applications
tolerate inter-core latencies well, the high frequency of inter-thread
communication (once every 5 to 20 dynamic instructions) in these
codes, makes them very sensitive to the intra-thread overhead imposed
by communication operations. In order to understand the issues
surrounding inter-thread communication for PMT applications, this
dissertation undertakes a methodical exploration of the design space
of communication support options for PMT. Three new communication
mechanisms with varying cost-performance tradeoffs are introduced and
are shown to perform 38% to 200% better than the state of the
art.
Shape Analysis with Inductive Recursion Synthesis [abstract] (ACM DL, PDF)
Bolei Guo, Neil Vachharajani, and David I. August
Proceedings of the ACM SIGPLAN 2007 Conference on
Programming Language Design and Implementation (PLDI), June 2007.
Separation logic with recursively defined predicates allows for
concise yet precise description of the shapes of data structures.
However, most uses of separation logic for program analysis rely on
pre-defined recursive predicates, limiting the class of programs
analyzable to those that manipulate only a priori data structures.
This paper describes a general algorithm based on \emph{inductive
program synthesis} that automatically infers recursive shape
invariants, yielding a shape analysis based on separation logic that
can be applied to any program. A key strength of separation logic is that it facilitates, via
explicit expression of structural separation, local reasoning
about heap where the effects of altering one part of a data structure
are analyzed in isolation from the rest. The interaction between
local reasoning and the global invariants given by recursive
predicates is a difficult area, especially in the presence of complex
internal sharing in the data structures. Existing approaches, using
logic rules specifically designed for the list predicate to unfold
and fold linked-lists, again require a priori knowledge about the
shapes of the data structures and do not easily generalize to more
complex data structures. We introduce a notion of ``truncation points"
in a recursive predicate, which gives rise to generic algorithms for
unfolding and folding arbitrary data structures.
Global Multi-Threaded Instruction Scheduling: Technique and
Initial Results [abstract] (CiteSeerX, PDF)
Guilherme Ottoni and David I. August
Proceedings of the Sixth Workshop on Explicitly Parallel
Instruction Computer Architectures and Compiler Technology (EPIC), March 2007.
Recently, the microprocessor industry has reached hard
physical and micro-architectural limits that have prevented the
continuous clock-rate increase, which had been the major source of
performance gains for decades. These impediments, in conjunction with
the still increasing transistor counts per chip, have driven all major
microprocessor manufacturers toward Chip Multiprocessor (CMP) designs.
Although CMPs are able to concurrently pursue multiple threads of
execution, they do not directly improve the performance of most
applications, which are written in sequential languages. In effect,
the move to CMPs has shifted even more the task of improving
performance from the hardware to the software. In order to support
more effective parallelization of sequential applications, computer
architects have proposed CMPs with light-weight communication
mechanisms. Despite such support, proposed multi-threaded scheduling
techniques have generally demonstrated little effectiveness in
extracting parallelism from general-purpose, sequential applications.
We call these techniques local multi-threaded scheduling,
because they basically exploit parallelism within straight-line
regions of code. A key observation of this paper is that local
multi-threaded techniques do not exploit the main feature of CMPs: the
ability to concurrently execute instructions from different
control-flow regions. In order to benefit from this powerful CMP
characteristic, it is necessary to perform global multi-threaded
scheduling, which simultaneously schedules instructions from
different basic blocks to enable their concurrent execution. This
paper presents algorithms to perform global scheduling for
communication-exposed multi-threaded architectures. By global
we mean that our technique simultaneously schedules instructions from
an arbitrary code region. Very encouraging preliminary results,
targeting a dual-core Itanium~2 model, are presented for selected
MediaBench and SPEC-CPU benchmark applications.
Automatic Instruction-Level Software-Only Recovery Methods [abstract] (IEEE Xplore, Original Full Paper, PDF)
George A. Reis, Jonathan Chang, and David I. August
IEEE Micro, Volume 27, Number 1, January 2007.
IEEE Micro's "Top Picks" special issue for papers "most
relevant to industry and significant in contribution to the field of
computer architecture" in 2006.
As chip densities and clock rates increase, processors are becoming
more susceptible to transient faults that can affect program
correctness. Computer architects have typically addressed reliability
issues by adding redundant hardware, but these techniques are often
too expensive to be used widely. Software-only reliability techniques
have shown promise in their ability to protect against soft-errors
without any hardware overhead. However, existing low-level
software-only fault tolerance techniques have only addressed the
problem of detecting faults, leaving recovery largely unaddressed. In
this paper, we present the concept, implementation, and evaluation of
automatic, instruction-level, software-only recovery techniques, as
well as various specific techniques representing different trade-offs
between reliability and performance. Our evaluation shows that these
techniques fulfill the promises of instruction-level, software-only
fault tolerance by offering a wide range of flexible recovery options
Non-Uniform Fault Tolerance [abstract] (PDF)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.
As devices become more susceptible to transient faults that can affect
program correctness, processor designers will increasingly compensate
by adding hardware or software redundancy. Proposed redundancy
techniques and those currently in use are generally applied uniformly
to a structure despite non-uniformity in the way errors within the
structure manifest themselves in programs. This uniform protection
leads to inefficiency in terms of performance, power, and area. Using
case studies involving the register file, this paper motivates an
alternative \emph{Non-Uniform Fault Tolerance} approach which improves
reliability over uniform approaches by spending the redundancy budget
on those areas most susceptible.
Support for High-Frequency Streaming in CMPs [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Adam Stoler, Guilherme Ottoni, David I. August, and George Z. N. Cai
Proceedings of the 39th IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2006.
As the industry moves toward larger-scale chip multiprocessors, the
need to parallelize applications grows. High inter-thread
communication delays, exacerbated by over-stressed high-latency memory
subsystems and ever-increasing wire delays, require parallelization
techniques to create partially or fully independent threads to improve
performance. Unfortunately, developers and compilers alike often fail
to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown
significant promise for both manual and automatic parallelization.
These techniques have wide-scale applicability because they embrace
inter-thread dependences (albeit acyclic dependences) and tolerate
long-latency communication of these dependences. This paper addresses
the lack of architectural support for this type of concurrency, which
has blocked its adoption and hindered related language and compiler
research. We observe that both manual and automatic techniques create
high-frequency streaming threads, with communication occurring
every 5 to 20 instructions. Even while easily tolerating
inter-thread transit delays, this high-frequency communication
makes thread performance very sensitive to intra-thread delays
from the repeated execution of the communication operations. Using
this observation, we define the design-space and evaluate several
mechanisms to find a better trade-off between performance and
operating system, hardware, and design costs. From this, we find a
light-weight streaming-aware enhancement to conventional memory
subsystems that doubles the speed of these codes and is within 2% of
the best-performing, but heavy-weight, hardware solution.
Configurable Transient Fault Detection via Dynamic Binary Translation [abstract] (PDF)
George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.
Smaller feature sizes, lower voltage levels, and reduced noise margins
have helped improve the performance and lower the power consumption
of modern microprocessors. These same advances have
made processors more susceptible to transient faults that can corrupt
data and make systems unavailable. Designers often compensate for
transient faults by adding hardware redundancy and making circuitand
process-level adjustments. However, applications have different
data integrity and availability demands, which make hardware
approaches such as these too costly for many markets.. Software techniques can provide fault tolerance at a lower cost and
with greater flexibility since they can be selectively deployed in the
field even after the hardware has been manufactured. Most existing
software-only techniques use recompilation, requiring access to program
source code. Regardless of the code transformation method,
previous techniques also incur unnecessary significant performance
penalties by uniformly protecting the entire program without taking
into account the varying vulnerability of different program regions
and state elements to transient faults. This paper presents Spot, a software-only fault-detection technique
which uses dynamic binary translation to provide softwaremodulated
fault tolerance with fine-grained control of redundancy.
By using dynamic binary translation, users can improve the reliability
of their applications without any assistance from hardware or
software vendors. By using software-modulated fault tolerance, Spot
can vary the level of protection independently for each register and
region of code to provide users with more, and often superior, faultdetection
options. This feature of Spot increases the mean work to
failure from 1.90x to 17.79x.
The Acceleration of Structural Microarchitectural Simulation via Scheduling [abstract] (PDF)
David A. Penry
Ph.D. Thesis, Department of Computer Science,
Princeton University, November 2006.
Microarchitects rely upon simulation to evaluate design alternatives,
yet constructing an accurate simulator by hand is a difficult and
time-consuming process because simulators are usually written in
sequential languages while the system being modeled is concurrent.
Structural modeling can mitigate this difficulty by allowing the
microarchitect to specify the simulation model in a concurrent,
structural form; a simulator compiler then generates a simulator from
the model. However, the resulting simulators are generally slower
than those produced by hand. The thesis of this dissertation is that
simulation speed improvements can be obtained by careful scheduling of
the work to be performed by the simulator onto single or multiple
processors.
For scheduling onto single processors, this dissertation presents an
evaluation of previously proposed scheduling mechanisms in the context
of a structural microarchitectural simulation framework which uses a
particular model of computation, the Heterogeneous Synchronous
Reactive (HSR) model, and improvements to these mechanisms which make
them more effective or more feasible for microarchitectural models. A
static scheduling technique known as partitioned scheduling is shown
to offer the most performance improvement: up to 2.08 speedup. This
work furthermore proves that the the Discrete Event model of
computation can be statically scheduled using partitioned scheduling
when restricted in ways that are commonly assumed in
microarchitectural simulation.
For scheduling onto multiple processors, this dissertation presents
the first automatic parallelization of simulators using the HSR model
of computation. It shows that effective parallelization requires
techniques to avoid waiting due to locks and to improve cache
locality. Two novel heuristics for lock mitigation and two for cache
locality improvement are introduced and evaluated on three different
parallel systems. The combination of lock mitigation and locality
improvement is shown to allow superlinear speedup for some models: up
to 7.56 for four processors.
Static Typing for a Faulty Lambda Calculus [abstract] (ACM DL, PDF)
David Walker, Lester Mackey, Jay Ligatti, George A. Reis, and David I. August
Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming (ICFP), September 2006.
A transient hardware fault occurs when an energetic particle strikes a
transistor, causing it to change state. These faults do not cause
permanent damage, but may result in incorrect program execution by
altering signal transfers or stored values. While the likelihood that
such transient faults will cause any significant damage may seem
remote, over the last several years transient faults have caused
costly failures in high-end machines at America Online, eBay, and the
Los Alamos Neutron Science Center, among others. Because
susceptibility to transient faults is proportional to the size and
density of transistors, the problem of transient faults will become
increasingly important in the coming decades. This paper defines the first formal, type-theoretic framework for
studying reliable computation in the presence of transient faults.
More specifically, it defines lzap, a lambda calculus that exhibits
intermittent data faults. In order to detect and recover from these
faults, lzap programs replicate intermediate computations and use
majority voting, thereby modeling software-based fault tolerance
techniques studied extensively, but informally. To ensure that programs maintain the proper invariants and use lzap
primitives correctly, the paper defines a type system for the
language. This type system guarantees that well-typed programs can
tolerate any single data fault. To demonstrate that lzap can serve as
an idealized typed intermediate language, we define a type-preserving
translation from a standard simply-typed lambda calculus into lzap.
Eliminating Scope and Selection Restrictions in Compiler Optimizations [abstract] (PDF)
Spyridon Triantafyllis
Ph.D. Thesis, Department of Computer Science,
Princeton University, September 2006.
To meet the challenges presented by the performance requirements of modern architectures,
compilers have been augmented with a rich set of aggressive optimizing transformations.
However, the overall compilation model within which these transformations operate has
remained fundamentally unchanged. This model imposes restrictions on these transformationsÂ
application, limiting their effectiveness.
First, procedure-based compilation limits code transformations within a single procedureÂs
boundaries, which may not present an ideal optimization scope. Although aggressive
inlining and interprocedural optimization can alleviate this problem, code growth and
compile time considerations limit their applicability.
Second, by applying a uniform optimization process on all codes, compilers cannot
meet the particular optimization needs of each code segment. Although the optimization
process is tailored by heuristics that attempt to a priori judge the effect of each transformation
on final code quality, the unpredictability of modern optimization routines and the
complexity of the target architectures severely limit the accuracy of such predictions.
This thesis focuses on removing these restrictions through two novel compilation framework
modifications, Procedure Boundary Elimination (PBE) and Optimization-Space Exploration
(OSE).
PBE forms compilation units independent of the original procedures. This is achieved
by unifying the entire application into a whole-program control-flow graph, allowing the
compiler to repartition this graph into free-form regions, making analysis and optimization
routines able to operate on these generalized compilation units. Targeted code duplication
techniques can then recover the performance benefits of inlining while limiting code
growth. Thus PBE offers a superset of the benefits of inlining and interprocedural optimization,
while avoiding both excessive code growth and overly long compile times.
OSE, on the other hand, explores many optimization options for each code segment and
selects the best one a posteriori. OSE trims the space of options explored through limited
use of heuristics, further limits the search space during compiler tuning, and exploits feedback
to prune the remaining optimization configurations at compile time. The resulting
optimization outcomes are compared through a fast and effective static performance estimator.
As a result, OSE is the first iterative compilation technique fast enough and general
enough for general-purpose compilation.
Amortizing Software Queue Overhead for Pipelined Inter-Thread Communication [abstract] (PDF)
Ram Rangan and David I. August
Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006.
Future chip multiprocessors are expected to contain multiple on-die
processing cores. Increased memory system contention and wire delays
will result in high inter-core latencies in these processors. Thus,
parallelizing applications to efficiently execute on multiple contexts
is key to achieving continued performance improvements. Recently
proposed pipelined multithreading (PMT) techniques have shown
significant promise for both manual and automatic
parallelization. They tolerate increasing inter-thread communication
delays by enforcing acyclic dependences amongst communicating threads
and pipelining communication. However, lack of efficient communication support for such programs
hinders related language and compiler research. While researchers have
proposed dedicated interconnects and storage for inter-core
communication, such mechanisms are not cost-effective, consume extra
power, demand chip redesign effort, and necessitate complex operating
system modifications. Software impelementations of shared memory
queues avoid these problems. But, they tend to have heavy overhead per
communication operation, causing them to negate parallelization
benefits and worse still, to perform slower than the original
single-threaded codes. In this paper, we present a simple compiler
analysis to coalesce synchronization and queue pointer updates for
select communication operations, to minimize the intra-thread
overhead of software queue implementations. A preliminary comparison
of static schedule heights shows a considerable performance
improvement over existing software queue
implementations.
The Liberty Simulation Environment: A Deliberate Approach to
High-Level System Modeling [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, Sharad Malik, and David I. August
ACM Transactions on Computer Systems (TOCS), Volume 24, Number 3, August 2006.
Automatic Instruction Scheduler Retargeting by Reverse-Engineering [abstract] (ACM DL, PDF)
Matthew J. Bridges, Neil Vachharajani, Guilherme Ottoni, and David I. August
Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2006.
In order to generate high-quality code for modern processors, a
compiler must aggressively schedule instructions, maximizing resource
utilization for execution efficiency. For a compiler to produce such
code, it must avoid structural hazards by being aware of the
processor's available resources and of how these resources are
utilized by each instruction. Unfortunately, the most prevalent
approach to constructing such a scheduler, manually discovering and
specifying this information, is both tedious and error-prone.
This paper presents a new approach which, when given a processor or
processor model, automatically determines this information. After
establishing that the problem of perfectly determining a processor's
structural hazards through probing is not solvable, this paper
proposes a heuristic algorithm that discovers most of this information
in practice. This can be used either to alleviate the problems
associated with manual creation or to verify an existing
specification. Scheduling with these automatically derived structural
hazards yields almost all of the performance gain achieved using
perfect hazard information.
A Framework for Unrestricted Whole-Program Optimization [abstract] (ACM DL, PDF)
Spyridon Triantafyllis, Matthew J. Bridges, Easwaran Raman, Guilherme Ottoni, and David I. August
Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2006.
Procedures have long been the basic units of compilation in
conventional optimization frameworks. However, procedures are
typically formed to serve software engineering rather than
optimization goals, arbitrarily constraining code transformations.
Techniques, such as aggressive inlining and interprocedural
optimization, have been developed to alleviate this problem, but, due
to code growth and compile time issues, these can be applied only
sparingly.
This paper introduces the Procedure Boundary Elimination (PBE)
compilation framework, which allows unrestricted whole-program
optimization. PBE allows all intra-procedural optimizations and
analyses to operate on arbitrary subgraphs of the program, regardless
of the original procedure boundaries and without resorting to
inlining. In order to control compilation time, PBE also introduces
novel extensions of region formation and encapsulation.
PBE enables targeted code specialization, which recovers the
specialization benefits of inlining while keeping code growth in
check. This paper shows that PBE attains better performance than
inlining with half the code growth.
Automatic Instruction-Level Software-Only Recovery [abstract] (IEEE Xplore, PDF, Top Picks Version)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2006.
Winner of the William C. Carter Award.
Selected for IEEE Micro's "Top Picks" special issue for papers "most
relevant to industry and significant in contribution to the field of
computer architecture" in 2006.
As chip densities and clock rates increase, processors are becoming
more susceptible to transient faults that can affect program
correctness. Computer architects have typically addressed reliability
issues by adding redundant hardware, but these techniques are often
too expensive to be used widely. Software-only reliability techniques
have shown promise in their ability to protect against soft-errors
without any hardware overhead. However, existing low-level
software-only fault tolerance techniques have only addressed the
problem of detecting faults, leaving recovery largely unaddressed. In
this paper, we present the concept, implementation, and evaluation of
automatic, instruction-level, software-only recovery techniques, as
well as various specific techniques representing different trade-offs
between reliability and performance. Our evaluation shows that these
techniques fulfill the promises of instruction-level, software-only
fault tolerance by offering a wide range of flexible recovery options
Selective Runtime Memory Disambiguation in a Dynamic Binary
Translator [abstract] (PDF, ACM DL)
Bolei Guo, Youfeng Wu, Cheng Wang, Matthew J. Bridges, Guilherme Ottoni, Neil Vachharajani, Jonathan Chang, and David I. August
Proceedings of the 15th International Conference on Compiler
Construction (CC), March 2006.
Performing runtime alias analysis in a dynamic binary translator (DBT)
is difficult due to incomplete control-flow information and the high
complexity of accurate alias analysis. Whole-program profiling,
however, shows that the vast majority of memory references do not
alias. The current technique used in DBTs to disambiguate memory
references, instruction inspection, is too simple and can only
disambiguate one-third of potential aliases. To address the problem
of effective memory disambiguation while keeping a tight bound on the
analysis overhead, we propose an efficient heuristic algorithm that
strategically selects key memory dependences to disambiguate with
runtime checks. These checks have little runtime overhead and, in the
common case where aliasing does not occur, enable aggressive
optimizations, particularly scheduling to be performed. We demonstrate
that a small number of checks, inserted with a low overhead analysis,
can approach optimal scheduling, where all false memory dependences
are removed.
Exploiting Parallelism and Structure to Accelerate the
Simulation of Chip Multi-processors [abstract] (IEEE Xplore, PDF)
David A. Penry, Daniel Fay, David Hodgdon, Ryan Wells, Graham Schelle, David I. August, and Daniel A. Connors
Proceedings of the Twelfth International Symposium on
High-Performance Computer Architecture (HPCA), February 2006.
Simulation is an important means of evaluating new microarchitectures.
Current trends toward chip multi-processors (CMPs) try the ability of
designers to develop efficient simulators. CMP simulation speed can
be improved by exploiting parallelism in the CMP simulation model.
This may be done by either running the simulation on multiple
processors or by integrating multiple processors into the simulation to
replace simulated processors. Doing so usually requires tedious
manual parallelization or re-design to encapsulate processors.
Both problems can be avoided by generating the simulator from a
concurrent, structural model of the CMP. Such a model not only
resembles hardware, making it easy to understand and use, but also
provides sufficient information to automatically parallelize the
simulator without requiring manual model changes. Furthermore,
individual components of the model such as processors may be replaced
with equivalent hardware without requiring repartitioning.
This paper presents techniques to perform automated simulator
parallelization and hardware integration for CMP structural models.
We show that automated parallelization can achieve an 7.60 speedup for
a 16-processor CMP model on a conventional 4-processor shared-memory
multiprocessor. We demonstrate the power of hardware integration by
integrating eight hardware PowerPC cores into a CMP model, achieving a
speedup of up to 5.82.
Software Fault Detection Using Dynamic Instrumentation [abstract] (CiteSeerX, PDF)
George A. Reis, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the Fourth Annual Boston Area Architecture Workshop (BARC), February 2006.
Software-only approaches to increase hardware reliability have
been proposed and evaluated as alternatives to hardware
modification. These techniques have shown that they can significantly
improve reliability with reasonable performance
overhead. Software-only techniques do not require any hardware support
and thus are far cheaper and easier to deploy. These techniques can
be used for systems that have already been manufactured and now
require higher reliability than the hardware can offer. All previous proposals have been static compilation techniques
that rely on source code transformations or alterations to the
compilation process. Our proposal is the first application of
software fault detection for transient errors that increases
reliability dynamically. The application of our technique is trivial
since the only requirement is the program binary, which makes it
applicable for legacy programs that no longer have readily available
or easily re-compilable source code. Our dynamic reliability
technique can seamlessly handle variable-length instructions, mixed
code and data, statically unknown indirect jump targets, dynamically
generated code, and dynamically loaded libraries. Our technique is
also able attach to an already running application to increase its
reliability, and detach when appropriate, thus returning to faster
(although unreliable) execution.
Software-Controlled Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
ACM Transactions on Architecture and Code Optimization (TACO), December 2005.
Traditional fault tolerance techniques typically utilize
resources ineffectively because they cannot adapt to the changing
reliability and performance demands of a system. This paper proposes
software-controlled fault tolerance, a concept allowing designers and
users to tailor their performance and reliability for each
situation. Several software-controllable fault detection techniques
are then presented: SWIFT, a software-only technique, and CRAFT, a
suite of hybrid hardware/ software techniques. Finally, the paper
introduces PROFiT, a technique which adjusts the level of protection
and performance at fine granularities through software control. When
coupled with software-controllable techniques like SWIFT and CRAFT,
PROFiT offers attractive and novel reliability options.
Chip Multi-Processor Scalability for Single-Threaded Applications [abstract] (ACM DL, PDF)
Neil Vachharajani, Matthew Iyer, Chinmay Ashok, Manish Vachharajani, David I. August, and Daniel A. Connors
Proceedings of the 2005 Workshop on Design, Architecture and Simulation of Chip Multi-Processors (dasCMP), November 2005.
The exponential increase in uniprocessor performance has
begun to slow. Designers have been unable to scale performance while
managing thermal, power, and electrical effects. Furthermore, design
complexity limits the size of monolithic processors that can be
designed while keeping costs reasonable. Industry has responded by
moving toward chip multi-processor architectures (CMP). These
architectures are composed from replicated processors utilizing the
die area afforded by newer design processes. While this approach
mitigates the issues with design complexity, power, and electrical
effects, it does nothing to directly improve the performance of
contemporary or future single-threaded applications. This paper examines the scalability potential for exploiting the
parallelism in single-threaded applications on these CMP
platforms. The paper explores the total available parallelism in
unmodified sequential applications and then examines the
viability of exploiting this parallelism on CMP machines. Using the
results from this analysis, the paper forecasts that CMPs, using the
"intrinsic" parallelism in a program, can sustain the performance
improvement users have come to expect from new processors for only 6-8
years provided many successful parallelization efforts emerge. Given
this outlook, the paper advocates exploring methodologies which
achieve parallelism beyond this "intrinsic" limit of programs.
Automatic Thread Extraction with Decoupled Software Pipelining [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 38th IEEE/ACM International Symposium on
Microarchitecture (MICRO), November 2005.
One of five papers nominated for the Best Paper Award by the Program Committee.
Until recently, a steadily rising clock rate and other uniprocessor
microarchitectural improvements could be relied upon to consistently
deliver increasing performance for a wide range of applications.
Current difficulties in maintaining this trend have lead
microprocessor companies to add value by incorporating multiple
processors on a chip. Unfortunately, since decades of compiler
research have not succeeded in delivering automatic threading for
prevalent code properties, this approach demonstrates no improvement
for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic
approach to thread extraction, called Decoupled Software Pipelining
(DSWP). DSWP exploits the fine-grained pipeline parallelism lurking
in most applications to extract long-running, concurrently executing
threads. Use of the non-speculative and truly decoupled threads
produced by DSWP can increase execution efficiency and provide
significant latency tolerance, mitigating design complexity by
reducing inter-core communication and per-core resource requirements.
Using our initial fully automatic compiler implementation and a
validated processor model, we prove the concept by demonstrating
significant gains for dual-core chip multiprocessor models running a
variety of codes. Then, we explore simple opportunities missed by our
initial compiler implementation which suggest a promising future for
this approach.
A New Approach to Thread Extraction for General-Purpose
Programs [abstract] (PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 2nd Watson Conference on
Interaction between Architecture, Circuits, and Compilers (PAC2), September 2005.
Until recently, a steadily rising clock rate and other uniprocessor
microarchitectural improvements could be relied upon to consistently
deliver increasing performance for a wide range of applications.
Current difficulties in maintaining this trend have lead
microprocessor companies to add value by incorporating multiple
processors on a chip. Unfortunately, since decades of compiler
research have not succeeded in delivering automatic threading for
prevalent code properties, this approach demonstrates no improvement
for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic
approach to thread extraction, called Decoupled Software Pipelining
(DSWP). DSWP exploits the fine-grained pipeline parallelism lurking
in most applications to extract long-running, concurrently executing
threads. Use of the non-speculative and truly decoupled threads
produced by DSWP can increase execution efficiency and provide
significant latency tolerance, mitigating design complexity by
reducing inter-core communication and per-core resource requirements.
Using our initial fully automatic compiler implementation and a
validated processor model, we prove the concept by demonstrating
significant gains for dual-core chip multiprocessor models running a
variety of codes. Then, we explore simple opportunities missed by our
initial compiler implementation which suggest a promising future for
this approach.
Recursive Data Structure Profiling [abstract] (ACM DL, PDF)
Easwaran Raman and David I. August
Proceedings of the Third Annual ACM SIGPLAN Workshop on
Memory Systems Performance (MSP), June 2005.
As the processor-memory performance gap increases, so does the need
for aggressive data structure optimizations to reduce memory access
latencies. Such optimizations require a better understanding of the
memory behavior of programs. We propose a profiling technique called
Recursive Data Structure Profiling to help better understand the
memory access behavior of programs that use recursive data structures
(RDS) such as lists, trees, etc. An RDS profile captures the runtime
behavior of the individual instances of recursive data structures.
RDS profiling differs from other memory profiling techniques in its
ability to aggregate information pertaining to an entire data
structure instance, rather than merely capturing the behavior of
individual loads and stores, thereby giving a more global view of a
program's memory accesses. This paper describes a method for collecting RDS profile without
requiring any high-level program representation or type
information. RDS profiling achieves this with manageable space and
time overhead on a mixture of pointer intensive benchmarks from the SPEC, Olden and other
benchmark suites. To illustrate the potential of the RDS profile in
providing a better understanding of memory accesses, we introduce a
metric to quantify the notion of
stability of an RDS instance. A stable RDS instance is one that
undergoes very few changes to its structure between its initial
creation and final destruction, making it an attractive candidate to certain
data structure optimizations.
Achieving Structural and Composable Modeling of Complex Systems [abstract] (SpringerLink, PDF)
David I. August, Sharad Malik, Li-Shiuan Peh, Vijay Pai, Manish Vachharajani, and Paul Willmann
The International Journal of Parallel Programming (IJPP), Volume 33, June 2005.
Invited.
This paper describes a recently-released, structural and composable
modeling system called the Liberty Simulation Environment (LSE). LSE
automatically constructs simulators from system descriptions that
closely resemble the structure of hardware at the chosen level of
abstraction. Component-based reuse features allow an extremely
diverse range of complex models to be built easily from a core set of
component libraries. This paper also describes the makeup and initial
experience with a set of such libraries currently undergoing
refinement. With LSE and these soon-to-be-released component
libraries, students will be able to learn about systems in a more
intuitive fashion, researchers will be able to collaborate with each
other more easily, and developers will be able to rapidly and
meaningfully explore novel design candidates.
From Sequential Programs to Concurrent Threads [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, Matthew J. Bridges, and David I. August
IEEE Computer Architecture Letters (CAL), June 2005.
Chip multiprocessors are of increasing importance due to recent
difficulties in achieving higher clock frequencies in uniprocessors,
but their success depends on finding useful work for the processor
cores. This paper addresses this challenge by presenting a simple
compiler approach that extracts non-speculative thread-level
parallelism from sequential codes. We present initial results from
this technique targeting a validated dual-core processor model,
achieving speedups ranging from 9-48% with an average of 25% for
important benchmark loops over their single-threaded versions. We
also identify important next steps found during our pursuit of higher
degrees of automatic threading.
Rapid Development of Flexible Validated Processor Models [abstract] (PDF)
David A. Penry, Manish Vachharajani, and David I. August
Proceedings of the Workshop on Modeling, Benchmarking, and Simulation (MoBS), June 2005.
Given the central role of simulation in procesor design and research,
an accurate, validated, and easily modified simulation model is
extremely desirable. Prior work proposed a modeling methodology with
the claim that it allows rapid construction of flexible validated
models. In this paper, we present our experience using this
methodology to construct a flexible validated model of Intel's Itanium
2 processor, lending support to their claims. Our initial model was
constructed by a single researcher in only 11 weeks and predicts
processor cycles-per-instruction(CPI) to within 7.9% on average for
the entire SPEC CINT2000 benchmark suite. We find that aggregate
accuracy for a metric like CPI is not sufficient; aggregate measures
like CPI may conceal remaining internal "offsetting errors" which
can adversely affect conclusions drawn from the model. We then
modified the model to reduce error in specific performance
constituents. In 2 1/2 person-weeks, overall constituent
error was reduced from 3.1% to 2.1%, while simultaneously reducing
average aggregate CPI error to 5.4%, demonstrating that model
flexibility allows rapid improvements to accuracy. Flexibility is
further shown by making significant changes to the model in under
eight person-weeks to explore two novel microarchitectural techniques.
Design and Evaluation of Hybrid Fault-Detection Systems [abstract] (IEEE Xplore, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
Proceedings of the 32nd International Symposium on
Computer Architecture (ISCA), June 2005.
To improve performance and reduce power consumption, processor
designers employ advances that shrink feature sizes, lower voltage
levels, reduce noise margins, and increase clock rates. However, these
advances also make processors more susceptible to transient faults
that can affect program correctness. Up to now, system designers have
primarily considered hardware-only and software-only fault-detection
mechanisms to identify and mitigate the deleterious effects of
transient faults. These two fault-detection systems, however, are
extremes in the design space, representing sharp trade-offs between
hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection
mechanisms as promising alternatives to hardware- only and
software-only systems. These hybrid systems offer designers more
options to fit their reliability needs within their hardware and
performance budgets. We propose CRAFT, a suite of three such hybrid
techniques, to illustrate the potential of the hybrid approach. We
evaluate CRAFT in relation to existing hardware and software
reliability techniques. For fair, quantitative comparisons among
hardware, software, and hybrid systems, we introduce a new metric,
mean work to failure, which is able to compare systems for which
machine instructions do not represent a constant unit of
work. Additionally, we present a new simulation framework which
rapidly assesses reliability and does not depend on manual
identification of failure modes. Our evaluation illustrates that
CRAFT, and hybrid techniques in general, offer attractive options in
the fault-detection design space.
Branch Predication (ACM DL)
David I. August
Speculative Execution in High Performance Computer
Architectures (ISBN: 978-1584884477)
Edited by D. Kaeli and P.-C. Yew.
CRC Press, May 2005.
SWIFT: Software Implemented Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the Third International Symposium on
Code Generation and Optimization (CGO), March 2005.
Winner Best Paper Award.
Winner of the 2015 International Symposium on
Code Generation and Optimization Test of Time Award.
To improve performance and reduce power consumption, processor
designers employ advances that shrink feature sizes, lower voltage
levels, reduce noise margins, and increase clock rates. These
advances, however, also make processors more susceptible to transient
faults that can affect program correctness. To mitigate this
increasing problem, designers build redundancy into systems to the
degree that the soft-error budget will allow. While reliable systems typically employ hardware techniques to address
soft-errors, software techniques can provide a lower cost and more
flexible alternative. To make this alternative more attractive, this
paper presents a new software fault tolerance technique, called SWIFT,
for detecting transient errors. Like other single-threaded software
fault tolerance techniques, SWIFT efficiently manages redundancy by
reclaiming unused instruction-level resources present during the
execution of most programs. SWIFT, however, eliminates the need to
double the memory requirement by acknowledging the use of ECC in
caches and memory. SWIFT also provides a higher level of protection
with enhanced checking of the program counter (PC) at no performance
cost. In addition, this enhanced PC checking makes most code inserted
to detect faults in prior methods unnecessary, significantly enhancing
performance. While SWIFT can be implemented on any architecture and
can protect individual code segments to varying degrees, we evaluate a
fully-redundant implementation running on Itanium 2. In these
experiments, SWIFT demonstrates exceptional fault-coverage with a
reasonable performance cost. Compared to the best known
single-threaded approach utilizing an ECC memory system, SWIFT
demonstrates a 51% average speedup.
Practical and Accurate Low-Level Pointer Analysis [abstract] (IEEE Xplore, PDF)
Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni, Easwaran Raman, and David I. August
Proceedings of the Third International Symposium on
Code Generation and Optimization (CGO), March 2005.
Pointer analysis is traditionally performed once, early in the
compilation process, upon an intermediate representation (IR) with
source-code semantics. However, performing pointer analysis only once
at this level imposes a phase-ordering constraint, causing alias
information to become stale after subsequent code transformations.
Moreover, high-level pointer analysis cannot be used at link time or
run time, where the source code is unavailable. This paper advocates performing pointer analysis on a low-level
intermediate representation. We present the first context-sensitive
and partially flow-sensitive points-to analysis designed to operate at
the assembly level. As we will demonstrate, low-level pointer analysis
can be as accurate as high-level analysis. Additionally, our low-level
pointer analysis also enables a quantitative comparison of propagating
high-level pointer analysis results through subsequent code
transformations, versus recomputing them at the low level. We show
that the former practice is considerably less accurate than the latter.
Decoupled Software Pipelining: A Promising Technique to Exploit Thread-Level Parallelism [abstract]
Guilherme Ottoni, Ram Rangan, Neil Vachharajani, and David I. August
Proceedings of the Fourth Workshop on Explicitly Parallel
Instruction Computer Architectures and Compiler Technology (EPIC), March 2005.
Processor manufacturers are moving to multi-core,
multi-threaded designs because of several factors such as cost, ease
of design and scalability. As most processors will be multi-threaded
in the future, exposing thread-level parallelism (TLP) is a problem of
increasing importance. Because the adequate granularity of the threads
is dependent on the target architecture, and writing sequential
applications is usually more natural, the compiler plays an important
role in performing the mapping from applications to the appropriate
multi-threaded code. In spite of this, few general-purpose compilation
techniques have been proposed to assist in this task. In this paper,
we propose Decoupled Software Pipelining (DSWP) to extract
thread-level parallelism. DSWP can convert most application loops into
a pipeline of loop threads. This brings pipeline parallelism to most
application loops including those not targeted by traditional software
pipelining. DSWP does not rely on complex hardware speculation support
since it is a non-speculative transformation. This paper describes the
DSWP technique, discusses its implementation in a compiler, and
presents experimental results demonstrating that it is a promising
technique to extract TLP.
Compiler Optimization-Space Exploration [abstract] (PDF)
Spyridon Triantafyllis, Manish Vachharajani, and David I. August
The Journal of Instruction-level Parallelism (JILP), February 2005.
To meet the performance demands of modern architectures, compilers
incorporate an ever-increasing number of aggressive code
transformations. Since most of these transformations are not
universally beneficial, compilers traditionally control their
application through predictive heuristics, which attempt to
judge an optimization's effect on final code quality a priori.
However, complex target architectures and unpredictable optimization
interactions severely limit the accuracy of these judgments, leading
to performance degradation because of poor optimization decisions.
This performance loss can be avoided through the iterative
compilation approach, which advocates exploring many optimization
options and selecting the best one a posteriori. However,
existing iterative compilation systems suffer from excessive compile
times and narrow application domains. By overcoming these
limitations, Optimization-Space Exploration (OSE) becomes the first
iterative compilation technique suitable for general-purpose
production compilers. OSE narrows down the space of optimization
options explored through limited use of heuristics. A compiler tuning
phase further limits the exploration space. At compile time, OSE
prunes the remaining optimization configurations in the search space
by exploiting feedback from earlier configurations tried. Finally,
rather than measuring actual runtimes, OSE compares optimization
outcomes through static performance estimation, further enhancing
compilation speed. An OSE-enhanced version of Intel's reference
compiler for the Itanium architecture yields a performance improvement
of more than 20% for some SPEC benchmarks.
RIFLE: An Architectural Framework for User-Centric Information-Flow Security [abstract] (ACM DL, PDF)
Neil Vachharajani, Matthew J. Bridges, Jonathan Chang, Ram Rangan, Guilherme Ottoni, Jason A. Blome, George A. Reis, Manish Vachharajani, and David I. August
Proceedings of the 37th International Symposium on
Microarchitecture (MICRO), December 2004.
Even as modern computing systems allow the manipulation and
distribution of massive amounts of information, users of these systems
are unable to manage the confidentiality of their data in a practical
fashion. Conventional access control security mechanisms cannot
prevent the illegitimate use of privileged data once access is
granted. For example, information provided by a user during an online
purchase may be covertly delivered to malicious third parties by an
untrustworthy web browser. Existing information flow security
mechanisms do provide this assurance, but only for
programmer-specified policies enforced during program development as a
static analysis on special-purpose type-safe languages. Not only are
these techniques not applicable to many commonly used programs, but
they leave the user with no defense against malicious programmers or
altered binaries. In this paper, we propose RIFLE, a runtime information flow security
system designed from the user's perspective. By addressing
information flow security using architectural support, RIFLE gives
users a practical way to enforce their own information flow security
policy on all programs. We prove that, contrary to statements in the
literature, runtime systems like RIFLE are no less secure than
existing language-based techniques. Using a model of the
architectural framework and a binary translator, we demonstrate
RIFLE's correctness and illustrate that the performance cost is
reasonable.
Rapid Development of Flexible Validated Processor Models [abstract] (PDF)
David A. Penry, Manish Vachharajani, and David I. August
Liberty Research Group Technical Report 04-03, November 2004.
For a variety of reasons, most architectural evaluations use
simulation models. An accurate baseline model validated against
existing hardware provides confidence in the results of these
evaluations. Meanwhile, a meaningful exploration of the design space
requires a wide range of quickly-obtainable variations of the
baseline. Unfortunately, these two goals are generally considered to
be at odds; the set of validated models is considered exclusive of the
set of easily malleable models. Vachharajani et al. challenge this
belief and propose a modeling methodology they claim allows rapid
construction of flexible validated models. Unfortunately, they only
present anecdotal and secondary evidence to support their claims. In this paper, we present our experience using this methodology to
construct a validated flexible model of Intel's Itanium 2 processor.
Our practical experience lends support to the above claims. Our
initial model was constructed by a single researcher in only 11 weeks
and predicts processor cycles-per-instruction (CPI) to within 7.9% on
average for the entire SPEC CINT2000 benchmark suite. Our experience
with this model showed us that aggregate accuracy for a metric like
CPI is not sufficient. Aggregate measures like CPI may conceal
remaining internal ``offsetting errors'' which can adversely affect
conclusions drawn from the model. Using this as our motivation, we
explore the flexibility of the model by modifying it to target
specific error constituents, such as front-end stall errors. In
2 1/2 person-weeks, average CPI error was reduced to 5.4%.
The targeted error constituents were reduced more dramatically;
front-end stall errors were reduced from 5.6% to 1.6%. The swift
implementation of significant new architectural features on this model
further demonstrated its flexibility.
Microarchitecture Modeling for Design-space Exploration [abstract] (PDF)
Manish Vachharajani
Ph.D. Thesis, Department of Electrical Engineering,
Princeton University, November 2004.
To identify the best processor designs, designers explore a vast
design space. To assess the quality of candidate designs, designers
construct and use simulators. Unfortunately, simulator construction
is a bottleneck in this design-space exploration because existing
simulator construction methodologies lead to long simulator
development times. This bottleneck limits exploration to a small set
of designs, potentially diminishing quality of the final design.
This dissertation describes a method to rapidly construct high-quality
simulators. In particular, it examines structural modeling as a means
to reduce construction time because it eliminates redundant effort
required to manage design complexity in many modeling approaches,
including that of programming a simulator in a sequential language.
The dissertation also describes how to overcome common limitations in
structural modeling that increase construction time by precluding
amortization of component specification effort across models via
component-based reuse. First, some time-consuming to specify design
portions do not benefit from reuse, the logic for managing stall
signals (i.e., timing-control) chief among them. Second, components
flexible enough to enjoy reuse are often not reused in practice
because of the number of details that must be understood in order to
instantiate them. This dissertation addresses these issues by:
The techniques presented in this dissertation are embodied in the Liberty Simulation Environment (LSE). LSE users have seen over an order of magnitude reduction in model development time. Their models were of high-quality; they were accurate, had adequate simulation speeds, and were compatible with static techniques for visualization and optimization.
Short model construction times allow more ideas to be explored in less time. This leads to shorter product time-to-market and more thorough design-space exploration. This also allows researchers to evaluate new design ideas faster, to efficiently evaluate ideas in the context of many designs, and, perhaps, develop a more fundamental understanding of microarchitecture due to these broader evaluations.
Decoupled Software Pipelining with the Synchronization Array [abstract] (IEEE Xplore, PDF)
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2004.
Highest ranked paper in double-blind review process.
Despite the success of instruction-level parallelism
(ILP) optimizations in increasing the performance of microprocessors,
certain codes remain elusive. In particular, codes containing
recursive data structure (RDS) traversal loops have been largely
immune to ILP optimizations, due to the fundamental serialization and
variable latency of the loop-carried dependence through a
pointer-chasing load. To address these and other situations, we
introduce decoupled software pipelining (DSWP), a technique that
statically splits a single-threaded sequential loop into multiple
non-speculative threads, each of which performs useful computation
essential for overall program correctness. The resulting threads
execute on thread-parallel architectures such as simultaneous
multithreaded (SMT) cores or chip multiprocessors (CMP), expose
additional instruction level parallelism, and tolerate latency better
than the original single-threaded RDS loop. To reduce overhead, these
threads communicate using a synchronization array, a
dedicated hardware structure for pipelined inter-thread
communication. DSWP used in conjunction with the synchronization array
achieves an 11% to 76% speedup in the optimized functions on both
statically and dynamically scheduled processors.
Facilitating Reuse in Hardware Models with Enhanced Type Inference [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, Sharad Malik, and David I. August
The IEEE/ACM/IFIP Second International Conference on Hardware/Software Codesign and System Synthesis (ISSS), September 2004.
High-level hardware modeling is an essential, yet time-consuming, part
of system design. However, effective component-based reuse in
hardware modeling languages can reduce model construction time and
enable the exploration of more design alternatives, leading to better
designs. While component overloading and parametric polymorphism are
critical for effective component-base reuse, no existing modeling
language supports both. The lack of these features creates overhead
for designers that discourages reuse, negating any benefits of reuse.
This paper first presents a type system which support both component
overloading and parametric polymorphism. Then, it proves that
performing type inference for any such system is NP-complete and
presents a heuristic that works efficiently in practice. The result
is a type system and type inference algorithm that, when deployed, can
encourage reuse, reduce design specification time, and lead to better
designs.
The Liberty Structural Specification Language: A High-Level
Modeling Language for Component Reuse [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, and David I. August
Proceedings of the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2004.
Rapid exploration of the design space with simulation models is
essential for quality hardware systems research and development.
Despite striking commonalities across hardware systems, designers
routinely fail to achieve high levels of reuse across models
constructed in existing general-purpose and domain-specific languages.
This lack of reuse adversely impacts hardware system design by slowing
the rate at which ideas are evaluated. This paper presents an
examination of existing languages to reveal their fundamental
limitations regarding reuse in hardware modeling. With this
understanding, a solution is described in the context of the design
and implementation of the Liberty Structural Specification Language
(LSS), the input language for a publicly available high-level
digital-hardware modeling tool called the Liberty Simulation
Environment. LSS is the first language to enable low-overhead reuse
by simultaneously supporting static inference based on hardware
structure and flexibility via parameterizable structure.
Through LSS, this paper also introduces a new type inference algorithm
and a new programming language technique, called use-based
specialization, which, in a manner analogous to type inference,
customizes reusable components by statically inferring structural
properties that otherwise would have had to have been specified
manually.
Exposing Memory Access Regularities Using Object-Relative Memory Profiling [abstract] (ACM DL, PDF)
Qiang Wu, Artem Pyatakov, Alexey N. Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August
Proceedings of the Second International Symposium on Code
Generation and Optimization (CGO), March 2004.
Memory profiling is the process of characterizing a program's memory
behavior by observing and recording its response to specific input
sets. Relevant aspects of the program's memory behavior may then be
used to guide memory optimizations in an aggressively optimizing
compiler. In general, memory access behavior has eluded meaningful
characterization because of confounding artifacts from memory
allocators, linker data layout, and OS memory management. Since these
artifacts may change from run to run, memory access patterns may
appear different in each run even for the same input set. Worse,
regular memory access behavior such as linked list traversals appear
to have no structure. In this paper we present object-relative translation and
decomposition techniques to eliminate these artifacts and to expose
previously obscured memory access patterns. To demonstrate the
potential of these ideas, we implement two different memory profilers
targeted at different sets of applications. These profilers
outperform the existing ones in terms of profile size and useful
information per byte of data. The first profiler is a lossless
profiler, called WHOMP, which uses object-relativity to achieve a 22%
better compression than the previously best known scheme. The second
profiler, called LEAP, uses lossy compression to get highly compact
profiles while providing useful information to the targeted
applications. LEAP correctly characterizes the memory alias rates for
56% more instruction pairs than the previously best known scheme with
a practical running time.
The Liberty Simulation Environment, Version 1.0 [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason Blome, and David I. August
Performance Evaluation Review: Special Issue on Tools for Architecture Research (PER), Volume 31, Number 4, March 2004.
Invited.
High-level hardware modeling via simulation is an essential step in
hardware systems design and research. Despite the importance of
simulation, current model creation methods are error prone and are
unnecessarily time consuming. To address these problems, we have
publicly released the Liberty Simulation Environment (LSE), Version
1.0, consisting of a simulator builder and automatic visualizer based
on a shared hardware description language. LSE's design was motivated
by a careful analysis of the strengths and weaknesses of existing
systems. This has resulted in a system in which models are easier to
understand, faster to develop, and have performance on par with other
systems. LSE is capable of modeling any synchronous hardware
system. To date, LSE has been used to simulate and convey ideas about
a diverse set of complex systems including a chip multiprocessor
out-of-order IA-64 machine and a multiprocessor system with detailed
device models.
The Liberty Simulation Environment: A Deliberate
Approach to High-Level System Modeling [abstract] (PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, Sharad Malik, and David I. August
Liberty Research Group Technical Report 04-02, March 2004.
In digital hardware system design, the quality of the product is
directly related to the number of meaningful design alternatives
properly considered. Unfortunately, existing modeling methodologies
and tools have properties which make them less than ideal for rapid
and accurate design-space exploration. This article identifies and
evaluates the shortcomings of existing methods to motivate the Liberty
Simulation Environment (LSE). LSE is a high-level modeling tool
engineered to address these limitations, allowing for the rapid
construction of accurate high-level simulation models. LSE simplifies
model specification with low-overhead component-based reuse techniques
and an abstraction for timing control. As part of a detailed
description of LSE, this article presents these features, their impact
on model specification effort, their implementation, and optimizations
created to mitigate their otherwise deleterious impact on simulator
execution performance.
A Study of the Clarity of Functionally and Structurally
Composed High-level Simulation Models (PDF)
Manish Vachharajani and David I. August
Liberty Research Group Technical Report 04-01, January 2004.
Liberty Simulation Environment, Version 1.0
Manish Vachharajani, David A. Penry, Neil Vachharajani, Jason A. Blome, and David I. August
Available at http://liberty.princeton.edu/Software/LSE, December 2003.
A Comparison of Reuse in Object-oriented Programming and Structural Modeling Systems [abstract] (PDF)
Manish Vachharajani, Neil Vachharajani, and David I. August
Liberty Research Group Technical Report 03-01, October 2003.
Modeling systems in which users construct models by specifying the
interconnect between concurrently executing components are a natural
fit for hardware modeling. These concurrent-structural modeling
systems allow the specification of the model to parallel that of the
hardware making specifications clear, easy to understand, and
straight-forward to build and modify.
Unfortunately, many concurrent-structural modeling systems provide few
mechanisms to create flexible-reusable components. Others have built
these modeling systems on object-oriented programming (OOP) languages
in the hope that the OOP features will allow the creation of useful
flexible reusable components for structural modeling. Unfortunately,
simply adopting OOP features is insufficient for this task. By
drawing an analogy between OOP programs and concurrent-structural
models, it is possible to understand why. Furthermore, this analogy
highlights what would be needed in a concurrent-structural system to
gain the same benefits seen in OOP systems. This report compares the
features necessary to obtain reuse in concurrent-structural modeling
systems by drawing this analogy between object-oriented programming
constructs and concurrent-structural modeling constructs.
The Liberty Simulation Environment as a Pedagogical Tool [abstract] (PDF)
Jason Blome, Manish Vachharajani, Neil Vachharajani, and David I. August
Proceedings of the Workshop on Computer Architecture
Education (WCAE), June 2003.
This paper describes how the Liberty Simulation
Environment(LSE) and its graphical visualizer can be used in a
computer architecture course. LSE allows for the rapid construction
of simulators from models that resemble the structure of hardware. By
using and modifying LSE models, students can develop a solid
understanding of and learn to reason about computer architecture.
Since LSE models are also relatively easy to modify, the tool can be
used as the basis of meaningful assignments, allowing students to
explore a variety of microarchitectural concepts on their own. In
lectures where block diagrams are typically displayed, LSE's
visualizer can be used instead to not only show block diagrams, but to
demonstrate the machine in action. As a result, LSE can ease the
burden of conveying complex microarchitectural design concepts,
greatly improving the depth of understanding a computer architecture
course provides.
Optimizations for a Simulator Construction System Supporting
Reusable Components [abstract] (ACM DL, PDF)
David A. Penry and David I. August
Proceedings of the 40th Design Automation
Conference (DAC), June 2003.
Exploring a large portion of the microprocessor design
space requires the rapid development of efficient simulators. While some
systems support rapid model development through the structural
composition of reusable concurrent components, the Liberty Simulation
Environment (LSE) provides additional reuse-enhancing features. This
paper evaluates the cost of these features and presents optimizations
to reduce their impact. With these optimizations, an LSE model using
reusable components outperforms a SystemC model using custom
components by 6%.
Compiler Optimization-Space Exploration [abstract] (IEEE Xplore, PDF)
Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, and David I. August
Proceedings of the 2003 International Symposium on Code
Generation and Optimization (CGO), March 2003.
Winner Best Paper Award.
To meet the demands of modern architectures, optimizing
compilers must incorporate an ever larger number of increasingly
complex transformation algorithms. Since code transformations may
often degrade performance or interfere with subsequent
transformations, compilers employ predictive heuristics to guide
optimizations by predicting their effects a
priori. Unfortunately, the unpredictability of optimization
interaction and the irregularity of today's wide-issue machines
severely limit the accuracy of these heuristics. As a result, compiler
writers may temper high variance optimizations with overly
conservative heuristics or may exclude these optimizations
entirely. While this process results in a compiler capable of
generating good average code quality across the target benchmark set,
it is at the cost of missed optimization opportunities in individual
code segments. To replace predictive heuristics, researchers have
proposed compilers which explore many optimization options, selecting
the best one a posteriori. Unfortunately, these existing
iterative compilation techniques are not practical for reasons of
compile time and applicability. In this paper, we present the Optimization-Space Exploration (OSE)
compiler organization, the first practical iterative compilation
strategy applicable to optimizations in general-purpose
compilers. Instead of replacing predictive heuristics, OSE uses the
compiler writer's knowledge encoded in the heuristics to select a
small number of promising optimization alternatives for a given code
segment. Compile time is limited by evaluating only these alternatives
for hot code segments using a general compiletime performance
estimator. An OSE-enhanced version of Intel's highly-tuned,
aggressively optimizing production compiler for IA-64 yields a
significant performance improvement, more than 20% in some cases, on
Itanium for SPEC codes.
Microarchitectural Exploration with Liberty [abstract] (IEEE Xplore, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, and David I. August
Proceedings of the 35th International Symposium on
Microarchitecture (MICRO), November 2002.
Winner Best Student Paper Award.
To find the best designs, architects must rapidly
simulate many design alternatives and have confidence in the
results. Unfortunately, the most prevalent simulator construction
methodology, hand-writing monolithic simulators in sequential
programming languages, yields simulators that are hard to retarget,
limiting the number of designs explored, and hard to understand,
instilling little confidence in the model. Simulator construction
tools have been developed to address these problems, but analysis
reveals that they do not address the root cause, the error-prone
mapping between the concurrent, structural hardware domain and the
sequential, functional software domain. This paper presents an
analysis of these problems and their solution, the Liberty Simulation
Environment (LSE). LSE automatically constructs a simulator from a
machine description that closely resembles the hardware, ensuring
fidelity in the model. Furthermore, through a strict but general
component communication contract, LSE enables the creation of highly
reusable component libraries, easing the task of rapidly exploring
ever more exotic designs.
A Disciplined Approach to the Development of Platform
Architectures [abstract] (CiteSeerX, PDF)
David I. August, Kurt Keutzer, Sharad Malik, and A. Richard Newton
Microelectronics Journal, Volume 33, Number 11, November 2002.
Invited.
Silicon capability has enabled the embedding of an entire
system on a single silicon die. These devices are known as
systems-on-a-chip. Currently, the design of these devices is
undisciplined, expensive, and risky. One way of amortizing the cost
and ameliorating this design risk is to make a single integrated
circuit serve multiple applications, and the natural way of enabling
this is through end-user programmability. The aim of the MESCAL
project, which is the subject of this paper, is to introduce a
disciplined approach to producing reusable architectural platforms
that can be easily programmed to meet a variety of
applications. (MESCAL stands for Modern Embedded Systems, Compilers,
Architectures, and Languages.)
Procedure Boundary Elimination for EPIC Compilers [abstract] (CiteSeerX, PDF)
Spyridon Triantafyllis, Manish Vachharajani, and David I. August
Proceedings of the Second Workshop on Explicitly Parallel
Instruction Computer Architectures and Compiler Technology (EPIC), November 2002.
Procedures are the basic units of compilation in
traditional optimization frameworks. This presents problems to
compilers targeting EPIC architectures, since the limited scope of a
single procedure is usually insufficient for extracting ILP and
identifying enough optimization opportunities. Although inlining can
expand the scope of optimization routines, it is not applicable to all
call sites and can cause excessive code growth, which can in turn
adversely affect cache performance and compile-time resource usage. In
this paper we propose a novel compilation strategy called Procedure
Boundary Elimination (PBE). PBE unifies the whole program into a
single compilation unit, which is then restructured into units better
suited to optimization than the original procedures. A targeted
specialization phase exposes further optimization opportunities while
limiting code growth only to the cases where it is beneficial. Unlike
inlining, PBE can eliminate all procedure calls while avoiding the
cost of excessive code growth.
Design Tools for Application Specific Embedded Processors [abstract] (SpringerLink, PDF)
Wei Qin, Subramanian Rajagopalan, Manish Vachharajani, Hangsheng Wang, Xinping Zhu, David I. August, Kurt Keutzer, Sharad Malik, and Li-Shiuan Peh
Proceedings of the Second International Workshop on
Embedded Software, Lecture Notes in Computer Science (EMSOFT), Volume 2491, October 2002.
Invited.
A variety of factors make it increasingly difficult and
expensive to design and manufacture traditional Application Specific
Integrated Circuits (ASICs). Consequently, programmable alternatives
are more attractive than ever. The flexibility provided by
programmability comes with a performance and power overhead. This can
be significantly mitigated by using application specific platforms,
also referred to as Application Specific Embedded Processors, or
Application Specific Instruction Set Processors (ASIPs). ASIPs and the
embedded software applications running on them, require specialized
design tools - both during architectural evaluation to provide
feedback on the suitability of the architecture for the application;
as well as during system implementation to ensure efficient mapping
and validation of design constraints. These functions result in
requirements different from those of traditional software development
environments. The first requirement is retargetability, especially
during the early architectural evaluation stage where a rapid
examination of design alternatives is essential. The second
requirement is for additional metrics such as power consumption,
real-time constraints and code size. This paper describes a set of
design tools and associated methodology designed to meet the
challenges posed by architectural evaluation and software
synthesis. This work is part of the MESCAL (Modern Embedded Systems,
Compilers, Architectures, and Languages) project.
A Disciplined Approach to the Development of Platform
Architectures [abstract] (PDF, Elsevier)
David I. August, Kurt Keutzer, Sharad Malik, and A. Richard Newton
Proceedings of the Tenth Workshop on Synthesis and System
Integration of Mixed Technologies (SASIMI), April 2001.
Invited.
Silicon capability has enabled the embedding of an entire
system on a single silicon die. These devices are known as
systems-on-achip (SOC). Currently, the design of these devices is
undisciplined, expensive, and risky. One way of amortizing the cost
and ameliorating this design risk is to make a single integrated
circuit serve multiple applications, and the natural way of enabling
this is through end-user programmability. The aim of the MESCAL
project, which is the subject of this paper, is to introduce a
disciplined approach to producing reusable architectural platforms
that can be easily programmed to meet a variety of
applications. (MESCAL stands for Modern Embedded Systems, Compilers,
Architectures, and Languages.)