Publications

SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads (PDF)
Ishita Chaturvedi, Bhargav Reddy Godala, Abiram Gangavaram, Daniel Flyer, Tyler Sorensen, Tor M. Aamodt, and David I. August
Proceedings of the 48th IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2025.

Revisiting Computation for Research: Practices and Trends [abstract] (PDF)
Jeremiah Giordani*, Ziyang Xu*, Ella Colby, August Ning, Bhargav Reddy Godala, Ishita Chaturvedi, Shaowei Zhu, Yebin Chon, Greg Chan, Zujun Tan, Galen Collier, Jonathan D. Halverson, Enrico Armenio Deiana, Jasper Liang, Federico Sossai, Yian Su, Atmn Patel, Bangyen Pham, Nathan Greiner, Simone Campanoni, and David I. August
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 2024.
(*Co-first authors)

In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. This study, building upon a decade-old survey that explored computational practices among researchers, aims to bridge the gap in understanding between those providing computational resources and the computational researchers. At the core of this study are over a hundred interviews with researchers, revisiting key questions from the earlier survey and introducing more open-ended questions to gather direct feedback. The findings illuminate the current landscape of research computation, highlighting trends and challenges through a comparison of past and present results, further enhanced by employing large language models for qualitative analysis. Given the rapidly evolving nature of computational science, maintaining the validity of results and interpretations requires regular updates, a task often filled with challenges. To facilitate frequent examination of the computational science landscape, this work introduces a toolkit outlining methodologies and insights designed to simplify future research efforts. With its contemporary findings and the toolkit for future updates, the study serves as a resource for researchers focused on enhancing computational systems. It deepens their understanding of user needs, guiding their efforts towards addressing the key challenges.

PROMPT: A Fast and Extensible Memory Profiling Framework [abstract] (PDF)
Ziyang Xu, Yebin Chon, Yian Su, Zujun Tan, Sotiris Apostolakis, Simone Campanoni, and David I. August
Proceedings of the ACM on Programming Languages, Volume 8, Issue OOPSLA (OOPSLA), October 2024.

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3Ã and 7.1Ã respectively. To further underscore PROMPTâs impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the clientâs memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction [abstract] (PDF)
Ishita Chaturvedi, Bhargav Reddy Godala, Yucan Wu, Ziyang Xu, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis, Dimitrios Soudris, Tyler Sorensen, Simone Campanoni, Tor M. Aamodt, and David I. August
Proceedings of the 51st International Symposium on Computer Architecture (ISCA), June 2024.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.
Awarded the HiPEAC 2025 Paper Award.

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application's execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage's existing pool of decoded instructions and the existing issue stage's information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia's intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a 36% maximum and 6.9% geometric mean speedup on GPU binaries with only a 0.007% area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

PDIP: Priority Directed Instruction Prefetching [abstract] (PDF)
Bhargav Reddy Godala, Sankara Prasad Ramesh, Gilles A. Pokam, Jared Stark, Andre Seznec, Dean Tullsen, and David I. August
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), April 2024.
Awarded Best Paper (one of six) out of 922 submissions. Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.

Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capac- ity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them. In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles â along the resteer path of front-end stall-causing events. PDIP identifies these targets and asso- ciates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.

PROMPT: A Fast and Extensible Memory Profiling Framework
Ziyang Xu, Yebin Chon, Yian Su, Zujun Tan, Sotiris Apostolakis, Simone Campanoni, and David I. August
Available at https://github.com/PrincetonUniversity/PROMPT/, January 2024.

Transpilation Utilizing Language-Agnostic IR and Interactivity for Parallelization [abstract] (PDF)
Zujun Tan
Ph.D. Thesis, Department of Computer Science, Princeton University, 2024.

Migrating codes between architectures is difficult because different execution models require different types of parallelism for optimal performance. Previous approaches, like libraries or source-level tools, generate correct and natural-looking syntax for the new parallel model with limited optimization and largely leave performance engineering to the programmer. Recent approaches, such as transpilation at the compiler intermediate representation (IR) level, can automate performance engineering, but profitability can be limited by not having facts known only to the programmer. Decompiling the optimized program could leverage the strength of existing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with arbitrary compilers of the source language.

This thesis provides a method for migrating code such that the compiler and programmer work together to generate code with optimal performance. To achieve this, it introduces Tulip via IR-level transpilation. Transpilation at the IR level enables Tulip to generalize the transformations applied to retarget parallelism. Furthermore, Tulip integrates the state-of-the-art automatic parallelization framework to explore additional parallelism expressible only in the target parallel programming model. It then generates natural source code through a novel decompiler, SPLENDID, in a high-level parallel programming language (OpenMP), which can be interactively optimized and tuned with programmer intervention. For 19 Polybench benchmarks, Tulip-generated OpenMP offloading programs perform 14% faster than the original CUDA sources on NVIDIA GPUs. Moreover, transpilation to the CPU leads to a 2.9x speedup over the best state-of-the-art transpiler. Tulip-generated portable parallel code is also more natural than what existing decompilers produce, resulting in a 39x higher average BLEU score.

This thesis includes contributions from Yebin Chon, Ziyang Xu, Sophia Zhang, and David I. August from Princeton Liberty Research Group, Brian Homerding, Yian Su, and Simone Campanoni from Northwestern Arcana Lab, Michael Kruse (AMD), Johannes Doerfert (LLNL), William S. Moses (UIUC), and Ivan R. Ivanov (Tokyo Tech).

Criticality-Aware Front-end [abstract] (PDF)
Bhargav Reddy Godala
Ph.D. Thesis, Department of Computer Science, Princeton University, 2024.

Code footprints continue to grow faster than instruction caches, putting additional pres- sure on existing front-end structures. Even with aggressive front-ends with fetch-directed instruction prefetching (FDIP), modern processors experience significant front-end stalls. Due to the end of Mooreâs Law, increasing cache sizes raises critical path latency, with mod- est returns for scaling instruction cache sizes. This dissertation aims to address front-end bottlenecks by making two key observations. In FDIP-enabled processors, cache misses have unequal costs, and a small fraction of critical instruction cache lines contribute to most of the front-end stalls.

EMISSARY, the pioneering cost-aware replacement policy tailored for the L1 Instruction Cache (L1I), defies conventional wisdom by presenting a groundbreaking approach. Unlike traditional replacements, EMISSARY demonstrates performance enhancements even amidst increased instruction cache misses. However, EMISSARY proves to be less effective when applied to datacenter workloads characterized by large code footprints. This is due to data- center workloads having more critical lines greater than the capacity of L1I. This dissertation first presents improved EMISSARY-L2, the first criticality-aware cache replacement family of policies specifically designed for datacenter workloads. Observing that modern architec- tures entirely tolerate many instruction cache misses, EMISSARY-L2 resists evicting those cache lines whose misses cause costly decode starvations from L2. In the context of a modern FDIP-enabled processor, EMISSARY-L2 delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on datacenter workloads. This speedup is 21.6% of the speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

This dissertation then proposes Priority Directed Instruction Prefetching (PDIP), a novel cost-ware instruction prefetching technique that complements FDIP by issuing prefetches for targets along the resteer path where FDIP stalls occur. PDIP identifies these targets and associates them with a trigger for future prefetch. When paired with EMISSARY-L2, PDIP achieves a geomean IPC speedup of 3.7% across a set of datacenter workloads using a budget of only 43.5KB. PDIP achieves 62% of the ideal prefetching performance.

QPoints: QEMU to gem5 ARM Full System Checkpointing [abstract]
Bhargav Reddy Godala, Ishita Chaturvedi, Yucan Wu, Simone Campanoni, and David I. August
ISCA 2023: The gem5 Workshop, June 2023.

Simulating data center workloads on gem5 with ARM model takes a prohibitively long time. The Atomic CPU model in gem5 is orders of magnitude faster than detail mode but still not fast enough to be practical and scalable. We built a framework that converts a checkpoint taken from QEMU to gem5âs checkpoint so that gem5 can resume execution. This approach helps accelerate the startup time in booting a full operating system and running heavy workloads like the ones used in the datacenter. QEMU can be used with KVM support to speedup emulation time. We have added QEMUâs virtual hardware model in gem5 so that a checkpoint collected on QEMU could run on the gem5 simulator seamlessly. Checkpoints created using this method are reusable. The framework also supports multicore checkpoints. We used Apple M1 Mac to create checkpoints of various datacenter workloads from DaCapo, Renaissance, Cloudsuite v4, OLTP bench, and Speedometer 2.0 successfully ported them to the gem5 checkpoint format. The checkpoint contains a disk image, memory dump, register state, and hardware state.

Modern Front-end Support in gem5 [abstract]
Bhargav Reddy Godala, Nayana Prasad Nagendra, Ishita Chaturvedi, Simone Campanoni, and David I. August
ISCA 2023: The gem5 Workshop, June 2023.

Modern processors employ deep out-of-order pipelines to get more ILP and thereby improve IPC. Datacenter workloads suffer from Instruction supply problem to keep the backend of the CPU busy. Modern CPUs employ fetch directed instruction prefetching technique where the front-end of the processor runs ahead and warms up the instruction cache and fills prefetch buffers in the predicted path using Branch predictors. This method is commercially known as decoupled frontend or FDIP. Although modern decoupled front-ends have been around for over a decade but not seen in any well-known execution-driven simulators. In this talk, we will present an aggressive decoupled frontend model that we implemented in gem5 and various design changes made to the fetch and branch prediction units. We show performance improvement on widely used datacenter workloads from DaCapo, Renaissance, Cloudsuite v4, OLTP bench, Speedometer 2.0 suites, and spec workloads. We support FDIP for ARM as well as X86 ISA model in the gem5.

EMISSARY: Enhanced Miss Awareness Replacement Policy For L2 Instruction Caching [abstract] (PDF)
Nayana Prasad Nagendra*, Bhargav Reddy Godala*, Ishita Chaturvedi, Atmn Patel, Svilen Kanev, Tipp Moseley, Jared Stark, Gilles A. Pokam, Simone Campanoni, and David I. August
Proceedings of the 50th International Symposium on Computer Architecture (ISCA), June 2023.
(*Co-first authors)

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development [abstract] (PDF)
Zujun Tan, Yebin Chon, Michael Kruse, Johannes Doerfert, Ziyang Xu, Brian Homerding, Simone Campanoni, and David I. August
Proceedings of the Twenty-Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2023.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but limitations, including not having facts only known to the programmer, can limit its profitability. A parallelizing compiler that collaborates with the programmer could increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP. Using published manual parallelizations of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs that are twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

LAMP: A Practical Loop-Aware Memory Dependence Profiler [abstract]
Yebin Chon, Ziyang Xu, Zujun Tan, and David I. August
The Fourth Young Architect Workshop (YArch), March 2022.

Profiling can overcome the inherent limits of static memory analysis, enabling aggressive compiler optimizations, including automatic parallelization. We propose LAMP -- a Loop-Aware Memory dependence Profiler that improves precision and performance over prior profilers to make memory dependence profiling more practical. The loop-aware design of LAMP allows it to reason about dependences from function calls, making it precise to all memory dependence within any loop. The advanced profiling recognition modules capture more dynamic dependence information beyond the existence of the dependence. By targeting memory dependences that are of interest to each optimization client, LAMP dramatically reduces the runtime overhead without loss of precision. LAMP uses modularity and parallelization to improve performance furthermore. Preliminary results show that targeting reduces the number of instrumented instructions by 2.7x, and the additional pattern recognition modules can recognize recursive data structure and guide speculative optimizations.

NOELLE Offers Empowering LLVM Extensions [abstract] (PDF)
Angelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Ishita Chaturvedi, Brian Homerding, Tommy McMichen, David I. August, and Simone Campanoni
Proceedings of the 2022 International Symposium on Code Generation and Optimization, February 2022.
Awarded ACM Artifact Available, Artifact Functional, and Result Reproduced Badges.

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations.This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Safer at Any Speed: Automatic Context-Aware Safety Enhancement for Rust [abstract] (PDF)
Natalie Popescu*, Ziyang Xu*, Sotiris Apostolakis, David I. August, and Amit Levy
Proceedings of the ACM on Programming Languages, Volume 5, Issue OOPSLA (OOPSLA), October 2021.
(*Co-first authors) Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.

Type-safe languages improve application safety by eliminating whole classes of vulnerabilitiesÅsuch as buffer overflowsÅby construction. However, this safety sometimes comes with a performance cost. As a result, many modern type-safe languages provide escape hatches that allow developers to manually bypass them. The relative value of performance to safety and the degree of performance obtained depends upon the application context, including user goals and the hardware upon which the application is to be executed. Since libraries may be used in many different contexts, library developers cannot make safety-performance trade-off decisions appropriate for all cases. Application developers can tune libraries themselves to increase safety or performance, but this requires extra effort and makes libraries less reusable. To address this problem, we present NADER, a Rust development tool that makes applications safer by automatically transforming unsafe code into equivalent safe code according to developer preferences and application context. In end-to-end system evaluations in a given context, NADER automatically reintroduces numerous library bounds checks, in many cases making application code that uses popular Rust libraries safer with no corresponding loss in performance

Improving Instruction Cache Performance For Modern Processors With Growing Workloads [abstract] (PDF)
Nayana Prasad Nagendra
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2021.

Cloud computing is the backbone of today\'s digital world. Billions of devices ranging from internet-of-things to mobile phones connect to services running on massive data centers, dubbed as Warehouse Scale Computers (WSCs). Due to the scale at which WSCs operate, their performance has a huge cost impact. Even the slightest percentage improvement in performance generates significant profits for the companies that operate them. Besides, they have a hugely adverse effect on the environment because of their massive carbon footprint.

Surprisingly, our work with Google shows significant inefficiencies among the WSC processor utilization. More than two-thirds of the CPU cycles are wasted in many forms. Specifically, nearly 20% of the wasted cycles come from the processor front-end, which is responsible for keeping the pipeline fed with useful instructions. Since a substantial fraction of the front-end cycle wastages occurs due to instruction cache misses, this Dissertation has taken a full-stack approach to study and solve those.

There have been vast growths in data center workloads and their complexity. Consequently, programs\' working set sizes are much larger than those of the instruction caches in modern processors. With the nearing end of Moore\'s Law, innovations are necessary to increase the efficiency of instruction cache usage, given the growing workloads.

In this Dissertation, we first present AsmDB, a fleet-wide study of Google workloads to understand processor front-end bottlenecks. Based on our knowledge, this is the first work to study the in-depth behavior of instruction caches at such a large scale. Among its many findings, AsmDB highlights inefficient usage of instruction caches because of significant fragmentation even among the highly-optimized code.

The second part of this Dissertation describes Emissary, a new approach to alleviate the said problem by using the available cache capacity more efficiently. Observing that not all cache misses cause stalls, Emissary employs a novel cache replacement algorithm that reserves cache capacity for lines whose misses cause the most significant performance impact. Emissary outperforms all known cache replacement algorithms in performance and energy consumption. In some cases, it produces speedups that exceed Beladyâs OPT, the perfect- knowledge minimal miss ideal cache replacement algorithm.

NOELLE Offers Empowering LLVM Extensions [abstract] (arXiv, PDF)
Angelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Ishita Chaturvedi, David I. August, and Simone Campanoni
arXiv:2102.05081 [cs.PL], February 2021.

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

A Sensible Approach to Speculative Automatic Parallelization [abstract] (PDF)
Sotiris Apostolakis
Ph.D. Thesis, Department of Computer Science, Princeton University, January 2021.

This dissertation shifts the focus from applicability to profitability by making speculative automatic parallelization more efficient. Unlike current approaches that perform analysis and transformations independently and in sequence, the proposed system integrates, without loss of modularity, memory analysis, speculative techniques, and enabling transformations.

Specifically, this dissertation involves three main contributions. First, it introduces a novel speculation-aware collaborative analysis framework (SCAF) that minimizes the need for high-overhead speculation. Second, it proposes a new parallelizing-compiler design that enables careful planning. Third, it presents new efficient speculative privatization transformations that avoid privatization overheads of prior work.

SCAF learns of available speculative information via profiling, computes its full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

The introduction of planning in the compiler enables the selection of parallelization-enabling transformations based on their cost and overall impacts. The benefit is twofold. First, the compiler only applies a minimal-cost set of enabling transformations. Second, this decision-making process makes the addition of new efficient speculative enablers possible, including the proposed privatization transformations.

This dissertation integrates these contributions into a fully-automatic speculative-DOALL parallelization framework for commodity hardware. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, this work obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory machine) than Privateer (11.5x), the prior automatic speculative-DOALL system with the highest applicability.

SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework (ACM DL)
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Campanoni, and David I. August
Available at https://liberty.princeton.edu/Projects/AutoPar/SCAF/, December 2020.

NOELLE Offers Empowering LLVM Extensions
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, David I. August, and Simone Campanoni
Available at https://liberty.princeton.edu/Projects/NOELLE/, December 2020.

Perspective: A Sensible Approach to Speculative Automatic Parallelization
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Campanoni, and David I. August
Available at https://liberty.princeton.edu/Projects/AutoPar/Perspective/, December 2020.

SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework [abstract] (ACM DL, PDF)
Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Campanoni, and David I. August
Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2020.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.

Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Automatic and Speculative Parallel-Stage Decoupled Software Pipelining [abstract]
Zujun Tan, Greg Chan, Ziyang Xu, Sotiris Apostolakis, and David I. August
The Second Young Architect Workshop (YArch), March 2020.

The end of Moore's law and Dennard scaling has raised the need for work to improve the utilization of transistors. Multicore systems and their transistors are underutilized due to the lack of sufficient extraction of multicore-appropriate thread-level parallelism (MATLP) from programs. Over the last decade and especially in the last year, compiler advancements have dramatically increased MATLP extraction for sequential codes. In particular, new lower-validation-cost speculation methods make it possible to bypass limits of static memory analyses. Applied to DOALL, these methods have been shown to double the performance over the best prior speculative techniques. Unfortunately, since these methods only apply to DOALL, they fail completely when any hard to predict cross-iteration dependence exists. PS-DSWP is a generalization of DOALL able to handle unpredictable cross-iteration dependences. Thus, the goal of this work is to combine the performance-enhancing power of these new speculation methods with the applicability of PS-DSWP. This work will include porting these efficient speculative enabling transformations to PS-DSWP, creating a planning phase to find an application of these techniques that avoid unnecessary speculation and maximizes parallelism.

Perspective: A Sensible Approach to Speculative Automatic Parallelization [abstract] (ACM DL, PDF)
Sotiris Apostolakis, Ziyang Xu, Greg Chan, Simone Campanoni, and David I. August
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2020.
Awarded all top ACM Reproducibility Badges offered by the Artifact Evaluation Committee.

The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Hardware MultiThreaded Transactions: Enabling Speculative MultiThreaded Pipeline Parallelization For Complex Programs [abstract] (PDF)
Jordan Fix
Ph.D. Thesis, Department of Computer Science, Princeton University, 2020.

Unfortunately, software support for multi-threaded transactions (MTXs) comes with significant additional inter-thread communication overhead for speculation validation. This overhead can make otherwise good parallelization unprofitable for programs with sizeable read and write sets.

Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to achieve to date. Instead, this dissertation makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs.

Even with maximal speculation validation of every load and store inside transactions of tens to hundreds of millions of instructions, profitable parallelization of complex programs can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 104% over sequential execution on a multicore machine with 4 cores, with modest increases in power and energy consumption. This work could be used as a building block to enable more realistic automatic parallelization of complex programs, by providing low-overhead, long-running, resilient transactions that support a diverse set of parallel paradigm options.

AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers [abstract] (PDF, Top Picks Version)
Grant Ayers, Nayana P. Nagendra, David I. August, Hyoun K. Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan
Proceedings of the 46th International Symposium on Computer Architecture (ISCA), June 2019.
Selected for IEEE Micro's "Top Picks" special issue "based on novelty and potential for long-term impact in the field of computer architecture" in 2019.

The large instruction working sets of private and public cloud workloads lead to frequent instruction cache misses and costs in the millions of dollars. While prior work has identified the growing importance of this problem, to date, there has been very little analysis of where the misses come from, and what the opportunities are to improve them. To address this challenge, this paper makes three contributions. First, we present the design and deployment of a new, always-on, fleet-wide monitoring system, AsmDB, that tracks front-end bottlenecks. AsmDB uses hardware support to collect bursty execution traces, fleet-wide temporal and spatial sampling, and sophisticated offline post-processing to construct full- program dynamic control-flow graphs. Second, based on a longitudinal analysis of AsmDB data from a real-world online services, we present two detailed insights on the sources of front-end stalls: (1) cold code that is brought in along with hot code leads to significant cache fragmentation and a corresponding large number of instruction cache misses, both at the function and cache line levels; (2) distant branches and calls that are not amenable to traditional cache locality or next-line prefetch strategies account for a large fraction of cache misses. Third, we prototype two optimizations that target these insights. For the first insight, we focus on memcmp, one of the hottest functions contributing to cache misses, and show how fine-grained layout optimizations lead to significant benefits. For the second insight, we propose new hardware support for software code prefetching and prototype a new feedback-directed compiler optimization that combines static program flow analysis with dynamic miss profiles to demonstrate significant benefits on several large warehouse-scale workloads. Improving upon prior work, our proposal avoids invasive hardware modifications by prefetching via software in an efficient and scalable way. Simulation results show that such an approach can eliminate up to 96% of instruction cache misses with negligible overheads.

Architectural Support for Containment-based Security [abstract] (PDF)
Hansen Zhang, Soumyadeep Ghosh, Jordan Fix, Sotiris Apostolakis, Stephen R. Beard, Nayana P. Nagendra, Taewook Oh, and David I. August
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2019.

Software security techniques rely on correct execution by the hardware. Securing hardware components has been challenging due to their complexity and the proportionate attack surface they present during their design, manufacture, deployment, and operation. Recognizing that external communication represents one of the greatest threats to a systemâs security, this paper introduces the TrustGuard containment architecture. TrustGuard contains malicious and erroneous behavior using a relatively simple and pluggable gatekeeping hardware component called the Sentry. The Sentry bridges a physical gap between the untrusted system and its external interfaces. TrustGuard allows only communication that results from the correct execution of trusted software, thereby preventing the ill effects of actions by malicious hardware or software from leaving the system. The simplicity and pluggability of the Sentry, which is implemented in less than half the lines of code of a simple in-order processor, enables additional measures to secure this root of trust, including formal verification, supervised manufacture, and supply chain diversification with less than a 15% impact on performance.

Collaborative Parallelization Framework [abstract]
Ziyang Xu, Greg Chan, Sotiris Apostolakis, and David I. August
The First Young Architect Workshop (YArch), February 2019.

Automatic parallelization frees programmers from the error-prone and time-consuming process of making applications run efficiently on multi-core systems. Each existing parallelizing compiler is the best option for a different set of benchmarks; none is universally performant. While new multicore-optimized parallelization transformations, stronger analyses, and more efficient speculation techniques are powerful enough to extract parallelism reliably from even irregular and complex applications, they often fail to do so because they lack proper coordination and information. This work introduces the Collaborative Parallelization Framework (CPF), a means to address these deficiencies. To address the lack of coordination, CPF orchestrates collaborations among parallelization techniques and enabling transformations. In cases where this is insufficient, CPF seeks additional information from the programmer. Orchestration of all these components allows for a modular design that facilitates extensibility.

TrustGuard: A Model for Practical Trust in Real Systems [abstract] (PDF)
Stephen R. Beard
Ph.D. Thesis, Department of Computer Science, Princeton University, 2019.

All layers of today's computing systems, from hardware to software, are vulnerable to attack. Market and economic pressure push companies to focus on performance and features, leaving security and correctness as secondary concerns. As a consequence, systems must frequently be patched after deployment to fix security vulnerabilities. While this non-virtuous exploit-patch-exploit cycle is insecure, it is practical enough for companies to use.

Formal methods in both software and hardware can guarantee the security they provide. Ideally, modern systems would be comprised of formally verified and secure components. Unfortunately, such methods have not seen widespread adoption for a number of reasons, such as difficulty in scaling, lack of tools, and high skill requirements. Additionally, the economics involved in securing and replacing every component in all systems, both new and currently deployed, result in clean slate solutions being impractical. A practical solution should rely on a few, simple components and should be adoptable incrementally.

TrustGuard, the first implementation of the Containment Architecture with Verified Output (CAVO) model developed at Princeton, showed how a simple, trusted Sentry could protect against malicious or buggy hardware components to ensure integrity of external communications. This was accomplished by ensuring the correct execution of signed software, with support from a modified CPU and system architecture. However, TrustGuardâs practicality was limited due to its reliance on modified host hardware and its requirement to trust entire application stacks, including the operating system.

The work presented in this dissertation seeks to make the CAVO model a practical solution for ensuring the integrity of data communicated externally in an untrusted commodity system. This work extends CAVO in two ways. First, it makes the Sentry compatible with a wide range of devices without requiring hardware modifications to the host system, thus increasing its flexibility and ease of integration into existing environments. Second, it gives developers the option to use trusted code to verify the execution of untrusted code, thus reducing the size of the trusted code base. This is analogous to the small, trusted Sentry ensuring correctness of execution of a large amount of complex, untrusted hardware.

MemoDyn: Exploiting Weakly Consistent Data Structures for Dynamic Parallel Memoization [abstract] (ACM DL, PDF)
Prakash Prabhu, Stephen R. Beard, Sotiris Apostolakis, Ayal Zaks, and David I. August
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT), November 2018.

Several classes of algorithms for combinatorial search and optimization problems employ memoization data structures to speed up their serial convergence. However, accesses to these data structures impose dependences that obstruct program parallelization. Such programs often continue to function correctly even when queries into these data structures return a partial view of their contents. Weakening the consistency of these data structures can unleash new parallelism opportunities, potentially at the cost of additional computation. These opportunities must, therefore, be carefully exploited for overall speedup. This paper presents MEMODYN, a framework for parallelizing loops that access data structures with weakly consistent semantics. MEMODYN provides programming abstractions to express weak semantics, and consists of a parallelizing compiler and a runtime system that automatically and adaptively exploit the semantics for optimized parallel execution. Evaluation of MEMODYN shows that it achieves efficient parallelization, providing significant improvements over competing techniques in terms of both runtime performance and solution quality.

Hardware MultiThreaded Transactions [abstract] (ACM DL, PDF)
Jordan Fix, Nayana P. Nagendra, Sotiris Apostolakis, Hansen Zhang, Sophie Qiu, and David I. August
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2018.

Speculation with transactional memory systems helps programmers and compilers produce profitable thread-level parallel programs. Prior work shows that supporting transactions that can span multiple threads, rather than requiring transactions be contained within a single thread, enables new types of speculative parallelization techniques for both programmers and parallelizing compilers. Unfortunately, software support for multi-threaded transactions (MTXs) comes with significant additional inter-thread communication overhead for speculation validation. This overhead can make otherwise good parallelization unprofitable for programs with sizeable read and write sets. Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to equivalently achieve. Instead, this paper makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs. Even with maximal speculation validation of every load and store inside transactions of tens to hundreds of millions of instructions, profitable parallelization of complex programs can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 99% over sequential execution on a multicore machine with 4 cores.

Static and Dynamic Instruction Mapping for Spatial Architectures [abstract] (PDF)
Feng Liu
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, 2018.

In response to the technology scaling trends, spatial architectures have emerged as a new style of processors for executing programs more efficiently. Unlike traditional out-of-order (OoO) processors, which time-share a small set of functional units, a spatial computer is composed of hundreds or even thousands of simple and replicated functional units. Spatial architectures avoid the overheads of time-sharing and of generating schedules repeatedly, by mapping instruction sequences onto the functional units explicitly and reusing the map- ping across multiple invocations.

Currently, spatial architectures mainly use static methods to map and schedule instructions onto the arrays of functional units. The existing methods have several limitations: First, for programs with irregular memory accesses and control flows, they yield poor performance because the functional units need to be invoked sequentially to respect data and control dependences. Second, static methods cannot fully exploit speculation techniques, which are the dominant performance sources in OoO processors. Finally, static methods cannot adapt to changing workloads and are not compatible across hardware generations.

To address these issues and improve the applicability of spatial architectures, this dissertation proposes two techniques. The first, Coarse-Grained Pipelined Accelerators (CGPA), is a static compiling framework that exploits the hidden parallelism within irregular C/C++ loops and translates them into spatial hardware modules. The proposed technique has been implemented as a compiler pass and the experiment shows 3.3x speedup over the performance achieved by an open-source tool baseline.

The second technique, Dynamic Spatial Architecture Mapping (DYNASPAM), reuses the speculation system in the OoO processors to dynamically produce high performance scheduling and execution on a dedicated spatial fabric. The proposed technique is modeled by a cycle accurate simulator and the experiment shows the new technique can achieve 1.4x geomean performance improvement and 23.9% energy consumption reduction, compared to an aggressive OoO processor baseline.

A Generalized Framework for Automatic Scripting Language Parallelization [abstract] (PDF)
Taewook Oh, Stephen R. Beard, Nick P. Johnson, Sergiy Popovych, and David I. August
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2017.

Computational scientists are typically not expert programmers, and thus work in easy to use dynamic languages. However, they have very high performance requirements, due to their large datasets and experimental setups. Thus, the performance required for computational science must be extracted from dynamic languages in a manner that is transparent to the programmer. Current approaches to optimize and parallelize dynamic languages, such as just-in-time compilation and highly optimized interpreters, require a huge amount of implementation effort and are typically only effective for a single language. However, scientists in different fields use different languages, depending upon their needs. This paper presents techniques to enable automatic extraction of parallelism within scripts that are universally applicable across multiple different dynamic scripting languages. The key insight is that combining a script with its interpreter, through program specialization techniques, will embed any parallelism within the script into the combined program. Additionally, this paper presents several enhancements to existing speculative automatic parallelization techniques to handle the dependence patterns created by the specialization process. A prototype of the proposed technique, called Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script interpreters with 6 input scripts each. The resulting geomean speedup of 5.1x on a 24-core machine shows the potential of the generalized approach in automatic extraction of parallelism in dynamic scripting languages.

A Collaborative Dependence Analysis Framework [abstract] (ACM DL, PDF)
Nick P. Johnson, Jordan Fix, Taewook Oh, Stephen R. Beard, Thomas B. Jablin, and David I. August
Proceedings of the 2017 International Symposium on Code Generation and Optimization (CGO), February 2017.
Highest ranked paper in double-blind review process.

Compiler optimizations discover facts about program behavior by querying static analysis. However, developing or extending precise analysis is difficult. Some prior works implement analysis with a single algorithm, but the algorithm becomes more complex as it is extended for greater precision. Other works achieve modularity by implementing several simple algorithms and trivially composing them to report the best result from among them. Such a modular approach has limited precision because it employs only one algorithm in response to one query, without synergy between algorithms. This paper presents a framework for dependence analysis algorithms to collaborate and achieve precision greater than the trivial combination of those algorithms. With this framework, developers can achieve the high precision of complex analysis algorithms through collaboration of simple and orthogonal algorithms, without sacrificing the ease of implementation of the modular approach. Results demonstrate that collaboration of simple analyses enables advanced compiler optimizations.

TrustGuard: A Containment Architecture with Verified Output [abstract] (PDF)
Soumyadeep Ghosh
Ph.D. Thesis, Department of Computer Science, Princeton University, 2017.

Computers today are so complex and opaque that a user cannot know everything occurring within the system. Most efforts toward computer security have focused on securing software. However, software security techniques implicitly assume correct execution by the underlying system, including the hardware. Securing these systems has been challenging due to their complexity and the proportionate attack surface they present during their design, manufacturing, deployment, and operation. Ultimately, the user's trust in the system depends on claims made by each party supplying the system's components.

This dissertation presents the Containment Architecture with Verified Output (CAVO) model in recognition of the reality that existing tools and techniques are insufficient to secure complex hardware components in modern computing systems. Rather than attempt to secure each complex hardware component individually, CAVO establishes trust in hardware using a single, simple, separately manufactured component, called the Sentry. The Sentry bridges a physical gap between the untrusted system and its external interfaces and contains the effects of malicious behavior by untrusted system components before the external manifestation of any such effects. Thus, only the Sentry and the physical gap must be secured in order to assure users of the containment of malicious behavior. The simplicity and pluggability of CAVO's Sentry enable suppliers and consumers to take additional measures to secure it, including formal verification, supervised manufacture, and supply chain diversification.

This dissertation also presents TrustGuard---the first prototype CAVO design---to demonstrate the feasibility of the CAVO model. TrustGuard achieves containment by only allowing the communication of correctly executed results of signed software. The Sentry in TrustGuard leverages execution information obtained from the untrusted processor to enable efficient checking of the untrusted system's work, even when the Sentry itself is simpler and much slower than the untrusted processor. Simulations show that TrustGuard can guarantee containment of malicious hardware components with a geomean of 8.5% decline in the processor's performance, even when the Sentry operates at half the clock frequency of the complex, untrusted processor.

Speculatively Exploiting Cross-Invocation Parallelism [abstract] (PDF)
Jialu Huang, Prakash Prabhu, Thomas B. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, and David I. August
Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2016.

Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For programs with many loop invocations, frequent synchronization often becomes the performance bottleneck. Some techniques exploit cross-invocation parallelism to overcome this problem. Using static analysis, they partition iterations among threads to avoid cross-thread dependences. However, this approach may fail if dependence pattern information is not available at compile time. To address this limitation, this work proposes SpecCross-âthe first automatic parallelization technique to exploit cross-invocation parallelism using speculation. With speculation, iterations from different loop invocations can execute concurrently, and the program synchronizes only on misspeculation. This allows SpecCross to adapt to dependence patterns that only manifest on particular inputs at runtime. Evaluation on eight programs shows that SpecCross achieves a geomean speedup of 3.43Ã over parallel execution without cross-invocation parallelization.

Static Dependence Analysis in an Infrastructure for Automatic Parallelization [abstract] (PDF)
Nick P. Johnson
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2015.

Now that parallel architectures are common, software must exploit multiple cores to fully utilize hardware resources and achieve efficient execution. Restructuring applications for explicit parallelism requires developers to reason about low-level details to avoid concurrency bugs and achieve parallel performance. Automatic thread extraction relieves developers of this arduous task. This dissertation presents a compiler {\em middle-ware} for automatic thread extraction---analyses and transformations that allow the compiler to deliver parallel performance for sequentially-specified input programs. This middle-ware reconsiders the compilation infrastructure to present alternative technologies better suited to the needs of automatic thread extraction. Specifically, {\bf Collaborative Dependence Analysis:} Since no analysis algorithm can precisely decide all analysis facts, this dissertation combines simple analysis algorithms into a {\em collaborative ensemble} algorithm: the ensemble algorithm can disprove dependence queries which no member disproves alone. Collaboration enables {\em factored} development, which prescribes the development of small, orthogonal analysis algorithms tailored to the needs of analysis clients. Results demonstrate independently-developed analysis algorithms collaborating to solve complex, multiple-logic queries. Further, results demonstrate a large impact on performance: for some benchmarks, analysis strength is the difference between 11$\times$ slowdown and 28$\times$ speedup on 50 cores. {\bf Scaling Analysis to Larger Scopes:} The infrastructure builds around Parallel-Stage Decoupled Software Pipelining (PS-DSWP) thread extraction. PS-DSWP targets large, hot program scopes to overcome the non-recurring overheads of parallel execution. However, as scopes grow, the burden of analysis becomes prohibitive. This dissertation contributes a faster algorithm to compute a dependence graph to drive PS-DSWP. This algorithm identifies dependence edges which {\em cannot} affect PS-DSWP. It skips dependence analysis queries pertaining to unimportant edges, reducing analysis time---or allowing more expensive analysis algorithms---for the remaining, important queries. Evaluation demonstrates that the algorithm computes the ${\rm DAG}_{\rm SCC}$ twice as fast using half as many dependence analysis queries without sacrificing analysis precision. {\bf Incorporating Speculation into the Compilation Pipeline:} A parallelization system may speculate various properties to enable thread extraction. This dissertation presents design patterns to simplify development of novel speculation types. The dissertation integrates these contributions into a robust and flexible middle-ware upon which many works are built.

Automatic Exploitation of Input Parallelism [abstract] (PDF)
Taewook Oh
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2015.

Parallelism may reside in the input of a program rather than the program itself. A script interpreter, for example, is hard to parallelize because its dynamic behavior is unpredictable until an input script is given. Once the interpreter is combined with the script, the resulting program becomes predictable, and even parallelizable if the input script contains parallelism. Despite recent progress in automatic parallelization research, however, existing techniques cannot take advantage of the parallelism within program inputs, even when the inputs remain fixed across multiple executions of the program.

This dissertation shows that the automatic exploitation of parallelism within fixed program inputs can be achieved by coupling program specialization with automatic parallelization techniques. Program specialization marries a program with the values that remain invariant across the program execution, including fixed inputs, and creates a program that is highly optimized for the invariants. The proposed technique exploits program specialization as an enabling transformation for automatic parallelization; through specialization, the parallelism within the fixed program inputs can be materialized within the specialized program.

First, this dissertation presents Invariant-induced Pattern-based Loop Specialization (IPLS). IPLS folds the parallelism within the program invariants into the specialized program, thereby creating a more complete and predictable program that is easier to parallelize. Second, this dissertation applies automatic speculative parallelization techniques to specialized programs to exploit parallelism in inputs. As existing techniques fail to extract parallelism from complex programs such as IPLS specialized programs, context-sensitive speculation and optimized design of the speculation run-time system are proposed to improve the applicability and minimize the execution overhead of the parallelized program.

A prototype of the proposed technique is evaluated against two widely-used open-source script interpreters. Experimental results demonstrate the effectiveness of the proposed techniques.

DynaSpAM: Dynamic Spatial Architecture Mapping using Out of Order Instruction Schedules [abstract] (PDF)
Feng Liu, Heejin Ahn, Stephen R. Beard, Taewook Oh, and David I. August
Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), June 2015.

Spatial architectures are more efficient than traditional Out-of-Order (OOO) processors for computationally intensive programs. However, spatial architectures require mapping a program, either statically or dynamically, onto the spatial fabric. Static methods generate optimized mappings, but they cannot adapt to changing workloads and are not compatible across hardware generations. Current dynamic methods are adaptive and compatible, but do not optimize as well due to their limited use of speculation and small mapping scopes. To overcome the limitations of existing dynamic mapping methods for spatial architectures while minimizing the inefficiencies inherent in OOO superscalar processors, this paper presents DynaSpAM (Dynamic Spatial Architecture Mapping), a framework that tightly couples a spatial fabric with an OOO pipeline. DynaSpAM coaxes the OOO processor into producing an optimized mapping with a simple modification to the processorâs scheduler. The insight behind DynaSpAM is that todayâs powerful OOO processors do for themselves most of the work necessary to produce a highly optimized mapping for a spatial architecture, including aggressively speculating control and memory dependences, and scheduling instructions using a large window. Evaluation of DynaSpAM shows a geomean speedup of 1.42x for 11 benchmarks from the Rodinia benchmark suite with a geomean 23.9% reduction in energy consumption compared to an 8-issue OOO pipeline.

CGPA:Coarse-Grained Pipelined Accelerators [abstract] (PDF)
Feng Liu, Soumyadeep Ghosh, Nick P. Johnson, and David I. August
The Design Automation Conference (DAC), June 2014.

High-level synthesis (HLS) tools dramatically reduce the nonrecurring engineering cost of creating specialized hardware accelerators. Existing HLS tools are successful in synthesizing efficient accelerators for program kernels with regular memory accesses and simple control flows. For other programs, however, these tools yield poor performance because they invoke computation units for instructions sequentially, without exploiting parallelism. To address this problem, this paper proposes Coarse-Grained Pipelined Accelerators (CGPA), an HLS framework that utilizes coarse-grained pipeline parallelism techniques to synthesize efficient specialized accelerator modules from irregular C/C++ programs without requiring any annotations. Compared to the sequential method, CGPA shows speedups of 3.0xâ3.8x for 5 kernels from programs in different domains.

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information [abstract] (PDF)
Jialuh Huang
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2013.

Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multi-core architectures. Most existing techniques parallelize independent loops and insert global synchronizations at the end of each loop invocation. For programs with few loop invocations, these global synchronizations do not limit parallel execution performance. However, for programs with many loop invocations, those synchronizations can easily become the performance bottleneck since they frequently force all threads to wait, losing potential parallelization opportunities. To address this problem, some automatic parallelization techniques apply static analyses to enable cross-invocation parallelization. Instead of waiting, threads can execute iterations from follow-up invocations if they do not cause any conflict. However, static analysis must be conservative and cannot handle irregular dependence patterns manifested by particular program inputs at runtime.

In order to enable more parallelization across loop invocations, this thesis presents two novel automatic parallelization techniques: DOMORE and SpecCross. Unlike existing techniques relying on static analyses, these two techniques take advantage of runtime information to achieve much more aggressive parallelization. DOMORE constructs a custom runtime engine which non-speculatively observes dependences at runtime and synchronizes iterations only when necessary; while SpecCross applies software speculative barriers to permit some of the threads to execute past the invocation boundaries. The two techniques are complimentary in the sense that they can parallelize programs with potentially very different characteristics. SpecCross, with less runtime overhead, works best when programs' cross-invocation dependences seldom cause any runtime conflict. DOMORE, on the other hand, has its advantage in handling dependences which cause frequent conflicts. Evaluating implementations of DOMORE and SpecCross demonstrates that both techniques can achieve much better scalability compared to existing automatic parallelization techniques.

Semantic Language Extensions for Implicit Parallel Programming [abstract] (PDF)
Prakash Prabhu
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2013.

Several emerging fields of science and engineering are increasingly characterized by computationally intensive programs. Without parallelization, such programs do not benefit from the increasing core counts available in todays chip multiprocessors. However, writing correct and well-performing parallel programs is widely perceived to be an extremely hard problem. In order to understand the challenges faced by scientific programmers in effectively leveraging parallel computation, this dissertation first presents an in-depth field study of the practice of computational science.

Based on the results of the field study, this dissertation proposes two new implicit parallel programming (IPP) solutions. With IPP, artificial constraints imposed by sequential models for automatic parallelization are overcome by use of semantic programming extensions. These preserve the ease of sequential programming and enable multiple parallelism forms without additional parallelism constructs, achieving the best of both automatic and explicit parallelization.

The first IPP solution, Commutative Set, generalizes existing notions of semantic commutativity. It allows a programmer to relax execution orders prohibited under a sequential programming model with a high degree of expressiveness. The second IPP solution, WeakC, provides language extensions to relax strict consistency requirements of sequential data structures, and dynamically optimizes a parallel configuration of these data structures via a combined compiler-runtime system.

This dissertation evaluates both Commutative Set and WeakC on real-world applications running on real hardware, including some that are actively used by some scientists in their day-to-day research. The detailed experimental evaluation results demonstrate the effectiveness of the proposed techniques.

ASAP: Automatic Speculative Acyclic Parallelization for Clusters [abstract] (PDF)
Hanjun Kim
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2013.

While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for clusters due to high inter-node communication cost and lack of globally shared memory. Speculative Decoupled Software Pipelining (Spec-DSWP) is a promising automatic parallelization technique for clusters that speculatively partitions a loop into multiple threads that communicate in a pipelined manner. Speculation can complement conservative static analysis, making automatic parallelization more robust and applicable. Pipelining allows Spec-DSWP to speculate only rarely occurring dependences while respecting the other dependences through communication among threads. Acyclic communication patterns in pipelining make the parallelized programs tolerant of high communication latency of clusters. However, since Spec-DSWP partitions a loop iteration (a transaction) into multiple sub-transactions across multiple threads according to the pipeline stages, a special runtime system is required that supports multi-threaded transactions (MTXs).

This dissertation proposes the Automatic Speculative Acyclic Parallelization (ASAP) system that enables Spec-DSWP for clusters without any hardware modification. The ASAP system supports various speculation techniques that require different validation and communication costs, and automatically parallelizes sequential loops using the Spec-DSWP transformation with the optimal application of the speculation techniques. The ASAP system efficiently supports MTXs to correctly execute the speculatively transformed programs on clusters. With synergistic combination of speculation, acyclic communication, and runtime system support, this approach achieves or demonstrates a path to achieve scalable performance speedup up to 109x for a wide range of applications on clusters without any hardware modification.

Fast Condensation of the Program Dependence Graph [abstract] (ACM DL, PDF)
Nick P. Johnson, Taewook Oh, Ayal Zaks, and David I. August
Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2013.

Aggressive compiler optimizations are formulated around the Program Dependence Graph (PDG). Many techniques, including loop fission and parallelization are concerned primarily with dependence cycles in the PDG. The Directed Acyclic Graph of Strongly Connected Components (DAGSCC) represents these cycles directly. The naive method to construct the DAGSCC first computes the full PDG. This approach limits adoption of aggressive optimizations because the number of analysis queries grows quadratically with program size, making DAGSCC construction expensive. Consequently, compilers optimize small scopes with weaker but faster analyses. We observe that many PDG edges do not affect the DAGSCC and that ignoring them cannot affect clients of the DAGSCC. Exploiting this insight, we present an algorithm to omit those analysis queries to compute the DAGSCC efficiently. Across 366 hot loops from 20 SPEC2006 benchmarks, this method computes the DAGSCC in half of the time using half as many queries.

Automatic Parallelization for GPUs [abstract] (PDF)
Thomas B. Jablin
Ph.D. Thesis, Department of Computer Science, Princeton University, April 2013.

GPUs are flexible parallel processors capable of accelerating real applications. To exploit them, programmers rewrite programs in new languages using intimate knowledge of the underlying hardware. This is a step backwards in abstraction and ease of use from sequential programming. When implementing sequential applications, programmers focus on high-level algorithmic concerns, allowing the compiler to target the peculiarities of specific hardware. Automatic parallelization can return ease of use and hardware abstraction to programmers. This dissertation presents techniques for automatically parallelizing ordinary sequential C codes for GPUs using DOALL and pipelined parallelization techniques. The key contributions include: the first automatic data management and communication optimization framework for GPUs and the first automatic pipeline parallelization system for GPUs. Combining these two contributions with an automatic DOALL parallelization yields the first fully automatic parallelizing compiler for GPUs.

Practical Automatic Loop Specialization [abstract] (PDF)
Taewook Oh, Hanjun Kim, Nick P. Johnson, Jae W. Lee, and David I. August
Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2013.

Program specialization optimizes a program with respect to known, fixed inputs and other program invariants. These invariants can be used to enable optimizations that are otherwise unsound. In many applications, a program input induces predictable patterns of values across loop iterations, yet existing specializers cannot fully capitalize on this opportunity. To address this limitation, we present Invariant-induced Pattern based Loop Specialization (IPLS), the first fully-automatic specialization technique designed for everyday use on real applications. Using dynamic information-flow tracking, IPLS profiles the values of instructions that depend solely on invariants and recognizes repeating patterns across multiple iterations of hot loops. IPLS then specializes these loops, using those patterns to predict values across a large window of loop iterations. This enables aggressive optimization of the loop; conceptually, this optimization reconstructs loops induced by the input as concrete loops in the specialized binary. IPLS specializes real-world programs that prior techniques fail to specialize without requiring hints from the user. Experiments demonstrate a geomean speedup of 14.1% with a maximum speedup of 138% over the original codes while increasing program size only by 7% when evaluated on three script interpreters and eleven scripts each.

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information [abstract] (PDF)
Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and David I. August
Proceedings of the 2013 International Symposium on Code Generation and Optimization (CGO), February 2013.

Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures. Many existing automatic techniques only parallelize iterations within a loop invocation and synchronize threads at the end of each loop invocation. When parallel code contains many loop invocations, synchronization can easily become a performance bottleneck. Some automatic techniques address this problem by exploiting crossinvocation parallelism. These techniques use static analysis to partition iterations among threads to avoid cross-thread dependences. However, this partitioning is not always achievable at compile-time, because program input termines dependence patterns at run-time. By contrast, this paper proposes DOMORE, the first automatic parallelization technique that uses runtime information to exploit additional cross-invocation parallelism. Instead of partitioning iterations statically, DOMORE dynamically detects crossthread dependences and synchronizes only when necessary. DOMORE consists of a compiler and a runtime library. At compile time, DOMORE automatically parallelizes loops and inserts a custom runtime engine into programs. At runtime, the engine observes dependences and synchronizes iterations only when necessary. For six programs, DOMORE achieves a geomean loop speedup of 2.1X over parallel execution without crossinvocation parallelization and of 3.2X over sequential execution on eight cores.

From Sequential Programming to Flexible Parallel Execution [abstract] (PDF)
Arun Raman, Jae W. Lee, and David I. August
Proceedings of the International Conference on Compilers and Synthesis for Embedded Systems (CASES), October 2012. Invited.

The embedded computing landscape is being transformed by three trends: growing demand for greater functionality and enriched user experience, increasing diversity and parallelism in the processing substrate, and an accelerating push for ever-greater energy efficiency. For programmers, these trends give rise to three challenges: writing code for a potentially heterogeneous architecture, extracting parallelism in software, and maximizing a multivariate (performance, power, energy, etc.) fitness function of user satisfaction which may vary with time. To meet these challenges, clarion calls have been issued for programmers to start writing software in new parallel programming models. Fundamentally, however, these proposals detract programmers from delivering new features and enriched user experience in the shortest time possible. This paper proposes to attract embedded systems programmers to a vertically integrated approach, comprising extensions to the sequential programming model, a parallelizing compiler, and an optimizing run-time system, to enable them to tackle all three challenges.

PASSERT: A Tool for Debugging Parallel Programs [abstract] (SpringerLink, PDF)
Daniel Schwartz-Narbonne, Feng Liu, David I. August, and Sharad Malik
Proceedings of the 18th International Conference on Computer-Aided Verification (CAV), June 2012.

PASSERT is a new debugging tool for parallel programs which allows programmers to express correctness criteria using a simple, expressive assertion language. We demonstrate how these parallel assertions allow the detection and diagnosis of real world concurrency bugs, detecting 14/17 bugs in an independently selected set of bugs from open source software. We describe a runtime checker which allows automatic checking of parallel assertions in C and C++ programs, with a geometric mean of 6.6x overhead on a set of PARSEC benchmarks. We improve performace by introducing a relaxed timing semantics for parallel assertions, which better reflects real memory models, and exposes more bugs with less overhead (geometric mean overhead 3.5x).

Parcae: A System for Flexible Parallel Execution [abstract] (PDF)
Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2012.

Workload, platform, and available resources constitute a parallel program's execution environment. Most parallelization efforts statically target an anticipated range of environments, but performance generally degrades outside that range. Existing approaches address this problem with dynamic tuning but do not optimize a multiprogrammed system holistically. Further, they either require manual programming effort or are limited to array-based data-parallel programs.

This paper presents Parcae, a generally applicable automatic system for platform-wide dynamic tuning. Parcae includes (i) the Nona compiler, which creates flexible parallel programs whose tasks can be efficiently reconfigured during execution; (ii) the Decima monitor, which measures resource availability and system performance to detect change in the environment; and (iii) the Morta executor, which cuts short the life of executing tasks, replacing them with other functionally equivalent tasks better suited to the current environment. Parallel programs made flexible by Parcae outperform original parallel implementations in many interesting scenarios.

Speculative Separation for Privatization and Reductions [abstract] (ACM DL, PDF)
Nick P. Johnson, Hanjun Kim, Prakash Prabhu, Ayal Zaks, and David I. August
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2012.

Automatic parallelization is a promising strategy to improve application performance in the multicore era. However, common programming practices such as the reuse of data structures introduce artificial constraints that obstruct automatic parallelization. Privatization relieves these constraints by replicating data structures, thus enabling scalable parallelization. Prior privatization schemes are limited to arrays and scalar variables because they are sensitive to the layout of dynamic data structures. This work presents Privateer, the first fully automatic privatization system to handle dynamic and recursive data structures, even in languages with unrestricted pointers. To reduce sensitivity to memory layout, Privateer speculatively separates memory objects. Privateer's lightweight runtime system validates speculative separation and speculative privatization to ensure correct parallel execution. Privateer enables automatic parallelization of general-purpose C/C++ applications, yielding a geomean whole-program speedup of 11.4 over best sequential execution on 24 cores, while non-speculative parallelization yields only 0.93.

Runtime Asynchronous Fault Tolerance via Speculation [abstract] (PDF)
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation and Optimization (CGO), April 2012.

Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a virtual layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a noninvasive way, and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications

Dynamically Managed Data for CPU-GPU Architectures [abstract] (PDF)
Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August
Proceedings of the 2012 International Symposium on Code Generation and Optimization (CGO), March 2012.

GPUs are flexible parallel processors capable of accelerating real applications. To exploit them, programmers must ensure a consistent program state between the CPU and GPU memories by managing data. Manually managing data is tedious and error-prone. In prior work on automatic CPU-GPU data management, alias analysis quality limits performance, and type-inference quality limits applicability. This paper presents Dynamically Managed Data (DyManD), the first automatic system to manage complex and recursive data-structures without static analyses. By replacing static analyses with a dynamic run-time system, DyManD overcomes the performance limitations of alias analysis and enables management for complex and recursive data-structures. DyManD-enabled GPU parallelization matches the performance of prior work equipped with perfectly precise alias analysis for 27 programs and demonstrates improved applicability on programs not previously managed automatically.

Automatic Speculative DOALL for Clusters [abstract] (PDF)
Hanjun Kim, Nick P. Johnson, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation and Optimization (CGO), March 2012.

Automatic parallelization for clusters is a promising alternative to time-consuming, error-prone manual parallelization. However, automatic parallelization is frequently limited by the imprecision of static analysis. Moreover, due to the inherent fragility of static analysis, small changes to the source code can significantly undermine performance. By replacing static analysis with speculation and profiling, automatic parallelization becomes more robust and applicable. A naive automatic speculative parallelization does not scale for distributed memory clusters, due to the high bandwidth required to validate speculation. This work is the first automatic speculative DOALL (Spec-DOALL) parallelization system for clusters. We have implemented a prototype automatic parallelization system, called Cluster Spec-DOALL, which consists of a Spec-DOALL parallelizing compiler and a speculative runtime for clusters. Since the compiler optimizes communication patterns, and the runtime is optimized for the cases in which speculation succeeds, Cluster Spec-DOALL minimizes the communication and validation overheads of the speculative runtime. Across 8 benchmarks, Cluster Spec-DOALL achieves a geomean speedup of 43.8x on a 120-core cluster, whereas DOALL without speculation achieves only 4.5x speedup. This demonstrates that speculation makes scalable fully-automatic parallelization for clusters possible.

DAFT: Decoupled Acyclic Fault Tolerance [abstract] (SpringerLink, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
The International Journal of Parallel Programming (IJPP), February 2012. Invited.
Special issue composed of "top papers" selected by the Program Committee of the 19th International Conference on Parallel Architectures and Compilation Techniques.

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such errors, but software transient fault detection techniques are more appealing for their low cost and flexibility. Recent software proposals double register pressure or memory usage, or are too slow in the absence of hardware extensions, preventing widespread acceptance. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Results demonstrate DAFT's high performance and broad fault coverage. Speculation allows DAFT to reduce the performance overhead of software redundant multithreading from an average of 200% to 38% with no degradation of faultcoverage.

A General Approach for Efficiently Accelerating Software-based Dynamic Data Flow Tracking on Commodity Hardware [abstract] (PDF)
Kangkook Jee, Georgios Portokalidis, Vasileios P. Kemerlis, Soumyadeep Ghosh, David I. August, and Angelos D. Keromytis
Proceedings of the 19th Internet Society (ISOC) Symposium on Network and Distributed Systems Security (NDSS), February 2012.

Despite the demonstrated usefulness of dynamic data flow tracking (DDFT) techniques in a variety of security applications, the poor performance achieved by available prototypes prevents their widespread adoption and use in production systems. We present and evaluate a novel methodology for improving the performance overhead of DDFT frameworks, by combining static and dynamic analysis. Our intuition is to separate the program logic from the corresponding tracking logic, extracting the semantics of the latter and abstracting them using a Taint Flow Algebra. We then apply optimization techniques to eliminate redundant tracking logic and minimize interference with the target program. Our optimizations are directly applicable to binary-only software and do not require any high level semantics. Furthermore, they do not require additional resources to improve performance, neither do they restrict or remove functionality. Most importantly, our approach is orthogonal to optimizations devised in the past, and can deliver additive performance benefits. We extensively evaluate the correctness and impact of our optimizations by augmenting a freely available high-performance DDFT framework, and applying it to multiple applications, including command line utilities, server applications, language runtimes, and web browsers. Our results show a speedup of DDFT by as much as 2.23x, with an average of 1.72x across all tested applications.

A System for Flexible Parallel Execution [abstract] (PDF)
Arun Raman
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, December 2011.

Exponential growth in transistor density combined with diminishing returns from uniprocessor improvements has compelled the industry to transition to multicore architectures. To realize the performance potential of multicore architectures, programs must be parallelized effectively. The efficiency of parallel program execution depends on the execution environment comprised of workload, platform, and performance goal. In writing parallel programs, most programmers and compilers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer's or compiler's development-time or compile-time choices suboptimal.

This dissertation presents Parcae, a generally applicable holistic system for platform-wide dynamic parallelism tuning. Parcae includes:
1. the Nona compiler, which applies a variety of auto-parallelization techniques to create flexible parallel programs whose tasks can be efficiently paused, reconfigured, and resumed during execution;
2. the Decima monitor, which measures resource availability and system performance to detect change in the environment; and
3. the Morta executor, which cuts short the life of executing tasks, replacing them with other functionally equivalent tasks better suited to the current environment.

Parallel programs made flexible by Parcae outperform original parallel implementations in a variety of interesting scenarios.

Automatic Extraction of Parallelism from Sequential Code
David I. August, Jialu Huang, Thomas B. Jablin, Hanjun Kim, Thomas R. Mason, Prakash Prabhu, Arun Raman, and Yun Zhang
Fundamentals of Multicore Software Development (ISBN: 978-1439812730)
Edited by Ali-Reza Adl-Tabatabai, Victor Pankratius, and Walter Tichy. Chapman & Hall / CRC Press, December 2011.

EPIC Processors
Arun Raman and David I. August
Encyclopedia of Parallel Computing (ISBN: 978-0387098449)
Edited by David Padua. Springer, November 2011.

A Survey of the Practice of Computational Science [abstract] (ACM DL, PDF)
Prakash Prabhu, Thomas B. Jablin, Arun Raman, Yun Zhang, Jialu Huang, Hanjun Kim, Nick P. Johnson, Feng Liu, Soumyadeep Ghosh, Stephen Beard, Taewook Oh, Matthew Zoufaly, David Walker, and David I. August
Proceedings of the 24th ACM/IEEE Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2011.

Computing plays an indispensable role in scientific research. Presently, researchers in science have different problems, needs, and beliefs about computation than professional programmers. In order to accelerate the progress of science, computer scientists must understand these problems, needs, and beliefs. To this end, this paper presents a survey of scientists from diverse disciplines, practicing computational science at a doctoral-granting university with very high research activity. The survey covers many things, among them, prevalent programming practices within this scientific community, the importance of computational power in different fields, use of tools to enhance performance and software productivity, computational resources leveraged, and prevalence of parallel computation. The results reveal several patterns that suggest interesting avenues to bridge the gap between scientific researchers and programming tools developers.

The SPARCHS Project: Hardware Support for Software Security [abstract] (PDF)
Simha Sethumadhavan, Salvatore J. Stolfo, Angelos D. Keromytis, Junfeng Yang, and David I. August
Proceedings of the First SysSec Workshop (SYSSEC), July 2011.

This paper describes the research philosophy and motivations of the SPARCHS project. The goal of the SPARCHS project is to improve security by implementing security mechanisms in hardware.

Parallel Assertions for Debugging Parallel Programs [abstract] (IEEE Xplore, PDF)
Daniel Schwartz-Narbonne, Feng Liu, Tarun Pondicherry, David I. August, and Sharad Malik
2011 9th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE), July 2011.

A parallel program must execute correctly even in the presence of unpredictable thread interleavings. This interleaving makes it hard to write correct parallel programs, and also makes it hard to find bugs in incorrect parallel programs. A range of tools have been developed to help debug parallel programs, ranging from atomicity-violation and data-race detectors to model-checkers and theorem provers. One technique that has been successful for debugging sequential programs, but less effective for parallel programs, is running the program using assertion predicates provided by the developer. These assertions allow programmers to specify and check their assumptions. In a multi-threaded program, the programmer's assumptions include both the current state, and any actions (e.g. access to shared memory) that other, parallel executing threads might take. We introduce parallel assertions which allow programmers to express these assumptions for parallel programs using simple and intuitive syntax and semantics. We present a proof-of-concept implementation, and demonstrate its value by testing a number of benchmark programs using parallel assertions.

Commutative Set: A Language Extension for Implicit Parallel Programming [abstract] (ACM DL, PDF)
Prakash Prabhu, Soumyadeep Ghosh, Yun Zhang, Nick P. Johnson, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.

Sequential programming models express a total program order, of which a partial order must be respected. This inhibits parallelizing tools from extracting scalable performance. Programmer written semantic commutativity assertions provide a natural way of relaxing this partial order, thereby exposing parallelism implicitly in a program. Existing implicit parallel programming models based on semantic commutativity either require additional programming extensions, or have limited expressiveness. This paper presents a generalized semantic commutativity based programming extension, called Commutative Set (CommSet), and associated compiler technology that enables multiple forms of parallelism. CommSet expressions are syntactically succinct and enable the programmer to specify commutativity relations between groups of arbitrary structured code blocks. Using only this construct, serializing constraints that inhibit parallelization can be relaxed, independent of any particular parallelization strategy or concurrency control mechanism. CommSet enables well performing parallelizations in cases where they were inapplicable or non-performing before. By extending eight sequential programs with only 8 annotations per program on average, CommSet and the associated compiler technology produced a geomean speedup of 5.7x on eight cores compared to 1.5x for the best non-\commset parallelization.

Automatic CPU-GPU Communication Management and Optimization [abstract] (ACM DL, PDF)
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.

The performance benefits of GPU parallelism can be enormous, but unlocking this performance potential is challenging. The applicability and performance of GPU parallelizations is limited by the complexities of CPU-GPU communication. To address these communications problems, this paper presents the first fully automatic system for managing and optimizing CPU-GPU communcation. This system, called the CPU-GPU Communication Manager (CGCM), consists of a run-time library and a set of compiler transformations that work together to manage and optimize CPU-GPU communication without depending on the strength of static compile-time analyses or on programmer-supplied annotations. CGCM eases manual GPU parallelizations and improves the applicability and performance of automatic GPU parallelizations. For 24 programs, CGCM-enabled automatic GPU parallelization yields a whole program geomean speedup of 5.36x over the best sequential CPU-only execution.

Parallelism Orchestration using DoPE: the Degree of Parallelism Executive [abstract] (ACM DL, PDF)
Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I. August
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2011.

In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer's development-time choices suboptimal. To address this problem, this paper presents the Degree of Parallelism Executive (DoPE), an API and run-time system that separates the concern of exposing parallelism from that of optimizing it. Using the DoPE API, the application developer expresses parallelism options. During program execution, DoPE's run-time system uses this information to dynamically optimize the parallelism options in response to the facts on the ground. We easily port several emerging parallel applications to DoPE's API and demonstrate the DoPE run-time system's effectiveness in dynamically optimizing the parallelism for a variety of performance goals.

Scalable Speculative Parallelization on Commodity Clusters [abstract] (ACM DL, PDF)
Hanjun Kim, Arun Raman, Feng Liu, Jae W. Lee, and David I. August
Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010.
Highest ranked paper in double-blind review process.

While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49x. This compares favorably to the 15x speedup achieved by our implementation of TLS-only support for clusters.

DAFT: Decoupled Acyclic Fault Tolerance [abstract] (ACM DL, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2010.
Selected by the Program Committee as a "top paper" for inclusion in a special issue of the International Journal of Parallel Programming (IJPP).

Structural Simulation for Architecture Exploration
David I. August, Veerle Desmet, Silvain Girbal, Daniel Gracia Pérez, and Olivier Temam
Processor and System-on-Chip Simulation (ISBN: 978-1441961747)
Edited by Rainer Leupers and Olivier Temam. Springer, September 2010.

Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs? [abstract] (IEEE Xplore, PDF)
Arvind, David I. August, Keshav Pingali, Derek Chiou, Resit Sendag, and Joshua J. Yi
IEEE Micro, Volume 30, Number 3, May 2010.

In this panel discussion from the 2009 Workshop On Computer Architecture Research Directions, David August and Keshav Pingali debate whether explicitly parallel programming is a necessary evil for applications programmers, assess the current state of parallel programming models, and discuss possible routes toward finding the programming model for the multicore era.

Decoupled Software Pipelining Creates Parallelization Opportunities [abstract] (ACM DL, PDF)
Jialu Huang, Arun Raman, Yun Zhang, Thomas B. Jablin, Tzu-Han Hung, and David I. August
Proceedings of the 2010 International Symposium on Code Generation and Optimization (CGO), April 2010.

Decoupled Software Pipelining (DSWP) is one approach to automatically extract threads from loops. It partitions loops into long-running threads that communicate in a pipelined manner via inter-core queues. This work recognizes that DSWP can also be an enabling transformation for other loop parallelization techniques. This use of DSWP, called DSWP+, splits a loop into new loops with dependence patterns amenable to parallelization using techniques that were originally either inapplicable or poorly-performing. By parallelizing each stage of the DSWP+ pipeline using (potentially) different techniques, not only is the benefit of DSWP increased, but the applicability and performance of other parallelization techniques are enhanced. This paper evaluates DSWP+ as an enabling framework for other transformations by applying it in conjunction with DOALL, LOCALWRITE, and SpecDOALL to individual stages of the pipeline. This paper demonstrates significant performance gains on a commodity 8-core multicore machine running a variety of codes transformed with DSWP+.

Liberty Queues for EPIC Architectures [abstract] (PDF)
Thomas B. Jablin, Yun Zhang, James A. Jablin, Jialu Huang, Hanjun Kim, and David I. August
Proceedings of the Eighth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), April 2010.

Core-to-core communication bandwidth is critical for high-performance pipeline-parallel programs. Hardware communication queues are unlikely to be implemented and are perhaps unnecessary. This paper presents Liberty Queues, a high-performance lock-free software-only ring buffer, and describes the porting effort from the original x86-64 implementation to IA-64. Liberty Queues achieve a bandwidth of 500 MB/s between unrelated processors on a first generation Itanium 2, compared with 281 MB/s on modern Opterons and 430 MB/s on modern Xeons claimed by related works. Bandwidth results are presented for seven different multicore and multiprocessor systems, as well as a sensitivity analysis.

Speculative Parallelization Using Software Multi-threaded Transactions [abstract] (ACM DL, PDF)
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August
Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2010.

With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative techniques, speculative parallelism appears to be the key to continuing this trend for general-purpose applications. Recently-proposed code parallelization techniques, such as those by Bridges et al. and by Thies et al., demonstrate scalable performance on multiple cores by using speculation to divide code into atomic units (transactions) that span multiple threads in order to expose data parallelism. Unfortunately, most software and hardware Thread-Level Speculation (TLS) memory systems and transactional memories are not sufficient because they only support single-threaded atomic units. Multi-threaded Transactions (MTXs) address this problem, but they require expensive hardware support as currently proposed in the literature. This paper proposes a Software MTX (SMTX) system that captures the applicability and performance of hardware MTX, but on existing multicore machines. The SMTX system yields a harmonic mean speedup of 13.36x on native hardware with four 6-core processors (24 cores in total) running speculatively parallelized applications.

Compilation Strategies and Challenges for Multicore Signal Processing [abstract] (IEEE Xplore, PDF)
Mojtaba Mehrara, Thomas B. Jablin, Dan Upton, David I. August, Kim Hazelwood, and Scott Mahlke
IEEE Signal Processing Magazine, November 2009.

To overcome challenges stemming from power density and hot spots on the chip, multicore computing platforms have emerged as the ubiquitous computing platform for servers down through embedded systems. Unfortunately, providing multiple cores does not directly translate into increased performance for most applications. The burden is placed on software developers and tools to find and exploit coarse-grain parallelism in order to effectively make use of the abundance of computing resources provided by these systems. Concurrent applications are much more complex to develop than their single-threaded ancestors, thus software development tools will be critical to help programmers create both high performance and correct software. This article provides an overview multicore parallelism and compiler technology for the signal processing community.

LAMPVIEW: A Loop-Aware Toolset for Facilitating Parallelization [abstract] (PDF)
Thomas Rorie Mason
Master's Thesis, Department of Electrical Engineering, Princeton University, August 2009.

A continual growth of the number of transistors per unit area coupled with diminishing returns from traditional microarchitectural and clock frequency improvements has led processor manufacturers to place multiple cores on a single chip. However, only multi-threaded code can fully take advantage of the new multicore processors; legacy single-threaded code does not benefit. Many approaches to parallelization have been explored, including both manual and automatic techniques.

Unfortunately, research in this area is impeded by the innate difficulty of exploring code by hand for new possible parallelization schemes. Regardless of whether it is a researcher attempting to discover possible automatic techniques or a programmer trying to make manual parallelization, the benefits of good dependence information are substantial. This thesis provides a profiling and analysis toolset aimed at easing a programmer or researcher.s effort in finding parallelism. The toolset, The Loop-Aware Memory Profile Viewing System (LAMPView), is developed in three parts.

The first part is a multi-frontend, multi-target compiler pass written to instrument the code with calls to the Loop-Aware Memory Profiling (LAMP) library. The compile-time instrumentation was partially developed previously and has been augmented here with additional features. The second part is a post-runtime processing pass that translates the output of the profiling run from a machine-level view to a source-level view. As it translates, it also processes and sorts dependence information. The third and final part is a pair of stand-alone utilities that takes the translated information and provides the user with human-readable output that is searchable by various parameters. In various discussions with potential users, it has been seen that the utility eases analysis of parallelism opportunities.

Parallelization Techniques with Improved Dependence Handling [abstract] (PDF)
Easwaran Raman
Ph.D. Thesis, Department of Computer Science, Princeton University, June 2009.

Continuing exponential growth in transistor density and diminishing returns from the increasing transistor count have forced processor manufacturers to pack multiple processor cores onto a single chip. These processors, known as multi-core processors, generally do not improve the performance of single-threaded applications. Automatic parallelization has a key role to play in improving the performance of legacy and newly written single-threaded applications in this new multi-threaded era.

Automatic parallelizations transform single-threaded code into a semantically equivalent multi-threaded code by preserving the dependences of the original code. This dissertation proposes two new automatic parallelization techniques that differ from related existing techniques in their handling of dependences. This difference in dependence handling enables the proposed techniques to outperform related techniques.

The first technique is known as parallel-stage decoupled software pipelining (PS-DSWP). PS-DSWP extends pipelined parallelization techniques like DSWP by allowing certain pipelined stages to be executed by multiple threads. Such a parallel execution of pipelined stages requires distinguishing inter-iteration dependences of the loop being parallelized from the rest of the dependences. The applicability and effectiveness of PS-DSWP is further enhanced by applying speculation to remove some dependences.

The second technique, known as speculative iteration chunk execution (Spice), uses value speculation to ignore inter-iteration dependences, enabling speculative execution of chunks of iterations in parallel. Unlike other value-speculation based parallelization techniques, Spice speculates only a few dynamic instances of those inter-iteration dependences.

Both these techniques are implemented in the VELOCITY compiler and are evaluated using a multi-core Itanium 2 simulator. PS-DSWP results in a geometric mean loop speedup of 2.13 over single-threaded performance with five threads on a set of loops from five benchmarks. The use of speculation improves the performance of PS-DSWP resulting in a geometric mean loop speedup of 3.67 over single-threaded performance on a set of loops from six benchmarks. Spice shows a geometric mean loop speedup of 2.01 on a set of loops from four benchmarks. Based on the above experimental results and qualitative and quantitative comparisons with related techniques, this dissertation demonstrates the effectiveness of the proposed techniques.

Speculation
Neil Vachharajani and David I. August
Encyclopedia of Computer Science and Engineering (ISBN: 978-0471383932)
Edited by Benjamin W. Wah. John Wiley & Sons, Inc., January 2009.

Intelligent Speculation for Pipelined Multithreading [abstract] (PDF)
Neil Amar Vachharajani
Ph.D. Thesis, Department of Computer Science, Princeton University, November 2008.

In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. Since many of today's applications are single-threaded and since it is likely that many of tomorrow's applications will have far fewer threads than there will be processor cores, automatic thread extraction is an essential tool for effectively leveraging today's multi-core and tomorrow's many-core processors. A recently proposed technique, Decoupled Software Pipelining (DSWP), has demonstrated promise by partitioning loops into long-running threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency and variability in intra-thread performance.

This dissertation extends the pipelined parallelism paradigm with speculation. Using speculation, dependences that manifest infrequently or are easily predictable can be safely ignored by the compiler allowing it to carve more, and better balanced, thread-based pipeline stages from a single thread of execution. Prior speculative threading proposals were obligated to speculate most, if not all, loop-carried dependences to squeeze the code segment under consideration into the mold required by the parallelization paradigm. Unlike those techniques, this dissertation demonstrates that speculation need only break the longest few dependence cycles to enhance the applicability and scalability of the pipelined multi-threading paradigm. By speculatively breaking these cycles, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. To demonstrate the effectiveness of speculative pipelined multi-threading, this dissertation presents the design and experimental evaluation of our fully automatic compiler transformation, Speculative Decoupled Software Pipelining, a speculative extension to DSWP.

This dissertation additionally introduces multi-threaded transactional memories to support speculative pipelined multi-threading. Similar to past speculative parallelization approaches, speculative pipelined multi-threading relies on runtime-system support to buffer speculative modifications to memory. However, this dissertation demonstrates that existing proposals to buffer speculative memory state, transactional memories, are insufficient for speculative pipelined multi-threading because the speculative buffers are restricted to a single thread. Further, this dissertation demonstrates that this limitation leads to modularity and composability problems even for transactional programming, thus limiting the potential of that approach also. To surmount these limitations, this thesis introduces multi-threaded transactional memories and presents an initial hardware implementation.

The VELOCITY Compiler: Extracting Efficient Multicore Execution from Legacy Sequential Programs [abstract] (PDF)
Matthew John Bridges
Ph.D. Thesis, Department of Computer Science, Princeton University, November 2008.

Multiprocessor systems, particularly chip multiprocessors, have emerged as the predominant organization for future microprocessors. Systems with 4, 8, and 16 cores are already shipping and a future with 32 or more cores is easily conceivable. Unfortunately, multiple cores do not always directly improve application performance, particularly for a single legacy application. Consequently, parallelizing applications to execute on multiple cores is essential.

Parallel programming models and languages could be used to create multi-threaded applications. However, moving to a parallel programming model only increases the complexity and cost involved in software development. Many automatic thread extraction techniques have been explored to address these costs.

Unfortunately, the amount of parallelism that has been automatically extracted using these techniques is generally insufficient to keep many cores busy. Though there are many reasons for this, the main problem is that extensions are needed to take full advantage of these techniques. For example, many important loops are not parallelized because the compiler lacks the necessary scope to apply the optimization. Additionally, the sequential programming model forces programmers to define a single legal application outcome, rather than allowing for a range of legal outcomes, leading to conservative dependences that prevent parallelization.

This dissertation integrates the necessary parallelization techniques, extending them where necessary, to enable automatic thread extraction. In particular, this includes an expanded optimization scope, which facilitates the optimization of large loops, leading to parallelism at higher levels in the application. Additionally, this dissertation shows that many unnecessary dependences can be broken with the help of the programmer using natural, simple extensions to the sequential programming model.

Through a case study of several applications, including several C benchmarks in the SPEC CINT 2000 suite, this dissertation shows how scalable parallelism can be extracted. By changing only 38 source code lines out of 100,000, several of the applications were parallelizable by automatic thread extraction techniques, yielding a speedup of 3.64x on 32 cores.

Global Instruction Scheduling for Multi-Threaded Architectures [abstract] (PDF)
Guilherme de Lima Ottoni
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2008.

Recently, the microprocessor industry has moved toward multi-core or chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software.

Since developing parallel applications has long been recognized as significantly harder than developing sequential ones, it is very desirable to have automatic tools to extract thread-level parallelism (TLP) from sequential applications. Unfortunately, automatic parallelization has only been successful in the restricted domains of scientific and data-parallel applications, which usually have regular array-based memory accesses and little control flow. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with light-weight, fine-grained (scalar) communication mechanisms. Despite such support, most existing multi-threading compilation techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications. The main reason for this is that such techniques are mostly restricted to extracting parallelism within local, straight-line regions of code. As observed in several limit studies, local regions of code contain limited parallelism, and control dependence analysis and multiple control units are necessary to extract most of the parallelism opportunities.

This thesis describes a general compiler framework for Global Multi-Threaded (GMT) instruction scheduling, i.e. to simultaneously schedule instructions from a global region of code to extract TLP for multi-threaded architectures. Our compiler framework is based on a Program Dependence Graph (PDG) representation, efficient graph partitioning algorithms, and novel multi-threaded code generation algorithms. Central to this framework are our multi-threaded code generation algorithms, which produce efficient code for arbitrary partitions of the PDG into threads. Based on this framework, three thread-partitioning strategies for GMT instructions scheduling are proposed. The first one, called GREMIO, extends list-scheduling to target multi-threaded architectures and to operate on global code regions. The second technique, called Decoupled Software Pipelining (DSWP), extracts pipelined TLP from arbitrary loop nests. We also propose Parallel-Stage DSWP, an extension of DSWP that allows multiple threads to concurrently execute the same stage of the pipeline. These techniques are implemented in the VELOCITY compiler and evaluated on an accurate CMP simulator built on top of validated Itanium 2 core models. The experiments show that our techniques balance applicability and scalability differently, with each technique resulting in the best speedup in different scenarios. Overall, the results demonstrate the effectiveness of the proposed compilation techniques, with significant speedups on a number of real benchmark applications written in C.

Performance Scalability of Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August
ACM Transactions on Architecture and Code Optimization (TACO), Volume 5, Number 2, August 2008.

Any successful solution to using multi-core processors to scale general-purpose program performance will have to contend with rising inter-core communication costs while exposing coarsegrained parallelism. Recently proposed pipelined multithreading (PMT) techniques have been demonstrated to have general-purpose applicability and are also able to effectively tolerate intercore latencies through pipelined inter-thread communication. These desirable properties make PMT techniques strong candidates for program parallelization on current and future multi-core processors and understanding their performance characteristics is critical to their deployment.

To that end, this paper evaluates the performance scalability of a general-purpose PMT technique called decoupled software pipelining (DSWP) and presents a thorough analysis of the communication bottlenecks that must be overcome for optimal DSWP scalability.

Shape Analysis with Inductive Recursion Synthesis [abstract] (PDF)
Bolei Guo
Ph.D. Thesis, Department of Computer Science, Princeton University, June 2008.

For program optimization and verification purposes, shape analysis can be used to statically determine structural properties of the runtime heap. One promising formalism for describing heap is separation logic, with recursively defined predicates that allow for concise yet precise summarization of linked data structures. A major challenge in this approach is the derivation of the predicates directly from the program being analyzed. As a result, current uses of separation logic rely heavily on predefined predicates, limiting the class of programs analyzable to those that manipulate only predefined data types. This thesis addresses this problem by proposing a general algorithm based on \emph{inductive program synthesis} that automatically infers recursive predicates in a canonical form. This technique yields a separation logic based shape analysis that can be effective on a much wider range of programs.

A key strength of separation logic is that it facilitates, via explicit expression of structural separation, local reasoning about heap where the effects of altering one part of a data structure are analyzed in isolation from the rest. The interaction between local reasoning and the global invariants given by recursive predicates is a difficult area, especially in the presence of complex internal sharing in the data structures. Existing approaches, using logic rules specifically designed for the list predicate to unfold and fold linked lists, again require a priori knowledge about the data structures and do not easily generalize to more complex data structures. We introduce a notion of ``truncation points" in a recursive predicate, which gives rise to generic algorithms for unfolding and folding arbitrary data structures.

We present a fully implemented interprocedural analysis algorithm that handles recursive procedures. A combination of pointer analysis and program slicing is used to deal with the scalability issue typically faced by shape analyses.

Finally, we present a data dependence test for recursive data structures that takes advantage of the results of our shape analysis.

Software Modulated Fault Tolerance [abstract] (PDF)
George A. Reis
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, June 2008.

In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their smaller size and sheer number make chips more susceptible to transient faults. Designers frequently introduce redundant hardware or software to detect these faults because process and material advances are often insufficient to mitigate their effect. Regardless of the methods used for adding reliability, these techniques incur significant performance penalties because they uniformly protect the entire application. They do not consider the varying resilience to transient faults of different program regions. This uniform protection leads to wasted resources that reduce performance and/or increase cost.

To maximize fault coverage while minimizing the performance impact, this dissertation takes advantage of the key insights that not all faults in an unprotected application will cause an incorrect answer and not all parts of the program respond the same way to reliability techniques. First, this dissertation demonstrates the varying vulnerability and performance responses of an application and identifies regions which are most susceptible to faults as well as those which are inexpensive to protect. Second, this dissertation advocates the use of software and hybrid approaches to fault tolerance to enable the synergy of high-level information with specific redundancy techniques. Third, this dissertation demonstrates how to exploit this non-uniformity via Software Modulated Fault Tolerance. Software Modulated Fault Tolerance leverages reliability and performance information at a high level and directs the reliability choices at fine granularities to provide the most efficient use of processor resources for an application.

Spice: Speculative Parallel Iteration Chunk Execution [abstract] (ACM DL, PDF)
Easwaran Raman, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.

The recent trend in the processor industry of packing multiple processor cores in a chip has increased the importance of automatic techniques for extracting thread level parallelism. A promising approach for extracting thread level parallelism in general purpose applications is thread level speculation(TLS), which uses memory alias or value speculation to break dependences amongst threads and executes them concurrently. In this work, we present a novel software-only value prediction mechanism for TLS and an associated TLS technique called speculative parallel iteration chunk execution (Spice). Our value prediction technique predicts the loop live-ins of only a few iterations of a given loop, enabling speculative threads to start from those iterations. It also increases the probability of successful speculation by only predicting that the values will be used as live-ins in some future iterations of the loop. These twin properties enable our value prediction scheme to have high prediction accuracies while exposing significant coarse-grained thread-level parallelism. Spice has been implemented as an automatic transformation in a research compiler. The technique results in up to 157% speedup (101% on average) with 4 threads.

Parallel-Stage Decoupled Software Pipelining [abstract] (ACM DL, PDF)
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew Bridges, and David I. August
Proceedings of the 2008 International Symposium on Code Generation and Optimization (CGO), April 2008.

In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that compilers automatically extract thread-level parallelism from single-threaded applications. DOALL is a popular automatic technique for loop-level parallelization employed successfully in the domains of scientific and numeric computing. While the scalability of DOALL is only limited by the number of iterations of the loop, its applicability is limited by the presence of loop-carried dependences. A parallelization technique with greater applicability is decoupled software pipelining(DSWP), which parallelizes loops even in the presence of loop-carried dependences. However, the scalability of DSWP is limited by the size of the loop body and the number of recurrences it contains. This work proposes a novel non-speculative compiler parallelization technique called parallel-stage decoupled software pipelining (PS-DSWP). The goal of PS-DSWP is to combine the applicability of DSWP with the scalability of DOALL parallelization. A key insight of PS-DSWP is that, after isolating the recurrences in their own stages in DSWP, portions of the loop suitable for DOALL parallelization may be exposed. PS-DSWP extends DSWP to benefit from these opportunities, utilizing multiple threads to execute the same stage of a DSWPed loop in parallel. This paper describes the PS-DSWP transformation in detail and discusses its implementation in a research compiler. PS-DSWP produces an average speedup of 114% (up to a maximum of 155%) with 6 threads on loops from a set of 5 applications. Our experiments also demonstrate that PS-DSWP achieves better scalability with the number of threads than DSWP.

Communication Optimizations for Global Multi-Threaded Instruction Scheduling [abstract] (ACM DL, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 13th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2008.

The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. Given the difficulty in finding coarse-grained parallelism in general-purpose, sequential applications, computer architects have proposed CMPs with light-weight, fine-grained (scalar) communication mechanisms. As observed in several limit studies, most of the parallelization opportunities require looking for parallelism beyond local regions of code. To exploit these opportunities, especially for sequential applications, researchers have recently proposed global multi-threaded instruction scheduling techniques, including Decoupled Software Pipelining (DSWP) and GREMIO. These techniques simultaneously schedule instructions from large regions of code, such as arbitrary loop nests or whole procedures, and have been shown effective at extracting threads for many applications. A key enabler of these global instruction scheduling techniques is the Multi-Threaded Code Generation (MTCG) algorithm proposed in [Ottoni et al, '05], which generates multi-threaded code for any partition of the instructions into threads. This algorithm inserts communication and synchronization instructions in order to satisfy all inter-thread dependences.

In this paper, we present a general compiler framework, COCO, to optimize the communication and synchronization instructions inserted by the MTCG algorithm. This framework, based on thread-aware data-flow analyses and graph min-cut algorithms, appropriately models and optimizes all kinds of inter-thread dependences, including register, memory, and control dependences. Our experiments, using a fully automatic compiler implementation of these techniques, demonstrate significant reductions (about 30% on average) in the number of dynamic communication instructions in code parallelized with DSWP and GREMIO. This reduction in communication translates to performance gains of up to 40%.

Revisiting the Sequential Programming Model for the Multicore Era [abstract] (Original Full Paper, IEEE Xplore, PDF)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
IEEE Micro, Volume 28, Number 1, January 2008.
IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.

Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race conditions, deadlocks, and other problems associated with parallel programming. To address these costs, other approaches, such as automatic thread extraction, have been explored. Unfortunately, the amount of parallelism that has been automatically extracted is generally insufficient to keep many cores busy.

This paper argues that this lack of parallelism is not an intrinsic limitation of the sequential programming model, but rather occurs for two reasons. First, there exists no framework for automatic thread extraction that brings together key existing state-of-the-art compiler and hardware techniques. This paper shows that such a framework can yield scalable parallelization on several SPEC CINT2000 benchmarks. Second, existing sequential programming languages force programmers to define a single legal program outcome, rather than allowing for a range of legal outcomes. This paper shows that natural, simple extensions to the sequential programming model enable parallelization for the remainder of the SPEC CINT2000 suite. Our experience demonstrates that, by changing only 60 source code lines, all of the C benchmarks in the SPEC CINT2000 suite were parallelizable by automatic thread extraction. This process, constrained by the limits of modern optimizing compilers, yielded a speedup of 454% on these applications.

Optimizations for the Memory Hierarchy
Easwaran Raman and David I. August
Compiler Design Handbook (ISBN: 978-1420043822)
Edited by Y. N. Srikant and Priti Shankar. CRC Press, December 2007.

Revisiting the Sequential Programming Model for Multi-Core [abstract] (IEEE Xplore, PDF, Top Picks Version)
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas B. Jablin, and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2007.
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2007.

Global Multi-Threaded Instruction Scheduling [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni and David I. August
Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2007.

Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with light-weight scalar communication mechanisms. Despite such support, most existing compiler multi-threading techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications. The main reason for this is that such techniques are mostly restricted to extracting parallelism within straight-line regions of code.

In this paper, we first propose a framework that enables global multi-threaded instruction scheduling in general. We then describe GREMIO, a scheduler built using this framework. GREMIO operates at a global scope, at the procedure level, and uses control dependence analysis to extract non-speculative thread-level parallelism from sequential codes. Using a fully automatic compiler implementation of GREMIO and a validated processor model, this paper demonstrates gains for a dual-core CMP model running a variety of codes. Our experiments demonstrate the advantage of exploiting global scheduling for multi-threaded architectures, and present gains in a detailed comparison with the Decoupled Software Pipelining (DSWP) multi-threading technique. Furthermore, our experiments show that adding GREMIO to a compiler with DSWP improves the average speedup from 16.5% to 32.8% for important benchmark functions when utilizing two cores, indicating the importance of this technique in making compilers extract threads effectively.

UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development [abstract] (IEEE Xplore, PDF)
David I. August, Jonathan Chang, Sylvain Girbal, Daniel Gracia-Perez, Gilles Mouchard, David Penry, Olivier Temam, and Neil Vachharajani
IEEE Computer Architecture Letters (CAL), September 2007.

Simulator development is already a huge burden for many academic and industry research groups; future complex or heterogeneous multi-cores, as well as the multiplicity of performance metrics and required functionality, will make matters worse. We present a new simulation environment, called UNISIM, which is designed to rationalize simulator development by making it possible and efficient to distribute the overall effort over multiple research groups, even without direct cooperation. UNISIM achieves this goal with a combination of modular software development, distributed communication protocols, multi-level abstract modeling, interoperability capabilities, a set of simulator services APIs, and an open library/repository for providing a consistent set of simulator modules.

Speculative Decoupled Software Pipelining [abstract] (IEEE CS, PDF)
Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2007.

In recent years, microprocessor manufacturers have shifted their focus from single-core to multicore processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, Decoupled Software Pipelining (DSWP), has demonstrated promise by partitioning loops into long-running, fine-grained threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency.

This paper proposes adding speculation to DSWP and evaluates an automatic approach for its implementation. By speculating past infrequent dependences, the benefit of DSWP is increased by making it applicable to more loops, facilitating better balanced threads, and enabling parallelized loops to be run on more cores. Unlike prior speculative threading proposals, speculative DSWP focuses on breaking dependence recurrences. By speculatively breaking these recurrences, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. Using an initial automatic compiler implementation and a validated processor model, this paper demonstrates significant gains using speculation for 4-core chip multiprocessor models running a variety of codes.

Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker
Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007.
Winner Best Paper Award.

A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the results of replicas to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. First, we provide an operational semantics for our assembly language, which includes a precise formal denition of our fault model. Formulating such a model is a crucial step forward for the science of hardware fault tolerance as it pins down the assumptions being made about when and where faults may occur in a completely precise and transparent manner. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. On the one hand, this type system may be viewed as a theoretical tool for proving that code is reliable. On the other hand, it may be used as a practical tool for compiler debugging. In the latter case, the type system is strictly superior to conventional testing as it guarantees full coverage and perfect fault tolerance relative to the fault model. Third, we provide a formal specication for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate the execution time of our detection scheme to quantify the performance penalty for sound fault tolerance.

Pipelined Multithreading Transformations and Support Mechanisms [abstract] (PDF)
Ram Rangan
Ph.D. Thesis, Department of Computer Science, Princeton University, June 2007.

Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved application performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is essential to their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, which communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program performance, they cannot be applied to parallelize general-purpose applications which are characterized by difficult-to-break recurrences. Even though cyclic multithreading techniques, such as DOACROSS, are more applicable, the cyclic inter-thread dependences created by these techniques cause them to have very low tolerance to rising inter-core latencies.

To address these problems, this work introduces a pipelined multithreading (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, in particular, and PMT techniques, in general, are able to tolerate inter-core latencies, while still handling codes with complex recurrences. They achieve this by enforcing an acyclic communication discipline amongst threads, which allow threads to use inter-thread queues to communicate in a pipelined fashion. This dissertation demonstrates that DSWPed codes not only tolerate inter-core communication costs, but also effectively tolerate variable latency stalls in applications better than single-threaded execution on both in-order and out-of-order issue processors with comparable resources. It then performs a thorough analysis of the performance scalability of automatically generated DSWPed codes and identifies the conditions necessary to achieve peak PMT performance.

Next, the dissertation shows that even though PMT applications tolerate inter-core latencies well, the high frequency of inter-thread communication (once every 5 to 20 dynamic instructions) in these codes, makes them very sensitive to the intra-thread overhead imposed by communication operations. In order to understand the issues surrounding inter-thread communication for PMT applications, this dissertation undertakes a methodical exploration of the design space of communication support options for PMT. Three new communication mechanisms with varying cost-performance tradeoffs are introduced and are shown to perform 38% to 200% better than the state of the art.

Shape Analysis with Inductive Recursion Synthesis [abstract] (ACM DL, PDF)
Bolei Guo, Neil Vachharajani, and David I. August
Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI), June 2007.

Separation logic with recursively defined predicates allows for concise yet precise description of the shapes of data structures. However, most uses of separation logic for program analysis rely on pre-defined recursive predicates, limiting the class of programs analyzable to those that manipulate only a priori data structures. This paper describes a general algorithm based on \emph{inductive program synthesis} that automatically infers recursive shape invariants, yielding a shape analysis based on separation logic that can be applied to any program.

A key strength of separation logic is that it facilitates, via explicit expression of structural separation, local reasoning about heap where the effects of altering one part of a data structure are analyzed in isolation from the rest. The interaction between local reasoning and the global invariants given by recursive predicates is a difficult area, especially in the presence of complex internal sharing in the data structures. Existing approaches, using logic rules specifically designed for the list predicate to unfold and fold linked-lists, again require a priori knowledge about the shapes of the data structures and do not easily generalize to more complex data structures. We introduce a notion of ``truncation points" in a recursive predicate, which gives rise to generic algorithms for unfolding and folding arbitrary data structures.

Global Multi-Threaded Instruction Scheduling: Technique and Initial Results [abstract] (CiteSeerX, PDF)
Guilherme Ottoni and David I. August
Proceedings of the Sixth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), March 2007.

Recently, the microprocessor industry has reached hard physical and micro-architectural limits that have prevented the continuous clock-rate increase, which had been the major source of performance gains for decades. These impediments, in conjunction with the still increasing transistor counts per chip, have driven all major microprocessor manufacturers toward Chip Multiprocessor (CMP) designs. Although CMPs are able to concurrently pursue multiple threads of execution, they do not directly improve the performance of most applications, which are written in sequential languages. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software. In order to support more effective parallelization of sequential applications, computer architects have proposed CMPs with light-weight communication mechanisms. Despite such support, proposed multi-threaded scheduling techniques have generally demonstrated little effectiveness in extracting parallelism from general-purpose, sequential applications. We call these techniques local multi-threaded scheduling, because they basically exploit parallelism within straight-line regions of code. A key observation of this paper is that local multi-threaded techniques do not exploit the main feature of CMPs: the ability to concurrently execute instructions from different control-flow regions. In order to benefit from this powerful CMP characteristic, it is necessary to perform global multi-threaded scheduling, which simultaneously schedules instructions from different basic blocks to enable their concurrent execution. This paper presents algorithms to perform global scheduling for communication-exposed multi-threaded architectures. By global we mean that our technique simultaneously schedules instructions from an arbitrary code region. Very encouraging preliminary results, targeting a dual-core Itanium~2 model, are presented for selected MediaBench and SPEC-CPU benchmark applications.

Automatic Instruction-Level Software-Only Recovery Methods [abstract] (IEEE Xplore, Original Full Paper, PDF)
George A. Reis, Jonathan Chang, and David I. August
IEEE Micro, Volume 27, Number 1, January 2007.
IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2006.

As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Computer architects have typically addressed reliability issues by adding redundant hardware, but these techniques are often too expensive to be used widely. Software-only reliability techniques have shown promise in their ability to protect against soft-errors without any hardware overhead. However, existing low-level software-only fault tolerance techniques have only addressed the problem of detecting faults, leaving recovery largely unaddressed. In this paper, we present the concept, implementation, and evaluation of automatic, instruction-level, software-only recovery techniques, as well as various specific techniques representing different trade-offs between reliability and performance. Our evaluation shows that these techniques fulfill the promises of instruction-level, software-only fault tolerance by offering a wide range of flexible recovery options

Non-Uniform Fault Tolerance [abstract] (PDF)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.

As devices become more susceptible to transient faults that can affect program correctness, processor designers will increasingly compensate by adding hardware or software redundancy. Proposed redundancy techniques and those currently in use are generally applied uniformly to a structure despite non-uniformity in the way errors within the structure manifest themselves in programs. This uniform protection leads to inefficiency in terms of performance, power, and area. Using case studies involving the register file, this paper motivates an alternative \emph{Non-Uniform Fault Tolerance} approach which improves reliability over uniform approaches by spending the redundancy budget on those areas most susceptible.

Support for High-Frequency Streaming in CMPs [abstract] (ACM DL, PDF)
Ram Rangan, Neil Vachharajani, Adam Stoler, Guilherme Ottoni, David I. August, and George Z. N. Cai
Proceedings of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2006.

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind.

Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, this high-frequency communication makes thread performance very sensitive to intra-thread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.

Configurable Transient Fault Detection via Dynamic Binary Translation [abstract] (PDF)
George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.

Smaller feature sizes, lower voltage levels, and reduced noise margins have helped improve the performance and lower the power consumption of modern microprocessors. These same advances have made processors more susceptible to transient faults that can corrupt data and make systems unavailable. Designers often compensate for transient faults by adding hardware redundancy and making circuitand process-level adjustments. However, applications have different data integrity and availability demands, which make hardware approaches such as these too costly for many markets..

Software techniques can provide fault tolerance at a lower cost and with greater flexibility since they can be selectively deployed in the field even after the hardware has been manufactured. Most existing software-only techniques use recompilation, requiring access to program source code. Regardless of the code transformation method, previous techniques also incur unnecessary significant performance penalties by uniformly protecting the entire program without taking into account the varying vulnerability of different program regions and state elements to transient faults.

This paper presents Spot, a software-only fault-detection technique which uses dynamic binary translation to provide softwaremodulated fault tolerance with fine-grained control of redundancy. By using dynamic binary translation, users can improve the reliability of their applications without any assistance from hardware or software vendors. By using software-modulated fault tolerance, Spot can vary the level of protection independently for each register and region of code to provide users with more, and often superior, faultdetection options. This feature of Spot increases the mean work to failure from 1.90x to 17.79x.

The Acceleration of Structural Microarchitectural Simulation via Scheduling [abstract] (PDF)
David A. Penry
Ph.D. Thesis, Department of Computer Science, Princeton University, November 2006.

Microarchitects rely upon simulation to evaluate design alternatives, yet constructing an accurate simulator by hand is a difficult and time-consuming process because simulators are usually written in sequential languages while the system being modeled is concurrent. Structural modeling can mitigate this difficulty by allowing the microarchitect to specify the simulation model in a concurrent, structural form; a simulator compiler then generates a simulator from the model. However, the resulting simulators are generally slower than those produced by hand. The thesis of this dissertation is that simulation speed improvements can be obtained by careful scheduling of the work to be performed by the simulator onto single or multiple processors.

For scheduling onto single processors, this dissertation presents an evaluation of previously proposed scheduling mechanisms in the context of a structural microarchitectural simulation framework which uses a particular model of computation, the Heterogeneous Synchronous Reactive (HSR) model, and improvements to these mechanisms which make them more effective or more feasible for microarchitectural models. A static scheduling technique known as partitioned scheduling is shown to offer the most performance improvement: up to 2.08 speedup. This work furthermore proves that the the Discrete Event model of computation can be statically scheduled using partitioned scheduling when restricted in ways that are commonly assumed in microarchitectural simulation.

For scheduling onto multiple processors, this dissertation presents the first automatic parallelization of simulators using the HSR model of computation. It shows that effective parallelization requires techniques to avoid waiting due to locks and to improve cache locality. Two novel heuristics for lock mitigation and two for cache locality improvement are introduced and evaluated on three different parallel systems. The combination of lock mitigation and locality improvement is shown to allow superlinear speedup for some models: up to 7.56 for four processors.

Static Typing for a Faulty Lambda Calculus [abstract] (ACM DL, PDF)
David Walker, Lester Mackey, Jay Ligatti, George A. Reis, and David I. August
Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming (ICFP), September 2006.

A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While the likelihood that such transient faults will cause any significant damage may seem remote, over the last several years transient faults have caused costly failures in high-end machines at America Online, eBay, and the Los Alamos Neutron Science Center, among others. Because susceptibility to transient faults is proportional to the size and density of transistors, the problem of transient faults will become increasingly important in the coming decades.

This paper defines the first formal, type-theoretic framework for studying reliable computation in the presence of transient faults. More specifically, it defines lzap, a lambda calculus that exhibits intermittent data faults. In order to detect and recover from these faults, lzap programs replicate intermediate computations and use majority voting, thereby modeling software-based fault tolerance techniques studied extensively, but informally.

To ensure that programs maintain the proper invariants and use lzap primitives correctly, the paper defines a type system for the language. This type system guarantees that well-typed programs can tolerate any single data fault. To demonstrate that lzap can serve as an idealized typed intermediate language, we define a type-preserving translation from a standard simply-typed lambda calculus into lzap.

Eliminating Scope and Selection Restrictions in Compiler Optimizations [abstract] (PDF)
Spyridon Triantafyllis
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2006.

To meet the challenges presented by the performance requirements of modern architectures, compilers have been augmented with a rich set of aggressive optimizing transformations. However, the overall compilation model within which these transformations operate has remained fundamentally unchanged. This model imposes restrictions on these transformationsÂ application, limiting their effectiveness.

First, procedure-based compilation limits code transformations within a single procedureÂs boundaries, which may not present an ideal optimization scope. Although aggressive inlining and interprocedural optimization can alleviate this problem, code growth and compile time considerations limit their applicability.

Second, by applying a uniform optimization process on all codes, compilers cannot meet the particular optimization needs of each code segment. Although the optimization process is tailored by heuristics that attempt to a priori judge the effect of each transformation on final code quality, the unpredictability of modern optimization routines and the complexity of the target architectures severely limit the accuracy of such predictions.

This thesis focuses on removing these restrictions through two novel compilation framework modifications, Procedure Boundary Elimination (PBE) and Optimization-Space Exploration (OSE).

PBE forms compilation units independent of the original procedures. This is achieved by unifying the entire application into a whole-program control-flow graph, allowing the compiler to repartition this graph into free-form regions, making analysis and optimization routines able to operate on these generalized compilation units. Targeted code duplication techniques can then recover the performance benefits of inlining while limiting code growth. Thus PBE offers a superset of the benefits of inlining and interprocedural optimization, while avoiding both excessive code growth and overly long compile times.

OSE, on the other hand, explores many optimization options for each code segment and selects the best one a posteriori. OSE trims the space of options explored through limited use of heuristics, further limits the search space during compiler tuning, and exploits feedback to prune the remaining optimization configurations at compile time. The resulting optimization outcomes are compared through a fast and effective static performance estimator. As a result, OSE is the first iterative compilation technique fast enough and general enough for general-purpose compilation.

Amortizing Software Queue Overhead for Pipelined Inter-Thread Communication [abstract] (PDF)
Ram Rangan and David I. August
Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006.

Future chip multiprocessors are expected to contain multiple on-die processing cores. Increased memory system contention and wire delays will result in high inter-core latencies in these processors. Thus, parallelizing applications to efficiently execute on multiple contexts is key to achieving continued performance improvements. Recently proposed pipelined multithreading (PMT) techniques have shown significant promise for both manual and automatic parallelization. They tolerate increasing inter-thread communication delays by enforcing acyclic dependences amongst communicating threads and pipelining communication.

However, lack of efficient communication support for such programs hinders related language and compiler research. While researchers have proposed dedicated interconnects and storage for inter-core communication, such mechanisms are not cost-effective, consume extra power, demand chip redesign effort, and necessitate complex operating system modifications. Software impelementations of shared memory queues avoid these problems. But, they tend to have heavy overhead per communication operation, causing them to negate parallelization benefits and worse still, to perform slower than the original single-threaded codes. In this paper, we present a simple compiler analysis to coalesce synchronization and queue pointer updates for select communication operations, to minimize the intra-thread overhead of software queue implementations. A preliminary comparison of static schedule heights shows a considerable performance improvement over existing software queue implementations.

The Liberty Simulation Environment: A Deliberate Approach to High-Level System Modeling [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, Sharad Malik, and David I. August
ACM Transactions on Computer Systems (TOCS), Volume 24, Number 3, August 2006.

In digital hardware system design, the quality of the product is directly related to the number of meaningful design alternatives properly considered. Unfortunately, existing modeling methodologies and tools have properties which make them less than ideal for rapid and accurate design-space exploration. This article identifies and evaluates the shortcomings of existing methods to motivate the Liberty Simulation Environment (LSE). LSE is a high-level modeling tool engineered to address these limitations, allowing for the rapid construction of accurate high-level simulation models. LSE simplifies model specification with low-overhead component-based reuse techniques and an abstraction for timing control. As part of a detailed description of LSE, this article presents these features, their impact on model specification effort, their implementation, and optimizations created to mitigate their otherwise deleterious impact on simulator execution performance.

Automatic Instruction Scheduler Retargeting by Reverse-Engineering [abstract] (ACM DL, PDF)
Matthew J. Bridges, Neil Vachharajani, Guilherme Ottoni, and David I. August
Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2006.

In order to generate high-quality code for modern processors, a compiler must aggressively schedule instructions, maximizing resource utilization for execution efficiency. For a compiler to produce such code, it must avoid structural hazards by being aware of the processor's available resources and of how these resources are utilized by each instruction. Unfortunately, the most prevalent approach to constructing such a scheduler, manually discovering and specifying this information, is both tedious and error-prone.

This paper presents a new approach which, when given a processor or processor model, automatically determines this information. After establishing that the problem of perfectly determining a processor's structural hazards through probing is not solvable, this paper proposes a heuristic algorithm that discovers most of this information in practice. This can be used either to alleviate the problems associated with manual creation or to verify an existing specification. Scheduling with these automatically derived structural hazards yields almost all of the performance gain achieved using perfect hazard information.

A Framework for Unrestricted Whole-Program Optimization [abstract] (ACM DL, PDF)
Spyridon Triantafyllis, Matthew J. Bridges, Easwaran Raman, Guilherme Ottoni, and David I. August
Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2006.

Procedures have long been the basic units of compilation in conventional optimization frameworks. However, procedures are typically formed to serve software engineering rather than optimization goals, arbitrarily constraining code transformations. Techniques, such as aggressive inlining and interprocedural optimization, have been developed to alleviate this problem, but, due to code growth and compile time issues, these can be applied only sparingly.

This paper introduces the Procedure Boundary Elimination (PBE) compilation framework, which allows unrestricted whole-program optimization. PBE allows all intra-procedural optimizations and analyses to operate on arbitrary subgraphs of the program, regardless of the original procedure boundaries and without resorting to inlining. In order to control compilation time, PBE also introduces novel extensions of region formation and encapsulation. PBE enables targeted code specialization, which recovers the specialization benefits of inlining while keeping code growth in check. This paper shows that PBE attains better performance than inlining with half the code growth.

Automatic Instruction-Level Software-Only Recovery [abstract] (IEEE Xplore, PDF, Top Picks Version)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2006.
Winner of the William C. Carter Award.
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2006.

Selective Runtime Memory Disambiguation in a Dynamic Binary Translator [abstract] (PDF, ACM DL)
Bolei Guo, Youfeng Wu, Cheng Wang, Matthew J. Bridges, Guilherme Ottoni, Neil Vachharajani, Jonathan Chang, and David I. August
Proceedings of the 15th International Conference on Compiler Construction (CC), March 2006.

Performing runtime alias analysis in a dynamic binary translator (DBT) is difficult due to incomplete control-flow information and the high complexity of accurate alias analysis. Whole-program profiling, however, shows that the vast majority of memory references do not alias. The current technique used in DBTs to disambiguate memory references, instruction inspection, is too simple and can only disambiguate one-third of potential aliases. To address the problem of effective memory disambiguation while keeping a tight bound on the analysis overhead, we propose an efficient heuristic algorithm that strategically selects key memory dependences to disambiguate with runtime checks. These checks have little runtime overhead and, in the common case where aliasing does not occur, enable aggressive optimizations, particularly scheduling to be performed. We demonstrate that a small number of checks, inserted with a low overhead analysis, can approach optimal scheduling, where all false memory dependences are removed.

Exploiting Parallelism and Structure to Accelerate the Simulation of Chip Multi-processors [abstract] (IEEE Xplore, PDF)
David A. Penry, Daniel Fay, David Hodgdon, Ryan Wells, Graham Schelle, David I. August, and Daniel A. Connors
Proceedings of the Twelfth International Symposium on High-Performance Computer Architecture (HPCA), February 2006.

Simulation is an important means of evaluating new microarchitectures. Current trends toward chip multi-processors (CMPs) try the ability of designers to develop efficient simulators. CMP simulation speed can be improved by exploiting parallelism in the CMP simulation model. This may be done by either running the simulation on multiple processors or by integrating multiple processors into the simulation to replace simulated processors. Doing so usually requires tedious manual parallelization or re-design to encapsulate processors.

Both problems can be avoided by generating the simulator from a concurrent, structural model of the CMP. Such a model not only resembles hardware, making it easy to understand and use, but also provides sufficient information to automatically parallelize the simulator without requiring manual model changes. Furthermore, individual components of the model such as processors may be replaced with equivalent hardware without requiring repartitioning.

This paper presents techniques to perform automated simulator parallelization and hardware integration for CMP structural models. We show that automated parallelization can achieve an 7.60 speedup for a 16-processor CMP model on a conventional 4-processor shared-memory multiprocessor. We demonstrate the power of hardware integration by integrating eight hardware PowerPC cores into a CMP model, achieving a speedup of up to 5.82.

Software Fault Detection Using Dynamic Instrumentation [abstract] (CiteSeerX, PDF)
George A. Reis, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the Fourth Annual Boston Area Architecture Workshop (BARC), February 2006.

Software-only approaches to increase hardware reliability have been proposed and evaluated as alternatives to hardware modification. These techniques have shown that they can significantly improve reliability with reasonable performance overhead. Software-only techniques do not require any hardware support and thus are far cheaper and easier to deploy. These techniques can be used for systems that have already been manufactured and now require higher reliability than the hardware can offer.

All previous proposals have been static compilation techniques that rely on source code transformations or alterations to the compilation process. Our proposal is the first application of software fault detection for transient errors that increases reliability dynamically. The application of our technique is trivial since the only requirement is the program binary, which makes it applicable for legacy programs that no longer have readily available or easily re-compilable source code. Our dynamic reliability technique can seamlessly handle variable-length instructions, mixed code and data, statically unknown indirect jump targets, dynamically generated code, and dynamically loaded libraries. Our technique is also able attach to an already running application to increase its reliability, and detach when appropriate, thus returning to faster (although unreliable) execution.

Software-Controlled Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
ACM Transactions on Architecture and Code Optimization (TACO), December 2005.

Traditional fault tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. Several software-controllable fault detection techniques are then presented: SWIFT, a software-only technique, and CRAFT, a suite of hybrid hardware/ software techniques. Finally, the paper introduces PROFiT, a technique which adjusts the level of protection and performance at fine granularities through software control. When coupled with software-controllable techniques like SWIFT and CRAFT, PROFiT offers attractive and novel reliability options.

Chip Multi-Processor Scalability for Single-Threaded Applications [abstract] (ACM DL, PDF)
Neil Vachharajani, Matthew Iyer, Chinmay Ashok, Manish Vachharajani, David I. August, and Daniel A. Connors
Proceedings of the 2005 Workshop on Design, Architecture and Simulation of Chip Multi-Processors (dasCMP), November 2005.

The exponential increase in uniprocessor performance has begun to slow. Designers have been unable to scale performance while managing thermal, power, and electrical effects. Furthermore, design complexity limits the size of monolithic processors that can be designed while keeping costs reasonable. Industry has responded by moving toward chip multi-processor architectures (CMP). These architectures are composed from replicated processors utilizing the die area afforded by newer design processes. While this approach mitigates the issues with design complexity, power, and electrical effects, it does nothing to directly improve the performance of contemporary or future single-threaded applications.

This paper examines the scalability potential for exploiting the parallelism in single-threaded applications on these CMP platforms. The paper explores the total available parallelism in unmodified sequential applications and then examines the viability of exploiting this parallelism on CMP machines. Using the results from this analysis, the paper forecasts that CMPs, using the "intrinsic" parallelism in a program, can sustain the performance improvement users have come to expect from new processors for only 6-8 years provided many successful parallelization efforts emerge. Given this outlook, the paper advocates exploring methodologies which achieve parallelism beyond this "intrinsic" limit of programs.

Automatic Thread Extraction with Decoupled Software Pipelining [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture (MICRO), November 2005.
One of five papers nominated for the Best Paper Award by the Program Committee.

Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor companies to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes.

To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. Then, we explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.

A New Approach to Thread Extraction for General-Purpose Programs [abstract] (PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August
Proceedings of the 2nd Watson Conference on Interaction between Architecture, Circuits, and Compilers (PAC2), September 2005.

Recursive Data Structure Profiling [abstract] (ACM DL, PDF)
Easwaran Raman and David I. August
Proceedings of the Third Annual ACM SIGPLAN Workshop on Memory Systems Performance (MSP), June 2005.

As the processor-memory performance gap increases, so does the need for aggressive data structure optimizations to reduce memory access latencies. Such optimizations require a better understanding of the memory behavior of programs. We propose a profiling technique called Recursive Data Structure Profiling to help better understand the memory access behavior of programs that use recursive data structures (RDS) such as lists, trees, etc. An RDS profile captures the runtime behavior of the individual instances of recursive data structures. RDS profiling differs from other memory profiling techniques in its ability to aggregate information pertaining to an entire data structure instance, rather than merely capturing the behavior of individual loads and stores, thereby giving a more global view of a program's memory accesses.

This paper describes a method for collecting RDS profile without requiring any high-level program representation or type information. RDS profiling achieves this with manageable space and time overhead on a mixture of pointer intensive benchmarks from the SPEC, Olden and other benchmark suites. To illustrate the potential of the RDS profile in providing a better understanding of memory accesses, we introduce a metric to quantify the notion of stability of an RDS instance. A stable RDS instance is one that undergoes very few changes to its structure between its initial creation and final destruction, making it an attractive candidate to certain data structure optimizations.

Achieving Structural and Composable Modeling of Complex Systems [abstract] (SpringerLink, PDF)
David I. August, Sharad Malik, Li-Shiuan Peh, Vijay Pai, Manish Vachharajani, and Paul Willmann
The International Journal of Parallel Programming (IJPP), Volume 33, June 2005. Invited.

This paper describes a recently-released, structural and composable modeling system called the Liberty Simulation Environment (LSE). LSE automatically constructs simulators from system descriptions that closely resemble the structure of hardware at the chosen level of abstraction. Component-based reuse features allow an extremely diverse range of complex models to be built easily from a core set of component libraries. This paper also describes the makeup and initial experience with a set of such libraries currently undergoing refinement. With LSE and these soon-to-be-released component libraries, students will be able to learn about systems in a more intuitive fashion, researchers will be able to collaborate with each other more easily, and developers will be able to rapidly and meaningfully explore novel design candidates.

From Sequential Programs to Concurrent Threads [abstract] (IEEE Xplore, PDF)
Guilherme Ottoni, Ram Rangan, Adam Stoler, Matthew J. Bridges, and David I. August
IEEE Computer Architecture Letters (CAL), June 2005.

Chip multiprocessors are of increasing importance due to recent difficulties in achieving higher clock frequencies in uniprocessors, but their success depends on finding useful work for the processor cores. This paper addresses this challenge by presenting a simple compiler approach that extracts non-speculative thread-level parallelism from sequential codes. We present initial results from this technique targeting a validated dual-core processor model, achieving speedups ranging from 9-48% with an average of 25% for important benchmark loops over their single-threaded versions. We also identify important next steps found during our pursuit of higher degrees of automatic threading.

Rapid Development of Flexible Validated Processor Models [abstract] (PDF)
David A. Penry, Manish Vachharajani, and David I. August
Proceedings of the Workshop on Modeling, Benchmarking, and Simulation (MoBS), June 2005.

Given the central role of simulation in procesor design and research, an accurate, validated, and easily modified simulation model is extremely desirable. Prior work proposed a modeling methodology with the claim that it allows rapid construction of flexible validated models. In this paper, we present our experience using this methodology to construct a flexible validated model of Intel's Itanium 2 processor, lending support to their claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction(CPI) to within 7.9% on average for the entire SPEC CINT2000 benchmark suite. We find that aggregate accuracy for a metric like CPI is not sufficient; aggregate measures like CPI may conceal remaining internal "offsetting errors" which can adversely affect conclusions drawn from the model. We then modified the model to reduce error in specific performance constituents. In 2 1/2 person-weeks, overall constituent error was reduced from 3.1% to 2.1%, while simultaneously reducing average aggregate CPI error to 5.4%, demonstrating that model flexibility allows rapid improvements to accuracy. Flexibility is further shown by making significant changes to the model in under eight person-weeks to explore two novel microarchitectural techniques.

Design and Evaluation of Hybrid Fault-Detection Systems [abstract] (IEEE Xplore, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), June 2005.

To improve performance and reduce power consumption, processor designers employ advances that shrink feature sizes, lower voltage levels, reduce noise margins, and increase clock rates. However, these advances also make processors more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance.

In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware- only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. We evaluate CRAFT in relation to existing hardware and software reliability techniques. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, mean work to failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.

Branch Predication (ACM DL)
David I. August
Speculative Execution in High Performance Computer Architectures (ISBN: 978-1584884477)
Edited by D. Kaeli and P.-C. Yew. CRC Press, May 2005.

SWIFT: Software Implemented Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the Third International Symposium on Code Generation and Optimization (CGO), March 2005.
Winner Best Paper Award.
Winner of the 2015 International Symposium on Code Generation and Optimization Test of Time Award.

While reliable systems typically employ hardware techniques to address soft-errors, software techniques can provide a lower cost and more flexible alternative. To make this alternative more attractive, this paper presents a new software fault tolerance technique, called SWIFT, for detecting transient errors. Like other single-threaded software fault tolerance techniques, SWIFT efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs. SWIFT, however, eliminates the need to double the memory requirement by acknowledging the use of ECC in caches and memory. SWIFT also provides a higher level of protection with enhanced checking of the program counter (PC) at no performance cost. In addition, this enhanced PC checking makes most code inserted to detect faults in prior methods unnecessary, significantly enhancing performance. While SWIFT can be implemented on any architecture and can protect individual code segments to varying degrees, we evaluate a fully-redundant implementation running on Itanium 2. In these experiments, SWIFT demonstrates exceptional fault-coverage with a reasonable performance cost. Compared to the best known single-threaded approach utilizing an ECC memory system, SWIFT demonstrates a 51% average speedup.

Practical and Accurate Low-Level Pointer Analysis [abstract] (IEEE Xplore, PDF)
Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni, Easwaran Raman, and David I. August
Proceedings of the Third International Symposium on Code Generation and Optimization (CGO), March 2005.

Pointer analysis is traditionally performed once, early in the compilation process, upon an intermediate representation (IR) with source-code semantics. However, performing pointer analysis only once at this level imposes a phase-ordering constraint, causing alias information to become stale after subsequent code transformations. Moreover, high-level pointer analysis cannot be used at link time or run time, where the source code is unavailable.

This paper advocates performing pointer analysis on a low-level intermediate representation. We present the first context-sensitive and partially flow-sensitive points-to analysis designed to operate at the assembly level. As we will demonstrate, low-level pointer analysis can be as accurate as high-level analysis. Additionally, our low-level pointer analysis also enables a quantitative comparison of propagating high-level pointer analysis results through subsequent code transformations, versus recomputing them at the low level. We show that the former practice is considerably less accurate than the latter.

Decoupled Software Pipelining: A Promising Technique to Exploit Thread-Level Parallelism [abstract]
Guilherme Ottoni, Ram Rangan, Neil Vachharajani, and David I. August
Proceedings of the Fourth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), March 2005.

Processor manufacturers are moving to multi-core, multi-threaded designs because of several factors such as cost, ease of design and scalability. As most processors will be multi-threaded in the future, exposing thread-level parallelism (TLP) is a problem of increasing importance. Because the adequate granularity of the threads is dependent on the target architecture, and writing sequential applications is usually more natural, the compiler plays an important role in performing the mapping from applications to the appropriate multi-threaded code. In spite of this, few general-purpose compilation techniques have been proposed to assist in this task. In this paper, we propose Decoupled Software Pipelining (DSWP) to extract thread-level parallelism. DSWP can convert most application loops into a pipeline of loop threads. This brings pipeline parallelism to most application loops including those not targeted by traditional software pipelining. DSWP does not rely on complex hardware speculation support since it is a non-speculative transformation. This paper describes the DSWP technique, discusses its implementation in a compiler, and presents experimental results demonstrating that it is a promising technique to extract TLP.

Compiler Optimization-Space Exploration [abstract] (PDF)
Spyridon Triantafyllis, Manish Vachharajani, and David I. August
The Journal of Instruction-level Parallelism (JILP), February 2005.

To meet the performance demands of modern architectures, compilers incorporate an ever-increasing number of aggressive code transformations. Since most of these transformations are not universally beneficial, compilers traditionally control their application through predictive heuristics, which attempt to judge an optimization's effect on final code quality a priori. However, complex target architectures and unpredictable optimization interactions severely limit the accuracy of these judgments, leading to performance degradation because of poor optimization decisions.

This performance loss can be avoided through the iterative compilation approach, which advocates exploring many optimization options and selecting the best one a posteriori. However, existing iterative compilation systems suffer from excessive compile times and narrow application domains. By overcoming these limitations, Optimization-Space Exploration (OSE) becomes the first iterative compilation technique suitable for general-purpose production compilers. OSE narrows down the space of optimization options explored through limited use of heuristics. A compiler tuning phase further limits the exploration space. At compile time, OSE prunes the remaining optimization configurations in the search space by exploiting feedback from earlier configurations tried. Finally, rather than measuring actual runtimes, OSE compares optimization outcomes through static performance estimation, further enhancing compilation speed. An OSE-enhanced version of Intel's reference compiler for the Itanium architecture yields a performance improvement of more than 20% for some SPEC benchmarks.

RIFLE: An Architectural Framework for User-Centric Information-Flow Security [abstract] (ACM DL, PDF)
Neil Vachharajani, Matthew J. Bridges, Jonathan Chang, Ram Rangan, Guilherme Ottoni, Jason A. Blome, George A. Reis, Manish Vachharajani, and David I. August
Proceedings of the 37th International Symposium on Microarchitecture (MICRO), December 2004.

Even as modern computing systems allow the manipulation and distribution of massive amounts of information, users of these systems are unable to manage the confidentiality of their data in a practical fashion. Conventional access control security mechanisms cannot prevent the illegitimate use of privileged data once access is granted. For example, information provided by a user during an online purchase may be covertly delivered to malicious third parties by an untrustworthy web browser. Existing information flow security mechanisms do provide this assurance, but only for programmer-specified policies enforced during program development as a static analysis on special-purpose type-safe languages. Not only are these techniques not applicable to many commonly used programs, but they leave the user with no defense against malicious programmers or altered binaries.

In this paper, we propose RIFLE, a runtime information flow security system designed from the user's perspective. By addressing information flow security using architectural support, RIFLE gives users a practical way to enforce their own information flow security policy on all programs. We prove that, contrary to statements in the literature, runtime systems like RIFLE are no less secure than existing language-based techniques. Using a model of the architectural framework and a binary translator, we demonstrate RIFLE's correctness and illustrate that the performance cost is reasonable.

Rapid Development of Flexible Validated Processor Models [abstract] (PDF)
David A. Penry, Manish Vachharajani, and David I. August
Liberty Research Group Technical Report 04-03, November 2004.

For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable variations of the baseline. Unfortunately, these two goals are generally considered to be at odds; the set of validated models is considered exclusive of the set of easily malleable models. Vachharajani et al. challenge this belief and propose a modeling methodology they claim allows rapid construction of flexible validated models. Unfortunately, they only present anecdotal and secondary evidence to support their claims.

In this paper, we present our experience using this methodology to construct a validated flexible model of Intel's Itanium 2 processor. Our practical experience lends support to the above claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction (CPI) to within 7.9% on average for the entire SPEC CINT2000 benchmark suite. Our experience with this model showed us that aggregate accuracy for a metric like CPI is not sufficient. Aggregate measures like CPI may conceal remaining internal ``offsetting errors'' which can adversely affect conclusions drawn from the model. Using this as our motivation, we explore the flexibility of the model by modifying it to target specific error constituents, such as front-end stall errors. In 2 1/2 person-weeks, average CPI error was reduced to 5.4%. The targeted error constituents were reduced more dramatically; front-end stall errors were reduced from 5.6% to 1.6%. The swift implementation of significant new architectural features on this model further demonstrated its flexibility.

Microarchitecture Modeling for Design-space Exploration [abstract] (PDF)
Manish Vachharajani
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, November 2004.

To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodologies lead to long simulator development times. This bottleneck limits exploration to a small set of designs, potentially diminishing quality of the final design.

This dissertation describes a method to rapidly construct high-quality simulators. In particular, it examines structural modeling as a means to reduce construction time because it eliminates redundant effort required to manage design complexity in many modeling approaches, including that of programming a simulator in a sequential language. The dissertation also describes how to overcome common limitations in structural modeling that increase construction time by precluding amortization of component specification effort across models via component-based reuse. First, some time-consuming to specify design portions do not benefit from reuse, the logic for managing stall signals (i.e., timing-control) chief among them. Second, components flexible enough to enjoy reuse are often not reused in practice because of the number of details that must be understood in order to instantiate them. This dissertation addresses these issues by:

Presenting new insight into timing-control that allows it to be partitioned amongst reusable components.
Describing the application of programming language techniques (such as polymorphism and aspect-oriented programming) to allow creation of easy to reuse flexible components.
Allowing models to be statically analyzable in the presence of the above features.

The techniques presented in this dissertation are embodied in the Liberty Simulation Environment (LSE). LSE users have seen over an order of magnitude reduction in model development time. Their models were of high-quality; they were accurate, had adequate simulation speeds, and were compatible with static techniques for visualization and optimization.

Short model construction times allow more ideas to be explored in less time. This leads to shorter product time-to-market and more thorough design-space exploration. This also allows researchers to evaluate new design ideas faster, to efficiently evaluate ideas in the context of many designs, and, perhaps, develop a more fundamental understanding of microarchitecture due to these broader evaluations.

Decoupled Software Pipelining with the Synchronization Array [abstract] (IEEE Xplore, PDF)
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2004.
Highest ranked paper in double-blind review process.

Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the loop-carried dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors.

Facilitating Reuse in Hardware Models with Enhanced Type Inference [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, Sharad Malik, and David I. August
The IEEE/ACM/IFIP Second International Conference on Hardware/Software Codesign and System Synthesis (ISSS), September 2004.

High-level hardware modeling is an essential, yet time-consuming, part of system design. However, effective component-based reuse in hardware modeling languages can reduce model construction time and enable the exploration of more design alternatives, leading to better designs. While component overloading and parametric polymorphism are critical for effective component-base reuse, no existing modeling language supports both. The lack of these features creates overhead for designers that discourages reuse, negating any benefits of reuse. This paper first presents a type system which support both component overloading and parametric polymorphism. Then, it proves that performing type inference for any such system is NP-complete and presents a heuristic that works efficiently in practice. The result is a type system and type inference algorithm that, when deployed, can encourage reuse, reduce design specification time, and lead to better designs.

The Liberty Structural Specification Language: A High-Level Modeling Language for Component Reuse [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, and David I. August
Proceedings of the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2004.

Rapid exploration of the design space with simulation models is essential for quality hardware systems research and development. Despite striking commonalities across hardware systems, designers routinely fail to achieve high levels of reuse across models constructed in existing general-purpose and domain-specific languages. This lack of reuse adversely impacts hardware system design by slowing the rate at which ideas are evaluated. This paper presents an examination of existing languages to reveal their fundamental limitations regarding reuse in hardware modeling. With this understanding, a solution is described in the context of the design and implementation of the Liberty Structural Specification Language (LSS), the input language for a publicly available high-level digital-hardware modeling tool called the Liberty Simulation Environment. LSS is the first language to enable low-overhead reuse by simultaneously supporting static inference based on hardware structure and flexibility via parameterizable structure. Through LSS, this paper also introduces a new type inference algorithm and a new programming language technique, called use-based specialization, which, in a manner analogous to type inference, customizes reusable components by statically inferring structural properties that otherwise would have had to have been specified manually.

Exposing Memory Access Regularities Using Object-Relative Memory Profiling [abstract] (ACM DL, PDF)
Qiang Wu, Artem Pyatakov, Alexey N. Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August
Proceedings of the Second International Symposium on Code Generation and Optimization (CGO), March 2004.

Memory profiling is the process of characterizing a program's memory behavior by observing and recording its response to specific input sets. Relevant aspects of the program's memory behavior may then be used to guide memory optimizations in an aggressively optimizing compiler. In general, memory access behavior has eluded meaningful characterization because of confounding artifacts from memory allocators, linker data layout, and OS memory management. Since these artifacts may change from run to run, memory access patterns may appear different in each run even for the same input set. Worse, regular memory access behavior such as linked list traversals appear to have no structure.

In this paper we present object-relative translation and decomposition techniques to eliminate these artifacts and to expose previously obscured memory access patterns. To demonstrate the potential of these ideas, we implement two different memory profilers targeted at different sets of applications. These profilers outperform the existing ones in terms of profile size and useful information per byte of data. The first profiler is a lossless profiler, called WHOMP, which uses object-relativity to achieve a 22% better compression than the previously best known scheme. The second profiler, called LEAP, uses lossy compression to get highly compact profiles while providing useful information to the targeted applications. LEAP correctly characterizes the memory alias rates for 56% more instruction pairs than the previously best known scheme with a practical running time.

The Liberty Simulation Environment, Version 1.0 [abstract] (ACM DL, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason Blome, and David I. August
Performance Evaluation Review: Special Issue on Tools for Architecture Research (PER), Volume 31, Number 4, March 2004. Invited.

High-level hardware modeling via simulation is an essential step in hardware systems design and research. Despite the importance of simulation, current model creation methods are error prone and are unnecessarily time consuming. To address these problems, we have publicly released the Liberty Simulation Environment (LSE), Version 1.0, consisting of a simulator builder and automatic visualizer based on a shared hardware description language. LSE's design was motivated by a careful analysis of the strengths and weaknesses of existing systems. This has resulted in a system in which models are easier to understand, faster to develop, and have performance on par with other systems. LSE is capable of modeling any synchronous hardware system. To date, LSE has been used to simulate and convey ideas about a diverse set of complex systems including a chip multiprocessor out-of-order IA-64 machine and a multiprocessor system with detailed device models.

The Liberty Simulation Environment: A Deliberate Approach to High-Level System Modeling [abstract] (PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, Sharad Malik, and David I. August
Liberty Research Group Technical Report 04-02, March 2004.

A Study of the Clarity of Functionally and Structurally Composed High-level Simulation Models (PDF)
Manish Vachharajani and David I. August
Liberty Research Group Technical Report 04-01, January 2004.

Liberty Simulation Environment, Version 1.0
Manish Vachharajani, David A. Penry, Neil Vachharajani, Jason A. Blome, and David I. August
Available at http://liberty.princeton.edu/Software/LSE, December 2003.

A Comparison of Reuse in Object-oriented Programming and Structural Modeling Systems [abstract] (PDF)
Manish Vachharajani, Neil Vachharajani, and David I. August
Liberty Research Group Technical Report 03-01, October 2003.

Modeling systems in which users construct models by specifying the interconnect between concurrently executing components are a natural fit for hardware modeling. These concurrent-structural modeling systems allow the specification of the model to parallel that of the hardware making specifications clear, easy to understand, and straight-forward to build and modify.

Unfortunately, many concurrent-structural modeling systems provide few mechanisms to create flexible-reusable components. Others have built these modeling systems on object-oriented programming (OOP) languages in the hope that the OOP features will allow the creation of useful flexible reusable components for structural modeling. Unfortunately, simply adopting OOP features is insufficient for this task. By drawing an analogy between OOP programs and concurrent-structural models, it is possible to understand why. Furthermore, this analogy highlights what would be needed in a concurrent-structural system to gain the same benefits seen in OOP systems. This report compares the features necessary to obtain reuse in concurrent-structural modeling systems by drawing this analogy between object-oriented programming constructs and concurrent-structural modeling constructs.

The Liberty Simulation Environment as a Pedagogical Tool [abstract] (PDF)
Jason Blome, Manish Vachharajani, Neil Vachharajani, and David I. August
Proceedings of the Workshop on Computer Architecture Education (WCAE), June 2003.

This paper describes how the Liberty Simulation Environment(LSE) and its graphical visualizer can be used in a computer architecture course. LSE allows for the rapid construction of simulators from models that resemble the structure of hardware. By using and modifying LSE models, students can develop a solid understanding of and learn to reason about computer architecture. Since LSE models are also relatively easy to modify, the tool can be used as the basis of meaningful assignments, allowing students to explore a variety of microarchitectural concepts on their own. In lectures where block diagrams are typically displayed, LSE's visualizer can be used instead to not only show block diagrams, but to demonstrate the machine in action. As a result, LSE can ease the burden of conveying complex microarchitectural design concepts, greatly improving the depth of understanding a computer architecture course provides.

Optimizations for a Simulator Construction System Supporting Reusable Components [abstract] (ACM DL, PDF)
David A. Penry and David I. August
Proceedings of the 40th Design Automation Conference (DAC), June 2003.

Exploring a large portion of the microprocessor design space requires the rapid development of efficient simulators. While some systems support rapid model development through the structural composition of reusable concurrent components, the Liberty Simulation Environment (LSE) provides additional reuse-enhancing features. This paper evaluates the cost of these features and presents optimizations to reduce their impact. With these optimizations, an LSE model using reusable components outperforms a SystemC model using custom components by 6%.

Compiler Optimization-Space Exploration [abstract] (IEEE Xplore, PDF)
Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, and David I. August
Proceedings of the 2003 International Symposium on Code Generation and Optimization (CGO), March 2003.
Winner Best Paper Award.

To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to guide optimizations by predicting their effects a priori. Unfortunately, the unpredictability of optimization interaction and the irregularity of today's wide-issue machines severely limit the accuracy of these heuristics. As a result, compiler writers may temper high variance optimizations with overly conservative heuristics or may exclude these optimizations entirely. While this process results in a compiler capable of generating good average code quality across the target benchmark set, it is at the cost of missed optimization opportunities in individual code segments. To replace predictive heuristics, researchers have proposed compilers which explore many optimization options, selecting the best one a posteriori. Unfortunately, these existing iterative compilation techniques are not practical for reasons of compile time and applicability.

In this paper, we present the Optimization-Space Exploration (OSE) compiler organization, the first practical iterative compilation strategy applicable to optimizations in general-purpose compilers. Instead of replacing predictive heuristics, OSE uses the compiler writer's knowledge encoded in the heuristics to select a small number of promising optimization alternatives for a given code segment. Compile time is limited by evaluating only these alternatives for hot code segments using a general compiletime performance estimator. An OSE-enhanced version of Intel's highly-tuned, aggressively optimizing production compiler for IA-64 yields a significant performance improvement, more than 20% in some cases, on Itanium for SPEC codes.

Microarchitectural Exploration with Liberty [abstract] (IEEE Xplore, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, and David I. August
Proceedings of the 35th International Symposium on Microarchitecture (MICRO), November 2002.
Winner Best Student Paper Award.

To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Unfortunately, the most prevalent simulator construction methodology, hand-writing monolithic simulators in sequential programming languages, yields simulators that are hard to retarget, limiting the number of designs explored, and hard to understand, instilling little confidence in the model. Simulator construction tools have been developed to address these problems, but analysis reveals that they do not address the root cause, the error-prone mapping between the concurrent, structural hardware domain and the sequential, functional software domain. This paper presents an analysis of these problems and their solution, the Liberty Simulation Environment (LSE). LSE automatically constructs a simulator from a machine description that closely resembles the hardware, ensuring fidelity in the model. Furthermore, through a strict but general component communication contract, LSE enables the creation of highly reusable component libraries, easing the task of rapidly exploring ever more exotic designs.

A Disciplined Approach to the Development of Platform Architectures [abstract] (CiteSeerX, PDF)
David I. August, Kurt Keutzer, Sharad Malik, and A. Richard Newton
Microelectronics Journal, Volume 33, Number 11, November 2002. Invited.

Silicon capability has enabled the embedding of an entire system on a single silicon die. These devices are known as systems-on-a-chip. Currently, the design of these devices is undisciplined, expensive, and risky. One way of amortizing the cost and ameliorating this design risk is to make a single integrated circuit serve multiple applications, and the natural way of enabling this is through end-user programmability. The aim of the MESCAL project, which is the subject of this paper, is to introduce a disciplined approach to producing reusable architectural platforms that can be easily programmed to meet a variety of applications. (MESCAL stands for Modern Embedded Systems, Compilers, Architectures, and Languages.)

Procedure Boundary Elimination for EPIC Compilers [abstract] (CiteSeerX, PDF)
Spyridon Triantafyllis, Manish Vachharajani, and David I. August
Proceedings of the Second Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), November 2002.

Procedures are the basic units of compilation in traditional optimization frameworks. This presents problems to compilers targeting EPIC architectures, since the limited scope of a single procedure is usually insufficient for extracting ILP and identifying enough optimization opportunities. Although inlining can expand the scope of optimization routines, it is not applicable to all call sites and can cause excessive code growth, which can in turn adversely affect cache performance and compile-time resource usage. In this paper we propose a novel compilation strategy called Procedure Boundary Elimination (PBE). PBE unifies the whole program into a single compilation unit, which is then restructured into units better suited to optimization than the original procedures. A targeted specialization phase exposes further optimization opportunities while limiting code growth only to the cases where it is beneficial. Unlike inlining, PBE can eliminate all procedure calls while avoiding the cost of excessive code growth.

Design Tools for Application Specific Embedded Processors [abstract] (SpringerLink, PDF)
Wei Qin, Subramanian Rajagopalan, Manish Vachharajani, Hangsheng Wang, Xinping Zhu, David I. August, Kurt Keutzer, Sharad Malik, and Li-Shiuan Peh
Proceedings of the Second International Workshop on Embedded Software, Lecture Notes in Computer Science (EMSOFT), Volume 2491, October 2002. Invited.

A variety of factors make it increasingly difficult and expensive to design and manufacture traditional Application Specific Integrated Circuits (ASICs). Consequently, programmable alternatives are more attractive than ever. The flexibility provided by programmability comes with a performance and power overhead. This can be significantly mitigated by using application specific platforms, also referred to as Application Specific Embedded Processors, or Application Specific Instruction Set Processors (ASIPs). ASIPs and the embedded software applications running on them, require specialized design tools - both during architectural evaluation to provide feedback on the suitability of the architecture for the application; as well as during system implementation to ensure efficient mapping and validation of design constraints. These functions result in requirements different from those of traditional software development environments. The first requirement is retargetability, especially during the early architectural evaluation stage where a rapid examination of design alternatives is essential. The second requirement is for additional metrics such as power consumption, real-time constraints and code size. This paper describes a set of design tools and associated methodology designed to meet the challenges posed by architectural evaluation and software synthesis. This work is part of the MESCAL (Modern Embedded Systems, Compilers, Architectures, and Languages) project.

A Disciplined Approach to the Development of Platform Architectures [abstract] (PDF, Elsevier)
David I. August, Kurt Keutzer, Sharad Malik, and A. Richard Newton
Proceedings of the Tenth Workshop on Synthesis and System Integration of Mixed Technologies (SASIMI), April 2001. Invited.

Silicon capability has enabled the embedding of an entire system on a single silicon die. These devices are known as systems-on-achip (SOC). Currently, the design of these devices is undisciplined, expensive, and risky. One way of amortizing the cost and ameliorating this design risk is to make a single integrated circuit serve multiple applications, and the natural way of enabling this is through end-user programmability. The aim of the MESCAL project, which is the subject of this paper, is to introduce a disciplined approach to producing reusable architectural platforms that can be easily programmed to meet a variety of applications. (MESCAL stands for Modern Embedded Systems, Compilers, Architectures, and Languages.)