A Generalized Framework for Automatic Scripting Language Parallelization [abstract] (PDF)
Taewook Oh, Stephen R. Beard, Nick P. Johnson, Sergiy Popovych, and David I. August
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2017.
Accept Rate: 23% (25/108).
Computational scientists are typically not expert programmers, and thus work
in easy to use dynamic languages. However, they have very high performance
requirements, due to their large datasets and experimental setups. Thus,
the performance required for computational science must be extracted from
dynamic languages in a manner that is transparent to the programmer.
Current approaches to optimize and parallelize dynamic languages, such as
just-in-time compilation and highly optimized interpreters, require a huge
amount of implementation effort and are typically only effective for a
single language. However, scientists in different fields use different
languages, depending upon their needs.
This paper presents techniques to enable automatic extraction of parallelism
within scripts that are universally applicable across multiple different
dynamic scripting languages. The key insight is that combining a script
with its interpreter, through program specialization techniques, will embed
any parallelism within the script into the combined program. Additionally,
this paper presents several enhancements to existing speculative automatic
parallelization techniques to handle the dependence patterns created by the
specialization process. A prototype of the proposed technique, called
Partial Evaluation with Parallelization (PEP), is evaluated against two open-source script
interpreters with 6 input scripts each. The resulting geomean speedup of
5.1x on a 24-core machine shows the potential of the generalized
approach in automatic extraction of parallelism in dynamic scripting
languages.
Runtime Asynchronous Fault Tolerance via Speculation [abstract] (PDF)
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation
and Optimization (CGO), April 2012.
Accept Rate: 28% (26/90).
Transient faults are emerging as a critical reliability concern
in modern microprocessors. Redundant hardware solutions are commonly deployed
to detect transient faults, but they are less flexible and cost-effective than
software solutions. However, software solutions are rendered impractical
because of high performance overheads. To address this problem, this paper
presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the
fastest transient fault detection technique known to date. Serving as a virtual
layer between the application and the underlying platform, RAFT automatically
generates two symmetric program instances from a program binary. It detects
transient faults in a noninvasive way, and exploits high-confidence value
speculation to achieve low runtime overhead. Evaluation on a commodity
multicore system demonstrates that RAFT delivers a geomean performance overhead
of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient
fault detection techniques, RAFT exhibits the best performance and fault
coverage, without requiring any change to the hardware or the software
applications
DAFT: Decoupled Acyclic Fault Tolerance [abstract] (SpringerLink, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
The International Journal of Parallel Programming (IJPP), February 2012.
Invited.
Special issue composed of "top papers" selected by the
Program Committee of the 19th International Conference on Parallel
Architectures and Compilation Techniques.
Higher transistor counts, lower voltage levels, and reduced noise
margin increase the susceptibility of multicore processors to
transient faults. Redundant hardware modules can detect such errors,
but software transient fault detection techniques are more appealing
for their low cost and flexibility. Recent software proposals double
register pressure or memory usage, or are too slow in the absence of
hardware extensions, preventing widespread acceptance. This paper
presents DAFT, a fast, safe, and memory efficient transient fault
detection framework for commodity multicore systems. DAFT replicates
computation across multiple cores and schedules fault detection off
the critical path. Where possible, values are speculated to be correct
and only communicated to the redundant thread at essential program
points. DAFT is implemented in the LLVM compiler framework and
evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a
commodity multicore system. Results demonstrate DAFT's high
performance and broad fault coverage. Speculation allows DAFT to
reduce the performance overhead of software redundant multithreading
from an average of 200% to 38% with no degradation of faultcoverage.
Low-cost, Fine-grained Transient Fault Recovery for Low-end Commodity Systems [abstract]
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August
Proceedings of the 44th IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2011.
Accept Rate: 21% (44/209).
DAFT: Decoupled Acyclic Fault Tolerance [abstract] (ACM DL, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2010.
Accept Rate: 17% (46/266).
Selected by the Program Committee as a "top paper" for inclusion
in a special issue of the International Journal of Parallel
Programming (IJPP).
Higher transistor counts, lower voltage levels, and reduced noise
margin increase the susceptibility of multicore processors to
transient faults. Redundant hardware modules can detect such errors,
but software transient fault detection techniques are more appealing
for their low cost and flexibility. Recent software proposals double
register pressure or memory usage, or are too slow in the absence of
hardware extensions, preventing widespread acceptance. This paper
presents DAFT, a fast, safe, and memory efficient transient fault
detection framework for commodity multicore systems. DAFT replicates
computation across multiple cores and schedules fault detection off
the critical path. Where possible, values are speculated to be correct
and only communicated to the redundant thread at essential program
points. DAFT is implemented in the LLVM compiler framework and
evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a
commodity multicore system. Results demonstrate DAFT's high
performance and broad fault coverage. Speculation allows DAFT to
reduce the performance overhead of software redundant multithreading
from an average of 200% to 38% with no degradation of fault
coverage.
Software Modulated Fault Tolerance [abstract] (PDF)
George A. Reis
Ph.D. Thesis, Department of Electrical Engineering,
Princeton University, June 2008.
In recent decades, microprocessor performance has been increasing
exponentially, due in large part to smaller and faster transistors
enabled by improved fabrication technology. While such transistors
yield performance enhancements, their smaller size and sheer number
make chips more susceptible to transient faults. Designers frequently
introduce redundant hardware or software to detect
these faults because process and material advances are often
insufficient to mitigate their effect.
Regardless of the methods used for adding reliability, these
techniques incur significant performance penalties because they
uniformly protect the entire application. They do not consider
the varying resilience to transient faults of different
program regions. This uniform protection leads to wasted resources that
reduce performance and/or increase cost.
To maximize fault coverage while minimizing the performance impact,
this dissertation takes advantage of the key insights that not all faults in
an unprotected application will cause an incorrect answer and not all
parts of the program respond the same way to reliability techniques.
First, this dissertation demonstrates the varying vulnerability and
performance responses of an application and identifies regions
which are most susceptible to faults as well as those which are
inexpensive to protect. Second, this dissertation advocates the use of
software and hybrid approaches to fault tolerance to enable the
synergy of high-level information with specific redundancy techniques.
Third, this dissertation demonstrates how to exploit this non-uniformity via
Software Modulated Fault Tolerance. Software Modulated Fault Tolerance leverages reliability
and performance information at a high level and directs the
reliability choices at fine granularities to provide the most
efficient use of processor resources for an application.
Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker
Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007.
Accept Rate: 25% (45/178).
Winner Best Paper Award.
A transient hardware fault occurs when an energetic particle strikes
a transistor, causing it to change state. Although transient faults do
not permanently damage the hardware, they may corrupt computations
by altering stored values and signal transfers. In this paper, we
propose a new scheme for provably safe and reliable computing in
the presence of transient hardware faults. In our scheme, software
computations are replicated to provide redundancy while special
instructions compare the results of replicas to detect errors before
writing critical data. In stark contrast to any previous efforts in this
area, we have analyzed our fault tolerance scheme from a formal,
theoretical perspective. First, we provide an operational semantics
for our assembly language, which includes a precise formal denition
of our fault model. Formulating such a model is a crucial step
forward for the science of hardware fault tolerance as it pins down
the assumptions being made about when and where faults may occur
in a completely precise and transparent manner. Second, we
develop an assembly-level type system designed to detect reliability
problems in compiled code. On the one hand, this type system
may be viewed as a theoretical tool for proving that code is reliable.
On the other hand, it may be used as a practical tool for compiler
debugging. In the latter case, the type system is strictly superior
to conventional testing as it guarantees full coverage and perfect
fault tolerance relative to the fault model. Third, we provide a formal
specication for program fault tolerance under the given fault
model and prove that all well-typed programs are indeed fault tolerant.
In addition to the formal analysis, we evaluate the execution
time of our detection scheme to quantify the performance penalty
for sound fault tolerance.
Automatic Instruction-Level Software-Only Recovery Methods [abstract] (IEEE Xplore, Original Full Paper, PDF)
George A. Reis, Jonathan Chang, and David I. August
IEEE Micro, Volume 27, Number 1, January 2007.
IEEE Micro's "Top Picks" special issue for papers "most
relevant to industry and significant in contribution to the field of
computer architecture" in 2006.
As chip densities and clock rates increase, processors are becoming
more susceptible to transient faults that can affect program
correctness. Computer architects have typically addressed reliability
issues by adding redundant hardware, but these techniques are often
too expensive to be used widely. Software-only reliability techniques
have shown promise in their ability to protect against soft-errors
without any hardware overhead. However, existing low-level
software-only fault tolerance techniques have only addressed the
problem of detecting faults, leaving recovery largely unaddressed. In
this paper, we present the concept, implementation, and evaluation of
automatic, instruction-level, software-only recovery techniques, as
well as various specific techniques representing different trade-offs
between reliability and performance. Our evaluation shows that these
techniques fulfill the promises of instruction-level, software-only
fault tolerance by offering a wide range of flexible recovery options
Non-Uniform Fault Tolerance [abstract] (PDF)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.
As devices become more susceptible to transient faults that can affect
program correctness, processor designers will increasingly compensate
by adding hardware or software redundancy. Proposed redundancy
techniques and those currently in use are generally applied uniformly
to a structure despite non-uniformity in the way errors within the
structure manifest themselves in programs. This uniform protection
leads to inefficiency in terms of performance, power, and area. Using
case studies involving the register file, this paper motivates an
alternative \emph{Non-Uniform Fault Tolerance} approach which improves
reliability over uniform approaches by spending the redundancy budget
on those areas most susceptible.
Configurable Transient Fault Detection via Dynamic Binary Translation [abstract] (PDF)
George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.
Smaller feature sizes, lower voltage levels, and reduced noise margins
have helped improve the performance and lower the power consumption
of modern microprocessors. These same advances have
made processors more susceptible to transient faults that can corrupt
data and make systems unavailable. Designers often compensate for
transient faults by adding hardware redundancy and making circuitand
process-level adjustments. However, applications have different
data integrity and availability demands, which make hardware
approaches such as these too costly for many markets.. Software techniques can provide fault tolerance at a lower cost and
with greater flexibility since they can be selectively deployed in the
field even after the hardware has been manufactured. Most existing
software-only techniques use recompilation, requiring access to program
source code. Regardless of the code transformation method,
previous techniques also incur unnecessary significant performance
penalties by uniformly protecting the entire program without taking
into account the varying vulnerability of different program regions
and state elements to transient faults. This paper presents Spot, a software-only fault-detection technique
which uses dynamic binary translation to provide softwaremodulated
fault tolerance with fine-grained control of redundancy.
By using dynamic binary translation, users can improve the reliability
of their applications without any assistance from hardware or
software vendors. By using software-modulated fault tolerance, Spot
can vary the level of protection independently for each register and
region of code to provide users with more, and often superior, faultdetection
options. This feature of Spot increases the mean work to
failure from 1.90x to 17.79x.
Static Typing for a Faulty Lambda Calculus [abstract] (ACM DL, PDF)
David Walker, Lester Mackey, Jay Ligatti, George A. Reis, and David I. August
Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming (ICFP), September 2006.
Accept Rate: 31% (24/76).
A transient hardware fault occurs when an energetic particle strikes a
transistor, causing it to change state. These faults do not cause
permanent damage, but may result in incorrect program execution by
altering signal transfers or stored values. While the likelihood that
such transient faults will cause any significant damage may seem
remote, over the last several years transient faults have caused
costly failures in high-end machines at America Online, eBay, and the
Los Alamos Neutron Science Center, among others. Because
susceptibility to transient faults is proportional to the size and
density of transistors, the problem of transient faults will become
increasingly important in the coming decades. This paper defines the first formal, type-theoretic framework for
studying reliable computation in the presence of transient faults.
More specifically, it defines lzap, a lambda calculus that exhibits
intermittent data faults. In order to detect and recover from these
faults, lzap programs replicate intermediate computations and use
majority voting, thereby modeling software-based fault tolerance
techniques studied extensively, but informally. To ensure that programs maintain the proper invariants and use lzap
primitives correctly, the paper defines a type system for the
language. This type system guarantees that well-typed programs can
tolerate any single data fault. To demonstrate that lzap can serve as
an idealized typed intermediate language, we define a type-preserving
translation from a standard simply-typed lambda calculus into lzap.
Automatic Instruction-Level Software-Only Recovery [abstract] (IEEE Xplore, PDF, Top Picks Version)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2006.
Accept Rate: 18% (34/187).
Winner of the William C. Carter Award.
Selected for IEEE Micro's "Top Picks" special issue for papers "most
relevant to industry and significant in contribution to the field of
computer architecture" in 2006.
As chip densities and clock rates increase, processors are becoming
more susceptible to transient faults that can affect program
correctness. Computer architects have typically addressed reliability
issues by adding redundant hardware, but these techniques are often
too expensive to be used widely. Software-only reliability techniques
have shown promise in their ability to protect against soft-errors
without any hardware overhead. However, existing low-level
software-only fault tolerance techniques have only addressed the
problem of detecting faults, leaving recovery largely unaddressed. In
this paper, we present the concept, implementation, and evaluation of
automatic, instruction-level, software-only recovery techniques, as
well as various specific techniques representing different trade-offs
between reliability and performance. Our evaluation shows that these
techniques fulfill the promises of instruction-level, software-only
fault tolerance by offering a wide range of flexible recovery options
Software Fault Detection Using Dynamic Instrumentation [abstract] (CiteSeerX, PDF)
George A. Reis, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the Fourth Annual Boston Area Architecture Workshop (BARC), February 2006.
Software-only approaches to increase hardware reliability have
been proposed and evaluated as alternatives to hardware
modification. These techniques have shown that they can significantly
improve reliability with reasonable performance
overhead. Software-only techniques do not require any hardware support
and thus are far cheaper and easier to deploy. These techniques can
be used for systems that have already been manufactured and now
require higher reliability than the hardware can offer. All previous proposals have been static compilation techniques
that rely on source code transformations or alterations to the
compilation process. Our proposal is the first application of
software fault detection for transient errors that increases
reliability dynamically. The application of our technique is trivial
since the only requirement is the program binary, which makes it
applicable for legacy programs that no longer have readily available
or easily re-compilable source code. Our dynamic reliability
technique can seamlessly handle variable-length instructions, mixed
code and data, statically unknown indirect jump targets, dynamically
generated code, and dynamically loaded libraries. Our technique is
also able attach to an already running application to increase its
reliability, and detach when appropriate, thus returning to faster
(although unreliable) execution.
Software-Controlled Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
ACM Transactions on Architecture and Code Optimization (TACO), December 2005.
Traditional fault tolerance techniques typically utilize
resources ineffectively because they cannot adapt to the changing
reliability and performance demands of a system. This paper proposes
software-controlled fault tolerance, a concept allowing designers and
users to tailor their performance and reliability for each
situation. Several software-controllable fault detection techniques
are then presented: SWIFT, a software-only technique, and CRAFT, a
suite of hybrid hardware/ software techniques. Finally, the paper
introduces PROFiT, a technique which adjusts the level of protection
and performance at fine granularities through software control. When
coupled with software-controllable techniques like SWIFT and CRAFT,
PROFiT offers attractive and novel reliability options.
Design and Evaluation of Hybrid Fault-Detection Systems [abstract] (IEEE Xplore, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
Proceedings of the 32nd International Symposium on
Computer Architecture (ISCA), June 2005.
Accept Rate: 23% (45/194).
To improve performance and reduce power consumption, processor
designers employ advances that shrink feature sizes, lower voltage
levels, reduce noise margins, and increase clock rates. However, these
advances also make processors more susceptible to transient faults
that can affect program correctness. Up to now, system designers have
primarily considered hardware-only and software-only fault-detection
mechanisms to identify and mitigate the deleterious effects of
transient faults. These two fault-detection systems, however, are
extremes in the design space, representing sharp trade-offs between
hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection
mechanisms as promising alternatives to hardware- only and
software-only systems. These hybrid systems offer designers more
options to fit their reliability needs within their hardware and
performance budgets. We propose CRAFT, a suite of three such hybrid
techniques, to illustrate the potential of the hybrid approach. We
evaluate CRAFT in relation to existing hardware and software
reliability techniques. For fair, quantitative comparisons among
hardware, software, and hybrid systems, we introduce a new metric,
mean work to failure, which is able to compare systems for which
machine instructions do not represent a constant unit of
work. Additionally, we present a new simulation framework which
rapidly assesses reliability and does not depend on manual
identification of failure modes. Our evaluation illustrates that
CRAFT, and hybrid techniques in general, offer attractive options in
the fault-detection design space.
SWIFT: Software Implemented Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the Third International Symposium on
Code Generation and Optimization (CGO), March 2005.
Accept Rate: 33% (25/75).
Winner Best Paper Award.
Winner of the 2015 International Symposium on
Code Generation and Optimization Test of Time Award.
To improve performance and reduce power consumption, processor
designers employ advances that shrink feature sizes, lower voltage
levels, reduce noise margins, and increase clock rates. These
advances, however, also make processors more susceptible to transient
faults that can affect program correctness. To mitigate this
increasing problem, designers build redundancy into systems to the
degree that the soft-error budget will allow. While reliable systems typically employ hardware techniques to address
soft-errors, software techniques can provide a lower cost and more
flexible alternative. To make this alternative more attractive, this
paper presents a new software fault tolerance technique, called SWIFT,
for detecting transient errors. Like other single-threaded software
fault tolerance techniques, SWIFT efficiently manages redundancy by
reclaiming unused instruction-level resources present during the
execution of most programs. SWIFT, however, eliminates the need to
double the memory requirement by acknowledging the use of ECC in
caches and memory. SWIFT also provides a higher level of protection
with enhanced checking of the program counter (PC) at no performance
cost. In addition, this enhanced PC checking makes most code inserted
to detect faults in prior methods unnecessary, significantly enhancing
performance. While SWIFT can be implemented on any architecture and
can protect individual code segments to varying degrees, we evaluate a
fully-redundant implementation running on Itanium 2. In these
experiments, SWIFT demonstrates exceptional fault-coverage with a
reasonable performance cost. Compared to the best known
single-threaded approach utilizing an ECC memory system, SWIFT
demonstrates a 51% average speedup.