Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker
Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007.
Winner Best Paper Award.
A transient hardware fault occurs when an energetic particle strikes
a transistor, causing it to change state. Although transient faults do
not permanently damage the hardware, they may corrupt computations
by altering stored values and signal transfers. In this paper, we
propose a new scheme for provably safe and reliable computing in
the presence of transient hardware faults. In our scheme, software
computations are replicated to provide redundancy while special
instructions compare the results of replicas to detect errors before
writing critical data. In stark contrast to any previous efforts in this
area, we have analyzed our fault tolerance scheme from a formal,
theoretical perspective. First, we provide an operational semantics
for our assembly language, which includes a precise formal denition
of our fault model. Formulating such a model is a crucial step
forward for the science of hardware fault tolerance as it pins down
the assumptions being made about when and where faults may occur
in a completely precise and transparent manner. Second, we
develop an assembly-level type system designed to detect reliability
problems in compiled code. On the one hand, this type system
may be viewed as a theoretical tool for proving that code is reliable.
On the other hand, it may be used as a practical tool for compiler
debugging. In the latter case, the type system is strictly superior
to conventional testing as it guarantees full coverage and perfect
fault tolerance relative to the fault model. Third, we provide a formal
specication for program fault tolerance under the given fault
model and prove that all well-typed programs are indeed fault tolerant.
In addition to the formal analysis, we evaluate the execution
time of our detection scheme to quantify the performance penalty
for sound fault tolerance.