Automatic Instruction-Level Software-Only Recovery [abstract] (IEEE Xplore, PDF, Top Picks Version)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2006.
Winner of the William C. Carter Award.
Selected for IEEE Micro's "Top Picks" special issue for papers "most
relevant to industry and significant in contribution to the field of
computer architecture" in 2006.
As chip densities and clock rates increase, processors are becoming
more susceptible to transient faults that can affect program
correctness. Computer architects have typically addressed reliability
issues by adding redundant hardware, but these techniques are often
too expensive to be used widely. Software-only reliability techniques
have shown promise in their ability to protect against soft-errors
without any hardware overhead. However, existing low-level
software-only fault tolerance techniques have only addressed the
problem of detecting faults, leaving recovery largely unaddressed. In
this paper, we present the concept, implementation, and evaluation of
automatic, instruction-level, software-only recovery techniques, as
well as various specific techniques representing different trade-offs
between reliability and performance. Our evaluation shows that these
techniques fulfill the promises of instruction-level, software-only
fault tolerance by offering a wide range of flexible recovery options