Front Page

The Liberty Research Group

Fault Tolerance Project

[Yun Zhang is responsible for this page]

As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. While designers often compensate for transient faults by adding costly hardware redundancy to address soft-errors, software techniques can provide a lower-cost and more flexible alternative. In the Liberty Research Group, we have developed several fault detection techniques suitable for software control. We proposed SWIFT [CGO 2005] - a software-based, single-threaded approach to achieve redundancy and fault tolerance. which demonstrated a 7x increase in reliability. It performed on par with the hardware multithreading-based redundancy techniques at the time [ISCA 2000] without the additional hardware cost. We proposed DAFT [PACT 2010], a fast, safe, and memory efficient, software-only speculation framework for transient fault detection in commodity multicore systems. Then we showed that one can detect transient-faults with near zero performance cost on commodity multicore systems with RAFT [CGO 2012] which delivers a geomean performance overhead of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications A lot of subsequent tools were developed based on the idea of low-cost, software-only transient fault detection - [CGO 2009 ] , [TOCS 2009], [CGO 2007], [MICRO 2007], [SOSP 2005], We have also worked with the Programming languages group at Princeton led by Prof. David Walker to develop an assembly-level type system[PLDI 2007] designed to detect reliability problems in compiled code.

Software Controlled Fault Tolerance

Different applications and different segments of a single application may have different reliability and performance demands. Recognizing that one-size-fits-all approaches may be too costly or inappropriate for many markets, we proposed Software-controlled fault tolerance [TACO 2005]. In software-controlled fault tolerance, the compiler or run-time optimizer modulates the performance and reliability of the fault tolerance system to meet specific demands using user, programmer, processor, or profile information. For example, rendering frames during movie playback should be done quickly while bank transactions should be performed with the utmost care. This new observation inspired many new research work targeting at achieving low-cost software-only fault tolerance. For example, Sundaram et. al [WREFT 2008] protect multi-media applications by duplicating only instructions that are critical to the correct execution. Similarly, Shoestring [ASPLOS 2010] applies intelligent analysis to detect and protect code segments that, when subjected to a soft error, are likely to result in user-visible faults.

We have also proposed CRAFT [ISCA 2005], the first hybrid hardware/software fault-detection mechanism as a promising alternative to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. The hybrid fault-detection techniques were also used by subsequent work in [DSN 2009] and [HPCA 2007].

Project Ph.D. Graduates

Selected Project Publications

All Project Publications

Runtime Asynchronous Fault Tolerance via Speculation [abstract] (PDF)
Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August
Proceedings of the 2012 International Symposium on Code Generation and Optimization (CGO), April 2012.
Accept Rate: 28% (26/90).

DAFT: Decoupled Acyclic Fault Tolerance [abstract] (SpringerLink, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
The International Journal of Parallel Programming (IJPP), February 2012. Invited.
Special issue composed of "top papers" selected by the Program Committe of the 19th International Conference on Parallel Architectures and Compilation Techniques.

Low-cost, Fine-grained Transient Fault Recovery for Low-end Commodity Systems [abstract]
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August
Proceedings of the 44th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2011.
Accept Rate: 21% (44/209).

DAFT: Decoupled Acyclic Fault Tolerance [abstract] (ACM DL, PDF)
Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2010.
Accept Rate: 17% (46/266).
Selected by the Program Committee as a "top paper" for inclusion in a special issue of the International Journal of Parallel Programming (IJPP).

Software Modulated Fault Tolerance [abstract] (PDF)
George A. Reis
Ph.D. Thesis, Department of Electrical Engineering, Princeton University, June 2008.

Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker
Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007.
Accept Rate: 25% (45/178).
Winner Best Paper Award.

Automatic Instruction-Level Software-Only Recovery Methods [abstract] (IEEE Xplore, Original Full Paper, PDF)
George A. Reis, Jonathan Chang, and David I. August
IEEE Micro, Volume 27, Number 1, January 2007.
IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2006.

Non-Uniform Fault Tolerance [abstract] (PDF)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.

Configurable Transient Fault Detection via Dynamic Binary Translation [abstract] (PDF)
George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December 2006.

Static Typing for a Faulty Lambda Calculus [abstract] (ACM DL, PDF)
David Walker, Lester Mackey, Jay Ligatti, George A. Reis, and David I. August
Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming (ICFP), September 2006.
Accept Rate: 31% (24/76).

Automatic Instruction-Level Software-Only Recovery [abstract] (IEEE Xplore, PDF, Top Picks Version)
Jonathan Chang, George A. Reis, and David I. August
Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2006.
Accept Rate: 18% (34/187).
Winner of the William C. Carter Award.
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2006.

Software Fault Detection Using Dynamic Instrumentation [abstract] (CiteSeerX, PDF)
George A. Reis, David I. August, Robert Cohn, and Shubhendu S. Mukherjee
Proceedings of the Fourth Annual Boston Area Architecture Workshop (BARC), February 2006.

Software-Controlled Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
ACM Transactions on Architecture and Code Optimization (TACO), December 2005.

Design and Evaluation of Hybrid Fault-Detection Systems [abstract] (IEEE Xplore, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee
Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), June 2005.
Accept Rate: 23% (45/194).

SWIFT: Software Implemented Fault Tolerance [abstract] (ACM DL, PDF)
George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August
Proceedings of the Third International Symposium on Code Generation and Optimization (CGO), March 2005.
Accept Rate: 33% (25/75).
Winner Best Paper Award.