Fault Tolerance Project[Yun Zhang is responsible for this page]
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. While designers often compensate for transient faults by adding costly hardware redundancy to address soft-errors, software techniques can provide a lower-cost and more flexible alternative. In the Liberty Research Group, we have developed several fault detection techniques suitable for software control. We proposed SWIFT [CGO 2005] - a software-based, single-threaded approach to achieve redundancy and fault tolerance. which demonstrated a 7x increase in reliability. It performed on par with the hardware multithreading-based redundancy techniques at the time [ISCA 2000] without the additional hardware cost. We proposed DAFT [PACT 2010], a fast, safe, and memory efficient, software-only speculation framework for transient fault detection in commodity multicore systems. Then we showed that one can detect transient-faults with near zero performance cost on commodity multicore systems with RAFT [CGO 2012] which delivers a geomean performance overhead of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications A lot of subsequent tools were developed based on the idea of low-cost, software-only transient fault detection - [CGO 2009 ] , [TOCS 2009], [CGO 2007], [MICRO 2007], [SOSP 2005], We have also worked with the Programming languages group at Princeton led by Prof. David Walker to develop an assembly-level type system[PLDI 2007] designed to detect reliability problems in compiled code.
Software Controlled Fault Tolerance
Different applications and different segments of a single application may have different reliability and performance demands. Recognizing that one-size-fits-all approaches may be too costly or inappropriate for many markets, we proposed Software-controlled fault tolerance [TACO 2005]. In software-controlled fault tolerance, the compiler or run-time optimizer modulates the performance and reliability of the fault tolerance system to meet specific demands using user, programmer, processor, or profile information. For example, rendering frames during movie playback should be done quickly while bank transactions should be performed with the utmost care. This new observation inspired many new research work targeting at achieving low-cost software-only fault tolerance. For example, Sundaram et. al [WREFT 2008] protect multi-media applications by duplicating only instructions that are critical to the correct execution. Similarly, Shoestring [ASPLOS 2010] applies intelligent analysis to detect and protect code segments that, when subjected to a soft error, are likely to result in user-visible faults.
We have also proposed CRAFT [ISCA 2005], the first hybrid hardware/software fault-detection mechanism as a promising alternative to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. The hybrid fault-detection techniques were also used by subsequent work in [DSN 2009] and [HPCA 2007].
Project Ph.D. Graduates
Selected Project Publications
All Project Publications
A Generalized Framework for Automatic Scripting Language Parallelization [abstract] (PDF)
Runtime Asynchronous Fault Tolerance via Speculation [abstract] (PDF)
DAFT: Decoupled Acyclic Fault Tolerance [abstract] (SpringerLink, PDF)
Low-cost, Fine-grained Transient Fault Recovery for Low-end Commodity Systems [abstract]
DAFT: Decoupled Acyclic Fault Tolerance [abstract] (ACM DL, PDF)
Fault-tolerant Typed Assembly Language [abstract] (ACM DL, PDF)
Automatic Instruction-Level Software-Only Recovery Methods [abstract] (IEEE Xplore, Original Full Paper, PDF)
Configurable Transient Fault Detection via Dynamic Binary Translation [abstract] (PDF)
Static Typing for a Faulty Lambda Calculus [abstract] (ACM DL, PDF)
Automatic Instruction-Level Software-Only Recovery [abstract] (IEEE Xplore, PDF, Top Picks Version)
Software Fault Detection Using Dynamic Instrumentation [abstract] (CiteSeerX, PDF)
Software-Controlled Fault Tolerance [abstract] (ACM DL, PDF)
Design and Evaluation of Hybrid Fault-Detection Systems [abstract] (IEEE Xplore, PDF)
SWIFT: Software Implemented Fault Tolerance [abstract] (ACM DL, PDF)