Front Page

The Liberty Research Group

Fault Tolerance Project

[Yun Zhang is responsible for this page]

As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. While designers often compensate for transient faults by adding costly hardware redundancy to address soft-errors, software techniques can provide a lower-cost and more flexible alternative. In the Liberty Research Group, we have developed several fault detection techniques suitable for software control. We proposed SWIFT [CGO 2005] - a software-based, single-threaded approach to achieve redundancy and fault tolerance. which demonstrated a 7x increase in reliability. It performed on par with the hardware multithreading-based redundancy techniques at the time [ISCA 2000] without the additional hardware cost. We proposed DAFT [PACT 2010], a fast, safe, and memory efficient, software-only speculation framework for transient fault detection in commodity multicore systems. Then we showed that one can detect transient-faults with near zero performance cost on commodity multicore systems with RAFT [CGO 2012] which delivers a geomean performance overhead of 2.03% on a set of 23 SPEC CPU benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications A lot of subsequent tools were developed based on the idea of low-cost, software-only transient fault detection - [CGO 2009 ] , [TOCS 2009], [CGO 2007], [MICRO 2007], [SOSP 2005], We have also worked with the Programming languages group at Princeton led by Prof. David Walker to develop an assembly-level type system[PLDI 2007] designed to detect reliability problems in compiled code.

Software Controlled Fault Tolerance

Different applications and different segments of a single application may have different reliability and performance demands. Recognizing that one-size-fits-all approaches may be too costly or inappropriate for many markets, we proposed Software-controlled fault tolerance [TACO 2005]. In software-controlled fault tolerance, the compiler or run-time optimizer modulates the performance and reliability of the fault tolerance system to meet specific demands using user, programmer, processor, or profile information. For example, rendering frames during movie playback should be done quickly while bank transactions should be performed with the utmost care. This new observation inspired many new research work targeting at achieving low-cost software-only fault tolerance. For example, Sundaram et. al [WREFT 2008] protect multi-media applications by duplicating only instructions that are critical to the correct execution. Similarly, Shoestring [ASPLOS 2010] applies intelligent analysis to detect and protect code segments that, when subjected to a soft error, are likely to result in user-visible faults.

We have also proposed CRAFT [ISCA 2005], the first hybrid hardware/software fault-detection mechanism as a promising alternative to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. The hybrid fault-detection techniques were also used by subsequent work in [DSN 2009] and [HPCA 2007].

Project Ph.D. Graduates

Selected Project Publications

All Project Publications

302 Found


The document has moved here.