The THRIFT Project

In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable, rendering processors that use them more susceptible to transient faults, which are caused by external events, such as energetic particles striking the chip. Processor designers are constantly making trade-offs to obtain the best performance while still meeting their constraints and the budget for reliability, like the budget for power, varies drastically for different market segments.

Current approaches to reliability are one-size-fits all and are inefficient as a result. The THRIFT Project advocates adaptive approaches that match the changing reliability and performance demands of a system to improve reliability at lower cost. This project introduces the concept of software-modulated fault tolerance (SMFT) to reduce the cost of reliability. Software-modulated fault-tolerance techniques take advantage of natural non-uniformity in programs to control the degree and nature of protection necessary for each part of the program. The community has long recognized and exploited the existence of non-uniformity within codes to improve performance or reduce power. In the same vein, with SMFT, non-uniformity can be exploited to improve reliability at a lower cost.

Non-uniformity can manifest itself at the instruction level, at the semantic level, or even at the level of user expectations. At the instruction level, naturally existing protection means that the value of certain bits of state over some periods of time do not affect program correctness. The literature describes several classes of such behavior, but does not describe methods to take advantage of them to improve efficiency. For example, logical masking occurs when certain bits of a data value are dynamically dead because of logical operands such as \texttt{AND}. Although logical masking has been discussed, it has not yet been actively exploited to more efficiently implement fault tolerance.

At the semantic level, some program parts are naturally resilient against faults [1]. Consider randomized algorithms. A fault in such algorithms often merely perturb the number of iterations required to converge by a small amount (favorably or unfavorably), but do not affect the final outcome of the program.

At the level of user requirements, sometimes wrong is better than always right at a cost. Consider a transient fault in a movie player affecting a single pixel in a single frame of a movie during playback. This will likely not be noticed. Weigh this against a method of protection which reduced the frame rate by 30% to get every pixel correct. In this situation, such a fault-tolerance technique is detrimental to the user experience. Compare this situation to one involving a bank transaction where correctness is much more important than speed.

At each of these level, differing reliability demands may warrant differing levels of protection at different points in the program. By enabling the system, the programmer, or even the user to decide when and how to apply protection, SMFT can adapt protection to best suit the needs of a constantly varying system, allowing the selection of the optimal trade-off of reliability for performance or power.

In addition to the previously published software-only [2] and hybrid methods [3] amenable to SMFT, the THRIFT project is exploring multiple approaches to better direct software-modulated fault tolerance. The first avenue of research in this area involves feedback-driven optimization using off-line profiles, finding new ways to increase the speed of profiling while at the same time exploring avenues for further improving the accuracy and granularity of profiling and modulation. The second avenue of research uses the VELOCITY compiler for static analysis and heuristics to determine when to apply reliability. The third avenue of research exploits dynamic feedback for better identification of reliability opportunities. Dynamic feedback not only provides deeper insight into program behavior, but can also determine the dynamic reliability parameters of the system, enabling the system to adapt to differing conditions. For example, a processor on a mountaintop or on an airplane will encounter many more transient faults than a processor at sea level. The final THRIFT avenue of research is exploring simple ways for programmers to express the reliability requirements of different portions of code, allowing a pixel value in a movie frame to be computed quickly while allowing a bank transaction to be handled accurately.