News

Computing Architectural Vulnerability Factors for Address-Based Structures [abstract] (IEEE Xplore, PDF)
Arijit Biswas, Paul B. Racunas, Razvan Cheveresan, Joel Emer, Shubhendu S. Mukherjee, and Ram Rangan
Proceedings of the 32nd International Symposium on Computer Architecture, June 2005.

Microprocessor designers require accurate estimates of pro- cessor error rates--arising from alpha and neutron strikes--to make appropriate cost/reliability trade-offs in their designs. A key aspect of the soft error rate estimation involves computing a structure's architectural vulnerability factor (AVF), which is the probability that a fault in the structure will result in a user-visible error. Prior work has shown how to compute the AVF of an instruction queue and execution units using a performance model.

This paper makes four contributions towards computing the AVFs of three critical address-based structures--a write-through data cache, a data translation buffer, and a store buffer--each with distinctive hardware characteristics. First, we describe how to perform a detailed breakdown of lifetime components (e.g., fill-to- read, read-to-evict) of bits in these structures into ACE (required for architecturally correct execution), un-ACE (unnecessary for ACE), and unknown components. Then, we calculate the AVF for the data arrays. Our analysis of a detailed IA64 processor simulator shows a best estimate AVF of 6%, 36%, and 4%, respectively, for the data arrays of the three structures (with realistic sizes). The AVF of the store buffer's data array is particularly low because of low average utilization.

Second, we extend the lifetime analysis for tag arrays to identify false positive (match when there should have been a mismatch) and false negative (mismatch when there should have been a match) cases. To identify false positive matches, we introduce a novel technique called hamming-distance-one analysis, which identifies tag entries that differ by one bit (potentially due to a single bit error) from the incoming match bits. Our best estimate for the AVF of the tag arrays of these structures shows surprisingly low AVFs of less than 0.41%, 3%, and 7.7%, respectively. For the data cache and translation buffer, the low AVF arises because a false negative match will force a miss and refetch sequence, but not cause an error. For the store buffer tag, the low AVF arises from low average utilization of the store buffer itself.

Third, we introduce a new technique called cooldown, a technique complementary to the conventional warmup used in performance models, to reduce the unknown component in the AVF analysis. Unknowns arise from lack of knowledge of future lifetime behavior of the constituent bits and can be reduced by continuing to run the simulation after the actual statistics collection ends. Finally, using our lifetime analysis framework, we show how two AVF reduction techniques--periodic flushing and incremental scrubbing--can reduce the AVF by converting ACE lifetime components into un-ACE without affecting performance significantly.