Computing Architectural Vulnerability Factors for Address-Based Structures [abstract] (IEEE Xplore, PDF)
Arijit Biswas, Paul B. Racunas, Razvan Cheveresan, Joel Emer, Shubhendu S. Mukherjee, and Ram Rangan
Proceedings of the 32nd International Symposium on
Computer Architecture, June 2005.
Microprocessor designers require accurate estimates of pro- cessor
error rates--arising from alpha and neutron strikes--to make
appropriate cost/reliability trade-offs in their designs. A key aspect
of the soft error rate estimation involves computing a structure's
architectural vulnerability factor (AVF), which is the probability
that a fault in the structure will result in a user-visible error.
Prior work has shown how to compute the AVF of an instruction queue
and execution units using a performance model. This paper makes four contributions towards computing the AVFs of
three critical address-based structures--a write-through data cache, a
data translation buffer, and a store buffer--each with distinctive
hardware characteristics. First, we describe how to perform a detailed
breakdown of lifetime components (e.g., fill-to- read, read-to-evict)
of bits in these structures into ACE (required for architecturally
correct execution), un-ACE (unnecessary for ACE), and unknown
components. Then, we calculate the AVF for the data arrays. Our
analysis of a detailed IA64 processor simulator shows a best
estimate AVF of 6%, 36%, and 4%, respectively, for the data arrays of
the three structures (with realistic sizes). The AVF of the store
buffer's data array is particularly low because of low average
utilization. Second, we extend the lifetime analysis for tag arrays to identify
false positive (match when there should have been a mismatch) and
false negative (mismatch when there should have been a match)
cases. To identify false positive matches, we introduce a novel
technique called hamming-distance-one analysis, which identifies tag
entries that differ by one bit (potentially due to a single bit error)
from the incoming match bits. Our best estimate for the AVF of the tag
arrays of these structures shows surprisingly low AVFs of less than
0.41%, 3%, and 7.7%, respectively. For the data cache and translation
buffer, the low AVF arises because a false negative match will force a
miss and refetch sequence, but not cause an error. For the store
buffer tag, the low AVF arises from low average utilization of the
store buffer itself. Third, we introduce a new technique called cooldown, a technique
complementary to the conventional warmup used in performance models,
to reduce the unknown component in the AVF analysis. Unknowns arise
from lack of knowledge of future lifetime behavior of the constituent
bits and can be reduced by continuing to run the simulation after the
actual statistics collection ends. Finally, using our lifetime
analysis framework, we show how two AVF reduction techniques--periodic
flushing and incremental scrubbing--can reduce the AVF by converting
ACE lifetime components into un-ACE without affecting performance
significantly.