Graduation Year

2009

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Nagarajan Ranganathan, Ph.D.

Committee Member

Srinivas Katkoori, Ph.D.

Committee Member

Hao Zheng, Ph.D.

Committee Member

Sanjukta Bhanja, Ph.D.

Committee Member

Kandethody M. Ramachandran, Ph.D.

Keywords

Transient faults, Design flow, VLSI CAD, Reliable architecture design, Reliability-aware design automation

Abstract

The occurrence of transient faults like soft errors in computer circuits poses a significant challenge to the reliability of computer systems. Soft error, which occurs when the energetic neutrons coming from space or the alpha particles arising out of packaging materials hit the transistors, may manifest themselves as a bit flip in the memory element or as a transient glitch generated at any internal node of combinational logic, which may subsequently propagate to and be captured in a latch. Although the problem of soft errors was earlier only a concern for space applications, aggressive technology scaling trends have exacerbated the problem to modern VLSI systems even for terrestrial applications.

In this dissertation, we explore techniques at all levels of the design flow to reduce the vulnerability of VLSI systems against soft errors without compromising on other design metrics like delay, area and power. We propose new models for estimating soft errors for storage structures and combinational logic. While soft errors in caches are estimated using the vulnerability metric, soft errors in logic circuits are estimated using two new metrics called the glitch enabling probability (GEP) and the cumulative probability of observability (CPO). These metrics, based on signal probabilities of nets, accurately model soft errors in radiation-aware synthesis algorithms and helps in efficient exploration of the design solution space during optimization. At the physical design level, we leverage the use of larger netlengths to provide larger RC ladders for effectively filtering out the transient glitches. Towards this, a new heuristic has been developed to selectively assign larger wirelengths to certain critical nets. This reduces the delay and area overhead while improving the immunity to soft errors. Based on this, we propose two placement algorithms based on simulated annealing and quadratic programming which significantly reduce the soft error rates of circuits.

At the circuit level, we develop techniques for hardening circuit nodes using a novel radiation jammer technique. The proposed technique is based on the principles of a RC differentiator and is used to isolate the driven cell from the driving cell which is being hit by a radiation strike. Since the blind insertion of radiation blocker cells on all circuit nodes is expensive, candidate nodes are selected for insertion of these cells using a new metric called the probability of radiation blocker circuit insertion (PRI). We investigate a gate sizing algorithm, at the logic level, in which we simultaneously optimize both the soft error rate (SER) and the crosstalk noise besides the power and performance of circuits while considering the effect of process variations. The reliability centric gate sizing technique has been formulated as a mathematical program and is efficiently solved. At the architectural level, we develop solutions for the correction of multi-bit errors in large L2 caches by controlling or mining the redundancy in the memory hierarchy and methods to increase the amount of redundancy in the memory hierarchy by employing a redundancy-based replacement policy, in which the amount of redundancy is controlled using a user defined redundancy threshold.

The novel architectures and the new reliability-centric synthesis algorithms proposed for the various design abstraction levels have been shown to achieve significant reduction of soft error rates in current nanometer circuits. The design techniques, algorithms and architectures can be integrated into existing design flows. A VLSI system implementation can leverage on the architectural solutions for the reliability of the caches while the custom hardware synthesized for the VLSI system can be protected against radiation strikes by utilizing the circuit level, logic level and layout level optimization algorithms that have been developed.

Share

COinS