Most Read This Week
Predictive Failure Analytics with Optimization for Big Data
Applying data mining and advanced statistical methods to analyze, diagnose and improve manufacturing yield
Sep. 7, 2014 09:00 AM
We present a unique case study of applying data mining and advanced statistical methods to analyze, diagnose and improve manufacturing yield, especially for rare failure event prediction. Intra-die process variations in nanometer technology nodes pose significant challenges to robust design practices. Geometric variations along with random dopant fluctuation effects have had significant impact on Memory functionality/yield. Inaccuracies in the models and variabilities in the process are more pronounced and force us to understand the variability effects in a processor chip with higher accuracy and fidelity and considering more physical effects than ever before. In this case study, we use predictive failure analytics to learn and optimize critical components of the processor, and deal with massive amounts of data using server farms for parallel processing. The technique can handle large numbers of process and design variables which cause mismatches in transistors, demonstrating the capability of high dimensionality, accuracy and efficiency. This increases the confidence level in the functionality and operability of system-on-chip as a whole. The underlying algorithms are generic and can be applied to big data analysis, and in particular, the techniques and framework would be very amenable to a cloud computing architectures for both scalability of processing power and data handling, and for enabling such analysis for organizations that would otherwise not have the means.
With the rapid scaling of CMOS technology, die-to-die and intra-die process variation effects are increasing dramatically (Figure 1). To meet the demand of high density memory, designers use the smallest devices and most aggressive design topologies & geometries for SRAM cells (Static random access memory is the main type of memory used in microprocessors, networking chips, cell phone chips, etc.). The drive for small dimension transistors leads to unavoidable manufacturing variations between neighboring transistors of the memory cell, due to variations in transistor physical features & dopants on the atomic scale. Namely, threshold voltage mismatch between neighboring devices can lead to large number of fails in memory designs and can degrade SRAM performance and yield. When combined with other effects such as narrow width effects, SER, temperature and process variations and parasitic transistor resistance, the scaling of SRAMs becomes increasingly difficult due to reduced margins [1-4]; end-of-life effects can further aggravate the situation . The same applies for logic design. In fact, designing for the worst-case is simply not feasible any more. Statistical timing techniques have been used to achieve full-chip and full-process coverage based on high-level models, and enable robust design practices . Furthermore, statistical techniques have been shown to improve quality in the context of at-speed test. To enable full-chip analysis, however, these models do sacrifice accuracy, and deal mainly with 3-sigma estimates.
Figure 1: Classification of variation sources. To ensure chip performance & yield, all of these sources of variation must be considered and controlled.
Accurate modeling and efficient statistical methodologies which can handle large number of variables form the crux of predictive analytics. Memories in microprocessors occupy 50-60% of the area and are critical for storage. In the past, the arena of statistical analysis for logic (latches, decoders etc) and SRAM memory has not been addressed adequately in the context of circuit design, especially when rare event failure estimation is involved; this is true not only from the performance perspective but also from the functional behavior perspective. Hence, there is a need to capture not only average logic delay distributions, but also possible design fails. As the number of elements (e.g. latches) increase in the designs, it is possible that a rare functional fail could occur; this is especially true when we want to guarantee the yield for millions of chips. Furthermore, it is necessary to analyze the yield of the memory design in-situ with the peripheral logic. This raises the need for simultaneous statistical analysis of the memory/logic unit.
In this case study, we employ superfast Monte-Carlo compatible techniques intended for memory analysis to such custom logic applications. We first revisit the methodology and its use as an analysis tool for different designs. We then take concrete examples to demonstrate the methodology for custom logic. Namely, we go over case studies of memory interacting with logic in-terms of the local evaluation circuits undergoing Fast-Read-Before-Write. Finally we conclude with the examples of memory decode logic and hit-logic. Figure 2 provides an overview of the applications under study in terms of the components in commonly used chip design. Recently published work  on an IBM POWER8 microprocessor shows close to 4.2B transistors with 12 cores with L2 and shared L3 SRAM Cache memories.
As is the case with state-of-the-art microprocessors, memory units occupy 50-60% of the chip while the logic occupies the rest. Hence, prediction of yields through variability analysis is of prime importance targeting first memory elements and then logic.
Figure 2: Partitioning of logic and memory in state-of-the-art chip design ().
Predictive Analytics for Memory Yield Design And Beyond
Figure 3: Prior Art: Monte Carlo method and its alternatives can lead to inaccuracies in the yield estimate given the limited number of sample points.
In , we proposed mixture importance sampling as a comprehensive and computationally efficient method for purposes of estimating low fail probabilities of SRAM designs. The method relies on adjusting the (natural) Monte Carlo sampling function, to produce more samples in the important region(s) (see Figure 4). It is based on the following fact.
where Ep[Q] is the expected value of Q with respect to the sampling function p(x), g(x) is the distorted sampling function, and p(x) is the natural distribution. The method is theoretically sound, and with the proper choice of g(x), we are able to obtain accurate results with a relatively small number of simulations. We refer the reader to  for more details.
Figure 4: Importance sampling helps improve the rate of sampling in the important regions as opposed to traditional Monte Carlo.
For SRAM cells, important metrics such as read/write margins, stability and performance are subjected to process variation and this can degrade the yield. Figure 5 illustrates a schematic sketch of a 6-transistor SRAM cell; often, to enable improved yield, the cell and logic supplies are separated. Here, we allocate Vcs to cell supply and Vdd to the bitline logic. We also enlist two different cases: (1) wordline connected to Vdd, and (2) wordline connected to Vcs. We then rely on our methodology to study the yield under different dual supply topologies and conditions . Figure 6 illustrates an example of model-to-hardware corroboration for the case 1 for combined stability results. Similarly case 2 can be handled. The same technique can be applied to logic.
Figure 5: Schematic of an SRAM cell, and possible dual supply scenarios.
Figure 6: (a) Operating and failure regions through proposed predictive methodology in the Vdd x Vcs space for case 1: wordline connected to Vdd.
Figure 6: (b) Hardware data for the operating/failure region shows close matching.
Predictive Analytics for High Dimensionality
One of the critical aspects of predictive statistical prediction technology is the error control algorithm and ability to monitor and diagnose the tools' convergence. This is particularly critical and challenging for very high sigma application such as that illustrated in Figure 8, where 20,000 samples are required to analyze the tail of the distribution out to 8s. Note, on the right side of Figure 8, the framework provides continuously updated diagnostics about the convergence with upper & lower confidence intervals, versus sample number.
Figure 7: No correlation between distance from nominal (likelihood of a sample point), and failure probability in high dimension (not to scale). Case data with overlapping pass/fail samples.
Figure 8: >7s high sigma analysis for 6T SRAM bitcell
Reader Feedback: Page 1 of 1
Subscribe to the World's Most Powerful Newsletters
Today's Top Reads