**control**

# Assessing Data Independence and Normality for SPC Charts

PCB chemical manufacturing processes can violate data independence and normality.

by Patrick Valentine, Ph.D.

The most critical assumption made concerning statistical process control (SPC) charts is that of data independence from one observation to the next (free from autocorrelation).^{1,2} The second critical assumption is that the individual observations are approximately normally distributed.^{1,2} The tabled constants used to calculate the SPC chart limits are constructed under the assumption of independence and normality.

Many printed circuit board chemical manufacturing processes can violate the assumption of data independence. This is because inertial elements drive reduction-oxidation (redox) chemical processes. When the interval between samples becomes small relative to the inertial elements, the sequential observations of the process will be correlated over time.

Statistical process control charts do not work well if the quality attributes charted exhibit even low levels of correlation over time. Correlated data produce too many false alarms – correlated data underestimate the upper and lower control limits.

Statistical process control charts are designed to capture 99.73% of the data within a ±3σ interval. In other words, there is a 0.9973% probability of not receiving a signal when the process is in control, the data are not autocorrelated, and normality is assumed. The distribution tails are symmetrical, each being 0.135% (100% – 99.73% = 0.27% / 2 = 0.135%) (Figure 1). Non-normal data change the tail probabilities. Departures from normality can significantly alter SPC detection rules.

### Data Independence

There are two primary plots for checking data independence. These plots are the time series plot and the autocorrelation function plot. The time series plot is qualitative, while the autocorrelation function plot is quantitative. Details are given below.

Time series plot. Time series plots are an easy way to summarize a univariate data set graphically. Common assumptions of univariate data sets are that they have a fixed distribution, with a common location (mean) and common variation (standard deviation). With time series plots, shifts in location and variation are easy to see, and outliers can easily be detected, giving the process engineer an excellent feel for the data. Time series plots can answer the following questions:

- Are the data independent?
- Are there any shifts in location?
- Are there any shifts in variation?
- Are there any outliers?

Autocorrelation function plot. The autocorrelation function is used to detect independence in univariate data sets. The autocorrelation is a correlation coefficient. Instead of a correlation between two variables, however, the correlation is between two values of the same variable at times X_{i} and X_{i+k}. Although the time variable, *X,* is not used in the formula for autocorrelation, the assumption is that the observations are equispaced.^{3,4} Usually, only the first few autocorrelations (lags 1, 2 and 3) are of interest in detecting independence. The autocorrelation function can be used to answer the following question:

- Was the sample data set generated from an independent (random) process?

### Data Normality

There are two primary plots for checking data normality. These plots are the histogram and the probability plot. The histogram is qualitative, while the probability plot is quantitative. Details are given below.

Histogram. The most common form of the histogram is obtained by splitting the range of the data into equal-sized bins. Then, the number of data points that fall into each bin is counted. This allows the histogram to summarize the distribution of a univariate data set graphically. These summarizations provide strong indications of the proper distributional model for the data, in this case, the normal distribution for SPC charts.^{4} The histogram can answer the following questions:

- What is the center of the data?
- What is the spread of the data?
- Are there any outliers?
- Are there multiple modes in the data?
- Are the data bell-shaped?

Probability plot. The probability plot is a graphical technique for assessing whether a data set follows a given distribution.^{4} The data are plotted against a theoretical distribution so that the points should form an approximately straight line on the diagonal. Departures from this straight line indicate departures from the specified distribution. The probability plot can answer the following questions:

- Does the normal distribution provide a good fit to the data?
- What are reasonable estimates for the location and variation parameters?

### Examples

The process engineer gathers two data sets from the electroless nickel immersion gold (ENIG) final finish line. Data set one: the cleaner concentration (ml/L). Data set two: the microetch’s sodium persulfate concentration (g/L). The assumptions of data independence and normality are checked. Data set one is shown in Figure 2, and two in Figure 3.

Data set one. The time series plot appears to have a fixed distribution, with a common location (mean) and common variation (standard deviation). There are no shifts in location or variation and no outliers. The autocorrelation function shows no statistically significant autocorrelations at lags 1, 2 or 3 (bars outside the red 95% confidence limit lines). The assumption of data independence has been met.

The histogram indicates that the data follow a normal distribution (a bell-shaped curve) and that there are no outliers. The probability plot points form an approximately straight line on the diagonal, and the p-value is significantly greater than 0.05. The assumption of data normality has been met.

Data set two. The time series plot appears not to have a fixed distribution, and there is not a common location (mean) or common variation (standard deviation). There are shifts in the location and variation; no outliers are present. The autocorrelation function shows statistically significant autocorrelations at lags 1, 2 and 3 (bars outside the red 95% confidence limit lines). The assumption of data independence has been violated.

The histogram indicates that the data follows a bimodal distribution (two distinct peaks). The probability plot points do not form an approximately straight line on the diagonal, and the p-value is significantly less than 0.05. The assumption of data normality has been violated.

### Statistical Process Control Charts

The two data sets were plotted on individual moving range SPC charts for discussion. Data set one is shown in Figure 4, and two in Figure 5, with the Western Electric Co. Rules violation points in red. The four Western Electric Co. Rules are listed below:^{4}

- One point more than three standard deviations from center line
- Two out of three points > two standard deviations from center line (same side)
- Four out of five points > one standard deviations from center line (same side)
- Nine points in a row on the same side of the center line.

All detection rules are subject to Type I false alarms (a point falling beyond the control limits, indicating an out-of-control condition when no assignable cause is present).^{1,4} False alarm rates are reported as average run lengths (ARL). The ARL tells us, for a given situation, how long, on average, we will plot successive control chart points before we detect a rule violation. Using just the first Western Electric Co. Rule provides an ARL of ~371. Using all four Western Electric Co. Rules reduces the ARL to ~92. The process engineer must decide whether this price is worth paying (signal sensitivity versus false alarms). One strategy is to use the four Western Electric Co. Rules but take them “less seriously” regarding the effort put into troubleshooting activities when out-of-control signals occur.

Data set one discussion. The non-correlated data result in accurate upper and lower control limit estimations. The normal data provide the correct symmetrical tail probabilities. There are a few detection rule violations (points in red). The process engineer has reliable evidence that these rule violation points are out-of-control, not false alarms. The capability indices C_{pk} and P_{pk} should be computed. These capability indices will aid the process engineer in determining if the violation points are common causes or assignable causes.

Data set two discussion. The correlated data result in underestimating the upper and lower control limits. The non-normal data change the tail probabilities. There are many rule violations (points in red). The process engineer cannot determine which violation points are out-of-control or false alarms. The process engineer would waste valuable time investigating all the violation points. The capability index C_{pk} would be misleading and is therefore not recommended to be computed. The capability index P_{pk} would be much more informative.

### Conclusions

Data independence and normality are underlying crucial assumptions for statistical process control charts. SPC charts do not work well if the quality attributes charted exhibit even low levels of correlation over time. Correlated data produce too many false alarms. Non-normal data alter the symmetrical tail probabilities. Violating independence and normality can significantly increase statistical process control detection rule false alarms. With a lack of data independence and normality, a process engineer has virtually no way of knowing which violation points are real and which are false alarms. Valuable time is wasted by not confirming the data’s independence and normality before plotting them on a statistical process control chart.

References

1. D. Montgomery, *Introduction to Statistical Quality Control, 6th Ed.,* John Wiley & Sons, 2009.

2. T. Ryan, *Statistical Methods for Quality Improvement, 2nd Ed.,* John Wiley & Sons, 2000.

3. G. Box and G. Jenkins, *Time Series Analysis: Forecasting and Control,* Holden-Day, 1976.

4. National Institute of Standards and Technology, *Handbook of Statistical Methods,* 2003.

Patrick Valentine, Ph.D. is the technical and Lean Six Sigma manager for Uyemura USA (uyemura.com); pvalentine@uyemura.com. He teaches Six Sigma Green Belt and black belt courses as part of his responsibilities. He holds a doctorate in Quality Systems Management from Cambridge College and ASQ certifications as a Six Sigma Black Belt and Reliability Engineer.