Estimation Error of MSA Summary Statistics
Updated: Jun 19, 2020
I hope you are all staying safe out there - or actually, in there... With the current COVID-19 situation, most of us have had to adjust to a new way of living. That of staying inside.
What is for certain is that COVID-19 has completely turned things upside down for virtually all of us; the world, as we know it, will never be the same. With that said, I strongly believe that we, mankind, will come out of this stronger than we were before with lots of learning along the way.
But now that you have gotten this far: you are reading the first Practical MSA blog post!
My first eBook Practical MSA: Laying The Foundations briefly discusses the conventional approach to Measurement Sytems Analysis. If you haven't read the eBook, visit practicalmsa.com and get your free copy there.
The conventional MSA approach introduces standard MSA summary statistics, such as %Study Variation, %Tolerance (or P/T ratio) and NDC (or Non-Disctinct Categories). These summary statistics are simple ratios.
The below figure should shed some light on what these ratios mean and how they are calculated.
Conventional guideline acceptance criteria for both %Study Variation and %Tolerance are:
Less than 10%: acceptable
Between 10% and 30%: marginal
Over 30%: unacceptable
Conventional guideline acceptance criteria for the NDC metric:
Less than 5: unacceptable
Over 5: acceptable
In the eBook I discuss the drawbacks of relying solely on these summary statistics and the fallacies surrounding them. One of them is the fact that the guideline acceptance criteria were arbitrarily derived therefore there is no empirical evidence to support them and the reasoning behind why less than 10% is "acceptable", between 10-30% is "marginal" and over 30% is "unacceptable". Also, the fact that the same acceptance criteria apply to both %Study Variation and %Tolerance summary statistics also doesn't make too much sense.
Summary statistics, like these GR&R summary stats, are always an estimate of a parameter, that is the true value of something we are trying to estimate.
The problem here is that the estimation error of these Gage R&R summary stats can be significant. And before I illustrate to you how significant it can be, let's talk about why it can be significant. In short, it is because you are estimating standard deviations, not means. If you look at the above figure, you can see that all three summary statistics involve estimates of standard deviations, those of measurement error (total gage R&R) and product variation.
Let me give you an example: you conduct a gage study and then run the numbers for, say, the %Study Variation summary statistic and get a 15% result. Should you believe that the true ratio of the measurement system error and the total variation is 15%? Absolutely not!
So, here's the scoop: estimates of standard deviation are very sensitive to two things:
Fact: the typical sample sizes for a Gage R&R study recommended by the conventional AIAG Gage R&R approach (5-10 parts measured 2-3 times by 2-3 operators) guarantee a decent, if not significant, amount of estimation error.
As far as the repeatability error estimate goes, the above sample size typically gives a fair estimate of the true repeatability error. That's not the case with the product variation component; since typically 5-10 parts are used to estimate the product variation, what ends up happening is that the product variation estimate becomes highly inaccurate. So if you are trying to estimate, say, the %Study Variation summary statistic, you've got a fair estimate of repeatability error divided by a highly inaccurate estimate of product variation - guess what the end result will be: a meaningless number that is not of much utility. I didn't even mention the reproducibility error estimate, and I had a good reason not to; that statistic is seriously flawed, and I will dedicate a separate blog post to discussing just that.
So, why are estimates of standard deviation so sensitive to sample size? In short, because standard deviations are not normally distributed. The best distribution to model standard deviations is the chi-square distribution, and this sensitivity comes from the shape of the chi-square distribution, which depends on the sample size. The below graph shows the probability density he chi-square distribution for different sample sizes (degrees of freedom). Degrees of freedom is the number of values in the calculation of statistic (in this case the standard deviation statistic) that are free to vary. Generally speaking, the number of degrees of freedom equals the sample size minus one, and that is because one degree is always spent on estimating the mean.
OK, now back to the chart: what you can also see is that the chi-square distribution, generally speaking, has a long upper tail depending on the sample size. The five different shapes correspond the sample sizes of 2, 3, 4, 6 and 11 (that is degrees of freedom of 1, 2, 3, 5 and 10).
Let me illustrate the following scenario so you better understand how the chi-square distribution works: you are drawing a sample from a normally-distributed population and trying to estimate the standard deviation from it. When you draw a sample of 3 units, the chances of most of those units being close to the population mean is high (based on the 68-95-99.73 rule) thereby bringing the expected standard deviation estimate down, away from the true standard deviation. At a sample size of 6 units, since we have now drawn more units, the chances of at least some of those units being farther away from the population mean (again, per the 68-95-99.73 rule), bringing the expected standard estimate up a little. This is why small sample sizes tend to underestimate the true amount of standard deviation, and as you start to increase the sample size, the the expected value (the central value of the distribution) slowly converges toward the true standard deviation value (the parameter), also causing the distribution to lose its long tail. Generally speaking, the estimation error of the standard deviation starts to become acceptable at degrees of freedom of 20 and over;30 is ideal, because it provides an economic trade-off between too much estimation error and too large a sample size.
Now, why did I go through all this? Because I wanted you to get a better understanding of how the chi-square distribution works and how inaccurate the estimates can be.
Let's take a look at what this looks like for our beloved Gage R&R studies! Again, we are estimating repeatability error, reproducibility error (although that really should not be calculated as a standard deviation, but we will talk about that at another time) and product variation. Here's the table for how the number of degrees of freedom is calculated for each component. As a refresher, the degrees of freedom equals the sample size minus one.
I did a little simulation to try and illustrate the estimation error involved in the %Study Variation and %Tolerance summary statistics.
In a typical Gage R&R study, 5-10 parts are selected and measured 2-3 times by 2-3 operators. To cover a few different sample selection scenarios, I created seven groups of product values in a Monte Carlo-type simulation.
The first five groups consisted of randomly generated normally-distributed samples, given a population mean of 10 units and standard deviation of 2 units, to illustrate product variation. I also generated normally-distributed values around each product value to represent repeatability error of 0.4 units. The only difference between the five groups was the number of product samples; 30, 20, 10, 5 and 3 units made up the first five groups.
Groups 6 and 7 consisted of product samples of 3 and 5 units, respectively. The sample selection method was, however, not random but deliberate. Specifically, the product values were evenly distributed across the expected product values both for the 3 and 5-unit sample. For the 3-unit sample, one "low" sample was drawn from the lower end of the expected product range, one "mid" sample from the middle of the expected values, and one "high" sample from the upper end. The same methodology was applied to the 5 deliberately selected units with units evenly distributed across the expected range. The reason why I chose to include this sample selection method in the simulation is because some folks like to use it for Gage R&R studies in an attempt to cover the expected range of product variation.
Specification limits were also defined in the simulation, with a lower spec limit of 3 units and an upper spec limit of 15 units.
Now take a look at the below plot showing the distribution of product values for each group. Note groups 6 and 7 with deliberately selected product samples.
I ran the simulation for 3 trials and 2 operators for each of the seven groups assuming no reproducibility error.
Curious to see how much estimation error there is in the summary statistics for each group? Get ready to be surprised.
But before we get into the fun part, first things first: let's see what the true value of %Study Variation and %Tolerance came out to be:
Based on the calculations, both summary statistics fall into the "marginal" category per the AIAG guideline acceptance criteria.
I calculated both summary statistics %Study Variation and %Tolerance for each of the seven sample groups. Remember, these are point estimates of the true values. To get an idea of the estimation error of these summary statistics, I needed to calculate the 95% confidence intervals for each component of variation, that is for the repeatability error and the product variation components. Finally, using the lower and upper bounds for each component, I was then able to plot where the point estimate and the actual bounds would be sitting at in relation to the true %Study Variation and %Tolerance values and the guideline acceptance criteria.
So, here's what I got for the %Study Variation summary statistic for each group.
Let's try to make sense of all this: the chart on the right shows the calculated %Study Variation statistic (the dot in the middle of each line plot) as well as the 95% upper and lower bounds for each of the seven groups. Each line plot is essentially the entire width of the confidence interval for the statistic. It is shocking how wide some of those confidence intervals are, right?
Note that for groups 1 and 2, the width of the confidence intervals are pretty tight compared to the rest of the groups. That is because those groups involve a fairly large sample size (30 and 20) for the product variation estimate. Now look at groups 3, 4 and 5: the width of the intervals is off the charts! They in fact extend over all three categories; acceptable, marginal and unacceptable. Now which is it? The point estimate renders group 3 marginal, group 4 unacceptable, group 5 marginal again, and groups 6 and 7 acceptable. And these are samples from the same population!
The typical sample sizes recommended by the AIAG approach and used by many professionals fall within groups 3 through 7, the ones where the estimation error of the statistic is just incredible.
What is also interesting is that groups 6 and 7 see an estimate significantly lower than that for any of the other groups. Remember, these are the groups where the samples were deliberately selected to be evenly distributed across the expected product range. The problem with this approach is that although the samples span the expected range, they don't represent the distribution of the product values. That is, the 68-95-99.73 rule is violated here - big time. If you don't remember what the 68-95-99.73 rule is, here's a refresher:
So, what do you think happens when the 68-95-99.73 rule is not met? Think of the group where 3 samples were selected; one in the middle, one at the upper and another one at the lower end of the expected range. Now, think about whether that sampling scheme meets the 68-95-99.73 rule? I don't think so. And guess what is going to happen? The product variation standard deviation estimate will be inflated causing the %Study Variation statistic to go down; that is exactly what you see in the above plot with the confidence intervals.
Let's see what we got for the %Tolerance estimates for each of the groups:
As you can see here, the estimation error, generally speaking, is not as bad here as for the %Study Variation statistic. This is because this formula doesn't have the product variation estimate in it, so product variation has absolutely no influence on it. What you see here is essentially the estimation error coming from the repeatability error component.
The %Study Variation summary statistic can have a significant amount of estimation error associated with it. That coupled with the fact that the guideline criteria are arbitrary make it a pretty useless statistic, unless a significantly large sample size is used to adequately estimate product variation and the distribution of the product values are adequately represented.
The %Tolerance statistic inherently carries less estimation error. However, if the tolerance doesn't reflect actual product function, the statistic becomes useless.
The only way to adequately represent the distribution of the product values in your sample is to sample randomly.
As I mention in my eBook, don't solely rely on summary statistics; instead, follow a practical approach: first Apply Common Sense, then Visualize Your Data and finally Calculate Metrics. More on a practical approach in the eBook, which can be downloaded here: practicalmsa.com
You all stay safe, and I will be back with another post soon!