Continuing the statistical theme of the last two posts, but trying to close it at the same time.
I observe three different levels of dealing with the same problem: look at a data set of some metric and tell whether our process is in statistical control, which data points represent normal common-cause (chance-cause) variation of the system’s behaviour, and which ones are outliers that have an external (also called special or assignable) cause of variation. The distinction is important. If it’s the former, the continuous improvement path goes through improving the system as a whole in a way that would move the average of the metric or reduce its variation. If it’s the latter, it goes through understanding the root cause of the outlier and eliminating it. We need to categorize the results of our measurements and we want to avoid categorization errors known as Shewhart Mistakes 1 and 2. This problem falls under the second item, Knowledge of Variation, of W. Edwards Deming’s System of Profound Knowledge.
The simplest way to approach this problem is to calculate the average and the standard deviation (commonly denoted by the Greek letter sigma) of our data set and calculate the control limits as the average plus-minus three sigmas. Then everything within these limits is common-cause variation and outside the limits are the outliers.
I began asking about this rule some time ago, why is the numeric constant in this rule equal to 3? Why is it not e=2.712828 or pi=3.14159 or some other important mathematical constant? Why is it not 1.96 (which would represent the 95% confidence interval) or 2.57583 (99%)? If the constant is 3, a data point has one chance in 370 of being an outlier; if it were e or pi, it would have one chance in 152 and one in 595, respectively. None of these are “magic” numbers with any universal meaning.
The best explanation I could come up with is simplicity. Calculating percentiles for the normal distribution requires a scientific calculator. But, if the average is known to be 50 and the sigma is known to be 3, then the entire staff can be trained at a minimal cost to calculate the control limits as 41 and 59 and flag the outliers. Simplicity democratizes process control and helps ensure the simple best practices are followed. It doesn’t matter if an outlier’s chance of occurring is one in 370 or 400. As long as we are in that ballpark, we can have a good balance of attacking the roots of special-cause variation and working on the system.
Looking at data sets of metrics from real-world software and IT projects, experts started to realize the distributions are not normal. Actually, Shewhart was first to point out that industrial production processes don’t necessarily lead to the Gaussian distribution. I already referred in my last two posts to the insight that project lead times often have the Weibull distribution. Many service processes are well-described by Markov chains, which leads to the Poisson distribution of arrivals and departures and to the exponential distribution of cycle times. Larry Maccherone, who examined many data sets as a researcher and the director of analytics at Rally Software, spoke at LSSC12 about (among other things) how assuming that a distribution is close enough to a bell curve can significantly and dangerously understate project risk.
Working effectively with such realities requires more complicated, expert-analysis-type approaches. We have to match data sets to distributions, instead of pretending it’s always the normal distribution. We have to develop techniques based on mathematical properties of known distributions for calculating control limits and estimating percentiles. For example, the control limits for exponential distribution are very easy to calculate – the UCL is six times the average (six-over-lambda is this distribution’s equivalent of plus-minus-three sigma), while the LCL is always zero. This is a simple rule, but it takes some math skills to derive it; also, we have to see if our data fits the distribution, and that’s complicated. My previous post suggested how something similar can be done in a general case for the Weibull distribution. Workers, managers and consultants doing this need better knowledge of statistics as well some skills in algebra and calculus to manipulate complicated formulas with many Greek letters.
Eventually, we find ourselves in a situation where formulas with Greek letters don’t help. The most common distribution in knowledge work is the Unique distribution. We cannot match it to any familiar distribution, so we cannot use any familiar formulas.
If we went by the same percentiles that are considered outliers on the normal distribution, we would have to wait for the collection of a sizable data set that would have one data point fitting into the 99.8-th percentile. There would be at least two problems with that. First, continuous improvement cannot wait this long. Second, if we wait this long, the system is likely to undergo some change that would make the early-arrived data points meaningless.
At the end of the day, we don’t just play with numbers, we have to make decisions how to proceed with the continuous improvement of our system. We have to take our next step – by using safety-to-fail, experimentation, abductive thinking and accepting that some things in this world are only weakly and retrospectively coherent.