UPDATE: The contents of this post are still valid, but there is a new, complementary post: How to Match to Weibull Distribution without Excel.
Warning: this is a very technical, hands-on post.
It turns out Weibull distribution is quite common among statistical distributions of lead times in software development and IT projects. This insight belongs to Troy Magennis, who is a leading expert on Monte Carlo simulations of projects and examined many data sets from real-world projects. He also has an explanation how the nature of knowledge work and its tendency to expand and get interrupted leads to this distribution. I recommend that you learn more about his work that won him a Brickell Key Award. The insight about Weibull distribution has been confirmed independently by another Brickell Key winner Richard Hensley who has examined many data sets from the IT organization of a large company.
Do Your Lead Times Match Weibull Distribution?
I had to come up with a quick answer to this question yesterday, using only my understanding of statistics and Excel. Weibull distribution is actually a family of distributions, parametrized by the so-called shape parameter (k). You can see in the chart above how changing this parameter can tweak the shape of the distribution curve. Weibull is identical to the exponential distribution when k=1, to Rayleigh distribution when k=2, and interpolates/extrapolates those distributions for other values of the parameter. So we have two questions to answer here:
- Does a given set of numbers match Weibull distribution?
- If it does, what is its shape parameter?
Here is a simple algorithm you can follow to answer these questions for your data set. I’ll attach the spreadsheet at the end of this post.
Step 1. Copy and paste your numbers in to column A of the spreadsheet. I prefer to reserve Row 1 for column headers, so the numbers will start from the cell A2.
Step 2. Sort the numbers in column A in the ascending order.
Step 3. Divide the interval [0; 1] into as many equal intervals as you have data points in your set and populate column B with centers of those intervals. If you have N=100 data points, the Excel formula for this is =(2*ROW(B2)-3)/200 (200 == 2*N). Type this formula into the cell B2 and copy and paste it to fill all cells in Column B.
Step 4. Populate Column C with natural logarithms of numbers in Column A. Type =LN(A2) into the cell C2, copy and paste the formula to the rest of Column C.
Step 5. Populate Column D with the numbers calculated from Column B as follows. Type =LN(-LN(1-B2)) into the cell D2, copy and paste to the rest of Column D. We’re basically linearizing the cumulative distribution function here so that linear regression can reveal the shape parameter.
Step 6. If the set matches Weibull distribution, then the shape parameter is the slope of the straight line through the set of points with the coordinates given by numbers in Columns C and D. Calculate it using this formula: =SLOPE(D2:D101,C2:C101) (This assumes your set contains N=100 points, adjust the formula accordingly). In the attached spreadsheet, this number is placed into the cell G2.
Step 7. Calculate the intercept: =INTERCEPT(D2:101,C2:C101). (Cell G3.)
Step 8. Calculate the scale parameter from the intercept: =EXP(-G3/G2). (Cell G4.)
Step 9. Test how good the fit is by calculating the R-squared: =RSQ(C2:C101,D2:D101). If the match is perfect, this number will be equal to 1.
Step 10. If we have good fit, we can also do a simple sanity check and calculate the mean from the newly obtained shape and scale parameters and see if it is close enough to the actual mean. The formula for the mean is
The standard Excel doesn’t have built-in Gamma-function, but has a built-in function that returns its logarithm. So, we can calculate the predicted mean by =G4*EXP(GAMMALN(1+1/G2)). Now you can compare the predicted mean to the actual mean, which can be obtained, of course, by =AVERAGE(A2:A101).
There is more than one way to do linear regression. The above steps will minimize the vertical distances (in the y-axis direction) between the best fit curve and the data points. It is also possible to do it by minimizing the horizontal distances. The latter method consistently overestimates the shape parameter, which is undesirable for the practical applications of lead time analytics, and can be inaccurate if the same number occurs in the data set multiple times, especially on the left side of the distribution. For these reasons, I don’t recommend using this method and instead recommend the method I originally described in the steps above (linear regression, minimizing vertical distances).
I’m also updating the Weibull Distribution Fit spreadsheet with “smarter” formulas. You don’t need to adjust the above formulas depending on the number of data points in your lead time set.