Using control charts is a great way to find out whether data collected over time has any statistically significant signals, or whether the variation in the data is merely noise. They were invented at the Western Electric Company by Walter Shewhart in the 1920s in the context of industrial quality control. The recent "six sigma" movement has brought this type of chart into prominent use, as legions of "black belts" use them to measure process behavior in an attempt to reduce variation and thereby improve quality.
The theory is that less variation results in fewer defects. The theory definitely holds water in manufacturing, and, more generally, any time a process should be producing the same thing over and over again. As much as that sounds drab, it's what we expect when we order a burger from a fast food chain or go to start up a new car. "Variation" in those contexts would most likely be unwelcome.
In order to reduce variation, process specialists first need to figure out whether there are any sources of "special-cause" variation - or signals - in the historical data. Let's illustrate the point with some fun data from the world of sports: Peyton Manning's career game-by-game passing statistics:
You may be asking why we're looking at football stats after an introduction about manufacturing. In a sense, Peyton Manning "manufactures" passing yards on offense in every game. As a quarterback, his performance can be seen as a process to generate yards of ball movement on the field. Each game results in a certain number of yards thrown. When it comes to passing yards, is Peyton Manning "in control", or are there signals in the data?
Let's first break down the elements of the control chart, and then we'll consider how to make them step by step.
Anatomy of a Control Chart
A control chart contains the following basic elements:
- The time series data itself - yards thrown (y-axis) per game played (x-axis) as a simple line plot.
- The average line - the arithmetic average of yards thrown per game
- The control limits, computed in various ways from the data itself:
- UCL - the Upper Control Limit
- LCL - the Lower Control Limit
- Outliers - data points either above the UCL or below the LCL
- Trends - 6 or more points either all ascending or all descending
- Shifts - 9 or more points either all above or all below the average line
How-to Create a Control Chart in Tableau
Now that we've got the basics covered, let's see how it's done using two different methods - the "quick method" and the "rigorous method". The difference between the two is how the control limits are calculated. The "quick method" uses what's called a "global measure of dispersion", or the standard deviation of all of the points. The "rigorous method" uses a "local measure of dispersion" called sigma(x), which is derived from the differences between successive data points.
1. The "Quick Method"
Step 1: Create a simple timeline with a discrete (blue pill) version of the game number (Rk) on the Columns shelf, and SUM(Pass Yards) field on the Rows shelf, fit to width
Step 2: Right-click on the y-axis, select "Add Reference Line" and add an average line by filling out the resulting dialog box as shown below, clicking OK when you're done:
Step 3: Right-click on the y-axis, select "Add Reference Line" again, and this time add a distribution of +3 and -3 times the standard deviation, with dotted red lines and no fill, as shown below, clicking OK when you're done:
That's it! We now have a simple control chart showing the number of yards thrown by Peyton Manning in every single game of his career:
Other than a few games, Peyton can be expected to throw between 46 and 492 yards per game. That's not really that informative though, and a more rigorous approach can enlighten us further about whether any statistical outliers are included in the data:
2. The "Rigorous Method",
We'll add a few extra elements to this version, including a "moving range" timeline that shows the absolute value of the change from game to game, control limits calculated from this moving range, and data points colored by their type - in range, outliers, trends and shifts.
Step 1: Start with a new sheet, and repeat step 1 of the "quick method" above to create a basic timeline, adding the discrete "Rk" dimension to Columns and a continuous SUM(Pass Yds) to the Rows shelf
Step 2: To begin creating the "Moving Range" timeline underneath the main "Individuals" timeline, drag a second instance of SUM(Pass Yds) to the Rows shelf to the right of the first instance, and change the second instance to a Quick Table Calculation showing the Difference, as shown below:
Step 3: Edit the Table Calculation for the second instance of SUM(Pass Yds) and click "Customize" to open the edit dialog, adding ABS() around the entire equation, as shown below:
Step 4: Add the average lines for both the Individuals chart (top) and the Moving Range chart (bottom) by right-clicking on each of the y-axes and selecting "Add Reference Line", similar to Step 2 in the "quick method" above. Do this for both charts.
Step 5: Create the following three calculated fields, "MR_UCL", "UCL", and "LCL", which are Shewharts original equations to determine the control limits:
Step 6: Add these three new calculated fields to the level of Detail for All Marks, and use them to create three new reference lines - "UCL" and "LCL" in the top line chart and "MR_UCL" in the bottom line chart.
Step 7: Create two new Calculated Fields for the signals, one called "Signals" (to be used in the top chart) and the other called "MR_Signals" (to be used in the bottom chart) by downloading the workbook above and copying the equations, or using these Word documents - this one for the Individuals chart (top) and this one for the Moving Range chart (bottom).
Step 8: Drag these new calculated fields to the appropriate Colors shelf. Note that if you would rather just color the data points leave the lines gray, you'll need to turn each chart into a dual axis chart with lines and shapes.
Also notice that along with a quick filter for the date, we've added two parameters to the worksheet - one to allow the user to change the statistic shown, and the other to allow the user to group the control chart based on team or season.
What can we learn from Manning's passing statistics?
There are a few new insights we can glean from this more rigorous approach:
- Peyton's passing yardage has improved since he joined the Broncos, and his numbers so far in the 2013 season are the best in his career, on average
- Prior to this past week's game, Peyton was on a statistical tear, with the previous 12 games all above his overall career average both in terms of yards thrown and number of completions.
- Using the more rigorous control limit formulas, the low yardage total of last week's game (150 yards) was a statistical outlier, at least for his time with the Broncos.
In closing, control charts have a broad application to anything that can be thought of as a process. Creating control charts in Tableau is a great way to track that process, be alerted to signals in the data, and save time and effort by filtering out the noise. When creating control charts, users can either opt to employ the "quick method" or the "rigorous method" depending on their level of skill and the degree of statistical rigor warranted by the situation.
I hope you found this helpful! Thanks for visiting,