Fitting a Normal Curve to a Histogram

Robin Kennedy
on November 8, 2013

{This is a guest post written by Robin Kennedy of The Information Lab. Robin is a Business Intelligence consultant, Tableau enthusiast, and Viz of the Day winner. His post is the first in a series comprising #TableauStatsMonth, and originally appeared on The Information Lab's website on Nov 4, 2013}

We were recently asked to help a customer use Tableau to draw a best-fit Gaussian curve from his data of suppliers and their scores. Whilst Tableau doesn’t have this sort of statistical analysis built-in, once you get your head round the normal distribution formula, it’s just a matter of configuring a few calculated fields.

First of all, let’s look at our data in it’s raw format. We’re using a simple example here of customers and profit figures. There is 1 row per customer, along with their total profit for all transactions. If your data has more than 1 row per customer (perhaps 1 row per order) then you may want to do some pre-aggregation before bringing the data into Tableau.

1. Histogram
Connect to your data and verify all the rows are present. The first step is to create a histogram from the data. This is one built-in feature in Tableau that can be extremely easy to do – simply click Profit from the data window, then select the Histogram option from the Show Me tab – boom! BUT, we want to be a bit more flexible and also be able to reference the bins in later calculations so we are going to set the histogram up manually.
Rather than using Tableau’s automatic binning feature, create your own calculated field bins by writing a formula to round the profit amount down to the nearest one thousand

INT ( [Profit] / 1000 ) * 1000

We can also introduce a Parameter here to allow a user to easily change the bin size. If you wish, create a Parameter called “Profit Bin Size” and then change the formula to…

INT ( [Profit] / [Profit Bin Size] ) * [Profit Bin Size]

WARNING: If your data contains negatives this next point is really important!
Because the INT formula rounds down towards zero, any negatives are actually rounded up along the number line (still rounded down in terms of absolute magnitude). This creates the situation where your zero bin, which should represent 0 to +1000 or whatever your bin size parameter is set to, actually represents –1,000 to +1,000. We need to correct that and can do it by shifting the negatives down. Change the formula to…

INT ( [Profit] / [Profit Bin Size] ) * [Profit Bin Size] + IF [Profit] [Profit Bin Size] ELSE 0 END

OK now we have our bin calculated field – it has probably appeared in the Measures section of the data window so go ahead and throw it up into the Dimensions section before adding it to Columns in the view and then adding Number of Records to Rows.

2. Calculating the mean and standard deviation
The normal distribution / Gaussian formula requires the mean and standard deviation of profit of our entire customer population.
We can create a formula to work out the mean by writing…

AVG ( [Profit] )

But this formula, when added to the histogram view, will be partitioned by our binning dimension – i.e. we will get the mean for the bin 0 to 1,000, the mean for the bin 1,000 to 2,000 etc. We want the mean for the entire data set so we have to use a table calculation…

TOTAL ( AVG ( [Profit] ) )

…to get the right result. Similarly, for standard deviation, the formula will make use of the TOTAL table calculation

TOTAL ( STDEV ( [Profit] ) )

3. Create the curve formula
The formula for the normal distribution probability density function is as follows (reference: Wikipedia)

Where sigma (s) is the standard deviation, mu (m) is the mean and x is the profit bin as calculated above. Translating this into Tableau, we get a calculated field as follows

(1 / ([St Dev] * SQRT(2*PI()))) * EXP(-((ATTR([Profit Bin]) – [Mean])^2 / (2*[St Dev]^2)))

If you add this new calculated field to your view you should see a nice bell curve shape appear – we’re getting close but to fit the curve to our histogram, there are a couple of steps left.

First of all, we need to scale the function according to our bin width (smaller bins will have fewer number of records) and we do that by multiplying the whole formula by 1,000 (or your bin size parameter).

Secondl,y we want to convert the density function into a frequency function so that the numbers match up to our histogram – this is done by multiplying by the total number of records.

The final formula then is:

(1 / ([St Dev] * SQRT(2*PI()))) * EXP(-((ATTR([Profit Bin]) – [Mean])^2 / (2*[St Dev]^2))) * [Profit Bin Size] * TOTAL (SUM([Number of Records]))

Add this to your view as a dual axis, synchronize and change the mark type to a line and you should end up with the below


Nice post. Your calculations are spot on, but there is a bug that can cause things to display oddly when there are empty bins. This shows up as a small wiggle when the bin size is set at 500, but things get pretty bad with small bins (say 50-100).

The problem: since the calculated bin centers are discrete, any bin without data is skipped in the display. This can lead to a normal distribution which doesn’t look very bell shaped.

A solution? Make Profit Bin continuous. This leads to a new problem where the column width is quirky when bin size is changed.

A better solution: Create bins on profit (sized by the same profit bin size parameter) and use that on the columns shelf instead of the calculated bins. Then select “show missing values” to display the bins where there is no data. There will, however, be gaps in the normal distribution at these bins because there is no data there for Tableau to hang these calculations on. I solved this, but it probably isn’t for the faint of heart. I expect most who would ignore this warning could figure this solution out on their own, but there are two tricks that are worth mentioning (one of which hung me up for a good half hour wondering what was wrong).

* First, complete profit bin calculation by using a previous_value(0) to grab the calculated value from Profit Bin where it exists and increment by the profit bin size parameter.
* Second, Pi is a calculation, so no data means there is no pi. There is data elsewhere though, so once your mind is done exploding, table calculations can come to the rescue.

Learn About Tableau

Nice approach, Noah! I had been wondering whether the continuous or discrete option was better for the display (both have disadvantages as you have seen) but your solution solves this brilliantly, great work!

Here are some other tweaks to this view:

This has a high resolution for the line and a low resolution for the bar marks. This is done with a single dimension pill one the view. A calculated field to bin to the low resolution, and a Bin data type field to bin to the high resolution. Then data densification, to fill in the marks at the high resolution. I use different formulas than Noah, but the same effective result. Using a continuous axis enables reference lines, and then the parameter on the size shelf sized the bar marks so it looks decent for all selections.

Tableau cant sync axes when the fields are of a different type. In this case, number of records is integer and curve is float. The easiest fix would be to edit the number of records field (which is the simplest calculation ever) and add a decimal point after the 1. In more complex situations you can always multiply by 1.0 for this conversion or explicitly wrap in float().