Visualizing distributions
Often, simply understanding totals, sums, and even the breakdown of part-to-whole only gives a piece of the overall picture. Most of the time, you'll want to understand where individual items fall within a distribution of all similar items.
You might find yourself asking questions such as the following:
- How much does each customer spend at our stores and how does that compare to all other customers?
- How long do most of our patients stay in the hospital? Which patients fall outside the normal range?
- What's the average life expectancy for components in a machine and which last more than average? Are there any components with extremely long or extremely short lives?
- How far above or below passing were students' test scores?
These questions all have similarities. In each case, you seek an understanding of how individuals (customers, patients, components, students) relate to the group. In each case, you most likely have a relatively high number of individuals. In data terms, you have a dimension (customer, patient, component, and student) representing a relatively large population of individuals and some measure (amount spent, length of stay, life expectancy, test score) you'd like to compare. Using one or more of the following visualizations might be a good way to do this.
Circle charts
Circle charts are one way to visualize a distribution. Consider the following view, which shows how each doctor compares to other doctors within the same type of department in terms of the average days their patients stay in the hospital:
Figure 3.42: A circle chart showing the average length of stay for each doctor within each department type
Here you can see which doctors have patients that stay in the hospital longer or shorter on average. It is also interesting to note that certain types of departments have longer average lengths of stay versus others. This makes sense as each type of department has patients with different needs. It's probably not surprising that patients in Intensive Care tend to stay longer. Certain departments may have different goals or requirements. Being able to evaluate doctors within their type of department makes comparisons far more meaningful.
To create the preceding circle chart, you need to place the fields on the shelves that are shown and then simply change the mark type from Automatic (which was a bar mark) to Circle. Department Type defines the rows, and each circle is drawn at the level of Doctor, which is in the level of Detail on the Marks card. Finally, to add the average lines, simply switch to the Analytics tab of the left pane and drag Average Line to the view, specifically dropping it on the Cell option:
Figure 3.43: You can add reference lines and more by dragging from the Analytics tab to the view
You may also click one of the resulting average lines and select Edit to find fine-tuning options, such as labeling.
Jittering
When using views like circle plots or other similar visualization types, you'll often see that marks overlap, which can lead to obscuring part of the true story. Do you know for certain, just by looking, how many doctors there are in Intensive Care who are above average? How many are below? Or could there be two or more circles exactly overlapping? One way of minimizing this is to click the Color shelf and add some transparency and a border to each circle. Another approach is a technique called jittering.
Jittering is a common technique in data visualization that involves adding a bit of intentional noise to a visualization to avoid overlap without harming the integrity of what is communicated. Alan Eldridge and Steve Wexler are among those who pioneered techniques for jittering in Tableau.
Various jittering techniques, such as using Index() or Random() functions, can be found by searching for jittering on the Tableau forums or Tableau jittering using a search engine.
Here is one approach that uses the Index() function, computed along Doctor, as a continuous field on Rows. Since INDEX() is continuous (green), it defines an axis and causes the circles to spread out vertically. Now, you can more clearly see each individual mark and have higher confidence that the overlap is not obscuring the true picture of the data:
Figure 3.44: Here INDEX() has been added as a continuous field on Rows (the table calculation is computed along Doctor)
In the preceding view, the vertical axis that was created by the Index field is hidden. You can hide an axis or header by using the drop-down menu of the field defining the axis or header and unchecking Show Header. Alternatively, you can right-click any axis or header in the view and select the same option.
You can use jittering techniques on many kinds of visualizations that involve plotting fixed points that could theoretically overlap, such as dot plots and scatterplots. Next, we will move onto another useful distribution visualization technique: box and whisker plots.
Box and whisker plots
Box and whisker plots (sometimes just called box plots) add additional statistical context to distributions. To understand a box and whisker plot, consider the following diagram:
Figure 3.45: Explanation of box and whisker plot
Here, the box plot has been added to a circle graph. The box is divided by the median, meaning that half of the values are above, and half are below. The box also indicates the lower and upper quartiles, which each contain a quarter of the values. The span of the box makes up what is known as the Interquartile Range (IQR). The whiskers extend to 1.5 times the IQR value (or the maximum extent of the data). Any marks beyond the whiskers are outliers.
To add box and whisker plots, use the Analytics tab on the left sidebar and drag Box Plot to the view. Doing this to the circle chart we considered in Figure 3.42 yields the following chart:
Figure 3.46: A box plot applied to the previous circle chart
The box plots help us to see and compare the medians, the ranges of data, the concentration of values, and any outliers. You may edit box plots by clicking or right-clicking the box or whisker and selecting Edit. This will reveal multiple options, including how whiskers should be drawn, whether only outliers should be displayed, and other formatting possibilities.
Histograms
Another possibility for showing distributions is to use a histogram. A histogram looks similar to a bar chart, but the bars show a count of occurrences of a value. For example, standardized test auditors looking for evidence of grade tampering might construct a histogram of student test scores. Typically, a distribution might look like the following example (not included in the workbook):
Figure 3.47: A histogram of test scores
The test scores are shown on the x axis and the height of each bar shows the number of students that made that particular score. A typical distribution often has a recognizable bell curve. In this case, some students are doing poorly and some are doing extremely well, but most have scores somewhere in the middle.
What if auditors saw something like this?
Figure 3.48: A histogram that does not have a typical bell curve, raising some questions
Something is clearly wrong. Perhaps graders have bumped up students who were just shy of passing to barely passing. It's also possible this may indicate bias in subjective grading instead of blatant tampering. We shouldn't jump to conclusions, but the pattern is not normal and requires investigation. Histograms are very useful in catching anomalies like this.
Now that we've seen an example of histograms, let's shift our focus back to the hospital data and work through an example. What if you want to visualize the time it takes to begin patient treatment so that you can observe the patterns for different patient populations. You might start with a blank view follow steps like these:
- Click to select the Minutes to Service field under Measures in the data pane.
- Expand Show Me if necessary and select the histogram.
Upon selecting the histogram, Tableau builds the chart by creating a new dimension, Minutes to Service (bin), which is used in the view, along with a COUNT of Minutes to Service to render the view:
Figure 3.49: A histogram showing the distribution of patients according to minutes to service
Bins are ranges of measure values that can be used as dimensions to slice the data. You can think of bins as buckets. For example, you might look at test scores by 0-5%, 5-10%, and so on, or people's ages by 0-10, 10-20, and so on. You can set the size, or range, of the bin when it is created and edit it at any point. Tableau will also suggest a size for the bin based on an algorithm that looks at the values that are present in the data. Tableau will use uniform bin sizes for all bins.
For this view, Tableau automatically set the bin size to 3.47 minutes, which is not very intuitive. Experiment with different values by right-clicking or using the drop-down on the Minutes to Service (bin) field in the data pane and selecting Edit. The resulting window gives some information and allows you to adjust the size of the bins:
Figure 3.50: Options for editing a bin
Here, for example, is the same histogram with each bin sized to 2 minutes:
Figure 3.51: A histogram with a bin size of 2
You can see the curve, which peaks at just under 20 minutes and then tapers off with a few patients having to wait as long as 40 minutes. You might pursue additional analysis, such as seeing how wait times vary for the majority of patients based on their risk profile, such as in this view:
Figure 3.52: Patient risk profile creates two rows of histograms, showing that most high-risk patients receive faster care (as we would hope)
You can create new bins on your own by right-clicking a numeric field and selecting Create | Bins. You may edit the size of bins by selecting the Edit option for the bin field.
You'll also want to consider what you want to count for each bin and place that on Rows. When you used Show Me, Tableau placed the COUNT of Minutes to Service on Rows, which is just a count of every record where the value was not null. In this case, that's equivalent to a count of patient visits because the data set contains one record per visit. However, if you wanted to count the number of unique patients, you might consider replacing the field in the view with COUNTD([Patient ID]).
Just like dates, when the bin field in the view is discrete, the drop-down menu includes an option for Show Missing Values. If you use a discrete bin field, you may wish to use this option to avoid distorting the visualization and to identify what values don't occur in the data.
We've seen how to visualize distributions with circle plots, histograms, and box plots. Let's turn our attention to using multiple axes to compare different measures.