Descriptive Statistics and Graphical Analysis

One of the best methods for quickly assessing trends in data is to view data in a graphical representation. The distribution of data is described statistically using two characteristics, both of which are necessary in evaluating data. Central tendency indicates the middle of the data distribution, whether that refers to the mean, mode, or median of the distribution. The dispersion of the distribution of data describes the amount of spread or variation in the data.

Perhaps the most common measure of central tendency is the mean. The mean of a distribution is the average value of all observations; the population parameter of mean is symbolized byand the sample statistic is symbolized by. Imagine a number line as a plank with the data points representing blocks. The mean serves as the fulcrum or balancing point for the distribution. The mean is easy to understand and is a well-established measure of central tendency. Thus, it is tempting to use in every data set. However, the mean is sensitive to outliers (extreme data points) and the mean is distorted by non-symmetrical distributions of data. For example, there is a town in America with an average income of millions of dollars, but no one is rushing to move to Bentonville, AR; a relatively low income town with a few billionaires! The mean is meaningless unless it is associated with some measure of dispersion, such as variance or standard deviation, and often the value calculated as the mean of a distribution is not held by any one data point in the distribution. For example, a statistician who is obsessed with means might believe that if one puts his/her head in the icebox and feet in the oven, on the average the body should feel comfortable!

The median is the middle observation of a data distribution. It is the value in the distribution at which point half of the data have a greater value and half of the data have a lesser value. The median is not affected by outliers and is applicable in almost any distribution. Unlike the mean, the median is a value contained by a point in the distribution.

The mode of a distribution is the most frequently occurring value in the data. Mode is used for categorical data or for data that has limited values. The mode can be visualized as the “hump” in the data distribution. Unfortunately, some data sets have no mode and there are few analyses that can be performed using mode, making it the least useful of the central tendency statistics.

A question often arises as to which measure of central tendency to use. There is no single answer. The mode is useful for determining where most cases fall and can easily be ascertained by inspection. The median is always safe and usually useful, especially when we have a rather skewed or lopsided distribution. The mean is the most sophisticated measure of central tendency and is most useful for further computations.

Take a hypothetical example of customer bank accounts. Every customer who walks in the bank has a varying account size (<$100 to >$100,000), but most are < $1,000. Using the mean in this case would produce a result grossly distorted by the few wealthy customers of the bank, as would the midrange. Because the distribution of the data is lopsided, the median would be best for telling us the central monetary value of bank accounts and the mode would tell us what a typical account size is. However, if we were trying to compute the total account size of the 300 lunchtime customers, the mean would be most useful for accuracy in further computations.

Given a measure of central tendency, say the mean, it is important to know whether the distribution of data hovers around the mean or whether the data are broadly distributed. Several measures of dispersion describe how the data are spread in a distribution.

The range is the difference between the highest value and the lowest value in the data set. Although the range is a simple computation, it is not a good tool to use for large data sets. The larger the sample size, the less likely the range will change, but a small sample size is very sensitive to change. Picture a sample of private phone bills. If the first two bills sampled were $12.00 and $160.00, it would be impossible for the range to increase unless an extremely large or small bill was sampled. Assume the next two bills in the sample are $5.00 and $240.00. The range has just grown immensely and will probably never increase again, despite the number sampled. Generally, if the data are normally distributed such that the graphical representation of the data is a bell-shaped curve, the range is useful for samples containing ten or fewer elements and only when the largest and smallest values in the data set are known. Range is also very sensitive to extreme outliers.

Variance is also a measure of the spread of data. The sample statistic of variance () describes the spread of data in a sample, whereas the parametric measure of variance () describes the spread or variation among all members of a population. Variance describes the average weighted distance of data points away from the mean. Therefore, a measure of variance must accompany a measure of mean. Like the mean, variance is sensitive to outliers. Because of the mathematical computation of variance, the units of variance are squared. It is difficult to describe the mean height of a group of people with the variance of height in feet². So, the square root of variance is computed to allow discussions of the spread of data in units that are more easily understood. The square root of variance is the standard deviation; and the sample statistic of standard deviation is symbolized by and the population parameter is symbolized by.

A third measure of dispersion is the interquartile range, which is the difference between the upper quartile and the lower quartile. The range is distorted by extreme outliers and do not truly measure spread across all data. The standard deviation does measure spread across all data, but extreme data points, too, easily distort it. By essentially ignoring the largest 25% and the smallest 25% of the data, we can measure the spread of the middle 50%, the eliminating the extreme outlying data points. Interquartile range is appropriate for almost any distribution of any size. However, because it does not use all data in the distribution, the interquartile range cannot be used in other computations.

Central tendency without dispersion is meaningless in descriptive statistics. For example, you may have the choice of two kinds of jobs; one with an average salary of $60k per year or one with an average salary of $50k per year. The lower paying job might be better. It could be that the $60k job has some high salaries and some low salaries, while the $50k job pays everyone the same $50k. Now, which will more likely offer higher pay?

As an example of how important it is to look beyond central tendency and consider the importance of dispersion, suppose someone required a vendor to produce widgets between 99 mm and 101 mm in size. If Company A’s widgets average 100 mm and Company B’s average 98.9 mm, which company should get the contract? Before deciding hastily, look beyond the average. While the average may lead one to believe that Company A is the better choice supplier, a further look at the dispersion reveals that Company B is in fact the better choice. Company A makes no widgets in the acceptable range, while 90% of Company B’s are acceptable.

	Company A	Company B
Breakdown	50% at 95mm, 50% at 105 mm	90% at 99 mm, 10% at 98 mm
Average	100 mm	98.9 mm
Range	10 mm	1 mm
Standard Deviation	5.3 mm	0.3 mm
% acceptable	None	90%

Never use a measure of central tendency without a measure of dispersion – one is meaningless without the other. The variance or standard deviation should accompany the mean, and the interquartile range should accompany the median. Though both statistics may be known about a distribution, one should not jump to conclusions on the basis of central tendency or dispersion separately! One should always visualize the data, by considering the central tendency and the shape of the distribution of the data around that tendency, before making any conclusions.

As described earlier, outliers are data points that fall in the extreme tails of a distribution. The stand alone relative to other values in a data set. Outlying data points may be real, but often they result from incorrectly recorded data or data that were collected different populations, each having a different distribution. Outliers can distort the mean and severely distort the range and variance or standard deviation in a data set. Always be wary of outliers when collecting data.