
Descriptive Statistics and Graphical Analysis · Organize data with frequency distributions · Differentiate tools for displaying data graphically · Recognize the measures of central tendency · Contrast and compare measures of dispersion Introduction One of the best methods for quickly assessing trends in data is to view data in a graphical representation. The distribution of data is described statistically using two characteristics, both of which are necessary in evaluating data. Central tendency indicates the middle of the data distribution, whether that refers to the mean, mode, or median of the distribution. The dispersion of the distribution of data describes the amount of spread or variation in the data. Measures of Central Tendency Perhaps the most common measure of central tendency is the mean. The mean of a distribution is the average value of all observations; the population parameter of mean is symbolized by_{ }and the sample statistic is symbolized by_{ }. Imagine a number line as a plank with the data points representing blocks. The mean serves as the fulcrum or balancing point for the distribution. The mean is easy to understand and is a wellestablished measure of central tendency. Thus, it is tempting to use in every data set. However, the mean is sensitive to outliers (extreme data points) and the mean is distorted by nonsymmetrical distributions of data. For example, there is a town in America with an average income of millions of dollars, but no one is rushing to move to Bentonville, AR; a relatively low income town with a few billionaires! The mean is meaningless unless it is associated with some measure of dispersion, such as variance or standard deviation, and often the value calculated as the mean of a distribution is not held by any one data point in the distribution. For example, a statistician who is obsessed with means might believe that if one puts his/her head in the icebox and feet in the oven, on the average the body should feel comfortable! The median is the middle observation of a data distribution. It is the value in the distribution at which point half of the data have a greater value and half of the data have a lesser value. The median is not affected by outliers and is applicable in almost any distribution. Unlike the mean, the median is a value contained by a point in the distribution.
The mode of a distribution is the most frequently occurring value in the data. Mode is used for categorical data or for data that has limited values. The mode can be visualized as the “hump” in the data distribution. Unfortunately, some data sets have no mode and there are few analyses that can be performed using mode, making it the least useful of the central tendency statistics. A question often arises as to which measure of central tendency to use. There is no single answer. The mode is useful for determining where most cases fall and can easily be ascertained by inspection. The median is always safe and usually useful, especially when we have a rather skewed or lopsided distribution. The mean is the most sophisticated measure of central tendency and is most useful for further computations. Take a hypothetical example of customer bank accounts. Every customer who walks in the bank has a varying account size (<$100 to >$100,000), but most are < $1,000. Using the mean in this case would produce a result grossly distorted by the few wealthy customers of the bank, as would the midrange. Because the distribution of the data is lopsided, the median would be best for telling us the central monetary value of bank accounts and the mode would tell us what a typical account size is. However, if we were trying to compute the total account size of the 300 lunchtime customers, the mean would be most useful for accuracy in further computations. Measures of Dispersion Given a measure of central tendency, say the mean, it is important to know whether the distribution of data hovers around the mean or whether the data are broadly distributed. Several measures of dispersion describe how the data are spread in a distribution. The range is the difference between the highest value and the lowest value in the data set. Although the range is a simple computation, it is not a good tool to use for large data sets. The larger the sample size, the less likely the range will change, but a small sample size is very sensitive to change. Picture a sample of private phone bills. If the first two bills sampled were $12.00 and $160.00, it would be impossible for the range to increase unless an extremely large or small bill was sampled. Assume the next two bills in the sample are $5.00 and $240.00. The range has just grown immensely and will probably never increase again, despite the number sampled. Generally, if the data are normally distributed such that the graphical representation of the data is a bellshaped curve, the range is useful for samples containing ten or fewer elements and only when the largest and smallest values in the data set are known. Range is also very sensitive to extreme outliers. Putting It All Together Central tendency without dispersion is meaningless in descriptive statistics. For example, you may have the choice of two kinds of jobs; one with an average salary of $60k per year or one with an average salary of $50k per year. The lower paying job might be better. It could be that the $60k job has some high salaries and some low salaries, while the $50k job pays everyone the same $50k. Now, which will more likely offer higher pay? As an example of how important it is to look beyond central tendency and consider the importance of dispersion, suppose someone required a vendor to produce widgets between 99 mm and 101 mm in size. If Company A’s widgets average 100 mm and Company B’s average 98.9 mm, which company should get the contract? Before deciding hastily, look beyond the average. While the average may lead one to believe that Company A is the better choice supplier, a further look at the dispersion reveals that Company B is in fact the better choice. Company A makes no widgets in the acceptable range, while 90% of Company B’s are acceptable.
Never use a measure of central tendency without a measure of dispersion – one is meaningless without the other. The variance or standard deviation should accompany the mean, and the interquartile range should accompany the median. Though both statistics may be known about a distribution, one should not jump to conclusions on the basis of central tendency or dispersion separately! One should always visualize the data, by considering the central tendency and the shape of the distribution of the data around that tendency, before making any conclusions. Conclusion As described earlier, outliers are data points that fall in the extreme tails of a distribution. The stand alone relative to other values in a data set. Outlying data points may be real, but often they result from incorrectly recorded data or data that were collected different populations, each having a different distribution. Outliers can distort the mean and severely distort the range and variance or standard deviation in a data set. Always be wary of outliers when collecting data. 


Whitt's Consulting * Reading * PA
