New Page 1
Descriptive Statistics and Graphical
Analysis
·
Organize data with frequency distributions
·
Differentiate tools for displaying data graphically
·
Recognize the measures of central tendency
·
Contrast and compare measures of dispersion
Introduction
One of the best methods for
quickly assessing trends in data is to view data in a graphical representation.
The distribution of data is described statistically using two characteristics,
both of which are necessary in evaluating data. Central
tendency indicates the middle of the data distribution, whether that
refers to the mean, mode, or median of the distribution. The dispersion
of the distribution of data describes the amount of spread or variation in the
data.
Measures of Central Tendency
Perhaps
the most common measure of central tendency is the mean. The mean
of a distribution is the average value of all observations; the population
parameter of mean is symbolized by
and the sample statistic is symbolized by
. Imagine a number line as a plank with the data points representing
blocks. The mean serves as the fulcrum or balancing point for the distribution.
The mean is easy to understand and is a well-established measure of central
tendency. Thus, it is tempting to use in every data set. However, the mean is
sensitive to outliers (extreme data points) and the mean is distorted by
non-symmetrical distributions of data. For example, there is a town in America
with an average income of millions of dollars, but no one is rushing to move to
Bentonville, AR; a relatively low income town with a few billionaires! The mean
is meaningless unless it is associated with some measure of dispersion, such as
variance or standard deviation, and often the value calculated as the mean of a
distribution is not held by any one data point in the distribution. For example,
a statistician who is obsessed with means might believe that if one puts his/her
head in the icebox and feet in the oven, on the average the body should
feel comfortable!
The
median
is the middle observation of a data distribution. It is the value in the
distribution at which point half of the data have a greater value and half of
the data have a lesser value. The median is not affected by outliers and is
applicable in almost any distribution. Unlike the mean, the median is a value
contained by a point in the distribution.
The mode
of a distribution is the most frequently occurring value in the data. Mode is
used for categorical data or for data that has limited values. The mode can be
visualized as the “hump” in the data distribution. Unfortunately, some data
sets have no mode and there are few analyses that can be performed using mode,
making it the least useful of the central tendency statistics.
A question
often arises as to which measure of central tendency to use. There is no single
answer. The mode is useful for determining where most cases fall and can easily
be ascertained by inspection. The median is always safe and usually useful,
especially when we have a rather skewed or lopsided distribution. The mean is
the most sophisticated measure of central tendency and is most useful for
further computations.
Take a
hypothetical example of customer bank accounts. Every customer who walks in the
bank has a varying account size (<$100 to >$100,000), but most are <
$1,000. Using the mean in this case would produce a result grossly distorted by
the few wealthy customers of the bank, as would the midrange. Because the
distribution of the data is lopsided, the median would be best for telling us
the central monetary value of bank accounts and the mode would tell us what a
typical account size is. However, if we were trying to compute the total account
size of the 300 lunchtime customers, the mean would be most useful for accuracy
in further computations.
Measures of Dispersion
Given a measure
of central tendency, say the mean, it is important to know whether the
distribution of data hovers around the mean or whether the data are broadly
distributed. Several measures of dispersion describe how the data are spread in
a distribution.
The range is
the difference between the highest value and the lowest value in the data set.
Although the range is a simple computation, it is not a good tool to use for
large data sets. The larger the sample size, the less likely the range will
change, but a small sample size is very sensitive to change. Picture a sample of
private phone bills. If the first two bills sampled were $12.00 and $160.00, it
would be impossible for the range to increase unless an extremely large or small
bill was sampled. Assume the next two bills in the sample are $5.00 and $240.00.
The range has just grown immensely and will probably never increase again,
despite the number sampled. Generally, if the data are normally distributed such
that the graphical representation of the data is a bell-shaped curve, the range
is useful for samples containing ten or fewer elements and only when the largest
and smallest values in the data set are known. Range is also very sensitive to
extreme outliers.
Putting It All Together
Central
tendency without dispersion is meaningless in descriptive statistics. For
example, you may have the choice of two kinds of jobs; one with an average
salary of $60k per year or one with an average salary of $50k per year. The
lower paying job might be better. It could be that the $60k job has some high
salaries and some low salaries, while the $50k job pays everyone the same $50k.
Now, which will more likely offer higher pay?
As an example
of how important it is to look beyond central tendency and consider the
importance of dispersion, suppose someone required a vendor to produce widgets
between 99 mm and 101 mm in size. If Company A’s widgets average 100 mm and
Company B’s average 98.9 mm, which company should get the contract? Before
deciding hastily, look beyond the average. While the average may lead one to
believe that Company A is the better choice supplier, a further look at the
dispersion reveals that Company B is in fact the better choice. Company A makes
no widgets in the acceptable range, while 90% of Company B’s are acceptable.
|
Company
A
|
Company
B
|
Breakdown
|
50%
at 95mm, 50% at 105 mm
|
90%
at 99 mm, 10% at 98 mm
|
Average
|
100
mm
|
98.9
mm
|
Range
|
10
mm
|
1
mm
|
Standard
Deviation
|
5.3
mm
|
0.3
mm
|
%
acceptable
|
None
|
90%
|
Never use a
measure of central tendency without a measure of dispersion – one is
meaningless without the other. The variance or standard deviation should
accompany the mean, and the interquartile range should accompany the median.
Though both statistics may be known about a distribution, one should not jump to
conclusions on the basis of central tendency or dispersion separately! One
should always visualize the data, by considering the central tendency and the
shape of the distribution of the data around that tendency, before making any
conclusions.
Conclusion
As described earlier, outliers are data points that fall in the extreme
tails of a distribution. The stand alone relative to other values in a data set.
Outlying data points may be real, but often they result from incorrectly
recorded data or data that were collected different populations, each having a
different distribution. Outliers can distort the mean and severely distort the
range and variance or standard deviation in a data set. Always be wary of
outliers when collecting data.