Getting Acquainted with Statistics
·
Define basic statistical terminology
·
Recognize the different types of data and their uses
·
Identify different types of sampling
Introduction
It is a common
notion that statistics can provide support for anything. Only through misuse is
this true. Most introductory statistics courses focus on how to use statistics rather than how to avoid misusing statistics.
Textbooks offer recipes for statistical analyses but often neglect to advise of the dangers of leaving out some important
ingredient in the application of the statistical test.
In statistics,
garbage in – garbage out, has a significant and real meaning. Some of the more common errors in the application of statistical
tests include the failure to use representative data. Representative data are those that accurately reflect some process,
whether that involves time or quantity, and the data must be free of measurement and sampling bias. Application of the wrong
statistical test or tool or using them incorrectly can lead to gross errors in interpretation of the data. Accurate interpretation
of data is dependent upon valid analysis. At times, data may be over-interpreted or under-interpreted. One needs to become
intimate with the data to determine whether the results of some analysis are important beyond the level of statistical significance
or to identify the benefits of statistical significance.
Basic Terminology
In any practical application, it is important to understand the basic terminology
used in statistics in order to make appropriate decisions about applying and interpreting statistical tests.
·
Observation – A single data point or datum
·
Population – A collection of all possible observations
sharing some common set of characteristics
·
Census – An investigation of all the individual observations
making up a population
·
Sample – A representative subset of a population; a
sample can be the entire population
·
Sampling – The process of collecting a representative
sample from which conclusions about the population may be drawn
·
Parameter – Computation based on all members of a population
·
Statistic – Computation based on a sample of a population
The Sampling Process
In statistics,
one collects a representative sample of a population and bases an interpretation of the population on that or several samples.
It is far easier and cost effective to manage a representative sample of data points or observations than it is to manage
and manipulate every member/data point/observation in a population. A population can be evaluated more quickly through sampling
and the risk of damage to the population as a whole is reduced with careful sampling. Therefore, it is imperative that the
sampling process permits collection of a representative sample, otherwise the population is misrepresented and the results
of any analysis become invalid and uninformative.
Sampling
is determined form a sampling frame; a list of elements from which a sample may be drawn. A sampling frame is a complete and
correct list of population members. The source of the sampling frame should be representative of the population so that it
cannot contribute to any bias in the results of a statistical analysis.
The stages in sample selection
are as follows (Cooper, 2001):
• Define the target population
• Select a sampling frame
• Choose probability or non-probability sampling
method
• Determine sample size
• Choose a data collection technique
• Select sample
Types of Samples
There are many types of
samples, and the decision of the type of sample to collect should be based on the type of analysis that will be performed.
A probability sample is one in which items are selected on the basis of known probabilities.
The advantages of probability sampling are that biases in the data are minimized and that the results tend to be more generalized.
However, probability sampling may be more costly and time consuming. Probability samples include simple random samples, stratified
random sample, systematic random
sample, and cluster random sample. A non-probability sample is one in which items are selected without regard to the probability of their
occurrence. Non-probability sampling is often more convenient as the sampling sources are more readily available, such as
shopping mall surveys or surveys of students on a college campus. However, sampling in this way lends itself to bias and subjectivity,
and the results of an analysis may not be generally applied. Non-probability samples include convenience samples, judgment
samples, quota samples, snowball samples, and voluntary
samples.
Probability Samples
·
Simple Random Sample – A sampling procedure that ensures
that each element of the population has an equal chance of being included in the sample. Random drawing techniques or using
a random number table helps to ensure that the data are collected randomly.
·
Systematic Sample – A sample in which every nth
individual is selected from a population. For example, one can systematically select every 25th name from a list of company employees or every 50th
number in the phone book
·
Stratified Sample – A subsample drawn from samples collected
from different strata that are essentially equal with respect to some characteristic. For example, randomly selecting 20%
of the automobile dealers from 10 randomly-selected states is a stratified sample where the states are different strata.
·
Proportional Stratified Sample – A sample in which the
size of the sample drawn from each stratum is in proportion to the relative population size of that stratum. In automobile
dealer example, if State A had 15% of the total number of dealers in the ten stratum, then 15% of the total sample size would
be selected from the State A sample frame.
·
Cluster Sample – A sample in which the primary sampling
unit is not an individual element in the population but a large cluster of elements. The most common type of cluster sample
is an “area” sample in which the primary sampling unit is a geographic area. One can also randomly select several
small clusters and choose all elements in the clusters.
Some important distinctions
exist between stratified and cluster samples (Zikmund, 2000). Stratified samples are collected from a population that is divided
into a few subgroups, each with many elements that are randomly selected. There is homogeneity within subgroups and heterogeneity
among groups. Cluster samples are collected from a population that is divided into many subgroups, each with few elements.
There is heterogeneity within subgroups but homogeneity among subgroups. Subgroups, rather than elements within subgroups,
are randomly selected.
Non-probability samples
·
Convenience Sample – A sample of items that are most
readily available.
·
Judgment Sample – A sample selected by an experienced
researcher based upon some appropriate characteristic.
·
Quota Sample – A sample that ensures that certain characteristics
of the population will be represented to the exact intended extent. For example, a quota sample may require selecting a sample
of 100 residents of a specified metro area or deciding that 59% of study participants must be male. The selection process
is initially random in nature, but as the quota is met, respondents are not included in the sample.
·
Snowball Sample – A sample in which initial respondents
are selected using probability methods, and then additional respondents are obtained from information provided by the initial
respondents, such as a survey of 10 shoppers selected at random and asking each of them for the names of five friends.
Sample Sizes
A sample
does not have to be large to be useful, as long as it is representative of the characteristics of the population. Some factors
should be considered in choosing the sampling size of a sample. Is it a percentage of the population? Is population size a
factor? Is there a magic minimum? According to Dr. George Gallup, “You do not need a large sampling proportion to do a good job if you first stir the pot well.”
Data Collection Technique
The table below can be useful
for assisting a researcher in deciding how to sample a population. Ideally, one should list all factors that might contribute
to the sampling process.
Pitfalls throughout the Sampling Hierarchy
1) With every step of the process of sampling and statistical analysis there are chances in which errors can be made that
will adversely affect the outcome of a study. One can make a sampling frame
error in which certain elements of the population are not included in the sampling frame. This may include unwanted units
OR exclude desired units. For example, using a telephone book to define the sample frame for residents of a particular neighborhood
may not contain accurate or correct information. Alf Landon was incorrectly predicted to win the 1936 Presidential election
– why? (see page 18 of textbook for answer). One can commit a random sampling
error in which the difference between the result of a sample and the result of a census is due solely to observations
chosen that do not represent the population. For example, 75% of a selected sample might be male when only 40% of the population
is male. Error may also be caused by sampling bias (i.e., one tends to favor certain data over others). Non-response errors are those that cause the sample to be less than representative of the population, perhaps
because a disproportionately large group of males responds to a questionnaire or respondents are unavailable or refuse to
cooperate. These pose as the most serious limitations of surveys. It is important not to confuse response rate with
sample size.
2)
A statistician must be cautious of voluntary samples as these
types of samples may introduce irreparable damage to an analysis. Voluntary samples might include
a) 900 number surveys
b) 800 number surveys
c) Opinions site at malls
d) News / sports polls on Web
e) Talk shows
f) Websites with voting options
Voluntary surveys
may bring large response totals (not the same as response rate), but do the larger sample size may not be representative
of the population under study and the size of the sample will not compensate for this bias.
Making Sense of Data -- Learning to Think Statistically
Respecting Ockham’s Razor
Ockham’s
Razor is the principle in investigative research that states that when two hypotheses make the same predictions and that data
cannot distinguish between them, the simpler of the explanations should be prioritized for further investigation as it is
more likely to be correct. The razor is the sharp edge that cuts excesses out of a hypothesis. Modern statistics is often
in need of a shave! The simplest procedures that can be used to solve a problem are preferred as deliberately complicating
solutions is a misuse of statistics – it obscures the analysis.
Designing the Process
In all research
involving data collection and analysis, it is important to approach the problem scientifically. Careful consideration must
be taken in sampling or collecting data. Data sheets are essential for managing data, and researchers need to consider the
factors that are necessary to evaluate to make valid interpretations of the data. A causal relationship is often misevaluated
when one attempts to determine the root cause of a problem. In order to rigorously test causation, the causal agent must be
removed and the new data need to be collected and reevaluated. It is important not to blindly accept the results of a statistical
analysis. Contradictions to historically well supported hypotheses need to be carefully considered and investigated. Look
at data intelligently. Identify the source of the data and always ask if the data and the results of an analysis make sense.
Follow some basic thinking about statistical analysis: beware of incorrect analyses, do not jump to conclusions, and perhaps
most importantly, check the math!
References
Cooper, D. R., & Schindler, P. S. (2001). Business research methods (7th ed.). Boston: McGraw-Hill Irwin.
Zikmund, W. G. (2000). Business
research methods (6th ed.). Ft. Worth, TX: Dryden.