More Hypothesis Testing
Conduct directional hypothesis testing
Conduct hypothesis testing for small samples
Conduct hypothesis testing for proportions
Hypothesis tests can be done on virtually any population measurement,
but the most common tests are about means and proportions. A hypothesis test is essentially the same as comparing a confidence
interval against a theorized value. Every election poll shows a percentage of people responding a certain way. Knowing that
it takes 50.1% to win the election, if the sample proportion +/- the margin of error is completely above 50%, then one could
be X% confident that person will win (and if it is completely below 50%, then one would be X% confident that person will lose).
Hypothesis tests for proportions work the same way. The null hypothesis
might be that the population proportion = .50, and with the sample proportion computed, you would use the test results to
determine if the sample proportion is significantly different from .50. Significantly different really means that the margin
of error is smaller than the distance to .50. If it is significantly different, you reject the null hypothesis and conclude
the alternate hypothesis is true. If it is not significantly different, you fail to reject the null hypothesis and state that
there is insufficient evidence to prove the alternate is true. If the Bush-Kerry poll showed Bush with 52% of the vote in
the American population might fail to reject the null – this does not mean they are tied, but it does mean the results
are still inconclusive (the media calls this a swing state).
Z-tests vs. T-tests
Hypothesis tests for means can use one of two approaches –
a z-test or a t-test. To use a z-test, you the population standard deviation must be known (which can be obtained from historical
data or prior analysis). You could also use the z-test when you do not know the population standard deviation provided that
your sample is large enough (at least 30); the rationale here is that with a large sample, the sample standard deviation tends
to be similar to the population standard deviation. If you do not know the population standard deviation and your sample is
not large enough, fear not because you can use a t-test. The t-test helps to deal with the bias from the sample standard deviation.
While there are slightly different formulas and different tables in the textbook, those using Excel will find the only difference
between the two tests is as simple as going to the next menu item.
Understanding the p-value
What determines if you reject or do not reject the null? If you
did the test using formulas and paper, you would rely on the test statistic and tables, but in this high tech world, you will
find that all statistical software produces a p-value in the output for a hypothesis test. A p-value is merely the probability
of making a type one error if you decide to reject the null. Going back to the judicial example, it is the probability that
the defendant is truly innocent should you decide to convict him. While the significance level is your tolerance for being
wrong, the p-value is the likelihood you actually are wrong, the risk you would be taking. When the p-value is smaller than
the significance level, than the risk of error is less than your tolerance for error and so you reject the null hypothesis;
when the p-value is larger than the significance level, the risk is too great for your tolerance and so you do not reject
the null hypothesis.
Computers are inhuman and do not make decisions. A p-value of
.03 is as far as the output goes. If you are willing to set a significance level at .05, you would reject the null, but a
colleague doing the same study might set the significance level at .01 and fail to reject the null.
Notice how the statement is to "not reject the null" or "fail
to reject the null." We NEVER accept the null. It cannot be proven that a person is innocent of a crime or that a coin is
perfectly fair. In fact, a physician cannot even prove a person is healthy (a doctor looks for problems and in the absence
of any, can only state that he failed to find anything wrong with you).
Directional vs. Non-directional Tests
Regardless of which test is conducted, a test direction must be
determined. The hypotheses can be two-tailed (non-directional), upper-tailed or lower-tailed. A two-tailed test is one in
which the null hypothesis has an equals sign and the alternate has a "not equals" sign; you are testing if the sample mean
is different from the hypothesized value. An upper-tailed test is one in which
the alternate hypothesis tests if the sample mean is greater than the hypothesized
value, and a lower-tailed test is one in which the alternate hypothesis tests if the sample mean is less than the hypothesized value. By conducting an upper-tailed or lower-tailed test (also called directional
tests), the p-value is cut in half and it is easier to reject the null hypothesis. While this sounds great, it is also dangerous
and irresponsible to set up the test as directional unless you have good reason to do so. For example, if you are testing
a diet pill, it is reasonable to test if the average weight has declined. It would also be understandable to test if the mean
SAT score at a top-tier school is above the national average. If you wanted to study the grade performance of students who
drink Pepsi, it would not be reasonable to test if grades are higher or lower than a norm. The best decision is to use a non-directional
approach unless you have justification to pick a direction.
Getting to a Conclusion
Remember that the aim of a study is to prove the alternate hypothesis
(often called the research hypothesis). Researchers spend a lot of time and money trying to prove the alternate is true, as
there is no value in failing to reject the null and being where they began. Prosecutors do not want to go to trial unless
they can convict, or else it is a waste of time and money. Since the object is to reject the null and conclude that the alternate
is true, researchers do what they can to help their odds. A directional test will cut the p-value and help the chances of
rejection, but it must be appropriate to do so. Raising the level of significance gives a wider berth for rejecting the null,
but is that much error really tolerable? Lastly, increasing the sample size will make a small difference more significant
(it is always acceptable to take an additional sample if your p-value is close to the rejection mark, but it may not be convenient
If Bush had 60% support
in a poll, it may not be statistically significant (e.g., 3 out of 5 surveyed support him), but if he had 51%, it may be statistically
significant (e.g., 51,000 out of 100,000 support him). The percents can be misleading; the sample size plays a big factor
in determining if the results are truly significant.
testing is truly a shift in thought, but it makes sense. Once the hypotheses are set up properly, things get easier. Of course,
the results are completely meaningless if the data is not valid or reliable. Do not cut corners just to get the wanted results.
Not many people would want to try drugs that were improperly tested by an anxious researcher who just wanted to get published?
Even if a researcher proves a drug to cure what ails you, a small percentage of studies will incur an error of that type,
which is why reputable pharmaceutical companies will re-test, as the aim is to sure. Never let ethics take a backseat to desired
results. Hypothesis testing should open researcher’s eyes to how election polling is done, how medical tests are conducted,
how psychological studies are done, etc., and should give the researcher a greater appreciation for the results.