More Hypothesis Testing

· Conduct directional hypothesis testing

· Conduct hypothesis testing for small samples

· Conduct hypothesis testing for proportions

Hypothesis Testing

Hypothesis tests can be done on virtually any population measurement, but the most common tests are about means and proportions. A hypothesis test is essentially the same as comparing a confidence interval against a theorized value. Every election poll shows a percentage of people responding a certain way. Knowing that it takes 50.1% to win the election, if the sample proportion +/- the margin of error is completely above 50%, then one could be X% confident that person will win (and if it is completely below 50%, then one would be X% confident that person will lose).

Hypothesis tests for proportions work the same way. The null hypothesis might be that the population proportion = .50, and with the sample proportion computed, you would use the test results to determine if the sample proportion is significantly different from .50. Significantly different really means that the margin of error is smaller than the distance to .50. If it is significantly different, you reject the null hypothesis and conclude the alternate hypothesis is true. If it is not significantly different, you fail to reject the null hypothesis and state that there is insufficient evidence to prove the alternate is true. If the Bush-Kerry poll showed Bush with 52% of the vote in Florida, the American population might fail to reject the null – this does not mean they are tied, but it does mean the results are still inconclusive (the media calls this a swing state).

Z-tests vs. T-tests

Hypothesis tests for means can use one of two approaches – a z-test or a t-test. To use a z-test, you the population standard deviation must be known (which can be obtained from historical data or prior analysis). You could also use the z-test when you do not know the population standard deviation provided that your sample is large enough (at least 30); the rationale here is that with a large sample, the sample standard deviation tends to be similar to the population standard deviation. If you do not know the population standard deviation and your sample is not large enough, fear not because you can use a t-test. The t-test helps to deal with the bias from the sample standard deviation. While there are slightly different formulas and different tables in the textbook, those using Excel will find the only difference between the two tests is as simple as going to the next menu item.

Understanding the p-value

What determines if you reject or do not reject the null? If you did the test using formulas and paper, you would rely on the test statistic and tables, but in this high tech world, you will find that all statistical software produces a p-value in the output for a hypothesis test. A p-value is merely the probability of making a type one error if you decide to reject the null. Going back to the judicial example, it is the probability that the defendant is truly innocent should you decide to convict him. While the significance level is your tolerance for being wrong, the p-value is the likelihood you actually are wrong, the risk you would be taking. When the p-value is smaller than the significance level, than the risk of error is less than your tolerance for error and so you reject the null hypothesis; when the p-value is larger than the significance level, the risk is too great for your tolerance and so you do not reject the null hypothesis.

Computers are inhuman and do not make decisions. A p-value of .03 is as far as the output goes. If you are willing to set a significance level at .05, you would reject the null, but a colleague doing the same study might set the significance level at .01 and fail to reject the null.

Notice how the statement is to "not reject the null" or "fail to reject the null." We NEVER accept the null. It cannot be proven that a person is innocent of a crime or that a coin is perfectly fair. In fact, a physician cannot even prove a person is healthy (a doctor looks for problems and in the absence of any, can only state that he failed to find anything wrong with you).

Directional vs. Non-directional Tests

Regardless of which test is conducted, a test direction must be determined. The hypotheses can be two-tailed (non-directional), upper-tailed or lower-tailed. A two-tailed test is one in which the null hypothesis has an equals sign and the alternate has a "not equals" sign; you are testing if the sample mean is different from the hypothesized value. An upper-tailed test is one in which the alternate hypothesis tests if the sample mean is greater than the hypothesized value, and a lower-tailed test is one in which the alternate hypothesis tests if the sample mean is less than the hypothesized value. By conducting an upper-tailed or lower-tailed test (also called directional tests), the p-value is cut in half and it is easier to reject the null hypothesis. While this sounds great, it is also dangerous and irresponsible to set up the test as directional unless you have good reason to do so. For example, if you are testing a diet pill, it is reasonable to test if the average weight has declined. It would also be understandable to test if the mean SAT score at a top-tier school is above the national average. If you wanted to study the grade performance of students who drink Pepsi, it would not be reasonable to test if grades are higher or lower than a norm. The best decision is to use a non-directional approach unless you have justification to pick a direction.

Getting to a Conclusion

Remember that the aim of a study is to prove the alternate hypothesis (often called the research hypothesis). Researchers spend a lot of time and money trying to prove the alternate is true, as there is no value in failing to reject the null and being where they began. Prosecutors do not want to go to trial unless they can convict, or else it is a waste of time and money. Since the object is to reject the null and conclude that the alternate is true, researchers do what they can to help their odds. A directional test will cut the p-value and help the chances of rejection, but it must be appropriate to do so. Raising the level of significance gives a wider berth for rejecting the null, but is that much error really tolerable? Lastly, increasing the sample size will make a small difference more significant (it is always acceptable to take an additional sample if your p-value is close to the rejection mark, but it may not be convenient to do).

If Bush had 60% support in a poll, it may not be statistically significant (e.g., 3 out of 5 surveyed support him), but if he had 51%, it may be statistically significant (e.g., 51,000 out of 100,000 support him). The percents can be misleading; the sample size plays a big factor in determining if the results are truly significant.

Hypothesis testing is truly a shift in thought, but it makes sense. Once the hypotheses are set up properly, things get easier. Of course, the results are completely meaningless if the data is not valid or reliable. Do not cut corners just to get the wanted results. Not many people would want to try drugs that were improperly tested by an anxious researcher who just wanted to get published? Even if a researcher proves a drug to cure what ails you, a small percentage of studies will incur an error of that type, which is why reputable pharmaceutical companies will re-test, as the aim is to sure. Never let ethics take a backseat to desired results. Hypothesis testing should open researcher’s eyes to how election polling is done, how medical tests are conducted, how psychological studies are done, etc., and should give the researcher a greater appreciation for the results.