The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis is a conjecture made solely to be falsified by the sample. Statistical significance is a possible finding of the test – that the sample is unlikely to have occurred by chance given the truth of the null hypothesis. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: reject or do not reject (which is not the same as accept). A calculated value is compared to a threshold, which is determined from the tolerable risk of error.
This article describes the commonly used frequentist treatment of hypothesis testing. From the Bayesian point of view, it is appropriate to treat hypothesis testing as a special case of normative decision theory (specifically a model selection problem) and it is possible to accumulate evidence in favor of (or against) a hypothesis using concepts such as likelihood ratios known as Bayes factors.
There are several preparations we make before we observe the data.
After the data are available, the test statistic is calculated and we determine whether it is inside the critical region.
If the test statistic is inside the critical region, then our conclusion is one of the following:
The researcher has to choose between these logical alternatives. In the example we would say: the observed response to treatment is statistically significant.
If the test statistic is outside the critical region, the only conclusion is that there is not enough evidence to reject the null hypothesis. This is not the same as evidence in favor of the null hypothesis. That we cannot obtain using these arguments, since lack of evidence against a hypothesis is not evidence for it. On this basis, statistical research progresses by eliminating error, not by finding the truth.
|One-sample z-test||(Normal distribution or n > 30) and σ known.|
(z is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within k standard deviations for any k (see: Chebyshev's inequality).
|Two-sample z-test||Normal distribution and independent observations and both (σ1 and σ2 known)|
|One-sample t-test||(Normal population or n > 30) and σ unknown|
|Paired t-test||(Normal population of differences or n > 30) and σ unknown|
|One-proportion z-test||n .p > 10 and n (1 − p) > 10 and it is a SRS (Simple Random Sample).|
|Two-proportion z-test, equal variances||n1 p1 > 5 AND n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations|
|Two-proportion z-test, unequal variances||n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations|
|Two-sample pooled t-test||(Normal populations or n1 + n2 > 40) and independent observations and σ1 = σ2 and (σ1 and σ2 unknown)|
|Two-sample unpooled t-test|
|(Normal populations or n1 + n2 > 40) and independent observations and σ1 ≠ σ2 and (σ1 and σ2 unknown)|
|Definition of symbols|| = sample size|
= sample mean
= population mean
= population standard deviation
= t statistic
= degrees of freedom
= sample 1 size
= sample 2 size
= sample 1 std. deviation
= sample 2 std. deviation = sample mean of differences
= population mean difference
= std. deviation of differences
= proportion 1
= proportion 2
= population 1 mean
= population 2 mean
= minimum of n1 and n2
Hypothesis testing is largely the product of Ronald Fisher, Jerzy Neyman, Karl Pearson and (son) Egon Pearson. Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an (extended) hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.
The following example is summarized from Fisher Fisher thoroughly explained his method in a proposed experiment to test a Lady's claimed ability to determine the means of tea preparation by taste. The article is less than 10 pages in length and is notable for its simplicity and completeness regarding terminology, calculations and design of the experiment. The example is loosely based on an event in Fisher's life. The Lady proved him wrong.
If, and only if the 8 trials produced 8 successes was Fisher willing to reject the null hypothesis – effectively acknowledging the Lady's ability with > 98% confidence (but without quantifying her ability). Fisher later discussed the benefits of more trials and repeated tests.
Some statisticians have commented that pure "significance testing" has what is actually a rather strange goal of detecting the existence of a "real" difference between two populations. In practice a difference can almost always be found given a large enough sample, what is typically the more relevant goal of science is a determination of causal effect size. The amount and nature of the difference, in other words, is what should be studied. Many researchers also feel that hypothesis testing is something of a misnomer. In practice a single statistical test in a single study never "proves" anything.
Even when you reject a null hypothesis, effect sizes should be taken into consideration. If the effect is statistically significant but the effect size is very small, then it is a stretch to consider the effect theoretically important.
Little criticism of the technique appears in introductory statistics texts. Criticism is of the application, or of the interpretation, rather than of the method.
Criticism of null-hypothesis significance testing is available in other articles ("Null-hypothesis" and "Statistical significance") and their references. Attacks and defenses of the null-hypothesis significance test are collected in Harlow et al.
The original purposes of Fisher's formulation, as a tool for the experimenter, was to plan the experiment and to easily assess the information content of the small sample. There is little criticism, Bayesian in nature, of the formulation in its original context.
In other contexts, complaints focus on flawed interpretations of the results and over-dependence/emphasis on one test.
Numerous attacks on the formulation have failed to supplant it as a criterion for publication in scholarly journals. The most persistent attacks originated from the field of Psychology. After review, the American Psychological Association did not explicitly deprecate the use of null-hypothesis significance testing, but adopted enhanced publication guidelines which implicitly reduced the relative importance of such testing. The International Committee of Medical Journal Editors recognizes an obligation to publish negative (not statistically significant) studies under some circumstances. The applicability of the null-hypothesis testing to the publication of observational (as contrasted to experimental) studies is doubtful.
Philosophical criticism to hypothesis testing includes consideration of borderline cases.
Any process that produces a crisp decision from uncertainty is subject to claims of unfairness near the decision threshold. (Consider close election results.) The premature death of a laboratory rat during testing can impact doctoral theses and academic tenure decisions. "... surely, God loves the .06 nearly as much as the .05
The statistical significance required for publication has no mathematical basis, but is based on long tradition.
"It is usual and convenient for experimenters to take 5% as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results."
Fisher, in the cited article, designed an experiment to achieve a statistically significant result based on sampling 8 cups of tea.
Ambivalence attacks all forms of decision making. A mathematical decision-making process is attractive because it is objective and transparent. It is repulsive because it allows authority to avoid taking personal responsibility for decisions.
Pedagogic criticism of the null-hypothesis testing includes the counter-intuitive formulation, the terminology and confusion about the interpretation of results.
"Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transiting from data to conclusions."
Students find it difficult to understand the formulation of statistical null-hypothesis testing. In rhetoric, examples often support an argument, but a mathematical proof "is a logical argument, not an empirical one". A single counterexample results in the rejection of a conjecture. Karl Popper defined science by its vulnerability to dis-proof by data. Null-hypothesis testing shares the mathematical and scientific perspective rather than the more familiar rhetorical one. Students expect hypothesis testing to be a statistical tool for illumination of the research hypothesis by the sample; It is not. The test asks indirectly whether the sample can illuminate the research hypothesis.
Students also find the terminology confusing. While Fisher disagreed with Neyman and Pearson about the theory of testing, their terminologies have been blended. The blend is not seamless or standardized. While this article teaches a pure Fisher formulation, even it mentions Neyman and Pearson terminology (Type II error and the alternative hypothesis). The typical introductory statistics text is less consistent. The Sage Dictionary of Statistics would not agree with the title of this article, which it would call null-hypothesis testing.
"...there is no alternate hypothesis in Fisher's scheme: Indeed, he violently opposed its inclusion by Neyman and Pearson." In discussing test results, "significance" often has two distinct meanings in the same sentence; One is a probability, the other is a subject-matter measurement (such as currency). The significance (meaning) of (statistical) significance is significant (important).
There is widespread and fundamental disagreement on the interpretation of test results.
"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is almost always false in the real world.... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it?" (The above criticism only applies to point hypothesis tests. If one were testing, for example, whether a parameter is greater than zero, it would not apply.)
"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?"
Null-hypothesis testing just answers the question of "how well the findings fit the possibility that chance factors alone might be responsible."
Null-hypothesis significance testing does not determine the truth or falseness of claims. It determines whether confidence in a claim based solely on a sample-based estimate exceeds a threshold. It is a research quality assurance test, widely used as one requirement for publication of experimental research with statistical results. It is uniformly agreed that statistical significance is not the only consideration in assessing the importance of research results. Rejecting the null hypothesis is not a sufficient condition for publication.
"Statistical significance does not necessarily imply practical significance!
Practical criticism of hypothesis testing includes the sobering observation that published test results are often contradicted. Mathematical models support the conjecture that most published medical research test results are flawed. Null-hypothesis testing has not achieved the goal of a low error probability in medical journals.
Jones and Tukey suggested a modest improvement in the original null-hypothesis formulation to formalize handling of one-tail tests. Fisher ignored the 8-failure case (equally improbable as the 8-success case) in the example tea test which altered the claimed significance by a factor of 2.
Killeen proposed an alternative statistic that estimates the probability of duplicating an experimental result. It "provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference."