A researcher never knows whether an error has been committed in statistical decision making

A null hypothesis is either true or false. Unfortunately, we do not know which is the case, and we almost never will. It is important to realize that there is no probability that the null hypothesis is true or that it is false, because there is no element of chance. For example, if you are testing whether a potential mine has a greater gold concentration than that of a break-even mine, the null hypothesis that your potential mine has a gold concentration no greater than a break-even mine is either true or it is false; you just don’t know which. There is no probability associated with these two cases (in a frequentist sense) because the gold is already in the ground, and as a result there is no possibility for chance because everything is already set. All we have is our own uncertainty about the null hypothesis.

This lack of knowledge about the null hypothesis is why we need to perform a statistical test: we want to use our data to make an inference about the null hypothesis. Specifically, we need to decide if we are going to act as if the null hypothesis is true or act as if it is false. From our hypothesis test, we therefore choose either to accept or to reject the null hypothesis. If we accept the null hypothesis, we are stating that our data are consistent with the null hypothesis (recognizing that other hypotheses might also be consistent with the data). If we reject the null hypothesis, we are stating that our data are so unexpected that they are inconsistent with the null hypothesis.

Our decision will change our behavior. If we reject the null hypothesis, we will act as if the null hypothesis is false, even though we do not know if that is in fact false. If we accept the null hypothesis, we will act as if the null hypothesis is true, even though we have not demonstrated that it is in fact true. This is a critical point: regardless of the results of our statistical test, we will never know if the null hypothesis is true or false. In other words, we do not prove or disprove null hypotheses and we never will; we never show that null hypotheses are true or that they are false.

In short, we operate in a world where hypotheses are true or false, but we don’t know which. What we would like to do is perform statistical tests that allow us to make decisions (accept or reject), and we would like these to be correct decisions.

Because we have two possibilities for the null hypothesis (true or false) and two possibilities for our decision (accept or reject), there are four possible scenarios. Two of these are correct decisions: we could accept a true null hypothesis or we could reject a false null hypothesis. The other two cases are errors. If we reject a true null hypothesis, we have committed a type I error. If we accept a false null hypothesis, we have made a type II error.

A researcher never knows whether an error has been committed in statistical decision making

Each of these four possibilities has some probability of occurring, and those probabilities depend on whether the null hypothesis is true or false.

If the null hypothesis is true, there are only two possibilities: we will reject it with probability of alpha (α), or we will choose to accept the null hypothesis with probability of 1-α. Rejecting a true null hypothesis is called a false positive, such as when a medical test says you have a disease when you do not.

If the null hypothesis is false, there are also only two possibilities: we will choose to accept the null hypothesis with probability of beta (β), or we will reject it with probability of 1-β. Accepting a false null hypothesis is called a false negative, such as when a medical test says you do not have a disease when you actually do.

Because the probabilities depend on whether the null hypothesis is true or false, it is the probabilities in each row that sum to 100%. The probabilities in each column do not sum to 100%, except by lucky coincidence.

Because we don’t know the truth of a null hypothesis, we need to lessen the chances of making both types of error. In case the null hypothesis is true, we want to decrease our chance of making a type I error: we want to find ways to reduce alpha. In case the null hypothesis is false, we want to lessen our chance of making a type II error: we want to find ways to reduce beta. Because we will never know if the null hypothesis is true or false, we need a strategy for doing both, simultaneously keeping the probabilities of alpha and beta as small as possible.

Significance and confidence

Keeping the probability of a type I error low is straightforward, because we choose our significance level (α). If we are especially concerned about making a type I error, we can set our significance level to be as small as we wish.

If the null hypothesis is true, we have a 1-α probability that we will make the correct decision and accept it. We call that probability (1-α) our confidence level. Confidence and significance sum to one because rejecting and accepting a null hypothesis are the only possible choices when the null hypothesis is true. Therefore, when we decrease significance, we increase our confidence. Although you might think you would always want confidence to be as high as possible, doing so comes at a high cost: we make type II errors more likely.

Beta and power

Keeping the probability of a type II error small is more complicated.

When the null hypothesis is false, β is the probability that we will make the wrong choice and accept it (a type II error). Beta is nearly always unknown, since knowing it requires knowing whether the null hypothesis is true or not. Specifically, calculating beta requires knowing the true value of the parameter, that is, the true hypothesis underlying our data. If we knew that, we wouldn’t need statistics.

If the null hypothesis is false, there is a 1-β probability that we will make the right choice and reject it. The probability that we will make the right choice when the null hypothesis is false is called statistical power. Power reflects our ability to reject false null hypotheses and detect new phenomena in the world. We must try to maximize power. Power is controlled by four factors:

  • Power increases with the size of the effect that we are trying to detect. For example, it is easier to detect a large difference in means than a small one. We cannot control effect size, because it is determined by the problem we are studying. The remaining three factors, however, are entirely under our control.
  • Sample size (n) has a major effect on power. Increasing sample size increases power. We should always strive to have as large of a sample size as money and time allow, as this is the best way to increase power.
  • Some statistical tests have greater power than other tests. In general, parametric tests (ones that assume a particular distribution, often a normal distribution) have greater power than nonparametric tests (those that do not assume a particular distribution).
  • Our significance level affects β. Increasing alpha (significance) will increase our power, but it also increases the risk of rejecting a true null hypothesis.

Our strategy for minimizing errors

We keep the probability of committing a type I error small by keeping significance (α) as small as possible.

We keep our probability of committing a type II error small by (1) by increasing sample size as much as possible, given the constraints of time and money, (2) choosing tests with high power, and (3) not making significance (α) too small. As a data analyst, you should always be thinking of ways to do all three.

Tradeoffs in α and β

We would like to minimize the probability of making type I and type II errors, but there is a tradeoff in α and β: decreasing one necessarily increases the other. We cannot simultaneously minimize both, and we will have to prioritize one or the other. Which one we prioritize will depend on the circumstances.

For example, if we were doing quality control at a factory, our null hypothesis would be that the current production run meets some minimum standard of quality. We want to maximize our profits, so we do not want to be overly careful and discard too many production runs; thus we set α to be low, because Type I errors cost us money. α is thus known as producer’s risk. If we are a consumer, we do not want defective goods, so if the null hypothesis is false (the goods are defective), we want the manufacturer to catch this and not ship defective products to consumers; in other words, we want β to be low. β is therefore known as consumer’s risk.

Another example comes from court proceedings, and we need to contrast the burden of proof in criminal and civil trials.

In a criminal trial, it is the individual pitted against the state. The founders of our government did not want the state to be too powerful and take away the rights of innocent people, so the standard in a criminal trial is “proof beyond a reasonable doubt”. In effect, this sets α to be very low, accepting the risk that some guilty people will walk free (a type II error).

In civil trials, individuals are pitted against one another, and there is no reason to favor one party over another. The standard of proof is therefore “a preponderance of the evidence”, that is, whoever presents the stronger case wins. In effect, this makes alpha much larger, creating a greater balance between type I and type II errors. Errors will not preferentially favor one party or the other.

Demonstration Code

This first block will demonstrate the relationship between alpha and beta for a small sample (n=20). This and the next block will generate color versions of the figures used in class.

dev.new(height=6, width=6) par(mfrow=c(2, 1))   # simulate Ho distribution for mean of small sample (n=20) nullMean <- 4.5 nullStandardDeviation <- 1.0 smallSample <- 20 iterations <- 10000 smallSampleMeans <- replicate(iterations, mean(rnorm(smallSample, nullMean, nullStandardDeviation)))   # plot the distribution of the mean breaks <- seq(3, 6, 0.05) range <- c(min(breaks), max(breaks)) hist(smallSampleMeans, breaks=breaks, col='gray', main='Hypothesized Mean n=20', xlab='sample mean', xlim=range)   # show the critical value alpha <- 0.05 criticalValue <- quantile(smallSampleMeans, 1-alpha) abline(v=criticalValue, col='blue', lwd=3) text(criticalValue, 800, 'critical value', pos=4, col='blue') criticalValue   # show alpha and 1-alpha text(5.5, 0, expression(alpha), cex=2, pos=3, col='blue') text(3.5, 0, expression(1-alpha), cex=2, pos=3, col='blue')   # add an observed mean observedMean <- 5.1 arrows(observedMean, 600, 5.1, 50, angle=20, length=0.15, lwd=3, col='red') text(observedMean, 620, 'observed mean', pos=4, col='red')   # simulate true distribution for mean of small sample (n=20) # remember, this is unknowable to us trueMean <- 5.0 trueSd <- 1.0 trueSampleMeans <- replicate(iterations, mean(rnorm(smallSample, trueMean, trueSd)))   # plot the distribution hist(trueSampleMeans, breaks=breaks, col='gray', main='True Mean (Unknowable)', xlab='mean', xlim=range)   # copy the observed mean and the critical values arrows(observedMean, 600, 5.1, 50, angle=20, length=0.15, lwd=3, col='red') abline(v=criticalValue, col='blue', lwd=3)   # show beta and 1-beta text(5.5, 0, expression(1-beta), cex=2, pos=3, col='blue') text(3.5, 0, expression(beta), cex=2, pos=3, col='blue')   # Note that magic numbers were used for coordinates of arrows and text labels # If you modify the code, you will likely need to change these numbers

This second block will do the same, but for a large sample (n=60).

dev.new(height=6, width=6) par(mfrow=c(2, 1))   # simulate Ho distribution for mean of large sample (n=60) nullMean <- 4.5 nullStandardDeviation <- 1.0 largeSample <- 60 iterations <- 10000 largeSampleMeans <- replicate(iterations, mean(rnorm(largeSample, nullMean, nullStandardDeviation)))   # plot the distribution of the mean breaks <- seq(3, 6, 0.05) range <- c(min(breaks), max(breaks)) hist(largeSampleMeans, breaks=breaks, col='gray', main='Hypothesized Mean n=60', xlab='sample mean', xlim=range)   # show the critical value alpha <- 0.05 criticalValue <- quantile(largeSampleMeans, 1-alpha) abline(v=criticalValue, col='blue', lwd=3) text(criticalValue, 1400, 'critical value', pos=4, col='blue') criticalValue   # show alpha and 1-alpha text(5.5, 0, expression(alpha), cex=2, pos=3, col='blue') text(3.5, 0, expression(1-alpha), cex=2, pos=3, col='blue')   # add an observed mean observedMean <- 5.1 arrows(observedMean, 1000, 5.1, 50, angle=20, length=0.15, lwd=3, col='red') text(observedMean, 1020, 'observed mean', pos=4, col='red')   # simulate true distribution for mean of large sample (n=60) trueMean <- 5.0 trueSd <- 1.0 trueLargeSampleMeans <- replicate(iterations, mean(rnorm(largeSample, trueMean, trueSd)))   # plot the distribution hist(trueLargeSampleMeans, breaks=breaks, col='gray', main='True Mean (Unknowable)', xlab='mean', xlim=range)   # copy the observed mean and the critical values arrows(observedMean, 1000, 5.1, 50, angle=20, length=0.15, lwd=3, col='red') abline(v=criticalValue, col='blue', lwd=3)   # show beta and 1-beta text(5.5, 0, expression(1-beta), cex=2, pos=3, col='blue') text(3.5, 0, expression(beta), cex=2, pos=3, col='blue')

This third block is like the first (n=20), but shows a smaller effect size (that is, the actual mean is closer to hypothesized mean). This was shown in class, but it was not included on the handout.

  dev.new(height=6, width=6) par(mfrow=c(2, 1))   # simulate Ho distribution for mean of small sample (n=20) nullMean <- 4.5 nullStandardDeviation <- 1.0 smallSample <- 20 iterations <- 10000 smallSampleMeans <- replicate(iterations, mean(rnorm(smallSample, nullMean, nullStandardDeviation)))   # plot the distribution of the mean breaks <- seq(3, 6, 0.05) range <- c(min(breaks), max(breaks)) hist(smallSampleMeans, breaks=breaks, col='gray', main='Hypothesized Mean n=20', xlab='sample mean', xlim=range)   # show the critical value alpha <- 0.05 criticalValue <- quantile(smallSampleMeans, 1-alpha) abline(v=criticalValue, col='blue', lwd=3) text(criticalValue, 800, 'critical value', pos=4, col='blue') criticalValue   # show alpha and 1-alpha text(5.5, 0, expression(alpha), cex=2, pos=3, col='blue') text(3.5, 0, expression(1-alpha), cex=2, pos=3, col='blue')   # add an observed mean observedMean <- 5.1 arrows(observedMean, 600, 5.1, 50, angle=20, length=0.15, lwd=3, col='red') text(observedMean, 620, 'observed mean', pos=4, col='red')   # simulate true distribution for mean of small sample (n=20) # remember, this is unknowable to us trueMean <- 4.6 trueSd <- 1.0 trueSampleMeans <- replicate(iterations, mean(rnorm(smallSample, trueMean, trueSd)))   # plot the distribution hist(trueSampleMeans, breaks=breaks, col='gray', main='True Mean (Unknowable)', xlab='mean', xlim=range)   # copy the observed mean and the critical values arrows(observedMean, 600, 5.1, 50, angle=20, length=0.15, lwd=3, col='red') abline(v=criticalValue, col='blue', lwd=3)   # show beta and 1-beta text(5.5, 0, expression(1-beta), cex=2, pos=3, col='blue') text(3.5, 0, expression(beta), cex=2, pos=3, col='blue')

You can download a pdf of the handout.