Bonferroni's word to the wise

A cautionary tale

General idea: If you run enough tests against a dataset, you'll eventually confirm a hypothesis with statistical significance α. That doesn't mean it's true!

Solution: When running m tests against a dataset for significance level α, instead hold each individual hypothesis to the higher significance criterion of αm

Background: The expected value of an independent event is generally calculated as:

$$E(x) = \sum_i^{n} x_i p_i$$

Which for binomial outcomes simplifies to:

$$E(x) = np(x) $$

Intuitively, the number of trials n and the probability of an event p(x) both contribute to the likelihood that you will see an interesting event. The Bonferroni Principle warns us to tune both the definition of an interesting event (p(x)) and the number of trials (n) such that the likelihood of flagging true positives is not overpowered by the sheer random probability of our criteria being satisfied.

Am I in danger of being fooled? Assuming the data selection method is random, use your sample size to calculate the expected number of interesting events. If this number is larger than the number of intuitive events you hope to see, your methods of defining an 'interesting event' may not be stringent enough or you may be testing too many times. You'll have to cull the set of interesting events further to avoid being flooded with False Positives.

What can I do? You can either use a more hypothesis-driven approach to define the interesting event (i.e. make your criteria more stringent), or attempt to correct your statistical significance estimations using the Bonferroni Correction.

An example illustrating Bonferroni's Principle

from Mining of Massive Datasets, by Leskovec, Rajaraman, Ullman

You are observing hotel visits to try to identify terrorist organization meetings. On average, people visit a hotel 1 in every 100 days. Each hotel holds 100 people, so we can assume there are 100,000 hotels-- enough to hold 1% of the population each day. You're examining hotel records over 1000 days. You define a suspicious duo as people who visited the same hotel at the same time on two occasions.

  • Sample size npeople
  • P(hotel visit), given H0prandom
  • number of hotels: nhotels
  • number of days observed t

Joint probability of two people visiting a hotel on a particular day:

$$ p_{random}^2 = 0.0001 $$

Probability that both people choose the same hotel:

$$\frac{p_{random}^2}{n_{hotels}} = \frac{0.0001}{10^5} = 10^{-9} $$

For the event to be considered interesting, this pair has to visit the same hotel on the same day twice:

$$P(event) = (10^{-9})^2 = 10^{-18}$$

This is the probability of the interesting event occuring, given random chance. Let's use the formula for expected value and our sample size n to estimate the number of interesting events we should see by random chance.

The number of possible pairs of people is npeople choose 2:

$$\frac{n_{people}!}{(n_{people}-2)!~2!} = \frac{10^9!}{(10^9-2)!~2!} = 5 \times 10^{17}$$

And the number of possible pairs of days for them to visit the same hotel:

$$\frac{t!}{(t-2)!~2!} = \frac{1000!}{(1000-2)!~2!} = 5 \times 10^5$$

Therefore, the expected value given random chance can be calculated by multiplying the number of possible people-pairs by the number of possible day-pairs and again by the probability of a people-and-day-pair:

$$(5 \times 10^{17}) \times (5 \times 10^5) \times (10^{-18}) = 250,000$$

Based on the event criteria and the number of trials (1000 days and 1B people), we can expect that random chance alone would qualify a quarter of a million pairs of people to be flagged as terrorists! Clearly, the experiment needs to be redesigned in order to be effective.


I've built a sandbox for Bonferroni calculations and visualizations at