Bonferroni's word to the wise

A cautionary tale

General idea: If you run enough tests against a dataset, you'll eventually confirm a hypothesis with statistical significance α. That doesn't mean it's true!

Solution: When running m tests against a dataset for significance level α, instead hold each individual hypothesis to the higher significance criterion of αm

Background: The expected value of an independent event is generally calculated as:

$$E(x) = \sum_i^{n} x_i p_i$$

Which for binomial outcomes simplifies to:

$$E(x) = np(x) $$

Intuitively, the number of trials n and the probability of an event p(x) both contribute to the likelihood that you will see an interesting event. The Bonferroni Principle warns us to tune both the definition of an interesting event (p(x)) and the number of trials (n) such that the likelihood of flagging true positives is not overpowered by the sheer random probability of our criteria being satisfied.

Am I in danger of being fooled? Assuming the data selection method is random, use your sample size to calculate the expected number of interesting events. If this number is larger than the number of intuitive events you hope to see, your methods of defining an 'interesting event' may not be stringent enough or you may be testing too many times. You'll have to cull the set of interesting events further to avoid being flooded with False Positives.

What can I do? You can either use a more hypothesis-driven approach to define the interesting event (i.e. make your criteria more stringent), or attempt to correct your statistical significance estimations using the Bonferroni Correction.

An example illustrating Bonferroni's Principle

from Mining of Massive Datasets, by Leskovec, Rajaraman, Ullman

You are observing hotel visits to try to identify terrorist organization meetings. On average, people visit a hotel 1 in every 100 days. Each hotel holds 100 people, so we can assume there are 100,000 hotels-- enough to hold 1% of the population each day. You're examining hotel records over 1000 days. You define a suspicious duo as people who visited the same hotel at the same time on two occasions.

  • Sample size npeople
  • P(hotel visit), given H0prandom
  • number of hotels: nhotels
  • number of days observed t

Joint probability of two people visiting a hotel on a particular day:

$$ p_{random}^2 = 0.0001 $$

Probability that both people choose the same hotel:

$$\frac{p_{random}^2}{n_{hotels}} = \frac{0.0001}{10^5} = 10^{-9} $$

For the event to be considered interesting, this pair has to visit the same hotel on the same day twice:

$$P(event) = (10^{-9})^2 = 10^{-18}$$

This is the probability of the interesting event occuring, given random chance. Let's use the formula for expected value and our sample size n to estimate the number of interesting events we should see by random chance.

The number of possible pairs of people is npeople choose 2:

$$\frac{n_{people}!}{(n_{people}-2)!~2!} = \frac{10^9!}{(10^9-2)!~2!} = 5 \times 10^{17}$$

And the number of possible pairs of days for them to visit the same hotel:

$$\frac{t!}{(t-2)!~2!} = \frac{1000!}{(1000-2)!~2!} = 5 \times 10^5$$

Therefore, the expected value given random chance can be calculated by multiplying the number of possible people-pairs by the number of possible day-pairs and again by the probability of a people-and-day-pair:

$$(5 \times 10^{17}) \times (5 \times 10^5) \times (10^{-18}) = 250,000$$

Based on the event criteria and the number of trials (1000 days and 1B people), we can expect that random chance alone would qualify a quarter of a million pairs of people to be flagged as terrorists! Clearly, the experiment needs to be redesigned in order to be effective.

TEST IT OUT:

I've built a sandbox for Bonferroni calculations and visualizations at www.github.com/NickiRom

Visualizing the World Series through Mobile Geolocation

Republished from RadiumOne blog

Visualizing the World Series through Mobile Geo-data By Nicole Romano – Data Scientist

Bay area natives are well aware of the riots (and tortilla throwing) that erupted in San Francisco’s Mission neighborhood after winning the 2014 World Series. And the 2012 World Series. Oh, and the 2010 World Series.

But how did the rest of the Bay Area celebrate?

RadiumOne’s integrated stack provides us with a rich source of non-personally identifiable mobile geo-location data, allowing us to visualize how the Bay Area celebrated the Giants’ World Series win. In order to make all of this amazing data come to life, we evaluated three components:

  • -  The most popular neighborhoods for celebrating a World Series win

  • -  The migration of SF’s bar hoppers post-game

  • -  The World Series parade patterns

    Let us break this down for you both in numbers and visuals.

    World Series Game 7: The Numbers

    Zip codes represented in SF: 2,024 Bay Area: 344 International: 76

    Population breakdown in SF:
    SF locals: 52% Bay Area: 30% US: 25% International: 1%

    Instead of relying on 3rd party or self-reported data to determine where fans live, we let our mobile geo-location data do the talking. Home zip codes were determined by observing the most frequent zip code visited by each device over a typical week.

    Next, we took a look at the game-watching habits of San Francisco locals.

    1 in 4 San Franciscans Celebrated the Game 7 Win at a Bar

    In order to determine who watched the game in a bar, we collected geo-locations for 1,385 bars and nightlife establishments within the city limits. Using our proprietary

 

mobile data, we identified San Francisco locals who spent at least 2 hours in a nightlife establishment on the last night of the World Series. Between the hours of 5 PM and 1 AM that night, an incredible 26% of the smartphone-wielding population of San Francisco was observed to celebrate with other locals in lieu of watching at home.

Figure 1: Most popular neighborhoods to watch the game. Attendance was normalized by the resident population of each neighborhood, according to the 2010 US Census. 

The most popular neighborhoods for game watching among locals were: SoMa (20% of the bar-hopping contingent), the Castro (15%), and Civic Center (13%). Of course, these numbers reflect people celebrating in nightlife establishments only. If we include public spaces, Civic Center is the clear winner.

Locals Left their Neighborhoods to Party in SoMa, the Castro, and the Mission

In order to incite some friendly neighborhood rivalry, we drilled down to see which locals left their own neighborhoods to party elsewhere.

 

Figure 2: interactive visualization of locals celebrating the Game 7 win, colored by their home neighborhoods. 

In Figure 2, we used mobile geolocation to understand where residents of SF’s most popular neighborhoods chose to watch Game 7. Each point on the map represents a group of SF residents, with colors corresponding to their resident neighborhoods. So which neighborhood was the most popular? To answer this question, we further processed the data to generate the co-occurrence matrix in Figure 3.

Many stereotypes proved true—the mutual disdain between North Beach and the Mission, for one, and the homogeneous intermingling between North Beach and the Marina, for another. But we were surprised to see so many Marina residents leave their home neighborhood in favor of Nob Hill (26% of bar hoppers from the Marina), as opposed to celebrating at the neighborhood bar (only 14% stayed in the Marina). We attribute this simultaneously to the lure of Nob Hill’s Polk Gulch and the Marina’s penchant for rooftop parties, which were not included in this analysis of nightlife establishments. How did your Game 7 plans compare to those of your neighbors?

Figure 3: Where locals live and where they chose to celebrate the World Series win. Green color denotes % of locals from a particular neighborhood—darker indicates higher attendance. Neighborhoods have been ranked from left to right in order of popularity. 

Orinda and Pittsburgh wore Orange and Black with Pride

1 in 4 locals is a pretty good showing— but locals made up only 52% of people in San Francisco that night. We challenged our geo-location data to tell us which of the Bay Area cities demonstrated superior fan-dom in San Francisco during the final game of the World Series. On a per-capita basis, Orinda and Pittsburgh traveled to SF in droves for Game 7.

Figure 4: Per capita attendance of Bay Area natives watching Game 7 in San Francisco

The Giants Parade

Finally, we visualized the Giants Parade on Friday, October 31. Although the parade was scheduled to begin at 12 PM, Market Street was packed by 8 AM. The figure below shows the increase in mobile devices detected, as compared to an average Friday in San Francisco.

Figure 5: Visualization of the World Series Parade