Bonferroni vs. Benjamini-Hochberg: Choosing Your P-Value Correction

be a sensitive topic. Perhaps best avoided on first encounter with a Statistician. The disposition toward the topic has led to a tacit agreement that α = 0.05 is the gold standard—in truth, a ‘convenient convention’, a rule of thumb set by Ronald Fisher himself.

Who?? Don’t know him? Don’t worry.

He was the first to introduce Maximum Likelihood Estimation (MLE), ANOVA, and Fisher Information (the latter, you may have guessed). Fisher was more than a relevant figure in the community, the father of statistics. Had a deep interest in Mendelian genetics and evolutionary biology, for which he would make several key contributions. Unfortunately, Fisher also had a thorny past. He was involved with the Eugenics Society and its policy of voluntary sterilization for the “feeble-minded.”

Yes, there is no such thing as a famous Statistician.

But a rule of thumb set by the father of statistics can sometimes be mistaken for a law, and law it is not.

A portrait of the young Ronald Aylmer Fisher (1890–1962). Source: Wikimedia Commons (file page). Public Domain.

There is one key instance when you are not only compelled to but must change this alpha-level, and that all comes down to multiple hypothesis testing.

To run multiple tests without using the Bonferroni Correction or the Benjamini-Hochberg procedure is more than problematic. Without these corrections, we could prove any hypothesis:

H₁: The sun is blue

By simply re-running our experiment until luck strikes. But how do these corrections work? and which one should you use? They are not interchangeable!

P-values and a problem

To understand why, we need to look at what exactly our p-value is telling us. To understand it deeper than small is good, and big is bad. But to do this, we will need an experiment, and nothing is as exciting — or as contested — as discovering superheavy elements.

These elements are incredibly unstable and created in particle accelerators, one atom at a time. Pound-for-pound, the most expensive thing ever produced. Existing only in cosmic events like supernovae, lasting only for thousandths or millionths of a second.

But their instability becomes an advantage for detection, as a new superheavy element would exhibit a distinct radioactive decay. The decay sequence captured by sensors in the reactor can tell us whether a new element is present.

Scientist working inside the ORMAK fusion device at ORNL. **Credit:** Oak Ridge National Laboratory, licensed under **CC BY 2.0**

As our null hypothesis, we state:

H₀ = The sequence is background noise decay. (No new element)

Now we need to gather evidence that H₀ is not true if we want to prove we have created a new element. This is done through our test statistic T(X). In general terms, this captures the difference between what the sensors observe and what is expected from background radiation. All test statistics are a measure of ‘surprise’ between what we expect to observe if H₀ is true and what our sample data actually says. The larger T(X), the more evidence we have that H₀ is false.

This is precisely what the Schmidt test statistic does on the sequence of radioactive decay times.

\[
\sigma_{obs} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (\ln t_i – \overline{\ln t})^2}
\]

The Schmidt test statistic was used in the discovery of: Hassium (108), Meitnerium (109) in 1984 Darmstadtium (110), Element 111 Roentgenium (111), Copernicium (112) from 1994 to 1996 Moscovium (115), Tennessine (117). from 2003 to 2016

It is essential to specify a distribution for H₀ so that we can calculate the probability that a test statistic is as extreme as the test statistic of the observed data.

We assume noise decays follow an exponential distribution. There are a million reasons why this is a good assumption, but let’s not get bogged down here. If we don’t have a distribution for H₀, computing our probability value would be impossible!

\[
H_0^{(Schmidt)}:t_1,…,t_n i.i.d. ∼ Exp(λ)
\]

The p-value is then the probability under the null model of obtaining a test statistic at least as extreme as that computed from the sample data. The less likely our test statistic is, the more likely it is that H₀ is false.

\[
p \;=\; \Pr_{H_0}\!\big( T(X) \ge T(x_{\mathrm{obs}}) \big).
\]

Edvard Munch, **At the Roulette Table in Monte Carlo** (1892). Source: Wikimedia Commons (file page). **Public Domain (CC PDM 1.0)**

Of course, this brings up an interesting issue. What if we observe a rare background decay rate, a decay rate that simply resembles that of an undiscovered decaying particle? What if our sensors detect an unlikely, though possible, decay sequence that yields a large test statistic? Each time we run the test there is a small chance of getting an outlier simply by chance. This outlier will give a large test statistic as it will be quite different than what we expect to see when H₀ is true. The large T(x) will be in the tails of our expected distribution of H₀ and will produce a small p-value. A small probability of observing anything more extreme than this outlier. But no new element exists! we just got 31 red by playing roulette a million times.

It seems unlikely, but when you keep in mind that protons are being beamed at target particles for months at a time, the possibility stands. So how do we account for it?

There are two ways: a conservative and a less conservative method. Your choice depends on the experiment. We can use the:

Family Wise Error Rate (FWER) and the Bonferroni correction
False Discovery Rate (FDR) and the Benjamini-Hochberg procedure

These are not interchangeable! You need to carefully consider your study and pick the right one.

If you’re interested in the physics of it:

New elements are created by accelerating lighter ions at 10% the speed of light. These ion beams bombard heavier target atoms. The incredible speeds and kinetic energy are required to overcome the coulomb barrier (the immense repulsive force between two positively charged particles.

New Element	Beam (Protons)	Target (Protons)
Nihonium (113)	Zinc-70 (30)	Bismuth-209 (83)
Moscovium (115)	Calcium-48 (20)	Americium-243 (95)
Tennessine (117)	Calcium-48 (20)	Berkelium-249 (97)
Oganesson (118)	Calcium-48 (20)	Californium-249 (98)

Computer simulation showing the collision and fusion of two atomic nuclei to form a superheavy element. (Credit: Lawrence Berkeley National Laboratory) Public Domain

Family Wise Error Rate (Bonferroni)

This is our conservative approach, and what should be used if we cannot admit any false positives. This approach keeps the probability of admitting at least one Type I error below our alpha level.

\[
\Pr(\text{at least one Type I error in the family}) \leq \alpha
\]

This is also a simpler correction. Simply divide the alpha level by the number of times the experiment was run. So for every test you reject the null hypothesis if and only if:

\[
p_i \leq \frac{\alpha}{m}
\]

Equivalently, you can adjust your p-values. If you run m tests, take:

\[
p_i^{\text{adj}} = \min(1, m p_i)
\]

And reject the null hypothesis if:

\[
p_i^{(\text{Bonf})} \le \alpha
\]

All we did here was multiply both sides of the inequality by m.

The proof for this is also a slim one-line. If we let Aᵢ be the event that there is a false positive in test i. Then the probability of having at least one false positive will be the probability of the union of all these events.

\[
\text{Pr}(\text{at least one false positive}) = \text{Pr}\left(\bigcup_{i=1}^{m} A_i\right) \le \sum_{i=1}^{m} \text{Pr}(A_i) \le m \cdot \frac{\alpha}{m} = \alpha
\]

Here we make use of the union bound. a fundamental concept in probability that states the probability of A₁, or A₂, or Aₖ happening must be less than or equal to the sum of the probability of each event happening.

\[
\text{Pr}(A_1 \cup A_2 \cup \cdots \cup A_k) \le \sum_{i=1}^{k} \text{Pr}(A_i)
\]

False Discovery Rate (Benjamini-Hochberg)

The Benjamini-Hochberg procedure also isn’t too complicated. Simply:

Sort your p-values: p₁ ≤ … ≤ pₘ.
Accept the first k where pₖ > α/(m−k+1)

In this approach, the goal is to control the false discovery rate (FDR).

\[
\text{FDR} = E\left[ \frac{V}{\max(R, 1)} \right]
\]

Where R is the number of times we reject the null hypothesis, and V is the number of rejections that are (unfortunately) false positives (Type I errors). The goal is to keep this metric below a specific threshold q = 0.05.

The BH thresholds are:

\[
\frac{1}{m}q, \frac{2}{m}q, \dots, \frac{m}{m}q = q
\]

And we reject the first smallest p-values where:

\[
P_{(k)} \leq \frac{k}{m}q
\]

Use this when you are okay with some false positives. When your primary concern is minimizing the type II error rate, that is, you want to make sure there are fewer false negatives, no instances when we accept H₀ when H₀ is in fact false.

Think of this as a genomics study where you aim to identify everyone who has a specific gene that makes them more susceptible to a particular cancer. It would be less harmful if we treated some people who did not have the gene than risk letting someone who did have it walk away with no treatment.

Quick side-by-side

Bonferroni:

Controls family-wise error rate (FWER).
Guarantees the probability of a single false discovery rate ≤ α
Higher rate of false negatives ⇒ Lower statistical power
Zero risk tolerance

Benjamini-Hochberg

Controls False Discovery Rate (FDR)
guarantees that among all discoveries, false positives are ≤ q
Fewer false negatives ⇒ Higher statistical power
Some risk tolerance

A super-tiny p for a super-heavy atom

We can’t have any nonexistent elements in the periodic table, so when it comes to finding a new element, the Bonferroni correction is the right approach. But when it comes to decay chain data collected by position-sensitive silicon detectors, picking an m isn’t so simple.

Physicists tend to use the expected number of random chains produced by the entire search over the entire dataset:

\[
\Pr(\ge 1 \text{ random chain}) \approx 1 – e^{-n_b}
\]

\[
1 – e^{-n_b} \leq \alpha_{\text{family}} \Rightarrow n_b \approx \alpha_{\text{family}} \quad (\text{approximately, for rare events})
\]

The number of random chains comes from observing the background data when no experiment is taking place. from this data we can build the null distribution H₀ through monte carlo simulation

We estimate the number of random chains by modelling the background event rates and resampling the observed background events. Under H₀ (no heavy element decay chain), we use Monte Carlo to simulate many null realizations and compute how often the search algorithm produces a chain as extreme as the observed chain.

More precisely:

H₀: background events arrive as a Poisson process with rate λ ⇒ inter-arrival times are Exponential.

Then an accidental chain is k consecutive hits in τ time. We scan the data using our test statistic to determine whether an extreme cluster exists.

lambda_rate = 0.2   # events per second
T_total = 2_000.0   # seconds of data-taking (mean events ~ 400)
k = 4               # chain length
tau_obs = 0.20      # "observed extreme": 4 events within 0.10 sec

Nmc = 20_000
rng = np.random.default_rng(0)

def dmin_and_count(times, k, tau):
    if times.size < k:
        return np.inf, 0
    spans = times[k-1:] - times[:-(k-1)]
    return float(np.min(spans)), int(np.sum(spans <= tau))

...

Monte-Carlo Simulation on GitHub

If you’re interested in the numbers, in the discovery of element 117 Tennessine (Ts), a p-value of 5×10−16 was used. I imagine that if no corrections were ever used, our periodic table would, unfortunately, not be poster-sized, and chemistry would be in shambles.

Conclusion

This whole concept of searching for something in lots of places, then treating a specially significant blip as if it came from one observation, is commonly referred to as the Look-Elsewhere Effect. and there are two primary ways we can adjust for this:

Bonferroni Correction
Benjamini-Hochberg Procedure

Our choice entirely depends on how conservative we want to be.

But even with a p-value of 5×10−16, you might be wondering when a p-value of 10^-99 should still be discarded. And that all comes down to Victor Ninov, a physicist at Lawrence Berkeley National Laboratory. Who was – for a brief moment – the man who discovered element 118.

However, an internal investigation found that he had fabricated the alpha-decay chain. In this instance, with respect to research misconduct and falsified data, even a p-value of 10^-99 does not justify rejecting the null hypothesis.

Yuri Oganessian, the leader of the team at the Joint Institute for Nuclear Research in Dubna and Lawrence Livermore National Laboratory, who discovered Element 118. Wikimedia Commons, CC BY 4.0.

References

Bodmer, W., Bailey, R. A., Charlesworth, B., Eyre-Walker, A., Farewell, V., Mead, A., & Senn, S. (2021). The outstanding scientist, RA Fisher: his views on eugenics and race. Heredity, 126(4), 565-576.

Khuyagbaatar, J., Yakushev, A., Düllmann, C. E., Ackermann, D., Andersson, L. L., Asai, M., … & Yakusheva, V. (2014). Ca 48+ Bk 249 fusion reaction leading to element Z= 117: Long-lived α-decaying Db 270 and discovery of Lr 266. Physical review letters, 112(17), 172501.

Positives, H. M. F. Multiple Comparisons: Bonferroni Corrections and False Discovery Rates.

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Bonferroni vs. Benjamini-Hochberg: Choosing Your P-Value Correction

[2504.18346] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

CraniMem: Cranial Inspired Gated and Bounded Memory for Agentic Systems

The Basics of Vibe Engineering

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Google expands its Universal Commerce Protocol to power AI-driven shopping

How Smart Advertisers Combine Automation With Strategy

What is Google Sites? And how to use it

[2504.18346] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Gen Z Social Media Trends & Usage

How to use VLOOKUP in Google Sheets: A complete guide

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Bonferroni vs. Benjamini-Hochberg: Choosing Your P-Value Correction

P-values and a problem

If you’re interested in the physics of it:

Family Wise Error Rate (Bonferroni)

False Discovery Rate (Benjamini-Hochberg)

Quick side-by-side

A super-tiny p for a super-heavy atom

Conclusion

References

Related Posts

Subscribe to Updates