← Back to home

P-Value

It seems we must delve into the rather tedious intricacies of the p-value. Fine. Let's make this as painless as possible, for me at least.

Function of the Observed Sample Results

Not to be confused with the P-factor, which, mercifully, is an entirely different kind of statistical entanglement.

In the grand theatre of null-hypothesis significance testing, the p-value, often stylized with a lowercase 'p' and occasionally hyphenated (as if its identity isn't already ambiguous enough, note 1), represents a singular, albeit frequently misunderstood, concept. It is, quite precisely, the probability of observing test results that are at least as extreme as the result actually observedprovided that the underlying null hypothesis is, in fact, correct. (note 2)

To put it more plainly, it's a measure of how incompatible your data is with a specified statistical model. A minuscule p-value indicates that such an extreme outcome would be profoundly improbable if the null hypothesis were truly valid. It's like finding a unicorn at a dog show and still insisting you're at a dog show. The evidence, in this case, would strongly suggest otherwise.

Despite its ubiquity and the solemn reporting of p-values in academic publications across a myriad of quantitative disciplines, the landscape is littered with misinterpretation and misuse of p-values. This persistent statistical debacle has become a recurring lament within the hallowed halls of mathematics and metascience, a testament to humanity's enduring capacity to misunderstand even its own carefully constructed tools.

The American Statistical Association (ASA), in a rather belated but necessary intervention in 2016, issued a formal statement attempting to clarify matters. They unequivocally declared that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone." Furthermore, they stressed that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" nor does it constitute "evidence regarding a model or hypothesis." One might imagine the exasperated sighs accompanying such a pronouncement, given how widely these very misconceptions had permeated scientific discourse.

However, the ASA wasn't quite done. In 2019, a subsequent task force released a statement on statistical significance and replicability, offering a more balanced perspective. While acknowledging the previous warnings, they concluded with a crucial nuance: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data." This suggests that the tool itself isn't inherently flawed; rather, it's the clumsy hands wielding it that often lead to disaster. A sharp knife can carve a masterpiece or sever a finger; the blame lies not with the blade.

Basic Concepts

In the realm of statistics, any proposition or conjecture concerning the unknown probability distribution that governs a collection of random variables, which represent the observed data, denoted as (X), within a particular study, is termed a statistical hypothesis. When the objective of a statistical test is solely to ascertain the plausibility of a single hypothesis, without venturing into the investigation of alternative, specific hypotheses, such a procedure is aptly named a null hypothesis test.

The null hypothesis itself typically serves as the default assumption, postulating the absence of the property or effect under investigation within the distribution. It commonly posits that a certain parameter of interest – be it a correlation coefficient, a difference between means, or another quantifiable characteristic within the populations being studied – is precisely zero. This hypothesis might, in some cases, meticulously define the probability distribution of (X), or it might more broadly assign it to a particular class of distributions. Often, for the sake of analytical tractability, the raw data is condensed into a singular numerical statistic, for instance, (T). The marginal probability distribution of this chosen statistic, (T), is then intimately linked to the primary question driving the study.

The p-value, in this context, is deployed within null hypothesis testing as a quantitative metric to gauge the statistical significance of an observed outcome. This outcome is nothing more than the specific value obtained for the chosen statistic, (T). (note 2) Fundamentally, a lower p-value signifies a reduced likelihood of obtaining the observed result if the null hypothesis were, in actuality, true. When a result yields a p-value sufficiently low to meet a predefined threshold, it is deemed "statistically significant," thereby providing grounds to reject the null hypothesis. All other factors being held constant, smaller p-values are conventionally regarded as furnishing more compelling evidence against the null hypothesis.

In common parlance, the rejection of the null hypothesis is often interpreted as an indication that there is sufficient evidence to contradict it. However, and this is where many stumble, "sufficient evidence against it" is not synonymous with "proof that it is false," nor does it automatically imply the practical importance or magnitude of any observed effect. The universe, it seems, rarely offers such straightforward assurances.

Consider a concrete example: if a null hypothesis postulates that a certain summary statistic (T) conforms precisely to the standard normal distribution, denoted as (\mathcal{N}(0,1)), then the rejection of this null hypothesis could be interpreted in several ways. It might suggest that (i) the mean of (T) is not 0, or (ii) the variance of (T) is not 1, or perhaps (iii) (T) does not follow a normal distribution at all. Different statistical tests, designed to probe the same null hypothesis, will exhibit varying degrees of sensitivity to these distinct alternative explanations. Crucially, even if one manages to reject the null hypothesis across all three potential alternatives – and even if one knows that the distribution is indeed normal and the variance is precisely 1 – the null hypothesis test still doesn't specify which non-zero values of the mean are now most plausible. It merely tells you that zero is unlikely. The more independent observations one gathers from the same underlying probability distribution, the greater the accuracy and power of the test. This increased precision allows for a more refined determination of the mean value and a clearer demonstration that it deviates from zero. However, this also amplifies the critical importance of evaluating the real-world or scientific relevance of any such detected deviation, a distinction often conveniently overlooked. A statistically significant effect can still be utterly trivial.

Definition and Interpretation

Definition

The p-value is formally defined as the probability, under the assumption that the null hypothesis is true, of observing a real-valued test statistic that is at least as extreme as the value actually obtained from the sample data. Let's say we observe a specific value, (t), from an unknown distribution of a test statistic (T). Then the p-value, denoted as (p), represents what the prior probability would be of observing a test-statistic value at least as "extreme" as (t) if the null hypothesis, (H_0), were, in fact, true. This can be expressed mathematically depending on the nature of the test:

  • For a one-sided right-tail test focusing on values greater than or equal to (t): [ p = \Pr(T \geq t \mid H_0) ]
  • For a one-sided left-tail test focusing on values less than or equal to (t): [ p = \Pr(T \leq t \mid H_0) ]
  • For a two-sided test that considers deviations in both directions from what the null hypothesis predicts, the p-value is calculated by taking twice the minimum of the probabilities of observing a result as extreme as (t) in either tail: [ p = 2\min{\Pr(T \geq t \mid H_0), \Pr(T \leq t \mid H_0)} ] If the distribution of (T) is perfectly symmetric around zero, this simplifies to: [ p = \Pr(|T| \geq |t| \mid H_0) ] Essentially, it's the likelihood of your data, or something even more surprising, if nothing interesting is actually happening. A low p-value suggests the "nothing interesting" scenario is, well, unlikely.

Interpretations

The error that a practising statistician would consider the more important to avoid (which is a subjective judgment) is called the error of the first kind. The first demand of the mathematical theory is to deduce such test criteria as would ensure that the probability of committing an error of the first kind would equal (or approximately equal, or not exceed) a preassigned number α, such as α = 0.05 or 0.01, etc. This number is called the level of significance.

— Jerzy Neyman, "The Emergence of Mathematical Statistics"

In the practice of a significance test, the null hypothesis, (H_0), is deemed worthy of rejection if the calculated p-value falls below a predefined threshold value, typically denoted as (\alpha). This (\alpha) is widely known as the alpha level or, more formally, the significance level. It's crucial to understand that (\alpha) is not a magical number revealed by the data itself; rather, it is arbitrarily set by the researcher before any data analysis commences. This pre-specification is intended to prevent researchers from retrospectively adjusting their rejection criteria to fit observed results, a practice that, regrettably, is not uncommon.

Historically, (\alpha) is most frequently set at 0.05, though more stringent levels, such as 0.01 or even lower, are occasionally employed depending on the field and the consequences of a false positive. The conventional choice of 0.05, representing a 1 in 20 chance of observing such extreme data if the null hypothesis were true, was originally championed by the highly influential statistician Ronald Fisher in his seminal 1925 work, "Statistical Methods for Research Workers." Fisher's endorsement cemented this threshold into statistical practice, almost as if by divine decree, leading to its widespread adoption, often without critical reflection on its specific appropriateness for a given context.

It is also worth noting that different p-values, derived from independent sets of data addressing the same underlying question, can be formally combined to yield a single, overarching measure of evidence. This is commonly achieved through methods such as Fisher's combined probability test, which aggregates the evidence from multiple studies, providing a more robust conclusion than any single study might offer alone.

Distribution

The p-value, being a direct mathematical consequence of the chosen test statistic (T), is, by its very nature, a random variable. This means that its value will fluctuate from sample to sample, even if the underlying conditions (including the truth or falsity of the null hypothesis) remain constant. A rather inconvenient truth, wouldn't you say?

If the null hypothesis precisely dictates the probability distribution of (T) – for instance, in a scenario where (H_0: \theta = \theta_0), and (\theta) is the solitary parameter defining the distribution – and if that distribution happens to be continuous, then a fascinating property emerges: when the null hypothesis is genuinely true, the p-value is uniformly distributed between 0 and 1. This means that, under the null, any p-value between 0 and 1 is equally likely, a fact often overlooked in the rush to find "significant" results. Regardless of the actual truth or falsity of (H_0), the p-value itself is not a fixed constant; if the same statistical test is replicated independently with fresh, new data, one can almost certainly expect to obtain a different p-value in each iteration. This inherent variability underscores why a single p-value should never be treated as the ultimate arbiter of truth.

Typically, in the course of a single study, only one p-value corresponding to a specific hypothesis is observed. Consequently, this p-value is then interpreted within the framework of a significance test, and there's usually no immediate attempt to estimate the broader distribution from which it was drawn. However, when a collection of p-values becomes available – for example, when aggregating findings from numerous studies exploring the same scientific question – the distribution of these statistically significant p-values can be graphically represented and analyzed as a "p-curve."

A p-curve serves as a diagnostic tool, offering insights into the overall reliability and integrity of scientific literature. It can be particularly effective in detecting tell-tale signs of publication bias, where studies with "significant" (i.e., low) p-values are disproportionately published, or even the more insidious practice of p-hacking, where researchers manipulate data or analyses until a desired p-value is achieved. It’s a sad commentary on human nature that we need statistical methods to detect when other humans are, shall we say, optimizing their results.

Distribution for Composite Hypothesis

In the specialized domain of parametric hypothesis testing, a "simple" or "point" hypothesis refers to a scenario where the parameter's value is assumed to be a single, specific numerical quantity. Conversely, a "composite" hypothesis is one where the parameter's value is not a single point but rather defined by a set or range of numbers. This distinction is not merely academic; it has practical implications for p-value interpretation.

When the null hypothesis is composite (or when the distribution of the test statistic is discrete, rather than continuous), a crucial property still holds: if the null hypothesis is true, the probability of obtaining a p-value less than or equal to any given number between 0 and 1 remains less than or equal to that number. In essence, very small p-values continue to be relatively improbable under a true null hypothesis, even a composite one. This ensures that a significance test conducted at a specified level (\alpha) retains its integrity; one rejects the null hypothesis if the p-value is less than or equal to (\alpha).

For example, consider testing the null hypothesis that a distribution is normal with a mean less than or equal to zero, against the alternative that the mean is greater than zero ((H_0: \mu \leq 0), with known variance). In this instance, the null hypothesis does not specify a single, exact probability distribution for the appropriate test statistic (which, in this case, would be the Z-statistic associated with a one-sided one-sample Z-test). Instead, for each possible theoretical mean value within the range (\mu \leq 0), the Z-test statistic possesses a distinct probability distribution. Under these circumstances, the p-value is defined by considering the "least favorable" scenario within the composite null hypothesis – typically the case precisely on the boundary between the null and the alternative hypotheses (e.g., (\mu = 0)). This careful definition guarantees the essential complementarity of p-values and alpha-levels: setting (\alpha = 0.05) means that one will only reject the null hypothesis if the p-value is (\leq 0.05), and crucially, this hypothesis test will indeed maintain a maximum Type I error rate of 0.05. It's a rather clever way to ensure that your risk of crying wolf is capped, even when the wolf's exact size and shape are a bit fuzzy under the null.

Usage

The p-value finds its most pervasive application within the framework of statistical hypothesis testing, particularly within the dominant paradigm of null hypothesis significance testing. In this methodology, the researcher must, prior to embarking on the study itself, first meticulously select a statistical model (representing the null hypothesis) and then designate an alpha level, (\alpha), which, as previously noted, is most commonly set at the somewhat arbitrary value of 0.05.

Following the rigorous analysis of the collected data, if the computed p-value is found to be less than or equal to this predetermined (\alpha), this outcome is traditionally interpreted as sufficient evidence to deem the observed data "sufficiently inconsistent" with the null hypothesis to warrant its rejection. However, and this is a point of frequent and egregious misunderstanding, the act of rejecting the null hypothesis does not constitute proof that the null hypothesis is inherently false. Nor does it magically confirm the truth of an alternative hypothesis. The p-value, in its purest form, does not, in and of itself, establish the probabilities of various hypotheses being true. Rather, it functions as a specific, conditional probability: the probability of observing data as extreme or more extreme if the null hypothesis were true. It is, therefore, merely a tool – a somewhat blunt and often misused one – for making a binary decision about whether to reject the null hypothesis, nothing more. Expecting it to do more is like expecting a hammer to write a symphony.

Misuse

According to the aforementioned statement from the ASA, there is a pervasive and unsettling consensus that p-values are not only frequently misused but also profoundly misinterpreted. This isn't a minor quibble; it's a fundamental flaw in how scientific conclusions are often drawn. One practice that has drawn particularly sharp criticism is the mechanical acceptance of an alternative hypothesis simply because a p-value nominally falls below the 0.05 threshold, without the crucial backing of other corroborating evidence. It's a classic case of mistaking a symptom for the disease.

While p-values certainly offer utility in gauging the incompatibility between observed data and a specified statistical model, it is imperative that contextual factors are given their due weight. These include, but are not limited to, "the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis." To ignore these elements and rely solely on a p-value is to engage in a form of intellectual tunnel vision, leading to conclusions that are brittle at best, and outright fallacious at worst. Another deeply ingrained misconception, and perhaps the most pernicious, is the frequent misunderstanding of the p-value as representing the actual probability that the null hypothesis is true. This is a profound logical leap, akin to assuming that because it's raining, your car keys must be lost, simply because both are probabilistic events. P-values and significance tests are also notably silent on the broader implications of drawing conclusions from a sample to an entire population, often leading to overgeneralized and unwarranted claims.

In light of these widespread issues, a vocal contingent of statisticians has advocated for the abandonment of p-values altogether, urging a shift in focus toward other, arguably more informative, inferential statistics. These proposed alternatives include, but are not limited to, confidence intervals, which provide a range of plausible values for a parameter; likelihood ratios, which quantify the relative support for different hypotheses; or Bayes factors, which offer a Bayesian framework for hypothesis comparison. However, this proposition has ignited a rather heated and ongoing debate regarding the practical feasibility and universal applicability of these alternatives. It seems humans are rather fond of their simple, albeit flawed, rules.

Other statisticians have suggested a less radical approach, advocating for the removal of rigid, fixed significance thresholds. Instead, they propose that p-values should be interpreted as continuous indices reflecting the strength of evidence against the null hypothesis, rather than a binary "significant/not significant" declaration. Yet another suggestion involves reporting, alongside p-values, the prior probability of a real effect that would be necessary to achieve a false positive risk (i.e., the probability that there is no genuine effect despite a "significant" p-value) below a predetermined threshold, such as 5%. This introduces a Bayesian flavor to frequentist reporting, attempting to bridge the two statistical philosophies.

Despite the fervent criticisms, the 2019 ASA task force, convened specifically to address the use of statistical methods, particularly hypothesis tests and p-values, and their relationship to scientific replicability, offered a more nuanced and ultimately supportive view. Their statement explicitly asserts that "Different measures of uncertainty can complement one another; no single measure serves all purposes," thereby positioning the p-value as one legitimate measure among many. They further underscore that p-values can yield valuable insights both when considered as specific numerical values and when compared against an appropriate threshold. The overarching message, and one that is often lost in the clamor of debate, is that "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data." The emphasis, as always, is on "properly applied and interpreted." A tall order, it seems.

Calculation

Typically, (T) refers to a test statistic. A test statistic is the singular numerical output of a scalar function that processes all the observations collected in a study. This statistic distills the raw data into a single, summary number, such as a t-statistic or an F-statistic. Consequently, the test statistic's sampling distribution is entirely determined by the specific mathematical function employed to define it and, crucially, by the underlying probability distribution of the input observational data.

For the particularly critical and frequently encountered scenario where the data are hypothesized to originate from a random sample drawn from a normal distribution, a variety of specialized null hypothesis tests have been meticulously developed. The choice among these tests hinges upon the precise nature of the test statistic and the specific hypotheses about its distribution that are of interest. For instance, the z-test is employed for hypotheses concerning the mean of a normal distribution when its variance is already known. When the variance is unknown, one turns to the t-test, which is based on Student's t-distribution of a suitable statistic. For hypotheses pertaining to the variance itself, the F-test, relying on the F-distribution of yet another specialized statistic, is the appropriate choice. For data of a different character, such as categorical (or discrete) data, test statistics can still be constructed. Their null hypothesis distributions are often based on normal approximations to suitable statistics, invoked by appealing to the central limit theorem for sufficiently large samples, as exemplified by Pearson's chi-squared test.

Thus, the act of computing a p-value fundamentally necessitates three components: a clearly articulated null hypothesis, a chosen test statistic (along with the decision of whether the researcher intends to perform a one-tailed test or a two-tailed test), and the actual observational data. While the calculation of the test statistic from given data is often a straightforward computational task, the subsequent steps – determining its sampling distribution under the null hypothesis and then computing its cumulative distribution function (CDF) – frequently present a significant mathematical challenge. In contemporary practice, this computation is almost invariably handled by specialized statistical software packages, often employing sophisticated numeric methods rather than relying on exact analytical formulae. However, in the early to mid-20th century, before the advent of powerful computing, statisticians painstakingly performed these calculations using extensive tables of pre-computed values, often having to interpolate or extrapolate p-values from discrete entries. Rather than directly tabulating p-values, Ronald Fisher (who else?) famously inverted the CDF, publishing tables that listed values of the test statistic for specific, fixed p-values. This approach corresponds to computing the quantile function (the inverse CDF) and effectively streamlined the process of comparing observed statistics against critical thresholds, further solidifying the use of fixed significance levels. A rather ingenious workaround for the lack of modern computing, even if it did encourage the binary "significant/not significant" thinking.

Example

Testing the Fairness of a Coin

Let's walk through a classic, albeit somewhat mundane, example of a statistical test: determining whether a coin flip is genuinely fair (meaning an equal probability of landing heads or tails) or if it exhibits a bias, making one outcome inherently more likely than the other.

Imagine an experiment where a coin is flipped 20 times. The observed results show the coin landing on heads 14 times. The complete dataset, (X), would technically be a sequence of twenty individual outcomes, each being either "H" (heads) or "T" (tails). However, for this particular inquiry, the statistic upon which we choose to focus is the total number of heads, denoted as (T). The null hypothesis in this scenario is straightforward: the coin is perfectly fair, and each coin toss is an independent event, with a 0.5 probability for heads and 0.5 for tails.

If we are performing a one-sided right-tail test – which would be appropriate if our primary interest lies in detecting a bias towards heads – then the p-value for this observed result is the probability of a fair coin landing on heads at least 14 times out of 20 flips. This probability can be calculated using the binomial distribution and its associated binomial coefficients:

[ \begin{aligned} &\Pr(14 \text{ heads}) + \Pr(15 \text{ heads}) + \cdots + \Pr(20 \text{ heads}) \ &= \frac{1}{2^{20}}\left[{\binom {20}{14}}+{\binom {20}{15}}+\cdots +{\binom {20}{20}}\right] = \frac{60,460}{1,048,576} \approx 0.058. \end{aligned} ]

This calculated probability of approximately 0.058 represents the p-value, specifically considering only those extreme results that lean in favor of heads. This is the essence of a one-tailed test. However, one might reasonably be interested in deviations in either direction – that is, a bias favoring either heads or tails. In such a case, a two-tailed p-value would be more appropriate. Given that the binomial distribution for a fair coin (where p=0.5) is perfectly symmetrical, the two-sided p-value is simply twice the calculated single-sided p-value: thus, the two-sided p-value is 0.115.

To summarize the example:

  • Null hypothesis ((H_0)): The coin is fair, meaning (\Pr(\text{heads}) = 0.5).
  • Test statistic: The number of heads observed in 20 flips.
  • Alpha level (our designated threshold of significance): 0.05.
  • Observation ((O)): 14 heads out of 20 flips.
  • Two-tailed p-value of observation (O) given (H_0): (2 \times \min(\Pr(\text{no. of heads} \geq 14 \text{ heads}), \Pr(\text{no. of heads} \leq 14 \text{ heads}))).
    • (\Pr(\text{no. of heads} \geq 14 \text{ heads}) \approx 0.058).
    • (\Pr(\text{no. of heads} \leq 14 \text{ heads})) can be calculated as (1 - \Pr(\text{no. of heads} \geq 15 \text{ heads})) or directly as the sum of probabilities from 0 to 14. This would be (1 - 0.058 + \Pr(\text{no. of head} = 14) = 1 - 0.058 + 0.036 = 0.978). However, due to the symmetry of this particular binomial distribution for a fair coin, simply doubling the smaller tail probability is sufficient.
    • So, the two-tailed p-value is (2 \times 0.058 = 0.115).

In this instance, the computed p-value of 0.115 comfortably exceeds our predefined alpha level of 0.05. This means that, if the coin were truly fair, observing 14 or more heads (or 6 or fewer heads) out of 20 flips is not an exceptionally rare event; it falls well within the range of what would occur approximately 95% of the time. Therefore, based on our chosen significance level, we do not possess sufficient evidence to reject the null hypothesis that the coin is fair.

However, consider a slightly different outcome: had one more head been observed, resulting in 15 heads out of 20 flips, the two-tailed p-value would have been approximately 0.0414 (or 4.14%). In that hypothetical scenario, because 0.0414 is less than 0.05, the null hypothesis would be rejected at the 0.05 significance level. This highlights the somewhat arbitrary nature of the fixed threshold and how a minor change in observation can flip a conclusion from "not significant" to "significant," a fact that has caused no end of statistical hand-wringing.

Optional Stopping

The nuances, or rather, the stark contradictions, in the interpretation of "extreme" become glaringly apparent when one considers sequential hypothesis testing, more commonly and perhaps more controversially known as optional stopping. This practice, often employed by researchers with a keen eye on achieving statistical significance, fundamentally alters how the p-value should be calculated and interpreted. It's a method that, if not accounted for, can easily lead to spurious findings.

Consider a subtly modified experimental design for assessing the fairness of a coin:

  • Flip the coin twice. If both outcomes are identical (two heads or two tails), the experiment concludes.
  • Otherwise, if the first two flips are mixed (one head, one tail), then flip the coin four more times, bringing the total to six flips.

This experimental structure yields seven distinct types of outcomes, ranging from "2 heads" (ending after two flips) to "1 head 5 tails" (after six flips). Now, let's attempt to calculate the p-value for the outcome "3 heads 3 tails" under this specific sequential design.

If one were to use the test statistic of "heads/tails ratio," then under the null hypothesis (e.g., assuming a fair coin, which for the purpose of this example implies #heads (\leq) 3), the two-sided p-value for "3 heads 3 tails" is precisely 1.0. Simultaneously, both the one-sided left-tail p-value and the one-sided right-tail p-value are exactly (19/32). This is because in this specific sequential design, "3 heads 3 tails" is one of the most probable outcomes, and thus not extreme at all.

Alternatively, if we define "at least as extreme" to encompass every outcome that has an equal or lower probability than "3 heads 3 tails" under this specific sequential design, then the p-value calculates to exactly (1/2). Again, not particularly compelling evidence against fairness.

However, let's contrast this with a simpler, fixed-design experiment where one had planned from the outset to simply flip the coin 6 times, regardless of the initial outcomes. In that scenario, the second definition of the p-value would yield a p-value of exactly 1 for "3 heads 3 tails." Why? Because observing exactly 3 heads in 6 flips is the most probable outcome for a fair coin (if you don't stop early), and thus, no other outcome is "less probable" than it in both tails combined.

The critical takeaway here is that the definition of "at least as extreme" – and consequently, the p-value itself – is profoundly contextual. It is not solely dependent on the observed data but critically hinges on the experimenter's planned course of action, even for scenarios that ultimately did not transpire. This highlights a fundamental challenge: the p-value's validity is inextricably linked to the design and intent of the experiment, rather than being an intrinsic property of the data alone. Optional stopping, in particular, can grossly inflate the Type I error rate if not properly accounted for in the p-value calculation, turning what might be random noise into "significant" findings. A rather convenient loophole for those inclined to find significance where none truly exists, wouldn't you agree?

History

The lineage of p-value computations can be traced back to the early 18th century, a period when statistical reasoning was nascent but already being applied to profound questions. One of the earliest documented instances involved investigations into the human sex ratio at birth. These early calculations aimed to ascertain the statistical significance of observed ratios when compared against the basic null hypothesis of an equal probability of male and female births.

A pioneering figure in this pursuit was John Arbuthnot, a Scottish physician, satirist, and mathematician. In 1710, Arbuthnot meticulously examined baptismal records from London, spanning an impressive 82-year period from 1629 to 1710. His findings were striking: in every single one of those 82 years, the number of male births consistently exceeded the number of female births. Assuming that male and female births were equally likely (the null hypothesis), the probability of observing this specific outcome – 82 consecutive years with more male births – is a staggeringly small 1/2(^{82}). This figure translates to approximately 1 in 4,836,000,000,000,000,000,000,000, a value so infinitesimally small that, in modern terms, it would constitute a p-value that is practically zero. Such a vanishingly small probability led Arbuthnot to conclude, rather grandly, that this consistent pattern could not possibly be attributed to mere chance but rather to "divine providence." He famously declared, "From whence it follows, that it is Art, not Chance, that governs." In contemporary statistical language, Arbuthnot effectively rejected the null hypothesis of equally likely male and female births at an incredibly stringent significance level of (p = 1/2^{82}). This groundbreaking work by Arbuthnot is widely recognized as marking "… the first use of significance tests …", the inaugural example of formal reasoning about statistical significance, and "… perhaps the first published report of a nonparametric test …", specifically the sign test. (For more details, one might consult Sign test § History.)

The same intriguing demographic question was revisited later in the 18th century by the illustrious French polymath Pierre-Simon Laplace. Laplace, however, employed a more advanced approach, utilizing a parametric test and modeling the number of male births using the binomial distribution.

  • In the 1770s, Laplace turned his attention to the statistics of nearly half a million births. His analysis of these extensive records consistently revealed an excess of boys compared to girls. Through meticulous calculation of a p-value, he concluded that this observed excess was a genuine, statistically significant effect, even if the underlying cause remained, at that time, unexplained.

The p-value, as a formal statistical concept, was first explicitly introduced by Karl Pearson at the turn of the 20th century. He incorporated it into his seminal Pearson's chi-squared test, denoting it with the capital letter P. The actual p-values for the chi-squared distribution (for various values of (\chi^2) and degrees of freedom), now commonly denoted as P, were meticulously calculated and tabulated in Elderton (1902) and subsequently collected in Pearson (1914, pp. xxxi–xxxiii, 26–28, Table XII).

However, it was Ronald Fisher who, with his characteristic blend of mathematical rigor and practical application, truly formalized and popularized the widespread use of the p-value in modern statistics. It became a cornerstone, indeed a central pillar, of his entire approach to the subject. In his immensely influential 1925 book, "Statistical Methods for Research Workers," Fisher proposed the now-ubiquitous level of (p = 0.05) – representing a 1 in 20 chance of being exceeded by random chance alone – as a conventional limit for declaring statistical significance. He applied this threshold within the context of a normal distribution (as a two-tailed test), which consequently established the widely recognized rule of two standard deviations for achieving statistical significance in such distributions (refer to the 68–95–99.7 rule). (note 3)

Fisher then proceeded to compile comprehensive tables of values, similar in purpose to Elderton's but, crucially, he reversed the roles of (\chi^2) and (p). Instead of calculating (p) for various values of (\chi^2) (and degrees of freedom (n)), he calculated the values of (\chi^2) that would yield specified p-values, such as 0.99, 0.98, 0.95, 0.90, and so on, down to 0.02 and 0.01. This ingenious shift enabled computed values of (\chi^2) to be directly compared against these fixed "cutoff" values, a practice that inadvertently encouraged the adoption of these specific p-values (especially 0.05, 0.02, and 0.01) as rigid thresholds, rather than encouraging the computation and nuanced interpretation of exact p-values themselves. This approach was further solidified by the compilation of similar tables in Fisher & Yates (1938), firmly embedding the practice into statistical methodology.

As a vivid illustration of the practical application of p-values to the design and interpretation of experiments, Fisher presented his famous "lady tasting tea" experiment in his subsequent influential work, "The Design of Experiments" (1935). This experiment serves as the archetypal example for demonstrating the utility of the p-value.

The experiment was conceived to evaluate the extraordinary claim of a lady (later identified as Muriel Bristol) who asserted she could discern, merely by taste, the precise method of tea preparation – specifically, whether the milk was added to the cup before the tea, or vice-versa. She was presented sequentially with 8 cups of tea: 4 prepared one way, and 4 the other, and was asked to correctly identify the preparation method for each cup, knowing in advance that there were exactly four of each type. In this experimental setup, the null hypothesis was that the lady possessed no special discerning ability whatsoever, meaning her classifications would be purely due to random chance. The appropriate test statistic was derived, and the test employed was Fisher's exact test. The p-value for a perfect classification (all 8 cups correctly identified) was calculated as (1/{\binom {8}{4}} = 1/70 \approx 0.014). Consequently, Fisher indicated his willingness to reject the null hypothesis (i.e., consider the outcome highly improbable to have occurred by chance) if all cups were classified correctly. (As it happened in the actual experiment, Bristol did correctly classify all 8 cups, much to Fisher's delight, one presumes.)

Fisher, ever the pragmatist, reiterated the (p = 0.05) threshold and provided a clear rationale for its adoption, stating: "It is usual and convenient for experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results." This statement reveals the practical, almost administrative, utility of the threshold – a way to filter out the noise.

He also extended the application of this threshold to the very design of experiments. He noted that had only 6 cups been presented (3 of each type), a perfect classification would have yielded a p-value of (1/{\binom {6}{3}} = 1/20 = 0.05). This result, while at the threshold, would have barely met his standard level of significance, suggesting that a larger sample size (8 cups) offered more compelling evidence. Fisher further underscored the correct interpretation of (p) as the long-run proportion of values at least as extreme as the observed data, under the explicit assumption that the null hypothesis is true.

In subsequent editions of his works, Fisher explicitly drew a distinction between the use of the p-value for scientific inference – a process of accumulating evidence and revising beliefs – and the Neyman–Pearson method, which he pejoratively termed "Acceptance Procedures." Fisher emphasized that while fixed levels like 5%, 2%, and 1% offer convenience for decision-making, the exact p-value itself carries more information, and the strength of evidence should always be considered provisional and subject to revision with further experimentation. In stark contrast, decision procedures, as envisioned by Neyman and Pearson, demand a clear-cut, binary decision that leads to an irreversible action. Such procedures, he argued, are grounded in the costs associated with different types of errors, a framework which, in Fisher's view, was largely inapplicable to the fluid and iterative nature of scientific research. A philosophical divide that, frankly, continues to echo in statistical debates today.

Related Indices

The E-value, a term that can be rather confusingly applied to two distinct yet related concepts, both of which are intimately connected to the p-value, plays a significant role in the complex domain of multiple testing. Firstly, the E-value can refer to a more generalized and robust alternative to the traditional p-value, designed to specifically address and accommodate the challenges posed by optional continuation of experiments, a scenario where the standard p-value often falters. Secondly, and perhaps more commonly, "E-value" serves as an abbreviation for "expect value." In this context, it quantifies the expected number of times one would anticipate observing a test statistic that is at least as extreme as the one actually obtained, assuming, of course, that the null hypothesis is true. This "expect value" is simply the product of the total number of tests performed and the individual p-value for a given test.

The q-value, another related metric, stands as the direct analog of the p-value but with respect to the positive false discovery rate. It is a crucial tool in the field of multiple hypothesis testing, where numerous hypotheses are tested simultaneously. Its primary function is to help researchers maintain adequate statistical power while rigorously minimizing the overall false positive rate across all tests, a necessary adjustment when casting a wide net for potential discoveries.

The Probability of Direction (pd), offers a Bayesian numerical equivalent to the frequentist p-value. It quantifies the proportion of the posterior distribution that aligns with the sign of the median effect, typically ranging from 50% to 100%. This metric effectively represents the certainty with which an observed effect can be deemed positive or negative within a Bayesian framework.

Finally, "Second-generation p-values" represent an evolution of the traditional p-value concept. These extensions aim to address one of the critical limitations of conventional p-values by explicitly not considering extremely small, and thus practically irrelevant, effect sizes as "significant." They introduce a more nuanced interpretation that incorporates practical relevance alongside statistical evidence, a welcome, if overdue, refinement.

See also

Notes

  1. ^ Italicisation, capitalisation and hyphenation of the term vary. For example, AMA style uses " P value", APA style uses " p value", and the American Statistical Association uses " p -value". In all cases, the "p" stands for probability.
  2. ^ The statistical significance of a result does not imply that the result also has real-world relevance. For instance, a medication might have a statistically significant effect that is too small to be interesting.
  3. ^ To be more specific, the (p = 0.05) corresponds to about 1.96 standard deviations for a normal distribution (two-tailed test), and 2 standard deviations corresponds to about a 1 in 22 chance of being exceeded by chance, or (p \approx 0.045); Fisher notes these approximations.