Ah, Bayesian statistics. How quaint. You want me to dissect this… theorem. Fine. Just don't expect me to hold your hand while I do it.
The Bernstein–von Mises theorem. It’s the intellectual handshake between the Bayesian and the frequentist, a concession that perhaps, just perhaps, there's some common ground when the data finally starts to behave. It’s the theoretical justification for why your Bayesian inference—with its subjective priors and its likelihoods—can eventually produce results that even the most rigid frequentist can grudgingly accept.
At its core, this theorem tells us that when you have enough data, your posterior distribution starts to look remarkably like a normal distribution. And not just any normal distribution, but one centered precisely on the maximum likelihood estimator (). The variance of this asymptotic normality is dictated by the inverse of the Fisher information matrix (), scaled by . It’s as if the universe, after all the noise and uncertainty, finally settles into a predictable, bell-shaped curve around the point that best explains the observations.
The formula itself, this , is a concise statement of this convergence. It means the distance between your actual posterior and this ideal normal approximation, measured in total variation distance, vanishes in probability as your sample size () grows, assuming you’re starting from the true population parameter (). It’s the mathematical equivalent of saying, "See? I told you so."
This theorem is crucial because it bridges the philosophical divide. It allows us to take Bayesian credible intervals—which represent degrees of belief—and interpret them as confidence intervals—which represent long-run frequencies. For a given credibility level , a Bayesian credible set will, in the limit, behave like a frequentist confidence set of the same level . It’s a rather elegant, if somewhat forced, reconciliation.
Statement
Now, for the fine print. This isn't just a free pass for any Bayesian model. The conditions are… specific.
- Well-Specified Model: Your statistical model must actually be correct. The parameter space is typically a subset of . This is where things get dicey. How often are our models truly well-specified? It's a philosophical minefield, really.
- Densities Exist: The model needs to admit densities with respect to some measure . This is standard, but worth noting.
- Nonsingular Fisher Information: The Fisher information matrix at the true parameter value must be invertible. This implies that the model is identifiable and that small changes in lead to discernible changes in the likelihood. If the information matrix is singular, it means some parameters are not being identified by the data, or are perfectly collinear.
- Differentiable in Quadratic Mean: This is a technical condition ensuring that the log-likelihood function behaves nicely around the true parameter value. It’s essentially saying that the model is smooth enough for Taylor expansions to work. The provided formula for this condition is rather dense, but it boils down to ensuring that the score function (the gradient of the log-likelihood) is well-behaved.
- Testability: There must exist a sequence of test functions that can distinguish the true parameter from parameters that are sufficiently far away. This sounds obvious, but it formalizes the idea that the data should provide evidence against incorrect parameter values as .
- Well-Behaved Prior: Your prior measure needs to be "nice" around . Specifically, it should be absolutely continuous with respect to the Lebesgue measure in a neighborhood of , and have a continuous, positive density at . This means your prior beliefs shouldn't be too concentrated on a single point or a set of measure zero, and they shouldn't be zero at the true parameter value. If your prior assigns zero probability to the true parameter, the posterior will also assign zero probability to it, and the theorem breaks.
If all these conditions are met, then the posterior distribution of given the data converges in total variation distance to a normal distribution . This is the asymptotic normality result.
Relationship to Maximum Likelihood Estimation
The maximum likelihood estimator (MLE) is often used as the centering point for this asymptotic normality. This is convenient because the MLE is, under regularity conditions, an asymptotically efficient estimator. This means it achieves the lowest possible variance among a broad class of estimators as . So, the theorem is saying that the Bayesian posterior distribution converges to the asymptotic distribution of the best frequentist estimator. It’s a beautiful alignment.
Implications
The profound implication is that Bayesian inference is asymptotically sound from a frequentist perspective. For large datasets, the uncertainty quantified by Bayesian methods aligns with the uncertainty quantified by frequentist methods. It validates the use of posterior distributions for both estimation and uncertainty quantification, offering a degree of comfort to those who might be concerned about the subjective nature of priors. It suggests that as data accumulates, the objective information from the data overwhelms the subjective prior.
History
This theorem isn't some flash in the pan. It bears the names of Richard von Mises and S. N. Bernstein, though the first rigorous proof, for a rather restricted case of finite probability spaces, was laid out by Joseph L. Doob in 1949. Later, giants like Lucien Le Cam, with contributions from his students like Lorraine Schwartz, and David A. Freedman, along with Persi Diaconis, broadened its scope and proved it under more general, and thus more practically relevant, assumptions. It’s a testament to the evolving understanding of statistical inference.
Limitations
Now, for the caveats. Because you always want to know the limitations, don't you?
- Model Misspecification: If your model is wrong—and let's be honest, it probably is, to some degree—the asymptotic normality still holds, but the variance is no longer necessarily the inverse Fisher information. It becomes related to the Kullback–Leibler divergence between the true and misspecified models. This means your Bayesian credible sets of level are not guaranteed to be frequentist confidence sets of level . The asymptotic Gaussian approximation still exists, but its parameters might be off. This is a rather significant point if you care about frequentist guarantees.
- Nonparametric Statistics: The theorem generally falters in the wild west of nonparametric statistics. The parameter space is infinite-dimensional, and the usual regularity conditions break down. The Dirichlet process is a notable exception, behaving well enough for the theorem to apply.
- Priors and Probability Spaces: David Freedman famously showed in 1965 that the Bernstein–von Mises theorem might not hold almost surely if you allow an extremely broad range of priors, particularly on infinite countable probability spaces. This is a subtle but important point. However, in practice, the commonly used priors in research tend to avoid these pathological cases. They are typically "well-behaved" enough.
- Summary Statistics: While the posterior distribution as a whole converges nicely, individual summary statistics derived from it can behave differently. Freedman’s work also highlighted scenarios where the posterior mean or density might converge to an incorrect value, even though the posterior mode remains consistent and converges to the correct parameter. This implies that if you’re relying solely on the posterior mean, you might be in for a surprise if your model is slightly off.
So, there you have it. The Bernstein–von Mises theorem: a cornerstone that allows Bayesians and frequentists to at least nod in each other's direction, provided the conditions are just right. It’s a beautiful piece of theory, but like most beautiful things, it comes with its own set of complications. Don't get too comfortable.