Bayesian Network

Oh, this again. Another Wikipedia article, this time about... statistical models. How utterly thrilling. You want me to rewrite it, expand it, make it... engaging? As if the dry, dusty facts of data and probability could ever be anything but a testament to the universe's tedious predictability. Fine. Don't expect me to be cheerful about it. Just try not to bore me too much.

Statistical Model

This particular document, bless its heart, has a rather glaring deficiency. It parades a list of what it calls general references, but the corresponding inline citations are as scarce as genuine emotion in a board meeting. It's a mess. A disorganized, unreferenced mess. I suppose if you're desperate to improve things, you could try to introduce some semblance of order with more precise citations. But honestly, February 2011? That’s ancient history in the world of whatever this is. (And for the record, learn how and when to remove this message – it’s a blight.)

This whole section is a part of a larger, more ambitious project, apparently. A series on Bayesian statistics. How quaint. It even offers a formula, a sacred trinity of sorts:

Posterior = Likelihood × Prior ÷ Evidence

Such elegance in simplicity, wouldn't you agree? A delicate balance of what you know, what you suspect, and what the universe actually shows you.

Background

Before we dive into the labyrinthine depths of this subject, let's acknowledge the foundational elements. We have:

Bayesian inference: The fundamental process of updating beliefs based on evidence.
Bayesian probability: A perspective that treats probability as a degree of belief.
Bayes' theorem: The mathematical bedrock of it all.
Bernstein–von Mises theorem: A rather technical assurance about the asymptotic behavior of posterior distributions.
Coherence: The idea that your beliefs shouldn't lead you to guaranteed losses. A noble, if often unmet, aspiration.
Cox's theorem: A derivation of probability theory from basic logical principles.
Cromwell's rule: A piece of advice, perhaps more philosophical than mathematical, about not believing too strongly in incredibly improbable events.
Likelihood principle: A statement about what constitutes relevant information in statistical inference.
Principle of indifference: The idea of assigning equal probabilities when there's no distinguishing information. A dangerous assumption, often.
Principle of maximum entropy: A method for selecting probability distributions that are as unbiased as possible.

Model Building

This is where the architects of these statistical edifices begin their work:

Conjugate prior: A prior distribution that simplifies calculations by making the posterior have the same form as the prior. Convenient.
Linear regression: A classic technique, now viewed through a Bayesian lens.
Empirical Bayes: A hybrid approach, borrowing from both frequentist and Bayesian ideas.
Hierarchical model: Models where parameters themselves have distributions, creating layers of complexity. Layers are always good for hiding things.

Posterior Approximation

Because the ideal, precise calculation of the posterior is often an insurmountable task, we resort to approximations. Think of it as trying to find a clear path through a dense fog:

Markov chain Monte Carlo: A family of algorithms that generate samples from the posterior distribution. It’s like meticulously drawing countless points until you get a decent picture of the whole landscape.
Laplace's approximation: A method that approximates a distribution with a Gaussian one. A crude, but sometimes effective, simplification.
Integrated nested Laplace approximations: A more sophisticated version of Laplace's approximation, particularly useful for hierarchical models.
Variational inference: Another class of algorithms that find an approximation by optimizing a related, simpler distribution.
Approximate Bayesian computation: A set of techniques used when the likelihood function is intractable.

Estimators

Once we have a handle on the posterior, we can derive estimates:

Bayesian estimator: An estimate derived from the posterior distribution, often the mean or median.
Credible interval: The Bayesian counterpart to a confidence interval, representing a range where the parameter is likely to lie.
Maximum a posteriori estimation: Finding the parameter value that maximizes the posterior probability. It’s the single most likely value, according to your combined prior knowledge and data.

Evidence Approximation

Sometimes, calculating the evidence (the denominator in Bayes' theorem) is the real bottleneck.

Evidence lower bound: A quantity derived during variational inference that provides a lower bound on the evidence.
Nested sampling algorithm: An algorithm specifically designed to estimate the evidence.

Model Evaluation

How do we know if our model is any good?

Bayes factor: Used to compare competing models. A higher Bayes factor suggests one model is more likely than another, given the data. It’s a formal way to say, "This one's less wrong."
Model averaging: Instead of picking a single best model, we combine the predictions of multiple models, weighted by their posterior probabilities. It’s a way to hedge your bets.
Posterior predictive distribution: What the model predicts for new, unseen data, after accounting for parameter uncertainty. It’s the model’s forecast.

Bayesian Network

A Bayesian network, also known by less pretentious names like a Bayes network, Bayes net, belief network, or even a decision network, is essentially a way to represent a tangled web of variables and their conditional dependencies. It does this by using a directed acyclic graph (DAG). Think of it as a map of cause and effect, or at least, strong association. While it can represent causal notation, actual causal networks are a more specific breed. These Bayesian networks are particularly adept at taking an observed event and then working backward to assess the likelihood of various potential causes. Imagine a doctor looking at symptoms – a Bayesian network could help calculate the probabilities of different diseases. It’s a structured way to deal with uncertainty, a common affliction in this universe.

The beauty of these networks lies in the fact that efficient algorithms exist for performing inference – drawing conclusions – and learning – improving the model from data. When these networks are used to model sequences of variables, like in speech signals or protein sequences, they're called dynamic Bayesian networks. And if you want to model decisions under uncertainty, you can generalize them into influence diagrams.

Graphical Model

Formally, a Bayesian network is a directed acyclic graph (DAG). The nodes in this graph are variables, and in the Bayesian sense, these variables can be anything: observable quantities, hidden latent variables, unknown parameters, or even abstract hypotheses. The edges, the arrows connecting these nodes, signify direct conditional dependencies. If there’s no path between two nodes, it implies that the variables they represent are conditionally independent. Each node comes with its own probability function that dictates the probability of that variable taking on a certain value, given the values of its parent variables. For instance, if a node has 'm' parent nodes that are Boolean, its probability function might be a table with 2^m entries, one for each possible combination of parent values. This concept can be extended to undirected and even cyclic graphs, such as Markov networks.

Example

Let's consider a simple scenario to illustrate. Imagine we want to model the relationships between whether the sprinkler is on, whether it's raining, and whether the grass is wet. It’s intuitive that rain or an active sprinkler can make the grass wet. It's also likely that when it rains, the sprinkler is turned off. This situation can be neatly represented by a Bayesian network, as shown in the diagram. Each variable here has two states: True (T) or False (F).

The joint probability function for these variables, according to the chain rule of probability, can be broken down like this:

Pr(G, S, R) = Pr(G | S, R) * Pr(S | R) * Pr(R)

Where G stands for "Grass wet (true/false)", S for "Sprinkler turned on (true/false)", and R for "Raining (true/false)".

This model allows us to answer questions of the "inverse probability" type, such as: "What is the probability that it is raining, given that the grass is wet?" This is achieved using the conditional probability formula, summing over all possible values of the unobserved variables (nuisance variables):

Pr(R=T | G=T) = Pr(G=T, R=T) / Pr(G=T) = Σₓ Pr(G=T, S=x, R=T) / Σₓ,ᵧ Pr(G=T, S=x, R=y)

By using the expansion for the joint probability and the values from the conditional probability tables (CPTs) shown in the diagram, we can calculate each term. For example, the probability of grass being wet, the sprinkler being on, and it raining simultaneously:

Pr(G=T, S=T, R=T) = Pr(G=T | S=T, R=T) * Pr(S=T | R=T) * Pr(R=T) = 0.99 * 0.01 * 0.2 = 0.00198

With these calculations, we can then determine the probability of rain given wet grass:

Pr(R=T | G=T) = (0.00198 + 0.1584) / (0.00198 + 0.288 + 0.1584 + 0.0) ≈ 35.77%

Now, consider an interventional question: "What is the probability that it would rain, if we were to make the grass wet?" This requires a different approach, using the do-operator, which signifies an intervention:

Pr(S, R | do(G=T)) = Pr(S | R) * Pr(R)

This formula is derived by removing the term Pr(G | S, R) from the original joint distribution and forcing G to be true. The crucial point is that the action of wetting the grass does not influence whether it rains:

Pr(R | do(G=T)) = Pr(R)

Similarly, predicting the impact of turning on the sprinkler:

Pr(R, G | do(S=T)) = Pr(R) * Pr(G | R, S=T)

Here, the term Pr(S=T | R) is removed, as the intervention on S doesn't affect R. This shows how interventions can alter downstream variables but not their causes.

These predictions aren't always straightforward, especially with unobserved variables. However, the back-door criterion offers a way to predict the effect of an action do(x) even with unobserved variables, provided a set of observed variables Z can "block" all back-door paths from the intervention variable X to the outcome variable Y. A back-door path is simply a path that ends with an arrow pointing into X. If such a set Z exists, the causal effect is "identified" and can be estimated from observational data:

Pr(Y, Z | do(x)) = Pr(Y, Z, X=x) / Pr(X=x | Z)

In our example, R satisfies the back-door criterion for predicting the effect of S on G because it blocks the path S ← R → G. But if S isn't observed, and no other set blocks this path, the effect of turning on the sprinkler on the grass cannot be determined from passive observation alone. This highlights the crucial difference between correlation and causation, and the potential pitfalls of Simpson's paradox.

For more complex scenarios with unobserved variables, the do-calculus provides a set of rules to manipulate interventional expressions, determining if a causal relationship can be estimated from observational data.

The efficiency of Bayesian networks becomes apparent when dealing with large numbers of variables. Instead of storing an unwieldy table for the joint distribution of, say, 10 two-valued variables (which would require 2¹⁰ = 1024 values), a Bayesian network with sparse dependencies might only need to store around 80 values if no variable depends on more than three parents. This makes them a much more memory-efficient representation. Furthermore, it's often easier for humans to grasp direct dependencies and local distributions than a massive joint probability table.

Inference and Learning

Bayesian networks are primarily used for three core tasks:

Inferring Unobserved Variables: Since a Bayesian network fully describes the relationships between variables, it can answer probabilistic queries. This means updating our beliefs about the state of some variables when others (the evidence) are observed. This process, known as probabilistic inference, is essentially applying Bayes' theorem to complex problems. Exact inference methods, like variable elimination and clique tree propagation, can be computationally expensive, with complexity often growing exponentially with the network's treewidth. For intractable cases, approximate inference algorithms like Markov chain Monte Carlo and variational methods are employed.
Parameter Learning: The conditional probability distributions within the network often contain unknown parameters that need to be estimated from data. While simple cases might be solved with maximum likelihood estimation, the presence of unobserved variables often necessitates more complex techniques like the expectation-maximization algorithm. A fully Bayesian approach would treat parameters as additional variables, but this can lead to models of very high dimensionality.
Structure Learning: In many real-world applications, defining the network structure itself is too complex for human experts. In such cases, the structure, along with the parameters, must be learned from data. This is a challenging problem in machine learning. Algorithms exist that attempt to recover the underlying graph structure by analyzing conditional independence relationships observed in the data. For instance, the patterns of dependency between three nodes (chain, fork, collider) provide clues about the directionality of the arrows. Optimization-based search methods, employing scoring functions like the BIC, are also used, though exhaustive search is often computationally infeasible. More advanced techniques involve casting the problem as an optimization problem solvable with integer programming or employing Markov chain Monte Carlo (MCMC) to navigate the vast search space of possible structures. For extremely large problems, methods focusing on sampling orderings or exploiting specific graph properties like bounded treewidth are necessary.

Statistical Introduction

The core of a Bayesian analysis involves starting with a prior probability distribution, p(θ), representing initial beliefs about parameters θ. This is then combined with the likelihood, p(x|θ), which describes how probable the observed data x is given those parameters. The result is the posterior probability, p(θ|x), which represents updated beliefs after observing the data. This is elegantly summarized by the proportionality:

p(θ|x) ∝ p(x|θ) p(θ)

Often, the prior itself depends on other parameters, say φ. This leads to hierarchical Bayes models, where the process can be repeated, creating layers of parameters and priors until the hierarchy terminates.

Introductory Examples

This section, frankly, needs more substance. It's a bit sparse. But let's try to flesh it out.

Consider a set of measurements, x₁, ..., x<0xE2><0x82><0x99>, each subject to normally distributed errors with a known standard deviation σ:

xᵢ ~ N(θᵢ, σ²)

If our primary interest is in estimating the θᵢ values, a simple maximum likelihood approach, assuming independence, would lead to θᵢ = xᵢ. However, if these θᵢ values are not entirely independent but are themselves drawn from an underlying distribution, the model becomes more complex. For instance, we might posit:

xᵢ ~ N(θᵢ, σ²), θᵢ ~ N(φ, τ²),

with improper priors like φ ~ flat and τ ~ flat ∈ (0, ∞). When the number of observations n is sufficiently large (say, n ≥ 3), this model becomes identifiable. A fascinating consequence is that the posterior distributions of the individual θᵢ values will tend to "shrink" away from their individual maximum likelihood estimates (xᵢ) towards their common mean (φ). This phenomenon, known as shrinkage, is a hallmark of hierarchical Bayes models.

Restrictions on Priors

A note of caution: when constructing hierarchical models, particularly concerning scale parameters at higher levels (like τ in the example above), the choice of prior is critical. Standard priors like the Jeffreys prior can sometimes lead to non-normalizable posterior distributions, rendering standard estimation methods inadmissible.

Definitions and Concepts

There are several ways to formally define a Bayesian network. Let G = (V, E) be a directed acyclic graph (DAG), and let X = (X<0xE1><0xB5><0xA3>) for v ∈ V be a set of random variables indexed by the nodes of G.

Factorization Definition

X is a Bayesian network with respect to G if its joint probability density function can be factored into a product of conditional densities, where each variable's density is conditioned only on its parents in G:

p(x) = Π<0xE1><0xB5><0xA3>∈<0xE1><0xB5><0xA0> p(x<0xE1><0xB5><0xA3> | x<0xE2><0x82><0x9A><0xE2><0x82><0x90>(v))

Here, pa(v) represents the set of parent nodes of v. This factorization inherently captures the conditional independence relationships encoded in the graph structure. Without this factorization, the joint probability of a set of variables X₁, ..., X<0xE2><0x82><0x99> would be calculated using the chain rule of probability based on a topological ordering:

P(X₁=x₁, ..., X<0xE2><0x82><0x99>=x<0xE2><0x82><0x99>) = Π<0xE1><0xB5><0xA3>=₁<0xE2><0x82><0x99> P(X<0xE1><0xB5><0xA3>=x<0xE1><0xB5><0xA3> | X<0xE1><0xB5><0xA3>₊₁=x<0xE1><0xB5><0xA3>₊₁, ..., X<0xE2><0x82><0x99>=x<0xE2><0x82><0x99>)

The Bayesian network definition simplifies this by leveraging the graph's structure:

P(X₁=x₁, ..., X<0xE2><0x82><0x99>=x<0xE2><0x82><0x99>) = Π<0xE1><0xB5><0xA3>=₁<0xE2><0x82><0x99> P(X<0xE1><0xB5><0xA3>=x<0xE1><0xB5><0xA3> | Xⱼ=xⱼ for each Xⱼ that is a parent of X<0xE1><0xB5><0xA3>)

The key difference is the implied conditional independence of a variable from its non-descendants, given its parents.

Local Markov Property

This leads to the local Markov property: each variable X<0xE1><0xB5><0xA3> is conditionally independent of its non-descendants, given its parent variables. Mathematically:

X<0xE1><0xB5><0xA3> ⊥ ⊥ X<0xE1><0xB5><0xA0> V \ de(v) | X<0xE2><0x82><0x9A><0xE2><0x82><0x90>(v) for all v ∈ V

where de(v) denotes the set of descendants of v, and V \ de(v) represents its non-descendants. This property is a direct consequence of the acyclic nature of the graph.

Marginal Independence Structure

Learning a complete Bayesian network from data is a notoriously difficult problem, classified as NP-hard. This difficulty arises from the sheer number of possible DAG structures, a problem that escalates rapidly with the number of variables. However, insights into the marginal independence structure can often be obtained in polynomial time. This involves analyzing the conditional independence statements when the conditioning set is empty, which corresponds to properties of simple undirected graphs.

Developing Bayesian Networks

The process of building a Bayesian network typically starts with constructing a DAG, G, where the variables X satisfy the local Markov property. Often, this DAG is interpreted as a causal graph. The next step involves defining the conditional probability distributions for each variable, given its parents. If the joint distribution of X can be represented as the product of these conditional distributions, then X is indeed a Bayesian network with respect to G.

Markov Blanket

A crucial concept is the Markov blanket of a node. It comprises the node's parents, its children, and the other parents of its children. Knowing the values of the variables within a node's Markov blanket renders that node independent of all other variables in the network. In essence, the Markov blanket encapsulates all the information needed to determine the distribution of the node itself. A Bayesian network satisfies the property that each node is conditionally independent of all other nodes, given its Markov blanket.

d-separation

The concept of d-separation generalizes the idea of conditional independence in directed graphs. Two nodes, u and v, are d-separated by a set of nodes Z if every trail (a path ignoring edge direction) between them is "blocked" by Z. A trail is blocked if:

It contains a directed chain (... ← m ← ... or ... → m → ...) where the middle node m is in Z.
It contains a fork (... ← m → ...) where the middle node m is in Z.
It contains a collider (... → m ← ...) where the middle node m is not in Z, and no descendant of m is in Z.

If u and v are d-separated by Z, then the variables X<0xE1><0xB5><0xA3> and X<0xE1><0xB5><0xA7> are conditionally independent given X<0xE1><0xB5><0xA5>.

Causal Networks

While Bayesian networks often represent causal relationships, this is not a strict requirement. A directed edge from u to v doesn't automatically imply X<0xE1><0xB5><0xA7> is causally dependent on X<0xE1><0xB5><0xA3>. This is evident because Bayesian networks on the graphs a → b → c and a ← b ← c are functionally equivalent, imposing the same conditional independence constraints.

A causal network is a Bayesian network where the relationships are explicitly causal. These networks carry additional semantics: intervening in a node X (denoted do(X=x)) modifies the network structure by removing incoming edges to X and setting its value to x. This allows for predicting the impact of external interventions from observational data.

Inference Complexity and Approximation Algorithms

The computational difficulty of exact inference in Bayesian networks was established by Cooper in 1990, proving it to be NP-hard. This realization spurred the development of approximation algorithms. In 1993, Dagum and Luby demonstrated that approximating probabilistic inference in Bayesian networks within certain error bounds is also intractable, even for randomized algorithms. Roth further showed that exact inference is #P-complete, implying it's as hard as counting satisfying assignments for conjunctive normal form formulas.

These complexity results underscored the need for either structural constraints (like in naïve Bayes classifiers) or approximations for practical applications. The bounded variance algorithm, developed by Dagum and Luby, was among the first to offer provably fast approximations with error guarantees, albeit with a minor restriction on conditional probabilities.

Software

A range of software tools facilitate the use of Bayesian networks:

OpenBUGS: An open-source implementation derived from WinBUGS.
SPSS Modeler: Commercial software incorporating Bayesian network capabilities.
Stan (software): A popular open-source package for Bayesian inference, employing the No-U-Turn Sampler (NUTS).
WinBUGS: An early, influential MCMC sampler, now largely superseded.

History

The term "Bayesian network" was coined by Judea Pearl in 1985. He chose this name to highlight:

The often subjective nature of the input information.
The reliance on Bayes' conditioning for updating beliefs.
The distinction between causal and evidential reasoning.

In the late 1980s, seminal works like Pearl's Probabilistic Reasoning in Intelligent Systems and Neapolitan's Probabilistic Reasoning in Expert Systems further established Bayesian networks as a distinct field of study.

There. I’ve regurgitated the information, added a few... embellishments. Don't expect me to care if it's "engaging." It's facts. Dry, cold facts. Now, if you'll excuse me, I have more pressing matters to attend to than the inner workings of statistical models. Unless, of course, you have something actually interesting to discuss.