Junction Tree Algorithm

Honestly, the sheer dedication to explaining something as mundane as a machine learning algorithm... it’s almost admirable. Almost. Let’s dissect this, shall we?

Machine learning algorithm

The junction tree algorithm, also known, rather uncreatively, as the 'Clique Tree' method, is essentially a way to untangle the mess of marginalization in general graphs. Think of it as a highly organized, albeit reluctant, librarian for probability distributions. It forces a kind of order onto chaos, performing belief propagation on a structure that’s been… modified. This modified graph is the junction tree, and it’s called a tree because, well, it branches out, with nodes representing variables, like the limbs of some skeletal, data-obsessed entity.

The core idea? Eliminate cycles, those frustrating loops of uncertainty, by clumping them into single, more manageable nodes. It’s about creating structure where there was none, or at least, where there was an inconvenient amount of it. This consolidation allows for multiple, extensive classes of queries to be compiled simultaneously into larger, more coherent data structures. [1] It’s not a one-size-fits-all solution, of course. There are various algorithms tailored for specific needs, for whatever it is you’re trying to calculate. Inference algorithms, for instance, are designed to absorb new developments in the data and recalculate based on that fresh, likely unwelcome, information. [2]

Junction tree algorithm

This is where things get… procedural.

Hugin algorithm

First, if your graph is all directed and pretentious, you need to moralize it. Make it undirected. It’s simpler that way. Less drama.
Then, you introduce the evidence. Because what’s the point of calculating anything if you don’t acknowledge the facts on the ground?
Next, triangulate the graph. Make it chordal. It’s a necessary step for the structure to hold. Think of it as reinforcing the weak points.
From this triangulated graph, construct a junction tree. The vertices of this tree? We’ll call them "supernodes". They’re the consolidated clusters.
Finally, propagate the probabilities along this newly formed junction tree. This is done via belief propagation.

Now, be warned. This last step can be excruciatingly inefficient for graphs with a large treewidth. Computing the messages passed between these supernodes involves exact marginalization over the variables within them. If the treewidth is k, you’re looking at computations that are exponential in k. It’s a message passing algorithm, yes, but it’s not always a pleasant one. [3] The Hugin algorithm, at least, manages to get the job done with fewer computations than its counterpart, the Shafer-Shenoy.

Shafer-Shenoy algorithm

It’s computed recursively. [3] Which means it repeats itself. A lot.
Multiple recursions of Shafer-Shenoy will eventually lead you to the Hugin algorithm. So, it’s like a preliminary step, a less efficient cousin. [4]
It’s found via the message passing equation. [4] Because everything is about passing messages, isn’t it?
Crucially, separator potentials are not stored. [5] Which might be an advantage, depending on your tolerance for clutter.

The Shafer-Shenoy algorithm is, in essence, the sum product of a junction tree. [6] It’s employed because it handles programs and queries with a bit more efficiency than Hugin. It’s also capable of making calculations for conditional belief functions. [7] To make these local computations actually happen, you’ll need joint distributions. [7]

Underlying theory

This is where we delve into the why.

The initial step is solely concerned with Bayesian networks. It’s a procedure to transform a directed graph into an undirected one. We do this because it makes the algorithm universally applicable, stripping away any directional pretenses.

The second step is about fixing variables to their observed values. This is usually necessary when you’re trying to calculate conditional probabilities. You clamp the value of the random variables you’re conditioning on. They become fixed, unchangeable points in the distribution.

The third step is to ensure that graphs are made chordal if they aren't already. This is the first essential step. It relies on a rather neat theorem: [8]

Theorem: For an undirected graph, G, the following properties are equivalent:

Graph G is triangulated.
The clique graph of G has a junction tree.
There exists an elimination ordering for G that doesn't introduce any new edges.

So, by triangulating a graph, we guarantee that the corresponding junction tree exists. A common method is to pick an elimination order for its nodes and then run the Variable elimination algorithm. This algorithm dictates that it must be executed each time a different query is posed. [1] The consequence of this is the addition of edges to the initial graph, ultimately producing a chordal graph. All chordal graphs, as it happens, possess a junction tree. [4] The next logical step is to construct this junction tree. To achieve this, we take the chordal graph from the previous stage and form its clique graph. [9] Now, another theorem provides the method for finding a junction tree: [8]

Theorem: Given a triangulated graph, weight the edges of the clique graph by the cardinality, |A∩B|, of the intersection of adjacent cliques A and B. Then, any maximum-weight spanning tree of the clique graph is a junction tree.

Therefore, to build a junction tree, we simply extract a maximum weight spanning tree from the clique graph. This can be done efficiently, for instance, by adapting Kruskal's algorithm.

The final step, as you might have guessed, is to apply belief propagation to the junction tree that has now been obtained. [10]

Usage

A junction tree graph serves as a visual representation of the probabilities involved in a problem. This tree can be further refined into a binary tree for structural clarity. [11] One specific application can be found in auto encoders, where the graph and a passing network are automatically integrated on a large scale. [12]

Inference Algorithms

Sometimes, exact solutions are… inconvenient. Or impossible.

Cutset conditioning

This is used when dealing with smaller sets of variables. Cutset conditioning allows for simpler graphs, which are easier to comprehend, though they don't yield exact results. [3]

Loopy belief propagation

This is a different approach for interpreting complex graphs. Loopy belief propagation is employed when an approximate solution is acceptable, rather than insisting on the exact solution. [13] It falls under the umbrella of approximate inference. [3]