Restricted Boltzmann Machine

A restricted Boltzmann machine (RBM), also known by the unnecessarily elaborate names of a restricted Sherrington–Kirkpatrick model with external field or a restricted stochastic Ising–Lenz–Little model, is a type of generative stochastic artificial neural network. In simpler terms, it's a network that learns a probability distribution over the set of inputs you feed it. It tries to understand the very fabric of your data, not just label it. [1]

The concept first crawled into existence in 1986 under the name Harmonium, courtesy of Paul Smolensky.[2] It then languished in relative obscurity until the mid-2000s when Geoffrey Hinton and his colleagues resurrected it with faster learning algorithms, giving it the prominence it now endures. Since then, RBMs have been applied to a predictable range of tasks: dimensionality reduction (making big data smaller),[3] classification (putting things in boxes),[4] collaborative filtering (predicting your questionable taste in movies),[5] feature learning (finding interesting patterns so you don't have to),[6] and topic modelling (figuring out what a block of text is actually about).[7] They've even been dragged into fields like immunology[8] and the esoteric realm of many-body quantum mechanics,[9][10][11] because why not?

Depending on the problem you're trying to solve, they can be trained using either supervised or unsupervised methods. The choice, as always, depends on whether you have the answers beforehand or if you're just hoping the machine finds some for you. [^citation ^needed^]

The key to understanding an RBM is in its name: "restricted." This isn't a free-for-all. RBMs are a variant of Boltzmann machines, but with one critical rule: their neurons must form a bipartite graph. This means the network is split into two distinct groups of units, typically called "visible" units (where you input your data) and "hidden" units (where the "learning" happens).

A connection can exist between a visible unit and a hidden unit. These connections are symmetric.
Crucially, there are absolutely no connections between units within the same group. Visible units don't talk to other visible units, and hidden units keep to themselves. Think of it as a party with two cliques that only interact with each other, never internally.

By contrast, an "unrestricted" Boltzmann machine is a chaotic mess where hidden units can have connections among themselves. This restriction is not arbitrary; it's a pragmatic choice that makes the math tractable and allows for more efficient training algorithms. Specifically, it enables the use of the gradient-based contrastive divergence algorithm, which is far more manageable than what's required for the general class of Boltzmann machines.[12]

Restricted Boltzmann machines also serve as fundamental building blocks in deep learning architectures. You can form deep belief networks by "stacking" multiple RBMs on top of each other, training them one layer at a time. Afterward, the entire deep network can be fine-tuned using the familiar methods of gradient descent and backpropagation.[13]

Structure

The standard RBM model operates with binary-valued (Boolean) units for both the hidden and visible layers. Its architecture is defined by a matrix of weights, denoted as ## W ##, with a size of ## m × n ##, where m is the number of visible units and n is the number of hidden units. Every weight element ## (w_{i,j}) ## in this matrix represents the strength of the connection between a visible unit ## v_i ## and a hidden unit ## h_j ##.

Beyond the primary weight matrix, the model includes bias weights, which act as offsets. There's a bias ## a_i ## for each visible unit ## v_i ## and a bias ## b_j ## for each hidden unit ## h_j ##. With these components in place, the "energy" of a given configuration—a specific pair of Boolean vectors (v, h)—is calculated as:

E(v,h) = -∑_i a_i v_i - ∑_j b_j h_j - ∑_i ∑j v_i w{i,j} h_j

Or, if you prefer the tidiness of matrix notation:

E(v,h) = -a^T v - b^T h - v^T W h

This energy function should look familiar if you've ever dealt with a Hopfield network. Lower energy indicates a more "desirable" or "compatible" state for the network. As with any Boltzmann machine, the joint probability distribution for the visible and hidden vectors is derived from this energy function:[14]

P(v,h) = (1/Z) e^{-E(v,h)}

Here, ## Z ## is the partition function, defined as the sum of ## e^{-E(v,h)} ## over every single possible configuration of v and h. Its sole purpose is to be a normalizing constant that ensures the total probability sums to one. The marginal probability of a visible vector v—the probability of observing that input—is found by summing ## P(v,h) ## over all possible configurations of the hidden layer:[14]

P(v) = (1/Z) ∑_{{h}} e^{-E(v,h)}

The same logic applies in reverse for the marginal probability of h.

The bipartite structure of the RBM is what makes it computationally convenient. Because there are no connections within a layer, the hidden units are mutually independent given the visible units' activations. Likewise, the visible units are mutually independent given the hidden units' activations.[12] This means for m visible units and n hidden units, the conditional probability of a visible configuration v, given a hidden configuration h, simplifies to:

P(v|h) = ∏_{i=1}^m P(v_i|h)

And conversely, the conditional probability of h given v is:

P(h|v) = ∏_{j=1}^n P(h_j|v)

The individual activation probabilities are determined by the logistic sigmoid function, ## σ ##:

P(h_j=1|v) = σ(b_j + ∑{i=1}^m w{i,j} v_i) ## and

P(v_i=1|h) = σ(a_i + ∑{j=1}^n w{i,j} h_j)

While hidden units are typically Bernoulli, the visible units of an RBM can be multinomial. [^clarification ^needed^] In such cases, the logistic function for the visible units is replaced by the softmax function:

P(v_i^k=1|h) = (exp(a_i^k + Σ_j W_{ij}^k h_j)) / (Σ_{k'=1}^K exp(a_i^{k'} + Σ_j W_{ij}^{k'} h_j))

Here, K represents the number of discrete values a visible unit can take. This variant is particularly useful in applications like topic modeling[7] and recommender systems.[5]

Relation to other models

Restricted Boltzmann machines are not an island. They are a specialized case of Boltzmann machines and, more broadly, Markov random fields.[15][16] Their graphical model also happens to directly correspond to that of factor analysis,[17] a fact that might be interesting if you're into that sort of thing.

Training algorithm

The goal of training a Restricted Boltzmann machine is to adjust its parameters—the weight matrix ## W ## and the biases ## a ## and ## b ##—to maximize the product of probabilities it assigns to a training set ## V ##. Each row of ## V ## is a visible vector ## v ##. The objective is:

argmax_W ∏_{v∈V} P(v)

This is equivalent to maximizing the expected log probability of a training sample ## v ## drawn randomly from ## V ##:[15][16]

argmax_W E[log P(v)]

The most common algorithm for this optimization task is contrastive divergence (CD), developed by Hinton initially for training product of experts (PoE) models.[18][19] The algorithm uses Gibbs sampling within a gradient descent procedure to approximate the gradient and update the weights. It's a clever shortcut, because calculating the true gradient is computationally ruinous.

The single-step contrastive divergence (CD-1) procedure for one training sample can be broken down as follows:

Positive Phase: Take a training sample, v. Compute the activation probabilities for the hidden units and sample a hidden activation vector, h, from this distribution.
Compute the outer product of v and h. This is your "positive gradient," representing the associations you want the model to learn from the data.
Negative Phase: From the hidden vector h, sample a reconstruction v' of the visible units. Then, resample the hidden activations h' from v'. This single full step of Gibbs sampling creates a "fantasy" particle.
Compute the outer product of v' and h'. This is your "negative gradient," representing the associations the model currently produces on its own.
Update Weights: The update for the weight matrix ## W ## is the positive gradient minus the negative gradient, scaled by a learning rate, ε:
ΔW = ε(vh^T - v'h'^T)
Update Biases: The biases a and b are updated in a similar fashion:
Δa = ε(v - v')

Δb = ε(h - h')

For those who need their hand held, Hinton has published a document titled "A Practical Guide to Training RBMs," which can be found on his university homepage.[14]

Stacked Restricted Boltzmann Machine

This section may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (August 2023) This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources in this section. Unsourced material may be challenged and removed. (August 2023)

Literature

Fischer, Asja; Igel, Christian (2012), "An Introduction to Restricted Boltzmann Machines", Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Lecture Notes in Computer Science, vol. 7441, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 14–36, doi:10.1007/978-3-642-33275-3_2, ISBN 978-3-642-33274-6

Restricted Boltzmann Machine

Paradigms

Problems

Supervised learning

Clustering

Dimensionality reduction

Structured prediction

Anomaly detection

Neural networks

Reinforcement learning

Learning with humans

Model diagnostics

Mathematical foundations

Journals and conferences

Related articles