← Back to home

Restricted Boltzmann Machine

Fine. You want an article. Here's your article. Don't say I never do anything for you.


• Part of a series on Machine learning and data mining

Paradigms

Supervised learningUnsupervised learningSemi-supervised learningSelf-supervised learningReinforcement learningMeta-learningOnline learningBatch learningCurriculum learningRule-based learningNeuro-symbolic AINeuromorphic engineeringQuantum machine learning

Problems

ClassificationGenerative modelingRegressionClusteringDimensionality reductionDensity estimationAnomaly detectionData cleaningAutoMLAssociation rulesSemantic analysisStructured predictionFeature engineeringFeature learningLearning to rankGrammar inductionOntology learningMultimodal learning

Supervised learning

(classificationregression)

Apprenticeship learningDecision treesEnsemblesBaggingBoostingRandom forest • k-NN • Linear regressionNaive BayesArtificial neural networksLogistic regressionPerceptronRelevance vector machine (RVM)Support vector machine (SVM)

Clustering

BIRCHCUREHierarchical • k-means • FuzzyExpectation–maximization (EM)DBSCANOPTICSMean shift

Dimensionality reduction

Factor analysisCCAICALDANMFPCAPGDt-SNESDL

Structured prediction

Graphical modelsBayes netConditional random fieldHidden Markov

Anomaly detection

RANSAC • k-NN • Local outlier factorIsolation forest

Neural networks

AutoencoderDeep learningFeedforward neural networkRecurrent neural networkLSTMGRUESNreservoir computingBoltzmann machine • Restricted • GANDiffusion modelSOMConvolutional neural networkU-NetLeNetAlexNetDeepDreamNeural fieldNeural radiance fieldPhysics-informed neural networksTransformerVisionMambaSpiking neural networkMemtransistorElectrochemical RAM (ECRAM)

Reinforcement learning

Q-learningPolicy gradientSARSATemporal difference (TD)Multi-agentSelf-play

Learning with humans

Active learningCrowdsourcingHuman-in-the-loopMechanistic interpretabilityRLHF

Model diagnostics

Coefficient of determinationConfusion matrixLearning curveROC curve

Mathematical foundations

Kernel machinesBias–variance tradeoffComputational learning theoryEmpirical risk minimizationOccam learningPAC learningStatistical learningVC theoryTopological deep learning

Journals and conferences

AAAIECML PKDDNeurIPSICMLICLRIJCAIMLJMLR

Related articles

Glossary of artificial intelligenceList of datasets for machine-learning researchList of datasets in computer vision and image processingOutline of machine learning

• v • t • e


A restricted Boltzmann machine (RBM), also known by the unnecessarily elaborate names of a restricted Sherrington–Kirkpatrick model with external field or a restricted stochastic Ising–Lenz–Little model, is a type of generative stochastic artificial neural network. In simpler terms, it's a network that learns a probability distribution over the set of inputs you feed it. It tries to understand the very fabric of your data, not just label it. [1]

The concept first crawled into existence in 1986 under the name Harmonium, courtesy of Paul Smolensky.[2] It then languished in relative obscurity until the mid-2000s when Geoffrey Hinton and his colleagues resurrected it with faster learning algorithms, giving it the prominence it now endures. Since then, RBMs have been applied to a predictable range of tasks: dimensionality reduction (making big data smaller),[3] classification (putting things in boxes),[4] collaborative filtering (predicting your questionable taste in movies),[5] feature learning (finding interesting patterns so you don't have to),[6] and topic modelling (figuring out what a block of text is actually about).[7] They've even been dragged into fields like immunology[8] and the esoteric realm of many-body quantum mechanics,[9][10][11] because why not?

Depending on the problem you're trying to solve, they can be trained using either supervised or unsupervised methods. The choice, as always, depends on whether you have the answers beforehand or if you're just hoping the machine finds some for you. [^citation ^needed^]

The key to understanding an RBM is in its name: "restricted." This isn't a free-for-all. RBMs are a variant of Boltzmann machines, but with one critical rule: their neurons must form a bipartite graph. This means the network is split into two distinct groups of units, typically called "visible" units (where you input your data) and "hidden" units (where the "learning" happens).

  • A connection can exist between a visible unit and a hidden unit. These connections are symmetric.
  • Crucially, there are absolutely no connections between units within the same group. Visible units don't talk to other visible units, and hidden units keep to themselves. Think of it as a party with two cliques that only interact with each other, never internally.

By contrast, an "unrestricted" Boltzmann machine is a chaotic mess where hidden units can have connections among themselves. This restriction is not arbitrary; it's a pragmatic choice that makes the math tractable and allows for more efficient training algorithms. Specifically, it enables the use of the gradient-based contrastive divergence algorithm, which is far more manageable than what's required for the general class of Boltzmann machines.[12]

Restricted Boltzmann machines also serve as fundamental building blocks in deep learning architectures. You can form deep belief networks by "stacking" multiple RBMs on top of each other, training them one layer at a time. Afterward, the entire deep network can be fine-tuned using the familiar methods of gradient descent and backpropagation.[13]

Structure

The standard RBM model operates with binary-valued (Boolean) units for both the hidden and visible layers. Its architecture is defined by a matrix of weights, denoted as ## W ##, with a size of ## m × n ##, where m is the number of visible units and n is the number of hidden units. Every weight element ## (w_{i,j}) ## in this matrix represents the strength of the connection between a visible unit ## v_i ## and a hidden unit ## h_j ##.

Beyond the primary weight matrix, the model includes bias weights, which act as offsets. There's a bias ## a_i ## for each visible unit ## v_i ## and a bias ## b_j ## for each hidden unit ## h_j ##. With these components in place, the "energy" of a given configuration—a specific pair of Boolean vectors (v, h)—is calculated as:

E(v,h) = -∑_i a_i v_i - ∑_j b_j h_j - ∑_i ∑j v_i w{i,j} h_j

Or, if you prefer the tidiness of matrix notation:

E(v,h) = -a^T v - b^T h - v^T W h

This energy function should look familiar if you've ever dealt with a Hopfield network. Lower energy indicates a more "desirable" or "compatible" state for the network. As with any Boltzmann machine, the joint probability distribution for the visible and hidden vectors is derived from this energy function:[14]

P(v,h) = (1/Z) e^{-E(v,h)}

Here, ## Z ## is the partition function, defined as the sum of ## e^{-E(v,h)} ## over every single possible configuration of v and h. Its sole purpose is to be a normalizing constant that ensures the total probability sums to one. The marginal probability of a visible vector v—the probability of observing that input—is found by summing ## P(v,h) ## over all possible configurations of the hidden layer:[14]

P(v) = (1/Z) ∑_{{h}} e^{-E(v,h)}

The same logic applies in reverse for the marginal probability of h.

The bipartite structure of the RBM is what makes it computationally convenient. Because there are no connections within a layer, the hidden units are mutually independent given the visible units' activations. Likewise, the visible units are mutually independent given the hidden units' activations.[12] This means for m visible units and n hidden units, the conditional probability of a visible configuration v, given a hidden configuration h, simplifies to:

P(v|h) = ∏_{i=1}^m P(v_i|h)

And conversely, the conditional probability of h given v is:

P(h|v) = ∏_{j=1}^n P(h_j|v)

The individual activation probabilities are determined by the logistic sigmoid function, ## σ ##:

P(h_j=1|v) = σ(b_j + ∑{i=1}^m w{i,j} v_i) ## and

P(v_i=1|h) = σ(a_i + ∑{j=1}^n w{i,j} h_j)

While hidden units are typically Bernoulli, the visible units of an RBM can be multinomial. [^clarification ^needed^] In such cases, the logistic function for the visible units is replaced by the softmax function:

P(v_i^k=1|h) = (exp(a_i^k + Σ_j W_{ij}^k h_j)) / (Σ_{k'=1}^K exp(a_i^{k'} + Σ_j W_{ij}^{k'} h_j))

Here, K represents the number of discrete values a visible unit can take. This variant is particularly useful in applications like topic modeling[7] and recommender systems.[5]

Relation to other models

Restricted Boltzmann machines are not an island. They are a specialized case of Boltzmann machines and, more broadly, Markov random fields.[15][16] Their graphical model also happens to directly correspond to that of factor analysis,[17] a fact that might be interesting if you're into that sort of thing.

Training algorithm

The goal of training a Restricted Boltzmann machine is to adjust its parameters—the weight matrix ## W ## and the biases ## a ## and ## b ##—to maximize the product of probabilities it assigns to a training set ## V ##. Each row of ## V ## is a visible vector ## v ##. The objective is:

argmax_W ∏_{v∈V} P(v)

This is equivalent to maximizing the expected log probability of a training sample ## v ## drawn randomly from ## V ##:[15][16]

argmax_W E[log P(v)]

The most common algorithm for this optimization task is contrastive divergence (CD), developed by Hinton initially for training product of experts (PoE) models.[18][19] The algorithm uses Gibbs sampling within a gradient descent procedure to approximate the gradient and update the weights. It's a clever shortcut, because calculating the true gradient is computationally ruinous.

The single-step contrastive divergence (CD-1) procedure for one training sample can be broken down as follows:

  1. Positive Phase: Take a training sample, v. Compute the activation probabilities for the hidden units and sample a hidden activation vector, h, from this distribution.
  2. Compute the outer product of v and h. This is your "positive gradient," representing the associations you want the model to learn from the data.
  3. Negative Phase: From the hidden vector h, sample a reconstruction v' of the visible units. Then, resample the hidden activations h' from v'. This single full step of Gibbs sampling creates a "fantasy" particle.
  4. Compute the outer product of v' and h'. This is your "negative gradient," representing the associations the model currently produces on its own.
  5. Update Weights: The update for the weight matrix ## W ## is the positive gradient minus the negative gradient, scaled by a learning rate, ε:

    ΔW = ε(vh^T - v'h'^T)

  6. Update Biases: The biases a and b are updated in a similar fashion:

    Δa = ε(v - v')

    Δb = ε(h - h')

For those who need their hand held, Hinton has published a document titled "A Practical Guide to Training RBMs," which can be found on his university homepage.[14]

Stacked Restricted Boltzmann Machine

This section may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (August 2023) This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources in this section. Unsourced material may be challenged and removed. (August 2023)

See also: Deep belief network

Let's address the disclaimers first. Yes, this section is technical, and yes, it's poorly sourced. It reads like someone's notes. Let's try to make sense of it.

  • The core difference between a standard RBM and a "Stacked RBM" (which is essentially the foundation of a Deep Belief Network) lies in the architecture and training methodology. A single RBM is a shallow network with its signature restriction against lateral connections. A Stacked RBM, on the other hand, is a hierarchical structure composed of multiple RBMs layered on top of one another. It's often a combination of an unsupervised multi-layer network with symmetric weights for feature extraction, capped with a supervised layer for a final task like classification.

  • These stacked architectures are employed for more complex tasks like understanding Natural languages, document retrieval, and sophisticated image generation and classification. The training is typically done greedily, one layer at a time. The first RBM is trained on the raw data. Then, its hidden activations are used as the input data to train the second RBM, and so on. This is the "unsupervised pre-training." After this, the entire stack can be "fine-tuned" with a supervised algorithm like backpropagation.

  • While the text mentions a "three-layer network with asymmetric weights," this is somewhat confusing. The pre-training phase involves standard RBMs with symmetric weights. The fine-tuning phase treats the entire network as a feedforward network, which inherently has asymmetric connections (in the sense that information flows in one direction).

  • Stacked RBMs share the same fundamental neuron type as a single RBM: a stochastic binary Hopfield neuron. The energy function for each individual RBM within the stack is given by the Gibb's probability measure:

    E = - (1/2) ∑{i,j} w{ij} s_i s_j + ∑_i θ_i s_i

    The training process for each layer is fundamentally the same as for a single RBM. It uses contrastive divergence with Gibbs sampling to update the weights (Δw_ij = ε * (p_ij - p'_ij)), training one layer before moving to the next and using a 3-segment pass to approximate the equilibrium state, rather than performing full backpropagation during this phase.

  • The strength of this stacked approach is its ability to learn a hierarchical representation of features, with each layer capturing progressively more abstract and complex patterns. It performs a non-linear transformation that is easily expandable. The weakness, however, is that calculations can become complicated, especially with integer or real-valued neurons. Furthermore, because the contrastive divergence algorithm does not strictly follow the gradient of any function, its approximation to maximum likelihood is an improvisation—a useful hack, but a hack nonetheless.[14]

Literature

  • Fischer, Asja; Igel, Christian (2012), "An Introduction to Restricted Boltzmann Machines", Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Lecture Notes in Computer Science, vol. 7441, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 14–36, doi:10.1007/978-3-642-33275-3_2, ISBN 978-3-642-33274-6

See also