Contrastive Divergence

Contents

1. Overview
2. Etymology
3. Cultural Impact

Contrastive Divergence (CD) is, for lack of a more dramatic description, an approximation algorithm. Its primary utility lies in training energy-based models , most notably Restricted Boltzmann Machines (RBMs). If you’ve ever wrestled with the seemingly insurmountable task of calculating an exact gradient in certain machine learning contexts, CD swoops in, not as a hero, but as a pragmatic, slightly weary compromise, offering a computationally tractable method where exact solutions are simply… inconveniently impossible. It emerged from the primordial soup of unsupervised learning as a means to make complex models learn useful representations without explicitly labeled data, a task often requiring more patience than most humans possess.

The Quagmire of Restricted Boltzmann Machines

To understand the necessity of Contrastive Divergence, one must first endure the structural nuances of the Restricted Boltzmann Machine (RBM). These are two-layer artificial neural networks characterized by a visible layer (where data enters, or, more accurately, where it’s observed) and a hidden layer (where the model attempts to discern underlying features, or perhaps just hide its own complexities). The “restricted” part is crucial: there are no connections within the visible layer, nor within the hidden layer. Connections only exist between visible and hidden units. This architectural constraint simplifies certain aspects of their operation, but don’t mistake “simplified” for “easy.”

RBMs are generative models . Their goal is to learn a probability distribution over their inputs, effectively modeling how data is generated. This is achieved by defining an “energy function” that assigns a scalar value to each possible configuration of visible and hidden units. Lower energy implies higher probability. The learning process, as with many models, involves adjusting the weights and biases to minimize this energy for training data and maximize it for other, less probable configurations. The ideal scenario involves maximizing the log-likelihood of the training data under the model’s learned distribution. And this, dear reader, is where the trouble begins.

The Intractability Problem: A Parting of Ways with Exactness

Maximizing the log-likelihood for an RBM requires computing its gradient with respect to the model parameters. This gradient involves two terms: a “positive phase” that encourages the model to assign high probability to training data, and a “negative phase” that discourages it from assigning high probability to configurations not in the training data. The positive phase is relatively straightforward to compute. It’s the negative phase that becomes a computational nightmare.

The negative phase requires averaging over all possible configurations of visible and hidden units. For any non-trivial RBM, the number of possible configurations is exponentially large – a number so vast it makes the universe seem quaint. This necessitates computing the infamous partition function (statistical mechanics) , a normalization constant that is, for all practical intents and purposes, intractable to calculate directly. Attempting to do so would lead to computational complexity that renders the entire endeavor futile, akin to counting grains of sand on every beach on every planet. This is precisely the kind of problem that makes researchers sigh dramatically and invent approximations.

Contrastive Divergence: The Art of Satisficing

Enter Contrastive Divergence, developed by Geoffrey Hinton in the early 2000s. CD doesn’t aim for perfection; it aims for “good enough” within a reasonable timeframe. Instead of computing the exact negative phase, which demands an average over all configurations, CD approximates it by averaging over a limited number of samples. This is where the “divergence” aspect comes into play, referring to the difference between two Kullback–Leibler divergences .

The algorithm, in its most basic form (CD-1), proceeds as follows:

Positive Phase (Data-driven):
- Start with a training example v0 (a vector of visible unit states).
- Compute the probabilities of the hidden units being active, given v0.
- Sample a hidden vector h0 from these probabilities.
- This (v0, h0) pair contributes to the positive gradient term.
Negative Phase (Model-driven, Approximated):
- Reconstruct a visible vector v1 from h0 (i.e., sample visible units given h0).
- Compute the probabilities of the hidden units being active, given v1.
- Sample a hidden vector h1 from these probabilities.
- This (v1, h1) pair contributes to the negative gradient term.

The crucial approximation is that instead of running a full Markov chain Monte Carlo (MCMC) simulation to reach the equilibrium distribution for the negative phase, CD-1 performs only one step of Gibbs sampling (from v0 to h0 to v1). The difference between the statistics of (v0, h0) and (v1, h1) then serves as an approximation of the gradient of the log-likelihood . This difference is then used to update the model’s weights and biases , typically via stochastic gradient descent . It’s a bit like trying to estimate the average temperature of the ocean by dipping your toe in once. Surprisingly, it often works well enough.

Variations and the Relentless Pursuit of “Better”

While CD-1 is the foundational variant, its approximations, while effective, introduce a bias. The single Gibbs sampling step doesn’t fully capture the model’s equilibrium distribution. To mitigate this, more sophisticated variants have been proposed, proving that even “good enough” can always be improved upon, or at least fussed over.

CD-k: This extends CD-1 by performing k steps of Gibbs sampling in the negative phase (e.g., v0 -> h0 -> v1 -> h1 -> ... -> vk -> hk). While k can be greater than 1, common practice often sticks to small values like k=1 or k=3, as the computational cost increases with k, and diminishing returns quickly set in.
Persistent Contrastive Divergence (PCD): Instead of starting the Gibbs sampling chain from the data v0 for each update, PCD maintains a separate set of “persistent” chains. These chains are initialized randomly once and then updated incrementally after each gradient step. This allows the chains to explore the model’s distribution more thoroughly over time, ideally providing a better approximation of the negative phase. It’s like having a few perpetually wandering explorers instead of sending a new one out every time.
Fast Persistent Contrastive Divergence (FPCD): An optimization of PCD, where the persistent chains are updated less frequently, or with specific learning rates, to speed up computation while retaining the benefits of persistence.

These variations all aim to refine the approximation of the negative phase, balancing computational efficiency with the desire for a less biased gradient estimate.

Applications: The Era of Deep Learning’s Dawn

Contrastive Divergence played a surprisingly significant role in the early days of deep learning , particularly in the context of pre-training deep belief networks (DBNs). DBNs are stacks of RBMs, where the hidden layer of one RBM becomes the visible layer for the next. Training these deep architectures layer-by-layer using CD was a breakthrough, allowing for the effective initialization of weights that could then be fine-tuned with supervised methods like backpropagation . This approach helped overcome the vanishing gradient problem that plagued earlier attempts to train deep neural networks .

CD’s ability to perform feature learning in an unsupervised learning manner was invaluable. RBMs trained with CD could learn hierarchical representations of data, extracting increasingly abstract features at each layer. This made them useful for tasks like dimensionality reduction, image recognition, and even collaborative filtering. While the advent of more powerful convolutional neural networks (CNNs) and recurrent neural networks (RNNs), coupled with improved optimization algorithms and massive datasets, has somewhat diminished RBMs’ prominence, CD remains a critical historical and conceptual stepping stone in the deep learning narrative. It showed that complex, intractable problems could be tamed with clever approximations, paving the way for the computational behemoths we see today.

Limitations and the Inevitable Trade-offs

Despite its historical significance and practical utility, Contrastive Divergence is not without its caveats. The primary limitation, as alluded to, is the inherent bias introduced by its approximation. While often “good enough,” especially for pre-training, it does not guarantee convergence to the true log-likelihood maximum. The approximated gradient can sometimes lead the optimization astray, particularly if the Gibbs sampling chain is too short (k is too small).

Furthermore, RBMs themselves, even when trained with CD, have certain architectural limitations. They are inherently shallow compared to many modern deep learning models, and their undirected nature can sometimes make them less flexible for certain tasks compared to directed generative models . The training process, even with CD, can still be sensitive to hyperparameters and prone to local optima. While CD offered a lifeline to RBMs, it couldn’t magically erase all their inherent challenges. It was a tool, a very useful one, but never meant to be the only tool. And frankly, expecting one algorithm to solve all your problems is just unrealistic, isn’t it?