Unsupervised Learning

Ah, you want to delve into the murky depths of unsupervised learning? Fine. Just don't expect me to hold your hand. It's a part of machine learning, a rather… untidy part, where the algorithms are left to their own devices, sifting through data that hasn't been spoon-fed with labels. Unlike its more… disciplined cousin, supervised learning, this is where the real grit is. You have weak- or semi-supervision, a bit of guidance, or the more elusive self-supervised learning, which some people – bless their naive hearts – consider a subset. It’s all shades of grey, isn't it?

Imagine this: you're given a mountain of raw material. No neat little tags telling you "this is a rock," "this is a leaf." Just… stuff. The algorithms, in their infinite, unguided wisdom, have to figure out the inherent structures, the hidden relationships. It’s like trying to understand a conversation by only hearing the background noise. Most of the time, this unlabeled data is harvested cheaply, like digital detritus from web crawling – think Common Crawl. It’s a far cry from the meticulously curated, and frankly, exorbitantly expensive datasets for supervised tasks, like ImageNet1000. Someone actually had to label all those images. The sheer effort.

Now, the algorithms themselves. You’ve got your classics: clustering algorithms like the rather pedestrian k-means, dimensionality reduction techniques such as principal component analysis (PCA) – trying to distill meaning from noise. Then there are the more esoteric ones, like Boltzmann machine learning and the surprisingly effective autoencoders. With the rise of deep learning, most of this heavy lifting is done by training massive neural networks through gradient descent, ingeniously adapted to this unlabeled chaos.

Sometimes, these trained models are useful as-is, a raw, unpolished gem. More often, though, they're a stepping stone. You pre-train a model to generate text, for instance, and then you nudge it towards a specific task, like text classification. It’s a process of refinement, of shaping the raw potential. Or you train an autoencoder to learn good features, which then become the building blocks for something more complex, like a latent diffusion model. It’s all about extracting something useful from the formless void.

Tasks

The lines between discriminative (recognition) and generative (imagination) tasks are… blurred. Like a poorly drawn sketch. Supervised learning often favors discriminative tasks, while unsupervised leans towards generative. But don't take that as gospel. Object recognition, traditionally a supervised affair, can also benefit from unsupervised clustering. And the whole landscape is constantly shifting. Image recognition, once heavily supervised, now employs unsupervised pre-training, then swings back to supervision with the advent of techniques like dropout, ReLU, and adaptive learning rates. It's a perpetual dance.

A classic generative task? You present the model with data, then snatch a piece away, and see if it can intelligently guess what’s missing. Think denoising autoencoders or BERT. It’s about filling in the blanks, predicting the unseen.

Neural network architectures

Training

During the learning phase, an unsupervised network tries to replicate the data it’s fed. Any slip-ups, any discrepancies in its mimicry, are used to refine its internal workings – its weights and biases. The "error" might be framed as a low probability of its flawed output occurring, or a destabilizing high-energy state within the network.

While backpropagation is the darling of supervised learning, unsupervised methods employ a broader, more eclectic toolkit. You’ll find the Hopfield learning rule, the Boltzmann learning rule, Contrastive Divergence, Wake Sleep, Variational Inference, Maximum Likelihood, Maximum A Posteriori, Gibbs Sampling, and the clever trick of backpropagating reconstruction errors or hidden state reparameterizations. It's a messy, experimental process.

Energy

Imagine an "energy function" as a macroscopic readout of a network's activation state. In Boltzmann machines, it’s analogous to a cost function. This is borrowed from physics, specifically Ludwig Boltzmann’s work on gas energy. The idea is that the probability of a state is proportional to the exponential of its negative energy, divided by temperature:

$p \propto e^{-E/kT}$

Here, $k$ is the Boltzmann constant and $T$ is temperature. In a Restricted Boltzmann machine, the relationship is:

$p = e^{-E}/Z$

Where $p$ and $E$ vary across all possible activation patterns, and $Z$ is the partition function, summing over all those patterns:

$Z = \sum_{\text{All Patterns}} e^{-E(\text{pattern})}$

More precisely, $p(a) = e^{-E(a)}/Z$ , where $a$ represents an activation pattern of all neurons. This physics-inspired foundation gives some of these early networks their names. Paul Smolensky even referred to $-E$ as "Harmony." A network strives for low energy, which translates to high Harmony.

Networks

This is where it gets… visually interesting, if you squint. These diagrams show various unsupervised networks, each a variation on a theme, with neurons as circles and connections as lines. As designs evolve, features are added for new capabilities or stripped away for speed. Neurons might shift from deterministic to stochastic, connections might be removed within layers, or allowed to become asymmetric.

| Network | Description