Contents

1. Overview
2. Etymology
3. Cultural Impact

This article… it’s a bit much, isn’t it? Like a lecture from someone who’s forgotten the point. Too much jargon, not enough… clarity. But fine. If you insist on wading through this digital detritus, I suppose I can make it less painful. Just try not to get lost in the noise.

This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts , without removing the technical details. (July 2025) ( Learn how and when to remove this message )

Machine learning paradigm

Don’t confuse this with Semi-supervised learning . It’s a different beast entirely.

• Part of a series on Machine learning and data mining

Paradigms

Problems

Supervised learning

( classification • regression )

Clustering

Dimensionality reduction

Structured prediction

Anomaly detection

Neural networks

Reinforcement learning

Learning with humans

Model diagnostics

Mathematical foundations

Journals and conferences

Self-supervised learning

Self-supervised learning (SSL) is a particular flavor of machine learning . Instead of some human laboriously feeding it labels – a tedious business, frankly – it uses the data itself to figure out what it’s supposed to be learning. Think of it as a closed system, where the data provides its own curriculum. For neural networks , this means finding the inherent patterns, the hidden structures within the data, to generate those all-important supervisory signals. The goal is to solve tasks that force the model to grasp the essential features or relationships.

It’s done by taking the input data and, well, messing with it. Augmenting it, transforming it, creating pairs of samples. One part is the input, the other is the target, the “answer” generated from the original data. This could be anything from adding noise to cropping, rotating, or other modifications. It’s an attempt, a rather clumsy one, to mimic how humans actually learn to recognize things. We don’t have someone constantly labeling every object for us, do we?

The process typically unfolds in two stages. First, the model tackles an auxiliary, or “pretext,” task. It uses these generated pseudo-labels to get its initial parameters in order. [2][3] Then, and only then, does it move on to the actual task, employing either supervised or unsupervised learning methods. [4][5][6]

SSL has been making waves, showing surprisingly good results. It’s already found its way into audio processing , and even companies like Facebook are using it for speech recognition . [7] It’s not magic, but it’s certainly more efficient than the alternative.

Types

Autoassociative self-supervised learning

This is a specific subset of SSL where the neural network’s job is to reconstruct its own input. It learns a representation of the data so well that it can essentially recreate the original. It’s associating input with itself, hence “autoassociative.” The usual suspects for this are autoencoders , which are designed precisely for learning representations. They have an encoder that squishes the data into a compressed form – the latent space – and a decoder that tries to unfurl it back into the original.

The training is simple: feed it data, and it tries to spit out the same data. The loss function measures how badly it failed, usually by calculating the difference between the original and the reconstruction. Minimize that error, and the autoencoder gets good at capturing the essence of the data in its latent space . It’s a direct, if somewhat brute-force, method.

Contrastive self-supervised learning

For a simple binary classification problem, you have your training data split into positive and negative examples. Positive examples are the ones that match your target – think images of birds if you’re training a bird classifier. Negative examples are everything else. [9] Contrastive SSL uses both. The core idea is to pull positive pairs closer together while pushing negative pairs further apart. The loss function is designed to enforce this separation. [9]

One of the earlier attempts involved a pair of 1-dimensional convolutional neural networks processing images and trying to make their outputs align. [10]

Contrastive Language-Image Pre-training (CLIP) is another example. It trains a text encoder and an image encoder together, so that matching text-image pairs have encodings that are very similar – their vectors point in almost the same direction, meaning a high cosine similarity .

InfoNCE (Noise-Contrastive Estimation) [11] is a technique for jointly optimizing two models, building on the Noise Contrastive Estimation (NCE) principle. [12] Given a set

X

{ x 1 , … x N

}

of N random samples, which includes one positive sample from p ( x t + k | c t )

and N − 1 negative samples drawn from the ‘proposal’ distribution p ( x t + k )

, it aims to minimize the following loss function:

L N

− E X

[ log ⁡

f k

( x t + k , c t

)

∑ x j ∈ X

f k

( x j , c t

)

]

Non-contrastive self-supervised learning

Non-contrastive self-supervised learning (NCSSL) takes a different route: it only uses positive examples. Counterintuitively, it manages to converge on a useful solution without trivializing to zero loss. If it only used positive examples in binary classification, it would just learn to label everything as positive. To avoid this, NCSSL requires an additional predictor component on the “online” side that doesn’t back-propagate gradients to the “target” side. [9] This prevents the trivial collapse.

Comparison with other forms of machine learning

SSL sits in an interesting spot. It’s technically a form of supervised learning because it aims to produce a classified output from an input. However, it bypasses the need for explicit input-output pairs. Instead, it cleverly extracts supervisory signals from the data itself – correlations, embedded metadata, or even domain knowledge . These implicit signals then drive the training. [1]

It shares similarities with unsupervised learning because it doesn’t rely on labeled data. But it differs in that it’s not solely focused on discovering inherent data structures; it’s actively creating a learning objective from those structures.

Semi-supervised learning , for contrast, is the middle ground. It uses a mix of supervised and unsupervised techniques, requiring only a small fraction of the data to be labeled . [3]

Then there’s transfer learning , where a model trained for one task is repurposed for another. [13] It’s about leveraging existing knowledge, not generating new learning signals from scratch.

Training an autoencoder is, by its very nature, a self-supervised process. The network is tasked with perfectly reconstructing its input. However, in the current lexicon, “self-supervised” often refers to methods that employ carefully designed pretext tasks, unlike the more self-contained approach of a standard autoencoder. [8]

In reinforcement learning , self-supervision can be used to distill complex states into more abstract, essential representations, keeping only the most critical information. [14]

Examples

Self-supervised learning is particularly potent in speech recognition . Facebook , for instance, developed wav2vec, an SSL algorithm that uses two stacked convolutional neural networks to achieve state-of-the-art results. [7]

Google ’s Bidirectional Encoder Representations from Transformers (BERT) model is a prime example of how SSL can enhance understanding of context, particularly in search queries. [15]

OpenAI ’s GPT-3 , an autoregressive language model , leverages SSL for a wide range of natural language processing tasks, including translation and question answering. [16]

Bootstrap Your Own Latent (BYOL) is a non-contrastive SSL method that has demonstrated impressive performance on benchmarks like ImageNet and in transfer and semi-supervised learning scenarios. [17]

The Yarowsky algorithm stands out in natural language processing for its self-supervised approach to word sense disambiguation . It learns to predict the correct meaning of a polysemous word based on its context, starting from just a handful of labeled examples.

DirectPred is another NCSSL approach that bypasses the typical gradient descent optimization by directly setting predictor weights. [9]

Self-GenomeNet showcases the application of self-supervised learning within the field of genomics. [18]

The continued rise of self-supervised learning across various domains is no accident. Its capacity to harness vast amounts of unlabeled data is unlocking new frontiers in machine learning, especially in areas heavily reliant on data. It’s a more efficient, and perhaps more natural, way to teach machines.

References

^ a b Bouchard, Louis (25 November 2020). “What is Self-Supervised Learning? | Will machines ever be able to learn like humans?”. Medium. Retrieved 9 June 2021.
^ Doersch, Carl; Zisserman, Andrew (October 2017). “Multi-task Self-Supervised Visual Learning”. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2070–2079. arXiv :1708.07860. doi :10.1109/iccv.2017.226. ISBN 978-1-5386-1032-9. S2CID 473729.
^ a b Beyer, Lucas; Zhai, Xiaohua; Oliver, Avital; Kolesnikov, Alexander (October 2019). “S4L: Self-Supervised Semi-Supervised Learning”. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 1476–1485. arXiv :1905.03670. doi :10.1109/iccv.2019.00156. ISBN 978-1-7281-4803-8. S2CID 167209887.
^ Doersch, Carl; Gupta, Abhinav; Efros, Alexei A. (December 2015). “Unsupervised Visual Representation Learning by Context Prediction”. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 1422–1430. arXiv :1505.05192. doi :10.1109/iccv.2015.167. ISBN 978-1-4673-8391-2. S2CID 9062671.
^ Zheng, Xin; Wang, Yong; Wang, Guoyou; Liu, Jianguo (April 2018). “Fast and robust segmentation of white blood cell images by self-supervised learning”. Micron. 107: 55–71. doi :10.1016/j.micron.2018.01.010. ISSN 0968-4328. PMID 29425969. S2CID 3796689.
^ Gidaris, Spyros; Bursuc, Andrei; Komodakis, Nikos; Perez, Patrick Perez; Cord, Matthieu (October 2019). “Boosting Few-Shot Visual Learning with Self-Supervision”. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 8058–8067. arXiv :1906.05186. doi :10.1109/iccv.2019.00815. ISBN 978-1-7281-4803-8. S2CID 186206588.
^ a b “Wav2vec: State-of-the-art speech recognition through self-supervision”. ai.facebook.com. Retrieved 9 June 2021.
^ a b Kramer, Mark A. (1991). “Nonlinear principal component analysis using autoassociative neural networks” (PDF). AIChE Journal. 37 (2): 233–243. Bibcode :1991AIChE..37..233K. doi :10.1002/aic.690370209.
^ a b c d “Demystifying a key self-supervised learning technique: Non-contrastive learning”. ai.facebook.com. Retrieved 5 October 2021.
^ Becker, Suzanna; Hinton, Geoffrey E. (January 1992). “Self-organizing neural network that discovers surfaces in random-dot stereograms”. Nature. 355 (6356): 161–163. Bibcode :1992Natur.355..161B. doi :10.1038/355161a0. ISSN 1476-4687. PMID 1729650.
^ Oord, Aaron van den; Li, Yazhe; Vinyals, Oriol (22 January 2019), Representation Learning with Contrastive Predictive Coding, arXiv :1807.03748
^ Gutmann, Michael; Hyvärinen, Aapo (31 March 2010). “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models”. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 297–304.
^ Littwin, Etai; Wolf, Lior (June 2016). “The Multiverse Loss for Robust Transfer Learning”. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 3957–3966. arXiv :1511.09033. doi :10.1109/cvpr.2016.429. ISBN 978-1-4673-8851-1. S2CID 6517610.
^ Francois-Lavet, Vincent; Bengio, Yoshua; Precup, Doina; Pineau, Joelle (2019). “Combined Reinforcement Learning via Abstract Representations”. Proceedings of the AAAI Conference on Artificial Intelligence. arXiv :1809.04506.
^ “Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing”. Google AI Blog. 2 November 2018. Retrieved 9 June 2021.
^ Wilcox, Ethan; Qian, Peng; Futrell, Richard; Kohita, Ryosuke; Levy, Roger; Ballesteros, Miguel (2020). “Structural Supervision Improves Few-Shot Learning and Syntactic Generalization in Neural Language Models”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 4640–4652. arXiv :2010.05725. doi :10.18653/v1/2020.emnlp-main.375. S2CID 222291675.
^ Grill, Jean-Bastien; Strub, Florian; Altché, Florent; Tallec, Corentin; Richemond, Pierre H.; Buchatskaya, Elena; Doersch, Carl; Pires, Bernardo Avila; Guo, Zhaohan Daniel; Azar, Mohammad Gheshlaghi; Piot, Bilal (10 September 2020). “Bootstrap your own latent: A new approach to self-supervised Learning”. arXiv :2006.07733 [cs.LG].
^ Gündüz, Hüseyin Anil; Binder, Martin; To, Xiao-Yin; Mreches, René; Bischl, Bernd; McHardy, Alice C.; Münch, Philipp C.; Rezaei, Mina (11 September 2023). “A self-supervised deep learning method for data-efficient training in genomics”. Communications Biology. 6 (1): 928. doi :10.1038/s42003-023-05310-2. ISSN 2399-3642. PMC 10495322. PMID 37696966.

External links

Doersch, Carl; Zisserman, Andrew (October 2017). “Multi-task Self-Supervised Visual Learning”. 2017 IEEE International Conference on Computer Vision (ICCV). pp. 2070–2079. arXiv :1708.07860. doi :10.1109/ICCV.2017.226. ISBN 978-1-5386-1032-9. S2CID 473729.
Doersch, Carl; Gupta, Abhinav; Efros, Alexei A. (December 2015). “Unsupervised Visual Representation Learning by Context Prediction”. 2015 IEEE International Conference on Computer Vision (ICCV). pp. 1422–1430. arXiv :1505.05192. doi :10.1109/ICCV.2015.167. ISBN 978-1-4673-8391-2. S2CID 9062671.
Zheng, Xin; Wang, Yong; Wang, Guoyou; Liu, Jianguo (1 April 2018). “Fast and robust segmentation of white blood cell images by self-supervised learning”. Micron. 107: 55–71. doi :10.1016/j.micron.2018.01.010. ISSN 0968-4328. PMID 29425969. S2CID 3796689.
Yarowsky, David (1995). “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA: Association for Computational Linguistics: 189–196. doi :10.3115/981658.981684. Retrieved 1 November 2022.

Artificial intelligence (AI)

Concepts

Applications

Implementations

Audio–visual

Text

Decisional

People

Architectures