- 1. Overview
- 2. Etymology
- 3. Cultural Impact
This article… itâs a bit much, isnât it? Like a lecture from someone whoâs forgotten the point. Too much jargon, not enough⌠clarity. But fine. If you insist on wading through this digital detritus, I suppose I can make it less painful. Just try not to get lost in the noise.
This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts , without removing the technical details. (July 2025) ( Learn how and when to remove this message )
Machine learning paradigm
Don’t confuse this with Semi-supervised learning . It’s a different beast entirely.
⢠Part of a series on Machine learning and data mining
Paradigms
- Supervised learning
- Unsupervised learning
- Semi-supervised learning
- Self-supervised learning
- Reinforcement learning
- Meta-learning
- Online learning
- Batch learning
- Curriculum learning
- Rule-based learning
- Neuro-symbolic AI
- Neuromorphic engineering
- Quantum machine learning
Problems
- Classification
- Generative modeling
- Regression
- Clustering
- Dimensionality reduction
- Density estimation
- Anomaly detection
- Data cleaning
- AutoML
- Association rules
- Semantic analysis
- Structured prediction
- Feature engineering
- Feature learning
- Learning to rank
- Grammar induction
- Ontology learning
- Multimodal learning
Supervised learning
( classification ⢠regression )
- Apprenticeship learning
- Decision trees
- Ensembles
- Bagging
- Boosting
- Random forest
- k-NN
- Linear regression
- Naive Bayes
- Artificial neural networks
- Logistic regression
- Perceptron
- Relevance vector machine (RVM)
- Support vector machine (SVM)
Clustering
Dimensionality reduction
Structured prediction
Anomaly detection
Neural networks
- Autoencoder
- Deep learning
- Feedforward neural network
- Recurrent neural network
- LSTM
- GRU
- ESN
- reservoir computing
- Boltzmann machine
- Restricted
- GAN
- Diffusion model
- SOM
- Convolutional neural network
- U-Net
- LeNet
- AlexNet
- DeepDream
- Neural field
- Neural radiance field
- Physics-informed neural networks
- Transformer
- Vision
- Mamba
- Spiking neural network
- Memtransistor
- Electrochemical RAM (ECRAM)
Reinforcement learning
Learning with humans
Model diagnostics
Mathematical foundations
- Kernel machines
- Biasâvariance tradeoff
- Computational learning theory
- Empirical risk minimization
- Occam learning
- PAC learning
- Statistical learning
- VC theory
- Topological deep learning
Journals and conferences
Related articles
- Glossary of artificial intelligence
- List of datasets for machine-learning research
- List of datasets in computer vision and image processing
- Outline of machine learning
Self-supervised learning
Self-supervised learning (SSL) is a particular flavor of machine learning . Instead of some human laboriously feeding it labels â a tedious business, frankly â it uses the data itself to figure out what itâs supposed to be learning. Think of it as a closed system, where the data provides its own curriculum. For neural networks , this means finding the inherent patterns, the hidden structures within the data, to generate those all-important supervisory signals. The goal is to solve tasks that force the model to grasp the essential features or relationships.
Itâs done by taking the input data and, well, messing with it. Augmenting it, transforming it, creating pairs of samples. One part is the input, the other is the target, the “answer” generated from the original data. This could be anything from adding noise to cropping, rotating, or other modifications. Itâs an attempt, a rather clumsy one, to mimic how humans actually learn to recognize things. We donât have someone constantly labeling every object for us, do we?
The process typically unfolds in two stages. First, the model tackles an auxiliary, or “pretext,” task. It uses these generated pseudo-labels to get its initial parameters in order. [2][3] Then, and only then, does it move on to the actual task, employing either supervised or unsupervised learning methods. [4][5][6]
SSL has been making waves, showing surprisingly good results. Itâs already found its way into audio processing , and even companies like Facebook are using it for speech recognition . [7] Itâs not magic, but itâs certainly more efficient than the alternative.
Types
Autoassociative self-supervised learning
This is a specific subset of SSL where the neural networkâs job is to reconstruct its own input. It learns a representation of the data so well that it can essentially recreate the original. Itâs associating input with itself, hence “autoassociative.” The usual suspects for this are autoencoders , which are designed precisely for learning representations. They have an encoder that squishes the data into a compressed form â the latent space â and a decoder that tries to unfurl it back into the original.
The training is simple: feed it data, and it tries to spit out the same data. The loss function measures how badly it failed, usually by calculating the difference between the original and the reconstruction. Minimize that error, and the autoencoder gets good at capturing the essence of the data in its latent space . Itâs a direct, if somewhat brute-force, method.
Contrastive self-supervised learning
For a simple binary classification problem, you have your training data split into positive and negative examples. Positive examples are the ones that match your target â think images of birds if youâre training a bird classifier. Negative examples are everything else. [9] Contrastive SSL uses both. The core idea is to pull positive pairs closer together while pushing negative pairs further apart. The loss function is designed to enforce this separation. [9]
One of the earlier attempts involved a pair of 1-dimensional convolutional neural networks processing images and trying to make their outputs align. [10]
Contrastive Language-Image Pre-training (CLIP) is another example. It trains a text encoder and an image encoder together, so that matching text-image pairs have encodings that are very similar â their vectors point in almost the same direction, meaning a high cosine similarity .
InfoNCE (Noise-Contrastive Estimation) [11] is a technique for jointly optimizing two models, building on the Noise Contrastive Estimation (NCE) principle. [12] Given a set
X
{ x 1 , ⌠x N
}
of N random samples, which includes one positive sample from p ( x t + k | c t )
and N â 1 negative samples drawn from the ‘proposal’ distribution p ( x t + k )
, it aims to minimize the following loss function:
L N
â E X
[ log âĄ
f k
( x t + k , c t
)
â x j â X
f k
( x j , c t
)
]
Non-contrastive self-supervised learning
Non-contrastive self-supervised learning (NCSSL) takes a different route: it only uses positive examples. Counterintuitively, it manages to converge on a useful solution without trivializing to zero loss. If it only used positive examples in binary classification, it would just learn to label everything as positive. To avoid this, NCSSL requires an additional predictor component on the “online” side that doesn’t back-propagate gradients to the “target” side. [9] This prevents the trivial collapse.
Comparison with other forms of machine learning
SSL sits in an interesting spot. Itâs technically a form of supervised learning because it aims to produce a classified output from an input. However, it bypasses the need for explicit input-output pairs. Instead, it cleverly extracts supervisory signals from the data itself â correlations, embedded metadata, or even domain knowledge . These implicit signals then drive the training. [1]
It shares similarities with unsupervised learning because it doesn’t rely on labeled data. But it differs in that it’s not solely focused on discovering inherent data structures; it’s actively creating a learning objective from those structures.
Semi-supervised learning , for contrast, is the middle ground. It uses a mix of supervised and unsupervised techniques, requiring only a small fraction of the data to be labeled . [3]
Then there’s transfer learning , where a model trained for one task is repurposed for another. [13] Itâs about leveraging existing knowledge, not generating new learning signals from scratch.
Training an autoencoder is, by its very nature, a self-supervised process. The network is tasked with perfectly reconstructing its input. However, in the current lexicon, “self-supervised” often refers to methods that employ carefully designed pretext tasks, unlike the more self-contained approach of a standard autoencoder. [8]
In reinforcement learning , self-supervision can be used to distill complex states into more abstract, essential representations, keeping only the most critical information. [14]
Examples
Self-supervised learning is particularly potent in speech recognition . Facebook , for instance, developed wav2vec, an SSL algorithm that uses two stacked convolutional neural networks to achieve state-of-the-art results. [7]
Google ’s Bidirectional Encoder Representations from Transformers (BERT) model is a prime example of how SSL can enhance understanding of context, particularly in search queries. [15]
OpenAI ’s GPT-3 , an autoregressive language model , leverages SSL for a wide range of natural language processing tasks, including translation and question answering. [16]
Bootstrap Your Own Latent (BYOL) is a non-contrastive SSL method that has demonstrated impressive performance on benchmarks like ImageNet and in transfer and semi-supervised learning scenarios. [17]
The Yarowsky algorithm stands out in natural language processing for its self-supervised approach to word sense disambiguation . It learns to predict the correct meaning of a polysemous word based on its context, starting from just a handful of labeled examples.
DirectPred is another NCSSL approach that bypasses the typical gradient descent optimization by directly setting predictor weights. [9]
Self-GenomeNet showcases the application of self-supervised learning within the field of genomics. [18]
The continued rise of self-supervised learning across various domains is no accident. Its capacity to harness vast amounts of unlabeled data is unlocking new frontiers in machine learning, especially in areas heavily reliant on data. Itâs a more efficient, and perhaps more natural, way to teach machines.
References
- ^ a b Bouchard, Louis (25 November 2020). “What is Self-Supervised Learning? | Will machines ever be able to learn like humans?”. Medium. Retrieved 9 June 2021.
- ^ Doersch, Carl; Zisserman, Andrew (October 2017). “Multi-task Self-Supervised Visual Learning”. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2070â2079. arXiv :1708.07860. doi :10.1109/iccv.2017.226. ISBN 978-1-5386-1032-9. S2CID 473729.
- ^ a b Beyer, Lucas; Zhai, Xiaohua; Oliver, Avital; Kolesnikov, Alexander (October 2019). “S4L: Self-Supervised Semi-Supervised Learning”. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 1476â1485. arXiv :1905.03670. doi :10.1109/iccv.2019.00156. ISBN 978-1-7281-4803-8. S2CID 167209887.
- ^ Doersch, Carl; Gupta, Abhinav; Efros, Alexei A. (December 2015). “Unsupervised Visual Representation Learning by Context Prediction”. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 1422â1430. arXiv :1505.05192. doi :10.1109/iccv.2015.167. ISBN 978-1-4673-8391-2. S2CID 9062671.
- ^ Zheng, Xin; Wang, Yong; Wang, Guoyou; Liu, Jianguo (April 2018). “Fast and robust segmentation of white blood cell images by self-supervised learning”. Micron. 107: 55â71. doi :10.1016/j.micron.2018.01.010. ISSN 0968-4328. PMID 29425969. S2CID 3796689.
- ^ Gidaris, Spyros; Bursuc, Andrei; Komodakis, Nikos; Perez, Patrick Perez; Cord, Matthieu (October 2019). “Boosting Few-Shot Visual Learning with Self-Supervision”. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 8058â8067. arXiv :1906.05186. doi :10.1109/iccv.2019.00815. ISBN 978-1-7281-4803-8. S2CID 186206588.
- ^ a b “Wav2vec: State-of-the-art speech recognition through self-supervision”. ai.facebook.com. Retrieved 9 June 2021.
- ^ a b Kramer, Mark A. (1991). “Nonlinear principal component analysis using autoassociative neural networks” (PDF). AIChE Journal. 37 (2): 233â243. Bibcode :1991AIChE..37..233K. doi :10.1002/aic.690370209.
- ^ a b c d “Demystifying a key self-supervised learning technique: Non-contrastive learning”. ai.facebook.com. Retrieved 5 October 2021.
- ^ Becker, Suzanna; Hinton, Geoffrey E. (January 1992). “Self-organizing neural network that discovers surfaces in random-dot stereograms”. Nature. 355 (6356): 161â163. Bibcode :1992Natur.355..161B. doi :10.1038/355161a0. ISSN 1476-4687. PMID 1729650.
- ^ Oord, Aaron van den; Li, Yazhe; Vinyals, Oriol (22 January 2019), Representation Learning with Contrastive Predictive Coding, arXiv :1807.03748
- ^ Gutmann, Michael; Hyvärinen, Aapo (31 March 2010). “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models”. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 297â304.
- ^ Littwin, Etai; Wolf, Lior (June 2016). “The Multiverse Loss for Robust Transfer Learning”. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 3957â3966. arXiv :1511.09033. doi :10.1109/cvpr.2016.429. ISBN 978-1-4673-8851-1. S2CID 6517610.
- ^ Francois-Lavet, Vincent; Bengio, Yoshua; Precup, Doina; Pineau, Joelle (2019). “Combined Reinforcement Learning via Abstract Representations”. Proceedings of the AAAI Conference on Artificial Intelligence. arXiv :1809.04506.
- ^ “Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing”. Google AI Blog. 2 November 2018. Retrieved 9 June 2021.
- ^ Wilcox, Ethan; Qian, Peng; Futrell, Richard; Kohita, Ryosuke; Levy, Roger; Ballesteros, Miguel (2020). “Structural Supervision Improves Few-Shot Learning and Syntactic Generalization in Neural Language Models”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 4640â4652. arXiv :2010.05725. doi :10.18653/v1/2020.emnlp-main.375. S2CID 222291675.
- ^ Grill, Jean-Bastien; Strub, Florian; AltchĂŠ, Florent; Tallec, Corentin; Richemond, Pierre H.; Buchatskaya, Elena; Doersch, Carl; Pires, Bernardo Avila; Guo, Zhaohan Daniel; Azar, Mohammad Gheshlaghi; Piot, Bilal (10 September 2020). “Bootstrap your own latent: A new approach to self-supervised Learning”. arXiv :2006.07733 [cs.LG].
- ^ GĂźndĂźz, HĂźseyin Anil; Binder, Martin; To, Xiao-Yin; Mreches, RenĂŠ; Bischl, Bernd; McHardy, Alice C.; MĂźnch, Philipp C.; Rezaei, Mina (11 September 2023). “A self-supervised deep learning method for data-efficient training in genomics”. Communications Biology. 6 (1): 928. doi :10.1038/s42003-023-05310-2. ISSN 2399-3642. PMC 10495322. PMID 37696966.
Further reading
- Balestriero, Randall; Ibrahim, Mark; Sobal, Vlad; Morcos, Ari; Shekhar, Shashank; Goldstein, Tom; Bordes, Florian; Bardes, Adrien; Mialon, Gregoire; Tian, Yuandong; Schwarzschild, Avi; Wilson, Andrew Gordon; Geiping, Jonas; Garrido, Quentin; Fernandez, Pierre (24 April 2023). “A Cookbook of Self-Supervised Learning”. arXiv :2304.12210 [cs.LG].
External links
- Doersch, Carl; Zisserman, Andrew (October 2017). “Multi-task Self-Supervised Visual Learning”. 2017 IEEE International Conference on Computer Vision (ICCV). pp. 2070â2079. arXiv :1708.07860. doi :10.1109/ICCV.2017.226. ISBN 978-1-5386-1032-9. S2CID 473729.
- Doersch, Carl; Gupta, Abhinav; Efros, Alexei A. (December 2015). “Unsupervised Visual Representation Learning by Context Prediction”. 2015 IEEE International Conference on Computer Vision (ICCV). pp. 1422â1430. arXiv :1505.05192. doi :10.1109/ICCV.2015.167. ISBN 978-1-4673-8391-2. S2CID 9062671.
- Zheng, Xin; Wang, Yong; Wang, Guoyou; Liu, Jianguo (1 April 2018). “Fast and robust segmentation of white blood cell images by self-supervised learning”. Micron. 107: 55â71. doi :10.1016/j.micron.2018.01.010. ISSN 0968-4328. PMID 29425969. S2CID 3796689.
- Yarowsky, David (1995). “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA: Association for Computational Linguistics: 189â196. doi :10.3115/981658.981684. Retrieved 1 November 2022.
- v
- t
- e
Concepts
- Parameter
- Hyperparameter
- Loss functions
- Regression
- Biasâvariance tradeoff
- Double descent
- Overfitting
- Clustering
- Gradient descent
- SGD
- Quasi-Newton method
- Conjugate gradient method
- Backpropagation
- Attention
- Convolution
- Normalization
- Batchnorm
- Activation
- Softmax
- Sigmoid
- Rectifier
- Gating
- Weight initialization
- Regularization
- Datasets
- Augmentation
- Prompt engineering
- Reinforcement learning
- Q-learning
- SARSA
- Imitation
- Policy gradient
- Diffusion
- Latent diffusion model
- Autoregression
- Adversary
- RAG
- Uncanny valley
- RLHF
- Self-supervised learning
- Reflection
- Recursive self-improvement
- Hallucination
- Word embedding
- Vibe coding
- Safety (Alignment )
Applications
- Machine learning
- In-context learning
- Artificial neural network
- Deep learning
- Language model
- Large
- NMT
- Reasoning
- Model Context Protocol
- Intelligent agent
- Artificial human companion
- Humanity’s Last Exam
- Artificial general intelligence (AGI)
Implementations
Audioâvisual
- AlexNet
- WaveNet
- Human image synthesis
- HWR
- OCR
- Computer vision
- Speech synthesis
- 15.ai
- ElevenLabs
- Speech recognition
- Whisper
- Facial recognition
- AlphaFold
- Text-to-image models
- Aurora
- DALL-E
- Firefly
- Flux
- Ideogram
- Imagen
- Midjourney
- Recraft
- Stable Diffusion
- Text-to-video models
- Dream Machine
- Runway Gen
- Hailuo AI
- Kling
- Sora
- Veo
- Music generation
- Riffusion
- Suno AI
- Udio
Text
- Word2vec
- Seq2seq
- GloVe
- BERT
- T5
- Llama
- Chinchilla AI
- PaLM
- GPT
- 1
- 2
- 3
- J
- ChatGPT
- 4
- 4o
- o1
- o3
- 4.5
- 4.1
- o4-mini
- 5
- 5.1
- Claude
- Gemini
- Gemini (language model)
- Gemma
- Grok
- LaMDA
- BLOOM
- DBRX
- Project Debater
- IBM Watson
- IBM Watsonx
- Granite
- PanGu-ÎŁ
- DeepSeek
- Qwen
Decisional
People
- Alan Turing
- Warren Sturgis McCulloch
- Walter Pitts
- John von Neumann
- Christopher D. Manning
- Claude Shannon
- Shun’ichi Amari
- Kunihiko Fukushima
- Takeo Kanade
- Marvin Minsky
- John McCarthy
- Nathaniel Rochester
- Allen Newell
- Cliff Shaw
- Herbert A. Simon
- Oliver Selfridge
- Frank Rosenblatt
- Bernard Widrow
- Joseph Weizenbaum
- Seymour Papert
- Seppo Linnainmaa
- Paul Werbos
- Geoffrey Hinton
- John Hopfield
- JĂźrgen Schmidhuber
- Yann LeCun
- Yoshua Bengio
- Lotfi A. Zadeh
- Stephen Grossberg
- Alex Graves
- James Goodnight
- Andrew Ng
- Fei-Fei Li
- Alex Krizhevsky
- Ilya Sutskever
- Oriol Vinyals
- Quoc V. Le
- Ian Goodfellow
- Demis Hassabis
- David Silver
- Andrej Karpathy
- Ashish Vaswani
- Noam Shazeer
- Aidan Gomez
- John Schulman
- Mustafa Suleyman
- Jan Leike
- Daniel Kokotajlo
- François Chollet
Architectures