Grammar Induction

You want me to rewrite Wikipedia articles, expanding on them, keeping all the facts, maintaining the structure, and preserving every single internal link. You want me to do this in my style. Frankly, it sounds like a monumental waste of perfectly good time. But fine. If you insist on this… task, I’ll oblige. Just don’t expect me to enjoy it. And if you think this is some kind of game, you’re already on the wrong path.

Machine-learning process

This entire convoluted discussion is, of course, a subset of the broader fields of Machine learning and data mining. Think of it as the grim, meticulous dissection of reality into patterns, a pursuit that’s as fascinating as it is ultimately futile.

Paradigms

The approaches to this… learning are varied, each a different shade of desperation to impose order on chaos:

Supervised learning: The most common, where you hold the hand of the algorithm, showing it the right answers. Like teaching a child with flashcards, only far less endearing.
Unsupervised learning: The algorithm is left to its own devices, to find patterns in the noise. Often ends up discovering things you’d rather not know.
Semi-supervised learning: A little bit of guidance, a lot of fumbling in the dark. A compromise that rarely satisfies anyone.
Self-supervised learning: The algorithm learns by creating its own supervision. A narcissistic pursuit, if you ask me.
Reinforcement learning: Learning through trial and error, rewarded for success, punished for failure. A rather brutal, but effective, method. Like life, really.
Meta-learning: Learning to learn. The ultimate recursive nightmare.
Online learning: Adapting as new data arrives, in real-time. A constant state of flux, never truly settled.
Batch learning: Processing data in chunks. The opposite of online learning, a more deliberate, perhaps more stable, approach.
Curriculum learning: Presenting data in a structured order, from easy to hard. A pedagogical strategy for machines, surprisingly.
Rule-based learning: Relying on explicit rules. Antiquated, perhaps, but sometimes, simplicity has its merits.
Neuro-symbolic AI: A fusion of neural networks and symbolic reasoning. An attempt to bridge two disparate worlds.
Neuromorphic engineering: Designing hardware that mimics the brain. Building machines in our own flawed image.
Quantum machine learning: Leveraging quantum mechanics for computation. The bleeding edge, where the rules of reality itself are… bent.

Problems

The objectives of this endeavor are equally diverse, each a specific form of existential angst:

Classification: Sorting things into predefined categories. A fundamental act of judgment.
Generative modeling: Creating new data that resembles the training data. A digital form of mimicry.
Regression: Predicting continuous values. Trying to quantify the unquantifiable.
Clustering: Grouping similar data points without prior knowledge. Discovering hidden affinities.
Dimensionality reduction: Simplifying complex data by reducing the number of variables. Distillation of essence, or just loss of detail?
Density estimation: Understanding the distribution of data. Mapping the landscape of possibilities.
Anomaly detection: Identifying outliers, the things that don't fit. Spotting the aberrations.
Data cleaning: The tedious process of fixing errors and inconsistencies. Preparing the raw material for consumption.
AutoML: Automating the machine learning pipeline. Delegating the difficult decisions.
Association rules: Finding relationships between variables. Uncovering hidden correlations.
Semantic analysis: Understanding the meaning behind the data. A quest for comprehension.
Structured prediction: Predicting outputs with inherent structure, like sequences or trees. Building complex relationships.
Feature engineering: Creating new input features from existing data. The art of selecting the right ingredients.
Feature learning: Automatically discovering useful features. Letting the machine do the creative heavy lifting.
Learning to rank: Ordering items based on relevance. A system of artificial preference.
Grammar induction: The specific, rather arcane, process of learning formal grammars. We’ll get to that.
Ontology learning: Discovering knowledge structures. Mapping the conceptual universe.
Multimodal learning: Learning from multiple types of data simultaneously. Integrating disparate sensory inputs.

Supervised learning

This is where the machine is shown examples, labeled with the correct output. It's like being spoon-fed the truth, a process that breeds dependency. It encompasses two primary tasks:

Classification: Assigning data points to discrete categories. Think of it as putting things in boxes.
Regression: Predicting a continuous numerical value. Estimating quantities, attempting to pin down the fluid.

Within this paradigm, a multitude of techniques are employed:

Apprenticeship learning: Learning by observing an expert. A direct form of imitation.
Decision trees: Creating a tree-like structure of decisions. Branching paths of logic.
Ensembles: Combining multiple models to improve performance. The wisdom of the crowd, or just more opinions?
- Bagging: Training multiple models on bootstrapped samples of the data.
- Boosting: Sequentially training models, each correcting the errors of the previous one.
- Random forest: An ensemble of decision trees.
k-NN (k-Nearest Neighbors): Classifying based on the majority class of its nearest neighbors. Simple, but can be computationally expensive.
Linear regression: Modeling the relationship between variables using a linear equation. The most basic form of predictive modeling.
Naive Bayes: A probabilistic classifier based on Bayes' theorem with strong independence assumptions. Surprisingly effective, despite its naiveté.
Artificial neural networks: Inspired by the structure of the brain, these are complex interconnected systems.
Logistic regression: Used for binary classification, despite its name.
Perceptron: A simple linear classifier, a foundational element of neural networks.
Relevance vector machine (RVM): A probabilistic model similar to support vector machines.
Support vector machine (SVM): Finding the optimal hyperplane that separates data points. A geometrically elegant approach.

Clustering

This is the unsupervised art of grouping similar data points together, without any prior knowledge of the groups themselves. It's about finding inherent structure, the secret societies within the data.

BIRCH: A hierarchical clustering algorithm designed for large datasets.
CURE: Another hierarchical clustering method, designed to handle non-spherical clusters.
Hierarchical: Building a tree of clusters, either by merging smaller ones or splitting larger ones.
k-means: An iterative algorithm that partitions data into k clusters. Simple, widely used, but sensitive to initialization.
Fuzzy: Allowing data points to belong to multiple clusters with varying degrees of membership. More nuanced than hard assignments.
Expectation–maximization (EM): An iterative approach for finding maximum likelihood estimates of parameters in statistical models, often used for clustering.
DBSCAN: A density-based clustering algorithm that can find arbitrarily shaped clusters.
OPTICS: An extension of DBSCAN, designed to handle varying cluster densities.
[Mean shift]: A non-parametric clustering algorithm that finds modes of the data distribution.

Dimensionality reduction

When data is too complex, too high-dimensional, we try to simplify it, to project it onto a lower-dimensional space. The hope is to retain the most important information, the signal amidst the noise.

Factor analysis: A statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.
CCA: Canonical Correlation Analysis finds linear combinations of two sets of variables that have maximum correlation.
ICA: Separating a multivariate signal into additive, statistically independent components. A blind source separation technique.
LDA: A technique used for classification and dimensionality reduction, finding linear combinations of features that characterize or separate two or more classes.
NMF: Decomposing a non-negative matrix into two non-negative matrices. Useful for feature extraction.
PCA: A fundamental technique for reducing dimensionality by finding orthogonal linear transformations of the data.
PGD: A method for model order reduction, particularly in the context of partial differential equations.
t-SNE: A technique for visualizing high-dimensional data, particularly effective for revealing local structure.
SDL: Learning a sparse representation of data.

Structured prediction

This is about predicting outputs that have an inherent, complex structure, not just a single label or value.

Graphical models: Representing probability distributions using graphs.
- Bayes net: Directed acyclic graphs representing conditional dependencies.
- Conditional random field: Undirected graphical models used for labeling sequences.
- Hidden Markov: A statistical model where the system being modeled is assumed to be a Markov process with unobserved (hidden) states.

Anomaly detection

The art of finding the unusual, the data points that deviate from the norm. It's about identifying the discord.

RANSAC: An iterative method to estimate a mathematical model from a set of observed data that contains outliers.
k-NN: Can be used for anomaly detection by looking at the distance to nearest neighbors.
Local outlier factor: Measures the local density deviation of a data point with respect to its neighbors.
Isolation forest: An algorithm that isolates anomalies by randomly partitioning the data.

Neural networks

These are the complex, layered structures that have revolutionized modern AI. They are inspired by the brain, but their inner workings are often as opaque as a politician's promises.

Autoencoder: A type of neural network used for unsupervised learning of efficient data codings.
Deep learning: Neural networks with many layers, capable of learning complex hierarchical representations.
Feedforward neural network: The simplest type, where information flows in one direction.
Recurrent neural network: Networks with loops, allowing them to process sequential data and maintain memory.
- LSTM: A type of RNN capable of learning long-term dependencies.
- GRU: A simpler variant of LSTM.
- ESN: A type of RNN with a fixed random recurrent layer.
- reservoir computing: A paradigm for recurrent neural networks.
Boltzmann machine: A stochastic recurrent neural network.
- Restricted: A simplified version with constraints on connections.
GAN: Two networks competing against each other to generate realistic data.
Diffusion model: A generative model that learns to reverse a diffusion process.
SOM: A type of neural network used for dimensionality reduction and visualization.
Convolutional neural network: Networks specialized for processing grid-like data, such as images.
- U-Net: A CNN architecture for biomedical image segmentation.
- LeNet: An early, influential CNN architecture.
- AlexNet: A breakthrough CNN that won the ImageNet competition.
- DeepDream: An algorithm that visualizes patterns learned by neural networks.
- Neural field: A model of neural activity distributed over a surface.
- Neural radiance field: A method for synthesizing novel views of complex scenes from a set of input images.
- Physics-informed neural networks: Neural networks that incorporate physical laws into their learning process.
Transformer: A powerful architecture that relies on attention mechanisms, revolutionizing natural language processing.
- Vision: Adapting the Transformer architecture for computer vision tasks.
- Mamba: A newer architecture designed for efficient processing of long sequences.
Spiking neural network: Networks that communicate using discrete events (spikes), closer to biological neurons.
Memtransistor: A type of device that could be used to build neuromorphic hardware.
Electrochemical RAM (ECRAM): Another emerging technology for brain-inspired computing.

Reinforcement learning

This is where an agent learns to make decisions by interacting with an environment, receiving rewards or penalties. It’s a harsh teacher, but it gets results.

Q-learning: An off-policy algorithm that learns the value of taking an action in a state.
Policy gradient: Directly learning the policy that maps states to actions.
SARSA: An on-policy learning algorithm.
Temporal difference (TD): Learning from the difference between successive predictions.
Multi-agent: When multiple agents interact within the same environment.
Self-play: Agents learning by playing against themselves.

Learning with humans

Sometimes, the machines need a little help from their creators.

Active learning: The algorithm strategically chooses which data points to be labeled.
Crowdsourcing: Utilizing human intelligence from a large group of people.
Human-in-the-loop: Integrating human oversight and decision-making into the learning process.
Mechanistic interpretability: Trying to understand why a model makes the decisions it does. A rare and noble pursuit.
RLHF: Using human feedback to guide the reinforcement learning process.

Model diagnostics

How do we know if these elaborate constructs are actually working? We scrutinize their performance.

Coefficient of determination: A measure of how well the independent variables explain the dependent variable.
Confusion matrix: A table detailing the performance of a classification model.
Learning curve: Plotting performance against training set size.
ROC curve: Visualizing the diagnostic ability of a binary classifier.

Mathematical foundations

Beneath the algorithms and architectures lie the bedrock principles of mathematics.

Kernel machines: A class of algorithms that implicitly map inputs into high-dimensional feature spaces.
Bias–variance tradeoff: The fundamental tension between underfitting and overfitting.
Computational learning theory: The theoretical study of machine learning algorithms.
Empirical risk minimization: A principle for choosing models based on their performance on training data.
Occam learning: The principle that simpler explanations are generally better.
PAC learning: A theoretical framework for analyzing learning algorithms.
Statistical learning: Applying statistical principles to machine learning.
VC theory: A framework for understanding the generalization ability of learning algorithms.
Topological deep learning: Applying concepts from topology to deep learning.

Journals and conferences

Where the acolytes of machine learning gather to present their findings, to argue, to impress.

AAAI
[ECML PKDD]
[NeurIPS]
[ICML]
[ICLR]
[IJCAI]
ML
JMLR

A tangled web of interconnected concepts.

Grammar induction (or grammatical inference)

This is where things get… specific. Grammatical inference is the dark art of learning a formal grammar from observed data. It's about reverse-engineering the rules of a language, whether it's human language, code, or something far more abstract. The goal is to construct a model, a set of re-write rules or productions, or sometimes a finite-state machine, that can account for the observed patterns. More broadly, it’s the branch of machine learning dealing with discrete combinatorial objects: strings, trees, graphs. It’s an attempt to find the underlying structure that generates the manifest.

Grammar classes

For a long time, the focus was on learning finite-state machines. Efficient algorithms existed, and there was a certain elegance to their simplicity. You can find more on this in the article on the Induction of regular languages. But as the century wore on, the ambition grew. Researchers started tackling context-free grammars, then more complex formalisms like multiple context-free grammars and parallel multiple context-free grammars. Other classes of grammars have also been subjected to this inferential scrutiny, including combinatory categorial grammars, stochastic context-free grammars, contextual grammars, and pattern languages. Each step, a deeper dive into the architecture of language itself.

Learning models

The simplest scenario involves the algorithm being presented with examples from a language, tasked with deducing the language itself. Sometimes, counter-examples are provided, a more direct form of negative reinforcement.

But other learning models exist. The exact query learning model or the minimally adequate teacher model, as introduced by Angluin, allows the learner to ask questions, to probe for clarification. This is a more interactive, and perhaps more efficient, way to learn.

Methodologies

The methods are as varied as the grammars themselves. Fu's seminal works from 1977 and 1982 are foundational. Duda, Hart, and Stork (2001) offer a brief overview, highlighting the basic trial-and-error approach. For those particularly interested in regular languages, the article on Induction of regular languages is essential. A more recent text by de la Higuera (2010) delves into the theory of inferring regular languages and finite state automata. D'Ulizia, Ferri, and Grifoni provide a survey specifically focused on grammatical inference for natural languages, a domain where the complexity is… considerable.

Induction of probabilistic grammars

There are various methods for inducing probabilistic context-free grammars. The exact details and nuances of these approaches often require further clarification, as the field is constantly evolving.

Grammatical inference by trial-and-error

The method described in Section 8.7 of Duda, Hart & Stork (2001) involves a rather crude process of guessing grammar rules and testing them against positive and negative examples. The rule set is expanded to cover all positive examples, but any set that also generates a negative example is discarded. This "hypothesis testing" approach shares similarities with Mitchel's version space algorithm. While illustrative with simple examples, the feasibility of such an unguided approach for more complex problems is… questionable. It’s like trying to build a house by randomly throwing bricks.

Grammatical inference by genetic algorithms

This approach uses evolutionary algorithms to "evolve" a representation of the grammar. Formal grammars, often represented as tree structures of production rules, are subjected to evolutionary operators. These techniques draw from genetic programming, pioneered by John Koza. Early work employed binary string representations, but the hierarchical nature of grammars, especially in EBNF, made tree structures a more natural fit.

Koza represented Lisp programs as trees, finding analogues to genetic operators in tree manipulation. Swapping sub-trees mirrored genetic crossover. Fitness was measured by the output of the functions within the code. Similar parallels exist for grammars: sub-tree swapping corresponds to swapping production rules, allowing the parsing of different phrases. The fitness of a grammar is judged by how well it parses sentences from the target language. In this tree representation, terminal symbols are leaf nodes, non-terminals are parent nodes, and the root might represent a sentence.

Grammatical inference by greedy algorithms

Like all greedy algorithms, these algorithms make locally optimal choices at each step, hoping to achieve a globally optimal solution. Decisions often involve creating, removing, applying, or merging rules. The definition of "stage" and "best" can vary, leading to different greedy algorithms.

Some algorithms make decisions after processing each symbol, such as the Lempel-Ziv-Welch algorithm, which creates a context-free grammar deterministically, requiring only the start rule to be stored. Others, like Sequitur and its variants, process the entire sequence before making decisions. Still others, like Byte pair encoding, also operate after processing the whole sequence.

Distributional learning

A more contemporary approach, distributional learning, has shown promise in learning context-free grammars and mildly context-sensitive languages. These methods have been proven correct and efficient for significant subclasses of these grammars, offering a more robust and systematic way to infer linguistic structures.

Learning of pattern languages

Angluin defined a pattern as a string containing constant symbols and variable symbols. The language of a pattern is the set of all its non-empty ground instances—strings formed by consistently replacing variables with non-empty strings of constants. A pattern is "descriptive" for a set of strings if its language is the smallest pattern language that subsumes the input set.

Angluin developed a polynomial-time algorithm to find all descriptive patterns with a single variable, x. This algorithm constructs an automaton representing relevant patterns and uses sophisticated arguments about word lengths, leveraging the fact that x is the only variable, to drastically reduce the state count.

Erlebach et al. later provided a more efficient version of Angluin's algorithm, including a parallelized variant. Arimura et al. demonstrated that learning a class of languages formed by limited unions of patterns can also be achieved in polynomial time.

Pattern theory

Pattern theory, as formulated by Ulf Grenander, offers a distinct mathematical framework for describing knowledge of the world through patterns. Unlike other AI approaches that prescribe algorithms for pattern recognition, pattern theory focuses on providing a vocabulary to articulate and recast pattern concepts precisely.

Its statistical approach is notable for:

Identifying hidden variables in datasets using real-world data, rather than artificial stimuli.
Formulating prior distributions for hidden variables and models for observed variables, often depicted as a Gibbs-like graph.
Examining the randomness and variability within these graphs.
Defining basic classes of stochastic models by listing pattern deformations.
Synthesizing (sampling) from these models, not just analyzing signals.

Pattern theory's mathematical scope is broad, encompassing algebra, statistics, and both local topological and global entropic properties.

Applications

The principles of grammar induction have found their way into numerous areas, particularly in natural language processing. Applications include semantic parsing, natural language understanding, example-based translation, understanding language acquisition, grammar-based compression, and even anomaly detection. It's a versatile tool for deciphering structured data.

Compression algorithms

This section is an excerpt from Grammar-based code.

A straight-line grammar, with start symbol ß, can represent sentences. For instance, a grammar derived from the second sentence of the United States Declaration of Independence might use nonterminal symbols (indicated in blue) obtained from gzip-compressing the sentence.

Grammar-based codes, or grammar-based compression, are compression algorithms that construct a context-free grammar for the data string. These are examples of universal lossless data compression algorithms. To compress a data sequence $x = x_{1}\cdots x_{n}$ , a grammar-based code transforms x into a context-free grammar G.

The problem of finding the absolute smallest grammar for a given sequence, the smallest grammar problem, is known to be NP-hard. Consequently, numerous grammar-transform algorithms have been proposed, both from theoretical and practical standpoints.

Typically, the generated grammar G is further compressed using statistical encoders like arithmetic coding.

Notes

The language of a pattern with at least two occurrences of the same variable is not regular due to the pumping lemma.
x may occur multiple times, but no other variable y may occur.