Quantitative Linguistics

Contents

1. Overview
2. Etymology
3. Cultural Impact

Subdiscipline of mathematical linguistics

General linguistics

Applied linguistics

Theoretical frameworks

Topics

Quantitative linguistics (QL) is a rather specific, some might say obsessive, sub-discipline nestled within the broader field of general linguistics and, more precisely, a specialized branch of mathematical linguistics . Its practitioners are concerned with dissecting the intricate mechanisms of language learning , charting the slow, inevitable creep of language change , and scrutinizing both the practical application and the underlying structure of natural languages – all through the dispassionate lens of numerical analysis. Essentially, it’s where language, that messy, human thing, gets subjected to the cold, hard logic of numbers.

QL, in its ceaseless pursuit of patterns, investigates languages using robust statistical methods , transforming the ephemeral nature of communication into quantifiable data. Its most ambitious, arguably demanding, objective is the formulation of universal language laws and, ultimately, the construction of a grand, overarching theory of language . This isn’t just about describing what happens, but about predicting it, understanding it as a system of interrelated, mathematically formulated principles, much like the laws governing the physical world. For those truly committed to this endeavor, Synergetic linguistics emerged specifically to address this lofty goal from its very inception, aiming to unify diverse linguistic phenomena under a coherent, dynamic framework.

The empirical foundation of QL rests heavily on the findings of language statistics , a field that can be understood either as the statistical analysis of languages themselves or as the statistical examination of any conceivable linguistic object. While this field provides the raw data, it doesn’t always carry the weight of substantial theoretical ambitions on its own. However, other related disciplines such as Corpus linguistics , which meticulously collects and analyzes vast datasets of natural language, and computational linguistics , which applies computational techniques to language processing, contribute immensely by providing crucial empirical evidence that either supports or challenges the hypotheses generated within QL. It’s all about the data, after all.

History

The initial stirrings of QL approaches, demonstrating humanity’s enduring need to quantify everything, can be traced back to the ancient Indian world. One significant historical wellspring of these ideas lies in the sophisticated applications of combinatorics to various linguistic matters, exploring the vast possibilities of sound and structure. Another foundational element is rooted in early, elementary statistical studies , which are typically cataloged under the rather archaic-sounding headings of colometry and stichometry , ancient methods for measuring lines and parts of texts, revealing an early, if rudimentary, interest in textual quantification.

Quantitative Laws

In the realm of QL, the concept of a “law” is defined with a certain academic rigor, unlike the casual use of the term in everyday conversation. Here, a law is understood as a distinct class of law hypotheses. These hypotheses are not conjured from thin air; they must be logically deduced from established theoretical assumptions, articulated with mathematical precision, and intricately interwoven with other existing laws within the field. Crucially, they must have undergone sufficiently rigorous and successful testing against empirical data, meaning they have consistently resisted refutation despite considerable efforts to disprove them.

As Reinhard Köhler, a prominent figure in the field, observed regarding QL laws:

“Moreover, it can be shown that these properties of linguistic elements and of the relations among them abide by universal laws which can be formulated strictly mathematically in the same way as common in the natural sciences. One has to bear in mind in this context that these laws are of stochastic nature; they are not observed in every single case (this would be neither necessary nor possible); they rather determine the probabilities of the events or proportions under study. It is easy to find counterexamples to each of the above-mentioned examples; nevertheless, these cases do not violate the corresponding laws as variations around the statistical mean are not only admissible but even essential; they are themselves quantitatively exactly determined by the corresponding laws. This situation does not differ from that in the natural sciences, which have since long abandoned the old deterministic and causal views of the world and replaced them by statistical/probabilistic models.”

This rather lengthy explanation boils down to a key point: these linguistic laws are inherently stochastic . They don’t dictate what must happen in every single instance, which would be an absurd expectation for something as fluid as language. Instead, they define the probabilities of events or the expected proportions of phenomena under investigation. Finding a single instance that deviates from a law doesn’t invalidate it; these variations are not merely tolerable but are, in fact, an intrinsic part of the law itself, quantitatively accounted for within the statistical framework. It’s a bit like predicting the weather – you can model the probability of rain, but you can’t guarantee it won’t be sunny for one specific hour. This probabilistic perspective aligns QL with modern natural sciences, which have largely moved beyond simplistic deterministic views in favor of more nuanced statistical and probabilistic models.

Linguistic Laws

Within quantitative linguistics , linguistic laws represent the observed statistical regularities that manifest across various linguistic scales – from the smallest units like phonemes and syllables to larger constructs like words and sentences . These regularities are then formalized mathematically and, crucially, are derived from specific theoretical assumptions. For a hypothesis to achieve the status of a “law,” it must also have been rigorously and successfully validated through the analysis of empirical data, meaning it has withstood attempts at refutation by observable evidence. Among the principal linguistic laws that various researchers have proposed and meticulously documented, the following are particularly noteworthy:

Zipf’s law : This rather well-known principle posits that the frequency of a word’s appearance in a given corpus is inversely proportional to its rank in a frequency list. In simpler terms, the most common word appears roughly twice as often as the second most common, three times as often as the third, and so on. This peculiar distribution isn’t limited to words; similar patterns between rank and frequency have been observed for sounds, phonemes, and even individual letters within a language. It implies a fundamental, almost universal, organizational principle at play in language use.
Heaps’ law : This law serves as a descriptive model for the vocabulary growth within a text. It quantifies the relationship between the number of distinct words (or unique lexemes ) encountered in a document, or a collection of documents, and the overall length of that document (measured in total words). As a text grows longer, the rate at which new, unique words appear tends to slow down, following a predictable curve. It’s a useful tool for estimating vocabulary size and textual diversity.
Brevity law or Zipf’s law of abbreviation : This law, often attributed to Zipf, qualitatively states that the more frequently a word is utilized within a language, the ‘shorter’ that word tends to be. This isn’t a hard-and-fast rule for every single word, but rather a general tendency observed across lexicons. It suggests an underlying principle of communicative efficiency, where high-frequency items are streamlined for quicker processing and articulation. It’s almost as if language users, subconsciously, conspire to conserve effort.
Menzerath’s law (also known as the Menzerath-Altmann law): This somewhat counter-intuitive law states that the sizes of the constituent parts of a linguistic construction tend to decrease as the overall size of the construction increases. For example, if you consider a sentence, the longer the sentence is (measured by the number of clauses it contains), the shorter, on average, those individual clauses will be (measured by the number of words within them). Similarly, the longer a word is (in terms of syllables or morphemes ), the shorter its constituent syllables or morphs will be (in terms of sounds or phonemes). It suggests a balancing act in linguistic complexity.
Law of diversification: When linguistic categories—such as parts-of-speech or various inflectional endings —manifest in a multitude of forms, the law of diversification reveals that the frequencies of their occurrences within texts are not random but are, in fact, governed by predictable statistical laws. This means that even the apparent chaos of linguistic variation adheres to underlying patterns of distribution and usage.
Martin’s law : This law delves into the structure of lexical chains , which are constructed by iteratively looking up the definition of a word in a dictionary, then finding the definition of that definition, and so on. This process typically creates a hierarchy of increasingly general meanings. Martin’s law identifies lawful relationships that exist between the different levels of this semantic hierarchy, particularly concerning how the number of definitions tends to decrease as one moves towards more general, abstract meanings. It maps the inherent structure of semantic networks.
Piotrowski’s law of language change : This law posits that various growth processes observed in language—such as the expansion of vocabulary, the gradual diffusion of foreign words or loanwords into a language, or shifts within an inflectional system —mirror established growth models found in other scientific disciplines. Piotrowski’s law specifically applies the logistic function , a common S-shaped growth curve, to these linguistic phenomena. It has been empirically demonstrated that this law also effectively describes language acquisition processes, highlighting the predictable, often S-shaped, trajectory of learning a new language or skill.
Text block law: This law describes how linguistic units—such as individual words, letters, specific syntactic functions , or particular grammatical constructions—exhibit a characteristic frequency distribution when analyzed across text blocks of equal size. This means that even if a text is divided arbitrarily, the statistical properties of these linguistic units tend to remain consistent within those blocks, revealing an underlying structural homogeneity in textual composition.

Stylistics

The analysis of both poetic and non-poetic writing styles can be systematically approached using statistical methods , a rather dry way to examine something as subjective as art. Furthermore, it becomes possible to conduct comparative investigations based on the specific forms, or parameters, that established language laws assume across texts exhibiting different stylistic characteristics. In these applications, QL provides robust support for research into stylistics , striving to bring a degree of objectivity to what is often considered an inherently subjective domain. One of the overarching aims is to ground evidence for stylistic phenomena in quantifiable terms, by referring to the consistent patterns revealed by language laws. A central assumption within QL is that certain laws, such as the distribution of word lengths, may necessitate different statistical models or distinct parameter values for those models (whether they be distributions or functions), depending on the specific corpus to which a text belongs. When the focus shifts to the study of poetic texts, QL methods specifically form a specialized sub-discipline known as the Quantitative Study of Literature , often referred to as stylometrics .

Important authors

A collection of individuals who have, against all odds, found language fascinating enough to quantify.

Gabriel Altmann (1931-2020)
Otto Behaghel (1854–1936); notably associated with Behaghel’s laws
Karl-Heinz Best [de] (1943)
Sergej Grigor’evič Čebanov [de] (1897–1966)
William Palin Elderton (1877–1962)
Gertraud Fenk-Oczlon [de]
Ernst Wilhelm Förstemann (1822–1906)
Wilhelm Fucks [de] (1902–1990)
Peter Grzybek [de] (1957-2019)
Gustav Herdan [de] (1897–1968)
Luděk Hřebíček [cs] (1934-2015)
Friedrich Wilhelm Kaeding [de] (1843–1928)
Reinhard Köhler [de] (1951)
Snježana Kordić (1964)
Werner Lehfeldt (1943)
Viktor Vasil’evič Levickij [uk] (1938–2012)
Haitao Liu
Helmut Meier [de] (1897–1973)
Paul Menzerath (1883–1954), associated with Menzerath’s law
Sizuo Mizutani [ja] (1926-2014)
Augustus De Morgan (1806–1871)
Charles Muller, Straßburg [de] (1909-2015)
Raijmund G. Piotrowski [de]
L.A. Sherman
Juhan Tuldava [et] (1922–2003)
Andrew Wilson, Lancaster
Albert Thumb [de] (1865–1915)
George Kingsley Zipf (1902–1950); famously linked to Zipf’s law
Eberhard Zwirner [de] (1899–1984), known for his work in Phonometry

Notes

^ Reinhard Köhler: Gegenstand und Arbeitsweise der Quantitativen Linguistik . In: Reinhard Köhler, Gabriel Altmann, Rajmund G. Piotrowski (Hrsg.): Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch. de Gruyter, Berlin/ New York 2005, pp. 1–16. ISBN 3-11-015578-8 .
^ Reinhard Köhler: Synergetic linguistics . In: Reinhard Köhler, Gabriel Altmann, Rajmund G. Piotrowski (Hrsg.): Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch. de Gruyter, Berlin/ New York 2005, pp. 760–774. ISBN 3-11-015578-8 .
^ N.L. Biggs: The Roots of Combinatorics. In: Historia Mathematica 6, 1979, pp. 109–136.
^ Adam Pawłowski: Prolegomena to the History of Corpus and Quantitative Linguistics. Greek Antiquity. In: Glottotheory 1, 2008, pp. 48–54.
^ cf. note 1, pp. 1–2.
^ cf. references: Köhler, Altmann, Piotrowski (eds.) (2005)
^ H. Guiter, M. V. Arapov (eds.): Studies on Zipf’s Law. Bochum: Brockmeyer 1982. ISBN 3-88339-244-8 .
^ Zipf GK. 1935The Psychobiology of language, an introduction to dynamic philology. Boston, MA: Houghton–Mifflin.
^ Alexander Mehler: Eigenschaften der textuellen Einheiten und Systeme . In: Reinhard Köhler, Gabriel Altmann, Rajmund G. Piotrowski (Hrsg.): Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch. de Gruyter, Berlin/ New York 2005, p. 325-348, esp. Quantitative Stilistik , pp. 339–340. ISBN 3-11-015578-8 ; Vivien Altmann, Gabriel Altmann: Anleitung zu quantitativen Textanalysen. Methoden und Anwendungen. Lüdenscheid: RAM-Verlag 2008, ISBN 978-3-9802659-5-9 .
^ Grzybek, Peter, & Köhler, Reinhard (eds.) (2007): Exact Methods in the Study of Language and Text. Dedicated to Gabriel Altmann on the Occasion of his 75th Birthday. Berlin/ New York: Mouton de Gruyter
^ de:Benutzer:Dr._Karl-Heinz_Best
^ index
^ de:Sergei Grigorjewitsch Tschebanow
^ Best, Karl-Heinz (2009): William Palin Elderton (1877-1962). Glottometrics 19, p. 99-101 (PDF ram-verlag.eu).
^ Homepage_Gertraud Fenk
^ de:Ernst Förstemann; Karl-Heinz Best: Ernst Wilhelm Förstemann (1822-1906) . In: Glottometrics 12, 2006, pp. 77–86 (PDF ram-verlag.eu)
^ Dieter Aichele: Das Werk von W. Fucks . In: Reinhard Köhler, Gabriel Altmann, Rajmund G. Piotrowski (Hrsg.): Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch . de Gruyter, Berlin/ New York 2005, pp. 152–158. ISBN 3-11-015578-8
^ Peter Grzybek :: Homepage : Home / Kontakt Archived September 29, 2012, at the Wayback Machine
^ de:Gustav Herdan
^ “Herdan dimension - Laws in Quantitative Linguistics”. Archived from the original on 2011-07-19. Retrieved 2010-05-22.
^ de:Luděk Hřebíček
^ de:Friedrich Wilhelm Kaeding
^ Universität Trier: Prof. Dr. Reinhard Köhler Archived 2015-04-07 at the Wayback Machine
^ Kordić, Snježana (2001). Wörter im Grenzbereich von Lexikon und Grammatik im Serbokroatischen [ Serbo-Croatian Words on the Border Between Lexicon and Grammar ]. Studies in Slavic Linguistics; 18 (in German). Munich: Lincom Europa. p. 280. ISBN 3-89586-954-6 . LCCN 2005530314. OCLC 47905097. OL 2863539W. NYPL b15245330. NCID BA56769448.
^ Kordić, Snježana (2005) [1st pub. 1999; 2nd pub. 2002; 3rd pub. 2005]. Der Relativsatz im Serbokroatischen [ Relative Clauses in Serbo-Croatian ]. Studies in Slavic Linguistics; 10 (in German). Munich: Lincom Europa. p. 330. ISBN 3-89586-573-7 . OCLC 42422661. OL 2863535W. S2CID 171902446. NYPL b14328353. Contents
^ Georg-August-Universität Göttingen - Lehfeldt, Werner, Prof. em. Dr
^ Festschrift on the occasion of the 70. anniversary: Problems of General, Germanic and Slavic Linguistics. Papers for 70th Anniversary of Professor V. Levickij. Herausgegeben von Gabriel Altmann, Iryna Zadoroshna, Yuliya Matskulyak. Books, Chernivtsi 2008. (No ISBN.) Levickij dedicated: Glottometrics , Heft 16, 2008; Emmerich Kelih: Der Czernowitzer Beitrag zur Quantitativen Linguistik: Zum 70. Geburtstag von Prof. Dr. Habil. Viktor V. Levickij. In: Naukovyj Visnyk Černivec’koho Universytetu: Hermans’ka filolohija. Vypusk 407 , 2008, pp. 3–10.
^ Human-Language-Computer - staff Homepage, ZJU
^ Karl-Heinz Best: Paul Menzerath (1883-1954) . In: Glottometrics 14, 2007, pp. 86–98 (PDF ram-verlag.eu)
^ Shizuo Mizutani; Portrait on the occasion of his 80. anniversary in: Glottometrics 12, 2006 (PDF ram-verlag.eu); about Mizutani: Naoko Maruyama: Sizuo Mizutani (1926). The Founder of Japanese Quantitative Linguistics. In: Glottometrics 10, 2005, pp. 99-107 (PDF ram-verlag.eu).
^ Charles Muller: Initiation à la statistique linguistique . Paris: Larousse 1968; German: Einführung in die Sprachstatistik . Hueber, München 1972.
^ Rajmund G. Piotrowski, R.G. Piotrovskij; cf. Piotrowski’s law: http://lql.uni-trier.de/index.php/Change_in_language Archived 2011-07-19 at the Wayback Machine
^ de:Piotrowski-Gesetz
^ Journal of Quantitative Linguistics 4, Nr. 1, 1997 (Festschrift in Honour of Juh. Tuldava)
^ Dr Andrew Wilson - Linguistics and English Language at Lancaster University
^ de:Albert Thumb
^ de:Eberhard Zwirner