← Back to home

Regular Expression

Oh, this again. You want me to… rewrite Wikipedia. As if the internet wasn't already drowning in enough meaningless drivel. Fine. Don't expect me to be cheerful about it. This is what you get when you poke the bear.


Regular Expression

A regular expression, often abbreviated as regex or regexp, is a sequence of characters that defines a search pattern. These patterns are typically employed by string-searching algorithms for operations like "find" or "find and replace" within text data. The underlying principles of regular expressions are rooted in theoretical computer science and the study of formal language theory.

The genesis of the regular expression concept can be traced back to the 1950s, when the American mathematician Stephen Cole Kleene meticulously formalized the notion of a regular language. Their widespread adoption began with the advent of Unix text-processing utilities. Since the 1980s, various syntaxes for writing regular expressions have emerged, notably the POSIX standard and the widely adopted Perl syntax.

Regular expressions find application in a multitude of domains, including search engines, the search and replace functionalities of word processors and text editors, text-processing utilities like sed and AWK, and in the process of lexical analysis. Many programming languages incorporate support for regular expressions, often through libraries referred to as "engines". A considerable number of these engines are available for reuse across different projects, with a notable list compiled at Comparison of regular expression engines.

History

The foundational concept of regular expressions was introduced in 1951 by mathematician Stephen Cole Kleene, who described regular languages using his mathematical notation known as "regular events." These concepts emerged from the field of theoretical computer science, specifically within the subfields of automata theory (which deals with models of computation) and the classification of formal languages. Kleene's work was partly motivated by his attempts to formalize early artificial neural networks. He presented this notation as an alternative to the "prehensible" concept proposed by McCulloch & Pitts, though he admitted a desire for a more descriptive term. Prior to the formalization of regular expressions, other implementations of pattern matching existed, such as the SNOBOL language, which employed its own distinct pattern-matching constructs.

The practical application of regular expressions began to gain traction around 1968, primarily in two areas: as a pattern-matching tool within text editors and in the lexical analysis phase of compilers. One of the earliest programmatic implementations involved Ken Thompson, who integrated Kleene's notation into the QED text editor to facilitate pattern matching in text files. For performance reasons, Thompson implemented this matching process through just-in-time compilation to IBM 7094 code under the Compatible Time-Sharing System, an early and significant demonstration of JIT compilation. He later incorporated this capability into the Unix editor ed, which eventually led to the development of the widely used search utility grep. The name "grep" itself is derived from the command for regular expression searching within the ed editor: g/re/p, signifying "Global search for Regular Expression and Print matching lines." Concurrently with Thompson's work on QED, a research group, including Douglas T. Ross, developed a tool based on regular expressions for lexical analysis in compiler design.

Throughout the 1970s, numerous variations of these initial regular expression forms were integrated into various Unix programs developed at Bell Labs, including lex, sed, AWK, and expr, as well as in other prominent programs like vi and Emacs (though Emacs utilizes its own, often incompatible, syntax and behavior). The widespread adoption of regular expressions led to their standardization within the POSIX.2 standard in 1992.

The 1980s saw the emergence of more complex regular expressions, particularly in Perl. This was largely due to a regex library initially written by Henry Spencer in 1986, who later developed an implementation for Tcl called Advanced Regular Expressions. Spencer's Tcl library, a hybrid NFA/DFA implementation, offered improved performance characteristics. Projects such as PostgreSQL adopted this regular expression implementation. Perl subsequently expanded upon Spencer's original library, introducing numerous new features. A significant part of the design effort for Raku (formerly known as Perl 6) was dedicated to enhancing Perl's regex integration and broadening their scope and capabilities, aiming to enable the definition of parsing expression grammars. The outcome was a specialized mini-language termed Raku rules, which not only define the Raku grammar itself but also serve as a powerful tool for programmers within the language. These rules preserve the existing features of Perl 5.x regexes while also permitting BNF-style definitions of recursive descent parsers through sub-rules.

The application of regexes in structured information standards for document and database modeling dates back to the 1960s and saw expansion in the 1980s with the consolidation of industry standards like ISO SGML (which had ANSI "GCA 101-1983" as a precursor). The core of XML schema standards is built upon regular expressions, evident in the syntax of DTD element groups. Before the widespread use of regular expressions, many search languages relied on simple wildcards, such as "*" to represent any sequence of characters and "?" to represent a single character. Vestiges of this can still be found in the glob syntax for filenames and in the SQL LIKE operator.

Starting in 1997, Philip Hazel developed PCRE, an implementation designed to closely emulate Perl's regex functionality. PCRE is now utilized by a multitude of modern tools, including PHP and the Apache HTTP Server.

Today, regular expressions are extensively supported across programming languages, text processing utilities (particularly lexers), advanced text editors, and various other software applications. Support for regexes is often integrated into the standard library of many programming languages, such as Java and Python, and is even built into the core syntax of others, like Perl and ECMAScript. In the latter half of the 2010s, hardware implementations, particularly on FPGAs and GPUs, emerged for PCRE-compatible regex engines, offering significant speed advantages over their CPU-based counterparts.

Patterns

The term "regular expressions," or "regexes," is frequently used to refer to the specific, standardized textual syntax employed for representing patterns intended for text matching. This stands in contrast to the mathematical notation detailed later. Each character within a regular expression—that is, each character in the string defining its pattern—is either a metacharacter, possessing a special meaning, or a regular character that is interpreted literally. For instance, in the regex b., the character 'b' is literal, matching only 'b', while '.' is a metacharacter that matches any single character except a newline. Consequently, this regex would match strings like 'b%', 'bx', or 'b5'. The combination of metacharacters and literal characters provides a means to identify text conforming to a specific pattern or to process multiple instances of such text. Pattern matches can range from exact equality to a broad similarity, dictated by the metacharacters used. For example, '.' represents a very general pattern, [a-z] (matching any lowercase letter from 'a' to 'z') is more specific, and 'b' is a precise pattern, matching only 'b'. The metacharacter syntax is deliberately crafted to represent prescribed targets in a concise and flexible manner, enabling the automation of text processing for diverse input data using characters readily available on a standard ASCII keyboard.

A straightforward example of a regular expression in this syntax is finding a word spelled in two different ways within a text editor. The regular expression seriali[sz]e would successfully match both "serialise" and "serialize." Wildcard characters can achieve similar results but are generally more limited in their pattern-matching capabilities due to their simpler language base and fewer metacharacters.

The typical context for wildcard characters is in globbing—matching filenames in a list—whereas regexes are more commonly employed in applications that perform general text string pattern matching. For example, the regex ^[ \t]+|[ \t]+$ is designed to match excess whitespace at the beginning or end of a line. A more complex regular expression that matches any numeral is [+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?.

The Kleene star operation, denoted by *, signifies "zero or more occurrences of the preceding element."

A regex processor translates a regular expression written in the aforementioned syntax into an internal representation. This representation can then be executed to match against a string representing the text being searched. One established method involves using Thompson's construction algorithm to build a nondeterministic finite automaton (NFA). This NFA is subsequently converted into a deterministic finite automaton (DFA) through a powerset construction process, and this DFA is then run on the target text string to identify substrings that match the original regular expression. The accompanying illustration depicts the NFA scheme N(s*) derived from the regular expression s*, where 's' itself represents a simpler regular expression that has already been recursively translated into the NFA N(s).

Basic Concepts

A regular expression, often referred to as a pattern, defines a set of strings required for a specific purpose. While a finite set of strings can be explicitly listed, more concise methods are often preferred. For instance, the set comprising the strings "Handel", "Händel", and "Haendel" can be represented by the pattern H(ä|ae?)ndel. It is said that this pattern matches each of these three strings. However, multiple regular expressions can often describe the same set of strings; for example, (Hän|Han|Haen)del also specifies the same set of three strings in this particular case.

Most formalisms provide the following operations for constructing regular expressions:

  • Boolean "or": A vertical bar (|) separates alternative patterns. For example, gray|grey can match either "gray" or "grey".

  • Grouping: Parentheses (()) are used to define the scope and precedence of operators, among other functions. For instance, gray|grey and gr(a|e)y are equivalent patterns that both represent the set of strings "gray" or "grey".

  • Quantification: A quantifier placed after an element (such as a token, character, or group) specifies how many times the preceding element is permitted to repeat. The most commonly used quantifiers are the question mark (?), the asterisk (*) (derived from the Kleene star), and the plus sign (+) (referring to the Kleene plus).

    • ?: The question mark signifies zero or one occurrence of the preceding element. For example, colou?r matches both "color" and "colour".

    • *: The asterisk signifies zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

    • +: The plus sign signifies one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

    • {n}: The preceding item is matched exactly n times.

    • {min,}: The preceding item is matched min or more times.

    • {,max}: The preceding item is matched up to max times.

    • {min,max}: The preceding item is matched at least min times, but not more than max times.

  • Wildcard: The wildcard character . matches any single character. For example:

    • a.b matches any string that contains an "a", followed by any single character, and then a "b".
    • a.*b matches any string that contains an "a", followed by any sequence of characters (including none), and then a "b" at some later point.

These basic constructions can be combined to form expressions of arbitrary complexity, akin to how arithmetical expressions are built from numbers and operations like +, −, ×, and ÷.

The precise syntax for regular expressions can vary between different tools and contexts; further details are provided in the section § Syntax.

Formal Language Theory

In the realm of formal language theory, regular expressions are used to describe regular languages. They possess the same expressive power as regular grammars. However, the language of regular expressions itself is classified as a context-free language.

Formal Definition

Regular expressions are constructed from constants, which represent sets of strings, and operator symbols, which define operations on these sets. The following definition is a standard one, commonly found in textbooks on formal language theory. Given a finite alphabet, denoted by Σ, the following constants are defined as regular expressions:

  • The empty set, denoted by , which represents an empty set of strings.
  • The empty string, denoted by ε, which represents the set containing only the "empty" string, a string with no characters.
  • A literal character, denoted by a where a is in Σ, which represents the set containing only the character a.

Given two regular expressions R and S, the following operations are defined to produce new regular expressions:

  • Concatenation: The expression (RS) denotes the set of strings formed by concatenating a string accepted by R with a string accepted by S (in that specific order). For example, if R represents the set {"ab", "c"} and S represents {"d", "ef"}, then (RS) represents {"abd", "abef", "cd", "cef"}.

  • Alternation: The expression (R|S) denotes the set union of the sets described by R and S. For example, if R describes {"ab", "c"} and S describes {"ab", "d", "ef"}, then (R|S) describes {"ab", "c", "d", "ef"}.

  • Kleene Star: The expression (R*) denotes the smallest superset of the set described by R that includes ε and is closed under string concatenation. This represents the set of all strings that can be formed by concatenating any finite number (including zero) of strings from the set described by R. For instance, if R denotes {"0", "1"}, then (R*) denotes the set of all finite binary strings, including the empty string. If R denotes {"ab", "c"}, then (R*) denotes {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ...}.

To simplify notation and avoid excessive parentheses, it is generally assumed that the Kleene star (*) has the highest precedence, followed by concatenation, and then alternation (|). When there is no ambiguity, parentheses may be omitted. For example, (ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*. Many textbooks use symbols like , +, or for alternation instead of the vertical bar.

Here are some illustrative examples:

  • a|b* denotes the set {ε, "a", "b", "bb", "bbb", ...}.
  • (a|b)* denotes the set of all strings composed solely of "a" and "b", including the empty string: {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}.
  • ab*(c|ε) denotes the set of strings starting with "a", followed by zero or more "b"s, and optionally ending with a "c": {"a", "ac", "ab", "abc", "abb", "abbc", ...}.
  • (0|(1(01*0)*1))* denotes the set of binary numbers that are multiples of 3: { ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ...}.

The derivative of a regular expression can be formally defined using the Brzozowski derivative.

Expressive Power and Compactness

The formal definition of regular expressions is intentionally minimal and excludes operators like ? and +. These can be expressed using the existing operators: a+ is equivalent to aa*, and a? is equivalent to (a|ε). Sometimes, the complement operator is included, resulting in a "generalized regular expression." Here, R^c matches all strings over Σ* that do not match R. While the complement operator does not increase the expressive power of regular expressions, it can significantly enhance conciseness. Eliminating a single complement operator without this feature can lead to a double exponential increase in the length of the expression.

Regular expressions, in this formal sense, are capable of describing regular languages, which are precisely the class of languages recognized by deterministic finite automata (DFAs). However, there's a notable difference in compactness. Certain classes of regular languages can only be represented by DFAs whose state count grows exponentially with respect to the size of the shortest equivalent regular expression. A classic example is the language L_k, which comprises all strings over the alphabet {a, b} where the k-th character from the end is 'a'. A regular expression for L_4, for instance, is given by:

(ab)a(ab)(ab)(ab)(a \mid b)^{*} a (a \mid b) (a \mid b) (a \mid b)

Generalizing this to L_k yields the expression:

(ab)a(ab)(ab)(ab)k1 times(a \mid b)^{*} a \underbrace{(a \mid b)(a \mid b)\cdots(a \mid b)}_{k-1 \text{ times}}

Conversely, it's established that any DFA accepting the language L_k requires at least 2^k states. Fortunately, there exists a straightforward mapping from regular expressions to the more general nondeterministic finite automata (NFAs) that does not result in such a size explosion. For this reason, NFAs are frequently used as alternative representations for regular languages. NFAs are a simplified variation of the type-3 grammars within the Chomsky hierarchy.

On the other hand, many languages that are easily described by a DFA are not readily represented by a regular expression. For example, validating an ISBN involves a modulus calculation in base 11, which can be efficiently implemented with an 11-state DFA. However, converting this logic to a regular expression can result in a file size of approximately 2.14 megabytes.

Given a regular expression, Thompson's construction algorithm can be used to derive an equivalent NFA. The inverse conversion, transforming an NFA into a regular expression, can be achieved using Kleene's algorithm.

It is important to note that many real-world "regular expression" engines implement features that extend beyond the capabilities of formal regular expressions. These often incorporate functionalities that are more accurately described as "regexes" (a term sometimes used to distinguish them from the purely formal definition). More on this distinction is discussed below.

Deciding Equivalence of Regular Expressions

As demonstrated by various examples, multiple ways exist to construct a regular expression that achieves the same matching outcome.

An algorithm can be devised to determine whether two given regular expressions describe identical languages. This is typically done by reducing each expression to its corresponding minimal deterministic finite state machine and then checking if these machines are isomorphic, meaning they are equivalent.

Algebraic laws applicable to regular expressions can be derived using Gischer's method. This method is best illustrated with an example: To ascertain whether (X + Y)* and (X*Y*)* denote the same regular language for any regular expressions X and Y, it suffices to check if the specific regular expressions (a | b)* and (a*b*)* describe the same language over the alphabet Σ = {a, b}. More generally, an equation E = F involving regular expression terms with variables holds if and only if its instantiation with different variables replaced by distinct symbol constants also holds.

Every regular expression can be expressed solely in terms of the Kleene star and set unions over finite words. This is a surprisingly complex problem. Despite the apparent simplicity of regular expressions, there is no systematic method for rewriting them into a standard normal form. The historical lack of such an axiom led to the formulation of the star height problem. In 1991, Dexter Kozen axiomatized regular expressions as a Kleene algebra, employing both equational and Horn clause axioms. It was proven as early as 1964 by Redko that no finite set of purely equational axioms can fully characterize the algebra of regular languages.

Syntax

A regex pattern is used to match a target string. The pattern itself is constructed from a sequence of "atoms." An atom represents a specific point within the regex pattern that the engine attempts to match against the target string. The simplest form of an atom is a literal character. However, to group parts of the pattern to match as a single atom, parentheses () are used as metacharacters. Metacharacters serve various functions: defining atoms; specifying quantifiers that dictate how many times an atom should repeat (and whether this repetition is greedy or reluctant); providing a logical OR operator for a set of alternatives; a logical NOT operator for negating an atom's presence; and enabling backreferences to previously matched patterns. A match occurs not when all atoms in the string are matched, but when all the pattern atoms within the regex have successfully matched. The fundamental concept is to represent a large number of potential strings using a concise pattern of characters, rather than compiling an exhaustive list of all literal possibilities.

Depending on the specific regex processor, there are approximately fourteen metacharacters. These are characters that may either retain their literal meaning or acquire a special meaning based on their context or whether they are "escaped" (i.e., preceded by a backslash \). Modern and POSIX extended regexes tend to use metacharacters more frequently than their literal interpretation. To mitigate the "backslash-osis" or leaning toothpick syndrome, they often employ an escape mechanism to revert a metacharacter to its literal form. Early implementations, however, treated the four bracketing metacharacters () and {} primarily as literals, requiring an escape sequence to give them their special metacharacter meaning. Common standards typically support both conventions. The usual metacharacters include {}[]()^$.|*+? and \. Characters that acquire metacharacter status when escaped include dswDSW and N.

Delimiters

When incorporating a regex into a programming language, it may be represented as a standard string literal, usually enclosed in quotes (e.g., "re" in C, Java, and Python). Alternatively, they are often delimited by slashes, as in /re/. This convention originates from the ed text editor, where / was the command for searching. An expression like /re/ could specify a range of lines matching the pattern, which could then be combined with other commands. The most famous example is g/re/p (used in grep), meaning "global regex print," a command found in most Unix-based operating systems, including Linux distributions. A similar convention is employed in sed, where search and replace operations are specified as s/re/replacement/. Patterns can be chained with a comma to define line ranges, such as /re1/,/re2/. This notation is particularly recognized due to its use in Perl, where it forms part of the language's syntax, distinct from regular string literals. In certain contexts, such as with sed and Perl, alternative delimiters can be used to avoid conflicts with characters within the pattern itself, thus reducing the need for escaping. For instance, in sed, the command s,/,X, would replace a forward slash (/) with an X, using commas as delimiters.

IEEE POSIX Standard

The IEEE POSIX standard defines three levels of regex compliance: BRE (Basic Regular Expressions), [34] ERE (Extended Regular Expressions), and SRE (Simple Regular Expressions). SRE is deprecated in favor of BRE, as both offer backward compatibility. The subsection on character classes below applies to both BRE and ERE.

BRE and ERE function in tandem. ERE introduces the metacharacters ?, +, and |, and it eliminates the requirement to escape the metacharacters () and {} which are mandatory in BRE. Furthermore, as long as the POSIX standard syntax for regexes is followed, additional syntax may be incorporated to serve specific, yet POSIX-compliant, applications. Although POSIX.2 leaves certain implementation specifics undefined, BRE and ERE provide a "standard" that has since been adopted as the default syntax in many tools, where the choice between BRE or ERE modes is typically an available option. For example, GNU grep offers the following options: grep -E for ERE, grep -G for BRE (the default), and grep -P for Perl regexes.

Perl regexes have become a de facto standard due to their rich and powerful set of atomic expressions. Perl does not have "basic" or "extended" levels. Similar to POSIX EREs, () and {} are treated as metacharacters unless escaped; other metacharacters are identified as literal or symbolic based solely on context. Additional features include lazy matching, backreferences, named capture groups, and recursive patterns.

POSIX Basic and Extended

In the POSIX standard, Basic Regular Syntax (BRE) requires that the metacharacters () and {} be escaped with a backslash, appearing as \( \) and \{ \} respectively. Extended Regular Syntax (ERE) does not impose this requirement.

| Metacharacter | Description