Random Variable

Part of a series on statistics Probability theory

A Formalization of the Unpredictable: The Random Variable

A random variable (which you might also encounter under the slightly more descriptive, though less common, monikers of random quantity, aleatory variable, or stochastic variable) represents the mathematical attempt to formalize a quantity or an object whose value is inextricably tied to random events. It's an interesting semantic choice, considering the term 'random variable' itself, in its precise mathematical definition, refers to neither inherent randomness nor intrinsic variability. Instead, it serves as a sophisticated mathematical function designed to bridge the gap between abstract possibilities and quantifiable outcomes.

To be precise, this function operates with a specific structure:

Its domain encompasses the entire set of potential outcomes within a defined sample space. For instance, if one were to consider the simple act of flipping a coin, the sample space would be the rather concise set {H, T}, representing the possible upper sides of the flipped coin: heads (H) or tails (T). This is the raw material, the unfiltered reality of the experiment.
Its range, on the other hand, is a measurable space. Continuing the coin toss example, if we decided to map heads (H) to -1 and tails (T) to 1, the range of our random variable would then be the set {−1, 1}. This transformation is crucial; it converts qualitative or abstract outcomes into a numerical format that can be mathematically manipulated and analyzed. Typically, and rather conveniently for most applications, the range of a random variable is a subset of the real numbers.

This conceptual mapping, often visualized as a graph showing the transition from abstract outcomes to concrete real values, also serves as the foundational stepping stone for defining probability mass functions when dealing with discrete scenarios.

Informally, the notion of randomness usually conjures images of pure chance, like the unpredictable tumble of a die. It can also represent irreducible uncertainty, such as the inherent measurement error in any observation. However, the exact interpretation of probability is a philosophically thorny issue, one that even in specific contexts remains stubbornly complex and far from straightforward. The beauty, or perhaps the cold efficiency, of the purely mathematical analysis of random variables is that it manages to bypass these interpretational quagmires entirely, establishing its edifice upon a rigorous axiomatic framework that demands no philosophical commitments.

Within the formal, somewhat intimidating, language of measure theory, a random variable is precisely defined as a measurable function. This function acts as a bridge, mapping from a probability measure space (which we so quaintly call the sample space) to another measurable space. This rigorous definition then allows for the consideration of the pushforward measure, which is given the rather grand title of the distribution of the random variable. This distribution, in essence, becomes a probability measure residing on the set of all possible numerical values that the random variable can assume. It's worth noting, with a touch of cosmic irony, that two random variables can possess identical distributions yet diverge significantly in other crucial aspects; for example, they might be entirely independent of one another.

Common practice often leads us to consider specific, simplified instances: discrete random variables and absolutely continuous random variables. These categories correspond to whether a random variable takes values from a countable subset or from an unbroken interval of real numbers, respectively. Yet, the universe of possibilities extends beyond these. In the realm of stochastic processes, for instance, it becomes entirely natural and necessary to consider more intricate constructs such as random sequences or even random functions. Occasionally, the term random variable is implicitly understood to refer exclusively to real-valued entities, with more generalized random quantities being designated as random elements to avoid confusion.

Credit where it's due: George Mackey observed that Pafnuty Chebyshev was arguably the first to approach the concept of random variables with systematic thought, truly paving the way for their modern understanding.

Definition

A random variable, typically denoted by a capital Roman letter such as X, Y, Z, or T, is formally a measurable function that maps from a sample space $\Omega$ (which represents the set of all possible outcomes) to a measurable space $E$ . For the concept of measurability of X to hold any actual meaning, the sample space $\Omega$ must itself be part of a probability triple $(\Omega, \mathcal{F}, P)$ , a foundational structure in measure theory.

The probability that this random variable X assumes a value within a particular measurable set $S \subseteq E$ is expressed with elegant mathematical conciseness as:

$P(X \in S) = P(\{\omega \in \Omega \mid X(\omega) \in S\})$

This notation effectively states that the probability of X landing in set S is equivalent to the probability of the original outcomes $\omega$ (from the sample space $\Omega$ ) that, when transformed by X, fall into S.

Standard Case

In the vast majority of practical applications and theoretical discussions, X is considered real-valued; that is, $E = \mathbb{R}$ . In more generalized contexts, where the random variable might not conform to this real-valued structure, the term random element is often employed to distinguish it (see the "Extensions" section below for a deeper dive into these more exotic entities).

When the image (or more simply, the range) of X is either finite or countably infinite, the random variable is designated as a discrete random variable. Its corresponding distribution is a discrete probability distribution, which can be fully characterized by a probability mass function (PMF) that assigns a specific probability to each individual value within the image of X.

Conversely, if the image of X is uncountably infinite (typically forming an unbroken interval on the real number line), then X is termed a continuous random variable. In the particular scenario where it is absolutely continuous, its distribution can be elegantly described by a probability density function (PDF). This PDF assigns probabilities not to individual points (which, for an absolutely continuous random variable, must necessarily have zero probability), but rather to intervals. It's an important nuance that not all continuous random variables are absolutely continuous; some can be rather more... singular.

Regardless of its type, any random variable can be comprehensively described by its cumulative distribution function (CDF), which quantifies the probability that the random variable will assume a value less than or equal to a given threshold. It's the unifying framework, if you will, for all these disparate manifestations of chance.

Extensions

The term "random variable" within the field of statistics has, for a long time, been rather narrowly confined to the real-valued case, where $E = \mathbb{R}$ . This convention is largely due to the inherent structure of the real numbers themselves, which conveniently allows for the definition of crucial quantities like the expected value and variance of a random variable, its cumulative distribution function, and the various moments of its distribution.

However, the foundational definition provided earlier is robust enough to accommodate any measurable space $E$ for the values. This expansive view allows us to consider random elements that exist within other sets $E$ , moving beyond mere numbers to encompass entities such as random Boolean values, categorical values, complex numbers (though these are often treated as random vectors of two real numbers), vectors, matrices, sequences, trees, sets, shapes, manifolds, and even functions. In these more generalized contexts, one might specifically refer to a random variable of type $E$ , or more simply, an $E$ -valued random variable.

This broader concept of a random element proves particularly invaluable in diverse disciplines like graph theory, machine learning, natural language processing, and various other fields within discrete mathematics and computer science. Here, the focus often shifts to modeling the unpredictable variations of non-numerical data structures. Even in these cases, it's frequently convenient to represent each element of $E$ by using one or more real numbers. When this mapping occurs, a random element can optionally be represented as a vector of real-valued random variables, all of which are defined on the same underlying probability space $\Omega$ . This shared foundation is critical, as it allows for the analysis of how these different random variables might covary with each other.

To illustrate, consider these specific examples:

A random word might be conceptualized as a random integer that functions as an index into a predefined vocabulary of all possible words. Alternatively, it could be represented as a random indicator vector, whose length matches the size of the vocabulary. In this vector, the only values with a positive probability would be those where a single '1' appears at a specific position, indicating the chosen word, such as $(1\ 0\ 0\ 0\ \cdots)$ , $(0\ 1\ 0\ 0\ \cdots)$ , or $(0\ 0\ 1\ 0\ \cdots)$ . The position of the '1' is the key.
A random sentence of a predetermined length N could then be represented as a vector of N random words, each word being a random element as described above.
A random graph constructed on N given vertices might be represented as an $N \times N$ matrix of random variables, where the values within this matrix directly specify the adjacency matrix of the random graph.
A random function F could be represented as a collection of random variables $F(x)$ , which provide the function's values at various points x within its domain. These $F(x)$ are essentially ordinary real-valued random variables, provided that the function itself is real-valued. For instance, a stochastic process is inherently a random function of time, a random vector is a random function over some defined index set (e.g., $1, 2, \ldots, n$ ), and a random field generalizes this further as a random function defined on any set (which typically includes time, space, or a discrete collection).

Distribution Functions

Given a real-valued random variable $X\colon \Omega \to \mathbb{R}$ defined on a probability space $(\Omega, \mathcal{F}, P)$ , one inevitably finds oneself asking questions like, "What is the likelihood that the value of $X$ will precisely equal 2?" This seemingly simple query translates directly into the probability of the event $\{\omega : X(\omega) = 2\}$ , which is often abbreviated and more commonly written as $P(X=2)$ or, even more compactly, $p_X(2)$ .

The act of systematically recording all these probabilities for every conceivable output of a random variable X culminates in what we call the probability distribution of X. This probability distribution possesses a rather convenient characteristic: it "forgets" the specifics of the particular probability space that was initially used to define X. Instead, it retains only the essential information—the probabilities associated with the various output values of X. When X is real-valued, such a probability distribution can invariably be captured by its cumulative distribution function (CDF), which is defined as:

$F_X(x) = P(X \leq x)$

And, in many cases, it can also be characterized by a probability density function, denoted as $f_X$ . From a measure-theoretic perspective, we employ the random variable X to "push-forward" the original measure $P$ on $\Omega$ to a new measure, $p_X$ , residing on $\mathbb{R}$ . This transformed measure, $p_X$ , is precisely what we refer to as the "(probability) distribution of X" or the "law of X". The density $f_X = dp_X/d\mu$ is, in fact, the Radon–Nikodym derivative of $p_X$ with respect to some carefully chosen reference measure $\mu$ on $\mathbb{R}$ . This reference measure is typically the Lebesgue measure for continuous random variables or the counting measure for discrete random variables.

The underlying probability space $\Omega$ often feels like a phantom limb in practical applications—it's a technical construct. Its primary purpose is to guarantee the very existence of random variables, to facilitate their construction, and crucially, to define concepts such as correlation and dependence or independence when dealing with a joint distribution of two or more random variables operating within the same probability space. In day-to-day practice, one frequently dispenses with the explicit consideration of $\Omega$ altogether, opting instead to simply place a measure directly onto $\mathbb{R}$ that assigns a total measure of 1 to the entire real line. In essence, this means one often works directly with probability distributions rather than with the random variables themselves. For a more comprehensive exploration of this simplification, consult the article on quantile functions.

Examples

Discrete Random Variable

Let's consider a simple, relatable experiment: a person is chosen entirely at random from a population. An exemplary random variable in this scenario might be the person's height. Mathematically, this random variable is interpreted as a function that maps the chosen individual directly to their measured height. Intricately linked with this random variable is a probability distribution that empowers us to compute the likelihood that the height falls within any conceivable subset of possible values. For instance, we could determine the probability that the person's height lies between 180 and 190 cm, or perhaps the probability that their height is either less than 150 cm or, conversely, more than 200 cm.

Now, consider a different random variable for the same randomly chosen person: their number of children. This is a quintessential discrete random variable, as it can only take on non-negative integer values. This type of random variable allows for the calculation of probabilities for specific, individual integer values—this is the domain of the probability mass function (PMF)—or for broader sets of values, which can even include infinite sets. For example, an event of particular interest might be "an even number of children." For both finite and infinite collections of such events, their probabilities can be ascertained by simply summing the PMFs of the constituent elements. Thus, the probability of encountering an even number of children would be the infinite sum: $\operatorname{PMF}(0) + \operatorname{PMF}(2) + \operatorname{PMF}(4) + \cdots$

In examples such as these, the underlying sample space is frequently left implicit or "suppressed," primarily because its explicit mathematical description can be notoriously complex. Consequently, the possible values that the random variables can assume are often treated as if they constitute the sample space. However, when two or more random variables are being measured on the same fundamental sample space of outcomes—for instance, both the height and the number of children being recorded for the very same randomly selected individuals—it becomes significantly easier to trace and understand their interrelationships if one acknowledges that both height and number of children originate from the same random person. This acknowledgment is vital for posing and answering questions regarding whether such random variables are correlated or entirely independent.

More formally, if $\{a_n\}$ and $\{b_n\}$ represent countable sets of real numbers, with the condition that $b_n > 0$ for all $n$ and $\sum_n b_n = 1$ , then the function $F = \sum_n b_n \delta_{a_n}(x)$ describes a discrete distribution function. Here, $\delta_t(x)$ is a step function, taking the value 0 for $x < t$ and 1 for $x \geq t$ . If, for example, we were to enumerate all rational numbers as $\{a_n\}$ , we would arrive at a discrete function that, surprisingly, is not necessarily a simple step function (i.e., piecewise constant).

Coin Toss

The potential outcomes for a single, unembellished coin toss can be comprehensively described by the sample space $\Omega = \{\text{heads}, \text{tails}\}$ . We can then introduce a real-valued random variable Y to model, for instance, a $1 payoff for a successful bet on heads. This is formalized as follows:

$Y(\omega) = \begin{cases} 1, & \text{if } \omega = \text{heads}, \\ 0, & \text{if } \omega = \text{tails}. \end{cases}$

Assuming the coin in question is a fair coin, the random variable Y will possess a probability mass function (PMF), denoted $f_Y$ , which is given by:

$f_Y(y) = \begin{cases} \tfrac{1}{2}, & \text{if } y=1, \\ \tfrac{1}{2}, & \text{if } y=0. \end{cases}$

Each outcome has an equal, non-zero probability, as one would expect from a truly fair coin.

Dice Roll

If our sample space consists of the myriad possible numbers that can be rolled on two standard dice, and the random variable of particular interest is S, representing the sum of the numbers appearing on the two dice, then S is undeniably a discrete random variable. Its distribution is eloquently characterized by the probability mass function (PMF), which, if visualized, would appear as a series of columns with varying heights, much like the histogram you might see associated with such an event.

A random variable can, in fact, be skillfully employed to describe the entire process of rolling dice, including all its potential outcomes. The most straightforward representation for the case involving two dice is to define the sample space as the set of all ordered pairs $(n_1, n_2)$ , where $n_1$ and $n_2$ are individual numbers drawn from {1, 2, 3, 4, 5, 6} (each representing the outcome of one die). The total number rolled—that is, the sum of the numbers in each pair—then becomes our random variable X. This X is given by the function that maps each pair to its sum:

$X((n_1, n_2)) = n_1 + n_2$

And, assuming both dice are fair, it possesses a probability mass function $f_X$ defined by:

$f_X(S) = \frac{\min(S-1, 13-S)}{36}, \quad \text{ for } S \in \{2,3,4,5,6,7,8,9,10,11,12\}$

This formula neatly captures the increasing and then decreasing probabilities as the sum S moves from the extremes (2 or 12) towards the most likely outcome (7).

Continuous Random Variable

Formally speaking, a continuous random variable is characterized by a cumulative distribution function (CDF) that is continuous across its entire domain. This implies an absence of any discernible "gaps" in the CDF, which would otherwise correspond to specific numerical values having a finite probability of occurring. Instead, continuous random variables will almost never assume an exact, precisely prescribed value c. Formally, this is expressed as $\forall c \in \mathbb{R} : P(X=c) = 0$ . However, there is always a positive probability that its value will fall within particular intervals, even if these intervals can be made arbitrarily small. Most continuous random variables conveniently allow for probability density functions (PDFs), which serve to characterize both their CDF and their underlying probability measures. Such distributions are also frequently referred to as absolutely continuous. It's a subtle but important distinction that not all continuous distributions are absolutely continuous; some are classified as singular, or even as intricate mixtures of an absolutely continuous part and a singular part.

Consider, as an example, a continuous random variable derived from a spinner that can land on any horizontal direction. The values taken by this random variable are, conceptually, directions. While we could label these as North, West, East, South, Southeast, and so on, it is almost always more convenient to map this conceptual sample space onto a random variable that takes values as real numbers. This can be achieved, for instance, by mapping each direction to a bearing measured in degrees clockwise from North. The random variable would then assume values that are real numbers within the interval [0, 360), with the understanding that all parts of this range are "equally likely." In this scenario, X simply represents the angle spun. Any single, specific real number within this range has a probability of zero of being selected, yet any range of values, no matter how small, can be assigned a positive probability. For example, the probability of the spinner landing on an angle within [0, 180] degrees is exactly 1/2. Rather than discussing a probability mass function, we declare that the probability density of X is 1/360. The probability of any given subset within the interval [0, 360) can then be calculated by simply multiplying the Lebesgue measure of that set by 1/360. More generally, for any given continuous random variable, the probability of a set is found by integrating the probability density function over that specific set.

To be more formal about it, given any interval $I = [a,b] = \{x \in \mathbb{R} : a \leq x \leq b\}$ , a random variable $X_I \sim \operatorname{U}(I) = \operatorname{U}[a,b]$ is known as a "continuous uniform random variable" (CURV). This designation implies that the probability of it taking a value within any subinterval depends solely on the length of that subinterval. Consequently, if $a \leq c \leq d \leq b$ , the probability of $X_I$ falling within any subinterval $[c,d] \subseteq [a,b]$ is directly proportional to the length of that subinterval. That is:

$\operatorname{Pr}\left(X_I \in [c,d]\right) = \frac{d-c}{b-a}$

The final equality here is a direct consequence of the unitarity axiom of probability, which states that the total probability of all possible outcomes must sum to 1. The probability density function (PDF) of a CURV $X \sim \operatorname{U}[a,b]$ is given by the indicator function of its interval of support, appropriately normalized by the length of that interval:

$f_X(x) = \begin{cases} \displaystyle {1 \over b-a}, & a \leq x \leq b \\ 0, & \text{otherwise}. \end{cases}$

Of particular theoretical and practical interest is the uniform distribution defined over the unit interval $[0,1]$ . Samples conforming to any desired probability distribution $D$ can be generated by calculating the quantile function of $D$ applied to a randomly-generated number that is itself distributed uniformly over the unit interval. This clever technique leverages the powerful properties of cumulative distribution functions, which provide a unifying framework for understanding and manipulating all types of random variables.

Mixed Type

A mixed random variable occupies a fascinating, if sometimes inconvenient, middle ground. Its cumulative distribution function (CDF) is neither exclusively discrete nor everywhere-continuous. Such a variable can be conceptualized as a blend, a deliberate mixture of a discrete random variable and a continuous random variable. In such cases, the resulting CDF will manifest as a weighted average of the CDFs of its component variables.

Consider an example of a random variable of mixed type: an experiment where a coin is flipped. The spinner (from our previous example) is only brought into play if the coin toss results in heads. If the coin lands on tails, we definitively set $X = -1$ . Conversely, if it's heads, $X$ takes on the value indicated by the spinner, as detailed in the preceding example. In this scenario, there is a clear, finite probability of 1/2 that this random variable will assume the precise value of -1. Any other range of values will, consequently, have half the probabilities observed in the purely continuous spinner example.

Most generally, it's a fundamental tenet of measure theory that every probability distribution on the real line can be decomposed into three distinct parts: a discrete part, a singular part, and an absolutely continuous part. This profound insight is formalized by Lebesgue's decomposition theorem. It's important to note that while the discrete part is concentrated on a countable set, this set is not necessarily sparse; it can, in fact, be dense, much like the set of all rational numbers. The universe, it seems, enjoys its complexities.

Measure-Theoretic Definition

The most rigorous, axiomatic definition of a random variable delves into the rather abstract, yet powerful, realm of measure theory. Continuous random variables, for instance, are defined not just in terms of sets of numbers, but in conjunction with functions that map these sets to probabilities. Due to inherent mathematical quandaries (such as the infamous Banach–Tarski paradox), which inevitably arise if such sets are left insufficiently constrained, it becomes absolutely necessary to introduce a construct known as a sigma-algebra. This sigma-algebra serves to precisely constrain the permissible sets over which probabilities can be meaningfully defined. Typically, a specific and widely adopted sigma-algebra is employed: the Borel σ-algebra. This particular sigma-algebra allows for probabilities to be defined over any sets that can be constructed either directly from continuous intervals of numbers or through a finite or countably infinite number of unions and/or intersections of such intervals. It's the mathematical equivalent of setting down the rules for what constitutes a valid "event."

The measure-theoretic definition is as follows, for those who appreciate precision:

Let $(\Omega, \mathcal{F}, P)$ be a probability space and $(E, \mathcal{E})$ be a measurable space. Then an $(E, \mathcal{E})$ -valued random variable is a measurable function $X\colon \Omega \to E$ . This "measurability" specifically means that for every subset $B \in \mathcal{E}$ , its preimage under $X$ is $\mathcal{F}$ -measurable. In other words, $X^{-1}(B) \in \mathcal{F}$ , where $X^{-1}(B) = \{\omega : X(\omega) \in B\}$ . This definition is crucial because it ensures that we can indeed assign a probability to any subset $B \in \mathcal{E}$ in the target space, by simply examining its preimage, which, by fundamental assumption, is measurable within the original probability space.

To translate this slightly more intuitively: a member $\omega$ of $\Omega$ represents a single, possible outcome of an experiment. A member of $\mathcal{F}$ is a measurable subset of these possible outcomes, and the function $P$ is what assigns the probability to each such measurable subset. $E$ itself represents the comprehensive set of values that the random variable can potentially take (for instance, the entire set of real numbers). A member of $\mathcal{E}$ is a "well-behaved" (i.e., measurable) subset of $E$ – precisely those subsets for which a probability can be unambiguously determined. The random variable then acts as a function that maps any given outcome $\omega$ to a specific quantity in $E$ . The critical condition is that the collection of outcomes in $\Omega$ that lead to any useful subset of quantities for the random variable (i.e., any $B \in \mathcal{E}$ ) must have a well-defined probability within the original probability space.

When $E$ is a topological space, the most commonly adopted choice for the σ-algebra $\mathcal{E}$ is the Borel σ-algebra $\mathcal{B}(E)$ . This is the σ-algebra that is generated by the collection of all open sets within $E$ . In such a case, an $(E, \mathcal{E})$ -valued random variable is more simply referred to as an $E$ -valued random variable. Furthermore, when the space $E$ happens to be the real line $\mathbb{R}$ , such a real-valued random variable is simply called a random variable without further ado. It's important to note that we are not endowing $\mathbb{R}$ with its usual Lebesgue σ-algebra, which is a completion of the Borel σ-algebra. This specific choice allows for a broader class of measurable functions $f: \Omega \to \mathbb{R}$ and simplifies the process of verifying that a function $f: \Omega \to \mathbb{R}$ is indeed measurable, as one only needs to confirm that the preimages of open sets are measurable.

Real-valued Random Variables

In this ubiquitous case, the observation space—the set of values our random variable can possibly take—is the set of real numbers $\mathbb{R}$ . As a reminder, $(\Omega, \mathcal{F}, P)$ remains our foundational probability space. For a real-valued observation space, the function $X\colon \Omega \rightarrow \mathbb{R}$ is classified as a real-valued random variable if, and only if, the following condition holds:

$\{\omega : X(\omega) \leq r\} \in \mathcal{F} \qquad \forall r \in \mathbb{R}.$

This condition dictates that for any real number r, the set of all outcomes $\omega$ in the sample space for which $X(\omega)$ is less than or equal to r must be a measurable set within $\mathcal{F}$ . This definition is, in fact, a specific instance of the more general measure-theoretic definition provided above. This is because the collection of all intervals of the form $(-\infty, r]$ for $r \in \mathbb{R}$ serves to generate the Borel σ-algebra on the set of real numbers. Therefore, it is sufficient to verify measurability only on any such generating set. Here, we can readily demonstrate measurability on this generating set by leveraging the fundamental identity that $\{\omega : X(\omega) \leq r\} = X^{-1}((-\infty, r])$ .

Moments

The probability distribution of a random variable is quite frequently, and rather efficiently, characterized by a select few parameters. These parameters not only offer a compact summary but often carry a significant practical interpretation. For instance, it's often deemed sufficient to grasp what its "average value" is. This intuitive concept is rigorously captured by the mathematical notion of the expected value of a random variable, conventionally denoted as $\operatorname{E}[X]$ . This is also known as the first moment of the distribution. It is a critical point to remember that, in general, $\operatorname{E}[f(X)]$ is not equivalent to $f(\operatorname{E}[X])$ . Once this "average value" is established, one might naturally inquire about the typical deviation of X's values from this average. This question finds its answer in the concepts of the variance and standard deviation of a random variable. The expected value $\operatorname{E}[X]$ can be intuitively understood as an average value that would be obtained from an infinitely large population, where each member of this hypothetical population represents a particular evaluation or realization of X.

Mathematically, this pursuit falls under the umbrella of the (generalized) "problem of moments." For a given class of random variables X, the challenge lies in identifying a collection $\{f_i\}$ of functions such that their expectation values $\operatorname{E}[f_i(X)]$ provide a complete and unambiguous characterization of the distribution of the random variable X.

It's important to note that moments can only be defined for real-valued functions of random variables (or complex-valued, etc.). If the random variable itself is real-valued, then one can directly compute the moments of the variable, which is equivalent to taking the moments of the identity function $f(X)=X$ of the random variable. However, even for random variables that are not real-valued, moments can still be extracted by considering their real-valued functions. For instance, imagine a categorical random variable X that can assume the nominal values "red," "blue," or "green." We can construct a real-valued function such as $[X=\text{green}]$ . This utilizes the Iverson bracket notation, yielding a value of 1 if X is "green" and 0 otherwise. Subsequently, the expected value and other moments of this specific function can be determined, providing quantifiable insights even from non-numerical data.

Functions of Random Variables

A fascinating aspect of random variables is their transformability. A brand new random variable, let's call it Y, can be meticulously defined by applying a real Borel measurable function $g\colon \mathbb{R} \rightarrow \mathbb{R}$ to the outcomes of an existing real-valued random variable X. In essence, we create $Y = g(X)$ . The cumulative distribution function (CDF) of this newly formed random variable Y is then determined as:

$F_Y(y) = P(g(X) \leq y).$

Should the function g be invertible (meaning its inverse function, $h = g^{-1}$ , exists) and furthermore be either strictly increasing or decreasing, then this fundamental relationship can be elegantly extended to yield:

$F_Y(y) = P(g(X) \leq y) = \begin{cases} P(X \leq h(y)) = F_X(h(y)), & \text{if } h=g^{-1} \text{ increasing}, \\ P(X \geq h(y)) = 1-F_X(h(y)), & \text{if } h=g^{-1} \text{ decreasing}. \end{cases}$

Maintaining the same hypotheses of g's invertibility and additionally assuming differentiability, the crucial relationship between the probability density functions (PDFs) can be uncovered by differentiating both sides of the preceding expression with respect to y, leading to:

$f_Y(y) = f_X{\bigl (}h(y){\bigr )}\left|{\frac{dh(y)}{dy}}\right|.$

Now, if the function g is not invertible in a simple one-to-one manner, but each specific value of y corresponds to at most a countable number of roots (meaning a finite, or countably infinite, number of $x_i$ such that $y = g(x_i)$ ), then the relationship between the probability density functions can be generalized as a summation:

$f_Y(y) = \sum_{i}f_X(g_i^{-1}(y))\left|{\frac{dg_i^{-1}(y)}{dy}}\right|$

where $x_i = g_i^{-1}(y)$ , a consequence of the inverse function theorem. It's worth noting that these formulas for densities do not impose the constraint that g must be an increasing function.

Within the more stringent measure-theoretic, axiomatic approach to probability, if we have a random variable X defined on $\Omega$ and a Borel measurable function $g\colon \mathbb{R} \rightarrow \mathbb{R}$ , then $Y = g(X)$ is also, by definition, a random variable on $\Omega$ . This is a direct result of the fact that the composition of measurable functions is itself measurable. (However, this is not necessarily true if g is merely Lebesgue measurable, a subtle detail that often escapes casual observation.) The same systematic procedure that allowed us to transition from a probability space $(\Omega, P)$ to $(\mathbb{R}, dF_X)$ can be precisely employed to derive the distribution of Y.

Example 1

Let X be a real-valued, continuous random variable, and let us define a new random variable $Y = X^2$ .

The cumulative distribution function (CDF) of Y is given by:

$F_Y(y) = P(X^2 \leq y).$

If $y < 0$ , then it's impossible for $X^2$ to be less than or equal to $y$ (since $X^2$ must be non-negative). Consequently, $P(X^2 \leq y) = 0$ , leading to:

$F_Y(y) = 0 \qquad \text{if} \quad y < 0.$

However, if $y \geq 0$ , the situation changes. The condition $X^2 \leq y$ is equivalent to $|X| \leq \sqrt{y}$ , which further expands to $-\sqrt{y} \leq X \leq \sqrt{y}$ . Therefore, for $y \geq 0$ :

$P(X^2 \leq y) = P(|X| \leq \sqrt{y}) = P(-\sqrt{y} \leq X \leq \sqrt{y}),$

And thus, the CDF of Y becomes:

$F_Y(y) = F_X(\sqrt{y}) - F_X(-\sqrt{y}) \qquad \text{if} \quad y \geq 0.$

This illustrates how the transformation affects the cumulative probabilities.

Example 2

Suppose X is a random variable characterized by a cumulative distribution function (CDF) defined as:

$F_X(x) = P(X \leq x) = \frac{1}{(1+e^{-x})^{\theta}}$

where $\theta > 0$ is a fixed, positive parameter. Now, let's consider the random variable $Y = \log(1+e^{-X})$ . We aim to find the CDF of Y.

The CDF of Y is given by:

$F_Y(y) = P(Y \leq y) = P(\log(1+e^{-X}) \leq y).$

To simplify the inequality, we exponentiate both sides (since the logarithm is a monotonic function) and rearrange: $\log(1+e^{-X}) \leq y \implies 1+e^{-X} \leq e^y \implies e^{-X} \leq e^y - 1$ . For this to be meaningful, $e^y - 1$ must be positive, implying $y > 0$ . Assuming this, we take the natural logarithm again and multiply by -1 (reversing the inequality sign): $-X \leq \log(e^y - 1) \implies X \geq -\log(e^y - 1)$ .

So, the expression for $F_Y(y)$ becomes:

$F_Y(y) = P(X \geq -\log(e^y - 1)).$

The last expression can now be conveniently calculated in terms of the cumulative distribution function of X, using the property $P(X \geq a) = 1 - P(X < a) = 1 - F_X(a^-)$ . Assuming $X$ is continuous or we are dealing with $F_X(a)$ , we get:

$F_Y(y) = 1 - F_X(-\log(e^y - 1))$ Now substitute the given form of $F_X(x)$ : $F_Y(y) = 1 - \frac{1}{(1+e^{-(-\log(e^y - 1))})^{\theta}}$ $F_Y(y) = 1 - \frac{1}{(1+e^{\log(e^y - 1)})^{\theta}}$ Since $e^{\log(A)} = A$ : $F_Y(y) = 1 - \frac{1}{(1+(e^y - 1))^{\theta}}$ $F_Y(y) = 1 - \frac{1}{(e^y)^{\theta}}$ $F_Y(y) = 1 - e^{-y\theta}.$

This result is precisely the cumulative distribution function (CDF) of an exponential distribution with rate parameter $\theta$ . A rather neat transformation, if I do say so myself.

Example 3

Assume X is a random variable that adheres to a standard normal distribution, whose probability density function (PDF) is given by:

$f_X(x) = \frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}.$

Now, let's consider the new random variable $Y = X^2$ . Our goal is to derive the density of Y using the general formula for a change of variables:

$f_Y(y) = \sum_{i}f_X(g_i^{-1}(y))\left|{\frac{dg_i^{-1}(y)}{dy}}\right|.$

In this particular case, the transformation $g(x) = x^2$ is not monotonic, which means that for a given value of Y, there are two corresponding values of X (one positive and one negative, for $y > 0$ ). However, due to the inherent symmetry of the standard normal distribution around zero, both the positive and negative halves of the transformation contribute identically to the density of Y. Thus, we can simplify the summation:

$f_Y(y) = 2f_X(g^{-1}(y))\left|{\frac{dg^{-1}(y)}{dy}}\right|.$

The inverse transformation for $y = x^2$ is $x = g^{-1}(y) = \sqrt{y}$ (considering the positive root for this simplified calculation due to symmetry). Its derivative with respect to y is:

${\frac{dg^{-1}(y)}{dy}} = {\frac{1}{2\sqrt{y}}}.$

Substituting these back into our modified formula, we get:

$f_Y(y) = 2 \left( \frac{1}{\sqrt{2\pi}}e^{-(\sqrt{y})^2/2} \right) \left| \frac{1}{2\sqrt{y}} \right|$ $f_Y(y) = 2 \left( \frac{1}{\sqrt{2\pi}}e^{-y/2} \right) \left( \frac{1}{2\sqrt{y}} \right)$ $f_Y(y) = \frac{1}{\sqrt{2\pi y}}e^{-y/2}.$

This derived probability density function is precisely that of a chi-squared distribution with one degree of freedom. A classic result that demonstrates the power of these transformations.

Example 4

Consider X as a random variable following a general normal distribution, with its probability density function (PDF) given by:

$f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}.$

We are again interested in the random variable $Y = X^2$ . We will employ the same change of variables formula for the density:

$f_Y(y) = \sum_{i}f_X(g_i^{-1}(y))\left|{\frac{dg_i^{-1}(y)}{dy}}\right|.$

Similar to the previous example, the transformation $g(x) = x^2$ is not monotonic. For any given value of $Y$ (where $y > 0$ ), there exist two corresponding values of $X$ : $x_1 = \sqrt{y}$ and $x_2 = -\sqrt{y}$ . However, in contrast to the standard normal distribution where $\mu=0$ , here there is no inherent symmetry around zero unless $\mu=0$ . Therefore, we must explicitly compute both distinct terms in the summation:

$f_Y(y) = f_X(g_1^{-1}(y))\left|{\frac{dg_1^{-1}(y)}{dy}}\right| + f_X(g_2^{-1}(y))\left|{\frac{dg_2^{-1}(y)}{dy}}\right|.$

The inverse transformations for $y = x^2$ are $x = g_{1,2}^{-1}(y) = \pm\sqrt{y}$ . The derivative of these inverse functions with respect to y is:

${\frac{dg_{1,2}^{-1}(y)}{dy}} = \pm{\frac{1}{2\sqrt{y}}}.$

Substituting these into the formula for $f_Y(y)$ :

$f_Y(y) = \frac{1}{\sqrt{2\pi \sigma^2}} \left| \frac{1}{2\sqrt{y}} \right| \left( e^{-(\sqrt{y}-\mu)^2/(2\sigma^2)} + e^{-(-\sqrt{y}-\mu)^2/(2\sigma^2)} \right)$ $f_Y(y) = \frac{1}{\sqrt{2\pi \sigma^2}} \frac{1}{2\sqrt{y}} \left( e^{-(\sqrt{y}-\mu)^2/(2\sigma^2)} + e^{-(\sqrt{y}+\mu)^2/(2\sigma^2)} \right).$

This resulting probability density function describes a noncentral chi-squared distribution with one degree of freedom. The presence of the mean $\mu$ introduces the "noncentrality" characteristic, distinguishing it from the simpler chi-squared distribution obtained when $\mu = 0$ .

Some Properties

The universe of random variables might seem chaotic, but it adheres to certain fundamental properties that bring a semblance of order:

Convolution of Distributions: A rather elegant property states that the probability distribution of the sum of two independent random variables is precisely the convolution of their individual distributions. This provides a powerful tool for analyzing combined effects.
Convex Combination of Distributions: While probability distributions do not form a vector space—they are not closed under arbitrary linear combinations because such operations would generally fail to preserve the crucial properties of non-negativity or a total integral of 1—they are closed under convex combination. This means that a weighted average of two valid probability distributions, where the weights are non-negative and sum to 1, will always result in another valid probability distribution. Consequently, the set of all probability distributions forms a convex subset within the broader space of functions (or measures).

Equivalence of Random Variables

The notion of "equivalence" for random variables is, perhaps unsurprisingly, not a monolithic concept. There are several distinct senses in which two random variables can be considered equivalent, each implying a different level of correspondence. They can be equal, equal almost surely, or merely equal in distribution.

These notions of equivalence are presented below in increasing order of their mathematical strength and specificity.

Equality in Distribution

Two random variables X and Y are said to be equal in distribution (denoted $X \stackrel{d}{=} Y$ ) if they share the exact same distribution functions. This means:

$P(X \leq x) = P(Y \leq x) \quad \text{for all } x.$

For this type of equivalence, it is not a prerequisite that the random variables X and Y be defined on the same probability space. They can originate from entirely different experimental setups, yet still exhibit identical probabilistic behavior. A practical and often useful criterion is that two random variables possessing equal moment generating functions are guaranteed to have the same distribution. This provides a convenient method, for example, to verify the equality of certain functions of independent, identically distributed (IID) random variables. However, this method is not universally applicable, as the moment generating function only exists for those distributions that have a well-defined Laplace transform.

Almost Sure Equality

Two random variables X and Y are considered equal almost surely (denoted $X \stackrel{\text{a.s.}}{=} Y$ ) if, and only if, the probability that they assume different values is precisely zero:

$P(X \neq Y) = 0.$

For all pragmatic purposes within probability theory, this particular notion of equivalence is as robust as actual, point-wise equality. It implies that any difference between X and Y can only occur on a set of outcomes that is so vanishingly small as to have zero probability. This concept is intimately associated with the following distance metric:

$d_{\infty}(X,Y) = \operatorname{ess} \sup_{\omega} |X(\omega)-Y(\omega)|,$

where "ess sup" refers to the essential supremum in the precise language of measure theory. This distance effectively measures the largest difference between X and Y that occurs on a set of positive probability.

Equality

Finally, the most stringent definition: two random variables X and Y are considered truly equal if they are identical as functions on their shared measurable space. This means:

$X(\omega) = Y(\omega) \qquad \text{for all } \omega.$

This notion, while seemingly the most straightforward, is paradoxically the least useful in the practical application of probability theory. The reason lies in the fact that, both in practice and in theory, the underlying measure space of the experiment is rarely, if ever, explicitly characterized or even characterizable in its entirety. It's an ideal, rather than a practical, state of affairs.

Practical Difference Between Notions of Equivalence

Since the underlying probability space of a random variable is so rarely explicitly constructed, the distinctions between these various notions of equivalence can be rather subtle, almost an academic exercise, until you trip over them. Essentially, if you consider two random variables in perfect isolation, they are "practically equivalent" if they are merely equal in distribution. Their individual probabilistic behavior is indistinguishable. However, the moment you introduce other random variables defined on the same probability space and start exploring their relationships, then only if they are equal almost surely do they remain "practically equivalent."

For instance, let's consider four real random variables, A, B, C, and D, all operating within the same probability space. Suppose A and B are equal almost surely ( $A \stackrel{\text{a.s.}}{=} B$ ), but A and C are only equal in distribution ( $A \stackrel{d}{=} C$ ). In this scenario, it logically follows that $A+D \stackrel{\text{a.s.}}{=} B+D$ . The "almost sure" equivalence carries through the addition. However, in general, $A+D \neq C+D$ , and this inequality holds even if we consider their distributions. Similarly, we find that their expected values are equal, $\mathbb{E}(AD) = \mathbb{E}(BD)$ , but generally, $\mathbb{E}(AD) \neq \mathbb{E}(CD)$ . This crucial distinction means that two random variables that are equal in distribution (but not equal almost surely) can exhibit entirely different covariances with a third random variable. It's a testament to the fact that context, even in the realm of randomness, is everything.

Convergence

Main article: Convergence of random variables

A pervasive and significant theme in the grand narrative of mathematical statistics involves the pursuit and derivation of convergence results for specific sequences of random variables. These results are not merely academic curiosities; they form the bedrock of much of statistical inference. Prime examples include the fundamental law of large numbers, which assures us that sample averages tend towards the true population mean, and the ubiquitous central limit theorem, a cornerstone that explains why averages of many independent random variables tend towards a normal distribution, regardless of the original distribution.

It's not a simple, singular concept, however. There are various nuanced senses in which a sequence $X_n$ of random variables can be said to converge to a limiting random variable X. These distinct modes of convergence—each with its own implications and applications—are meticulously elaborated upon in the dedicated article on convergence of random variables. One might almost say that even in their approach to a defined limit, random variables insist on a certain level of complexity.