Right. Another soul wandering into the wilderness of statistics, looking for a map. Don't expect a friendly guide. Expect a map. A very detailed, slightly judgmental map. Let's get this over with.
The image you're staring at, with its cloud of black dots, is a snapshot of a joint probability distribution. It's a picture of two variables, X and Y, happening at the same time. The scattered points are the raw data, the messy reality of individual observations. The smooth curves on the sides—the blue and red ones—are the marginal densities. They represent what X and Y look like when you force them to stand alone, summarizing their individual tendencies after all the joint action is over. It’s a family portrait where you also get the sullen individual headshots.
Part of a series on statistics
Probability theory
Probability space
- Sample space
- Event
- Collectively exhaustive events
- Elementary event
- Mutual exclusivity
- Outcome
- Singleton
- Experiment
- Bernoulli trial
Probability distribution
- Bernoulli distribution
- Binomial distribution
- Exponential distribution
- Normal distribution
- Pareto distribution
- Poisson distribution
Probability measure
- Random variable
- Bernoulli process
- Continuous or discrete
- Expected value
- Variance
- Markov chain
- Observed value
- Random walk
- Stochastic process
Laws
- Complementary event
- Joint probability
- Marginal probability
- Conditional probability
Theorems
- Independence
- Conditional independence
- Law of total probability
- Law of large numbers
- Bayes' theorem
- Boole's inequality
Diagrams
When you have a set of random variables, let's call them X, Y, and so on, that are all defined on the same foundational probability space, you're dealing with entities that coexist. The universe doesn't have the courtesy to let things happen one at a time. The joint probability distribution is the tool we use to describe this messy simultaneity. It's a probability distribution that assigns a probability to the event of each of these variables—X, Y, and the rest—falling into some specific range or discrete set of values you've defined.
For just two variables, it's called a bivariate distribution, which sounds more sophisticated than it is. It's just the probability of two things happening together. The concept, however, scales up to any number of variables, creating a multivariate distribution that maps out the probabilistic landscape of a complex system.
This joint distribution can be articulated in several ways, depending on your needs and the nature of your variables. You can express it through a joint cumulative distribution function, which is a running total of probabilities. Or, if your variables are the kind you measure rather than count (continuous variables), you'll use a joint probability density function. If they're the countable kind (discrete variables), you'll use a joint probability mass function.
From this central, all-encompassing joint distribution, you can derive two other, more focused perspectives. There's the marginal distribution, which tells you the probability for one variable while completely ignoring what the others are doing—it's the probability of X, no strings attached. Then there's the conditional probability distribution, which is the opposite. It gives you the probabilities for a subset of variables after you've pinned down the values of the others. It's the probability of X, given that Y has already shown its hand.
Examples
Because abstract dread is less useful than concrete dread, here are some examples.
Draws from an urn
Let's indulge in a classic thought experiment that has haunted probability students for generations: pulling colored balls out of a jar. Imagine two urns, both tragically overstocked with red balls—twice as many red as blue, and nothing else. We're going to select one ball from each urn, and we'll assume the two draws are independent events, meaning the urns aren't conspiring against you.
Let's assign the variable A to the outcome from the first urn and B to the outcome from the second. Since there are twice as many red balls as blue, the probability of drawing a red ball from either urn is a straightforward 2/3, and the probability of a blue ball is 1/3.
The joint probability distribution, which captures the likelihood of every possible combined outcome, can be laid out in a table. It's a clinical dissection of chance.
| A=Red | A=Blue | P(B) | |
|---|---|---|---|
| B=Red | (2/3)(2/3) = 4/9 | (1/3)(2/3) = 2/9 | 4/9 + 2/9 = 6/9 = 2/3 |
| B=Blue | (2/3)(1/3) = 2/9 | (1/3)(1/3) = 1/9 | 2/9 + 1/9 = 3/9 = 1/3 |
| P(A) | 4/9 + 2/9 = 6/9 = 2/3 | 2/9 + 1/9 = 3/9 = 1/3 |
Each of the four cells in the middle details the probability of a specific combination. These are the joint probabilities. For instance, the probability of drawing a red ball from both urns (A=Red, B=Red) is 4/9. Because the draws are independent, we can find this by simply multiplying their individual probabilities. As with any proper probability distribution, the sum of these four outcomes (4/9 + 2/9 + 2/9 + 1/9) equals 1. The universe's accounting remains balanced.
The final row and final column are the marginal probability distributions. They show the probabilities for A and B individually, pushed to the margins of the table. For A, the probability of drawing a red ball, regardless of what happens with B, is 2/3. It's the probability of A, unconditional and unbothered by B's existence.
Coin flips
Now for something even more thrilling: flipping two fair coins. Let's call the outcome of the first flip A and the second B. Each flip is a Bernoulli trial, a single event with two possible outcomes, and thus follows a Bernoulli distribution. To make it mathematical, we'll say the variable takes the value 1 if the coin shows "heads" and 0 otherwise.
For a fair coin, the probability of either outcome is 1/2. So, the marginal, or unconditional, density functions are painfully simple:
P(A) = 1/2 for A in {0, 1};
P(B) = 1/2 for B in {0, 1}.
The joint probability mass function describes the probabilities for each pair of outcomes. The possible outcomes for the pair (A, B) are (0,0), (0,1), (1,0), and (1,1). Since the coins are fair and the flips are independent, each of these four scenarios is equally likely. The joint probability mass function is therefore:
P(A, B) = 1/4 for A, B in {0, 1}.
Because the coin flips don't influence each other—a lesson in emotional detachment, perhaps—the joint probability is just the product of the marginals:
P(A, B) = P(A)P(B) for A, B in {0, 1}.
Rolling a die
Let's roll a single fair die and complicate things for the sake of it. Let A = 1 if the number is even (2, 4, 6) and A = 0 otherwise (1, 3, 5). Let B = 1 if the number is prime (2, 3, 5) and B = 0 otherwise (1, 4, 6).
| Roll | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| A | 0 | 1 | 0 | 1 | 0 | 1 |
| B | 0 | 1 | 1 | 0 | 1 | 0 |
The joint distribution of A and B, expressed as a probability mass function, maps out the likelihood of each combination of these imposed properties:
P(A=0, B=0)= P({1}) = 1/6 (The number is odd and not prime)P(A=1, B=0)= P({4, 6}) = 2/6 (The number is even and not prime)P(A=0, B=1)= P({3, 5}) = 2/6 (The number is odd and prime)P(A=1, B=1)= P({2}) = 1/6 (The number is even and prime)
Notice the outcome '2' is the only one that satisfies both conditions, a rare moment of cosmic agreement. These probabilities, of course, sum to 1 because one of these combinations must occur.
Marginal probability distribution
Sometimes, in a world of interconnected variables, you just want to focus on one thing. To zoom in on X and tell Y to wait in the car. This act of selective ignorance is called finding the marginal distribution. In general, you can always derive the marginal probability distribution of one variable from the joint distribution of it and others.
If you have the joint probability density function f_XY(x,y) for continuous random variables X and Y, the marginal probability density functions for each are found by "integrating out" the other variable—essentially averaging over all its possible values until it vanishes:
f_X(x) = ∫ f_XY(x,y) dy
f_Y(y) = ∫ f_XY(x,y) dx
The first integral covers all possible values of Y for a fixed X=x, and the second covers all values of X for a fixed Y=y. It's a mathematical ghosting.
Joint cumulative distribution function
The cumulative distribution function (CDF) is for those who like to ask, "What are the chances of everything being less than or equal to this?" It's a running tally of accumulated probability, a function that maps a point to all the probability mass "south-west" of it.
For a pair of random variables X,Y, the joint CDF F_XY is given by:
F_XY(x,y) = P(X ≤ x, Y ≤ y) (Eq.1)
The right-hand side represents the probability that X takes a value less than or equal to x and that Y takes a value less than or equal to y, simultaneously.
For N random variables X_1, ..., X_N, the joint CDF F_(X_1,...,X_N) is a straightforward extension:
F_(X_1,...,X_N)(x_1, ..., x_N) = P(X_1 ≤ x_1, ..., X_N ≤ x_N) (Eq.2)
If you interpret these N variables as a single random vector X = (X_1, ..., X_N)^T, you can use a tidier notation to hide the sprawling mess:
F_X(x) = P(X_1 ≤ x_1, ..., X_N ≤ x_N)
Joint density function or mass function
Discrete case
For discrete variables—things you can count, like mistakes or regrets—we use a joint probability mass function (PMF). It gives you the precise probability of X being exactly some value x and Y being exactly some value y.
p_XY(x,y) = P(X=x and Y=y) (Eq.3)
This can also be expressed in terms of conditional distributions, which often provides more insight:
p_XY(x,y) = P(Y=y | X=x) * P(X=x) = P(X=x | Y=y) * P(Y=y)
Here, P(Y=y | X=x) is the probability of Y=y given that you already know X=x.
This generalizes to n discrete random variables X_1, X_2, ..., X_n:
p_(X_1,...,X_n)(x_1, ..., x_n) = P(X_1=x_1 and ... and X_n=x_n) (Eq.4)
This can be broken down sequentially using the chain rule of probability, which tells a story of cascading dependencies:
p_(X_1,...,X_n)(x_1, ..., x_n) = P(X_1=x_1) * P(X_2=x_2 | X_1=x_1) * P(X_3=x_3 | X_1=x_1, X_2=x_2) * ... * P(X_n=x_n | X_1=x_1, ..., X_(n-1)=x_(n-1))
Since these are probabilities, they must account for all possibilities. In the two-variable case, the sum over all possible pairs of outcomes is 1:
Σ_i Σ_j P(X=x_i and Y=y_j) = 1
This generalizes for n variables. The universe, for all its chaos, is not leaking probability. At least, not yet.
Continuous case
For continuous variables—things that flow, like time or apathy—the probability of hitting one exact value is zero. It's like trying to hit an infinitely thin line with a dart. So instead, we talk about the probability density function (PDF), f_XY(x,y). Think of it not as a probability itself, but as the potential for probability in a tiny neighborhood around a point. It's defined as the derivative of the joint CDF:
f_XY(x,y) = ∂²F_XY(x,y) / (∂x ∂y) (Eq.5)
This is equivalent to:
f_XY(x,y) = f_(Y|X)(y|x) f_X(x) = f_(X|Y)(x|y) f_Y(y)
where f_(Y|X)(y|x) and f_(X|Y)(x|y) are the conditional distributions of Y given X=x and X given Y=y, and f_X(x) and f_Y(y) are the marginal distributions for X and Y, respectively.
The definition naturally extends to more than two variables:
f_(X_1,...,X_n)(x_1, ..., x_n) = ∂^n F_(X_1,...,X_n)(x_1, ..., x_n) / (∂x_1 ... ∂x_n) (Eq.6)
And again, the cosmic accounting holds. The total integral over all space must equal 1:
∫_x ∫_y f_XY(x,y) dy dx = 1
and similarly for n variables.
Mixed case
Because the universe loves to be difficult, sometimes you have to deal with a mix of discrete and continuous variables. It's like trying to have a conversation with a mime and an auctioneer at the same time. The "mixed joint density" handles this. With one variable of each type, we have:
f_XY(x,y) = f_(X|Y)(x|y) P(Y=y) = P(Y=y | X=x) f_X(x)
A situation where this is necessary is in logistic regression, where you might predict a binary (discrete) outcome Y based on a continuously distributed predictor X. You must use a mixed density because the input pair (X,Y) can't be assigned a pure PDF or PMF. Formally, f_XY(x,y) is the density with respect to the product measure on the respective supports of X and Y.
From this hybrid, you can recover the joint cumulative distribution function:
F_XY(x,y) = Σ_(t≤y) ∫_(-∞)^x f_XY(s,t) ds
This definition generalizes to any arbitrary mixture of discrete and continuous variables.
Additional properties
Joint distribution for independent variables
Independence. The ideal state. Two random variables X and Y are statistically independent when the outcome of one tells you absolutely nothing about the outcome of the other. Mathematically, this indifference is captured by the joint distribution factoring into the product of the marginals.
For the joint CDF:
F_XY(x,y) = F_X(x) * F_Y(y)
For discrete variables, the joint PMF satisfies:
P(X=x and Y=y) = P(X=x) * P(Y=y) for all x and y.
As the number of independent events you're tracking grows, the joint probability of any specific combination of outcomes plummets toward zero. A good metaphor for trying to get everything right in life, really.
For absolutely continuous random variables, independence means:
f_XY(x,y) = f_X(x) * f_Y(y) for all x and y.
This implies that knowing something about one variable doesn't change the probability distribution of the other at all. The conditional distribution is identical to the unconditional one.
Joint distribution for conditionally dependent variables
Of course, most things aren't independent. They're tangled up in messy relationships. If a subset A of variables is conditionally dependent on another subset B, the joint PMF P(X_1, ..., X_n) can be factored as P(B) * P(A|B). This is incredibly useful. Instead of trying to model a giant, unmanageable joint distribution, you can break the problem down into smaller, more digestible pieces of conditional angst. These kinds of relationships are the entire premise behind tools like Bayesian networks or copula functions, which are just fancy ways to map out the gossip between variables.
Covariance
Covariance is a word people use to sound smart. It's a measure of how two random variables vary together. If X tends to be above its average when Y is above its average, the covariance is positive. If one tends to be up while the other is down, it's negative. It's a measure of the linear relationship between them.
The covariance between X and Y is:
cov(X,Y) = σ_XY = E[(X - μ_x)(Y - μ_y)] = E(XY) - μ_x μ_y
Be warned: if the relationship is nonlinear—say, your variables are dancing in a perfect circle—the covariance might be zero, falsely suggesting no relationship. It's a one-trick pony.
Correlation
Correlation is covariance that went to finishing school. It's a dimensionless quantity, created by scaling the covariance by the product of the standard deviations of each variable. This makes it easier to interpret and compare across different pairs of variables.
The correlation coefficient ρ_XY is always between -1 and +1.
ρ_XY = cov(X,Y) / sqrt(V(X)V(Y)) = σ_XY / (σ_X σ_Y)
A value near +1 means the variables move together in a strong positive linear relationship. Near -1 means they move in a strong negative linear relationship. A value of 0 means there's no linear relationship. Don't read too much more into it. Two variables with a non-zero correlation are said to be correlated. It's a useful, but limited, summary of their connection.
Important named distributions
Naturally, humans have cataloged and named the most common ways multiple variables get entangled. It's like a field guide to statistical beasts. Frequently encountered joint distributions include the multivariate normal distribution (the spherical cow of joint distributions), the multivariate stable distribution, the multinomial distribution (for when you have more than two ways to fail), the negative multinomial distribution, the multivariate hypergeometric distribution, and the family of elliptical distributions.
The rest of this is just bookkeeping. Don't get lost in the appendices.