Covariance Matrix

A bivariate Gaussian probability density function centered at (0, 0), with covariance matrix given by

{\begin{bmatrix}1&0.5\\0.5&1\end{bmatrix}}

Sample points from a bivariate Gaussian distribution with a standard deviation of 3 in roughly the lower left–upper right direction and of 1 in the orthogonal direction. Because the x and y components co-vary, the variances of $x$ and $y$ do not fully describe the distribution. A $2\times 2$ covariance matrix is needed; the directions of the arrows correspond to the eigenvectors of this covariance matrix and their lengths to the square roots of the eigenvalues.

Part of a series on Statistics Correlation and covariance

For random vectors

For stochastic processes

For deterministic signals

In the grand scheme of probability theory and statistics—concepts that, frankly, often seem to exist solely to complicate the obvious—we encounter the covariance matrix. This particular construct, also known by a bewildering array of aliases such as the auto-covariance matrix, dispersion matrix, variance matrix, or the rather verbose variance–covariance matrix, is fundamentally a square matrix. Its purpose is to quantify the covariance between every single pair of elements within a given random vector. One might think of it as the mathematical equivalent of a meticulously maintained social network map, where each node is a random variable and the connections illustrate how they vary together.

To grasp this, consider the common, almost pedestrian, notion of variance. It's a single number, a scalar, that tells you how much a single random variable tends to deviate from its mean. Simple enough, for simple minds. However, when you step into the more intricate realm of multiple dimensions, that single number becomes laughably insufficient. Imagine a collection of random points scattered across a two-dimensional plane. Characterizing their variation with just one number is like trying to describe a symphony with a single note. Even knowing the individual variances in the $x$ and $y$ directions won't paint the full picture; they merely tell you how spread out the points are along each axis independently. What they utterly fail to convey is the relationship between these two directions – how they move in concert, or opposition. For that, a more sophisticated instrument is required: a $2\times 2$ matrix, specifically the covariance matrix, becomes absolutely necessary to fully characterize the two-dimensional dance of variation. It’s the difference between knowing someone’s height and weight, and understanding the elegant (or clumsy) way they move through the world.

A fundamental truth, often overlooked by those who prefer their mathematics uncomplicated, is that any covariance matrix exhibits two crucial characteristics: it is always symmetric and positive semi-definite. Furthermore, if you bother to glance at its main diagonal, you'll find it populated by the variances of the individual elements—that is, the covariance of each element with itself. A rather self-referential little detail, wouldn't you say?

The covariance matrix of a random vector denoted by $\mathbf{X}$ is typically, and somewhat interchangeably, represented by $\operatorname{K}_{\mathbf{X}\mathbf{X}}$ , $\Sigma$ , or $S$ . Consistency, it seems, is a luxury not always afforded in nomenclature.

Definition

For the sake of clarity, and because precision is occasionally useful, throughout this discussion, boldfaced unsubscripted $\mathbf{X}$ and $\mathbf{Y}$ will refer to entire random vectors, while Roman subscripted $X_{i}$ and $Y_{i}$ will denote scalar random variables, which are the individual components of those vectors.

Consider a column vector $\mathbf{X}=(X_{1},X_{2},\dots ,X_{n})^{\mathsf {T}}$ where each of its entries, $X_i$ , represents a random variable. Assuming each of these variables possesses a finite variance and a finite expected value—a rather generous assumption for some systems, but we work with what we have—then the covariance matrix $\operatorname{K}_{\mathbf{X}\mathbf{X}}$ is precisely the matrix whose $(i,j)$ entry is defined by the covariance between $X_i$ and $X_j$ [1] : 177 :

\operatorname{K}_{X_{i}X_{j}}=\operatorname{cov} [X_{i},X_{j}]=\operatorname{E} [(X_{i}-\operatorname{E} [X_{i}])(X_{j}-\operatorname{E} [X_{j}])]

Here, the operator $\operatorname{E}$ is simply denoting the expected value (or mean) of its argument. In essence, it measures how much $X_i$ and $X_j$ deviate from their respective means together. If they tend to rise and fall in tandem, the covariance will be positive. If one rises as the other falls, it will be negative. If they couldn't care less about each other's movements, it approaches zero.

Conflicting Nomenclatures and Notations

Ah, the glorious inconsistency of human endeavor. Nomenclatures, as one might expect, differ. Some statisticians, perhaps following the venerable probabilist William Feller and his exhaustive two-volume work, "An Introduction to Probability Theory and Its Applications" [2], prefer to call the matrix $\operatorname{K}_{\mathbf{X}\mathbf{X}}$ the variance of the random vector $\mathbf{X}$ . Their reasoning? It's the most natural, multi-dimensional generalization of the 1-dimensional variance. A logical leap, if you're into that sort of thing.

Others, however, insist on calling it the covariance matrix, because, well, it's quite literally a matrix composed of covariances between the scalar components of the vector $\mathbf{X}$ . One can see their point; it’s less poetic, perhaps, but certainly more direct. In this interpretation, the variance of the vector $\mathbf{X}$ itself would be expressed as:

\operatorname{var} (\mathbf{X})=\operatorname{cov} (\mathbf{X},\mathbf{X})=\operatorname{E} \left[(\mathbf{X}-\operatorname{E} [\mathbf{X}])(\mathbf{X}-\operatorname{E} [\mathbf{X}])^{\mathsf {T}}\right].

Both forms, rather annoyingly for anyone seeking definitive answers, are quite standard, and usually, there isn't much ambiguity once you're immersed in the context. However, for those just starting, it's a delightful little trap. The matrix $\operatorname{K}_{\mathbf{X}\mathbf{X}}$ is also frequently dubbed the variance-covariance matrix, which, one might argue, is merely an attempt to appease both factions by including both terms. After all, the diagonal elements are indeed variances, so it’s not entirely inaccurate.

For comparison, and to further illustrate the nuanced distinctions that keep statisticians employed, the notation for the cross-covariance matrix between two distinct vectors, $\mathbf{X}$ and $\mathbf{Y}$ , is typically given as:

\operatorname{cov} (\mathbf{X},\mathbf{Y})=\operatorname{K}_{\mathbf{X}\mathbf{Y}}=\operatorname{E} \left[(\mathbf{X}-\operatorname{E} [\mathbf{X}])(\mathbf{Y}-\operatorname{E} [\mathbf{Y}])^{\mathsf {T}}\right].

This measures how the components of $\mathbf{X}$ co-vary with the components of $\mathbf{Y}$ , a distinctly different, though related, concept. It's almost as if they want you to distinguish between self-relationship and inter-relationship. Fascinating.

Properties

Now, let's delve into the inherent characteristics that define these matrices. These aren't mere suggestions; they are fundamental truths that dictate how covariance matrices behave.

Relation to the autocorrelation matrix

The auto-covariance matrix, $\operatorname{K}_{\mathbf{X}\mathbf{X}}$ , which describes the internal relationships within a single random vector, is intimately connected to the autocorrelation matrix, $\operatorname{R}_{\mathbf{X}\mathbf{X}}$ . The relationship is rather straightforward, once you strip away the layers of abstraction:

\operatorname{K}_{\mathbf{X}\mathbf{X}}=\operatorname{E} [(\mathbf{X}-\operatorname{E} [\mathbf{X}])(\mathbf{X}-\operatorname{E} [\mathbf{X}])^{\mathsf {T}}]=\operatorname{R}_{\mathbf{X}\mathbf{X}}-\operatorname{E} [\mathbf{X}]\operatorname{E} [\mathbf{X}]^{\mathsf {T}}

Here, the autocorrelation matrix itself is defined as the expected value of the outer product of the vector with itself:

\operatorname{R}_{\mathbf{X}\mathbf{X}}=\operatorname{E} [\mathbf{X}\mathbf{X}^{\mathsf {T}}]

Essentially, the auto-covariance matrix can be viewed as the autocorrelation matrix minus the outer product of the mean vector with itself. This subtraction effectively centers the data, removing the influence of the mean from the measure of variability. It's like observing how people interact after you've accounted for their average disposition.

Relation to the correlation matrix

Further information: Correlation matrix

Another entity, closely related and often confused with the covariance matrix, is the matrix of Pearson product-moment correlation coefficients. This matrix quantifies the linear dependence between each pair of random variables within the random vector $\mathbf{X}$ , but in a standardized, dimensionless way. It can be expressed elegantly, or perhaps tortuously, as:

\operatorname{corr} (\mathbf{X})={\big (}\operatorname{diag} (\operatorname{K}_{\mathbf{X}\mathbf{X}}){\big )}^{-{\frac {1}{2}}}\,\operatorname{K}_{\mathbf{X}\mathbf{X}}\,{\big (}\operatorname{diag} (\operatorname{K}_{\mathbf{X}\mathbf{X}}){\big )}^{-{\frac {1}{2}}},

where $\operatorname{diag} (\operatorname{K}_{\mathbf{X}\mathbf{X}})$ is a diagonal matrix constructed solely from the diagonal elements of $\operatorname{K}_{\mathbf{X}\mathbf{X}}$ . These diagonal elements, as we've established, are the variances of the individual $X_{i}$ for $i=1,\dots ,n$ . This operation effectively scales the covariance matrix by the inverse of the standard deviations of its components, stripping away the units and magnitude of variance to reveal pure linear association.

Alternatively, and perhaps more intuitively for some, the correlation matrix can be conceptualized as the covariance matrix of the standardized random variables. That is, if you take each $X_{i}$ and divide it by its own standard deviation, $\sigma (X_{i})$ , then compute the covariance matrix of these newly "standardized" variables, you would arrive at the correlation matrix:

\operatorname{corr} (\mathbf{X})={\begin{bmatrix}1&{\frac {\operatorname {E} [(X_{1}-\mu _{1})(X_{2}-\mu _{2})]}{\sigma (X_{1})\sigma (X_{2})}}&\cdots &{\frac {\operatorname {E} [(X_{1}-\mu _{1})(X_{n}-\mu _{n})]}{\sigma (X_{1})\sigma (X_{n})}}\\\\{\frac {\operatorname {E} [(X_{2}-\mu _{2})(X_{1}-\mu _{1})]}{\sigma (X_{2})\sigma (X_{1})}}&1&\cdots &{\frac {\operatorname {E} [(X_{2}-\mu _{2})(X_{n}-\mu _{n})]}{\sigma (X_{2})\sigma (X_{n})}}\\\\\vdots &\vdots &\ddots &\vdots \\\\{\frac {\operatorname {E} [(X_{n}-\mu _{n})(X_{1}-\mu _{1})]}{\sigma (X_{n})\sigma (X_{1})}}&{\frac {\operatorname {E} [(X_{n}-\mu _{n})(X_{2}-\mu _{2})]}{\sigma (X_{n})\sigma (X_{2})}}&\cdots &1\end{bmatrix}}.

Notice the elegant simplicity: each element on the principal diagonal of a correlation matrix is precisely the correlation of a random variable with itself, which, in a truly unsurprising turn of events, always equals 1. Furthermore, every off-diagonal element, representing the correlation between two different variables, will always fall within the range of −1 and +1, inclusive. This standardization makes correlations easily comparable across different scales, a small mercy in a world of complex metrics.

Inverse of the covariance matrix

The inverse of this matrix, $\operatorname{K}_{\mathbf{X}\mathbf{X}}^{-1}$ , assuming it deigns to exist (a non-trivial assumption for many real-world datasets), is known as the inverse covariance matrix. Alternatively, and perhaps more grandly, it's referred to as the inverse concentration matrix [ dubious – discuss ], or, most commonly, the precision matrix (or simply, concentration matrix) [3]. This matrix isn't just a mathematical curiosity; it holds profound implications, particularly in the realm of Gaussian graphical models, where its non-zero elements directly indicate conditional dependencies between variables.

Just as the covariance matrix can be understood as the rescaling of a correlation matrix by the marginal variances, demonstrating how individual spreads influence joint variability:

\operatorname{cov} (\mathbf{X})={\begin{bmatrix}\sigma _{x_{1}}&&&0\\&\sigma _{x_{2}}&&\\&&\ddots &\\0&&&\sigma _{x_{n}}\end{bmatrix}}\times {\begin{bmatrix}1&\rho _{x_{1},x_{2}}&\cdots &\rho _{x_{1},x_{n}}\\\rho _{x_{2},x_{1}}&1&\cdots &\rho _{x_{2},x_{n}}\\\vdots &\vdots &\ddots &\vdots \\\rho _{x_{n},x_{1}}&\rho _{x_{n},x_{2}}&\cdots &1\end{bmatrix}}\times {\begin{bmatrix}\sigma _{x_{1}}&&&0\\&\sigma _{x_{2}}&&\\&&\ddots &\\0&&&\sigma _{x_{n}}\end{bmatrix}}

So too, using the rather elegant concepts of partial correlation and partial variance, the inverse covariance matrix can be expressed analogously. This reveals how variables correlate after accounting for the influence of other variables, a much cleaner, though more complex, view of their interdependencies:

\operatorname{cov} (\mathbf{X})^{-1}={\begin{bmatrix}{\frac {1}{\sigma _{x_{1}\mid x_{2}\dots }}}&&&0\\&{\frac {1}{\sigma _{x_{2}\mid x_{1},x_{3}\dots }}}\\&&\ddots \\0&&&{\frac {1}{\sigma _{x_{n}\mid x_{1}\dots x_{n-1}}}}\end{bmatrix}}\times {\begin{bmatrix}1&-\rho _{x_{1},x_{2}\mid x_{3}\dots }&\cdots &-\rho _{x_{1},x_{n}\mid x_{2}\dots x_{n-1}}\\-\rho _{x_{2},x_{1}\mid x_{3}\dots }&1&\cdots &-\rho _{x_{2},x_{n}\mid x_{1},x_{3}\dots x_{n-1}}\\\vdots &\vdots &\ddots &\vdots \\-\rho _{x_{n},x_{1}\mid x_{2}\dots x_{n-1}}&-\rho _{x_{n},x_{2}\mid x_{1},x_{3}\dots x_{n-1}}&\cdots &1\end{bmatrix}}\times {\begin{bmatrix}{\frac {1}{\sigma _{x_{1}\mid x_{2}\dots }}}&&&0\\&{\frac {1}{\sigma _{x_{2}\mid x_{1},x_{3}\dots }}}\\&&\ddots \\0&&&{\frac {1}{\sigma _{x_{n}\mid x_{1}\dots x_{n-1}}}}\end{bmatrix}}

This rather profound duality between marginalizing (ignoring other variables) and conditioning (accounting for other variables) for Gaussian random variables motivates a host of other dualities in multivariate analysis, suggesting a deeper, more elegant structure beneath the surface. It's almost... beautiful. Almost.

Basic properties

Let's distill the fundamental truths about the covariance matrix. For $\operatorname{K}_{\mathbf{X}\mathbf{X}}=\operatorname{var} (\mathbf{X})=\operatorname{E} \left[\left(\mathbf{X}-\operatorname{E} [\mathbf{X}]\right)\left(\mathbf{X}-\operatorname{E} [\mathbf{X}]\right)^{\mathsf {T}}\right]$ and $\boldsymbol {\mu }_{\mathbf{X}}=\operatorname{E} [{\textbf {X}}]$ , where $\mathbf{X}=(X_{1},\ldots ,X_{n})^{\mathsf {T}}$ is an $n$ -dimensional random variable, the following properties are not merely suggestions, but mathematical dictates [4]:

$\operatorname{K}_{\mathbf{X}\mathbf{X}}=\operatorname{E} (\mathbf{XX^{\mathsf {T}}} )-{\boldsymbol {\mu }}_{\mathbf{X}}{\boldsymbol {\mu }}_{\mathbf{X}}^{\mathsf {T}}$ This property is a convenient computational identity, allowing one to calculate the covariance matrix using the expected value of the outer product of the vector itself and subtracting the outer product of its mean. It’s a common shortcut, assuming you actually want to compute these things.
$\operatorname{K}_{\mathbf{X}\mathbf{X}}\,$ is positive-semidefinite, meaning that for any real vector $\mathbf{a} \in \mathbb {R} ^{n}$ :
$\mathbf {a} ^{T}\operatorname {K} _{\mathbf {X} \mathbf {X} }\mathbf {a} \geq 0$
Proof: This isn't just an abstract mathematical flourish; it has concrete implications. Consider a linear transformation of the random variable $\mathbf{X}$ with covariance matrix $\mathbf{\Sigma _{X}}=\mathrm {cov} (\mathbf {X} )$ by a linear operator $\mathbf{A}$ such that $\mathbf{Y}=\mathbf{A} \mathbf{X}$ . The covariance matrix of the transformed variable $\mathbf{Y}$ is then itself transformed as:
$\mathbf {\Sigma _{Y}}=\mathrm {cov} \left(\mathbf {Y} \right)=\mathbf {A\,\Sigma _{X}\,A} ^{\top }$
Now, because property 3 (which we'll get to in a moment, patience) states that matrix $\mathbf{\Sigma _{X}}$ is symmetric, it can be diagonalized by an orthogonal linear transformation. This means there exists an orthogonal matrix $\mathbf{A}$ (where $\mathbf{A}^{\top }=\mathbf{A}^{-1}$ ), such that:
$\mathbf {A\,\Sigma _{X}\,A} ^{\top }=\mathbf {A\,\Sigma _{X}\,A} ^{-1}={\mbox{diag}}(\sigma _{1},\ldots ,\sigma _{n}),$
where $\sigma _{1},\ldots ,\sigma _{n}$ are the eigenvalues of $\mathbf{\Sigma _{X}}$ . This diagonalized matrix, $\mathbf{A\,\Sigma _{X}\,A} ^{\top }$ , is itself a covariance matrix, specifically for the random variable $\mathbf{Y}=\mathbf{A} \mathbf{X}$ . The elements along its main diagonal are the variances of the components of the vector $\mathbf{Y}$ . Since variance, by its very definition, must always be non-negative (you can't have negative spread, can you?), we can confidently conclude that $\sigma _{i}\geq 0$ for any $i$ . This, in turn, directly implies that the original matrix $\mathbf{\Sigma _{X}}$ is positive-semidefinite. It's a neat little chain of logic, isn't it?
$\operatorname{K}_{\mathbf{X}\mathbf{X}}\,$ is symmetric, meaning its transpose is itself:
$\operatorname{K}_{\mathbf{X}\mathbf{X}}^{\mathsf {T}}=\operatorname{K}_{\mathbf{X}\mathbf{X}}$
This property arises directly from the definition of covariance, where $\operatorname{cov}(X_i, X_j) = \operatorname{cov}(X_j, X_i)$ . The order doesn't matter, just like with most truly fundamental relationships.
For any constant (i.e., non-random) $m\times n$ matrix $\mathbf{A}$ and constant $m\times 1$ vector $\mathbf{a}$ , the covariance matrix transforms linearly:
$\operatorname{var} (\mathbf{AX} +\mathbf{a} )=\mathbf {A} \,\operatorname{var} (\mathbf{X} )\,\mathbf {A} ^{\mathsf {T}}$
This property is incredibly useful. It shows how the variability of a linearly transformed random vector relates to the original. The constant vector $\mathbf{a}$ effectively shifts the mean but doesn't alter the spread or shape of the distribution, hence it vanishes from the variance calculation. The matrix $\mathbf{A}$ , however, warps and rotates the distribution, and its effect is precisely captured by this formula.
If $\mathbf{Y}$ is another random vector with the same dimension as $\mathbf{X}$ , then the variance of their sum is:
$\operatorname{var} (\mathbf{X} +\mathbf{Y} )=\operatorname{var} (\mathbf{X} )+\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )+\operatorname{cov} (\mathbf{Y} ,\mathbf{X} )+\operatorname{var} (\mathbf{Y} )$
where $\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )$ is the cross-covariance matrix of $\mathbf{X}$ and $\mathbf{Y}$ . This simply generalizes the scalar variance sum rule ( $\operatorname{var}(X+Y) = \operatorname{var}(X) + \operatorname{var}(Y) + 2\operatorname{cov}(X,Y)$ ) to multiple dimensions. If $\mathbf{X}$ and $\mathbf{Y}$ are uncorrelated, the cross-covariance terms vanish, simplifying the expression significantly. But then, life is rarely that simple, is it?

Block matrices

When dealing with two distinct random vectors, say $\mathbf{X}$ and $\mathbf{Y}$ , it's often convenient to consider their combined behavior. The joint mean $\boldsymbol {\mu }$ and joint covariance matrix $\boldsymbol {\Sigma }$ of $\mathbf{X}$ and $\mathbf{Y}$ can be elegantly organized into a block form, a structure that reveals their interdependencies at a glance:

{\boldsymbol {\mu }}={\begin{bmatrix}{\boldsymbol {\mu }}_{X}\\{\boldsymbol {\mu }}_{Y}\end{bmatrix}},\qquad {\boldsymbol {\Sigma }}={\begin{bmatrix}\operatorname {K} _{\mathbf{XX} }&\operatorname {K} _{\mathbf{XY} }\\\operatorname {K} _{\mathbf{YX} }&\operatorname {K} _{\mathbf{YY} }\end{bmatrix}}

In this arrangement:

$\operatorname{K}_{\mathbf{XX}}=\operatorname{var} (\mathbf{X})$ represents the covariance matrix for $\mathbf{X}$ itself.
$\operatorname{K}_{\mathbf{YY}}=\operatorname{var} (\mathbf{Y})$ similarly represents the covariance matrix for $\mathbf{Y}$ .
$\operatorname{K}_{\mathbf{XY}}=\operatorname{K}_{\mathbf{YX}}^{\mathsf {T}}=\operatorname{cov} (\mathbf{X} ,\mathbf{Y})$ is the cross-covariance matrix between $\mathbf{X}$ and $\mathbf{Y}$ . Note the symmetry: the covariance of $\mathbf{X}$ with $\mathbf{Y}$ is the transpose of the covariance of $\mathbf{Y}$ with $\mathbf{X}$ . This makes sense, as the underlying scalar covariances are symmetric.

The diagonal blocks, $\operatorname{K}_{\mathbf{XX}}$ and $\operatorname{K}_{\mathbf{YY}}$ , can be identified as the variance matrices of the marginal distributions for $\mathbf{X}$ and $\mathbf{Y}$ respectively. They tell you about the variability within each vector independently. The off-diagonal blocks, $\operatorname{K}_{\mathbf{XY}}$ and $\operatorname{K}_{\mathbf{YX}}$ , are where the real story often lies, detailing how the components of $\mathbf{X}$ interact with the components of $\mathbf{Y}$ .

If these vectors, $\mathbf{X}$ and $\mathbf{Y}$ , are jointly normally distributed, which is a common and often convenient assumption in many statistical models (though reality rarely obliges with such neatness), then the conditional distribution for $\mathbf{Y}$ given $\mathbf{X}$ adheres to a remarkably precise form [5]:

\mathbf {Y} \mid \mathbf {X} \sim \ {\mathcal {N}}({\boldsymbol {\mu }}_{\mathbf {Y|X} },\operatorname {K} _{\mathbf {Y|X} }),

This conditional distribution is defined by a conditional mean:

{\boldsymbol {\mu }}_{\mathbf {Y} |\mathbf {X} }={\boldsymbol {\mu }}_{\mathbf {Y} }+\operatorname {K} _{\mathbf {YX} }\operatorname {K} _{\mathbf {XX} }^{-1}\left(\mathbf {X} -{\boldsymbol {\mu }}_{\mathbf {X} }\right)

And a conditional variance:

\operatorname {K} _{\mathbf {Y|X} }=\operatorname {K} _{\mathbf {YY} }-\operatorname {K} _{\mathbf {YX} }\operatorname {K} _{\mathbf {XX} }^{-1}\operatorname {K} _{\mathbf {XY} }.

The matrix $\operatorname{K}_{\mathbf{YX}}\operatorname{K}_{\mathbf{XX}}^{-1}$ is rather important; it's known as the matrix of regression coefficients. In the realm of linear algebra, the conditional variance $\operatorname{K}_{\mathbf{Y|X}}$ is recognized as the Schur complement of $\operatorname{K}_{\mathbf{XX}}$ in the larger joint covariance matrix $\boldsymbol {\Sigma }$ . This connection highlights the deep algebraic roots of these statistical concepts.

It's worth noting that the matrix of regression coefficients is sometimes presented in its transpose form, $\operatorname{K}_{\mathbf{XX}}^{-1}\operatorname{K}_{\mathbf{XY}}$ . This alternative is particularly suitable for post-multiplying a row vector of explanatory variables $\mathbf{X}^{\mathsf {T}}$ , rather than pre-multiplying a column vector $\mathbf{X}$ . In this specific configuration, these coefficients directly correspond to those derived by inverting the matrix of the normal equations that arise in ordinary least squares (OLS) regression. It's a subtle distinction, but one that can save you considerable grief if you're actually implementing these things.

Partial covariance matrix

A covariance matrix where all elements are non-zero is, unfortunately, a common occurrence. It tells us that all the individual random variables within a vector are, to some extent, interrelated. This doesn't just mean direct correlation; it implies that variables might also be indirectly correlated through the influence of other variables. Often, such indirect, or common-mode, correlations are utterly trivial and, frankly, uninteresting. They're statistical noise, distracting from the true, meaningful relationships.

Fortunately, these extraneous correlations can be suppressed by calculating the partial covariance matrix. This matrix surgically extracts only the interesting part of the correlations, effectively holding the influence of the "uninteresting" variables constant.

If two vectors of random variables, $\mathbf{X}$ and $\mathbf{Y}$ , are correlated via the confounding influence of another vector $\mathbf{I}$ (representing these "uninteresting" variables), the correlations mediated by $\mathbf{I}$ are mathematically removed to yield a matrix [6]:

\operatorname{K}_{\mathbf{XY\mid I} }=\operatorname{pcov} (\mathbf{X} ,\mathbf{Y} \mid \mathbf{I} )=\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )-\operatorname{cov} (\mathbf{X} ,\mathbf{I} )\operatorname{cov} (\mathbf{I} ,\mathbf{I} )^{-1}\operatorname{cov} (\mathbf{I} ,\mathbf{Y} ).

The partial covariance matrix $\operatorname{K}_{\mathbf{XY\mid I}}$ is, in essence, the simple covariance matrix $\operatorname{K}_{\mathbf{XY}}$ as if the uninteresting random variables in $\mathbf{I}$ were held perfectly constant. It offers a cleaner, more focused view of the relationship between $\mathbf{X}$ and $\mathbf{Y}$ , free from the statistical echoes of other factors. It's like filtering out the background chatter to hear the actual conversation.

Standard deviation matrix

Main article: Standard deviation § Standard deviation matrix

The standard deviation, that familiar measure of spread for a single variable, also has a multi-dimensional counterpart. The standard deviation matrix $\mathbf{S}$ extends this concept to multiple dimensions. It is, quite literally, the symmetric square root of the covariance matrix $\boldsymbol {\Sigma }$ . While the covariance matrix provides a comprehensive picture of joint variability, the standard deviation matrix offers a more direct, though less commonly used, representation of the individual spreads and their scaled relationships. It's a different lens through which to view the same underlying structure.

Covariance matrix as a parameter of a distribution

The covariance matrix isn't merely a descriptive statistic; it's a fundamental parameter that shapes entire probability distributions. If a column vector $\mathbf{X}$ of $n$ possibly correlated random variables is jointly normally distributed, or, more broadly, elliptically distributed, then its probability density function $\operatorname{f} (\mathbf{X})$ can be expressed in terms of the covariance matrix $\boldsymbol {\Sigma }$ in a rather elegant, if somewhat intimidating, form [6]:

\operatorname{f} (\mathbf{X})=(2\pi )^{-n/2}|{\boldsymbol {\Sigma }}|^{-1/2}\exp \left(-{\tfrac {1}{2}}\mathbf {(X-\mu )^{\mathsf {T}}\Sigma ^{-1}(X-\mu )} \right),

Here, $\boldsymbol {\mu }=\operatorname{E} [\mathbf{X}]$ is the mean vector of the random variables, and $|{\boldsymbol {\Sigma }}|$ is the determinant of $\boldsymbol {\Sigma }$ . This determinant is often referred to as the generalized variance, as it provides a scalar measure of the overall variability of the multivariate distribution, encompassing the spread in all directions and their interdependencies. The term $\mathbf {(X-\mu )^{\mathsf {T}}\Sigma ^{-1}(X-\mu )}$ is essentially a squared Mahalanobis distance, measuring how far a given observation $\mathbf{X}$ is from the mean $\boldsymbol {\mu}$ in a way that accounts for the shape and orientation of the distribution defined by $\boldsymbol {\Sigma }$ . This formula underscores the covariance matrix's central role in defining the very geometry of multivariate Gaussian and elliptical distributions. Without it, these distributions would simply be shapeless clouds.

Covariance matrix as a linear operator

Main article: Covariance operator

Beyond its role as a mere parameter, the covariance matrix can also be viewed as a powerful linear operator. When applied to a single vector, the covariance matrix possesses the remarkable ability to map a linear combination $\mathbf{c}$ of the random variables $\mathbf{X}$ onto a vector of covariances with those same variables:

\mathbf {c} ^{\mathsf {T}}\Sigma =\operatorname{cov} (\mathbf {c} ^{\mathsf {T}}\mathbf {X} ,\mathbf {X} )

This means it tells you how a new variable, formed by a weighted sum of the original variables, relates to each of the original variables.

Treated as a bilinear form, its utility expands further, yielding the covariance between two distinct linear combinations:

\mathbf {d} ^{\mathsf {T}}{\boldsymbol {\Sigma }}\mathbf {c} =\operatorname{cov} (\mathbf {d} ^{\mathsf {T}}\mathbf {X} ,\mathbf {c} ^{\mathsf {T}}\mathbf {X} )

This is a more general statement, allowing you to understand the relationship between any two new variables created from the original set. The ultimate, rather self-referential, consequence of this is that the variance of a single linear combination is simply $\mathbf {c} ^{\mathsf {T}}{\boldsymbol {\Sigma }}\mathbf {c}$ , which is its covariance with itself.

Similarly, the (pseudo-)inverse covariance matrix, if it exists, provides an inner product $\langle c-\mu |\Sigma ^{+}|c-\mu \rangle$ . This inner product, in turn, induces the Mahalanobis distance, a critical metric that quantifies the "unlikelihood" of a given vector $\mathbf{c}$ relative to a distribution with mean $\boldsymbol {\mu}$ and covariance $\boldsymbol {\Sigma}$ . It’s a measure of distance that respects the underlying statistical structure, rather than just raw Euclidean separation [ citation needed ]. It's the difference between saying "that point is far away" and "that point is far away for this distribution."

Admissibility

The question of "admissibility" for a matrix boils down to whether it could actually be a covariance matrix for some random vector. This isn't a philosophical debate; it's a mathematical constraint. From basic property 4. discussed earlier, let $\mathbf{b}$ be a $(p\times 1)$ real-valued vector. Then, the variance of the linear combination $\mathbf{b}^{\mathsf {T}}\mathbf{X}$ is given by:

\operatorname{var} (\mathbf{b}^{\mathsf {T}}\mathbf{X} )=\mathbf {b} ^{\mathsf {T}}\operatorname{var} (\mathbf{X} )\mathbf{b} ,\,

Since $\mathbf{b}^{\mathsf {T}}\mathbf{X}$ is a real-valued random variable, its variance must, by definition, always be non-negative. This directly implies that a covariance matrix is, without exception, a positive-semidefinite matrix. If you encounter a matrix that isn't positive semi-definite, you can immediately dismiss it as a potential covariance matrix. It simply can't exist in that capacity.

The argument can be expanded for the skeptical:

{\begin{aligned}&w^{\mathsf {T}}\operatorname{E} \left[(\mathbf{X}-\operatorname{E} [\mathbf{X}])(\mathbf{X}-\operatorname{E} [\mathbf{X}])^{\mathsf {T}}\right]w=\operatorname{E} \left[w^{\mathsf {T}}(\mathbf{X}-\operatorname{E} [\mathbf{X}])(\mathbf{X}-\operatorname{E} [\mathbf{X}])^{\mathsf {T}}w\right]\\&=\operatorname{E} {\big [}{\big (}w^{\mathsf {T}}(\mathbf{X}-\operatorname{E} [\mathbf{X}]){\big )}^{2}{\big ]}\geq 0,\end{aligned}}

The last inequality is a direct consequence of the fact that $w^{\mathsf {T}}(\mathbf{X}-\operatorname{E} [\mathbf{X}])$ is a scalar quantity. Squaring any real scalar always yields a non-negative result, and the expected value of a non-negative quantity must also be non-negative. Thus, the argument stands.

Conversely, it's also true that every symmetric positive semi-definite matrix can be a covariance matrix. To illustrate this, suppose $M$ is a $p\times p$ symmetric positive-semidefinite matrix. Thanks to the finite-dimensional case of the spectral theorem, we know that $M$ possesses a nonnegative symmetric square root, which we can denote as $M^{1/2}$ . Now, let $\mathbf{X}$ be any $p\times 1$ column vector-valued random variable whose covariance matrix is the $p\times p$ identity matrix (e.g., a vector of independent, standard normal variables). Then, consider the transformation $M^{1/2}\mathbf{X}$ . Its variance is:

\operatorname{var} (\mathbf{M} ^{1/2}\mathbf{X} )=\mathbf {M} ^{1/2}\,\operatorname{var} (\mathbf{X} )\,\mathbf {M} ^{1/2}=\mathbf {M} .

This demonstrates that any symmetric positive semi-definite matrix can indeed be realized as a covariance matrix. It’s a comforting thought, for those who seek comfort in mathematical completeness.

Complex random vectors

Further information: Complex random vector § Covariance matrix and pseudo-covariance matrix

When we venture into the realm of complex scalar-valued random variables, the definitions, naturally, become a touch more intricate, though no less logical. The variance of a complex scalar-valued random variable $Z$ with an expected value $\mu$ is conventionally defined using complex conjugation:

\operatorname{var} (Z)=\operatorname{E} \left[(Z-\mu _{Z}){\overline {(Z-\mu _{Z})}}\right],

where the complex conjugate of a complex number $z$ is denoted by $\overline {z}$ . The crucial detail here is that the variance of a complex random variable, despite its complex components, remains a real number. This preserves its interpretation as a measure of spread.

If $\mathbf{Z}=(Z_{1},\ldots ,Z_{n})^{\mathsf {T}}$ is a column vector composed of complex-valued random variables, then the conjugate transpose $\mathbf{Z}^{\mathsf {H}}$ is formed by both transposing and conjugating its elements. In this context, the product of a vector with its conjugate transpose, when subjected to the expectation operator, results in a square matrix known as the covariance matrix [7] : 293 :

\operatorname{K}_{\mathbf{Z}\mathbf{Z}}=\operatorname{cov} [\mathbf{Z} ,\mathbf{Z} ]=\operatorname{E} \left[(\mathbf{Z} -{\boldsymbol {\mu }}_{\mathbf{Z} })(\mathbf{Z} -{\boldsymbol {\mu }}_{\mathbf{Z} })^{\mathsf {H}}\right],

The matrix thus obtained will always be a Hermitian positive-semidefinite matrix [8]. This means its diagonal elements will be real numbers (the variances of the individual complex variables), while its off-diagonal elements will generally be complex numbers. This is a natural extension, maintaining the fundamental properties of covariance while accommodating the algebraic structure of complex numbers.

Properties

The covariance matrix of complex random vectors is a Hermitian matrix, meaning $\operatorname{K}_{\mathbf{Z}\mathbf{Z}}^{\mathsf {H}}=\operatorname{K}_{\mathbf{Z}\mathbf{Z}}$ . [1] : 179 This is the complex analogue of symmetry for real matrices.
The diagonal elements of the covariance matrix are real. [1] : 179 As mentioned, these are the variances of the complex random variables, which must be real.

Pseudo-covariance matrix

For complex random vectors, there exists yet another flavor of second central moment, rather charmingly named the pseudo-covariance matrix (sometimes also called the relation matrix). It is defined slightly differently:

\operatorname{J}_{\mathbf{Z}\mathbf{Z}}=\operatorname{cov} [\mathbf{Z} ,{\overline {\mathbf{Z} }}]=\operatorname{E} \left[(\mathbf{Z} -{\boldsymbol {\mu }}_{\mathbf{Z} })(\mathbf{Z} -{\boldsymbol {\mu }}_{\mathbf{Z} })^{\mathsf {T}}\right]

The key distinction here, in contrast to the standard covariance matrix defined just prior, is that the Hermitian transposition ( $\mathsf{H}$ ) is replaced by a simple transposition ( $\mathsf{T}$ ). This seemingly minor alteration has significant consequences. Its diagonal elements, unlike those of the standard covariance matrix, may be complex-valued. Furthermore, this matrix is a complex symmetric matrix, not necessarily Hermitian. It captures a different aspect of the complex variable's variability, particularly relevant when the complex random vector is not "proper" (i.e., its pseudo-covariance matrix is not zero). This matrix is often overlooked, but it's vital for a complete understanding of complex random processes.

Estimation

In the messy real world, we rarely have the luxury of knowing the true, underlying covariance matrix. Instead, we must estimate it from observed data. If $\mathbf{M}_{\mathbf{X}}$ and $\mathbf{M}_{\mathbf{Y}}$ are centered data matrices—meaning their row means have already been subtracted—of dimensions $p\times n$ and $q\times n$ respectively (where $n$ is the number of observations and $p, q$ are the number of variables), then we can construct sample covariance matrices.

If the row means were themselves estimated from the data (the more common scenario), the sample covariance matrices $\mathbf{Q}_{\mathbf{XX}}$ and $\mathbf{Q}_{\mathbf{XY}}$ are defined as:

\mathbf {Q} _{\mathbf{XX} }={\frac {1}{n-1}}\mathbf {M} _{\mathbf{X} }\mathbf {M} _{\mathbf{X} }^{\mathsf {T}},\qquad \mathbf {Q} _{\mathbf{XY} }={\frac {1}{n-1}}\mathbf {M} _{\mathbf{X} }\mathbf {M} _{\mathbf{Y} }^{\mathsf {T}}

The division by $(n-1)$ rather than $n$ is known as Bessel's correction, a small but important detail to ensure an unbiased estimator of the population covariance matrix.

However, if the row means were known a priori (a rare but occasionally useful simplification), then the division by $n$ is appropriate:

\mathbf {Q} _{\mathbf{XX} }={\frac {1}{n}}\mathbf {M} _{\mathbf{X} }\mathbf {M} _{\mathbf{X} }^{\mathsf {T}},\qquad \mathbf {Q} _{\mathbf{XY} }={\frac {1}{n}}\mathbf {M} _{\mathbf{X} }\mathbf {M} _{\mathbf{Y} }^{\mathsf {T}}.

These empirical sample covariance matrices are the most straightforward and, predictably, the most frequently used estimators. However, they are not without their flaws, particularly in high-dimensional settings where $p$ or $q$ approaches $n$ . In such cases, these "naive" estimators can be unstable or even singular. Consequently, other, more sophisticated estimators exist, including regularized or shrinkage estimators, which often possess superior statistical properties by trading a small amount of bias for a significant reduction in variance. It's a pragmatic compromise for a less-than-ideal world.

Applications

The covariance matrix, for all its mathematical austerity, is an astonishingly versatile tool, finding its way into a multitude of diverse fields. It's almost as if some people need it.

From this matrix, a crucial transformation matrix can be derived, known as a whitening transformation. This transformation serves two primary, and often interconnected, purposes: it allows one to completely decorrelate the data [9], essentially making each variable statistically independent of the others. Alternatively, from a different perspective, it enables the identification of an optimal basis for representing the data in an exceedingly compact and efficient manner [ citation needed ]. (For a deeper dive into this, and for additional properties of covariance matrices, one might consult the Rayleigh quotient.) This entire process is famously known as principal component analysis (PCA) and the Karhunen–Loève transform (KL-transform). These techniques are foundational in data compression, noise reduction, and pattern recognition, effectively stripping away redundant information to reveal the underlying structure.

The covariance matrix also plays an utterly central role in financial economics, particularly within the hallowed halls of portfolio theory. Its implications extend to the mutual fund separation theorem and, perhaps most famously, the capital asset pricing model (CAPM). Here, the matrix of covariances among the returns of various financial assets is indispensable. Under specific, often idealized, assumptions, this matrix is used to determine the relative proportions of different assets that investors should (in a normative analysis) or are predicted to (in a positive analysis) choose to hold. This is all in the noble pursuit of diversification, of course, aiming to mitigate risk by not putting all one's eggs in a single, volatile basket. Without the covariance matrix, modern portfolio management would be little more than guesswork.

Use in optimization

Even in the realm of optimization, the covariance matrix asserts its undeniable presence. The evolution strategy, a particular and rather clever family of Randomized Search Heuristics, fundamentally relies on an evolving covariance matrix within its core mechanism. The characteristic mutation operator, which dictates how potential solutions are perturbed, draws its update step from a multivariate normal distribution parameterized by this adapting covariance matrix. There's even a formal proof demonstrating that the evolution strategy's covariance matrix effectively adapts to the inverse of the Hessian matrix of the search landscape, give or take a scalar factor and some small random fluctuations. This has been proven for single-parent strategies and static models, particularly as the population size grows, relying on a quadratic approximation of the landscape [10].

Intuitively, this result makes a certain amount of sense, if you're inclined towards such things. The optimal covariance distribution is one that can offer mutation steps whose equidensity probability contours align perfectly with the level sets of the optimization landscape. By doing so, it maximizes the rate of progress, essentially guiding the search efficiently along the contours of the objective function. It's a sophisticated way to avoid blindly stumbling through the search space.

Covariance mapping

Covariance mapping is a technique that takes the values of the $\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )$ or $\operatorname{pcov} (\mathbf{X} ,\mathbf{Y} \mid \mathbf{I} )$ matrix and plots them as a 2-dimensional map. When the vectors $\mathbf{X}$ and $\mathbf{Y}$ represent discrete random functions, this map visually reveals the statistical relationships between different regions or components of these functions. Statistically independent regions of the functions conveniently manifest on the map as a rather dull, zero-level flatland. Conversely, positive or negative correlations spring forth as distinct hills or valleys, respectively, providing a topographic view of statistical dependencies.

In practical application, the column vectors $\mathbf{X}$ , $\mathbf{Y}$ , and $\mathbf{I}$ are typically acquired experimentally, appearing as rows of $n$ samples. For instance, a collection of time-series measurements might look something like this:

\left[\mathbf {X} _{1},\mathbf {X} _{2},\dots ,\mathbf {X} _{n}\right]={\begin{bmatrix}X_{1}(t_{1})&X_{2}(t_{1})&\cdots &X_{n}(t_{1})\\\\X_{1}(t_{2})&X_{2}(t_{2})&\cdots &X_{n}(t_{2})\\\\\vdots &\vdots &\ddots &\vdots \\\\X_{1}(t_{m})&X_{2}(t_{m})&\cdots &X_{n}(t_{m})\end{bmatrix}},

where $X_{j}(t_{i})$ denotes the $i$ -th discrete value in the $j$ -th sample of the random function $X(t)$ . The expected values, so crucial for the covariance formula, are then estimated using the sample mean, a pragmatic approximation for the true, elusive mean:

\langle \mathbf{X} \rangle ={\frac {1}{n}}\sum _{j=1}^{n}\mathbf {X} _{j}

And the covariance matrix itself is estimated by the sample covariance matrix, which is a fairly direct, if sometimes imperfect, proxy:

\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )\approx \langle \mathbf{XY^{\mathsf {T}}} \rangle -\langle \mathbf{X} \rangle \langle \mathbf{Y} ^{\mathsf {T}}\rangle ,

It's important to remember that the angular brackets here denote sample averaging. A small but critical detail: Bessel's correction should be applied to avoid bias in these estimates, a common pitfall for the unwary. With this estimation in hand, the partial covariance matrix can then be computed:

\operatorname{pcov} (\mathbf{X} ,\mathbf{Y} \mid \mathbf{I} )=\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )-\operatorname{cov} (\mathbf{X} ,\mathbf{I} )\left(\operatorname{cov} (\mathbf{I} ,\mathbf{I} )\backslash \operatorname{cov} (\mathbf{I} ,\mathbf{Y} )\right),

Here, the backslash symbol ( \ ) represents the left matrix division operator. This operator is particularly useful as it bypasses the explicit requirement to invert a matrix, a numerically sensitive operation, and is readily available in many computational packages, such as Matlab [11]. It's a small convenience that prevents a great deal of computational headache.

Figure 1: Construction of a partial covariance map of N $_2$ molecules undergoing Coulomb explosion induced by a free-electron laser. [12] Panels a and b map the two terms of the covariance matrix, which is shown in panel c. Panel d maps common-mode correlations via intensity fluctuations of the laser. Panel e maps the partial covariance matrix that is corrected for the intensity fluctuations. Panel f shows that 10% overcorrection improves the map and makes ion-ion correlations clearly visible. Owing to momentum conservation these correlations appear as lines approximately perpendicular to the autocorrelation line (and to the periodic modulations which are caused by detector ringing).

Fig. 1 offers a concrete illustration of how a partial covariance map is constructed, drawing from an experiment conducted at the FLASH free-electron laser in Hamburg [12]. In this particular scenario, the random function $X(t)$ represents the time-of-flight spectrum of ions resulting from a Coulomb explosion of nitrogen molecules, which were multiply ionized by a laser pulse. Given that only a few hundred molecules are ionized during each laser pulse, the single-shot spectra, $\mathbf{X}_{j}(t)$ , are inherently highly fluctuating – a chaotic mess, if you will. However, by diligently collecting a typically large number of such spectra (e.g., $m=10^{4}$ ), and then averaging them over $j$ , one can produce a smooth, legible average spectrum $\langle \mathbf{X} (t)\rangle$ , which is helpfully depicted in red at the bottom of Fig. 1. This average spectrum, $\langle \mathbf{X} \rangle$ , reveals the presence of several nitrogen ions, appearing as peaks broadened by their kinetic energy. However, merely seeing peaks isn't enough; to truly unravel the intricate correlations between the ionization stages and the ion momenta, the full power of a covariance map is required.

In this example, the spectra $\mathbf{X}_{j}(t)$ and $\mathbf{Y}_{j}(t)$ are essentially the same, with the only difference being the range of the time-of-flight $t$ being considered. Panel a displays $\langle \mathbf{XY^{\mathsf {T}}} \rangle$ , while panel b shows $\langle \mathbf{X} \rangle \langle \mathbf{Y} ^{\mathsf {T}}\rangle$ . Panel c, crucially, shows their difference, which is precisely $\operatorname{cov} (\mathbf{X} ,\mathbf{Y} )$ (note the necessary adjustment in the color scale to appreciate the detail). Regrettably, this raw covariance map is often entirely overwhelmed by uninteresting, common-mode correlations. These are typically induced by the laser intensity fluctuating unpredictably from shot to shot, obscuring any meaningful signal. To suppress such trivial correlations, the laser intensity $I_{j}$ is meticulously recorded at every shot, aggregated into vector $\mathbf{I}$ , and then $\operatorname{pcov} (\mathbf{X} ,\mathbf{Y} \mid \mathbf{I} )$ is calculated, as demonstrated in panels d and e. While this partial covariance correction does indeed suppress the uninteresting correlations, the suppression is, alas, often imperfect. This is because other sources of common-mode fluctuations inevitably exist beyond just the laser intensity, and in principle, all of these should ideally be monitored and included in the $\mathbf{I}$ vector. However, in practice, it is frequently sufficient to overcompensate the partial covariance correction. As panel f vividly illustrates, a 10% overcorrection dramatically improves the map, rendering the genuinely interesting correlations of ion momenta clearly visible as distinct straight lines centered on the ionization stages of atomic nitrogen. These lines, owing to momentum conservation, appear approximately perpendicular to the autocorrelation line, and also to the periodic modulations caused by detector ringing. It's a rather elegant way to cut through the noise and reveal the underlying physical phenomena.

Two-dimensional infrared spectroscopy

Two-dimensional infrared spectroscopy, a technique employed to probe the dynamics of the condensed phase, makes extensive use of correlation analysis to generate 2D spectra. There exist two primary versions of this analysis: synchronous and asynchronous. Mathematically, the synchronous version is directly expressed in terms of the sample covariance matrix, making the technique entirely equivalent to the covariance mapping approach discussed above [13]. This highlights how fundamental the concept of covariance is, finding application in highly specialized fields of physical chemistry.

Covariance Matrix

Definition

Conflicting Nomenclatures and Notations

Properties

Relation to the autocorrelation matrix

Relation to the correlation matrix

Inverse of the covariance matrix

Basic properties

Block matrices

Partial covariance matrix

Standard deviation matrix

Covariance matrix as a parameter of a distribution

Covariance matrix as a linear operator

Admissibility

Complex random vectors

Properties

Pseudo-covariance matrix

Estimation

Applications

Use in optimization

Covariance mapping

Two-dimensional infrared spectroscopy

See also