Coefficient Of Determination

Contents

1. Overview
2. Etymology
3. Cultural Impact

Indicator for how well data points fit a line or curve

Not to be confused with Coefficient of variation .

You appear to be contemplating a translation from the German article. A word of caution, though I suppose it’s futile: while tools like DeepL or Google Translate might offer a convenient starting point, they are hardly a substitute for actual human intelligence. Translators are, regrettably, still necessary to revise errors and confirm accuracy. Simply copy-pasting is a shortcut to statistical irrelevance.

If you insist on this path, consider adding a topic to this template; with 1,784 articles in the main category , specifying |topic= might help categorize your efforts, though I doubt it will improve the quality. And for the love of clarity, do not translate text that appears unreliable or of low quality. Verify with references, if such things exist in the foreign-language article. You must provide copyright attribution in the edit summary using an interlanguage link to the source. A model attribution edit summary, for those who need their hands held, would be: Content in this edit is translated from the existing German Wikipedia article at [[:de:Bestimmtheitsmaß]]; see its history for attribution. You may also add the template {{Translated|de|Bestimmtheitsmaß}} to the talk page . For more guidance, if you’re truly lost, see Wikipedia:Translation . Now, if you’re done with the administrative drudgery, perhaps we can discuss something of actual substance.

In the realm of statistics , where order is often sought amidst chaos, the coefficient of determination, typically denoted as R² or r² and pronounced “R squared,” emerges as a fundamental metric. At its core, it quantifies the proportion of the total variance observed in a dependent variable that can be explained or predicted by the independent variable or variables within a given statistical model . It’s a measure, if you must know, of how well the chosen model manages to capture the underlying patterns in your data, rather than just observing random fluctuations.

This particular statistic finds its primary application in contexts where the overarching objective is either the prediction of future outcomes – peering into the murky crystal ball of data – or the rigorous testing of hypotheses , all based on other related, presumably relevant, information. Essentially, R² offers a concise, albeit sometimes misleading, indication of how closely the observed outcomes are replicated by the predictions generated from the model. This assessment is fundamentally rooted in the proportion of the total variation of these outcomes that the model successfully “explains,” or accounts for, rather than leaving to the whims of unexplained noise. [1] [2] [3]

It’s worth noting that the term “R²” isn’t a monolithic entity; several definitions exist, and they are, rather inconveniently, only sometimes equivalent. In the simplest scenario, that of simple linear regression – a model that includes both an intercept and a single predictor – r² is straightforwardly the square of the sample correlation coefficient (r). This correlation is computed directly between the observed outcomes and their corresponding predicted values from the model. [4] Should one decide to complicate matters by including additional regressors (more independent variables), R² then becomes the square of the coefficient of multiple correlation . In these typical, well-behaved scenarios, the coefficient of determination conventionally resides within the range of 0 to 1, inclusive, offering a seemingly intuitive scale of model performance.

However, the statistical universe, much like reality, is rarely so accommodating. There are, indeed, peculiar instances where R² can inexplicably yield negative values. This rather unsettling outcome typically arises when the predictions being evaluated against the observed outcomes have not been derived from a model-fitting procedure that actually used those specific data points. Even if a formal model-fitting procedure was employed, R² can still plunge into negativity. This might occur, for example, when a linear regression is performed without the inclusion of an intercept term [5] – a questionable decision, if you ask me – or when a non-linear function is used to fit the data in a manner that is fundamentally ill-suited. [6] In such unfortunate cases where negative values manifest, it’s a stark indicator that simply predicting the mean of the observed data would provide a better fit to the outcomes than the sophisticated, yet clearly inadequate, fitted function values, according to this particular criterion. A truly embarrassing revelation for any model.

When one is evaluating the efficacy of a regression analysis , the coefficient of determination often proves more intuitively informative than its brethren like Mean Absolute Error (MAE) , Mean Absolute Percentage Error (MAPE) , Mean Squared Error (MSE) , and Root Mean Squared Error (RMSE) . This is largely because R² can be conveniently expressed as a percentage – a concept even the statistically uninitiated can grasp – whereas those other measures operate within arbitrary, often uninterpretable, numerical ranges. Furthermore, R² has demonstrated a surprising resilience, proving more robust in cases of poor fits compared to Symmetric Mean Absolute Percentage Error (SMAPE) across certain test datasets. [7] A minor victory, perhaps, but a victory nonetheless.

A critical nuance, often overlooked by the eager and the naive, arises when evaluating the goodness-of-fit of simulated values (Y_pred) against their corresponding measured values (Y_obs). It is fundamentally inappropriate, and frankly, a statistical blunder, to base this assessment solely on the R² value derived from a simple linear regression where Y_obs = m · Y_pred + b. [citation needed] While R² does indeed quantify the degree of any linear correlation between Y_obs and Y_pred, a proper goodness-of-fit evaluation demands consideration of only one specific linear correlation: the ideal 1:1 line, where Y_obs = 1 · Y_pred + 0. [8] [9] Anything less, or more, is simply missing the point.

Definitions

The core definition of R², the one that most adequately encapsulates its purpose, hinges on the comparison of residuals.

$$R^{2}=1-{\frac {\color {blue}{SS_{\text{res}}}}{\color {red}{SS_{\text{tot}}}}}$$

Observe the diagram: the closer the linear regression (on the right) aligns with the data, particularly when contrasted with a simple average (on the left), the nearer the value of R² will approach 1. The areas of the blue squares visually represent the squared residuals relative to the linear regression model, indicating the unexplained variance. Conversely, the areas of the red squares illustrate the squared residuals with respect to the simple average value, representing the total variance. A model that perfectly fits the data would have blue squares of zero area, making R² equal to 1.

Consider a data set comprising n individual observations, denoted as y₁, …, y_n (or simply y_i, or as a vector y = [y₁, …, y_n]^T). Each of these observed values is paired with a corresponding fitted (or modeled, or predicted) value, f₁, …, f_n (referred to as f_i, or sometimes ŷ_i, as a vector f).

The crucial element in quantifying model error is the residual , defined as the difference between the observed value and the fitted value: e_i = y_i − f_i (collectively forming the vector e).

Now, if we consider $\bar{y}$ as the mean of the observed data:

$$\bar{y}={\frac {1}{n}}\sum {i=1}^{n}y{i}$$

Then the overall variability inherent in the data set can be meticulously measured using two fundamental sums of squares formulas:

The sum of squares of residuals (SS_res), also known as the residual sum of squares : This quantifies the collective discrepancy between the observed data points and the values predicted by the model. It’s the sum of the squared errors, a direct measure of how much variation the model failed to explain.
$$SS_{\text{res}}=\sum {i}(y{i}-f_{i})^{2}=\sum {i}e{i}^{2},$$
The total sum of squares (SS_tot): This represents the total variability present in the dependent variable. It measures the sum of the squared differences between each observed data point and the overall mean of the observed data. It is, in essence, proportional to the variance of the data, providing a baseline for total variation that needs to be accounted for.
$$SS_{\text{tot}}=\sum {i}(y{i}-{\bar {y}})^{2}$$

With these components defined, the most generalized and widely accepted definition of the coefficient of determination is elegantly expressed as:

$$R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}$$

In the utopian scenario, where the modeled values perfectly align with the observed values, the residuals would all be zero, leading to $SS_{\text{res}}=0$. Consequently, R² would attain its maximum value of 1, indicating a flawless fit. Conversely, a baseline model that consistently predicts the mean of the dependent variable (i.e., f_i = $\bar{y}$) would result in SS_res = SS_tot, and thus R² = 0. This signifies that the model explains precisely zero proportion of the total variance, performing no better than a simple average. Any model achieving less than zero is, frankly, an active detriment.

Relation to unexplained variance

In a more general perspective, R² can be understood as intrinsically linked to the fraction of variance unexplained (FVU) . This relationship becomes clear when considering the second term of the R² formula, which effectively compares the unexplained variance (the variance of the model’s errors) against the total variance (the inherent variability within the data itself):

$$R^{2}=1-{\text{FVU}}$$

This equation succinctly highlights that R² is merely the complement of the FVU. If the model explains 70% of the variance, then 30% remains unexplained. Simple, yet surprisingly often misunderstood.

As explained variance

A higher value of R² is generally interpreted as a testament to a more successful regression model, suggesting it captures a greater proportion of the underlying data patterns. [4] :463 For instance, if one were to obtain an R² of 0.49, this implies that 49% of the variability observed in the dependent variable within the dataset has been adequately accounted for by the model’s predictors. The remaining 51% of the variability, however, persists as unexplained noise or the influence of variables not included in the model.

For certain regression models, specifically those where the sum of squares can be partitioned, the regression sum of squares, also known as the explained sum of squares , is defined as:

$$SS_{\text{reg}}=\sum {i}(f{i}-{\bar {y}})^{2}$$

This term represents the portion of the total variability in the dependent variable that is successfully captured by the model’s predictions. In specific, well-behaved cases, such as simple linear regression or ordinary least squares (OLS) regression with an intercept, the total sum of squares conveniently equals the sum of the two other sums of squares we’ve discussed:

$$SS_{\text{res}}+SS_{\text{reg}}=SS_{\text{tot}}$$

For a detailed derivation of this result in a scenario where this relation holds, one might consult the section on Partitioning in the general OLS model . When this fundamental relation holds true, the earlier definition of R² (using 1 - SS_res/SS_tot) becomes elegantly equivalent to:

$$R^{2}={\frac {SS_{\text{reg}}}{SS_{\text{tot}}}}={\frac {SS_{\text{reg}}/n}{SS_{\text{tot}}/n}}$$

Here, n denotes the number of observations (or cases) across the variables. In this particular formulation, R² is explicitly presented as the ratio of the explained variance – which is the variance of the model’s predictions, calculated as SS_reg / n – to the total variance – which is the sample variance of the dependent variable, represented by SS_tot / n. This neatly illustrates R² as the proportion of total variance accounted for by the model.

This convenient partitioning of the sum of squares is particularly valid when the model values f_i have been derived through linear regression . A slightly milder, yet still sufficient condition , for this relationship to hold is as follows: the model must assume the form:

$$f_{i}={\widehat {\alpha }}+{\widehat {\beta }}q_{i}$$

where the q_i values are arbitrary and may or may not depend on i or other free parameters (the common choice of q_i = x_i being merely a special case), and the coefficient estimates $\widehat{\alpha}$ and $\widehat{\beta}$ are obtained by the process of minimizing the residual sum of squares. This specific set of conditions is quite significant, as it leads to a number of predictable properties concerning the fitted residuals and the modeled values. Most notably, under these conditions:

$$\bar{f}=\bar{y}.,$$

This means the mean of the fitted values will exactly equal the mean of the observed values, a minor detail that, like most foundational principles, is often overlooked until something breaks.

As squared correlation coefficient

In the context of linear least squares multiple regression , specifically when the model includes both a fitted intercept and slope, R² holds a special equivalence: it is precisely equal to $\rho ^{2}(y,f)$, which is the square of the Pearson correlation coefficient between the observed $y$ data values and the modeled (predicted) $f$ data values of the dependent variable. This relationship underscores its nature as a measure of linear association.

Furthermore, in a linear least squares regression with a single explanator – again, with both a fitted intercept and slope – R² simplifies even further. In this specific scenario, it is also equal to $\rho ^{2}(y,x)$, which is the squared Pearson correlation coefficient between the dependent variable $y$ and the sole explanatory variable $x$.

It is absolutely crucial not to confuse this with the correlation coefficient between two explanatory variables, which is defined as:

$$\rho _{{\widehat {\alpha }},{\widehat {\beta }}}={\operatorname {cov} \left({\widehat {\alpha }},{\widehat {\beta }}\right) \over \sigma _{\widehat {\alpha }}\sigma _{\widehat {\beta }}},$$

where the covariance between two coefficient estimates, along with their respective standard deviations , are extracted from the covariance matrix of those coefficient estimates, typically represented as $(X^{T}X)^{-1}$. This distinction is not merely academic; confusing these concepts leads to fundamental misinterpretations of your model.

Under more generalized modeling conditions, particularly when the predicted values might originate from a model that deviates from the standard linear least squares regression framework, an R² value can still be computed. In such cases, it is typically calculated as the square of the correlation coefficient between the original $y$ observations and their corresponding modeled $f$ values. However, and this is a critical caveat, in this generalized context, the resulting R² value does not directly serve as a measure of the absolute quality of the modeled values themselves. Instead, it functions more as an indicator of how effectively a revised predictor could be constructed from these modeled values (by creating a new predictor of the form α + βƒ_i). [citation needed] According to Everitt, [10] this specific usage aligns precisely with the definition of the term “coefficient of determination”: the square of the correlation between any two (general) variables. A useful distinction, if one bothers to remember it.

Interpretation

R², in its essence, serves as a quantitative measure of the goodness of fit of a statistical model. [11] Within the realm of regression, the R² coefficient of determination functions as a statistical gauge, indicating how accurately the model’s predictions approximate the actual, observed data points. An R² value of 1 is the theoretical zenith, signifying that the regression predictions perfectly coincide with the data – a state rarely achieved outside of textbook examples or perfectly deterministic systems.

As previously mentioned, values of R² existing outside the conventional range of 0 to 1 are not merely anomalies but rather statistical red flags. They typically occur when the model in question performs so poorly that its fit to the data is worse than the most rudimentary, “worst possible” least-squares predictor. This baseline, for context, is equivalent to a horizontal hyperplane positioned at a height equal to the mean of the observed data. Such a disastrous outcome usually points to a fundamental flaw: either an entirely inappropriate model was selected for the data, or, more embarrassingly, nonsensical constraints were inadvertently applied during the modeling process. Specifically, if Equation 1 from Kvålseth [12] (the most frequently used definition) is employed, R² can fall below zero. Conversely, if Equation 2 from Kvålseth is utilized, R² can, in rare circumstances, even exceed one. These are not signs of brilliance but rather indicators of a model that has wandered far from statistical sanity.

In virtually all scenarios where R² is commonly employed, the predictors are derived through ordinary least-squares regression ; that is, by systematically minimizing the sum of squares of residuals (SS_res). A crucial property of this method is that R² inherently increases, or at least never decreases, as the number of variables incorporated into the model grows. R² is monotone increasing with the number of variables included – it will never decrease. This inherent characteristic highlights a significant drawback to one common, yet misguided, application of R²: the temptation to relentlessly add variables in a “kitchen sink” approach (kitchen sink regression ) solely to inflate the R² value. For example, if one endeavors to predict the sales of a car model based on factors like its gas mileage, price, and engine power, one might be tempted to include utterly irrelevant factors such as the first letter of the model’s name or the height of the lead engineer responsible for its design. The R² will never decrease with these additions, and it will almost certainly experience a spurious increase due to sheer chance, giving a false sense of explanatory power.

This inherent tendency for R² to inflate with added variables leads directly to the necessity of alternative approaches, most notably the adjusted R-squared .

In a multiple linear model

Consider a linear model that incorporates more than a single explanatory variable , taking the general form:

$$Y_{i}=\beta _{0}+\sum _{j=1}^{p}\beta {j}X{i,j}+\varepsilon _{i},$$

where, for any given i^th observation: $Y_{i}$ represents the response variable; $X_{i,1},\dots ,X_{i,p}$ denote the p regressors (or explanatory variables); and $\varepsilon _{i}$ signifies a mean-zero error term, capturing the unexplained randomness. The quantities $\beta _{0},\dots ,\beta {p}$ are the unknown coefficients, whose values are typically estimated through the method of least squares . In this multivariate context, the coefficient of determination R² serves as a comprehensive measure of the overall goodness of fit of the entire model. More specifically, R² is constrained within the interval [0, 1] and quantifies the proportion of variability observed in $Y{i}$ that can be legitimately attributed to some linear combination of the regressors (the explanatory variables ) contained within X. [13]

R² is frequently, though sometimes loosely, interpreted as the proportion of the response variable’s variation that is “explained” by the regressors included in the model. Thus, an R² = 1 would indicate that the fitted model perfectly explains all variability in $y$, leaving no room for unexplained error – a statistical fantasy. Conversely, an R² = 0 suggests the complete absence of any ’linear’ relationship between the response variable and the regressors. In the specific case of straight-line regression, this implies that the best-fit model is simply a constant line (with a slope = 0 and an intercept = $\bar{y}$) – meaning your predictors are as useful as a screen door on a submarine. An intermediate value, such as R² = 0.7, might be interpreted as follows: “Seventy percent of the variance in the response variable can be accounted for by the explanatory variables included in the model. The remaining thirty percent, regrettably, must be attributed to unknown factors, lurking variables , or simply inherent, irreducible variability within the system.”

A perennial caution, one that applies to R² just as it does to all other statistical descriptions of correlation and association, is the enduring truth that “correlation does not imply causation .” While correlations can occasionally provide valuable clues in the arduous process of uncovering genuine causal relationships among variables, a non-zero estimated correlation between two variables is, by itself, insufficient evidence to claim that altering the value of one variable would directly result in changes in the values of the other. For instance, the practice of carrying matches (or a lighter) is statistically correlated with the incidence of lung cancer, but carrying matches does not, in the standard sense of “cause,” lead to cancer. It’s a common antecedent to smoking, which does cause cancer. Context, as always, is everything.

In the specific instance of a single regressor, when fitted by least squares , R² is numerically equivalent to the square of the Pearson product-moment correlation coefficient relating that regressor and the response variable. More broadly, R² represents the square of the correlation between the constructed predictor and the response variable. When a model incorporates more than one regressor, this R² is often more precisely referred to as the coefficient of multiple determination .

Inflation of R²

As previously alluded to, in least squares regression, particularly when applied to typical datasets, the R² value exhibits a disconcerting tendency to increase, or at least remain constant, with every additional regressor introduced into the model. This means that R², taken in isolation, cannot serve as a reliable metric for comparing models that possess vastly different numbers of independent variables. To facilitate a more meaningful comparison between two competing models, one might consider performing an F-test on the residual sum of squares [citation needed] – a technique similar to the F-tests employed in Granger causality – though this approach is not universally appropriate [further explanation needed]. As a subtle reminder of this inherent inflationary bias, some authors prefer to denote R² as R_q², where q explicitly represents the number of columns in X (i.e., the number of explanators, including the constant term).

To rigorously demonstrate this property, recall that the fundamental objective of least squares linear regression is to minimize the sum of squares of residuals:

$$\min {b}SS{\text{res}}(b)\Rightarrow \min {b}\sum {i}(y{i}-X{i}b)^{2},$$

where $X_i$ is a row vector containing the values of the explanatory variables for the i^th case, and $b$ is a column vector comprising the coefficients corresponding to the respective elements of $X_i$. The optimal value of this objective function will inherently be either smaller or, at worst, equal when more explanatory variables are introduced. This is because adding additional columns to $X$ (the explanatory data matrix whose i^th row is $X_i$) effectively relaxes the minimization problem, allowing for a broader search space and thus a potentially better fit. A less constrained minimization problem will, by definition, always yield an optimal cost that is weakly smaller (or at least not larger) than a more constrained one. Given this conclusion, and noting that $SS_{\text{tot}}$ depends exclusively on the observed values of $y$ and is unaffected by the model’s complexity, the non-decreasing property of R² follows directly from its definition.

The intuitive explanation for why an additional explanatory variable cannot lower the R² is quite straightforward: the process of minimizing $SS_{\text{res}}$ is mathematically equivalent to maximizing R². When an extra variable is introduced into the model, the optimization algorithm always retains the option of assigning an estimated coefficient of zero to this new variable. Should this occur, the predicted values and, consequently, the R² value, would remain entirely unchanged. The only circumstance under which the optimization problem will yield a non-zero coefficient for the new variable is if doing so genuinely improves the R². Thus, R² can only increase or stay the same, never decrease.

The preceding provides an analytical explanation for the inflation of R². To further solidify this understanding, let’s consider a geometric perspective based on ordinary least squares regression. [14] This geometric view clearly illustrates how residuals change as models are constructed in spaces of increasing dimensionality.

Let’s begin with a simple case:

$$Y=\beta _{0}+\beta {1}\cdot X{1}+\varepsilon ,$$

This equation describes an ordinary least squares regression model with a single regressor. Geometrically, the prediction, often visualized as a red vector in a conceptual space, represents the projection of the true value onto a one-dimensional model space in $\mathbb {R}$ (assuming no intercept for simplicity in visualization). The residual is then depicted as the red line, representing the orthogonal distance from the true value to this model line.

Now, consider a more complex model:

$$Y=\beta _{0}+\beta {1}\cdot X{1}+\beta {2}\cdot X{2}+\varepsilon ,$$

This equation corresponds to an ordinary least squares regression model incorporating two regressors. The prediction is now represented by a blue vector, which is the projection of the true value onto a larger, two-dimensional model space in $\mathbb {R} ^{2}$ (again, assuming no intercept for geometric clarity). It’s important to note that the estimated values of $\beta_0$ and $\beta_1$ will generally not be identical to those in the single-regressor model, unless $X_1$ and $X_2$ are zero vectors. Therefore, these two equations are expected to yield distinct predictions (meaning the blue vector will differ from the red vector). The least squares regression criterion inherently guarantees that the residual is minimized. In the geometric representation, the blue line, which signifies the residual, is orthogonal to the model space in $\mathbb {R} ^{2}$, thereby representing the shortest possible distance from the true value to that space.

Crucially, the smaller model space (for the single regressor) is a subspace of the larger one (for two regressors). Consequently, the residual associated with the smaller model is mathematically guaranteed to be larger than or equal to the residual of the larger model. Visually comparing the red and blue lines in the figure, the blue line, being orthogonal to its space, represents the minimal distance, and any other line (like the red one, constrained to a smaller space) would necessarily be longer. Given the calculation for R², where $SS_{tot}$ remains constant, a smaller value of $SS_{res}$ (as achieved by the more complex model) will inevitably lead to a larger value of R². This geometric intuition directly confirms that adding regressors will, by the very nature of least squares optimization, result in an inflation of R².

Caveats

R², for all its widespread use, is far from a panacea and comes with a significant list of limitations that are often conveniently ignored. It does not indicate whether:

The independent variables are, in fact, the cause of the observed changes in the dependent variable . Correlation is not causation, a lesson humanity seems destined to relearn perpetually.
Omitted-variable bias exists, meaning crucial predictors might have been left out of your model, rendering the included variables’ apparent effects misleading.
The correct regression model was chosen in the first place. You can fit a line to anything, but that doesn’t mean a line is the right shape.
The most appropriate set of independent variables has been selected. You might have ten variables, but only two are truly relevant, or you might be missing the single most important one.
There is collinearity present in the data on the explanatory variables, where your independent variables are so intertwined that the model struggles to distinguish their individual effects.
The model could be significantly improved by employing transformed versions of the existing set of independent variables (e.g., using a logarithm instead of the raw value).
There are sufficient data points to draw any solid, statistically sound conclusions. Too few data points can lead to R² values that are wildly misleading.
The presence of a few extreme outliers in an otherwise well-behaved sample might be distorting the fit, making a good model appear poor or vice-versa.

The image provided, comparing the Theil–Sen estimator (black line) and simple linear regression (blue line) for a dataset riddled with outliers , perfectly illustrates this point. Due to the overwhelming influence of these outliers, neither regression line manages to fit the data particularly well. This poor fit is directly reflected in the fact that neither method yields a very high R². The R² in this scenario correctly flags the inadequacy of these linear models, but it doesn’t tell you why they’re inadequate, nor does it suggest a better approach. It’s a symptom, not a diagnosis.

Extensions

Given the inherent limitations and occasional interpretive ambiguities of the basic R², statisticians, in their endless pursuit of clarity, have developed several extensions. These variations aim to refine the measure, making it more robust or applicable to a broader range of modeling scenarios.

Adjusted R²

Coefficient of partial determination

Generalizing and decomposing R²

As discussed, traditional model selection heuristics, such as the adjusted R² criterion and the venerable F-test , are typically employed to ascertain whether the total R² increases sufficiently to justify the inclusion of a new regressor into the model. However, a significant problem arises when a regressor is added that happens to be highly correlated with other regressors already present in the model. In such scenarios, the total R² will exhibit only a negligible increase, even if the new regressor possesses genuine relevance. Consequently, the aforementioned heuristics might erroneously disregard genuinely relevant regressors when strong cross-correlations exist among the predictors. [24] This is a subtle trap for the unwary.

The geometric representation of r² visually reinforces how the projection of data onto a model space explains variance.

An alternative, more nuanced approach involves decomposing a generalized version of R². This allows for a precise quantification of the relevance of deviating from a specific hypothesis. [24] As Hoornweg (2018) demonstrates, several shrinkage estimators – including Bayesian linear regression , ridge regression , and the (adaptive) lasso – implicitly leverage this decomposition of R² as they gradually shrink estimated parameters from the unrestricted OLS solutions towards hypothesized values.

Let’s first define the linear regression model in its standard form:

$$y=X\beta +\varepsilon .$$

For clarity, it is assumed that the matrix X has been standardized using Z-scores, and the column vector $y$ has been centered to possess a mean of zero. Let the column vector $\beta _{0}$ denote the hypothesized regression parameters, and let the column vector $b$ represent the estimated parameters. We can then define a generalized R² as:

$$R^{2}=1-{\frac {(y-Xb)’(y-Xb)}{(y-X\beta _{0})’(y-X\beta _{0})}}.$$

An R² value of 75% in this context would imply that the in-sample accuracy of the model improves by 75% if the data-optimized $b$ solutions are utilized instead of the hypothesized $\beta _{0}$ values. In the specific and common scenario where $\beta _{0}$ is a vector of zeros (representing a null hypothesis of no effect), this generalized R² conveniently reverts to the traditional R².

To delve deeper into the individual impact on R² stemming from deviations from a hypothesis, one can compute $R^{\otimes}$ (‘R-outer’). This p times p matrix is given by:

$$R^{\otimes }=(X’{\tilde {y}}{0})(X’{\tilde {y}}{0})’(X’X)^{-1}({\tilde {y}}{0}’{\tilde {y}}{0})^{-1},$$

where ${\tilde {y}}_{0}=y-X\beta {0}$. The sum of the diagonal elements of $R^{\otimes}$ precisely equals the generalized R². If the regressors are uncorrelated and $\beta {0}$ is a vector of zeros, then the j^th diagonal element of $R^{\otimes}$ simply corresponds to the r² value (squared Pearson correlation) between $x_j$ and $y$. However, when regressors $x_i$ and $x_j$ are correlated, $R{ii}^{\otimes}$ might increase at the expense of a decrease in $R{jj}^{\otimes}$. Consequently, the diagonal elements of $R^{\otimes}$ can, in some instances, be smaller than 0 and, in more exceptional cases, even larger than 1 – a testament to the complexities introduced by multicollinearity. To navigate such uncertainties, various shrinkage estimators implicitly employ a weighted average of the diagonal elements of $R^{\otimes}$ to quantify the relevance of deviating from a hypothesized value. [24] For a practical example, one might consult the article on the lasso .

R² in logistic regression

In the specialized domain of logistic regression , which is typically fitted using maximum likelihood estimation rather than least squares, the concept of R² requires adaptation. Since the traditional R² relies on sums of squares, which are not directly applicable to likelihood-based models, several alternative “pseudo-R²” metrics have been proposed.

One such prominent pseudo-R² is the generalized R², initially put forth by Cox & Snell [25] and independently by Magee [26]:

$$R^{2}=1-\left({{\mathcal {L}}(0) \over {\mathcal {L}}({\widehat {\theta }})}\right)^{2/n}$$

Here, ${\mathcal {L}}(0)$ represents the likelihood of the null model, which includes only the intercept term (i.e., a model with no explanatory variables). In contrast, ${\mathcal {L}}({\widehat {\theta }})$ denotes the likelihood of the estimated model, incorporating the full set of parameter estimates. The variable n signifies the sample size. This formula can be conveniently rewritten in terms of the likelihood ratio test statistic, D:

$$R^{2}=1-e^{{\frac {2}{n}}(\ln({\mathcal {L}}(0))-\ln({\mathcal {L}}({\widehat {\theta }})))}=1-e^{-D/n}$$

Nico Nagelkerke meticulously outlined several desirable properties for such a pseudo-R² [27] [22]:

It should be consistent with the classical coefficient of determination when both are applicable and computable.
Its value should be maximized by the maximum likelihood estimation of a model, reflecting the optimal fit achieved by this method.
It should be asymptotically independent of the sample size, preventing its value from being unduly influenced by the number of observations in large datasets.
Its interpretation should remain intuitive: the proportion of the variation in the dependent variable explained by the model.
The values should be constrained between 0 and 1, where 0 indicates that the model explains no variation whatsoever, and 1 signifies a perfect explanation of the observed variation.
It should be unitless, a pure proportion.

However, Nagelkerke also pointed out a specific limitation of the Cox & Snell R² in the context of logistic models: since ${\mathcal {L}}({\widehat {\theta }})$ cannot exceed 1, the maximum possible value for this R² is limited to $R_{\max }^{2}=1-({\mathcal {L}}(0))^{2/n}$. Because this maximum is often less than 1, it can be misleading. To address this, Nagelkerke [22] suggested the possibility of defining a scaled R² as R² / R²_max, which normalizes the measure to always range from 0 to 1, thus providing a more readily interpretable proportion of explained variance relative to the maximum possible explanation.

Comparison with residual statistics

Occasionally, other residual-based statistics are employed as indicators of goodness of fit . The norm of residuals, for instance, is calculated as the square-root of the sum of squares of residuals (SSR):

$${\text{norm of residuals}}={\sqrt {SS_{\text{res}}}}=|e|.$$

Similarly, the reduced chi-square statistic is derived by dividing the SSR by the model’s degrees of freedom.

Both R² and the norm of residuals possess their own distinct merits and interpretational quirks. For least squares analysis, R² operates within the familiar range of 0 to 1, with values closer to 1 indicating a superior fit and a perfect 1 representing, well, perfection. The norm of residuals, conversely, spans a range from 0 to infinity, where smaller values denote better fits, and a zero value signifies a perfect alignment between model and data.

One notable advantage, and simultaneously a disadvantage, of R² is the normalizing effect of the $SS_{\text{tot}}$ term in its denominator. If all the $y_i$ values in a dataset are multiplied by a constant factor (e.g., changing units from meters to millimeters), the norm of residuals will scale proportionally by that same constant. However, the R² value will remain entirely unchanged, as both $SS_{\text{res}}$ and $SS_{\text{tot}}$ would scale by the square of that constant, canceling out in the ratio.

Consider a basic example for a linear least squares fit to the following dataset:

x	y
1	1.9
2	3.7
3	5.8
4	8.0
5	9.6

For this dataset, R² = 0.998, indicating an excellent fit, and the norm of residuals = 0.302. Now, if all values of y are multiplied by 1000 (perhaps representing a change in SI prefix from base units to milli-units, for example), the R² value remains precisely 0.998. However, the norm of residuals dramatically changes to 302. This demonstrates R²’s invariance to scale changes in the dependent variable, while residual norms are scale-dependent. This is not inherently good or bad, but a characteristic to be aware of.

Another common single-parameter indicator of fit is the Root Mean Square Error (RMSE) of the residuals, or equivalently, the standard deviation of the residuals. For the example above, assuming a linear fit with an unforced intercept, this would yield a value of 0.135. [28] Each of these metrics offers a slightly different lens through which to evaluate model performance; choosing the most appropriate one depends on the specific context and the questions one wishes to answer.

History

The development and formalization of the coefficient of determination are generally attributed to the pioneering geneticist Sewall Wright . His foundational work on this concept was first published in 1921 [29], marking a significant milestone in the quantitative analysis of statistical relationships.

Notes

^ Steel, R. G. D.; Torrie, J. H. (1960). Principles and Procedures of Statistics with Special Reference to the Biological Sciences. McGraw Hill . ISBN 007060925X. {{cite book }}: ISBN / Date incompatibility (help )
^ Glantz, Stanton A.; Slinker, B. K. (1990). Primer of Applied Regression and Analysis of Variance. McGraw-Hill. ISBN 978-0-07-023407-9.
^ Draper, N. R.; Smith, H. (1998). Applied Regression Analysis. Wiley-Interscience. ISBN 978-0-471-17082-2.
^ a b Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 508–510. ISBN 978-0-538-73352-6.
^ Barten, Anton P. (1987). “The Coeffecient of Determination for Regression without a Constant Term”. In Heijmans, Risto; Neudecker, Heinz (eds.). The Practice of Econometrics. Dordrecht: Kluwer. pp. 181–189. ISBN 90-247-3502-5.
^ Colin Cameron, A.; Windmeijer, Frank A.G. (1997). “An R-squared measure of goodness of fit for some common nonlinear regression models”. Journal of Econometrics. 77 (2): 1790–2. doi :10.1016/S0304-4076(96)01818-0.
^ Chicco, Davide; Warrens, Matthijs J.; Jurman, Giuseppe (2021). “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation”. PeerJ Computer Science. 7 (e623): e623. doi :10.7717/peerj-cs.623. PMC 8279135. PMID 34307865.
^ Legates, D.R.; McCabe, G.J. (1999). “Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation”. Water Resour. Res. 35 (1): 233–241. Bibcode :1999WRR….35..233L. doi :10.1029/1998WR900018. S2CID 128417849.
^ Ritter, A.; Muñoz-Carpena, R. (2013). “Performance evaluation of hydrological models: statistical significance for reducing subjectivity in goodness-of-fit assessments”. Journal of Hydrology. 480 (1): 33–45. Bibcode :2013JHyd..480…33R. doi :10.1016/j.jhydrol.2012.12.004.
^ Everitt, B. S. (2002). Cambridge Dictionary of Statistics (2nd ed.). CUP. p. 78. ISBN 978-0-521-81099-9.
^ Casella, Georges (2002). Statistical inference (Second ed.). Pacific Grove, Calif.: Duxbury/Thomson Learning. p. 556. ISBN 9788131503942.
^ Kvalseth, Tarald O. (1985). “Cautionary Note about R2”. The American Statistician. 39 (4): 279–285. doi :10.2307/2683704. JSTOR 2683704.
^ “Linear Regression – MATLAB & Simulink”. www.mathworks.com .
^ Faraway, Julian James (2005). Linear models with R (PDF). Chapman & Hall/CRC. ISBN 9781584884255.
^ a b Raju, Nambury S.; Bilgic, Reyhan; Edwards, Jack E.; Fleer, Paul F. (1997). “Methodology review: Estimation of population validity and cross-validity, and the use of equal weights in prediction”. Applied Psychological Measurement. 21 (4): 291–305. doi :10.1177/01466216970214001. ISSN 0146-6216. S2CID 122308344.
^ Mordecai Ezekiel (1930), Methods Of Correlation Analysis, Wiley , Wikidata Q120123877, pp. 208–211.
^ Yin, Ping; Fan, Xitao (January 2001). “Estimating R 2 Shrinkage in Multiple Regression: A Comparison of Different Analytical Methods” (PDF). The Journal of Experimental Education. 69 (2): 203–224. doi :10.1080/00220970109600656. ISSN 0022-0973. S2CID 121614674.
^ a b c d Shieh, Gwowen (2008-04-01). “Improved shrinkage estimation of squared multiple correlation coefficient and squared cross-validity coefficient”. Organizational Research Methods. 11 (2): 387–407. doi :10.1177/1094428106292901. ISSN 1094-4281. S2CID 55098407.
^ Olkin, Ingram; Pratt, John W. (March 1958). “Unbiased estimation of certain correlation coefficients”. The Annals of Mathematical Statistics. 29 (1): 201–211. doi :10.1214/aoms/1177706717. ISSN 0003-4851.
^ Karch, Julian (2020-09-29). “Improving on Adjusted R-Squared”. Collabra: Psychology. 6 (45). doi :10.1525/collabra.343. hdl :1887/3161248. ISSN 2474-7394.
^ Richard Anderson-Sprecher, “Model Comparisons and R 2”, The American Statistician , Volume 48, Issue 2, 1994, pp. 113–117.
^ a b c Nagelkerke, N. J. D. (September 1991). “A Note on a General Definition of the Coefficient of Determination” (PDF). Biometrika. 78 (3): 691–692. doi :10.1093/biomet/78.3.691. JSTOR 2337038.
^ “regression – R implementation of coefficient of partial determination”. Cross Validated.
^ a b c d Hoornweg, Victor (2018). “Part II: On Keeping Parameters Fixed”. Science: Under Submission. Hoornweg Press. ISBN 978-90-829188-0-9.
^ Cox, D. D.; Snell, E. J. (1989). The Analysis of Binary Data (2nd ed.). Chapman and Hall.
^ Magee, L. (1990). “R 2 measures based on Wald and likelihood ratio joint significance tests”. The American Statistician. 44 (3): 250–3. doi :10.1080/00031305.1990.10475731.
^ Nagelkerke, Nico J. D. (1992). Maximum Likelihood Estimation of Functional Relationships, Pays-Bas. Lecture Notes in Statistics. Vol. 69. ISBN 978-0-387-97721-8.
^ OriginLab webpage, http://www.originlab.com/doc/Origin-Help/LR-Algorithm . Retrieved February 9, 2016.
^ Wright, Sewall (January 1921). “Correlation and causation”. Journal of Agricultural Research. 20: 557–585.

Machine learning evaluation metrics

Regression

Classification

Clustering

Ranking

Computer vision

NLP

Deep learning

Recommender system

Coverage
Intra-list similarity

Similarity

Coefficient Of Determination

Definitions

Relation to unexplained variance

As explained variance

As squared correlation coefficient

Interpretation

In a multiple linear model

Inflation of R²

Caveats

Extensions

Adjusted R²

Coefficient of partial determination

Generalizing and decomposing R²

R² in logistic regression

Comparison with residual statistics

History

See also

Notes

Further reading

Machine learning evaluation metrics

Authority control databases

Definitions

Relation to unexplained variance

As explained variance

As squared correlation coefficient

Interpretation

In a multiple linear model

Inflation of R2

Caveats

Extensions

Adjusted R2

Coefficient of partial determination

Generalizing and decomposing R2

R2 in logistic regression

Comparison with residual statistics

History

See also

Notes

Further reading

Machine learning evaluation metrics

Authority control databases

Inflation of R²

Adjusted R²

Generalizing and decomposing R²

R² in logistic regression