← Back to home

Linear Regression

Alright, let's dissect this statistical monstrosity. You want me to… rewrite it? In my style? Fine. But don't expect sunshine and rainbows. This is about as exciting as watching paint dry, but if you insist on staring into the abyss, I'll provide the lens. Just try not to break it.


Statistical Modeling: The Illusion of Linearity

Part of a series on Regression Analysis that masquerades as understanding.

Models: The Cast of Characters

These are the archetypes we force data into, hoping for a coherent narrative.

  • Linear Regression: The classic. Assumes a straight-line relationship. Simple, predictable, and often utterly wrong.
    • Simple Regression: One variable trying to explain another. Like a single, desperate plea.
    • Polynomial Regression: When a straight line just won't do, we bend it. Still linear in the parameters, though. A subtle deception.
    • General Linear Model: The more ambitious cousin, allowing for multiple responses. More complex, more potential for error.
    • Generalized Linear Model: For when your response variable is being difficult – counts, categories, you name it. It’s a framework, not a magic wand.
      • Vector Generalized Linear Model: Even more general. Because why wouldn't you want more layers of abstraction?
      • Discrete Choice: When the outcome is a decision, not a measurement. Think of it as predicting regret.
      • Binomial Regression: For outcomes with two possibilities, like success or failure. Or, you know, living or not.
      • Binary Regression: Essentially the same as binomial, just… more binary.
      • Logistic Regression: The go-to for binary outcomes. Predicts probabilities, which are just educated guesses about the future.
      • Probit Model: Another player in the binary outcome game. Similar to logistic, but with a different mathematical flavor.
      • Ordered Logit: For outcomes that have a natural order, like ratings. A ranked series of disappointments.
      • Ordered Probit: The probit counterpart for ordered outcomes.
      • Poisson Regression: For count data. How many times will this happen before it breaks?
    • Multilevel Model: When your data has layers, like Russian dolls. Students in classes, classes in schools. Because everything is nested.
    • Nonlinear Regression: When the parameters themselves are non-linearly related to the predictors. The real world, in all its inconvenient glory.
    • Nonparametric Regression: Makes fewer assumptions about the form of the relationship. It’s more flexible, but often less interpretable. Like trying to understand a whisper in a hurricane.
    • Semiparametric Regression: A compromise between parametric and nonparametric. A bit of structure, a bit of freedom.
    • Robust Regression: Designed to shrug off outliers. Because the real world is full of data points that just don't care about your assumptions.
    • Quantile Regression: Focuses on specific points in the distribution of the response, not just the mean. Understanding the edges, not just the center.
    • Isotonic Regression: For cases where the relationship is monotonic. Things generally go up, or generally go down. No sudden reversals.
    • Principal Component Regression: A way to handle too many correlated predictors. Reduces the noise, hopefully without losing the signal.
    • Least Angle Regression: An algorithm that's good with high-dimensional data. Tries to be efficient.
    • Local Regression: Fits models to small subsets of the data. Assumes relationships are local. Like understanding a city by looking at one street at a time.
    • Segmented Regression: Fits different linear models to different segments of the data. Change points. Where does the story shift?
    • Errors-in-Variables Models: Acknowledges that all variables might be measured with error. A rare moment of honesty.

Estimation: The Art of Guessing Coefficients

How we figure out the numbers that define the relationship.

Background: The Unseen Foundations

The assumptions we make, often without fully realizing it.

  • Regression Validation: How do we know if our model is any good? A crucial, often overlooked, step.
  • Mean and Predicted Response: What we're actually estimating. The average outcome, or a specific prediction.
  • Errors and Residuals: The difference between what we predict and what we observe. The gap between theory and reality.
  • Goodness of Fit: Measures of how well the model captures the data. Metrics that try to quantify success.
  • Studentized Residual: A standardized residual, useful for outlier detection. A more critical look at individual errors.
  • Gauss-Markov Theorem: The theoretical underpinning for why OLS is often the "best" linear unbiased estimator. A foundational piece, rarely questioned.

In statistics, linear regression is a model. It's a lens, really, through which we attempt to understand the relationship between a single, measurable outcome – the dependent variable – and one or more factors that might influence it, the explanatory variables, or regressors. If you're dealing with just one of these influencing factors, it's a simple linear regression. More than one, and it graduates to a multiple linear regression. Don't confuse this with multivariate linear regression, though; that's a different beast, predicting multiple dependent variables simultaneously, variables that are likely tangled together.

The core of linear regression lies in assuming these relationships can be described by linear predictor functions. We estimate the unknown parameters – the slopes and intercepts, the coefficients that dictate the strength and direction of these relationships – from the data. Most of the time, we're trying to model the conditional mean of the response. We assume it behaves like an affine function of the predictors. Sometimes, less commonly, we might look at the median or other quantiles. Like all forms of regression analysis, it focuses on how the response variable behaves given the predictors, not the whole messy joint distribution of everything. That’s the domain of multivariate analysis, a whole other can of worms.

And yes, linear regression is also a machine learning algorithm. Specifically, a supervised one. It learns from labeled data, trying to find the most optimized linear functions to make predictions on new, unseen data. It's a fundamental building block, simple enough to grasp, yet powerful enough to be widely misused.

It’s worth noting that linear regression was one of the first regression techniques to be rigorously studied and widely applied. Why? Because models that are linear in their parameters are easier to fit and their statistical properties are more tractable. The math doesn't fight back as much.

Linear regression has its uses, usually falling into two broad categories:

  • Prediction and Forecasting: If your goal is to reduce error, to make better predictions or forecasts, you fit a model to observed data. Then, when you get new values for your explanatory variables but not the response, you use the model to guess the response. It’s an educated guess, but a guess nonetheless.
  • Explanation and Quantification: If you want to understand why a response variable changes, to quantify the strength of relationships, linear regression can help. It can tell you if certain explanatory variables actually have any meaningful linear connection to the response, or if they're just redundant noise.

We often fit these models using least squares. But other methods exist, like minimizing other norms (think least absolute deviations) or using penalty functions to keep coefficients in check, as in ridge regression or lasso. The Mean Squared Error can be a problem with outliers; it inflates their importance. If your data is messy, you might need more robust cost functions. And remember, "least squares" doesn't always mean a "linear model." The terms are linked, but not interchangeable.

Formulation: The Mathematical Skeleton

Imagine you have a dataset, a collection of observations. For each observation, you have a response variable, let's call it yiy_i, and a set of pp factors that might influence it: xi1,xi2,,xipx_{i1}, x_{i2}, \ldots, x_{ip}.

A linear regression model assumes that the relationship between yiy_i and these xx's isn't just random. It's mostly linear, with a bit of unpredictable noise thrown in. This noise is represented by an error term, εi\varepsilon_i, an unobserved random variable that adds a little chaos to the otherwise orderly linear equation.

So, for each observation ii, the model looks like this:

yi=β0+β1xi1+β2xi2++βpxip+εiy_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i

This can be condensed using vector notation. Let xi\mathbf{x}_i be the vector of predictors for observation ii (including a 1 for the intercept), and β\boldsymbol{\beta} be the vector of unknown parameters. Then:

yi=xiTβ+εiy_i = \mathbf{x}_i^{\mathsf {T}} \boldsymbol{\beta} + \varepsilon_i

Now, if we stack all nn observations into matrices and vectors, we get the familiar form:

y=Xβ+ε\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}

Where:

  • y\mathbf{y} is the vector of observed responses: [y1y2yn]\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}
  • X\mathbf{X} is the design matrix, where each row is the vector of predictors for an observation: [1x11x1p1x21x2p1xn1xnp]\begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix}
  • β\boldsymbol{\beta} is the vector of unknown parameters (coefficients): [β0β1β2βp]\begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}
  • ε\boldsymbol{\varepsilon} is the vector of error terms: [ε1ε2εn]\begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}

Notation and Terminology: The Language of the Game

  • y\mathbf{y}: This is your observed data, the thing you're trying to explain. Sometimes called the regressand, endogenous variable, response variable, target variable, measured variable, or dependent variable. Just don't call it the "predicted variable" – that's for the estimated values, y^\hat{y}. We choose what's dependent based on what we think influences what, or sometimes just for practical reasons.
  • X\mathbf{X}: This matrix holds your predictors. They're called regressors, exogenous variables, explanatory variables, covariates, input variables, or independent variables. Crucially, they are not necessarily statistically independent. The X\mathbf{X} matrix is often referred to as the design matrix.
  • Intercept: Usually, the first column of X\mathbf{X} is all ones (xi0=1x_{i0}=1). This allows for a constant term, β0\beta_0, the intercept. It's the expected value of yy when all predictors are zero. Even if theoretically it shouldn't exist, we often include it because statistical procedures demand it.
  • Non-linear Functions: Sometimes, a predictor isn't a raw variable but a transformation of one, like xij2x_{ij}^2 in polynomial regression. The model is still "linear" because it's linear in the parameters (β\boldsymbol{\beta}).
  • Fixed vs. Random Predictors: We can treat the xijx_{ij} values as fixed numbers we've set, or as values drawn from random variables. The estimation methods are often the same, but how we analyze their behavior in the long run (asymptotics) differs.
  • β\boldsymbol{\beta}: This is the heart of the model – the vector of parameters we want to estimate. β0\beta_0 is the intercept, and β1,,βp\beta_1, \ldots, \beta_p are the coefficients for each predictor. They tell you how much yy changes for a one-unit change in a predictor, holding all other predictors constant. This "holding constant" part is key, and sometimes problematic.
  • ε\boldsymbol{\varepsilon}: The error term. It's everything else that influences yy that we haven't accounted for with our predictors. It’s the noise, the unexplained variance. What we assume about this noise (its distribution, its relationship with X\mathbf{X}) is critical for choosing the right estimation method.

Fitting a model usually means finding the β\boldsymbol{\beta} values that make the error term (ε=yXβ\boldsymbol{\varepsilon} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}) as small as possible, typically by minimizing the sum of its squared values (ε22\|\boldsymbol{\varepsilon}\|_{2}^{2}).

Example: The Ball in the Air

Imagine tossing a ball. Physics tells us its height (hih_i) at time (tit_i) can be modeled as:

hi=β1ti+β2ti2+εih_i = \beta_1 t_i + \beta_2 t_i^2 + \varepsilon_i

Here, β1\beta_1 relates to initial velocity, and β2\beta_2 to gravity. εi\varepsilon_i is measurement error. This looks non-linear because of ti2t_i^2, but it's linear in the parameters β1\beta_1 and β2\beta_2. If we define our predictors as xi1=tix_{i1} = t_i and xi2=ti2x_{i2} = t_i^2, it fits the standard form: hi=β1xi1+β2xi2+εih_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i.

Assumptions: The Pillars of the Structure (and where they often crumble)

Standard linear regression, especially with ordinary least squares, relies on a few critical assumptions. Relaxing them leads to more complex models, but sometimes, that’s what reality demands.

  • Weak Exogeneity: Predictor variables (X\mathbf{X}) are treated as fixed, not random. This implies they are measured without error. A noble thought, rarely true. Ignoring measurement error in predictors leads to errors-in-variables models, which are considerably more complicated.
  • Linearity: The mean of the response is a linear combination of the parameters and predictors. This sounds restrictive, but it's not entirely. Predictors can be transformed (x2x^2, log(x)\log(x)), and multiple transformations can be included. This is how polynomial regression works – it's still linear regression. The danger here is overfitting; models become too tailored to the specific data, losing generalizability. Regularization techniques like ridge regression and lasso regression (which can be seen as Bayesian models with specific prior distributions) are often employed to combat this.
  • Constant Variance (Homoscedasticity): The variance of the errors (εi\varepsilon_i) is the same for all values of the predictors. In simpler terms, the spread of the data points around the regression line is consistent. This is often violated. A person earning 100,000hasamuchwiderpotentialrangeofactualincomethansomeoneearning100,000 has a much wider potential range of actual income than someone earning 10,000. The variance tends to increase with the mean. This violation is called heteroscedasticity. Plots of residuals versus predicted values can reveal this "fanning out." Ignoring it leads to biased standard errors, making your significance tests and confidence intervals unreliable. Solutions include weighted least squares or using heteroscedasticity-consistent standard errors. Transformations of the response variable (like taking the logarithm) can sometimes stabilize variance.
  • Independence of Errors: The errors (εi\varepsilon_i) are uncorrelated with each other. This means knowing the error for one observation doesn't tell you anything about the error for another. This is often false in time-series data or clustered data. Methods like generalized least squares can handle this, but often require more data or regularization. Bayesian linear regression is also adaptable.
  • Lack of Perfect Multicollinearity: Predictor variables shouldn't be perfectly linearly related to each other. If one predictor can be perfectly predicted from others, the model breaks. It's like trying to assign blame when two people always do the exact same thing. This makes parameter estimates impossible or unstable. Highly correlated predictors (near multicollinearity) reduce the precision of estimates, leading to large variance inflation factors. Solutions exist, like partial least squares regression, or methods that assume effect sparsity (many coefficients are zero). Iterative algorithms for generalized linear models tend to be more robust to this.

Violating these assumptions can lead to biased estimates, unreliable standard errors, and ultimately, misleading conclusions.

Interpretation: What Does It All Mean?

The coefficients (βj\beta_j) in a multiple regression model represent the expected change in the response variable (yy) for a one-unit increase in the predictor variable (xjx_j), holding all other predictor variables constant. This "holding constant" is crucial. It's the model's attempt to isolate the unique effect of each variable.

Consider Anscombe's Quartet – a set of four datasets with nearly identical summary statistics (mean, variance, correlation, regression line) but vastly different graphical representations. This starkly illustrates the danger of relying solely on numbers without visualizing the data.

The interpretation of "holding constant" can be tricky. If your predictors are highly correlated, it might be practically impossible for one to change while others remain fixed. This is where the interpretation of individual coefficients becomes problematic.

Sometimes, a variable's unique effect (βj\beta_j) might be small even if its marginal effect (its relationship with yy alone) is large. This suggests another variable in the model captures all its predictive power. Conversely, a variable's unique effect might be large while its marginal effect is small, meaning other variables explain most of yy's variation, but in a way that complements xjx_j's contribution.

The notion of a "unique effect" is appealing for understanding complex systems, potentially even identifying causal effects. However, critics argue that when predictors are correlated and not experimentally manipulated, multiple regression often obscures rather than clarifies relationships.

Extensions: When the Basic Model Isn't Enough

  • Simple and Multiple Linear Regression: The basic building blocks. One predictor vs. many.
  • General Linear Models: Extending to multiple vector-valued response variables. The multivariate counterpart.
  • Heteroscedastic Models: Specifically designed for situations where error variances differ. Weighted least squares is a prime example.
  • Generalized Linear Models (GLMs): A powerful framework for response variables that don't follow a normal distribution or have bounded ranges. Think counts (Poisson regression), binary outcomes (logistic regression, probit regression), or ordered categories (ordered logit). They use a link function (gg) to connect the mean of the response to the linear predictor: E(Y)=g1(Xβ)E(Y) = g^{-1}(\mathbf{X}\boldsymbol{\beta}).
  • Hierarchical Linear Models (Multilevel Models): For data with nested structures (students within schools, etc.). Allows for variation at different levels.
  • Errors-in-Variables Models: Acknowledges that predictors might also be measured with error, leading to biased estimates in standard models.
  • Group Effects: When predictors are highly correlated, their individual effects are hard to interpret. Group effects look at the collective impact of a set of related predictors. This involves defining weights and estimating combined effects, especially useful when individual coefficients are unstable due to multicollinearity. The idea is that while individual components might be unpredictable, their combined action might be well-defined and estimable. For instance, the "average group effect" can be meaningful even if individual effects are not.

Estimation Methods: The Toolkit for Finding Coefficients

  • Least-Squares Estimation: The cornerstone.
  • Maximum Likelihood Estimation (MLE): Assumes a specific distribution for the error terms (often normal). When errors are normal, MLE yields the same results as OLS. It's about finding the parameter values that make the observed data most probable.
  • Regularized Regression:
    • Ridge Regression: Adds a penalty for large coefficients (L2 norm) to shrink them towards zero, reducing variance at the cost of some bias. Useful for multicollinearity and overfitting.
    • Lasso Regression: Adds an L1 norm penalty, which can force some coefficients to be exactly zero, performing variable selection.
  • Least Absolute Deviation (LAD): A robust estimation method that minimizes the sum of absolute errors. Less sensitive to outliers than OLS. Equivalent to MLE under a Laplace distribution for errors.
  • Adaptive Estimation: More advanced techniques that can estimate the error distribution non-parametrically first.
  • Other Techniques:

Applications: Where This Stuff Actually Gets Used (or Abused)

Linear regression is ubiquitous.

  • Trend Lines: Visualizing long-term movements in data over time. Simple, intuitive, but prone to oversimplification.
  • Epidemiology: Trying to link factors like smoking to health outcomes. Researchers add covariates to control for confounding, but can never account for everything. Randomized controlled trials are often preferred for establishing causality.
  • Finance: The capital asset pricing model uses beta coefficients from linear regression to quantify systematic risk.
  • Economics: A primary tool for empirical analysis. Predicting everything from consumption spending to labor supply. Econometrics is built on these foundations.
  • Environmental Science: Modeling land use, disease spread (COVID-19 pandemic), and air pollution.
  • Building Science: Estimating occupant comfort based on temperature and other factors. Debates exist about the direction of regression.
  • Machine Learning: A fundamental supervised learning algorithm. Simple, interpretable, and a good baseline.

History: A Long and Winding Road


There. It's longer, more detailed, and infused with the soul-crushing reality of statistical modeling. Don't say I didn't warn you.