Poisson Regression

Contents

1. Overview
2. Etymology
3. Cultural Impact

In the realm of statistics , Poisson regression stands as a cornerstone of regression analysis , specifically tailored for the intricate modeling of count data and the analysis of contingency tables . It operates as a form of generalized linear model , offering a sophisticated approach to understanding relationships where the dependent variable represents discrete occurrences. At its core, Poisson regression posits that the response variable, let’s call it Y, adheres to a Poisson distribution . This distribution is characterized by the property that its mean is equal to its variance , a crucial aspect we’ll revisit later. The model further assumes that the logarithm of the expected value of Y can be expressed as a linear combination of unknown parameters – essentially, a weighted sum of our independent variables.

This particular formulation means that Poisson regression is often referred to as a log-linear model , particularly when its application involves deciphering patterns within contingency tables. It’s a powerful tool, but like any tool, it has its limitations, and its assumptions must be carefully considered.

Extensions and Generalizations

While Poisson regression is robust, the statistical world rarely stops at one solution. Negative binomial regression emerges as a popular and practical generalization, primarily because it offers a more flexible approach by relaxing the stringent assumption of mean-variance equality inherent in the Poisson model. The traditional negative binomial regression model achieves this by employing a Poisson-gamma mixture distribution . This construction is favored because it effectively models the inherent heterogeneity in Poisson data—the variability that might exceed what the basic Poisson model predicts—by incorporating a gamma distribution . This allows for situations where the variance is greater than the mean, a common occurrence in real-world data.

The Mechanics of Poisson Regression

Fundamentally, Poisson regression models are a specific instance of generalized linear models . They utilize the logarithm as their canonical link function , meaning it’s the natural and most straightforward choice for connecting the linear predictor to the mean of the response. The Poisson distribution itself serves as the assumed probability distribution for the response variable.

Regression Models: A Deeper Dive

Let’s say we have a vector of independent variables , denoted as x, where x ∈ ℝⁿ. The model then takes on a specific mathematical form:

$$ \log(\operatorname {E} (Y\mid \mathbf {x} ))=\alpha +\mathbf {\beta } ‘\mathbf {x} $$

Here, α is a scalar parameter (∈ ℝ), and β is a vector of parameters (∈ ℝⁿ) corresponding to our independent variables. This equation states that the natural logarithm of the expected value of our count variable Y, given the predictors x, is a linear function of those predictors.

For conciseness, we often augment the predictor vector x by appending a ‘1’ to it, creating an (n + 1)-dimensional vector. Let’s call this augmented vector x as well, but now x ∈ ℝⁿ⁺¹. Similarly, we combine α and β into a single parameter vector θ ∈ ℝⁿ⁺¹. The model can then be written more compactly:

$$ \log(\operatorname {E} (Y\mid \mathbf {x} ))={\boldsymbol {\theta }}’\mathbf {x} $$

This simplification is purely notational; θ simply contains the intercept term (originally α) and the coefficients for the original predictors (originally β).

Given this structure, when we have a Poisson regression model defined by θ and an input vector x, the predicted mean of the associated Poisson distribution is obtained by exponentiating the linear predictor:

$$ \operatorname {E} (Y\mid \mathbf {x} )=e^{{\boldsymbol {\theta }}’\mathbf {x} } $$

This equation reveals a critical aspect: the expected count is always positive, as it’s an exponential function.

Now, consider a dataset of m independent observations, where each observation i consists of a predictor vector xᵢ and a corresponding count yᵢ. Estimating the parameters θ is typically achieved through the method of maximum likelihood . The challenge here is that these maximum-likelihood estimates do not possess a closed-form expression ; they must be derived through iterative numerical methods. Fortunately, the probability surface associated with maximum-likelihood estimation for Poisson regression is always concave. This mathematical property makes it amenable to standard optimization algorithms like Newton–Raphson or other gradient-based techniques, ensuring that a unique maximum can be found.

Interpretation of Coefficients: Unpacking the Numbers

The coefficients in a Poisson regression model carry a specific and interpretable meaning, particularly concerning the log-transformed expected counts. Let’s consider a simplified model with just one predictor, x (i.e., n = 1), so the model is:

$$ \log(\operatorname {E} (Y\mid x))=\alpha +\beta x $$

Suppose we compare two scenarios: one at predictor value x₁ and another at x₂, where x₂ = x₁ + 1. The difference in the log of the expected counts is:

$$ \log(\operatorname {E} (Y_{2}\mid x_{2})) - \log(\operatorname {E} (Y_{1}\mid x_{1})) = (\alpha + \beta x_2) - (\alpha + \beta x_1) = \beta (x_2 - x_1) $$

If we set x₂ = x₁ + 1, then the difference becomes simply β. This leads to the interpretation:

$$ \log(\operatorname {E} (Y_{2}\mid x_{2})) - \log(\operatorname {E} (Y_{1}\mid x_{1})) = \beta $$

This tells us that the coefficient β represents the increase in the logarithm of the expected count of the outcome variable when the independent variable x increases by one unit.

Applying the properties of logarithms, we can rewrite this as:

$$ \log \left({\dfrac {\operatorname {E} (Y_{2}\mid x_{2})}{\operatorname {E} (Y_{1}\mid x_{1})}}\right)=\beta $$

Exponentiating both sides, we get:

$$ {\dfrac {\operatorname {E} (Y_{2}\mid x_{2})}{\operatorname {E} (Y_{1}\mid x_{1})}}=e^{\beta } $$

Or, more directly:

$$ \operatorname {E} (Y_{2}\mid x_{2})=e^{\beta }\operatorname {E} (Y_{1}\mid x_{1}) $$

This is a crucial interpretation: each one-unit increase in the independent variable x multiplies the expected count of the outcome variable by a factor of e^β. The term e^β is often referred to as the incidence ratio, providing a multiplicative effect on the expected count.

Average Partial Effect: The Marginal Impact

Beyond the multiplicative effect, statisticians are often interested in the marginal impact of a predictor on the expected outcome. This is captured by the average partial effect, or average marginal effect, calculated as:

$$ {\frac {\partial E(Y|x)}{\partial x}} $$

This represents the change in the expected outcome Y for a one-unit change in the independent variable x. For a continuous predictor x in the Poisson model, the average partial effect can be derived as:

$$ {\frac {\partial E(Y|x)}{\partial x}}=\exp(\theta ‘\mathbb {x} )\beta $$

This effect is not constant; it depends on the values of the predictors x themselves. It can be estimated using the maximum likelihood estimates of the parameters, θ̂ = ( α̂, β̂ ), and the observed values of the predictors x.

Maximum Likelihood-Based Parameter Estimation: The Nuts and Bolts

Let’s delve deeper into the estimation process. For a given set of parameters θ and an input vector x, the mean of the predicted Poisson distribution , denoted by λ, is:

$$ \lambda := \operatorname {E} (Y\mid x)=e^{\theta ‘x} $$

The probability mass function (PMF) of the Poisson distribution is given by:

$$ p(y\mid x;\theta )={\frac {\lambda ^{y}}{y!}}e^{-\lambda } $$

Substituting our expression for λ, we get:

$$ p(y\mid x;\theta )={\frac {(e^{\theta ‘x})^{y}}{y!}}e^{-e^{\theta ‘x}} = {\frac {e^{y\theta ‘x}e^{-e^{\theta ‘x}}}{y!}} $$

Now, consider our dataset of m independent observations: xᵢ ∈ ℝⁿ⁺¹ and yᵢ ∈ ℕ for i = 1, …, m. The probability of observing this specific dataset, given a set of parameters θ, is the product of the individual probabilities:

$$ p(y_{1},\ldots ,y_{m}\mid x_{1},\ldots ,x_{m};\theta )=\prod {i=1}^{m}{\frac {e^{y{i}\theta ‘x_{i}}e^{-e^{\theta ‘x_{i}}}}{y_{i}!}} $$

To find the parameters θ that best explain our data, we employ the method of maximum likelihood . This involves maximizing the likelihood function , which is essentially the same probability expression but viewed as a function of θ:

$$ L(\theta \mid X,Y)=\prod {i=1}^{m}{\frac {e^{y{i}\theta ‘x_{i}}e^{-e^{\theta ‘x_{i}}}}{y_{i}!}} $$

Working directly with the product of probabilities can be mathematically cumbersome. Therefore, it’s standard practice to maximize the log-likelihood function instead. The logarithm is a monotonically increasing function, so maximizing the log-likelihood is equivalent to maximizing the likelihood itself, but it simplifies the calculations considerably:

$$ \ell (\theta \mid X,Y)=\log L(\theta \mid X,Y)=\sum {i=1}^{m}\left(y{i}\theta ‘x_{i}-e^{\theta ‘x_{i}}-\log(y_{i}!)\right) $$

Notice that the parameters θ only influence the first two terms within each part of the summation. The term log(yᵢ!) is constant with respect to θ. For the purpose of finding the optimal θ, we can therefore simplify the objective function by dropping this constant term:

$$ \ell (\theta \mid X,Y)=\sum {i=1}^{m}\left(y{i}\theta ‘x_{i}-e^{\theta ‘x_{i}}\right) $$

To find the maximum of this log-likelihood function, we set its derivative with respect to θ equal to zero:

$$ {\frac {\partial \ell (\theta \mid X,Y)}{\partial \theta }}=0 $$

As mentioned earlier, this equation lacks a closed-form solution for θ. However, the negative of the log-likelihood function, -ℓ(θ | X, Y), is a convex function. This convexity is a critical property that allows us to use standard convex optimization techniques, such as gradient descent , to efficiently find the value of θ that minimizes the negative log-likelihood (and thus maximizes the original log-likelihood).

Poisson Regression in Practice: When to Use It

Poisson regression finds its application when the dependent variable is a count of events. Imagine tracking the number of events occurring within a specific timeframe or unit, such as the arrival of a telephone call at a call center. The underlying assumption is that these events occur independently of each other, meaning the occurrence of one event doesn’t influence the probability of another. However, the overall probability of events occurring per unit of time or space is understood to be influenced by other factors, our covariates.

“Exposure” and the Offset Variable

Poisson regression is also remarkably adept at modeling rate data. Rate data is essentially a count of events divided by a measure of “exposure” for that unit of observation. For instance, a biologist might count the number of tree species within different plots of land. Here, the “events” are the species observed, the “exposure” is the area of the plot, and the “rate” is the number of species per unit area. Similarly, demographers might model death rates in various geographic regions, where the count is the number of deaths and the exposure is measured in person-years lived in that region.

More generally, event rates can be expressed as events per unit time, which is particularly useful when the observation period varies across different units. In these scenarios, the exposure term (area, person-years, time) is handled within the Poisson regression framework as an offset. The mathematical manipulation works as follows: if the rate is defined as count/exposure, and we model the log of the rate, we can rearrange the equation. Multiplying both sides of the log-rate equation by “exposure” and then taking the logarithm effectively moves the log of the exposure term to the right side of the model equation, where it is added to the linear combination of coefficients. This logged exposure term is known as the offset variable, and it enters the model with its coefficient constrained to be exactly 1.

The foundational model is:

$$ \log(\operatorname {E} (Y\mid x))=\theta ‘x $$

When incorporating exposure, the model becomes:

$$ \log \left({\frac {\operatorname {E} (Y\mid x)}{\text{exposure}}}\right)=\log(\operatorname {E} (Y\mid x))-\log(\text{exposure})=\theta ‘x $$

Rearranging to solve for the log of the expected count:

$$ \log \left(\operatorname {E} (Y\mid x)\right)=\theta ‘x+\log(\text{exposure}) $$

This effectively allows the model to account for varying levels of exposure when estimating the effects of other covariates. In statistical software like R , this is often implemented using the offset() function within the glm() command, like so: glm(y ~ offset(log(exposure)) + x, family=poisson(link=log)).

Overdispersion and Zero Inflation: When the Poisson Model Stumbles

A defining characteristic of the Poisson distribution is its strict equality between the mean and the variance. However, in many real-world datasets, the observed variance in the counts is significantly larger than the mean. This phenomenon is termed overdispersion, and it signals that the standard Poisson model might not be the most appropriate choice. Overdispersion can arise from several sources, including the omission of relevant predictor variables from the model or the presence of dependent observations within the data.

When faced with overdispersion, several strategies can be employed. One approach is to use quasi-likelihood estimation, which relaxes the strict distributional assumptions. Another, often more powerful, solution is to switch to a negative binomial distribution , as discussed earlier, which explicitly allows for variance to exceed the mean.

Ver Hoef and Boveng provided a clear distinction between quasi-Poisson (often termed overdispersion with quasi-likelihood) and negative binomial regression. While both handle overdispersion, they do so with slightly different variance structures. The quasi-Poisson model assumes var(Y) = θμ, where θ is an overdispersion parameter, while the gamma-Poisson (negative binomial) model assumes var(Y) = μ(1 + κμ), with κ being a parameter related to the negative binomial distribution . Both models typically use iteratively reweighted least squares for parameter estimation, with differing weight functions. Ver Hoef and Boveng suggested diagnostic plots, such as plotting mean squared residuals against the mean, to help decide between the two models.

Another common issue is excess zeros, also known as zero inflation. This occurs when the number of observed zeros in the data is substantially higher than what a standard Poisson distribution would predict. A classic example involves modeling the number of cigarettes smoked by individuals in a group, where some individuals are non-smokers. In such cases, two underlying processes might be at play: one that determines whether any events occur (i.e., whether someone smokes at all) and a second Poisson process that governs the number of events if they do occur (how many cigarettes are smoked).

For situations with excess zeros, alternative models like the negative binomial model or specialized zero-inflated models (e.g., zero-inflated Poisson or zero-inflated negative binomial) often provide a better fit.

Conversely, while less common, underdispersion (where the variance is less than the mean) can also pose challenges for parameter estimation in Poisson models.

Poisson Regression in Survival Analysis

Interestingly, Poisson regression plays a role in survival analysis , a field focused on understanding the time until an event occurs. Specifically, it can be used to construct proportional hazards models, a prominent class of survival models. For a more in-depth discussion of these models, one can refer to the entry on proportional hazards models , which also covers models like the Cox proportional hazards model.

Extensions and Advanced Techniques

Regularized Poisson Regression: Taming Overfitting

When dealing with complex datasets, especially those with a large number of predictors, the risk of overfitting – where the model learns the training data too well, including its noise, and performs poorly on new data – becomes a significant concern. Regularization techniques are employed to mitigate this.

In the context of Poisson regression, regularization is added to the maximum likelihood optimization problem. Instead of simply maximizing the log-likelihood:

$$ \sum {i=1}^{m}\log(p(y{i};e^{\theta ‘x_{i}})) $$

where m is the number of data points and p is the Poisson probability mass function with mean e^θ’xᵢ, we maximize a penalized version:

$$ \sum {i=1}^{m}\log(p(y{i};e^{\theta ‘x_{i}})) - \lambda \left|\theta \right|_{2}^{2} $$

Here, λ is a positive constant controlling the strength of the penalty, and ||θ||₂² is the squared L₂ norm (sum of the squares) of the parameter vector θ. This penalty term, similar to the concept in ridge regression , discourages overly large coefficient values, effectively shrinking them towards zero. This process helps to improve the model’s generalization performance on unseen data by reducing its complexity.