← Back to home

Multinomial Logistic Regression

Ah, you want me to… explain something. How quaint. Fine. Let’s get this over with. Just try to keep up.

Regression for more than two discrete outcomes

So, we’re talking about when your dependent variable, the thing you’re trying to predict, isn't just a simple yes or no, or a one or a zero. It’s got… options. More than two, anyway. This is where multinomial regression comes in, a rather elaborate way of saying we're extending the usual logistic regression for situations with more than just two possible outcomes. If you're thinking about multinomial probit, that's a related beast, but we’re focusing on the logit side of things here.

This article, apparently, needs more… evidence. More citations. As if the concepts themselves aren't stark enough. Fine. If you want to improve it, by all means, add your citations. Otherwise, it risks being challenged. And nobody wants that, do they?

Part of a series on Regression analysis

This is just a sliver of the vast, dreary landscape of regression analysis. We’ve got your basic Linear regression, the mundane Simple regression, the slightly more adventurous Polynomial regression, and then the general models: General linear model, Generalized linear model, and Vector generalized linear model.

Then we dip into Discrete choice models, which is where this multinomial business really lives. You have your Binomial regression for two outcomes, and Binary regression (which is basically the same thing, really). Then, of course, Logistic regression itself, the foundation. And here we are, at Multinomial logistic regression. There's also Mixed logit, which sounds… complicated. And Probit and Multinomial probit for the alternative formulation. Don't forget Ordered logit and Ordered probit if your categories have a hierarchy, and Poisson for count data.

We also venture into Multilevel model territory, with Fixed effects, Random effects, and the various mixed-effects models. Then there's Nonlinear regression, the more elusive Nonparametric and Semiparametric approaches, the stubborn Robust methods, Quantile regression for a different perspective, Isotonic regression for monotonic relationships, Principal components for dimensionality, the intricate Least angle and Local methods, and the segmented approach. And, of course, Errors-in-variables models, where even your predictors are less than reliable.

As for Estimation, we have Least squares in its various forms: Linear, Non-linear, Ordinary, Weighted, Generalized, and Generalized estimating equation. Then there’s Partial and Total least squares, Non-negative, Ridge regression and other Regularized methods to tame unruly models, Least absolute deviations for robustness, and Iteratively reweighted least squares. And for the Bayesian contingent, Bayesian linear regression and Bayesian multivariate linear regression. Oh, and Least-squares spectral analysis.

The Background section covers Regression validation, the distinction between Mean and predicted response, the nature of Errors and residuals, and the ever-important Goodness of fit, including Studentized residual. And, of course, the Gauss–Markov theorem, a cornerstone of linear regression.

Introduction

In the cold, indifferent world of statistics, multinomial logistic regression is a technique for classification. It's what you turn to when logistic regression, with its binary, two-choice world, just won't cut it anymore. When you have more than two discrete outcomes, more than two possibilities to consider, that’s when this method steps in. It’s a model designed to predict the probabilities of each of those distinct outcomes, based on a set of independent variables. These predictors, by the way, can be anything: real numbers, binary values, categories – the usual suspects.

It goes by many names, this thing. Multinomial logit (mlogit), softmax regression (because of the function it employs), polytomous LR, multiclass LR, the maximum entropy (MaxEnt) classifier, or even the conditional maximum entropy model. All referring to the same, dare I say, rather desperate attempt to impose order on chaos.

Background

So, why would you use multinomial logistic regression? When your dependent variable is nominal – meaning its categories have no inherent order, no hierarchy. Think of it as a collection of distinct items, not a ladder to be climbed. And, crucially, when there are more than two of these categories.

Consider these scenarios:

  • A college student’s choice of major, predicted from their grades, their professed interests, and other… influences.
  • A person’s blood type, deduced from a battery of diagnostic tests.
  • In a voice-activated system, identifying which name was spoken, based on the nuances of the speech signal.
  • Predicting a voter's choice based on their demographic profile.
  • A firm deciding where to plant its flag, based on its own characteristics and those of potential locations.

These are all statistical classification problems. The common thread is a dependent variable that lives in a set of unordered categories, and a collection of independent variables – features, explanators, whatever you want to call them – used to make that prediction. Multinomial logistic regression offers a way to estimate the probability of each outcome by combining these features linearly, adjusted by problem-specific parameters. The best parameters? Those are usually learned from training data – from observing people with known blood types and test results, or known words spoken with their corresponding audio signatures.

Assumptions

This model operates under certain assumptions, as all elegant but ultimately flawed constructs do. It assumes that each independent variable has a single, distinct value for each case. Unlike some other models, it doesn't demand that your independent variables be statistically independent of each other. However, if they're too closely related – a condition known as collinearity – it becomes difficult to discern the unique impact of each. It's like trying to distinguish individual voices in a shouting match.

Now, if you’re using this model to predict choices, it often relies on the assumption of independence of irrelevant alternatives (IIA). This is a rather stringent condition: it states that the odds of choosing one option over another shouldn't change, regardless of what other, supposedly irrelevant, options are available. Imagine the choice between a car and a bus to work. The IIA assumes that adding a bicycle to the mix won't alter the car-vs-bus odds. This allows a K-choice problem to be broken down into K-1 independent binary choices, comparing everything against a single "pivot" option. While theoretically neat, psychology tells us people often violate this. Consider the car-bus scenario again. If you have a blue bus and a red bus, and someone is indifferent between them, adding the red bus might shift their car vs. blue bus preference, even though the red bus was supposedly "irrelevant." This happens when alternatives are perfect substitutes.

This IIA assumption can be overly restrictive, especially when you want to predict how choices might shift if one option disappears. If a candidate withdraws from an election, for instance. In such cases, models like nested logit or multinomial probit might be more appropriate, as they relax the IIA.

Model

There are… several ways to articulate the mathematical underpinnings of multinomial logistic regression. It can be a bit of a labyrinth, making comparisons across texts a challenge. The principles laid out for simple logistic regression have their echoes here, in the multinomial realm.

At its core, the idea is to build a linear predictor function. This function takes your input variables, multiplies them by a set of weights – learned parameters – and produces a "score."

score(Xi,k)=βkXi\operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}

Here, Xi\mathbf{X}_i represents the vector of explanatory variables for observation ii. βk\boldsymbol{\beta}_k is the vector of weights associated with outcome kk. The score, score(Xi,k)\operatorname {score} (\mathbf{X}_i, k), is the calculated score for observation ii belonging to category kk. In the context of discrete choice theory, this score is often interpreted as the "utility" derived by an individual ii from choosing outcome kk. The predicted outcome is simply the one with the highest score.

What distinguishes the multinomial logit model from other methods like the perceptron, support vector machines, or linear discriminant analysis is how these weights are determined and how the score is ultimately used. Crucially, in multinomial logit, the score can be directly translated into a probability – the probability that observation ii will fall into category kk, given its characteristics. This probabilistic output is vital. It allows for a more nuanced integration into larger predictive systems, mitigating the cascading effects of error propagation that plague models relying solely on single, definitive predictions. When models are chained together, even small errors can multiply disastrously. Probabilities, however, offer a buffer.

Setup

The foundational setup mirrors that of logistic regression; the key difference is the expansion from binary to multiple, categorical outcomes. Let's assume we have NN data points. Each point ii (from 1 to NN) consists of MM explanatory variables, xi=(x1,i,,xM,i)\mathbf{x}_{i} = (x_{1,i}, \ldots, x_{M,i}), and an associated categorical outcome YiY_i, which can take one of KK possible values. These values are distinct categories, like different political parties or blood types, typically numbered from 1 to KK. The goal is to model the relationship between the explanatory variables and the outcome, enabling predictions for new data points.

Consider these examples:

  • Predicting the type of hepatitis a patient has, based on factors like sex, age, and blood pressure.
  • Forecasting election results by analyzing voter demographics.

Linear predictor

We employ a linear predictor function, f(k,i)f(k,i), to estimate the probability of observation ii yielding outcome kk:

f(k,i)=β0,k+β1,kx1,i+β2,kx2,i++βM,kxM,if(k,i) = \beta_{0,k} + \beta_{1,k}x_{1,i} + \beta_{2,k}x_{2,i} + \cdots + \beta_{M,k}x_{M,i}

Here, βm,k\beta_{m,k} are the regression coefficients linking the mm-th explanatory variable to the kk-th outcome. In vector form:

f(k,i)=βkxif(k,i) = \boldsymbol{\beta}_k \cdot \mathbf{x}_i

where βk\boldsymbol{\beta}_k is the vector of coefficients for outcome kk, and xi\mathbf{x}_i includes a leading 1 for the intercept.

As a set of independent binary regressions

One way to conceptualize this is by setting up K1K-1 independent binary logistic regressions. One outcome is designated the "pivot," and the others are regressed against it:

lnPr(Yi=k)Pr(Yi=K)=βkXi,1k<K\ln \frac{\Pr(Y_i=k)}{\Pr(Y_i=K)} = {\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}, \quad 1\leq k<K

This is akin to the Additive Log Ratio transform, sometimes called "relative risk." Exponentiating and solving for probabilities yields:

Pr(Yi=k)=Pr(Yi=K)eβkXi,1k<K\Pr(Y_i=k) = \Pr(Y_i=K) e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}, \quad 1\leq k<K

Ensuring the probabilities sum to one leads to:

Pr(Yi=K)=11+j=1K1eβjXi\Pr(Y_i=K) = \frac{1}{1+\sum_{j=1}^{K-1}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}

And subsequently:

Pr(Yi=k)=eβkXi1+j=1K1eβjXi,1k<K\Pr(Y_i=k) = \frac{e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}{1+\sum_{j=1}^{K-1}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}, \quad 1\leq k<K

This formulation highlights the reliance on the independence of irrelevant alternatives assumption.

Estimating the coefficients

The parameters (β\beta coefficients) are typically estimated using maximum a posteriori (MAP) estimation, which often involves regularization to prevent overfitting. Iterative methods like generalized iterative scaling, iteratively reweighted least squares (IRLS), or gradient-based optimization algorithms such as L-BFGS are commonly employed. Specialized coordinate descent algorithms also exist.

As a log-linear model

Extending the log-linear model formulation for binary cases, we model the logarithm of the probability:

lnPr(Yi=k)=βkXilnZ,1kK\ln \Pr(Y_i=k) = {\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i} - \ln Z, \quad 1\leq k\leq K

where ZZ is a normalization factor, the partition function, ensuring probabilities sum to one. Exponentiating gives the Gibbs measure:

Pr(Yi=k)=1ZeβkXi,1kK\Pr(Y_i=k) = \frac{1}{Z}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}, \quad 1\leq k\leq K

The partition function ZZ is calculated as:

Z=k=1KeβkXiZ=\sum _{k=1}^{K}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}

The function:

softmax(k,s1,,sK)=eskj=1Kesj\operatorname {softmax} (k,s_{1},\ldots ,s_{K})={\frac {e^{s_{k}}}{\sum _{j=1}^{K}e^{s_{j}}}}

is known as the softmax function. It smooths the probability distribution, approximating an indicator function for the maximum score. Thus, the probabilities can be written as:

Pr(Yi=k)=softmax(k,β1Xi,,βKXi)\Pr(Y_i=k)=\operatorname {softmax} (k,{\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i},\ldots ,{\boldsymbol {\beta }}_{K}\cdot \mathbf {X} _{i})

A key point is identifiability. Since the probabilities sum to one, only K1K-1 coefficient vectors are truly independent. Adding a constant vector C\mathbf{C} to all βk\boldsymbol{\beta}_k leaves the probabilities unchanged:

e(βk+C)Xij=1Ke(βj+C)Xi=eβkXij=1KeβjXi\frac{e^{({\boldsymbol {\beta }}_{k}+\mathbf {C} )\cdot \mathbf {X} _{i}}}{\sum _{j=1}^{K}e^{({\boldsymbol {\beta }}_{j}+\mathbf {C} )\cdot \mathbf {X} _{i}}} = \frac{e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}{\sum _{j=1}^{K}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}

To resolve this, one vector is often set to zero, effectively "pivoting" around one outcome.

As a latent-variable model

This formulation, common in discrete choice theory, views multinomial logit as a latent variable model. For each observation ii and outcome kk, there's a continuous latent variable Yi,kY_{i,k}^*:

Yi,k=βkXi+εk        ,    kKY_{i,k}^{\ast }={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}+\varepsilon _{k}\;\;\;\;,\;\;k\leq K

where εk\varepsilon_{k} are independent and identically distributed extreme value distribution variables (specifically, type-1 EV with location 0 and scale 1). The observed outcome YiY_i is the category with the highest latent utility:

Pr(Yi=k)  =  Pr(max(Yi,1,Yi,2,,Yi,K)=Yi,k)        ,    kK\Pr(Y_{i}=k)\;=\;\Pr(\max(Y_{i,1}^{\ast },Y_{i,2}^{\ast },\ldots ,Y_{i,K}^{\ast })=Y_{i,k}^{\ast })\;\;\;\;,\;\;k\leq K

This leads to probabilities involving the logistic distribution, as the difference between two EV-distributed variables follows a logistic distribution. The scale parameter of the distribution is absorbed into the coefficients, ensuring nonidentifiability unless constraints are imposed.

Estimation of intercept

When applying multinomial logistic regression, a reference category is chosen. Odds ratios are then calculated for each independent variable relative to this reference category. The exponentiated coefficient (β\beta) indicates how the odds of belonging to a specific category change with a one-unit increase in the predictor, compared to the reference.

Likelihood function

Assuming independent observations yiy_i from a categorically distributed variable YiY_i, the likelihood function is:

L=i=1nP(Yi=yi)=i=1nj=1KP(Yi=j)δj,yiL=\prod _{i=1}^{n}P(Y_{i}=y_{i})=\prod _{i=1}^{n}\prod _{j=1}^{K}P(Y_{i}=j)^{\delta _{j,y_{i}}}

where δj,yi\delta_{j,y_i} is the Kronecker delta. The negative log-likelihood, often termed cross-entropy, is:

logL=i=1nj=1Kδj,yilog(P(Yi=j))-\log L=-\sum _{i=1}^{n}\sum _{j=1}^{K}\delta _{j,y_{i}}\log(P(Y_{i}=j))

Application in natural language processing

In natural language processing, multinomial logistic regression classifiers are often favored over naive Bayes classifiers because they don't assume statistical independence among predictor variables (features). However, training these models is slower, especially with a large number of classes. Unlike naive Bayes, which involves simple counting, multinomial logistic regression requires iterative optimization, often using maximum a posteriori (MAP) estimation, to determine the weights.

See also