- 1. Overview
- 2. Etymology
- 3. Cultural Impact
This article includes a list of general references , but it lacks sufficient corresponding inline citations . Itâs almost as if someone couldnât be bothered to meticulously document their sources. A glaring oversight, really. Please help to improve this article by introducing more precise citations. Honestly, the lack of rigor is⌠disheartening. (July 2024) ( Learn how and when to remove this message )
Variable Capable of Taking on a Limited Number of Possible Values
In the grand, often messy, theatre of statistics , a categorical variable, also known as a qualitative variable, is one that has the audacity to present itself with only a finite, and usually predetermined, set of distinct outcomes. Itâs like a limited edition print â you know exactly what youâre getting, no surprises. This type of variable assigns each individual, or whatever unit of observation youâre fixated on, to a specific group or nominal category based on some qualitative property . [1] If youâre dabbling in computer science or certain esoteric branches of mathematics, you might hear these referred to as enumerations or enumerated types . Most of the time, though, each of these possible outcomes is rather prosaically termed a “level.” The probability distribution that governs such a random categorical variable is, predictably, called a categorical distribution .
Categorical data, then, is the statistical data type comprised of these categorical variables, or data that has been⌠transformed into this format. Think of grouped data as a prime example. More specifically, categorical data can emerge from observations of qualitative data that have been condensed into counts or cross tabulations . Alternatively, it can arise from quantitative data that has been corralled into specific intervals. When the data is purely categorical, itâs often presented in the elegant simplicity of a contingency table . However, especially in the realm of data analysis, youâll frequently encounter the term “categorical data” used more broadly, even when a dataset contains a mix of categorical and non-categorical variables. A crucial distinction lies between ordinal variables , which possess a meaningful order, and nominal variables , which, for all intents and purposes, are just labels with no inherent ranking.
Now, a categorical variable that is content with precisely two possible values is rather uncreatively dubbed a binary variable or, if you prefer, a dichotomous variable. A particularly important subset of these is the Bernoulli variable . If a categorical variable decides it needs more than two options, it’s then classified as a polytomous variable. Generally, unless someone explicitly tells you otherwise, you can assume “categorical variable” implies a polytomous one. And for those who like to blur the lines, discretization is the practice of treating what was once continuous data as if it were categorical. Similarly, dichotomization involves forcing continuous data or polytomous variables into a binary framework. In the labyrinthine world of regression analysis , itâs common to represent category membership using one or more quantitative dummy variables .
Examples of Categorical Variables
Consider the following as mere glimpses into the vast landscape of categorical variables:
- Demographic Information: The gender of an individual, their disease status, or their nationality. These are labels, plain and simple.
- Blood Type: A person’s blood type, whether it’s A, B, AB, or O, is a classic example of a nominal category. There’s no inherent ranking here, just distinct classifications.
- Political Affiliation: The political party a voter might align withâsay, Green Party, Christian Democrat, or Social Democratârepresents distinct choices, not a spectrum of preference.
- Rock Type: The classification of a rock as igneous , sedimentary , or metamorphic is a fundamental categorical distinction in geology.
- Linguistic Elements: In the intricate world of language models , the identity of a particular word can be treated as a categorical variable. If your vocabulary has V distinct words, then each word is one of V possible choices.
Notation
For the sake of streamlining statistical processing, categorical variables are often assigned numerical indices. For instance, a categorical variable with K possible values might be represented by the numbers 1 through K. However, itâs crucial to remember that these numbers are generally arbitrary; they serve merely as convenient labels. In essence, the values within a categorical variable reside on a nominal scale . Each value represents a distinct concept, they cannot necessarily be meaningfully ordered , and certainly cannot be manipulated as one would with actual numbers. The valid operations here are limited to equivalence , checking for set membership , and other set-related manipulations.
Consequently, the measure of central tendency for a set of categorical variables is invariably its mode . Neither the mean nor the median can be meaningfully calculated. Imagine trying to find the average of a list of last names. You can determine if two names are the same (equivalence), if a name exists in a specific list (set membership), count how many times a name appears, or identify the most frequent name (the mode). But asking for the “sum” of “Smith” and “Johnson,” or whether “Smith” is “less than” or “greater than” “Johnson,” is nonsensical. This is why the “average name” or the “middle-most name” are concepts that simply don’t apply.
This line of reasoning conveniently sidesteps the notion of alphabetical order , which is a property imposed by our chosen labeling system, not an inherent characteristic of the names themselves. For example, if we were to write those same names in Cyrillic and apply the Cyrillic alphabet’s ordering, the comparison of “Smith < Johnson” might yield a different result than in the standard Latin alphabet . And if we were to represent them using Chinese characters , the very idea of ordering “Smith” and “Johnson” becomes utterly meaningless due to the lack of a consistent, universally agreed-upon order for such characters. However, if we do consider the names as written in, say, the Latin alphabet, and then define an order based on standard alphabetical rules, we have effectively transmuted them into ordinal variables operating on an ordinal scale .
Number of Possible Values
Categorical random variables are typically described statistically using a categorical distribution . This framework allows for an arbitrary K-way categorical variable to be represented by specifying separate probabilities for each of its K possible outcomes. When dealing with categorical variables that have multiple categories, the multinomial distribution is often employed to count the frequency of each possible combination of occurrences across the various categories. For regression analysis involving categorical outcomes, techniques like multinomial logistic regression , multinomial probit , or related discrete choice models are used.
Categorical variables that are limited to exactly two possible outcomes â such as “yes” versus “no” or “success” versus “failure” â are known as binary variables (or Bernoulli variables). Due to their fundamental importance, these variables are often treated as a distinct category, possessing their own distribution (the Bernoulli distribution ) and specialized regression models (logistic regression , probit regression , and so forth). As a consequence, the term “categorical variable” is frequently reserved for situations involving three or more outcomes, sometimes referred to as a multi-way variable in contrast to a binary variable.
There are also scenarios where the number of possible categories for a categorical variable isn’t fixed in advance. Consider a categorical variable representing a specific word in a language . You might not know the entire vocabulary size beforehand and need to accommodate the possibility of encountering words you haven’t seen before. Standard statistical models, including those relying on the categorical distribution and multinomial logistic regression , operate under the assumption that the number of categories is known upfront. Dynamically altering the number of categories during analysis is a rather tricky affair. In such complex situations, more advanced techniques become necessary. The Dirichlet process , for instance, falls within the domain of nonparametric statistics . This approach logically assumes an infinite number of potential categories, but at any given moment, most of them (in fact, all but a finite number) remain unobserved. All calculations are framed in terms of the categories actually encountered so far, rather than the total, infinite set of potential categories. These methods are designed for incremental updates of statistical distributions, crucially including the capacity to incorporate “new” categories as they emerge.
Categorical Variables and Regression
Categorical variables represent a qualitative method of categorizing data; they essentially denote categories or group memberships. While they can be incorporated as independent variables in a regression analysis or serve as dependent variables in logistic regression or probit regression , they must first be converted into quantitative data to enable analysis. This transformation is achieved through various coding systems. A common practice in these analyses is to code only g - 1 variables, where g represents the number of groups. This approach minimizes redundancy while ensuring the complete dataset is represented, as coding all g groups would provide no additional information. For example, when coding gender (where g = 2: male and female), if you only code for females, everyone else must necessarily be male. Generally, the group omitted from coding is the one of least interest to the researcher. [2]
There are three primary coding systems typically employed when analyzing categorical variables within regression: dummy coding, effects coding, and contrast coding. The general form of a regression equation is Y = bX + a, where b is the slope, representing the empirically assigned weight to an explanatory variable (X), and a is the Y-intercept. The specific meaning of these values shifts depending on the coding system utilized. Importantly, the choice of coding system does not alter the F or R² statistics. However, researchers select a coding system based on the specific comparisons they wish to make, as the interpretation of the b values will vary accordingly. [2]
Dummy Coding
Dummy coding is the preferred method when a control or comparison group is central to the research question. The analysis then focuses on examining the data of one group in relation to this established comparison group. In this system, a represents the mean of the control group, and b signifies the difference between the mean of the experimental group and the mean of the control group. For a control group to be considered suitable, it is generally recommended that it meets three criteria: it should be a well-established group (avoiding vague categories like “other”), there should be a logical justification for selecting it as the benchmark (e.g., itâs expected to perform highest on the dependent variable), and its sample size should be substantial, not disproportionately small compared to other groups. [3]
In dummy coding, the designated reference group is assigned a value of 0 for each code variable. The group of particular interest for comparison against the reference group is assigned a value of 1 for its specific code variable, while all other groups receive a 0 for that particular code variable. [2]
The b values in dummy coding are interpreted as the difference between the experimental group and the control group. Consequently, a negative b value would indicate that the experimental group scored lower on the dependent variable than the control group. To illustrate: suppose we are assessing optimism levels across various nationalities, with French individuals serving as the control group. If we compare them against Italians and observe a negative b value, this suggests that Italians, on average, exhibit lower optimism scores.
The following table provides an example of dummy coding, with French as the control group and C1, C2, and C3 representing the codes for Italian, German, and “Other” (not French, Italian, or German), respectively:
| Nationality | C1 | C2 | C3 |
|---|---|---|---|
| French | 0 | 0 | 0 |
| Italian | 1 | 0 | 0 |
| German | 0 | 1 | 0 |
| Other | 0 | 0 | 1 |
Effects Coding
Within the effects coding system, data analysis involves comparing each group against the mean of all other groups combined. Unlike dummy coding, there isn’t a single, designated control group. Instead, the comparison is made against the grand mean â the average across all groups. Therefore, the focus shifts from comparing one group to another to understanding how each group deviates from the overall average. [2]
Effects coding can be either weighted or unweighted. Weighted effects coding calculates a weighted grand mean, incorporating the sample size of each group. This approach is most suitable when the sample is considered representative of the broader population. Unweighted effects coding, on the other hand, is more appropriate when sample size variations are due to incidental factors. The interpretation of b differs: in unweighted effects coding, b represents the difference between the mean of the experimental group and the grand mean; in the weighted scenario, it represents the mean of the experimental group minus the weighted grand mean. [2]
In effects coding, the group of interest is coded as 1, similar to dummy coding. The key distinction is that the group of least interest is coded as -1. Following the g - 1 coding scheme, it is this -1 coded group that will not generate its own data, thus signifying it as the least interesting. All other groups are assigned a code of 0.
The b values in effects coding are interpreted as the difference between the mean of the experimental group and the grand mean (or weighted grand mean, in the case of weighted effects coding). Consequently, a negative b value would indicate that the coded group scored below the overall mean on the dependent variable. Using our optimism score example, if Italians are the group of interest, a negative b value would suggest they have lower optimism scores on average compared to the combined group.
The following table illustrates effects coding, with “Other” designated as the group of least interest:
| Nationality | C1 | C2 | C3 |
|---|---|---|---|
| French | 0 | 0 | 1 |
| Italian | 1 | 0 | 0 |
| German | 0 | 1 | 0 |
| Other | -1 | -1 | -1 |
Contrast Coding
The contrast coding system empowers researchers to pose highly specific questions. Instead of the coding system dictating the comparisons (as in dummy coding’s comparison to a control group, or effects coding’s comparison to all groups), contrast coding allows for the design of unique comparisons tailored to particular research hypotheses. These hypotheses are typically grounded in prior theory and/or research. The common structure of these proposed hypotheses involves a central hypothesis postulating a significant difference between two sets of groups, followed by a secondary hypothesis suggesting that within each set, the differences among the groups are minimal. Through its focus on a priori hypotheses, contrast coding can potentially enhance the power of the statistical test compared to the less directed approaches of dummy and effects coding. [2]
There are subtle but important differences when comparing a priori coefficients used in ANOVA versus regression. In ANOVA, researchers have the discretion to choose coefficient values that are either orthogonal or non-orthogonal. However, in regression, it is imperative that the coefficient values assigned in contrast coding are orthogonal. Furthermore, in regression, these coefficient values must be expressed in either fractional or decimal form; they cannot be interval values.
The construction of contrast codes is governed by three fundamental rules:
- The sum of the contrast coefficients for each code variable must equal zero.
- The difference between the sum of the positive coefficients and the sum of the negative coefficients must equal 1.
- The coded variables must be orthogonal. [2]
Violating the second rule will still yield accurate R² and F values, meaning the conclusions drawn about the significance of differences will remain the same. However, the interpretation of the b values as representing mean differences becomes unreliable.
To demonstrate the construction of contrast codes, consider the following table. The coefficients were chosen to align with our a priori hypotheses: Hypothesis 1 posits that French and Italian individuals will score higher on optimism than Germans (French = +0.33, Italian = +0.33, German = â0.66). This is reflected by assigning the same coefficient to the French and Italian categories and a markedly different, negative one to Germans. The signs indicate the hypothesized direction of the relationship, hence the negative sign for Germans aligns with their expected lower optimism scores. Hypothesis 2 predicts that French and Italians are expected to differ in their optimism scores (French = +0.50, Italian = â0.50, German = 0). Here, assigning a zero value to Germans signifies their exclusion from the analysis pertaining to this specific hypothesis. Again, the assigned signs reflect the proposed directional relationship.
| Nationality | C1 | C2 |
|---|---|---|
| French | +0.33 | +0.50 |
| Italian | +0.33 | -0.50 |
| German | -0.66 | 0 |
Nonsense Coding
Nonsense coding occurs when arbitrary values are used in place of the standard “0,” “1,” and “-1” designations found in the aforementioned coding systems. While this method might produce correct mean values for the variables, it is strongly discouraged because it inevitably leads to uninterpretable statistical results. [2]
Embeddings
Embeddings represent a sophisticated method of coding categorical values into low-dimensional vector spaces, typically composed of real numbers (and occasionally complex numbers ). The goal is usually to ensure that ‘similar’ categories are assigned ‘similar’ vectors, or to create vectors that are useful for a specific application based on some defined criterion. A common and particularly relevant special case is word embeddings , where the categorical values are the words within a language , and words with analogous meanings are mapped to similar vector representations.
Interactions
An interaction can emerge when examining the relationships among three or more variables. It describes a situation where the combined effect of two variables on a third variable is not simply additive. Interactions involving categorical variables can manifest in two primary ways: interactions between two categorical variables, or interactions between a categorical variable and a continuous variable.
Categorical by Categorical Variable Interactions
This type of interaction arises when you have two categorical variables under consideration. To investigate such an interaction, you would employ the coding system that best aligns with the researcher’s specific hypothesis. The product of the codes generated from these variables then represents the interaction term. Subsequently, you can calculate the b value for this interaction term and determine its statistical significance. [2]
Categorical by Continuous Variable Interactions
Simple slopes analysis is a widely used post hoc test in regression, functioning analogously to the simple effects analysis in ANOVA for examining interactions. In this technique, we scrutinize the slopes of one independent variable at specific values of another independent variable. This analytical approach is not confined solely to continuous variables; it can also be applied when one of the independent variables is categorical. The challenge here is that you cannot simply select arbitrary values to probe the interaction, as you might with continuous variables. In the continuous case, one could analyze the data at high, moderate, and low levels by using values at one standard deviation above the mean, at the mean, and at one standard deviation below the mean, respectively. However, in the categorical context, you would need to employ a separate simple regression equation for each distinct group to investigate the simple slopes. It is standard practice to standardize or center variables to enhance the interpretability of data in simple slopes analysis. However, categorical variables should never be standardized or centered. This test can be effectively utilized with all coding systems. [2]