Structural Equation Modeling

Contents

1. Overview
2. Etymology
3. Cultural Impact

Form of causal modeling that fit networks of constructs to data

This article delves into the general realm of structural modeling . For the specific application of structural models within the domain of econometrics , a field often preoccupied with the quantitative analysis of economic phenomena, one should consult the article on Structural estimation . Furthermore, for those seeking the academic publication dedicated to this intricate subject, information can be found under the entry for Structural Equation Modeling (journal) .

Figure 1. An example structural equation model after estimation. Latent variables are sometimes indicated with ovals while observed variables are shown in rectangles. Residuals and variances are sometimes drawn as double-headed arrows (as depicted here) or, alternatively, as single arrows and a circle (as illustrated in Figure 2). The variance of the latent IQ variable is typically fixed at a value of 1 to establish a consistent scale for the entire model. Figure 1 explicitly illustrates how measurement errors exert influence upon each individual indicator of latent intelligence and, similarly, on each indicator of latent achievement. Notably, neither the indicators themselves nor their associated measurement errors are conceptualized or modeled as directly influencing the underlying latent variables. 1

Figure 2. An example structural equation model before estimation. This diagram is functionally similar to Figure 1, but it presents the model prior to the calculation of standardized values and features a somewhat reduced number of items. Given that intelligence and academic performance are conceptualized as theory-postulated variables , their precise scale values inherently remain unknown. Nevertheless, the model dictates that the values of each latent variable must align with some point along the observable scale inherent to one of its corresponding indicators. The fixed 1.0 effect, which serves as a connection between a latent variable and its indicator, stipulates that every real unit increase or decrease in the latent variable’s intrinsic value should result in a commensurate unit increase or decrease in the observed indicator’s value. While the aspiration is always to select robust indicators for each latent construct, these 1.0 values do not, by any stretch, imply perfect measurement. This model explicitly posits the existence of other, unspecified entities that exert causal influence on the observed indicator measurements, thereby inevitably introducing a degree of measurement error . Specifically, this model theorizes that distinct measurement errors impact each of the two indicators of latent intelligence and each indicator of latent achievement. The unlabeled arrow pointing towards academic performance serves as a crucial acknowledgment that factors beyond intelligence alone can, and often do, influence an individual’s academic outcomes.

Structural equation modeling (SEM) encompasses a remarkably varied collection of analytical methods employed by researchers across both observational and experimental research designs. Although SEM finds its primary application and most prolific use within the social and behavioral science disciplines—a space where the complexities of human experience demand nuanced statistical approaches—its utility extends far beyond these traditional boundaries. It is increasingly adopted in fields such as epidemiology 2 , business 3 , and numerous other domains where understanding intricate relationships is paramount. At its core, SEM can be defined, rather academically, as “a class of methodologies that seeks to represent hypotheses about the means, variances, and covariances of observed data in terms of a smaller number of ‘structural’ parameters defined by a hypothesized underlying conceptual or theoretical model” 4 . A mouthful, certainly, but it gets the point across.

Fundamentally, SEM operates on the premise of a conceptual model that articulates how various facets of a given phenomenon are presumed to be causally interconnected. These models frequently incorporate hypothesized causal links among latent variables —those elusive constructs that are believed to exist, like intelligence or attitude, but which defy direct observation. Beyond these unseen influences, additional causal connections bridge these latent variables to observed variables , whose tangible values populate the data sets researchers grapple with. These causal relationships are formally expressed through equations , though their underlying structure is often more intuitively conveyed through visual diagrams that employ arrows, much like those seen in Figures 1 and 2. The very nature of these postulated causal structures implies that specific, predictable patterns should manifest within the values of the observed variables. This inherent expectation is what empowers researchers to leverage the relationships among observed variable values, enabling the estimation of the magnitudes of these hypothesized effects and, crucially, allowing for a statistical test of whether the empirical data align with the stringent demands of the proposed causal structures. 5 It’s an elaborate dance between what we think is happening and what the numbers actually show.

The precise demarcation of what constitutes a “structural equation model” can, admittedly, be a bit fuzzy, much like many academic classifications. However, generally speaking, SE models invariably feature hypothesized causal connections among a collection of latent variables —these are the theoretical constructs that are believed to exist but cannot be directly measured, such as an individual’s attitude, their innate intelligence, or the presence of a mental illness. Complementing these are causal connections that link these postulated latent variables to variables that can be observed, and whose empirical values are readily available in a given data set. The sheer variety in how these latent causal connections are conceptualized, the diverse ways in which observed variables are employed to measure these latents, and the multitude of statistical estimation strategies that can be brought to bear, collectively contribute to SEM’s expansive methodological toolkit. This toolkit includes, but is by no means limited to, confirmatory factor analysis (CFA), confirmatory composite analysis , path analysis (statistics) , multi-group modeling, sophisticated longitudinal modeling, partial least squares path modeling , latent growth modeling , and both hierarchical and multilevel modeling . 6 7 8 9 10 It’s a smorgasbord of techniques, each designed to peel back a different layer of causal complexity.

Researchers employing SEM rely heavily on specialized computer programs to accurately estimate both the strength and the directional sign of the coefficients that correspond to the structural connections specified within their models—for instance, the numerical values affixed to the arrows in Figure 1. However, a crucial caveat exists: a hypothesized model, such as the one presented in Figure 1, may not perfectly mirror the actual worldly forces that govern the observed data measurements. Consequently, these very programs also furnish a battery of model tests and diagnostic indicators. These clues are invaluable, as they can hint at which specific indicators, or even which broader components of the model, might be introducing inconsistencies between the theoretical framework and the empirical data. It’s a necessary reality check. Despite its sophistication, SEM methods have faced their share of criticisms, including the occasional disregard for available model tests, inherent problems in the initial specification of the model, a concerning propensity among some practitioners to accept models without rigorously considering their external validity, and even potential philosophical biases that can subtly skew research outcomes. 11

One of the most compelling advantages of SEM, a feature that often justifies its intricate nature, is its capacity to perform all these measurements and tests simultaneously within a single, integrated statistical estimation procedure. This means that every single model coefficient is calculated by drawing upon the entirety of the information available from all observed variables. The profound implication of this holistic approach is that the resulting estimates are demonstrably more precise and accurate than if a researcher were to attempt to calculate each individual component of the model in isolation, piece by painstaking piece. 12 It’s about seeing the forest and the trees, all at once.

History

The lineage of Structural equation modeling (SEM) truly began to diverge from the more straightforward realms of correlation and regression analysis when Sewall Wright —a name that, frankly, should be uttered with more reverence in these circles—began to provide explicit causal interpretations for a series of regression-style equations. His work was underpinned by a profound and nuanced understanding of the precise physical and physiological mechanisms that generated both direct and indirect effects among the variables he observed. 13 14 15 While these equations were estimated using techniques akin to ordinary regression equations , the substantive context and deep theoretical grounding for the measured variables allowed for a clear, unambiguous causal understanding, transcending mere predictive associations. It was a leap from “what happens” to “why it happens.”

The introduction of SEM to the social sciences is largely credited to O. D. Duncan, whose seminal 1975 book 16 opened the floodgates. SEM subsequently experienced a dramatic expansion and “blossomed,” as some might say, throughout the late 1970s and 1980s. This period of rapid growth was largely facilitated by the burgeoning availability and increasing power of computing technology, which finally made the practical estimation of these complex models feasible. Before widespread computing, these intricate calculations were often more of a theoretical exercise than a practical tool. In 1987, Hayduk 7 delivered the first comprehensive, book-length introduction specifically addressing structural equation modeling with latent variables , a crucial development. This was swiftly followed by Bollen’s widely adopted text in 1989 17 , further solidifying SEM’s place in the statistical landscape.

Interestingly, distinct yet mathematically interconnected modeling approaches were simultaneously evolving within the fields of psychology , sociology , and economics . Early work conducted by the Cowles Commission on the estimation of simultaneous equations was heavily influenced by Koopman and Hood’s (1953) algorithms, which originated from the practical challenges of transport economics and optimal routing. These methods primarily utilized maximum likelihood estimation and relied on closed-form algebraic calculations, as the iterative solution search techniques we take for granted today were severely constrained in the pre-computer era.

The convergence of two of these significant developmental streams—namely, factor analysis from psychology and path analysis (statistics) from sociology (itself a direct descendant of Wright and Duncan’s work)—gave rise to the current core framework of SEM. Among the various programs developed by Karl Jöreskog at Educational Testing Services, LISREL 18 19 20 proved particularly influential. LISREL ingeniously integrated latent variables (which psychologists recognized as the latent factors derived from factor analysis) into the path-analysis-style equations that sociologists had inherited. The factor-structured segment of the model explicitly accounted for measurement errors , thereby enabling adjustments for these errors and facilitating, though not guaranteeing, error-free estimation of the effects connecting different postulated latent variables. It was a rather elegant solution to a persistent problem.

The historical convergence of the factor analytic and path analytic traditions continues to manifest in the persistent distinction between the measurement and structural portions of models. This historical tension also fuels ongoing debates regarding appropriate model testing strategies and the fundamental question of whether measurement should precede or accompany structural estimates. 21 22 Viewing factor analysis primarily as a data-reduction technique, for instance, tends to deemphasize rigorous model testing. This stands in stark contrast to the path analytic tradition’s strong appreciation for testing hypothesized causal connections, where the outcome of such a test might critically signal a model’s misspecification. The friction between these two foundational perspectives, it seems, continues to bubble to the surface in academic discourse.

Wright’s pioneering work in path analysis (statistics) profoundly influenced Hermann Wold, who in turn mentored Karl Jöreskog, and Jöreskog’s student Claes Fornell. Despite this strong intellectual lineage, SEM never quite achieved widespread adoption among U.S. econometricians . This divergence can perhaps be attributed to fundamental differences in modeling objectives and the typical structures of economic data. The prolonged, somewhat isolated development of SEM’s economic branch ultimately led to distinct procedural and terminological conventions, even though deep mathematical and statistical connections inherently persist. 23 24 These disciplinary disparities are often quite evident in SEMNET discussions concerning endogeneity and the ongoing debates surrounding causality as represented through directed acyclic graphs (DAGs). 5 Detailed comparisons and contrasts of various SEM approaches are readily available 25 26 , often highlighting the specific data structures and pressing concerns that motivate economic models.

More recently, Judea Pearl 5 significantly expanded the scope of SEM, transitioning it from strictly linear to more flexible nonparametric models . He also proposed rigorous causal and counterfactual interpretations of the underlying equations. These nonparametric SEMs offer the remarkable capability to estimate total, direct, and indirect effects without needing to commit to assumptions of linearity for effects or specific distributions for the error terms—a considerable leap forward in flexibility and robustness. 26

SEM analyses enjoy considerable popularity within the social sciences because these sophisticated techniques provide powerful tools for dissecting complex concepts and illuminating intricate causal processes. However, the very complexity that makes these models so appealing also introduces substantial variability into the results. Factors such as the presence or absence of conventional control variables , the size and characteristics of the sample , and the specific variables of interest can all profoundly influence outcomes. 27 The strategic use of experimental designs can, to some extent, mitigate some of these inherent uncertainties and strengthen causal claims. 28

In a rather fascinating turn, SEM today forms a foundational component of both machine learning and the increasingly critical field of interpretable neural networks (machine learning) . The classical statistical methods of exploratory and confirmatory factor analyses find direct parallels in the unsupervised and supervised machine learning paradigms, respectively, demonstrating a compelling cross-pollination of ideas across disciplines. It appears that the quest to understand underlying structures, whether human or artificial, remains a constant.

General steps and considerations

Engaging with structural equation models is not for the faint of heart, or the intellectually lazy. The following considerations are absolutely crucial for both the meticulous construction and the rigorous assessment of virtually any structural equation model worth its salt. Pay attention, as these are the details that separate meaningful insight from statistical noise.

Model specification

The initial act of constructing, or specifying, a model is a surprisingly intricate process that demands careful attention to several interconnected elements. It’s not just throwing variables at a wall and seeing what sticks; it’s about deliberate, informed choices. This crucial stage requires the researcher to meticulously consider:

The precise collection of variables that will be incorporated into the analysis. Choosing the right players is half the battle.
The existing body of knowledge and established understanding pertaining to these variables. We don’t operate in a vacuum, or at least, we shouldn’t.
The theoretical propositions or specific hypotheses that posit causal connections and, equally important, disconnections among these variables. What do we think causes what, and what doesn’t?
The specific insights or knowledge the researcher aims to extract from the modeling exercise. What’s the point of all this, anyway?
The presence of missing values within the data set and, consequently, the necessity for appropriate imputation strategies to handle them. Ignoring missing data is like ignoring a gaping hole in your argument.

Structural equation models are, at their heart, attempts to mirror the intricate worldly forces that operate within causally homogeneous cases . That is to say, they strive to model individuals or entities that are genuinely embedded within the same underlying causal structures, but which exhibit differing values on the causal variables and, as a direct consequence, manifest different values on the outcome variables. Achieving this elusive causal homogeneity can be facilitated either through judicious case selection or by strategically segregating cases within a more complex multi-group model . A model’s specification, however, remains incomplete—and thus, fundamentally flawed—until the researcher explicitly defines:

Which specific effects and/or correlations /covariances are to be incorporated into the model and subsequently estimated. These are the active pathways.
Which effects and other coefficients are explicitly forbidden from the model, or are presumed to be entirely unnecessary, representing theoretical null connections. These are the paths we claim don’t exist.
And, finally, which coefficients will be assigned fixed, unchanging values—for example, the 1.0 values often used in Figure 2 to establish measurement scales for latent variables . These fixed values are the anchors.

Within the latent level of a model, variables are typically categorized as either endogenous or exogenous . The endogenous latent variables are the true-score variables that are theorized to receive causal effects from at least one other variable within the specified model. Each endogenous variable is, in essence, treated as the dependent variable in a regression-style equation . Conversely, the exogenous latent variables are the foundational, background variables that are posited as causing one or more of the endogenous variables. These are modeled akin to the predictor variables in regression equations. While direct causal connections among the exogenous variables are generally not explicitly modeled, their interrelationships are typically acknowledged by allowing them to freely correlate with one another. The model may also incorporate intervening variables —those fascinating constructs that receive effects from some variables but simultaneously transmit effects to others. As is standard in regression analysis , each endogenous variable is assigned a residual or error variable . This residual serves as a conceptual catch-all, encapsulating the cumulative effects of unmeasured, unavailable, and often unknown causes. Every latent variable , whether exogenous or endogenous , is conceptualized as embodying the cases’ true scores on that particular variable. These true scores are then understood to causally contribute valid and genuine variations into one or more of the observed or reported indicator variables . 29 It’s a rather intricate web of presumed reality.

The LISREL program, in a move that has both standardized and complicated notation, famously assigned Greek names to the various elements within its matrices. This system was designed to meticulously track the diverse components of structural equation models. These Greek designations subsequently became a relatively standard notation across the field, though it has naturally been extended and modified over time to accommodate the ever-expanding array of statistical considerations. 20 7 17 30 Modern texts and software packages often attempt to “simplify” model specification, either through intuitive diagrams or by allowing users to select their own variable names. What often goes unsaid, however, is that these “simplifications” merely re-convert the user’s model into some standard matrix-algebraic form in the background. This “simplification” is achieved by implicitly introducing default program “assumptions” about various model features—features that users are, apparently, not meant to concern themselves with. This, frankly, is a recipe for disaster. These default assumptions regrettably tend to obscure critical model components, allowing unrecognized issues to lurk within the model’s fundamental structure and its underlying matrices. It’s like building a house without looking at the foundation, then wondering why it leans.

Within SEM, two primary components of models are meticulously distinguished: the structural model , which delineates the potential causal dependencies between endogenous and exogenous latent variables , and the measurement model , which illustrates the causal connections between the latent variables and their observed indicators . For instance, exploratory and confirmatory factor analysis models largely concentrate on these causal measurement connections, seeking to understand how observed items tap into latent constructs. Conversely, path models (statistics) more directly align with the latent structural connections inherent in SEMs, focusing on the relationships between the latent variables themselves.

Modelers are tasked with specifying each coefficient within a model as either “free” (meaning it is available to be estimated from the data) or “fixed” at a predetermined value. The free coefficients typically represent hypothesized effects that the researcher wishes to rigorously test, background correlations among the exogenous variables , or the variances of the residual or error variables that account for additional unexplained variations in the endogenous latent variables . The fixed coefficients, on the other hand, serve various crucial roles. These might include values such as the 1.0 in Figure 2, which are used to establish a consistent scale for the latent variables. Alternatively, they might be fixed at 0.0, which explicitly asserts a causal disconnection—for example, the assertion of “no-direct-effects” (represented by the absence of an arrow) pointing from Academic Achievement to any of the four scales depicted in Figure 1. SEM programs then provide estimates and statistical tests for these free coefficients, while the fixed coefficients contribute significantly to the overall testing of the hypothesized model structure. Various types of constraints can also be imposed between coefficients, adding further nuance and theoretical rigor to the model. 30 7 17 The entire model specification process is a delicate interplay, drawing heavily upon existing literature, the researcher’s practical experience with the modeled indicator variables, and the specific theoretical features that the model aims to investigate.

It’s imperative to recognize that there is a fundamental limit to the number of coefficients that can be reliably estimated within any given model. If the number of available data points falls short of the number of coefficients requiring estimation, the resulting model is deemed “unidentified,” rendering it impossible to obtain stable or unique coefficient estimates. This is a critical structural flaw. Furthermore, the presence of reciprocal effects or other forms of causal loops within the model can also significantly impede the estimation process, leading to similar identification problems. 31 32 30 Ignoring these fundamental limitations is a sure path to meaningless results.

Estimation of free model coefficients

Coefficients within a model that are explicitly fixed at values such as 0.0, 1.0, or any other predetermined number, by definition, do not require estimation. Their values are already specified. The estimated values for the free model coefficients, however, are derived through a process of optimization: either by maximizing the model’s fit to the observed data or by minimizing the discrepancy between the model’s implications and the data, relative to what the data’s characteristics would be if these free coefficients assumed their estimated values. It’s a quest for the best possible alignment.

The model’s precise implications for the expected characteristics of the data, given a specific set of coefficient values, are contingent upon several critical factors: a) The exact placement of the coefficients within the model’s structure (e.g., which variables are connected and, crucially, which are disconnected). b) The fundamental nature of the connections posited between the variables (whether they represent covariances or direct effects ; with effects often assumed, for simplicity, to be linear). c) The inherent characteristics of the error or residual variables (which are frequently assumed to be independent of, or causally disconnected from, many other variables in the model). d) The appropriate measurement scales for the variables involved (interval level measurement is a common, though not always justified, assumption).

A stronger hypothesized effect linking two latent variables inherently implies that the indicators of those latents should exhibit a more robust correlation . Consequently, a reasonable estimate for a latent variable’s effect will be that specific value which best aligns with the observed correlations between the indicators of the corresponding latent variables. This is, in essence, the estimate-value that maximizes the overall match with the empirical data or, conversely, minimizes the discrepancies from it.

When employing maximum likelihood estimation , the numerical values of all the free model coefficients are individually and iteratively adjusted—progressively increased or decreased from their initial starting values—until they jointly maximize the likelihood of observing the particular sample data at hand. This optimization can be based on either the variables’ covariances /correlations or the actual case-level values on the indicator variables . Ordinary least squares estimates, by contrast, are those coefficient values that minimize the squared differences between the observed data and what the data would look like if the model were perfectly specified—that is, if all the model’s estimated features truly corresponded to real-world phenomena.

The selection of the appropriate statistical criterion to maximize or minimize during estimation is contingent upon the variables’ levels of measurement . Generally, estimation proves more straightforward with interval level measurements compared to nominal or ordinal measures . Furthermore, the specific position of a variable within the model (e.g., endogenous dichotomous variables often present more estimation challenges than exogenous dichotomous variables ) also plays a crucial role. Most SEM software packages offer a variety of options for what is to be maximized or minimized to obtain estimates for the model’s coefficients. These choices commonly include maximum likelihood estimation (MLE), full information maximum likelihood (FIML), ordinary least squares (OLS), weighted least squares (WLS), diagonally weighted least squares (DWLS), and two-stage least squares . 30 Each has its context, its strengths, and its inevitable weaknesses.

A common and persistent challenge arises when a coefficient’s estimated value is “underidentified.” This occurs because the coefficient is insufficiently constrained by the combination of the model’s structure and the available data. In such a scenario, no unique “best-estimate” can be obtained unless the model and data together impose sufficient constraints or restrictions on that coefficient’s value. For example, the magnitude of a single observed data correlation between two variables is inherently insufficient to provide distinct estimates for a reciprocal pair of modeled effects between those same variables. The observed correlation could, hypothetically, be accounted for by one of the reciprocal effects being stronger than the other, or vice-versa, or even by both effects being of equal magnitude. The data alone cannot differentiate.

Underidentified effect estimates can, in principle, be rendered identifiable by introducing additional model and/or data constraints. For instance, reciprocal effects might be made identifiable by constraining one effect estimate to be double, triple, or equivalent to the other effect estimate 32 . However, the trustworthiness of the resulting estimates is entirely dependent on whether this additional model constraint accurately reflects the actual structure of the world. If your constraint is arbitrary, your identification is meaningless. Data derived from a third variable that directly causes only one of a pair of reciprocally causally connected variables can also greatly assist in identification. 31 By constraining this third variable to not directly cause one of the reciprocally-causal variables, the symmetry that otherwise plagues reciprocal effect estimates is broken. This is because the third variable must then be more strongly correlated with the variable it directly causes than with the variable at the “other end” of the reciprocal connection, which it impacts only indirectly. 31 It is crucial to note, once again, that this strategy inherently presumes the correctness of the model’s causal specification—specifically, that there truly is a direct effect from the third variable to one of the reciprocally-linked variables, and no direct effect on the other. Theoretical demands for null/zero effects provide incredibly helpful constraints that aid estimation, though, frustratingly, theories often fail to clearly articulate which effects are supposedly nonexistent. The burden, as always, falls on the researcher to be precise.

Model assessment

Model assessment is not a simple checklist; it’s a multi-faceted judgment call, deeply intertwined with the underlying theory, the empirical data, the specific model constructed, and the chosen estimation strategy. Therefore, a comprehensive model assessment must rigorously consider:

Data Quality and Appropriateness: Whether the data collected genuinely contain reasonable measurements of variables that are truly appropriate for the research question at hand. Garbage in, garbage out, as they say.
Causal Homogeneity of Cases: Whether the cases included in the analysis are indeed causally homogeneous . It is fundamentally illogical to attempt to estimate a single model if the data cases reflect two or more distinct underlying causal networks. This is a common, and often ignored, problem.
Theoretical Representation: Whether the model accurately and appropriately represents the theoretical framework or the specific features of interest it purports to investigate. Models that omit crucial features required by a theory, or that contain coefficients inconsistent with that theory, are inherently unpersuasive.
Statistical Justification of Estimates: Whether the obtained estimates are statistically defensible. Substantive assessments can be utterly devastated by violations of underlying assumptions, by the use of an inappropriate estimator for the data, or by the dreaded non-convergence of iterative estimation procedures.
Substantive Reasonableness of Estimates: The practical, real-world plausibility of the estimates. Statistically impossible outcomes, such as negative variances or correlations exceeding 1.0 or falling below -1.0, are immediate red flags. Even statistically possible estimates that fundamentally contradict established theory or our understanding of the world should provoke a serious re-evaluation of both the theory and the model.
Consistency (or Inconsistency) Between Model and Data: The degree of remaining consistency, or more often, inconsistency, between the model’s implications and the observed data. While the estimation process strives to minimize these differences, important and highly informative discrepancies may—and often do—persist.

Research that purports to test or “investigate” a theory must rigorously attend to any model-data inconsistency that extends beyond mere chance. The estimation process, by its very nature, adjusts the model’s free coefficients to achieve the best possible fit to the observed data. The output generated by SEM programs typically includes a matrix detailing the relationships among the observed variables that would be observed if the estimated model effects were, in fact, the true underlying forces controlling those variables’ values. The “fit” of a model, then, is a report of the correspondence, or lack thereof, between these model-implied relationships (often covariances ) and the corresponding relationships actually observed in the data. Large and statistically significant differences between the data and the model’s implications are not minor quibbles; they are unequivocal signals of fundamental problems.

The probability associated with a χ² (chi-squared test) is a direct measure of the likelihood that the observed data could have arisen solely through random sampling variations, if the estimated model accurately represented the true underlying population forces. A small χ² probability therefore indicates that it would be highly improbable for the current data to have occurred if the modeled structure truly constituted the real population’s causal forces, with any remaining differences attributed solely to random sampling fluctuations. This is a powerful, if often inconvenient, test.

If a model remains stubbornly inconsistent with the data, even after the selection of optimal coefficient estimates, an intellectually honest research response is to transparently report and diligently address this evidence (which often manifests as a statistically significant model χ² test ). 33 Any model-data inconsistency that extends beyond what can be reasonably attributed to chance challenges both the individual coefficient estimates and the model’s overall capacity to accurately adjudicate its own structure. This holds true regardless of whether the inconsistency originates from problematic data, inappropriate statistical estimation techniques, or an incorrect initial model specification.

It’s crucial to understand that coefficient estimates derived from data-inconsistent (“failing”) models are still interpretable. They serve as reports of how the world would appear to someone who genuinely believed in a model that, unfortunately, conflicts with the available empirical data. The estimates within data-inconsistent models don’t necessarily become “obviously wrong” by exhibiting statistical strangeness or by being wrongly signed according to theory. Indeed, these estimates might even align quite closely with a theory’s requirements. However, the remaining data inconsistency renders this apparent match between estimates and theory utterly unable to provide any meaningful support or “succor,” as some might say. Failing models, while interpretable, are only interpretable as accounts that fundamentally contradict the available evidence. A rather bleak reality, but a reality nonetheless.

The notion that replication alone will reliably detect misspecified models that have been inappropriately fitted to data is, frankly, overly optimistic. If the replicate data falls within the expected range of random variations from the original data, then the same incorrect coefficient placements that led to an inappropriate fit in the original data will, in all likelihood, also inappropriately fit the replicate data. Replication certainly plays a vital role in identifying issues like data entry mistakes (especially when different research groups are involved), but it is particularly weak at uncovering misspecifications that arise from exploratory model modification—for example, when confirmatory factor analysis is applied to a randomly selected second half of a dataset, following an exploratory factor analysis (EFA) performed on the first half. This sequential approach often just propagates the initial errors.

A modification index is essentially an estimate of how much a model’s fit to the data might “improve” (though not necessarily how much the model’s underlying structure would improve) if a specific, currently fixed model coefficient were instead freed for estimation. Researchers facing models that are inconsistent with their data are often tempted to free coefficients that modification indices suggest are likely to produce substantial improvements in fit. This practice, while seemingly pragmatic, simultaneously introduces a significant and perilous risk: the transition from a causally-wrong-and-failing model to a causally-wrong-but-fitting model. Improved data-fit, it must be stressed, provides no assurance whatsoever that the newly freed coefficients are substantively reasonable or accurately reflect real-world causal structures. The original model might contain fundamental causal misspecifications, such as incorrectly directed effects or flawed assumptions about unmeasured variables, and such deep-seated problems simply cannot be rectified by merely adding more coefficients to the existing model. Consequently, such models remain fundamentally misspecified, despite the superficially closer fit achieved by the inclusion of additional coefficients. These fitting-yet-worldly-inconsistent models are particularly prone to emerge if a researcher, perhaps overly committed to a specific model (e.g., a factor model with a predetermined number of factors), manages to force an initially failing model to fit by inserting “measurement error covariances” that are conveniently “suggested” by modification indices. MacCallum (1986) astutely demonstrated that “even under favorable conditions, models arising from specification searches must be viewed with caution.” 34 While model misspecification can sometimes be corrected by judiciously inserting coefficients suggested by modification indices, a far broader array of corrective possibilities arises from employing a few carefully chosen indicators of similar, yet importantly distinct, latent variables . 35

“Accepting” failing models as merely “close enough” is, regrettably, not a reasonable alternative either. A stark cautionary tale was provided by Browne, MacCallum, Kim, Anderson, and Glaser, who explored the mathematical underpinnings of why the χ² test can (though not always does) possess considerable power to detect model misspecification. 36 The probability accompanying a χ² test signifies the likelihood that the observed data could have arisen by random sampling variations, if the current model, with its optimally estimated coefficients, truly constituted the real underlying population forces. A small χ² probability , therefore, strongly indicates that it would be unlikely for the current data to have emerged if the current model structure genuinely represented the real population’s causal forces—with any remaining differences being attributed solely to random sampling variations. Browne et al. presented a factor model that they deemed “acceptable,” despite the model being significantly inconsistent with their data according to the χ² test . The inherent fallacy of their claim—that “close-fit” should be treated as “good enough”—was subsequently and decisively demonstrated by Hayduk, Pazderka-Robinson, Cummings, Levers, and Beres 37 . They achieved a fitting model for Browne et al.’s own data by simply incorporating an experimental feature that Browne et al. had entirely overlooked. The fault, it turns out, was not with the mathematics of the indices or with an alleged “over-sensitivity” of χ² testing . The fault lay squarely with Browne, MacCallum, and their co-authors for forgetting, neglecting, or simply overlooking the fundamental truth: the amount of ill fit cannot be reliably trusted to correspond to the nature, location, or seriousness of the problems within a model’s specification. 38

Many researchers have attempted to justify their preference for “switching to fit-indices” rather than rigorously testing their models by asserting that χ² values tend to increase (and, consequently, χ² probability decreases) with increasing sample size (N). There are two critical mistakes in discounting χ² on this basis, both of which reveal a fundamental misunderstanding. First, for truly properly specified models, χ² does not inherently increase with increasing N 33 . Therefore, if χ² does increase with N, that very observation is itself a clear signal that something is detectably problematic within the model. Second, and perhaps even more importantly, for models that are genuinely detectably misspecified, an increase in χ² with increasing N actually provides the good news of increasing statistical power to detect model misspecification—namely, increased power to avoid a Type II error . While it is true that some specific kinds of important misspecifications cannot be detected by χ² 38 , any degree of ill fit that extends beyond what could reasonably be produced by random variations absolutely warrants transparent reporting and careful consideration. 39 33 The χ² model test , possibly with appropriate adjustments 40 , remains the strongest and most robust structural equation model test available. To ignore it is to ignore reality.

Numerous “fit indices” exist, each attempting to quantify how closely a model aligns with the data. However, all fit indices suffer from a fundamental logical flaw: the size or magnitude of the ill fit is not reliably coordinated with the severity or the fundamental nature of the underlying issues that are producing the data inconsistency. 38 Models possessing entirely different causal structures, yet fitting the data identically well, have been termed “equivalent models .” 30 Such models may be equivalent in terms of data-fit, but they are most certainly not causally equivalent. This inescapable truth implies that at least one of these so-called equivalent models must be inconsistent with the actual structure of the world. For example, if there is a perfect 1.0 correlation between variables X and Y, and we model this as “X causes Y,” we will achieve perfect fit and zero residual error. But this model may not, in fact, match reality. It’s entirely possible that Y actually causes X, or that both X and Y are merely responding to a common underlying cause Z, or even that the world contains a complex mixture of these effects (e.g., a common cause plus an effect of Y on X), or other, entirely different, causal structures. The perfect fit itself provides no guarantee that the model’s structure corresponds to the world’s structure. This, in turn, critically implies that merely getting “closer to perfect fit” does not necessarily equate to getting “closer to the world’s true structure”—it might, or it might not. This renders it fundamentally incorrect for a researcher to claim that even perfect model fit inherently implies that the model is correctly causally specified. For models of even moderate complexity, truly precisely equivalently-fitting models are a rarity. However, models that almost fit the data, according to any given index, invariably introduce additional, potentially significant, yet unknown model misspecifications. These models, in their subtle deception, pose an even greater impediment to genuine research advancement.

This inherent logical weakness renders all fit indices fundamentally “unhelpful” whenever a structural equation model is found to be significantly inconsistent with the data 39 . Yet, several powerful forces continue to perpetuate the widespread use of fit indices. For instance, Dag Sorbom once recounted that when someone directly asked Karl Jöreskog—the very developer of the first structural equation modeling program—“Why have you then added GFI?” to your LISREL program, Jöreskog candidly replied, “Well, users threaten us saying they would stop using LISREL if it always produces such large chi-squares. So we had to invent something to make people happy. GFI serves that purpose.” 41 The χ² evidence of model-data inconsistency was simply too statistically solid to be dislodged or discarded, but, apparently, people could at least be provided with a convenient distraction from the “disturbing” evidence. Career advancement and academic “profits” can still be accrued by developing ever more indices, by reporting investigations into the esoteric behavior of these indices, and by publishing models that intentionally bury evidence of model-data inconsistency beneath an MDI—a veritable “mound of distracting indices.” There appears to be no general, justifiable reason why a researcher should ever “accept” a causally flawed model, rather than diligently attempting to correct its detected misspecifications. And, disturbingly, some segments of the literature seem to have entirely missed the point that “accepting a model” (based on “satisfying” an arbitrary index value) suffers from an intensified version of the criticism typically leveled against the “acceptance” of a null-hypothesis . Introductory statistics texts routinely recommend replacing the term “accept” with “failed to reject the null hypothesis” to acknowledge the inherent possibility of a Type II error . Here, we face a potential Type III error (or worse)—arising from “accepting” a model hypothesis even when the available data are demonstrably sufficient to reject that very model.

The fundamental concern here boils down to whether researchers are genuinely committed to uncovering the world’s true structure, or merely to producing “publishable” results. Displacing compelling test evidence of model-data inconsistency by obscuring it behind dubious index claims of “acceptable-fit” imposes a significant, discipline-wide cost. It actively diverts intellectual attention away from what the discipline could have been doing to attain a structurally improved understanding of its own subject matter. The discipline ultimately pays a very real price for this index-based displacement of evidence of model misspecification. The frictions generated by disagreements over the absolute necessity of correcting model misspecifications are only likely to intensify with the increasing adoption of non-factor-structured models and with the use of fewer, yet more precisely defined, indicators of similar but importantly distinct latent variables . 35

When considering the use of fit indices—a practice that, despite its flaws, remains prevalent—several critical checks are absolutely essential:

Data Concerns Addressed: Have all potential data issues been thoroughly addressed? This ensures that any observed model-data inconsistency is not merely a reflection of underlying data mistakes.
Criterion Values Investigated for Model Structure: Have the criterion values for the chosen index been rigorously investigated for models that share a similar structure to the researcher’s current model? Critically, index criteria developed specifically for factor-structured models are only appropriate if the researcher’s model is, in fact, factor-structured. Applying a tool designed for one purpose to an entirely different one is simply irresponsible.
Correspondence of Misspecifications: Do the types of potential misspecifications present in the current model align with the kinds of misspecifications upon which the index criteria were originally based? For example, criteria derived from simulations of omitted factor loadings may be entirely inappropriate for misspecifications resulting from a failure to include relevant control variables .
Conscious Disregard of Evidence: Is the researcher knowingly agreeing to disregard evidence that points to the kinds of misspecifications upon which the index criteria were based? If an index criterion is, for instance, based on simulating one or two missing factor loadings, then using that criterion inherently acknowledges the researcher’s willingness to accept a model that is indeed missing one or two factor loadings.
Up-to-Date Index Criteria: Is the latest, most current version of the index criteria being used? Criteria for some indices have, quite rightly, become more stringent over time.
Paired-Index Criteria: Are criterion values on pairs of indices being simultaneously required? Hu and Bentler (1999) 42 reported that some common indices simply do not function appropriately unless they are assessed in conjunction with others.
Model Test Availability: Is a primary model test, such as the χ² test , available and reported? A χ² value , its associated degrees of freedom , and its probability will typically be available for any models that report indices derived from χ² .
Consideration of Error Types: Has the researcher carefully considered both alpha (Type I) and beta (Type II) errors when making their index-based decisions? For example, if a model is significantly data-inconsistent, the “tolerable” amount of inconsistency is likely to vary considerably depending on whether the context is medical research, business analysis, social science, or psychological inquiry.

Some of the more commonly cited and, for better or worse, utilized fit statistics include:

Chi-square
- A foundational test of fit that serves as a basis for the calculation of many other fit measures. It quantifies the discrepancy between the observed covariance matrix and the model-implied covariance matrix. Crucially, chi-square only increases with sample size if the model is, in fact, detectably misspecified. 33
Akaike information criterion (AIC)
- An index primarily used for relative model fit comparisons. The preferred model among a set of candidates is typically the one exhibiting the lowest AIC value.
$${\mathit {AIC}}=2k-2\ln(L),$$
- Where k represents the number of parameters estimated within the statistical model , and L denotes the maximized value of the likelihood of the model.
Root Mean Square Error of Approximation (RMSEA)
- A fit index where a value of zero ideally indicates the best possible fit. 43 However, guidelines for what constitutes a “close fit” using RMSEA are, to put it mildly, highly contested and subject to considerable debate. 44
Standardized Root Mean Squared Residual (SRMR)
- The SRMR is a widely adopted absolute fit indicator. Hu and Bentler (1999) 45 famously suggested a value of .08 or smaller as a general guideline for what might be considered a “good fit.”
Comparative Fit Index (CFI)
- When evaluating baseline comparisons, the CFI’s value is heavily dependent on the average magnitude of the correlations present in the data. If the average correlation between variables is not particularly high, then the CFI itself will not achieve a very high value. A CFI value of .95 or higher is generally considered desirable. 45

The subsequent table provides references that document these, and other, crucial features for several common indices: the RMSEA (Root Mean Square Error of Approximation ), SRMR (Standardized Root Mean Squared Residual ), CFI (Comparative Fit Index ), and the TLI (the Tucker-Lewis Index ). Additional indices, such as the AIC (Akaike Information Criterion ), are typically covered in most introductory SEM texts. 30 For each measure of fit, the decision regarding what constitutes a “good-enough fit” between the model and the data ultimately reflects the researcher’s specific modeling objective (perhaps challenging an existing model, or striving to improve measurement precision); whether or not the model is to be legitimately claimed as having been “tested”; and, crucially, the researcher’s comfort level with “disregarding” the evidence of the index-documented degree of ill fit. 33 It’s a rather subjective tightrope walk, often with significant implications.

Features of Fit Indices

	RMSEA	SRMR	CFI
Index Name	Root Mean Square Error of Approximation	Standardized Root Mean Squared Residual	Comparative Fit Index
Formula	RMSEA = sq-root((χ² - d)/(d(N-1)))
Basic References	46 47 48
Factor Model proposed wording for critical values	.06 wording? 42
NON-Factor Model proposed wording for critical values
References proposing revised/changed, disagreements over critical values	42	42	42
References indicating two-index or paired-index criteria are required	42	42	42
Index based on χ²	Yes	No	Yes
References recommending against use of this index	39	39	39

Sample size, power, and estimation

There is a general, though often unspoken, consensus among researchers that samples should be sufficiently large to ensure both stable coefficient estimates and adequate statistical power for testing hypotheses. However, this agreement quickly dissolves when it comes to specific, universally required sample sizes , or even a clear methodology for determining appropriate sample sizes for a given study. Recommendations for sample size have traditionally been based on a variety of factors, including the sheer number of coefficients slated for estimation, the total count of modeled variables, and insights gleaned from Monte Carlo simulations tailored to specific model coefficients. 30 It’s a rather fragmented landscape of advice.

It’s critical to note that sample size recommendations based on the ratio of the number of indicators to latent variables are inherently factor-oriented. Consequently, these guidelines simply do not apply to models that employ single indicators, especially those with fixed non-zero measurement error variances . 35 Broadly speaking, for models of moderate size that do not contain coefficients that are statistically difficult to estimate, the required sample sizes (N’s) appear to be roughly comparable to the N’s typically demanded for a regression analysis that incorporates all the indicators. A common rule of thumb, but one that should be applied with caution.

The larger the sample size, the greater the statistical likelihood of inadvertently including cases that are not causally homogeneous . This presents a rather thorny dilemma: increasing N to enhance the probability of reporting a desired coefficient as statistically significant simultaneously increases the risk of model misspecification and, paradoxically, increases the power to detect that very misspecification. Researchers who are genuinely committed to learning from their modeling efforts (which includes being open to the possibility that their model requires significant adjustment or even complete replacement) will invariably strive for the largest possible sample size, constrained only by available funding and their rigorous assessment of the likely population-based causal heterogeneity or homogeneity. If an exceptionally large N is available, a strategic approach might involve modeling subsets of cases, which can help control for variables that might otherwise disrupt causal homogeneity.

Researchers who, perhaps subconsciously, fear having to report their model’s deficiencies often find themselves in a peculiar bind. They desire a larger N to provide sufficient power to detect structural coefficients of interest, yet simultaneously wish to avoid the very power capable of signaling model-data inconsistency . It’s a rather self-defeating paradox. Given the immense variation in model structures and data characteristics, a pragmatic approach to determining adequate sample sizes might involve carefully considering the experiences (both successful and unsuccessful) of other researchers who have worked with models of comparable size and complexity, estimated with similar data. Learning from others’ mistakes, or triumphs, is often the most efficient path.

Interpretation

Causal interpretations of SE models are, without a doubt, the clearest and most readily understandable. However, these interpretations will be fundamentally fallacious and simply wrong if the model’s structure does not accurately correspond to the world’s actual causal structure. Consequently, any rigorous interpretation must address the overall status and underlying structure of the model, rather than merely focusing on the individual estimated coefficients in isolation. Whether a model genuinely fits the data, and crucially, how that model came to fit the data, are paramount considerations for any valid interpretation.

Data fit achieved through exploratory analyses, or by slavishly following successive modification indices , does not inherently guarantee that the model is wrong. However, such approaches raise profound doubts because they are highly susceptible to incorrectly modeling data features. For instance, embarking on an exploration to determine the “required” number of factors often preempts the possibility of discovering that the data are, in fact, not factor-structured at all—especially if the factor model has been “persuaded” to fit through the convenient inclusion of measurement error covariances . The data’s inherent ability to speak against a postulated model is progressively eroded with each unwarranted inclusion of a “modification index suggested” effect or error covariance. It becomes exceedingly difficult, if not impossible, to recover a truly proper model if the initial or base model is burdened with several fundamental misspecifications. 49 It’s a slippery slope toward statistical fabrication.

Direct-effect estimates within SEM are interpreted in a manner parallel to the interpretation of coefficients in standard regression equations , but with a much stronger, explicit causal commitment. Each unit increase in a causal variable’s value is understood to produce a change of the estimated magnitude in the dependent variable’s value, given explicit control or adjustment for all other operative and modeled causal mechanisms. Indirect effects are interpreted similarly, with the magnitude of a specific indirect effect equaling the product of the series of direct effects that constitute that indirect pathway. The units involved are the actual, real-world scales of the observed variables’ values, and the assigned scale values for the latent variables . A specified or fixed 1.0 effect of a latent variable on a particular indicator serves to coordinate that indicator’s scale with the latent variable’s scale. The presumption that the remainder of the model remains constant or unchanging during interpretation may necessitate discounting indirect effects that might, in the complexities of the real world, be simultaneously prompted by a real unit increase in the causal variable. Furthermore, the very “unit increase” itself might be inconsistent with what is genuinely possible in the real world, as there may be no known or practical way to actually change the causal variable’s value. If a model explicitly adjusts for measurement errors , this adjustment permits interpreting latent-level effects as referring to variations in true scores—a much more theoretically satisfying outcome. 29

SEM interpretations depart most radically from traditional regression interpretations when a complex network of causal coefficients connects the latent variables , primarily because standard regressions do not inherently provide estimates of indirect effects. SEM interpretations should meticulously convey the full consequences of the intricate patterns of indirect effects that transmit influences from background variables, through intervening variables, to the downstream dependent variables. This approach actively encourages a deeper understanding of how multiple worldly causal pathways can operate in coordination, or independently, or even counteract one another. Direct effects might be counteracted (or reinforced) by indirect effects, or have their correlational implications counteracted (or reinforced) by the effects of common causes. 16 The precise meaning and interpretation of specific estimates should always be contextualized within the entirety of the full model, not just pulled out in isolation.

A robust SE model interpretation should also thoughtfully connect specific model causal segments to their implications for variance and covariance . A single direct effect indicates that the variance in the independent variable contributes a specific amount of variation to the dependent variable’s values. However, the fine-grained causal details of precisely how this occurs remain unspecified, as a single effect coefficient does not inherently contain sub-components available for integration into a structured narrative of that effect’s genesis. A more fine-grained SE model, one that incorporates additional variables intervening between the cause and effect, would be required to furnish the specific features that constitute a coherent story about how any one effect truly functions. Until such a model emerges, each estimated direct effect retains a certain tinge of the unknown, thereby invoking the very essence of a theory . A parallel, fundamental unknownness would accompany each estimated coefficient in even the most fine-grained model, ensuring that the sense of fundamental mystery is never fully eradicated from SE models. And perhaps, that’s precisely the point.

Even if each modeled effect is only known by the identity of the variables involved and the estimated magnitude of the effect, the intricate structures linking multiple modeled effects provide rich opportunities to articulate how things function to coordinate the observed variables—thereby yielding exceptionally useful interpretation possibilities. For example, a common cause inherently contributes to the covariance or correlation between two effected variables because, if the value of the common cause increases, the values of both effects should also increase (assuming positive effects), even if we do not possess the complete narrative underlying each specific cause. 16 (A correlation , for clarity, is simply the covariance between two variables that have both been standardized to possess a variance of 1.0). Another valuable interpretive contribution might involve explaining how two causal variables can both account for variance in a dependent variable , and also how the covariance between those two causes can either increase or decrease the overall explained variance in the dependent variable. That is, interpretation can involve elucidating how a specific pattern of effects and covariances can collectively contribute to diminishing a dependent variable’s variance. 50 Understanding these causal implications implicitly connects to the broader understanding of “controlling” for variables, and potentially explaining why some variables, but not others, should be controlled within an analysis. 5 51 As models grow in complexity, these fundamental components can combine in strikingly non-intuitive ways. For instance, it’s possible to explain how there can be absolutely no correlation (zero covariance ) between two variables, despite those variables being connected by a direct, non-zero causal effect. 16 17 7 32

The caution articulated in the “Model Assessment” section warrants emphatic repetition. Interpretation should remain possible whether a model is, or is not, consistent with the data. The estimates, in their essence, report how the world would appear to someone who believes the model—even if that belief is ultimately unfounded because the model happens to be fundamentally incorrect. Interpretation must explicitly acknowledge that the model coefficients may or may not genuinely correspond to “parameters” in the real world, precisely because the model’s coefficients may not possess corresponding worldly structural features. This distinction is not academic nitpicking; it’s a matter of intellectual honesty.

The strategic addition of new latent variables that enter or exit the original model at a few clearly defined causal locations or variables can significantly contribute to detecting model misspecifications. Such misspecifications, if left unaddressed, could otherwise irrevocably ruin the interpretations of other coefficients. The correlations between the new latent variable’s indicators and all the original indicators contribute directly to testing the original model’s structure. This is because the few new and focused effect coefficients must operate in coordinated harmony with the model’s original direct and indirect effects to properly coordinate the new indicators with the original ones. If the original model’s structure was, in fact, problematic, these sparse new causal connections will prove insufficient to adequately coordinate the new indicators with the original ones. This failure will, in turn, signal the inappropriateness of the original model’s coefficients through the undeniable evidence of model-data inconsistency . 32 The correlational constraints grounded in null/zero effect coefficients , and coefficients that are assigned fixed nonzero values, contribute critically to both model testing and coefficient estimation. Therefore, they unequivocally deserve acknowledgment as the essential scaffolding that supports both the estimates and their subsequent interpretation. 32

Interpretations become progressively more complex for models that incorporate interactions , nonlinearities , multiple groups , multiple levels , and categorical variables . 30 Similarly, effects that interact with causal loops , reciprocal effects , or correlated residuals also necessitate slightly revised and more nuanced interpretations. 7 32 The model’s complexity demands a corresponding increase in interpretive sophistication.

Careful and conscientious interpretation of both failing and fitting models can, surprisingly, provide significant research advancement. To be truly dependable, a model must investigate academically informative causal structures, fit applicable data with estimates that are both understandable and substantively plausible, and rigorously avoid the inclusion of vacuous or theoretically empty coefficients. 53 Dependable fitting models are, admittedly, far rarer than failing models or models that have been inappropriately bludgeoned into fitting. Yet, appropriately-fitting models are possible, and their identification is a mark of true methodological rigor. 37 54 55 56

The multiple, often conflicting, ways of conceptualizing Partial Least Squares (PLS) models 57 significantly complicate their interpretation. Many of the interpretive comments above remain applicable if a PLS modeler adopts a realist perspective , striving to ensure that their modeled indicators combine in a manner that accurately reflects some existing, albeit unobservable, latent variable . However, for non-causal PLS models—such as those focusing primarily on R² or out-of-sample predictive power—the interpretation criteria fundamentally shift. These approaches diminish the concern for whether or not the model’s coefficients actually possess real-world counterparts. The fundamental features that differentiate the five distinct PLS modeling perspectives discussed by Rigdon, Sarstedt, and Ringle 57 directly point to variations in PLS modelers’ objectives, and, consequently, to corresponding differences in the specific model features that warrant careful interpretation.

A strong word of caution is always warranted when making claims of causality , even when rigorous experiments or meticulously time-ordered investigations have been conducted. The term “causal model ” must be understood to mean “a model that conveys causal assumptions,” not necessarily a model that definitively produces validated causal conclusions. It might; it also might not. While collecting data at multiple time points and employing an experimental or quasi-experimental design can certainly help to rule out certain rival hypotheses, even a perfectly randomized experiment cannot entirely eliminate all potential threats to causal claims. No research design, however sophisticated, can ever fully guarantee the absolute certainty of causal structures. 5 The universe, it seems, enjoys its ambiguities.

Controversies and movements

Structural equation modeling , despite its widespread adoption, is a field perpetually fraught with controversies. It’s less a placid lake and more a turbulent sea of debate. Researchers stemming from the traditional factor analytic tradition often attempt to reduce sets of multiple indicators into a smaller, more manageable number of scales or factor-scores, intending to use these simplified constructs later in path-structured models . This constitutes a stepwise process, where an initial measurement step yields scales or factor-scores that are then subsequently fed into a path model. While this stepwise approach might appear intuitively obvious, it, in fact, grapples with severe underlying deficiencies.

The segmentation of the process into distinct steps fundamentally interferes with a thorough and holistic checking of whether the derived scales or factor-scores genuinely and validly represent their underlying indicators, and/or whether they accurately report on latent level effects . A truly comprehensive structural equation model that simultaneously incorporates both the measurement and the latent-level structures not only verifies whether the latent factors appropriately coordinate their respective indicators, but it also rigorously checks whether that same latent simultaneously and appropriately coordinates each latent’s indicators with the indicators of its theorized causes and/or consequences. 32 If a latent variable proves incapable of performing both these styles of coordination, its very validity is called into question, and, by extension, any scale or factor-score purporting to measure that latent becomes suspect. The disagreements that have swirled around this issue fundamentally concern the respect for, or disregard of, empirical evidence that challenges the validity of postulated latent factors. These simmering, and at times outright boiling, discussions culminated in a special issue of the journal Structural Equation Modeling, centered on a provocative target article by Hayduk and Glaser 21 . This was followed by several insightful comments and a rejoinder 22 , all of which were, commendably, made freely available thanks to the efforts of George Marcoulides.

These intense discussions further ignited debate over the fundamental question of whether structural equation models should even be tested for consistency with the data. This question of model testing then became the next focal point of academic contention. Scholars with strong backgrounds in path modeling (statistics) tended to staunchly defend the necessity of careful and rigorous model testing, whereas those primarily rooted in factor analysis traditions often advocated for the use of “fit-indexing” as a substitute for direct fit-testing. These protracted discussions eventually led to a powerful target article in Personality and Individual Differences by Paul Barrett 39 , who unequivocally stated: “In fact, I would now recommend banning ALL such indices from ever appearing in any paper as indicative of model ‘acceptability’ or ‘degree of misfit’.” 39 (page 821). Barrett’s article, too, was accompanied by extensive commentary reflecting both sides of the argument. 53 58 It seems some truths are simply too inconvenient to be easily accepted.

The controversy surrounding model testing has, thankfully, begun to subside as clear and transparent reporting of significant model-data inconsistency becomes increasingly mandatory. Scientists, after all, are not afforded the luxury of ignoring, or failing to report, evidence simply because they dislike what that evidence reveals. 33 This fundamental requirement—of attending to evidence that points toward model misspecification—underpins a more recent and growing concern for rigorously addressing “endogeneity .” Endogeneity represents a specific type of model misspecification that fundamentally interferes with accurate estimation due to a lack of independence among error or residual variables . More generally, the broader controversy over the inherently causal nature of structural equation models , including factor-models , has also been in decline. Even Stan Mulaik, a long-standing stalwart of factor analysis , has openly acknowledged the causal basis of factor models. 59 The insightful comments by Bollen and Pearl regarding common myths about causality within the context of SEM 26 have further reinforced the undeniable centrality of causal thinking in this analytical framework.

A briefer, though equally pointed, controversy centered on the practice of comparing competing models. While comparing alternative models can be incredibly insightful, there are fundamental issues that simply cannot be resolved by merely creating two models and then retaining the one that happens to fit the data “better.” The statistical sophistication often present in presentations, such as that by Levy and Hancock (2007) 60 , can, for example, make it easy to overlook the rather inconvenient truth that a researcher might begin with one truly terrible model and one utterly atrocious model, only to conclude by retaining the structurally terrible model because some arbitrary index reports it as “better fitting” than the atrocious one. It is, frankly, unfortunate that even otherwise robust SEM texts, such as Kline (2016) 30 , remain disturbingly weak in their presentation of rigorous model testing. 61 Overall, the genuine contributions that can be made by structural equation modeling are entirely dependent on meticulous and detailed model assessment, even if the “best available” model happens to be a demonstrably failing one.

An additional controversy, one that has thus far only touched the fringes of these prior debates, awaits its full ignition. [ citation needed ] Factor models and theory-embedded factor structures that rely on multiple indicators frequently tend to fail, and the common practice of dropping “weak” indicators often appears to reduce the observable model-data inconsistency . This reduction in the number of indicators inevitably leads to a renewed concern for, and controversy over, the minimum number of indicators genuinely required to adequately support a latent variable within a structural equation model . Researchers deeply rooted in the factor tradition can sometimes be persuaded to reduce the number of indicators to three per latent variable, but even three, or indeed just two, indicators may still be fundamentally inconsistent with a proposed underlying factor acting as a common cause. Hayduk and Littvay (2012) 35 provided a compelling discussion on how to conceptualize, defend, and appropriately adjust for measurement error when utilizing only a single indicator for each modeled latent variable. Single indicators have, in fact, been used effectively in SE models for a considerable period 54 , yet controversy remains only as far away as a reviewer who has considered measurement exclusively from the narrow, factor analytic perspective.

While these controversies are, for the most part, declining, traces of them are still scattered throughout the SEM literature. One can quite easily reignite a vigorous disagreement by merely posing questions such as: What, precisely, should be done with models that are found to be significantly inconsistent with the data? Or, does the allure of model simplicity truly override the fundamental respect for evidence of data inconsistency? What weight, if any, should be given to indices that purport to show a “close” or “not-so-close” data fit for some models? Should we be particularly lenient toward, and “reward,” parsimonious models that are nonetheless inconsistent with the data? Or, given that the RMSEA explicitly condones disregarding a certain amount of real ill fit for each model degree of freedom , doesn’t that inherently imply that researchers testing models with null-hypotheses of non-zero RMSEA are, by definition, engaged in deficient model testing? Addressing such questions cogently requires a considerable degree of statistical sophistication, though, ultimately, the responses will likely converge on the non-technical matter of whether or not researchers are truly required to report and respect empirical evidence, regardless of how inconvenient it may be.

Extensions, modeling alternatives, and statistical kin

The field of structural equation modeling is, predictably, not static. It is constantly evolving, spawning a myriad of extensions, alternative modeling approaches, and statistical relatives that aim to address its inherent complexities and expand its applicability. It seems the quest for a perfect model, or at least a less imperfect one, is unending. These include:

Categorical dependent variables [ citation needed ]
Categorical intervening variables [ citation needed ]
Copulas [ citation needed ]
Deep Path Modelling 62
Exploratory Structural Equation Modeling 63
Fusion validity models 64
Item response theory models [ citation needed ]
Latent class models [ citation needed ]
Latent growth modeling [ citation needed ]
Link functions [ citation needed ]
Longitudinal models 65
Measurement invariance models 66
Meta-analytic Structural Equation Modeling (MASEM) and Individual Participant Data Meta-analytic Structural Equation Modeling (IPD MASEM)
Mixture model [ citation needed ]
Multilevel models , including hierarchical models (e.g., individuals nested within groups) 67
Multiple group modeling, with or without imposed constraints between groups (e.g., comparisons across genders, cultures, test forms, languages, etc.) [ citation needed ]
Multi-method multi-trait models [ citation needed ]
Random intercepts models [ citation needed ]
Structural Equation Model Trees [ citation needed ]
Structural Equation Multidimensional scaling 68

Software

The array of software packages available for structural equation modeling is vast and, frankly, can be overwhelming. These programs differ significantly in their capabilities, the level of user expertise they demand, and the underlying statistical philosophies they embody. 69 Below is a table detailing some of the currently available options, for those brave enough to dive in.

Name	License	Platform	Add-on Package for	Link	Covariance-Based	Variance-Based
Mplus	Commercial	Windows, Mac, Linux	Standalone	statmodel.com	✓
AMOS	Commercial	Windows	Standalone	ibm.com	✓
lavaan	Open Source	Windows, Mac, Linux	Add-on for R (programming language)	lavaan.org	✓
lavaangui	Open Source	Windows, Mac, Linux	Add-on for R (programming language) and Standalone	lavaangui.org	✓ (uses lavaan)
LISREL	Commercial	Windows	Standalone	ssicentral.com	✓
EQS	Commercial	Windows, Mac, Linux	Standalone	mvsoft.com	✓
Stata	Commercial	Windows, Mac, Linux	Standalone	stata.com	✓
SAS (software)	Commercial	Windows, Mac, Linux	Standalone	sas.com	✓
semopy	Open Source	Windows, Mac, Linux	Add-on for Python (programming language)	semopy.com	✓
sem	Open Source	Windows, Mac, Linux	Add-on for R (programming language)	cran.r-project.org	✓
OpenMX	Open Source	Windows, Mac, Linux	Add-on for R (programming language)	openmx.ssri.psu.edu	✓
Ωnyx	Open Source	Windows, Mac, Linux	Standalone	onyx.brandmaier.de	✓
SmartPLS 4	Commercial	Windows, Mac	Standalone	smartpls.com	✓	✓
PLSGraph	Commercial	Windows	Standalone	plsgraph.com		✓
WarpPLS	Commercial	Windows	Standalone	warppls.com		✓
ADANCO	Commercial	Windows, Mac	Standalone	composite-modeling.com		✓
LVPLS	Freeware	MS-DOS	Standalone	www2.kuas.edu.tw		✓
matrixpls	Open Source	Windows, Mac, Linux	Add-on for R (programming language)	cran.r-project.org		✓
SEMinR	Open Source	Windows, Mac, Linux	Add-on for R (programming language)	https://github.com/sem-in-r/seminr	✓ (uses lavaan)	✓