Overidentified model

Overidentified Model

An overidentified model is a statistical or econometric model within the broader field of statistical modeling where there are more pieces of information or "identifying restrictions" available from the observed data than there are parameters to be estimated. This surplus of information means that the model's parameters can be determined in more than one way, providing a basis for hypothesis testing and evaluation of the model's validity. In essence, an overidentified model has a positive number of degrees of freedom, unlike just-identified models which have zero.²⁵

History and Origin

The concept of identification in statistical and econometric models, including overidentification, gained prominence with the development of large-scale structural equations models. A key period for this development was the mid-20th century, particularly the work of the Cowles Commission for Research in Economics. Researchers associated with the Cowles Commission, such as Tjalling Koopmans, recognized the identification problem when attempting to estimate the parameters of simultaneous equation systems in economics.²⁴,²³ They established the conditions under which the parameters of an economic model could be uniquely determined from observable data. The exploration of these conditions naturally led to the classification of models as underidentified, just-identified, or overidentified, depending on the number of independent pieces of information relative to the number of unknown parameters. The formalization of these concepts was crucial for the rigorous application of statistical methods to economic theory.

Key Takeaways

An overidentified model has more observed information or restrictions than unknown parameters, leading to positive degrees of freedom.
The surplus of information in an overidentified model allows for statistical tests of the model's goodness-of-fit and validity.
Overidentification provides a mechanism to check if the model's assumptions are consistent with the observed data.
While offering benefits for testing, an overidentified model may not perfectly fit the observed data, unlike a just-identified model.

Formula and Calculation

The identification status of a model, including whether it is overidentified, is often assessed using the order condition and the rank condition.

The order condition for identification in a single equation of a system of structural equations is:

K - k \ge G - 1

Where:

( K ) = Total number of predetermined (exogenous and lagged endogenous) variables in the entire system.
( k ) = Number of predetermined variables included in the specific equation being examined.
( G ) = Total number of endogenous variables in the entire system.

For an equation to be identified:

If ( K - k = G - 1 ), the equation is just-identified.
If ( K - k > G - 1 ), the equation is overidentified.

The rank condition is a more stringent and necessary and sufficient condition for identification. It states that for a specific equation to be identified, it must be possible to form at least one non-zero determinant of order ((G - 1) \times (G - 1)) from the coefficients of the variables excluded from that equation but included in other equations of the system. This condition ensures that there is enough independent variation among the excluded exogenous variables to uniquely determine the parameters of interest.²²

The difference between the number of known data points (or moment conditions) and the number of unknown parameter estimation defines the degrees of freedom for an overidentified model. For instance, in structural equation modeling, if the number of unique elements in the covariance matrix of observed variables is greater than the number of free parameters to be estimated, the model is overidentified.²¹

Interpreting the Overidentified Model

An overidentified model is generally preferred in econometric models and structural equation modeling because it offers the ability to test the validity of the model's underlying assumptions.²⁰ When a model is overidentified, the surplus information (excess moments or identifying restrictions) can be used to perform tests of overidentifying restrictions, such as the Sargan-Hansen J-test.

A statistically significant result from such a test indicates that the model's restrictions are inconsistent with the data, suggesting misspecification or invalid instrumental variables. Conversely, a non-significant result implies that the data do not reject the overidentifying restrictions, lending support to the model's validity. This capacity for internal validation is a crucial advantage.¹⁹ Unlike a just-identified model, which will always perfectly fit the observed data by definition and thus offers no room for such empirical tests, an overidentified model's ability to be "wrong" makes it more scientifically useful for validating theoretical constructs.¹⁸

Hypothetical Example

Consider a simple economic model aiming to explain the demand for a certain good (Q) based on its price (P) and consumer income (Y).
Assume the true structural equations are:

Demand: ( Q = \beta_0 + \beta_1 P + \beta_2 Y + \epsilon_D )
Supply: ( Q = \gamma_0 + \gamma_1 P + \gamma_2 W + \epsilon_S )

Where:

( Q ) and ( P ) are endogenous variables (determined within the system).
( Y ) (consumer income) and ( W ) (weather/input costs) are exogenous variables (determined outside the system).
( \epsilon_D ) and ( \epsilon_S ) are error terms.

Our goal is to estimate the parameters of the demand equation (( \beta_0, \beta_1, \beta_2 )).
To identify the demand equation, we need at least as many excluded exogenous variables from the demand equation that are included in the other equation (supply) as there are endogenous variables in the demand equation minus one.

In the demand equation:

Endogenous variables: ( Q, P ) (2 variables)
Parameters to estimate: ( \beta_0, \beta_1, \beta_2 ) (3 parameters)

The relevant excluded exogenous variable for the demand equation (from the supply equation) is ( W ).
Applying the order condition for the demand equation:

Number of endogenous variables in the equation (( G_D )): 2 (Q, P)
Number of predetermined variables in the entire system excluded from the demand equation (( K_{excl} )): 1 (W)

If we were to just consider the demand equation for identification, we compare ( K_{excl} ) with ( G_D - 1 ).
So, ( 1 ) vs ( (2 - 1) = 1 ).
Since ( 1 \ge 1 ), the demand equation is identified.

Now, let's say we have an additional exogenous variable ( Z ) (e.g., population size) that affects only supply and is not included in the demand equation. Our supply equation becomes:
Supply: ( Q = \gamma_0 + \gamma_1 P + \gamma_2 W + \gamma_3 Z + \epsilon_S )

Now, for the demand equation, the excluded exogenous variables are ( W ) and ( Z ).

( K_{excl} ): 2 (W, Z)
( G_D - 1 ): 1 (as before)

Since ( 2 > 1 ), the demand equation is overidentified. This means we have more identifying information (two excluded exogenous variables, ( W ) and ( Z )) than strictly necessary to determine the coefficients of the endogenous variables in the demand equation. This surplus allows for tests of the validity of ( W ) and ( Z ) as instruments.

Practical Applications

Overidentified models are extensively used in various branches of quantitative analysis, particularly in econometric models and causal inference.

Econometrics: In economic research, structural equations often involve simultaneous relationships (e.g., supply and demand, consumption and income). Overidentification allows researchers to use instrumental variables (IV) or generalized method of moments (GMM) techniques to estimate parameters and then test the validity of the chosen instruments.¹⁷,¹⁶ For instance, when analyzing the impact of education on earnings, where education might be an endogenous variable due to unobserved abilities, researchers might use an instrument like proximity to a college. If multiple such valid instruments are available, the model becomes overidentified, enabling tests of whether these additional instruments are truly exogenous.¹⁵
Structural Equation Modeling (SEM): In fields like psychology, sociology, and marketing, SEM is used to analyze complex relationships between observed and latent variables. Overidentified SEMs are preferred because they allow for goodness-of-fit testing, assessing how well the theoretical model fits the observed data structure.¹⁴
Program Evaluation: When evaluating the impact of policy interventions or programs, overidentified models can help determine the causal effect by using multiple sources of exogenous variation (instruments) and then validating these sources.
Financial Modeling: In some advanced financial modeling applications, particularly those involving asset pricing or risk management where structural relationships are posited, overidentification can occur and provide a basis for model validation.

A crucial practical application is the Sargan-Hansen J-test, which is designed to test the validity of the overidentifying restrictions. This test essentially checks whether the additional instruments are uncorrelated with the error term. If the test fails, it indicates potential model specification issues or invalid instruments, prompting further investigation.¹³

Limitations and Criticisms

While overidentification offers significant advantages for model validation, it is not without limitations or criticisms:

Sensitivity to Instrument Validity: The primary strength of overidentification—the ability to test instruments—is also its main vulnerability. The Sargan-Hansen test, for example, tests a joint null hypothesis that all instruments are valid. If the test rejects the null, it doesn't specify which instrument is invalid. Identifying the problematic instrument can be challenging. Moreover, if the model is misspecified in other ways (e.g., omitted variables), the overidentification test may still reject, even if the instruments themselves are valid.
¹² Weak Instruments: If the instrumental variables used in an overidentified model are "weak" (i.e., only weakly correlated with the endogenous variables they are meant to instrument), then the properties of the parameter estimation (like bias and inconsistency) can be severely affected, and the overidentification tests themselves may perform poorly, having low power to detect invalid instruments.
¹¹ Model Misspecification: An overidentified model implies that the data provide more information than needed to uniquely estimate the parameters. This often means the model imposes certain restrictions on the data. If these restrictions are incorrect due to underlying model specification errors (e.g., incorrect functional form, omitted variables), the model may still pass the overidentification test, leading to a false sense of security, or fail the test for reasons unrelated to instrument validity.
Interpretation of Rejection: A rejection of the overidentification test suggests that the model is misspecified or the instruments are invalid. However, pinpointing the exact source of the problem requires careful diagnostic work, often involving examining subsets of instruments or considering alternative model formulations. This process can be complex and may not always yield a clear answer.
¹⁰ Theoretical vs. Empirical Identification: A model can be theoretically overidentified but empirically underidentified if the chosen instruments lack sufficient correlation with the endogenous variables in the actual data. This highlights that formal conditions alone do not guarantee successful parameter estimation in practice.,

#⁹#⁸ Overidentified Model vs. Just-identified Model

The distinction between an overidentified model and a just-identified model lies in the number of available pieces of information relative to the number of parameters that need to be estimated.

A just-identified model has exactly enough information (identifying restrictions or moment conditions) to estimate its parameters uniquely. The number of degrees of freedom for a just-identified model is zero. This means there is only one unique solution for the parameters, and the model will always perfectly reproduce the observed data's covariance matrix or match the specified moments. While this ensures a unique parameter estimation, it also means there is no remaining information to test the model's assumptions or its goodness-of-fit against the data. It cannot be "wrong" in terms of fit to the observed moments.,,

⁷I⁶n⁵ contrast, an overidentified model has more information than necessary to estimate its parameters. This surplus results in positive degrees of freedom. The key advantage of overidentification is that it allows for statistical tests of the model's validity. Since there is redundant information, it can be checked for internal consistency. If the model is correctly specified and the identifying restrictions are valid, the extra information should not contradict the estimated parameters. This ability to test for consistency makes overidentified models generally more informative and preferred for empirical research, as they provide a means to assess whether the theoretical model specification aligns with the observed data.

FAQs

What does "identified" mean in the context of a statistical model?

A model is "identified" if there is a unique set of numerical values for all its unknown parameters that can be determined from the observed data. If a model is not identified, it means multiple (or infinitely many) sets of parameter values could explain the same observed data, making it impossible to draw unique conclusions.

##⁴# Why is an overidentified model generally preferred over a just-identified model?
An overidentified model is generally preferred because the surplus of information it contains allows for formal hypothesis testing of the model's underlying assumptions and goodness-of-fit. Unlike a just-identified model, which offers no such testing capability, an overidentified model allows researchers to assess whether their theoretical specifications are consistent with the empirical data.

##³# What happens if an overidentified model's test of overidentifying restrictions fails?
If a test of overidentifying restrictions, such as the Sargan-Hansen J-test, fails (i.e., yields a statistically significant p-value), it suggests that the model is likely misspecified or that at least one of the instrumental variables used is invalid (correlated with the error term). This indicates that the additional information in the model contradicts the estimated parameters, casting doubt on the validity of the model's conclusions. Further investigation into the model specification and instrument selection is then necessary.

##²# Does overidentification guarantee a "good" model?
No, overidentification itself does not guarantee a "good" or correctly specified model. While it allows for tests of internal consistency and instrument validity, these tests only check specific restrictions. An overidentified model could still be flawed due to other forms of model specification error (e.g., incorrect functional form, omitted relevant variables not covered by the identification tests) or issues like weak instruments that reduce the power of the tests themselves.¹