First stage regression

What Is First Stage Regression?

First stage regression refers to the initial step in a multi-stage estimation technique, most commonly within the framework of instrumental variables (IV) or Two-Stage Least Squares (2SLS) in econometric models. In this foundational step, a problematic explanatory variable in a primary regression analysis (often termed the "second stage" or "structural equation") is regressed on a set of instruments and any other exogenous variables. The primary purpose of the first stage regression is to isolate the variation in the endogenous explanatory variable that is not correlated with the error term of the main equation, thereby addressing issues like endogeneity.

History and Origin

The concept underlying instrumental variables, and by extension, the first stage regression, emerged to address challenges in establishing causal inference from observational data. Early applications sought to disentangle complex economic relationships where variables might simultaneously influence each other or where unobserved factors could bias standard regression estimates. While various researchers contributed to the development of these techniques, early foundational work on instrumental variable regression is often attributed to Philip G. Wright in 1928 for his work in agricultural economics, and Sewall Wright in 1925, though the complete historical attribution is debated among economists.⁴

The instrumental variables framework gained significant traction and refinement over the decades, becoming a cornerstone of empirical economics. Its profound impact on the field was recognized with the 2021 Nobel Memorial Prize in Economic Sciences, awarded to David Card for his empirical contributions to labor economics, and to Joshua Angrist and Guido Imbens for their methodological contributions to the analysis of causal relationships using instrumental variables. Their work helped clarify when and how instrumental variables could be used to draw valid causal conclusions from real-world data, particularly in "natural experiments" where researchers exploit real-world events that mimic randomized control trials.³

Key Takeaways

First stage regression is the initial step in Two-Stage Least Squares (2SLS) and other instrumental variable (IV) estimation methods.
Its main goal is to generate a predicted value for an endogenous explanatory variable that is free from correlation with the primary equation's error term.
This process helps address endogeneity, measurement error, and omitted variable bias, leading to more consistent parameter estimates.
The first stage regression uses valid instrumental variables and all other exogenous variables from the model.
The strength of the relationship between the instrumental variables and the endogenous explanatory variable is crucial for the validity and efficiency of the overall estimation.

Formula and Calculation

The first stage regression specifically models the endogenous explanatory variable as a function of the chosen instrumental variables and all exogenous variables present in the structural (second-stage) equation.

Consider a structural equation where (Y) is the dependent variable, (X) is the endogenous explanatory variable, and (Z) is an exogenous explanatory variable:

Y_i = \beta_0 + \beta_1 X_i + \beta_2 Z_i + u_i

Here, (X_i) is correlated with (u_i) (the error term), causing endogeneity. To address this, we need an instrumental variable, (W_i), that is correlated with (X_i) but uncorrelated with (u_i).

The first stage regression then takes the form:

X_i = \gamma_0 + \gamma_1 W_i + \gamma_2 Z_i + v_i

In this equation:

(X_i): The endogenous explanatory variable from the structural equation, now serving as the dependent variable in the first stage.
(W_i): The instrumental variable(s).
(Z_i): All exogenous explanatory variables from the original structural equation. These are included to ensure that the variation in (X_i) predicted by (W_i) is only the part uncorrelated with (u_i).
(v_i): The error term for the first stage regression.

After running this first stage regression, the predicted values of (X_i), denoted as (\hat{X}_i), are obtained. These predicted values are then used in the second stage regression.

Interpreting the First Stage Regression

The outcome of the first stage regression is primarily evaluated based on the strength of the relationship between the instrumental variable(s) and the endogenous explanatory variable. A robust first stage is critical for the validity of the overall instrumental variable estimation. Specifically, analysts look at the F-statistic from the first stage regression, which tests the joint significance of the instrumental variables. A commonly cited rule of thumb, particularly for a single endogenous regressor and multiple instruments, is that an F-statistic greater than 10 indicates sufficiently strong instruments, mitigating the problem of "weak instruments." Weak instruments lead to biased and inconsistent estimates in the second stage, often performing worse than ordinary least squares (OLS) estimates.

Beyond the F-statistic, the coefficients on the instrumental variables in the first stage regression also provide insights into how the instruments influence the endogenous variable. The sign and magnitude of these coefficients should align with theoretical expectations. The presence of sufficiently strong exogenous variables in the model also plays a role in isolating the desired variation.

Hypothetical Example

Imagine a financial researcher wants to determine the causal effect of a company's research and development (R&D) expenditure on its stock returns. A direct regression analysis of stock returns on R&D expenditure might suffer from endogeneity because highly profitable companies might invest more in R&D, and their profitability might also directly affect their stock returns, creating a correlation between R&D and the error term.

To address this, the researcher identifies an instrumental variable: the presence of local government tax incentives for R&D in the company's operating region. This variable is assumed to influence a company's R&D expenditure but not directly affect its stock returns, except through its impact on R&D.

First Stage Regression:

The researcher runs a first stage regression where the endogenous variable (R&D Expenditure) is regressed on the instrument (Local Government Tax Incentives) and any other exogenous control variables (e.g., industry growth, company size).

\text{R\&D Expenditure}_i = \gamma_0 + \gamma_1 (\text{Tax Incentives})_i + \gamma_2 (\text{Industry Growth})_i + \gamma_3 (\text{Company Size})_i + v_i

From this regression, the researcher obtains the predicted R&D expenditure for each company, (\widehat{\text{R&D Expenditure}}_i). This predicted value captures the variation in R&D expenditure that is driven solely by the tax incentives and other exogenous factors, thus isolating the "clean" variation in R&D. These predicted values are then used in the second stage to estimate the causal effect on stock returns.

Practical Applications

First stage regression, as a component of instrumental variable estimation, is widely applied in quantitative finance, econometrics, and various social sciences to address the fundamental challenge of establishing causality from observational data.

Financial Markets: Researchers might use instrumental variables to estimate the causal impact of corporate governance practices on firm performance, or the effect of monetary policy changes on asset prices, where simple ordinary least squares (OLS) would yield biased results due to endogeneity. For instance, in analyzing the effect of credit scores on loan default rates, instrumental variables can help estimate the true causal effect, accounting for other confounding factors.²
Economic Policy Analysis: Governments and researchers employ these techniques to evaluate the effectiveness of policy interventions, such as the impact of education policies on earnings, minimum wage laws on employment, or healthcare reforms on health outcomes.
Behavioral Finance: To isolate the causal effect of certain psychological biases on investment decisions, instrumental variables can be employed when direct observation of the bias is confounded by other factors.
Risk Management and Valuation: While not directly used in day-to-day calculations, the robust causal insights derived from studies employing first stage regression can inform better risk management strategies and more accurate valuation models by identifying true causal drivers rather than mere correlations.

Limitations and Criticisms

While first stage regression is a crucial step in addressing endogeneity, the overall instrumental variable (IV) approach has significant limitations and criticisms:

Weak Instruments: If the instrumental variables are only weakly correlated with the endogenous explanatory variable in the first stage, the IV estimator can be severely biased and have large standard errors, often performing worse than ordinary least squares (OLS).¹ This "weak instrument" problem can lead to unreliable conclusions even in large samples.
Validity of the Exclusion Restriction: The most critical assumption of IV estimation is the "exclusion restriction," which states that the instrumental variable affects the dependent variable only through its effect on the endogenous explanatory variable, and not directly. If this assumption is violated, the IV estimates will be inconsistent and biased. It is often challenging, if not impossible, to definitively test this assumption empirically, making it a point of frequent debate in academic research.
Overidentification and Underidentification: The number of instruments relative to the number of endogenous variables affects the identification of the model. If there are fewer instruments than endogenous variables, the model is "underidentified," and no unique causal effect can be estimated. If there are more instruments than endogenous variables ("overidentified"), statistical tests (like the Sargan or Hansen test) can be performed, but these tests only indicate if all instruments are valid, not which specific one might be problematic if the test rejects the null hypothesis.
Interpretation with Heterogeneous Effects: When the causal effect of the endogenous variable varies across individuals, IV estimates identify a "local average treatment effect" (LATE) rather than a universal average effect. The LATE applies only to those individuals whose treatment status is influenced by the instrument, which may not be the population average. This complicates generalizability.
Data Requirements and Practicality: Finding truly valid and strong instrumental variables in real-world financial or economic data is exceptionally difficult. Often, researchers must rely on "natural experiments" or "quasi-experimental" settings, which are rare. The absence of suitable instruments can prevent the application of this powerful method. Concerns about multicollinearity can also arise if instruments are highly correlated with each other or with other exogenous variables.

First Stage Regression vs. Two-Stage Least Squares

First stage regression is a component of Two-Stage Least Squares (2SLS), not a distinct method that can be used independently to achieve the same objectives.

Feature	First Stage Regression	Two-Stage Least Squares (2SLS)
Purpose	Predicts the endogenous explanatory variable using instruments and exogenous covariates. It addresses correlation between the endogenous explanatory variable and the error term.	Estimates the causal effect of the endogenous variable on the dependent variable, providing consistent and unbiased estimates when OLS would fail.
Inputs	Endogenous explanatory variable (as dependent variable), instrumental variables, all exogenous variables.	Predicted values from the first stage regression (for the endogenous variable), and all original exogenous variables.
Output	Predicted values of the endogenous explanatory variable.	Consistent estimates of the structural parameters, representing causal effects.
Role in Analysis	An intermediate, necessary step.	The complete estimation method for addressing endogeneity with instrumental variables.
Relation to Error Term	Focuses on isolating variation in the endogenous variable uncorrelated with the original structural error term.	Utilizes the predicted variable from the first stage, which by construction is uncorrelated with the structural error term, to produce consistent estimates.

The first stage regression creates the essential input for the second stage, where the actual coefficients of interest are estimated. Without a properly executed first stage, the Two-stage least squares procedure cannot correct for endogeneity.

FAQs

What problem does first stage regression help solve?

The first stage regression helps solve the problem of endogeneity in econometric models. Endogeneity occurs when an explanatory variable in a model is correlated with the error term, leading to biased and inconsistent estimates if ordinary least squares (OLS) is used. The first stage uses instrumental variables to create a version of the endogenous variable that is free from this problematic correlation.

Can first stage regression be used on its own?

No, first stage regression is not typically used on its own for final inference. It is an integral preliminary step within a larger estimation framework, most notably Two-stage least squares. Its primary output, the predicted values of the endogenous variable, is fed into the second stage to obtain consistent estimates of the parameters of interest.

What makes an instrumental variable "good" for the first stage?

A "good" instrumental variable for the first stage regression must satisfy two key conditions: it must be strongly correlated with the endogenous explanatory variable (relevance condition), and it must be uncorrelated with the error term of the main equation (exclusion restriction). If the relevance condition is weak, the instruments are "weak instruments," leading to unreliable estimates. If the exclusion restriction is violated, the estimates will be biased, undermining the entire causal inference effort.

Is first stage regression applicable in all statistical analyses?

No, first stage regression is specifically applicable in situations where there is concern about endogeneity in a regression model and suitable instrumental variables can be identified. It is not necessary if all explanatory variables are exogenous variables and meet the assumptions of ordinary least squares (OLS). Many analyses, particularly in fields where randomized experiments are common, do not require this advanced technique.

How does the first stage relate to identifying causality?

By isolating the portion of the endogenous variable's variation that is truly exogenous (i.e., driven by the instruments and uncorrelated with the error term of the main equation), the first stage regression enables the second stage to estimate a more accurate causal effect. This is because the predicted values generated by the first stage act as a proxy for the endogenous variable, eliminating the bias that would otherwise arise from the correlation between the endogenous variable and unobserved factors. This process is fundamental to strengthening causal inference in observational studies.