Inverse mills ratio

What Is Inverse Mills Ratio?

The Inverse Mills Ratio (IMR) is a statistical term used in econometrics to address selection bias in regression analysis. It is specifically employed in situations where observing the outcome variable depends on a prior selection process, leading to a non-random sample. The IMR helps correct for this bias by incorporating the probability of selection into the regression model, thereby providing more accurate and unbiased parameter estimates. This tool is fundamental within the broader field of statistical inference when dealing with non-randomly selected data.

History and Origin

The Inverse Mills Ratio gained prominence through the pioneering work of Nobel laureate James Heckman in the mid-1970s. Heckman, an American economist, developed a two-stage estimation procedure—now widely known as the Heckman correction or Heckit method—to address issues of self-selection bias in econometric models. His contributions to analyzing selective samples and evaluating social programs earned him the Nobel Memorial Prize in Economic Sciences in 2000, which he shared with Daniel McFadden. The⁶ methodology developed by Heckman, utilizing the Inverse Mills Ratio, became a standard tool for economists and other social scientists grappling with data where observations are not randomly selected. Prior to Heckman's work, researchers often encountered biased estimates when analyzing censored or truncated data, as the selection into the observed sample was systematically related to the unobserved error terms.

Key Takeaways

The Inverse Mills Ratio (IMR) is an econometric tool used to correct for selection bias.
It is typically applied in a two-stage estimation process, most notably the Heckman correction.
The IMR is derived from the probit model of the selection equation.
Its inclusion as an additional regressor in the outcome equation helps produce unbiased parameter estimates.
The significance of the IMR's coefficient indicates the presence of selection bias in the model.

Formula and Calculation

The Inverse Mills Ratio, denoted as (\lambda), is calculated as the ratio of the probability density function (PDF) to the cumulative distribution function (CDF) of the standard normal distribution, evaluated at the predicted value from the first-stage selection model.

For a standard normal distribution, where (\phi(\cdot)) is the PDF and (\Phi(\cdot)) is the CDF:

\lambda(Z\gamma) = \frac{\phi(Z\gamma)}{\Phi(Z\gamma)}

Here:

(\phi(Z\gamma)) represents the value of the standard normal PDF at (Z\gamma).
(\Phi(Z\gamma)) represents the value of the standard normal CDF at (Z\gamma).
(Z) is a vector of explanatory variables influencing the selection decision in the first stage.
(\gamma) is a vector of estimated coefficients from the probit model of the selection equation.

The calculation of (Z\gamma) is derived from the first stage of the Heckman correction, which typically involves a probit regression to model the probability of an observation being selected into the sample. The Inverse Mills Ratio is then generated from this probit estimation.

##⁵ Interpreting the Inverse Mills Ratio

When the Inverse Mills Ratio is included as an additional explanatory variable in the main regression equation (the second stage of the Heckman correction), its coefficient provides insights into the presence and nature of endogeneity caused by selection bias. A statistically significant coefficient for the Inverse Mills Ratio suggests that selection bias is indeed present and that simply running an ordinary least squares (OLS) regression on the observed sample would yield biased estimates.

The sign and magnitude of the IMR's coefficient can indicate the direction and strength of the selection effect. For instance, in a study analyzing the wages of working individuals, a positive and significant coefficient on the Inverse Mills Ratio might suggest that unobserved factors that make individuals more likely to work also tend to increase their wages. Conversely, a negative sign could imply that those who are less likely to be observed in the sample (e.g., non-workers) would have higher wages if they were to participate. While the coefficient's significance is key for detecting bias, its direct interpretation can be complex and often requires careful consideration of the underlying theoretical model and the context of the data analysis.

##⁴ Hypothetical Example

Imagine a researcher wants to study the factors affecting the investment returns of venture capital firms. They have data on many firms, but only observe the actual returns for firms that successfully secured funding and completed an exit (e.g., IPO or acquisition). Firms that failed to secure funding or did not exit are not in the "observed returns" dataset. This creates a potential selection bias: the observed returns might be systematically higher because only successful firms are included.

To address this, the researcher can use the Inverse Mills Ratio in a two-stage approach:

Stage 1 (Selection Model): A probit model is run to predict the probability of a venture capital firm successfully securing funding and completing an exit. The independent variables in this stage might include factors like the firm's age, the industry sector of its investments, the experience of its partners, and market conditions.
Stage 2 (Outcome Model): The Inverse Mills Ratio is calculated for each firm based on the results of the first-stage probit model. This calculated Inverse Mills Ratio is then included as an additional independent variable in an OLS regression where the dependent variable is the observed investment return. Other independent variables in this second stage might include the amount of capital raised, the number of portfolio companies, or the investment horizon.

If the coefficient on the Inverse Mills Ratio in the second stage is statistically significant, it indicates that there is indeed selection bias. This means that factors influencing a firm's ability to successfully exit (and thus have observed returns) are also correlated with their actual investment performance. By including the IMR, the researcher obtains unbiased estimates of the true impact of factors like capital raised or portfolio size on investment returns, accounting for the fact that only successful firms are being observed.

Practical Applications

The Inverse Mills Ratio is a crucial tool in various fields where censored data or selection bias are prevalent. In labor economics, it is widely used to study the determinants of wages, labor force participation, and employment duration, accounting for the fact that wages are only observed for those who choose to work. For example, researchers might use it to estimate the true impact of education on earnings, correcting for the bias that individuals with higher unobserved abilities might be more likely to pursue higher education and also earn more.

In finance, the Inverse Mills Ratio can be applied in areas such as credit risk modeling, where analysts might only observe default outcomes for loans that were granted, or in analyzing investment performance, where returns are only observed for assets that were actually traded. It is also relevant in policy evaluation, allowing researchers to assess the effectiveness of programs when participation is voluntary and thus self-selected. For instance, a study evaluating the impact of a government training program on employment outcomes would use the IMR to correct for the fact that individuals who choose to participate in the program may inherently differ from those who do not. The Federal Reserve Bank of San Francisco, for example, publishes research that employs advanced econometric techniques, which could implicitly or explicitly involve corrections for sample selection, particularly in studies related to financial behavior and monetary policy.

##³ Limitations and Criticisms

While powerful, the Inverse Mills Ratio and the Heckman correction method are not without limitations. A primary concern is the strong distributional assumption that the error terms in both the selection and outcome equations follow a normal distribution. If this assumption is violated, the estimates obtained using the IMR can still be biased or inconsistent.

An²other challenge lies in the choice of instruments for the selection equation. For the Heckman correction to work effectively, there must be at least one variable in the selection equation that influences the probability of selection but does not directly affect the outcome variable in the main regression. Finding such a valid and theoretically justifiable exclusion restriction can be difficult in practice, and a weak or invalid instrument can lead to unreliable results. Fur¹thermore, multicollinearity can arise if the Inverse Mills Ratio is highly correlated with other independent variables in the outcome equation, making it difficult to precisely estimate the coefficients and potentially leading to inflated standard errors in hypothesis testing. Researchers must carefully consider these econometric issues when applying the Inverse Mills Ratio in their financial modeling and analysis.

Inverse Mills Ratio vs. Hazard Rate

The Inverse Mills Ratio and the hazard rate are related but distinct concepts, primarily differing in their application and the type of event they model.

Feature	Inverse Mills Ratio (IMR)	Hazard Rate
Primary Use	Correcting for selection bias in regression models.	Modeling the instantaneous rate of an event occurring at a given time, given that it has not occurred previously.
Context	Used in models where the outcome variable is observed only if a prior selection event occurs (e.g., Heckman correction).	Used in survival analysis or duration models (e.g., time until default, duration of unemployment).
Mathematical Form	Ratio of PDF to CDF of the standard normal distribution: (\phi(Z\gamma) / \Phi(Z\gamma)) (for observed outcomes).	Conditional probability: (f(t) / (1 - F(t))), where (f(t)) is the PDF and (F(t)) is the CDF of the event time.
Interpretation	Represents the expected value of the error term in the outcome equation, conditional on selection.	Represents the risk of an event occurring per unit of time, given survival up to that time.

The confusion often arises because the Inverse Mills Ratio is sometimes referred to as a "non-selection hazard" in certain contexts, particularly when the selection process is viewed as a "survival" until the outcome is observed. However, the core application of the Inverse Mills Ratio remains focused on addressing sample selection bias in a regression framework, whereas the hazard rate is fundamental to understanding the dynamics of events over time in stochastic processes.

FAQs

What is selection bias and how does the Inverse Mills Ratio help?

Selection bias occurs when the sample of observations available for analysis is not random, meaning that the factors influencing selection into the sample are also related to the outcome being studied. The Inverse Mills Ratio helps by providing a statistical adjustment term that accounts for this non-randomness, allowing researchers to obtain more accurate estimates of the relationships between variables. It effectively models the probability of an observation being selected.

Can the Inverse Mills Ratio be used with any regression model?

While the Inverse Mills Ratio is commonly associated with Ordinary Least Squares (OLS) regression in the second stage of the Heckman correction, its derivation requires the first stage to be a probit model. This is because the probit model assumes a normally distributed error term, which is necessary for the mathematical properties of the Inverse Mills Ratio. Other models for the first stage, like a logit, would not yield a directly interpretable Inverse Mills Ratio in the same way.

What does a significant coefficient on the Inverse Mills Ratio imply?

A statistically significant coefficient on the Inverse Mills Ratio in your regression model indicates that sample selection bias is present. This means that the decision to be included in the observed sample is not random with respect to the outcome variable. If the coefficient is significant, then simply running an OLS regression on the selected sample would lead to biased and inconsistent estimates of the true effects of your independent variables.