Missing not at random

What Is Missing Not At Random (MNAR)?

Missing Not At Random (MNAR) describes a situation in statistical analysis and econometrics where the probability that a data point is missing is directly related to the unobserved value of the data itself. This falls under the broader category of data quality and missing data mechanisms in quantitative research. Unlike other types of missing data, MNAR is particularly challenging because the missingness introduces a systematic bias into the analysis, making it difficult to draw accurate conclusions about the underlying population.

When data are MNAR, the absence of a value provides information about what that value would have been. For instance, if individuals with very low incomes are less likely to report their income, then the missing income data is MNAR. Analyzing only the reported incomes would lead to an overestimation of the average income. Identifying and addressing missing not at random data is crucial for ensuring the validity of statistical inferences and the reliability of research findings.

History and Origin

The classification of missing data mechanisms, including missing not at random, was notably formalized by Donald Rubin in his 1976 work, which laid the foundation for modern approaches to handling incomplete data. This framework categorized missing data into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). Rubin's work provided a theoretical basis for understanding the implications of different missingness patterns on statistical inference.

A significant advancement in addressing MNAR data came with the work of James Heckman, who was awarded the Nobel Memorial Prize in Economic Sciences in 2000. Heckman's research focused on developing theory and methods for analyzing selective samples, particularly addressing the problem of selection bias that arises when observations are missing not at random. His "Heckman correction" or "Heckit method" provided an econometric approach to correct for this bias, particularly in labor economics studies where self-selection into the workforce often leads to MNAR data¹⁶, ¹⁷. This methodological contribution has had profound implications for applied research in economics and other social sciences¹⁵.

Key Takeaways

Definition: Missing Not At Random (MNAR) occurs when the reason data is missing is directly related to the unobserved value of the missing data itself.
Bias: MNAR introduces systematic bias, making accurate statistical inference challenging.
Heckman Correction: James Heckman's work, including the "Heckman correction," provides methods to address selection bias in MNAR scenarios.
Complexity: MNAR is the most complex type of missing data to handle, requiring advanced statistical techniques.
Impact: Failure to account for MNAR can lead to flawed conclusions in research and policy analysis.

Formula and Calculation

Unlike Missing Completely at Random (MCAR) or Missing at Random (MAR), there isn't a single universal "formula" for missing not at random data itself. Instead, addressing MNAR typically involves modeling the missingness mechanism, which often requires more complex statistical models and assumptions that cannot be directly verified from the observed data alone.

One of the most well-known approaches for handling MNAR data in econometric analysis is the Heckman selection model. This model attempts to correct for selection bias by explicitly modeling the process that leads to missing observations. The general idea involves a two-equation system:

Selection Equation: This equation models the probability of an observation being observed (i.e., not missing).
$P(Y \text{ is observed}) = f(Z\alpha + u)$
Where:
- (Y) is the dependent variable.
- (Z) is a vector of instrumental variables that influence the probability of observation but do not directly influence the outcome variable (Y).
- (\alpha) is a vector of coefficients.
- (u) is an error term.
Outcome Equation: This equation models the relationship between the independent variables and the outcome variable, conditional on the outcome being observed, and includes a term to correct for selection bias.
$Y = X\beta + \lambda\sigma_u\rho + \epsilon$
Where:
- (Y) is the outcome variable of interest.
- (X) is a vector of independent variables.
- (\beta) is a vector of coefficients.
- (\lambda) (lambda) is the Inverse Mills Ratio, derived from the selection equation, which accounts for the selection bias.
- (\sigma_u) is the standard deviation of the error term from the selection equation.
- (\rho) (rho) is the correlation between the error terms of the selection and outcome equations.
- (\epsilon) is the error term for the outcome equation.

The Inverse Mills Ratio acts as a correction term, adjusting the outcome equation for the non-random selection of the observed sample. The successful application of this model relies heavily on the correct specification of the selection equation and the existence of valid instrumental variables.

Interpreting the Missing Not At Random

Interpreting missing not at random data means understanding that the absence of a data point carries inherent information about its true value. This is a critical distinction from missing completely at random (MCAR), where missingness is purely by chance, or missing at random (MAR), where missingness can be explained by other observed data. When dealing with MNAR, simply ignoring the missing data or using simple imputation methods like mean imputation will lead to biased results and incorrect conclusions.

For example, if a survey on personal wealth has many missing responses for individuals at the very high or very low ends of the wealth spectrum (perhaps due to privacy concerns or lack of engagement), then the missing data is MNAR. Analyzing only the available data would likely misrepresent the true distribution of wealth in the population. Proper interpretation requires acknowledging that the observed data is not a random subset of the full data. Analysts must consider the underlying reasons for missingness and employ advanced statistical models that explicitly account for the selection process. This often involves making untestable assumptions about the missingness mechanism, which can introduce uncertainty into the results.

Hypothetical Example

Consider a hedge fund that is conducting a performance analysis of its quantitative trading strategies. The fund has data on the daily returns of 100 strategies over the past year. However, for some strategies, daily return data is missing on days when the strategy performed exceptionally poorly (e.g., a loss exceeding 5%). The fund's data collection system may automatically filter out or fail to record extreme negative returns to avoid flagging internal alerts unnecessarily, or perhaps the data feed itself is disrupted during periods of high volatility that disproportionately affect underperforming strategies.

In this scenario, the missing data is Missing Not At Random (MNAR). The reason for the missingness (poor performance) is directly related to the unobserved value (the actual large negative return). If an analyst were to simply ignore these missing data points or impute them with the average return, the overall risk metrics and average daily returns for these strategies would be artificially inflated, making the strategies appear less risky and more profitable than they truly are.

To address this, the fund would need to employ more sophisticated techniques than simple deletion methods. They might try to model the probability of a return being missing based on other market indicators on that day, or use advanced imputation methods that account for the non-random nature of the missingness. Without properly addressing this MNAR, the hedge fund's assessment of its strategies' true performance and risk exposure would be flawed, potentially leading to incorrect decisions regarding capital allocation or strategy adjustments.

Practical Applications

Missing not at random (MNAR) data is a prevalent challenge across various financial and economic applications, requiring careful consideration and sophisticated handling techniques.

Economic Surveys and Official Statistics: In economic surveys, MNAR can arise when individuals with extremely high or low incomes are less likely to report their precise earnings, leading to skewed perceptions of income distribution or wealth inequality. Government agencies and international bodies like the International Monetary Fund (IMF) and the Financial Stability Board (FSB) are actively involved in initiatives like the G20 Data Gaps Initiative to improve the quality and availability of financial statistics, precisely to mitigate such data challenges that can affect financial stability and policy decisions¹², ¹³, ¹⁴. The Federal Reserve Bank of Cleveland, for instance, frequently hosts conferences addressing issues related to financial stability and data quality in their research⁹, ¹⁰, ¹¹.
Credit Risk Modeling: When assessing credit risk, data on loan defaults might be MNAR if borrowers who are on the verge of defaulting are less likely to respond to communication attempts or provide updated financial information. This missingness, directly tied to their deteriorating financial health, could lead models to underestimate actual default rates.
Market Microstructure Analysis: In high-frequency trading or market microstructure analysis, order book data might be MNAR if certain types of orders are more likely to be canceled or unrecorded during periods of extreme market volatility or when specific trading strategies are active. This could bias the perceived liquidity or price discovery mechanisms.
Environmental, Social, and Governance (ESG) Data: ESG data collection often faces MNAR issues. Companies with poor environmental performance, for instance, might be less transparent or fail to report certain emissions data, making the absence of data indicative of worse performance. This poses a challenge for sustainable investing and ESG integration.
Behavioral Finance Research: Studies in behavioral finance that rely on self-reported data (e.g., risk tolerance surveys) can encounter MNAR if individuals with extreme biases or specific psychological traits are less likely to complete certain survey sections.

In all these applications, acknowledging the presence of missing not at random data is the first step. Subsequent steps involve employing advanced statistical or econometric methods to model the missingness mechanism or to conduct sensitivity analyses to understand the potential impact of MNAR on conclusions.

Limitations and Criticisms

Despite advancements in statistical methods, dealing with missing not at random (MNAR) data presents significant limitations and criticisms. The primary challenge stems from the fact that the mechanism causing the missingness is unobservable and dependent on the missing values themselves⁷, ⁸. This means that, unlike Missing At Random (MAR) where missingness can be modeled based on observed variables, handling MNAR often requires making strong, untestable assumptions about the relationship between the missingness and the actual values that are absent⁶.

One major criticism is that the choice of model for the MNAR mechanism can heavily influence the results, and there is no definitive way to verify if the chosen model accurately reflects reality. Different assumptions about the missingness process can lead to substantially different conclusions, undermining the robustness of the analysis. For example, if a researcher assumes that low-income individuals are less likely to report their income and models this relationship, the results will depend on the specific mathematical form of that assumed relationship. If the true relationship is different, the correction will be inaccurate, perpetuating the bias.

Furthermore, implementing sophisticated MNAR models, such as selection models or pattern-mixture models, can be computationally intensive and complex, requiring specialized statistical expertise. The reliance on instrumental variables in some approaches, like the Heckman correction, also poses a challenge: finding truly valid instrumental variables that influence the probability of missingness but not the outcome directly is often difficult in real-world scenarios. If the chosen instruments are weak or invalid, the correction for MNAR can actually introduce new biases rather than eliminate existing ones. This underscores the need for careful consideration and sensitivity analyses when dealing with missing not at random data.

Missing Not At Random vs. Missing At Random

The distinction between Missing Not At Random (MNAR) and Missing At Random (MAR) is crucial in statistical analysis, particularly within quantitative analysis and data science. Both describe scenarios where data is incomplete, but the implications for analysis and the methods required to address them differ significantly.

Feature	Missing Not At Random (MNAR)	Missing At Random (MAR)
Definition	The probability of data being missing depends on the unobserved value of the missing data itself.	The probability of data being missing depends on observed data, but not on the unobserved (missing) data.⁴, ⁵
Source of Bias	Introduces systematic bias because the missingness provides information about the true (unobserved) value.	Does not introduce systematic bias if accounted for using appropriate methods, as the missingness can be predicted from observed data.³
Example	People with very low or very high salaries are less likely to report their income.	A survey respondent might skip questions about income if they are male and unemployed, but not because of the income value itself.²
Handling Approach	Requires modeling the missingness mechanism, often involves untestable assumptions. More complex methods like selection models (e.g., Heckman correction) or sensitivity analysis are necessary.	Can be handled with sophisticated imputation methods (e.g., multiple imputation, expectation-maximization) that leverage observed data.¹
Complexity	Most challenging type of missing data to address effectively.	More manageable than MNAR, but still requires careful statistical consideration beyond simple deletion.

In essence, with MAR, if you know the values of other observed variables, you can predict why a data point might be missing. For example, older participants might be less likely to answer certain health questions, but their age is observed. With MNAR, the reason for missingness is tied directly to what you don't see. Because of this fundamental difference, MNAR presents a much greater challenge for statistical inference and can lead to more significant biases if not properly addressed.

FAQs

What are the three types of missing data?

The three main types of missing data are Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). MCAR means the missingness is entirely random, like data lost due to a system glitch. MAR means the missingness can be explained by other observed data, such as older people being less likely to complete a certain survey section. MNAR, the most complex, means the missingness depends on the value of the missing data itself, like individuals with very low financial literacy not answering questions about investment products.

Why is Missing Not At Random (MNAR) difficult to handle?

Missing Not At Random (MNAR) is particularly difficult because the reason data is missing is directly tied to its unobserved value. This means the missingness itself provides information about the data, introducing a systematic bias. Unlike other types, you cannot simply ignore MNAR or use basic imputation without distorting your analysis and drawing incorrect conclusions. It often requires making strong, untestable assumptions about why the data is missing.

Can MNAR data be ignored or simply deleted?

No, MNAR data generally cannot be ignored or simply deleted without severe consequences for the validity of the analysis. Listwise deletion (removing entire cases with any missing data) or pairwise deletion (removing only the specific missing values) when data is MNAR will lead to biased results, as the remaining observed data will not be representative of the full population. Addressing MNAR requires more advanced statistical techniques that model the missingness process.

What is the "Heckman correction"?

The "Heckman correction" is a statistical method developed by Nobel laureate James Heckman to address selection bias, which often arises when data are Missing Not At Random (MNAR). It involves a two-step econometric procedure. First, a selection equation models the probability of an observation being included in the sample. Second, the outcome equation, which is the primary relationship of interest, includes a correction term (the Inverse Mills Ratio) derived from the first step to account for the non-random selection of the observed data. This helps to provide unbiased estimates even when MNAR is present.

How does MNAR impact financial analysis?

In financial analysis, MNAR can significantly distort conclusions. For instance, if companies with severe financial distress are less likely to report certain key financial ratios, then analyzing only the available data would paint an overly optimistic picture of overall financial health or market risk. Similarly, in portfolio management, if data on failed investments is systematically suppressed or harder to obtain, performance metrics could be inflated. Properly identifying and addressing MNAR is crucial for accurate risk assessment, valuation, and policy decisions.