Missing completely at random

What Is Missing Completely at Random (MCAR)?

Missing Completely at Random (MCAR) describes a situation in data analysis where the probability of a data point being missing is entirely unrelated to any observed or unobserved values in the dataset.⁴⁷, ⁴⁸ In simpler terms, if a value is MCAR, its absence is purely by chance, as if the data were removed through a truly random process.⁴⁶ This makes MCAR the most favorable type of missing data mechanism for statistical analysis because, ideally, the subset of complete cases remains a representative sample size of the original population.⁴⁵

Despite being the "most benign" form, MCAR can still impact the quality and reliability of analyses. It can lead to a reduction in effective sample size, which in turn diminishes statistical power and broadens confidence intervals.⁴⁴ When MCAR is present, analyses performed on the observed data are generally unbiased, meaning they do not systematically overestimate or underestimate the true population parameters. However, this assumption is often considered strong and frequently unrealistic in real-world scenarios.⁴³

History and Origin

The concept of Missing Completely at Random, along with Missing at Random (MAR) and Missing Not at Random (MNAR), forms a foundational typology for understanding missing data mechanisms. This classification was introduced by statistician Donald Rubin in 1976 and further elaborated upon by Roderick J.A. Little in his seminal 1988 paper, "A Test of Missing Completely at Random for Multivariate Data with Missing Values."⁴² Little's work provided a formal statistical test to assess whether the MCAR assumption holds true for a given dataset, a tool widely known today as Little's MCAR Test. This development was crucial for advancing the rigor with which researchers approached incomplete financial data and other types of datasets, moving beyond simplistic deletion methods that often introduced bias.

Key Takeaways

Definition: MCAR means the probability of data being missing is independent of both observed and unobserved data values.
Ideal Scenario: When data are truly MCAR, analyses conducted on the remaining complete data are generally unbiased.
Impact: Even with MCAR, the effective sample size is reduced, potentially lowering statistical power.
Rarity: MCAR is a strong assumption and is rarely met in complex real-world datasets, particularly in finance.
Testing: Little's MCAR Test is a statistical method used to evaluate the plausibility of the MCAR assumption.

Formula and Calculation

While MCAR itself is a theoretical assumption about the mechanism of missingness rather than a direct calculation, its presence can be statistically assessed using tests such as Little's MCAR Test. This test evaluates the null hypothesis that data are MCAR. The test statistic approximately follows a chi-squared distribution.⁴⁰, ⁴¹

The core idea of Little's MCAR Test involves comparing the observed means of variables across different patterns of missing data with the expected population means, typically estimated using the Expectation-Maximization (EM) algorithm.³⁹ If these mean differences are statistically significant, it suggests the data are not MCAR.³⁸

The test statistic, denoted as $M$, is based on a likelihood ratio approach:

M = -2 \log(\frac{L_0}{L_1})

Where:

$L_0$: Likelihood under the null hypothesis (data are MCAR).
$L_1$: Likelihood under the alternative hypothesis (data are not MCAR).

A significant p-value (typically p < 0.05) from Little's MCAR Test indicates that the null hypothesis of MCAR should be rejected, suggesting that the missingness is not completely random.³⁷

Interpreting Missing Completely at Random (MCAR)

Interpreting MCAR involves understanding its implications for data integrity and the validity of subsequent analyses. If a dataset passes Little's MCAR Test (i.e., the p-value is not statistically significant), it suggests that the missingness is likely random and does not depend on the values themselves.³⁶ In such cases, simpler methods for handling missing data, such as listwise deletion (also known as complete case analysis), may be considered, although they still lead to a loss of sample size and potential reduction in statistical power.³⁵

However, if the test yields a statistically significant p-value, the MCAR assumption is violated. This means the missingness is related to either observed or unobserved variables, and applying methods that assume MCAR could introduce bias into the results.³³, ³⁴ In finance, for example, a company's financial report might be missing certain financial ratios not by random chance, but because the company is performing poorly, making the data not MCAR. Therefore, understanding whether MCAR holds is a critical first step in selecting appropriate strategies for dealing with incomplete datasets.

Hypothetical Example

Consider a hedge fund analyst collecting quarterly financial data for 100 stocks to build a quantitative model. One of the variables collected is "quarterly revenue growth."

Scenario: During one quarter, an external data provider experiences a temporary technical glitch, causing 5% of the "quarterly revenue growth" data points for randomly selected companies to be corrupted and unrecordable. The glitch affects companies irrespective of their size, industry, or actual revenue growth performance.

In this scenario, the missing data would be considered MCAR. The reason for the missingness (a random technical glitch) has no relationship to the underlying revenue growth figures or any other observable characteristics of the companies. If the analyst were to perform a regression analysis on the remaining 95% of the complete data, the estimates of the relationships between revenue growth and other variables (e.g., stock price movements) would likely remain unbiased, even though the sample size is slightly reduced.

Practical Applications

In quantitative finance, MCAR is a theoretical ideal that rarely holds perfectly true for real-world financial data. However, understanding MCAR is crucial for evaluating the validity of different missing data handling techniques used in financial models and risk assessment.

For instance, when predicting stock returns or analyzing portfolio performance, missing data for firm characteristics (e.g., balance sheet items, trading volume) is a common challenge.³² If data were truly MCAR, simply removing observations with missing values (complete case analysis) would yield unbiased results, albeit with reduced statistical power.³¹ However, studies often find that missing financial data is not random and can even be related to fundamental performance or returns.³⁰ This implies that ignoring missingness or applying methods assuming MCAR when it doesn't hold can lead to flawed insights and potentially poor investment decisions.

Researchers continually explore improved methods for handling missing data in financial contexts. For example, a study highlighted by the University of Chicago Booth School of Business discusses advanced techniques beyond simple deletion or mean imputation to maintain data integrity when predicting stock returns from incomplete datasets.²⁹ Similarly, the firm Sakura Sky emphasizes the importance of robust techniques like multiple imputation for ensuring the integrity and completeness of datasets in the financial industry.²⁸

Limitations and Criticisms

Despite being the most "benign" form of missing data, the MCAR assumption is often criticized for being overly restrictive and unrealistic in practical applications, particularly in complex fields like finance and economics.²⁶, ²⁷

Key limitations and criticisms include:

Rarely Met in Reality: Real-world data, especially financial data, is rarely missing completely at random. Missing values often occur for systematic reasons, such as non-response to sensitive questions, technical glitches affecting specific data types, or companies failing to report certain metrics due to poor performance. For example, a company might cease reporting certain financial ratios if its performance deteriorates, making that data missing for reasons related to the unobserved (or poorly performing) value itself.²⁵
Loss of Statistical Power: Even if the MCAR assumption holds, simply discarding cases with missing values (known as listwise deletion) leads to a reduction in the effective sample size. This can diminish the statistical power of analyses, making it harder to detect true relationships or significant effects.²³, ²⁴
Difficulty in Verification: While Little's MCAR Test exists to formally assess the assumption, a non-significant result only suggests that there is insufficient evidence to reject MCAR, not that MCAR definitively holds. The test also has limitations, such as assuming multivariate normality and not being suitable for categorical variables.²¹, ²² Furthermore, it does not identify which specific variables might be violating the MCAR assumption.²⁰
Alternative Methods Often Superior: Because true MCAR is uncommon, more sophisticated methods like multiple imputation or maximum likelihood estimation are generally preferred for handling missing data. These methods make less restrictive assumptions (e.g., Missing at Random, MAR) and can provide more accurate and efficient estimates when MCAR is violated.¹⁹

Missing Completely at Random (MCAR) vs. Missing at Random (MAR)

Missing Completely at Random (MCAR) and Missing at Random (MAR) are two fundamental classifications of missing data mechanisms, often confused due to their similar-sounding names. The key distinction lies in the relationship between the missingness and the observed data.

Feature	Missing Completely at Random (MCAR)	Missing at Random (MAR)
Definition	The probability of a data point being missing is independent of both observed and unobserved data values. Missingness is truly random.¹⁸	The probability of a data point being missing depends only on observed data, but not on the missing value itself.¹⁷
Example	A corrupted hard drive randomly deletes 5% of financial records, regardless of the values in those records.¹⁶	In a survey, men are less likely to report their income, but this missingness is explainable by their gender (an observed variable), not by their actual income level.¹⁵
Bias if Ignored	No statistical bias is introduced in analyses of the complete cases, though statistical power is reduced.¹⁴	Can introduce bias if not properly addressed, as the observed data might not be representative of the full dataset. Sophisticated methods (e.g., multiple imputation) are often needed.¹³
Real-world Occurrence	Rarely observed in real-world financial data or social science datasets; often considered a strong and unrealistic assumption.¹¹, ¹²	More common than MCAR. Many statistical methods (like multiple imputation and full information maximum likelihood) are designed to handle MAR data effectively.¹⁰

FAQs

What causes data to be Missing Completely at Random (MCAR)?

MCAR usually results from truly random events or technical issues that affect data collection or storage without any systematic pattern. Examples include random equipment failure, accidental deletion of files, or a survey page unintentionally being skipped by respondents due to a software glitch unrelated to their answers.⁹

Why is MCAR important for financial analysis?

Although rare, understanding MCAR is crucial in financial analysis because it dictates the appropriate methods for handling incomplete datasets. If data is genuinely MCAR, simpler techniques like listwise deletion might be acceptable, preventing bias in the results. If the data is not MCAR, ignoring the missingness can lead to inaccurate financial models and flawed conclusions.⁷, ⁸

How can one determine if data is MCAR?

The primary method for statistically assessing whether data is MCAR is Little's MCAR Test. This hypothesis testing procedure evaluates if the patterns of missingness are random. A non-significant p-value suggests the MCAR assumption might hold, while a significant p-value indicates that the data are not MCAR.⁵, ⁶

If data is MCAR, can I simply delete the rows with missing values?

While deleting rows with missing values (known as listwise deletion or complete case analysis) will produce unbiased results if the data is truly MCAR, it comes at the cost of reduced sample size and decreased statistical power.³, ⁴ For this reason, even with MCAR data, more sophisticated imputation methods might be considered to maximize the use of available information.

What are the alternatives if data is not MCAR?

If data is not MCAR (i.e., it's Missing at Random (MAR) or Missing Not at Random, MNAR), simpler deletion methods can introduce significant bias. In such cases, advanced missing data techniques like multiple imputation, maximum likelihood estimation, or various model-based approaches are generally recommended to produce more reliable and accurate results.¹, ²