Log likelihood

What Is Log likelihood?

Log likelihood is a fundamental concept in statistical inference used primarily in parameter estimation for statistical models. It represents the natural logarithm of the likelihood function, which quantifies how well a statistical model's parameters explain observed data. Instead of directly maximizing the likelihood function, analysts often work with the log likelihood because the logarithmic transformation simplifies calculations, particularly when dealing with product terms in the likelihood function, by converting them into sums. This transformation does not change the location of the maximum, making the maximization of the log likelihood equivalent to maximizing the likelihood itself. The log likelihood is crucial for various forms of data analysis and statistical modeling.

History and Origin

While early ideas related to seeking the most probable distribution for observations appeared in the mid-18th century with mathematicians like Thomas Bayes and Johann Heinrich Lambert, the modern conceptualization and formalization of likelihood and, by extension, log likelihood, are largely attributed to Sir Ronald Aylmer Fisher. Fisher began his seminal work on maximum likelihood from 1912 to 1922, culminating in influential publications such as his 1922 paper "On the Mathematical Foundations of Theoretical Statistics"⁷. His contributions transformed the method into a cornerstone of statistical methodology. The "simple idea" of maximum likelihood, which log likelihood optimizes, has a complex mathematical history, explored by many prominent figures including Joseph Louis Lagrange and Carl Friedrich Gauss⁶. The convenience of maximizing the logarithm of the likelihood function was recognized early on, simplifying the optimization process for parameter estimation⁵.

Key Takeaways

Log likelihood is the natural logarithm of the likelihood function, used in statistical modeling.
It simplifies the calculation of maximum likelihood estimates by converting products into sums.
Maximizing the log likelihood is mathematically equivalent to maximizing the likelihood function.
It is a core component in model selection criteria like AIC and BIC.
Log likelihood values are used to compare the fit of different statistical models to observed data.

Formula and Calculation

The likelihood function, often denoted (L(\theta | x)), represents the joint probability of observing a given dataset (x) for a particular set of model parameters (\theta). For independent and identically distributed (i.i.d.) observations, the likelihood function is the product of the probability density (or mass) functions for each observation.

The log likelihood function, denoted ( \ell(\theta | x) ), is then:

\ell(\theta | x) = \ln(L(\theta | x)) = \ln\left(\prod_{i=1}^{n} f(x_i | \theta)\right)

Due to the properties of logarithms, this product can be transformed into a sum:

\ell(\theta | x) = \sum_{i=1}^{n} \ln(f(x_i | \theta))

Where:

( \ell(\theta | x) ) is the log likelihood.
( L(\theta | x) ) is the likelihood function.
( \ln ) denotes the natural logarithm.
( x = (x_1, x_2, \ldots, x_n) ) represents the observed data points.
( \theta ) represents the unknown parameters of the statistical model.
( f(x_i | \theta) ) is the probability density function (or probability mass function for discrete data) for a single observation ( x_i ) given the parameters ( \theta ).

To find the parameters (\theta) that best fit the data, one typically finds the values of (\theta) that maximize (\ell(\theta | x)). This often involves taking the derivative of the log likelihood with respect to each parameter, setting the derivatives to zero, and solving the resulting equations. This process is central to maximum likelihood estimation.

Interpreting the Log likelihood

Interpreting the log likelihood value itself is not as straightforward as interpreting a probability. A higher log likelihood value indicates that the chosen parameters provide a better fit to the observed data, meaning the observed data is more probable under those parameters. However, the absolute value of the log likelihood is not directly interpretable and depends on the scale and number of data points. For instance, a log likelihood of -100 implies a better fit than -1000, but these values do not have an inherent meaning on their own.

Instead, log likelihood values are primarily used for comparison, especially in model selection. When comparing two or more models, the model with the higher log likelihood (i.e., less negative) is generally preferred as it indicates a better fit to the observed data. This comparison is often formalized through information criteria like the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), both of which incorporate the log likelihood value along with a penalty for model complexity. For example, in a regression analysis, a higher log likelihood for a given set of coefficients suggests those coefficients are more likely to have generated the observed outcomes.

Hypothetical Example

Consider an investment firm wanting to model the daily returns of a particular stock. They hypothesize that the stock returns follow a normal distribution with an unknown mean ((\mu)) and standard deviation ((\sigma)). They collect 100 days of historical stock return data.

To estimate (\mu) and (\sigma) using maximum likelihood, they would construct the log likelihood function for the normal distribution given their 100 observations.

For a single observation (x_i) from a normal distribution, the probability density function is:

f(x_i | \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)

The log likelihood for (n) independent observations is:

\ell(\mu, \sigma | x) = \sum_{i=1}^{n} \ln\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) - \sum_{i=1}^{n} \frac{(x_i - \mu)^2}{2\sigma^2}

\ell(\mu, \sigma | x) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \sum_{i=1}^{n} \frac{(x_i - \mu)^2}{2\sigma^2}

The firm would then use optimization techniques to find the values of (\mu) and (\sigma) that maximize this log likelihood function. If, after calculations, they find that a model with (\mu = 0.0005) and (\sigma = 0.01) yields a log likelihood of -150, and another set of parameters (\mu = 0.001) and (\sigma = 0.015) yields a log likelihood of -160, the first set of parameters ((\mu = 0.0005), (\sigma = 0.01)) would be considered a better fit for the observed daily returns. This method is a core component of quantitative finance.

Practical Applications

Log likelihood plays a critical role across various domains, particularly in finance and econometrics:

Financial Modeling and Forecasting: In financial forecasting, log likelihood is used to estimate parameters for complex models, such as those for asset pricing or volatility. This includes GARCH models for time-varying volatility in time series analysis.
Risk Management: For risk management, log likelihood is essential in fitting distributions to historical loss data, helping to estimate parameters for Value at Risk (VaR) or Expected Shortfall calculations.
Credit Scoring: Financial institutions use log likelihood to estimate parameters in logistic regression models that predict the probability of loan default, informing credit scoring decisions.
Econometric Analysis: In econometrics, researchers widely use log likelihood for parameter estimation in various models, including linear regression (when errors are assumed normally distributed, minimizing sum of squares is equivalent to maximum likelihood)⁴, probit, and logit models.
Model Validation and Comparison: Log likelihood is central to information criteria such as AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion), which are used to compare the goodness-of-fit of different models while penalizing complexity, thus aiding in model selection. The inverse of the Fisher information matrix, which is derived from the log likelihood function, is commonly used to approximate the covariance matrix of maximum likelihood estimators, providing insights into the precision of these estimates³.

Limitations and Criticisms

While powerful, the use of log likelihood and maximum likelihood estimation (MLE) has several limitations. One key criticism is that MLE relies on certain assumptions about the underlying distribution of the data. If these assumptions are incorrect, the resulting parameter estimates and model inferences may be inaccurate or inefficient. For instance, if data are not truly independent or identically distributed, the maximum likelihood estimates derived from the log likelihood might be biased or inconsistent.

Furthermore, maximizing the log likelihood function can be computationally intensive, especially for complex models with many parameters or large datasets. Numerical optimization techniques are often required, and these methods can sometimes converge to local maxima rather than the true global maximum, leading to suboptimal parameter estimates. Some theoretical cases exist where the maximum likelihood estimator might not behave as expected or even be "wrong" under certain pathological conditions². Issues can also arise with technical challenges for multidimensional data and complex multiparameter models¹. While log likelihood is a cornerstone of statistical inference, practitioners must be aware of its underlying assumptions and potential computational difficulties in data analysis.

Log likelihood vs. Likelihood

Log likelihood and likelihood are intimately related but serve different practical purposes in statistical modeling.

The likelihood function ((L(\theta | x))) represents the probability of observing the given data (x) for specific parameter values (\theta). It is a direct measure of how plausible a given set of parameters is, given the observed data. Its values can range from zero to theoretically infinite, though practically they are often very small numbers when dealing with continuous distributions.

Log likelihood ((\ell(\theta | x))) is the natural logarithm of the likelihood function. Its primary advantage is mathematical convenience. When the likelihood function involves a product of many probability density functions (as is common with independent observations), taking the logarithm transforms this product into a sum. This conversion simplifies differentiation, which is necessary to find the parameters that maximize the function (i.e., the maximum likelihood estimates). Maximizing the log likelihood is equivalent to maximizing the likelihood because the natural logarithm is a monotonically increasing function, meaning it preserves the order of values. If (L_1 > L_2), then (\ln(L_1) > \ln(L_2)). Thus, the parameters that maximize (L(\theta | x)) will also maximize (\ell(\theta | x)).

In practice, statisticians and econometrics professionals almost always work with the log likelihood due to its computational benefits for parameter estimation.

FAQs

Why do we use log likelihood instead of just likelihood?

Using log likelihood simplifies the mathematical calculations involved in finding the maximum likelihood estimates. The likelihood function often involves multiplying many small probabilities, which can lead to computational underflow. Taking the logarithm transforms these products into sums, making computations more stable and easier to differentiate for optimization.

Can log likelihood be positive?

No, the log likelihood for a continuous probability distribution will typically be a negative number. This is because the probability density values (f(x_i | \theta)) for continuous variables are often less than 1, and the logarithm of a number less than 1 is negative. For discrete probability mass functions where (P(X=x_i)) can be greater than 1, it's possible for a single term (\ln(P(X=x_i))) to be positive, but the sum over many observations typically results in a negative log likelihood value when working with statistical models.

How does log likelihood relate to model fit?

A higher log likelihood value (meaning closer to zero, or less negative) generally indicates a better fit of the statistical model to the observed data. It implies that the chosen parameters make the observed data more probable. However, when comparing models, it's important to consider model complexity alongside log likelihood, often done using criteria like AIC or BIC for model selection.

Is log likelihood used in Bayesian inference?

Yes, log likelihood is also a crucial component in Bayesian inference. In Bayesian statistics, the likelihood function (and thus its logarithm) is used to update prior beliefs about parameters to form a posterior distribution. The posterior distribution is proportional to the likelihood function multiplied by the prior distribution.