F1 score

What Is F1 score?

The F1 score is a widely used evaluation metric in machine learning, particularly for assessing the performance of a classification model. It provides a single score that balances both precision and recall, making it especially valuable when dealing with imbalanced datasets. In such scenarios, where one class significantly outnumbers another, metrics like accuracy can be misleading. The F1 score, belonging to the broader category of Model Evaluation Metrics, offers a more robust measure of a model's ability to correctly identify positive instances while minimizing both false positives and false negatives.

History and Origin

The F1 score has its roots in information retrieval (IR), a field concerned with finding relevant documents or information from large collections. It emerged from early efforts to evaluate the effectiveness of IR systems. C.J. van Rijsbergen is credited with proposing the F-measure (of which F1 is a specific instance) in the 1970s as a way to combine precision and recall into a single metric. His work in the mathematical theory of measurement for information retrieval laid the foundation for its adoption beyond IR into machine learning and other domains.²⁵, ²⁶

Key Takeaways

The F1 score is a metric that combines precision and recall into a single value, offering a balanced view of a classification model's performance.
It is particularly useful for evaluating models trained on imbalanced datasets, where standard accuracy can be deceptive.
Calculated as the harmonic mean of precision and recall, the F1 score ranges from 0 to 1, with 1 indicating perfect performance.
A high F1 score signifies that a model has both high precision (few false positives) and high recall (few false negatives).
It is widely applied in various fields, including fraud detection, medical diagnosis, and spam detection.

Formula and Calculation

The F1 score is the harmonic mean of precision and recall. To understand its calculation, it's essential to first define the components derived from a confusion matrix:

True Positives (TP): Cases correctly identified as positive.
False Positives (FP): Cases incorrectly identified as positive.
False Negatives (FN): Cases incorrectly identified as negative (missed positive cases).
True Negatives (TN): Cases correctly identified as negative.

Precision and Recall are then calculated as follows:

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Finally, the F1 score is computed using the formula:

F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Alternatively, in terms of true positives, false positives, and false negatives:

F1 = \frac{2 \times \text{TP}}{2 \times \text{TP} + \text{FP} + \text{FN}}

The f1_score function in scikit-learn, a popular Python library for machine learning, implements this calculation.²⁴

Interpreting the F1 score

The F1 score ranges from 0 to 1, where 1 indicates the best possible performance (perfect precision and recall), and 0 indicates the worst. A higher F1 score suggests a better balance between the model's ability to identify all relevant instances (recall) and its ability to avoid misclassifying irrelevant instances as relevant (precision).

For instance, an F1 score of 0.9 suggests excellent performance, meaning the model is making accurate predictions with a good balance of precision and recall. An F1 score between 0.7 and 0.9 is generally considered very good, while scores below 0.5 often indicate poor performance, potentially due to a significant imbalance between precision and recall or overall weak predictive power. The F1 score is particularly insightful for binary classification problems where both types of errors (false positives and false negatives) carry significant consequences.²², ²³

Hypothetical Example

Consider a hypothetical credit scoring model designed to predict whether a loan applicant will default. Out of 1,000 applicants, 50 are expected to default (positive class), and 950 are expected to repay (negative class). This represents an imbalanced dataset.

Let's say the model makes the following predictions:

True Positives (TP): 40 (correctly predicted defaults)
False Positives (FP): 20 (incorrectly predicted defaults – non-defaulters flagged as defaulters)
False Negatives (FN): 10 (missed defaults – defaulters flagged as non-defaulters)
True Negatives (TN): 930 (correctly predicted non-defaults)

Using these values, we can calculate precision and recall:

\text{Precision} = \frac{40}{40 + 20} = \frac{40}{60} \approx 0.667

\text{Recall} = \frac{40}{40 + 10} = \frac{40}{50} = 0.80

Now, we calculate the F1 score:

F1 = 2 \times \frac{0.667 \times 0.80}{0.667 + 0.80} = 2 \times \frac{0.5336}{1.467} \approx 2 \times 0.3637 \approx 0.727

An F1 score of approximately 0.727 indicates a reasonably good balance between the model's ability to identify defaulting applicants and its tendency to avoid false alarms.

Practical Applications

The F1 score is a critical evaluation metric in various practical applications, especially where the cost of false positives and false negatives differs significantly or when dealing with imbalanced datasets.

In finance, for example, the F1 score is extensively used in fraud detection systems. Identifying fraudulent transactions is crucial (high recall), but so is minimizing the incorrect flagging of legitimate transactions as fraudulent (high precision), which can inconvenience customers and erode trust. The F1 score provides a balanced assessment for such critical applications. Similarly, in credit scoring, machine learning models predict default risk, and the F1 score helps ensure that models effectively identify potential defaulters without unduly rejecting creditworthy applicants.

Be¹⁸, ¹⁹, ²⁰, ²¹yond finance, the F1 score is vital in medical diagnosis, where missing a disease (false negative) can have severe consequences, but false alarms (false positives) can lead to unnecessary treatments or patient anxiety. It's also applied in spam detection, natural language processing for sentiment analysis, and various other classification model tasks within machine learning.

##¹⁶, ¹⁷ Limitations and Criticisms

While the F1 score is a valuable evaluation metric, it is not without limitations. One key criticism is that it equally weights precision and recall. In some real-world scenarios, one of these metrics might be more critical than the other. For instance, in a medical screening for a rare, life-threatening disease, maximizing recall (ensuring no true cases are missed, even at the cost of some false positives) might be prioritized over precision. The F1 score, by enforcing an equal balance, may not reflect the true business objective in such cases.

An¹⁴, ¹⁵other limitation is its focus solely on the positive class. When evaluating a binary classification model, the F1 score only considers true positives, false positives, and false negatives, effectively ignoring true negatives. This can be problematic if the performance on the negative class is also important, or if the negative class is the minority. In such instances, the F1 score might present an overly optimistic view of model performance. For¹², ¹³ multi-class scenarios, macro, micro, and weighted F1 scores exist, each offering a different averaging approach to address class imbalance.

##¹⁰, ¹¹ F1 score vs. Accuracy

The F1 score and accuracy are both common evaluation metrics for classification models, but they are best suited for different situations.

Feature	F1 Score	Accuracy
Definition	Harmonic mean of precision and recall.	Proportion of correctly classified instances out of total instances.
Formula	( \frac{2 \times \text{TP}}{2 \times \text{TP} + \text{FP} + \text{FN}} )	( \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}} )
Best Use	Imbalanced datasets, when false positives and false negatives are equally important.	Balanced datasets, when all misclassifications have similar costs.
Sensitivity	Sensitive to class distribution imbalances.	Can be misleadingly high for highly imbalanced datasets.

The main point of confusion arises with imbalanced datasets. For example, if a model predicts a rare event (e.g., fraud) that only occurs 1% of the time, a model that simply predicts "no fraud" for every transaction would achieve 99% accuracy. While seemingly high, this model is useless because it fails to identify any actual fraud. The F1 score, by considering both precision and recall, would be very low in this scenario, effectively highlighting the model's poor performance on the positive class. Therefore, for most real-world problems where class distribution is uneven or the costs of different errors vary, the F1 score provides a more reliable assessment than simple accuracy.

##⁷, ⁸, ⁹ FAQs

What is F1 score used for?

The F1 score is used to evaluate the performance of classification models, particularly when the dataset has an uneven distribution between classes (imbalanced datasets). It helps to understand how effective a model is at correctly identifying positive instances while keeping false positives and false negatives low.

##⁵, ⁶# Why use F1 score instead of just precision or recall?
F1 score provides a single, balanced metric that considers both precision and recall. If you only look at precision, a model might have high precision by being very cautious and only predicting positive when it's extremely sure, but it would miss many actual positive cases (low recall). Conversely, a model with high recall might correctly identify almost all positive cases but at the cost of many false alarms (low precision). The F1 score ensures a healthy trade-off between these two critical aspects of performance.

##³, ⁴# What does a high F1 score mean?
A high F1 score, close to 1, indicates that a classification model has achieved a good balance of both high precision and high recall. This means the model is effective at correctly identifying positive instances (high recall) and is also accurate in its positive predictions, generating few false alarms (high precision).¹, ²