Evaluation Quirks, Metric Pitfalls and Some Recommendations

Having observed system evaluation practice for close to 10 years, I thought to share a few funny and intriguing things that I noted.

Overview of this post:

Funny Evaluation Quirks
General Evaluation Recommendations
Detailed References

Six Funny Evaluation Quirks

To keep the overview reasonably brief, detailed references and code examples for each point will come at the bottom of this post.

Doppelganger metrics:

1) Doppelganger metrics by name: Folks have been using two metrics called “macro F1” for multi-class evaluation. They can differ up to 50 points!

2) Doppelganger metrics by implementation: For multi-class evaluation, “micro F1” is the same as “Accuracy”.

Implementation bugs:

3) Optimistic result because of double-counting: By improperly evaluating retrieved instances in an IR setting, the F1 score can rise up to 200%, bursting the scale which is supposed to end at 100.

4) Optimistic result because of tie-breaking: Sometimes, evaluation scores can be optimistic. This can happen, e.g., when ties in an ensemble classifier are resolved using the gold label.

Quirky metric properties:

5) Wrong prediction, better score: For Matthews Correlation Coefficient (MCC) and Kappa, there can be a situation where a wrong prediction would increase the score.

Ambiguous Metric Goals:

6) “Balance”: Researchers are often wishing for a “balance” when they’re evaluating a system. This “balance” is then said to be achieved by using metrics such as MCC or macro F1. But it’s often not clear what is understood as a “balance”, and how these metrics would accomplish it.

Evaluation Tips

My basic take-aways from such evaluation quirks are:

Try to become aware of what an evaluation metric actually measures.
Become aware of the evaluation problem.
This knowledge helps you select a metric.
Make sure it’s correctly implemented.

For deeper reading, and to help practitioners and researchers with this, I’ve written a paper that explores how to select the right metrics and make more sense of their behavior:

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Links: journal, arxiv

To give a quick idea of the work: The paper analyzes how metrics behave depending on factors like how often a class appears (prevalence) and the model’s tendency to predict certain classes (classifier bias). Metrics analyzed: Accuracy, macro Recall and Precision, F1, weighted F1, macro F1, Kappa, Matthews Correlation Coefficient (MCC). A finding, for instance, is that in a strict sense, only macro Recall is “balanced”.

Detailed references

Point 1) “macro F1 doppelgangers”

See 4.4 in this metric overview. For a deeper survey of the relationships of the two F1 formulas see “Macro F1 and Macro F1.

Point 2) “Micro F1 = Accuracy” I think that’s already known to some folks, but probably not to all. If you need to see the simple derivation, look at, e.g., Appendix A.

Point 3) “200% F1 score”

Essentially, what has happened is using a function like this for calculating recall:

 def misleading_recall(cand, ref):
      for pred in cand:
             If pred in ref:
                   count += 1
      return count / len(ref)

Now if you feed a list into this function where the same element accidentally occurs multiple times (can happen in generative AI), go figure! With a recall that’s approaching infinity, the F1 score approaches 200.

I observed this in evaluation of semantic parsing, a popular NLP task that has even been targeted by NeurIPS papers and is commonly found in ACL/EMNLP conferences. Potentially, there are other applications who also suffer from this bug.

I prepared a github repository so you can simply reproduce this quirk!

Point 4) “Optimistic evaluation”

There’s been a very popular ACL paper for low-resource text classification with a super cool and simple method. GZIP distance and KNN! It provides promising results – but the results are not quite as strong as shown in the paper, due to a quirk in the evaluation. See my writeup here, and Ken Schutte’s blogpost here.

Point 5) “Wrong prediction can increase score”

See Section 4.7 in the metric overview (the appendix near the end also contains an example).

Point 6) “Concept of “Balance””

For example, one notion of such a “balance” could be understood as a wish for prevalence-invariance, that is, a metric yields the same score when label prevalences are different (e.g., say 95/5 positive/negative class vs. 5/95 positive/negative class). It is accomplished by a very simple metric: macro Recall. See property V and Section 4.2 in the metric overview.