Published, Version of Record (VoR)Open Access via Drexel Libraries Read and Publish Program 2025CC BY V4.0, Open
Abstract
Information systems Recommender systems Retrieval effectiveness
Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define ‘recall-orientation’ as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across multiple recommendation and retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
Metrics
18 Record Views
Details
Title
Recall, Robustness, and Lexicographic Evaluation
Creators
Fernando Diaz - Carnegie Mellon University
Michael D. Ekstrand (Corresponding Author) - Drexel University
Bhaskar Mitra - Microsoft (Canada)
Publication Details
ACM transactions on recommender systems, v 4(1), pp 1-50