Democratizing and Personalizing Evaluation with ∞-Benchmarks: Sample-Level Heterogeneous Testing Over Arbitrary Capabilities

Abstract

Traditional fixed benchmark datasets fall short in quantifying the open-ended potential of foundation models. We propose ∞-benchmarks, a new testing paradigm that offers to combine arbitrary evaluation datasets into one unified, ever-expanding pool from which personalized evaluations can be flexibly generated. The ∞-benchmark pool allows anyone to dynamically select their preferred collection of sample-level testings, tailoring the assessment towards their specific capabilities of interest. By aggregating and reusing samples across various test sets, ∞-benchmarks allow for the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias through real-world diversity. Most importantly, it formulates model evaluation as collective process of aggregation and selection of sample-level tests. The shift from multi-task benchmarks to ∞-benchmarks introduces two key challenges: (1) heterogeneity and (2) incompleteness. Heterogeneity involves aggregating diverse metrics, including binary, numeric, and ordinal annotations, while incompleteness concerns comparing models evaluated on different, unequal subsets of test data. To address these challenges we explore algorithms inspired by social choice theory which aggregate sparse, unequal measurements into reliable model scores. Our aggregation algorithms ensure identifiability (asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model comparisons with relatively little data. We introduce ∞-LLMBench for language models and ∞-LMMBench for multimodal models, unifying evaluations across leaderboards and arenas in these domains, and showcasing targeted querying over a wide-range of capabilities. Our aggregation algorithms recover ground truth rankings with over 0.9 Kendall Tau correlation when compared with standard aggregation measures on homogenous metrics, even with up to 95% of data samples missing. This approach reduces evaluation cost by up to 20x with little to no compromise in performance. Overall, we present the first large-scale ∞-benchmarks for lifelong, efficient evaluation of language and multimodal models which can aggregate over open-ended heterogeneous sample-level testing to evolve alongside with the rapid development of foundation models.

Sebastian Dziadzio
Sebastian Dziadzio
PhD candidate
Vishaal Udandarao
Vishaal Udandarao
PhD candidate

My research interests include multi-modal (vision-language) learning, self-supervised representation learning and continual learning.

Ameya Prabhu
Ameya Prabhu
Postdoc

My research interests include Data-Centric ML, Continual Learning on Foundation Models, Automated Theorem Proving

Matthias Bethge
Matthias Bethge
Professor for Computational Neuroscience and Machine Learning & Director of the Tübingen AI Center

Matthias Bethge is Professor for Computational Neuroscience and Machine Learning at the University of Tübingen and director of the Tübingen AI Center, a joint center between Tübingen University and MPI for Intelligent Systems that is part of the German AI strategy.