ML Evaluation Standards

Orals

A Siren Song of Open Source Reproducibility

Edward Raff & Andrew L Farris

[PDF] [Video] [Poster]

As reproducibility becomes a greater concern, conferences have largely converged to a strategy of asking reviewers to indicate whether code was attached to a submission. This is part of a larger trend of taking action based on assumed ideals, without studying if those actions will yield the desired outcome. Our argument is that this focus on code for replication is misguided if we want to improve the state of reproducible research. This focus can be harmful --- we should not force code to be submitted. Due to the lack of evidence for actions taken today, we argue that it is clear that conferences must do more to encourage and reward the study of reproducibility itself, so that we can learn what actions should be taken.

Experimental Standards for Deep Learning Research: A Natural Language Processing Perspective

Dennis Ulmer, Elisa Bassignana, Max Müller-Eberstein, Daniel Varab, Mike Zhang, Christian Hardmeier & Barbara Plank

[PDF] [Video] [Poster]

The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, as with other fields employing DL techniques, there has been a lack of common experimental standards compared to more established disciplines. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in DL into a single, widely-applicable methodology. Following these best practices is crucial to strengthening experimental evidence, improve reproducibility and enable scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.

A Case for Better Evaluation Standards in NLG

Sebastian Gehrmann, Elizabeth Clark & Thibault Sellam

[PDF]

Evaluating natural language generation (NLG) models has become a popular and active field of study, which has led to the release of novel datasets, automatic metrics, and human evaluation methods. Yet, newly established best practices are often not adopted. Moreover, the research process is often hindered by the scarcity of released resources like model outputs, and a lack of documentation of evaluation parameters often complicates judging new NLG methods. We analyze 66 papers published in 2021 across 29 different dimensions to quantify this effect, and identify promising ways for the research community to improve reporting and reviewing experimental results.

Integrating Rankings into Quantized Scores in Peer Review

Yusha Liu, Yichong Xu, Nihar B. Shah & Aarti Singh

[PDF]

In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a large number of ties, thereby leading to a significant loss of information. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are however two key challenges. First, there is no standard procedure for using this ranking information and Area Chairs may use it in different ways (including simply ignoring them), thereby leading to arbitrariness in the peer-review process. Second, there are no suitable interfaces for judicious use of this data nor methods to incorporate it in existing workflows, thereby leading to inefficiencies. We take a principled approach to integrate the ranking information into the scores. The output of our method is an updated score pertaining to each review that also incorporates the rankings. Our approach addresses the two aforementioned challenges by: (i) ensuring that rankings are incorporated into the updates scores in the same manner for all papers, thereby mitigating arbitrariness, and (ii) allowing to seamlessly use existing interfaces and workflows designed for scores. We empirically evaluate our method on synthetic datasets as well as on peer reviews from the ICLR 2017 conference, and find that it reduces the error by approximately 30\% as compared to the best performing baseline on the ICLR 2017 data.

Tradeoffs in Preventing Manipulation in Paper Bidding for Reviewer Assignment

Steven Jecmen, Nihar B. Shah, Fei Fang & Vincent Conitzer

[PDF] [Video]

Many conferences rely on paper bidding as a key component of their reviewer assignment procedure. These bids are then taken into account when assigning reviewers to help ensure that each reviewer is assigned to suitable papers. However, despite the benefits of using bids, reliance on paper bidding can allow malicious reviewers to manipulate the paper assignment for unethical purposes (e.g., getting assigned to a friend's paper). Several different approaches to preventing this manipulation have been proposed and deployed. In this paper, we enumerate certain desirable properties that algorithms for addressing bid manipulation should satisfy. We then offer a preliminary analysis of various approaches along with directions for future investigation.

Posters

Does the Market of Citations Reward Reproducible Work?

Edward Raff

[PDF] [Video] [Poster]

The field of bibliometrics, studying citations and behavior, is critical to the the discussion of reproducibility as citations are one of the primary incentive and reward systems for academic work. Yet to the best of our knowledge, only one work has attempted to look at this combined space, concluding that non-reproducible work is more highly cited. We show that answering this question is more challenging than first proposed, and subtle issues can inhibit a robust conclusion. To make inferences with more robust behavior, we propose a hierarchical Bayesian model that incorporates the citation rate over time, rather than the total number of citations after a fixed amount of time. In doing so we show that, under current evidence the the answer is more likely that certain fields of study such as Medicine and Machine Learning (ML) do reward reproducible works with more citations, but other fields appear to have no relationship. Further, we find that making code available and thoroughly referencing prior works appear to also positively correlate with increased citations.

What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability

John Kirchenbauer, Jacob R Oaks & Eric Heim

[PDF] [Video] [Poster]

Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the “predicted class”) is calibrated. However, this narrow interpretation of classifier outputs does not adequately capture the variety of practical use cases in which classifiers can aid in decision making. In this work, we argue that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed. To this end, we derive a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibration error under different definitions of reliability. We then provide an extensive empirical evaluation of commonly used neural network architectures and calibration techniques with respect to these metrics. We find that: 1) definitions of ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics derived from these diverse definitions of reliability.

System Analysis for Responsible Design of Modern AI/ML Systems

Virginia Goodwin & Rajmonda S. Caceres

[PDF] [Poster]

The irresponsible use of ML algorithms in practical settings has received a lot of deserved attention in the recent years. We posit that the traditional system analysis perspective is needed when designing and implementing ML algorithms and systems. Such perspective can provide a formal way for evaluating and enabling responsible ML practices. In this paper, we review components of the System Analysis methodology and highlight how they connect and enable responsible practices of ML design.

A Brief Guide to Designing and Evaluating Human-Centered Interactive Machine Learning

Kory W. Mathewson & Patrick Pilarski

[PDF] [Video]

Interactive machine learning (IML) is a field of research that explores how to leverage both human and computational abilities in decision making systems. IML represents a collaboration between multiple complimentary human and machine intelligent systems working as a team, each with their own unique abilities and limitations. This teamwork might mean that both systems take actions at the same time, or in sequence. Two major open research questions in the field of IML are: ``How should we design systems that can learn to make better decisions over time with human interaction?'' and ``How should we evaluate the design and deployment of such systems?'' A lack of appropriate consideration for the humans involved can lead to problematic system behaviour, and issues of fairness, accountability, and transparency. Thus, our goal with this work is to present a human-centred guide to designing and evaluating IML systems while mitigating risks. This guide is intended to be used by machine learning practitioners who are responsible for the health, safety, and well-being of interacting humans. An obligation of responsibility for public interaction means acting with integrity, honesty, fairness, and abiding by applicable legal statutes. With these values and principles in mind, we as a machine learning research community can better achieve goals of augmenting human skills and abilities. This practical guide therefore aims to support many of the responsible decisions necessary throughout the iterative design, development, and dissemination of IML systems.

deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks

Dennis Ulmer, Christian Hardmeier & Jes Frellsen

[PDF] [Video] [Poster]

A lot of Machine Learning (ML) and Deep Learning (DL) research is of an empirical nature. Nevertheless, statistical significance testing (SST) is still not widely used. This endangers true progress, as seeming improvements over a baseline might be statistical flukes, leading follow-up research astray while wasting human and computational resources. Here, we provide an easy-to-use package containing different significance tests and utility functions specifically tailored towards research needs and usability.

Reproducible Subjective Evaluation

Max Morrison, Brian Tang, Gefei Tan & Bryan Pardo

[PDF] [Video] [Poster]

Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performed, they are often not reported with sufficient detail to ensure reproducibility. We propose Reproducible Subjective Evaluation (ReSEval), an open-source framework for quickly deploying crowdsourced subjective evaluations directly from Python. ReSEval lets researchers launch A/B, ABX, Mean Opinion Score (MOS) and MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA-style) tests on audio, image, text, or video data from a command-line interface or using one line of Python, making it as easy to run as objective evaluation. With ReSEval, researchers can reproduce each other's subjective evaluations by sharing a configuration file and the audio, image, text, or video files.

Increasing Confidence in Adversarial Robustness Evaluations

Roland S. Zimmermann, Wieland Brendel, Florian Tramer & Nicholas Carlini

[PDF] [Video] [Poster]

Hundreds of defenses have been proposed in the past years to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these could hold up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this paper, we propose a test to identify weak attacks. Our test introduces a small and simple modification into a neural network that guarantees the existence of an adversarial example for every sample. Consequentially, any correct attack must succeed in attacking this modified network. For eleven out of thirteen previously-published defenses, the original evaluation of the defense fails our test while stronger attacks that break these defenses pass it. We hope that attack unit tests such as ours will be a major component in future robustness evaluations and increase confidence in an empirical field that today is riddled with skepticism and disbelief.

Machine Learning State-of-the-Art with Uncertainties

Peter Steinbach, Felicita Gernhardt, Mahnoor Tanveer, Steve Schmerler & Sebastian Starke

[PDF] [Video] [Poster]

With the availability of data, hardware, software ecosystem and relevant skill sets, the machine learning community is undergoing a rapid development with new architectures and approaches appearing at high frequency every year. In this article, we conduct an exemplary image classification study in order to demonstrate how confidence intervals around accuracy measurements can greatly enhance the communication of research results as well as impact the reviewing process. In addition, we explore the hallmarks and limitations of this approximation. We discuss the relevance of this approach reflecting on a spotlight publication of ICLR22. A reproducible workflow is made available as open-source adjoint to this publication. Based on our discussion, we make suggestions for improving the authoring and reviewing process of machine learning articles.

Towards Clear Expectations for Uncertainty Estimation

Victor Bouvier, Simona Maggio, Alexandre Abraham & Léo Dreyfus-Schmidt

[PDF] [Video] [Poster]

If Uncertainty Quantification (UQ) is crucial to achieve trustworthy ML, most UQ methods suffer from disparate and inconsistent evaluation protocols. We claim this inconsistency results from the uncleared requirements the community expects from UQ. This opinion paper offers a new perspective by specifying those requirements through five downstream tasks where we expect uncertainty scores to have substantial predictive power. On an example benchmark of 7 classification datasets, we did not observe statistical superiority of state-of-the-art intrinsic UQ methods against simple baselines. We believe that our findings question the very rationale of why we quantify uncertainty and call for a standardized protocol for UQ evaluation based on metrics proven to be relevant for the ML practitioner.

Rethinking Machine Learning Model Evaluation in Pathology

Syed Ashar Javed, Dinkar Juyal, Shreya Chakraborty, Zahil Shanis, Harsha Pokkalla & Aaditya Prakash

[PDF] [Video] [Poster]

Machine Learning has been applied to pathology images in research and clinical practice with promising outcomes. However, standard ML models often lack the high quality or rigorous evaluation required for clinical decisions. Most of these models trained on natural images are also ill-equipped to deal with pathology images that are significantly large and noisy, require expensive labeling, are hard to interpret, and are susceptible to spurious correlations. We propose a set of highly relevant and practical guidelines for ML evaluations in pathology that address the above concerns. The paper includes measures for setting up the evaluation framework, efficiently dealing with variability in labels, and a recommended suite of tests to address issues related to domain shift, robustness, and confounding variables. We hope that the proposed framework will bridge the gap between ML researchers and domain experts, leading to wider adoption of ML techniques in pathology and improving patient outcomes.

Strengthening Subcommunities: Towards Sustainable Growth in AI Research

Andi Peng, Jessica Zosa Forde, Yonadav Shavit & Jonathan Frankle

[PDF]

AI’s rapid growth has been felt acutely by scholarly venues, leading to growing pains within the peer review process. These problems largely center on the inability of specific sub-areas to identify and evaluate work that is appropriate according to criteria relevant to each subcommunity as determined by stakeholders of that sub-area. We set forth a proposal that re-focuses efforts within these subcommunities through a decentralization of the reviewing and publication process. Through this re-centering effort, we hope to encourage each sub-area to confront the issues specific to their process of academic publication and incentivization. This model has been successful for several subcommunities in AI, and we highlight those instances as examples for how the broader field can continue to evolve despite its continually growing size.

A meta analysis of data-driven newsvendor approaches

Simone Buttler, Andreas Philippi, Nikolai Stein & Richard Pibernik

[PDF] [Video] [Poster]

Recently, a number of publications in leading operations management and operations research journals proposed new models that combine machine learning and mathematical optimization techniques to predict inventory decisions directly from historical demand and additional feature information. This paper presents the results of a meta analysis of recent machine learning based approaches for solving the most prominent problem in operations management, the newsvendor problem. We find that the reproducibility of existing studies is typically low, because authors evaluate their new approaches based on small and proprietary datasets, do not share data and code, and use different benchmarks. We develop a reproducible, unified evaluation procedure and apply various models to a large and heterogeneous dataset. Our results do not support the findings and claims of most of the recent papers and, in several cases, we even obtain contradicting results. Overall, the robustness of the newly proposed models appears to be low. To support both researchers and practitioners in the development and evaluation of new models, we provide extensive benchmark data and a python library that contains open source implementations of most existing models.

CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Hyundong Cho, Chinnadhurai Sankar, Christopher Lin, Kaushik Ram Sadagopan, Shahin Shayandeh, Asli Celikyilmaz, Jonathan May & Ahmad Beirami

[PDF] [Video] [Poster]

Neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results in dialogue state tracking (DST) benchmarks on joint goal accuracy (JGA). However, motivated by CheckList (Ribeiro et al. 2020), we argue for a holistic assessment of DST models since JGA is unable to capture robustness to the inevitable test-time distribution shifts. To this end, we build on recent work on robustness testing in task-oriented dialogue and introduce CheckDST, an instantiation of CheckList for DST that quantifies robustness with test set augmentations and new metrics that measure consistency. Using CheckDST, we are able to extensively compare state-of-the-art DST models, finding that, although span-based classification models achieve slightly better JGA on the original test set than generation models, they are significantly less robust to distribution shift. Secondly, we observe that while stopping training early, e.g. at the first epoch, hurts JGA, the resulting models are significantly more robust to distribution shift. Lastly, guided by the weaknesses exposed by CheckDST, we explore training DST models that simultaneously boost JGA and CheckDST metrics and report preliminary success with PrefineDST, a simple generation model pretrained with non-target datasets to internalize reasoning skills relevant to dialogue state tracking.

A Survey On Uncertainty Toolkits For Deep Learning

[PDF] [Poster]

The success of deep learning (DL) fostered the creation of unifying frameworks such as tensorflow or pytorch as much as it was driven by their creation in return. Having common building blocks facilitates the exchange of, e.g., models or concepts and makes developments easier replicable. Nonetheless, robust and reliable evaluation and assessment of DL models has often proven challenging. This is at odds with their increasing safety relevance, which recently culminated in the field of “trustworthy ML”. We believe that, among others, further unification of evaluation and safeguarding methodologies in terms of toolkits, i.e. small and specialized framework derivatives, might positively impact problems of trustworthiness as well as reproducibility. To this end, we present the first survey on toolkits for uncertainty estimation (UE) in DL, as UE forms a cornerstone in assessing model reliability. We investigate 11 toolkits with respect to modeling and evaluation capabilities, providing an in-depth comparison for the 3 most promising ones, namely Pyro, Tensorflow Probability, and Uncertainty Quantification 360. While the first two provide a large degree of flexibility and seamless integration into their respective framework, the last one has the larger methodological scope.

A Revealing Large-Scale Evaluation of Unsupervised Anomaly Detection Algorithms

Maxime Alvarez, Jean-Charles Verdier, D'Jeff Kanda Nkashama, Marc Frappier, Pierre Martin Tardif & Froduald Kabanza

[PDF] [Video] [Poster]

Anomaly detection has many applications ranging from bank-fraud detection and cyber-threat detection to equipment maintenance and health monitoring. However, choosing a suitable algorithm for a given application remains a challenging design decision, often informed by the literature of anomaly detection algorithms. We extensively reviewed twelve of the most popular unsupervised anomaly detection methods. We observed that, so far, they have been compared using inconsistent protocols – the choice of the class of interest or the positive class, the split of training and test data, and choice of hyperparameters – leading to ambiguous evaluations. This observation led us to define a coherent evaluation protocol which we then used to produce an updated and more precise picture of the relative performance of the twelve methods on five widely used tabular datasets. While our evaluation cannot pinpoint a method that outperforms all the others on all datasets, it identifies those that stand out and revise misconceived knowledge about their relative performances.

Rethinking Streaming Machine Learning Evaluation

Shreya Shankar, Bernease R Herman & Aditya Parameswaran

[PDF]

While most work on evaluating machine learning (ML) models focuses on batches of data, computing the same metrics in a streaming setting (i.e., unbounded, timestamp-ordered datasets) fails to accurately identify when models are performing unexpectedly. In this position paper, we discuss how sliding windows--that ML metrics are evaluated over--can be negatively affected by real-world phenomena (e.g., delayed arrival of labels) and propose additional metrics to assess streaming ML performance.

Towards Yet Another Checklist for New Datasets

Stefan Larson

[PDF] [Video] [Poster]

The goal of this opinion paper is to start (or more accurately, continue) a conversation on a checklist-style artifact for researchers to use when introducing a new dataset. While other checklists exist and are widely used, this paper proposes the inclusion of checklist-style questions to encourage dataset developers (as well as consumers) to consider data quality, diversity, and evaluation, especially as it relates to machine learning model robustness and generalizability

Are Ground Truth Labels Reproducible? An Empirical Study

Ka Wong, Praveen Paritosh & Kurt Bollacker

[PDF] [Poster]

Standard evaluation techniques in Machine Learning (ML) and the corresponding statistical inference do not explicitly consider ground truth labels as a source of uncertainty, i.e., they assume that the benchmarks are reproducible. We investigate the reliability of ground truth labels in nine highly cited evaluation datasets. Via replication, we find the majority votes in three datasets to have zero reliability. They cannot be reproduced. The cause of irreproducibility is excessive rater disagreement, as evidenced in the zero inter-rater reliability. Contrary to popular belief, majority voting fails to have a material impact in this case. We conduct a smaller pilot using raters with high qualifications and find significant improvement in reliability across the board. This suggests high quality data collection is still paramount, and cannot be replaced by aggregation. We urge researchers, reviewers, and the publication processes (such as reproducibility checklists) to encourage the measurement and reporting of reproducibility in the ground truth data used in ML evaluation. Towards this end, we publish and release all the replications and associated data to aid assessment of reproducibility of this work.

Incentivizing Empirical Science in Machine Learning: Problems and Proposals

Preetum Nakkiran & Mikhail Belkin

[PDF]

We introduce a proposal to help address a structural problem in ML publishing: the lack of community support and perceived lack of legitimacy for experimental scientific work that neither proves a mathematical theorem, nor improves a practical application. Such experimental work is the bedrock of many fields of science, yet is not well appreciated by top ML publication venues (e.g. NeurIPS, ICML, ICLR). The problem is twofold: reviewers are often unaware of the value of such work, and thus authors are disincentivized from producing and submitting such work. The result is a suffocation of a scientific methodology that has a long history of success in the natural sciences, and has recently been fruitful in machine learning. To address this, we propose introducing a Best Paper Award specifically for foundational experimental work in machine learning. The award targets scientific work that is missed by existing communities: we exclude primarily theoretical work and application-motivated work, both of which are well supported by existing venues (e.g. COLT, CVPR). We propose that ML venues include a subject-area for ``scientific aspects of machine learning'', and consider papers in this subject for the award. More ambitiously, it can be implemented as an endowed yearly award considering all papers in the prior year. We expect that establishing an award will help legitimize this research area, establish standards for such scientific work, and encourage authors to conduct this work with the support of the community. In this proposal, we first discuss the structural problems in ML publication which we hope to address. We then present a call-for-papers for the ``science of ML'' subject area, describing the type of work we want to encourage. We argue that it is not only a scientifically legitimate type of work, but perhaps even a necessary type of work. Finally, we discuss guidelines for how such papers may be evaluated by reviewers.

Why External Validity Matters for Machine Learning Evaluation: Motivation and Open Problems

Thomas I. Liao, Rohan Taori & Ludwig Schmidt

[PDF] [Video] [Poster]

New machine learning methods often fail to perform as expected on datasets similar to benchmarks reported on in their respective papers. These performance gaps pose a challenge for evaluation: both researchers and practitioners expect (or hope) that machine learning models which perform well on a dataset designed for a task perform well on other datasets matched to that task. We argue that external validity, the relationships between tasks and the learning problems which instantiate them, is understudied. We highlight the ways in which algorithm developers and benchmark creators fail to address this concern of external validity, suggest some remedies, and identify open questions in external validity which would help the community better build benchmarks and understand model performance.

A Quality-Diversity-based Evaluation Strategy for Symbolic Music Generation

Berker Banar & Simon Colton

[PDF] [Video] [Poster]

Since human (audience) evaluation methods are challenging due to their logistic inconvenience, symbolic music generation systems are typically evaluated using loss-based measures or some statistical analyses with pre-defined musical metrics. Even though these loss-based and statistical methods could be informative to some extent, they often cannot guarantee any success for the generative model in terms of higher-level musical qualities, such as style / genre. Also, as another aspect of evaluation, diversity of the generated material is not considered for symbolic music generators. In this study, we argue that Quality-Diversity-based evaluation approach is more appropriate to value symbolic music generators. We give a few examples of where loss-based and statistical methods fail and suggest some techniques for quality-based and diversity-based evaluation, jointly forming a Quality-Diversity-based evaluation strategy.