Talks

Invited speakers

Thomas Wolf

The challenges of open evaluation at Hugging Face: from open-source evaluation libraries to evaluating +30k hosted models and models of 176B parameters

12h15, 29th April 2022 (UTC)
In this talk I'll cover some recent work on projects related to evaluation at Hugging Face where the specificity is to develop open-source and community base solution for solving some of today's challenges in evaluation and reproducibility. From new libraries around metrics to the evaluation of 30k models on the hub up to the evaluation of the 176B parameters model of BigScience.

Thomas Wolf is co-founder and Chief Science Officer of HuggingFace. The tools created by Thomas Wolf and the HuggingFace team are used across more than 5000 research organisations including Facebook Artificial Intelligence Research, Google Research, DeepMind, Amazon Research, Apple, the Allen Institute for Artificial Intelligence as well as most university departments.

Thomas Wolf is the initiator and senior chair of the largest research collaboration that has ever existed in Artificial Intelligence: “BigScience” as well as a set of widely used libraries and tools (huggingface).

Thomas Wolf is also a prolific educator and a thought leader in the field of Artificial Intelligence and Natural Language Processing, a regular invited speaker to conferences all around the world.

Frank Schneider

Improving Optimizer Evaluation in Deep Learning

13h00, 29th April 2022 (UTC)

Although hundreds of optimization algorithms have been proposed for deep learning, there is no widely agreed-upon protocol for evaluating their efficiency, performance, and usability. Instead, the crucial choice of the optimizer is too often done based on personal anecdotes instead of grounded empirical evidence.

In this talk, we present strategies for comparing deep learning optimizers which consider the unique challenges of deep learning such as the inherent stochasticity or the crucial distinction between learning and pure optimization. These strategies are formalized and automatized in the Python package DeepOBS, which allows fairer, faster, and more convincing empirical comparisons of deep learning optimizers.

Following this protocol, we report insights from our independent, third-party evaluation of the field's current state. A thorough comparison of fifteen popular deep learning optimizers, using roughly 50,000 individual runs, reveals that the comparably traditional Adam optimizer remains a strong but not dominating contender and that newer methods fail to consistently outperform it.

As an adjacent research direction to benchmarks, new debugging tools, such as Cockpit, allow for a more detailed evaluation of the training process of neural networks beyond just the loss or the model's performance. These tools could disentangle the many factors contributing to (un)successful neural network training, helping us understand whether training improvements are the result of better models, better algorithms, or better hyperparameters.

Frank Schneider is a Ph.D. student in the Methods of Machine Learning group supervised by Prof. Dr. Philipp Hennig at the University of Tübingen in Germany. His research focuses on making deep learning more user-friendly. He has previously published work on new debugging tools for neural network training and on improving the evaluation process of optimization algorithms for deep learning. He is currently a co-chair of the MLCommons™ Algorithms Working Group. He holds a Bachelor's and Master's degree in Simulation Technology from the University of Stuttgart as well as a Master's degree in Industrial and Applied Mathematics from the Eindhoven University of Technology.

Rotem Dror

A Statistical Analysis of Automatic Evaluation Metrics for Summarization

13h45, 29th April 2022 (UTC)

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is not clear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to random chance. In this talk, I will address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, I will analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. In this research, we find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings. This work is published at TACL 2021.

In the second part of this talk, I will present an ongoing study that identifies two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate summarization systems in practice and propose changes to rectify this disconnect. The results from these analyses point to the need for future research to focus on developing more consistent and reliable human evaluations of summaries.

This research was done in collaboration with Daniel Deutsch, a Ph.D. student from the Cognitive Computation Group at the Department of Computer and Information Science, University of Pennsylvania.

Rotem Dror is a Postdoctoral Researcher at the Cognitive Computation Group at the Department of Computer and Information Science, University of Pennsylvania. She is working with Prof. Dan Roth. She has completed her Ph.D. in the Natural Language Processing Group, supervised by Prof. Roi Reichart, at the Faculty of Industrial Engineering and Management at the Technion - Israel Institute of Technology. In her Ph.D. thesis, she discussed Algorithms and Statistically Sound Evaluation of Structured Solutions in Natural Language Processing. For more information: rtmdrr.github.io.

James Evans

AI Advance and the Paradox of Sustained Innovation

16h35, 29th April 2022 (UTC)
I briefly survey advances in AI and Deep Learning, considering them in the light of the paradox of sustainable innovation. Growth-oriented economists from Marx to Schumpeter highlighted the process of creative destruction whereby technical successes (like deep learning) displace and render competitors obsolete, but the multi-disciplinary study of innovation reveals a companion process of destructive creation whereby innovations emerge from contexts of discord and disorder, breaches in the structure of prior success. Surveying thousands of fields, millions of teams, and hundreds of millions of scientists, I demonstrate that at all levels, early success comes to be associated with rigidity in education and research, a shift toward the exploitation of prior knowledge, and a demanded performance of continued, incremental success. This represents a shift away from costly exploration, failures and abductive surprise that anticipate punctuated advance at field, team and individual-levels. I simulate processes of disruptive search and show that systematically violating established field boundaries is associated with sizeable successes of future scientific predictions, suggesting diminishing marginal returns to disciplines. I use this as background to consider movements toward open data, the common task framework and evaluation standards to insure ML reproducibility, showing how they broaden the field, but encourage it to collectively learn more about less. I conclude with a discussion of prospects for sustained innovation in ML, which involve rethinking the institution of education as a sustained experiment, the cultivation of diversity, speciation of research tasks and approaches, and the pursuit of productive failure, surfing the interface between order and chaos.
James Evans is the Max Palevsky Professor of History and Civilization in Sociology, Director of Knowledge Lab, and Founding Faculty Director of Computational Social Science at the University of Chicago and the Santa Fe Institute. Evans' research uses large-scale data, machine learning and generative models to understand how collectives think and what they know. This involves inquiry into the emergence of ideas, shared patterns of reasoning, and processes of attention, communication, agreement, and certainty. Thinking and knowing collectives like science, Wikipedia or the Web involve complex networks of diverse human and machine intelligences, collaborating and competing to achieve overlapping aims. Evans' work connects the interaction of these agents with the knowledge they produce and its value for themselves and the system. Evans designs observatories for understanding that fuse data from text, images and other sensors with results from interactive crowd sourcing and online experiments. Much of Evans' work has investigated modern science and technology to identify collective biases, generate new leads taking these into account, and imagine alternative discovery regimes. He has identified R&D institutions that generate more and less novelty, precision, density and robustness. Evans also explores thinking and knowing in other domains ranging from political ideology to popular culture. His work has been published in Nature, Science, PNAS, American Sociological Review, American Journal of Sociology and many other outlets.

Melanie Mitchell

Beyond Accuracy: How to Evaluate Understanding on Conceptual Abstraction Benchmarks

18h50, 29th April 2022 (UTC)
The abilities to recognize abstract concepts (e.g. “same” vs. “different”) and make analogies is central to human intelligence, and has received increasing attention in the AI/ML community, with challenge domains such as Raven’s Progressive Matrices, Bongard Problems, and the Abstraction and Reasoning Corpus (ARC). However, the methods typically used to evaluate ML systems on these domains have failed to assess true abstraction and generalization abilities, and have allowed for “shortcut learning” that succeeds on a particular benchmark, but for the wrong reasons. In this talk I will propose a different approach to evalution, one that attempts to test for degrees of “understanding” of abstract concepts, beyond simple accuracy measures.
Melanie Mitchell is the Davis Professor of Complexity at the Santa Fe Institute. Her current research focuses on conceptual abstraction, analogy-making, and visual recognition in artificial intelligence systems. Melanie is the author or editor of six books and numerous scholarly papers in the fields of artificial intelligence, cognitive science, and complex systems. Her book Complexity: A Guided Tour (Oxford University Press) won the 2010 Phi Beta Kappa Science Book Award and was named by Amazon.com as one of the ten best science books of 2009. Her latest book is Artificial Intelligence: A Guide for Thinking Humans (Farrar, Straus, and Giroux).

Katherine Heller

A framework for improved ML evaluations

19h35, 29th April 2022 (UTC)
I start by discussing ML evaluation goals versus current standards, followed by the presentation of a framework for better addressing these evaluation goals. This framework is comprised of three parts: 1) Qualitiative Evaluation, 2) Demographic Slicing, and 3) Distribution Shift and Causal Evaluations. I present "Healthsheets", a transparency artefact for health datasets, in the spirit of "Datasheets", as work towards improve qualitative evaluation. I will also discuss work on identifying underspecified models, where our model may not encode the causal structure we believe it does, in situations where distribution shift is a factor. We also look at fairness properties in distribution shift situations, and discuss transfer implications of models, when broken down by demographic group. Lastly, I will make a case for the importance of uncertainty and its implications for demographic fairness, discussing work on Electronic Health Record data, Bayesian neural networks, and differences in distributions of mortality predictions for various demographic groups.
Katherine Heller is a Research Scientist in the Responsible AI organization in Google Research. She leads the Context in AI Research (CAIR) group, which focuses on understanding the *context* in which are AI systems are being developed and deployed. Prior to Google, she was faculty in the Statistical Science department at Duke university, where she collaborated across many disciplines to release a sepsis detection system (SepsisWatch), to the Emergency Departments of Duke University hospitals. SepsisWatch has now been run on over 200k patients. She also engaged in many other projects there, through an NSF CAREER, and other awards. Katherine received her PhD from the Gatsby Computational Neuroscience Unit at UCL, and was a postdoctoral fellow at the University of Cambridge and MIT.

Corinna Cortes

Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment

20h20, 29th April 2022 (UTC)

In this talk we revisit the 2014 NeurIPS experiment that examined inconsistency in conference peer review. We determine that 50% of the variation in reviewer quality scores was subjective in origin. Further, with 7+ years passing since the experiment we find that for accepted papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count. We trace the fate of rejected papers, recovering where these papers were eventually published. For these papers we find a correlation between quality scores and impact. We conclude that the reviewing process for the 2014 conference was good for identifying poor papers, but poor for identifying good papers.

This 2014 NeurIPS experiment was repeated by the 2021 NeurIPS Program Chairs and we compare the findings. We also discuss other experiments carried out by ICML Program Chairs.

We hope the findings and some of the ideas from the experiments will help to design a better peer-review pipeline in future conferences.

Corinna Cortes is a VP of Google Research, NY, where she is working on a broad range of theoretical and applied large-scale machine learning problems. Prior to Google, Corinna spent more than ten years at AT&T Labs - Research, formerly AT&T Bell Labs, where she held a distinguished research position. Corinna's research work is well-known in particular for her contributions to the theoretical foundations of support vector machines (SVMs), for which she jointly with Vladimir Vapnik received the 2008 Paris Kanellakis Theory and Practice Award, and her work on data-mining in very large data sets for which she was awarded the AT&T Science and Technology Medal in the year 2000. Corinna received her MS degree in Physics from University of Copenhagen and joined AT&T Bell Labs as a researcher in 1989. She received her Ph.D. in computer science from the University of Rochester in 1993. Corinna is also a competitive runner, and a mother of two.