Call for Papers

The field of machine learning (ML) is undergoing massive growth, and it is becoming apparent that it may be in need of self-reflection to ensure that efforts are directed towards real progress of the field [1-2]. More recently, there is an increasing number of papers at top conferences on the topic of ML evaluation, which show evidence of non-reliable findings and unsupported empirical claims in several subfields including computer vision [3-4], recommender systems [5], reinforcement learning [6-8], natural language processing [9-14], hyperparameter optimization [15-17] and more [18-24]. Such papers highlight the need for more scientific rigor and careful evaluation, both by researchers themselves and by reviewers. Therefore, it is clear that these are discussions that researchers are interested in having, while it is yet unclear what is the best path forward. We can accelerate this self-reflection by having thoughtful and careful discussions, and by gathering a diversity of opinions.

We invite two types of papers – opinion papers (up to 4 pages) stating positions on the topics related to those listed above, and methodology papers (up to 8 pages excluding references) about evaluation in ML. These topics may include:

  • Establishing benchmarking standards for ML research
    • The role of benchmarks in facilitating research progress
    • New evaluation techniques or metrics (general or e.g. NLP-specific)
  • Reliable tools/protocols for benchmarking and evaluation
    • Hyperparameter tuning
    • Reporting statistical uncertainty
    • Statistical testing and metrics (e.g. multi-objective, Pareto frontiers).
  • Understanding and defining reproducibility for machine learning
    • Replicability vs Reproducibility
    • Does reproducibility in ML need to be different from other fields?
    • Role of determinism
    • Understanding sources of variation in results in deep learning
    • How much rigor is needed for ML?
  • Meta analyses thoroughly evaluating existing claims across papers
    • Evaluating robustness of previous claims
    • Reporting statistical or methodological issues
    • Analysis of evaluation protocols on existing benchmarks
  • Incentives for doing better evaluation and reporting results
    • Action items for conferences / Guidelines / training reviewers
    • Incentivizing researchers to care about deployment (trustability, reliability, generalizability)
    • Cultural changes and software tools for reliable evaluation
    • Understanding differences across subfields in ML evaluation

Schmidt Futures Grants

Excitingly, the ML Evaluation Standards workshop will be collaborating with @SchmidtFutures to grant $15,000 USD in awards for workshop submissions and reviewers!

As part of this collaboration, we are announcing a new track “Improvement to the Conference and Publication Process”. We will consider awards for workshop submissions that propose ideas in the following areas:

  1. Ideas that incentivize better peer-reviewing processes (e.g. a new workflow for assigning paper reviewers)
  2. Ideas that address structural problems in ML publishing (e.g. proposing a new track for engineering-heavy work that would not be rewarded in traditional platforms)
  3. Ideas that emphasize (novel) reproducibility and/or standardization of methods evaluation (e.g. a novel reproducibility adoption mechanism for conferences)

If your submission falls into one of these areas, please submit your proposal to the relevant track.

We will also be granting awards for the best reviews. Reviewing can be a thankless task; in this workshop, we want to acknowledge reviewers’ hard work, to reward their service to the community, to encourage quality reviews, and to set an example for future conferences/workshops.

Submission Website

Important Dates

  • Submission deadline: 4th March Extended to 8th March 11:59 pm AoE
  • Accept/Reject Notification Date: 25th March
  • Camera Ready: 18th April
  • Workshop: 29th April

Submission instructions

We will accept technical papers of length up to 8 pages and opinion papers of length up to 4 pages, excluding references and appendices and position papers. In order to gather a wide variety of views, we will minimally constrain the format and topic areas for both types of papers. We will not accept work published in prior ML conferences including ICLR 2022, and encourage the submission of working papers. The submission process will be handled via CMT in a double-blind review process. Papers can utilize any style file they want (for e.g., ICML, NeurIPS, ICLR) and should be anonymized for double blind review.

The proceedings of the workshop will be non-archival. We will require authors of accepted papers to provide a recorded video and a slide deck describing their work. We plan to divide the accepted papers into two groups with different presentation types – contributed talks (10 min), and posters (two poster sessions) – based on novelty, technical merit, and alignment to the workshop’s goals.


[1] Sculley, David, et al. “Winner’s curse? On pace, progress, and empirical rigor.” (2018).

[2] Ali Rahimi and Benjamin Recht. Nips test of time award presentation. Accessed: 2021-10-23.


[3] Bouthillier, X., Laurent, C., & Vincent, P. (2019, May). Unreproducible research is reproducible. In the International Conference on Machine Learning (pp. 725-734). PMLR.

[4] Wightman, R., Touvron, H., & Jégou, H. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476.


[5] Dacrema, M.F., Cremonesi, P. and Jannach, D., 2019, September. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems (pp. 101-109)

Reinforcement Learning

[6] Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34.

[7] Chan, S. C., Fishman, S., Korattikara, A., Canny, J., & Guadarrama, S. (2019, September). Measuring the Reliability of Reinforcement Learning Algorithms. In International Conference on Learning Representations.

[8] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2017). Deep reinforcement learning that matters. arXiv 2017. arXiv preprint arXiv:1709.06560.


[9] Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020). With little power comes great responsibility. arXiv preprint arXiv:2010.06595.

[10] Gorman, K., & Bedrick, S. (2019, July). We need to talk about standard splits. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2786-2791).

[11] Hoyle, Alexander, et al. “Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence.” arXiv preprint arXiv:2107.02173 (2021).

[12] Mathur, Nitika, Johnny Wei, Markus Freitag, Qingsong Ma, and Ondřej Bojar. “Results of the wmt20 metrics shared task.” In Proceedings of the Fifth Conference on Machine Translation, pp. 688-725. 2020.

[13] Marie, Benjamin, Atsushi Fujita, and Raphael Rubino. “Scientific credibility of machine translation research: A meta-evaluation of 769 papers.” arXiv preprint arXiv:2106.15195 (2021).

[14] Narang, S., Chung, H. W., Tay, Y., Fedus, W., Fevry, T., Matena, M., … & Raffel, C. (2021). Do Transformer Modifications Transfer Across Implementations and Applications?. arXiv preprint arXiv:2102.11972.


[15] Li, L., & Talwalkar, A. (2019). Random Search and Reproducibility for Neural Architecture Search. arXiv e-prints, art. arXiv preprint arXiv:1902.07638.

[16] Probst, P., Boulesteix, A. L., & Bischl, B. (2019). Tunability: importance of hyperparameters of machine learning algorithms. The Journal of Machine Learning Research, 20(1), 1934-1965.

[17] Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., & Fleuret, F. (2020, November). Optimizer benchmarking needs to account for hyperparameter tuning. In International Conference on Machine Learning (pp. 9036-9045). PMLR.


[18] Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., … & Vincent, P. (2021). Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3.

[19] D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., … & Sculley, D. (2020). Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395.

[20] Dodge, J., Gururangan, S., Card, D., Schwartz, R. and Smith, N.A., 2019, January. Show Your Work: Improved Reporting of Experimental Results. In EMNLP/IJCNLP.

[21] Kadlec, R., Bajgar, O., & Kleindienst, J. (2017). Knowledge base completion: Baselines strike back. arXiv preprint arXiv:1705.10744.

[22] Musgrave, K., Belongie, S., & Lim, S. N. (2020, August). A metric learning reality check. In the European Conference on Computer Vision (pp. 681-699). Springer, Cham.

[23] Raff, Edward. “A step toward quantifying independently reproducible machine learning research.” Advances in Neural Information Processing Systems 32 (2019): 5485-5495.

[24] Schmidt, R. M., Schneider, F., & Hennig, P. (2021, July). Descending through a crowded valley-benchmarking deep learning optimizers. In the International Conference on Machine Learning (pp. 9367-9376). PMLR.