We invite two types of papers – opinion papers (up to 4 pages) stating positions on the topics related to those listed above, and methodology papers (up to 8 pages excluding references) about evaluation in ML. These topics may include:
- Establishing benchmarking standards for ML research
- The role of benchmarks in facilitating research progress
- New evaluation techniques or metrics (general or e.g. NLP-specific)
- Reliable tools/protocols for benchmarking and evaluation
- Hyperparameter tuning
- Reporting statistical uncertainty
- Statistical testing and metrics (e.g. multi-objective, Pareto frontiers).
- Understanding and defining reproducibility for machine learning
- Replicability vs Reproducibility
- Does reproducibility in ML need to be different from other fields?
- Role of determinism
- Understanding sources of variation in results in deep learning
- How much rigor is needed for ML?
- Meta analyses thoroughly evaluating existing claims across papers
- Evaluating robustness of previous claims
- Reporting statistical or methodological issues
- Analysis of evaluation protocols on existing benchmarks
- Incentives for doing better evaluation and reporting results
- Action items for conferences / Guidelines / training reviewers
- Incentivizing researchers to care about deployment (trustability, reliability, generalizability)
- Cultural changes and software tools for reliable evaluation
- Understanding differences across subfields in ML evaluation
Schmidt Futures Grants
Excitingly, the ML Evaluation Standards workshop will be collaborating with @SchmidtFutures to grant $15,000 USD in awards for workshop submissions and reviewers!
As part of this collaboration, we are announcing a new track “Improvement to the Conference and Publication Process”. We will consider awards for workshop submissions that propose ideas in the following areas:
- Ideas that incentivize better peer-reviewing processes (e.g. a new workflow for assigning paper reviewers)
- Ideas that address structural problems in ML publishing (e.g. proposing a new track for engineering-heavy work that would not be rewarded in traditional platforms)
- Ideas that emphasize (novel) reproducibility and/or standardization of methods evaluation (e.g. a novel reproducibility adoption mechanism for conferences)
If your submission falls into one of these areas, please submit your proposal to the relevant track.
We will also be granting awards for the best reviews. Reviewing can be a thankless task; in this workshop, we want to acknowledge reviewers’ hard work, to reward their service to the community, to encourage quality reviews, and to set an example for future conferences/workshops.
- Submission deadline:
4th MarchExtended to 8th March 11:59 pm AoE
- Accept/Reject Notification Date: 25th March
- Camera Ready: 18th April
- Workshop: 29th April
We will accept technical papers of length up to 8 pages and opinion papers of length up to 4 pages, excluding references and appendices and position papers. In order to gather a wide variety of views, we will minimally constrain the format and topic areas for both types of papers. We will not accept work published in prior ML conferences including ICLR 2022, and encourage the submission of working papers. The submission process will be handled via CMT in a double-blind review process. Papers can utilize any style file they want (for e.g., ICML, NeurIPS, ICLR) and should be anonymized for double blind review.
The proceedings of the workshop will be non-archival. We will require authors of accepted papers to provide a recorded video and a slide deck describing their work. We plan to divide the accepted papers into two groups with different presentation types – contributed talks (10 min), and posters (two poster sessions) – based on novelty, technical merit, and alignment to the workshop’s goals.
 Sculley, David, et al. “Winner’s curse? On pace, progress, and empirical rigor.” (2018).
 Ali Rahimi and Benjamin Recht. Nips test of time award presentation. https://www.youtube.com/watch?v=Qi1Yry33TQE. Accessed: 2021-10-23.
 Bouthillier, X., Laurent, C., & Vincent, P. (2019, May). Unreproducible research is reproducible. In the International Conference on Machine Learning (pp. 725-734). PMLR.
 Wightman, R., Touvron, H., & Jégou, H. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476.
 Dacrema, M.F., Cremonesi, P. and Jannach, D., 2019, September. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems (pp. 101-109)
 Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34.
 Chan, S. C., Fishman, S., Korattikara, A., Canny, J., & Guadarrama, S. (2019, September). Measuring the Reliability of Reinforcement Learning Algorithms. In International Conference on Learning Representations.
 Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2017). Deep reinforcement learning that matters. arXiv 2017. arXiv preprint arXiv:1709.06560.
 Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020). With little power comes great responsibility. arXiv preprint arXiv:2010.06595.
 Gorman, K., & Bedrick, S. (2019, July). We need to talk about standard splits. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2786-2791).
 Hoyle, Alexander, et al. “Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence.” arXiv preprint arXiv:2107.02173 (2021).
 Mathur, Nitika, Johnny Wei, Markus Freitag, Qingsong Ma, and Ondřej Bojar. “Results of the wmt20 metrics shared task.” In Proceedings of the Fifth Conference on Machine Translation, pp. 688-725. 2020.
 Marie, Benjamin, Atsushi Fujita, and Raphael Rubino. “Scientific credibility of machine translation research: A meta-evaluation of 769 papers.” arXiv preprint arXiv:2106.15195 (2021).
 Narang, S., Chung, H. W., Tay, Y., Fedus, W., Fevry, T., Matena, M., … & Raffel, C. (2021). Do Transformer Modifications Transfer Across Implementations and Applications?. arXiv preprint arXiv:2102.11972.
HPO & NAS
 Li, L., & Talwalkar, A. (2019). Random Search and Reproducibility for Neural Architecture Search. arXiv e-prints, art. arXiv preprint arXiv:1902.07638.
 Probst, P., Boulesteix, A. L., & Bischl, B. (2019). Tunability: importance of hyperparameters of machine learning algorithms. The Journal of Machine Learning Research, 20(1), 1934-1965.
 Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., & Fleuret, F. (2020, November). Optimizer benchmarking needs to account for hyperparameter tuning. In International Conference on Machine Learning (pp. 9036-9045). PMLR.
 Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., … & Vincent, P. (2021). Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3.
 D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., … & Sculley, D. (2020). Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395.
 Dodge, J., Gururangan, S., Card, D., Schwartz, R. and Smith, N.A., 2019, January. Show Your Work: Improved Reporting of Experimental Results. In EMNLP/IJCNLP.
 Kadlec, R., Bajgar, O., & Kleindienst, J. (2017). Knowledge base completion: Baselines strike back. arXiv preprint arXiv:1705.10744.
 Musgrave, K., Belongie, S., & Lim, S. N. (2020, August). A metric learning reality check. In the European Conference on Computer Vision (pp. 681-699). Springer, Cham.
 Raff, Edward. “A step toward quantifying independently reproducible machine learning research.” Advances in Neural Information Processing Systems 32 (2019): 5485-5495.
 Schmidt, R. M., Schneider, F., & Hennig, P. (2021, July). Descending through a crowded valley-benchmarking deep learning optimizers. In the International Conference on Machine Learning (pp. 9367-9376). PMLR.