Video Recording
Downloadable Slides July 22, 2024
3:30-5:30pm GMT+2

Overview

NLP and Machine Learning rely on benchmarks and evaluation to accurately track progress in the field and assess the efficacy of new models and methodologies. For this reason, good evaluation practices and accurate reporting are crucial.

However, language models (LMs) not only inherit the challenges previously faced in benchmarking, but also introduce a slew of novel considerations which can make proper comparison across models difficult, misleading, or near-impossible.

In this tutorial, we aim to bring attendees up to speed on the state of language model evaluation, and highlight current challenges in evaluating language model performance through discussing the various methods of evaluation, tasks and benchmarks commonly associated with evaluating progress in language model research. We will then discuss how these common pitfalls can be addressed and what considerations should be taken to enhance future work.


Presenters

Lintang Sutawika
Lintang Sutawika
EleutherAI
Hailey Schoelkopf
Hailey Schoelkopf
EleutherAI

Schedule

The tutorial will be held at 3:30-5:30pm GMT+2 on July 22nd. A preliminary outline of the schedule is as follows:

LM Evaluation Fundamentals We will review the fundamentals of evaluating autoregressive LMs, covering the tools available to practitioners including:

LM-centric Challenges We will then cover a number of challenges resulting from the Language Model side of LM evaluation–concerns that are novel or exacerbated by LMs.

Benchmark-Centric Challenges Next, we will discuss broader concerns and challenges faced in evaluation and benchmarking as a whole, which apply to LM evaluation as well.

Addressing Pitfalls, and Where To Go From Here We will close with suggestions to mitigate the impact of the challenges we have discussed, and provide notes on directions for future study.

Materials

Slides can be found at this link! For those seeking a larger list of references, all slides which use figures from or are based centrally around the conclusions from a given work are linked at the bottom of each slide.

For ICML in-person and virtual attendees, the video recording may be watched on the ICML 2024 website.

Relevant Papers

The material in this presentation is loosely modeled off of Lessons From the Trenches on Reproducible Evaluation of Language Models (Biderman, Schoelkopf, Sutawika et al. 2024). We recommend reviewing this paper if interested in learning more about this area, for a number of other references and resources.

Code

Language Model Evaluation Harness is a library for prompted zero- and few-shot language model evaluations built around incorporating and mitigating a number of the challenges we discuss in this tutorial.

Citation

If you find this tutorial useful, please consider citing the following works:

@misc{sutawika2024challenges,
  author = {Sutawika, Lintang and Schoelkopf, Hailey},
  title = {{ICML} Tutorial on Challenges in Language Model Evaluations},
  year = {2024},
  howpublished = {\url{https://lm-evaluation-challenges.github.io}},
}
@misc{biderman2024lessons,
      title={Lessons from the Trenches on Reproducible Evaluation of Language Models}, 
      author={Stella Biderman and Hailey Schoelkopf and Lintang Sutawika and Leo Gao and Jonathan Tow and Baber Abbasi and Alham Fikri Aji and Pawan Sasanka Ammanamanchi and Sidney Black and Jordan Clive and Anthony DiPofi and Julen Etxaniz and Benjamin Fattori and Jessica Zosa Forde and Charles Foster and Mimansa Jaiswal and Wilson Y. Lee and Haonan Li and Charles Lovering and Niklas Muennighoff and Ellie Pavlick and Jason Phang and Aviya Skowron and Samson Tan and Xiangru Tang and Kevin A. Wang and Genta Indra Winata and François Yvon and Andy Zou},
      year={2024},
      eprint={2405.14782},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = 12,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.4.0},
  doi          = {10.5281/zenodo.10256836},
  url          = {https://zenodo.org/records/10256836}
}

Contact

{hailey, lintang}@eleuther.ai

Acknowledgements

This site is modeled after https://machine-learning-for-theorem-proving.github.io/.