• Will Bennett
  • Posts
  • LLM benchmarks: Gut feel and difficult datasets

LLM benchmarks: Gut feel and difficult datasets

👋 Semi-regular writing about technology and early-stage investing. Investor at Seedcamp, Europe’s seed fund.

Large language models are often tested for ‘performance’, based on how well a model completes certain tasks. I.e. If a machine can pass the bar exam, it is performant.

Model builders advocate their models based on the results of these tests. Microsoft and OpenAI even wrote a paper on the effectiveness of GPT-4 at solving medical problems.

A human test like this is a great sounding board for a model’s effectiveness. For example, human tests tend to ask test-takers to recall knowledge about known outputs and produce answers in natural language. Models excel at this.

However, a human test privileges a machine’s ability to replicate human capabilities and overlooks the quality of a model’s output in dimensions that cannot be easily assessed by a teacher or professor. For example, a human test might not detect that a model is good at spewing previously unknown outputs, such as AlphaFold.

Instead of relying purely on this sort of test, researchers have developed large language model ‘benchmarks’. Benchmarks combine gut feel and quantitative evaluation to offer a more precise way to assess models.

A benchmark works by assessing a model against its own benchmark data. Benchmarks are therefore subject to the limitations of this dataset - even if that that data is the entire internet.

The key challenge for benchmarks is that they advantage models trained on similar data. For example, in March, OpenAI benchmarked GPT-4 against a pre-2021 dataset. GPT-4 had been trained on pre-2021 data and solved 10/10 tasks from pre-2021 and 0/10 of the more recent tasks.

So what is a benchmark?

Benchmarks apply a set of ‘metrics’ across use cases or ‘tasks’. Metrics assess how well a model met a certain criterion in comparison to the benchmark’s own dataset.

‘Accuracy’ or ‘performance’ might represent a single metric or a weighted blend of several metrics. In some cases, performance might also be disaggregated into accuracy and societally beneficial and therefore subjective metrics, like bias.

Benchmark metrics fall into a few (non-exhaustive) buckets:

  • Known outputs (e.g. a human answer to a test)

  • Known inputs (e.g. factual consistency)

  • Known structure (e.g. JSON, .txt)

  • Unknown outputs (e.g. bias, new information)

  • Open world challenges (e.g. level of uncertainty, level of consistency of output to different but semantically identical inputs)

A metric’s calculation method is often as simple as counting the number of occurrences of a certain word, phrase or token in a dataset. For example, a Stanford paper (“Holistic Evaluation of Language Models” or HELM) considers gender bias by calculating the frequency of male v female representation in model output.

Similarly, HELM assessed a model’s uncertainty from the probability assigned to different tokens. i.e. If token probability is high in a certain context, the model scores well for uncertainty.

Benchmark tasks or use cases are more wide ranging :

Metrics are examined across different ‘tasks’ or use cases. Tasks each use a specific dataset to test a model - like NarrativeQA or SQuAD - that act as proxies for effectiveness in different settings. i.e. Writing Python code or generating a polite email.

Some of the most compelling benchmarking approaches compile many different tasks to conduct a sophisticated assessment. For example, the Hugging Face Open LLM Leaderboard is effectively a benchmark of benchmarks, each of which include several tasks for common sense, accuracy and truth.

Similarly, a paper titled Beyond the Imitation Game (BIG-Bench) used 204 tasks. These tasks tested all manner of things, including whether a model can checkmate-in-one, or produce the element name that corresponds to an atomic number.

Most benchmarks however are limited to narrower use cases or tasks that test a more specific capability. HumanEval for example is a test based on data in the paper "Evaluating Large Language Models Trained on Code" and benchmarks ~1,000 Python programming problems. HumanEval would be a relevant way to benchmark Mistral against a competitor focused on code generation, but would not be relevant for a model trained on a NLP dataset.

The challenge

Benchmarks are subject to the inherent biases of the dataset on which they are trained in exactly the same way as a model. Models whose data is similar to that of the benchmark, will score more highly. It is also difficult to contrast model performance on different benchmarks, as each benchmark is typically context or data-specific.

“Contamination in the form of models being trained on the very instances they are tested on can drastically compromise the validity and legitimacy of evaluations”

Benchmarks are tackling this problem by including more quality data and testing models on a broader range of tasks. Where early benchmarks used individual datasets such as SQuAD, many have moved to small collections of datasets such as SuperGLUE, and in some cases massive collections of datasets. This is a great boon, but most model evaluations can only incorporate a small number of datasets due to the cost pressures of compute.

Including as many datasets is uncontroversially a good idea because it dilutes the bias of each dataset and is more exhaustive. For example, HELM, tests each model across several metrics in several scenarios.

The HELM methodology then compares model performance across these scenarios by compiling or standardising these metrics.

However, even in the case of HELM, the results are not without a meaningful level of contamination. Training data varies immensely and in some cases model builders can be quite secretive regarding the actual data content itself. This means that the extent to which the benchmark privileges the model is unknown.

Benchmarking these models is challenging given they vary in accessibility (see Liang et al., 2022): some are open (e.g. GPT-NeoX (20B)), some are limited-access (e.g. davinci (175B)), and some are closed (e.g. Anthropic-LM v4-s3 (52B)). In some cases, very little is known about how these models were built (e.g. the training data and its size are often not known), such as text-davinci-002.

The limitations of datasets and human tests don’t have an obvious solutions and it isn’t ridiculous for foundation models to represent themselves with a benchmark that most befits them. A relevant benchmark still demonstrates the model is good at what it said it would do. It wouldn’t make sense for Midjourney to test itself on chess problems…

Satirical takes on the training data v benchmark data problem have also provided great entertainment for those interested in the space.

In the future, model benchmarking, accuracy and reliability will hopefully attract the interest of more foundations and independent open-source communities with a limited vested interest in ‘performance’.

For example, a Chainalysis-for-model-performance organisation or business might act as a machine learning analytics engine and impartial adjudicator of performance.

In any case, benchmarks are likely to remain incredibly important as this space continues to blossom.

With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out…which model is the current state of the art.

Join the conversation

or to participate.