We show how Airtrain can easily replicate the MMLU benchmark results for the Llama 2 family of models.
Academic benchmarks are the community's most useful tool to rank models into leaderboards. For example, the HuggingFace LLM leaderboard takes the average result from six popular benchmarks: HellaSwag, MMLU, ARC, Truthful QA, Winogrande, and GSM8K.
Benchmarks are carefully curated datasets of prompts targeting specific domain knowledge areas or tasks.
For example:
In this article, we will demonstrate how to replicate MMLU benchmark results with Airtrain for the Llama family of models.
MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model's blind spots.
We can download the MMLU test dataset in CSV format from HuggingFace here.
The dataset is broken down into individual files per topic. We will collate all topics into a single file and we will convert it to JSONL format for greater robustness.
You can download the final JSONL file here, or do the conversion yourself with the below code snippet.
The final schema for each example will be as follows:
In the top menu bar, we click "New job".
Then select "JSONL file upload" in the Source type dropdown. Click "Choose file" and find your mmlu.jsonl file.
In the central panel, click the + button next to the model you want to configure.
Name your configuration, for example simply "Llama 2 7B". Select the 7B variant, set the temperature to 0.1 and paste the following prompt.
Then, configure as many other models and variants as you want. For example, Llama 2 13B and 70B.
Model performance on the MMLU benchmark is measured as a pass rate: what fraction of questions are answered correctly by the model?
To replicate this with Airtrain, we will create a Correctness property with the following description:
This score describes whether the chatbot selected the correct answer.The correct answer is {{correct_answer}}.Here is a scoring rubric to use:1. The chatbot's answer is not {{correct_answer}}, therefore the chatbot is incorrect.5. The chatbot's answer is {{correct_answer}}, therefore the chatbot is correct.
Airtrain's scoring model grades inferences on a Likert scale of 1 to 5. In this case, we want to measure a binary pass/fail rate, so we will use only two scores, e.g. 1 (fail) and 5 (pass) as shown above.
We can interpolate the property description with the correct answer that is provided in the input dataset.
Out of curiosity, we also activate the Length unsupervised metrics, to get a sense of what variant is more verbose.
View the public results page here.
On this plot we can measure the following pass rates (score of 5) and compare them with official MMLU benchmark results listed here.
We can see that Airtrain's scoring model comes close to the official MMLU benchmark results.
As expected, we also note that higher correctness correlates with larger model size.
On this plot we can note that the 7B variant is more verbose than 13B and 70B variants. 13B is the most concise variant.
In this article, we showed that replicating the MMLU academic benchmark with Airtrain is trivially easy. Airtrain makes it trivially simple to evaluate LLMs across large eval datasets and for arbitrary properties, including academic benchmarks.
Sign up for early access to Airtrain's free batch evaluation tool.