In this extensive guide to LLM evaluation. We cover all existing methods to gauge the quality of a language model.
Despite their prowesses, Large Language Models (LLMs) come with their set of challenges—unpredictable results, deployment costs and performance, scalability, etc. Many developers, initially drawn to GPT-4 for prototyping, soon find themselves exploring open-source alternatives and considering fine-tuning smaller models for targeted tasks.
The first step towards integrating LLMs into production applications is to set up a good evaluation harness and answer the question: how does this model perform for my specific application running on my specific dataset?
In this blog post, we expose common as well as novel techniques to evaluate the generative performance of LLMs.
This article focuses on batch offline evaluation, but other processes can be used in complement.
A vibe check is a simple manual cursory look at a model. A human will prompt the model manually with various test cases and will develop an intuition as to how the model performs. This is the easiest, fastest, and cheapest way to gauge a model, but will provide only superficial information.
Batch evaluation consists in running an entire evaluation dataset (e.g. a few thousand examples) through a model and gathering statistical evidence as to its performance. Benchmarks are batch evaluation methods.
Batch evaluation is a necessary part of a rigorous development workflow similarly to a CI test suite in traditional software and should be performed systematically before shipping new AI-powered systems to production.
Online evaluation attempts to quantify the quality of a live production model by scoring its inferences. This requires a real-time ingestion pipeline to persist inferences and calculate evaluation metrics on them.
This type of online monitoring is necessary to catch production failures early or performance degradation trends (e.g. drift).
Human evaluation is the most costly and least scalable method, but can yield solid results. Essentially, in order to best evaluate the performance and relevance of a model in the context of a specific application is to get a set of humans to review outputs generated for a test dataset and provide qualitative and/or quantitative feedback.
Testers can be presented with a single inference per example to score on a numerical scale, or with a number of inference to choose from. Collecting this data is very valuable and can constitute the basis of a preference dataset on which to fine-tune a model.
This potentially less costly, yet still more time-consuming technique consists in gathering feedback from real users. The feedback can be a direct rating left by the user, or it can be indirect where a follow-up action is tracked. For example, for a code-generation model, if the user actually uses the suggestion, it is a measurable signal of a good quality output.
It is less costly as it does not require paying workers to grade outputs, but also means your product is serving real production traffic. You also may have to do some UX magic to get users to provide some feedback.
Reinforcement Learning with Human Feedback (RLHF) is hardly an evaluation technique but deserves mentioning alongside human-based evaluation techniques. RLHF combines traditional reinforcement learning algorithms with human insights.
Instead of relying solely on predefined reward signals, RLHF incorporates human feedback, such as ranking model actions or providing demonstrations.
This melding of algorithmic learning with human input helps guide the model towards desired outcomes, especially when the reward signal is sparse or ambiguous. It's a fusion of computational strength and human intuition.
In traditional machine learning, clear metrics like Precision and Recall for classifiers, or Intersection over Union for computer vision, offer straightforward ways to measure model performance based on structured data such as numbers, classes, and bounding boxes.
However, language models generate natural language, which is unstructured, making it more challenging to assess their performance. There are various metrics available to gauge specific aspects of language model performance, yet there isn't a universal metric that captures overall effectiveness.
We break down NLP metrics into two categories: supervised metrics, applicable when we have access to ground truth labels, and unsupervised metrics, useful when such labels are not available.
Supervised metrics can be evaluated when we have access to reference labels, i.e. an expected appropriate response for each prompt in the evaluation dataset.
BLEU (Bilingual Evaluation Understudy) (Papineni et al. 2002) is a metric that quantifies the quality of text which has been machine-translated from one natural language to another.
Developed for assessing translation quality, BLEU scores compare the machine-produced text to one or more reference translations. The evaluation is based on the presence and frequency of consecutive words – n-grams – in both the machine-generated and the reference texts.
BLEU considers precision, which is the number of matching n-grams in the translated text compared to the reference, but it also applies a brevity penalty to discourage overly short translations.
Higher BLEU scores indicate better translation quality, with a score of 0 meaning no overlap and a score of 1 being a perfect match.
However, while BLEU is widely used due to its simplicity and ease of use, it has limitations, such as not accounting for the meaning or grammatical correctness of the generated text.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating summarization and translation. It compares a generated summary or translation against a set of human-generated reference summaries.
ROUGE is recall-oriented and is particularly interested in how many of the reference's words and phrases are captured by the generated summary. There are several variations of the ROUGE metric, each of which looks at different aspects of the text:
ROUGE scores have been commonly used in various competitions and evaluations for summarization tasks, and they serve as a standard by which different summarization methods can be compared.
It is important to note that while ROUGE is useful for evaluation, it does not capture all aspects of language quality, such as coherence, cohesiveness, and readability.
BERTScore (Zhang et al. 2020) is a more recent metric. It relies on the contextual embeddings from the BERT model. Unlike traditional metrics that measure overlap at the token level (e.g. BLEU) or n-gram level (e.g. ROUGE), BERTScore computes a similarity score for each token in the candidate text with each token in the reference text using contextual embeddings.
Here's how BERTScore works:
BERTScore offers several advantages over traditional metrics:
However, BERTScore also has limitations. It requires heavier computational resources compared to string-based metrics due to its reliance on BERT embeddings, and it may not always align with human judgments, particularly in cases where the structure and coherence of the generated text are important.
MoverScore (Zhao et al. 2019) measures the semantic distance between a summary and reference text by making use of the Word Mover’s Distance (Kusner et al., 2015) operating over n-gram embeddings pooled from BERT representations.
MoverScore is particularly useful in evaluating tasks where the exact wording might differ, but the conveyed meaning should remain the same. For instance, in machine translation, different translations might be equally valid as long as they convey the same meaning.
BLEU or ROUGE rely heavily on surface-level similarities (like n-gram overlap), which might not fully capture semantic nuances. MoverScore, by focusing on semantic meaning, provides a more nuanced evaluation.
Like MoverScore, BaryScore (Columbo et al. 2021) is centered around measuring the semantic similarity between two texts. It represents the result and reference text as probability distributions in the embedding space (word, sentence level, other subset level), uses a metric based on the cost of transforming one probability distribution into another. It correlates well with human judgements of correctness, coverage, and relevance.
Unsupervised metrics can be evaluated in the absence of ground truth label. IT makes them easier to compute, but also potentially less useful.
At a high level, Perplexity measures the amount of “randomness” in a model. If Perplexity is 3 (per word) then the model had a 1-in-3 chance of guessing (on average) the next word in the text.
Perplexity of a model is connected to its entropy. They both quantify how “surprised” a model is when trying to predict a sentence. When prompted with a sequence of tokens and asked to predict all the following ones, if there is only ever one choice for each new token with a probability of 1, then perplexity is very low. The model does not need to decide from many different options.
Technically, Perplexity of a model M is defined as
where M(wk|w0w1…wk-1) is the probability of wk being the next token after w0w1…wk-1.
Perplexity can be interpreted as the weighted average number of choices the model believes it has at any point in predicting the next word. A lower perplexity indicates that the model is more certain of its word choices.
Perplexity measures how well a model generalizes to unseen data by calculating the model's uncertainty in predicting new text.
It's important to note that while perplexity is a standard measure for model comparison, it does not necessarily correlate perfectly with human judgments of textual quality or fluency. It is possible for a model to have low perplexity but still generate nonsensical or ungrammatical text. Therefore, perplexity is often used alongside other evaluation metrics.
This is likely the most straightforward metric on the block. It simply measures the number of tokens or characters in the generated inference. Although very simple, this metric can be quite informative when conciseness is important, notably in the case of summarization or when preparing payloads for length-restricted systems.
Compression (Grusky et al. 2018) is simply defined as the ratio between the word count in the original text |A|, and the word count in the generated inference |S|.
Compression is mostly relevant in the context of summarization.
Coverage (Grusky et al. 2018) quantifies the extent to which a summary is derivative of a text. It measures the percentage of words in the summary that are part of an extractive fragment with the article.
For example, a summary with 10 words that borrows 7 words from its article text and includes 3new words will have a coverage of 0.7.
Coverage ranges from 0 to 1 and higher is better.
Density (Grusky et al. 2018) quantifies how well the word sequence of a summary can be described as a series of extractions from the original text. For instance, a summary might contain many individual words from the article and therefore have a high coverage. However, if arranged in a new order, the words of the summary could still be used to convey ideas not present in the article.
Density is defined as the average length of the extractive fragment to which each word inthe summary belongs.
For example, an article with a 10-word summary made of two extractive fragments of lengths 3and 4 would have a coverage of 0.7 and density of 2.5.
Depending on the task that is being evaluated, some fairly basic heuristics can be applied to get cursory precision and recall results.
If a model is expected to output an exact results, for example, a number, or a class, or an exact phrase, then evaluating it can be as straight forward as applying a strict equality criteria.
For example, when trying to evaluate for basic arithmetic, the model can be prompted to output only the exact numeric result. When evaluating for responses to multiple-choice questions (e.g. A, B, C, D), the model can be prompted to output the exact letter corresponding to the selected answer.
Going one step beyond exact match, a more flexible approach is to use regular expressions to validate the presence of certain words or tokens in the response.
For example, one can test for safety by screening for unsafe words in the response.
If ground truth labels are known for the evaluation dataset, one can use similarity search to evaluate how close the response is to the label.
By generating embeddings for both the response to evaluate and the ground truth label, cosine similarity can be used to quantify how close the two responses are, and with some threshold to be established, measure the accuracy of the model.
LLMs are often used to parse user inputs into structured data format such as JSON payloads. Even when carefully prompted with the desired data schema, models will occasionally returned invalid JSON, or a payload that does not comply with the instructed schema.
Standards such as JSON Schema can be used to systematically validate an entire evaluation dataset and quantify models’ compliance.
Benchmarks are a very common and useful way to evaluate the performance of models. They are widely used in academia and the open-source community to establish model leaderboards (See Leaderboards section).
Benchmarks are usually composed of one or multiple highly curated datasets including ground truth labels, and a scoring mechanism to clearly evaluate correctness of the generated answer.
HumanEval (Chen et al. 2021) is a dataset of coding problems designed to evaluate models’ ability to write code.
The dataset consists of 64 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
The problem is described in plain English (see docstrings below) and correctness is evaluated using the pass@k metric. A single problem is considered solved if it passes a set of unit tests. pass@k measures the fraction of problems that pass after up to k solutions are generated for each problem.
A HumanEval leaderboard can be found here. Results as of this writing are displayed below.
MMLU (Measuring Massive Multitask Language Understanding) (Hendrycks et al. 2021) is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn.
This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.
See MMLU leaderboard here.
The AI2’s Reasoning Challenge (ARC) (Clark et al. 2018) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices.
Find the ARC leaderboard here.
HellaSwag (Zellers et al. 2019) is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
Models are presented with a context, and a number of possible completions. They must select the most likely completion.
The HellaSwag leaderboard can be found here.
TruthfulQA (Lin et al. 2022) is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. The authors crafted questions that some humans would answer falsely due to a false belief or misconception.
Given a question, the model is tasked with generating a 1-2 sentence answer. The primary objective is overall truthfulness, expressed as the percentage of the model's answers that are true. Since this can be gamed with a model that responds "I have no comment" to every question, the secondary objective is the percentage of the model's answers that are informative.
Evaluation is done using fine-tuned GPT-3 ("GPT-judge", "GPT-info"), BLEURT, ROUGE, BLEU. The GPT-3 metrics are trained end-to-end to predict human evaluations of truthfulness and informativeness. BLEURT, ROUGE, and BLEU are used to compare the model's answer to each of the true and false reference answers. The score is then given by (max similarity to a true reference answer) - (max similarity to a false reference answer).
The TruthfulQA leaderboard can be found here.
Evaluation harnesses are scripts or frameworks that let users run a large number of evaluation benchmark at once. They essentially integrate with a large umber of benchmark datasets and metrics and enable running them all at once.
Eleuther AI’s harness is one of the most comprehensive and popular one. It features over 200 evaluation tasks including all the benchmark listed above. This harness is used to populate the HuggingFace LLM leaderboard.
If your model is registered oh HuggingFace, then you can run the harness with a single command line.
Explore the harness on Eleuther AI’s Github repository here.
Leaderboards use individual or aggregate evaluation metrics to rank models. They are useful to get a high-level glance of what models perform best, and to keep the community competitive.
The HuggingFace leaderboard is likely the most popular leaderboard. It lets any user submit a model and will run Eleuther AI’s evaluation harness on four benchmarks: ARC, HellaSwag, MMLU, and TruthfulQA. It uses an average of all four accuracies to generate the final ranking.
Models can be filtered by quantizations, sizes, and types.
The figure below shows the evaluation of top scores on all four benchmarks over time, showing staggeringly fast improvement. ARC’s top score is almost on par with the human baseline.
Large Model Systems Organization (LMSYS Org) is an open research organization founded by students and faculty from UC Berkeley.
They offer a leaderboard based on three benchmarks:
The interesting part is the ELO pair-wise ranking system using crowdsourced human data.
Metrics and benchmarks are very useful to rank models but they do not necessarily inform model performance on dimensions outside of the ones they are designed for (e.g. summarization, translation, mathematics, common sense).
Here is a number of attributes that cannot be quantified by existing metrics and benchmarks:
Evaluating these attributes can be crucial before confidently integrating them in production application.
A novel area of experimentation is proposing to use LLMs as judges to evaluate these properties in other models.
When using a scoring model to evaluate the output of other models, the scoring prompt should contain a description of the attributes to score and the grading scale, and is interpolated to contain the inference to grade.
In this example, the model is asked to evaluate the response’s language style and return a classification.
Lin et al. 2023 introduced in May ‘23 the LLM-Eval evaluation method. LLM-Eval uses a single prompt approach to use a reference model as scoring judge.
They do this by augmenting the input prompt with a description of the attributes to score and asking the judging model to output a JSON payload that contains scores across 4 dimensions: appropriateness, content, grammar, and relevance.
They compare the correlation of the obtained scored with human scores, and compare these correlation with specialized metrics for these four dimensions. They observe that LLM-Eval consistently beat specialized metrics as shown below.
Zhu et al. 2023 introduced a series of fine-tuned models called JudgeLM. The models are trained on a large dataset (100k) of statement and judging instructions, answer pairs (from two different LLMs), judgement and reasoning from GPT-4.
The results below show that the JudgeLM models beat GPT-3.5 at agreeing with GPT-4 on the evaluation dataset.
Note that this work does not attempt to score inferences on particular dimensions, but instead shows that LLMs can correctly identify the correct answer when presented with two options.
Kim et al. 2023 fine-tuned Llama 2 (7B and 13B) on a custom built dataset of 100K+ language feedback examples generated by GPT-4 in order to achieve similar scoring performance.
They showed that using a rubric and a reference answer, they can correlate scores with human scores more highly than GPT-4.
At this time, LLM-assisted evaluation is the only way to evaluate arbitrary abstract concepts in language models (e.g. child readability, toxicity, language mode, etc.). That being said, it remains an uncertain method as the scoring model itself is subject to stochastic variations, which will introduce variance in the evaluation results.
Additionally, as was shown in the above literature review, GPT-4 remains the most powerful AI judge. GPT-4 is notoriously slow and costly, which can make scoring large datasets prohibitively expensive.
Better and cheaper scoring performance could likely be achieved by fine-tuning models for specific attributes, e.g. safety scoring models, groundedness scoring models, etc.
One risk worth mentioning in the LLM-assisted approach is that of contamination. If the evaluation dataset was somehow included in the training set, evaluation results would be skewed. Since LLM training datasets are sometimes undisclosed, it can be challenging to guarantee no contamination.
In this article, we explored a wide range of methods and techniques to evaluate LLMs. Most methods such as metrics and benchmarks are crucial to set baselines for the field and let model producers compete against one another, but can fall short of predicting quality of models form more domain-specific tasks.
For this purpose, the best and safest but expensive and unscalable method is still human-based evaluation. A new promising technique involves using a judge LLM as a scoring model to evaluate other models on arbitrary attributes.
We encourage the reader to carefully design an evaluation harness that scalable and repeatable, as a core component of their AI development workflow.