LLM-Assisted Evaluation: Can AI Models Evaluate Other Models

In this video, we review the academic literature that indicates that using LLMs to evaluate other LLMs is viable.

In this video, the focus is on a novel approach for evaluating and grading large language models (LLMs). We discuss how traditional benchmarks are useful but insufficient for assessing performance on specific tasks or datasets. The concept of using other LLMs to score an LLM is introduced, likening it to a student-teacher dynamic.

Model Evaluation Basics

Model evaluation, a crucial part of the model development lifecycle, is described. The process involves holding out a portion of the training data (typically 10-20%) for evaluation. This evaluation or eval dataset helps in comparing the trained model's inferences with ground truth labels. The challenge of measuring performance in natural language processing (NLP) is highlighted due to the unstructured nature of language outputs.

Novel Evaluation Method

We introduce a new method of using scoring models to evaluate specific attributes of LLMs. This method moves beyond traditional task-specific evaluations like summarization or translation, enabling assessment of attributes such as creativity or politeness.

Academic Research Review

The video reviews three academic papers that explore using LLMs to evaluate other LLMs. Each paper presents a unique approach and findings on this topic.

LLM Eval by National Taiwan University: This paper discusses using a single prompt method to grade dialogues across four dimensions. The effectiveness of this method is measured using Spearman correlation coefficients to compare with human ratings.
Judge LM by Beijing Academy of AI: This study involves creating a dataset with questions and answers, using GPT-4 as a judge. The paper demonstrates how a smaller model can be fine-tuned to reliably judge other models' outputs.
Prometheus Study: Focused on creating a Feedback Collection dataset, this paper shows how adding scoring rubrics and reference answers improves correlation with human scores.

Potential Issues with AI Scoring Techniques

We highlight three potential issues with AI-based scoring methods:

Bias: Models may show a preference for certain answers based on their position in a list or generate biased responses towards their own outputs.
Variance: The stochastic nature of models can lead to different results for the same input, highlighting the variance problem.
Contamination: If the eval dataset was part of the pre-training set, it could lead to inherent biases in the scoring model.

Conclusion

We conclude that using LLMs to score other LLMs is a viable method, offering a range of scoring dimensions. It's observed that best-in-class models like Claude and GPT-4 are effective evaluators, and smaller models can achieve similar performance with fine-tuning. The use of reference answers and scoring rubrics significantly enhances the correlation with human scores. However, potential biases and issues like model variance and dataset contamination are noted.