How 15 top LLMs perform on classification: accuracy vs. cost breakdown

How 15 top LLMs perform on classification: accuracy vs. cost breakdown

Idriss Chebak·10/15/2024

Historically, classification tasks used to be non-trivial endeavors, both in terms of human-power and cost. It would require a specialized model trained on a large labeled dataset; a consequential project involving tasks such as gathering the data, labeling the data, training and evaluating the model, monitoring and retraining, etc.

Like many other things, this has been widely disrupted by the advent of Large Language Models (LLMs). As next-token predictors trained of vast amounts of data, LLMs are capable of classifying text data with great performance, even in zero-shot contexts, i.e. with no labeled data provided.

So how do they perform on typical classification tasks, which ones perform best, and what is the performance-cost tradeoff? In this post, we expose our analysis of classification performance and cost of 15 state-of-the-art LLMs.

Why use LLMs for classification tasks?

  • Domain flexibility – traditional classification models can only perform well within the bounds defined by their training dataset. If the production data drifts, the model will need to be retrained with updated data. As zero-shot classifiers, LLMs can natively adapt to changes in the distribution of the data. You may also be able to use the same LLM for various classification tasks instead of dedicated models.
  • Cost – Even though LLM inference is more expensive than that of a traditional classification model, it costs nothing to produce for the consumer. No labeling or engineering workforce is needed to label the dataset or train the model.
  • Operational simplicity – Depending on your privacy requirements, you may be able to use off-the-shelf inference APIs. In this case, you don’t need to worry about deploying or hosting the model.

Methodology

Dataset

We used the mteb/mtop_domain dataset, containing 4,386 rows of of commands classified across 11 categories (e.g., news, weather, music), suitable for domain-specific classification evaluation. See example rows below (sic).

textlabel_text
Cancel my reminder about my dentist appointmentreminder
What time did I call Mum yesterday?calling
Find music by Adelemusic
can i get a maritime forecastweather
i want you to start recording a video message for Ameliamessaging

Models

We selected 15 state-of-the-art LLMs (as of this writing) to compare their classification performance. See [1] for full list.

Task

All models classified the dataset using the same prompt to ensure consistency. Better results could have been achieved with individual models with specifically tuned prompts, but for the sake of consistency, we decided to use the same prompt across all models.

A further study could attempt to obtain the best possible performance by tuning prompts for each model individually and compared the ceiling across models.

Unlike Airtrain AI’s classification product, this study does not use any proprietary techniques and leverages only LLMs to classify the entire dataset.

Cost

Costs were standardized by calculating the expense per 1,000 rows classified.

Evaluation metrics

All evaluation metrics are computed on rows for which an intelligible output was returned by the LLM. Bad rows (no discernible predicted class) were very rare and excluded from the analysis [3].

Pareto Optimal Frontier

The Pareto frontier was used to identify models offering the best trade-off between accuracy and cost, where improvements in one could not be made without increasing the other.

Which model is best at classification?

Below we report Accuracy, Precision, Recall, and f1 score for all 15 models.

accuracy.png

precision.png

recall.png

f1score.png

Top Performers:

  • Claude 3.5 Sonnet led the benchmark with the highest accuracy at 95.44%, making it the most accurate model.
  • Llama 3.1 405B followed closely with 95.19%, showing a minimal drop in accuracy while remaining efficient in deployment.
  • Claude 3 Opus ranked third, achieving 95% accuracy.
  • Llama 3.1 70B and Mistral Large 2 delivered solid performances with accuracies of 94.78% and 94.73%, respectively, placing them among the top performers.

Insights on task-specific performance

Accuracy is critical in applications such as content moderation, recommendation systems, and dynamic responses. High-accuracy models like Claude 3.5 Sonnet excel in such cases where precision is non-negotiable, and minor misclassifications could have significant consequences.

What is the trade-off between cost and performance?

After evaluating performance on classification, we want to compare the cost of running classification tasks. The Pareto frontier illustrated below in red represents the best models for both cost and performance.

accuracy_cost_pareto.png

See Appendix [4] for precision, recall, and f1 score vs. cost.

Cost-Efficient Options

  • GPT-4o Mini stood out as a highly efficient model, offering a competitive 94.21% accuracy at just $0.12 per 1,000 rows. This makes it an attractive option for cost-conscious deployments.
  • Mistral NeMo 2407, though slightly less accurate (91.61%), is a standout for its low cost at $0.12 per 1,000 rows.

Balanced Models

For users seeking a balance between cost and accuracy, Llama 3.1 70B is an excellent option. With 94.78% accuracy at $0.60 per 1,000 rows, it provides a practical solution for projects that demand solid performance without incurring too much cost.

More Expensive ≠ Better

Higher costs don’t always guarantee superior performance. Claude 3 Opus at $14.67 per 1,000 rows, for example, performed worse (94.94%) than the cheaper Claude 3.5 Sonnet. Meanwhile, GPT-4o Mini achieved 94.21% accuracy at a fraction of the cost, reinforcing the point that budget models can offer high-quality results in certain scenarios.

Conclusion

LLMs offer a unique opportunity to replace the lengthy and costly processes of labeling and classifying datasets with fast and cheaper alternative, with comparable performance.

In this article, we outlined our protocol to compare performance and cost across 15 state-of-the-art models on a standard classification task and explored the cost/performance tradeoff.

For accuracy-focused applications, Claude 3.5 Sonnet excels with 95.44% accuracy at a reasonable price, followed by Llama 3.1 405B (95.19%), Llama 3.1 70B (94.78%) and Mistral Large (94.73%). On the other hand, models like GPT-4o Mini (94.21%), Mistral NeMo 2407 (91.61%), and Llama 3.1 70B (94.78%) offer solid alternatives with slight reductions in accuracy but significant cost savings, especially GPT-4o Mini, which combines near-top performance with one of the lowest costs. Our analysis shows that more expensive models don’t always perform better. Ultimately, choosing the right model depends on your specific problem as well as your requirements around cost and performance. In our case, optimizing both cost and performance led to a more affordable model. It’s worth noting that many factors affect a model's cost-effectiveness, from its model architecture to complex economic factors like strategic pricing and market share considerations.

Appendix

[1] API Model Slugs and Distributors

To maintain transparency in our benchmarking process, here is a table summarizing the models used, their distributors, and the slugs:

ModelDistributorSlug
GPT-4oOpenAIgpt-4o-2024-08-06
GPT-4o MiniOpenAIgpt-4o-mini
Mistral Large 2Mistralmistral-large-2407
Mistral NeMo 24.07Mistral AIopen-mistral-nemo-2407
Mistral Small 24.09Mistral AImistral-small-2409
Claude 3 OpusAnthropicclaude-3-opus-20240229
Claude 3.5 SonnetAnthropicclaude-3-5-sonnet-20240620
Claude HaikuAnthropicclaude-3-haiku-20240307
Llama 3.2 90B Vision Instruct TurboTogether AImeta-llama/Llama-3.2-90B-Vision-Instruct-Turbo
Llama 3.2 11B Vision Instruct TurboTogether AImeta-llama/Llama-3.2-11B-Vision-Instruct-Turbo
Llama 3.1 405B Instruct TurboTogether AImeta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Llama 3.1 70B Instruct TurboTogether AImeta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
Llama 3.1 8B Instruct TurboTogether AImeta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
Gemma 2 27B ITTogether AIgoogle/gemma-2-27b-it
WizardLM 22BTogether AImicrosoft/WizardLM-2-8x22B

[3] Bad Rows Exclusion

When asking the models for a classification, it was not always possible to extract a definite answer from the model's output. For each model, we tracked this fraction of "unclassified" rows and excluded those rows from the analysis. Below we list the percentage of unclassified rows for each included model. With additional prompt engineering and more sophisticated result parsing, it is possible the number of such rows could be further reduced. However, the rate was small enough to not impact the results significantly.

ModelUnclassified Rows Percentage
Claude 3 Opus0%
GPT-4o0%
Claude 3.5 Sonnet0%
Mistral Large 20%
Llama 3.2 90B Vision0%
WizardLM 22B0%
Llama 3.1 70B0%
Llama 3.2 11B Vision0%
GPT-4o Mini0%
Gemma 2 27B IT0.02%
Llama 3.1 405B0.06%
Mistral Small 24.090.07%
Claude Haiku0.07%
Llama 3.1 8B0.18%
Mistral NeMo 24.070.32%

[4] Pareto plots

precision_cost_pareto.png

recall_cost_pareto.png

f1score_cost_pareto.png

AI Data Platform

A comprehensive AI platform

Dataset Curation

Generate high-quality datasets.

LLM Fine-Tuning

Customize LLMs to your specific use case.

LLM Playground

Vibe-check 30+ SOTA LLMs at once.

LLM Evaluation

Compare LLMs on your entire eval set.

Accelerate your AI workflows with Airtrain's comprehensive suite of tools. From dataset curation to LLM fine-tuning and evaluation.

Unlock your data, control your AI.