How 15 top LLMs perform on classification: accuracy vs. cost breakdown
Historically, classification tasks used to be non-trivial endeavors, both in terms of human-power and cost. It would require a specialized model trained on a large labeled dataset; a consequential project involving tasks such as gathering the data, labeling the data, training and evaluating the model, monitoring and retraining, etc.
Like many other things, this has been widely disrupted by the advent of Large Language Models (LLMs). As next-token predictors trained of vast amounts of data, LLMs are capable of classifying text data with great performance, even in zero-shot contexts, i.e. with no labeled data provided.
So how do they perform on typical classification tasks, which ones perform best, and what is the performance-cost tradeoff? In this post, we expose our analysis of classification performance and cost of 15 state-of-the-art LLMs.
Why use LLMs for classification tasks?
- Domain flexibility – traditional classification models can only perform well within the bounds defined by their training dataset. If the production data drifts, the model will need to be retrained with updated data. As zero-shot classifiers, LLMs can natively adapt to changes in the distribution of the data. You may also be able to use the same LLM for various classification tasks instead of dedicated models.
- Cost – Even though LLM inference is more expensive than that of a traditional classification model, it costs nothing to produce for the consumer. No labeling or engineering workforce is needed to label the dataset or train the model.
- Operational simplicity – Depending on your privacy requirements, you may be able to use off-the-shelf inference APIs. In this case, you don’t need to worry about deploying or hosting the model.
Methodology
Dataset
We used the mteb/mtop_domain dataset, containing 4,386 rows of of commands classified across 11 categories (e.g., news, weather, music), suitable for domain-specific classification evaluation. See example rows below (sic).
text | label_text |
---|---|
Cancel my reminder about my dentist appointment | reminder |
What time did I call Mum yesterday? | calling |
Find music by Adele | music |
can i get a maritime forecast | weather |
i want you to start recording a video message for Amelia | messaging |
Models
We selected 15 state-of-the-art LLMs (as of this writing) to compare their classification performance. See [1] for full list.
Task
All models classified the dataset using the same prompt to ensure consistency. Better results could have been achieved with individual models with specifically tuned prompts, but for the sake of consistency, we decided to use the same prompt across all models.
A further study could attempt to obtain the best possible performance by tuning prompts for each model individually and compared the ceiling across models.
Unlike Airtrain AI’s classification product, this study does not use any proprietary techniques and leverages only LLMs to classify the entire dataset.
Cost
Costs were standardized by calculating the expense per 1,000 rows classified.
Evaluation metrics
All evaluation metrics are computed on rows for which an intelligible output was returned by the LLM. Bad rows (no discernible predicted class) were very rare and excluded from the analysis [3].
Pareto Optimal Frontier
The Pareto frontier was used to identify models offering the best trade-off between accuracy and cost, where improvements in one could not be made without increasing the other.
Which model is best at classification?
Below we report Accuracy, Precision, Recall, and f1 score for all 15 models.
Top Performers:
- Claude 3.5 Sonnet led the benchmark with the highest accuracy at 95.44%, making it the most accurate model.
- Llama 3.1 405B followed closely with 95.19%, showing a minimal drop in accuracy while remaining efficient in deployment.
- Claude 3 Opus ranked third, achieving 95% accuracy.
- Llama 3.1 70B and Mistral Large 2 delivered solid performances with accuracies of 94.78% and 94.73%, respectively, placing them among the top performers.
Insights on task-specific performance
Accuracy is critical in applications such as content moderation, recommendation systems, and dynamic responses. High-accuracy models like Claude 3.5 Sonnet excel in such cases where precision is non-negotiable, and minor misclassifications could have significant consequences.
What is the trade-off between cost and performance?
After evaluating performance on classification, we want to compare the cost of running classification tasks. The Pareto frontier illustrated below in red represents the best models for both cost and performance.
See Appendix [4] for precision, recall, and f1 score vs. cost.
Cost-Efficient Options
- GPT-4o Mini stood out as a highly efficient model, offering a competitive 94.21% accuracy at just $0.12 per 1,000 rows. This makes it an attractive option for cost-conscious deployments.
- Mistral NeMo 2407, though slightly less accurate (91.61%), is a standout for its low cost at $0.12 per 1,000 rows.
Balanced Models
For users seeking a balance between cost and accuracy, Llama 3.1 70B is an excellent option. With 94.78% accuracy at $0.60 per 1,000 rows, it provides a practical solution for projects that demand solid performance without incurring too much cost.
More Expensive ≠ Better
Higher costs don’t always guarantee superior performance. Claude 3 Opus at $14.67 per 1,000 rows, for example, performed worse (94.94%) than the cheaper Claude 3.5 Sonnet. Meanwhile, GPT-4o Mini achieved 94.21% accuracy at a fraction of the cost, reinforcing the point that budget models can offer high-quality results in certain scenarios.
Conclusion
LLMs offer a unique opportunity to replace the lengthy and costly processes of labeling and classifying datasets with fast and cheaper alternative, with comparable performance.
In this article, we outlined our protocol to compare performance and cost across 15 state-of-the-art models on a standard classification task and explored the cost/performance tradeoff.
For accuracy-focused applications, Claude 3.5 Sonnet excels with 95.44% accuracy at a reasonable price, followed by Llama 3.1 405B (95.19%), Llama 3.1 70B (94.78%) and Mistral Large (94.73%). On the other hand, models like GPT-4o Mini (94.21%), Mistral NeMo 2407 (91.61%), and Llama 3.1 70B (94.78%) offer solid alternatives with slight reductions in accuracy but significant cost savings, especially GPT-4o Mini, which combines near-top performance with one of the lowest costs. Our analysis shows that more expensive models don’t always perform better. Ultimately, choosing the right model depends on your specific problem as well as your requirements around cost and performance. In our case, optimizing both cost and performance led to a more affordable model. It’s worth noting that many factors affect a model's cost-effectiveness, from its model architecture to complex economic factors like strategic pricing and market share considerations.
Appendix
[1] API Model Slugs and Distributors
To maintain transparency in our benchmarking process, here is a table summarizing the models used, their distributors, and the slugs:
Model | Distributor | Slug |
---|---|---|
GPT-4o | OpenAI | gpt-4o-2024-08-06 |
GPT-4o Mini | OpenAI | gpt-4o-mini |
Mistral Large 2 | Mistral | mistral-large-2407 |
Mistral NeMo 24.07 | Mistral AI | open-mistral-nemo-2407 |
Mistral Small 24.09 | Mistral AI | mistral-small-2409 |
Claude 3 Opus | Anthropic | claude-3-opus-20240229 |
Claude 3.5 Sonnet | Anthropic | claude-3-5-sonnet-20240620 |
Claude Haiku | Anthropic | claude-3-haiku-20240307 |
Llama 3.2 90B Vision Instruct Turbo | Together AI | meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo |
Llama 3.2 11B Vision Instruct Turbo | Together AI | meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo |
Llama 3.1 405B Instruct Turbo | Together AI | meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo |
Llama 3.1 70B Instruct Turbo | Together AI | meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo |
Llama 3.1 8B Instruct Turbo | Together AI | meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo |
Gemma 2 27B IT | Together AI | google/gemma-2-27b-it |
WizardLM 22B | Together AI | microsoft/WizardLM-2-8x22B |
[3] Bad Rows Exclusion
When asking the models for a classification, it was not always possible to extract a definite answer from the model's output. For each model, we tracked this fraction of "unclassified" rows and excluded those rows from the analysis. Below we list the percentage of unclassified rows for each included model. With additional prompt engineering and more sophisticated result parsing, it is possible the number of such rows could be further reduced. However, the rate was small enough to not impact the results significantly.
Model | Unclassified Rows Percentage |
---|---|
Claude 3 Opus | 0% |
GPT-4o | 0% |
Claude 3.5 Sonnet | 0% |
Mistral Large 2 | 0% |
Llama 3.2 90B Vision | 0% |
WizardLM 22B | 0% |
Llama 3.1 70B | 0% |
Llama 3.2 11B Vision | 0% |
GPT-4o Mini | 0% |
Gemma 2 27B IT | 0.02% |
Llama 3.1 405B | 0.06% |
Mistral Small 24.09 | 0.07% |
Claude Haiku | 0.07% |
Llama 3.1 8B | 0.18% |
Mistral NeMo 24.07 | 0.32% |
[4] Pareto plots
- Why use LLMs for classification tasks?
- Methodology
- Dataset
- Models
- Task
- Cost
- Evaluation metrics
- Pareto Optimal Frontier
- Which model is best at classification?
- Top Performers:
- Insights on task-specific performance
- What is the trade-off between cost and performance?
- Cost-Efficient Options
- Balanced Models
- More Expensive ≠ Better
- Conclusion
- Appendix
- [1] API Model Slugs and Distributors
- [3] Bad Rows Exclusion
- [4] Pareto plots
A comprehensive AI platform
Dataset Curation
Generate high-quality datasets.
LLM Fine-Tuning
Customize LLMs to your specific use case.
LLM Playground
Vibe-check 30+ SOTA LLMs at once.
LLM Evaluation
Compare LLMs on your entire eval set.