Introducing Airtrain
In November 2022, the game changed. Building AI-powered products became accessible to every developer. Not only did it become possible to enrich product features with inferences from an AI model, but from the best model ever made. The most advanced AI model in history, just a GET request away.
Although very large models exhibit impressive generative performance, they are also notoriously slow, expensive, and difficult to deploy and run, especially in the midst of the current GPU shortage. Alternatively, it has been shown repeatedly that smaller models can perform on-par with very large models on specific tasks when fine-tuned on high quality datasets.
At Airtrain, we think the future is made of smaller fine-tuned models dedicated to specific tasks. We also believe that the first step towards training and fine-tuning models is evaluation.
Evaluation is All You Need
Well, not really, but it is indispensable. As we learned building ML infrastructure for Cruise, safe and accurate Machine Learning is a loop. Data is mined and processed into feature datasets, then a model is trained, evaluated, ran through simulations, and if all checks out, deployed to production. Rinse and repeat.
Although training is the core of this loop, evaluation is the gate: does this new model perform well enough for the business case it is intended for? If no, back to the drawing board. If yes, off to production.
Traditional supervised ML has well known metrics to evaluate the quality of models. However, evaluation for LLMs is still an unsolved problem. Metrics and benchmarks exist, and they are useful to compare models an build leaderboards, but they do not inform how a model perform for a specific task. For example, common metrics such as BLUE and ROUGE mostly evaluate translation and summarization, and benchmarks such as GLUE and MMLU mostly evaluate general reasoning and knowledge.
So how can you evaluate LLMs on your own dataset, and with metrics specific to your own application?
Introducing Airtrain
Airtrain is a no-code compute platform for batch evaluation workloads.
With Airtrain, you can upload your evaluation dataset with up to 10,000 examples, select models to evaluate or bring your own inferences, and design a set of benchmark metrics specific to your application. We start by offering three main evaluation methods:
- LLM-assisted evaluation – You describe in plain english the attributes and properties of interest for your application (e.g. creativity, groundedness, politeness), and we use a scoring model to grade the output of the evaluated models on a scale of 1 to 10.
- JSON schema validation – Many applications expect a JSON payload from LLMs. When those fail to construct a response the required schema, applications fail and give users bad experiences. Airtrain lets you describe the required schema, and compare what prompts and models give the highest levels of compliance.
- Unsupervised metrics – Standalone metrics such as length, compression, and density, can be a good rule of thumb to compare models, depending on the target use case. After your job has completed, you can visualize the distribution of each metric across your entire dataset, and browse individual examples to dive deeper and discover issues.
Compare LLMs on your own dataset and with your own metrics.
Get started now!
At Airtrain, we are always eager to received feedback from our users. Sign up now to get early access and join our community on Slack to give us feedback, report issues, and get product updates.
A comprehensive AI platform
Dataset Curation
Generate high-quality datasets.
LLM Fine-Tuning
Customize LLMs to your specific use case.
LLM Playground
Vibe-check 30+ SOTA LLMs at once.
LLM Evaluation
Compare LLMs on your entire eval set.