Introducing AI-powered classification and labeling
The first feature we launched in the Airtrain Data Explorer was automatic semantic clustering. For semantic clustering, we generate embeddings for each row in an ingested dataset, then we organize clusters into a two-tier hierarchy, where each cluster is labeled by an LLM.
Our users found this capability incredibly useful for discovering the makeup of a dataset by revealing unknown niches, overrepresented topics, etc. To address the needs of high-performance use cases, users requested the ability to design and name the clusters ahead of time, allowing Airtrain to place each dataset row in the appropriate cluster.
As a result of user feedback, today we are excited to launch AI-powered classification and labeling in the Airtrain Data Explorer.
Now upon ingesting your dataset into Airtrain, users can specify as many classes or labels as desired, with a name and a comprehensive natural language description. Airtrain’s optimized labeling AI will automatically assign each row in your dataset to the appropriate class.
Data labeling, a perennial requirement for ML and AI
Data labeling sits at the core of a supervised Machine Learning task. Without ground-truth labels, there is no “supervision” to teach a model to predict.
Traditionally, labeling a dataset (i.e. assigning a class to each entry) has been the purview of human workforces. This process is often painstaking. The human workforce needs to be trained to recognize and identify each class in the data, they need to perform the labeling work, and the error rates are often non-negligible and lead to poor model performance.
Additionally, human-labeling is prohibitively costly in time and money:
Dataset size | Typical labeling platform | Airtrain Data Platform |
---|---|---|
10,000 rows | Weeks and $1,000 | Minutes and $0 |
100,000 rows | Weeks and $10,000 | Minutes and $200 |
Classification without a purpose-trained model
Before the advent of Large Language Models, a classification task would require a purpose-trained model. That means curating a training dataset specific to the classification task including ground-truth labels, running training jobs, tuning hyper-parameters, and evaluating the model. This painstaking process would require at least one full-time engineer and may have to be repeated periodically to maintain model performance.
Airtrain’s classification AI needs no special training for your data. Simply specify your classes and carefully describe them in plain English, and Airtrain will assign each row in your dataset to the appropriate class. This replaces thousands of human hours and produces very competitive performance typical classification benchmarks.
Dataset | Task | # classes | Accuracy | Precision | Recall | f1 score |
---|---|---|---|---|---|---|
mteb/mtop_domain | Domain classification | 11 | 94.8% | 94.1% | 95.1% | 0.95 |
mteb/imdb | Sentiment analysis | 2 | 90.4% | 90.5% | 90.4% | 0.90 |
wangrongsheng/ag_news | News articles topic classification | 4 | 86.1% | 87.0% | 86.3% | 0.86 |
Confusion matrix for a classification task on the mteb/mtop-domain dataset.
Get started for free now!
With Airtrain, you can get your data labeled and classified within minutes. Get started today by visiting https://app.airtrain.com and uploading your first free dataset of up to 10,000 rows.
A comprehensive AI platform
Dataset Curation
Generate high-quality datasets.
LLM Fine-Tuning
Customize LLMs to your specific use case.
LLM Playground
Vibe-check 30+ SOTA LLMs at once.
LLM Evaluation
Compare LLMs on your entire eval set.