Introducing AI-powered classification and labeling

Introducing AI-powered classification and labeling

Emmanuel Turlay·9/13/2024

The first feature we launched in the Airtrain Data Explorer was automatic semantic clustering. For semantic clustering, we generate embeddings for each row in an ingested dataset, then we organize clusters into a two-tier hierarchy, where each cluster is labeled by an LLM.

Our users found this capability incredibly useful for discovering the makeup of a dataset by revealing unknown niches, overrepresented topics, etc. To address the needs of high-performance use cases, users requested the ability to design and name the clusters ahead of time, allowing Airtrain to place each dataset row in the appropriate cluster.

As a result of user feedback, today we are excited to launch AI-powered classification and labeling in the Airtrain Data Explorer.

Now upon ingesting your dataset into Airtrain, users can specify as many classes or labels as desired, with a name and a comprehensive natural language description. Airtrain’s optimized labeling AI will automatically assign each row in your dataset to the appropriate class.

Data labeling, a perennial requirement for ML and AI

Data labeling sits at the core of a supervised Machine Learning task. Without ground-truth labels, there is no “supervision” to teach a model to predict.

Traditionally, labeling a dataset (i.e. assigning a class to each entry) has been the purview of human workforces. This process is often painstaking. The human workforce needs to be trained to recognize and identify each class in the data, they need to perform the labeling work, and the error rates are often non-negligible and lead to poor model performance.

Additionally, human-labeling is prohibitively costly in time and money:

Dataset sizeTypical labeling platformAirtrain Data Platform
10,000 rowsWeeks and $1,000Minutes and $0
100,000 rowsWeeks and $10,000Minutes and $200

Classification without a purpose-trained model

Before the advent of Large Language Models, a classification task would require a purpose-trained model. That means curating a training dataset specific to the classification task including ground-truth labels, running training jobs, tuning hyper-parameters, and evaluating the model. This painstaking process would require at least one full-time engineer and may have to be repeated periodically to maintain model performance.

Airtrain’s classification AI needs no special training for your data. Simply specify your classes and carefully describe them in plain English, and Airtrain will assign each row in your dataset to the appropriate class. This replaces thousands of human hours and produces very competitive performance typical classification benchmarks.

DatasetTask# classesAccuracyPrecisionRecallf1 score
mteb/mtop_domainDomain classification1194.8%94.1%95.1%0.95
mteb/imdbSentiment analysis290.4%90.5%90.4%0.90
wangrongsheng/ag_newsNews articles topic classification486.1%87.0%86.3%0.86

Confusion matrix for a classification task on the mteb/mtop-domain dataset.

Confusion matrix for a classification task on the mteb/mtop-domain dataset.

Get started for free now!

With Airtrain, you can get your data labeled and classified within minutes. Get started today by visiting https://app.airtrain.com and uploading your first free dataset of up to 10,000 rows.

AI Data Platform

A comprehensive AI platform

Dataset Curation

Generate high-quality datasets.

LLM Fine-Tuning

Customize LLMs to your specific use case.

LLM Playground

Vibe-check 30+ SOTA LLMs at once.

LLM Evaluation

Compare LLMs on your entire eval set.

Accelerate your AI workflows with Airtrain's comprehensive suite of tools. From dataset curation to LLM fine-tuning and evaluation.

Unlock your data, control your AI.