Data-Centric AI vs Model-Centric AI

Data-Centric AI vs Model-Centric AI #

Traditionally, machine learning has focused on model-centric AI, where the dataset is fixed and the goal is to tune the model:

Real-world data is:

Therefore, improving data quality can be more impactful than model tinkering.

Data-Centric AI focuses on improving data quality to enhance model performance:

Given any model, systematically improve the dataset.
Examples:
- Curriculum Learning (easy to hard examples)
- Confident Learning (identify and remove label errors)

Feature	Model-Centric AI	Data-Centric AI
Goal	Improve model	Improve dataset
Fixed/Variable	Dataset is fixed	Dataset is changeable
Techniques	Tuning, loss functions, archs	Error correction, augmentation, etc.
Real-world use	Less applicable in messy domains	Highly applicable

Large foundation models (e.g., DALL-E, GPT-3) suffer from data issues
Examples:
- OpenAI identified label errors as major performance bottlenecks
- ChatGPT fine-tuned using better data, with human ranking
- Tesla’s Data Engine loops model outputs back to dataset improvements

Hinton demonstrated a label error in the MNIST dataset (e.g., mislabeling of digit ‘3’) using data-centric techniques.

PU Learning stands for Positive-Unlabeled Learning. It deals with training models using:

A small set of positive examples
A large set of unlabeled data
No explicitly labeled negatives
It’s called “the perceptron of DCAI” because:
- It is a foundational concept that demonstrates the importance of improving data rather than models.
- Just as the perceptron is a gateway to model-centric ML, PU Learning introduces core data-centric thinking.
PU Learning is used when:
- Full labeling is impractical or costly (e.g., medical records, fraud detection)
- You need to build a classifier using partial label information
Example:
- If you only know a few patients with a disease (positives), and the rest of your records are unlabeled, PU Learning helps build a classifier without assuming all unlabeled cases are negative.