Data-Centric AI vs Model-Centric AI #
- MIT Lecture : https://dcai.csail.mit.edu/
- GitHub for Labs: https://github.com/dcai-course/dcai-lab
Q1: What is the traditional model-centric approach to machine learning? #
Traditionally, machine learning has focused on model-centric AI, where the dataset is fixed and the goal is to tune the model:
- Learn different model architectures
- Modify hyperparameters and training losses
- Focus on improving model performance given clean data
- This is how ML is often taught in courses (e.g., MIT 6.036)
Q2: What challenges arise in real-world ML settings? #
Real-world data is:
- Often messy and error-prone
- Not fixed or clean
- Contains label errors, noise, and distribution shifts
Therefore, improving data quality can be more impactful than model tinkering.
Q3: What is Data-Centric AI (DCAI)? #
Data-Centric AI focuses on improving data quality to enhance model performance:
- Given any model, systematically improve the dataset.
- Examples:
- Curriculum Learning (easy to hard examples)
- Confident Learning (identify and remove label errors)
Q4: How is Data-Centric AI different from Model-Centric AI? #
Feature | Model-Centric AI | Data-Centric AI |
---|---|---|
Goal | Improve model | Improve dataset |
Fixed/Variable | Dataset is fixed | Dataset is changeable |
Techniques | Tuning, loss functions, archs | Error correction, augmentation, etc. |
Real-world use | Less applicable in messy domains | Highly applicable |
Q5: What techniques fall under Data-Centric AI? #
- Outlier detection and removal
- Error correction
- Data augmentation
- Feature selection
- Active learning
- Establishing consensus labels
Q6: Why is DCAI gaining attention now? #
- Large foundation models (e.g., DALL-E, GPT-3) suffer from data issues
- Examples:
- OpenAI identified label errors as major performance bottlenecks
- ChatGPT fine-tuned using better data, with human ranking
- Tesla’s Data Engine loops model outputs back to dataset improvements
Q7: What’s a real example of a label error found in a famous dataset? #
Hinton demonstrated a label error in the MNIST dataset (e.g., mislabeling of digit ‘3’) using data-centric techniques.
Q8: What is PU Learning and why is it called “the perceptron of Data-Centric AI”? #
PU Learning stands for Positive-Unlabeled Learning. It deals with training models using:
-
A small set of positive examples
-
A large set of unlabeled data
-
No explicitly labeled negatives
-
It’s called “the perceptron of DCAI” because:
- It is a foundational concept that demonstrates the importance of improving data rather than models.
- Just as the perceptron is a gateway to model-centric ML, PU Learning introduces core data-centric thinking.
-
PU Learning is used when:
- Full labeling is impractical or costly (e.g., medical records, fraud detection)
- You need to build a classifier using partial label information
-
Example:
- If you only know a few patients with a disease (positives), and the rest of your records are unlabeled, PU Learning helps build a classifier without assuming all unlabeled cases are negative.