Data-Centric AI vs Model-Centric AI

Data-Centric AI vs Model-Centric AI #



Q1: What is the traditional model-centric approach to machine learning? #

Traditionally, machine learning has focused on model-centric AI, where the dataset is fixed and the goal is to tune the model:

  • Learn different model architectures
  • Modify hyperparameters and training losses
  • Focus on improving model performance given clean data
  • This is how ML is often taught in courses (e.g., MIT 6.036)

Q2: What challenges arise in real-world ML settings? #

Real-world data is:

  • Often messy and error-prone
  • Not fixed or clean
  • Contains label errors, noise, and distribution shifts

Therefore, improving data quality can be more impactful than model tinkering.


Q3: What is Data-Centric AI (DCAI)? #

Data-Centric AI focuses on improving data quality to enhance model performance:

  • Given any model, systematically improve the dataset.
  • Examples:
    • Curriculum Learning (easy to hard examples)
    • Confident Learning (identify and remove label errors)

Q4: How is Data-Centric AI different from Model-Centric AI? #

Feature Model-Centric AI Data-Centric AI
Goal Improve model Improve dataset
Fixed/Variable Dataset is fixed Dataset is changeable
Techniques Tuning, loss functions, archs Error correction, augmentation, etc.
Real-world use Less applicable in messy domains Highly applicable

Q5: What techniques fall under Data-Centric AI? #

  • Outlier detection and removal
  • Error correction
  • Data augmentation
  • Feature selection
  • Active learning
  • Establishing consensus labels

Q6: Why is DCAI gaining attention now? #

  • Large foundation models (e.g., DALL-E, GPT-3) suffer from data issues
  • Examples:
    • OpenAI identified label errors as major performance bottlenecks
    • ChatGPT fine-tuned using better data, with human ranking
    • Tesla’s Data Engine loops model outputs back to dataset improvements

Q7: What’s a real example of a label error found in a famous dataset? #

Hinton demonstrated a label error in the MNIST dataset (e.g., mislabeling of digit ‘3’) using data-centric techniques.


Q8: What is PU Learning and why is it called “the perceptron of Data-Centric AI”? #

PU Learning stands for Positive-Unlabeled Learning. It deals with training models using:

  • A small set of positive examples

  • A large set of unlabeled data

  • No explicitly labeled negatives

  • It’s called “the perceptron of DCAI” because:

    • It is a foundational concept that demonstrates the importance of improving data rather than models.
    • Just as the perceptron is a gateway to model-centric ML, PU Learning introduces core data-centric thinking.
  • PU Learning is used when:

    • Full labeling is impractical or costly (e.g., medical records, fraud detection)
    • You need to build a classifier using partial label information
  • Example:

    • If you only know a few patients with a disease (positives), and the rest of your records are unlabeled, PU Learning helps build a classifier without assuming all unlabeled cases are negative.