Data-Centric AI (DCAI)

Data-Centric AI #

Topic Q&A Summary
1. Data-Centric AI vs. Model-Centric AI
2. Label Errors and Confident Learning
3. Advanced Confident Learning, LLM and GenAI applications
4. Class Imbalance, Outliers, and Distribution Shift
5. Dataset Creation and Curation
6. Data-centric Evaluation of ML Models
7. Data Curation for LLMs
8. Growing or Compressing Datasets
9. Interpretability in Data-Centric ML
10. Encoding Human Priors: Data Augmentation and Prompt Engineering
11. Data Privacy and Security

Q1: How does Data-Centric AI differ from Model-Centric AI? #

  • Model-Centric AI improves models assuming fixed data.
  • Data-Centric AI improves data quality, coverage, and structure, recognizing data as the bottleneck for model success.

Q2: Why are Label Errors and Confident Learning crucial? #

  • Label errors silently degrade model performance.
  • Confident Learning identifies mislabeled data points using model predictions and corrects them systematically.

Q3: What advances exist in Confident Learning and LLM/GenAI applications? #

  • Advanced Confident Learning enhances robustness to noisy outputs.
  • It’s applied to improve foundation models like LLMs by cleaning training and synthetic data.

Q4: How are Class Imbalance, Outliers, and Distribution Shift handled? #

  • Class imbalance is managed via over/under-sampling, SMOTE, and weighting.
  • Outliers are detected using isolation techniques or autoencoders.
  • Distribution shifts are diagnosed and corrected with careful monitoring and adaptation.

Q5: What is essential about Dataset Creation and Curation? #

  • Good datasets start with thoughtful design, balanced sampling, and robust label validation.
  • Crowdsourcing must be augmented with consensus models like Dawid-Skene or CROWDLAB.

Q6: How does Data-Centric Evaluation of ML Models change standard practices? #

  • Not just global metrics: need slice-based evaluation, error analysis, and influence functions.
  • Subpopulations and rare cases must be properly assessed.

Q7: How is Data Curation for LLMs unique? #

  • LLMs memorize training data deeply.
  • Curating high-quality fine-tuning datasets, synthetic data filtering, and evaluation by uncertainty quantification is critical.

Q8: What role does Growing or Compressing Datasets play? #

  • Active learning grows datasets smartly by labeling only informative samples.
  • Core-set selection compresses datasets while preserving model performance, making training efficient.

Q9: How does Interpretability relate to Data-Centric AI? #

  • Models are only as interpretable as their features.
  • Human-in-the-loop feature engineering ensures features are understandable, relevant, and actionable.

Q10: How do we Encode Human Priors into Models? #

  • Via Data Augmentation: enriching datasets to encode invariances (e.g., rotation, Mixup).
  • Via Prompt Engineering: guiding LLMs at inference time with careful input manipulation.

Q11: How do we secure Data Privacy and Security? #

  • ML models risk leaking sensitive information.
  • Defenses include membership inference mitigation, differential privacy, model regularization, and careful threat modeling.

Q12: How does the full picture of Data-Centric AI flow? #

  1. Frame the problem with a Data-Centric mindset.
  2. Curate a well-constructed, balanced, and interpretable dataset.
  3. Detect and fix label errors, outliers, and bias early.
  4. Train models, but always re-evaluate data quality after errors.
  5. Focus evaluations on slices and high-loss examples.
  6. Grow datasets when needed (active learning) or compress intelligently (core-sets).
  7. Secure models against privacy attacks.
  8. Continuously refine, because data evolves in deployment.

Q13: Final Takeaway #

  • In Data-Centric AI, data is the model.
  • Every improvement — in accuracy, fairness, robustness, trust, and security — roots back to the data.