Data-Centric AI (DCAI)

Data-Centric AI #

Topic Q&A Summary
1. Data-Centric AI vs. Model-Centric AI
2. Label Errors and Confident Learning
3. Advanced Confident Learning, LLM and GenAI applications
4. Class Imbalance, Outliers, and Distribution Shift
5. Dataset Creation and Curation
6. Data-centric Evaluation of ML Models
7. Data Curation for LLMs
8. Growing or Compressing Datasets
9. Interpretability in Data-Centric ML
10. Encoding Human Priors: Data Augmentation and Prompt Engineering
11. Data Privacy and Security

Model-Centric AI improves models assuming fixed data.
Data-Centric AI improves data quality, coverage, and structure, recognizing data as the bottleneck for model success.

Label errors silently degrade model performance.
Confident Learning identifies mislabeled data points using model predictions and corrects them systematically.

Advanced Confident Learning enhances robustness to noisy outputs.
It’s applied to improve foundation models like LLMs by cleaning training and synthetic data.

Class imbalance is managed via over/under-sampling, SMOTE, and weighting.
Outliers are detected using isolation techniques or autoencoders.
Distribution shifts are diagnosed and corrected with careful monitoring and adaptation.

Good datasets start with thoughtful design, balanced sampling, and robust label validation.
Crowdsourcing must be augmented with consensus models like Dawid-Skene or CROWDLAB.

Not just global metrics: need slice-based evaluation, error analysis, and influence functions.
Subpopulations and rare cases must be properly assessed.

LLMs memorize training data deeply.
Curating high-quality fine-tuning datasets, synthetic data filtering, and evaluation by uncertainty quantification is critical.

Active learning grows datasets smartly by labeling only informative samples.
Core-set selection compresses datasets while preserving model performance, making training efficient.

Models are only as interpretable as their features.
Human-in-the-loop feature engineering ensures features are understandable, relevant, and actionable.

Via Data Augmentation: enriching datasets to encode invariances (e.g., rotation, Mixup).
Via Prompt Engineering: guiding LLMs at inference time with careful input manipulation.

ML models risk leaking sensitive information.
Defenses include membership inference mitigation, differential privacy, model regularization, and careful threat modeling.

Frame the problem with a Data-Centric mindset.
Curate a well-constructed, balanced, and interpretable dataset.
Detect and fix label errors, outliers, and bias early.
Train models, but always re-evaluate data quality after errors.
Focus evaluations on slices and high-loss examples.
Grow datasets when needed (active learning) or compress intelligently (core-sets).
Secure models against privacy attacks.
Continuously refine, because data evolves in deployment.

In Data-Centric AI, data is the model.
Every improvement — in accuracy, fairness, robustness, trust, and security — roots back to the data.