Data-Centric AI #
- Reference: MIT Data-Centric AI course
Q1: How does Data-Centric AI differ from Model-Centric AI? #
- Model-Centric AI improves models assuming fixed data.
- Data-Centric AI improves data quality, coverage, and structure, recognizing data as the bottleneck for model success.
Q2: Why are Label Errors and Confident Learning crucial? #
- Label errors silently degrade model performance.
- Confident Learning identifies mislabeled data points using model predictions and corrects them systematically.
Q3: What advances exist in Confident Learning and LLM/GenAI applications? #
- Advanced Confident Learning enhances robustness to noisy outputs.
- It’s applied to improve foundation models like LLMs by cleaning training and synthetic data.
Q4: How are Class Imbalance, Outliers, and Distribution Shift handled? #
- Class imbalance is managed via over/under-sampling, SMOTE, and weighting.
- Outliers are detected using isolation techniques or autoencoders.
- Distribution shifts are diagnosed and corrected with careful monitoring and adaptation.
Q5: What is essential about Dataset Creation and Curation? #
- Good datasets start with thoughtful design, balanced sampling, and robust label validation.
- Crowdsourcing must be augmented with consensus models like Dawid-Skene or CROWDLAB.
Q6: How does Data-Centric Evaluation of ML Models change standard practices? #
- Not just global metrics: need slice-based evaluation, error analysis, and influence functions.
- Subpopulations and rare cases must be properly assessed.
Q7: How is Data Curation for LLMs unique? #
- LLMs memorize training data deeply.
- Curating high-quality fine-tuning datasets, synthetic data filtering, and evaluation by uncertainty quantification is critical.
Q8: What role does Growing or Compressing Datasets play? #
- Active learning grows datasets smartly by labeling only informative samples.
- Core-set selection compresses datasets while preserving model performance, making training efficient.
Q9: How does Interpretability relate to Data-Centric AI? #
- Models are only as interpretable as their features.
- Human-in-the-loop feature engineering ensures features are understandable, relevant, and actionable.
Q10: How do we Encode Human Priors into Models? #
- Via Data Augmentation: enriching datasets to encode invariances (e.g., rotation, Mixup).
- Via Prompt Engineering: guiding LLMs at inference time with careful input manipulation.
Q11: How do we secure Data Privacy and Security? #
- ML models risk leaking sensitive information.
- Defenses include membership inference mitigation, differential privacy, model regularization, and careful threat modeling.
Q12: How does the full picture of Data-Centric AI flow? #
- Frame the problem with a Data-Centric mindset.
- Curate a well-constructed, balanced, and interpretable dataset.
- Detect and fix label errors, outliers, and bias early.
- Train models, but always re-evaluate data quality after errors.
- Focus evaluations on slices and high-loss examples.
- Grow datasets when needed (active learning) or compress intelligently (core-sets).
- Secure models against privacy attacks.
- Continuously refine, because data evolves in deployment.
Q13: Final Takeaway #
- In Data-Centric AI, data is the model.
- Every improvement — in accuracy, fairness, robustness, trust, and security — roots back to the data.