Data-Centric Evaluation of ML Models #
- Instead of only asking how accurate, we ask why and where does the model fail — and whether the data itself causes it.
Aspect | Model-Centric Evaluation | Data-Centric Evaluation |
---|---|---|
Focus | Overall model score | Specific weaknesses tied to data |
Metrics | Accuracy, ROC, etc. | Slice-specific accuracy, Error analysis |
Blind spots | Hide rare data failures | Detect hidden failures in data |
Error tracing | Hard | Directly trace errors to dirty/outlier/bias data |
Example | “95% accurate model!” | “Fails badly on young users with rare diseases” |
Mindset | Improve model tuning | Fix the dataset (quality, balance, coverage) |
Q1: What is the typical ML workflow before deployment? #
- Collect data and define the ML task.
- Explore and preprocess the data.
- Train a straightforward model.
- Investigate shortcomings in the model and dataset.
- Improve dataset and model iteratively.
- Deploy the model and monitor for new issues.
Q2: Why is model evaluation critical? #
- Evaluation affects practical outcomes in real-world applications.
- Poor evaluation choices can lead to misleading or harmful models.
Q3: What are examples of evaluation metrics for classification? #
- Accuracy, balanced accuracy, precision, recall, log loss, AUROC, calibration error.
Q4: What are some pitfalls in model evaluation? #
- Data leakage by using non-held-out data.
- Misspecified metrics hiding failures in subpopulations.
- Validation data not representing deployment settings.
- Label errors.
Q5: How is text generation model evaluation different? #
- Human evaluations (👍👎 or Likert scales).
- LLM evaluations with multiple criteria.
- Automated metrics like ROUGE, BLEU, and Perplexity.
Q6: What is a data slice? #
- A subset of the dataset sharing a common characteristic, e.g., different sensor types, demographics.
Q7: Why is it insufficient to delete sensitive features to address slice fairness? #
- Slice membership information may be correlated with other features.
Q8: How can we improve model performance for underperforming slices? #
- Use a more flexible model.
- Over-sample the minority subgroup.
- Collect more data from the subgroup.
- Engineer new features that better capture subgroup specifics.
Q9: How to discover underperforming subpopulations? #
- Sort validation examples by loss.
- Cluster high-loss examples to find commonalities.
Q10: What are typical causes of wrong predictions? #
- Incorrect labels.
- Examples that do not belong to any class.
- Outlier examples.
- Model type limitations.
- Conflicting or noisy dataset labels.
Q11: What actions can address wrong predictions? #
- Correct labels.
- Remove fundamentally unpredictable examples.
- Augment or normalize outlier examples.
- Fit better model architectures or do feature engineering.
- Enrich the dataset to distinguish overlapping classes.
Q12: What is the concept of leave-one-out influence? #
- Measure the impact of omitting a datapoint on the model’s validation performance.
Q13: What is Data Shapley? #
- A method that averages the influence of a datapoint over all subsets containing it, providing a fairer measure of its importance.
Q14: How can we approximate influence? #
- Monte Carlo sampling methods.
- Closed-form approximations for simple models like linear regression and k-NN.
Q15: Why review influential samples? #
- Correcting highly influential mislabeled examples can lead to significant accuracy improvements.