Data-Centric Evaluation of ML Models

Data-Centric Evaluation of ML Models #

  • Instead of only asking how accurate, we ask why and where does the model fail — and whether the data itself causes it.
Aspect Model-Centric Evaluation Data-Centric Evaluation
Focus Overall model score Specific weaknesses tied to data
Metrics Accuracy, ROC, etc. Slice-specific accuracy, Error analysis
Blind spots Hide rare data failures Detect hidden failures in data
Error tracing Hard Directly trace errors to dirty/outlier/bias data
Example “95% accurate model!” “Fails badly on young users with rare diseases”
Mindset Improve model tuning Fix the dataset (quality, balance, coverage)

Q1: What is the typical ML workflow before deployment? #

  • Collect data and define the ML task.
  • Explore and preprocess the data.
  • Train a straightforward model.
  • Investigate shortcomings in the model and dataset.
  • Improve dataset and model iteratively.
  • Deploy the model and monitor for new issues.

Q2: Why is model evaluation critical? #

  • Evaluation affects practical outcomes in real-world applications.
  • Poor evaluation choices can lead to misleading or harmful models.

Q3: What are examples of evaluation metrics for classification? #

  • Accuracy, balanced accuracy, precision, recall, log loss, AUROC, calibration error.

Q4: What are some pitfalls in model evaluation? #

  • Data leakage by using non-held-out data.
  • Misspecified metrics hiding failures in subpopulations.
  • Validation data not representing deployment settings.
  • Label errors.

Q5: How is text generation model evaluation different? #

  • Human evaluations (👍👎 or Likert scales).
  • LLM evaluations with multiple criteria.
  • Automated metrics like ROUGE, BLEU, and Perplexity.

Q6: What is a data slice? #

  • A subset of the dataset sharing a common characteristic, e.g., different sensor types, demographics.

Q7: Why is it insufficient to delete sensitive features to address slice fairness? #

  • Slice membership information may be correlated with other features.

Q8: How can we improve model performance for underperforming slices? #

  • Use a more flexible model.
  • Over-sample the minority subgroup.
  • Collect more data from the subgroup.
  • Engineer new features that better capture subgroup specifics.

Q9: How to discover underperforming subpopulations? #

  • Sort validation examples by loss.
  • Cluster high-loss examples to find commonalities.

Q10: What are typical causes of wrong predictions? #

  • Incorrect labels.
  • Examples that do not belong to any class.
  • Outlier examples.
  • Model type limitations.
  • Conflicting or noisy dataset labels.

Q11: What actions can address wrong predictions? #

  • Correct labels.
  • Remove fundamentally unpredictable examples.
  • Augment or normalize outlier examples.
  • Fit better model architectures or do feature engineering.
  • Enrich the dataset to distinguish overlapping classes.

Q12: What is the concept of leave-one-out influence? #

  • Measure the impact of omitting a datapoint on the model’s validation performance.

Q13: What is Data Shapley? #

  • A method that averages the influence of a datapoint over all subsets containing it, providing a fairer measure of its importance.

Q14: How can we approximate influence? #

  • Monte Carlo sampling methods.
  • Closed-form approximations for simple models like linear regression and k-NN.

Q15: Why review influential samples? #

  • Correcting highly influential mislabeled examples can lead to significant accuracy improvements.

References #