Data-Centric Evaluation of ML Models | AI Reasoning

Data-Centric Evaluation of ML Models

Data-Centric Evaluation of ML Models #

Instead of only asking how accurate, we ask why and where does the model fail — and whether the data itself causes it.

Aspect	Model-Centric Evaluation	Data-Centric Evaluation
Focus	Overall model score	Specific weaknesses tied to data
Metrics	Accuracy, ROC, etc.	Slice-specific accuracy, Error analysis
Blind spots	Hide rare data failures	Detect hidden failures in data
Error tracing	Hard	Directly trace errors to dirty/outlier/bias data
Example	“95% accurate model!”	“Fails badly on young users with rare diseases”
Mindset	Improve model tuning	Fix the dataset (quality, balance, coverage)

Q1: What is the typical ML workflow before deployment? #

Collect data and define the ML task.
Explore and preprocess the data.
Train a straightforward model.
Investigate shortcomings in the model and dataset.
Improve dataset and model iteratively.
Deploy the model and monitor for new issues.

Q2: Why is model evaluation critical? #

Evaluation affects practical outcomes in real-world applications.
Poor evaluation choices can lead to misleading or harmful models.

Q3: What are examples of evaluation metrics for classification? #

Accuracy, balanced accuracy, precision, recall, log loss, AUROC, calibration error.

Q4: What are some pitfalls in model evaluation? #

Data leakage by using non-held-out data.
Misspecified metrics hiding failures in subpopulations.
Validation data not representing deployment settings.
Label errors.

Q5: How is text generation model evaluation different? #

Human evaluations (👍👎 or Likert scales).
LLM evaluations with multiple criteria.
Automated metrics like ROUGE, BLEU, and Perplexity.

Q6: What is a data slice? #

A subset of the dataset sharing a common characteristic, e.g., different sensor types, demographics.

Q7: Why is it insufficient to delete sensitive features to address slice fairness? #

Slice membership information may be correlated with other features.

Q8: How can we improve model performance for underperforming slices? #

Use a more flexible model.
Over-sample the minority subgroup.
Collect more data from the subgroup.
Engineer new features that better capture subgroup specifics.

Q9: How to discover underperforming subpopulations? #

Sort validation examples by loss.
Cluster high-loss examples to find commonalities.

Q10: What are typical causes of wrong predictions? #

Incorrect labels.
Examples that do not belong to any class.
Outlier examples.
Model type limitations.
Conflicting or noisy dataset labels.

Q11: What actions can address wrong predictions? #

Correct labels.
Remove fundamentally unpredictable examples.
Augment or normalize outlier examples.
Fit better model architectures or do feature engineering.
Enrich the dataset to distinguish overlapping classes.

Q12: What is the concept of leave-one-out influence? #

Measure the impact of omitting a datapoint on the model’s validation performance.

Q13: What is Data Shapley? #

A method that averages the influence of a datapoint over all subsets containing it, providing a fairer measure of its importance.

Q14: How can we approximate influence? #

Monte Carlo sampling methods.
Closed-form approximations for simple models like linear regression and k-NN.

Q15: Why review influential samples? #

Correcting highly influential mislabeled examples can lead to significant accuracy improvements.

References #