[Summary] Module 4: Evaluation and Metrics for ML in Healthcare

Module 4: Evaluation and Metrics for ML in Healthcare #

1 Introduction to Model Performance Evaluation #


Q1: Why is model evaluation critical in healthcare machine learning? #

In healthcare, decisions informed by ML models can have life-altering consequences:

  • It’s not enough for a model to perform well on training data.
  • We need to ensure that the model performs well on unseen patients and real-world conditions.
  • Rigorous evaluation is essential to trust and validate clinical usefulness.

➡️ What does it mean for a model to generalize?

Q2: What is generalization and how is it assessed? #

Generalization refers to the model’s ability to perform well on new, unseen data:

  • Indicates how well the model has learned the underlying patterns.
  • We measure this using validation and test sets.
  • High training accuracy but poor test accuracy implies overfitting.

➡️ What techniques help ensure fair and reliable evaluation?

Q3: What is data splitting and why is it important? #

Common splits:

  • Training set: For model learning.
  • Validation set: For tuning hyperparameters and model selection.
  • Test set: For final performance estimation.

These splits help avoid data leakage and optimism bias in evaluation.

➡️ Are there methods more robust than simple train/test splits?

Q4: What is cross-validation and when is it useful? #

Cross-validation is a strategy to make evaluation more robust:

  • Data is divided into k folds.
  • Model trains on k-1 folds and validates on the remaining one.
  • Repeated multiple times to average out variability.

It’s particularly useful in small datasets, common in healthcare studies.

2 Overfitting and Underfitting #


Q1: What are overfitting and underfitting in machine learning? #

These are two common problems that reduce model effectiveness:

  • Overfitting: The model learns noise and specifics of the training set, performing poorly on new data.
  • Underfitting: The model fails to capture underlying trends, resulting in poor performance on both training and test sets.

➡️ What are the visual signs of overfitting and underfitting during training?

Q2: How can we detect overfitting and underfitting? #

By plotting training and validation accuracy/loss:

  • Overfitting: Training accuracy increases while validation accuracy drops or plateaus.
  • Underfitting: Both training and validation accuracies remain low.
  • Regular monitoring helps identify these trends early.

➡️ What causes models to overfit?

Q3: What factors contribute to overfitting in ML models? #

  • High model complexity (deep networks, too many parameters).
  • Small training dataset or lack of representative diversity.
  • Too many epochs without early stopping.

Overfitting is especially risky in healthcare due to variability in real-world patient populations.

➡️ Conversely, what might cause underfitting?

Q4: Why might a model underfit the data? #

  • The model is too simple (e.g., linear models for nonlinear problems).
  • Insufficient training time or suboptimal hyperparameters.
  • Poor feature engineering or missing important data signals.

Underfitting leads to missed patterns—dangerous in diagnostic or predictive tools.

3 Strategies to Address Overfitting, Underfitting and Introduction to Regularization #


Q1: How can we address overfitting in ML models? #

Strategies include:

  • Reducing model complexity: Use fewer layers or parameters.
  • Early stopping: Halt training when validation performance stops improving.
  • Data augmentation: Increase data diversity artificially (especially for images).
  • Regularization: Penalize model complexity.

➡️ What types of regularization techniques are commonly used?

Q2: What is L1 and L2 regularization? #

These techniques add penalty terms to the loss function:

  • L1 (Lasso): Adds absolute value of weights → encourages sparsity (some weights become zero).
  • L2 (Ridge): Adds square of weights → discourages large weights.

They help control overfitting by shrinking parameter magnitudes.

➡️ Besides weight penalties, are there other techniques to improve generalization?

Q3: What is dropout and how does it help prevent overfitting? #

Dropout randomly disables neurons during training:

  • Prevents co-adaptation of features.
  • Encourages the network to learn redundant, distributed representations.
  • Typically used in deep networks.

➡️ Can underfitting also be addressed through specific strategies?

Q4: How can we fix underfitting in a model? #

Underfitting can be resolved by:

  • Increasing model capacity (deeper or more complex models).
  • Training longer or using better optimization.
  • Improving data quality or features.
  • Adjusting learning rate and other hyperparameters.

4 Statistical Approaches to Model Evaluation #


Q1: Why are statistical methods important in evaluating ML models? #

Statistical tools help us quantify confidence and variability in model performance:

  • Avoid over-interpreting single-point metrics.
  • Make decisions grounded in significance and uncertainty.
  • Essential when models may be deployed in clinical practice.

➡️ What’s a simple way to estimate uncertainty in metrics?

Q2: What is bootstrapping and how is it used in model evaluation? #

Bootstrapping is a resampling technique:

  • Repeatedly sample with replacement from the test set.
  • Evaluate the model on each sample to get a distribution of performance metrics.
  • Helps compute confidence intervals for metrics like accuracy or AUC.

➡️ Are there other methods for statistical comparison of models?

Q3: How do permutation tests assess model significance? #

Permutation testing involves:

  • Randomly shuffling labels on test data.
  • Evaluating model performance on this randomized data.
  • Repeating multiple times to build a null distribution.

If the true model significantly outperforms the null, it’s statistically meaningful.

➡️ How do we assess reliability when comparing two models?

Q4: What is the paired t-test and when should it be used? #

Used to compare two models’ predictions on the same test samples:

  • Measures whether the difference in performance is statistically significant.
  • Assumes approximately normal distribution of differences.

Can be useful in evaluating if model A is better than model B.

5 Receiver Operator and Precision Recall Curves as Evaluation Metrics #


Q1: Why do we need different evaluation metrics beyond accuracy? #

Accuracy alone can be misleading, especially in imbalanced datasets:

  • In healthcare, positive cases (e.g., disease presence) may be rare.
  • A model that predicts only the majority class can appear highly accurate.
  • Metrics like precision, recall, and AUC provide better insight into real performance.

➡️ What is the ROC curve and how is it interpreted?

Q2: What is the ROC curve and AUC? #

  • ROC (Receiver Operating Characteristic) curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR).
  • AUC (Area Under the Curve) summarizes this curve into a single number (0.5 = random, 1.0 = perfect).
  • AUC reflects the model’s ranking ability—how well it separates classes.

➡️ Are ROC curves always the best choice?

Q3: When should we prefer Precision-Recall (PR) curves over ROC? #

  • PR curves focus on positive class performance, plotting Precision vs. Recall.
  • More informative than ROC when dealing with imbalanced data.
  • Helpful for screening tools or rare disease prediction.

➡️ How are these metrics computed and interpreted?

Q4: What are precision, recall, and F1 score? #

  • Precision: TP / (TP + FP) → Of predicted positives, how many were correct?
  • Recall: TP / (TP + FN) → Of actual positives, how many did we find?
  • F1 Score: Harmonic mean of precision and recall.

These metrics give a fuller picture of model trade-offs, especially in clinical use cases.