[Summary] Module 4: Evaluation and Metrics for ML in Healthcare

Module 4: Evaluation and Metrics for ML in Healthcare #

1 Introduction to Model Performance Evaluation #

Q1: Why is model evaluation critical in healthcare machine learning? #

In healthcare, decisions informed by ML models can have life-altering consequences:

It’s not enough for a model to perform well on training data.
We need to ensure that the model performs well on unseen patients and real-world conditions.
Rigorous evaluation is essential to trust and validate clinical usefulness.

➡️ What does it mean for a model to generalize?

Q2: What is generalization and how is it assessed? #

Generalization refers to the model’s ability to perform well on new, unseen data:

Indicates how well the model has learned the underlying patterns.
We measure this using validation and test sets.
High training accuracy but poor test accuracy implies overfitting.

➡️ What techniques help ensure fair and reliable evaluation?

Q3: What is data splitting and why is it important? #

Common splits:

Training set: For model learning.
Validation set: For tuning hyperparameters and model selection.
Test set: For final performance estimation.

These splits help avoid data leakage and optimism bias in evaluation.

➡️ Are there methods more robust than simple train/test splits?

Q4: What is cross-validation and when is it useful? #

Cross-validation is a strategy to make evaluation more robust:

Data is divided into k folds.
Model trains on k-1 folds and validates on the remaining one.
Repeated multiple times to average out variability.

It’s particularly useful in small datasets, common in healthcare studies.

2 Overfitting and Underfitting #

Q1: What are overfitting and underfitting in machine learning? #

These are two common problems that reduce model effectiveness:

Overfitting: The model learns noise and specifics of the training set, performing poorly on new data.
Underfitting: The model fails to capture underlying trends, resulting in poor performance on both training and test sets.

➡️ What are the visual signs of overfitting and underfitting during training?

Q2: How can we detect overfitting and underfitting? #

By plotting training and validation accuracy/loss:

Overfitting: Training accuracy increases while validation accuracy drops or plateaus.
Underfitting: Both training and validation accuracies remain low.
Regular monitoring helps identify these trends early.

➡️ What causes models to overfit?

Q3: What factors contribute to overfitting in ML models? #

High model complexity (deep networks, too many parameters).
Small training dataset or lack of representative diversity.
Too many epochs without early stopping.

Overfitting is especially risky in healthcare due to variability in real-world patient populations.

➡️ Conversely, what might cause underfitting?

Q4: Why might a model underfit the data? #

The model is too simple (e.g., linear models for nonlinear problems).
Insufficient training time or suboptimal hyperparameters.
Poor feature engineering or missing important data signals.

Underfitting leads to missed patterns—dangerous in diagnostic or predictive tools.

3 Strategies to Address Overfitting, Underfitting and Introduction to Regularization #

Q1: How can we address overfitting in ML models? #

Strategies include:

Reducing model complexity: Use fewer layers or parameters.
Early stopping: Halt training when validation performance stops improving.
Data augmentation: Increase data diversity artificially (especially for images).
Regularization: Penalize model complexity.

➡️ What types of regularization techniques are commonly used?

Q2: What is L1 and L2 regularization? #

These techniques add penalty terms to the loss function:

L1 (Lasso): Adds absolute value of weights → encourages sparsity (some weights become zero).
L2 (Ridge): Adds square of weights → discourages large weights.

They help control overfitting by shrinking parameter magnitudes.

➡️ Besides weight penalties, are there other techniques to improve generalization?

Q3: What is dropout and how does it help prevent overfitting? #

Dropout randomly disables neurons during training:

Prevents co-adaptation of features.
Encourages the network to learn redundant, distributed representations.
Typically used in deep networks.

➡️ Can underfitting also be addressed through specific strategies?

Q4: How can we fix underfitting in a model? #

Underfitting can be resolved by:

Increasing model capacity (deeper or more complex models).
Training longer or using better optimization.
Improving data quality or features.
Adjusting learning rate and other hyperparameters.

4 Statistical Approaches to Model Evaluation #

Q1: Why are statistical methods important in evaluating ML models? #

Statistical tools help us quantify confidence and variability in model performance:

Avoid over-interpreting single-point metrics.
Make decisions grounded in significance and uncertainty.
Essential when models may be deployed in clinical practice.

➡️ What’s a simple way to estimate uncertainty in metrics?

Q2: What is bootstrapping and how is it used in model evaluation? #

Bootstrapping is a resampling technique:

Repeatedly sample with replacement from the test set.
Evaluate the model on each sample to get a distribution of performance metrics.
Helps compute confidence intervals for metrics like accuracy or AUC.

➡️ Are there other methods for statistical comparison of models?

Q3: How do permutation tests assess model significance? #

Permutation testing involves:

Randomly shuffling labels on test data.
Evaluating model performance on this randomized data.
Repeating multiple times to build a null distribution.

If the true model significantly outperforms the null, it’s statistically meaningful.

➡️ How do we assess reliability when comparing two models?

Q4: What is the paired t-test and when should it be used? #

Used to compare two models’ predictions on the same test samples:

Measures whether the difference in performance is statistically significant.
Assumes approximately normal distribution of differences.

Can be useful in evaluating if model A is better than model B.

5 Receiver Operator and Precision Recall Curves as Evaluation Metrics #

Q1: Why do we need different evaluation metrics beyond accuracy? #

Accuracy alone can be misleading, especially in imbalanced datasets:

In healthcare, positive cases (e.g., disease presence) may be rare.
A model that predicts only the majority class can appear highly accurate.
Metrics like precision, recall, and AUC provide better insight into real performance.

➡️ What is the ROC curve and how is it interpreted?

Q2: What is the ROC curve and AUC? #

ROC (Receiver Operating Characteristic) curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR).
AUC (Area Under the Curve) summarizes this curve into a single number (0.5 = random, 1.0 = perfect).
AUC reflects the model’s ranking ability—how well it separates classes.

➡️ Are ROC curves always the best choice?

Q3: When should we prefer Precision-Recall (PR) curves over ROC? #

PR curves focus on positive class performance, plotting Precision vs. Recall.
More informative than ROC when dealing with imbalanced data.
Helpful for screening tools or rare disease prediction.

➡️ How are these metrics computed and interpreted?

Q4: What are precision, recall, and F1 score? #

Precision: TP / (TP + FP) → Of predicted positives, how many were correct?
Recall: TP / (TP + FN) → Of actual positives, how many did we find?
F1 Score: Harmonic mean of precision and recall.

These metrics give a fuller picture of model trade-offs, especially in clinical use cases.