Module 4: Evaluation and Metrics for ML in Healthcare #
1 Introduction to Model Performance Evaluation #
Q1: Why is model evaluation critical in healthcare machine learning? #
In healthcare, decisions informed by ML models can have life-altering consequences:
- It’s not enough for a model to perform well on training data.
- We need to ensure that the model performs well on unseen patients and real-world conditions.
- Rigorous evaluation is essential to trust and validate clinical usefulness.
➡️ What does it mean for a model to generalize?
Q2: What is generalization and how is it assessed? #
Generalization refers to the model’s ability to perform well on new, unseen data:
- Indicates how well the model has learned the underlying patterns.
- We measure this using validation and test sets.
- High training accuracy but poor test accuracy implies overfitting.
➡️ What techniques help ensure fair and reliable evaluation?
Q3: What is data splitting and why is it important? #
Common splits:
- Training set: For model learning.
- Validation set: For tuning hyperparameters and model selection.
- Test set: For final performance estimation.
These splits help avoid data leakage and optimism bias in evaluation.
➡️ Are there methods more robust than simple train/test splits?
Q4: What is cross-validation and when is it useful? #
Cross-validation is a strategy to make evaluation more robust:
- Data is divided into k folds.
- Model trains on k-1 folds and validates on the remaining one.
- Repeated multiple times to average out variability.
It’s particularly useful in small datasets, common in healthcare studies.
2 Overfitting and Underfitting #
Q1: What are overfitting and underfitting in machine learning? #
These are two common problems that reduce model effectiveness:
- Overfitting: The model learns noise and specifics of the training set, performing poorly on new data.
- Underfitting: The model fails to capture underlying trends, resulting in poor performance on both training and test sets.
➡️ What are the visual signs of overfitting and underfitting during training?
Q2: How can we detect overfitting and underfitting? #
By plotting training and validation accuracy/loss:
- Overfitting: Training accuracy increases while validation accuracy drops or plateaus.
- Underfitting: Both training and validation accuracies remain low.
- Regular monitoring helps identify these trends early.
➡️ What causes models to overfit?
Q3: What factors contribute to overfitting in ML models? #
- High model complexity (deep networks, too many parameters).
- Small training dataset or lack of representative diversity.
- Too many epochs without early stopping.
Overfitting is especially risky in healthcare due to variability in real-world patient populations.
➡️ Conversely, what might cause underfitting?
Q4: Why might a model underfit the data? #
- The model is too simple (e.g., linear models for nonlinear problems).
- Insufficient training time or suboptimal hyperparameters.
- Poor feature engineering or missing important data signals.
Underfitting leads to missed patterns—dangerous in diagnostic or predictive tools.
3 Strategies to Address Overfitting, Underfitting and Introduction to Regularization #
Q1: How can we address overfitting in ML models? #
Strategies include:
- Reducing model complexity: Use fewer layers or parameters.
- Early stopping: Halt training when validation performance stops improving.
- Data augmentation: Increase data diversity artificially (especially for images).
- Regularization: Penalize model complexity.
➡️ What types of regularization techniques are commonly used?
Q2: What is L1 and L2 regularization? #
These techniques add penalty terms to the loss function:
- L1 (Lasso): Adds absolute value of weights → encourages sparsity (some weights become zero).
- L2 (Ridge): Adds square of weights → discourages large weights.
They help control overfitting by shrinking parameter magnitudes.
➡️ Besides weight penalties, are there other techniques to improve generalization?
Q3: What is dropout and how does it help prevent overfitting? #
Dropout randomly disables neurons during training:
- Prevents co-adaptation of features.
- Encourages the network to learn redundant, distributed representations.
- Typically used in deep networks.
➡️ Can underfitting also be addressed through specific strategies?
Q4: How can we fix underfitting in a model? #
Underfitting can be resolved by:
- Increasing model capacity (deeper or more complex models).
- Training longer or using better optimization.
- Improving data quality or features.
- Adjusting learning rate and other hyperparameters.
4 Statistical Approaches to Model Evaluation #
Q1: Why are statistical methods important in evaluating ML models? #
Statistical tools help us quantify confidence and variability in model performance:
- Avoid over-interpreting single-point metrics.
- Make decisions grounded in significance and uncertainty.
- Essential when models may be deployed in clinical practice.
➡️ What’s a simple way to estimate uncertainty in metrics?
Q2: What is bootstrapping and how is it used in model evaluation? #
Bootstrapping is a resampling technique:
- Repeatedly sample with replacement from the test set.
- Evaluate the model on each sample to get a distribution of performance metrics.
- Helps compute confidence intervals for metrics like accuracy or AUC.
➡️ Are there other methods for statistical comparison of models?
Q3: How do permutation tests assess model significance? #
Permutation testing involves:
- Randomly shuffling labels on test data.
- Evaluating model performance on this randomized data.
- Repeating multiple times to build a null distribution.
If the true model significantly outperforms the null, it’s statistically meaningful.
➡️ How do we assess reliability when comparing two models?
Q4: What is the paired t-test and when should it be used? #
Used to compare two models’ predictions on the same test samples:
- Measures whether the difference in performance is statistically significant.
- Assumes approximately normal distribution of differences.
Can be useful in evaluating if model A is better than model B.
5 Receiver Operator and Precision Recall Curves as Evaluation Metrics #
Q1: Why do we need different evaluation metrics beyond accuracy? #
Accuracy alone can be misleading, especially in imbalanced datasets:
- In healthcare, positive cases (e.g., disease presence) may be rare.
- A model that predicts only the majority class can appear highly accurate.
- Metrics like precision, recall, and AUC provide better insight into real performance.
➡️ What is the ROC curve and how is it interpreted?
Q2: What is the ROC curve and AUC? #
- ROC (Receiver Operating Characteristic) curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR).
- AUC (Area Under the Curve) summarizes this curve into a single number (0.5 = random, 1.0 = perfect).
- AUC reflects the model’s ranking ability—how well it separates classes.
➡️ Are ROC curves always the best choice?
Q3: When should we prefer Precision-Recall (PR) curves over ROC? #
- PR curves focus on positive class performance, plotting Precision vs. Recall.
- More informative than ROC when dealing with imbalanced data.
- Helpful for screening tools or rare disease prediction.
➡️ How are these metrics computed and interpreted?
Q4: What are precision, recall, and F1 score? #
- Precision: TP / (TP + FP) → Of predicted positives, how many were correct?
- Recall: TP / (TP + FN) → Of actual positives, how many did we find?
- F1 Score: Harmonic mean of precision and recall.
These metrics give a fuller picture of model trade-offs, especially in clinical use cases.