Class Imbalance, Outliers, and Distribution Shift #
Q1: What are the main problems discussed in this lecture? #
- Class imbalance
- Outliers
- Distribution shift
Q2: What is class imbalance and why is it a problem? #
- Definition: Some classes occur much less frequently than others.
- Examples: COVID detection, fraud detection, manufacturing defects, self-driving cars.
- Impact: Naive models can have misleadingly high accuracy while failing on rare classes.
Q3: How do we address class imbalance? #
- Sampling Techniques:
- Sample weights (less stable for mini-batch training)
- Over-sampling (replicating minority class examples)
- Under-sampling (dropping majority class examples)
- SMOTE (synthetic minority over-sampling)
- Balanced mini-batch training (better distribution in each batch)
- Choose appropriate evaluation metrics: precision, recall, F-beta score.
Q4: What are outliers and why are they problematic? #
- Definition: Datapoints that differ significantly from the norm.
- Causes: Measurement error, bad data collection, adversarial inputs, rare events.
- Impact: Outliers can harm training and inference stability.
Q5: How do we detect outliers? #
- Simple methods: Tukey’s fences, Z-score analysis.
- More advanced:
- Isolation forest (tree-based)
- KNN distance (neighbor proximity)
- Autoencoders (reconstruction loss)
- Evaluation: ROC curve and AUROC score.
Q6: What is distribution shift? #
- Definition: Training and test distributions differ.
- Almost all real-world ML deployments experience it.
Q7: What are the types of distribution shift? #
Type | Meaning | Example |
---|---|---|
Covariate shift | (p(x)) changes, (p(y | x)) stays the same |
Concept shift | (p(y | x)) changes, (p(x)) stays the same |
Prior probability shift | (p(y)) changes, (p(x | y)) stays the same |
Q8: How do we detect and handle distribution shift? #
- Detection: Monitor metrics and statistical properties of data.
- Handling:
- Retrain with better data.
- Use sample reweighting if unlabeled test data is available.
- Concept shift remains hardest to fix without labeled test data.
Q9: Final Takeaways #
- Handling class imbalance, outliers, and distribution shift is critical for building robust, real-world ML systems.
- Evaluation metric choice, proper data preprocessing, and continuous monitoring are key strategies.