Class Imbalance, Outliers, and Distribution Shift

Class Imbalance, Outliers, and Distribution Shift #


Q1: What are the main problems discussed in this lecture? #

  • Class imbalance
  • Outliers
  • Distribution shift

Q2: What is class imbalance and why is it a problem? #

  • Definition: Some classes occur much less frequently than others.
  • Examples: COVID detection, fraud detection, manufacturing defects, self-driving cars.
  • Impact: Naive models can have misleadingly high accuracy while failing on rare classes.

Q3: How do we address class imbalance? #

  • Sampling Techniques:
    • Sample weights (less stable for mini-batch training)
    • Over-sampling (replicating minority class examples)
    • Under-sampling (dropping majority class examples)
    • SMOTE (synthetic minority over-sampling)
    • Balanced mini-batch training (better distribution in each batch)
  • Choose appropriate evaluation metrics: precision, recall, F-beta score.

Q4: What are outliers and why are they problematic? #

  • Definition: Datapoints that differ significantly from the norm.
  • Causes: Measurement error, bad data collection, adversarial inputs, rare events.
  • Impact: Outliers can harm training and inference stability.

Q5: How do we detect outliers? #

  • Simple methods: Tukey’s fences, Z-score analysis.
  • More advanced:
    • Isolation forest (tree-based)
    • KNN distance (neighbor proximity)
    • Autoencoders (reconstruction loss)
  • Evaluation: ROC curve and AUROC score.

Q6: What is distribution shift? #

  • Definition: Training and test distributions differ.
  • Almost all real-world ML deployments experience it.

Q7: What are the types of distribution shift? #

Type Meaning Example
Covariate shift (p(x)) changes, (p(y x)) stays the same
Concept shift (p(y x)) changes, (p(x)) stays the same
Prior probability shift (p(y)) changes, (p(x y)) stays the same

Q8: How do we detect and handle distribution shift? #

  • Detection: Monitor metrics and statistical properties of data.
  • Handling:
    • Retrain with better data.
    • Use sample reweighting if unlabeled test data is available.
    • Concept shift remains hardest to fix without labeled test data.

Q9: Final Takeaways #

  • Handling class imbalance, outliers, and distribution shift is critical for building robust, real-world ML systems.
  • Evaluation metric choice, proper data preprocessing, and continuous monitoring are key strategies.

References #