Growing or Compressing Datasets

Growing or Compressing Datasets #


Q1: What is the main focus of this lecture? #

  • Techniques to carefully select examples for labeling to reduce the burden in ML systems.
  • Growing datasets via active learning and compressing datasets via core-set selection.

Q2: What is active learning? #

  • A method to intelligently select the most informative examples to label next, maximizing model improvement with fewer labeled samples.

Q3: How does pool-based active learning work? #

  • Start with a pool of unlabeled examples.
  • At each round, score examples using an acquisition function (e.g., entropy of predicted probabilities).
  • Select and label top examples.
  • Retrain the model with newly labeled data.

Q4: What is a common acquisition function used in active learning? #

  • Entropy of the predicted class probabilities, encouraging labeling of uncertain examples.

Q5: How does active learning compare to passive learning? #

  • Active learning exponentially improves data efficiency compared to random/passive sampling.

Q6: What practical challenges does active learning face? #

  • High computational costs with large models and datasets.

Q7: How can active learning be made more practical? #

  • Batch active learning with diversity selection.
  • Efficient candidate selection using methods like SEALS to reduce search space.

Q8: What is SEALS? #

  • Similarity Search for Efficient Active Learning and Search of Rare Concepts.
  • Uses nearest neighbor search in embedding space to limit active learning candidate pool.

Q9: What is core-set selection? #

  • Choosing a small representative subset of a large labeled dataset that preserves model performance.

Q10: Why is core-set selection important? #

  • When we have massive datasets, it reduces computational, time, and energy costs without sacrificing accuracy.

Q11: What methods help with core-set selection? #

  • Greedy k-centers approach.
  • Selection via Proxy: using smaller proxy models to guide subset selection.

Q12: How does Selection via Proxy work? #

  • Train a lightweight model (proxy) on the full data.
  • Use it to select a subset for training a larger model, speeding up training significantly.

Q13: What are key takeaways about dataset growth and compression? #

  • Active learning enables data-efficient labeling for growing datasets.
  • Core-set selection enables training efficiency for already large datasets.

Active Learning vs. Confident Learning #

Category Active Learning Confident Learning
Main Goal Select the most informative examples to label next Find mislabeled examples in existing labeled data
When Used During dataset growth (annotation phase) After labels exist (cleaning phase)
Data State Partially labeled data pool Fully labeled (but noisy) dataset
Model Uncertainty Usage Samples highest-uncertainty examples for human labeling Detects label inconsistency via confidence estimation
Human Role Label new examples Review and correct suspicious labels
Output New labels added to dataset List of potential label errors to fix
Typical Workflow Train model → Select uncertain points → Human annotates → Expand dataset Train model → Identify inconsistent labels → Human verifies/corrects → Clean dataset
Common Technique Uncertainty sampling Confidence-based error detection
Libraries/Tools modAL, ALiPy cleanlab
Philosophy Proactively grow data wisely Reactively audit and clean existing data
Typical Question “What should I label next?” “Which labels are probably wrong?”
End Goal Smarter, faster data acquisition Higher quality existing labels
Example Scenario Medical image AI needing efficient expert labeling Noisy crowd-sourced labeled text needing cleaning
  • Key Distillation

    • Active Learning: “Help me label better data.”
    • Confident Learning: “Help me fix wrong data.”
  • Visual Process Summary

(Active Learning)
Small Labeled Set → Train Model → Find Most Uncertain → Human Labels → Grow Dataset

(Confident Learning)
Labeled (Noisy) Set → Train Model → Find Inconsistent Labels → Human Corrects → Clean Dataset

References #