Growing or Compressing Datasets

Growing or Compressing Datasets #

Q1: What is the main focus of this lecture? #

Techniques to carefully select examples for labeling to reduce the burden in ML systems.
Growing datasets via active learning and compressing datasets via core-set selection.

Q2: What is active learning? #

A method to intelligently select the most informative examples to label next, maximizing model improvement with fewer labeled samples.

Q3: How does pool-based active learning work? #

Start with a pool of unlabeled examples.
At each round, score examples using an acquisition function (e.g., entropy of predicted probabilities).
Select and label top examples.
Retrain the model with newly labeled data.

Q4: What is a common acquisition function used in active learning? #

Entropy of the predicted class probabilities, encouraging labeling of uncertain examples.

Q5: How does active learning compare to passive learning? #

Active learning exponentially improves data efficiency compared to random/passive sampling.

Q6: What practical challenges does active learning face? #

High computational costs with large models and datasets.

Q7: How can active learning be made more practical? #

Batch active learning with diversity selection.
Efficient candidate selection using methods like SEALS to reduce search space.

Q8: What is SEALS? #

Similarity Search for Efficient Active Learning and Search of Rare Concepts.
Uses nearest neighbor search in embedding space to limit active learning candidate pool.

Q9: What is core-set selection? #

Choosing a small representative subset of a large labeled dataset that preserves model performance.

Q10: Why is core-set selection important? #

When we have massive datasets, it reduces computational, time, and energy costs without sacrificing accuracy.

Q11: What methods help with core-set selection? #

Greedy k-centers approach.
Selection via Proxy: using smaller proxy models to guide subset selection.

Q12: How does Selection via Proxy work? #

Train a lightweight model (proxy) on the full data.
Use it to select a subset for training a larger model, speeding up training significantly.

Q13: What are key takeaways about dataset growth and compression? #

Active learning enables data-efficient labeling for growing datasets.
Core-set selection enables training efficiency for already large datasets.

Active Learning vs. Confident Learning #

Category	Active Learning	Confident Learning
Main Goal	Select the most informative examples to label next	Find mislabeled examples in existing labeled data
When Used	During dataset growth (annotation phase)	After labels exist (cleaning phase)
Data State	Partially labeled data pool	Fully labeled (but noisy) dataset
Model Uncertainty Usage	Samples highest-uncertainty examples for human labeling	Detects label inconsistency via confidence estimation
Human Role	Label new examples	Review and correct suspicious labels
Output	New labels added to dataset	List of potential label errors to fix
Typical Workflow	Train model → Select uncertain points → Human annotates → Expand dataset	Train model → Identify inconsistent labels → Human verifies/corrects → Clean dataset
Common Technique	Uncertainty sampling	Confidence-based error detection
Libraries/Tools	modAL, ALiPy	cleanlab
Philosophy	Proactively grow data wisely	Reactively audit and clean existing data
Typical Question	“What should I label next?”	“Which labels are probably wrong?”
End Goal	Smarter, faster data acquisition	Higher quality existing labels
Example Scenario	Medical image AI needing efficient expert labeling	Noisy crowd-sourced labeled text needing cleaning

Key Distillation
- Active Learning: “Help me label better data.”
- Confident Learning: “Help me fix wrong data.”
Visual Process Summary

(Active Learning)
Small Labeled Set → Train Model → Find Most Uncertain → Human Labels → Grow Dataset

(Confident Learning)
Labeled (Noisy) Set → Train Model → Find Inconsistent Labels → Human Corrects → Clean Dataset

Growing or Compressing Datasets #

Q1: What is the main focus of this lecture? #

Q2: What is active learning? #

Q3: How does pool-based active learning work? #

Q4: What is a common acquisition function used in active learning? #

Q5: How does active learning compare to passive learning? #

Q6: What practical challenges does active learning face? #

Q7: How can active learning be made more practical? #

Q8: What is SEALS? #

Q9: What is core-set selection? #

Q10: Why is core-set selection important? #

Q11: What methods help with core-set selection? #

Q12: How does Selection via Proxy work? #

Q13: What are key takeaways about dataset growth and compression? #

Active Learning vs. Confident Learning #

References #