Growing or Compressing Datasets #
Q1: What is the main focus of this lecture? #
- Techniques to carefully select examples for labeling to reduce the burden in ML systems.
- Growing datasets via active learning and compressing datasets via core-set selection.
Q2: What is active learning? #
- A method to intelligently select the most informative examples to label next, maximizing model improvement with fewer labeled samples.
Q3: How does pool-based active learning work? #
- Start with a pool of unlabeled examples.
- At each round, score examples using an acquisition function (e.g., entropy of predicted probabilities).
- Select and label top examples.
- Retrain the model with newly labeled data.
Q4: What is a common acquisition function used in active learning? #
- Entropy of the predicted class probabilities, encouraging labeling of uncertain examples.
Q5: How does active learning compare to passive learning? #
- Active learning exponentially improves data efficiency compared to random/passive sampling.
Q6: What practical challenges does active learning face? #
- High computational costs with large models and datasets.
Q7: How can active learning be made more practical? #
- Batch active learning with diversity selection.
- Efficient candidate selection using methods like SEALS to reduce search space.
Q8: What is SEALS? #
- Similarity Search for Efficient Active Learning and Search of Rare Concepts.
- Uses nearest neighbor search in embedding space to limit active learning candidate pool.
Q9: What is core-set selection? #
- Choosing a small representative subset of a large labeled dataset that preserves model performance.
Q10: Why is core-set selection important? #
- When we have massive datasets, it reduces computational, time, and energy costs without sacrificing accuracy.
Q11: What methods help with core-set selection? #
- Greedy k-centers approach.
- Selection via Proxy: using smaller proxy models to guide subset selection.
Q12: How does Selection via Proxy work? #
- Train a lightweight model (proxy) on the full data.
- Use it to select a subset for training a larger model, speeding up training significantly.
Q13: What are key takeaways about dataset growth and compression? #
- Active learning enables data-efficient labeling for growing datasets.
- Core-set selection enables training efficiency for already large datasets.
Active Learning vs. Confident Learning #
Category | Active Learning | Confident Learning |
---|---|---|
Main Goal | Select the most informative examples to label next | Find mislabeled examples in existing labeled data |
When Used | During dataset growth (annotation phase) | After labels exist (cleaning phase) |
Data State | Partially labeled data pool | Fully labeled (but noisy) dataset |
Model Uncertainty Usage | Samples highest-uncertainty examples for human labeling | Detects label inconsistency via confidence estimation |
Human Role | Label new examples | Review and correct suspicious labels |
Output | New labels added to dataset | List of potential label errors to fix |
Typical Workflow | Train model → Select uncertain points → Human annotates → Expand dataset | Train model → Identify inconsistent labels → Human verifies/corrects → Clean dataset |
Common Technique | Uncertainty sampling | Confidence-based error detection |
Libraries/Tools | modAL, ALiPy | cleanlab |
Philosophy | Proactively grow data wisely | Reactively audit and clean existing data |
Typical Question | “What should I label next?” | “Which labels are probably wrong?” |
End Goal | Smarter, faster data acquisition | Higher quality existing labels |
Example Scenario | Medical image AI needing efficient expert labeling | Noisy crowd-sourced labeled text needing cleaning |
-
Key Distillation
- Active Learning: “Help me label better data.”
- Confident Learning: “Help me fix wrong data.”
-
Visual Process Summary
(Active Learning)
Small Labeled Set → Train Model → Find Most Uncertain → Human Labels → Grow Dataset
(Confident Learning)
Labeled (Noisy) Set → Train Model → Find Inconsistent Labels → Human Corrects → Clean Dataset
References #
- Active Learning for CNNs: A Core-Set Approach (Sener & Savarese, 2018)
- SEALS paper (Coleman et al., 2022)
- Similarity Estimation Techniques (Charikar, 2002)
- Billion-scale similarity search (Johnson et al., 2019)
- BERT paper (Devlin et al., 2019)
- SimCLR paper (Chen et al., 2020)
- DINO paper (Caron et al., 2021)
- Selection via Proxy (Coleman et al., 2020)
- Lab assignment on growing datasets