Label Errors
#
Q1: What are label errors and why do they matter?
#
- Label errors: Incorrect labels in training/testing datasets.
- They cause worse model performance, benchmark misinterpretation, and deployment risks.
Q2: What are the types of label noise?
#
- Uniform/Symmetric noise: Random label flipping.
- Systematic/Asymmetric noise: Certain labels more likely flipped.
- Instance-dependent noise: Noise depends on input features (out of scope here).
Q3: What is Confident Learning (CL)?
#
- A framework to:
- Find label errors
- Rank examples by label issue likelihood
- Learn with noisy labels
- Characterize noise structure
- Model-agnostic: uses model-predicted probabilities.
Q4: How does CL detect label errors?
#
- Use predicted probabilities + noisy labels.
- Estimate joint distribution of noisy vs. true labels.
- Detect off-diagonal entries = label errors.
- Key techniques: Prune, Count, Rank.
Q5: Why is a noise process assumption needed?
#
- To separate model uncertainty (epistemic) and label noise (aleatoric).
- CL assumes class-conditional noise.
Q6: Why not just sort by loss?
#
- Sorting by loss doesn’t tell you:
- Where to cut off
- How many label errors exist
- How to automate error finding without human review
Q7: How does CL achieve robustness to imperfect predictions?
#
- Prune low-confidence examples
- Count robustly across examples
- Rank by predicted probabilities
Q8: How does label noise affect real-world ML?
#
- Real-world datasets are not random noise.
- Deep learning claims about noise robustness often assume unrealistic random noise.
Q9: What happens when test sets have label errors?
#
- Benchmark model rankings change.
- A “better” model might actually underperform in deployment.
- Quantifying label errors in test sets is critical.
Q10: How can practitioners fix this?
#
- Use corrected test sets.
- Benchmark using cleaned labels.
- Tools like cleanlab can automate finding label issues.
Q11: Key Takeaways
#
- Confident learning enables data-centric model improvements.
- Even small label error rates (~3-6%) can destabilize ML benchmarks.
- ML needs to quantify label noise to ensure real-world reliability.
References
#