Label Errors
Q1: What are label errors and why do they matter? #
- Label errors: Incorrect labels in training/testing datasets.
- They cause worse model performance, benchmark misinterpretation, and deployment risks.
Q2: What are the types of label noise? #
- Uniform/Symmetric noise: Random label flipping.
- Systematic/Asymmetric noise: Certain labels more likely flipped.
- Instance-dependent noise: Noise depends on input features (out of scope here).
Q3: What is Confident Learning (CL)? #
- A framework to:
- Find label errors
- Rank examples by label issue likelihood
- Learn with noisy labels
- Characterize noise structure
- Model-agnostic: uses model-predicted probabilities.
Q4: How does CL detect label errors? #
- Use predicted probabilities + noisy labels.
- Estimate joint distribution of noisy vs. true labels.
- Detect off-diagonal entries = label errors.
- Key techniques: Prune, Count, Rank.
Q5: Why is a noise process assumption needed? #
- To separate model uncertainty (epistemic) and label noise (aleatoric).
- CL assumes class-conditional noise.
Q6: Why not just sort by loss? #
- Sorting by loss doesn’t tell you:
- Where to cut off
- How many label errors exist
- How to automate error finding without human review
Q7: How does CL achieve robustness to imperfect predictions? #
- Prune low-confidence examples
- Count robustly across examples
- Rank by predicted probabilities
Q8: How does label noise affect real-world ML? #
- Real-world datasets are not random noise.
- Deep learning claims about noise robustness often assume unrealistic random noise.
Q9: What happens when test sets have label errors? #
- Benchmark model rankings change.
- A “better” model might actually underperform in deployment.
- Quantifying label errors in test sets is critical.
Q10: How can practitioners fix this? #
- Use corrected test sets.
- Benchmark using cleaned labels.
- Tools like cleanlab can automate finding label issues.
Q11: Key Takeaways #
- Confident learning enables data-centric model improvements.
- Even small label error rates (~3-6%) can destabilize ML benchmarks.
- ML needs to quantify label noise to ensure real-world reliability.