Dataset Creation and Curation
#
Q1: What are the main themes of dataset creation and curation?
#
- Framing the ML task correctly.
- Addressing data sourcing concerns like selection bias.
- Handling label sourcing and quality control.
Q2: Why is careful sourcing of data important?
#
- ML models exploit spurious correlations.
- If training data does not match real-world deployment conditions, models can fail badly.
Q3: What is selection bias and what are its common causes?
#
- Selection bias: Systematic mismatch between training data and deployment data.
- Causes: Time/location bias, demographic bias, response bias, availability bias, long tail bias.
Q4: How can we deal with selection bias during data collection?
#
- Hold out validation sets that mimic deployment conditions, such as latest data, new locations, or oversampled rare events.
Q5: How can we estimate how much data we need?
#
- Use a method to measure learning curves by sub-sampling data and fitting a simple log-log model to predict performance scaling.
- ( \log( ext{error}) = -a \cdot \log(n) + b )
Q7: What are concerns when labeling data with crowdsourced workers?
#
- Variability in annotator accuracy.
- Possibility of annotator collusion.
Q8: How can we maintain label quality during crowdsourcing?
#
- Insert “quality control” examples with known ground-truth to monitor annotator performance.
Q9: What methods are used to curate labels from multiple annotators?
#
- Majority Vote and Inter-Annotator Agreement.
- Dawid-Skene model.
- CROWDLAB.
Q10: How does Majority Vote work?
#
- Assigns the label chosen by the majority of annotators.
- Confidence is based on inter-annotator agreement.
Q11: What are downsides of Majority Vote?
#
- Ties are ambiguous.
- Bad annotators have equal influence as good ones.
Q12: What is the Dawid-Skene model?
#
- Models each annotator with a confusion matrix.
- Uses Bayesian inference (often approximated with EM) to estimate consensus labels and annotator quality.
Q13: What are limitations of Dawid-Skene?
#
- Requires strong assumptions.
- Performs poorly if examples are labeled by few annotators.
Q14: What is CROWDLAB?
#
- Combines classifier predictions and annotator labels for better consensus.
- Weights depend on model confidence and inter-annotator agreement.
Q15: How does CROWDLAB work?
#
- For examples labeled by few annotators, rely more on the classifier.
- For examples labeled by many annotators, rely more on label agreement.
Q16: How are weights estimated in CROWDLAB?
#
- Based on annotator agreement rates and classifier accuracy normalized against a majority-class baseline.
Q17: What is the hands-on lab assignment for this lecture?
#
- Analyze a multi-annotator dataset and implement methods for estimating consensus labels and annotator quality.
References
#