Dataset Creation and Curation
Q1: What are the main themes of dataset creation and curation? #
- Framing the ML task correctly.
- Addressing data sourcing concerns like selection bias.
- Handling label sourcing and quality control.
Q2: Why is careful sourcing of data important? #
- ML models exploit spurious correlations.
- If training data does not match real-world deployment conditions, models can fail badly.
Q3: What is selection bias and what are its common causes? #
- Selection bias: Systematic mismatch between training data and deployment data.
- Causes: Time/location bias, demographic bias, response bias, availability bias, long tail bias.
Q4: How can we deal with selection bias during data collection? #
- Hold out validation sets that mimic deployment conditions, such as latest data, new locations, or oversampled rare events.
Q5: How can we estimate how much data we need? #
- Use a method to measure learning curves by sub-sampling data and fitting a simple log-log model to predict performance scaling.
Q6: What is the formula used to predict model error with more data? #
- ( \log( ext{error}) = -a \cdot \log(n) + b )
Q7: What are concerns when labeling data with crowdsourced workers? #
- Variability in annotator accuracy.
- Possibility of annotator collusion.
Q8: How can we maintain label quality during crowdsourcing? #
- Insert “quality control” examples with known ground-truth to monitor annotator performance.
Q9: What methods are used to curate labels from multiple annotators? #
- Majority Vote and Inter-Annotator Agreement.
- Dawid-Skene model.
- CROWDLAB.
Q10: How does Majority Vote work? #
- Assigns the label chosen by the majority of annotators.
- Confidence is based on inter-annotator agreement.
Q11: What are downsides of Majority Vote? #
- Ties are ambiguous.
- Bad annotators have equal influence as good ones.
Q12: What is the Dawid-Skene model? #
- Models each annotator with a confusion matrix.
- Uses Bayesian inference (often approximated with EM) to estimate consensus labels and annotator quality.
Q13: What are limitations of Dawid-Skene? #
- Requires strong assumptions.
- Performs poorly if examples are labeled by few annotators.
Q14: What is CROWDLAB? #
- Combines classifier predictions and annotator labels for better consensus.
- Weights depend on model confidence and inter-annotator agreement.
Q15: How does CROWDLAB work? #
- For examples labeled by few annotators, rely more on the classifier.
- For examples labeled by many annotators, rely more on label agreement.
Q16: How are weights estimated in CROWDLAB? #
- Based on annotator agreement rates and classifier accuracy normalized against a majority-class baseline.
Q17: What is the hands-on lab assignment for this lecture? #
- Analyze a multi-annotator dataset and implement methods for estimating consensus labels and annotator quality.