Dataset Creation and Curation | AI Reasoning

Dataset Creation and Curation

Dataset Creation and Curation #

Q1: What are the main themes of dataset creation and curation? #

Framing the ML task correctly.
Addressing data sourcing concerns like selection bias.
Handling label sourcing and quality control.

Q2: Why is careful sourcing of data important? #

ML models exploit spurious correlations.
If training data does not match real-world deployment conditions, models can fail badly.

Q3: What is selection bias and what are its common causes? #

Selection bias: Systematic mismatch between training data and deployment data.
Causes: Time/location bias, demographic bias, response bias, availability bias, long tail bias.

Q4: How can we deal with selection bias during data collection? #

Hold out validation sets that mimic deployment conditions, such as latest data, new locations, or oversampled rare events.

Q5: How can we estimate how much data we need? #

Use a method to measure learning curves by sub-sampling data and fitting a simple log-log model to predict performance scaling.

Q6: What is the formula used to predict model error with more data? #

( \log( ext{error}) = -a \cdot \log(n) + b )

Q7: What are concerns when labeling data with crowdsourced workers? #

Variability in annotator accuracy.
Possibility of annotator collusion.

Q8: How can we maintain label quality during crowdsourcing? #

Insert “quality control” examples with known ground-truth to monitor annotator performance.

Q9: What methods are used to curate labels from multiple annotators? #

Majority Vote and Inter-Annotator Agreement.
Dawid-Skene model.
CROWDLAB.

Q10: How does Majority Vote work? #

Assigns the label chosen by the majority of annotators.
Confidence is based on inter-annotator agreement.

Q11: What are downsides of Majority Vote? #

Ties are ambiguous.
Bad annotators have equal influence as good ones.

Q12: What is the Dawid-Skene model? #

Models each annotator with a confusion matrix.
Uses Bayesian inference (often approximated with EM) to estimate consensus labels and annotator quality.

Q13: What are limitations of Dawid-Skene? #

Requires strong assumptions.
Performs poorly if examples are labeled by few annotators.

Q14: What is CROWDLAB? #

Combines classifier predictions and annotator labels for better consensus.
Weights depend on model confidence and inter-annotator agreement.

Q15: How does CROWDLAB work? #

For examples labeled by few annotators, rely more on the classifier.
For examples labeled by many annotators, rely more on label agreement.

Q16: How are weights estimated in CROWDLAB? #

Based on annotator agreement rates and classifier accuracy normalized against a majority-class baseline.

Q17: What is the hands-on lab assignment for this lecture? #

Analyze a multi-annotator dataset and implement methods for estimating consensus labels and annotator quality.

References #