Dataset Creation and Curation

Dataset Creation and Curation #


Q1: What are the main themes of dataset creation and curation? #

  • Framing the ML task correctly.
  • Addressing data sourcing concerns like selection bias.
  • Handling label sourcing and quality control.

Q2: Why is careful sourcing of data important? #

  • ML models exploit spurious correlations.
  • If training data does not match real-world deployment conditions, models can fail badly.

Q3: What is selection bias and what are its common causes? #

  • Selection bias: Systematic mismatch between training data and deployment data.
  • Causes: Time/location bias, demographic bias, response bias, availability bias, long tail bias.

Q4: How can we deal with selection bias during data collection? #

  • Hold out validation sets that mimic deployment conditions, such as latest data, new locations, or oversampled rare events.

Q5: How can we estimate how much data we need? #

  • Use a method to measure learning curves by sub-sampling data and fitting a simple log-log model to predict performance scaling.

Q6: What is the formula used to predict model error with more data? #

  • ( \log( ext{error}) = -a \cdot \log(n) + b )

Q7: What are concerns when labeling data with crowdsourced workers? #

  • Variability in annotator accuracy.
  • Possibility of annotator collusion.

Q8: How can we maintain label quality during crowdsourcing? #

  • Insert “quality control” examples with known ground-truth to monitor annotator performance.

Q9: What methods are used to curate labels from multiple annotators? #

  • Majority Vote and Inter-Annotator Agreement.
  • Dawid-Skene model.
  • CROWDLAB.

Q10: How does Majority Vote work? #

  • Assigns the label chosen by the majority of annotators.
  • Confidence is based on inter-annotator agreement.

Q11: What are downsides of Majority Vote? #

  • Ties are ambiguous.
  • Bad annotators have equal influence as good ones.

Q12: What is the Dawid-Skene model? #

  • Models each annotator with a confusion matrix.
  • Uses Bayesian inference (often approximated with EM) to estimate consensus labels and annotator quality.

Q13: What are limitations of Dawid-Skene? #

  • Requires strong assumptions.
  • Performs poorly if examples are labeled by few annotators.

Q14: What is CROWDLAB? #

  • Combines classifier predictions and annotator labels for better consensus.
  • Weights depend on model confidence and inter-annotator agreement.

Q15: How does CROWDLAB work? #

  • For examples labeled by few annotators, rely more on the classifier.
  • For examples labeled by many annotators, rely more on label agreement.

Q16: How are weights estimated in CROWDLAB? #

  • Based on annotator agreement rates and classifier accuracy normalized against a majority-class baseline.

Q17: What is the hands-on lab assignment for this lecture? #

  • Analyze a multi-annotator dataset and implement methods for estimating consensus labels and annotator quality.

References #