Data Quality, Labeling, and Weak Supervision in Clinical ML

Data Quality, Labeling, and Weak Supervision in Clinical ML #


Q1: What does “Garbage In, Garbage Out” mean in machine learning? #

It means that no model, no matter how advanced, can compensate for poor-quality data. If your input data is noisy, biased, irrelevant, or mislabeled, your model will reflect those flaws.

✅ The choice of data and problem matters more than the algorithm itself.


Q2: Can large, rich datasets still be garbage? #

Yes — if the data is fundamentally flawed or based on faulty assumptions (like phrenology), more volume just means more noise.

📌 Data quality ≠ data quantity.


Q3: What makes clinical data especially tricky to label accurately? #

  • Lack of standardized label definitions
  • Evolving medical criteria (e.g. changing diabetes thresholds)
  • Some labels (e.g., mortality) are easier to pin down
  • Others (e.g., pneumonia, hypertension) require complex confirmation (labs, imaging, notes)

🧠 Medical label creation is hard, expensive, and often subjective.


Q4: How do we deal with label noise in practice? #

Label noise is inevitable, but manageable.

Strategies:

  • Have domain experts label a subset for benchmarking
  • Use multiple reviewers to estimate disagreement rate
  • Triangulate labels (e.g., combine ICD codes + meds + clinician notes)

📉 This reduces label noise but often shrinks dataset size.


Q5: Is it ever okay to use noisy labels? #

Surprisingly, yes — if you have enough data.

📈 A Stanford study found:

  • 90% label accuracy ≈ baseline (with 50% more data)
  • 85% label accuracy ≈ baseline (with 100% more data)

Rule of thumb:

  • 10% noise → 1.5× more data
  • 15% noise → 2× more data

Q6: What is weak supervision? #

Weak supervision refers to learning from labels that are:

  • Noisy
  • Incomplete
  • Imperfectly defined

This is common in healthcare due to:

  • The cost of expert labeling
  • The complexity of clinical truth

👨‍⚕️ That’s why domain experts + scalable label strategies are a key bottleneck in clinical ML.


Q7: If training data is noisy, what about the test set? #

The test set must be as clean as possible.
Why?

  • If test labels are noisy, your evaluation metrics will be inaccurate
  • It may underestimate model performance, leading to incorrect conclusions

📌 Training set: can handle some noise.
📌 Test set: must approximate gold-standard ground truth.


🔑 Final Takeaways #

Principle Why It Matters
Garbage In, Garbage Out Bad input = bad model, no matter the algorithm
Labels ≠ Truth Always validate how close your labels are to clinical reality
More Data ≠ Better Data Large, outdated, or noisy datasets can harm performance
Weak Supervision Works (With Scale) Noisy labels can be offset by higher volume
Test Set Must Be Clean Final evaluation must reflect ground truth