Data Quality, Labeling, and Weak Supervision in Clinical ML #
Q1: What does “Garbage In, Garbage Out” mean in machine learning? #
It means that no model, no matter how advanced, can compensate for poor-quality data. If your input data is noisy, biased, irrelevant, or mislabeled, your model will reflect those flaws.
✅ The choice of data and problem matters more than the algorithm itself.
Q2: Can large, rich datasets still be garbage? #
Yes — if the data is fundamentally flawed or based on faulty assumptions (like phrenology), more volume just means more noise.
📌 Data quality ≠ data quantity.
Q3: What makes clinical data especially tricky to label accurately? #
- Lack of standardized label definitions
- Evolving medical criteria (e.g. changing diabetes thresholds)
- Some labels (e.g., mortality) are easier to pin down
- Others (e.g., pneumonia, hypertension) require complex confirmation (labs, imaging, notes)
🧠 Medical label creation is hard, expensive, and often subjective.
Q4: How do we deal with label noise in practice? #
Label noise is inevitable, but manageable.
Strategies:
- Have domain experts label a subset for benchmarking
- Use multiple reviewers to estimate disagreement rate
- Triangulate labels (e.g., combine ICD codes + meds + clinician notes)
📉 This reduces label noise but often shrinks dataset size.
Q5: Is it ever okay to use noisy labels? #
Surprisingly, yes — if you have enough data.
📈 A Stanford study found:
- 90% label accuracy ≈ baseline (with 50% more data)
- 85% label accuracy ≈ baseline (with 100% more data)
✅ Rule of thumb:
- 10% noise → 1.5× more data
- 15% noise → 2× more data
Q6: What is weak supervision? #
Weak supervision refers to learning from labels that are:
- Noisy
- Incomplete
- Imperfectly defined
This is common in healthcare due to:
- The cost of expert labeling
- The complexity of clinical truth
👨⚕️ That’s why domain experts + scalable label strategies are a key bottleneck in clinical ML.
Q7: If training data is noisy, what about the test set? #
The test set must be as clean as possible.
Why?
- If test labels are noisy, your evaluation metrics will be inaccurate
- It may underestimate model performance, leading to incorrect conclusions
📌 Training set: can handle some noise.
📌 Test set: must approximate gold-standard ground truth.
🔑 Final Takeaways #
Principle | Why It Matters |
---|---|
Garbage In, Garbage Out | Bad input = bad model, no matter the algorithm |
Labels ≠ Truth | Always validate how close your labels are to clinical reality |
More Data ≠ Better Data | Large, outdated, or noisy datasets can harm performance |
Weak Supervision Works (With Scale) | Noisy labels can be offset by higher volume |
Test Set Must Be Clean | Final evaluation must reflect ground truth |