Missing Data Scenarios in Healthcare Modeling

Missing Data Scenarios in Healthcare Modeling #

1. Should Be Measured But Wasn’t #

  • Description: The value is expected but is missing due to random or procedural issues (e.g., lab error, missed test).
  • Technical Term:
    • MCAR: Missing Completely At Random
    • MAR: Missing At Random
  • Example: A routine blood test wasn’t recorded because the sample was lost.
  • Strategy:
    • Impute (mean, median, or model-based).
    • Add a missingness indicator variable (e.g., var_missing = 1).
  • Rationale: The missingness is unrelated to the value itself, so estimation is relatively safe.

2. Mostly Zero Due to Rare Occurrence #

  • Description: Not truly missing — the value is zero or absent for most patients because the condition/event is rare.
  • Technical Term:
    • Not Missing (No abbreviation needed)
  • Example: HIV diagnosis column is 0 for most patients.
  • Strategy:
    • Do not impute — the 0s are meaningful and reflect true absence.
  • Rationale: These are real values, and zeros carry clinical meaning.

3. Deliberately Not Recorded #

  • Description: Clinician or system chooses not to record a value based on context (e.g., patient clearly stable or too ill).
  • Technical Term:
    • MNAR: Missing Not At Random
  • Example: Sodium level not tested because the patient was clearly stable.
  • Strategy:
    • Avoid imputation if possible — it may introduce bias.
    • Use models that handle missingness natively (e.g., decision trees, XGBoost, LightGBM).
    • Consider adding a missingness indicator.
  • Rationale: The missingness depends on the unobserved value and may carry predictive signal.

Summary Table #

Case Description Abbreviation Impute? Extra Notes
1 Should be measured but wasn’t MCAR / MAR ✅ Yes Add indicator if signal is likely
2 Mostly zero (rare condition) Not Missing 🚫 No Keep as is — zeros are informative
3 Deliberately not recorded MNAR ⚠️ Caution Use native handling + possible indicator