Missing Data Scenarios in Healthcare Modeling #
1. Should Be Measured But Wasn’t #
- Description: The value is expected but is missing due to random or procedural issues (e.g., lab error, missed test).
- Technical Term:
- MCAR: Missing Completely At Random
- MAR: Missing At Random
- Example: A routine blood test wasn’t recorded because the sample was lost.
- Strategy:
- Impute (mean, median, or model-based).
- Add a missingness indicator variable (e.g.,
var_missing = 1
).
- Rationale: The missingness is unrelated to the value itself, so estimation is relatively safe.
2. Mostly Zero Due to Rare Occurrence #
- Description: Not truly missing — the value is zero or absent for most patients because the condition/event is rare.
- Technical Term:
- Not Missing (No abbreviation needed)
- Example: HIV diagnosis column is
0
for most patients. - Strategy:
- Do not impute — the 0s are meaningful and reflect true absence.
- Rationale: These are real values, and zeros carry clinical meaning.
3. Deliberately Not Recorded #
- Description: Clinician or system chooses not to record a value based on context (e.g., patient clearly stable or too ill).
- Technical Term:
- MNAR: Missing Not At Random
- Example: Sodium level not tested because the patient was clearly stable.
- Strategy:
- Avoid imputation if possible — it may introduce bias.
- Use models that handle missingness natively (e.g., decision trees, XGBoost, LightGBM).
- Consider adding a missingness indicator.
- Rationale: The missingness depends on the unobserved value and may carry predictive signal.
Summary Table #
Case | Description | Abbreviation | Impute? | Extra Notes |
---|---|---|---|---|
1 | Should be measured but wasn’t | MCAR / MAR | ✅ Yes | Add indicator if signal is likely |
2 | Mostly zero (rare condition) | Not Missing | 🚫 No | Keep as is — zeros are informative |
3 | Deliberately not recorded | MNAR | ⚠️ Caution | Use native handling + possible indicator |