𧬠Clinical Text Feature Extraction Using Dictionary-Based Filtering #
This guide demonstrates a simplified approach for processing clinical text without removing PHI directly. Instead, it extracts only medical terms from a predefined dictionary (simulated knowledge graph), which passively excludes PHI and enables downstream analyses.
β Objective #
- Extract present, positive mentions of clinical concepts (e.g., diseases, symptoms, drugs).
- Avoid mentions that are negated or refer to historical/family context.
- Demonstrate the principle: “Keep only medical terms” as an alternative to direct PHI removal.
π§Ύ Input Example #
Patient complains of chest pain. No signs of pneumonia. History of diabetes mellitus.
Prescribed metformin. Mother had breast cancer.
π§ Procedure Overview #
- Define a medical term dictionary (simulating a knowledge graph).
- Split the clinical note into sentences.
- Ignore sentences with negation or irrelevant context.
- Match and extract terms from the dictionary.
- Output structured features for downstream use.
π§ͺ Code Implementation (Python) #
import re
# 1. Simulated clinical note
clinical_note = '''
Patient complains of chest pain. No signs of pneumonia. History of diabetes mellitus.
Prescribed metformin. Mother had breast cancer.
'''
# 2. Simulated knowledge graph (medical term dictionary)
medical_terms = {
"chest pain": "symptom",
"pneumonia": "disease",
"diabetes mellitus": "disease",
"metformin": "drug",
"breast cancer": "disease"
}
# 3. Split into sentences
sentences = re.split(r'\.\s*', clinical_note.strip())
features = []
# 4. Process each sentence
for sentence in sentences:
sentence_lower = sentence.lower()
# 5. Skip negated or historical context
if "no " in sentence_lower or "history of" in sentence_lower or "mother had" in sentence_lower:
continue
# 6. Match medical terms
for term in medical_terms:
if term in sentence_lower:
features.append({
"term": term,
"type": medical_terms[term],
"sentence": sentence.strip()
})
# 7. Output extracted features
for feature in features:
print(f"Found {feature['type']} β '{feature['term']}' in: \"{feature['sentence']}\"")
π€ Sample Output #
Found symptom β 'chest pain' in: "Patient complains of chest pain"
Found drug β 'metformin' in: "Prescribed metformin"
π Summary #
This method:
- Avoids direct PHI detection
- Extracts useful clinical concepts only
- Can be adapted to larger vocabularies and real NLP tools (e.g., spaCy, scispaCy, NegEx)
Perfect for research scenarios where structured clinical features are needed but full de-identification is too complex.