🧬 Clinical Text Feature Extraction Using Dictionary-Based Filtering #

This guide demonstrates a simplified approach for processing clinical text without removing PHI directly. Instead, it extracts only medical terms from a predefined dictionary (simulated knowledge graph), which passively excludes PHI and enables downstream analyses.

✅ Objective #

Extract present, positive mentions of clinical concepts (e.g., diseases, symptoms, drugs).
Avoid mentions that are negated or refer to historical/family context.
Demonstrate the principle: “Keep only medical terms” as an alternative to direct PHI removal.

🧾 Input Example #

Patient complains of chest pain. No signs of pneumonia. History of diabetes mellitus.
Prescribed metformin. Mother had breast cancer.

🧠 Procedure Overview #

Define a medical term dictionary (simulating a knowledge graph).
Split the clinical note into sentences.
Ignore sentences with negation or irrelevant context.
Match and extract terms from the dictionary.
Output structured features for downstream use.

🧪 Code Implementation (Python) #

import re

# 1. Simulated clinical note
clinical_note = '''
Patient complains of chest pain. No signs of pneumonia. History of diabetes mellitus.
Prescribed metformin. Mother had breast cancer.
'''

# 2. Simulated knowledge graph (medical term dictionary)
medical_terms = {
    "chest pain": "symptom",
    "pneumonia": "disease",
    "diabetes mellitus": "disease",
    "metformin": "drug",
    "breast cancer": "disease"
}

# 3. Split into sentences
sentences = re.split(r'\.\s*', clinical_note.strip())
features = []

# 4. Process each sentence
for sentence in sentences:
    sentence_lower = sentence.lower()
    
    # 5. Skip negated or historical context
    if "no " in sentence_lower or "history of" in sentence_lower or "mother had" in sentence_lower:
        continue
    
    # 6. Match medical terms
    for term in medical_terms:
        if term in sentence_lower:
            features.append({
                "term": term,
                "type": medical_terms[term],
                "sentence": sentence.strip()
            })

# 7. Output extracted features
for feature in features:
    print(f"Found {feature['type']} → '{feature['term']}' in: \"{feature['sentence']}\"")

📤 Sample Output #

Found symptom → 'chest pain' in: "Patient complains of chest pain"
Found drug → 'metformin' in: "Prescribed metformin"

📌 Summary #

This method:

Avoids direct PHI detection
Extracts useful clinical concepts only
Can be adapted to larger vocabularies and real NLP tools (e.g., spaCy, scispaCy, NegEx)

Perfect for research scenarios where structured clinical features are needed but full de-identification is too complex.