π₯ Clinical Text Mining Pipeline (Steps 1β5) #
This document outlines a high-level clinical text mining pipeline using knowledge graphs, NLP, and structured indexing. The goal is to extract, enrich, and analyze clinical concepts from raw EMR text.
π§Ύ Step 1: Preprocessing Clinical Documents #
Goal: Prepare and normalize clinical notes for processing.
Tools: Text cleaning, sentence segmentation, tokenizer.
# Example: Clean and split into sentences
import re
clinical_note = "Pt c/o chest pain. No signs of pneumonia. History of stroke. Prescribed metformin."
sentences = re.split(r'\.\s*', clinical_note.lower())
π§ Step 2: Extract Terms Using Knowledge Graph + NLP #
Goal: Identify medical terms using a knowledge graph and remove ambiguous, negated, or contextual mentions.
Tools: Knowledge Graph (e.g., UMLS), NegEx, ConText
# Simulated medical term dictionary (knowledge graph-based)
medical_terms = {"chest pain": "symptom", "pneumonia": "disease", "stroke": "disease", "metformin": "drug"}
# Filtered sentences (simulate negation/context removal)
filtered_mentions = []
for s in sentences:
if "no " in s or "history of" in s:
continue
for term in medical_terms:
if term in s:
filtered_mentions.append(term)
ποΈ Step 3: Index Positive, Present Mentions #
Goal: Store structured, filtered term mentions for later search.
Tools: JSON/DB-based indexing, storing patient-term mappings.
indexed_mentions = [
{"patient_id": 1, "term": "chest pain"},
{"patient_id": 1, "term": "metformin"},
]
π§ Step 4: Query-Time Semantic Expansion #
Goal: Expand the userβs query using KG (synonyms, variants, etc.) and disambiguate based on context.
Tools: Knowledge Graph (UMLS), synonym/semantic type lookup, optional filters
query = "stroke"
expanded_terms = ["stroke", "cva", "cerebrovascular accident"]
# Disambiguate (simplified)
def is_valid(term, patient_age, season):
return not (term == "heatstroke" and patient_age < 18 and season == "summer")
π Step 5: Build Patient-Feature Matrix for Analysis #
Goal: Aggregate term mentions per patient for cohort selection and modeling.
Tools: Pandas, matrix construction, temporal tagging
from collections import defaultdict
feature_matrix = defaultdict(lambda: {"stroke_mention": 0})
patient_metadata = {1: {"age": 65, "season": "spring"}}
for mention in indexed_mentions:
pid = mention["patient_id"]
term = mention["term"]
if term in expanded_terms and is_valid(term, patient_metadata[pid]["age"], patient_metadata[pid]["season"]):
feature_matrix[pid]["stroke_mention"] += 1
print(dict(feature_matrix))
β Summary #
Step | Goal | Tools |
---|---|---|
1 | Clean & tokenize notes | Regex, NLP |
2 | Extract clean medical terms | KG, NegEx, filtering |
3 | Store structured mentions | JSON, DB |
4 | Expand/interpret queries | KG, synonyms, disambiguation |
5 | Analyze for research | Patient-feature matrix, Pandas |
This modular pipeline separates data preparation from query-time flexibility, making it robust and reusable.