Ch4 Deep Dive – Electronic Health Records (EHR) #
Q1: What is the central focus of Chapter 4? #
Chapter 4 focuses on working with electronic health record (EHR) data using the MIMIC-III dataset, and explores medication harmonization using SQL, Neo4j (property graph), and TypeDB (typed hypergraph).
Q2: What makes working with EHR data complex? #
- EHRs are highly structured but vary between implementations.
- Data is often redundant, inconsistent, or missing.
- Clinical context and domain knowledge are crucial for correct interpretation.
Q3: Why was medication harmonization chosen as the use case? #
Because medications are objective and widely used in EHRs, but the same drug can appear under multiple names or codes (e.g. NDCs). Harmonization is necessary to:
- Normalize names or codes.
- Filter out forms like “heparin flush” vs “therapeutic heparin”.
Q4: How is this implemented using SQL (SQLite)? #
- CSVs are loaded into SQLite tables.
- Free-text string search is done across drug name columns (e.g.,
drug_name_generic
LIKE “%heparin%”). - NDC-based harmonization is done by:
- Extracting distinct NDCs.
- Filtering to include only valid/therapeutic ones.
- Rewriting queries using explicit
WHERE ndc IN (...)
clauses.
Q5: What are the challenges of the SQL approach? #
- Hard to maintain mappings (e.g., repeated NDC lists).
- Poor separation of concerns (clinical logic leaks into queries).
- Not reusable across projects or datasets.
Q6: How does Neo4j (property graph) improve the workflow? #
- Drug instances are modeled as nodes, and a new “heparin (non-flush)” concept is created.
- Relationships connect drug nodes to patients and prescriptions.
- Reusability improves because mappings are stored inside the graph.
- Queries become more semantic, e.g., follow a concept node rather than duplicating NDCs in every query.
Q7: How are concepts created in Neo4j? #
- A node like
Drug:Knowledge { drug: "Heparin (non-flush)", ... }
is created. - Drug nodes are linked to it via
[:for_drug {derived: true}]
relationships. - This enables querying “all patients on this drug concept.”
Q8: How does TypeDB enhance this model further? #
- TypeDB uses strong typing and roles to model relations:
patient plays prescription:patient
druginstance plays prescription:prescribed_drug
- Separate
druginstance
entity andprescription
relation. - Can add inference rules to dynamically associate drugs with higher-level concepts.
Q9: What are the two approaches to harmonization in TypeDB? #
- Persisted:
- Insert actual hierarchy facts:
(parent: heparin, child: drug) isa hierarchy
.
- Insert actual hierarchy facts:
- Rule-based:
- Use rules like:
rule heparin-rule: when { $d has ndc "xxxx"; $c has purl "..."; } then { (parent: $c, child: $d) isa hierarchy; }
- Use rules like:
Q10: What are the tradeoffs between approaches? #
Model | Pros | Cons |
---|---|---|
SQL | Ubiquitous, accessible | Poor semantics, duplication of logic |
Neo4j | Intuitive graph model, reusable concepts | Fragile schemas, performance concerns |
TypeDB | Semantic precision, rule engine | Newer ecosystem, complexity, fewer tools |
RDF Graph | Web standard, portable | Very steep learning curve |
Q11: What are the broader lessons? #
- Separate clinical knowledge from code.
- Choose modeling strategies based on:
- Project duration
- Tooling maturity
- Frequency of schema changes
- Collaboration needs
- Use graphs or TypeDB to encode reusable logic and keep queries clean.
🧠 Curriculum Task-Based Summary (Chapter 4) #
🔹 1. Understanding EHR Data Models #
- Compare OMOP, FHIR, i2b2, PCORnet, ADaM, SDTM.
- Explore role of implementation guides and FHIR profiles.
🔹 2. Setup and Load EHR Data (MIMIC-III) #
- Use SQLite to ingest
.csv
files (Example 4-1). - Use Neo4j or TypeDB containers (Docker).
- Load and explore data with basic queries.
🔹 3. Medication Harmonization Use Case #
- Focus on heparin, and identify pitfalls with NDC codes.
- Extract and deduplicate NDCs from
prescriptions
. - Build queries that target therapeutic use only.
🔹 4. Query and Harmonize in Three Paradigms #
Task | SQL | Neo4j | TypeDB |
---|---|---|---|
Load data | pandas + sqlite3 | pandas + NeoInterface | pandas + TypeDB client |
Query for “heparin” | LIKE on drug names |
toLower(drug_name) CONTAINS |
drug has name contains |
Harmonization | Filter with ndc IN (...) |
Create concept node + edges | Insert facts or define rules |
🔹 5. Linkage and Reasoning #
- Create custom drug concepts.
- Track prescriptions per patient.
- Link concepts using:
- SQL joins
- Neo4j
(:Patient)-[:has_prescription]->(:Drug)
- TypeDB
roles
+rules
.
🔹 6. Evaluate Tradeoffs and Performance #
- Review table of pros/cons (Table 4-2).
- Balance:
- Query simplicity vs data model reusability.
- Rule-driven inference vs static mapping.
- Ecosystem maturity.
✅ End of Chapter Outcome #
You should now be able to:
- Choose the right data model (SQL, graph, hypergraph) for your RWD task.
- Implement and harmonize medication concepts.
- Balance engineering choices with clinical accuracy and long-term maintainability.
- Begin thinking about integrating terminologies (UMLS, SNOMED CT) into your models.