Ch6 Machine Learning & Graph-Based Analytics
#
Part 1: Q&A Summary
#
1. What is the difference between cleaning, harmonization, and feature engineering?
#
- Cleaning: Removing errors or inconsistencies in the raw data.
- Harmonization: Mapping and aligning data semantically across datasets (e.g., converting NDC to RxNorm).
- Feature Engineering: Transforming data to fit the needs of specific algorithms or analysis (e.g., PCA, one-hot encoding).
2. Why are graphs more useful for harmonization than feature engineering?
#
- Graphs help link concepts across vocabularies, terminologies, or systems.
- Feature engineering tends to be model-specific and harder to generalize.
3. What are the downsides of repeating cleaning/harmonization for each project?
#
- Redundancy: Same steps are repeated across projects.
- Inefficiency: Each team member duplicates similar work.
- Inconsistency: No central source of truth for processed data.
4. What is a feature store and how does it help?
#
- A feature store centralizes reusable, preprocessed features.
- Helps reduce redundancy and promotes consistency.
5. How do knowledge graphs improve the pipeline?
#
- Data is cleaned and harmonized once at the graph level.
- All downstream users can reuse the harmonized view via queries or APIs.
6. What assumptions are made when using a knowledge graph?
#
- Patient-level data and terminology concepts are stored in the same graph.
- Nodes/edges are tagged with metadata (e.g., timestamps, source).
- The graph is a supergraph enabling subgraph extraction.
7. What are graph embeddings and why are they useful?
#
- They convert graph structures into vectors usable in ML models.
- Enable pattern detection, similarity analysis, and deep learning.
8. What is node2vec
?
#
- Random walk-based graph embedding technique.
- Uses return (p) and in-out (q) parameters to tune graph walk.
- Captures homophily and structural equivalence.
9. What is cui2vec
?
#
- Embeds UMLS CUIs based on co-occurrence in various RWD sources.
- Context-aware (claims, notes, publications).
- Useful for understanding concept similarity.
10. What is med2vec
?
#
- Uses temporal sequence of medical events to create visit-based embeddings.
- Retains longitudinal context.
11. What is snomed2vec
?
#
- Embeds SNOMED CT concepts using hierarchical and network-based methods.
- Includes alternatives like metapath2vec and Poincaré embeddings.
12. What are some challenges with pretrained embeddings?
#
- Risk of overfitting to training data domain (e.g., CMS claims).
- May not generalize well to other populations or use cases.
- Introduces extra model layer to maintain and tune.
Part 2: Curriculum-Style Breakdown with “Why”
#
🧭 Phase 1: Understand the Motivation
#
- Task: Read and distinguish between cleaning, harmonization, and feature engineering.
- Why: Clarifies each pipeline component and prevents misuse of graphs for tasks like feature engineering.
🧱 Phase 2: Explore Pipeline Challenges
#
- Task: Analyze Figures 6-6 to 6-9 on pipeline repetition and inefficiency.
- Why: Understand how lack of standardization leads to duplicated efforts.
🧠 Phase 3: Learn about Feature Stores
#
- Task: Study how feature stores centralize and reuse engineered features.
- Why: Saves time, increases reproducibility, and reduces tech debt.
🌐 Phase 4: Integrate Knowledge Graphs
#
- Task: Understand what goes into a knowledge graph (patient data + ontologies).
- Why: Enables one-time harmonization per data source, allowing scalable reuse.
🧩 Phase 5: Explore Graph Embedding Techniques
#
- Task: Implement
node2vec
on a small graph.
- Why: Learn homophily vs structural equivalence, key for biomedical graph reasoning.
🧬 Phase 6: Biomedical Concept Embeddings
#
- Task: Compare and contrast
cui2vec
, med2vec
, and snomed2vec
.
- Why: Appreciate how embeddings differ by data type (temporal, co-occurrence, hierarchical).
⚠️ Phase 7: Real-World Concerns with Embeddings
#
- Task: Evaluate pretrained embeddings and consider limitations (overfitting, generalizability).
- Why: Embeddings may look good on paper but can fail in new domains.
🔁 Phase 8: Apply to Your Use Case
#
- Task: Pick a small real-world use case and simulate a pipeline using a knowledge graph and embedding.
- Why: Reinforces learning and identifies operational gaps in pipeline design.