Ch6 ML and Graph Analytics | AI Reasoning

Ch6 ML and Graph Analytics

Ch6 Machine Learning & Graph-Based Analytics #

Part 1: Q&A Summary #

1. What is the difference between cleaning, harmonization, and feature engineering? #

Cleaning: Removing errors or inconsistencies in the raw data.
Harmonization: Mapping and aligning data semantically across datasets (e.g., converting NDC to RxNorm).
Feature Engineering: Transforming data to fit the needs of specific algorithms or analysis (e.g., PCA, one-hot encoding).

2. Why are graphs more useful for harmonization than feature engineering? #

Graphs help link concepts across vocabularies, terminologies, or systems.
Feature engineering tends to be model-specific and harder to generalize.

3. What are the downsides of repeating cleaning/harmonization for each project? #

Redundancy: Same steps are repeated across projects.
Inefficiency: Each team member duplicates similar work.
Inconsistency: No central source of truth for processed data.

4. What is a feature store and how does it help? #

A feature store centralizes reusable, preprocessed features.
Helps reduce redundancy and promotes consistency.

5. How do knowledge graphs improve the pipeline? #

Data is cleaned and harmonized once at the graph level.
All downstream users can reuse the harmonized view via queries or APIs.

6. What assumptions are made when using a knowledge graph? #

Patient-level data and terminology concepts are stored in the same graph.
Nodes/edges are tagged with metadata (e.g., timestamps, source).
The graph is a supergraph enabling subgraph extraction.

7. What are graph embeddings and why are they useful? #

They convert graph structures into vectors usable in ML models.
Enable pattern detection, similarity analysis, and deep learning.

8. What is `node2vec`? #

Random walk-based graph embedding technique.
Uses return (p) and in-out (q) parameters to tune graph walk.
Captures homophily and structural equivalence.

9. What is `cui2vec`? #

Embeds UMLS CUIs based on co-occurrence in various RWD sources.
Context-aware (claims, notes, publications).
Useful for understanding concept similarity.

10. What is `med2vec`? #

Uses temporal sequence of medical events to create visit-based embeddings.
Retains longitudinal context.

11. What is `snomed2vec`? #

Embeds SNOMED CT concepts using hierarchical and network-based methods.
Includes alternatives like metapath2vec and Poincaré embeddings.

12. What are some challenges with pretrained embeddings? #

Risk of overfitting to training data domain (e.g., CMS claims).
May not generalize well to other populations or use cases.
Introduces extra model layer to maintain and tune.

Part 2: Curriculum-Style Breakdown with “Why” #

🧭 Phase 1: Understand the Motivation #

Task: Read and distinguish between cleaning, harmonization, and feature engineering.
- Why: Clarifies each pipeline component and prevents misuse of graphs for tasks like feature engineering.

🧱 Phase 2: Explore Pipeline Challenges #

Task: Analyze Figures 6-6 to 6-9 on pipeline repetition and inefficiency.
- Why: Understand how lack of standardization leads to duplicated efforts.

🧠 Phase 3: Learn about Feature Stores #

Task: Study how feature stores centralize and reuse engineered features.
- Why: Saves time, increases reproducibility, and reduces tech debt.

🌐 Phase 4: Integrate Knowledge Graphs #

Task: Understand what goes into a knowledge graph (patient data + ontologies).
- Why: Enables one-time harmonization per data source, allowing scalable reuse.

🧩 Phase 5: Explore Graph Embedding Techniques #

Task: Implement node2vec on a small graph.
- Why: Learn homophily vs structural equivalence, key for biomedical graph reasoning.

🧬 Phase 6: Biomedical Concept Embeddings #

Task: Compare and contrast cui2vec, med2vec, and snomed2vec.
- Why: Appreciate how embeddings differ by data type (temporal, co-occurrence, hierarchical).

⚠️ Phase 7: Real-World Concerns with Embeddings #

Task: Evaluate pretrained embeddings and consider limitations (overfitting, generalizability).
- Why: Embeddings may look good on paper but can fail in new domains.

🔁 Phase 8: Apply to Your Use Case #

Task: Pick a small real-world use case and simulate a pipeline using a knowledge graph and embedding.
- Why: Reinforces learning and identifies operational gaps in pipeline design.