Ch6 ML and Graph Analytics

Ch6 Machine Learning & Graph-Based Analytics #


Part 1: Q&A Summary #

1. What is the difference between cleaning, harmonization, and feature engineering? #

  • Cleaning: Removing errors or inconsistencies in the raw data.
  • Harmonization: Mapping and aligning data semantically across datasets (e.g., converting NDC to RxNorm).
  • Feature Engineering: Transforming data to fit the needs of specific algorithms or analysis (e.g., PCA, one-hot encoding).

2. Why are graphs more useful for harmonization than feature engineering? #

  • Graphs help link concepts across vocabularies, terminologies, or systems.
  • Feature engineering tends to be model-specific and harder to generalize.

3. What are the downsides of repeating cleaning/harmonization for each project? #

  • Redundancy: Same steps are repeated across projects.
  • Inefficiency: Each team member duplicates similar work.
  • Inconsistency: No central source of truth for processed data.

4. What is a feature store and how does it help? #

  • A feature store centralizes reusable, preprocessed features.
  • Helps reduce redundancy and promotes consistency.

5. How do knowledge graphs improve the pipeline? #

  • Data is cleaned and harmonized once at the graph level.
  • All downstream users can reuse the harmonized view via queries or APIs.

6. What assumptions are made when using a knowledge graph? #

  • Patient-level data and terminology concepts are stored in the same graph.
  • Nodes/edges are tagged with metadata (e.g., timestamps, source).
  • The graph is a supergraph enabling subgraph extraction.

7. What are graph embeddings and why are they useful? #

  • They convert graph structures into vectors usable in ML models.
  • Enable pattern detection, similarity analysis, and deep learning.

8. What is node2vec? #

  • Random walk-based graph embedding technique.
  • Uses return (p) and in-out (q) parameters to tune graph walk.
  • Captures homophily and structural equivalence.

9. What is cui2vec? #

  • Embeds UMLS CUIs based on co-occurrence in various RWD sources.
  • Context-aware (claims, notes, publications).
  • Useful for understanding concept similarity.

10. What is med2vec? #

  • Uses temporal sequence of medical events to create visit-based embeddings.
  • Retains longitudinal context.

11. What is snomed2vec? #

  • Embeds SNOMED CT concepts using hierarchical and network-based methods.
  • Includes alternatives like metapath2vec and Poincaré embeddings.

12. What are some challenges with pretrained embeddings? #

  • Risk of overfitting to training data domain (e.g., CMS claims).
  • May not generalize well to other populations or use cases.
  • Introduces extra model layer to maintain and tune.

Part 2: Curriculum-Style Breakdown with “Why” #

🧭 Phase 1: Understand the Motivation #

  • Task: Read and distinguish between cleaning, harmonization, and feature engineering.
    • Why: Clarifies each pipeline component and prevents misuse of graphs for tasks like feature engineering.

🧱 Phase 2: Explore Pipeline Challenges #

  • Task: Analyze Figures 6-6 to 6-9 on pipeline repetition and inefficiency.
    • Why: Understand how lack of standardization leads to duplicated efforts.

🧠 Phase 3: Learn about Feature Stores #

  • Task: Study how feature stores centralize and reuse engineered features.
    • Why: Saves time, increases reproducibility, and reduces tech debt.

🌐 Phase 4: Integrate Knowledge Graphs #

  • Task: Understand what goes into a knowledge graph (patient data + ontologies).
    • Why: Enables one-time harmonization per data source, allowing scalable reuse.

🧩 Phase 5: Explore Graph Embedding Techniques #

  • Task: Implement node2vec on a small graph.
    • Why: Learn homophily vs structural equivalence, key for biomedical graph reasoning.

🧬 Phase 6: Biomedical Concept Embeddings #

  • Task: Compare and contrast cui2vec, med2vec, and snomed2vec.
    • Why: Appreciate how embeddings differ by data type (temporal, co-occurrence, hierarchical).

⚠️ Phase 7: Real-World Concerns with Embeddings #

  • Task: Evaluate pretrained embeddings and consider limitations (overfitting, generalizability).
    • Why: Embeddings may look good on paper but can fail in new domains.

🔁 Phase 8: Apply to Your Use Case #

  • Task: Pick a small real-world use case and simulate a pipeline using a knowledge graph and embedding.
    • Why: Reinforces learning and identifies operational gaps in pipeline design.