[Summary] Module4 : Creating Analysis Ready Dataset from Patient Timelines

Module4 : Creating Analysis Ready Dataset from Patient Timelines #

1 Turning clinical data into something you can analyze #

Q: What is the main objective of this module? #

A: To explain how to convert raw clinical data into a patient-feature matrix suitable for analysis and answering research questions.

Q: What is a patient-feature matrix? #

A: A structured table where each row represents a patient and each column represents a clinical feature or measurement.

Q: What topics will this module cover? #

Choosing and extracting features
Managing too many or missing features
Understanding how data transformations affect analysis outcomes

➡️ How do we define the unit of analysis in clinical datasets?

2 Defining the unit of analysis #

Q: What is the typical unit of analysis in clinical data studies? #

A: The patient — with each row in the data table representing one patient and each column a feature about them.

Q: Are there other valid units of analysis? #

A: Yes. Depending on the question, units can be drug-disease pairs, visits, procedures, or other clinical events.

Q: Can you give an example of an alternative unit? #

A: For studying off-label drug use, the unit might be a drug-disease pair, with features describing co-mentions, sequencing, and usage frequency.

➡️ How do we use features and track their presence in patient data?

3 Using features and the presence of features #

Q: Should we use all possible features extracted from clinical data? #

A: Generally yes — start with a full feature set, unless constrained by computational limits or privacy considerations.

Q: What are some reasons to remove features? #

Resource limitations
Low predictive value
Sensitivity concerns (e.g., HIV status)

Q: How can presence or absence of data itself become a feature? #

A: Metadata such as the act of ordering a test (regardless of result) can indicate clinical concern and may be predictive on its own.

Q: What role does metadata play in feature design? #

A: It can be highly informative, capturing clinical behavior that indirectly reflects patient status.

➡️ How do we create features from structured clinical sources?

4 How to create features from structured sources #

Q: What are structured vs. unstructured healthcare data? #

Structured: Organized in tables with rows/columns (e.g., lab results, diagnosis codes)
Unstructured: Free text, images, or signals

Q: Why are structured sources important? #

A: They are the most readily available and easiest to transform into features for analysis.

Q: What are the main steps in using structured data? #

Accessing the database
Querying with SQL
Joining tables using patient IDs
Standardizing and reshaping features
Handling missing or excessive features
Optionally constructing new ones

➡️ Why is standardizing features important and how is it done?

5 Standardizing features #

Q: What does it mean to standardize features? #

A: Transforming feature values into a common numerical scale, often called normalization.

Q: Why is standardization important? #

A: It prevents features with large ranges from dominating distance-based or scale-sensitive algorithms in downstream analysis.

Q: What are common standardization techniques? #

Min-max scaling: Rescales values to a range (e.g., 0 to 1)
Z-score normalization: Transforms values to have mean 0 and standard deviation 1

➡️ How do we deal with having too many features?

6 Dealing with too many features #

Q: Why might you avoid using all available features in clinical data? #

A: While comprehensive data is ideal, practical concerns can necessitate reducing features.

Q: What are reasons to reduce the number of features? #

Irrelevance: Some features offer no useful signal (e.g., record access frequency)
Missingness: Some are missing in most patients
Sparsity: Too many features missing in a given patient
Redundancy: Highly correlated features interfere with some analyses
Computational cost: More features = slower analysis
Privacy: More features increase re-identification risk

Q: How can this affect patient privacy? #

A: A rich set of features can unintentionally reveal a patient’s identity.

➡️ Where do missing values in clinical data originate?

7 The origins of missing values #

Q: Why do missing values occur in clinical datasets? #

A: Missing values often result from converting complex patient timelines into a simplified matrix format where not all features are applicable or expected.

Q: How is missingness different in routine care vs. prospective studies? #

In prospective studies, missing data usually implies an error or oversight.
In routine care, absence might mean the data was never meant to be recorded.

Q: What are three possible reasons a value is missing? #

It should have existed, but wasn’t recorded.
It’s absent due to matrix transformation — not clinically expected.
Its absence has meaning — like no diagnosis implies a negative finding.

➡️ How do we handle missing values in practice?

8 Dealing with missing values #

Q: What is imputation in clinical data analysis? #

A: Imputation is a technique used to fill in missing data using predictions based on other available information.

Q: What is column-mean imputation? #

A: A simple method where missing values are replaced with the average of non-missing values in the same column.

Q: Why might column-mean imputation be unsuitable in medicine? #

A: Because one patient’s lab value isn’t necessarily informative for another — clinical data often lack this kind of inter-patient correlation.

Q: What is a better alternative in clinical contexts? #

A: Use within-patient imputation: infer a missing value from other known values of that same patient using correlated features.

➡️ What are the recommended practices for handling missing values?

9 Summary recommendations for missing values #

Q: Should you always impute missing values in clinical datasets? #

A: Not necessarily — it depends on the extent and distribution of missingness. Expert guidance is recommended.

Q: When is imputation appropriate? #

A: If most values are present and only a few are missing, imputation is usually a good choice.

Q: When should you drop a variable? #

A: If most patients are missing that variable, it’s better to exclude it rather than impute unreliable values.

Q: What about cases in between? #

A: There is no universal rule. Some propose using indicator variables to flag imputed values, but this is debated.

➡️ How are new features constructed from existing data?

10 Constructing new features #

Q: What is feature engineering in clinical data mining? #

A: The process of creating new features from existing ones, often by transformation or combination (e.g., computing BMI from height and weight).

Q: Why is feature engineering valuable? #

A: Well-designed features can significantly improve the performance of models — even more than using advanced algorithms with raw data.

Q: What are some simple examples of constructed features? #

Binary indicators (e.g., converting a count to 1 if > 0)
Aggregations or ratios
Clinical scores derived from combinations of measurements

➡️ What are some practical examples of feature engineering in healthcare?

11 Examples of engineered features #

Q: What are clinical scoring systems in feature engineering? #

A: Formulas that combine multiple clinical values to estimate disease severity or health status — commonly used in risk adjustment.

Q: What is an example of a simple scoring system? #

A: Body Mass Index (BMI) — calculated from height and weight to assess obesity.

Q: What are examples of comorbidity scoring systems? #

Charlson Comorbidity Index
Elixhauser Comorbidity Index

Q: Can non-clinical features also be engineered? #

A: Yes — proxy features like zip code (socioeconomic status) and record frequency (healthcare utilization) can be derived.

➡️ When should you consider engineering new features?

12 When to consider engineered features #

Q: When should you engineer new features in clinical datasets? #

A: When important concepts are not directly captured in the raw data — consider proxies or derived variables to fill the gap.

Q: What strategies can guide feature creation? #

Use clinical intuition
Include counts, changes over time, or ratios
Repurpose validated scoring systems

Q: What trade-offs should be considered? #

A: Balance the benefit of a new feature against the effort required to create and validate it.

Q: Can models ever learn features automatically? #

A: Yes — deep learning methods can learn features directly from raw data, reducing the need for manual feature engineering.

➡️ What are the main takeaways about creating analysis-ready datasets?

13 Main points about creating analysis ready datasets #

Q: What is an analysis-ready dataset in clinical research? #

A: A clean, structured patient-feature matrix derived from raw clinical data and suitable for analysis.

Q: What tools are commonly used to construct it? #

A: Standard programming tools and database queries (e.g., S### QL + Python).

Q: How can the number of features be reduced? #

Domain knowledge to combine or drop variables
Mathematical methods like Principal Component Analysis (PCA)

Q: How are missing values handled? #

A: Either removed or imputed, using varying levels of complexity depending on the case.

Q: What boosts success in dataset creation? #

A: Learning the clinical context of the question — deeper medical understanding improves data transformations and feature design.

➡️ What are structured knowledge graphs, and how do they relate to datasets?

14 Structured knowledge graphs #

Q: What are the two key topics in this part of the module? #

Constructing a patient feature matrix
Using curated biomedical knowledge (via knowledge graphs)

Q: What is a knowledge graph in healthcare? #

A: A structured digital representation of biomedical entities (e.g., diseases, drugs, lab tests) and their relationships — also known as an ontology.

Q: Why are knowledge graphs useful? #

A: They encode expert knowledge in a machine-readable way, enabling smarter feature construction, search, and data linkage.

Q: What are examples of implicit prior knowledge use? #

A: Using test order counts related to glucose as a proxy for diabetes — derived from domain knowledge.

➡️ What exactly is contained in a biomedical knowledge graph?

15 So what exactly is in a knowledge graph #

Q: What are the core components of a biomedical knowledge graph? #

Entities: e.g., symptoms, diseases, drugs, body parts
Synonyms: mappings of equivalent terms (e.g., “heart attack” = “acute myocardial infarction”)
Relationships: logical links between entities (e.g., “is a kind of”)

Q: What is the most important type of relationship in medical knowledge graphs? #

A: The “is a kind of” relationship — it defines hierarchies (e.g., Lipitor is a kind of lipid-lowering drug).

Q: What benefit does this hierarchical structure provide? #

A: Entities inherit properties from broader categories, enabling powerful reasoning and search (e.g., querying all “lipid-lowering drugs”).

➡️ What are the most important knowledge graphs in biomedicine?

16 What are important knowledge graphs #

Q: Where can you find biomedical knowledge graphs? #

A: One central repository is BioPortal from the National Center for Biomedical Ontology at Stanford.

Q: What are some widely used biomedical ontologies? #

ICD (International Classification of Diseases): Maintained by WHO; used globally for diagnosis coding (ICD-9, ICD-10).
CPT (Current Procedural Terminology): Created by the AMA to categorize procedures and services, mainly for billing.

Q: Why are ICD-9 and ICD-10 both relevant today? #

A: Many clinical datasets still use ICD-9 codes even though ICD-10 is current, especially in older patient records.

➡️ How do we choose which knowledge graph to use in practice?

17 How to choose which knowledge graph to use #

Q: What should you consider when selecting a knowledge graph? #

Entity types and how they are classified
Relationship meanings between entities
Terminology: presence of synonyms and spelling variants
Interoperability: how well the graph maps to other knowledge graphs

Q: What practical method can help assess utility? #

A: Count how many terms from the knowledge graph appear in EMR text or data — this indicates how relevant and compatible it is with your dataset.