Module4 : Creating Analysis Ready Dataset from Patient Timelines #
1 Turning clinical data into something you can analyze #
Q: What is the main objective of this module? #
A: To explain how to convert raw clinical data into a patient-feature matrix suitable for analysis and answering research questions.
Q: What is a patient-feature matrix? #
A: A structured table where each row represents a patient and each column represents a clinical feature or measurement.
Q: What topics will this module cover? #
A:
- Choosing and extracting features
- Managing too many or missing features
- Understanding how data transformations affect analysis outcomes
➡️ How do we define the unit of analysis in clinical datasets?
2 Defining the unit of analysis #
Q: What is the typical unit of analysis in clinical data studies? #
A: The patient — with each row in the data table representing one patient and each column a feature about them.
Q: Are there other valid units of analysis? #
A: Yes. Depending on the question, units can be drug-disease pairs, visits, procedures, or other clinical events.
Q: Can you give an example of an alternative unit? #
A: For studying off-label drug use, the unit might be a drug-disease pair, with features describing co-mentions, sequencing, and usage frequency.
➡️ How do we use features and track their presence in patient data?
3 Using features and the presence of features #
Q: Should we use all possible features extracted from clinical data? #
A: Generally yes — start with a full feature set, unless constrained by computational limits or privacy considerations.
Q: What are some reasons to remove features? #
A:
- Resource limitations
- Low predictive value
- Sensitivity concerns (e.g., HIV status)
Q: How can presence or absence of data itself become a feature? #
A: Metadata such as the act of ordering a test (regardless of result) can indicate clinical concern and may be predictive on its own.
Q: What role does metadata play in feature design? #
A: It can be highly informative, capturing clinical behavior that indirectly reflects patient status.
➡️ How do we create features from structured clinical sources?
4 How to create features from structured sources #
Q: What are structured vs. unstructured healthcare data? #
A:
- Structured: Organized in tables with rows/columns (e.g., lab results, diagnosis codes)
- Unstructured: Free text, images, or signals
Q: Why are structured sources important? #
A: They are the most readily available and easiest to transform into features for analysis.
Q: What are the main steps in using structured data? #
A:
- Accessing the database
- Querying with SQL
- Joining tables using patient IDs
- Standardizing and reshaping features
- Handling missing or excessive features
- Optionally constructing new ones
➡️ Why is standardizing features important and how is it done?
5 Standardizing features #
Q: What does it mean to standardize features? #
A: Transforming feature values into a common numerical scale, often called normalization.
Q: Why is standardization important? #
A: It prevents features with large ranges from dominating distance-based or scale-sensitive algorithms in downstream analysis.
Q: What are common standardization techniques? #
A:
- Min-max scaling: Rescales values to a range (e.g., 0 to 1)
- Z-score normalization: Transforms values to have mean 0 and standard deviation 1
➡️ How do we deal with having too many features?
6 Dealing with too many features #
Q: Why might you avoid using all available features in clinical data? #
A: While comprehensive data is ideal, practical concerns can necessitate reducing features.
Q: What are reasons to reduce the number of features? #
A:
- Irrelevance: Some features offer no useful signal (e.g., record access frequency)
- Missingness: Some are missing in most patients
- Sparsity: Too many features missing in a given patient
- Redundancy: Highly correlated features interfere with some analyses
- Computational cost: More features = slower analysis
- Privacy: More features increase re-identification risk
Q: How can this affect patient privacy? #
A: A rich set of features can unintentionally reveal a patient’s identity.
➡️ Where do missing values in clinical data originate?
7 The origins of missing values #
Q: Why do missing values occur in clinical datasets? #
A: Missing values often result from converting complex patient timelines into a simplified matrix format where not all features are applicable or expected.
Q: How is missingness different in routine care vs. prospective studies? #
A:
- In prospective studies, missing data usually implies an error or oversight.
- In routine care, absence might mean the data was never meant to be recorded.
Q: What are three possible reasons a value is missing? #
A:
- It should have existed, but wasn’t recorded.
- It’s absent due to matrix transformation — not clinically expected.
- Its absence has meaning — like no diagnosis implies a negative finding.
➡️ How do we handle missing values in practice?
8 Dealing with missing values #
Q: What is imputation in clinical data analysis? #
A: Imputation is a technique used to fill in missing data using predictions based on other available information.
Q: What is column-mean imputation? #
A: A simple method where missing values are replaced with the average of non-missing values in the same column.
Q: Why might column-mean imputation be unsuitable in medicine? #
A: Because one patient’s lab value isn’t necessarily informative for another — clinical data often lack this kind of inter-patient correlation.
Q: What is a better alternative in clinical contexts? #
A: Use within-patient imputation: infer a missing value from other known values of that same patient using correlated features.
➡️ What are the recommended practices for handling missing values?
9 Summary recommendations for missing values #
Q: Should you always impute missing values in clinical datasets? #
A: Not necessarily — it depends on the extent and distribution of missingness. Expert guidance is recommended.
Q: When is imputation appropriate? #
A: If most values are present and only a few are missing, imputation is usually a good choice.
Q: When should you drop a variable? #
A: If most patients are missing that variable, it’s better to exclude it rather than impute unreliable values.
Q: What about cases in between? #
A: There is no universal rule. Some propose using indicator variables to flag imputed values, but this is debated.
➡️ How are new features constructed from existing data?
10 Constructing new features #
Q: What is feature engineering in clinical data mining? #
A: The process of creating new features from existing ones, often by transformation or combination (e.g., computing BMI from height and weight).
Q: Why is feature engineering valuable? #
A: Well-designed features can significantly improve the performance of models — even more than using advanced algorithms with raw data.
Q: What are some simple examples of constructed features? #
A:
- Binary indicators (e.g., converting a count to 1 if > 0)
- Aggregations or ratios
- Clinical scores derived from combinations of measurements
➡️ What are some practical examples of feature engineering in healthcare?
11 Examples of engineered features #
Q: What are clinical scoring systems in feature engineering? #
A: Formulas that combine multiple clinical values to estimate disease severity or health status — commonly used in risk adjustment.
Q: What is an example of a simple scoring system? #
A: Body Mass Index (BMI) — calculated from height and weight to assess obesity.
Q: What are examples of comorbidity scoring systems? #
A:
- Charlson Comorbidity Index
- Elixhauser Comorbidity Index
Q: Can non-clinical features also be engineered? #
A: Yes — proxy features like zip code (socioeconomic status) and record frequency (healthcare utilization) can be derived.
➡️ When should you consider engineering new features?
12 When to consider engineered features #
Q: When should you engineer new features in clinical datasets? #
A: When important concepts are not directly captured in the raw data — consider proxies or derived variables to fill the gap.
Q: What strategies can guide feature creation? #
A:
- Use clinical intuition
- Include counts, changes over time, or ratios
- Repurpose validated scoring systems
Q: What trade-offs should be considered? #
A: Balance the benefit of a new feature against the effort required to create and validate it.
Q: Can models ever learn features automatically? #
A: Yes — deep learning methods can learn features directly from raw data, reducing the need for manual feature engineering.
➡️ What are the main takeaways about creating analysis-ready datasets?
13 Main points about creating analysis ready datasets #
Q: What is an analysis-ready dataset in clinical research? #
A: A clean, structured patient-feature matrix derived from raw clinical data and suitable for analysis.
Q: What tools are commonly used to construct it? #
A: Standard programming tools and database queries (e.g., S### QL + Python).
Q: How can the number of features be reduced? #
A:
- Domain knowledge to combine or drop variables
- Mathematical methods like Principal Component Analysis (PCA)
Q: How are missing values handled? #
A: Either removed or imputed, using varying levels of complexity depending on the case.
Q: What boosts success in dataset creation? #
A: Learning the clinical context of the question — deeper medical understanding improves data transformations and feature design.
➡️ What are structured knowledge graphs, and how do they relate to datasets?
14 Structured knowledge graphs #
Q: What are the two key topics in this part of the module? #
A:
- Constructing a patient feature matrix
- Using curated biomedical knowledge (via knowledge graphs)
Q: What is a knowledge graph in healthcare? #
A: A structured digital representation of biomedical entities (e.g., diseases, drugs, lab tests) and their relationships — also known as an ontology.
Q: Why are knowledge graphs useful? #
A: They encode expert knowledge in a machine-readable way, enabling smarter feature construction, search, and data linkage.
Q: What are examples of implicit prior knowledge use? #
A: Using test order counts related to glucose as a proxy for diabetes — derived from domain knowledge.
➡️ What exactly is contained in a biomedical knowledge graph?
15 So what exactly is in a knowledge graph #
Q: What are the core components of a biomedical knowledge graph? #
A:
- Entities: e.g., symptoms, diseases, drugs, body parts
- Synonyms: mappings of equivalent terms (e.g., “heart attack” = “acute myocardial infarction”)
- Relationships: logical links between entities (e.g., “is a kind of”)
Q: What is the most important type of relationship in medical knowledge graphs? #
A: The “is a kind of” relationship — it defines hierarchies (e.g., Lipitor is a kind of lipid-lowering drug).
Q: What benefit does this hierarchical structure provide? #
A: Entities inherit properties from broader categories, enabling powerful reasoning and search (e.g., querying all “lipid-lowering drugs”).
➡️ What are the most important knowledge graphs in biomedicine?
16 What are important knowledge graphs #
Q: Where can you find biomedical knowledge graphs? #
A: One central repository is BioPortal from the National Center for Biomedical Ontology at Stanford.
Q: What are some widely used biomedical ontologies? #
A:
- ICD (International Classification of Diseases): Maintained by WHO; used globally for diagnosis coding (ICD-9, ICD-10).
- CPT (Current Procedural Terminology): Created by the AMA to categorize procedures and services, mainly for billing.
Q: Why are ICD-9 and ICD-10 both relevant today? #
A: Many clinical datasets still use ICD-9 codes even though ICD-10 is current, especially in older patient records.
➡️ How do we choose which knowledge graph to use in practice?
17 How to choose which knowledge graph to use #
Q: What should you consider when selecting a knowledge graph? #
A:
- Entity types and how they are classified
- Relationship meanings between entities
- Terminology: presence of synonyms and spelling variants
- Interoperability: how well the graph maps to other knowledge graphs
Q: What practical method can help assess utility? #
A: Count how many terms from the knowledge graph appear in EMR text or data — this indicates how relevant and compatible it is with your dataset.