[Summary] Module 1: Asking Answering Questions via Clinical DataMining

Module 1: Asking Answering Questions via Clinical DataMining #

1 Introduction to the data mining workflow #

Q: What is the main goal of this course? #

A: To explain how clinical data can be used to answer research questions that improve patient and population health.

Q: What is the structure of the course? #

A: It begins with choosing meaningful research questions, followed by understanding the healthcare system, exploring data types, reviewing processing and analysis methods, and addressing bias and error.

Q: What is the data mining workflow in this context? #

A: It consists of four key steps:

Pose a research question
Identify suitable data sources
Extract and transform data
Conduct analysis using appropriate methods

Q: Why is posing the research question considered the most important step? #

A: Because all subsequent steps rely on having a clear and meaningful question to guide data selection and analysis methods.

➡️ What makes a research question useful to answer?

2 Real Life Example #

Q: What clinical scenario is used to illustrate the data mining process? #

A: A teenager named Laura with systemic lupus erythematosus (SLE) develops proteinuria, pancreatitis, and antiphospholipid antibodies — raising the question of whether she should receive anticoagulant medication.

Q: What makes this case clinically significant? #

A: The complexity and severity of symptoms, especially in a teenager, highlight a rare and serious condition that lacks straightforward treatment guidelines.

Q: What is the key research question raised? #

A: Should teenagers with SLE, proteinuria, and antiphospholipid antibodies be treated with anticoagulants?

Q: What is the traditional method for answering such questions? #

A: Reviewing medical literature or relying on clinical experience.

Q: Why is data mining proposed instead? #

A: Because traditional evidence might be limited or absent for such specific populations, making retrospective data mining valuable for generating insights.

➡️ How can we find similar patients to inform treatment decisions?

3 Example: Finding similar patients #

Q: What is the next step in answering the clinical question? #

A: Extract and transform data from the electronic medical record (EMR) to find patients who match the research criteria.

Q: What challenge does EMR data present? #

A: EMR data are not organized for easy searching and often require clinical expertise to map to useful criteria, like diagnosis codes or lab values.

Q: How would one identify relevant pediatric patients? #

A: By querying based on age to find those under 18, and then filtering for those with a diagnosis code indicating systemic lupus erythematosus (SLE).

Q: How is proteinuria identified? #

A: Through lab values in urine tests.

Q: How about antiphospholipid antibodies? #

A: These are ideally found via numeric lab data, but often reside in unstructured text, requiring natural language processing (NLP) or manual review.

➡️ How do we estimate risk in patients like Laura using this data?

4 Example: Estimating risk #

Q: What is the objective of the analysis in this case? #

A: To estimate the risk of clotting in teenagers with SLE, proteinuria, and antiphospholipid antibodies compared to the baseline risk in teenagers with just SLE.

Q: How is the patient cohort defined? #

A: By selecting patients who meet the specified criteria and identifying which of them had a blood clot event.

Q: What is a major practical challenge in this step? #

A: Extracting usable patient data from the EMR, which is often the bottleneck due to unstructured or fragmented data.

Q: What assumption is made for the purposes of this example? #

A: That the data extraction step is bypassed, allowing focus directly on the risk analysis.

➡️ How can patient data be visualized over time using timelines?

5 Putting patient data on timeline #

Q: What is the purpose of using a patient timeline? #

A: To visually arrange and analyze a patient’s clinical data chronologically to understand event sequences and compute outcomes like risk.

Q: What types of data are placed on the timeline? #

A: Diagnosis codes, lab results, medication orders, and clinician notes from the EMR.

Q: How are relevant patients identified and flagged? #

A: Pediatric patients are filtered first, followed by those with SLE, and then those developing proteinuria and antiphospholipid antibodies.

Q: What does the timeline help determine in this context? #

A: Whether a patient developed a blood clot after the onset of each condition, enabling computation of relative risk based on timing.

➡️ How do we revisit and refine the steps of the data mining workflow?

6 Revisit the data mining workflow steps #

Q: What was the clinical question in the example revisited here? #

A: Whether the risk of clotting in teenagers with SLE, proteinuria, and antiphospholipid antibodies justifies treatment with anticoagulants.

Q: What was the data source? #

A: The Electronic Medical Record (EMR).

Q: What steps were involved in extracting and transforming the data? #

A: Identifying teenagers with SLE, forming subgroups based on clinical criteria, using diagnosis codes, crafting search terms, and occasionally applying proxy terms with follow-up confirmation.

Q: How was analysis conducted? #

A: By comparing clotting risk in subgroups to the general risk in teenagers with SLE, guiding the decision to treat.

Q: What is the broader implication of this example? #

A: It introduces the core elements of the data mining workflow, which will be elaborated in the course—especially regarding accurate execution and bias mitigation.

➡️ What types of research questions can be asked using clinical data?

7 Types of research questions #

Q: Why is it important to understand types of research questions? #

A: It helps in formulating one’s own questions and critically evaluating the validity of others’ findings using clinical data.

Q: What is the simplest type of research question? #

A: Descriptive questions — they summarize data using counts or proportions, e.g., “What proportion of the population has familial hypercholesterolemia?”

Q: What comes after descriptive questions? #

A: Exploratory questions — they aim to find patterns in data, such as identifying subtypes of a disease like autism.

Q: What types of methods are used in exploratory questions? #

A: Techniques like clustering or statistical modeling to identify patterns without predefined hypotheses.

➡️ Which research questions are best suited for clinical data?

8 Research questions suited for clinical data #

Q: What types of questions are clinical data best suited to answer? #

A: Descriptive, exploratory, inferential, and predictive questions.

Q: What types of questions are harder to answer with clinical data? #

A: Causal and mechanistic questions, which often require carefully designed experiments and new data collection.

Q: What are the two primary goals of asking research questions in medicine? #

Risk stratification — determining whether to treat a patient.
Data-driven treatment selection — deciding how best to treat a patient.

Q: Why is it important to match question type with purpose? #

A: To ensure both the research question and analysis methods are appropriate and meaningful for clinical decision-making.

➡️ How does data mining help in making treatment decisions?

9 Example: making decision to treat #

Q: What distinction is made between the question asked and the question answered? #

A: The original question was whether to treat Laura, a specific patient. The analysis, however, answered a descriptive question — what proportion of similar patients developed clots.

Q: What kind of analysis was conducted? #

A: A descriptive risk stratification, grouping patients based on risk of clotting using historical data.

Q: What assumption underlies the application of this analysis to Laura? #

A: That Laura’s outcome is likely to mirror those of similar past patients.

Q: How does this analysis support decision-making? #

A: It helps determine whether Laura belongs in a high-risk group, supporting a treatment recommendation such as anticoagulation.

Q: What additional considerations are necessary in real life? #

A: We must also consider the risks of adverse events from the proposed treatment before making a final clinical decision.

10 Properties that make answering a research question useful #

Q: What determines whether answering a question is useful in clinical data mining? #

A: Usefulness is assessed via a checklist of aspects, not a strict formula. Key considerations include impact, actionability, and downstream effects.

Q: What is the first major factor? #

A: The number of lives affected — including the disease burden and scope of patient populations influenced by the answer.

Q: What is the second factor? #

A: The probability that the results will lead to beneficial changes for clinicians or patients, improving clinical care or outcomes.

Q: What is the third key aspect? #

A: The real-world consequences: does the answer help reduce mortality/morbidity, lower healthcare costs, increase care access, or guide medical decisions?

Q: What is an important implication of this approach? #

A: Slight rephrasing of the question can often make it significantly more useful or relevant.

➡️ How is clinical data mining used in a real-world example?