Module 1: Asking Answering Questions via Clinical DataMining #
1 Introduction to the data mining workflow #
Q: What is the main goal of this course? #
A: To explain how clinical data can be used to answer research questions that improve patient and population health.
Q: What is the structure of the course? #
A: It begins with choosing meaningful research questions, followed by understanding the healthcare system, exploring data types, reviewing processing and analysis methods, and addressing bias and error.
Q: What is the data mining workflow in this context? #
A: It consists of four key steps:
- Pose a research question
- Identify suitable data sources
- Extract and transform data
- Conduct analysis using appropriate methods
Q: Why is posing the research question considered the most important step? #
A: Because all subsequent steps rely on having a clear and meaningful question to guide data selection and analysis methods.
➡️ What makes a research question useful to answer?
2 Real Life Example #
Q: What clinical scenario is used to illustrate the data mining process? #
A: A teenager named Laura with systemic lupus erythematosus (SLE) develops proteinuria, pancreatitis, and antiphospholipid antibodies — raising the question of whether she should receive anticoagulant medication.
Q: What makes this case clinically significant? #
A: The complexity and severity of symptoms, especially in a teenager, highlight a rare and serious condition that lacks straightforward treatment guidelines.
Q: What is the key research question raised? #
A: Should teenagers with SLE, proteinuria, and antiphospholipid antibodies be treated with anticoagulants?
Q: What is the traditional method for answering such questions? #
A: Reviewing medical literature or relying on clinical experience.
Q: Why is data mining proposed instead? #
A: Because traditional evidence might be limited or absent for such specific populations, making retrospective data mining valuable for generating insights.
➡️ How can we find similar patients to inform treatment decisions?
3 Example: Finding similar patients #
Q: What is the next step in answering the clinical question? #
A: Extract and transform data from the electronic medical record (EMR) to find patients who match the research criteria.
Q: What challenge does EMR data present? #
A: EMR data are not organized for easy searching and often require clinical expertise to map to useful criteria, like diagnosis codes or lab values.
Q: How would one identify relevant pediatric patients? #
A: By querying based on age to find those under 18, and then filtering for those with a diagnosis code indicating systemic lupus erythematosus (SLE).
Q: How is proteinuria identified? #
A: Through lab values in urine tests.
Q: How about antiphospholipid antibodies? #
A: These are ideally found via numeric lab data, but often reside in unstructured text, requiring natural language processing (NLP) or manual review.
➡️ How do we estimate risk in patients like Laura using this data?
4 Example: Estimating risk #
Q: What is the objective of the analysis in this case? #
A: To estimate the risk of clotting in teenagers with SLE, proteinuria, and antiphospholipid antibodies compared to the baseline risk in teenagers with just SLE.
Q: How is the patient cohort defined? #
A: By selecting patients who meet the specified criteria and identifying which of them had a blood clot event.
Q: What is a major practical challenge in this step? #
A: Extracting usable patient data from the EMR, which is often the bottleneck due to unstructured or fragmented data.
Q: What assumption is made for the purposes of this example? #
A: That the data extraction step is bypassed, allowing focus directly on the risk analysis.
➡️ How can patient data be visualized over time using timelines?
5 Putting patient data on timeline #
Q: What is the purpose of using a patient timeline? #
A: To visually arrange and analyze a patient’s clinical data chronologically to understand event sequences and compute outcomes like risk.
Q: What types of data are placed on the timeline? #
A: Diagnosis codes, lab results, medication orders, and clinician notes from the EMR.
Q: How are relevant patients identified and flagged? #
A: Pediatric patients are filtered first, followed by those with SLE, and then those developing proteinuria and antiphospholipid antibodies.
Q: What does the timeline help determine in this context? #
A: Whether a patient developed a blood clot after the onset of each condition, enabling computation of relative risk based on timing.
➡️ How do we revisit and refine the steps of the data mining workflow?
6 Revisit the data mining workflow steps #
Q: What was the clinical question in the example revisited here? #
A: Whether the risk of clotting in teenagers with SLE, proteinuria, and antiphospholipid antibodies justifies treatment with anticoagulants.
Q: What was the data source? #
A: The Electronic Medical Record (EMR).
Q: What steps were involved in extracting and transforming the data? #
A: Identifying teenagers with SLE, forming subgroups based on clinical criteria, using diagnosis codes, crafting search terms, and occasionally applying proxy terms with follow-up confirmation.
Q: How was analysis conducted? #
A: By comparing clotting risk in subgroups to the general risk in teenagers with SLE, guiding the decision to treat.
Q: What is the broader implication of this example? #
A: It introduces the core elements of the data mining workflow, which will be elaborated in the course—especially regarding accurate execution and bias mitigation.
➡️ What types of research questions can be asked using clinical data?
7 Types of research questions #
Q: Why is it important to understand types of research questions? #
A: It helps in formulating one’s own questions and critically evaluating the validity of others’ findings using clinical data.
Q: What is the simplest type of research question? #
A: Descriptive questions — they summarize data using counts or proportions, e.g., “What proportion of the population has familial hypercholesterolemia?”
Q: What comes after descriptive questions? #
A: Exploratory questions — they aim to find patterns in data, such as identifying subtypes of a disease like autism.
Q: What types of methods are used in exploratory questions? #
A: Techniques like clustering or statistical modeling to identify patterns without predefined hypotheses.
➡️ Which research questions are best suited for clinical data?
8 Research questions suited for clinical data #
Q: What types of questions are clinical data best suited to answer? #
A: Descriptive, exploratory, inferential, and predictive questions.
Q: What types of questions are harder to answer with clinical data? #
A: Causal and mechanistic questions, which often require carefully designed experiments and new data collection.
Q: What are the two primary goals of asking research questions in medicine? #
A:
- Risk stratification — determining whether to treat a patient.
- Data-driven treatment selection — deciding how best to treat a patient.
Q: Why is it important to match question type with purpose? #
A: To ensure both the research question and analysis methods are appropriate and meaningful for clinical decision-making.
➡️ How does data mining help in making treatment decisions?
9 Example: making decision to treat #
Q: What distinction is made between the question asked and the question answered? #
A: The original question was whether to treat Laura, a specific patient. The analysis, however, answered a descriptive question — what proportion of similar patients developed clots.
Q: What kind of analysis was conducted? #
A: A descriptive risk stratification, grouping patients based on risk of clotting using historical data.
Q: What assumption underlies the application of this analysis to Laura? #
A: That Laura’s outcome is likely to mirror those of similar past patients.
Q: How does this analysis support decision-making? #
A: It helps determine whether Laura belongs in a high-risk group, supporting a treatment recommendation such as anticoagulation.
Q: What additional considerations are necessary in real life? #
A: We must also consider the risks of adverse events from the proposed treatment before making a final clinical decision.
10 Properties that make answering a research question useful #
Q: What determines whether answering a question is useful in clinical data mining? #
A: Usefulness is assessed via a checklist of aspects, not a strict formula. Key considerations include impact, actionability, and downstream effects.
Q: What is the first major factor? #
A: The number of lives affected — including the disease burden and scope of patient populations influenced by the answer.
Q: What is the second factor? #
A: The probability that the results will lead to beneficial changes for clinicians or patients, improving clinical care or outcomes.
Q: What is the third key aspect? #
A: The real-world consequences: does the answer help reduce mortality/morbidity, lower healthcare costs, increase care access, or guide medical decisions?
Q: What is an important implication of this approach? #
A: Slight rephrasing of the question can often make it significantly more useful or relevant.
➡️ How is clinical data mining used in a real-world example?