Module2: Data Available From Healthcare Systems #
1 Review of the healthcare system #
Q: What is the focus of this module’s introduction? #
A: To demonstrate how clinical data from the healthcare system can be used to ask and answer meaningful research questions.
Q: What key topics are introduced in this session? #
A:
- Common data sources in healthcare
- Types of data generated
- Systematic inaccuracies in the data
- Strategies for working with imperfect data
Q: How does this connect to the earlier module? #
A: It builds on the discussion of research question formulation and data mining workflows by diving into real-world data availability and limitations.
➡️ What entities exist in the healthcare system and what data do they collect?
2 Review of key entities and the data they collect #
Q: Who are the key entities in the healthcare system that generate data? #
A:
- Patients : Generate little data when healthy, but more when they seek treatment.
- Search Engines : Capture data from patients researching health online.
- Healthcare Providers : Generate EMR data including diagnoses, test orders, and prescriptions.
- Pharmacies : Record dispensing data.
- Pharmaceutical Companies : Design and manufacture drugs.
- Insurers/Payers : Capture claims and billing data.
Q: What kind of data is produced at each step of care? #
A: Data are produced at nearly every step — from search behavior before treatment, to EMR entries during care, to dispensing records from pharmacies, and payment data from insurers.
➡️ Who are the actors in the healthcare system and what interests do they have?
3 Actors with different interests #
Q: Who are the main actors in the healthcare system? #
A:
- Patients and their families
- Healthcare professionals
- Payers (e.g. insurance companies)
- Government and regulatory agencies
Q: What are the interests of these different actors? #
A:
- Patients want to stay healthy and recover quickly.
- Clinicians seek best practices to maximize benefit and minimize harm.
- Payers and regulators aim for societal cost-effectiveness, fairness, and justice.
Q: How do differing interests create conflict? #
A: A patient may want to pursue a high-risk treatment for hope, while a clinician may focus on safety, and insurers may prioritize cost-effectiveness — leading to tension in care decisions.
➡️ What are the common data types used in healthcare?
4 Common data types in Healthcare #
Q: What are the major types of healthcare data? #
A: Structured and unstructured data.
Q: What is structured data? #
A: Data organized in consistent formats like tables with rows (patients) and columns (attributes such as age, DOB). Missing values are often marked with placeholders like “NA”.
Q: What is unstructured data? #
A: Data without a uniform format, including clinical notes, images, biomedical signals, and free text.
Q: Why is this distinction important? #
A: Structured data is easier to analyze computationally, while unstructured data requires specialized methods (e.g. NLP, image processing) to extract insights.
➡️ What are the strengths and weaknesses of observational healthcare data?
5 Strengths and weaknesses of observational data #
Q: What are observational data in healthcare? #
A: Data collected during routine clinical care for the primary purpose of delivering care — not for research. They’re used secondarily for analysis.
Q: What are the main strengths of observational data? #
A:
- Large scale : Covers millions to billions of records, capturing rare events.
- Real-world relevance : Reflects actual clinical practices and patient behaviors.
- Efficiency : Already collected, so no additional data gathering is needed.
Q: What are the weaknesses of observational data? #
A:
- Bias and incompleteness : Not collected with research goals in mind.
- Lack of standardization : Varies across sites and systems.
- Potential confounding : Causal inference is difficult without randomized control.
➡️ What are the biases and errors introduced by the healthcare system itself?
6 Bias and error from the healthcare system perspective #
Q: How can data from the healthcare system be biased or inaccurate? #
A: Each entity in the healthcare system contributes potential sources of bias, especially due to who seeks care and how care is recorded.
Q: What is selection bias in this context? #
A: It occurs when only certain patients (e.g., those who seek care) are recorded, leaving out healthy individuals or those managing illness at home or outside the system.
Q: What patient-level factors influence bias? #
A: Health literacy, financial situation, insurance coverage, and cultural beliefs can all affect whether and how patients engage with the healthcare system.
➡️ How do biases and errors affect recorded exposures and outcomes?
7 Bias and error of exposures and outcomes #
Q: What are exposures and outcomes in clinical data analysis? #
A:
- Exposures are events or conditions that occur to a patient (e.g., diseases, procedures, medications).
- Outcomes are events or conditions assessed after the exposure (e.g., complications, lab results, costs).
Q: Why is distinguishing exposures and outcomes important? #
A: It provides a framework for analyzing how prior events influence later results and helps in identifying where bias or error may occur.
Q: What type of biases can arise in this context? #
A: Misclassification, timing errors, or missing data related to either exposures or outcomes can distort analysis and conclusions.
➡️ How can patient exposure be misclassified in clinical data?
8 How a patient exposure might be misclassified #
Q: How can exposure to medication be misclassified in clinical data? #
A: When prescription records don’t align with actual medication use, such as delays in filling, use of free samples, or interruptions in adherence.
Q: What example illustrates this misclassification? #
A: A doctor gives a 15-day free sample before a prescription is filled. The patient begins treatment immediately, but data may only show the pharmacy fill date, not the true start date.
Q: Why does this matter? #
A: Misclassification of the timing or existence of exposure can distort analyses linking exposures to outcomes, especially for time-sensitive effects.
➡️ How might a patient outcome be misclassified in clinical records?
9 How a patient outcome could be misclassified #
Q: How can a patient’s outcome be misclassified in the medical record? #
A: Sometimes a diagnosis code is added based on suspicion (e.g., diabetes) before a condition is confirmed, and it may remain even if the diagnosis is later ruled out.
Q: What strategies help reduce outcome misclassification? #
A:
- Require multiple instances of a diagnosis code.
- Pair diagnosis codes with procedure codes specific to the condition.
- Look for treatment or intervention evidence supporting the diagnosis.
Q: Why are procedure codes more reliable? #
A: Procedures are typically documented only if they were actually performed, adding confirmatory weight to a diagnosis.
➡️ What are the key sources of electronic medical record data?
10 Electronic medical record data #
Q: What are electronic medical records (EMRs)? #
A: EMRs are digital versions of the traditional paper patient charts. They store detailed information collected in clinical settings.
Q: What types of data are typically stored in EMRs? #
A:
- Patient demographics
- Diagnosis and procedure codes
- Clinical notes
- Medication records
- Imaging and lab test results
- (Increasingly) genetic test results and wearable device data
Q: How are EMRs generated? #
A: As a byproduct of routine clinical care and documentation processes in hospitals and clinics.
Q: Are EMRs and EHRs the same? #
A: The terms are often used interchangeably, though technically EHR may imply a broader, longitudinal view across providers.
➡️ What can we learn from claims data in healthcare systems?
11 Claims data #
Q: What are claims data in healthcare? #
A: Claims data are records generated when healthcare providers submit bills to insurers for services rendered.
Q: What information do claims typically include? #
A:
- Patient identifiers
- Insurance status
- Diagnosis and procedure codes
- Requested charges and actual payments
Q: How is coding related to billing? #
A: Specific codes are used to categorize services for billing. These may differ between clinician entries, hospital submissions, and what insurers reimburse.
Q: Why are claims data valuable? #
A: They include detailed information on costs, utilization, and provider-payer interactions — data often missing from EMRs.
➡️ What data do pharmacies provide, and how is it used?
12 Pharmacy #
Q: What kind of data do pharmacies provide? #
A: They document when prescriptions are written, filled, and paid for — offering insight into medication access and fulfillment.
Q: Why is pharmacy data valuable? #
A: It goes beyond prescription intent (EMR) to show that a patient actually obtained the medication, which is a step closer to actual use.
Q: What are limitations of pharmacy data? #
A: It doesn’t guarantee the patient took the medication — only that it was picked up.
Q: How can pharmacy records be fragmented? #
A: Patients may use multiple sources: retail chains (e.g., CVS), mail-order services, and online pharmacies, dispersing data across datasets.
➡️ What are surveillance datasets and registries, and how are they used?
13 Surveillance datasets and Registries #
Q: What are surveillance datasets and why are they important? #
A: They monitor adverse events and side effects of drugs or devices after approval (post-marketing surveillance) to catch safety issues early.
Q: What are examples of surveillance systems in the U.S.? #
A:
- FAERS : FDA Adverse Event Reporting System
- MAUDE : Manufacturer and User Facility Device Experience database
Q: Who uses these datasets and for what purpose? #
A: Government agencies like the FDA and CDC use them to track disease outbreaks (e.g. flu, Ebola) and monitor product safety. Local and state agencies often assist.
Q: What are registries? #
A: Registries are organized systems maintained by agencies or societies to collect consistent clinical data on specific conditions, devices, or populations.
➡️ What are population health datasets and how are they used?
14 Population health data sets #
Q: How do population health datasets differ from patient-centric data? #
A: They focus on aggregated health trends, costs, and resource use across populations — not on individual patient records.
Q: What are key examples of U.S. population health datasets? #
A:
- National Inpatient Sample (NIS) : Tracks hospital resource utilization, costs, and outcomes.
- Medical Expenditure Panel Survey (MEPS) : Surveys patients, providers, and employers on healthcare usage and spending.
- NHANES : Measures demographic, nutritional, and health variables through national surveys conducted by the CDC.
Q: Why are these datasets useful? #
A: They offer broad insights into national health patterns, disparities, and costs — helping guide public policy and resource planning.
➡️ How do we assess if a healthcare data source is useful?
15 A framework to assess if a data source is useful #
Q: What questions should you ask to assess a healthcare dataset’s usefulness? #
A:
- Is there a well-documented data model?
- Poor or missing documentation makes the data hard to use.
- What is the data provenance?
- Understand how and where the data were collected.
- Are the data accessible and in what form?
- Consider legal restrictions and costs.
- What known errors or missingness exist?
- Evaluate data quality and be prepared to address gaps.
- Are data standards used (e.g., vocabularies or formats)?
- Standardization affects interoperability and analysis readiness.
Q: Why is this framework important? #
A: It helps researchers avoid costly or infeasible data efforts and ensures they can trust and interpret results from the dataset effectively.