[Summary] Module2: Data Available From Healthcare Systems

Module2: Data Available From Healthcare Systems #

1 Review of the healthcare system #

Q: What is the focus of this module’s introduction? #

A: To demonstrate how clinical data from the healthcare system can be used to ask and answer meaningful research questions.

Q: What key topics are introduced in this session? #

Common data sources in healthcare
Types of data generated
Systematic inaccuracies in the data
Strategies for working with imperfect data

Q: How does this connect to the earlier module? #

A: It builds on the discussion of research question formulation and data mining workflows by diving into real-world data availability and limitations.

➡️ What entities exist in the healthcare system and what data do they collect?

2 Review of key entities and the data they collect #

Q: Who are the key entities in the healthcare system that generate data? #

Patients : Generate little data when healthy, but more when they seek treatment.
Search Engines : Capture data from patients researching health online.
Healthcare Providers : Generate EMR data including diagnoses, test orders, and prescriptions.
Pharmacies : Record dispensing data.
Pharmaceutical Companies : Design and manufacture drugs.
Insurers/Payers : Capture claims and billing data.

Q: What kind of data is produced at each step of care? #

A: Data are produced at nearly every step — from search behavior before treatment, to EMR entries during care, to dispensing records from pharmacies, and payment data from insurers.

➡️ Who are the actors in the healthcare system and what interests do they have?

3 Actors with different interests #

Q: Who are the main actors in the healthcare system? #

Patients and their families
Healthcare professionals
Payers (e.g. insurance companies)
Government and regulatory agencies

Q: What are the interests of these different actors? #

Patients want to stay healthy and recover quickly.
Clinicians seek best practices to maximize benefit and minimize harm.
Payers and regulators aim for societal cost-effectiveness, fairness, and justice.

Q: How do differing interests create conflict? #

A: A patient may want to pursue a high-risk treatment for hope, while a clinician may focus on safety, and insurers may prioritize cost-effectiveness — leading to tension in care decisions.

➡️ What are the common data types used in healthcare?

4 Common data types in Healthcare #

Q: What are the major types of healthcare data? #

A: Structured and unstructured data.

Q: What is structured data? #

A: Data organized in consistent formats like tables with rows (patients) and columns (attributes such as age, DOB). Missing values are often marked with placeholders like “NA”.

Q: What is unstructured data? #

A: Data without a uniform format, including clinical notes, images, biomedical signals, and free text.

Q: Why is this distinction important? #

A: Structured data is easier to analyze computationally, while unstructured data requires specialized methods (e.g. NLP, image processing) to extract insights.

➡️ What are the strengths and weaknesses of observational healthcare data?

5 Strengths and weaknesses of observational data #

Q: What are observational data in healthcare? #

A: Data collected during routine clinical care for the primary purpose of delivering care — not for research. They’re used secondarily for analysis.

Q: What are the main strengths of observational data? #

Large scale : Covers millions to billions of records, capturing rare events.
Real-world relevance : Reflects actual clinical practices and patient behaviors.
Efficiency : Already collected, so no additional data gathering is needed.

Q: What are the weaknesses of observational data? #

Bias and incompleteness : Not collected with research goals in mind.
Lack of standardization : Varies across sites and systems.
Potential confounding : Causal inference is difficult without randomized control.

➡️ What are the biases and errors introduced by the healthcare system itself?

6 Bias and error from the healthcare system perspective #

Q: How can data from the healthcare system be biased or inaccurate? #

A: Each entity in the healthcare system contributes potential sources of bias, especially due to who seeks care and how care is recorded.

Q: What is selection bias in this context? #

A: It occurs when only certain patients (e.g., those who seek care) are recorded, leaving out healthy individuals or those managing illness at home or outside the system.

Q: What patient-level factors influence bias? #

A: Health literacy, financial situation, insurance coverage, and cultural beliefs can all affect whether and how patients engage with the healthcare system.

➡️ How do biases and errors affect recorded exposures and outcomes?

7 Bias and error of exposures and outcomes #

Q: What are exposures and outcomes in clinical data analysis? #

Exposures are events or conditions that occur to a patient (e.g., diseases, procedures, medications).
Outcomes are events or conditions assessed after the exposure (e.g., complications, lab results, costs).

Q: Why is distinguishing exposures and outcomes important? #

A: It provides a framework for analyzing how prior events influence later results and helps in identifying where bias or error may occur.

Q: What type of biases can arise in this context? #

A: Misclassification, timing errors, or missing data related to either exposures or outcomes can distort analysis and conclusions.

➡️ How can patient exposure be misclassified in clinical data?

8 How a patient exposure might be misclassified #

Q: How can exposure to medication be misclassified in clinical data? #

A: When prescription records don’t align with actual medication use, such as delays in filling, use of free samples, or interruptions in adherence.

Q: What example illustrates this misclassification? #

A: A doctor gives a 15-day free sample before a prescription is filled. The patient begins treatment immediately, but data may only show the pharmacy fill date, not the true start date.

Q: Why does this matter? #

A: Misclassification of the timing or existence of exposure can distort analyses linking exposures to outcomes, especially for time-sensitive effects.

➡️ How might a patient outcome be misclassified in clinical records?

9 How a patient outcome could be misclassified #

Q: How can a patient’s outcome be misclassified in the medical record? #

A: Sometimes a diagnosis code is added based on suspicion (e.g., diabetes) before a condition is confirmed, and it may remain even if the diagnosis is later ruled out.

Q: What strategies help reduce outcome misclassification? #

Require multiple instances of a diagnosis code.
Pair diagnosis codes with procedure codes specific to the condition.
Look for treatment or intervention evidence supporting the diagnosis.

Q: Why are procedure codes more reliable? #

A: Procedures are typically documented only if they were actually performed, adding confirmatory weight to a diagnosis.

➡️ What are the key sources of electronic medical record data?

10 Electronic medical record data #

Q: What are electronic medical records (EMRs)? #

A: EMRs are digital versions of the traditional paper patient charts. They store detailed information collected in clinical settings.

Q: What types of data are typically stored in EMRs? #

Patient demographics
Diagnosis and procedure codes
Clinical notes
Medication records
Imaging and lab test results
(Increasingly) genetic test results and wearable device data

Q: How are EMRs generated? #

A: As a byproduct of routine clinical care and documentation processes in hospitals and clinics.

Q: Are EMRs and EHRs the same? #

A: The terms are often used interchangeably, though technically EHR may imply a broader, longitudinal view across providers.

➡️ What can we learn from claims data in healthcare systems?

11 Claims data #

Q: What are claims data in healthcare? #

A: Claims data are records generated when healthcare providers submit bills to insurers for services rendered.

Q: What information do claims typically include? #

Patient identifiers
Insurance status
Diagnosis and procedure codes
Requested charges and actual payments

A: Specific codes are used to categorize services for billing. These may differ between clinician entries, hospital submissions, and what insurers reimburse.

Q: Why are claims data valuable? #

A: They include detailed information on costs, utilization, and provider-payer interactions — data often missing from EMRs.

➡️ What data do pharmacies provide, and how is it used?

12 Pharmacy #

Q: What kind of data do pharmacies provide? #

A: They document when prescriptions are written, filled, and paid for — offering insight into medication access and fulfillment.

Q: Why is pharmacy data valuable? #

A: It goes beyond prescription intent (EMR) to show that a patient actually obtained the medication, which is a step closer to actual use.

Q: What are limitations of pharmacy data? #

A: It doesn’t guarantee the patient took the medication — only that it was picked up.

Q: How can pharmacy records be fragmented? #

A: Patients may use multiple sources: retail chains (e.g., CVS), mail-order services, and online pharmacies, dispersing data across datasets.

➡️ What are surveillance datasets and registries, and how are they used?

13 Surveillance datasets and Registries #

Q: What are surveillance datasets and why are they important? #

A: They monitor adverse events and side effects of drugs or devices after approval (post-marketing surveillance) to catch safety issues early.

Q: What are examples of surveillance systems in the U.S.? #

FAERS : FDA Adverse Event Reporting System
MAUDE : Manufacturer and User Facility Device Experience database

Q: Who uses these datasets and for what purpose? #

A: Government agencies like the FDA and CDC use them to track disease outbreaks (e.g. flu, Ebola) and monitor product safety. Local and state agencies often assist.

Q: What are registries? #

A: Registries are organized systems maintained by agencies or societies to collect consistent clinical data on specific conditions, devices, or populations.

➡️ What are population health datasets and how are they used?

14 Population health data sets #

Q: How do population health datasets differ from patient-centric data? #

A: They focus on aggregated health trends, costs, and resource use across populations — not on individual patient records.

Q: What are key examples of U.S. population health datasets? #

National Inpatient Sample (NIS) : Tracks hospital resource utilization, costs, and outcomes.
Medical Expenditure Panel Survey (MEPS) : Surveys patients, providers, and employers on healthcare usage and spending.
NHANES : Measures demographic, nutritional, and health variables through national surveys conducted by the CDC.

Q: Why are these datasets useful? #

A: They offer broad insights into national health patterns, disparities, and costs — helping guide public policy and resource planning.

➡️ How do we assess if a healthcare data source is useful?

15 A framework to assess if a data source is useful #

Q: What questions should you ask to assess a healthcare dataset’s usefulness? #

Is there a well-documented data model?

Poor or missing documentation makes the data hard to use.

What is the data provenance?

Understand how and where the data were collected.

Are the data accessible and in what form?

Consider legal restrictions and costs.

What known errors or missingness exist?

Evaluate data quality and be prepared to address gaps.

Are data standards used (e.g., vocabularies or formats)?

Standardization affects interoperability and analysis readiness.

Q: Why is this framework important? #

A: It helps researchers avoid costly or infeasible data efforts and ensures they can trust and interpret results from the dataset effectively.