[Summary] Module2: Data Available From Healthcare Systems

Module2: Data Available From Healthcare Systems #

1 Review of the healthcare system #


Q: What is the focus of this module’s introduction? #

A: To demonstrate how clinical data from the healthcare system can be used to ask and answer meaningful research questions.

Q: What key topics are introduced in this session? #

A:

  • Common data sources in healthcare
  • Types of data generated
  • Systematic inaccuracies in the data
  • Strategies for working with imperfect data

Q: How does this connect to the earlier module? #

A: It builds on the discussion of research question formulation and data mining workflows by diving into real-world data availability and limitations.

➡️ What entities exist in the healthcare system and what data do they collect?

2 Review of key entities and the data they collect #


Q: Who are the key entities in the healthcare system that generate data? #

A:

  • Patients : Generate little data when healthy, but more when they seek treatment.
  • Search Engines : Capture data from patients researching health online.
  • Healthcare Providers : Generate EMR data including diagnoses, test orders, and prescriptions.
  • Pharmacies : Record dispensing data.
  • Pharmaceutical Companies : Design and manufacture drugs.
  • Insurers/Payers : Capture claims and billing data.

Q: What kind of data is produced at each step of care? #

A: Data are produced at nearly every step — from search behavior before treatment, to EMR entries during care, to dispensing records from pharmacies, and payment data from insurers.

➡️ Who are the actors in the healthcare system and what interests do they have?

3 Actors with different interests #


Q: Who are the main actors in the healthcare system? #

A:

  • Patients and their families
  • Healthcare professionals
  • Payers (e.g. insurance companies)
  • Government and regulatory agencies

Q: What are the interests of these different actors? #

A:

  • Patients want to stay healthy and recover quickly.
  • Clinicians seek best practices to maximize benefit and minimize harm.
  • Payers and regulators aim for societal cost-effectiveness, fairness, and justice.

Q: How do differing interests create conflict? #

A: A patient may want to pursue a high-risk treatment for hope, while a clinician may focus on safety, and insurers may prioritize cost-effectiveness — leading to tension in care decisions.

➡️ What are the common data types used in healthcare?

4 Common data types in Healthcare #


Q: What are the major types of healthcare data? #

A: Structured and unstructured data.

Q: What is structured data? #

A: Data organized in consistent formats like tables with rows (patients) and columns (attributes such as age, DOB). Missing values are often marked with placeholders like “NA”.

Q: What is unstructured data? #

A: Data without a uniform format, including clinical notes, images, biomedical signals, and free text.

Q: Why is this distinction important? #

A: Structured data is easier to analyze computationally, while unstructured data requires specialized methods (e.g. NLP, image processing) to extract insights.

➡️ What are the strengths and weaknesses of observational healthcare data?

5 Strengths and weaknesses of observational data #


Q: What are observational data in healthcare? #

A: Data collected during routine clinical care for the primary purpose of delivering care — not for research. They’re used secondarily for analysis.

Q: What are the main strengths of observational data? #

A:

  • Large scale : Covers millions to billions of records, capturing rare events.
  • Real-world relevance : Reflects actual clinical practices and patient behaviors.
  • Efficiency : Already collected, so no additional data gathering is needed.

Q: What are the weaknesses of observational data? #

A:

  • Bias and incompleteness : Not collected with research goals in mind.
  • Lack of standardization : Varies across sites and systems.
  • Potential confounding : Causal inference is difficult without randomized control.

➡️ What are the biases and errors introduced by the healthcare system itself?

6 Bias and error from the healthcare system perspective #


Q: How can data from the healthcare system be biased or inaccurate? #

A: Each entity in the healthcare system contributes potential sources of bias, especially due to who seeks care and how care is recorded.

Q: What is selection bias in this context? #

A: It occurs when only certain patients (e.g., those who seek care) are recorded, leaving out healthy individuals or those managing illness at home or outside the system.

Q: What patient-level factors influence bias? #

A: Health literacy, financial situation, insurance coverage, and cultural beliefs can all affect whether and how patients engage with the healthcare system.

➡️ How do biases and errors affect recorded exposures and outcomes?

7 Bias and error of exposures and outcomes #


Q: What are exposures and outcomes in clinical data analysis? #

A:

  • Exposures are events or conditions that occur to a patient (e.g., diseases, procedures, medications).
  • Outcomes are events or conditions assessed after the exposure (e.g., complications, lab results, costs).

Q: Why is distinguishing exposures and outcomes important? #

A: It provides a framework for analyzing how prior events influence later results and helps in identifying where bias or error may occur.

Q: What type of biases can arise in this context? #

A: Misclassification, timing errors, or missing data related to either exposures or outcomes can distort analysis and conclusions.

➡️ How can patient exposure be misclassified in clinical data?

8 How a patient exposure might be misclassified #


Q: How can exposure to medication be misclassified in clinical data? #

A: When prescription records don’t align with actual medication use, such as delays in filling, use of free samples, or interruptions in adherence.

Q: What example illustrates this misclassification? #

A: A doctor gives a 15-day free sample before a prescription is filled. The patient begins treatment immediately, but data may only show the pharmacy fill date, not the true start date.

Q: Why does this matter? #

A: Misclassification of the timing or existence of exposure can distort analyses linking exposures to outcomes, especially for time-sensitive effects.

➡️ How might a patient outcome be misclassified in clinical records?

9 How a patient outcome could be misclassified #


Q: How can a patient’s outcome be misclassified in the medical record? #

A: Sometimes a diagnosis code is added based on suspicion (e.g., diabetes) before a condition is confirmed, and it may remain even if the diagnosis is later ruled out.

Q: What strategies help reduce outcome misclassification? #

A:

  • Require multiple instances of a diagnosis code.
  • Pair diagnosis codes with procedure codes specific to the condition.
  • Look for treatment or intervention evidence supporting the diagnosis.

Q: Why are procedure codes more reliable? #

A: Procedures are typically documented only if they were actually performed, adding confirmatory weight to a diagnosis.

➡️ What are the key sources of electronic medical record data?

10 Electronic medical record data #


Q: What are electronic medical records (EMRs)? #

A: EMRs are digital versions of the traditional paper patient charts. They store detailed information collected in clinical settings.

Q: What types of data are typically stored in EMRs? #

A:

  • Patient demographics
  • Diagnosis and procedure codes
  • Clinical notes
  • Medication records
  • Imaging and lab test results
  • (Increasingly) genetic test results and wearable device data

Q: How are EMRs generated? #

A: As a byproduct of routine clinical care and documentation processes in hospitals and clinics.

Q: Are EMRs and EHRs the same? #

A: The terms are often used interchangeably, though technically EHR may imply a broader, longitudinal view across providers.

➡️ What can we learn from claims data in healthcare systems?

11 Claims data #


Q: What are claims data in healthcare? #

A: Claims data are records generated when healthcare providers submit bills to insurers for services rendered.

Q: What information do claims typically include? #

A:

  • Patient identifiers
  • Insurance status
  • Diagnosis and procedure codes
  • Requested charges and actual payments

A: Specific codes are used to categorize services for billing. These may differ between clinician entries, hospital submissions, and what insurers reimburse.

Q: Why are claims data valuable? #

A: They include detailed information on costs, utilization, and provider-payer interactions — data often missing from EMRs.

➡️ What data do pharmacies provide, and how is it used?

12 Pharmacy #


Q: What kind of data do pharmacies provide? #

A: They document when prescriptions are written, filled, and paid for — offering insight into medication access and fulfillment.

Q: Why is pharmacy data valuable? #

A: It goes beyond prescription intent (EMR) to show that a patient actually obtained the medication, which is a step closer to actual use.

Q: What are limitations of pharmacy data? #

A: It doesn’t guarantee the patient took the medication — only that it was picked up.

Q: How can pharmacy records be fragmented? #

A: Patients may use multiple sources: retail chains (e.g., CVS), mail-order services, and online pharmacies, dispersing data across datasets.

➡️ What are surveillance datasets and registries, and how are they used?

13 Surveillance datasets and Registries #


Q: What are surveillance datasets and why are they important? #

A: They monitor adverse events and side effects of drugs or devices after approval (post-marketing surveillance) to catch safety issues early.

Q: What are examples of surveillance systems in the U.S.? #

A:

  • FAERS : FDA Adverse Event Reporting System
  • MAUDE : Manufacturer and User Facility Device Experience database

Q: Who uses these datasets and for what purpose? #

A: Government agencies like the FDA and CDC use them to track disease outbreaks (e.g. flu, Ebola) and monitor product safety. Local and state agencies often assist.

Q: What are registries? #

A: Registries are organized systems maintained by agencies or societies to collect consistent clinical data on specific conditions, devices, or populations.

➡️ What are population health datasets and how are they used?

14 Population health data sets #


Q: How do population health datasets differ from patient-centric data? #

A: They focus on aggregated health trends, costs, and resource use across populations — not on individual patient records.

Q: What are key examples of U.S. population health datasets? #

A:

  • National Inpatient Sample (NIS) : Tracks hospital resource utilization, costs, and outcomes.
  • Medical Expenditure Panel Survey (MEPS) : Surveys patients, providers, and employers on healthcare usage and spending.
  • NHANES : Measures demographic, nutritional, and health variables through national surveys conducted by the CDC.

Q: Why are these datasets useful? #

A: They offer broad insights into national health patterns, disparities, and costs — helping guide public policy and resource planning.

➡️ How do we assess if a healthcare data source is useful?

15 A framework to assess if a data source is useful #


Q: What questions should you ask to assess a healthcare dataset’s usefulness? #

A:

  1. Is there a well-documented data model?
  • Poor or missing documentation makes the data hard to use.
  1. What is the data provenance?
  • Understand how and where the data were collected.
  1. Are the data accessible and in what form?
  • Consider legal restrictions and costs.
  1. What known errors or missingness exist?
  • Evaluate data quality and be prepared to address gaps.
  1. Are data standards used (e.g., vocabularies or formats)?
  • Standardization affects interoperability and analysis readiness.

Q: Why is this framework important? #

A: It helps researchers avoid costly or infeasible data efforts and ensures they can trust and interpret results from the dataset effectively.