C5 Capstone Projects | AI Reasoning

C5 Capstone Projects

📘 Course 5: Capston Projects – COVID-19 AI #

📷 Project 1: CXR-Based COVID-19 Detector #

Phase 1: Data Collection #

Objective: Build a deep learning model to predict COVID-19 status using chest x-ray (CXR) images.
Input: 3000x3000 px uncompressed DICOM images from 30,000 exams (10% COVID-positive).
Concern: Class imbalance (90:10) and high-resolution image processing needs.

Phase 2: Model Training (Part 1) #

Used ResNet-50 on resized 224x224 images.
Data split randomly (not by patient).
Augmentation: 50% zoom-in on random region.
Issue: Training loss did not improve → possible underfitting or flawed preprocessing.

Phase 3: Model Training (Part 2) #

Improvements made:
- Patient-level data split to prevent leakage.
- Image size increased to 512x512 px; model adjusted accordingly.
- Simplified augmentation: horizontal flip + light zoom.
- COVID-positive oversampling added.
New issue: Overfitting (training loss much lower than validation loss).
Metric discrepancy: High accuracy but relatively low AUROC on validation set.

Phase 4: Model Evaluation #

Applied dropout (p=0.5) and random rotation augmentation.
Two early-stopped models:
- Model A: Best validation AUROC.
- Model B: Best validation loss.
Deployment consideration: Choose model based on worklist prioritization use case.

📈 Project 2: EHR-Based Intubation Predictor #

Phase 1: Data Collection #

Objective: Predict likelihood of intubation from electronic health records (EHR).
Input: COVID dataset with 3,000 EHRs (300 positive) + additional 40,000-exam COVID-like dataset.
Issue: Misassumption—only 3,000 usable EHRs in COVID dataset.
Challenge: Sparse features, strange lab value distributions (e.g., D-DIMER), many missing values.

Phase 2: Model Training (Part 1) #

Attempted logistic regression but faced data issues (sparsity, outliers, NaNs).
Required strategies to deal with missingness and outliers before modeling.

Phase 3: Model Training (Part 2) #

New strategy: Train on 40,000 “COVID-like” exams; test on COVID dataset (3,000 exams).
Split: 70% train, 30% validation (COVID-like dataset).
10-fold cross-validation used for hyperparameter tuning.
Models trained: Logistic regression + Random Forests.

Phase 4: Model Evaluation #

Performance improved using COVID-like training data.
Now selecting operating threshold using precision-recall curve.
Deployment consideration: Choose threshold optimized for triage decision-making.

🔗 Cross-Project Learnings #

Both projects improved significantly from Phase 2 to 4 through better data practices:
- Patient-level splits
- Cross-validation
- Oversampling
- Threshold tuning
Key divergence:
- Project 1 is image-based, focuses on COVID diagnosis.
- Project 2 is EHR-based, focuses on intervention prediction (intubation).
Both must align their evaluation strategy with real-world clinical use (triage vs. prioritization).