C5 Capstone Projects

📘 Course 5: Capston Projects – COVID-19 AI #


📷 Project 1: CXR-Based COVID-19 Detector #

Phase 1: Data Collection #

  • Objective: Build a deep learning model to predict COVID-19 status using chest x-ray (CXR) images.
  • Input: 3000x3000 px uncompressed DICOM images from 30,000 exams (10% COVID-positive).
  • Concern: Class imbalance (90:10) and high-resolution image processing needs.

Phase 2: Model Training (Part 1) #

  • Used ResNet-50 on resized 224x224 images.
  • Data split randomly (not by patient).
  • Augmentation: 50% zoom-in on random region.
  • Issue: Training loss did not improve → possible underfitting or flawed preprocessing.

Phase 3: Model Training (Part 2) #

  • Improvements made:
    • Patient-level data split to prevent leakage.
    • Image size increased to 512x512 px; model adjusted accordingly.
    • Simplified augmentation: horizontal flip + light zoom.
    • COVID-positive oversampling added.
  • New issue: Overfitting (training loss much lower than validation loss).
  • Metric discrepancy: High accuracy but relatively low AUROC on validation set.

Phase 4: Model Evaluation #

  • Applied dropout (p=0.5) and random rotation augmentation.
  • Two early-stopped models:
    • Model A: Best validation AUROC.
    • Model B: Best validation loss.
  • Deployment consideration: Choose model based on worklist prioritization use case.

📈 Project 2: EHR-Based Intubation Predictor #

Phase 1: Data Collection #

  • Objective: Predict likelihood of intubation from electronic health records (EHR).
  • Input: COVID dataset with 3,000 EHRs (300 positive) + additional 40,000-exam COVID-like dataset.
  • Issue: Misassumption—only 3,000 usable EHRs in COVID dataset.
  • Challenge: Sparse features, strange lab value distributions (e.g., D-DIMER), many missing values.

Phase 2: Model Training (Part 1) #

  • Attempted logistic regression but faced data issues (sparsity, outliers, NaNs).
  • Required strategies to deal with missingness and outliers before modeling.

Phase 3: Model Training (Part 2) #

  • New strategy: Train on 40,000 “COVID-like” exams; test on COVID dataset (3,000 exams).
  • Split: 70% train, 30% validation (COVID-like dataset).
  • 10-fold cross-validation used for hyperparameter tuning.
  • Models trained: Logistic regression + Random Forests.

Phase 4: Model Evaluation #

  • Performance improved using COVID-like training data.
  • Now selecting operating threshold using precision-recall curve.
  • Deployment consideration: Choose threshold optimized for triage decision-making.

🔗 Cross-Project Learnings #

  • Both projects improved significantly from Phase 2 to 4 through better data practices:
    • Patient-level splits
    • Cross-validation
    • Oversampling
    • Threshold tuning
  • Key divergence:
    • Project 1 is image-based, focuses on COVID diagnosis.
    • Project 2 is EHR-based, focuses on intervention prediction (intubation).
  • Both must align their evaluation strategy with real-world clinical use (triage vs. prioritization).