Data and Computational Challenges in Integrative Biomedical Informatics


Published on

Presentation at the Georgia Tech Center for Data Analytics

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Combine with next slide.Graphical representation
  • Data and Computational Challenges in Integrative Biomedical Informatics

    1. 1. Data and Computational Challenges in Integrative Biomedical InformaticsJoel Saltz MD, PhDChair Department of Biomedical Informatics,Director Center for Comprehensive InformaticsEmory UniversityAdjunct Professor CSE, CSCollege of Computing, Georgia Tech
    2. 2. Center for Comprehensive Informatics ANALYTICS INTEGRATIVE DATA
    3. 3. Integrative Biomedical Informatics AnalyticsCenter for Comprehensive Informatics • Anatomic/functional characterization at fine level (Pathology) and gross level (Radiology) Radiology Imaging • High throughput multi- scale image segmentation, feature Patient “Omic” extraction, analysis of Outcome Data features • Integration of anatomic/functional Pathologic Features characterization with multiple types of “omic” information
    4. 4. Integrative Spatio-Temporal Molecular AnalyticsCenter for Comprehensive Informatics • Aka Big Data
    5. 5. Quantitative Feature Analysis in Pathology:Emory In Silico Center for Brain TumorResearch (PI = Dan Brat, PD= Joel Saltz)
    6. 6. Using TCGA Data to Study GlioblastomaDiagnostic ImprovementMolecular ClassificationPredictors of Progression
    7. 7. Millions of Nuclei Defined by n Features• Top-down analysis: use the features with existing diagnostic constructs• Bottom-up analysis: let features define and drive the analysis
    8. 8. TCGA Whole Slide Images Step 1: Nuclei • Identify individual nuclei Segmentation and their boundaries Jun Kong
    9. 9. Nuclear Analysis Workflow Step 1: Step 2: Nuclei Feature Segmentation Extraction• Describe individual nuclei in terms of size, shape, and texture
    10. 10. Step 3: Nuclei Nuclear Qualities Classification 1 10Oligodendroglioma Astrocytoma
    11. 11. Comparison of Machine-based Classification to Human Based ClassificationSeparation of GBM, Oligo1, Oligo2 Separation of GBM, Oligo1 andas Designated by Oligo2 as Designated by MachineNeuropathologists
    12. 12. Survival AnalysisHuman Machine
    13. 13. Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification Oligo Related Genes Myelin Basic Protein Proteolipoprotein HoxD1 Nuclear features most Associated with Oligo Signature Genes: Circularity (high) Eccentricity (low)
    14. 14. Millions of Nuclei Defined by n Features• Top-down analysis: analyze features in context of existing diagnostic constructs• Bottom-up analysis: let nuclear features define and drive the analysis
    15. 15. Direct Study of Relationship Between vsCenter for Comprehensive Informatics Lee Cooper, Carlos Moreno
    16. 16. Clustering identifies three morphological groupsCenter for Comprehensive Informatics • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p=4.5e-4) CC CM PB 1 CC 10 0.8 CM PB 20 Feature Indices 0.6 Survival 30 0.4 40 0.2 50 0 0 500 1000 1500 2000 2500 3000 Days
    17. 17. Center for Comprehensive Informatics Associations
    18. 18. Center for Comprehensive Informatics ANALYTICS HEALTHCARE DATA
    19. 19. Clinical Phenotype Characterization and the Emory Analytic Information WarehouseCenter for Comprehensive Informatics • Example Project: Find hot spots in readmissions within 30 days – What fraction of patients with a given principal diagnosis will be readmitted within 30 days? – What fraction of patients with a given set of diseases will be readmitted within 30 days? – How does severity and time course of co-morbidities affect readmissions? – Geographic analyses • Compare and contrast with UHC Clinical Data Base – Repeat analyses across all UHC hospitals – Are we performing the same? – How are UHC-curated groupings of patients (e.g., product lines) useful? • Need a repeatable process that we can apply identically to both local and UHC data
    20. 20. Overall SystemCenter for Comprehensive Informatics Metadata Repository I2b2 Web I2b2 Server Database Investigator Metadata Manager Data Modeler Data Query Processing Specification Data Analyst Investigator Database Mapper Data Analyst Study- Query tools specific Database Source Source Source Investigator data data data
    21. 21. 5-year Datasets from Emory and University Healthcare ConsortiumCenter for Comprehensive Informatics • EUH, EUHM and WW (inpatient encounters) • Removed encounter pairs with chemotherapy and radiation therapy readmit encounters (CDW data) • Encounter location (down to unit for Emory) • Providers (Emory only) • Discharge disposition • Primary and secondary ICD9 codes • Procedure codes • DRGs • Medication orders (Emory only) • Labs (Emory only) • Vitals (Emory only) • Geographic information (CDW only + US Census and American Community Survey) Analytic Information
    22. 22. Using Emory & UHC Data to Find Associations With 30-day ReadmitsCenter for Comprehensive Informatics • Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining – Too many diagnosis codes, procedure codes – Continuous variables (e.g., labs) require interpretation – Temporal relationships between variables are implicit • Solution: Transform the data into a much smaller set of variables using heuristic knowledge – Categorize diagnosis and procedure codes using code hierarchies – Classify continuous variables using standard interpretations (e.g., high, normal, low) – Identify temporal patterns (e.g., frequency, duration, sequence) – Apply standard data mining techniques Analytic Information
    23. 23. Derived VariablesCenter for Comprehensive Informatics • 30-day readmit • The 9 Emory Enhanced Risk Assessment Tool diagnosis categories • UHC product lines • Variables derived from a combination of codes and/or laboratory test results – Obesity – Diabetes/uncontrolled diabetes – End-stage renal disease (ESRD) – Pressure ulcer – Sickle cell disease/sickle cell crisis • Temporal variables derived over multiple encounters – Multiple MI – Multiple 30-day readmissions – Chemotherapy within 180 (or 365) days before surgery – Previous encounter within the last 90 (or 180) days
    24. 24. 30-Day Readmission Rates for Derived VariablesCenter for Comprehensive Informatics Emory Health Care
    25. 25. Geographic Analyses UHC Medicine General Product Line (#15)Center for Comprehensive Informatics Analytic Information Warehouse
    26. 26. Predictive Modeling for ReadmissionCenter for Comprehensive Informatics • Random forests (ensemble of decision trees) – Create a decision tree using a random subset of the variables in the dataset – Generate a large number of such trees – All trees vote to classify each test example in a training dataset – Generate a patient-specific readmission risk for each encounter • Rank the encounters by risk for a subsequent 30- day readmission Analytic Information
    27. 27. Emory Readmission Rates for High and Low Risk Groups Generated withCenter for Comprehensive Informatics Random Forest
    28. 28. Predictive Modeling Applied to 180 UHC Hospitals Readmission fraction of top 10% high risk patientsCenter for Comprehensive Informatics 0.9 0.8 0.7 0.6 0.5 All Hospital Model Individual Hospital 0.4 Model 0.3 0.2 0.1 0 113 17 25 33 41 49 57 65 73 81 89 97 161 105 121 129 137 145 153 169 177 185 9 1
    29. 29. Status of Healthcare Data AnalyticsCenter for Comprehensive Informatics • Integrative dataset analysis can leverage patient information gathered over many encounters • Temporal analyses can generate derived variables that appear to correlate with readmissions • Predictive modeling has promise of providing decision support • Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper • Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts • Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue
    30. 30. Center for Comprehensive Informatics DATA COMPUTING HIGH END AND LARGE
    31. 31. Supercomputing – Collaboration with ORNL: Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!Center for Comprehensive Informatics
    32. 32. Core Transformations for multi-scale pipelinesCenter for Comprehensive Informatics • Data Cleaning and Low Level Transformations • Data Subsetting, Filtering, Subsampling • Spatio-temporal Mapping and Registration • Object Segmentation • Feature Extraction, Object Classification • Spatio-temporal Aggregation • Change Detection, Comparison, and Quantification
    33. 33. Extreme DataCutter – Two Level ModelCenter for Comprehensive Informatics
    34. 34. Center for Comprehensive Informatics Node Level Work Scheduling
    35. 35. VLDB 2012Center for Comprehensive Informatics Change Detection, Comparison, and Quantification
    36. 36. Thanks!