Integrative Multi-Scale AnalysesJoel Saltz MD, PhDEmory UniversityFebruary 2013
a.k.a “Big Data”Center for Comprehensive Informatics                                       • Integrative Spatio-Temporal A...
Application TargetsCenter for Comprehensive Informatics                                       • Multi-dimensional spatial-...
Core TransformationsCenter for Comprehensive Informatics                                       • Data Cleaning and Low Lev...
Emory In Silico Center for Brain TumorResearch (PI = Dan Brat, PD= Joel Saltz)
National Science Foundation Grand Challenge           in Land Cover Dynamics                                         • Rem...
Analysis of Computational Data; Uncertainty                                       Quantification, Comparisons with Experim...
a.k.a “Big Data”Center for Comprehensive Informatics                                       • Integrative Spatio-Temporal A...
Center for Comprehensive Informatics                                       Whole Slide Imaging: Scale
Pathology Computer Assisted DiagnosisCenter for Comprehensive Informatics                                               Sh...
Computerized Classification System     for Grading Neuroblastoma                    Initialization                        ...
Using TCGA Data to Study GlioblastomaDiagnostic ImprovementMolecular ClassificationPredictors of Progression
TCGA Network               Digital Pathology               Neuroimaging
Morphological Tissue ClassificationCenter for Comprehensive Informatics                                           Whole Sl...
Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints?            ...
Millions of Nuclei Defined by n Features• Bottom-up analysis: let features define  and drive the analysis• Top-down analys...
TCGA Whole Slide Images    Step 1:    Nuclei                 • Identify individual nuclei Segmentation                   a...
Nuclear Analysis Workflow    Step 1:             Step 2:    Nuclei              Feature Segmentation          Extraction• ...
Step 3:    Nuclei                  Nuclear Qualities Classification 1                                         10Oligodendr...
Representative NucleiOligo               Astro
Comparison of Machine-based Classification        to Human Based ClassificationSeparation of GBM, Oligo1, Oligo2   Separat...
Survival AnalysisHuman            Machine
Gene Expression Correlates of High Oligo-Astro    Ratio on Machine-based Classification                               Olig...
Millions of Nuclei Defined by n Features• Bottom-up analysis: let nuclear features  define and drive the analysis• Top-dow...
Direct Study of Relationship Between                                                 vsCenter for Comprehensive Informatic...
Nuclear Features Used to Classify GBMsCenter for Comprehensive Informatics                                                ...
Clustering identifies three morphological groupsCenter for Comprehensive Informatics                                      ...
Center for Comprehensive Informatics                                       Associations
Molecular Correlates of MR Features Using TCGA Data MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI featu...
Capturing structured annotations and markups/ AIM Data Service
VASARI      Feature SetScott HwangChad HolderAdam Flanders
Prognostic Significance of Vasari Features Tests Between Groups:   0-33% vs. 34-95% Proportion enhancing            Test  ...
a.k.a “Big Data”Center for Comprehensive Informatics                                       • Integrative Spatio-Temporal A...
Titan – Peak Speed 30,000,000,000,000,000                                       floating point operations per second!Cente...
Center for Comprehensive Informatics
Extreme DataCutter PrototypeCenter for Comprehensive Informatics                                       DataCutter         ...
Extreme DataCutter – Two Level ModelCenter for Comprehensive Informatics
Center for Comprehensive Informatics                                       Node Level Work Scheduling
Brain Tumor Pipeline Scaling on Keeneland                                       (100 Nodes)Center for Comprehensive Inform...
Challenge: Structured/Unstructured Grid                                       Calculations with Unpredictable RuntimeCente...
“Speedup” relative to single CPU coreCenter for Comprehensive Informatics
Large Scale Data ManagementCenter for Comprehensive Informatics                                        Represented by a c...
Spatial Centric – Pathology Imaging “GIS”Point query: human marked point      Window query: return markupsinside a nucleus...
Algorithm Validation: Intersectionbetween Two Result Sets (Spatial Join)            PAIS: Example Queries    .   .
VLDB 2012Center for Comprehensive Informatics                                       Change Detection, Comparison, and Quan...
Approach to Integrated Sensor Data Analysis                                       FrameworkCenter for Comprehensive Inform...
a.k.a “Big Data”Center for Comprehensive Informatics                                       • Integrative Spatio-Temporal A...
Clinical Phenotype Characterization and the Emory                                       Analytic Information WarehouseCent...
Overall SystemCenter for Comprehensive Informatics                                                                        ...
5-year Datasets from Emory and                                       University Healthcare ConsortiumCenter for Comprehens...
Using Emory & UHC Data to Find                                       Associations With 30-day ReadmitsCenter for Comprehen...
Derived VariablesCenter for Comprehensive Informatics                                 •     30-day readmit                ...
30-Day Readmission Rates for Derived                                       VariablesCenter for Comprehensive Informatics  ...
Geographic Analyses                                       UHC Medicine General Product Line (#15)Center for Comprehensive ...
Predictive Modeling for ReadmissionCenter for Comprehensive Informatics                                       • Random for...
Emory Readmission Rates for High and                                       Low Risk Groups Generated withCenter for Compre...
Status of Clinical Phenotype                                       CharacterizationCenter for Comprehensive Informatics   ...
a.k.a “Big Data”Center for Comprehensive Informatics                                       • Integrative Spatio-Temporal A...
Thanks to:                                       •   In silico center team: Dan Brat (Science PI), Tahsin Kurc, AshishCent...
Thanks toCenter for Comprehensive Informatics                                       • National Cancer Institute           ...
Thanks!
Integrative Multi-Scale Analyses
Integrative Multi-Scale Analyses
Integrative Multi-Scale Analyses
Upcoming SlideShare
Loading in …5
×

Integrative Multi-Scale Analyses

343 views

Published on

Integrative
analyses of large scale spatio-temporal datasets play increasingly important roles in many areas of science and engineering. Our recent work in this area is motivated by application scenarios involving complementary digital microscopy, Radiology and "omic"
analyses in cancer research. In these scenarios, our objective is to use a coordinated set of image analysis, feature extraction and machine learning methods to predict disease progression and to aid in targeting new therapies.
We describe methods
we have developed for extraction, management and analysis of features along with the systems software methods for optimizing execution on high end CPU/GPU platforms. We will also describe biomedical results obtained from these studies and extensions of the
computational methods to broader application areas.

1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
343
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • Combine with next slide.Graphical representation
  • Integrative Multi-Scale Analyses

    1. 1. Integrative Multi-Scale AnalysesJoel Saltz MD, PhDEmory UniversityFebruary 2013
    2. 2. a.k.a “Big Data”Center for Comprehensive Informatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/”Big Data” Computers, Systems Software • Analysis of Patient Populations
    3. 3. Application TargetsCenter for Comprehensive Informatics • Multi-dimensional spatial-temporal datasets – Radiology and Microscopy Image Analyses – Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation – Biomass monitoring and disaster surveillance using multiple types of satellite imagery – Weather prediction using satellite and ground sensor data – Analysis of Results from Large Scale Simulations • Correlative and cooperative analysis of data from multiple sensor modalities and sources • What-if scenarios and multiple design choices or initial conditions
    4. 4. Core TransformationsCenter for Comprehensive Informatics • Data Cleaning and Low Level Transformations • Data Subsetting, Filtering, Subsampling • Spatio-temporal Mapping and Registration • Object Segmentation • Feature Extraction, Object Classification • Spatio-temporal Aggregation • Change Detection, Comparison, and Quantification
    5. 5. Emory In Silico Center for Brain TumorResearch (PI = Dan Brat, PD= Joel Saltz)
    6. 6. National Science Foundation Grand Challenge in Land Cover Dynamics • Remote sensing analysis of high resolution satellite images. • Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling • Maps of the worlds tropical rain forest during the past three decades. Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend
    7. 7. Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental ResultsCenter for Comprehensive Informatics Dimitri Mavriplis, Raja Das, Joel Saltz
    8. 8. a.k.a “Big Data”Center for Comprehensive Informatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/”Big Data” Computers, Systems Software • Analysis of Patient Populations
    9. 9. Center for Comprehensive Informatics Whole Slide Imaging: Scale
    10. 10. Pathology Computer Assisted DiagnosisCenter for Comprehensive Informatics Shimada, Gurcan, Kong, Saltz
    11. 11. Computerized Classification System for Grading Neuroblastoma Initialization YesImage Tile Background? Label I=L • Background Identification No Create Image I(L) • Image Decomposition (Multi- Training Tiles resolution levels) Segmentation I = I -1 • Image Segmentation Down-sampling (EMLDA) Segmentation Feature Construction • Feature Construction (2nd Yes No order statistics, Tonal Feature Extraction I > 1?Feature Construction Features) Feature Extraction Classification • Feature Extraction (LDA) + Classification (Bayesian) Classifier Training No • Multi-resolution Layer Within Confidence Region ? Controller (Confidence Yes TRAINING Region) TESTING
    12. 12. Using TCGA Data to Study GlioblastomaDiagnostic ImprovementMolecular ClassificationPredictors of Progression
    13. 13. TCGA Network Digital Pathology Neuroimaging
    14. 14. Morphological Tissue ClassificationCenter for Comprehensive Informatics Whole Slide Imaging Cellular Features Nuclei Segmentation Lee Cooper, Jun Kong
    15. 15. Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints? Nuclear QualitiesOligodendroglioma Astrocytoma Application: Oligodendroglioma Component in GBM
    16. 16. Millions of Nuclei Defined by n Features• Bottom-up analysis: let features define and drive the analysis• Top-down analysis: use the features with existing diagnostic constructs
    17. 17. TCGA Whole Slide Images Step 1: Nuclei • Identify individual nuclei Segmentation and their boundaries Jun Kong
    18. 18. Nuclear Analysis Workflow Step 1: Step 2: Nuclei Feature Segmentation Extraction• Describe individual nuclei in terms of size, shape, and texture
    19. 19. Step 3: Nuclei Nuclear Qualities Classification 1 10Oligodendroglioma Astrocytoma
    20. 20. Representative NucleiOligo Astro
    21. 21. Comparison of Machine-based Classification to Human Based ClassificationSeparation of GBM, Oligo1, Oligo2 Separation of GBM, Oligo1 andas Designated by Oligo2 as Designated by MachineNeuropathologists
    22. 22. Survival AnalysisHuman Machine
    23. 23. Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification Oligo Related Genes Myelin Basic Protein Proteolipoprotein HoxD1 Nuclear features most Associated with Oligo Signature Genes: Circularity (high) Eccentricity (low)
    24. 24. Millions of Nuclei Defined by n Features• Bottom-up analysis: let nuclear features define and drive the analysis• Top-down analysis: analyze features in context of existing diagnostic constructs
    25. 25. Direct Study of Relationship Between vsCenter for Comprehensive Informatics Lee Cooper, Carlos Moreno
    26. 26. Nuclear Features Used to Classify GBMsCenter for Comprehensive Informatics 50 3 2 1 20 1 45 40 Silhouette Area 40 60 Cluster 80 2 35 100 120 30 140 3 25 160 2 3 4 5 6 7 20 40 60 80 100 120 140 160 # Clusters 0 0.5 1 Silhouette Value Consensus clustering of morphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering
    27. 27. Clustering identifies three morphological groupsCenter for Comprehensive Informatics • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p=4.5e-4) CC CM PB 1 CC 10 0.8 CM PB 20 Feature Indices 0.6 Survival 30 0.4 40 0.2 50 0 0 500 1000 1500 2000 2500 3000 Days
    28. 28. Center for Comprehensive Informatics Associations
    29. 29. Molecular Correlates of MR Features Using TCGA Data MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools MR Features compared to TCGA Transcriptional Classes and Genetic Alterations David Gutman
    30. 30. Capturing structured annotations and markups/ AIM Data Service
    31. 31. VASARI Feature SetScott HwangChad HolderAdam Flanders
    32. 32. Prognostic Significance of Vasari Features Tests Between Groups: 0-33% vs. 34-95% Proportion enhancing Test ChiSquare DF P-Value Log-Rank 12.4775 3 0.0059* Wilcoxon 10.0802 3 0.0179*
    33. 33. a.k.a “Big Data”Center for Comprehensive Informatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/”Big Data” Computers, Systems Software • Analysis of Patient Populations
    34. 34. Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!Center for Comprehensive Informatics
    35. 35. Center for Comprehensive Informatics
    36. 36. Extreme DataCutter PrototypeCenter for Comprehensive Informatics DataCutter Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System S Extreme DataCutter Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm Bottom line – everything ends up as DAGS
    37. 37. Extreme DataCutter – Two Level ModelCenter for Comprehensive Informatics
    38. 38. Center for Comprehensive Informatics Node Level Work Scheduling
    39. 39. Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)Center for Comprehensive Informatics
    40. 40. Challenge: Structured/Unstructured Grid Calculations with Unpredictable RuntimeCenter for Comprehensive Informatics Dependencies Key Kernel in Distance Transform, Morphological Reconstruction, Delaney Triagulation
    41. 41. “Speedup” relative to single CPU coreCenter for Comprehensive Informatics
    42. 42. Large Scale Data ManagementCenter for Comprehensive Informatics  Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.  Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships  Highly optimized spatial query and analyses  Implemented in a variety of ways including optimized CPU/GPU, Hadoop/HDFS and IBM DB2
    43. 43. Spatial Centric – Pathology Imaging “GIS”Point query: human marked point Window query: return markupsinside a nucleus contained in a rectangle .Containment query: nuclear feature Spatial join query: algorithmaggregation in tumor regions validation/comparison
    44. 44. Algorithm Validation: Intersectionbetween Two Result Sets (Spatial Join) PAIS: Example Queries . .
    45. 45. VLDB 2012Center for Comprehensive Informatics Change Detection, Comparison, and Quantification
    46. 46. Approach to Integrated Sensor Data Analysis FrameworkCenter for Comprehensive Informatics • Abstract templates specify dataset geometry • Templates describe collections of space-time regions • Mapping to memory hierarchies provided by user defined mapping functions • Leverages Parashar’s DataSpaces
    47. 47. a.k.a “Big Data”Center for Comprehensive Informatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/”Big Data” Computers, Systems Software • Analysis of Patient Populations
    48. 48. Clinical Phenotype Characterization and the Emory Analytic Information WarehouseCenter for Comprehensive Informatics • Example Project: Find hot spots in readmissions within 30 days – What fraction of patients with a given principal diagnosis will be readmitted within 30 days? – What fraction of patients with a given set of diseases will be readmitted within 30 days? – How does severity and time course of co-morbidities affect readmissions? – Geographic analyses • Compare and contrast with UHC Clinical Data Base – Repeat analyses across all UHC hospitals – Are we performing the same? – How are UHC-curated groupings of patients (e.g., product lines) useful? • Need a repeatable process that we can apply identically to both local and UHC data Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
    49. 49. Overall SystemCenter for Comprehensive Informatics Metadata Repository I2b2 Web I2b2 Server Database Investigator Metadata Manager Data Modeler Data Query Processing Specification Data Analyst Investigator Database Mapper Data Analyst Study- Query tools specific Database Source Source Source Investigator data data data
    50. 50. 5-year Datasets from Emory and University Healthcare ConsortiumCenter for Comprehensive Informatics • EUH, EUHM and WW (inpatient encounters) • Removed encounter pairs with chemotherapy and radiation therapy readmit encounters (CDW data) • Encounter location (down to unit for Emory) • Providers (Emory only) • Discharge disposition • Primary and secondary ICD9 codes • Procedure codes • DRGs • Medication orders (Emory only) • Labs (Emory only) • Vitals (Emory only) • Geographic information (CDW only + US Census and American Community Survey) Analytic Information
    51. 51. Using Emory & UHC Data to Find Associations With 30-day ReadmitsCenter for Comprehensive Informatics • Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining – Too many diagnosis codes, procedure codes – Continuous variables (e.g., labs) require interpretation – Temporal relationships between variables are implicit • Solution: Transform the data into a much smaller set of variables using heuristic knowledge – Categorize diagnosis and procedure codes using code hierarchies – Classify continuous variables using standard interpretations (e.g., high, normal, low) – Identify temporal patterns (e.g., frequency, duration, sequence) – Apply standard data mining techniques Analytic Information
    52. 52. Derived VariablesCenter for Comprehensive Informatics • 30-day readmit • The 9 Emory Enhanced Risk Assessment Tool diagnosis categories • UHC product lines • Variables derived from a combination of codes and/or laboratory test results – Obesity – Diabetes/uncontrolled diabetes – End-stage renal disease (ESRD) – Pressure ulcer – Sickle cell disease/sickle cell crisis • Temporal variables derived over multiple encounters – Multiple MI – Multiple 30-day readmissions – Chemotherapy within 180 (or 365) days before surgery – Previous encounter within the last 90 (or 180) days
    53. 53. 30-Day Readmission Rates for Derived VariablesCenter for Comprehensive Informatics Emory Health Care
    54. 54. Geographic Analyses UHC Medicine General Product Line (#15)Center for Comprehensive Informatics Analytic Information Warehouse
    55. 55. Predictive Modeling for ReadmissionCenter for Comprehensive Informatics • Random forests (ensemble of decision trees) – Create a decision tree using a random subset of the variables in the dataset – Generate a large number of such trees – All trees vote to classify each test example in a training dataset – Generate a patient-specific readmission risk for each encounter • Rank the encounters by risk for a subsequent 30- day readmission Sharath Cholleti
    56. 56. Emory Readmission Rates for High and Low Risk Groups Generated withCenter for Comprehensive Informatics Random Forest
    57. 57. Status of Clinical Phenotype CharacterizationCenter for Comprehensive Informatics • Integrative dataset analysis can leverage patient information gathered over many encounters • Temporal analyses can generate derived variables that appear to correlate with readmissions • Predictive modeling has promise of providing decision support • Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper • Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts • Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue
    58. 58. a.k.a “Big Data”Center for Comprehensive Informatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/”Big Data” Computers, Systems Software • Analysis of Patient Populations
    59. 59. Thanks to: • In silico center team: Dan Brat (Science PI), Tahsin Kurc, AshishCenter for Comprehensive Informatics Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director) • Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers) • Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod • In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz • NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe • ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos • ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
    60. 60. Thanks toCenter for Comprehensive Informatics • National Cancer Institute • National Library of Medicine • National Science Foundation • Cardiovascular Research Grid (NHLBI) • Minority Health Grid (ARRA) • Emory Health Care • Kaiser Health Care • Winship Cancer Institute • Oak Ridge National Laboratory • Woodruff Health Sciences
    61. 61. Thanks!

    ×