Extreme Spatio-Temporal Data Analysis


Published on

Talked delivered at XLDB Asia, June 2012

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Metadata about imagesMetadata about image targets, how images are derived (patient, specimen, anatomicEntity, etc)3) Metadata about analyses (the purpose of the analysis, who performed the analysis, etc) 4) Image markups -- a markup delineates a spatial region (e.g., as points, lines, polygons, multi-polygons) in images5) Annotation: Image features: a type of annotation calculated or derived from the markups6) Annotation: observation -- an annotation associates semantic meaning to markup entities through coded or free text terms that provide explanatory or descriptive information7) provenance information, i.e., the derivation history of a markup or annotation, including algorithm information, parameters, and inputsNative XML database based approachSmall sized PAIS documents, e.g., organ, tissue, or region level annotationsNo mapping needed, support standard XML queriesRelational and spatial database approachFor large scale PAIS documents, e.g., analysis results at cellular or subcellular level Data mapped into relational tables and spatial objectsHighly efficient on storage and queries
  • Extreme Spatio-Temporal Data Analysis

    1. 1. Extreme Spatio Temporal Data Analysis in Biomedical Informatics Joel Saltz MD, PhD Director Center for Comprehensive Informatics
    2. 2. Contributions • Computer Science: Methods and middlewareCenter for Comprehensive Informatics for analysis, classification of very large datasets from low dimensional spatio- temporal sensors; methods to carry out comparisons and change detection between sensor datasets • Biomedical: Mine whole slide image datasets to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets
    3. 3. Outline of Talk • Pathology: Analysis of Digitized Tissue for ResearchCenter for Comprehensive Informatics and Practice • Feature Clustering: Morphologic Tumor Subtypes in GBM Brain Tumors and Relationship to “omic” classifications • Whole Slide Image Analysis in Clinical Practice: Neuroblastoma • Tissue Flow: Multiplex Quantum Dot • HPC/BIGDATA Feature Pipeline • Pathology data analytic tools and techniques
    4. 4. Center for Comprehensive Informatics Whole Slide Imaging: Scale
    5. 5. Pathology Computer Assisted DiagnosisCenter for Comprehensive Informatics Shimada, Gurcan, Kong, Saltz
    6. 6. Computerized Classification System for Grading Neuroblastoma Initialization YesImage Tile Background? Label I=L • Background Identification No Create Image I(L) • Image Decomposition (Multi- Training Tiles resolution levels) Segmentation I = I -1 • Image Segmentation Down-sampling (EMLDA) Segmentation Feature Construction • Feature Construction (2nd Yes No order statistics, Tonal Feature Extraction I > 1?Feature Construction Features) Feature Extraction Classification • Feature Extraction (LDA) + Classification (Bayesian) Classifier Training No • Multi-resolution Layer Within Confidence Region ? Controller (Confidence Yes TRAINING Region) TESTING
    7. 7. Center for Comprehensive Informatics
    8. 8. Direct Study of Relationship Between vsCenter for Comprehensive Informatics
    9. 9. In Silico Brain Tumor Center Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)
    10. 10. Morphological Tissue Classification Whole Slide Imaging Cellular FeaturesCenter for Comprehensive Informatics Nuclei Segmentation Lee Cooper, Jun Kong
    11. 11. Nuclear Features Used to Classify GBMsCenter for Comprehensive Informatics 50 3 2 1 20 1 45 40 Silhouette Area 40 60 Cluster 80 2 35 100 120 30 140 3 25 160 2 3 4 5 6 7 20 40 60 80 100 120 140 160 # Clusters 0 0.5 1 Silhouette Value Consensus clustering of morphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering
    12. 12. Clustering identifies three morphological groupsCenter for Comprehensive Informatics • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p=4.5e-4) CC CM PB 1 CC 10 0.8 CM PB 20 Feature Indices 0.6 Survival 30 0.4 40 0.2 50 0 0 500 1000 1500 2000 2500 3000 Days
    13. 13. Gene Expression Class AssociationsCenter for Comprehensive Informatics • Cox proportional hazards – Gene expression class not significant p=0.58 – Morphology clustering p=5.0e-3 100 Classical Mesenchymal 80 Subtype Percentage (%) Neural Proneural 60 40 20 0 CC CM PB Cluster
    14. 14. Clustering ValidationCenter for Comprehensive Informatics • Separate set of 84 GBMs from Henry Ford Hospital • ClusterRepro: CC p=7.2e-3, CM p=1.3e-2 CC Mixed CM 1 10 CC 0.8 Mixed Feature Indices 20 CM 0.6 30 Survival 0.4 40 0.2 50 0 0 20 40 60 80 100 Months
    15. 15. Center for Comprehensive Informatics Associations
    16. 16. Novel Pathology Modalities Genomics ImagingExcellent Molecular Resolution Excellent Spatial Resolution Limited Spatial Resolution Limited Molecular Resolution 1000’s of genes
    17. 17. Quantum Dots Professor Robin Bostick
    18. 18. Imaging Pipeline – Feature Extraction
    19. 19. Example Application: Cancer Stem Cell Niche• Cancer stem cells – Rare(?), proliferative cells, regenerative – Do they prefer to live near blood vessels, or necrosis?
    20. 20. Extreme Spatio-Temporal Sensor Data AnalyticsCenter for Comprehensive Informatics • Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data • Run lots of different algorithms to derive same features • Run lots of algorithms to derive complementary features • Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
    21. 21. Application TargetsCenter for Comprehensive Informatics • Multi-dimensional spatial-temporal datasets – Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations • Can we analyze 100,000+ microscopy images per hour? • Correlative and cooperative analysis of data from multiple sensor modalities and sources • What-if scenarios and multiple design choices or initial conditions
    22. 22. Biomass Monitoring (joint with ORNL) • Investigate changes in vegetation and land useCenter for Comprehensive Informatics • Hierarchical, multi-resolution coarse/fine-grained analytics into a unified framework • Changes identified using high temporal/low spatial resolution MODIS data • Segmentation and classification methods used to characterize changes using higher resolution data (e.g. multitemporal AWiFS data) • Segmentation and classification to identify man-made structures.
    23. 23. Center for Comprehensive Informatics
    24. 24. Core Transformations• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation• Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification
    25. 25. Extreme DataCutterDataCutter Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System SExtreme DataCutter Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm
    26. 26. Extreme DataCutter – Two Level ModelCenter for Comprehensive Informatics
    27. 27. Node Level Work SchedulingCenter for Comprehensive Informatics • Features of Node Level Architectures – Nodes contain CPUs, GPUs – Each CPU contains multiple cores – GPU has complex internal architecture – Data locality within node – Data paths between CPUs and GPUs Keeneland Node
    28. 28. Node Level Work SchedulingCenter for Comprehensive Informatics • Attempt to minimize data movement • Identify and assign operations that perform well on GPU • Balance load between CPUs and GPUs • Prefetch data • Identify and use high bandwidth CPU/GPU data paths • Schedule exclusive GPU access for components (e.g. morphological reconstruction) requiring fine grained parallelism
    29. 29. Center for Comprehensive Informatics Node Level Work Scheduling
    30. 30. Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)Center for Comprehensive Informatics
    31. 31. Control Structures for Handling Fine Grained/Runtime Dependent Parallelism in GPUsCenter for Comprehensive Informatics Morphological Reconstruction: 8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs
    32. 32. Large Scale Data ManagementCenter for Comprehensive Informatics  Implemented with IBM DB2 for large scale pathology image metadata (~million markups per slide)  Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.  Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships  Highly optimized spatial query and analyses
    33. 33. Spatial Centric – Pathology Imaging “GIS”Point query: human marked point Window query: return markupsinside a nucleus contained in a rectangle .Containment query: nuclear feature Spatial join query: algorithmaggregation in tumor regions validation/comparison
    34. 34. PAISPAIS (Pathology Analytical Imaging Standards) Supported by caBIG, R01 and ACTSI class Domain Mo...  PAIS Logical Model WholeSlideImageReference TMAImageReference Patient Region  62 UML classes MicroscopyImageReference Subject DICOMImageReference 0..1 1 0..1  markups, annotations, 1 Specimen 0..* ImageReference 1 0..* 0..1 imageReferences, 1 0..1 AnatomicEntity User 0..1 0..* 1 1 provenance 1 1 Equipment Group  PAIS Data Representation 0..1 1 0..1 1 PAIS 1 0..* Project AnnotationReference 1  XML (compressed) or HDF5 0..1 0..* 1 1 1 Collection 0..* 0..* 0..* 1 Markup 0..* 0..* 0..1 Annotation 0..*  PAIS Databases 0..* GeometricShape 1  loading, managing and 1 1 0..* 0..* 0..* 1 Observation Calculation Inference Surface Field 1 1 querying and sharing data 0..1 0..1 0..1 Provenance 0..*  Native XML DBMS or RDBMS + SDBMS
    35. 35. PAIS: Example Queries Example Query for Integrative Studies • Find mean nuclear feature vector and covariance onCenter for Comprehensive Informatics tumor regions for each patient grouped by tumor subtype SELECT c.pais_uid, pc.subtype, AVG(area), AVG(perimeter), AVG(eccentricity), COVARIANCE(area, perimeter), COVARIANCE(area, eccentricity) FROM pais.calculation_flat c,TCGA.PATIENT_CHARACTERISTIC pc, pais.patient p WHERE p.patientid = pc.patient_id AND p.pais_uid = c.pais_uid GROUP BY c.pais_uid, pc.subtype; 2 1 3 4 1 1 Cluster 1 20 10 0.9 Cluster 2 20 Cluster 3 40 0.8 Cluster 4 30 0.7 60 40 2 Feature Indices 0.6 Cluster 50 Survival 80 0.5 60 100 0.4 70 3 120 80 0.3 140 90 0.2 100 0.1 160 4 110 50 100 150 0 0 500 1000 1500 2000 2500 3000 0 0.2 0.4 0.6 0.8 1 Days Silhouette Value
    36. 36. Algorithm Validation: Intersectionbetween Two Result Sets (Spatial Join) PAIS: Example Queries . .
    37. 37. VLDB 2012Center for Comprehensive Informatics Change Detection, Comparison, and Quantification
    38. 38. Summary and Perspective • Large scale integrative data analytic methods andCenter for Comprehensive Informatics tools to integrate clinical, molecular, Pathology, Radiology data • Characterize new cancer subtypes and biomarkers, predict outcome, treatment response • Algorithms to quantify Pathology classification • HPC/BIGDATA analysis pipelines
    39. 39. Importance: • Computer Science: general approaches to analysisCenter for Comprehensive Informatics and classification of very large datasets from low dimensional spatio-temporal sensors • Biomedical: generate basic insights into pathophysiology, clues to new treatments, better ways of evaluating existing treatments and core infrastructure needed for comparative effectiveness research studies
    40. 40. Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin Kurc, Himanshu Rathod Emory leads• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon, Daniel Rubin, Fred Prior, Larry Tarbox and many others• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul Pantalone• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado- Ramos• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman, Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi
    41. 41. Thanks!