Data Science, Big Data and You


Published on

Presentation at George Mason University, April 2013

1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Combine with next slide.Graphical representation
  • Data Science, Big Data and You

    1. 1. Joel Saltz MD, PhDEmory UniversityFebruary 2013Data Science, Big Data and You
    2. 2. CenterforComprehensiveInformaticsBig Data• Social media—analysis of tweetsand Facebook toobserved trends inreal time• Local Walgreensstock their shelvesaccording to localtweets about coldsymptoms• Credit card fraud—lostof transactions, butyet you get a flag thatyou shopped in a storethat does not fit yourprofile—and withinminutes your card isblocked.
    3. 3. CenterforComprehensiveInformaticsBig Data in Commerce - Fraud Detection• Seek unexpected data – outliers• Lots of data – all Amex, Visa or Mastercardtransactions• Look for individual outliers – e.g. credittransaction involving large amount of moneypurchasing unusual product• Look for sequence data with temporal orspatial relationship -- find unusualsequence e.g., intrusion detection and cybersecurity
    4. 4. CenterforComprehensiveInformatics• Define the ―typical‖ regions in a data set – may bedifficult• ―Typical‖ behavior may change with time. What istypical today may be considered anomalous infuture and vice versa.• (Smart) crooks will make ―keep under the radar‖ totry to stay undetected
    5. 5. CenterforComprehensiveInformaticsApproaches• Sometimes build a model from the training dataand apply the model to detect outliers• Sometimes use the existing data directly to detectoutliers
    6. 6. CenterforComprehensiveInformaticsBig Data Ecosystem6Credit:
    7. 7. CenterforComprehensiveInformaticsScience and Engineering ApplicationsSloan Sky Survey
    8. 8. CenterforComprehensiveInformaticsEarly Big Data 1922 -Lewis RichardsonWeather Forecasting
    9. 9. CenterforComprehensiveInformatics• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    10. 10. CenterforComprehensiveInformatics
    11. 11. CenterforComprehensiveInformaticsScientific Big Data Targets• Multi-dimensional spatial-temporal datasets– Biomedicine– Oil Reservoir Simulation/CarbonSequestration/Groundwater Pollution Remediation– Biomass monitoring and disaster surveillance– Weather prediction– Analysis of Results from Large Scale Simulations• Correlative and cooperative analysis of data frommultiple sensor modalities and sources• What-if scenarios and multiple design choices orinitial conditions
    12. 12. Emory In Silico Center for Brain TumorResearch (PI = Dan Brat, PD= Joel Saltz)
    13. 13. CenterforComprehensiveInformaticsIntegrative Cancer Research with Digital Pathologyhistology neuroimagingclincalpathologyIntegratedAnalysismolecularHigh-resolution whole-slide microscopyMultiplex IHC
    14. 14. Integrative Analysis: OSU BISTI NBIB CenterBig Data (2005)Associate genotype withphenotypeBig science experiments oncancer, heart disease,pathogen host responseTissue specimen -- 1 cm30.3 μ resolution – roughly 1013bytesMolecular data (spatial location)can add additional significantfactor; e.g. 102Multispectral imaging, lasercaptured microdissection,Imaging Mass Spec, MultiplexQDMultiple tissue specimens; anotherfactor of 103Total: 1018 bytes – exabyte per bigscience experiment
    15. 15. A Data Intense Challenge:The Instrumented Oil Field of the Future
    16. 16. The Tyranny of Scale(Tinsley Oden - U Texas)process scalefield scalekmcmsimulation scalemmpore scale
    17. 17. Why Applications Get Big• Physical world or simulation results• Detailed description of two, three (or more)dimensional space• High resolution in each dimension, lots oftimesteps• e.g. oil reservoir code -- simulate 100 km by100 km region to 1 km depth at resolution of100 cm:– 10^6*10^6*10^4 mesh points, 10^2 bytes permesh point, 10^6 timesteps --- 10^24 bytes(Yottabyte) of data!!!
    18. 18. Detect and track changes in data during productionInvert data for reservoir propertiesDetect and track reservoir changesAssimilate data & reservoir properties intothe evolving reservoir modelUse simulation and optimization to guide future productionOil Field Management – Joint ITR with Mary Wheeler,Paul Stoffa
    19. 19. Coupled Ground Water and Surface Water SimulationsMultiple codes -- e.g. fluid code, contaminanttransport codeDifferent space and time scalesData from a given fluid code run is used in differentcontaminant transport code scenarios
    20. 20. Bioremediation SimulationMicrobe colonies(magenta)Dissolved NAPL (blue)Mineral oxidationproducts (green)abiotic reactionscompete withmicrobes,reduce extent ofbiodegradation
    21. 21. National Science Foundation Grand Challengein Land Cover Dynamics - 1994• Remote sensing analysis ofhigh resolution satelliteimages.• Databases of land coverdynamics are essential forglobal carbon models,biogeochemical cycling,hydrological modeling andecosystem responsemodeling• Maps of the worlds tropicalrain forest during the pastthree decades.Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , JohnTownshend
    22. 22. CenterforComprehensiveInformaticsAnalysis of Computational Data; UncertaintyQuantification, Comparisons with Experimental ResultsDimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s
    23. 23. CenterforComprehensiveInformatics• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    24. 24. CenterforComprehensiveInformaticsWhole Slide Imaging: Scale
    25. 25. CenterforComprehensiveInformatics
    26. 26. Using TCGA Data to StudyGlioblastomaDiagnostic ImprovementMolecular ClassificationPredictors of Progression
    27. 27. Digital PathologyNeuroimagingTCGA Network
    28. 28. CenterforComprehensiveInformaticsMorphological Tissue ClassificationNuclei SegmentationCellular FeaturesLee Cooper,Jun KongWhole Slide Imaging
    29. 29. Oligodendroglioma AstrocytomaNuclear QualitiesCan we use image analysis of TCGA GBMs TO INFORMdiagnostic criteria based on molecular or clinicalendpoints?Application: Oligodendroglioma Component in GBM
    30. 30. Millions of Nuclei Defined by n Features• Top-down analysis: use the featureswith existing diagnostic constructs• Bottom-up analysis: let features defineand drive the analysis
    31. 31. TCGA Whole Slide ImagesJun KongStep 1:NucleiSegmentation• Identify individual nucleiand their boundaries
    32. 32. Nuclear Analysis Workflow• Describe individual nuclei in terms of size,shape, and textureStep 2:FeatureExtractionStep 1:NucleiSegmentation
    33. 33. Oligodendroglioma AstrocytomaNuclear Qualities1 10Step 3:NucleiClassification
    34. 34. Survival AnalysisHuman Machine
    35. 35. Gene Expression Correlates of High Oligo-AstroRatio on Machine-based ClassificationOligo Related GenesMyelin Basic ProteinProteolipoproteinHoxD1Nuclear features mostAssociated with OligoSignature Genes:Circularity (high)Eccentricity (low)
    36. 36. Millions of Nuclei Defined by n Features• Top-down analysis: analyze features incontext of existing diagnostic constructs• Bottom-up analysis: let nuclear featuresdefine and drive the analysis
    37. 37. CenterforComprehensiveInformaticsDirect Study of Relationship BetweenvsLee Cooper,Carlos Moreno
    38. 38. CenterforComprehensiveInformaticsConsensus clustering of morphologicalsignaturesStudy includes 200 million nuclei taken from 480slides corresponding to 167 distinct patientsEach possibility evaluated using 2000 iterations of K-means to quantify co-clusteringNuclear Features Used to Classify GBMs3 2 120 40 60 80 100 120 140 160204060801001201401602 3 4 5 6 7253035404550# ClustersSilhouetteArea0 0.5 1123Silhouette ValueCluster
    39. 39. CenterforComprehensiveInformaticsClustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:Cell Cycle (CC), Chromatin Modification (CM),Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)FeatureIndicesCC CM PB10203040500 500 1000 1500 2000 2500 300000.
    40. 40. Molecular Correlates of MR Features Using TCGA DataMRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and InVivo Imaging toolsMR Features compared to TCGA Transcriptional Classes and Genetic AlterationsDavid Gutman
    41. 41. VASARIFeature Set
    42. 42. CenterforComprehensiveInformatics
    43. 43. 46Principal Investigator and Director: Haian FuCo-Directors: Fadlo R. Khuri, Joel SaltzProject Manager: Margaret JohnsAim 1 LeaderYuhong DuAim 2 LeaderCarlos MorenoCancergenomics-based HT PPInetworkdiscovery &validationGenomicsinformatics anddata integrationEmory CTD2 Center:High throughput protein-protein interaction interrogation in cancerWinshipCancerInstituteCenter forComprehensiveInformaticsEmoryChemical BiologyDiscovery CenterEmory Molecular Interaction Centerfor Functional Genomics (MicFG)
    44. 44. CenterforComprehensiveInformaticsa.k.a ―Big Data‖• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,Systems Software• Analysis of Patient Populations
    45. 45. CenterforComprehensiveInformaticsTitan – Peak Speed 30,000,000,000,000,000floating point operations per second!
    46. 46. CenterforComprehensiveInformatics
    47. 47. CenterforComprehensiveInformaticsHPC Segmentation and Feature Extraction PipelineTony Pan and George Teodoro
    48. 48. CenterforComprehensiveInformaticsLarge Scale Data Management Represented by a complex data model capturingmulti-faceted information including markups,annotations, algorithm provenance, specimen, etc. Support for complex relationships and spatialquery: multi-level granularities, relationshipsbetween markups and annotations, spatial andnested relationships Highly optimized spatial query and analyses Implemented in a variety of ways includingoptimized CPU/GPU, Hadoop/HDFS and IBM DB2
    49. 49. Spatial Centric – Pathology Imaging “GIS”Point query: human marked pointinside a nucleus.Window query: return markupscontained in a rectangleSpatial join query: algorithmvalidation/comparisonContainment query: nuclear featureaggregation in tumor regions
    50. 50. CenterforComprehensiveInformaticsa.k.a ―Big Data‖• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    51. 51. CenterforComprehensiveInformatics• Example Project: Find hot spots in readmissionswithin 30 days– What fraction of patients with a given principal diagnosis willbe readmitted within 30 days?– What fraction of patients with a given set of diseases will bereadmitted within 30 days?– How does severity and time course of co-morbidities affectreadmissions?– Geographic analyses• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., productlines) useful?Clinical Phenotype Characterization and the EmoryAnalytic Information WarehouseAndrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
    52. 52. CenterforComprehensiveInformaticsOverall SystemI2b2 WebServerI2b2DatabaseSourcedataDatabaseMapperSourcedataSourcedataDataProcessingMetadataManagerMetadataRepositoryQuerySpecificationInvestigatorData AnalystData AnalystData ModelerInvestigatorQuery toolsStudy-specificDatabaseInvestigator
    53. 53. CenterforComprehensiveInformatics5-year Datasets from Emory andUniversity Healthcare Consortium• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiationtherapy readmit encounters (CDW data)• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and AmericanCommunity Survey)Analytic Information
    54. 54. CenterforComprehensiveInformaticsUsing Emory & UHC Data to FindAssociations With 30-day Readmits• Problem: ―Raw‖ clinical and administrative variablesare difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit• Solution: Transform the data into a much smaller setof variables using heuristic knowledge– Categorize diagnosis and procedure codes using codehierarchies– Classify continuous variables using standardinterpretations (e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration,sequence)– Apply standard data mining techniquesAnalytic InformationWarehouse
    55. 55. CenterforComprehensiveInformatics30-Day Readmission Rates for DerivedVariablesEmory Health Care
    56. 56. CenterforComprehensiveInformaticsGeographic AnalysesUHC Medicine General Product Line (#15)Analytic Information Warehouse
    57. 57. CenterforComprehensiveInformaticsPredictive Modeling for Readmission• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of thevariables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in atraining dataset– Generate a patient-specific readmission risk for eachencounter• Rank the encounters by risk for a subsequent 30-day readmissionSharath Cholleti
    58. 58. CenterforComprehensiveInformaticsEmory Readmission Rates for High andLow Risk Groups Generated withRandom Forest
    59. 59. CenterforComprehensiveInformaticsPredictive Modeling for 180 UHC Hospitals, 35 Million PatientsIdentify High Risk Patients!Readmission fraction of top 10% high risk patients00. Hospital ModelIndividual HospitalModel
    60. 60. Quasi-real-time display and analysis of physiologicdata from Emory University Hospital SICU
    61. 61. Burst of tachycardia,no desaturationTwo episodes ofdesaturation, no changein heart rateHRSpO2This slide is for orientation. Red data are the newest, greenintermediate, blue oldest. Frequency every 2 seconds.
    62. 62. We have started to construct alerts arounddesaturation behaviors(this image courtesy IBM)
    63. 63. CenterforComprehensiveInformatics• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    64. 64. CenterforComprehensiveInformaticsThanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, AshishSharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti,Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, TomMikkelsen, Adam Flanders, Joel Saltz (Director)• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, SharathCholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma,David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang,David J. Foran (Rutgers)• Analytic Warehouse team: Andrew Post, Sharath Cholleti, DorisGao, Michel Monsour, Himanshu Rathod• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, ErichHuang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, MaxWintermark, David Gutman, Carlos Moreno, Lee Cooper, JohnFreymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, CarlJaffe• ACTSI Biomedical Informatics Program: Marc Overcash, TimMorris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis,Sharon Mason, Andrew Post, Alfredo Tirado-Ramos• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
    65. 65. CenterforComprehensiveInformaticsThanks to• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences
    66. 66. Thanks!