Your SlideShare is downloading. ×
0
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Data Science, Big Data and You
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Science, Big Data and You

314

Published on

Presentation at George Mason University, April 2013

Presentation at George Mason University, April 2013

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
314
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Combine with next slide.Graphical representation
  • Transcript

    • 1. Joel Saltz MD, PhDEmory UniversityFebruary 2013Data Science, Big Data and You
    • 2. CenterforComprehensiveInformaticsBig Data• Social media—analysis of tweetsand Facebook toobserved trends inreal time• Local Walgreensstock their shelvesaccording to localtweets about coldsymptoms• Credit card fraud—lostof transactions, butyet you get a flag thatyou shopped in a storethat does not fit yourprofile—and withinminutes your card isblocked.
    • 3. CenterforComprehensiveInformaticsBig Data in Commerce - Fraud Detection• Seek unexpected data – outliers• Lots of data – all Amex, Visa or Mastercardtransactions• Look for individual outliers – e.g. credittransaction involving large amount of moneypurchasing unusual product• Look for sequence data with temporal orspatial relationship -- find unusualsequence e.g., intrusion detection and cybersecurity
    • 4. CenterforComprehensiveInformatics• Define the ―typical‖ regions in a data set – may bedifficult• ―Typical‖ behavior may change with time. What istypical today may be considered anomalous infuture and vice versa.• (Smart) crooks will make ―keep under the radar‖ totry to stay undetected
    • 5. CenterforComprehensiveInformaticsApproaches• Sometimes build a model from the training dataand apply the model to detect outliers• Sometimes use the existing data directly to detectoutliers
    • 6. CenterforComprehensiveInformaticsBig Data Ecosystem6Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
    • 7. CenterforComprehensiveInformaticsScience and Engineering ApplicationsSloan Sky Survey
    • 8. CenterforComprehensiveInformaticsEarly Big Data 1922 -Lewis RichardsonWeather Forecasting
    • 9. CenterforComprehensiveInformatics• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    • 10. CenterforComprehensiveInformatics
    • 11. CenterforComprehensiveInformaticsScientific Big Data Targets• Multi-dimensional spatial-temporal datasets– Biomedicine– Oil Reservoir Simulation/CarbonSequestration/Groundwater Pollution Remediation– Biomass monitoring and disaster surveillance– Weather prediction– Analysis of Results from Large Scale Simulations• Correlative and cooperative analysis of data frommultiple sensor modalities and sources• What-if scenarios and multiple design choices orinitial conditions
    • 12. Emory In Silico Center for Brain TumorResearch (PI = Dan Brat, PD= Joel Saltz)
    • 13. CenterforComprehensiveInformaticsIntegrative Cancer Research with Digital Pathologyhistology neuroimagingclincalpathologyIntegratedAnalysismolecularHigh-resolution whole-slide microscopyMultiplex IHC
    • 14. Integrative Analysis: OSU BISTI NBIB CenterBig Data (2005)Associate genotype withphenotypeBig science experiments oncancer, heart disease,pathogen host responseTissue specimen -- 1 cm30.3 μ resolution – roughly 1013bytesMolecular data (spatial location)can add additional significantfactor; e.g. 102Multispectral imaging, lasercaptured microdissection,Imaging Mass Spec, MultiplexQDMultiple tissue specimens; anotherfactor of 103Total: 1018 bytes – exabyte per bigscience experiment
    • 15. A Data Intense Challenge:The Instrumented Oil Field of the Future
    • 16. The Tyranny of Scale(Tinsley Oden - U Texas)process scalefield scalekmcmsimulation scalemmpore scale
    • 17. Why Applications Get Big• Physical world or simulation results• Detailed description of two, three (or more)dimensional space• High resolution in each dimension, lots oftimesteps• e.g. oil reservoir code -- simulate 100 km by100 km region to 1 km depth at resolution of100 cm:– 10^6*10^6*10^4 mesh points, 10^2 bytes permesh point, 10^6 timesteps --- 10^24 bytes(Yottabyte) of data!!!
    • 18. Detect and track changes in data during productionInvert data for reservoir propertiesDetect and track reservoir changesAssimilate data & reservoir properties intothe evolving reservoir modelUse simulation and optimization to guide future productionOil Field Management – Joint ITR with Mary Wheeler,Paul Stoffa
    • 19. Coupled Ground Water and Surface Water SimulationsMultiple codes -- e.g. fluid code, contaminanttransport codeDifferent space and time scalesData from a given fluid code run is used in differentcontaminant transport code scenarios
    • 20. Bioremediation SimulationMicrobe colonies(magenta)Dissolved NAPL (blue)Mineral oxidationproducts (green)abiotic reactionscompete withmicrobes,reduce extent ofbiodegradation
    • 21. National Science Foundation Grand Challengein Land Cover Dynamics - 1994• Remote sensing analysis ofhigh resolution satelliteimages.• Databases of land coverdynamics are essential forglobal carbon models,biogeochemical cycling,hydrological modeling andecosystem responsemodeling• Maps of the worlds tropicalrain forest during the pastthree decades.Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , JohnTownshend
    • 22. CenterforComprehensiveInformaticsAnalysis of Computational Data; UncertaintyQuantification, Comparisons with Experimental ResultsDimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s
    • 23. CenterforComprehensiveInformatics• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    • 24. CenterforComprehensiveInformaticsWhole Slide Imaging: Scale
    • 25. CenterforComprehensiveInformatics
    • 26. Using TCGA Data to StudyGlioblastomaDiagnostic ImprovementMolecular ClassificationPredictors of Progression
    • 27. Digital PathologyNeuroimagingTCGA Network
    • 28. CenterforComprehensiveInformaticsMorphological Tissue ClassificationNuclei SegmentationCellular FeaturesLee Cooper,Jun KongWhole Slide Imaging
    • 29. Oligodendroglioma AstrocytomaNuclear QualitiesCan we use image analysis of TCGA GBMs TO INFORMdiagnostic criteria based on molecular or clinicalendpoints?Application: Oligodendroglioma Component in GBM
    • 30. Millions of Nuclei Defined by n Features• Top-down analysis: use the featureswith existing diagnostic constructs• Bottom-up analysis: let features defineand drive the analysis
    • 31. TCGA Whole Slide ImagesJun KongStep 1:NucleiSegmentation• Identify individual nucleiand their boundaries
    • 32. Nuclear Analysis Workflow• Describe individual nuclei in terms of size,shape, and textureStep 2:FeatureExtractionStep 1:NucleiSegmentation
    • 33. Oligodendroglioma AstrocytomaNuclear Qualities1 10Step 3:NucleiClassification
    • 34. Survival AnalysisHuman Machine
    • 35. Gene Expression Correlates of High Oligo-AstroRatio on Machine-based ClassificationOligo Related GenesMyelin Basic ProteinProteolipoproteinHoxD1Nuclear features mostAssociated with OligoSignature Genes:Circularity (high)Eccentricity (low)
    • 36. Millions of Nuclei Defined by n Features• Top-down analysis: analyze features incontext of existing diagnostic constructs• Bottom-up analysis: let nuclear featuresdefine and drive the analysis
    • 37. CenterforComprehensiveInformaticsDirect Study of Relationship BetweenvsLee Cooper,Carlos Moreno
    • 38. CenterforComprehensiveInformaticsConsensus clustering of morphologicalsignaturesStudy includes 200 million nuclei taken from 480slides corresponding to 167 distinct patientsEach possibility evaluated using 2000 iterations of K-means to quantify co-clusteringNuclear Features Used to Classify GBMs3 2 120 40 60 80 100 120 140 160204060801001201401602 3 4 5 6 7253035404550# ClustersSilhouetteArea0 0.5 1123Silhouette ValueCluster
    • 39. CenterforComprehensiveInformaticsClustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:Cell Cycle (CC), Chromatin Modification (CM),Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)FeatureIndicesCC CM PB10203040500 500 1000 1500 2000 2500 300000.20.40.60.81DaysSurvivalCCCMPB
    • 40. Molecular Correlates of MR Features Using TCGA DataMRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and InVivo Imaging toolsMR Features compared to TCGA Transcriptional Classes and Genetic AlterationsDavid Gutman
    • 41. VASARIFeature Set
    • 42. CenterforComprehensiveInformatics
    • 43. 46Principal Investigator and Director: Haian FuCo-Directors: Fadlo R. Khuri, Joel SaltzProject Manager: Margaret JohnsAim 1 LeaderYuhong DuAim 2 LeaderCarlos MorenoCancergenomics-based HT PPInetworkdiscovery &validationGenomicsinformatics anddata integrationEmory CTD2 Center:High throughput protein-protein interaction interrogation in cancerWinshipCancerInstituteCenter forComprehensiveInformaticsEmoryChemical BiologyDiscovery CenterEmory Molecular Interaction Centerfor Functional Genomics (MicFG)
    • 44. CenterforComprehensiveInformaticsa.k.a ―Big Data‖• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,Systems Software• Analysis of Patient Populations
    • 45. CenterforComprehensiveInformaticsTitan – Peak Speed 30,000,000,000,000,000floating point operations per second!
    • 46. CenterforComprehensiveInformatics
    • 47. CenterforComprehensiveInformaticsHPC Segmentation and Feature Extraction PipelineTony Pan and George Teodoro
    • 48. CenterforComprehensiveInformaticsLarge Scale Data Management Represented by a complex data model capturingmulti-faceted information including markups,annotations, algorithm provenance, specimen, etc. Support for complex relationships and spatialquery: multi-level granularities, relationshipsbetween markups and annotations, spatial andnested relationships Highly optimized spatial query and analyses Implemented in a variety of ways includingoptimized CPU/GPU, Hadoop/HDFS and IBM DB2
    • 49. Spatial Centric – Pathology Imaging “GIS”Point query: human marked pointinside a nucleus.Window query: return markupscontained in a rectangleSpatial join query: algorithmvalidation/comparisonContainment query: nuclear featureaggregation in tumor regions
    • 50. CenterforComprehensiveInformaticsa.k.a ―Big Data‖• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    • 51. CenterforComprehensiveInformatics• Example Project: Find hot spots in readmissionswithin 30 days– What fraction of patients with a given principal diagnosis willbe readmitted within 30 days?– What fraction of patients with a given set of diseases will bereadmitted within 30 days?– How does severity and time course of co-morbidities affectreadmissions?– Geographic analyses• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., productlines) useful?Clinical Phenotype Characterization and the EmoryAnalytic Information WarehouseAndrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
    • 52. CenterforComprehensiveInformaticsOverall SystemI2b2 WebServerI2b2DatabaseSourcedataDatabaseMapperSourcedataSourcedataDataProcessingMetadataManagerMetadataRepositoryQuerySpecificationInvestigatorData AnalystData AnalystData ModelerInvestigatorQuery toolsStudy-specificDatabaseInvestigator
    • 53. CenterforComprehensiveInformatics5-year Datasets from Emory andUniversity Healthcare Consortium• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiationtherapy readmit encounters (CDW data)• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and AmericanCommunity Survey)Analytic Information
    • 54. CenterforComprehensiveInformaticsUsing Emory & UHC Data to FindAssociations With 30-day Readmits• Problem: ―Raw‖ clinical and administrative variablesare difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit• Solution: Transform the data into a much smaller setof variables using heuristic knowledge– Categorize diagnosis and procedure codes using codehierarchies– Classify continuous variables using standardinterpretations (e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration,sequence)– Apply standard data mining techniquesAnalytic InformationWarehouse
    • 55. CenterforComprehensiveInformatics30-Day Readmission Rates for DerivedVariablesEmory Health Care
    • 56. CenterforComprehensiveInformaticsGeographic AnalysesUHC Medicine General Product Line (#15)Analytic Information Warehouse
    • 57. CenterforComprehensiveInformaticsPredictive Modeling for Readmission• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of thevariables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in atraining dataset– Generate a patient-specific readmission risk for eachencounter• Rank the encounters by risk for a subsequent 30-day readmissionSharath Cholleti
    • 58. CenterforComprehensiveInformaticsEmory Readmission Rates for High andLow Risk Groups Generated withRandom Forest
    • 59. CenterforComprehensiveInformaticsPredictive Modeling for 180 UHC Hospitals, 35 Million PatientsIdentify High Risk Patients!Readmission fraction of top 10% high risk patients00.10.20.30.40.50.60.70.80.9191725334149576573818997105113121129137145153161169177185All Hospital ModelIndividual HospitalModel
    • 60. Quasi-real-time display and analysis of physiologicdata from Emory University Hospital SICU
    • 61. Burst of tachycardia,no desaturationTwo episodes ofdesaturation, no changein heart rateHRSpO2This slide is for orientation. Red data are the newest, greenintermediate, blue oldest. Frequency every 2 seconds.
    • 62. We have started to construct alerts arounddesaturation behaviors(this image courtesy IBM)
    • 63. CenterforComprehensiveInformatics• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/‖Big Data‖ Computers,Systems Software• Analysis of Patient Populations
    • 64. CenterforComprehensiveInformaticsThanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, AshishSharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti,Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, TomMikkelsen, Adam Flanders, Joel Saltz (Director)• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, SharathCholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma,David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang,David J. Foran (Rutgers)• Analytic Warehouse team: Andrew Post, Sharath Cholleti, DorisGao, Michel Monsour, Himanshu Rathod• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, ErichHuang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, MaxWintermark, David Gutman, Carlos Moreno, Lee Cooper, JohnFreymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, CarlJaffe• ACTSI Biomedical Informatics Program: Marc Overcash, TimMorris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis,Sharon Mason, Andrew Post, Alfredo Tirado-Ramos• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
    • 65. CenterforComprehensiveInformaticsThanks to• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences
    • 66. Thanks!

    ×