1. Joel Saltz MD, PhD
Emory University
February 2013
Data Science, Big Data and You
2. CenterforComprehensiveInformatics
Big Data
• Social media—
analysis of tweets
and Facebook to
observed trends in
real time
• Local Walgreens
stock their shelves
according to local
tweets about cold
symptoms
• Credit card fraud—lost
of transactions, but
yet you get a flag that
you shopped in a store
that does not fit your
profile—and within
minutes your card is
blocked.
3. CenterforComprehensiveInformatics
Big Data in Commerce - Fraud Detection
• Seek unexpected data – outliers
• Lots of data – all Amex, Visa or Mastercard
transactions
• Look for individual outliers – e.g. credit
transaction involving large amount of money
purchasing unusual product
• Look for sequence data with temporal or
spatial relationship -- find unusual
sequence e.g., intrusion detection and cyber
security
4. CenterforComprehensiveInformatics
• Define the ―typical‖ regions in a data set – may be
difficult
• ―Typical‖ behavior may change with time. What is
typical today may be considered anomalous in
future and vice versa.
• (Smart) crooks will make ―keep under the radar‖ to
try to stay undetected
12. CenterforComprehensiveInformatics
Scientific Big Data Targets
• Multi-dimensional spatial-temporal datasets
– Biomedicine
– Oil Reservoir Simulation/Carbon
Sequestration/Groundwater Pollution Remediation
– Biomass monitoring and disaster surveillance
– Weather prediction
– Analysis of Results from Large Scale Simulations
• Correlative and cooperative analysis of data from
multiple sensor modalities and sources
• What-if scenarios and multiple design choices or
initial conditions
13. Emory In Silico Center for Brain Tumor
Research (PI = Dan Brat, PD= Joel Saltz)
15. Integrative Analysis: OSU BISTI NBIB Center
Big Data (2005)
Associate genotype with
phenotype
Big science experiments on
cancer, heart disease,
pathogen host response
Tissue specimen -- 1 cm3
0.3 μ resolution – roughly 1013
bytes
Molecular data (spatial location)
can add additional significant
factor; e.g. 102
Multispectral imaging, laser
captured microdissection,
Imaging Mass Spec, Multiplex
QD
Multiple tissue specimens; another
factor of 103
Total: 1018 bytes – exabyte per big
science experiment
16. A Data Intense Challenge:
The Instrumented Oil Field of the Future
17.
18. The Tyranny of Scale
(Tinsley Oden - U Texas)
process scale
field scale
km
cm
simulation scale
mm
pore scale
19. Why Applications Get Big
• Physical world or simulation results
• Detailed description of two, three (or more)
dimensional space
• High resolution in each dimension, lots of
timesteps
• e.g. oil reservoir code -- simulate 100 km by
100 km region to 1 km depth at resolution of
100 cm:
– 10^6*10^6*10^4 mesh points, 10^2 bytes per
mesh point, 10^6 timesteps --- 10^24 bytes
(Yottabyte) of data!!!
20. Detect and track changes in data during production
Invert data for reservoir properties
Detect and track reservoir changes
Assimilate data & reservoir properties into
the evolving reservoir model
Use simulation and optimization to guide future production
Oil Field Management – Joint ITR with Mary Wheeler,
Paul Stoffa
21. Coupled Ground Water and Surface Water Simulations
Multiple codes -- e.g. fluid code, contaminant
transport code
Different space and time scales
Data from a given fluid code run is used in different
contaminant transport code scenarios
23. National Science Foundation Grand Challenge
in Land Cover Dynamics - 1994
• Remote sensing analysis of
high resolution satellite
images.
• Databases of land cover
dynamics are essential for
global carbon models,
biogeochemical cycling,
hydrological modeling and
ecosystem response
modeling
• Maps of the world's tropical
rain forest during the past
three decades.
Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John
Townshend
32. Oligodendroglioma Astrocytoma
Nuclear Qualities
Can we use image analysis of TCGA GBMs TO INFORM
diagnostic criteria based on molecular or clinical
endpoints?
Application: Oligodendroglioma Component in GBM
33. Millions of Nuclei Defined by n Features
• Top-down analysis: use the features
with existing diagnostic constructs
• Bottom-up analysis: let features define
and drive the analysis
34. TCGA Whole Slide Images
Jun Kong
Step 1:
Nuclei
Segmentation
• Identify individual nuclei
and their boundaries
35. Nuclear Analysis Workflow
• Describe individual nuclei in terms of size,
shape, and texture
Step 2:
Feature
Extraction
Step 1:
Nuclei
Segmentation
38. Gene Expression Correlates of High Oligo-Astro
Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic Protein
Proteolipoprotein
HoxD1
Nuclear features most
Associated with Oligo
Signature Genes:
Circularity (high)
Eccentricity (low)
39. Millions of Nuclei Defined by n Features
• Top-down analysis: analyze features in
context of existing diagnostic constructs
• Bottom-up analysis: let nuclear features
define and drive the analysis
41. CenterforComprehensiveInformatics
Consensus clustering of morphological
signatures
Study includes 200 million nuclei taken from 480
slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-
means to quantify co-clustering
Nuclear Features Used to Classify GBMs
3 2 1
20 40 60 80 100 120 140 160
20
40
60
80
100
120
140
160
2 3 4 5 6 7
25
30
35
40
45
50
# Clusters
SilhouetteArea
0 0.5 1
1
2
3
Silhouette Value
Cluster
42. CenterforComprehensiveInformatics
Clustering identifies three morphological groups
• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)
• Prognostically-significant (logrank p=4.5e-4)
FeatureIndices
CC CM PB
10
20
30
40
50
0 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Survival
CC
CM
PB
43. Molecular Correlates of MR Features Using TCGA Data
MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In
Vivo Imaging tools
MR Features compared to TCGA Transcriptional Classes and Genetic Alterations
David Gutman
46. 46
Principal Investigator and Director: Haian Fu
Co-Directors: Fadlo R. Khuri, Joel Saltz
Project Manager: Margaret Johns
Aim 1 Leader
Yuhong Du
Aim 2 Leader
Carlos Moreno
Cancer
genomics-
based HT PPI
network
discovery &
validation
Genomics
informatics and
data integration
Emory CTD2 Center:
High throughput protein-protein interaction interrogation in cancer
Winship
Cancer
Institute
Center for
Comprehensive
InformaticsEmory
Chemical Biology
Discovery Center
Emory Molecular Interaction Center
for Functional Genomics (MicFG)
47. CenterforComprehensiveInformatics
a.k.a ―Big Data‖
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
51. CenterforComprehensiveInformatics
Large Scale Data Management
Represented by a complex data model capturing
multi-faceted information including markups,
annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial
query: multi-level granularities, relationships
between markups and annotations, spatial and
nested relationships
Highly optimized spatial query and analyses
Implemented in a variety of ways including
optimized CPU/GPU, Hadoop/HDFS and IBM DB2
52. Spatial Centric – Pathology Imaging “GIS”
Point query: human marked point
inside a nucleus
.
Window query: return markups
contained in a rectangle
Spatial join query: algorithm
validation/comparison
Containment query: nuclear feature
aggregation in tumor regions
53. CenterforComprehensiveInformatics
a.k.a ―Big Data‖
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/‖Big Data‖ Computers,
Systems Software
• Analysis of Patient Populations
54. CenterforComprehensiveInformatics
• Example Project: Find hot spots in readmissions
within 30 days
– What fraction of patients with a given principal diagnosis will
be readmitted within 30 days?
– What fraction of patients with a given set of diseases will be
readmitted within 30 days?
– How does severity and time course of co-morbidities affect
readmissions?
– Geographic analyses
• Compare and contrast with UHC Clinical Data Base
– Repeat analyses across all UHC hospitals
– Are we performing the same?
– How are UHC-curated groupings of patients (e.g., product
lines) useful?
Clinical Phenotype Characterization and the Emory
Analytic Information Warehouse
Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
56. CenterforComprehensiveInformatics
5-year Datasets from Emory and
University Healthcare Consortium
• EUH, EUHM and WW (inpatient encounters)
• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)
• Encounter location (down to unit for Emory)
• Providers (Emory only)
• Discharge disposition
• Primary and secondary ICD9 codes
• Procedure codes
• DRGs
• Medication orders (Emory only)
• Labs (Emory only)
• Vitals (Emory only)
• Geographic information (CDW only + US Census and American
Community Survey)
Analytic Information
57. CenterforComprehensiveInformatics
Using Emory & UHC Data to Find
Associations With 30-day Readmits
• Problem: ―Raw‖ clinical and administrative variables
are difficult to use for associative data mining
– Too many diagnosis codes, procedure codes
– Continuous variables (e.g., labs) require interpretation
– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set
of variables using heuristic knowledge
– Categorize diagnosis and procedure codes using code
hierarchies
– Classify continuous variables using standard
interpretations (e.g., high, normal, low)
– Identify temporal patterns (e.g., frequency, duration,
sequence)
– Apply standard data mining techniques
Analytic Information
Warehouse
60. CenterforComprehensiveInformatics
Predictive Modeling for Readmission
• Random forests (ensemble of decision trees)
– Create a decision tree using a random subset of the
variables in the dataset
– Generate a large number of such trees
– All trees vote to classify each test example in a
training dataset
– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-
day readmission
Sharath Cholleti
64. Burst of tachycardia,
no desaturation
Two episodes of
desaturation, no change
in heart rate
HR
SpO2
This slide is for orientation. Red data are the newest, green
intermediate, blue oldest. Frequency every 2 seconds.
65. We have started to construct alerts around
desaturation behaviors
(this image courtesy IBM)
67. CenterforComprehensiveInformatics
Thanks to:
• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish
Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti,
Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom
Mikkelsen, Adam Flanders, Joel Saltz (Director)
• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath
Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma,
David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang,
David J. Foran (Rutgers)
• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris
Gao, Michel Monsour, Himanshu Rathod
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich
Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max
Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John
Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl
Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim
Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis,
Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
68. CenterforComprehensiveInformatics
Thanks to
• National Cancer Institute
• National Library of Medicine
• National Science Foundation
• Cardiovascular Research Grid (NHLBI)
• Minority Health Grid (ARRA)
• Emory Health Care
• Kaiser Health Care
• Winship Cancer Institute
• Oak Ridge National Laboratory
• Woodruff Health Sciences