What Can Happen when Genome
Sciences Meets Data Sciences?
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
02/14/18 UVA Genome Sciences 1
I am more interested in having a
discussion than giving a lecture …
This is not about my research
specifically but what is happening
more broadly
02/14/18 UVA Genome Sciences 2
Agenda
• Some context
– My definition of data science
– What drives my thinking
– What is the NIH thinking?
• Relevant examples
• The DSI and what is happening at UVA
• Together, where do we go from here?
02/14/18 UVA Genome Sciences 3
What Do I Mean by Big Data/Data
Science?
• Use of the ever increasing amount of open,
complex, diverse digital data
• Finding ways to ask and then answer relevant
questions by combining such diverse data sets
• Arriving at statistically significant conclusions
not otherwise obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that
improve the human condition
02/14/18 UVA Genome Sciences 4
What Drives my
Thinking?
Disruption:
Digitization
Deception
Disruption
Demonetization
Dematerialization
Democratization
Time
Volume,Velocity,Variety
Digital camera invented by
Kodak but shelved
Megapixels & quality improve slowly;
Kodak slow to react
Film market collapses;
Kodak goes bankrupt
Phones replace
cameras
Instagram,
Flickr become the
value proposition
Digital media becomes bona fide
form of communication
From a presentation to the Advisory Board to the NIH Director
Example - Photography
502/14/18 UVA Genome Sciences
A Few Random Data {Science} Facts
• There are ~2.7 Zetabytes (2.7 x 106 PB) of digital
data currently
– = US population tweeting 3x/min for 26,976 years
• Big data currently estimated as a $50bn business
– could save $3.1tn
• 40% growth in data/yr; 5% growth in IT
expenditure
• US 140,000- 190,000 unfilled deep data analytics
jobs
• DSI has 600 applicants this year for 50 spots;
MSDS/MBA highly sought
02/14/18 UVA Genome Sciences 6
A Few Random Data {Science} Facts
• There are ~2.7 Zetabytes (2.7 x 106 PB) of digital
data currently
– = US population tweeting 3x/min for 26,976 years
• Big data currently estimated as a $50bn business
– could save $3.1tn – private sector research
• 40% growth in data/yr; 5% growth in IT
expenditure - undervalued
• US 140,000- 190,000 unfilled deep data analytics
jobs – competition for skilled researchers high
• DSI has 600 applicants this year for 50 spots;
MSDS/MBA highly sought – large human capital
02/14/18 UVA Genome Sciences 7
How Much Biomedical Data?
• Big Data
– Total data from NIH-funded research in 2016
estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is
expected to grow by 10 PB in 2016
• Dark Data
– Only 12% of data described in published papers is
in recognized archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on
maintaining data archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
02/14/18 UVA Genome Sciences 8
Consider Some Current High Profile
NIH Examples Where Data Science is
Being Applied
• Moonshot - Bringing together 5 petabytes of homogenized data within the
Genome Data Commons (GDC) to explore genotype-phenotype
relationships
• MODs – Multiple high value high cost genomic resources
• Human Microbiome Project – microbe characterization and analysis
• TOPMed – Genomic, proteomic, metabolomic, image and EHR data
• All-of-Us Precision Medicine - Building a platform to support data on >1M
individuals with extensive and constantly updated health profiles
• ECHO – Effects of Environmental Exposures on Child Health and
Development - Integration of child health and environmental data
• BRAIN - Temporal and spatial analysis of neural circuits
9
How is Data Science Being Applied?
• Moonshot – new ways to analyze genotype-phenotype associations
• MODs – new curation and integration tools
• Human Microbiome Project – new cloud based tools
• TOPMed – large scale storage and analysis; data harmonization
• All-of-Us Precision Medicine – security; analysis of sensor data; EHR
integration
• ECHO – metadata descriptions of health and environmental data;
application of geospatial methods
• BRAIN – methods for network analysis, visualization
All:
Analytics, the Commons, FAIR, sustainability, workforce
10
Wilkinson et al The FAIR Guiding Principles for
scientific data management and stewardship. Sci
Data. 2016 Mar 15;3:160018
https://datascience.nih.gov/TheCommons
Some underlying concerns at NIH…
Reproducibility…
Conformance to data sharing policies
& governance more generally
11
Why a More Open Process?
Use case:
Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
02/14/18 UVA Genome Sciences 12
Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary, co-
occurring mutation
From Adam Resnick
02/14/18 UVA Genome Sciences 13
What do we need to do differently to
reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
02/14/18 UVA Genome Sciences 14
Both funders and some institutions
see the need to move from pipes to
platforms to accelerate research…
02/14/18 UVA Genome Sciences 15
https://blog.lexicata.com/wp-content/uploads/2015/03/platform-model-
750x410.png
If platforms are the answer we could
ask the question…
Will biomedical research become more
like Airbnb?
02/14/18 UVA Genome Sciences 16
Vivien Bonazzi
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship
between consumer (renter) and supplier (host)
• The platform focuses on maximizing the exchange of services
between supplier and consumer and maximizing the amount
of trust associated with a given stakeholder
• It seems to be working:
– 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
02/14/18 UVA Genome Sciences 17
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
Platforms will ultimately digitally
integrate the scholarly workflow for
human and machine analysis
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818UVA Genome Sciences 1802/14/18
Why a comparison to Airbnb is not fair
• Airbnb was born digital
• The exchange of services on Airbnb are
simple compared to what is required of a
platform to support biomedical research
Nevertheless there is much to be
learnt
02/14/18 UVA Genome Sciences 19
Impediments to a biomedical platform
• Current work practices by all stakeholders
• Entrenched business models
• Size of the undertaking aka resources
needed
• Trust
• Incentives to use the platform
http://www.forbes.com/sites/johnhall/2013/04/29/1
0-barriers-to-employee-innovation/#8bdbaa811133
02/14/18 UVA Genome Sciences 20
In summary there is not currently a
widely adopted single platform for
the exchange of services in
biomedical research. Either there is a
platform per service or no platform
at all….
Funders and the institutions they
fund need to work more closely to
implement platforms
02/14/18 UVA Genome Sciences 21
Example: NSF and NIH Approaches
02/14/18 UVA Genome Sciences 22
How is the DSI responding to these
various needs?
02/14/18 UVA Genome Sciences 23
02/14/18 UVA Genome Sciences 24
Working across the grounds
to break down traditional silos
• Currently sustainable
• Planning for where the academical village meets Google – an
ecosystem in which students, faculty, staff, visitors, private sector
reps, entrepreneurs live and work
• Open UVA and open data
• Not owning anything; only working through collaboration e.g.
– Dual degrees
– Research projects across disciplines
• MS DS focusing on practical training
• Dual degrees
• Soon PhD and undergraduate major
• Wikimedian in residence (March, 2018)
02/14/18 UVA Genome Sciences 25
Hallmarks
Emergent DSI Organization
02/14/18 UVA Genome Sciences 26
Data Integration
& Engineering
Machine Learning
& Analytics
Visualization
Data Acquisition
& Dissemination
Ethics, Law,
Policy,
Social Implications
Emergent DSI Organization
02/14/18 UVA Genome Sciences 27
Data Integration
& Engineering
Machine Learning
& Analytics
Visualization
Data Acquisition
& Dissemination
Ethics, Law,
Policy,
Social Implications
Biomedical Data Sciences
Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider Reagent Consumer
Software Provider Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Data Acquisition &
Dissemination
Pilot Open Data Lab
Underway
UVA Genome Sciences 28gDOC02/14/18
Data Integration and
Engineering
• Ontologies
• Object identifiers
• Indexing schemes
• Common data models
02/14/18 UVA Genome Sciences 29gDOC
Machine Learning &
Analytics
• Neural nets
• Deep learning
• NLP
• Gene expression &
neurological disease (Kipnis)
• Predicting opioid overdose
(VA Health)
• Predicting escalating care
and mortality risk of
cirrhosis patients (UVA HS)
• Human microbiome &
mental health in maternal
health (Physcology &
Nursing)
02/14/18 UVA Genome Sciences 30gDOC
Visualization
• VR
• Networks
• Sonics
• Visualizing microbial
stability (Biology &
Systems)
02/14/18 UVA Genome Sciences 31gDOC
Ethics, Law,
Policy & Social
Implications
• Data sharing
• Privacy
• Normativity
02/14/18 UVA Genome Sciences 32gDOC
Wendy Novicoff, Ph.D
Points of Interaction
• Dual degrees with an MSDS
• Specific projects for:
– Presidential fellows (due March 19, 2018)
– Capstones (due June 29, 2018)
• Thoughts on biomedical data science cluster hires
• Data Science Internship program with NIH, Inova, GMU, VT,
GWU, UMD…
• Join the DSI faculty
• Join the mailing list
– Lunch and learn
– Distinguished lectures
– Special events
02/14/18 UVA Genome Sciences 33
References
• Dunn and Bourne Building the Biomedical Data Science
Workforce PLoS Biol. 2017 Jul 17;15(7):e2003082.
• Bonazzi and Bourne Should Biomedical Research be like
Airbnb? PLoS Biol. 2017 Apr 7;15(4):e2001818.
• McKiernan et al How Open Science Helps Researchers
Succeed Elife. 2016 Jul 7;5. pii: e16800
• Wilkinson et al The FAIR Guiding Principles for scientific
data management and stewardship. Sci Data. 2016
Mar 15;3:160018.
• https://datascience.nih.gov/TheCommons
02/14/18 UVA Genome Sciences 34
Acknowledgements
02/14/18 UVA Genome Sciences 35
The BD2K Team at NIH
My New Colleagues at UVA
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Scott and Beth Stephenson
Anonymous donors for the DSI endowment
Thank You
peb6a@virginia.edu
3602/14/18 UVA Genome Sciences

What Can Happen when Genome Sciences Meets Data Sciences?

  • 1.
    What Can Happenwhen Genome Sciences Meets Data Sciences? Philip E. Bourne PhD, FACMI Stephenson Chair of Data Science Director, Data Science Institute Professor of Biomedical Engineering peb6a@virginia.edu https://www.slideshare.net/pebourne 02/14/18 UVA Genome Sciences 1
  • 2.
    I am moreinterested in having a discussion than giving a lecture … This is not about my research specifically but what is happening more broadly 02/14/18 UVA Genome Sciences 2
  • 3.
    Agenda • Some context –My definition of data science – What drives my thinking – What is the NIH thinking? • Relevant examples • The DSI and what is happening at UVA • Together, where do we go from here? 02/14/18 UVA Genome Sciences 3
  • 4.
    What Do IMean by Big Data/Data Science? • Use of the ever increasing amount of open, complex, diverse digital data • Finding ways to ask and then answer relevant questions by combining such diverse data sets • Arriving at statistically significant conclusions not otherwise obtainable • Sharing such findings in a useful way • Translating such findings into actions that improve the human condition 02/14/18 UVA Genome Sciences 4
  • 5.
    What Drives my Thinking? Disruption: Digitization Deception Disruption Demonetization Dematerialization Democratization Time Volume,Velocity,Variety Digitalcamera invented by Kodak but shelved Megapixels & quality improve slowly; Kodak slow to react Film market collapses; Kodak goes bankrupt Phones replace cameras Instagram, Flickr become the value proposition Digital media becomes bona fide form of communication From a presentation to the Advisory Board to the NIH Director Example - Photography 502/14/18 UVA Genome Sciences
  • 6.
    A Few RandomData {Science} Facts • There are ~2.7 Zetabytes (2.7 x 106 PB) of digital data currently – = US population tweeting 3x/min for 26,976 years • Big data currently estimated as a $50bn business – could save $3.1tn • 40% growth in data/yr; 5% growth in IT expenditure • US 140,000- 190,000 unfilled deep data analytics jobs • DSI has 600 applicants this year for 50 spots; MSDS/MBA highly sought 02/14/18 UVA Genome Sciences 6
  • 7.
    A Few RandomData {Science} Facts • There are ~2.7 Zetabytes (2.7 x 106 PB) of digital data currently – = US population tweeting 3x/min for 26,976 years • Big data currently estimated as a $50bn business – could save $3.1tn – private sector research • 40% growth in data/yr; 5% growth in IT expenditure - undervalued • US 140,000- 190,000 unfilled deep data analytics jobs – competition for skilled researchers high • DSI has 600 applicants this year for 50 spots; MSDS/MBA highly sought – large human capital 02/14/18 UVA Genome Sciences 7
  • 8.
    How Much BiomedicalData? • Big Data – Total data from NIH-funded research in 2016 estimated at 650 PB* – 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10 PB in 2016 • Dark Data – Only 12% of data described in published papers is in recognized archives – 88% is dark data^ • Cost – 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data archives * In 2012 Library of Congress was 3 PB ^ http://www.ncbi.nlm.nih.gov/pubmed/26207759 02/14/18 UVA Genome Sciences 8
  • 9.
    Consider Some CurrentHigh Profile NIH Examples Where Data Science is Being Applied • Moonshot - Bringing together 5 petabytes of homogenized data within the Genome Data Commons (GDC) to explore genotype-phenotype relationships • MODs – Multiple high value high cost genomic resources • Human Microbiome Project – microbe characterization and analysis • TOPMed – Genomic, proteomic, metabolomic, image and EHR data • All-of-Us Precision Medicine - Building a platform to support data on >1M individuals with extensive and constantly updated health profiles • ECHO – Effects of Environmental Exposures on Child Health and Development - Integration of child health and environmental data • BRAIN - Temporal and spatial analysis of neural circuits 9
  • 10.
    How is DataScience Being Applied? • Moonshot – new ways to analyze genotype-phenotype associations • MODs – new curation and integration tools • Human Microbiome Project – new cloud based tools • TOPMed – large scale storage and analysis; data harmonization • All-of-Us Precision Medicine – security; analysis of sensor data; EHR integration • ECHO – metadata descriptions of health and environmental data; application of geospatial methods • BRAIN – methods for network analysis, visualization All: Analytics, the Commons, FAIR, sustainability, workforce 10 Wilkinson et al The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018 https://datascience.nih.gov/TheCommons
  • 11.
    Some underlying concernsat NIH… Reproducibility… Conformance to data sharing policies & governance more generally 11
  • 12.
    Why a MoreOpen Process? Use case: Diffuse Intrinsic Pontine Gliomas (DIPG) • Occur 1:100,000 individuals • Peak incidence 6-8 years of age • Median survival 9-12 months • Surgery is not an option • Chemotherapy ineffective and radiotherapy only transitive From Adam Resnick 02/14/18 UVA Genome Sciences 12
  • 13.
    Timeline of genomicstudies in DIPG • Landmark studies identify histone mutations as recurrent driver mutations in DIPG ~2012 • Almost 3 years later, in largely the same datasets, but partially expanded, the same two groups and 2 others identify ACVR1 mutations as a secondary, co- occurring mutation From Adam Resnick 02/14/18 UVA Genome Sciences 13
  • 14.
    What do weneed to do differently to reveal ACVR1? • ACVR1 is a targetable kinase • Inhibition of ACVR1 inhibited tumor progression in vitro • ~300 DIPG patients a year • ~60 are predicted to have ACVR1 • If large scale data sets were only integrated with TCGA and/or rare disease data in 2012, ACVR1 mutations would have been identified • 60 patients/year X 3 years = 180 children’s lives (who likely succumbed to the disease during that time) could have been impacted if only data were FAIR From Adam Resnick 02/14/18 UVA Genome Sciences 14
  • 15.
    Both funders andsome institutions see the need to move from pipes to platforms to accelerate research… 02/14/18 UVA Genome Sciences 15 https://blog.lexicata.com/wp-content/uploads/2015/03/platform-model- 750x410.png
  • 16.
    If platforms arethe answer we could ask the question… Will biomedical research become more like Airbnb? 02/14/18 UVA Genome Sciences 16 Vivien Bonazzi Should biomedical research be Like Airbnb? doi: 10.1371/journal.pbio.2001818
  • 17.
    I am notcrazy, hear me out • Airbnb is a platform that supports a trusted relationship between consumer (renter) and supplier (host) • The platform focuses on maximizing the exchange of services between supplier and consumer and maximizing the amount of trust associated with a given stakeholder • It seems to be working: – 60 million users searching 2 million listings in 192 countries – Average of 500,000 stays per night. – Evaluation of US $25bn 02/14/18 UVA Genome Sciences 17 Should biomedical research be Like Airbnb? doi: 10.1371/journal.pbio.2001818
  • 18.
    Platforms will ultimatelydigitally integrate the scholarly workflow for human and machine analysis Should biomedical research be Like Airbnb? doi: 10.1371/journal.pbio.2001818UVA Genome Sciences 1802/14/18
  • 19.
    Why a comparisonto Airbnb is not fair • Airbnb was born digital • The exchange of services on Airbnb are simple compared to what is required of a platform to support biomedical research Nevertheless there is much to be learnt 02/14/18 UVA Genome Sciences 19
  • 20.
    Impediments to abiomedical platform • Current work practices by all stakeholders • Entrenched business models • Size of the undertaking aka resources needed • Trust • Incentives to use the platform http://www.forbes.com/sites/johnhall/2013/04/29/1 0-barriers-to-employee-innovation/#8bdbaa811133 02/14/18 UVA Genome Sciences 20
  • 21.
    In summary thereis not currently a widely adopted single platform for the exchange of services in biomedical research. Either there is a platform per service or no platform at all…. Funders and the institutions they fund need to work more closely to implement platforms 02/14/18 UVA Genome Sciences 21
  • 22.
    Example: NSF andNIH Approaches 02/14/18 UVA Genome Sciences 22
  • 23.
    How is theDSI responding to these various needs? 02/14/18 UVA Genome Sciences 23
  • 24.
    02/14/18 UVA GenomeSciences 24 Working across the grounds to break down traditional silos
  • 25.
    • Currently sustainable •Planning for where the academical village meets Google – an ecosystem in which students, faculty, staff, visitors, private sector reps, entrepreneurs live and work • Open UVA and open data • Not owning anything; only working through collaboration e.g. – Dual degrees – Research projects across disciplines • MS DS focusing on practical training • Dual degrees • Soon PhD and undergraduate major • Wikimedian in residence (March, 2018) 02/14/18 UVA Genome Sciences 25 Hallmarks
  • 26.
    Emergent DSI Organization 02/14/18UVA Genome Sciences 26 Data Integration & Engineering Machine Learning & Analytics Visualization Data Acquisition & Dissemination Ethics, Law, Policy, Social Implications
  • 27.
    Emergent DSI Organization 02/14/18UVA Genome Sciences 27 Data Integration & Engineering Machine Learning & Analytics Visualization Data Acquisition & Dissemination Ethics, Law, Policy, Social Implications Biomedical Data Sciences
  • 28.
    Paper Author PaperReader Data Provider Data Consumer Employer Employee Reagent Provider Reagent Consumer Software Provider Software Consumer Grant Writer Grant Reviewer Supplier Consumer Platform MS Project Google Drive Coursera Researchgate Academia.edu Open Science Framework Synapse F1000 Rio Educator Student Data Acquisition & Dissemination Pilot Open Data Lab Underway UVA Genome Sciences 28gDOC02/14/18
  • 29.
    Data Integration and Engineering •Ontologies • Object identifiers • Indexing schemes • Common data models 02/14/18 UVA Genome Sciences 29gDOC
  • 30.
    Machine Learning & Analytics •Neural nets • Deep learning • NLP • Gene expression & neurological disease (Kipnis) • Predicting opioid overdose (VA Health) • Predicting escalating care and mortality risk of cirrhosis patients (UVA HS) • Human microbiome & mental health in maternal health (Physcology & Nursing) 02/14/18 UVA Genome Sciences 30gDOC
  • 31.
    Visualization • VR • Networks •Sonics • Visualizing microbial stability (Biology & Systems) 02/14/18 UVA Genome Sciences 31gDOC
  • 32.
    Ethics, Law, Policy &Social Implications • Data sharing • Privacy • Normativity 02/14/18 UVA Genome Sciences 32gDOC Wendy Novicoff, Ph.D
  • 33.
    Points of Interaction •Dual degrees with an MSDS • Specific projects for: – Presidential fellows (due March 19, 2018) – Capstones (due June 29, 2018) • Thoughts on biomedical data science cluster hires • Data Science Internship program with NIH, Inova, GMU, VT, GWU, UMD… • Join the DSI faculty • Join the mailing list – Lunch and learn – Distinguished lectures – Special events 02/14/18 UVA Genome Sciences 33
  • 34.
    References • Dunn andBourne Building the Biomedical Data Science Workforce PLoS Biol. 2017 Jul 17;15(7):e2003082. • Bonazzi and Bourne Should Biomedical Research be like Airbnb? PLoS Biol. 2017 Apr 7;15(4):e2001818. • McKiernan et al How Open Science Helps Researchers Succeed Elife. 2016 Jul 7;5. pii: e16800 • Wilkinson et al The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018. • https://datascience.nih.gov/TheCommons 02/14/18 UVA Genome Sciences 34
  • 35.
    Acknowledgements 02/14/18 UVA GenomeSciences 35 The BD2K Team at NIH My New Colleagues at UVA The 150 folks who have passed through my laboratory https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0 Scott and Beth Stephenson Anonymous donors for the DSI endowment
  • 36.

Editor's Notes

  • #9 $1.25bn per year to capture all data. After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.
  • #37 36