One View of Data Science
Philip E. Bourne
peb6a@virginia.edu
https://www.slideshare.net/pebourne
April 5, 2022
Punchline – in 40+ Years in Academia I Have
Never Seen Anything Like It
• It is a response to the digital transformation of
society
• It is touching every discipline (aka vertical)
• We can keep the students out of our classes
• Cause – large amounts of digital data
• Effect – interdisciplinarity, openness, translation,
search for responsibility and more
In summary, it is disruptive and higher ed. better pay attention
My Perspective/Biases
• Practical Science Long standing computational biomedical researcher
• Open Access Co-Founder and Founding Editor in Chief PLOS
Computational Biology
• Open Knowledge First President of FORCE11
• Data are Value Involved in FAIR
• Translation First Associate Vice Chancellor for Innovation and
Industrial Alliances
• Funders as Lever First Associate Director for Data Science NIH – preprints,
data sharing, BD2K, etc.
• Change Higher Ed Founding Dean School of Data Science
There is a Precedent Which Points to What is
Coming
http://www.ornl.gov/hgmis
• High throughput DNA digital data changed how
we think about biomedicine
• Spawned a new field – bioinformatics /
computational biology/ systems biology /
biomedical data science
• Spawned a multi-billion dollar industry
Is Bioinformatics Dead? PLOS Biology 2021
1991-1995
1993-1998
1998-2003
2003-2010
2011-Present
More on the Data Driven Genomic
Revolution
[Adapted from Eric Green, Director NHGRI]
Life Sciences – The Digital Effect
1980s 1990s 2000s 2010s 2015 2022
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver
The Raw Material:
Non-existent Sequence Genomes Omics Patient Multi-scale
The People:
No name Technicians Industry Bioinformaticians Systems Biologists Data Scientists
From a Presentation to the Advisory Committee to the NIH Director
Given this history what do we need to do
differently to accelerate the process and not make
the same mistakes?
Let’s breakdown one success story to see what
happened and why
https://medium.com/proteinqure/welcome-into-the-fold-bbd3f3b19fdd
Google’s DeepMind’s AlphaFold2 makes gigantic leap in solving
protein structures
AlphaFold2
Numerical optimization – differential programming
Overall gradient descent trained to win CASP
Jumper et al.., 2021. Nature, 596 (7873),
pp.583-589
Transformer models using attention
Geometry invariant to
translation/rotation
Logistics Behind the Win
● Nothing fundamentally new from an AI perspective
● Data Integration
● Collaboration not competition
● Engineering challenge beyond most labs
● Compute power beyond most labs
● Team size beyond most labs
● Worked with protein structure specialists
Downstream Implications
• Cooperation rather than competition
• Public-private partnership
• Translational possibilities are endless
• Made possible by curated open data
• Appreciate engineering
Given these precedents how should we think
about data science in an academic context?
Big data and data science are like the Internet…
If I asked you to define them you would all say
something different, yet you use them every day…
http://vadlo.com/cartoons.php?id=357
The right culture starts with all being on the
same page as to how we define data science
One Representation of Data Science –
The 4+1 Model
• Value – assuring societal
benefit
• Design - Communication
of the value of data
• Systems – the means to
communicate and
convey benefit
• Analytics – models and
methods
• Practice – where
everything happens
[From Raf Alvarado]
The Data Science Interplay
• Value + Design = Openness,
responsibility
• Value + Analytics = Human
centered AI, algorithmic bias
• Value + Systems =
sustainability, access,
environmental impact
• Design + Analytics = literate
programming, visualization
• Design + Systems =
dashboards, engineering
design
• Analytics + Systems = ML
engineering
[From Raf Alvarado]
Thinking of data as a science unto itself is novel and controversial
Lets dig into a couple of these quadrants ….
Databases
organize data
around a project.
Data warehouses
organize the data
for an organization
Data commons
organize the data
for a scientific
discipline or field
Data
Warehouse
Data Ecosystems
See Forthcoming Science Policy Forum
Challenges
Fixed level of funding
Opportunities
data commons
Data commons co-locate data
with cloud computing
infrastructure and commonly
used software services, tools &
apps for managing, analyzing and
sharing data to create an
interoperable resource for the
research community.*
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE
Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
Bonazzi VR, Bourne PE (2017) Should biomedical research be like Airbnb? PLoS Biol 15(4): e2001818.
Systems
[Adapted from Bob Grossman]
Research ethics
committees (RECs) review
the ethical acceptability
of research involving
human participants.
Historically, the principal
emphases of RECs have
been to protect
participants from physical
harms and to provide
assurance as to
participants’ interests and
welfare.*
[The Framework] is
guided by, Article 27
of the 1948 Universal
Declaration of Human
Rights. Article 27
guarantees the rights
of every individual in
the world "to share in
scientific
advancement and its
benefits" (including to
freely engage in
responsible scientific
inquiry)…*
Protect human
subject data
The right of human
subjects to benefit
from research.
*GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR
Data sharing with protections provides the evidence
so patients can benefit from advances in research.
Balance protecting human subject data
with open research that benefits
patients
[Adapted from Bob Grossman]
Value
A Data Integration Poster Child
Researcher and Assistant Professor of
Medicine Dr. Thomas Hartka, also a
current online Masters in Data Science
student, is combining two disparate
data sets—electronic health records
and DMV crash data—to save lives
after motor vehicle crashes.
“I enrolled in the MSDS program to
expand my research on automotive
safety. I have already used
techniques from classes in my work.
I hope to expand my research to
real-time analytics to improve
emergency room care.”
— Dr. Thomas Hartka, UVA School
of Medicine
We Are Not Alone … But We Are Unususal
Furthering Discovery to Build a Better World
RESEAR
CH
Cybersecurity
Detecting broad-spectrum cyber
threats almost immediately after
they are launched through a $7.6
million Defense Advanced
Research Projects Agency
(DARPA) grant.
Environment
Using NASA data collected aboard the
International Space Station to examine
climate change in the Shenandoah National
Forest and beyond, and find solutions
Health & Medicine
Securing high-performance computing
equipment and personnel to allow
collaboration across the university on brain
science research like Autism, Alzheimer’s,
mental health disorders, traumatic brain
injuries and more.
Business
Discovering what makes a job
interview successful for the
candidate and the recruiter, and
how to mitigate bias in the
recruiting process
Democracy
Investigating how terrorist groups recruit
women through propaganda and examining
risk and threat assessment for extremist
violence perpetrated by women.
Education
Helping economically disadvantaged,
underrepresented populations pursue tailored
educational workforce pathways that have a
higher probability of leading them to success.
SDS Current Research Portfolio
12
7
4
3
2
3
3
Research Areas
Healthcare/Life Sciences
Technology/Software
Defense/Cybersecurity
Finance/Fintech
Energy/Environment
Education & Digital
Humanities
SDS strives to be a connector – a place where interdisciplinary
research driven by common data, methods and expertise
comes together
With So Much Opportunity – What To Do?
Leverage what our institution is already good at…
For us that is leadership, policy, law
Why Responsible Data Science?
• A defining feature
• A partnership between STEM, social
sciences and the humanities
• Where UVA has strength
Challenges
• Deciding what not to do
• Competition for the best team members (faculty and staff)
• Establishing a diverse team
• Lack of a comprehensive enterprise-wide data infrastructure
• Its easier to conform
Growing the School
M.S. IN DATA SCIENCE
Residential & Online
202
0
2020-
2023
UNDERGRADUATE
MINOR
2022
PH.D. PROGRAM
2023
UNDERGRADUATE
MAJOR
Building occupied
Team Size (FTEs)
5
40
60
80
120
Research
$5M
$10M
$20M
$30M
https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist)
https://www.microsoft.com/en-us/research/wp-
content/uploads/2009/10/Fourth_Paradigm.pdf
https://twitter.com/aip_publishing/status/856825353645559808
Of course this was all predicted
by smart people ..
Model
Transportability
Horizontal
Integration
Multi-scale
Integration
human
mouse
zebrafish
DNA
Gene/Protein
Network
Cell
Tissue
Organ
Body
Population
CNV SNP methylation
3D structure Gene
expression Proteomics
Metabolomics
Metabolic
Signaling
transduction
Gene
regulation
Hepatic Myoepithelial Erythrocyte
Epithelial Muscle Nervous
Liver Kidney Pancreas Heart
Physiologically based
pharmacokinetics
GWAS
Population
dynamics
Microbiota
From Harnessing Big Data for Systems Pharmacology 2017
https://doi.org/10.1146/annurev-pharmtox-010716-104659
Current roadblocks are more cultural than technical
The Fifth Paradigm: Integration Across Scales?
Gohlke et al. 2022
https://onlinelibrary.wiley.com/doi/10.1002/ctm2.726
Real World Evidence for Preventive Effects of Statins on
Cancer Incidence: A Transatlantic Analysis
EHR
Animal Models
Pathways
Questions I Leave You With ….
• Have I overstated the case for data science?
• Are we currently doing the best by our students?
• Are the models we propose the right ones?
• What should we be doing differently?

One View of Data Science

  • 1.
    One View ofData Science Philip E. Bourne peb6a@virginia.edu https://www.slideshare.net/pebourne April 5, 2022
  • 2.
    Punchline – in40+ Years in Academia I Have Never Seen Anything Like It • It is a response to the digital transformation of society • It is touching every discipline (aka vertical) • We can keep the students out of our classes • Cause – large amounts of digital data • Effect – interdisciplinarity, openness, translation, search for responsibility and more In summary, it is disruptive and higher ed. better pay attention
  • 3.
    My Perspective/Biases • PracticalScience Long standing computational biomedical researcher • Open Access Co-Founder and Founding Editor in Chief PLOS Computational Biology • Open Knowledge First President of FORCE11 • Data are Value Involved in FAIR • Translation First Associate Vice Chancellor for Innovation and Industrial Alliances • Funders as Lever First Associate Director for Data Science NIH – preprints, data sharing, BD2K, etc. • Change Higher Ed Founding Dean School of Data Science
  • 4.
    There is aPrecedent Which Points to What is Coming http://www.ornl.gov/hgmis • High throughput DNA digital data changed how we think about biomedicine • Spawned a new field – bioinformatics / computational biology/ systems biology / biomedical data science • Spawned a multi-billion dollar industry Is Bioinformatics Dead? PLOS Biology 2021
  • 5.
    1991-1995 1993-1998 1998-2003 2003-2010 2011-Present More on theData Driven Genomic Revolution [Adapted from Eric Green, Director NHGRI]
  • 6.
    Life Sciences –The Digital Effect 1980s 1990s 2000s 2010s 2015 2022 Discipline: Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver The Raw Material: Non-existent Sequence Genomes Omics Patient Multi-scale The People: No name Technicians Industry Bioinformaticians Systems Biologists Data Scientists From a Presentation to the Advisory Committee to the NIH Director
  • 7.
    Given this historywhat do we need to do differently to accelerate the process and not make the same mistakes?
  • 8.
    Let’s breakdown onesuccess story to see what happened and why https://medium.com/proteinqure/welcome-into-the-fold-bbd3f3b19fdd
  • 10.
    Google’s DeepMind’s AlphaFold2makes gigantic leap in solving protein structures
  • 11.
    AlphaFold2 Numerical optimization –differential programming Overall gradient descent trained to win CASP Jumper et al.., 2021. Nature, 596 (7873), pp.583-589 Transformer models using attention Geometry invariant to translation/rotation
  • 12.
    Logistics Behind theWin ● Nothing fundamentally new from an AI perspective ● Data Integration ● Collaboration not competition ● Engineering challenge beyond most labs ● Compute power beyond most labs ● Team size beyond most labs ● Worked with protein structure specialists
  • 13.
    Downstream Implications • Cooperationrather than competition • Public-private partnership • Translational possibilities are endless • Made possible by curated open data • Appreciate engineering
  • 14.
    Given these precedentshow should we think about data science in an academic context?
  • 15.
    Big data anddata science are like the Internet… If I asked you to define them you would all say something different, yet you use them every day… http://vadlo.com/cartoons.php?id=357
  • 16.
    The right culturestarts with all being on the same page as to how we define data science
  • 17.
    One Representation ofData Science – The 4+1 Model • Value – assuring societal benefit • Design - Communication of the value of data • Systems – the means to communicate and convey benefit • Analytics – models and methods • Practice – where everything happens [From Raf Alvarado]
  • 18.
    The Data ScienceInterplay • Value + Design = Openness, responsibility • Value + Analytics = Human centered AI, algorithmic bias • Value + Systems = sustainability, access, environmental impact • Design + Analytics = literate programming, visualization • Design + Systems = dashboards, engineering design • Analytics + Systems = ML engineering [From Raf Alvarado] Thinking of data as a science unto itself is novel and controversial
  • 19.
    Lets dig intoa couple of these quadrants ….
  • 20.
    Databases organize data around aproject. Data warehouses organize the data for an organization Data commons organize the data for a scientific discipline or field Data Warehouse Data Ecosystems See Forthcoming Science Policy Forum
  • 21.
    Challenges Fixed level offunding Opportunities data commons Data commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community.* *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center. Bonazzi VR, Bourne PE (2017) Should biomedical research be like Airbnb? PLoS Biol 15(4): e2001818. Systems [Adapted from Bob Grossman]
  • 22.
    Research ethics committees (RECs)review the ethical acceptability of research involving human participants. Historically, the principal emphases of RECs have been to protect participants from physical harms and to provide assurance as to participants’ interests and welfare.* [The Framework] is guided by, Article 27 of the 1948 Universal Declaration of Human Rights. Article 27 guarantees the rights of every individual in the world "to share in scientific advancement and its benefits" (including to freely engage in responsible scientific inquiry)…* Protect human subject data The right of human subjects to benefit from research. *GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR Data sharing with protections provides the evidence so patients can benefit from advances in research. Balance protecting human subject data with open research that benefits patients [Adapted from Bob Grossman] Value
  • 23.
    A Data IntegrationPoster Child Researcher and Assistant Professor of Medicine Dr. Thomas Hartka, also a current online Masters in Data Science student, is combining two disparate data sets—electronic health records and DMV crash data—to save lives after motor vehicle crashes. “I enrolled in the MSDS program to expand my research on automotive safety. I have already used techniques from classes in my work. I hope to expand my research to real-time analytics to improve emergency room care.” — Dr. Thomas Hartka, UVA School of Medicine
  • 25.
    We Are NotAlone … But We Are Unususal
  • 26.
    Furthering Discovery toBuild a Better World RESEAR CH Cybersecurity Detecting broad-spectrum cyber threats almost immediately after they are launched through a $7.6 million Defense Advanced Research Projects Agency (DARPA) grant. Environment Using NASA data collected aboard the International Space Station to examine climate change in the Shenandoah National Forest and beyond, and find solutions Health & Medicine Securing high-performance computing equipment and personnel to allow collaboration across the university on brain science research like Autism, Alzheimer’s, mental health disorders, traumatic brain injuries and more. Business Discovering what makes a job interview successful for the candidate and the recruiter, and how to mitigate bias in the recruiting process Democracy Investigating how terrorist groups recruit women through propaganda and examining risk and threat assessment for extremist violence perpetrated by women. Education Helping economically disadvantaged, underrepresented populations pursue tailored educational workforce pathways that have a higher probability of leading them to success.
  • 27.
    SDS Current ResearchPortfolio 12 7 4 3 2 3 3 Research Areas Healthcare/Life Sciences Technology/Software Defense/Cybersecurity Finance/Fintech Energy/Environment Education & Digital Humanities SDS strives to be a connector – a place where interdisciplinary research driven by common data, methods and expertise comes together
  • 28.
    With So MuchOpportunity – What To Do? Leverage what our institution is already good at… For us that is leadership, policy, law
  • 29.
    Why Responsible DataScience? • A defining feature • A partnership between STEM, social sciences and the humanities • Where UVA has strength
  • 30.
    Challenges • Deciding whatnot to do • Competition for the best team members (faculty and staff) • Establishing a diverse team • Lack of a comprehensive enterprise-wide data infrastructure • Its easier to conform
  • 31.
    Growing the School M.S.IN DATA SCIENCE Residential & Online 202 0 2020- 2023 UNDERGRADUATE MINOR 2022 PH.D. PROGRAM 2023 UNDERGRADUATE MAJOR Building occupied Team Size (FTEs) 5 40 60 80 120 Research $5M $10M $20M $30M
  • 32.
  • 33.
    Model Transportability Horizontal Integration Multi-scale Integration human mouse zebrafish DNA Gene/Protein Network Cell Tissue Organ Body Population CNV SNP methylation 3Dstructure Gene expression Proteomics Metabolomics Metabolic Signaling transduction Gene regulation Hepatic Myoepithelial Erythrocyte Epithelial Muscle Nervous Liver Kidney Pancreas Heart Physiologically based pharmacokinetics GWAS Population dynamics Microbiota From Harnessing Big Data for Systems Pharmacology 2017 https://doi.org/10.1146/annurev-pharmtox-010716-104659 Current roadblocks are more cultural than technical The Fifth Paradigm: Integration Across Scales?
  • 34.
    Gohlke et al.2022 https://onlinelibrary.wiley.com/doi/10.1002/ctm2.726 Real World Evidence for Preventive Effects of Statins on Cancer Incidence: A Transatlantic Analysis EHR Animal Models Pathways
  • 35.
    Questions I LeaveYou With …. • Have I overstated the case for data science? • Are we currently doing the best by our students? • Are the models we propose the right ones? • What should we be doing differently?

Editor's Notes

  • #6 History Culture NHGRI role >Defined eras
  • #24 I will introduce the concept of data science with a story that illustrates - citizen engagement, merging of unexpected data and societal benefit