Data Science and AI in Biomedicine: The World
has Changed
Philip E. Bourne PhD
peb6a@virginia.edu
https://www.slideshare.net/pebourne
April 3, 2024 University of Tulane
Disclaimer
I am privileged to be
helping build a new
kind of school within a
traditional institution. I
have drunk my own
Kool-Aid
https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist)
https://www.microsoft.com/en-us/research/wp-
content/uploads/2009/10/Fourth_Paradigm.pdf
https://twitter.com/aip_publishing/status/856825353645559808
Science Drivers Over the
Millennia
The Human Genome was the Tipping Point
and Led the Way
http://www.ornl.gov/hgmis
• High throughput DNA digital data changed how
we think about biomedicine
• Spawned a new field – bioinformatics /
computational biology/ systems biology /
biomedical data science
• Spawned a multi billion-dollar industry
Bourne’s Timeline
1980s 1990s 2000s 2010s 2020’s
The Discipline (Whatever it is Called)
Unknown Expt. Driven Emergent Over-sold A Service A Partner The Driver
6
Digital Data
Systems
Analytics
Design
Value
4 Pillars of Data Science
HPC Cloud GPUs
HHMs SVMs NNs CNNs LLMs
HIPPA Privacy Security HiTech
Mol Graphics Web 2.0 Dashboards
Basic Premise …..
We are at a new tipping point
Basic Premise …
“We need to be more aware than
ever of developments that may be
far outside our discipline that fall
under the broad topic of data
science. In short, we need to
become biomedical data
scientists.”
Stated another way, the
leadership role in data/informatics
afforded by the human genome
project no longer applies.
Data Science –
In 45+ Years in Academia I Have Never Seen Anything Like It
• It is a response to the digital transformation of
society
• It is touching every discipline (aka vertical)
• We can’t keep the students out of our classes
• Cause – large amounts of digital data
• Effect – interdisciplinarity, openness, translation,
search for responsibility and more
In summary, it is disruptive to current modes of biomedical research
A Data Integration Poster Child
Researcher and Assistant Professor of
Medicine Dr. Thomas Hartka, also a
current online Masters in Data Science
student, is combining two disparate
data sets—electronic health records
and DMV crash data—to save lives
after motor vehicle crashes.
“I enrolled in the MSDS program to
expand my research on automotive
safety. I have already used
techniques from classes in my work.
I hope to expand my research to
real-time analytics to improve
emergency room care.”
— Dr. Thomas Hartka, UVA School
of Medicine
Data Science
As a Driver Its Just the Beginning….
https://zenodo.org/records/7768414
Data scientist jobs are predicted to experience 36
percent growth between 2021 and 2031, according
to the US Bureau of Labor Statistics.
The global data science platform market size was
valued at USD 64.14 billion in 2021 and is projected
to grow from USD 81.47 billion in 2022 to USD
484.17 billion by 2029, exhibiting a CAGR of 29.0%
during the forecast period.
Data science is the fastest emerging field around the
globe.
57 Member
Institutions
Given these precedents about data and data
science we should start with a definition/framework
Big data and data science are like the Internet…
If I asked you to define them you would all say
something different, yet you use them every day…
http://vadlo.com/cartoons.php?id=357
One Definition of Data Science –
The 4+1 Model (aka domains)
• Value – assuring societal
benefit
• Design - Communication
of the value of data
• Systems – the means to
communicate and
convey benefit
• Analytics – models and
methods
• Practice – where
everything happens
[Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
The Data Science Interplay
• Value + Design = Openness,
responsibility
• Value + Analytics = Human
centered AI, algorithmic bias
• Value + Systems =
sustainability, access,
environmental impact
• Design + Analytics = literate
programming, visualization
• Design + Systems =
dashboards, engineering
design
• Analytics + Systems = ML
engineering
Thinking of data as a science unto itself is novel and controversial
[Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
With this definition let’s explore the
implications for biomedical research …
The 4+1 Model - Systems
• Value – assuring societal
benefit
• Design - Communication
of the value of data
• Systems – the means to
communicate and
convey benefit
• Analytics – models and
methods
• Practice – where
everything happens
[Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
Systems….
Science, 377 (6603), .
DOI: 10.1126/science.abo5947
Systems….
• Need something akin to the electricity grid or banking system
• Need to consider data and methods as first-class data objects
• Examples: European Open Science Cloud (EOSC), the CS3MESH4EOSC Science
Mesh, the China Science and Technology (CST) Cloud, the African Open Science
Platform, the South African National Integrated Cyber Infrastructure System, the
Malaysia Open Science Platform, the Global Open Science Cloud (GOSC) the
Australian Research Data Commons (ARDC) Nectar Research Cloud, the Digital
Research Alliance of Canada (formerly known as the New Digital Research
Infrastructure Organization), and the Arab States Research and Education
Network.
• Problems span funding agencies; solutions do not
• There is a lack of public-private partnership
Analytics ….
AlphaGo – Take Home Messages
https://www.alphagomovie.com/
1. Even the programmers were
disquieted by creating
something better than any
human
2. AlphaGo made a move that no
human Go expert nor
programmer anticipated
3. It takes a lot of resources to
defeat the world champion
Go has more moves than there are atoms in the universe
Proteins have ~20**300 combinations also more than the
number of atoms in the universe
Science Games….
https://medium.com/proteinqure/welcome-into-the-fold-bbd3f3b19fdd
AlphaFold2 Makes Significant Leap
AlphaFold2
Numerical optimization – differential programming
Overall gradient descent trained to win CASP
Jumper et al.., 2021. Nature, 596 (7873),
pp.583-589
Transformer models using attention
Geometry invariant to
translation/rotation
Logistics Behind the Win
● Nothing fundamentally new from an AI perspective
● Data Integration
● Collaboration not competition
● Engineering challenge beyond most labs
● Compute power beyond most labs
● Team size beyond most labs
● Worked with protein structure specialists
Downstream Implications
• Cooperation rather than competition
• Public-private partnership
• Translational possibilities are endless
• Made possible by curated open data
• Appreciate engineering
Scientific Implications
Exploration of Latent Space
Rethink fold space? Rethink classification schemes?
AI Analytics Across the Scientific Discovery
Process
From Yolanda Gil 2023 AI for Science Eds. Choudhary, Fox & Hey p699
The 4+1 Model - Design
• Value – assuring societal
benefit
• Design - Communication
of the value of data
• Systems – the means to
communicate and
convey benefit
• Analytics – models and
methods
• Practice – where
everything happens
[Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
Beyond data science the academic landscape
is changing….
https://doi.org/10.1038/sdata.2016.18
https://www.heliosopen.org/
The One Lever Left in Open
Scholarship is Academia Itself
Openness/FAIR
Data Science would not exist if it were not for open
data and methods. It would be wrong for us to take
and not give back
https://sparcopen.org/
https://datascience.virginia.edu/policies
Questions I Leave You With ….
• Are we indeed at a change point?
• Will biomedicine continue to lead data science?
• Do we need new models for doing science?
• Are we placing the right emphasis on our research
products, notably data and methods vs papers
Questions?
Databases
organize data
around a project.
Data warehouses
organize the data
for an organization
Data commons
organize the data
for a scientific
discipline or field
Data
Warehouse
Data Ecosystems
How we think about our
infrastructure is important
Challenges
Fixed level of funding
Opportunities
data commons
Data commons co-locate data
with cloud computing
infrastructure and commonly
used software services, tools &
apps for managing, analyzing and
sharing data to create an
interoperable resource for the
research community.*
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE
Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
Bonazzi VR, Bourne PE (2017) Should biomedical research be like Airbnb? PLoS Biol 15(4): e2001818.
Systems
[Adapted from Bob Grossman]
But wait the picture is more complicated….
Data Science versus Data Engineering – How
Much Emphasis Where?
Coming back to the question…
So we have a definition of data science and we
have a set of guiding principles, where does this
take us?
Stated another way, what do we want to be
recognized for in 10 years?
https://pebourne.wordpress.com/
Research ethics
committees (RECs) review
the ethical acceptability
of research involving
human participants.
Historically, the principal
emphases of RECs have
been to protect
participants from physical
harms and to provide
assurance as to
participants’ interests and
welfare.*
[The Framework] is
guided by, Article 27
of the 1948 Universal
Declaration of Human
Rights. Article 27
guarantees the rights
of every individual in
the world "to share in
scientific
advancement and its
benefits" (including to
freely engage in
responsible scientific
inquiry)…*
Protect human
subject data
The right of human
subjects to benefit
from research.
*GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR
Data sharing with protections provides the evidence
so patients can benefit from advances in research.
Balance protecting human subject data
with open research that benefits
patients
[Adapted from Bob Grossman]
Value
Why Responsible Data Science?
• A defining feature
• A partnership between STEM, social
sciences and the humanities
• Where UVA has strength
Model
Transportability
Horizontal
Integration
Multi-scale
Integration
human
mouse
zebrafish
DNA
Gene/Protein
Network
Cell
Tissue
Organ
Body
Population
CNV SNP methylation
3D structure Gene
expression Proteomics
Metabolomics
Metabolic
Signaling
transduction
Gene
regulation
Hepatic Myoepithelial Erythrocyte
Epithelial Muscle Nervous
Liver Kidney Pancreas Heart
Physiologically based
pharmacokinetics
GWAS
Population
dynamics
Microbiota
From Harnessing Big Data for Systems Pharmacology 2017
https://doi.org/10.1146/annurev-pharmtox-010716-104659
Current roadblocks are more cultural than technical
The Fifth Paradigm: Integration Across Scales?
Gohlke et al. 2022
https://onlinelibrary.wiley.com/doi/10.1002/ctm2.726
Real World Evidence for Preventive Effects of Statins on
Cancer Incidence: A Transatlantic Analysis
EHR
Animal Models
Pathways
Daily Challenges
• Deciding what not to do
• Competition for the best team members (faculty and staff)
• Establishing a diverse team
• Lack of a comprehensive enterprise-wide data infrastructure
• Its easier to conform

Data Science and AI in Biomedicine: The World has Changed

  • 1.
    Data Science andAI in Biomedicine: The World has Changed Philip E. Bourne PhD peb6a@virginia.edu https://www.slideshare.net/pebourne April 3, 2024 University of Tulane
  • 3.
    Disclaimer I am privilegedto be helping build a new kind of school within a traditional institution. I have drunk my own Kool-Aid
  • 4.
  • 5.
    The Human Genomewas the Tipping Point and Led the Way http://www.ornl.gov/hgmis • High throughput DNA digital data changed how we think about biomedicine • Spawned a new field – bioinformatics / computational biology/ systems biology / biomedical data science • Spawned a multi billion-dollar industry
  • 6.
    Bourne’s Timeline 1980s 1990s2000s 2010s 2020’s The Discipline (Whatever it is Called) Unknown Expt. Driven Emergent Over-sold A Service A Partner The Driver 6 Digital Data Systems Analytics Design Value 4 Pillars of Data Science HPC Cloud GPUs HHMs SVMs NNs CNNs LLMs HIPPA Privacy Security HiTech Mol Graphics Web 2.0 Dashboards
  • 7.
    Basic Premise ….. Weare at a new tipping point
  • 8.
    Basic Premise … “Weneed to be more aware than ever of developments that may be far outside our discipline that fall under the broad topic of data science. In short, we need to become biomedical data scientists.” Stated another way, the leadership role in data/informatics afforded by the human genome project no longer applies.
  • 9.
    Data Science – In45+ Years in Academia I Have Never Seen Anything Like It • It is a response to the digital transformation of society • It is touching every discipline (aka vertical) • We can’t keep the students out of our classes • Cause – large amounts of digital data • Effect – interdisciplinarity, openness, translation, search for responsibility and more In summary, it is disruptive to current modes of biomedical research
  • 10.
    A Data IntegrationPoster Child Researcher and Assistant Professor of Medicine Dr. Thomas Hartka, also a current online Masters in Data Science student, is combining two disparate data sets—electronic health records and DMV crash data—to save lives after motor vehicle crashes. “I enrolled in the MSDS program to expand my research on automotive safety. I have already used techniques from classes in my work. I hope to expand my research to real-time analytics to improve emergency room care.” — Dr. Thomas Hartka, UVA School of Medicine
  • 11.
    Data Science As aDriver Its Just the Beginning…. https://zenodo.org/records/7768414 Data scientist jobs are predicted to experience 36 percent growth between 2021 and 2031, according to the US Bureau of Labor Statistics. The global data science platform market size was valued at USD 64.14 billion in 2021 and is projected to grow from USD 81.47 billion in 2022 to USD 484.17 billion by 2029, exhibiting a CAGR of 29.0% during the forecast period. Data science is the fastest emerging field around the globe. 57 Member Institutions
  • 12.
    Given these precedentsabout data and data science we should start with a definition/framework
  • 13.
    Big data anddata science are like the Internet… If I asked you to define them you would all say something different, yet you use them every day… http://vadlo.com/cartoons.php?id=357
  • 14.
    One Definition ofData Science – The 4+1 Model (aka domains) • Value – assuring societal benefit • Design - Communication of the value of data • Systems – the means to communicate and convey benefit • Analytics – models and methods • Practice – where everything happens [Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
  • 15.
    The Data ScienceInterplay • Value + Design = Openness, responsibility • Value + Analytics = Human centered AI, algorithmic bias • Value + Systems = sustainability, access, environmental impact • Design + Analytics = literate programming, visualization • Design + Systems = dashboards, engineering design • Analytics + Systems = ML engineering Thinking of data as a science unto itself is novel and controversial [Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
  • 16.
    With this definitionlet’s explore the implications for biomedical research …
  • 17.
    The 4+1 Model- Systems • Value – assuring societal benefit • Design - Communication of the value of data • Systems – the means to communicate and convey benefit • Analytics – models and methods • Practice – where everything happens [Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
  • 18.
    Systems…. Science, 377 (6603),. DOI: 10.1126/science.abo5947
  • 19.
    Systems…. • Need somethingakin to the electricity grid or banking system • Need to consider data and methods as first-class data objects • Examples: European Open Science Cloud (EOSC), the CS3MESH4EOSC Science Mesh, the China Science and Technology (CST) Cloud, the African Open Science Platform, the South African National Integrated Cyber Infrastructure System, the Malaysia Open Science Platform, the Global Open Science Cloud (GOSC) the Australian Research Data Commons (ARDC) Nectar Research Cloud, the Digital Research Alliance of Canada (formerly known as the New Digital Research Infrastructure Organization), and the Arab States Research and Education Network. • Problems span funding agencies; solutions do not • There is a lack of public-private partnership
  • 20.
  • 21.
    AlphaGo – TakeHome Messages https://www.alphagomovie.com/ 1. Even the programmers were disquieted by creating something better than any human 2. AlphaGo made a move that no human Go expert nor programmer anticipated 3. It takes a lot of resources to defeat the world champion Go has more moves than there are atoms in the universe
  • 22.
    Proteins have ~20**300combinations also more than the number of atoms in the universe
  • 23.
  • 25.
  • 26.
    AlphaFold2 Numerical optimization –differential programming Overall gradient descent trained to win CASP Jumper et al.., 2021. Nature, 596 (7873), pp.583-589 Transformer models using attention Geometry invariant to translation/rotation
  • 27.
    Logistics Behind theWin ● Nothing fundamentally new from an AI perspective ● Data Integration ● Collaboration not competition ● Engineering challenge beyond most labs ● Compute power beyond most labs ● Team size beyond most labs ● Worked with protein structure specialists
  • 28.
    Downstream Implications • Cooperationrather than competition • Public-private partnership • Translational possibilities are endless • Made possible by curated open data • Appreciate engineering
  • 29.
  • 30.
    Exploration of LatentSpace Rethink fold space? Rethink classification schemes?
  • 31.
    AI Analytics Acrossthe Scientific Discovery Process From Yolanda Gil 2023 AI for Science Eds. Choudhary, Fox & Hey p699
  • 32.
    The 4+1 Model- Design • Value – assuring societal benefit • Design - Communication of the value of data • Systems – the means to communicate and convey benefit • Analytics – models and methods • Practice – where everything happens [Raf Alvarado & Phil Bourne https://doi.org/10.1142/9789811265679_0004]
  • 34.
    Beyond data sciencethe academic landscape is changing….
  • 35.
  • 36.
    Openness/FAIR Data Science wouldnot exist if it were not for open data and methods. It would be wrong for us to take and not give back https://sparcopen.org/ https://datascience.virginia.edu/policies
  • 37.
    Questions I LeaveYou With …. • Are we indeed at a change point? • Will biomedicine continue to lead data science? • Do we need new models for doing science? • Are we placing the right emphasis on our research products, notably data and methods vs papers
  • 38.
  • 39.
    Databases organize data around aproject. Data warehouses organize the data for an organization Data commons organize the data for a scientific discipline or field Data Warehouse Data Ecosystems How we think about our infrastructure is important
  • 40.
    Challenges Fixed level offunding Opportunities data commons Data commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community.* *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center. Bonazzi VR, Bourne PE (2017) Should biomedical research be like Airbnb? PLoS Biol 15(4): e2001818. Systems [Adapted from Bob Grossman]
  • 41.
    But wait thepicture is more complicated….
  • 42.
    Data Science versusData Engineering – How Much Emphasis Where?
  • 43.
    Coming back tothe question… So we have a definition of data science and we have a set of guiding principles, where does this take us? Stated another way, what do we want to be recognized for in 10 years? https://pebourne.wordpress.com/
  • 44.
    Research ethics committees (RECs)review the ethical acceptability of research involving human participants. Historically, the principal emphases of RECs have been to protect participants from physical harms and to provide assurance as to participants’ interests and welfare.* [The Framework] is guided by, Article 27 of the 1948 Universal Declaration of Human Rights. Article 27 guarantees the rights of every individual in the world "to share in scientific advancement and its benefits" (including to freely engage in responsible scientific inquiry)…* Protect human subject data The right of human subjects to benefit from research. *GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR Data sharing with protections provides the evidence so patients can benefit from advances in research. Balance protecting human subject data with open research that benefits patients [Adapted from Bob Grossman] Value
  • 45.
    Why Responsible DataScience? • A defining feature • A partnership between STEM, social sciences and the humanities • Where UVA has strength
  • 46.
    Model Transportability Horizontal Integration Multi-scale Integration human mouse zebrafish DNA Gene/Protein Network Cell Tissue Organ Body Population CNV SNP methylation 3Dstructure Gene expression Proteomics Metabolomics Metabolic Signaling transduction Gene regulation Hepatic Myoepithelial Erythrocyte Epithelial Muscle Nervous Liver Kidney Pancreas Heart Physiologically based pharmacokinetics GWAS Population dynamics Microbiota From Harnessing Big Data for Systems Pharmacology 2017 https://doi.org/10.1146/annurev-pharmtox-010716-104659 Current roadblocks are more cultural than technical The Fifth Paradigm: Integration Across Scales?
  • 47.
    Gohlke et al.2022 https://onlinelibrary.wiley.com/doi/10.1002/ctm2.726 Real World Evidence for Preventive Effects of Statins on Cancer Incidence: A Transatlantic Analysis EHR Animal Models Pathways
  • 48.
    Daily Challenges • Decidingwhat not to do • Competition for the best team members (faculty and staff) • Establishing a diverse team • Lack of a comprehensive enterprise-wide data infrastructure • Its easier to conform

Editor's Notes

  • #11 I will introduce the concept of data science with a story that illustrates - citizen engagement, merging of unexpected data and societal benefit