The Levinthal Lecture
Philip E. Bourne Ph.D., FACMI
Associate Director for Data Science
National Institutes of Health
philip.bourne@nih.gov
http://www.slideshare.net/pebourne
Open Eye Meeting, Santa Fe, March 8, 2016
What follows are my personal views
and not necessarily those of my
employer, the US federal government.
There is No Intelligent Life Down
Here
With Apologies to Cy
Phil Bourne
Open Eye Meeting, Santa Fe, March 8, 2016
My Interactions with Cy
……
And pray that there's intelligent life somewhere up in
space
'Cause there's bugger all down here on Earth
Evidence #1
http://www.iucr.org/resources/commissions/crystallographic-computing/schools/school96/banquet-humour
Evidence #2
We throttle some but not all scholarly
communication
Consider Cy’s Own words from
around 1970 concerning data sharing
“At that time, it was difficult to obtain
crystallographic coordinates although the
results of the structural analysis had
been published”
Local: Cooperative Community Action
 Individual letters to editors of
journals
 Committees
 IUCr commission on
Biological Macromolecules
 ACA/USNCCr
 Richards committee
 Funding agencies
 Articles in journals
Marvin Cassman Fred Richards Richard Dickerson
Courtesy of Helen Berman
PDB Growth
http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100
A Broad Culture of Sharing
1999 20042003 2007 20142008
Research
Tools
Policy
NIH Data
Sharing Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge
(BD2K) Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical
Trials
White House
Initiative
(2013 “Holdren
Memo”)
Data Sharing: An Essential ComponentData Sharing: An Essential Component
Modernizing NIH Clinical Trials
Activities
 NIH-Funded trials published within 100 months of
completion
Less than 50% published within 30 months of completion
BMJ 2012;344:d7292
Modernizing NIH Clinical Trials
Activities:
Call to Action
Increasing Clinical Trial Transparency
Proposed November 2014; Final Spring 2016 (est.)
 Notice of Proposed Rulemaking: Clinical Trials Registration and
Results Submission (FDAAA, Section 801)
– Further implements statutory requirements on private and public
sponsors to register; report results on phase 2, 3, and 4 trials
– Includes drugs, biologics, and devices (except small feasibility)
 Draft NIH Policy on Clinical Trial Information Dissemination
– Extends Section 801 requirements to all NIH-funded clinical trials
– Includes phase 1 trials and trials of non-FDA regulated
interventions such as behavioral trials
Evidence #3
Research does not follow a free market economy – you
can get rewarded regardless of what you produce
True Free Market - Photography
Digitization
Deception
Disruption
Demonetization
Dematerialization
Democratization
Time
Volume,Velocity,Variety
Digital camera invented by
Kodak but shelved
Megapixels & quality improve slowly;
Kodak slow to react
Film market collapses;
Kodak goes bankrupt
Phones replace
cameras
Instagram,
Flickr become the
value proposition
Digital media becomes bona fide
form of communication
False Market - Biomedical Research?
Digitization of Basic &
Clinical Research & EHR’s
Deception
We Are Here
Disruption
Demonetization
Dematerialization
Democratization
Open science
Patient centered health care
Sustaining the System is a Problem
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
Reproducibility
Changing Value of Scholarship
“And that’s why we’re here today. Because something
called precision medicine … gives us one of the greatest
opportunities for new medical breakthroughs that we
have ever seen.”
President Barack Obama
January 30, 2015
New Science
Lets get a bit closer to home for this
audience ….
Evidence #4
Molecular graphics has not
advanced as it should
http://upload.wikimedia.org/wikipedia/commons/2/2e/M
olecular-Graphics-GRIP-75-Console.jpg
What Did Cy Say?
 1990 – “..although we may not have
"chemical insight" there are more and more
3-D structures determined experimentally to
aid in understanding which conformational
results are reasonable and which are not; as
long as we can look at them.”
Good News/Bad News of Molecular
Graphics Today
 Good News:
– It is harder to think of a
more powerful way to
comprehend complex
data
– It has excited
generations to the
promise of science
– It has adapted to
changing technologies
 Bad News:
– It is not an
adaptive/extensible
environment
– It is not a collaborative
environment
– It is not an integrative
environment
– State not transferable
BMC Bioinformatics 2005, 6:21
1. A link brings up figures
from the paper
0. Full text of PLoS papers stored
in a database
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
3. A composite view of
journal and database
content results
Is a database
really different
than a
biological
journal?
PloS Comp Biol
2005 1(3) e34
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
The Knowledge and Data Cycle
Evidence #5
By Pbroks13 (talk) - File:Views on Evolution.jpgNew Scientist Magazine, 19
April 2008, Vol. 198, No.2652, page 31: "Evolution myths: It doesn't matter if
people don't grasp evolution"New Scientist Magazine, 19 August 2006, Vol.
191, No.2565, page 11: "Why doesn't America believe in evolution?"., Public
Domain, https://commons.wikimedia.org/w/index.php?curid=4403503
Nature’s Reductionism
There are ~ 20300
possible proteins
>>>> all the atoms in the Universe
~58M protein sequences from
58K organisms (source RefSeq)
116,539 protein structures
yield 1393 domain folds (SCOP)
Is structure a useful
discriminator of species?
Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8
Method – Distance Determination
(FSF)
SCOP
SUPERFAMILY
organisms
C. intestinalis C. briggsae F. rubripes
a.1.1 1 1 1
a.1.2 1 1 1
a.10.1 0 0 1
a.100.1 1 1 1
a.101.1 0 0 0
a.102.1 0 1 1
a.102.2 1 1 1
C. intestinalis C. briggsae F. rubripes
C. intestinalis 0 101 109
C. briggsae 0 144
F. rubripes 0
Presence/Absence
Data Matrix
Distance Matrix
The Answer Would Appear to be
Yes
 It is possible to
generate a
reasonable tree of life
from merely the
presence or absence
of superfamilies
(FSFs) within a given
proteome
Environmental Influence
Chris Dupont
Scripps Institute of Oceanography
UCSD
DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827
Evolution of the Earth
 4.5 billion years of change
 300+50K
 1-5 atmospheres
 Constant photoenergy
 Chemical and geological
changes
 Life has evolved in this time
 The ocean was the “cradle”
for 90% of evolution
 Whether the deep ocean
became oxic or euxinic
following the rise in
atmospheric oxygen (~2.3
Gya) is debated, therefore
both are shown (oxic ocean-
solid lines, euxinic ocean-
dashed lines).
 The phylogenetic tree symbols
at the top of the figure show
one idea as to the theoretical
periods of diversification for
each Superkingdom.
Billions of years before present
Concentration
(O2inarbitraryunits,ZnandFeinmolesL-1
Bacteria
Archaea
Eukarya
Oxygen
Zinc
Iron
Cobalt
Manganese
Theoretical Levels of Trace Metals and Oxygen
in the Deep Ocean Through Earth’s History
Replotted from Saito et al, 2003
Inorganica Chimica Acta 356: 308-318
Evidence #6
Data resources including the PDB
don’t fully serve the needs of the
user at this point?
Good News/Bad News for the PDB in
this Changing Landscape
 Bad News:
– Interface complex and
uni-data oriented
– Data accessible;
methods accessible (sort
of); but not together
– Significant redundancy in
services offered
– Sustainability
 Good News:
– Annotation!
– Demand is increasing
– Integrated with other
data types
– Restful services
General Problem Statement:
How to insure a high quality
annotated data source that provides
the optimal environment for
accessibility, integration and analysis
by a broad community of diverse
users?
Enter the Commons
The Commons
Components
 Computing environment
– cloud or HPC (High Performance Computing)
– supports access, utilization, sharing and storage of
digital objects.
 Methods for Interoperability
– enables connectivity, shareability and interoperability
between digital objects.
 Digital object compliance model
– describes the properties of digital objects that
enables them to be discoverable and shareable.
The Commons
Components
BD2K
Center
BD2K
Center
BD2K
Center
BD2K
Center
BD2K
Center
BD2K
Center
DDICC
Software
Standards
Infrastructure - The
Commons
Labs
Labs
Labs
Labs
Commons - Pilots
 The Cloud Credits - business model
 BD2K Centers
 MODs (Model Organism Databases)
 HMP Data and tools available in the cloud
 NCI Cloud Pilots & Genomic Data
Commons
The PDB in the Commons
 Components:
– Annotated collection of data files
– API’s to access these data files
– Example methods using these APIs
 Potential outcomes
– Nothing happens?
– A new breed of developer starts to use PDB data in new
ways ?
– The casual user has a broader set of services that
previously?
– Quality declines/increases?
Delineation of polypharmacology
across the human structural kinome
using a functional site interaction
fingerprint approach
Zhao et al. J. Med. Chem., 2016,
DOI: 10.1021/acs.jmedchem.5b02041
Evidence #7
The difficulty to translate academic
ideas into products
Functional Site interaction Fingerprint (Fs-IFP) Approach
Step 1. Extract the Structural Kinome
208 kinase, 2383 ligand-bound structures
Step 2. All-against-all binding-site
comparison
Step 3. Encoding Fs-IFP
Step 4. Statistics analysis and machine learning
Binding Mode Characterization
of Kinase Inhibitors
Clustering of Fs-IFP across
the structural kinome
Spatial locations for
the binding regions
for the eight clusters
Kinase Binding Profile Prediction Using
Fs-IFP
ROC curves of the
trained support
vector machine model
The performance of
predicted binding
profile of 51 type-I
inhibitors to 344
kinases
Summary
There is more intelligence than we
think.
While we study complex systems they
are also why we do not make faster
progress
Acknowledgements
The 133 Folks who have passed
through my lab over the years
Cy Levinthal for giving me this
opportunity
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJs
T03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
NIHNIH……
Turning Discovery Into HealthTurning Discovery Into Health
philip.bourne@nih.gov

There is No Intelligent Life Down Here

  • 1.
    The Levinthal Lecture PhilipE. Bourne Ph.D., FACMI Associate Director for Data Science National Institutes of Health philip.bourne@nih.gov http://www.slideshare.net/pebourne Open Eye Meeting, Santa Fe, March 8, 2016
  • 2.
    What follows aremy personal views and not necessarily those of my employer, the US federal government.
  • 3.
    There is NoIntelligent Life Down Here With Apologies to Cy Phil Bourne Open Eye Meeting, Santa Fe, March 8, 2016
  • 4.
  • 5.
    …… And pray thatthere's intelligent life somewhere up in space 'Cause there's bugger all down here on Earth
  • 6.
  • 7.
    Evidence #2 We throttlesome but not all scholarly communication
  • 8.
    Consider Cy’s Ownwords from around 1970 concerning data sharing “At that time, it was difficult to obtain crystallographic coordinates although the results of the structural analysis had been published”
  • 9.
    Local: Cooperative CommunityAction  Individual letters to editors of journals  Committees  IUCr commission on Biological Macromolecules  ACA/USNCCr  Richards committee  Funding agencies  Articles in journals Marvin Cassman Fred Richards Richard Dickerson Courtesy of Helen Berman
  • 10.
  • 11.
    A Broad Cultureof Sharing 1999 20042003 2007 20142008 Research Tools Policy NIH Data Sharing Policy Model Organism Policy Genome-wide Association (GWAS) Policy 2012 NIH Public Access Policy (Publications) Big Data to Knowledge (BD2K) Initiative Genomic Data Sharing (GDS) Policy Modernization of NIH Clinical Trials White House Initiative (2013 “Holdren Memo”)
  • 12.
    Data Sharing: AnEssential ComponentData Sharing: An Essential Component
  • 13.
    Modernizing NIH ClinicalTrials Activities  NIH-Funded trials published within 100 months of completion Less than 50% published within 30 months of completion BMJ 2012;344:d7292
  • 14.
    Modernizing NIH ClinicalTrials Activities: Call to Action
  • 15.
    Increasing Clinical TrialTransparency Proposed November 2014; Final Spring 2016 (est.)  Notice of Proposed Rulemaking: Clinical Trials Registration and Results Submission (FDAAA, Section 801) – Further implements statutory requirements on private and public sponsors to register; report results on phase 2, 3, and 4 trials – Includes drugs, biologics, and devices (except small feasibility)  Draft NIH Policy on Clinical Trial Information Dissemination – Extends Section 801 requirements to all NIH-funded clinical trials – Includes phase 1 trials and trials of non-FDA regulated interventions such as behavioral trials
  • 16.
    Evidence #3 Research doesnot follow a free market economy – you can get rewarded regardless of what you produce
  • 17.
    True Free Market- Photography Digitization Deception Disruption Demonetization Dematerialization Democratization Time Volume,Velocity,Variety Digital camera invented by Kodak but shelved Megapixels & quality improve slowly; Kodak slow to react Film market collapses; Kodak goes bankrupt Phones replace cameras Instagram, Flickr become the value proposition Digital media becomes bona fide form of communication
  • 18.
    False Market -Biomedical Research? Digitization of Basic & Clinical Research & EHR’s Deception We Are Here Disruption Demonetization Dematerialization Democratization Open science Patient centered health care
  • 19.
    Sustaining the Systemis a Problem Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
  • 20.
  • 21.
    “And that’s whywe’re here today. Because something called precision medicine … gives us one of the greatest opportunities for new medical breakthroughs that we have ever seen.” President Barack Obama January 30, 2015 New Science
  • 22.
    Lets get abit closer to home for this audience ….
  • 23.
    Evidence #4 Molecular graphicshas not advanced as it should http://upload.wikimedia.org/wikipedia/commons/2/2e/M olecular-Graphics-GRIP-75-Console.jpg
  • 24.
    What Did CySay?  1990 – “..although we may not have "chemical insight" there are more and more 3-D structures determined experimentally to aid in understanding which conformational results are reasonable and which are not; as long as we can look at them.”
  • 25.
    Good News/Bad Newsof Molecular Graphics Today  Good News: – It is harder to think of a more powerful way to comprehend complex data – It has excited generations to the promise of science – It has adapted to changing technologies  Bad News: – It is not an adaptive/extensible environment – It is not a collaborative environment – It is not an integrative environment – State not transferable BMC Bioinformatics 2005, 6:21
  • 26.
    1. A linkbrings up figures from the paper 0. Full text of PLoS papers stored in a database 2. Clicking the paper figure retrieves data from the PDB which is analyzed 3. A composite view of journal and database content results Is a database really different than a biological journal? PloS Comp Biol 2005 1(3) e34 4. The composite view has links to pertinent blocks of literature text and back to the PDB 1. 2. 3. 4. The Knowledge and Data Cycle
  • 27.
    Evidence #5 By Pbroks13(talk) - File:Views on Evolution.jpgNew Scientist Magazine, 19 April 2008, Vol. 198, No.2652, page 31: "Evolution myths: It doesn't matter if people don't grasp evolution"New Scientist Magazine, 19 August 2006, Vol. 191, No.2565, page 11: "Why doesn't America believe in evolution?"., Public Domain, https://commons.wikimedia.org/w/index.php?curid=4403503
  • 28.
    Nature’s Reductionism There are~ 20300 possible proteins >>>> all the atoms in the Universe ~58M protein sequences from 58K organisms (source RefSeq) 116,539 protein structures yield 1393 domain folds (SCOP)
  • 29.
    Is structure auseful discriminator of species? Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8
  • 30.
    Method – DistanceDetermination (FSF) SCOP SUPERFAMILY organisms C. intestinalis C. briggsae F. rubripes a.1.1 1 1 1 a.1.2 1 1 1 a.10.1 0 0 1 a.100.1 1 1 1 a.101.1 0 0 0 a.102.1 0 1 1 a.102.2 1 1 1 C. intestinalis C. briggsae F. rubripes C. intestinalis 0 101 109 C. briggsae 0 144 F. rubripes 0 Presence/Absence Data Matrix Distance Matrix
  • 31.
    The Answer WouldAppear to be Yes  It is possible to generate a reasonable tree of life from merely the presence or absence of superfamilies (FSFs) within a given proteome
  • 32.
    Environmental Influence Chris Dupont ScrippsInstitute of Oceanography UCSD DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827
  • 33.
    Evolution of theEarth  4.5 billion years of change  300+50K  1-5 atmospheres  Constant photoenergy  Chemical and geological changes  Life has evolved in this time  The ocean was the “cradle” for 90% of evolution
  • 34.
     Whether thedeep ocean became oxic or euxinic following the rise in atmospheric oxygen (~2.3 Gya) is debated, therefore both are shown (oxic ocean- solid lines, euxinic ocean- dashed lines).  The phylogenetic tree symbols at the top of the figure show one idea as to the theoretical periods of diversification for each Superkingdom. Billions of years before present Concentration (O2inarbitraryunits,ZnandFeinmolesL-1 Bacteria Archaea Eukarya Oxygen Zinc Iron Cobalt Manganese Theoretical Levels of Trace Metals and Oxygen in the Deep Ocean Through Earth’s History Replotted from Saito et al, 2003 Inorganica Chimica Acta 356: 308-318
  • 35.
    Evidence #6 Data resourcesincluding the PDB don’t fully serve the needs of the user at this point?
  • 36.
    Good News/Bad Newsfor the PDB in this Changing Landscape  Bad News: – Interface complex and uni-data oriented – Data accessible; methods accessible (sort of); but not together – Significant redundancy in services offered – Sustainability  Good News: – Annotation! – Demand is increasing – Integrated with other data types – Restful services
  • 37.
    General Problem Statement: Howto insure a high quality annotated data source that provides the optimal environment for accessibility, integration and analysis by a broad community of diverse users?
  • 38.
  • 39.
    The Commons Components  Computingenvironment – cloud or HPC (High Performance Computing) – supports access, utilization, sharing and storage of digital objects.  Methods for Interoperability – enables connectivity, shareability and interoperability between digital objects.  Digital object compliance model – describes the properties of digital objects that enables them to be discoverable and shareable.
  • 40.
  • 41.
  • 42.
    Commons - Pilots The Cloud Credits - business model  BD2K Centers  MODs (Model Organism Databases)  HMP Data and tools available in the cloud  NCI Cloud Pilots & Genomic Data Commons
  • 43.
    The PDB inthe Commons  Components: – Annotated collection of data files – API’s to access these data files – Example methods using these APIs  Potential outcomes – Nothing happens? – A new breed of developer starts to use PDB data in new ways ? – The casual user has a broader set of services that previously? – Quality declines/increases?
  • 44.
    Delineation of polypharmacology acrossthe human structural kinome using a functional site interaction fingerprint approach Zhao et al. J. Med. Chem., 2016, DOI: 10.1021/acs.jmedchem.5b02041 Evidence #7 The difficulty to translate academic ideas into products
  • 45.
    Functional Site interactionFingerprint (Fs-IFP) Approach Step 1. Extract the Structural Kinome 208 kinase, 2383 ligand-bound structures Step 2. All-against-all binding-site comparison Step 3. Encoding Fs-IFP Step 4. Statistics analysis and machine learning
  • 46.
    Binding Mode Characterization ofKinase Inhibitors Clustering of Fs-IFP across the structural kinome Spatial locations for the binding regions for the eight clusters
  • 47.
    Kinase Binding ProfilePrediction Using Fs-IFP ROC curves of the trained support vector machine model The performance of predicted binding profile of 51 type-I inhibitors to 344 kinases
  • 48.
    Summary There is moreintelligence than we think. While we study complex systems they are also why we do not make faster progress
  • 49.
    Acknowledgements The 133 Folkswho have passed through my lab over the years Cy Levinthal for giving me this opportunity https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJs T03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
  • 50.
    NIHNIH…… Turning Discovery IntoHealthTurning Discovery Into Health philip.bourne@nih.gov

Editor's Notes

  • #13 Added (10/1/15): TCGA, dbGaP, GTR
  • #14 Figure 2. Cumulative percentage of studies published in a peer reviewed biomedical journal indexed by Medline during 100 months after trial completion among all NIH funded clinical trials registered within ClinicalTrials.gov Public benefits to clinical trials data-sharing (OSP): Inform future research and research funding decisions Mitigate bias (e.g., non publication of results, especially negative results) Prevent duplication of unsafe trials Meet ethical obligation to human subjects (i.e., that results inform science) Increase access to data about marketed products All contribute to public trust in clinical research Source: Ross JS, Tse T, Zarin DA, Xu H, Zhou L, Krumholz HM. Publication of NIH funded trials registered in ClinicalTrials.gov: cross-sectional analysis. BMJ 2012;344:d7292.
  • #16 Text updated by Sarah Carr [10/7/2015] – also changed order to feature NPRM before Draft NIH Policy. Nearly 900 Comments received on PPRM: Many simply stating broad support Final Rule expected Spring 2016 Section 801 of the Food and Drug Administration Amendments Act (FDAAA)
  • #22 Photos: FC tweet; RK screen grab
  • #40 Digital object = data or analytics software
  • #46 Sequence reference with a variety of electrostatic properties encoded.