Data-knowledge transition zones within the biomedical research ecosystem

Data-knowledge transition zones
within the biomedical research
ecosystem
Maryann E. Martone, Ph. D.
University of California, San Diego

• NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been
surveying,
cataloging and
tracking the
neuroscience
resource
landscape since
< 2008

BD2K: Big Data to Knowledge
• BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise, to facilitate discovery and support new knowledge,
and to maximize community engagement.
• BD2K aims to develop the new approaches, standards, methods, tools,
software, and competencies that will enhance the use of biomedical Big Data
by:
– Facilitating broad use of biomedical digital assets by making them
discoverable, accessible, and citable
– Conducting research and developing the methods, software, and tools
needed to analyze biomedical Big Data
– Enhancing training in the development and use of methods and tools
necessary for biomedical Big Data science
– Supporting a data ecosystem that accelerates discovery as part of a digital
enterprise
http://bd2k.nih.gov/

How do resources get added to the NIF?
•NIF curators
•Nomination by the community
•Semi-automated text mining
pipelines
NIF Registry
Requires no special skills
Manual and semi-
automated updates
•NIF Data Federation
•DISCO interop
•Requires some
programming skill
•Open Source Brain < 2 hr
•Automated update via NIF
DISCO dashboard
Low barrier to entry; incremental refinementMarenco et al., 2010; 2014

Registry vs Federation: Metadata about resource vs
metadata/data in database

What resources are available for GRM1?
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice

THE STATE OF RESEARCH
RESOURCES: RESOURCE REGISTRY

Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
Anita Bandrowski and Burak Ozyurt
Population, Coverage and Linkage of Resource
Registry

• Automated text mining is used to look
for “web page last updated” or
copyright dates
– Identified for 570 resources
– 373 were not updated within the last 2
years (65%)
• Manual review of ~200 resources
– 38 not updated within the past 2 years
(~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
What happens to these resources?
The Registry provides a persistent identifier and metadata
record for what once existed but no longer does

Keeping content up
to date
Connectome
Tractography
Epigenetics
•New tags come into
existence
•New resource types come
into existence, e.g., Mobile
apps
•Resources add new types of
content
•Change name
•Change scope
•> 7000 updates to the
registry last year
It’s a challenge to keep the registry up to date;
sitemaps, curation, ontologies, community review

NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views
250 sources
> 800 M records

What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies

NIF Information Framework: Query and alignment
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
• Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF uses ontologies to enhance search
and discovery but is not constrained by
them

Find clinical trials that have data
available?

Current challenge: With so much
available, how do I find what I need?
• “What genes are upregulated
by chronic morphine?”
– It depends
• Most often use cases require
connecting a researcher to
relevant data sets and
appropriate tools
– Depending upon the data and
tools, the answers may differ
• Many databases have tool
bases and workflows that
they support
– Much value has been added to
individual data sets

Facets and filters: Progressive
refinement of search
Facet/Filter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene Organism
Expression
level
Geo
Integrated
Expression
Literature
More effective to start with a general query and use
the navigation to refine search

Concept Mapper: Alignment and weighting
Find:gene cerebellum=find all sources with column mapped to gene that also contain
keyword cerebellum; Find:gene Anatomy:cerebellum

“Data trails”: Linking data and analysis tools

Query across Registry and Federation
• Registry and
Federation were
treated
separately, even
though
Federation
comprises views
of Registry
entries
• Experimenting
with new
combined index

SciCrunch: A “social network” for
resources
• NIF is a general search
engine across all of
neuroscience
• Very powerful for discovery
and general browsing
• Can perform analytics across
the spectrum of biomedical
resources
• Many communities want to
create more focused portals
• Specialized for their domain
• Restrict the particular sources
• Organize the data according
to their needs
• Use their own branding
• How do we create a system
that satisfies community
needs without creating
another silo?

Put dkNET here
http://dknet.org
Autogenerated snippets

Where can I find validated antibodies
against CART?

1 100 10,000 1,000,000 100,000,00010,000,000,000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunch
Federation become immediately
available through More Resources

Breaking down silos: Community enrichment
It’s like a Mendeley for
resources!

SciCrunch
Shared
Resources
Undiagnosed
Disease Program
Phenotype RCN
One Mind for
Research
Consortia-Pedia
Faster Cures
Model Organism
Databases
Community
Outreach
Shared curation; shared expertise
Resource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube

Making use of community
Facet/Filter
Source
Category
Index
Community Community
Community
resources
SciCrunch
data (all)
Gene
Gemma
Gene Organism
Expression
level
Geo
Integrated
Expression
Literature
Brings expertise of community to understanding how to work
with data

KNOWLEDGE TO DATA: GAP
ANALYSIS

Looking across the ecosystem: Where are the data?
Data Sources
Bringing knowledge to data: Gap analysis

Forebrain
Midbrain
Hindbrain
0
1-10
11-100
>101
Data Sources
Revealing biases in the dataspace

SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
Adult mouse brain connectivity matrix: revenge of the
midbrain

The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus, and
could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.

Importance of comprehensive indices: For how
many proteins are there antibodies?
0
1-10
11-100
101-1000
1001+
Human, protein coding genes (Entrez Gene) vs # of
search results from the antibodyregistry.org
Antibodyregistry.orgTrish Whetzel and Anita Bandrowski

Data-Knowledge Mismatch
Dutowski et al., 2013:
Nature Biotechnology

The scourge of neuroanatomical nomenclature
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385

6 parcellation schemes of mouse
prefrontal cortex based on Nissl alone
Van De Werd HJ1, Uylings HB.. Brain Struct Funct. 2014 Mar;219(2):433-59. doi:
10.1007/s00429-013-0630-

How many neuron types are
there?
NIH funding announcement: BRAIN Initiative: Transformative
Approaches for Cell-Type Classification in the Brain
“The mammalian brain contains a vast number of cells. These cells are
generally grouped within broad classes (e.g., neurons or glia) but it is
currently unknown exactly how many classes exist.”

Location of Cell Soma
Location of dendrites
Location of local axon
arbor
Transition Zones: Neurons and their properties

Analysis of Red Links in the Neuron
Registry
• INCF Project
– Neuron Registry
• Neurolex.org
• Semantic
MediaWiki
– > 30 experts
worldwide
– Fill out neuron
pages in Neurolex
Wiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
Number
Total
redlinks
easy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from
the collective behavior of contributors  show limits in our
knowledge and our knowledge representations

Domain Knowledge
Ontologies
Atlases/Maps
Annotation
Claims, assertions
Registries
Derived data
Models and
simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system, but figure
out how information (data + knowledge) can flow between them; Knowledge is fluid
and will continually update
SciCrunch: Creating a Data and Resource
Discovery Environment

BD2K: Creating a Data Discovery
Index
• BioCADDIE
– Dr. Lucila Ohno-
Machado PI
– FORCE11:
Community
engagement piece
• What should a data
discovery index do?
– Task Forces
– Pilot projects
• How should it be
built? http://biocaddie.org
BIOMEDICAL AND HEALTH CARE DATA
DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
Amarnath Gupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11

BD2K-K2BD: Data Discovery Index
• Accounting of what is available
– Comprehensive resource registry
– UPC’s for research resources
• Information framework
– Major concepts contained in data, but also accounting of what happens to
data as it flows through the ecosystem (provenance)
• Community-based portals into shared data resources
– Share expertise
– Metrics of trust
– Shared curation and upkeep
• Two way validation of knowledge to data

Registry vs Federation: Metadata about
resource vs metadata/data in database
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice

What have we learned: Grabbing the
long tail of small data
• NIF is in a unique position to ask
questions against the data resource
landscape
• The data space is not uniform
• Data “flows” from one resource to
the next
– Data is reinterpreted, reanalyzed or added
to
• Currently very difficult to track data
as it moves across the landscape
– Makes it difficult to learn from combined
efforts

Working with and extending
ontologies: Neurolex.org
http://neurolex.org Larson et al, Frontiers in Neuroinformatics, in press
•Semantic MediWiki
•Provide a simple interface
for defining the concepts
required
•Light weight semantics-sets of
triples
•Good teaching tool for
learning about semantic
integration and the benefits of
a consistent semantic
framework
•Community based:
•Anyone can contribute their
terms, concepts, things
•Anyone can edit
•Anyone can link
•Accessible: searched by Google
•Growing into a significant
knowledge base for
neuroscience
Demo D03

Neuron Lexicon: Gauging the state of
knowledge in neuroscience
• Led by Dr. Gordon
Shepherd
• > 30 world wide
experts
• Simple set of
properties
• Consistent naming
scheme
• Integrated with
Structural Lexicon
• Used for annotation
in other resources,
e.g., NeuroElectro

Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us
to “track” data as it moves
Data flows throughout the ecosystem...value is added

Analyzed
Curated
GSE13732
Analyzed
Mirrored
But…even our standards need standards
GSE13732
E-GEOD-13732
GEO:GSE13732
Standard identifier format for all data
federation sources; text mining to deal
with inconsistencies

Same data: different analysis
• Gemma: Gene ID + Gene Symbol
• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of change is
opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:
•1370 statements from Gemma regarding gene expression as a function of chronic
morphine
•617 were consistent with DRG;  over half of the claims of the paper were not
confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data
has gone and what has been done with it

How many do we use?
These resources themselves need to be citable

Resource Identification Initiative:
Linking resources to literature
• Have authors supply appropriate
identifiers for key resources used
within a study such that they are:
– Machine processible (i.e., unique
identifier that resolves to a single
resource)
– Outside of the paywall
– Uniform across journals and
publishers
• Pilot project: SciCrunch portal
serving identifiers for
– Software/databases
– Antibodies
– Genetically modified organisms
Launched February 2014: > 30 journals
participating

What studies have used...?
•>200 articles have appeared to date
•>30 journals
•Data set being made available to
community
•> 650 RRID’s
•~10% disappeared after
copyediting
•5% were in error
Database available at: https://www.force11.org/node/5635

: C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers

Data-knowledge transition zones within the biomedical research ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Data-knowledge transition zones within the biomedical research ecosystem

Similar to Data-knowledge transition zones within the biomedical research ecosystem (20)

Recently uploaded

Recently uploaded (20)

Data-knowledge transition zones within the biomedical research ecosystem