How do we know what we don't know? Exploring the data and knowledge space through the Neuroscience Information Framework

How do we know what we don't
know? Exploring the data and
knowledge space through the
Neuroscience Information
Framework
Maryann E. Martone, Ph. D.
University of California, San Diego
Building Analytics for Integrated Neuroscience Data
Ontario Brain Institute May 28-29, 2014

We say this to each other all the
time, but we set up systems for
scholarly advancement and
communication that are the
antithesis of integrationWhole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
EM volumes &
reconstructions
Solved molecular
structures
No single technology serves
these all equally well.
Multiple data types;
multiple scales; multiple
databases
A data integration problem

• NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been
surveying,
cataloging and
tracking the
neuroscience
resource
landscape since
< 2008

Old Model: Single type of content; single
mode of distribution
Scholar
Library
Scholar
Publisher
Systems for cataloging, metadata standards, and citation in
place

Scholar
Consumer
Libraries
Data Repositories
Code Repositories
Community
databases/platforms
OA
Curators
Social
Networks
Social
NetworksSocial
Networks
Peer Reviewers
Narrative
Workflows
Data
Models
Multimedia
Nanopublications
Code

The duality of modern scholarship
Observation: Those who build information systems from the
machine side don’t understand the requirements of the
human very well
Those who build information systems from the human side,
don’t understand requirements of machines very well
Scholarship requires the ability to cite and track usage of
scholarly artifacts. In our current mode of working, there is no
way to track artifacts as they move through the ecosystem; no
way to incrementally add human expertise

NIF: A New Type of Entity for New Modes of
Scientific Dissemination
• NIF’s mission is to maximize the awareness of, access to
and utility of research resources produced worldwide to
enable better science and promote efficient use
– NIF unites neuroscience information without respect to domain,
funding agency, institute or community
– NIF is like a “Pub Med” for all biomedical resources and a “Pub
Med Central” for databases
– Makes them searchable from a single interface
– Practical and cost-effective; tries to be sensible
– Learned a lot about the effective data sharing
The Neuroscience Information Framework provides a rich data
source for understanding the current resource landscape

But we have Google!
• Current web is
designed to share
documents
– Documents are
unstructured data
• Much of the content
of digital resources is
part of the “hidden
web”
• Wikipedia: The Deep Web
(also called Deepnet, the
invisible Web, DarkNet,
Undernet or the hidden
Web) refers to World
Wide Web content that is
not part of the Surface
Web, which is indexed by
standard search engines.

Surveying the resource
landscape
~3000 databases
and datasets

Populate broadly and quickly with minimum
overhead to resource providers
•NIF curators
•Nomination by the community
•Semi-automated text mining
pipelines
NIF Registry
Requires no special skills
Site map available for
local hosting
•NIF Data Federation
•DISCO interop (Yale)
•Requires some
programming skill
•But designed for quick
ingestion
Bandrowski et al., Database, 2012

Data Federation: Deep search
http://neuinfo.org
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
Subthalamus

Data about the subthalamus
http://neuinfo.org

NIF unifies look, feel and access

What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies; many duplicate effort and
content

0
50
100
150
200
250
0.01
0.1
1
10
100
1000
Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13
NumberofFederatedDatabases
NumberofFederatedRecords(Millions)
Data Federation Growth
NIF searches the largest collation of
neuroscience-relevant data on the web
DISCO

Purkinje
Cell
Axon
Terminal
Axon
Dendritic
Tree
Dendritic
Spine
Dendrite
Cell body
Cerebellar
cortex
Bringing knowledge to data: Ontologies as framework
There is little obvious connection between
data sets taken at different scales using
different microscopies without an explicit
representation of the biological objects that
the data represent

NIF Semantic Framework: NIFSTD ontology
• NIF uses ontologies to help navigate across and unify neuroscience
resources
• Ontologies are built from community ontologies  cross integration with
other domains
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF Ontologies provide standards for integration of diverse data;
available through NIF vocabulary services

NIF links neuroscience to other domains via
community ontologies
• NIF Subcellular = Gene Ontology Cell Component
• NIF Anatomy = UBERON cross-species ontology
(Includes FMA and Neuronames)
• NIF Disease = Disease Ontology
• NIF Organism = NCBI Taxonomy
• NIF Molecule = Chemicals of Biological Interest
(CHEBI); Protein Ontology
• NIF Cell/Investigation/Function = Developed largely
by neuroscience community
Use of ontology identifiers within data sources creates linkage across databases and
across domains; the more they are used, the better they become

: C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on computers

Concept-based search: Query by meaning
NIF provides formal definitions of many neuroscience terms
= brain region without a blood brain
barrier

Ontologies as a data integration framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385

Building a knowledge space for
neuroscience: Neurolex.org
http://neurolex.org
•Semantic MediWiki
•Provide a simple interface
for defining the concepts
required
•Light weight semantics
•Community based:
•Anyone can contribute their
terms, concepts, things
•Anyone can edit
•Anyone can link
•Accessible: searched by Google
•Growing into a significant
knowledge base for
neuroscience
•33,000 concepts
200,000
edits
150
contributors
Larson and Martone Frontiers in Neuroinformatics, 2013

“When I use a word...it means what I choose it
to mean”
Formalization lets us develop
metrics for the precision of the
terms we use

Mapping the known unknowns
Comprehensive ontologies provide an accounting of what we
think we know
Where are the data relative to what we think we know?
Striatum
Hypothalamus
Olfactory bulb
Cerebral cortex
Brain
Brainregion
Data source

0
1-10
11-100
>101
Open World-Closed World: Mapping the knowledge - data space
Data Sources
NIF lets us ask: where isn’t there data? What isn’t studied? Why?

Forebrain
Midbrain
Hindbrain
0
1-10
11-100
>101
Data Sources
Open World-Closed World: Mapping the knowledge - data space
Junk brain regions?

SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
Adult mouse brain connectivity matrix: revenge of the
midbrain

The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus, and
could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.

fMRI Cerebellum
When results contradict a current theory, they may be ignored

“The Data Homunculus”
Funding drives representation in the data space

NIF Reports: Male
vs Female circa 2012
Gender bias
When data is not
made available, the
data space is an
incomplete record
of what is available

How much information makes it into
the data space?
∞
What is easily machine
processable and accessible
What is potentially knowable
What is known:
Literature, images, human
knowledge
Unstructured; Natural
language processing,
entity recognition,
image processing and
analysis; paywalls; file
drawers
Abstracts vs full
text vs tables etc
Estimates that > 50% scientific output is not recovered
Chan et al. Lancet, 383, 2014

Data sharing in the long tail of neurosciences

A place for my data
NIF lists over 350 data repositories=accept data
contributions from the community

“Empty Archives”
Repository Type of Data
Date
started Host
Public
data Comments
CARMEN
neuroscience /
electrophysiology 2008
Newcastle University; United
Kingdom 100 Requires account
INCF Dataspace various 2012
International
Neuroinformatics
Coordinating Facility ?
Open Source Brain models 2014 University College London 47 Cells and Networks; 23 (Technology -showcases)
XNAT Central Neuroimaging 2010
Washington University
School of Medicine in St.
Louis; Missouri; USA 34
States 370 projects, 3804 subjects, and 5172
imaging sessions. 123 were visible but do not all
appear to be public. 34 public data were listed
under “Recent”
Open Connectome
Serial electron
Microscopy and
Magnetic Resonance 2011
Johns Hopkins University;
Maryland; USA (graphs) 9 9, 7 - image projects; 19 - graphs
UCSF DataShare
biomedical including
neuroimaging, MRI,
cognitive
impairment,
dementia, aging 2011
University of California at San
Francisco; California; USA 15
BrainLiner
various functional
data 2011 ATR; Kyoto; Japan 10
ModelDB neuron models 1996
Yale University; Connecticut;
USA 875
NeuroMorpho
digitally
reconstructed
neurons 2006
George Mason University;
Virginia; USA 10004
Cell Image
Library/Cell
Centered Database
images, videos, and
animations of cell
2002 CCDB
2010 CIL
American Society for Cell
Biology / University of
California at San Diego;
California; USA 10,360
The CCDB had 450 data sets when it merged with
CIL. CIL also contains large imaging data sets that
are not counted as separate images
CRCNS
computational
neuroscience
datasets 2008
University of California at
Berkeley; California; USA 38
OpenfMRI fMRI 2012
University of Texas at Austin;
Texas; USA 22
NeuroMorpho.org =
10,000 neuronal
reconstructions
from ~200 labs
Cell Image Library =
10,000 image sets
from 1500
individuals
“I finally gave NeuroMorpho my data so they would stop

Attitudes towards data sharing
“Pry it from my cold, dead
fingers”
“Done”
“You can have it if you really
want”
•Lack of time and resources
•Lack of incentives
•Fear of being scooped
•Fear of being criticized
•Fear that data will be misused
•Data sharing is a waste of time
AlwaysNever
Reasons for not making data available
Tenopir, C. et al. Data sharing by scientists: practices and perceptions. PLoS One 6,
e21101, doi:10.1371/journal.pone.0021101 (2011)
Many make data
available via web sites
or via supplementary
material

Multivariate analysis of the SCI syndrome using data from two research sites.
Ferguson AR, Irvine K-A, Gensel JC, Nielson JL, et al. (2013) Derivation of Multivariate Syndromic Outcome Metrics for Consistent Testing across Multiple
Models of Cervical Spinal Cord Injury in Rats. PLoS ONE 8(3): e59712. doi:10.1371/journal.pone.0059712
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0059712

Incentives: New solutions• New journals
for data,
where focus
is on data not
results
• Data must be
deposited in a
recognized
repository
– Persistent
identifier
assigned
• Standards for
metadata and
data types
Nature Scientific Data

Incentives: Data citations
• Many groups are
developing
guidelines for
creating a system
of citation for data
used in a study
• First step for
providing an
incentive system
for data sharing
• Currently, very
difficult to track
use of data in
articles
http://www.force11.or
g/datacitation
“Sound, reproducible scholarship rests upon
a foundation of robust, accessible data. Data
should be considered legitimate, citable
products of research. Data citation, like the
citation of other evidence and sources, is
good research practice.”
-Joint Declaration of Data Citation
Principles
Future of Research Communications and e-Scholarship; FORCE11
1. Importance
2. Credit and attribution
3. Evidence
4. Unique Identification
5. Access
6. Persistence
7. Specificity and verifiability
8. Interoperability and
flexibility

Unique ID’s for all! Resource Identification
Initiative
• It is currently impossible to
query the biomedical
literature to find out what
research resources have
been used to produce the
results of a study
-authors don’t provide enough
information to
unambiguously identify
key research resources
• Impossible to find all
studies that used a
resource
• Critical for reproducibility
and data mining
• Critical for trouble-
shooting
http://www.force11.org/resource_identification_initiative
Faulty Antibodies Continue to Enter US and
European Markets, Warns Top Clinical
Chemistry Researcher-Genome Web Daily,
October 11, 2013

Resource Identification Initiative
• Have authors supply
appropriate identifiers for
key resources used within
a study such that they
are:
– Machine processible (i.e.,
unique identifier that
resolves to a single
resource)
– Outside of the paywall
– Uniform across journals
and publishers
Launched February 2014: > 30 journals
participating
Anita Bandrowski, Nicole Vasilevsky,
Matthew Brush, Melissa Haendel and
the RINL group

Pilot Project
• Have authors identify 3 different
types of research resources:
– Software tools and databases
– Antibodies
– Genetically modified animals
• Include RRID in methods section
• RRID=RRID:Accession number
– Just a string at this point
• Voluntary for authors
• Journals did not have to modify
their submission system
• Journals have flexibility in
implementation. Send request to
author at:
– Submission
– During review
– After acceptance
Sources: NIF Registry, NIF Antibody Registry, Model Organism Databases
Resource Identification Portal: Aggregates
accession numbers from >10 different
databases that are the authorities for
registering research resources

First results are in the literature
Google Scholar: Search RRID; select since 2014

What studies used X?
To date:
•30 articles have appeared
•2 articles have disappeared, i.e.,
the RRID’s were removed at
copyediting
•195 RRID’s were reported
•14 were in error = 0.7%
•> 200 antibodies were added
•> 75 software tools/databases
were added
•A resolver service has been
created
•3rd party tools are being created
to provide linkage between
resources and papers
RRID:nif-0000-30467
Authors did not deliberately leave out identifying information; they
just hadn’t thought about it

What have we learned?
Utopia plug-in: Steve Pettifer
•Authors are willing to
adopt new types of
citations and citation
styles; you just have to
ask
•RRID = usage of
research resource
•Ideal: resolved by
search engines without
requiring specialized
citation services
•Citation drives
registration
•Clear role for
repositories as
authorities

Digital objects are a new beast
RRID: Provides foundation for establishing an
alerting service for research resources
Trust: Not just
who produced it
but what
produced it

Community
database:
beginning
Community
database:
End
Register your resource to NIF!
“How do I share my
data/tool?”
“There is no database
for my data”
1
2
3
4
Institutional
repositories
Cloud
INCF: Global
infrastructure
Government
Education
Industry
NIF provides the “glue” for a functioning ecosystem of data and tools
Tool repositories
Standards
Brokering
Archiving

Article
Code
Blogs
Workflows
Data
Persistent Identifiers Portals
Persistent Identifiers
Persistent Identifiers
Unique and persistent identifiers and a system for
referencing them allow an ecosystem to function
An ecosystem for research objects: the social network of
research resources
Data
Data
Code
Code
Blogs
Blogs
Workflows
Workflows
Portals
Portals
Search engines

Musings from the NIF
• Analytics let us to take a global view of data
– By bringing in a knowledge framework, we can look at positive and negative space
• Well-populated data resources are critical to moving analytics forward
– Comprehensive, i.e. they have most of the data that are available
– Much can be learned even from messy data, but reasonable standards help
– Active outreach is required
• Technological barriers to widespread data sharing are diminishing
– Best practices are emerging
– General and focused repositories are available, although sustainability of these is a problem
• There is a lot of neuroscience data available, but a culture of routine data sharing
does not yet exist in neuroscience
– But encouraging signs that it is largely due to lack of time and means, not lack of desire
– It is up to us to change the incentive system to support the best science possible
• Most scientists are not adept at managing or curating their own data
– Role for repositories and data curators
• Pieces of a functioning ecosystem are in place
– Think about how you fit into the ecosystem

NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Co-PI
Amarnath Gupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Yueling Li, UCSD
Trish Whetzel, UCSD
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Svetlana Sulima
Burak Ozyrt
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
Melissa Haendel, OHSU**
Nicole Vasilevsky
Matthew Brush
**Monarch and
Resource
Identification
Initiative

Creating an on-line knowledge space for
neuroscience

Pages are related through properties
Red Links: Information is missing (or misspelled)

Neurolex Neuron
• Led by Dr. Gordon
Shepherd
• > 30 world wide
experts
• Simple set of
properties
• Consistent naming
scheme
• Integrated with
Structural Lexicon
• Used for annotation in
other resources, e.g.,
NeuroElectro

Location of Cell Soma
Location of dendrites
Location of local axon
arbor

Analysis of Red Links in the Neuron Registry
• INCF Project
– Neuron Registry
– > 30 experts
worldwide
– Fill out neuron
pages in Neurolex
Wiki
– Led by Dr. Gordon
Shepherd
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
Number
Total
redlinks
easy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the
collective behavior of contributors  INCF/HBP Knowledge Space

Structural Lexicon in Neurolex
Brain
Region
Brain
Parcel
•Trans-species
•“Stateless”, i.e. no universal defining
criteria
•General structures and partonomies
based on Neuroanatomy 101
Partially overlaps
e.g., Hippocampus, Dentate gyrus
•Species specific
•Specific reference
•Defining criteria
•Sometimes partonomy;
sometimes not
e.g., Hippocampus of ABA2009

Is there a framework for neuroscience?
• Of the ~ 4000 columns
that NIF queries,
~1300 map to one of
our core categories:
– Organism
– Anatomical structure
– Cell
– Molecule
– Function
– Dysfunction
– Technique
• 30-50% of NIF’s
queries autocomplete
• When NIF combines
multiple sources, a set
of common fields
emerges
– >Basic information
models/semantic
models exist for
certain types of
entities
Biomedical science does have a conceptual framework

What would a 21st century platform for
scholarship look like?
D
K
Macroinformatics
NIF: Sensors and monitors for the resource ecosystem

Exposing knowledge to the web
Because they are static URL’s, Wikis are searchable by
Google

NIF provides a rich source of information on
digital resources
• Analytics let us to take a global view of data
– By bringing in a knowledge framework, we can look at positive and negative space
• Well-populated data resources are critical to moving analytics forward
– Comprehensive, i.e. they have most of the data that are available
– Much can be learned even from messy data, but reasonable standards help
– Active outreach is required
• Technological barriers to widespread data sharing are diminishing
– Best practices are emerging
– General and focused repositories are available, although sustainability of these is a
problem
• There is a lot of neuroscience data available, but a culture of routine data sharing
does not yet exist in neuroscience
– But encouraging signs that it is largely due to lack of time and means, not lack of
agreement
• Most scientists are not adept at managing or curating their own data
– Role for repositories and data curators
• Pieces of a functioning ecosystem are in place; think globally
Not just science, but data policy should be data driven

Same data: different analysis
• Gemma: Gene ID + Gene Symbol
• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of change is
opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:
•1370 statements from Gemma regarding gene expression as a function of chronic
morphine
•617 were consistent with DRG;  over half of the claims of the paper were not
confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
Relatively simple standards would make it easier to
perform comparisons across the ecosystem

Musings from the NIF
• Every resource is resource limited: few have enough time, money, staff or
expertise required to do everything they would like
– If the market can support 11 MRI databases, fine
– Some consolidation, coordination is warranted
– How can industry help support the data space? How can they take them even further?
– Don’t let the data space become fractured
• Big, broad and messy beats small, narrow and neat
– Without trying to integrate a lot of data, we will not know what needs to be done
– Progressive refinement; addition of complexity through layers
• Be flexible and opportunistic: assume all will change
– A single optimal technology/container for all types of scientific data and information does not
exist; technology is changing
• Think globally; act locally:
– No source, not even NIF, is THE source; we are all a source
– System and culture to be able to learn from everyting
– Cooperative model for biomedicine

How do we know what we don't know? Exploring the data and knowledge space through the Neuroscience Information Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to How do we know what we don't know? Exploring the data and knowledge space through the Neuroscience Information Framework

Similar to How do we know what we don't know? Exploring the data and knowledge space through the Neuroscience Information Framework (20)

Recently uploaded

Recently uploaded (20)

How do we know what we don't know? Exploring the data and knowledge space through the Neuroscience Information Framework