The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

The real world of ontologies and
phenotype representation:
perspectives from the
Neuroscience Information
Framework
Maryann Martone, Ph. D.
University of California, San Diego

“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation
to its multiple layers of organization that operate at different spatial and
temporal scales. Central to this effort is tackling “neural choreography” --
the integrated functioning of neurons into brain circuits-- Neural
choreography cannot be understood via a purely reductionist approach.
Rather, it entails the convergent use of analytical and synthetic tools to
gather, analyze and mine information from each level of analysis, and
capture the emergence of new layers of function (or dysfunction) as we
move from studying genes and proteins, to cells, circuits, thought, and
behavior....
However, the neuroscience community is not yet fully engaged in exploiting the
rich array of data currently available, nor is it adequately poised to capitalize
on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011

“Data choreography”
 In that same issue of Science
 Asked peer reviewers from last year about the availability and use of
data
 About half of those polled store their data only in their
laboratories—not an ideal long-term solution.
 Many bemoaned the lack of common metadata and archives as a
main impediment to using and storing data, and most of the
respondents have no funding to support archiving
 And even where accessible, much data in many fields is too poorly
organized to enable it to be efficiently used.
 “...it is a growing challenge to ensure that data produced during the
course of reported research are appropriately described, standardized,
archived, and available to all.” Lead Science editorial (Science 11
February 2011:Vol. 331 no. 6018 p. 649 )

 NIF is an initiative of the NIH Blueprint consortium of institutes
 What types of resources (data, tools, materials, services) are
available to the neuroscience community?
 How many are there?
 What domains do they cover? What domains do they not cover?
 Where are they?
 Web sites
 Databases
 Literature
 Supplementary material
 Who uses them?
 Who creates them?
 How can we find them?
 How can we make them better in the future? http://neuinfo.org
• PDF files
• Desk drawers

In an ideal world...
We’d like to be able to find:
 What is known****:
 What is the average diameter of a Purkinje neuron
 IsGRM1 expressed In cerebral cortex?
 What are the projections of hippocampus?
 What genes have been found to be upregulated in
chronic drug abuse in adults
 Is alpha synuclein in the striatum?
 What studies used my polyclonal antibody against
GABA in humans?
 What rat strains have been used most extensively in
research during the last 20 years?
 What is not known:
 Connections among data
 Gaps in knowledge
Without some sort of framework, very difficult to
RequiredComponents:
– Query interface
– Search strategies
– Data sources
– Infrastructure
– Results display
– Why did I get this
result?
– Analysis tools

The Neuroscience Information Framework: Discovery and
utilization of web-based resources for neuroscience
 A portal for finding and
using neuroscience
resources
 A consistent framework for
describing resources
 Provides simultaneous
search of multiple types of
information, organized by
category
 Supported by an expansive
ontology for neuroscience
 Utilizes advanced
technologies to search the
“hidden web”
http://neuinfo.org
UCSD,Yale, CalTech, George Mason, Washington Univ
Supported by NIH Blueprint
Literature
Database
Federation
Registry

We need more databases !?
•NIF Registry: A
catalog of
neuroscience-relevant
resources
•> 5000 currently
listed
•> 2000 databases
•And we are finding
more every day

NIF must work with ecosystem as
it is today
 NIF was one of the first projects to attempt data integration in
the neurosciences on a large scale
 NIF is supported by a contract that specified the number of
resources to be added per year
 Designed to be populated rapidly; set up process for progressive refinement
 No budget was allocated to retrofit existing resources; had to work with
them in their current state
 We designed a system that required little to no cooperation or work from
providers
 NIF was required to assemble (not create) ontologies very fast and to provide a
platform through which the community could view, comment and add
 NIF is enriched by ontologies but does not depend on them
 Took advantage of community ontologies
 But needed to take a very pragmatic and aggressive approach to incorporating and using them
 Neurolex semantic wiki

What are the connections of the
hippocampus?
HippocampusOR “CornuAmmonis” OR
“Ammon’s horn” Query expansion: Synonyms
and related concepts
Boolean queries
Data sources
categorized by
“data type” and
level of nervous
system
Common views
across multiple
sources
Tutorials for using
full resource when
getting there from
NIF
Link back to
record in
original
source

Imminent: NIF 5.0
 NIF 5.0 about
to be released
 New design
 New query
features
 New analytics

What do you mean by data?
Databases come in many shapes and sizes
 Primary data:
 Data available for
reanalysis, e.g., microarray data
sets from GEO; brain images from
XNAT; microscopic images
(CCDB/CIL)
 Secondary data
 Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
 Tertiary data
 Claims and assertions about the
meaning of data
 E.g., gene
upregulation/downregulation,
 Registries:
 Metadata
 Pointers to data sets or
materials stored elsewhere
 Data aggregators
 Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
 Single source
 Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information resources using a multitude of
technologies

Exploration: Where is alpha synuclein?
•Spatially:
•Gene
•Protein
•Subcellular
•Cellular
•Regional
•Organism
•Semantically:
•Gene regulation networks
•Protein pathways
•Cellular local connectivity
•Regional connectivity
•Who is studying it?
•Who is funding its study?
Networks exist across scales; all important in the nervous system

 Set of modular ontologies
 86, 000 + distinct concepts +
synonyms
 Bridge files between modules
 Expressed in OWL-DL language
 Currently supports OWL 2
 Tries to follow OBO community
best practices
 Standardized to the same
upper level ontologies
 e.g., Basic Formal Ontology
(BFO), OBO Relations
Ontology (OBO-RO),
 Imports existing community
ontologies
 e.g., CHEBI, GO, PRO,
DOID, OBI etc.
 Retains identifiers in
most recent additions
but reflects history
13
Covers major domains of neuroscience:
Organisms, Brain Regions, Cells,
Molecules, Subcellular parts, Diseases,
Nervous system functions,Techniques
NIFSTD Ontologies
Fahim Imam, William Bug

“Search computing”: Query by concept
What genes are upregulated by drugs of abuse in the
adult mouse? (show me the data!)
Morphine
Increased
expression
Adult Mouse
Reasonable standards make it easy to search for and compare results

Diseases of nervous system
New: Data analytics
NIF is in a unique position to answer questions about the neuroscience
ecosystem using new analytics tools
Neurodegenerative
Seizuredisorders
Neoplasticdiseaseofnervoussystem
NIH
Reporter
NIFdatafederatedsources

Results are organized within a common
framework
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates
Projects toCellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases

The scourge of neuroanatomical nomenclature:
Importance of NIF semantic framework
•NIFConnectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•BrainArchitecture Management System (rodent)
•Temporal lobe.com (rodent)
•ConnectomeWiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385

Why so many names?
 The brain is perhaps unique among major organ systems in the
multiplicity of naming schemes for its major and minor regions.
 The brain has been divided based on topology of major
features, cyto- and myelo-architecture, developmental
boundaries, supposed evolutionary origins, histochemistry, gene
expression and functional criteria.
 The gross anatomy of the brain reflects the underlying networks
only superficially, and thus any parcellation reflects a somewhat
arbitrary division based on one or more of these criteria.
The “activation map” images that commonly accompany brain imaging papers can be
misleading to inexperienced readers, by seeming to suggest that the boundaries between
“activated” and “unactivated” patches of cortex are unambigous and sharp. Instead, as
most researchers are aware, the apparent sharp boundaries are subject to the choice of
threshold applied to the statistical tests that generate the image.What, then, justifies
dividing the cortex into regions with boundaries based on this fuzzy, mutable measure of
functional profile?
(Saxe et al., 2010, p. 39).
Brainmaps.org

Program on Ontologies for Neural
Structures
 International Neuroinformatics Coordinating Committee
 Structural LexiconTask Force
 Defining brain structures
 Translate among terminologies
 Neuronal RegistryTask Force
 Consistent naming scheme for neurons
 Knowledge base of neuron properties
 Representation and DeploymentTask Force
 Formal representation
 Also interacts with Digital Atlasing Task Force
http://incf.org

NeuroLexWiki
http://neurolex.org Stephen Larson
•Provide a simple framework
for defining the concepts
required
•Light weight semantics
•Good teaching tool for
learning about
semantic integration
and the benefits of a
consistent semantic
framework
•Community based:
•Anyone can contribute
their terms, concepts,
things
•Anyone can edit
•Anyone can link
•Accessible: searched by
Google
•Building an extensive cross-
disciplinary knowledge base
for neuroscience
Demo D03

Defining nervous system structures
Parcellation scheme: Set of parcels
occupying part or all of an anatomical
entity that has been delineated using a
common approach or set of criteria,
often in a single study.A parcellation
scheme for any given individual entity
may include gaps, transitional zones, or
regions of uncertainty. A parcellation
scheme derived from a set of individuals
registered to a common target (atlas)
may be probabilistic and include overlap
of parcels in regions that reflect
individual variability or imperfections in
alignment.
14 parcellation schemes currently represented in Neurolex
Documentation available
INCF task force on
ontologies

Basic model: do not conflate conceptual
structures with parcels
Regional part of
nervous system
Functional part of
nervous system
Parcel
overlaps
overlaps overlaps
Parcel Parcel
Neuroscientists have a lot of different parcellation schemes because they have a lot of different
ways of classifying brain structures and techniques to match them are imperfect

Linking semantics to space: INCF Atlasing
www.neurolex.org
Link to spatial
representation in
scalable brain
atlas
Waxholm space
Seth Ruffins,Alan Ruttenberg, Rembrandt Bakker

Neurons in Neurolex
 International
Neuroinformatics
Coordinating Facility (INCF)
building a knowledge base of
neurons and their properties
via the NeurolexWiki
 Led by Dr. Gordon Shepherd
 Consistent and parseable
naming scheme
 Knowledge is readily
accessible, editable and
computable
 While structure is imposed,
don’t worry too much about
the upper level classes of the
ontology
Stephen Larson

A KNOWLEDGE BASE OF NEURONAL PROPERTIES
26Additional semantics added in NIFSTD by ontology engineer

Concept-based search: search by meaning
 Search Google: GABAergic neuron
 Search NIF: GABAergic neuron
 NIF automatically searches for types of
GABAergic neurons
Types of GABAergic
neurons

Challenges of multiscale neurodegenerative
disease phenotypes
•Neurodegenerative diseases target very specific cell
populations
•Model systems only replicate a subset of features of the
disease
•Related phenotypes occur across anatomical scales
•Different vocabularies are used by different communities
not
not
Midbrain degenerated
Substantianigra decreased
in volume
Substantianigra pars
compacta atrophied
Loss of Snpcdopaminergic
neurons
Degeneration of nigrostriatal
terminals
Tyrosine-hydroxylase containing
neurons degenerate

Approach: Use ontologies to provide necessary
knowledge for matching related phenotypes
Sarah Maynard, Chris Mungall,
Suzie Lewis, Fahim Imam
Midbrain
Substantianigr
a
compacta
compacta dopamine
cell
Dopamine
Neuron cell
soma
Neuron (CL)
Part of neuron
(GO)
Small molecule
(Chebi)
Atrophied
Decreased
volume
Fewer in
number
Degenerate
Decreased in magnitude
relative to some normal
Has part
Has part
Is part
of
Has part
Has part
Is a
Is a Is a
Is a
Entities
Qualities
NIFSTD/PKB
OBO ontology

Alzheimer’s
disease
Human
(birnlex_516)
Neocortex pyramidal
neuron
Increased
number of
Lipofuscin
has part
inheres in inheres in
towards
EQ Representation of Phenotypes in Neurodegenerative
Disease: PATO and NIFSTD
Instance: Human with
Alzheimer’s disease 050
Phenotype
birnlex_2087_56
inheres in
about
Chris Mungall, Suzanna Lewis
Structured annotation
model implemented in WIB

OBD: Ontology based database
 Provides a user
interface for matching
organisms based on
similarity of
phenotypes
 Based on EQ model
 Uses knowledge in the
ontology to compute
similarity scores and
other statistical
measures like
information content
http://www.berkeleybop.org/pkb/
Chris Mungall, Suzanna Lewis, Lawrence Berkeley
Labs

Thalamus
Cellular
inclusion
Midline nuclear
group
Lewy Body
Paracentral
nucleus
Cellular
inclusion
Computes common subsumers and information
content among phenotypes

*B6CBA-TgN (HDexon1)62) that express exon1 of the human mutant HD gene- Li et al., J
Neurosci, 21(21):8473-8481
PhenoSim: What organism is most similar to a human
with Huntington’s disease?
Putamen atrophied
Globuspallidusneuropil
degenerate
Part of basal ganglia
decreased in
magnitude
Fewer neostriatum
medium spiny neurons in
putamen
Neurons in striatum
degenerate
Neuron in striatum
decreased in
magnitude
Increased number of
astrocytes in caudate
nucleus
Neurons in striatum
degenerate
Nervous system cell
change in number in
striatum

Progressive enrichment
Understanding and comparing phenotypes will be enriched through community
knowledge bases like Neurolex
Looking forward to continuing this as part of the Monarch project with Melissa
Haendel, Chris Mungall and Suzie Lewis

Top Down vs Bottom up
Top-down ontology construction
• A select few authors have write privileges
• Maximizes consistency of terms with each other (automated consistency
checking)
• Making changes requires approval and re-publishing
•Works best when domain to be organized has: small corpus, formal categories,
stable entities, restricted entities, clear edges.
•Works best with participants who are: expert catalogers, coordinated users, expert
users, people with authoritative source of judgment
Bottom-up ontology construction
• Multiple participants can edit the ontology instantly (many eyes to correct errors)
• Semantics are limited to what is convenient for the domain
• Not a replacement for top-down construction; sometimes necessary to increase flexibility
• Necessary when domain has: large corpus, no formal categories, no clear edges
•Necessary when participants are: uncoordinated users, amateur users, naïve catalogers
• Neuroscience is a domain that is less formal and neuroscientists are more uncoordinated
NIFSTD
NEUROLEX
Important for Ontologists to define community contribution model

It’s a messy ecosystem (and that’s OK)
NIF favors a hybrid, tiered,
federated system
 Domain knowledge
 Ontologies
 Claims about results
 Virtuoso RDF triples
 Data
 Data federation
 Workflows
 Narrative
 Full text access
Neuron Brain part Disease
Organism Gene
Caudate projects to
Snpc Grm1 is upregulated in
chronic cocaine
Betz cells
degenerate in ALS

Musings from the NIF
 No one can be stopped from doing what they need to do
 Every resource is resource limited: few have enough time,
money, staff or expertise required to do everything they would
like
 If the market can support 11 MRI databases, fine
 Some consolidation, coordination is warranted though
 Big, broad and messy beats small, narrow and neat
 Without trying to integrate a lot of data, we will not know what needs to be done
 A lot can be done with messy data; neatness helps though
 Progressive refinement; addition of complexity through layers
 Be flexible and opportunistic
 A single optimal technology/container for all types of scientific data and
information does not exist; technology is changing
 Think globally; act locally:
 No source, not even NIF, isTHE source; we are all a source

Grabbing the long tail of small
data
 Analysis of NIF shows
multiple databases with
similar scope and content
 Many contain partially
overlapping data
 Data “flows” from one
resource to the next
 Data is reinterpreted,
reanalyzed or added to
 Is duplication good or bad?

Same data: different analysis
Chronic vs acute
morphine in striatum
 Drug Related Gene database:
extracted statements from
figures, tables and supplementary
data from published article
 Gemma: Reanalyzed microarray
results from GEO using different
algorithms
 Both provide results of increased
or decreased expression as a
function of experimental
paradigm
 4 strains of mice
 3 conditions: chronic morphine,
acute morphine, saline Mined NIF for all references to GEO
ID’s: found small number where the
same dataset was represented in two
or more databases
http://www.chibi.ubc.ca/Gemma/home.html

How easy was it to compare?
 Gemma: Gene ID + Gene Symbol
 DRG: Gene name + Probe ID
 Gemma: Increased expression/decreased expression
 DRG: Increased expression/decreased expression
 But...Gemma presented results relative to baseline chronic morphine; DRG with
respect to saline, so direction of change is opposite in the 2 databases
 Analysis:
 1370 statements from Gemma regarding gene expression as a function of
chronicmorphine
 617 were consistent with DRG; over half of the claims of the paper were not
confirmed in this analysis
 Results for 1 gene were opposite in DRG and Gemma
 45 did not have enough information provided in the paper to make a judgment
NIF annotation
standard

Beware of False Dichotomies
 Top-down vs bottom up
 Light weight vs heavy weight
 “Chaotic Nihilists and Semantic Idealists”
 Text mining vs annotation
 Curators vs scientists
 Human vs machine
 DOI’svsURI’s
http://www.datanami.com/datanami/2013-02-
05/chaotic_nihilists_and_semantic_idealists.html

NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
AmarnathGupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd,Yale University
Perry Miller
Luis Marenco
RixinWang
DavidVan Essen,Washington University
Erin Reid
Paul Sternberg, CalTech
ArunRangarajan
Hans Michael Muller
Yuling Li
GiorgioAscoli,George Mason University
SrideviPolavarum
Fahim Imam, NIF Ontology Engineer
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Lee Hornbrook
Binh Ngo
VadimAstakhov
XufeiQian
Chris Condit
Mark Ellisman
Stephen Larson
WillieWong
TimClark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer

The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

Similar to The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework (20)

Recently uploaded

Recently uploaded (20)

The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

Editor's Notes