The possibility and probability
of establishing a global
neuroscience information
framework:
lessons learned from practical experiences
in data integration for neuroscience
Maryann Martone, Ph. D.
University of California, San Diego
“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation
to its multiple layers of organization that operate at different spatial and
temporal scales. Central to this effort is tackling “neural choreography” --
the integrated functioning of neurons into brain circuits--their spatial
organization, local and long-distance connections, their temporal
orchestration, and their dynamic features. Neural choreography cannot
be understood via a purely reductionist approach. Rather, it entails the
convergent use of analytical and synthetic tools to gather, analyze and
mine information from each level of analysis, and capture the emergence
of new layers of function (or dysfunction) as we move from studying
genes and proteins, to cells, circuits, thought, and behavior....
However, the neuroscience community is not yet fully engaged in exploiting the
rich array of data currently available, nor is it adequately poised to capitalize
on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
On the other hand...
 In that same issue of Science
 Asked peer reviewers from last year about the availability and use of
data
 About half of those polled store their data only in their
laboratories—not an ideal long-term solution.
 Many bemoaned the lack of common metadata and archives as a
main impediment to using and storing data, and most of the
respondents have no funding to support archiving
 And even where accessible, much data in many fields is too poorly
organized to enable it to be efficiently used.
 “...it is a growing challenge to ensure that data produced during the
course of reported research are appropriately
described, standardized, archived, and available to all.” Lead Science
editorial (Science 11 February 2011:Vol. 331 no. 6018 p. 649 )
We speak piously of taking
measurements and making small
studies that will add another brick
to the temple of science. Most
such bricks just lie around the
brickyard.
Platt,J.R. (1964)Strong Inference.
Science. 146: 347-353.
"We now have unprecedented
ability to collect data about
nature…but there is now a crisis
developing in biology, in that
completely unstructured information
does not enhance understanding”
-Sidney Brenner
The
Encyclopedia
of Life
A…
Access to data has
changed over the
years
Tim Berner-s Lee: Web of data
Wikipedia defines Linked Data as "a term used
to describe a recommended best practice for
exposing, sharing, and connecting pieces of
data, information, and knowledge on the
SemanticWeb using URIs and RDF.”
http://linkeddata.org/
Genban
k
PDB
Are we there yet?
We’d like to be able to find:
 What is known****:
 What is the average diameter of a Purkinje
neuron
 Is GRM1 expressed In cerebral cortex?
 What are the projections of hippocampus
 What genes have been found to be
upregulated in chronic drug abuse in adults
 What studies used my monoclonal mouse
antibody against GAD in humans?
 Find all instances of spines that
contain membrane-bound organelles
 ****by combining data from different
sources and different groups
 What is not known:
 Connections among data
 Gaps in knowledge
We’d like it to be really simple to
implement and use:
– Query interface
– Search strategies
– Data sources
– Infrastructure
– Results display
– Trust
– Context
– Analysis tools
– Tools for translating existing
content into linkable form
– Tools for creating new data ready
to be linked
 NIF is an initiative of the NIH Blueprint consortium of institutes
 What types of resources (data, tools, materials, services) are
available to the neuroscience community?
 How many are there?
 What domains do they cover? What domains do they not cover?
 Where are they?
 Web sites
 Databases
 Literature
 Supplementary material
 Who uses them?
 Who creates them?
 How can we find them?
 How can we make them better in the future? http://neuinfo.org
A look into the brickyard
• PDF files
• Desk drawers
How many resources are there?
•NIF Registry: A
catalog of
neuroscience-relevant
resources
•> 3500 currently
described
•> 1700 databases
•Another 3000
awaiting curation
•And we are finding
more every day
But we have Google!
 Current web is designed
to share documents
 Documents are
unstructured data
 Much of the content of
digital resources is part of
the “hidden web”
 Wikipedia: The DeepWeb
(also called Deepnet, the
invisible
Web, DarkNet, Undernet
or the hiddenWeb) refers
toWorldWideWeb content
that is not part of the
SurfaceWeb, which is
indexed by standard
search engines.
A tip of the “resourceome”
Microarray
9, 535, 440
Model organisms
246, 639
Connectivity
26, 443
Antibodies
890, 571
Pathways
43, 013
Brain Activation
Foci
56, 591
65 databases
But we have Pub Med!
 Bulk of neuroscience data
is published as part of
papers
 > 20,000,000
 Structured vs
unstructured information
“...it is a growing challenge to ensure that
data produced during the course of reported
research are appropriately
described, standardized, archived, and
available to all.” Lead Science editorial
(Science 11 February 2011: Vol. 331 no. 6018 p.
649 )
Author, year,
journal, keyw
ords
Content
The Neuroscience Information Framework: Discovery and
utilization of web-based resources for neuroscience
 A portal for finding and
using neuroscience
resources
 A consistent framework for
describing resources
 Provides simultaneous
search of multiple types of
information, organized by
category
 Supported by an expansive
ontology for neuroscience
 Utilizes advanced
technologies to search the
“hidden web”
http://neuinfo.org
UCSD,Yale, CalTech, George Mason, Washington Univ
Supported by NIH Blueprint
Literature
Database
Federation
Registry
Neuroscience is unlikely to be
served by a few large databases
like the genomics and proteomics
community
Whole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
EM volumes &
reconstructions
Solved molecular
structures
No single technology serves these all
equally well.
Multiple data types; multiple
scales; multiple databases
A data federation problem
NIF Data Federation
 Too many databases to visit
 Registry not adequate for finding and using them
 Capturing content in a few keywords is difficult if not impossible
 Access to deep content; currently searches over 30 million records from > 65
different databases
 Flexible tools for resource providers to make their content available as easily and
meaningfully as possible
 Organized according to level of nervous system and data type, e.g., brain
activation foci
 Link to host resource: these databases are independent!
 Provides simplified and unified views to help users navigate very different
resources
 Common vocabularies
 Common data models for basic neuroscience data
 Laying the foundations for data integration for neuroscience
What are the connections of the
hippocampus?
HippocampusOR “CornuAmmonis” OR
“Ammon’s horn” Query expansion: Synonyms
and related concepts
Boolean queries
Data sources
categorized by
“data type” and
level of nervous
system
Simplified views of
complex data
sources
Tutorials for using
full resource when
getting there from
NIF
Link back to
record in
original
source
What are the connections of the
hippocampus?
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates
Projects toCellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
NIF: Minimum requirements to use shared
data
 You (and the machine) have to be able to find it
 Accessible through the web
 Structured or semi-structured
 Annotations
 You (and the machine) have to be able to use it
 Data type specified and in a usable form
 You (and the machine) have to know what the data
mean
 Semantics
 Context: Experimental metadata
Reporting neuroscience data within a consistent framework helps enormously
Is GRM1 in cerebral cortex?
 NIF system allows easy search over multiple sources of information
 But, we have difficulty finding data
 Well known difficulties in search
 Inconsistent and sparse annotation of scientific data
 Many different names for the same thing
 The same name means many things
 “Hidden semantics”: 1 = male; 1 = present; 1=mouse
Allen Brain Atlas
MGD
Gensat
Cerebral Cortex
Atlas Children Parent
Genepaint Neocortex, Olfactory cortex (Olfactory
bulb; piriform cortex), hippocampus
Telencephalon
Allen Brain Atlas Cortical plate, Olfactory areas,
Hippocampal Formation
Cerebrum
MBAT (cortex) Hippocampus, Olfactory, Frontal,
Perirhinal cortex, entorhinal cortex
Forebrain
GENSAT Not defined Telencephalon
BrainInfo frontal lobe, insula, temporal lobe,
limbic lobe, occipital lobe
Telencephalon
Brainmaps
Entorhinal, insular, 6, 8, 4, A SII 17,
Prp, SI
Telencephalon
What is an ontology?
Brain
Cerebellum
Purkinje Cell Layer
Purkinje cell
neuron
has a
has a
has a
is a
 Ontology: an explicit, formal
representation of concepts
relationships among them
within a particular domain that
expresses human knowledge in a
machine readable form
 Branch of philosophy: a theory
of what is
 e.g., Gene ontologies
What can ontology do for us?
 Express neuroscience concepts in a way that is machine readable
 Synonyms, lexical variants
 Definitions
 Provide means of disambiguation of strings
 Nucleus part of cell; nucleus part of brain; nucleus part of atom
 Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases
GABA as a neurotransmitter
 Properties
 Provide universals for navigating across different data sources
 Semantic “index”
 Perform reasoning
 Link data through relationships not just one-to-one mappings
 Provide the basis for concept-based queries to probe and mine data
 As a branch of philosophy, make us think about the nature of the
things we are trying to describe, e.g., synapse is a site
Linking datatypes to semantics: What is
the average diameter of a Purkinje
neuron dendrite?
 Branch structure not a
tree, not a set of blood
vessels, not a road map but a
DENDRITE
 Because anyone who uses
Neurolucida uses the same
concepts: axon, dendrite, cell
body, dendritic
spine, information systems
can combine the data
together in meaningful ways
 Neurolucida
doesn’t, however, tell you that
dendrite belongs to a neuron
of a particular type or whether
this dendrite is a neural
dendrite at all
( (Color Yellow) ; [10,1]
(Dendrite)
( 5.04 -44.40 -89.00 1.32) ; Root
( 3.39 -44.40 -89.00 1.32) ; R, 1
(
( 2.81 -45.10 -90.00 0.91) ; R-1, 1
( 2.81 -45.18 -90.00 0.91) ; R-1, 2
( 1.90 -46.01 -90.00 0.91) ; R-1, 3
( 1.82 -46.09 -90.00 0.91) ; R-1, 4
( 0.91 -46.59 -90.00 0.91) ; R-1, 5
( 0.41 -46.83 -92.50 0.91) ; R-1, 6
(
( -0.66 -46.92 -88.50 0.74) ; R-1-1, 1
( -0.74 -46.92 -88.50 0.74) ; R-1-1, 2
( -2.15 -47.25 -88.00 0.74) ; R-1-1, 3
( -2.15 -47.33 -88.00 0.74) ; R-1-1, 4
( -3.06 -47.00 -87.00 0.74) ; R-1-1, 5
( -4.05 -46.92 -86.00 0.74) ; R-1-1, 6
Output of Neurolucida neuron trace
“A rose by any other name...”:
 Identity:
 Entities are uniquely identifiable
 Name is a meaningless numerical identifier (URI: Uniform resource identifier)
 Any number of human readable labels can be assigned to it
 Definition:
 Genera: is a type of (cell, anatomical structure, cell part)
 Differentia: “has a” A set of properties that distinguish among members of that
class
 Can include necessary and sufficient conditions
 Implementation: How is this definition expressed
 Depending on the nature of the concept or entity and the needs of the
information system, we can say more or fewer things
 Different languages; can express different things about the concept that can be
computed upon
 OWLW3C standard, RDF
Comprehensive Ontology
 NIF covers multiple structural scales and domains of relevance to neuroscience
 Aggregate of community ontologies with some extensions for
neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology
 Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks
for more complex representations
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
Query across resources: Snca
and striatum
NIF uses the NIFSTD ontologies to query across sources that use very
different terminologies, symbolic notations and levels of granularity
Entity mapping
BIRNLex_435 Brodmann.3
Explicit mapping of database content helps disambiguate non-unique and
custom terminology
Concept-based search: search by meaning
 Search Google: GABAergic neuron
 Search NIF: GABAergic neuron
 NIF automatically searches for types of
GABAergic neurons
Types of GABAergic
neurons
Data mining through
interrogation
 What genes are upregulated by drugs of abuse in the adult
mouse?
Morphine
Increased
expression
Adult Mouse
Integration of knowledge based on
relationships
Looking for commonalities and distinctions among animal
models and human conditions based on phenotypes
Sarah Maynard, Chris Mungall, Suzie Lewis NINDS
Thalamus
Cellular inclusion
Midline nuclear
group
Lewy Body
Paracentral nucleus
Cellular inclusion
And now, the literature
 The scientific article remains the currency of science
 Vast majority of neuroscience data is published in
the literature
 Computational biologists like to consume data
 Neuroscientists like to produce it
 Two NIF projects:
 1) Resource identification from the literature
 Identifying antibodies used in scientific studies from
text
 2) Extracting data from tables and supplementary
material
 Neuroscience is fundamentally reliant on antibodies
 Neuroscientists spend a lot of time searching for antibodies
that will work in their system for the target of interest and
troubleshooting experiments that didn’t work
 The scientific literature is a major source of information on
antibodies
 Proposal
 Use text mining strategies to identify antibodies, protocol
type and subject organism from materials and methods
section of J. Neuroscience
Problem: antibodies
 Midfrontal cortex tissue samples from neurologically unimpaired subjects (n9)
and from subjects with AD (n11) were obtained from the Rapid Autopsy
Program
 Immunoblot analysis and antibodies
 The following antibodies were used for immunoblotting:-actinmAb (1:10,000
dilution, Sigma-Aldrich); -tubulinmAb (1:10,000,Abcam);T46 mAb (specific to tau 404–
441, 1:1000, Invitrogen);Tau-5 mAb (human tau 218–225, 1:1000, BD Biosciences) (Porzig et
al., 2007);AT8 mAb (phospho-tau Ser199, Ser202, andThr205, 1:500, Innogenetics); PHF-1
mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies); 12E8 mAb(phospho-tau
Ser262 and Ser356, 1:1000, gift from P. Seubert); NMDA receptors 2A, 2B and 2D goat pAbs (C
terminus, 1:1000, Santa Cruz Biotechnology)…
Semantic annotation: Entity mapping by
human
Sato et al., J. Neurosci. 2008 Subject is
Human
Antibody #7
"12E8" is a Monoclonal antibody birnlex_2027
Antibody
reagent has target human PHF tau
Waiting for
Neurolex ID
Protein product of
Antibody
reagent has provider Peter Seubert
Antibody
reagent has catalog #
Antibody
reagent
has source
organism Mouse birnlex_167 NCBI Taxonomic ID: 10090
Antibody
reagent has id "12E8"
Provider has location
Elan Pharmaceuticals, South San
Francisco, CA
Provider has url
Try this Watson!
• 95 antibodies were identified in 8 articles
• 52 did not contain enough information to determine the
antibody used
• Some provided details in another paper
• And another paper, and another...
• Failed to give species, clonality, vendor, or catalog number
• But, many provided the location of the vendor because
the instructions to authors said to do so
• no antibodies had lot numbers associated
We never got to test the algorithms!
 NIF along with several other large informatics
projects recommends that all authors provide
vendor and catalog # for all reagents use
 But...vendors merge and sell each other’s
antibodies, making it difficult to track down exactly
which reagent was used in some cases
 Catalog numbers get replaced; many variants on the
same product, e.g., HRP-conjugated, 200 ulvs 500 ul
 Clone names are not unique
 Universal antibody ID
Publishing for the 21st Century
NIF Antibody Registry
• We have created an antibody
registry database
• Assigns each antibody a
persistent identifier to both
commercial and non-
commercial antibodies
• ID will persist even if company
goes out of business or the
antibody is sold by multiple
vendors
• The data model is being formalized
into a rigorous ontology in
collaboration with others:
• We negotiated with antibody
aggregators to pull data for over
800,000 commercial antibodies,
200 vendors
• Can be used to register homegrown
antibodies as well
• http://antibodyregistry.org
“Find studies that used a rabbit polyclonal antibody
against GFAP that recognizes human in
immunocytochemisty”
Paz et al,
J Neurosci, 2010
(AB_310775)
Demo 2: Extracting data from
tables and supplementary
material
 Challenge: Extract data on gene expression in brain from
studies relevant to drug abuse
 Workflow:
Find articles
Extract results
from tables
Standardize
results
Load into NIF
Current DB: 140 tables from 54 articles
Andrea Arnaud-Stagg, Anita Bandrowski
Gene for tyrosine
hydroxylase has
increased
expression in locus
coeruleus of mouse
compared to control
when given chronic
morphine
Translations:
Upregulatedp< 0.05 =
increased expression
LC = locus coeruleus
Probe ID = gene name
Extract data and meaning of data
from tables
Challenges working with tables and
supplemental data
 Difficult data arrangements
 PDF, JPG,TXT,CSV, XLS
 Difficult styles: colors, symbols, data arrangements (results
combined into one column, multiple comparisons in one table,
legends defining values, unclearly described data (eg., unclear
significance)
 Not clear what tables/values represent
 nothing in paper about the supplementary data file and table has no heading
 Probe ID’s are given but not gene identifiers
 No link from supplemental material back to article; lose
provenance
 Results are presented but values of significance unclear
 Neither curator (nor machine) could distinguish between no difference
and not reported
What affects SMN1 expression?
Researchers often report results in a way where curators cannot
extract full information from a study
Common theme
•We are not publishing data in a
form that is easy to integrate
•What we mean isn’t clear to a
search engine (or even to a
human)
•We use many different data
structures to say the same
thing
•We don’t provide crucial
information
•Searching and navigating across
individual resources takes an
inordinate amount of human effort
Tempus PecuniaEst Painting by Richard
Harpum
When I talk to neuroscientists (and journal editors)...
Collaboration, competition,
coordination, cooperation
 The diversity and dynamism of neuroscience will make data
integration challenging always
 Neural space is vast: No one group or individual can do
everything
 We don’t have to solve everything to make it better
 Global partnership with room for everyone:
 Neuroscientists
 Curators
 Resource developers
 Funders
 Computational biologists
 Text miners
 Computer scientists
 Watson
Hopeful signs...
•Means for sharing data on the web
becoming more routine
•With availability, growing recognition for a role
of standards and curation
•For neuroscience, we now have
organizations that can help
coordinate
•NIF, NITRC (http://nitrc.org)
•NeuroimagingTools and Resource
Clearinghouse
•International Neuroinformatics
Coordinating Facility
•Educate neuroscientists on what is
necessary
•Bring together stakeholders to
define what is necessary for
interoperation
•Implement structures and
procedures for developing
neuroscience resources within a
framework
http://incf.org
We don’t know everything but we
do know some things
1. Register your resource
with NIF!!!!
3: Be mindful
 Resource providers: Mindfulness that your
resource is contributing data to a global
federation
 Link to shared ontology identifiers where
possible
 Stable and unique identifiers for data
 Explicit semantics
 Database, model, atlas
 Researchers: Mindfulness when publishing
data that it is to be consumed by machines
and not just your colleagues
 Accession numbers for genes and species
 Catalog numbers for reagents
 Provide supplemental data in a form where it is
is easy to re-use
2. Become involved with NIF
and INCF
Learn about neuroinformatics
Many thanks to...
Amarnath Gupta, UCSD, Co Investigator
Jeff Grethe, UCSD, Co Investigator
Anita Bandrowski, NIF Curator
Gordon Shepherd,Yale University
Perry Miller
Luis Marenco
DavidVan Essen,Washington University
Erin Reid
Paul Sternberg, CalTech
ArunRangarajan
Hans Michael Muller
GiorgioAscoli,George Mason University
SrideviPolavarum
FahimImam, NIF Ontology Engineer
Karen Skinner, NIH, Program Officer
Mark Ellisman
Lee Hornbrook
Kara Lu
VadimAstakhov
XufeiQian
Chris Condit
Stephen Larson
Sarah Maynard
Bill Bug
Register your resource to NIF!
How old is an adult squirrel?
 Definitions can be
quantitative
 Arbitrary but defensible
 Qualitative categories
for quantitative
attributes
 Best practice to
provide ages of
subjects, but for
query, need to
translate into
qualitative concepts
Jonathan Cachat, Anita Bandrowski
But there are no databases for
siRNA
NIF Registry is probably the most complete accounting we have of what is out
there

The possibility and probability of a global Neuroscience Information Framework

  • 1.
    The possibility andprobability of establishing a global neuroscience information framework: lessons learned from practical experiences in data integration for neuroscience Maryann Martone, Ph. D. University of California, San Diego
  • 2.
    “Neural Choreography” “A grandchallenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits--their spatial organization, local and long-distance connections, their temporal orchestration, and their dynamic features. Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior.... However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “ Akil et al., Science, Feb 11, 2011
  • 3.
    On the otherhand...  In that same issue of Science  Asked peer reviewers from last year about the availability and use of data  About half of those polled store their data only in their laboratories—not an ideal long-term solution.  Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving  And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.  “...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011:Vol. 331 no. 6018 p. 649 )
  • 4.
    We speak piouslyof taking measurements and making small studies that will add another brick to the temple of science. Most such bricks just lie around the brickyard. Platt,J.R. (1964)Strong Inference. Science. 146: 347-353. "We now have unprecedented ability to collect data about nature…but there is now a crisis developing in biology, in that completely unstructured information does not enhance understanding” -Sidney Brenner
  • 5.
    The Encyclopedia of Life A… Access todata has changed over the years Tim Berner-s Lee: Web of data Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the SemanticWeb using URIs and RDF.” http://linkeddata.org/ Genban k PDB
  • 6.
    Are we thereyet? We’d like to be able to find:  What is known****:  What is the average diameter of a Purkinje neuron  Is GRM1 expressed In cerebral cortex?  What are the projections of hippocampus  What genes have been found to be upregulated in chronic drug abuse in adults  What studies used my monoclonal mouse antibody against GAD in humans?  Find all instances of spines that contain membrane-bound organelles  ****by combining data from different sources and different groups  What is not known:  Connections among data  Gaps in knowledge We’d like it to be really simple to implement and use: – Query interface – Search strategies – Data sources – Infrastructure – Results display – Trust – Context – Analysis tools – Tools for translating existing content into linkable form – Tools for creating new data ready to be linked
  • 7.
     NIF isan initiative of the NIH Blueprint consortium of institutes  What types of resources (data, tools, materials, services) are available to the neuroscience community?  How many are there?  What domains do they cover? What domains do they not cover?  Where are they?  Web sites  Databases  Literature  Supplementary material  Who uses them?  Who creates them?  How can we find them?  How can we make them better in the future? http://neuinfo.org A look into the brickyard • PDF files • Desk drawers
  • 8.
    How many resourcesare there? •NIF Registry: A catalog of neuroscience-relevant resources •> 3500 currently described •> 1700 databases •Another 3000 awaiting curation •And we are finding more every day
  • 9.
    But we haveGoogle!  Current web is designed to share documents  Documents are unstructured data  Much of the content of digital resources is part of the “hidden web”  Wikipedia: The DeepWeb (also called Deepnet, the invisible Web, DarkNet, Undernet or the hiddenWeb) refers toWorldWideWeb content that is not part of the SurfaceWeb, which is indexed by standard search engines.
  • 10.
    A tip ofthe “resourceome” Microarray 9, 535, 440 Model organisms 246, 639 Connectivity 26, 443 Antibodies 890, 571 Pathways 43, 013 Brain Activation Foci 56, 591 65 databases
  • 11.
    But we havePub Med!  Bulk of neuroscience data is published as part of papers  > 20,000,000  Structured vs unstructured information “...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 ) Author, year, journal, keyw ords Content
  • 12.
    The Neuroscience InformationFramework: Discovery and utilization of web-based resources for neuroscience  A portal for finding and using neuroscience resources  A consistent framework for describing resources  Provides simultaneous search of multiple types of information, organized by category  Supported by an expansive ontology for neuroscience  Utilizes advanced technologies to search the “hidden web” http://neuinfo.org UCSD,Yale, CalTech, George Mason, Washington Univ Supported by NIH Blueprint Literature Database Federation Registry
  • 13.
    Neuroscience is unlikelyto be served by a few large databases like the genomics and proteomics community Whole brain data (20 um microscopic MRI) Mosiac LM images (1 GB+) Conventional LM images Individual cell morphologies EM volumes & reconstructions Solved molecular structures No single technology serves these all equally well. Multiple data types; multiple scales; multiple databases A data federation problem
  • 14.
    NIF Data Federation Too many databases to visit  Registry not adequate for finding and using them  Capturing content in a few keywords is difficult if not impossible  Access to deep content; currently searches over 30 million records from > 65 different databases  Flexible tools for resource providers to make their content available as easily and meaningfully as possible  Organized according to level of nervous system and data type, e.g., brain activation foci  Link to host resource: these databases are independent!  Provides simplified and unified views to help users navigate very different resources  Common vocabularies  Common data models for basic neuroscience data  Laying the foundations for data integration for neuroscience
  • 15.
    What are theconnections of the hippocampus?
  • 16.
    HippocampusOR “CornuAmmonis” OR “Ammon’shorn” Query expansion: Synonyms and related concepts Boolean queries Data sources categorized by “data type” and level of nervous system Simplified views of complex data sources Tutorials for using full resource when getting there from NIF Link back to record in original source
  • 17.
    What are theconnections of the hippocampus? Connects to Synapsed with Synapsed by Input region innervates Axon innervates Projects toCellular contact Subcellular contact Source site Target site Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
  • 18.
    NIF: Minimum requirementsto use shared data  You (and the machine) have to be able to find it  Accessible through the web  Structured or semi-structured  Annotations  You (and the machine) have to be able to use it  Data type specified and in a usable form  You (and the machine) have to know what the data mean  Semantics  Context: Experimental metadata Reporting neuroscience data within a consistent framework helps enormously
  • 19.
    Is GRM1 incerebral cortex?  NIF system allows easy search over multiple sources of information  But, we have difficulty finding data  Well known difficulties in search  Inconsistent and sparse annotation of scientific data  Many different names for the same thing  The same name means many things  “Hidden semantics”: 1 = male; 1 = present; 1=mouse Allen Brain Atlas MGD Gensat
  • 20.
    Cerebral Cortex Atlas ChildrenParent Genepaint Neocortex, Olfactory cortex (Olfactory bulb; piriform cortex), hippocampus Telencephalon Allen Brain Atlas Cortical plate, Olfactory areas, Hippocampal Formation Cerebrum MBAT (cortex) Hippocampus, Olfactory, Frontal, Perirhinal cortex, entorhinal cortex Forebrain GENSAT Not defined Telencephalon BrainInfo frontal lobe, insula, temporal lobe, limbic lobe, occipital lobe Telencephalon Brainmaps Entorhinal, insular, 6, 8, 4, A SII 17, Prp, SI Telencephalon
  • 21.
    What is anontology? Brain Cerebellum Purkinje Cell Layer Purkinje cell neuron has a has a has a is a  Ontology: an explicit, formal representation of concepts relationships among them within a particular domain that expresses human knowledge in a machine readable form  Branch of philosophy: a theory of what is  e.g., Gene ontologies
  • 22.
    What can ontologydo for us?  Express neuroscience concepts in a way that is machine readable  Synonyms, lexical variants  Definitions  Provide means of disambiguation of strings  Nucleus part of cell; nucleus part of brain; nucleus part of atom  Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmitter  Properties  Provide universals for navigating across different data sources  Semantic “index”  Perform reasoning  Link data through relationships not just one-to-one mappings  Provide the basis for concept-based queries to probe and mine data  As a branch of philosophy, make us think about the nature of the things we are trying to describe, e.g., synapse is a site
  • 23.
    Linking datatypes tosemantics: What is the average diameter of a Purkinje neuron dendrite?  Branch structure not a tree, not a set of blood vessels, not a road map but a DENDRITE  Because anyone who uses Neurolucida uses the same concepts: axon, dendrite, cell body, dendritic spine, information systems can combine the data together in meaningful ways  Neurolucida doesn’t, however, tell you that dendrite belongs to a neuron of a particular type or whether this dendrite is a neural dendrite at all ( (Color Yellow) ; [10,1] (Dendrite) ( 5.04 -44.40 -89.00 1.32) ; Root ( 3.39 -44.40 -89.00 1.32) ; R, 1 ( ( 2.81 -45.10 -90.00 0.91) ; R-1, 1 ( 2.81 -45.18 -90.00 0.91) ; R-1, 2 ( 1.90 -46.01 -90.00 0.91) ; R-1, 3 ( 1.82 -46.09 -90.00 0.91) ; R-1, 4 ( 0.91 -46.59 -90.00 0.91) ; R-1, 5 ( 0.41 -46.83 -92.50 0.91) ; R-1, 6 ( ( -0.66 -46.92 -88.50 0.74) ; R-1-1, 1 ( -0.74 -46.92 -88.50 0.74) ; R-1-1, 2 ( -2.15 -47.25 -88.00 0.74) ; R-1-1, 3 ( -2.15 -47.33 -88.00 0.74) ; R-1-1, 4 ( -3.06 -47.00 -87.00 0.74) ; R-1-1, 5 ( -4.05 -46.92 -86.00 0.74) ; R-1-1, 6 Output of Neurolucida neuron trace
  • 24.
    “A rose byany other name...”:  Identity:  Entities are uniquely identifiable  Name is a meaningless numerical identifier (URI: Uniform resource identifier)  Any number of human readable labels can be assigned to it  Definition:  Genera: is a type of (cell, anatomical structure, cell part)  Differentia: “has a” A set of properties that distinguish among members of that class  Can include necessary and sufficient conditions  Implementation: How is this definition expressed  Depending on the nature of the concept or entity and the needs of the information system, we can say more or fewer things  Different languages; can express different things about the concept that can be computed upon  OWLW3C standard, RDF
  • 25.
    Comprehensive Ontology  NIFcovers multiple structural scales and domains of relevance to neuroscience  Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology  Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks for more complex representations NIFSTD Organism NS FunctionMolecule Investigation Subcellular structure Macromolecule Gene Molecule Descriptors Techniques Reagent Protocols Cell Resource Instrument Dysfunction Quality Anatomical Structure
  • 26.
    Query across resources:Snca and striatum NIF uses the NIFSTD ontologies to query across sources that use very different terminologies, symbolic notations and levels of granularity
  • 27.
    Entity mapping BIRNLex_435 Brodmann.3 Explicitmapping of database content helps disambiguate non-unique and custom terminology
  • 28.
    Concept-based search: searchby meaning  Search Google: GABAergic neuron  Search NIF: GABAergic neuron  NIF automatically searches for types of GABAergic neurons Types of GABAergic neurons
  • 29.
    Data mining through interrogation What genes are upregulated by drugs of abuse in the adult mouse? Morphine Increased expression Adult Mouse
  • 30.
    Integration of knowledgebased on relationships Looking for commonalities and distinctions among animal models and human conditions based on phenotypes Sarah Maynard, Chris Mungall, Suzie Lewis NINDS Thalamus Cellular inclusion Midline nuclear group Lewy Body Paracentral nucleus Cellular inclusion
  • 31.
    And now, theliterature  The scientific article remains the currency of science  Vast majority of neuroscience data is published in the literature  Computational biologists like to consume data  Neuroscientists like to produce it  Two NIF projects:  1) Resource identification from the literature  Identifying antibodies used in scientific studies from text  2) Extracting data from tables and supplementary material
  • 32.
     Neuroscience isfundamentally reliant on antibodies  Neuroscientists spend a lot of time searching for antibodies that will work in their system for the target of interest and troubleshooting experiments that didn’t work  The scientific literature is a major source of information on antibodies  Proposal  Use text mining strategies to identify antibodies, protocol type and subject organism from materials and methods section of J. Neuroscience Problem: antibodies
  • 33.
     Midfrontal cortextissue samples from neurologically unimpaired subjects (n9) and from subjects with AD (n11) were obtained from the Rapid Autopsy Program  Immunoblot analysis and antibodies  The following antibodies were used for immunoblotting:-actinmAb (1:10,000 dilution, Sigma-Aldrich); -tubulinmAb (1:10,000,Abcam);T46 mAb (specific to tau 404– 441, 1:1000, Invitrogen);Tau-5 mAb (human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007);AT8 mAb (phospho-tau Ser199, Ser202, andThr205, 1:500, Innogenetics); PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies); 12E8 mAb(phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert); NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, Santa Cruz Biotechnology)… Semantic annotation: Entity mapping by human Sato et al., J. Neurosci. 2008 Subject is Human Antibody #7 "12E8" is a Monoclonal antibody birnlex_2027 Antibody reagent has target human PHF tau Waiting for Neurolex ID Protein product of Antibody reagent has provider Peter Seubert Antibody reagent has catalog # Antibody reagent has source organism Mouse birnlex_167 NCBI Taxonomic ID: 10090 Antibody reagent has id "12E8" Provider has location Elan Pharmaceuticals, South San Francisco, CA Provider has url
  • 34.
    Try this Watson! •95 antibodies were identified in 8 articles • 52 did not contain enough information to determine the antibody used • Some provided details in another paper • And another paper, and another... • Failed to give species, clonality, vendor, or catalog number • But, many provided the location of the vendor because the instructions to authors said to do so • no antibodies had lot numbers associated We never got to test the algorithms!
  • 35.
     NIF alongwith several other large informatics projects recommends that all authors provide vendor and catalog # for all reagents use  But...vendors merge and sell each other’s antibodies, making it difficult to track down exactly which reagent was used in some cases  Catalog numbers get replaced; many variants on the same product, e.g., HRP-conjugated, 200 ulvs 500 ul  Clone names are not unique  Universal antibody ID Publishing for the 21st Century
  • 36.
    NIF Antibody Registry •We have created an antibody registry database • Assigns each antibody a persistent identifier to both commercial and non- commercial antibodies • ID will persist even if company goes out of business or the antibody is sold by multiple vendors • The data model is being formalized into a rigorous ontology in collaboration with others: • We negotiated with antibody aggregators to pull data for over 800,000 commercial antibodies, 200 vendors • Can be used to register homegrown antibodies as well • http://antibodyregistry.org
  • 37.
    “Find studies thatused a rabbit polyclonal antibody against GFAP that recognizes human in immunocytochemisty” Paz et al, J Neurosci, 2010 (AB_310775)
  • 38.
    Demo 2: Extractingdata from tables and supplementary material  Challenge: Extract data on gene expression in brain from studies relevant to drug abuse  Workflow: Find articles Extract results from tables Standardize results Load into NIF Current DB: 140 tables from 54 articles Andrea Arnaud-Stagg, Anita Bandrowski
  • 39.
    Gene for tyrosine hydroxylasehas increased expression in locus coeruleus of mouse compared to control when given chronic morphine Translations: Upregulatedp< 0.05 = increased expression LC = locus coeruleus Probe ID = gene name Extract data and meaning of data from tables
  • 40.
    Challenges working withtables and supplemental data  Difficult data arrangements  PDF, JPG,TXT,CSV, XLS  Difficult styles: colors, symbols, data arrangements (results combined into one column, multiple comparisons in one table, legends defining values, unclearly described data (eg., unclear significance)  Not clear what tables/values represent  nothing in paper about the supplementary data file and table has no heading  Probe ID’s are given but not gene identifiers  No link from supplemental material back to article; lose provenance  Results are presented but values of significance unclear  Neither curator (nor machine) could distinguish between no difference and not reported
  • 41.
    What affects SMN1expression? Researchers often report results in a way where curators cannot extract full information from a study
  • 42.
    Common theme •We arenot publishing data in a form that is easy to integrate •What we mean isn’t clear to a search engine (or even to a human) •We use many different data structures to say the same thing •We don’t provide crucial information •Searching and navigating across individual resources takes an inordinate amount of human effort Tempus PecuniaEst Painting by Richard Harpum
  • 43.
    When I talkto neuroscientists (and journal editors)...
  • 44.
    Collaboration, competition, coordination, cooperation The diversity and dynamism of neuroscience will make data integration challenging always  Neural space is vast: No one group or individual can do everything  We don’t have to solve everything to make it better  Global partnership with room for everyone:  Neuroscientists  Curators  Resource developers  Funders  Computational biologists  Text miners  Computer scientists  Watson
  • 45.
    Hopeful signs... •Means forsharing data on the web becoming more routine •With availability, growing recognition for a role of standards and curation •For neuroscience, we now have organizations that can help coordinate •NIF, NITRC (http://nitrc.org) •NeuroimagingTools and Resource Clearinghouse •International Neuroinformatics Coordinating Facility •Educate neuroscientists on what is necessary •Bring together stakeholders to define what is necessary for interoperation •Implement structures and procedures for developing neuroscience resources within a framework http://incf.org
  • 46.
    We don’t knoweverything but we do know some things 1. Register your resource with NIF!!!! 3: Be mindful  Resource providers: Mindfulness that your resource is contributing data to a global federation  Link to shared ontology identifiers where possible  Stable and unique identifiers for data  Explicit semantics  Database, model, atlas  Researchers: Mindfulness when publishing data that it is to be consumed by machines and not just your colleagues  Accession numbers for genes and species  Catalog numbers for reagents  Provide supplemental data in a form where it is is easy to re-use 2. Become involved with NIF and INCF
  • 47.
  • 48.
    Many thanks to... AmarnathGupta, UCSD, Co Investigator Jeff Grethe, UCSD, Co Investigator Anita Bandrowski, NIF Curator Gordon Shepherd,Yale University Perry Miller Luis Marenco DavidVan Essen,Washington University Erin Reid Paul Sternberg, CalTech ArunRangarajan Hans Michael Muller GiorgioAscoli,George Mason University SrideviPolavarum FahimImam, NIF Ontology Engineer Karen Skinner, NIH, Program Officer Mark Ellisman Lee Hornbrook Kara Lu VadimAstakhov XufeiQian Chris Condit Stephen Larson Sarah Maynard Bill Bug
  • 49.
  • 50.
    How old isan adult squirrel?  Definitions can be quantitative  Arbitrary but defensible  Qualitative categories for quantitative attributes  Best practice to provide ages of subjects, but for query, need to translate into qualitative concepts Jonathan Cachat, Anita Bandrowski
  • 51.
    But there areno databases for siRNA NIF Registry is probably the most complete accounting we have of what is out there

Editor's Notes

  • #31 Replace this slide with something better