Databases and Ontologies: Where do we go from here?
1. Maryann
E.
Martone,
Ph.
D.
University
of
California,
San
Diego
INCF
Neuroinforma>cs
Short
Course,
Stockholm,
August
2013
2. • Introduc>on
• Introduc>on
to
the
Neuroscience
Informa>on
Framework
• Structured
informa>on:
data,
databases
• Federa>ng
neuroscience-‐relevant
databases
• Informa>on
frameworks
• Ontologies
• What
can
we
do
with
informa>on
in
the
NIF?
• Conclusions
3. Scholar
Library
Scholar
Publisher
FORCE11.org:
Future
of
research
communica>ons
and
e-‐scholarship
4. Scholar
Consumer
Libraries
Data
Repositories
Code
Repositories
Community
databases/
plaRorms
OA
Curators
Social
Networks
Social
Networks
Social
Networks
Peer
Reviewers
Narra>ve
Workflows
Data
Models
Mul>media
Nanopublica>ons
Code
6. • NIF’s
mission
is
to
maximize
the
awareness
of,
access
to
and
u>lity
of
research
resources
produced
worldwide
to
enable
beTer
science
and
promote
efficient
use
– NIF
unites
neuroscience
informa>on
without
respect
to
domain,
funding
agency,
ins>tute
or
community
– NIF
is
like
a
“Pub
Med”
for
all
biomedical
resources
and
a
“Pub
Med
Central”
for
databases
– Makes
them
searchable
from
a
single
interface
– Prac>cal
and
cost-‐effec>ve;
tries
to
be
sensible
– Learned
a
lot
about
the
effec0ve
data
sharing
The
Neuroscience
Informa>on
Framework
is
an
ini>a>ve
of
the
NIH
Blueprint
consor>um
of
ins>tutes
hTp://neuinfo.org
7. We’d
like
to
be
able
to
find:
• What
is
known****:
– What
are
the
projec>ons
of
hippocampus?
– Is
GRM1
expressed
In
cerebral
cortex?
– What
genes
have
been
found
to
be
upregulated
in
chronic
drug
abuse
in
adults
– What
animal
models
have
similar
phenotypes
to
Parkinson’s
disease?
– What
studies
used
my
polyclonal
an>body
against
GABA
in
humans?
• What
is
not
known:
– Connec>ons
among
data
– Gaps
in
knowledge
A
framework
makes
it
easier
to
address
these
ques>ons
9. Neuroscience
is
unlikely
to
be
served
by
a
few
large
databases
like
the
genomics
and
proteomics
community
Whole
brain
data
(20
um
microscopic
MRI)
Mosiac
LM
images
(1
GB+)
Conven>onal
LM
images
Individual
cell
morphologies
EM
volumes
&
reconstruc>ons
Solved
molecular
structures
No
single
technology
serves
these
all
equally
well.
Mul0ple
data
types;
mul0ple
scales;
mul0ple
databases
10. • Data
warehouse:
May
contain
data
from
diverse
sources;
schemas
are
integrated.
Data
are
“cleaned”
to
fit
unified
data
model.
One
database
to
rule
them
all...
• Data
federa>on:
a
virtual
database
that
stores
data
defini>ons
and
not
the
data
itself.
The
virtual
database
will
have
informa>on
about
the
loca>on
of
the
data.
When
a
single
call
is
made
to
a
virtual
database,
the
technology
ensures
mul>ple
calls
to
underlying
databases
and
is
also
responsible
for
meaningfully
aggrega>ng
the
returned
result
sets.
From
wikipedia
and
hTp://www.infosysblogs.com/oracle/2010/01/
data_federa>on_a_potent_subst_1.html
11. Subject
473
• Species:
mouse
(string)
• Age:
50
days
(integer)
• Age
category:
adult
• Protocol:
2
Rela0onal
Database
“Mice
(aged
50
days)
were
perfused
with
4%
paraformaldehyde
and
brains
were
sec>oned
at
a
thickness
of
50
um.
Sec>ons
were
labeled
using
an>bodies
against
calbindin
and
imaged
on
a
Zeiss
confocal
microscope.”
Data
model;
data
types,
formal
query
language
Free
text
En>ty
recogni>on;
Natural
language
processing
12. ∞
What
is
easily
machine
processable
and
accessible
What
is
poten>ally
knowable
What
is
known:
Literature,
images,
human
knowledge
Unstructured;
Natural
language
processing,
en>ty
recogni>on,
image
processing
and
analysis;
paywalls
communica>on
Abstracts
vs
full
text
vs
tables
etc
13. hGp://neuinfo.org
June10,
2013
dkCOIN
Inves>gator's
Retreat
13
• A
portal
for
finding
and
using
neuroscience
resources
A
consistent
framework
for
describing
resources
Provides
simultaneous
search
of
mul>ple
types
of
informa>on,
organized
by
category
Supported
by
an
expansive
ontology
for
neuroscience
U>lizes
advanced
technologies
to
search
the
“hidden
web”
UCSD,
Yale,
Cal
Tech,
George
Mason,
Washington
Univ
Literature
Database
Federa>on
Registry
15. With
the
thousands
of
databases
and
other
informa>on
sources
available,
simple
descrip>ve
metadata
will
not
suffice
16. • NIF
curators
• Nomina>on
by
the
community
• Semi-‐automated
text
mining
pipelines
NIF
Registry
Requires
no
special
skills
Site
map
available
for
local
hos>ng
• NIF
Data
Federa>on
• DISCO
interop
• Requires
some
programming
skill
• Open
Source
Brain
<
2
hr
Low
barrier
to
entry;
incremental
refinement
17. NIF
was
designed
to
be
populated
rapidly
with
progressive
refinement
18. Databases
come
in
many
shapes
and
sizes
• Primary
data:
– Data
available
for
reanalysis,
e.g.,
microarray
data
sets
from
GEO;
brain
images
from
XNAT;
microscopic
images
(CCDB/CIL)
• Secondary
data
– Data
features
extracted
through
data
processing
and
some>mes
normaliza>on,
e.g,
brain
structure
volumes
(IBVD),
gene
expression
levels
(Allen
Brain
Atlas);
brain
connec>vity
statements
(BAMS)
• Ter>ary
data
– Claims
and
asser>ons
about
the
meaning
of
data
• E.g.,
gene
upregula>on/
downregula>on,
brain
ac>va>on
as
a
func>on
of
task
• Registries:
– Metadata
– Pointers
to
data
sets
or
materials
stored
elsewhere
• Data
aggregators
– Aggregate
data
of
the
same
type
from
mul>ple
sources,
e.g.,
Cell
Image
Library
,SUMSdb,
Brede
• Single
source
– Data
acquired
within
a
single
context
,
e.g.,
Allen
Brain
Atlas
Researchers
are
producing
a
variety
of
informa>on
ar>facts
using
a
mul>tude
of
technologies
19. • Data:
values
of
qualita>ve
or
quan>ta>ve
variables,
belonging
to
a
set
of
items...
oten
the
results
of
measurements
(Wikipedia)
• Metadata:
“Data
about
data”
• Structural
metadata:
• the
design
and
specifica>on
of
data
structures
and
is
more
properly
called
"data
about
the
containers
of
data”
(Wikipedia)
• e.g.,
image
size,
bit
depth,
integer
vs
string
• Descrip>ve
metadata:
• individual
instances
of
applica>on
data,
the
data
content
“data
about
data
content”
• e.g.,
creator,
subject,
• Data
type:
the
form
of
the
data
for
the
purposes
of
data
opera>ons
• Data
Integra>on:
combining
data
residing
in
different
sources
and
providing
users
with
a
unified
view
of
these
data
“Metadata
are
data”
-‐Wikipedia
20. 0
50
100
150
200
250
0.01
0.1
1
10
100
1000
6-‐12
12-‐12
7-‐13
1-‐14
8-‐14
2-‐15
9-‐15
4-‐16
10-‐16
5-‐17
Number
of
Federated
Databases
Number
of
Federated
Records
(Millions)
NIF
searches
the
largest
colla>on
of
neuroscience-‐relevant
data
on
the
web
DISCO
June10,
2013
dkCOIN
Inves>gator's
Retreat
20
21. • Long
tail
data:
large
numbers
of
small
data
sets
hTp://en.wikipedia.org/wiki/Long_tail
22. Hippocampus
OR
“Cornu
Ammonis”
OR
“Ammon’s
horn”
Query
expansion:
Synonyms
and
related
concepts
Boolean
queries
Data
sources
categorized
by
“data
type”
and
level
of
nervous
system
Common
views
across
mul>ple
sources
Tutorials
for
using
full
resource
when
ge{ng
there
from
NIF
Link
back
to
record
in
original
source
23. Connects
to
Synapsed
with
Synapsed
by
Input
region
innervates
Axon
innervates
Projects
to
Cellular
contact
Subcellular
contact
Source
site
Target
site
Each
resource
implements
a
different,
though
related
model;
systems
are
complex
and
difficult
to
learn,
in
many
cases
25. • Current
web
is
designed
to
share
documents
– Documents
are
unstructured
data
• Much
of
the
content
of
digital
resources
is
part
of
the
“hidden
web”
• Wikipedia:
The
Deep
Web
(also
called
Deepnet,
the
invisible
Web,
DarkNet,
Undernet
or
the
hidden
Web)
refers
to
World
Wide
Web
content
that
is
not
part
of
the
Surface
Web,
which
is
indexed
by
standard
search
engines.
27. Knowledge
in
space
and
spa>al
rela>onships
(the
“where”)
Knowledge
in
words,
terminologies
and
logical
rela>onships
(the
“what”)
28. Purkinje
Cell
Axon
Terminal
Axon
Dendri>c
Tree
Dendri>c
Spine
Dendrite
Cell
body
Cerebellar
cortex
There
is
liTle
obvious
connec>on
between
data
sets
taken
at
different
scales
using
different
microscopies
without
an
explicit
representa>on
of
the
biological
objects
that
the
data
represent
29. • NIF
covers
mul>ple
structural
scales
and
domains
of
relevance
to
neuroscience
• Aggregate
of
community
ontologies
with
some
extensions
for
neuroscience,
e.g.,
Gene
Ontology,
Chebi,
Protein
Ontology
NIFSTD
Organism
NS
Func>on
Molecule
Inves>ga>on
Subcellular
structure
Macromolecule
Gene
Molecule
Descriptors
Techniques
Reagent
Protocols
Cell
Resource
Instrument
Dysfunc>on
Quality
Anatomical
Structure
30. Brain
Cerebellum
Purkinje
Cell
Layer
Purkinje
cell
neuron
has
a
has
a
has
a
is
a
• Ontology:
an
explicit,
formal
representa>on
of
concepts
rela>onships
among
them
within
a
par>cular
domain
that
expresses
human
knowledge
in
a
machine
readable
form
• Branch
of
philosophy:
a
theory
of
what
is
• e.g.,
Gene
ontologies
31. • Express
neuroscience
concepts
in
a
way
that
is
machine
readable
– Synonyms,
lexical
variants
– Defini>ons
• Provide
means
of
disambigua>on
of
strings
– Nucleus
part
of
cell;
nucleus
part
of
brain;
nucleus
part
of
atom
• Rules
by
which
a
class
is
defined,
e.g.,
a
GABAergic
neuron
is
neuron
that
releases
GABA
as
a
neurotransmiTer
• Proper>es
– Support
reasoning
• Provide
universals
for
naviga>ng
across
different
data
sources
– Seman>c
“index”
– Link
data
through
rela>onships
not
just
one-‐to-‐one
mappings
• Provide
the
basis
for
concept-‐based
queries
to
probe
and
mine
data
• Establish
a
seman>c
framework
for
landscape
analysis
Mathema>cs,
Computer
code
or
Esperanto
32. June10,
2013
32
Aligns
sources
to
the
NIF
seman>c
framework
34. birnlex_1741
Brodmann.10
Explicit
mapping
of
database
content
helps
disambiguate
non-‐unique
and
custom
terminology
36. • Search
Google:
GABAergic
neuron
• Search
NIF:
GABAergic
neuron
– NIF
automa>cally
searches
for
types
of
GABAergic
neurons
Types
of
GABAergic
neurons
Neuroscience Information Framework – http://neuinfo.org
37. Equivalence
classes;
restric>ons
Arbitrary
but
defensible
• Neurons
classified
by
• Circuit
role:
principal
neuron
vs
interneuron
• Molecular
cons>tuent:
Parvalbumin-‐
neurons,
calbindin-‐neurons
• Brain
region:
Cerebellar
neuron
• Morphology:
Spiny
neuron
•
Molecule
Roles:
Drug
of
abuse,
anterograde
tracer,
retrograde
tracer
• Brain
parts:
Circumventricular
organ
• Organisms:
Non-‐human
primate,
non-‐human
vertebrate
• Quali>es:
Expression
level
• Techniques:
Neuroimaging
38. What
genes
are
upregulated
by
drugs
of
abuse
in
the
adult
mouse?
(show
me
the
data!)
Morphine
Increased
expression
Adult
Mouse
39. • NIF
Connec>vity:
7
databases
containing
connec>vity
primary
data
or
claims
from
literature
on
connec>vity
between
brain
regions
• Brain
Architecture
Management
System
(rodent)
• Temporal
lobe.com
(rodent)
• Connectome
Wiki
(human)
• Brain
Maps
(various)
• CoCoMac
(primate
cortex)
• UCLA
Mul>modal
database
(Human
fMRI)
• Avian
Brain
Connec>vity
Database
(Bird)
• Total:
1800
unique
brain
terms
(excluding
Avian)
• Number
of
exact
terms
used
in
>
1
database:
42
• Number
of
synonym
matches:
99
• Number
of
1st
order
partonomy
matches:
385
40. • Realism
vs
conceptualism
• Controlled
vocabularies
vs
taxonomies
vs
ontology?
• How
do
I
name
classes?
• Shared
vs
custom
ontologies
• Single
vs
mul>ple
inheritance
• RDF
vs
OWL?
• Top
down
vs
boTom
up:
heavy
weight
vs
light
weight
ontologies
• Should
I
encode
everything
in
my
ontology?
Many
schools
of
thought
about
ontologies-‐their
construc>on
and
use
41. • Controlled
vocabularies:
prescribed
list
of
terms
or
headings
each
one
having
an
assigned
meaning
• Lexicon/Thesaurus:
Vocabularies
+
their
lexical
proper>es,
e.g.,
synonyms,
lexical
variants
• Taxonomy:
monohierarchical
classifica>on
of
concepts,
as
used,
for
example,
in
the
classifica>on
of
biological
organisms,
built
on
the
“is
a
“
rela>onship
•
Ontology:
specifica>on
of
the
concepts
of
a
domain
and
their
rela>onships,
structured
to
allow
computer
processing
and
reasoning
hTp://www.willpowerinfo.co.uk/glossary.htm
Mike
Bergman
42. • Iden>ty:
– En>>es
are
uniquely
iden>fiable
– Name
is
a
meaningless
numerical
iden>fier
(URI:
Uniform
resource
iden>fier)
– Any
number
of
human
readable
labels
can
be
assigned
to
it
• Defini>on:
– Genera:
is
a
type
of
(cell,
anatomical
structure,
cell
part)
– Differen>a:
“has
a”
A
set
of
proper>es
that
dis>nguish
among
members
of
that
class
– Can
include
necessary
and
sufficient
condi>ons
• Implementa>on:
How
is
this
defini>on
expressed
– Depending
on
the
nature
of
the
concept
or
en>ty
and
the
needs
of
the
informa>on
system,
we
can
say
more
or
fewer
things
– Different
languages;
can
express
different
things
about
the
concept
that
can
be
computed
upon
• OWL
W3C
standard,
RDF
birnlex_1362
CA2
CHEBI_29108
CA2
NIF
follows
OBO
Foundry
best
prac>ces
for
naming
and
defining
classes
43. • XML:
Extensible
Mark
Up
language:
Mark
up
language
for
data.
XML
itself
is
not
very
much
concerned
with
meaning.
XML
nodes
don't
need
to
be
associated
with
par>cular
concepts,
and
the
XML
standard
doesn't
indicate
how
to
derive
a
fact
from
a
document.
• RDF:
Resource
Descrip>on
Framework:
a
general
method
to
decompose
knowledge
into
small
pieces,
with
some
rules
about
the
seman>cs,
or
meaning,
of
those
pieces.
What
sets
RDF
apart
from
XML
is
that
RDF
is
designed
to
represent
knowledge
in
a
distributed
world.
That
RDF
is
designed
for
knowledge,
and
not
data,
means
RDF
is
par>cularly
concerned
with
meaning.
– Small
pieces
are
called
“triples”:
Subject
predicate
object
– Purkinje
neuron
(S)
has
neurotransmiDer
(P)
GABA
(O)
• RDFS
-‐
a
method
of
specifying
metadata
about
proper>es/characteris>cs
of
things
and
classes
of
things
such
that
inference
an
be
carried
out
(conceptualized
in
RDF)
• OWL
(Web
Ontology
Language)
-‐
a
more
complex(/powerful)
extension
of
RDFS
• SPARQL
-‐
Is
a
query
language
designed
for
RDF
(similar
to
how
SQL
was
designed
for
rela>onal
databases)
hTp://answers.seman>cweb.com/ques>ons/15215/whats-‐the-‐difference-‐between-‐using-‐rdfsowl-‐
versus-‐xml
hTp://www.rdfabout.com/intro/#Introducing%20RDF
44. Rela>onal
model
• Mouse
has
age
50
days
• Protocol
uses
instrument
confocal
microscope
• A
confocal
imaging
protocol
is
a
protocol
that
uses
instrument
confocal
microscope
RDF:
The
computer
doesn't
need
to
know
what
has
actually
means
in
English
for
this
to
be
useful.
It
is
let
up
to
the
applica>on
writer
to
choose
appropriate
names
for
things
(confocal
microscope)
and
to
use
the
right
predicates
(uses,
has).
RDF
tools
are
ignorant
of
what
these
names
mean,
but
they
can
s>ll
usefully
process
the
informa>on.-‐hTp://www.rdfabout.com/intro/
#Introducing%20RDF
May
link
to
other
informa>on,
e.g.,
mouse
is
a
rodent
45. The
thalamus
projects
to
the
cortex
in
mammals
• Universal:
allValuesFrom:
If
a
mammal
has
a
cortex
and
a
thalamus,
then
the
thalamus
must
project
to
the
cortex
• Existen>al:
SomeValuesFrom:
The
thalamus
projects
to
the
cortex
in
at
least
one
member
of
the
class
mammal
• Disjointness:
owl:disjointWith:
a
member
of
one
class
cannot
simultaneously
be
an
instance
of
a
specified
other
class:
Rep>les
are
disjoint
from
mammals
W3C
OWL
guide:
www.w3.org/TR/2004/REC-‐owl-‐guide-‐20040210/
Restric>ons
places
on
classes
allow
us
to
reason
over
the
ontology
and
check
for
consistency
47. 1. Look
brain
region
up
in
NeuroLex
2. Look
up
cells
contained
in
the
brain
region
3. Find
those
cells
that
are
known
to
project
out
of
that
brain
region
4. Look
up
the
neurotransmiTers
for
those
cells
5. Determine
whether
those
neurotransmiTers
are
known
to
be
excitatory
or
inhibitory
6. Report
the
projec>on
as
excitatory
or
inhibitory,
and
report
the
en>re
chain
of
logic
with
links
back
to
the
wiki
pages
where
they
were
made
7. Make
sure
user
can
get
back
to
each
statement
in
the
logic
chain
to
edit
it
if
they
think
it
is
wrong
Stephen
Larson
CHEBI:18243
48. Brain
Cerebellum
Cortex
Cerebellar
Purkinje
cell
Purkinje
neuron
Purkinje
cell
soma
Purkinje
cell
layer
Cerebellar
cortex
IP3
Cerebellum
• To
create
the
linkages
requires
mapping
• Mapping
is
usually
incomplete
and
not
always
possible
• Can’t
take
advantage
of
others’
work
Gross
anatomy
ontology
Cell
centered
anatomy
ontology
Reuse
iden>fiers
rather
than
recreate
them
49. • “The
trouble
is
that
if
I
make
up
all
of
my
own
URIs,
my
RDF
document
has
no
meaning
to
anyone
else
unless
I
explain
what
each
URI
is
intended
to
denote
or
mean.
Two
RDF
documents
with
no
URIs
in
common
have
no
informa>on
that
can
be
interrelated.”
• NIF
favors
reuse
of
iden>fiers
rather
than
mapping
• Crea>ng
ontologies
to
be
used
as
common
building
blocks:
modularity,
low
seman>c
overhead,
is
important
hTp://www.rdfabout.com/intro/#Introducing%20RDF
50. Cerebellum
Purkinje
cell
soma
Cerebellum
Purkinje
cell
dendrite
Cerebellum
Purkinje
cell
axon
(Cell
part
ontology)
Cerebellum
granule
cell
layer
(Anatomy
ontology)
Cerebellum
Purkinje
cell
layer
Cerebellum
molecular
layer
Has
part
Has
part
Has
part
Is
part
of
Is
part
of
Is
part
of
Calbindin
IP3
(CHEBI:16595)
Cerebellum
Purkinje
neuron
(Cell
Ontology)
Cerebellar
cortex
Has
part
Has
part
Has
part
51. • Neuroscience
Informa>on
Framework
– NIFSTD
available
for
download
– Ontoquest
web
services
– NIF
annota>on
services
and
mapping
tools
available
– Neurolex
available
via
SPARQL
endpoint
• Bioportal:
Collec>on
of
>
300
ontologies
covering
many
domains
– automated
mapping
between
ontologies
– Annota>on
services
– Web
services
for
access
• OBO
Foundry:
hTp://www.obofoundry.org/
– Collec>on
of
community
ontologies
designed
according
to
OBO
Foundry
principles
• Protégé
Ontology
editor:
Edi>ng
tool
for
construc>ng
ontologies.
Excellent
short
course
available
for
Protégé/OWL.
• Program
on
Ontologies
of
Neural
Structures
(INCF):
CUMBO,
Neurolex
Wiki,
Scalable
Brain
Atlas
You
can
enhance
your
tools
and
annota>on
with
community
ontologies
52. hTp://neurolex.org
Larson
et
al,
Fron>ers
in
Neuroinforma>cs,
in
press
• Seman>c
MediWiki
• Provide
a
simple
interface
for
defining
the
concepts
required
• Light
weight
seman>cs
• Good
teaching
tool
for
learning
about
seman>c
integra>on
and
the
benefits
of
a
consistent
seman>c
framework
• Community
based:
• Anyone
can
contribute
their
terms,
concepts,
things
• Anyone
can
edit
• Anyone
can
link
• Accessible:
searched
by
Google
• Growing
into
a
significant
knowledge
base
for
neuroscience
Demo
D03
200,000
edits
150
contributors
53. Red
Links:
Informa>on
is
missing
(or
misspelled)
54. • Neurolex
provides
an
on-‐line
computable
index
for
expressing
models
in
seman>c
terms,
and
linking
to
other
knowledge
and
data
• INCF
task
forces
are
contribu>ng
knowledge
• Neuroscience
knowledge
in
the
web
Builds
a
knowledge
base
by
cross-‐modular
rela>ons
and
links
to
data
55. Once
terms
have
been
proposed
and
veTed
by
neuroscience
community,
NIF
feeds
them
back
to
general
ontologies
to
enrich
coverage
of
neuroscience
57. • INCF
Project
– Neuron
Registry
– >
30
experts
worldwide
– Fill
out
neuron
pages
in
Neurolex
Wiki
– Led
by
Dr.
Gordon
Shepherd
Soma
loca>on
Dendrite
loca>on
Axon
loca>on
0
50
100
150
200
250
300
Number
Total
redlinks
easy
fixes
hard
fixes
Soma
loca>on
Dendrite
loca>on
Axon
loca>on
Social
networks
and
community
sites
let
us
learn
things
from
the
collec>ve
behavior
of
contributors
INCF
Knowledge
Space
58. • Of
the
~
4000
columns
that
NIF
queries,
~1300
map
to
one
of
our
core
categories:
– Organism
– Anatomical
structure
– Cell
– Molecule
– Func>on
– Dysfunc>on
– Technique
• 30-‐50%
of
NIF’s
queries
autocomplete
• When
NIF
combines
mul>ple
sources,
a
set
of
common
fields
emerges
– >Basic
informa>on
models/seman>c
models
exist
for
certain
types
of
en>>es
Biomedical
science
does
have
a
conceptual
framework;
but
we
don’t
place
undo
importance
on
it
must
>e
to
data
60. • NIF
can
be
used
to
survey
the
data
landscape
• Analysis
of
NIF
shows
mul>ple
databases
with
similar
scope
and
content
• Many
contain
par>ally
overlapping
data
• Data
“flows”
from
one
resource
to
the
next
– Data
is
reinterpreted,
reanalyzed
or
added
to
• Is
duplica>on
good
or
bad?
NIF
is
trying
to
make
it
easier
to
work
with
diverse
data
61. NIF
is
in
a
unique
posi>on
to
answer
ques>ons
about
the
neuroscience
landscape
Where
are
the
data?
Striatum
Hypothalamus
Olfactory
bulb
Cerebral
cortex
Brain
Brain
region
Data
source
62. ∞
What
is
easily
machine
processable
and
accessible
What
is
poten>ally
knowable
What
is
known:
Literature,
images,
human
knowledge
Unstructured;
Natural
language
processing,
en>ty
recogni>on,
image
processing
and
analysis;
communica>on
“Known
unknowns
vs
unknown
unknowns”
Open
world
meets
closed
world
63. Comprehensive
and
unbiased?
We
know
a
lot
about
some
things
and
less
about
others;
some
of
NIF’s
sources
are
comprehensive;
others
are
highly
biased
But...NIF
has
>
2M
an>bodies,
338,000
model
organisms,
and
3
million
microarray
records
64. Neocortex
Olfactory
bulb
Neostriatum
Cochlear
nucleus
All
neurons
with
cell
bodies
in
the
same
brain
region
are
grouped
together
Proper>es
in
Neurolex
65. NIF
is
in
a
unique
posi>on
to
answer
ques>ons
about
the
neuroscience
landscape
Where
are
the
data?
Striatum
Hypothalamus
Olfactory
bulb
Cerebral
cortex
Brain
Brain
region
Data
source
Funding
66. • Requires
account
in
MyNIF
• S>ll
a
work
in
progress,
i.e.,
it
breaks
a
lot
• If
you
are
interested,
contact
us!
Vadim
Astakhov,
Kepler
Workflow
Engine
67. • Gemma:
Gene
ID
+
Gene
Symbol
• DRG:
Gene
name
+
Probe
ID
• Gemma
presented
results
rela>ve
to
baseline
chronic
morphine;
DRG
with
respect
to
saline,
so
direc>on
of
change
is
opposite
in
the
2
databases
•
Analysis:
• 1370
statements
from
Gemma
regarding
gene
expression
as
a
func>on
of
chronic
morphine
• 617
were
consistent
with
DRG;
over
half
of
the
claims
of
the
paper
were
not
confirmed
in
this
analysis
• Results
for
1
gene
were
opposite
in
DRG
and
Gemma
• 45
did
not
have
enough
informa>on
provided
in
the
paper
to
make
a
judgment
Rela>vely
simple
standards
would
make
life
easier
69. 47/50
major
preclinical
published
cancer
studies
could
not
be
replicated
• “The
scien>fic
community
assumes
that
the
claims
in
a
preclinical
study
can
be
taken
at
face
value-‐that
although
there
might
be
some
errors
in
detail,
the
main
message
of
the
paper
can
be
relied
on
and
the
data
will,
for
the
most
part,
stand
the
test
of
>me.
Unfortunately,
this
is
not
always
the
case.”
• Ge{ng
data
out
sooner
in
a
form
where
they
can
be
exposed
to
many
eyes
and
many
analyses
may
allow
us
to
expose
errors
and
develop
beTer
metrics
to
evaluate
the
validity
of
data
Begley
and
Ellis,
29
MARCH
2012
|
VOL
483
|
NATURE
|
531
70. NIF
favors
a
hybrid,
>ered,
federated
system
• Domain
knowledge
– Ontologies
• Claims,
models
and
observa>ons
– Virtuoso
RDF
triples
– Model
repositories
• Data
– Data
federa>on
– Spa>al
data
– Workflows
• Narra>ve
– Full
text
access
Neuron
Brain
part
Disease
Organism
Gene
Caudate
projects
to
Snpc
Grm1
is
upregulated
in
chronic
cocaine
Betz
cells
degenerate
in
ALS
NIF
provides
the
tentacles
that
connect
the
pieces:
a
new
type
of
en>ty
for
21st
century
science
Technique
People
71. • Several
powerful
trends
should
change
the
way
we
think
about
our
data:
One
Many
– Many
data
• Genera>on
of
data
is
ge{ng
easier
shared
data
• Data
space
is
ge{ng
richer:
more
–omes
everyday
• But...compared
to
the
biological
space,
s>ll
sparse
– Many
eyes
• Wisdom
of
crowds
• More
than
one
way
to
interpret
data
– Many
algorithms
• Not
a
single
way
to
analyze
data
– Many
analy>cs
• “Signatures”
in
data
may
not
be
directly
related
to
the
ques>on
for
which
they
were
acquired
but
tell
us
something
really
interes>ng
Are
you
exposing
or
burying
your
work?
72. • You
(and
the
machine)
have
to
be
able
to
find
it
– Accessible
through
the
web
– Structured
or
semi-‐structured
– Annota>ons
• You
(and
the
machine)
have
to
be
able
to
use
it
– Data
type
specified
and
in
an
ac>onable
form
• You
(and
the
machine)
have
to
know
what
the
data
mean
• Seman>cs
• Context:
Experimental
metadata
• Provenance:
where
did
they
come
from
Repor>ng
neuroscience
data
within
a
consistent
framework
helps
enormously,
but
the
frameworks
need
not
be
onerous
75. Jeff
Grethe,
UCSD,
Co
Inves>gator,
Interim
PI
Amarnath
Gupta,
UCSD,
Co
Inves>gator
Anita
Bandrowski,
NIF
Project
Leader
Gordon
Shepherd,
Yale
University
Perry
Miller
Luis
Marenco
Rixin
Wang
David
Van
Essen,
Washington
University
Erin
Reid
Paul
Sternberg,
Cal
Tech
Arun
Rangarajan
Hans
Michael
Muller
Yuling
Li
Giorgio
Ascoli,
George
Mason
University
Sridevi
Polavarum
Fahim
Imam
Larry
Lui
Andrea
Arnaud
Stagg
Jonathan
Cachat
Jennifer
Lawrence
Svetlana
Sulima
Davis
Banks
Vadim
Astakhov
Xufei
Qian
Chris
Condit
Mark
Ellisman
Stephen
Larson
Willie
Wong
Tim
Clark,
Harvard
University
Paolo
Ciccarese
Karen
Skinner,
NIH,
Program
Officer
(re>red)
Jonathan
Pollock,
NIH,
Program
Officer
And
my
colleagues
in
Monarch,
dkNet,
3DVC,
Force
11