The document discusses the challenges of managing and utilizing the large amount of neuroscience data being generated. It notes that currently, about half of researchers only store data in their own labs and many lack funding for proper archiving. The National Information Framework (NIF) is working to address these issues by creating a catalog and federation of neuroscience resources to facilitate discovery, access, analysis and integration of data. NIF has assembled the largest searchable collection of neuroscience data on the web using an ontology and technologies that can search the "hidden web" of resources.
DevoxxFR 2024 Reproducible Builds with Apache Maven
Neuroscience Data Discovery and Integration Challenges
1. Maryann
E.
Martone,
Ph.
D.
University
of
California,
San
Diego
2. “A
grand
challenge
in
neuroscience
is
to
elucidate
brain
func>on
in
rela>on
to
its
mul>ple
layers
of
organiza>on
that
operate
at
different
spa>al
and
temporal
scales.
Central
to
this
effort
is
tackling
“neural
choreography”
-‐-‐
the
integrated
func>oning
of
neurons
into
brain
circuits-‐-‐
Neural
choreography
cannot
be
understood
via
a
purely
reduc>onist
approach.
Rather,
it
entails
the
convergent
use
of
analy>cal
and
synthe>c
tools
to
gather,
analyze
and
mine
informa>on
from
each
level
of
analysis,
and
capture
the
emergence
of
new
layers
of
func>on
(or
dysfunc>on)
as
we
move
from
studying
genes
and
proteins,
to
cells,
circuits,
thought,
and
behavior....
However,
the
neuroscience
community
is
not
yet
fully
engaged
in
exploi;ng
the
rich
array
of
data
currently
available,
nor
is
it
adequately
poised
to
capitalize
on
the
forthcoming
data
explosion.
“
Akil
et
al.,
Science,
Feb
11,
2011
3. • In
that
same
issue
of
Science
– Asked
peer
reviewers
from
last
year
about
the
availability
and
use
of
data
• About
half
of
those
polled
store
their
data
only
in
their
laboratories—not
an
ideal
long-‐term
solu>on.
• Many
bemoaned
the
lack
of
common
metadata
and
archives
as
a
main
impediment
to
using
and
storing
data,
and
most
of
the
respondents
have
no
funding
to
support
archiving
• And
even
where
accessible,
much
data
in
many
fields
is
too
poorly
organized
to
enable
it
to
be
efficiently
used.
“...it
is
a
growing
challenge
to
ensure
that
data
produced
during
the
course
of
reported
research
are
appropriately
described,
standardized,
archived,
and
available
to
all.”
Lead
Science
editorial,
2011
4. Neuroscience
is
unlikely
to
be
served
by
a
few
large
databases
like
the
genomics
and
proteomics
community
Whole
brain
data
(20
um
microscopic
MRI)
Mosiac
LM
images
(1
GB+)
Conven>onal
LM
images
Individual
cell
morphologies
EM
volumes
&
reconstruc>ons
Solved
molecular
structures
No
single
technology
serves
these
all
equally
well.
Mul6ple
data
types;
mul6ple
scales;
mul6ple
databases
6. • Current
web
is
designed
to
share
documents
– Documents
are
unstructured
data
• Much
of
the
content
of
digital
resources
is
part
of
the
“hidden
web”
• Wikipedia:
The
Deep
Web
(also
called
Deepnet,
the
invisible
Web,
DarkNet,
Undernet
or
the
hidden
Web)
refers
to
World
Wide
Web
content
that
is
not
part
of
the
Surface
Web,
which
is
indexed
by
standard
search
engines.
7. • NIF
has
developed
a
produc>on
technology
pla]orm
for
researchers
to:
– Discover
– Share
– Analyze
– Integrate
neuroscience-‐relevant
informa>on
• Since
2008,
NIF
has
assembled
the
largest
searchable
catalog
of
neuroscience
data
and
resources
on
the
web
• Cost-‐effec>ve
and
innova>ve
strategy
for
managing
data
assets
“This
unique
data
depository
serves
as
a
model
for
other
Web
sites
to
provide
research
data.
“
-‐
Choice
Reviews
Online
NIF
is
poised
to
capitalize
on
the
new
tools
and
emphasis
on
big
data
and
open
science
8. h?p://neuinfo.org
June10,
2013
dkCOIN
Inves>gator's
Retreat
8
• A
portal
for
finding
and
using
neuroscience
resources
A
consistent
framework
for
describing
resources
Provides
simultaneous
search
of
mul>ple
types
of
informa>on,
organized
by
category
Supported
by
an
expansive
ontology
for
neuroscience
U>lizes
advanced
technologies
to
search
the
“hidden
web”
UCSD,
Yale,
Cal
Tech,
George
Mason,
Washington
Univ
Literature
Database
Federa>on
Registry
9. • NIF
Registry:
A
catalog
of
neuroscience-‐
relevant
resources
• >
6000
currently
listed
• >
2200
databases
• And
we
are
finding
more
every
day
“Of
relevance
to
neuroscience”
is
very
broad
10. dkCOIN
Inves>gator's
Retreat
10
• NIF
curators
• Nomina>on
by
the
community
• Semi-‐automated
text
mining
pipelines
NIF
Registry
Requires
no
special
skills
Site
map
available
for
local
hos>ng
• NIF
Data
Federa>on
• DISCO
interop
• Requires
some
programming
skill
Low
barrier
to
entry
11. • Extended
over
>me
– Parent
resource
– Suppor>ng
agency
– Grant
numbers
– Accessibility
– Related
to
– Organism
– Disease
or
condi>on
– Last
updated
First
catalog:
SFN
Neuroscience
Database
Gateway
NIF
0.5
NIF
1.0+
Simple
metadata
model
Name,
descrip>on,
type,
URL,
other
names,
keywords,
unique
iden>fier
~2003
2006
2008
12. 12
• NIF
Registry
is
hosted
on
Seman>c
Media
Wiki
pla]orm
Neurolex
– Community
can
add,
review,
edit
without
special
privileges
– Searchable
by
Google
– Integrated
with
NIF
ontologies
– Graph
structure
Seman>c
wiki:
A
wiki
with
seman>cs;
pages
are
linked
through
rela>onships
14. – NIF
employs
an
automated
link
checker
– Last
analysis:
478/6100
invalid
URL’s
(~8%)
– 199
can’t
locate
at
another
university
or
loca>on
out
of
service
(~3%)
– Bigger
issue:
Many
resources
are
no
longer
updated
or
maintained
0
20
40
60
80
100
120
140
160
180
200
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
0
500
1000
1500
2000
2500
3000
3500
Resources
added
Last
updated
15. Keeping
content
up
to
date
Connectome
Tractography
Epigene>cs
• New
tags
come
into
existence
• New
resource
types
come
into
existence,
e.g.,
Mobile
apps
• Resources
add
new
types
of
content
• Change
name
• Change
scope
• >
7000
updates
to
the
registry
last
year
It’s
a
challenge
to
keep
the
registry
up
to
date;
sitemaps,
cura>on,
ontologies,
community
review
16. • The
NIF
Registry
has
created
a
linked
data
graph
of
web-‐accessible
resources
• Maintained
on
a
community
wiki
pla]orm
• Provides
data
on
the
fluidity
of
the
resource
landscape
– New
resources
con>nue
to
be
created
and
found
– Rela>vely
few
disappear
altogether
– Many
more
grow
stale,
although
their
value
may
s>ll
be
significant
– Maintaining
up
to
date
cura>on
requires
frequent
upda>ng
NIF
Registry
provides
insight
into
the
state
of
digital
resources
on
the
web
17. • The
NIF
data
federa>on
performs
deep
search
over
the
content
of
over
200
databases
• New
databases
are
added
at
a
rate
of
25-‐40
per
year
• Latest
update:
Open
Source
Brain;
ingest
completed
in
2
hours
• Databases
chosen
on
a
variety
of
criteria:
• Early:
tes>ng
different
types
of
resources
• Thema>c
areas
• Volunteers
NIF
provides
access
to
the
largest
aggrega>on
of
neuroscience-‐relevant
informa>on
on
the
web
18. • NIF
was
one
of
the
first
projects
to
aZempt
data
integra>on
in
the
neurosciences
on
a
large
scale
• NIF
is
supported
by
a
contract
that
specified
the
number
of
resources
to
be
added
per
year
– Designed
to
be
populated
rapidly;
set
up
process
for
progressive
refinement
– No
budget
was
allocated
to
retrofit
exis>ng
resources;
had
to
work
with
them
in
their
current
state
– We
designed
a
system
that
required
liZle
to
no
coopera>on
or
work
from
providers
– Supports
many
formats:
rela>onal,
XML,
RDF
19. Current
Planned
DISCO
Dashboard
Func6ons
• Ingest
Script
Manager
• Public
Script
Repository
• Data
&
Event
Tracker
• Versioning
System
• Curator
Tool
• Data
Transformer
Manager
June10,
2013
dkCOIN
Inves>gator's
Retreat
19
Luis
Marenco,
Rixin
Wang,
Perrry
Miller,
Gordon
Shepherd
Yale
University
20. 0
50
100
150
200
250
0.01
0.1
1
10
100
1000
6-‐12
12-‐12
7-‐13
1-‐14
8-‐14
2-‐15
9-‐15
4-‐16
10-‐16
5-‐17
Number
of
Federated
Databases
Number
of
Federated
Records
(Millions)
NIF
searches
the
largest
colla>on
of
neuroscience-‐relevant
data
on
the
web
DISCO
June10,
2013
dkCOIN
Inves>gator's
Retreat
20
22. Hippocampus
OR
“Cornu
Ammonis”
OR
“Ammon’s
horn”
Query
expansion:
Synonyms
and
related
concepts
Boolean
queries
Data
sources
categorized
by
“data
type”
and
level
of
nervous
system
Common
views
across
mul>ple
sources
Tutorials
for
using
full
resource
when
gewng
there
from
NIF
Link
back
to
record
in
original
source
23. Connects
to
Synapsed
with
Synapsed
by
Input
region
innervates
Axon
innervates
Projects
to
Cellular
contact
Subcellular
contact
Source
site
Target
site
Each
resource
implements
a
different,
though
related
model;
systems
are
complex
and
difficult
to
learn,
in
many
cases
24. • NIF
Connec>vity:
7
databases
containing
connec>vity
primary
data
or
claims
from
literature
on
connec>vity
between
brain
regions
• Brain
Architecture
Management
System
(rodent)
• Temporal
lobe.com
(rodent)
• Connectome
Wiki
(human)
• Brain
Maps
(various)
• CoCoMac
(primate
cortex)
• UCLA
Mul>modal
database
(Human
fMRI)
• Avian
Brain
Connec>vity
Database
(Bird)
• Total:
1800
unique
brain
terms
(excluding
Avian)
• Number
of
exact
terms
used
in
>
1
database:
42
• Number
of
synonym
matches:
99
• Number
of
1st
order
partonomy
matches:
385
25. – You
(and
the
machine)
have
to
be
able
to
find
it
• Accessible
through
the
web
• Annota>ons
– You
have
to
be
able
to
access
and
use
it
• Data
type
specified
and
in
a
usable
form
– You
have
to
know
what
the
data
mean
• Some
seman>cs:
“1”
• Context:
Experimental
metadata
• Provenance:
Where
did
the
data
come
from?
Repor>ng
neuroscience
data
within
a
consistent
framework
helps
enormously
26. Knowledge
in
space
and
spa>al
rela>onships
(the
“where”)
Knowledge
in
words,
terminologies
and
logical
rela>onships
(the
“what”)
27. • NIF
covers
mul>ple
structural
scales
and
domains
of
relevance
to
neuroscience
• Aggregate
of
community
ontologies
with
some
extensions
for
neuroscience,
e.g.,
Gene
Ontology,
Chebi,
Protein
Ontology
NIFSTD
Organism
NS
Func>on
Molecule
Inves>ga>on
Subcellular
structure
Macromolecule
Gene
Molecule
Descriptors
Techniques
Reagent
Protocols
Cell
Resource
Instrument
Dysfunc>on
Quality
Anatomical
Structure
NIF
capitalizes
on
the
growing
set
of
community
ontologies
available
in
biomedical
science
28. Purkinje
Cell
Axon
Terminal
Axon
Dendri>c
Tree
Dendri>c
Spine
Dendrite
Cell
body
Cerebellar
cortex
There
is
liZle
obvious
connec>on
between
data
sets
taken
at
different
scales
using
different
microscopies
without
an
explicit
representa>on
of
the
biological
objects
that
the
data
represent
29. Brain
Cerebellum
Purkinje
Cell
Layer
Purkinje
cell
neuron
has
a
has
a
has
a
is
a
• Ontology:
an
explicit,
formal
representa>on
of
concepts
rela>onships
among
them
within
a
par>cular
domain
that
expresses
human
knowledge
in
a
machine
readable
form
– Branch
of
philosophy:
a
theory
of
what
is
– e.g.,
Gene
ontologies
• Provide
universals
for
naviga>ng
across
different
data
sources
– Seman>c
“index”
• Provide
the
basis
for
concept-‐based
queries
to
probe
and
mine
data
– Perform
reasoning
– Link
data
through
rela>onships
not
just
one-‐
to-‐one
mappings
30. “Search
compu6ng”
What
genes
are
upregulated
by
drugs
of
abuse
in
the
adult
mouse?
Morphine
Increased
expression
Adult
Mouse
Some
concepts,
e.g.,
age
category,
are
quan>ta>ve
but
s>ll
must
be
interpreted
in
a
global
query
system
33. hZp://neurolex.org
Stephen
Larson
• Provide
a
simple
interface
for
defining
the
concepts
required
• Light
weight
seman>cs
• Good
teaching
tool
for
learning
about
seman>c
integra>on
and
the
benefits
of
a
consistent
seman>c
framework
• Community
based:
• Anyone
can
contribute
their
terms,
concepts,
things
• Anyone
can
edit
• Anyone
can
link
• Accessible:
searched
by
Google
• Growing
into
a
significant
knowledge
base
for
neuroscience
Demo
D03
200,000
edits
150
contributors
34. • NIF
can
be
used
to
survey
the
data
landscape
• Analysis
of
NIF
shows
mul>ple
databases
with
similar
scope
and
content
• Many
contain
par>ally
overlapping
data
• Data
“flows”
from
one
resource
to
the
next
– Data
is
reinterpreted,
reanalyzed
or
added
to
• Is
duplica>on
good
or
bad?
35. Databases
come
in
many
shapes
and
sizes
• Primary
data:
– Data
available
for
reanalysis,
e.g.,
microarray
data
sets
from
GEO;
brain
images
from
XNAT;
microscopic
images
(CCDB/CIL)
• Secondary
data
– Data
features
extracted
through
data
processing
and
some>mes
normaliza>on,
e.g,
brain
structure
volumes
(IBVD),
gene
expression
levels
(Allen
Brain
Atlas);
brain
connec>vity
statements
(BAMS)
• Ter>ary
data
– Claims
and
asser>ons
about
the
meaning
of
data
• E.g.,
gene
upregula>on/
downregula>on,
brain
ac>va>on
as
a
func>on
of
task
• Registries:
– Metadata
– Pointers
to
data
sets
or
materials
stored
elsewhere
• Data
aggregators
– Aggregate
data
of
the
same
type
from
mul>ple
sources,
e.g.,
Cell
Image
Library
,SUMSdb,
Brede
• Single
source
– Data
acquired
within
a
single
context
,
e.g.,
Allen
Brain
Atlas
Researchers
are
producing
a
variety
of
informa>on
ar>facts
using
a
mul>tude
of
technologies
36. NIF
Analy6cs:
The
Neuroscience
Landscape
NIF
is
in
a
unique
posi>on
to
answer
ques>ons
about
the
neuroscience
landscape
Where
are
the
data?
Striatum
Hypothalamus
Olfactory
bulb
Cerebral
cortex
Brain
Brain
region
Data
source
Vadim
Astakhov,
Kepler
Workflow
Engine
37. Diseases
of
nervous
system
Adding
more
seman6cs
The
combina>on
of
ontologies,
diverse
data
and
analy>cs
lets
us
look
at
the
current
landscape
in
interes>ng
ways
Neurodegenera>ve
Seizure
disorders
Neoplas>c
disease
of
nervous
system
NIH
Reporter
NIF
data
federated
sources
38. • Gemma:
Gene
ID
+
Gene
Symbol
• DRG:
Gene
name
+
Probe
ID
• Gemma
presented
results
rela>ve
to
baseline
chronic
morphine;
DRG
with
respect
to
saline,
so
direc>on
of
change
is
opposite
in
the
2
databases
•
Analysis:
• 1370
statements
from
Gemma
regarding
gene
expression
as
a
func>on
of
chronic
morphine
• 617
were
consistent
with
DRG;
over
half
of
the
claims
of
the
paper
were
not
confirmed
in
this
analysis
• Results
for
1
gene
were
opposite
in
DRG
and
Gemma
• 45
did
not
have
enough
informa>on
provided
in
the
paper
to
make
a
judgment
Rela>vely
simple
standards
would
make
life
easier
39. NIF
favors
a
hybrid,
>ered,
federated
system
• Domain
knowledge
– Ontologies
• Claims,
models
and
observa>ons
– Virtuoso
RDF
triples
– Model
repositories
• Data
– Data
federa>on
– Spa>al
data
– Workflows
• Narra>ve
– Full
text
access
Neuron
Brain
part
Disease
Organism
Gene
Caudate
projects
to
Snpc
Grm1
is
upregulated
in
chronic
cocaine
Betz
cells
degenerate
in
ALS
NIF
provides
the
tentacles
that
connect
the
pieces:
a
new
type
of
en>ty
for
21st
century
science
Technique
People
40. • 2006-‐2008:
A
survey
of
what
was
out
there
• 2008-‐2009:
Strategy
for
resource
discovery
– NIF
Registry
vs
NIF
data
federa>on
– Inges>on
of
data
contained
within
different
technology
pla]orms,
e.g.,
XML
vs
rela>onal
vs
RDF
– Effec>ve
search
across
seman>cally
diverse
sources
• NIFSTD
ontologies
• 2009-‐2011:
Strategy
for
data
integra>on
– Unified
views
across
common
sources
– Mapping
of
content
to
NIF
vocabularies
• 2011-‐present:
Data
analy>cs
– Uniform
external
data
references
• 2012-‐present:
SciCrunch:
unified
biomedical
resource
services
NIF
provides
a
strategy
and
set
of
tools
applicable
to
all
domains
grappling
with
mul>ple
sources
of
diverse
data
(i.e.,
preZy
much
everything)
41. • Search
seman>cs
• Ranking
• Resources
supported
by
NIH
Blueprint
Ins>tutes
are
more
thoroughly
covered
• Data
types,
e.g.,
Brain
ac>va>on
foci
June10,
2013
dkCOIN
Inves>gator's
Retreat
41
42. June10,
2013
42
SciCrunch
NIF
MONARCH
Community
Services
dkCOIN
Shared
Resources
Undiagnosed
Disease
Program
Phenotype
RCN
3D
Virtual
Cell
Na>onal
Ins>tute
on
Aging
One
Mind
for
Research
BIRN
Interna>onal
Neuroinforma>cs
Coordina>ng
Facility
Model
Organism
Databases
Community
Outreach
DELSA
(not
just
a
data
catalog)
43. 43
• 3dVC:
Focus
on
models
and
simula>on
• Gene
Ontology:
Focus
on
bioinforma>cs
tools
• Na>onal
Ins>tute
on
aging:
Aging-‐
related
data
sets
• Monarch:
Phenotype-‐Genotype;
deep
seman>c
data
integra>on
• One
Mind
for
Research:
Biospecimen
repositories
• NeuroGateway:
Computa>onal
resources
• FORCE11:
Tools
for
next-‐gen
publishing
and
e-‐scholarship
SciCrunch
SciCrunch
is
ac>vely
suppor>ng
mul>ple
communi>es;
mul>ple
communi>es
are
enriching
and
improving
SciCrunch
44. Community
database:
beginning
Community
database:
End
“How
do
I
share
my
data/tool?”
“There
is
no
database
for
my
data”
1
2
3
4
Ins3tu3onal
repositories
Cloud
INCF:
Global
infrastructure
Government
Educa>on
Industry
NIF
is
designed
to
leverage
exis>ng
investments
in
resources
and
infrastructure
Tool
repositories
45. • No
one
can
be
stopped
from
doing
what
they
need
to
do
• Every
resource
is
resource
limited:
few
have
enough
>me,
money,
staff
or
exper>se
required
to
do
everything
they
would
like
– If
the
market
can
support
11
MRI
databases,
fine
– Some
consolida>on,
coordina>on
is
warranted
though
• Big,
broad
and
messy
beats
small,
narrow
and
neat
– Without
trying
to
integrate
a
lot
of
data,
we
will
not
know
what
needs
to
be
done
– A
lot
can
be
done
with
messy
data;
neatness
helps
though
– Progressive
refinement;
addi>on
of
complexity
through
layers
• Be
flexible
and
opportunis>c
– A
single
op>mal
technology/container
for
all
types
of
scien>fic
data
and
informa>on
does
not
exist;
technology
is
changing
• Think
globally;
act
locally:
– No
source,
not
even
NIF,
is
THE
source;
we
are
all
a
source
46. • Several
powerful
trends
should
change
the
way
we
think
about
our
data:
One
Many
– Many
data
• Genera>on
of
data
is
gewng
easier
shared
data
• Data
space
is
gewng
richer:
more
–omes
everyday
• But...compared
to
the
biological
space,
s>ll
sparse
– Many
eyes
• Wisdom
of
crowds
• More
than
one
way
to
interpret
data
– Many
algorithms
• Not
a
single
way
to
analyze
data
– Many
analy>cs
• “Signatures”
in
data
may
not
be
directly
related
to
the
ques>on
for
which
they
were
acquired
but
tell
us
something
really
interes>ng
Are
you
exposing
or
burying
your
work?
47. Jeff
Grethe,
UCSD,
Co
Inves>gator,
Interim
PI
Amarnath
Gupta,
UCSD,
Co
Inves>gator
Anita
Bandrowski,
NIF
Project
Leader
Gordon
Shepherd,
Yale
University
Perry
Miller
Luis
Marenco
Rixin
Wang
David
Van
Essen,
Washington
University
Erin
Reid
Paul
Sternberg,
Cal
Tech
Arun
Rangarajan
Hans
Michael
Muller
Yuling
Li
Giorgio
Ascoli,
George
Mason
University
Sridevi
Polavarum
Fahim
Imam
Larry
Lui
Andrea
Arnaud
Stagg
Jonathan
Cachat
Jennifer
Lawrence
Svetlana
Sulima
Davis
Banks
Vadim
Astakhov
Xufei
Qian
Chris
Condit
Mark
Ellisman
Stephen
Larson
Willie
Wong
Tim
Clark,
Harvard
University
Paolo
Ciccarese
Karen
Skinner,
NIH,
Program
Officer
(re>red)
Jonathan
Pollock,
NIH,
Program
Officer
And
my
colleagues
in
Monarch,
dkNet,
3DVC,
Force
11