Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences

Maryann
E.

Martone,
Ph.
D.

University
of
California,
San
Diego

“A
grand
challenge
in
neuroscience
is
to
elucidate
brain
func>on
in
rela>on
to

its
mul>ple
layers
of
organiza>on
that
operate
at
diﬀerent
spa>al
and

temporal
scales.

Central
to
this
eﬀort
is
tackling
“neural
choreography”
-‐-‐

the
integrated
func>oning
of
neurons
into
brain
circuits-‐-‐
Neural

choreography
cannot
be
understood
via
a
purely
reduc>onist
approach.

Rather,
it
entails
the
convergent
use
of
analy>cal
and
synthe>c
tools
to

gather,
analyze
and
mine
informa>on
from
each
level
of
analysis,
and

capture
the
emergence
of
new
layers
of
func>on
(or
dysfunc>on)
as
we

move
from
studying
genes
and
proteins,
to
cells,
circuits,
thought,
and

behavior....

However,
the
neuroscience
community
is
not
yet
fully
engaged
in
exploi;ng

the
rich
array
of
data
currently
available,
nor
is
it
adequately
poised
to

capitalize
on
the
forthcoming
data
explosion.

“

Akil
et
al.,
Science,
Feb
11,
2011

•  In
that
same
issue
of
Science

–  Asked
peer
reviewers
from
last
year
about
the
availability
and
use
of

data

•  About
half
of
those
polled
store
their
data
only
in
their

laboratories—not
an
ideal
long-‐term
solu>on.

•  Many
bemoaned
the
lack
of
common
metadata
and

archives
as
a
main
impediment
to
using
and
storing

data,
and
most
of
the
respondents
have
no
funding
to

support
archiving

•  And
even
where
accessible,
much
data
in
many
ﬁelds
is

too
poorly
organized
to
enable
it
to
be
eﬃciently
used.

“...it
is
a
growing
challenge
to
ensure
that
data
produced
during
the
course

of
reported
research
are
appropriately
described,
standardized,
archived,

and
available
to
all.”

Lead
Science
editorial,
2011

Neuroscience
is
unlikely
to
be

served
by
a
few
large
databases

like
the
genomics
and
proteomics

community
Whole
brain
data

(20
um

microscopic
MRI)

Mosiac
LM

images
(1
GB+)

Conven>onal
LM

images

Individual
cell

morphologies

EM
volumes
&

reconstruc>ons

Solved
molecular

structures

No
single
technology
serves
these
all

equally
well.

 Mul6ple
data
types;

mul6ple

scales;

mul6ple
databases

•  Current
web
is

designed
to
share

documents

– Documents
are

unstructured
data

•  Much
of
the

content
of
digital

resources
is
part
of

the
“hidden
web”

•  Wikipedia:

The
Deep
Web

(also
called
Deepnet,
the

invisible
Web,
DarkNet,

Undernet
or
the
hidden

Web)
refers
to

World
Wide
Web
content

that
is
not
part
of
the

Surface
Web,
which
is

indexed
by
standard

search
engines.

•  NIF
has
developed
a

produc>on
technology

pla]orm
for
researchers
to:

–  Discover

–  Share

–  Analyze

–  Integrate

neuroscience-‐relevant

informa>on

•  Since
2008,
NIF
has

assembled
the
largest

searchable
catalog
of

neuroscience
data
and

resources
on
the
web

•  Cost-‐eﬀec>ve
and

innova>ve
strategy
for

managing
data
assets

“This
unique
data
depository
serves
as
a
model

for
other
Web
sites
to
provide
research
data.
“
-‐

Choice
Reviews
Online

NIF
is
poised
to
capitalize
on
the
new
tools

and
emphasis
on
big
data
and
open

science

h?p://neuinfo.org

June10,
2013
dkCOIN
Inves>gator's
Retreat
8

•  A
portal
for
ﬁnding
and
using

neuroscience
resources

  A
consistent
framework
for

describing
resources

  Provides
simultaneous

search
of
mul>ple
types
of

informa>on,
organized
by

category

  Supported
by
an
expansive

ontology
for
neuroscience

  U>lizes
advanced

technologies
to
search
the

“hidden
web”

UCSD,
Yale,
Cal
Tech,
George
Mason,
Washington
Univ

Literature

Database

Federa>on

Registry

• NIF
Registry:

A
catalog

of
neuroscience-‐
relevant
resources

• >
6000
currently

listed

• >
2200
databases

• And
we
are
ﬁnding

more
every
day

“Of
relevance
to
neuroscience”
is
very
broad

dkCOIN
Inves>gator's
Retreat
10

• NIF
curators

• Nomina>on
by
the

community

• Semi-‐automated
text
mining

pipelines

 NIF
Registry

 Requires
no
special

skills

 Site
map
available
for

local
hos>ng

• NIF
Data
Federa>on

• DISCO
interop

• Requires
some

programming
skill

Low
barrier
to
entry

•  Extended
over
>me

–  Parent
resource

–  Suppor>ng
agency

–  Grant
numbers

–  Accessibility

–  Related
to

–  Organism

–  Disease
or
condi>on

–  Last
updated

First
catalog:

SFN
Neuroscience
Database
Gateway

NIF
0.5

NIF
1.0+

Simple
metadata
model

Name,
descrip>on,
type,
URL,
other
names,
keywords,

unique
iden>ﬁer

~2003

2006

2008

12

•  NIF
Registry
is
hosted

on
Seman>c
Media

Wiki
pla]orm

Neurolex

–  Community
can
add,

review,
edit
without

special
privileges

–  Searchable
by
Google

–  Integrated
with
NIF

ontologies

–  Graph
structure

Seman>c
wiki:

A
wiki
with
seman>cs;

pages
are
linked
through
rela>onships

NIF
is
crea>ng
the
linked
data
graph
of
resources

–  NIF
employs
an
automated
link
checker

–  Last
analysis:

478/6100
invalid
URL’s
(~8%)

–  199
can’t
locate
at
another
university
or
loca>on

out
of
service
(~3%)

–  Bigger
issue:

Many
resources
are
no
longer
updated
or
maintained

0

20

40

60

80

100

120

140

160

180

200

1996
1998
2000
2002
2004
2006
2008
2010
2012
2014

0

500

1000

1500

2000

2500

3000

3500

Resources
added

Last
updated

Keeping
content
up

to
date

Connectome

Tractography

Epigene>cs

• New
tags
come
into

existence

• New
resource
types
come

into
existence,
e.g.,
Mobile

apps

• Resources
add
new
types
of

content

• Change
name

• Change
scope

• >
7000
updates
to
the

registry
last
year

It’s
a
challenge
to
keep
the
registry
up
to
date;

sitemaps,
cura>on,
ontologies,
community
review

• The
NIF
Registry
has
created
a
linked
data

graph
of
web-‐accessible
resources

• Maintained
on
a
community
wiki

pla]orm

• Provides
data
on
the
ﬂuidity
of
the

resource
landscape

–  New
resources
con>nue
to
be
created
and

found

–  Rela>vely
few
disappear
altogether

–  Many
more
grow
stale,
although
their
value

may
s>ll
be
signiﬁcant

–  Maintaining
up
to
date
cura>on
requires

frequent
upda>ng

NIF
Registry
provides
insight
into
the
state
of
digital

resources
on
the
web

• The
NIF
data
federa>on
performs
deep
search
over

the
content
of
over
200
databases

• New
databases
are
added
at
a
rate
of
25-‐40
per
year

• Latest
update:

Open
Source
Brain;

ingest

completed
in
2
hours

• Databases
chosen
on
a
variety
of
criteria:

• Early:

tes>ng
diﬀerent
types
of
resources

• Thema>c
areas

• Volunteers

NIF
provides
access
to
the
largest
aggrega>on
of

informa>on
on
the
web

•  NIF
was
one
of
the
first
projects
to
aZempt
data
integra>on

in
the
neurosciences
on
a
large
scale

•  NIF
is
supported
by
a
contract
that
specified
the
number
of

resources
to
be
added
per
year

–  Designed
to
be
populated
rapidly;

set
up
process
for
progressive

refinement

–  No
budget
was
allocated
to
retrofit
exis>ng
resources;

had
to

work
with
them
in
their
current
state

–  We
designed
a
system
that
required
liZle
to
no
coopera>on
or

work
from
providers

–  Supports
many
formats:

rela>onal,
XML,
RDF

Current

Planned

DISCO
Dashboard
Func6ons

•  Ingest
Script
Manager

•  Public
Script
Repository

•  Data
&
Event
Tracker

•  Versioning
System

•  Curator
Tool

•  Data
Transformer
Manager

June10,
2013
dkCOIN
Inves>gator's
Retreat
19
Luis
Marenco,
Rixin
Wang,
Perrry
Miller,
Gordon
Shepherd

Yale
University

0

50

100

150

200

250

0.01

0.1

1

10

100

1000

6-‐12
12-‐12
7-‐13
1-‐14
8-‐14
2-‐15
9-‐15
4-‐16
10-‐16
5-‐17

Number
of
Federated
Databases

Number
of
Federated
Records
(Millions)

NIF
searches
the
largest
colla>on
of

data
on
the
web

DISCO

June10,
2013
dkCOIN
Inves>gator's
Retreat
20

Results
categorized
by
data
type
and
level

of
nervous
system

Hippocampus
OR
“Cornu
Ammonis”
OR

“Ammon’s
horn”
Query
expansion:

Synonyms

and
related
concepts

Boolean
queries

Data
sources

categorized
by

“data
type”
and

level
of
nervous

system

Common
views

across
mul>ple

sources

Tutorials
for
using

full
resource
when

gewng
there
from

NIF

Link
back
to

record
in

original
source

Connects
to

Synapsed
with

Synapsed
by

Input
region

innervates

Axon
innervates

Projects
to
Cellular
contact

Subcellular
contact

Source
site

Target

site

Each
resource
implements
a
diﬀerent,
though
related
model;

systems
are
complex
and
diﬃcult
to
learn,
in
many
cases

• NIF
Connec>vity:

7
databases
containing
connec>vity
primary
data
or
claims

from
literature
on
connec>vity
between
brain
regions

• Brain
Architecture
Management
System
(rodent)

• Temporal
lobe.com
(rodent)

• Connectome
Wiki
(human)

• Brain
Maps
(various)

• CoCoMac
(primate
cortex)

• UCLA
Mul>modal
database
(Human
fMRI)

• Avian
Brain
Connec>vity
Database
(Bird)

• Total:

1800
unique
brain
terms
(excluding
Avian)

• Number
of
exact
terms
used
in
>
1
database:

42

• Number
of
synonym
matches:

99

• Number
of
1st
order
partonomy
matches:

385

– You
(and
the
machine)
have
to
be
able
to

ﬁnd
it

•  Accessible
through
the
web

•  Annota>ons

– You
have
to
be
able
to
access
and
use
it

•  Data
type
speciﬁed
and
in
a
usable
form

– You
have
to
know
what
the
data
mean

•  Some
seman>cs:

“1”

•  Context:

Experimental
metadata

•  Provenance:

Where
did
the
data
come
from?

Repor>ng
neuroscience
data
within
a
consistent
framework
helps

enormously

Knowledge
in
space
and
spa>al
rela>onships

(the
“where”)

Knowledge
in
words,
terminologies
and

logical
rela>onships
(the
“what”)

•  NIF
covers
mul>ple
structural
scales
and
domains
of
relevance
to
neuroscience

•  Aggregate
of
community
ontologies
with
some
extensions
for
neuroscience,
e.g.,
Gene

Ontology,
Chebi,
Protein
Ontology

NIFSTD

Organism

NS
Func>on
Molecule
Inves>ga>on

Subcellular

structure

Macromolecule
Gene

Molecule
Descriptors

Techniques

Reagent
Protocols

Cell

Resource
Instrument

Dysfunc>on
Quality

Anatomical

Structure

NIF
capitalizes
on
the
growing
set
of
community
ontologies

available
in
biomedical
science

Purkinje

Cell

Axon

Terminal

Axon

Dendri>c

Tree

Dendri>c

Spine

Dendrite

Cell
body

Cerebellar

cortex

There
is
liZle
obvious
connec>on
between

data
sets
taken
at
diﬀerent
scales
using

diﬀerent
microscopies
without
an
explicit

representa>on
of
the
biological
objects
that

the
data
represent

Brain

Cerebellum

Purkinje
Cell
Layer

Purkinje
cell

neuron

has
a

has
a

has
a

is
a

•  Ontology:
an
explicit,
formal
representa>on

of
concepts

rela>onships
among
them

within
a
par>cular
domain
that
expresses

human
knowledge
in
a
machine
readable

form

–  Branch
of
philosophy:

a
theory
of
what
is

–  e.g.,
Gene
ontologies

•  Provide
universals
for
naviga>ng
across

diﬀerent
data
sources

–  Seman>c
“index”

•  Provide
the
basis
for
concept-‐based

queries
to
probe
and
mine
data

–  Perform
reasoning

–  Link
data
through
rela>onships
not
just
one-‐
to-‐one
mappings

“Search
compu6ng”

What
genes
are
upregulated
by
drugs
of
abuse

in
the
adult
mouse?

Morphine

Increased

expression

Adult
Mouse

Some
concepts,
e.g.,
age
category,
are
quan>ta>ve
but

s>ll
must
be
interpreted
in
a
global
query
system

June10,
2013
dkCOIN
Inves>gator's
Retreat
32

hZp://neurolex.org
Stephen
Larson

• Provide
a
simple

interface
for
defining
the

concepts
required

• Light
weight
seman>cs

• Good
teaching
tool
for

learning
about
seman>c

integra>on
and
the

benefits
of
a
consistent

seman>c
framework

• Community
based:

• Anyone
can
contribute

their
terms,
concepts,

things

• Anyone
can
edit

• Anyone
can
link

• Accessible:

searched
by

Google

• Growing
into
a
significant

knowledge
base
for

neuroscience
Demo

D03

 200,000

edits

 150

contributors

•  NIF
can
be
used
to
survey
the

data
landscape

•  Analysis
of
NIF
shows
mul>ple

databases
with
similar
scope

and
content

•  Many
contain
par>ally

overlapping
data

•  Data
“ﬂows”
from
one

resource
to
the
next

–  Data
is
reinterpreted,
reanalyzed
or

added
to

•  Is
duplica>on
good
or
bad?

Databases
come
in
many
shapes
and
sizes

•  Primary
data:

–  Data
available
for
reanalysis,
e.g.,

microarray
data
sets
from
GEO;

brain
images
from
XNAT;

microscopic
images
(CCDB/CIL)

•  Secondary
data

–  Data
features
extracted
through

data
processing
and
some>mes

normaliza>on,
e.g,
brain
structure

volumes
(IBVD),
gene
expression

levels
(Allen
Brain
Atlas);

brain

connec>vity
statements
(BAMS)

•  Ter>ary
data

–  Claims
and
asser>ons
about
the

meaning
of
data

•  E.g.,
gene
upregula>on/
downregula>on,
brain

ac>va>on
as
a
func>on
of
task

•  Registries:

–  Metadata

–  Pointers
to
data
sets
or

materials
stored
elsewhere

•  Data
aggregators

–  Aggregate
data
of
the
same

type
from
mul>ple
sources,

e.g.,
Cell
Image

Library
,SUMSdb,
Brede

•  Single
source

–  Data
acquired
within
a
single

context
,
e.g.,
Allen
Brain
Atlas

Researchers
are
producing
a
variety
of

informa>on
ar>facts
using
a
mul>tude
of

technologies

NIF
Analy6cs:

The
Neuroscience
Landscape

NIF
is
in
a
unique
posi>on
to
answer
ques>ons
about
the
neuroscience

landscape

Where
are
the
data?

Striatum

Hypothalamus

Olfactory
bulb

Cerebral
cortex

Brain

Brain
region

Data
source

Vadim
Astakhov,
Kepler
Workﬂow
Engine

Diseases
of
nervous
system

Adding
more
seman6cs

The
combina>on
of
ontologies,
diverse
data
and
analy>cs
lets
us
look
at

the
current
landscape
in
interes>ng
ways

Neurodegenera>ve

Seizure
disorders

Neoplas>c
disease
of
nervous
system

NIH

Reporter

NIF
data
federated
sources

•  Gemma:

Gene
ID

+
Gene
Symbol

•  DRG:

Gene
name
+
Probe
ID

•  Gemma
presented
results
rela>ve
to
baseline
chronic

morphine;

DRG
with
respect
to
saline,
so
direc>on
of
change
is

opposite
in
the
2
databases

• 

Analysis:

• 1370
statements
from
Gemma
regarding
gene
expression
as
a
func>on
of
chronic

morphine

• 617
were
consistent
with
DRG;


over
half

of
the
claims
of
the
paper
were
not

conﬁrmed
in
this
analysis

• Results
for
1
gene
were
opposite
in
DRG
and
Gemma

• 45
did
not
have
enough
informa>on
provided
in
the
paper
to
make
a
judgment

Rela>vely
simple
standards
would
make
life
easier

NIF
favors
a
hybrid,
>ered,

federated
system

•  Domain
knowledge

–  Ontologies

•  Claims,
models
and

observa>ons

–  Virtuoso
RDF
triples

–  Model
repositories

•  Data

–  Data
federa>on

–  Spa>al
data

–  Workﬂows

•  Narra>ve

–  Full
text
access

Neuron
Brain
part
Disease

Organism
Gene

Caudate
projects
to

Snpc
Grm1
is
upregulated
in

chronic
cocaine

Betz
cells

degenerate
in
ALS

NIF
provides
the
tentacles
that
connect
the
pieces:

a

new
type
of
en>ty
for
21st
century
science

Technique

People

•  2006-‐2008:

A
survey
of
what
was
out
there

•  2008-‐2009:

Strategy
for
resource
discovery

–  NIF
Registry
vs
NIF
data
federa>on

–  Inges>on
of
data
contained
within
different
technology
pla]orms,
e.g.,
XML
vs
rela>onal

vs
RDF

–  Effec>ve
search
across
seman>cally
diverse
sources

•  NIFSTD
ontologies

•  2009-‐2011:

Strategy
for
data
integra>on

–  Unified
views
across
common
sources

–  Mapping
of
content
to
NIF
vocabularies

•  2011-‐present:

Data
analy>cs

–  Uniform
external
data
references

•  2012-‐present:


SciCrunch:

unified
biomedical
resource

services

NIF
provides
a
strategy
and
set
of
tools
applicable
to
all

domains
grappling
with
mul>ple
sources
of
diverse
data

(i.e.,
preZy
much
everything)

•  Search
seman>cs

•  Ranking

•  Resources
supported
by
NIH
Blueprint
Ins>tutes
are

more
thoroughly
covered

•  Data
types,
e.g.,
Brain
ac>va>on
foci

June10,
2013
dkCOIN
Inves>gator's
Retreat
41

June10,
2013
42

SciCrunch

NIF
MONARCH

Community

Services

dkCOIN

Shared

Resources

Undiagnosed

Disease
Program

Phenotype
RCN

3D
Virtual
Cell

Na>onal
Ins>tute

on
Aging

One
Mind
for

Research

BIRN

Interna>onal

Neuroinforma>cs

Coordina>ng

Facility

Model
Organism

Databases

Community

Outreach

DELSA

(not
just
a
data
catalog)

43

• 3dVC:

Focus
on
models
and
simula>on

• Gene
Ontology:

Focus
on

bioinforma>cs
tools

• Na>onal
Ins>tute
on
aging:
Aging-‐
related
data
sets

• Monarch:

Phenotype-‐Genotype;

deep

seman>c
data
integra>on

• One
Mind
for
Research:

Biospecimen

repositories

• NeuroGateway:

Computa>onal

resources

• FORCE11:

Tools
for
next-‐gen
publishing

and
e-‐scholarship

SciCrunch

SciCrunch
is
ac>vely
suppor>ng
mul>ple

communi>es;
mul>ple
communi>es
are

enriching

and
improving
SciCrunch

Community

database:

beginning

Community

database:

End

“How
do
I
share
my

data/tool?”

“There
is
no
database

for
my
data”

1

2

3

4

Ins3tu3onal

repositories

Cloud

INCF:

Global

infrastructure

Government

Educa>on

Industry

NIF
is
designed
to
leverage
exis>ng
investments
in
resources
and
infrastructure

Tool
repositories

•  No
one
can
be
stopped
from
doing
what
they
need
to
do

•  Every
resource
is
resource
limited:

few
have
enough
>me,
money,

staff
or

exper>se
required
to
do
everything
they
would
like

–  If
the
market
can
support
11
MRI
databases,
fine

–  Some
consolida>on,
coordina>on
is
warranted
though

•  Big,
broad
and
messy
beats
small,
narrow
and
neat

–  Without
trying
to
integrate
a
lot
of
data,
we
will
not
know
what
needs
to
be
done

–  A
lot
can
be
done
with
messy
data;

neatness
helps
though

–  Progressive
refinement;

addi>on
of
complexity
through
layers

•  Be
flexible
and
opportunis>c

–  A
single

op>mal
technology/container
for
all
types
of
scien>fic
data
and
informa>on
does
not
exist;

technology
is
changing

•  Think
globally;

act
locally:

–  No
source,
not
even
NIF,
is
THE
source;

we
are
all
a
source

•  Several
powerful
trends
should
change
the
way
we
think
about

our
data:

One

Many

–  Many
data

•  Genera>on
of
data
is
gewng
easier

shared
data

•  Data
space
is
gewng
richer:

more
–omes
everyday

•  But...compared
to
the
biological
space,
s>ll
sparse

–  Many
eyes

•  Wisdom
of
crowds

•  More
than
one
way
to
interpret
data

–  Many
algorithms

•  Not
a
single
way
to
analyze
data

–  Many
analy>cs

•  “Signatures”
in
data
may
not
be
directly
related
to
the
ques>on
for
which
they

were
acquired
but
tell
us
something
really
interes>ng

Are
you
exposing
or
burying
your
work?

Jeff
Grethe,
UCSD,
Co
Inves>gator,
Interim
PI

Amarnath
Gupta,
UCSD,
Co
Inves>gator

Anita
Bandrowski,
NIF
Project
Leader

Gordon
Shepherd,
Yale
University

Perry
Miller

Luis
Marenco

Rixin
Wang

David
Van
Essen,
Washington
University

Erin
Reid

Paul
Sternberg,
Cal
Tech

Arun
Rangarajan

Hans
Michael
Muller

Yuling
Li

Giorgio
Ascoli,
George
Mason
University

Sridevi
Polavarum

Fahim
Imam

Larry
Lui

Andrea
Arnaud
Stagg

Jonathan
Cachat

Jennifer
Lawrence

Svetlana
Sulima

Davis
Banks

Vadim
Astakhov

Xufei
Qian

Chris
Condit

Mark
Ellisman

Stephen
Larson

Willie
Wong

Tim
Clark,
Harvard
University

Paolo
Ciccarese

Karen
Skinner,
NIH,
Program
Officer

(re>red)

Jonathan
Pollock,
NIH,
Program
Officer

And
my
colleagues
in
Monarch,
dkNet,
3DVC,
Force
11

Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences

More Related Content

What's hot

Viewers also liked

Similar to Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences

More from Neuroscience Information Framework

Recently uploaded

Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the Neurosciences