Neuroscience as networked science

Neuroscience as Networked Science
Maryann E. Martone, Ph. D.
University of California, San Diego

We say this to each other all the
time, but we set up systems for
scholarly advancement and
communication that are the
antithesis of integrationWhole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
EM volumes &
reconstructions
Solved molecular
structures
No single technology serves
these all equally well.
Multiple data types;
multiple scales; multiple
databases
A data integration problem

Solving the large problems of
science?
• Observation
• Experimentation
• Modeling
• Cooperative data
intensive science
“An unaided human’s ability to process
large data sets is comparable to a dog’s
ability to do arithmetic, and not much more
valuable.” –Michael Nielson, Reinventing
Discovery, 2012.

Old Model: Single type of content;
single mode of distribution
Scholar
Library
Scholar
Publisher
FORCE11.org: Future of research communications and e-scholarship

Scholar
Consumer
Libraries
Data Repositories
Code Repositories
Community
databases/platforms
OA
Curators
Social
Networks
Social
NetworksSocial
Networks
Peer Reviewers
Narrative
Workflows
Data
Models
Multimedia
Nanopublications
Code

The duality of modern scholarship
Observation: Those who build information systems from the
machine side don’t understand the requirements of the
human very well
Those who build information systems from the human side,
don’t understand requirements of machines very well
Production of “reusable scholarly artifacts” = usable by by humans and machines
Findable, accessible, citable

• NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been
surveying,
cataloging and
tracking the
neuroscience
resource
landscape since
< 2008

Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
Anita Bandrowski and Burak Ozyurt
Population, Coverage and Linkage of Resource
Registry

• Automated text mining is used to look
for “web page last updated” or
copyright dates
– Identified for 570 resources
– 373 were not updated within the last 2
years (65%)
• Manual review of ~200 resources
– 38 not updated within the past 2 years
(~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
What happens to these resources?
The Registry provides a persistent identifier and metadata
record for what once existed but no longer does

BD2K: Big Data to Knowledge
• BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise, to facilitate discovery and support new knowledge,
and to maximize community engagement.
• BD2K aims to develop the new approaches, standards, methods, tools,
software, and competencies that will enhance the use of biomedical Big Data
by:
– Facilitating broad use of biomedical digital assets by making them
discoverable, accessible, and citable
– Conducting research and developing the methods, software, and tools
needed to analyze biomedical Big Data
– Enhancing training in the development and use of methods and tools
necessary for biomedical Big Data science
– Supporting a data ecosystem that accelerates discovery as part of a digital
enterprise
http://bd2k.nih.gov/

Registry vs Federation: Metadata about resource vs
metadata/data in database

What resources are available for GRM1?
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice

NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views
250 sources
> 800 M records

What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies

NIF unifies look, feel and access

Making it easier to access and understand
distributed databases
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases

Current challenge: With so much
available, how do I find what I need?
• “What genes are upregulated by
chronic morphine?”
– It depends
• Most often use cases require
connecting a researcher to
relevant data sets and
appropriate tools
– Depending upon the data and tools,
the answers may differ
• Many databases have tool bases
and workflows that they support
• Much value has been added to
individual data sets if we can
connect to them

Analyzed
Curated
GSE13732
Analyzed
Mirrored
NIF is developing standards and indices to
“track” resources and they move through the
ecosystem
Data flows throughout the ecosystem...value is added

“Data trails”: Linking data across platforms

SciCrunch: A “social network” for
resources
• NIF is a general search
engine across all of
neuroscience (biomedicine)
• Very powerful for discovery
and general browsing
• Can perform analytics across
the spectrum of biomedical
resources
• Many communities want to
create more focused portals
• Specialized for their domain
• Restrict the particular sources
• Organize the data according
to their needs
• Use their own branding
• How do we create a system
that satisfies community
needs without creating
another silo?

Put dkNET here
http://dknet.org
Autogenerated snippets

1 100 10,000 1,000,000 100,000,00010,000,000,000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunch
Federation become immediately
available through More Resources

SciCrunch
Shared
Resources
Undiagnosed
Disease Program
Phenotype RCN
One Mind for
Research
Consortia-Pedia
Faster Cures
Model Organism
Databases
Community
Outreach
Shared curation; shared expertise
Resource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube

Breaking down silos: Community enrichment
Connect communities via data
and tools

KNOWLEDGE TO DATA: THE POWER
OF A SEMANTIC INFORMATION
FRAMEWORK

What is an effective information
framework for neuroscience?
Knowledge in space and spatial relationships
(the “where”)
Knowledge in words, terminologies and
logical relationships (the “what”)

Purkinje
Cell
Axon
Terminal
Axon
Dendritic
Tree
Dendritic
Spine
Dendrite
Cell body
Cerebellar
cortex
Space limitations: Multiscale integration is not obvious
There is little obvious connection between
data sets taken at different scales using
different microscopies without an explicit
representation of the biological objects that
the data represent

What can ontology do for us?
• Express neuroscience concepts in a way that is machine readable
– Unique identifier
– Synonyms, lexical variants
– Definitions
• Provide means of disambiguation of strings
– Nucleus part of cell; nucleus part of brain; nucleus part of atom
– Each of these concepts has a unique identifier that distinguishes them
• Properties
– Support reasoning
• Provide universals for navigating across different data sources
– Semantic “index”
– Link data through relationships not just one-to-one mappings
• Provide the basis for concept-based queries to probe and mine data
• Establish a semantic framework for landscape analysis
• Deep data integration for some types of knowledge
Mathematics, Computer code or Esperanto

The scourge of neuroanatomical nomenclature
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number that map to the same identifier, i.e., synonyms: 99
•Number of 1st order partonomy matches: 385

: C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers

Looking across the ecosystem: Where are the data?
Data Sources
Bringing knowledge to data: Gap analysis

Forebrain
Midbrain
Hindbrain
0
1-10
11-100
>101
Data Sources
Revealing biases in the dataspace

How much information makes it into
the data space?
∞
What is easily machine
processable and accessible
What is potentially knowable
What is known:
Literature, images, human
knowledge
Unstructured; Natural
language processing,
entity recognition,
image processing and
analysis; paywalls; file
drawers
Abstracts vs full
text vs tables etc
Estimates that > 50% scientific output is not recovered
Chan et al. Lancet, 383, 2014

The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus, and
could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.

Data-Knowledge Mismatch
Dutowski et al., 2013:
Nature Biotechnology
A major impediment
for researchers using
ontology identifiers
is the perception
that ontologies
require a consensus
on definition of
terms
By matching
assertions about
biological entities
to data, we can
test both our
knowledge and
our data

The Monarch Initiative
•Genotype-Phenotype
comparison engine
•Integrates large amounts
of genotype-phenotype
data
•Semantic similarity
analytics
•Human disease  
Animal model
Monarchinitiative.org
Melissa Haendel, OHSU
Chris Mungall, LBL

SO ALL I AM IS A NUMBER?
The power of unique and persistent identifiers

What studies used my monoclonal mouse antibody
against actin in humans?
“The following antibodies were used for immunoblotting: -actin
mAb (1:10,000 dilution, Sigma-Aldrich)…”
Papers are
currently poor at
identifying the
simplest part of
the paper, the
materials used

Pilot Project
• Authors to identify 3 types of
research resources:
– Software/databases
– Antibodies
– Model organisms
• Include unique identifier = RRID
in methods section
• Voluntary for authors
• Journals did not have to modify
their submission system
Launched February 2014: 3 month commitment and more…
Two simple questions:
Could authors do it?
Would authors do it?

Resource IDs from NIF aggregated databases
•A single portal for
authors
•>10 authoritative
databases
•One search interface
•Simple directions
•Prominent “Cite
This” button
•Help desk
RII Portal
http://scicrunch.org/resources
Initiative was possible because of
the massive registries available
and aggregation services of
NIF/SciCrunch

RRID’s in the wild!
• >300 articles
have appeared to
date
• 47 journals
• 800+ RRID’s
• 96% correct!
Database available at: https://www.force11.org/node/5635
Authors can and will
adopt new citation
styles for research
resources

Increased identifiability of resources after the
Resource Identification Initiative Pilot
Update of Vasilevsky et al, PeerJ, 2013

What can we do with an RRID?
• A resolver
service has
been created
• 3rd party tools
are being
created to
provide linkage
between
resources and
papers
http://scicrunch.com/resolver/RRID:AB_90755

“Alerting” service
• Teaming with
Hypothes.is and
ORCID to
develop
annotation tools
for RRID’s,
including
“alerts” on
reagents and
tools

Hypothes.is is a tool for creating and
sharing annotations on web pages
http://hypothes.is.org

Article
Code
Blogs
Workflows
Data
Portals
Unique and persistent identifiers and a system for
referencing them allow an ecosystem to function
An ecosystem for research objects: the social network of
research resources
Data
Data
Code
Code
Blogs
Blogs
Workflows
Workflows
Portals
Portals
Search engines
ID’s
ID’s
ID’s
ID’s
ID’s
ID’s
ID’s
ID’s

WHAT CAN WE DO NOW?
Lessons learned from my career

Share your data and share it
effectively• Discoverability
– Data can be found
• Accessibility
– Data can be accessed and
access rights are clear
– Links to data are stable
• Assessability
– The reliability of the data can
be determined
• Understandability
– The data can be understood
• Usability
– The data are in a usable form
• Publishing data on your
website or as
supplemental material is
not the best way to make
it available

What about my data?
•Best practice:
•Put it in a repository
•What repository?
•Community repository for
your data type, e.g.,
NITRC, GEO
•General repository:
•Dryad
•FigShare
•NIH Data Commons
•Institutional repository
•Research libraries are
setting up repositories to
manage their “digital
assets”
NIF can help you find a place for your data

Make sure you and your scholarly outputs
can be linked
A distributed system like the biomedical data ecosystem runs
on the ability to uniquely identify relevant entities
ORCID ID: Unique researcher
identifier
Editors, authors: participate in
the Resource Identification
Initiative
“Sound, reproducible scholarship rests upon a
foundation of robust, accessible data. Data should be
considered legitimate, citable products of research. Data
citation, like the citation of other evidence and sources,
is good research practice.”
-Joint Declaration of Data Citation
Principles http://www.force11.org/datacitation
Coming soon: Formal
standards for citing data sets

Future of Research Communications
and e-Scholarship (FORCE11.org)
http://force11.orgJoin FORCE11!

NIF team (past and present)
Jeff Grethe, UCSD, Co-PI
Amarnath Gupta, UCSD,
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11

The
Encyclopedia
of Life
A…
Access to data has
changed over the
years
Tim Berner-s Lee: Web of dataWikipedia defines Linked Data as "a
term used to describe a
recommended best practice for
exposing, sharing, and connecting
pieces of data, information, and
knowledge on the Semantic Web
using URIs and RDF.”
http://linkeddata.org/
Genbank
PDB
“Whichever technology wins broad adoption will become, by
default, the data web. That’s why we don’t need to know
which technological vision of the data web will win to conclude
that the data web is inevitable”-Michael Nielson

“Empty Archives”
Repository Type of Data
Date
started Host
Public
data Comments
CARMEN
neuroscience /
electrophysiology 2008
Newcastle University; United
Kingdom 100 Requires account
INCF Dataspace various 2012
International
Neuroinformatics
Coordinating Facility ?
Open Source Brain models 2014 University College London 47 Cells and Networks; 23 (Technology -showcases)
XNAT Central Neuroimaging 2010
Washington University
School of Medicine in St.
Louis; Missouri; USA 34
States 370 projects, 3804 subjects, and 5172
imaging sessions. 123 were visible but do not all
appear to be public. 34 public data were listed
under “Recent”
Open Connectome
Serial electron
Microscopy and
Magnetic Resonance 2011
Johns Hopkins University;
Maryland; USA (graphs) 9 9, 7 - image projects; 19 - graphs
UCSF DataShare
biomedical including
neuroimaging, MRI,
cognitive
impairment,
dementia, aging 2011
University of California at San
Francisco; California; USA 15
BrainLiner
various functional
data 2011 ATR; Kyoto; Japan 10
ModelDB neuron models 1996
Yale University; Connecticut;
USA 875
NeuroMorpho
digitally
reconstructed
neurons 2006
George Mason University;
Virginia; USA 10004
Cell Image
Library/Cell
Centered Database
images, videos, and
animations of cell
2002 CCDB
2010 CIL
American Society for Cell
Biology / University of
California at San Diego;
California; USA 10,360
The CCDB had 450 data sets when it merged with
CIL. CIL also contains large imaging data sets that
are not counted as separate images
CRCNS
computational
neuroscience
datasets 2008
University of California at
Berkeley; California; USA 38
OpenfMRI fMRI 2012
University of Texas at Austin;
Texas; USA 22
“I finally gave NeuroMorpho my data so they would stop

NIF/NITRC: Customized Neuroimaging Portal
Age
Gender
Type

Make your data machine-actionable
Van De Werd HJ1, Uylings HB.. Brain Struct Funct. 2014 Mar;219(2):433-59. doi:
10.1007/s00429-013-0630-

Use RRID’s in your papers,
databases and journals!
• Antibody and
model
organism
databases
are adopting

NIF Information Framework: Query and alignment
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
• Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF uses ontologies to enhance search
and discovery but is not constrained by
them

Where can I find validated antibodies
against CART?

Find clinical trials that have data
available?

Neuroscience as networked science

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Neuroscience as networked science

Similar to Neuroscience as networked science (19)

More from Neuroscience Information Framework

More from Neuroscience Information Framework (20)

Recently uploaded

Recently uploaded (20)

Neuroscience as networked science