Big data from small data: A deep survey of the neuroscience landscape data via

Big data from small data: A deep
survey of the neuroscience
landscape data via
the Neuroscience Information
Framework

Maryann Martone, Ph. D.
University of California, San Diego

“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation
to its multiple layers of organization that operate at different spatial and
temporal scales. Central to this effort is tackling “neural choreography” --
the integrated functioning of neurons into brain circuits-- Neural
choreography cannot be understood via a purely reductionist approach.
Rather, it entails the convergent use of analytical and synthetic tools to
gather, analyze and mine information from each level of analysis, and
capture the emergence of new layers of function (or dysfunction) as we
move from studying genes and proteins, to cells, circuits, thought, and
behavior....

However, the neuroscience community is not yet fully engaged in exploiting the
rich array of data currently available, nor is it adequately poised to capitalize
on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011

“Data choreography”
 In that same issue of Science
 Asked peer reviewers from last year about the availability and use of
data
 About half of those polled store their data only in their
laboratories—not an ideal long-term solution.
 Many bemoaned the lack of common metadata and archives as a
main impediment to using and storing data, and most of the
respondents have no funding to support archiving
 And even where accessible, much data in many fields is too poorly
organized to enable it to be efficiently used.

 “...it is a growing challenge to ensure that data produced during the
course of reported research are appropriately
described, standardized, archived, and available to all.” Lead Science
editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )

A data federation problem

No single technology serves these all
equally well.
Multiple data types; multiple
scales; multiple databases
Whole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)

Conventional LM
images

Individual cell
morphologies

Neuroscience is unlikely to be EM volumes &
served by a few large databases reconstructions

like the genomics and proteomics
Solved molecular
community structures

 NIF is an initiative of the NIH Blueprint consortium of institutes
 What types of resources (data, tools, materials, services) are
available to the neuroscience community?
 How many are there?
 What domains do they cover? What domains do they not cover?
 Where are they?
 Web sites • PDF files
 Databases • Desk drawers
 Literature
 Supplementary material
 Who uses them?
 Who creates them?
 How can we find them?
 How can we make them better in the future? http://neuinfo.org

We need more databases (?)

•NIF Registry: A
catalog of
neuroscience-relevant
resources
•> 5000 currently
listed
•> 2000 databases
•And we are finding
more every day

But we have Google!

 Current web is designed  Wikipedia: The Deep Web
to share documents (also called Deepnet, the
 Documents are invisible Web, DarkNet,
unstructured data Undernet or the hidden
 Much of the content of Web) refers to World Wide
digital resources is part of Web content that is not
the “hidden web” part of the Surface Web,
which is indexed by
standard search engines.

NIF must work with ecosystem as
it is today
 NIF has developed a production technology platform for
researchers to discover, share, access, analyze, and
integrate neuroscience-relevant information
 Semantically-enabled search engine and interface that customizes
results for neuroscience
 System that searches the “hidden web”, i.e., content not well served by
search engines
 Data resources are predominantly relational, xml, text, rdf, owl
 Automated data harvesting technologies that produce dynamic indices
of data content including databases, web pages, text, xml etc.
 Tools to make products and data available
 Designed to be populated rapidly; set up process for progressive
refinement

NIF accomplishments
 Assembled the largest searchable
collation of neuroscience data on the
web UCSD, Yale, Cal Tech, George Mason, Washington Univ
 Data federation
 Resource registry (materials, data,
tools, services)
 Pub Med literature
 Full text of open access

 The largest ontology for neuroscience

 NIF search portal: simultaneous search
over data, NIF catalog and biomedical
literature

 Neurolex Wiki: a community wiki
serving neuroscience concepts
NIF is poised to capitalize on the new tools
 A unique technology platform and emphasis on big data and open
 A reservoir of cross-disciplinary
science
biomedical data expertise

NIF data federation
Percentage of data records per
data type
Brain activation foci
Animals
Images

Pathways
Drugs

connectivity
Antibodies

Microarray
98% Grants

> 180 sources; 350 M records: NIF was Percentage of data records per data
designed to be populated rapidly, with type: everything but microarray
progressive refinement of data

What do you mean by data?
Databases come in many shapes and sizes
 Primary data:  Registries:
 Data available for  Metadata
reanalysis, e.g., microarray data  Pointers to data sets or
sets from GEO; brain images from materials stored elsewhere
XNAT; microscopic images
(CCDB/CIL)  Data aggregators
 Secondary data  Aggregate data of the same
 Data features extracted through
type from multiple
data processing and sometimes
sources, e.g., Cell Image
normalization, e.g, brain structure
Library ,SUMSdb, Brede
volumes (IBVD), gene expression  Single source
levels (Allen Brain Atlas); brain  Data acquired within a single
connectivity statements (BAMS) context , e.g., Allen Brain Atlas
 Tertiary data
 Claims and assertions about the Researchers are producing a variety of
meaning of data information artifacts using a multitude of
 E.g., gene technologies
upregulation/downregulation,

What types of questions can I ask?
We’d like to be able to find:
 What is known****:
 What is the average diameter of a Purkinje neuron
 Is GRM1 expressed In cerebral cortex?
 What are the projections of hippocampus?
 What genes have been found to be upregulated in
chronic drug abuse in adults
 Is there a database of fMRI studies?
 What studies used my polyclonal antibody against
GABA in humans?
 What rat strains have been used most
extensively in research during the last 20 years?

 What is not known:
 Connections among data
 Gaps in knowledge
Without some sort of framework, very difficult to
do

What are the connections of the
hippocampus?
Hippocampus OR “CornuAmmonis” OR
“Ammon’s horn” Query expansion: Synonyms
and related concepts
Boolean queries
Data sources
categorized by
“data type” and
level of nervous
system Tutorials for using
full resource when
getting there from
NIF
Common views
across multiple
sources
Link back to
record in
original
source

Results are organized within a common
framework

Target site
Synapsed by
innervates Connects to
Input region
Synapsed with
Cellular contact
Projects to
Axon innervates
Subcellular contact
Source site
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases

The scourge of neuroanatomical nomenclature:
Importance of NIF semantic framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385

NIF’s minimum requirements for
effective data sharing
 You (and the machine) have to be able to
find it
 Accessible through the web
 Annotations
 You have to be able to use it
 Data type specified and in a usable form
 You have to know what the data mean
 Some semantics
 Context: Experimental metadata
 Provenance: Where did the data come from?

Reporting neuroscience data within a consistent framework helps enormously

What is an ontology?

Brain
 Ontology: an explicit, formal has a
representation of concepts
relationships among them Cerebellum
within a particular domain that has a

expresses human knowledge in a Purkinje Cell Layer
machine readable form
has a
 Branch of philosophy: a theory Purkinje cell
of what is is a
neuron
 e.g., Gene ontologies

You need to use
ontology
identifiers instead
of strings

Blah, blah,
ontology blah

“Ontology as mathematics, computer science or esperanto”-
AndreyRzhetsky and James A. Evans

What can ontology do for us?
“Esperanto!”

 Express neuroscience concepts in a way that is machine readable
 Classes are identified by unique identifiers
 Synonyms, lexical variants
 Definitions
 Provide means of disambiguation of strings
 Nucleus part of cell; nucleus part of brain; nucleus part of atom
 Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases
GABA as a neurotransmitter
 Properties
 Provide universals for navigating across different data sources
 Semantic “index”
 Perform reasoning
 Link data through relationships not just one-to-one mappings
 “Concept-based queries”

Power of unique identifiers: Are you the M
Martone who...
The Gene Wiki: community intelligence applied to human gene annotation.
Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch
JB, Su AI. Nucleic Acids Res. 2010 Jan;38(Database issue):D633-9.

Ontologies for Neuroscience: What are they and What are they Good for? Larson
SD, Martone ME. Front Neurosci. 2009 May;3(1):60-7. Epub 2009 May 1.

Three-dimensional electron microscopy reveals new details of membrane systems for
Ca2+ signaling in the heart. Hayashi T, Martone ME, Yu Z, Thor A, Doi M, Holst
MJ, Ellisman MH, Hoshijima M. J Cell Sci. 2009 Apr 1;122(Pt 7):1005-13.

Some analyses of forgetting of pictorial material in amnesic and demented
patients.Martone M, Butters N, Trauner D. J Clin Exp Neuropsychol. 1986 Jun;8(3):161-78.
Traumatic brain injury and the goals of care.Martone M. Hastings Cent Rep. 2006 Mar-
Apr;36(2):3.
Three-dimensional pattern of enkephalin-like immunoreactivity in the caudate nucleus of the
cat.Groves PM, Martone M,Young SJ, Armstrong DM. J Neurosci. 1988 Mar;8(3):892-900.

I am not a number (but I should
be)
 Full URI: Uniform
Resource Identifier Dept of
Boston VA
Psychiatry,
 http://orcid.org/1234567 Hospital
UCSD
 Label: Maryann Elizabeth
Martone
 Synonym: ME Martone, M M Martone Female
Martone, Maryann
 Abbreviation: MEM
 Is a
Nelson
 Has a Butters
Publications
 Is that entity which has
these properties
Text mining algorithms can discover a lot of things
about me
ORCID project: Author ID’s

NIF Semantic Framework: NIFSTD ontology
NIFSTD

Anatomical
Organism Structure
Cell Dysfunction Quality

Subcellular
Molecule NS Function Investigation
structure

Macromolecule Gene Techniques Resource Instrument

Molecule Descriptors
Reagent Protocols

 NIF covers multiple structural scales and domains of relevance to neuroscience
 Aggregate of community ontologies with some extensions for
neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology
 Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks
for more complex representations

“We studied the behavior of CA2-binding proteins in
Ca2 neurons under high and low Ca2 conditions ”

NIF queries
across over
170+
BioGrid independent
Allen Brain Atlas databases
Brain Info

But you don’t have what I need!
•Provide a simple framework for
defining the concepts required
•Cell, Part of
brain, subcellular
structure, molecule

•Community based:
•Communities contribute
their vocabularies
•Reconcile and align
concepts used by different
domains

•Each concept gets its own
unique identifier

•Creating a computable index for
neuroscience data
•INCF Demo D03

http://neurolex.org Stephen Larson/INCF

Concept-based search: search by meaning
 Search Google: GABAergic neuron
 Search NIF: GABAergic neuron
 NIF automatically searches for types of
GABAergic neurons

Types of GABAergic
neurons

Esperanto!

 “The trouble is that if I make up all of my own URIs, my [data]
has no meaning to anyone else unless I explain what each URI is
intended to denote or mean. Two [data sets] with no URIs in
common have no information that can be interrelated.”
 NIF favors reuse of identifiers rather than mapping
 NIF imports many ontologies

 Creating ontologies to be used as common building blocks:
modularity, low semantic overhead, is important
 Many community ontologies available covering multiple domains
 NIFSTD available via web serivices
 Bioportal (http://bioportal.bioontology.org/)

http://www.rdfabout.com/intro/#Introducing%20RDF

NIF Analytics: The Neuroscience Ecosystem
Where are the data?
Striatum
Brain Hypothalamus
Olfactory bulb Data source
Brain region

Cerebral cortex
NIF is in a unique position to answer questions about the neuroscience
ecosystem
VadimAstakhov, Kepler Workflow Engine

Whither neuroscience information?

What is potentially knowable
∞
Unstructured;
What is known: Natural language
Literature, images, human processing, entity
knowledge recognition, image
processing and
analysis;
communication
What is easily machine
processable and accessible

Open world meets closed world

But...NIF has > 900,000
antibodies, 250,000 model
organisms, and 3 million microarray
records

Query for “reference” brain structures and their parts in NIF Connectivity database

Gender bias

NIF can start to
answer interesting
questions about
neuroscience
research, not just
about neuroscience

NIF Reports:
Male vs Female

What have we learned: Grabbing
the long tail of small data
 Analysis of NIF shows
multiple databases with
similar scope and content

 Many contain partially
overlapping data

 Data “flows” from one
resource to the next
 Data is
reinterpreted, reanalyze
d or added to

 Is duplication good or bad?

Embracing duplication: Data Mash ups

•NIF queries across 3 of approximately 10 fMRI databases
•~300 PMID’swere common between Brede and SUMSdb
•PMID serves as a unique identifier for an article
•Same information; value added
Same data; different aspects

Same data: different analysis
Chronic vs acute morphine in striatum
 Gemma: Gene ID + Gene Symbol
 DRG: Gene name + Probe ID

 Gemmapresented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of
change is opposite in the 2 databases

 Analysis:
 1370 statements from Gemma regarding gene expression as
a function of chronicmorphine
 617 were consistent with DRG; over half of the claims of
the paper were not confirmed in this analysis
 Results for 1 gene were opposite in DRG and Gemma
 45 did not have enough information provided in the paper to
make a judgment

Taking a global view on data:
microculture to ecosystem
 Several powerful trends should change the way we
think about our data: One  Many
 Many data
 Generation of data is getting easier  shared data
 Data space is getting richer: more –omes everyday
 But...compared to the biological space, still sparse
 Many eyes
 Wisdom of crowds
 More than one way to interpret data
 Many algorithms
 Not a single way to analyze data
 Many analytics
 “Signatures” in data may not be directly related to the question for
which they were acquired but tell us something really interesting

Are you exposing or burying your work?

The future of scientific
communication
 We have learned over the years how to write Printing press
a scientific paper for other humans to read
and for other agents to index
 We now have to learn how to write papers
for automated agents (and their humans)
to mine
 We have learned over the years to report
Linked data cloud
data in papers for humans to read
 We now have to learn how to publish data
in a form and on a suitable platform for
automated agents (and their humans) to
mine
Watson
Reporting neuroscience data within a consistent framework helps enormously

Why does it matter?
47/50 major preclinical
published cancer studies  “There are no guidelines that
could not be replicated require all data sets to be
reported in a paper; often,
 “The scientific community original data are removed
assumes that the claims in a during the peer review and
preclinical study can be taken publication process. “
at face value-that although
there might be some errors in  Getting data out sooner in a
detail, the main message of form where they can be exposed
the paper can be relied on and to many eyes and many
analyses, and easily
the data will, for the most compared, may allow us to
part, stand the test of time. expose errors and develop
Unfortunately, this is not better metrics to evaluate the
always the case.” validity of data
Begley and Ellis, 29 MARCH 2012 | VOL 483 | Data, not just stories about them!
NATURE | 531

Register your resource to NIF!
1 Institutional
“How do I share my
data?” repositories

Cloud
2
“There is no database
for my data” INCF: Global
infrastructure

3 Community
database:
beginning

4 Community Education
database:
End
Industry Government

NIF is designed to leverage existing investments in resources and infrastructure

It’s a messy ecosystem (and that’s OK)
NIF favors a
hybrid, tiered, federated Gene
Organism
system Neuron Brain part Disease

 Domain knowledge
 Ontologies Caudate projects to
Snpc Grm1 is upregulated in
chronic cocaine
 Claims about results Betz cells
degenerate in ALS

 Virtuoso RDF triples

 Data
 Data federation
 Workflows

 Narrative

Future of Research Communications
and e-Scholarship
 FORCE11: http://force11.org
 Founded by Phil Bourne, Tim
Clark, Ed Hovy, Anita de Waard
and Ivan Herman
 Bring together stakeholders with
an interest in moving scholarly
communication beyond reliance
on papers and traditional impact
metrics
 Beyond the PDF 2: Spring 2013

NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI Fahim Imam, NIF Ontology Engineer
AmarnathGupta, UCSD, Co Investigator Larry Lui
Anita Bandrowski, NIF Project Leader Andrea Arnaud Stagg
Gordon Shepherd, Yale University Jonathan Cachat
Perry Miller Jennifer Lawrence
Luis Marenco Lee Hornbrook
Rixin Wang Binh Ngo
David Van Essen, Washington University VadimAstakhov
Erin Reid XufeiQian
Paul Sternberg, Cal Tech Chris Condit
ArunRangarajan Mark Ellisman
Hans Michael Muller Stephen Larson
Yuling Li Willie Wong
Giorgio Ascoli, George Mason University Tim Clark, Harvard University
SrideviPolavarum Paolo Ciccarese
Karen Skinner, NIH, Program Officer

Why do we create so many
overlapping products?
Science is
“That which I cannot incremental;we build on
build, I cannot understand” the results of others
 Don’t trust any data you  It’s ingrained in our culture
haven’t generated  “Build a better mousetrap and the
 Oh, now I see what you are world will beat down our doors”
saying  Little credit for making someone
 Scientists know the else’s product better
domain, not informatics
Yes, we are planning to There’s more than
do that... way to skin a cat....
 We are all time and resource  We are still mastering the
constrained medium
 We extend projects in time  Technology is developing fast

You need to use
ontology
identifiers instead
of strings

Blah, blah, ont
ology blah

When I talk toresource providers, neuroscientists (and
journal editors)...

Big data from small data: A deep survey of the neuroscience landscape data via

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (8)

Similar to Big data from small data: A deep survey of the neuroscience landscape data via

Similar to Big data from small data: A deep survey of the neuroscience landscape data via (20)

More from Neuroscience Information Framework

More from Neuroscience Information Framework (20)

Recently uploaded

Recently uploaded (20)

Big data from small data: A deep survey of the neuroscience landscape data via

Editor's Notes