Big data from small data: A deep      survey of the neuroscience              landscape data via     the Neuroscience Info...
“Neural Choreography”“A grand challenge in neuroscience is to elucidate brain function in relation   to its multiple layer...
“Data choreography” In that same issue of Science   Asked peer reviewers from last year about the availability and use o...
A data federation problem                                                            No single technology serves these all...
 NIF is an initiative of the NIH Blueprint consortium of institutes   What types of resources (data, tools, materials, s...
We need more databases (?)                     •NIF Registry: A                     catalog of                     neurosc...
But we have Google! Current web is designed         Wikipedia: The Deep Web  to share documents               (also call...
NIF must work with ecosystem as             it is today NIF has developed a production technology platform for  researche...
NIF accomplishments   Assembled the largest searchable    collation of neuroscience data on the    web                   ...
NIF data federation  Percentage of data records per           data type                                                   ...
What do you mean by data?      Databases come in many shapes and sizes Primary data:                              Regist...
What types of questions can I ask?We’d like to be able to find: What is known****:   What is the average diameter of a P...
What are the connections of the          hippocampus?Hippocampus OR “CornuAmmonis” OR         “Ammon’s horn”              ...
Results are organized within a common                  framework                                                          ...
The scourge of neuroanatomical nomenclature:    Importance of NIF semantic framework•NIF Connectivity: 7 databases contain...
NIF’s minimum requirements for          effective data sharing      You (and the machine) have to be able to        find ...
What is an ontology?                                    Brain Ontology: an explicit, formal                  has a  repre...
You need to use                                                             ontology                                      ...
What can ontology do for us?                                      “Esperanto!” Express neuroscience concepts in a way tha...
Power of unique identifiers: Are you the M                    Martone who...The Gene Wiki: community intelligence applied ...
I am not a number (but I should                    be)   Full URI: Uniform      Resource Identifier                      ...
NIF Semantic Framework: NIFSTD ontology                                                          NIFSTD                   ...
“We studied the behavior of CA2-binding proteins in      Ca2 neurons under high and low Ca2 conditions ”                  ...
But you don’t have what I need!•Provide a simple framework fordefining the concepts required     •Cell, Part of     brain,...
Concept-based search: search by meaning Search Google: GABAergic neuron Search NIF: GABAergic neuron    NIF automatical...
Esperanto! “The trouble is that if I make up all of my own URIs, my [data]   has no meaning to anyone else unless I expla...
NIF Analytics: The Neuroscience Ecosystem                                         Where are the data?                     ...
Whither neuroscience information?What is potentially knowable                                    ∞                        ...
Open world meets closed world                                        But...NIF has > 900,000                              ...
Gender biasNIF can start toanswer interestingquestions aboutneuroscienceresearch, not justabout neuroscience NIF Reports:M...
What have we learned: Grabbing   the long tail of small data Analysis of NIF shows  multiple databases with  similar scop...
Embracing duplication: Data Mash ups   •NIF queries across 3 of approximately 10 fMRI databases   •~300 PMID’swere common ...
Same data: different analysis               Chronic vs acute morphine in striatum Gemma: Gene ID + Gene Symbol DRG: Gene...
Taking a global view on data:            microculture to ecosystem Several powerful trends should change the way we  thin...
The future of scientific                 communication       We have learned over the years how to write                 ...
Why does it matter?     47/50 major preclinical    published cancer studies                   “There are no guidelines th...
Register your resource to NIF! 1                                                                      Institutional       ...
It’s a messy ecosystem (and that’s OK)NIF favors a  hybrid, tiered, federated                        Gene                 ...
Future of Research Communications         and e-Scholarship FORCE11: http://force11.org   Founded by Phil Bourne, Tim   ...
NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Interim PI   Fahim Imam, NIF Ontology EngineerAmarnathGupta...
Why do we create so many           overlapping products?                                           Science is   “That whic...
You need to use                                                    ontology                                               ...
Upcoming SlideShare
Loading in …5
×

Big data from small data: A deep survey of the neuroscience landscape data via

1,828 views

Published on

Maryann Martone - Keynote at Bernstein Computational Neuroscience conference, Munich, 2012

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,828
On SlideShare
0
From Embeds
0
Number of Embeds
773
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Doesn’t do it well; doesn’t organize the results in a domain specific way; doesn’t search across itFor use as content goal Dynamic inventory for deep coverage of neuroscience data: Genes -> Systems
  • Should this say collation or collection?
  • Big data from small data: A deep survey of the neuroscience landscape data via

    1. 1. Big data from small data: A deep survey of the neuroscience landscape data via the Neuroscience Information Framework Maryann Martone, Ph. D. University of California, San Diego
    2. 2. “Neural Choreography”“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits-- Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “ Akil et al., Science, Feb 11, 2011
    3. 3. “Data choreography” In that same issue of Science  Asked peer reviewers from last year about the availability and use of data  About half of those polled store their data only in their laboratories—not an ideal long-term solution.  Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving  And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.  “...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
    4. 4. A data federation problem No single technology serves these all equally well. Multiple data types; multiple scales; multiple databasesWhole brain data (20 ummicroscopic MRI) Mosiac LM images (1 GB+) Conventional LM images Individual cell morphologiesNeuroscience is unlikely to be EM volumes &served by a few large databases reconstructionslike the genomics and proteomics Solved molecularcommunity structures
    5. 5.  NIF is an initiative of the NIH Blueprint consortium of institutes  What types of resources (data, tools, materials, services) are available to the neuroscience community?  How many are there?  What domains do they cover? What domains do they not cover?  Where are they?  Web sites • PDF files  Databases • Desk drawers  Literature  Supplementary material  Who uses them?  Who creates them?  How can we find them?  How can we make them better in the future? http://neuinfo.org
    6. 6. We need more databases (?) •NIF Registry: A catalog of neuroscience-relevant resources •> 5000 currently listed •> 2000 databases •And we are finding more every day
    7. 7. But we have Google! Current web is designed  Wikipedia: The Deep Web to share documents (also called Deepnet, the  Documents are invisible Web, DarkNet, unstructured data Undernet or the hidden Much of the content of Web) refers to World Wide digital resources is part of Web content that is not the “hidden web” part of the Surface Web, which is indexed by standard search engines.
    8. 8. NIF must work with ecosystem as it is today NIF has developed a production technology platform for researchers to discover, share, access, analyze, and integrate neuroscience-relevant information  Semantically-enabled search engine and interface that customizes results for neuroscience  System that searches the “hidden web”, i.e., content not well served by search engines  Data resources are predominantly relational, xml, text, rdf, owl  Automated data harvesting technologies that produce dynamic indices of data content including databases, web pages, text, xml etc.  Tools to make products and data available  Designed to be populated rapidly; set up process for progressive refinement
    9. 9. NIF accomplishments Assembled the largest searchable collation of neuroscience data on the web UCSD, Yale, Cal Tech, George Mason, Washington Univ  Data federation  Resource registry (materials, data, tools, services)  Pub Med literature  Full text of open access The largest ontology for neuroscience NIF search portal: simultaneous search over data, NIF catalog and biomedical literature Neurolex Wiki: a community wiki serving neuroscience concepts NIF is poised to capitalize on the new tools A unique technology platform and emphasis on big data and open A reservoir of cross-disciplinary science biomedical data expertise
    10. 10. NIF data federation Percentage of data records per data type Brain activation foci Animals Images Pathways Drugs connectivity Antibodies Microarray 98% Grants> 180 sources; 350 M records: NIF was Percentage of data records per datadesigned to be populated rapidly, with type: everything but microarrayprogressive refinement of data
    11. 11. What do you mean by data? Databases come in many shapes and sizes Primary data:  Registries:  Data available for  Metadata reanalysis, e.g., microarray data  Pointers to data sets or sets from GEO; brain images from materials stored elsewhere XNAT; microscopic images (CCDB/CIL)  Data aggregators Secondary data  Aggregate data of the same  Data features extracted through type from multiple data processing and sometimes sources, e.g., Cell Image normalization, e.g, brain structure Library ,SUMSdb, Brede volumes (IBVD), gene expression  Single source levels (Allen Brain Atlas); brain  Data acquired within a single connectivity statements (BAMS) context , e.g., Allen Brain Atlas Tertiary data  Claims and assertions about the Researchers are producing a variety of meaning of data information artifacts using a multitude of  E.g., gene technologies upregulation/downregulation,
    12. 12. What types of questions can I ask?We’d like to be able to find: What is known****:  What is the average diameter of a Purkinje neuron  Is GRM1 expressed In cerebral cortex?  What are the projections of hippocampus?  What genes have been found to be upregulated in chronic drug abuse in adults  Is there a database of fMRI studies?  What studies used my polyclonal antibody against GABA in humans?  What rat strains have been used most extensively in research during the last 20 years? What is not known:  Connections among data  Gaps in knowledge Without some sort of framework, very difficult to do
    13. 13. What are the connections of the hippocampus?Hippocampus OR “CornuAmmonis” OR “Ammon’s horn” Query expansion: Synonyms and related concepts Boolean queries Data sources categorized by “data type” and level of nervous system Tutorials for using full resource when getting there from NIF Common views across multiple sources Link back to record in original source
    14. 14. Results are organized within a common framework Target site Synapsed by innervates Connects to Input region Synapsed with Cellular contact Projects to Axon innervates Subcellular contact Source siteEach resource implements a different, though related model;systems are complex and difficult to learn, in many cases
    15. 15. The scourge of neuroanatomical nomenclature: Importance of NIF semantic framework•NIF Connectivity: 7 databases containing connectivity primary data or claimsfrom literature on connectivity between brain regions •Brain Architecture Management System (rodent) •Temporal lobe.com (rodent) •Connectome Wiki (human) •Brain Maps (various) •CoCoMac (primate cortex) •UCLA Multimodal database (Human fMRI) •Avian Brain Connectivity Database (Bird)•Total: 1800 unique brain terms (excluding Avian)•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
    16. 16. NIF’s minimum requirements for effective data sharing  You (and the machine) have to be able to find it  Accessible through the web  Annotations  You have to be able to use it  Data type specified and in a usable form  You have to know what the data mean  Some semantics  Context: Experimental metadata  Provenance: Where did the data come from?Reporting neuroscience data within a consistent framework helps enormously
    17. 17. What is an ontology? Brain Ontology: an explicit, formal has a representation of concepts relationships among them Cerebellum within a particular domain that has a expresses human knowledge in a Purkinje Cell Layer machine readable form has a Branch of philosophy: a theory Purkinje cell of what is is a neuron e.g., Gene ontologies
    18. 18. You need to use ontology identifiers instead of strings Blah, blah, ontology blah“Ontology as mathematics, computer science or esperanto”- AndreyRzhetsky and James A. Evans
    19. 19. What can ontology do for us? “Esperanto!” Express neuroscience concepts in a way that is machine readable  Classes are identified by unique identifiers  Synonyms, lexical variants  Definitions  Provide means of disambiguation of strings  Nucleus part of cell; nucleus part of brain; nucleus part of atom  Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmitter  Properties Provide universals for navigating across different data sources  Semantic “index”  Perform reasoning  Link data through relationships not just one-to-one mappings  “Concept-based queries”
    20. 20. Power of unique identifiers: Are you the M Martone who...The Gene Wiki: community intelligence applied to human gene annotation.Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, HogeneschJB, Su AI. Nucleic Acids Res. 2010 Jan;38(Database issue):D633-9.Ontologies for Neuroscience: What are they and What are they Good for? LarsonSD, Martone ME. Front Neurosci. 2009 May;3(1):60-7. Epub 2009 May 1.Three-dimensional electron microscopy reveals new details of membrane systems forCa2+ signaling in the heart. Hayashi T, Martone ME, Yu Z, Thor A, Doi M, HolstMJ, Ellisman MH, Hoshijima M. J Cell Sci. 2009 Apr 1;122(Pt 7):1005-13.Some analyses of forgetting of pictorial material in amnesic and dementedpatients.Martone M, Butters N, Trauner D. J Clin Exp Neuropsychol. 1986 Jun;8(3):161-78.Traumatic brain injury and the goals of care.Martone M. Hastings Cent Rep. 2006 Mar-Apr;36(2):3.Three-dimensional pattern of enkephalin-like immunoreactivity in the caudate nucleus of thecat.Groves PM, Martone M,Young SJ, Armstrong DM. J Neurosci. 1988 Mar;8(3):892-900.
    21. 21. I am not a number (but I should be)  Full URI: Uniform Resource Identifier Dept of Boston VA Psychiatry,  http://orcid.org/1234567 Hospital UCSD  Label: Maryann Elizabeth Martone  Synonym: ME Martone, M M Martone Female Martone, Maryann  Abbreviation: MEM  Is a Nelson  Has a Butters Publications  Is that entity which has these properties Text mining algorithms can discover a lot of things about meORCID project: Author ID’s
    22. 22. NIF Semantic Framework: NIFSTD ontology NIFSTD Anatomical Organism Structure Cell Dysfunction Quality Subcellular Molecule NS Function Investigation structureMacromolecule Gene Techniques Resource Instrument Molecule Descriptors Reagent Protocols NIF covers multiple structural scales and domains of relevance to neuroscience Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks for more complex representations
    23. 23. “We studied the behavior of CA2-binding proteins in Ca2 neurons under high and low Ca2 conditions ” NIF queries across over 170+BioGrid independentAllen Brain Atlas databasesBrain Info
    24. 24. But you don’t have what I need!•Provide a simple framework fordefining the concepts required •Cell, Part of brain, subcellular structure, molecule•Community based: •Communities contribute their vocabularies •Reconcile and align concepts used by different domains•Each concept gets its ownunique identifier•Creating a computable index forneuroscience data •INCF Demo D03 http://neurolex.org Stephen Larson/INCF
    25. 25. Concept-based search: search by meaning Search Google: GABAergic neuron Search NIF: GABAergic neuron  NIF automatically searches for types of GABAergic neurons Types of GABAergic neurons
    26. 26. Esperanto! “The trouble is that if I make up all of my own URIs, my [data] has no meaning to anyone else unless I explain what each URI is intended to denote or mean. Two [data sets] with no URIs in common have no information that can be interrelated.” NIF favors reuse of identifiers rather than mapping  NIF imports many ontologies Creating ontologies to be used as common building blocks: modularity, low semantic overhead, is important  Many community ontologies available covering multiple domains  NIFSTD available via web serivices  Bioportal (http://bioportal.bioontology.org/)http://www.rdfabout.com/intro/#Introducing%20RDF
    27. 27. NIF Analytics: The Neuroscience Ecosystem Where are the data? Striatum Brain Hypothalamus Olfactory bulb Data sourceBrain region Cerebral cortex NIF is in a unique position to answer questions about the neuroscience ecosystem VadimAstakhov, Kepler Workflow Engine
    28. 28. Whither neuroscience information?What is potentially knowable ∞ Unstructured; What is known: Natural language Literature, images, human processing, entity knowledge recognition, image processing and analysis; communication What is easily machine processable and accessible
    29. 29. Open world meets closed world But...NIF has > 900,000 antibodies, 250,000 model organisms, and 3 million microarray recordsQuery for “reference” brain structures and their parts in NIF Connectivity database
    30. 30. Gender biasNIF can start toanswer interestingquestions aboutneuroscienceresearch, not justabout neuroscience NIF Reports:Male vs Female
    31. 31. What have we learned: Grabbing the long tail of small data Analysis of NIF shows multiple databases with similar scope and content Many contain partially overlapping data Data “flows” from one resource to the next  Data is reinterpreted, reanalyze d or added to Is duplication good or bad?
    32. 32. Embracing duplication: Data Mash ups •NIF queries across 3 of approximately 10 fMRI databases •~300 PMID’swere common between Brede and SUMSdb •PMID serves as a unique identifier for an article •Same information; value added Same data; different aspects
    33. 33. Same data: different analysis Chronic vs acute morphine in striatum Gemma: Gene ID + Gene Symbol DRG: Gene name + Probe ID Gemmapresented results relative to baseline chronicmorphine; DRG with respect to saline, so direction ofchange is opposite in the 2 databases Analysis:  1370 statements from Gemma regarding gene expression as a function of chronicmorphine  617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis  Results for 1 gene were opposite in DRG and Gemma  45 did not have enough information provided in the paper to make a judgment
    34. 34. Taking a global view on data: microculture to ecosystem Several powerful trends should change the way we think about our data: One  Many  Many data  Generation of data is getting easier  shared data  Data space is getting richer: more –omes everyday  But...compared to the biological space, still sparse  Many eyes  Wisdom of crowds  More than one way to interpret data  Many algorithms  Not a single way to analyze data  Many analytics  “Signatures” in data may not be directly related to the question for which they were acquired but tell us something really interesting Are you exposing or burying your work?
    35. 35. The future of scientific communication  We have learned over the years how to write Printing press a scientific paper for other humans to read and for other agents to index  We now have to learn how to write papers for automated agents (and their humans) to mine  We have learned over the years to report Linked data cloud data in papers for humans to read  We now have to learn how to publish data in a form and on a suitable platform for automated agents (and their humans) to mine WatsonReporting neuroscience data within a consistent framework helps enormously
    36. 36. Why does it matter? 47/50 major preclinical published cancer studies  “There are no guidelines that could not be replicated require all data sets to be reported in a paper; often,  “The scientific community original data are removed assumes that the claims in a during the peer review and preclinical study can be taken publication process. “ at face value-that although there might be some errors in  Getting data out sooner in a detail, the main message of form where they can be exposed the paper can be relied on and to many eyes and many analyses, and easily the data will, for the most compared, may allow us to part, stand the test of time. expose errors and develop Unfortunately, this is not better metrics to evaluate the always the case.” validity of dataBegley and Ellis, 29 MARCH 2012 | VOL 483 | Data, not just stories about them!NATURE | 531
    37. 37. Register your resource to NIF! 1 Institutional “How do I share my data?” repositories Cloud 2 “There is no database for my data” INCF: Global infrastructure 3 Community database: beginning 4 Community Education database: End Industry GovernmentNIF is designed to leverage existing investments in resources and infrastructure
    38. 38. It’s a messy ecosystem (and that’s OK)NIF favors a hybrid, tiered, federated Gene Organism system Neuron Brain part Disease Domain knowledge  Ontologies Caudate projects to Snpc Grm1 is upregulated in chronic cocaine Claims about results Betz cells degenerate in ALS  Virtuoso RDF triples Data  Data federation  Workflows Narrative
    39. 39. Future of Research Communications and e-Scholarship FORCE11: http://force11.org  Founded by Phil Bourne, Tim Clark, Ed Hovy, Anita de Waard and Ivan Herman  Bring together stakeholders with an interest in moving scholarly communication beyond reliance on papers and traditional impact metrics  Beyond the PDF 2: Spring 2013
    40. 40. NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Interim PI Fahim Imam, NIF Ontology EngineerAmarnathGupta, UCSD, Co Investigator Larry LuiAnita Bandrowski, NIF Project Leader Andrea Arnaud StaggGordon Shepherd, Yale University Jonathan CachatPerry Miller Jennifer LawrenceLuis Marenco Lee HornbrookRixin Wang Binh NgoDavid Van Essen, Washington University VadimAstakhovErin Reid XufeiQianPaul Sternberg, Cal Tech Chris ConditArunRangarajan Mark EllismanHans Michael Muller Stephen LarsonYuling Li Willie WongGiorgio Ascoli, George Mason University Tim Clark, Harvard UniversitySrideviPolavarum Paolo Ciccarese Karen Skinner, NIH, Program Officer
    41. 41. Why do we create so many overlapping products? Science is “That which I cannot incremental;we build onbuild, I cannot understand” the results of others  Don’t trust any data you  It’s ingrained in our culture haven’t generated  “Build a better mousetrap and the  Oh, now I see what you are world will beat down our doors” saying  Little credit for making someone  Scientists know the else’s product better domain, not informatics Yes, we are planning to There’s more than do that... way to skin a cat....  We are all time and resource  We are still mastering the constrained medium  We extend projects in time  Technology is developing fast
    42. 42. You need to use ontology identifiers instead of strings Blah, blah, ont ology blahWhen I talk toresource providers, neuroscientists (and journal editors)...

    ×