Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The real world of ontologies and     phenotype representation:          perspectives from the      Neuroscience Informatio...
“Neural Choreography”“A grand challenge in neuroscience is to elucidate brain function in relation   to its multiple layer...
“Data choreography” In that same issue of Science   Asked peer reviewers from last year about the availability and use o...
 NIF is an initiative of the NIH Blueprint consortium of institutes   What types of resources (data, tools, materials, s...
In an ideal world...We’d like to be able to find: What is known****:   What is the average diameter of a Purkinje neuron...
The Neuroscience Information Framework: Discovery and   utilization of web-based resources for neuroscience               ...
We need more databases !?                     •NIF Registry: A                     catalog of                     neurosci...
NIF must work with ecosystem as                 it is today NIF was one of the first projects to attempt data integration...
What are the connections of the          hippocampus?Hippocampus OR “CornuAmmonis” OR         “Ammon’s horn”              ...
Imminent: NIF 5.0               NIF 5.0 about                to be released               New design               New ...
What do you mean by data?      Databases come in many shapes and sizes Primary data:                               Regis...
Exploration: Where is alpha synuclein?•Spatially:   •Gene   •Protein        •Subcellular        •Cellular        •Regional...
NIFSTD Ontologies Set of modular ontologies   86, 000 + distinct concepts +      synonyms   Bridge files between module...
“Search computing”: Query by concept     What genes are upregulated by drugs of abuse in the             adult mouse? (sho...
New: Data analytics                                                  Diseases of nervous system                           ...
Results are organized within a common                  framework                                                          ...
NIF Concept Mapper
The scourge of neuroanatomical nomenclature:    Importance of NIF semantic framework•NIF Connectivity: 7 databases contain...
Why so many names?     The brain is perhaps unique among major organ systems in the         multiplicity of naming scheme...
Program on Ontologies for Neural          Structures International Neuroinformatics Coordinating Committee   Structural ...
•Provide a simple frameworkfor defining the conceptsrequired                                NeuroLex Wiki     •Light weigh...
Defining nervous system structures                                                       Parcellation scheme: Set of parce...
Basic model: do not conflate conceptual                  structures with parcels                                          ...
Linking semantics to space: INCF Atlasing                  www.neurolex.org                                               ...
Neurons in Neurolex    International     Neuroinformatics     Coordinating Facility (INCF)     building a knowledge base ...
A KNOWLEDGE BASE OF NEURONAL PROPERTIES                                            26Additional semantics added in NIFSTD ...
Concept-based search: search by meaning Search Google: GABAergic neuron Search NIF: GABAergic neuron    NIF automatical...
Challenges of multiscale neurodegenerative               disease phenotypes                                               ...
Approach: Use ontologies to provide necessary                knowledge for matching related phenotypesEntities           M...
EQ Representation of Phenotypes in Neurodegenerative                       Disease: PATO and NIFSTD                       ...
OBD: Ontology based database Provides a user   interface for matching   organisms based on   similarity of   phenotypes  ...
Computes common subsumers and information        content among phenotypes                   Thalamus Midline nuclear      ...
PhenoSim: What organism is most similar to a human              with Huntington’s disease?                                ...
Progressive enrichmentUnderstanding and comparing phenotypes will be enriched through communityknowledge bases like Neurol...
Top Down vs Bottom up           Top-down ontology construction           • A select few authors have write privileges     ...
It’s a messy ecosystem (and that’s OK)NIF favors a hybrid, tiered,  federated system                                  Gene...
Musings from the NIF No one can be stopped from doing what they need to do Every resource is resource limited: few have ...
Grabbing the long tail of small               data Analysis of NIF shows  multiple databases with  similar scope and cont...
Same data: different analysis     Drug Related Gene database:      extracted statements from             Chronic vs acute...
How easy was it to compare? Gemma: Gene ID + Gene Symbol DRG: Gene name + Probe ID Gemma: Increased expression/decrease...
Beware of False Dichotomies Top-down vs bottom up Light weight vs heavy weight “Chaotic Nihilists and Semantic Idealist...
NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Interim PI   Fahim Imam, NIF Ontology EngineerAmarnathGupta...
Upcoming SlideShare
Loading in …5
×

The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

1,729 views

Published on

Presentation to the RCN Phenotype Summit meeting

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

  1. 1. The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework Maryann Martone, Ph. D. University of California, San Diego
  2. 2. “Neural Choreography”“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits-- Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “ Akil et al., Science, Feb 11, 2011
  3. 3. “Data choreography” In that same issue of Science  Asked peer reviewers from last year about the availability and use of data  About half of those polled store their data only in their laboratories—not an ideal long-term solution.  Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving  And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.  “...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
  4. 4.  NIF is an initiative of the NIH Blueprint consortium of institutes  What types of resources (data, tools, materials, services) are available to the neuroscience community?  How many are there?  What domains do they cover? What domains do they not cover?  Where are they?  Web sites • PDF files  Databases • Desk drawers  Literature  Supplementary material  Who uses them?  Who creates them?  How can we find them?  How can we make them better in the future? http://neuinfo.org
  5. 5. In an ideal world...We’d like to be able to find: What is known****:  What is the average diameter of a Purkinje neuron  Is GRM1 expressed In cerebral cortex?  What are the projections of hippocampus?  What genes have been found to be upregulated in chronic drug abuse in adults  Is alpha synuclein in the striatum?  What studies used my polyclonal antibody against Required Components: GABA in humans? – Query interface  What rat strains have been used most extensively in – Search strategies research during the last 20 years? – Data sources – Infrastructure – Results display What is not known: – Why did I get this result?  Connections among data – Analysis tools  Gaps in knowledge Without some sort of framework, very difficult to
  6. 6. The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience Literature UCSD, Yale, Cal Tech, George Mason, Washington Univ Database Federation  A portal for finding and using neuroscience resources  A consistent framework for describing resources  Provides simultaneous search of multiple types of information, organized by category  Supported by an expansive ontology for neuroscience  Utilizes advanced technologies to search the “hidden web” Registry Supported by NIH Blueprint http://neuinfo.org
  7. 7. We need more databases !? •NIF Registry: A catalog of neuroscience-relevant resources •> 5000 currently listed •> 2000 databases •And we are finding more every day
  8. 8. NIF must work with ecosystem as it is today NIF was one of the first projects to attempt data integration in the neurosciences on a large scale NIF is supported by a contract that specified the number of resources to be added per year  Designed to be populated rapidly; set up process for progressive refinement  No budget was allocated to retrofit existing resources; had to work with them in their current state  We designed a system that required little to no cooperation or work from providers NIF was required to assemble (not create) ontologies very fast and to provide a platform through which the community could view, comment and add  NIF is enriched by ontologies but does not depend on them  Took advantage of community ontologies  But needed to take a very pragmatic and aggressive approach to incorporating and using them  Neurolex semantic wiki
  9. 9. What are the connections of the hippocampus?Hippocampus OR “CornuAmmonis” OR “Ammon’s horn” Query expansion: Synonyms and related concepts Boolean queries Data sources categorized by “data type” and level of nervous system Tutorials for using full resource when getting there from NIF Common views across multiple sources Link back to record in original source
  10. 10. Imminent: NIF 5.0  NIF 5.0 about to be released  New design  New query features  New analytics
  11. 11. What do you mean by data? Databases come in many shapes and sizes Primary data:  Registries:  Data available for reanalysis, e.g.,  Metadata microarray data sets from GEO;  Pointers to data sets or brain images from XNAT; materials stored elsewhere microscopic images (CCDB/CIL)  Data aggregators Secondary data  Aggregate data of the same  Data features extracted through data processing and sometimes type from multiple sources, normalization, e.g, brain structure e.g., Cell Image Library volumes (IBVD), gene expression ,SUMSdb, Brede levels (Allen Brain Atlas); brain  Single source connectivity statements (BAMS)  Data acquired within a single Tertiary data context , e.g., Allen Brain Atlas  Claims and assertions about the meaning of data Researchers are producing a variety of  E.g., gene information resources using a multitude of upregulation/downregulation, technologies brain activation as a function
  12. 12. Exploration: Where is alpha synuclein?•Spatially: •Gene •Protein •Subcellular •Cellular •Regional •Organism•Semantically: •Gene regulation networks •Protein pathways •Cellular local connectivity •Regional connectivity •Who is studying it? •Who is funding its study? Networks exist across scales; all important in the nervous system
  13. 13. NIFSTD Ontologies Set of modular ontologies  86, 000 + distinct concepts + synonyms  Bridge files between modules Expressed in OWL-DL language  Currently supports OWL 2 Tries to follow OBO community best practices  Standardized to the same upper level ontologies  e.g., Basic Formal Ontology (BFO), OBO Relations Ontology (OBO-RO),  Imports existing community ontologies Covers major domains of neuroscience:  e.g., CHEBI, GO, PRO, Organisms, Brain Regions, Cells, DOID, OBI etc. Molecules, Subcellular parts, Diseases,  Retains identifiers in Nervous system functions, Techniques most recent additions but reflects history Fahim Imam, William Bug 13
  14. 14. “Search computing”: Query by concept What genes are upregulated by drugs of abuse in the adult mouse? (show me the data!) Morphine Increased expression Adult MouseReasonable standards make it easy to search for and compare results
  15. 15. New: Data analytics Diseases of nervous system Neoplastic disease of nervous systemNIF data federated sources Neurodegenerative Seizure disorders NIH Reporter NIF is in a unique position to answer questions about the neuroscience ecosystem using new analytics tools
  16. 16. Results are organized within a common framework Target site Synapsed by innervates Connects to Input region Synapsed with Cellular contact Projects to Axon innervates Subcellular contact Source siteEach resource implements a different, though related model;systems are complex and difficult to learn, in many cases
  17. 17. NIF Concept Mapper
  18. 18. The scourge of neuroanatomical nomenclature: Importance of NIF semantic framework•NIF Connectivity: 7 databases containing connectivity primary data or claimsfrom literature on connectivity between brain regions •Brain Architecture Management System (rodent) •Temporal lobe.com (rodent) •Connectome Wiki (human) •Brain Maps (various) •CoCoMac (primate cortex) •UCLA Multimodal database (Human fMRI) •Avian Brain Connectivity Database (Bird)•Total: 1800 unique brain terms (excluding Avian)•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
  19. 19. Why so many names?  The brain is perhaps unique among major organ systems in the multiplicity of naming schemes for its major and minor regions.  The brain has been divided based on topology of major features, cyto- and myelo-architecture, developmental boundaries, supposed evolutionary origins, histochemistry, gene expression and functional criteria.  The gross anatomy of the brain reflects the underlying networks only superficially, and thus any parcellation reflects a somewhat arbitrary division based on one or more of these criteria.The “activation map” images that commonly accompany brain imaging papers can bemisleading to inexperienced readers, by seeming to suggest that the boundaries between“activated” and “unactivated” patches of cortex are unambigous and sharp. Instead, asmost researchers are aware, the apparent sharp boundaries are subject to the choice ofthreshold applied to the statistical tests that generate the image. What, then, justifiesdividing the cortex into regions with boundaries based on this fuzzy, mutable measure offunctional profile?(Saxe et al., 2010, p. 39). Brainmaps.org
  20. 20. Program on Ontologies for Neural Structures International Neuroinformatics Coordinating Committee  Structural Lexicon Task Force  Defining brain structures  Translate among terminologies  Neuronal Registry Task Force  Consistent naming scheme for neurons  Knowledge base of neuron properties  Representation and Deployment Task Force  Formal representation Also interacts with Digital Atlasing Task Force http://incf.org
  21. 21. •Provide a simple frameworkfor defining the conceptsrequired NeuroLex Wiki •Light weight semantics •Good teaching tool for learning about semantic integration and the benefits of a consistent semantic framework•Community based: •Anyone can contribute their terms, concepts, things •Anyone can edit •Anyone can link•Accessible: searched byGoogle•Building an extensive cross- Demo D03disciplinary knowledge basefor neuroscience http://neurolex.org Stephen Larson
  22. 22. Defining nervous system structures Parcellation scheme: Set of parcels occupying part or all of an anatomical entity that has been delineated using a common approach or set of criteria, often in a single study. A parcellation scheme for any given individual entity may include gaps, transitional zones, or regions of uncertainty. A parcellation scheme derived from a set of individuals registered to a common target (atlas) may be probabilistic and include overlap of parcels in regions that reflect individual variability or imperfections in alignment. Documentation available INCF task force on14 parcellation schemes currently represented in Neurolex ontologies
  23. 23. Basic model: do not conflate conceptual structures with parcels overlaps Regional part of Parcel nervous system overlaps overlaps Functional part of nervous system Parcel ParcelNeuroscientists have a lot of different parcellation schemes because they have a lot of differentways of classifying brain structures and techniques to match them are imperfect
  24. 24. Linking semantics to space: INCF Atlasing www.neurolex.org Waxholm space Link to spatial representation in scalable brain atlas Seth Ruffins, Alan Ruttenberg, Rembrandt Bakker
  25. 25. Neurons in Neurolex  International Neuroinformatics Coordinating Facility (INCF) building a knowledge base of neurons and their properties via the Neurolex Wiki  Led by Dr. Gordon Shepherd  Consistent and parseable naming scheme  Knowledge is readily accessible, editable and computable  While structure is imposed, don’t worry too much about the upper level classes of the ontologyStephen Larson
  26. 26. A KNOWLEDGE BASE OF NEURONAL PROPERTIES 26Additional semantics added in NIFSTD by ontology engineer
  27. 27. Concept-based search: search by meaning Search Google: GABAergic neuron Search NIF: GABAergic neuron  NIF automatically searches for types of GABAergic neurons Types of GABAergic neurons
  28. 28. Challenges of multiscale neurodegenerative disease phenotypes Midbrain degenerated Substantianigra decreased not in volume Substantianigra pars not compacta atrophied Loss of Snpcdopaminergic neurons Degeneration of nigrostriatal terminals•Neurodegenerative diseases target very specific cell Tyrosine-hydroxylase containingpopulations neurons degenerate•Model systems only replicate a subset of features of thedisease•Related phenotypes occur across anatomical scales•Different vocabularies are used by different communities
  29. 29. Approach: Use ontologies to provide necessary knowledge for matching related phenotypesEntities Midbrain Neuron (CL) Has part Substantianigr Is a a Substantianigra pars compacta dopamine Has part Has part cell Substantianigra pars compacta Neuron cell Has part Dopamine Is part soma of Is a Is a Part of neuron Small moleculeQualities (GO) (Chebi)DegenerateAtrophied Decreased in magnitude Decreased Is a relative to some normal volume Sarah Maynard, Chris Mungall, Fewer in NIFSTD/PKB Suzie Lewis, Fahim Imam number OBO ontology
  30. 30. EQ Representation of Phenotypes in Neurodegenerative Disease: PATO and NIFSTD inheres in Human has part Neocortex pyramidal (birnlex_516) neuron Instance: Human with Alzheimer’s disease 050 inheres in inheres in Alzheimer’s Increased Phenotype disease number of birnlex_2087_56 towards Lipofuscin about Structured annotation model implemented in WIBChris Mungall, Suzanna Lewis
  31. 31. OBD: Ontology based database Provides a user interface for matching organisms based on similarity of phenotypes  Based on EQ model Uses knowledge in the ontology to compute similarity scores and other statistical measures like information content Chris Mungall, Suzanna Lewis, Lawrence Berkeleyhttp://www.berkeleybop.org/pkb/ Labs
  32. 32. Computes common subsumers and information content among phenotypes Thalamus Midline nuclear Paracentral group nucleus Cellular Cellular Lewy Body inclusion inclusion
  33. 33. PhenoSim: What organism is most similar to a human with Huntington’s disease? Part of basal ganglia decreased in magnitude Globuspallidusneuropil Putamen atrophied degenerate Neuron in striatum decreased in magnitude Fewer neostriatum medium spiny neurons in Neurons in striatum putamen degenerate Nervous system cell change in number in striatum Increased number of astrocytes in caudate Neurons in striatum (HDexon1)62) that express exon1 of the human mutant degenerate et al., J*B6CBA-TgNnucleus HD gene- LiNeurosci, 21(21):8473-8481
  34. 34. Progressive enrichmentUnderstanding and comparing phenotypes will be enriched through communityknowledge bases like NeurolexLooking forward to continuing this as part of the Monarch project with MelissaHaendel, Chris Mungall and Suzie Lewis
  35. 35. Top Down vs Bottom up Top-down ontology construction • A select few authors have write privileges • Maximizes consistency of terms with each other (automated consistency NIFSTD checking) • Making changes requires approval and re-publishing • Works best when domain to be organized has: small corpus, formal categories, stable entities, restricted entities, clear edges. •Works best with participants who are: expert catalogers, coordinated users, expert users, people with authoritative source of judgment Bottom-up ontology construction • Multiple participants can edit the ontology instantly (many eyes to correct errors) • Semantics are limited to what is convenient for the domain • Not a replacement for top-down construction; sometimes necessary to increase flexibilityNEUROLEX • Necessary when domain has: large corpus, no formal categories, no clear edges •Necessary when participants are: uncoordinated users, amateur users, naïve catalogers • Neuroscience is a domain that is less formal and neuroscientists are more uncoordinated Important for Ontologists to define community contribution model
  36. 36. It’s a messy ecosystem (and that’s OK)NIF favors a hybrid, tiered, federated system Gene Organism Neuron Brain part Disease Domain knowledge  Ontologies Caudate projects to Claims about results Snpc Grm1 is upregulated in chronic cocaine Betz cells  Virtuoso RDF triples degenerate in ALS Data  Data federation  Workflows Narrative  Full text access
  37. 37. Musings from the NIF No one can be stopped from doing what they need to do Every resource is resource limited: few have enough time, money, staff or expertise required to do everything they would like  If the market can support 11 MRI databases, fine  Some consolidation, coordination is warranted though Big, broad and messy beats small, narrow and neat  Without trying to integrate a lot of data, we will not know what needs to be done  A lot can be done with messy data; neatness helps though  Progressive refinement; addition of complexity through layers Be flexible and opportunistic  A single optimal technology/container for all types of scientific data and information does not exist; technology is changing Think globally; act locally:  No source, not even NIF, is THE source; we are all a source
  38. 38. Grabbing the long tail of small data Analysis of NIF shows multiple databases with similar scope and content Many contain partially overlapping data Data “flows” from one resource to the next  Data is reinterpreted, reanalyzed or added to Is duplication good or bad?
  39. 39. Same data: different analysis  Drug Related Gene database: extracted statements from Chronic vs acute figures, tables and supplementary morphine in striatum data from published article  Gemma: Reanalyzed microarray results from GEO using different algorithms  Both provide results of increased or decreased expression as a function of experimental paradigm  4 strains of mice  3 conditions: chronic morphine, acute morphine, saline Mined NIF for all references to GEO ID’s: found small number where the same dataset was represented in two or more databaseshttp://www.chibi.ubc.ca/Gemma/home.html
  40. 40. How easy was it to compare? Gemma: Gene ID + Gene Symbol DRG: Gene name + Probe ID Gemma: Increased expression/decreased expression NIF annotation standard DRG: Increased expression/decreased expression  But...Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases Analysis:  1370 statements from Gemma regarding gene expression as a function of chronicmorphine  617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis  Results for 1 gene were opposite in DRG and Gemma  45 did not have enough information provided in the paper to make a judgment
  41. 41. Beware of False Dichotomies Top-down vs bottom up Light weight vs heavy weight “Chaotic Nihilists and Semantic Idealists”  Text mining vs annotation Curators vs scientists Human vs machine DOI’svsURI’s http://www.datanami.com/datanami/2013-02- 05/chaotic_nihilists_and_semantic_idealists.html
  42. 42. NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Interim PI Fahim Imam, NIF Ontology EngineerAmarnathGupta, UCSD, Co Investigator Larry LuiAnita Bandrowski, NIF Project Leader Andrea Arnaud StaggGordon Shepherd, Yale University Jonathan CachatPerry Miller Jennifer LawrenceLuis Marenco Lee HornbrookRixin Wang Binh NgoDavid Van Essen, Washington University VadimAstakhovErin Reid XufeiQianPaul Sternberg, Cal Tech Chris ConditArunRangarajan Mark EllismanHans Michael Muller Stephen LarsonYuling Li Willie WongGiorgio Ascoli, George Mason University Tim Clark, Harvard UniversitySrideviPolavarum Paolo Ciccarese Karen Skinner, NIH, Program Officer

×