The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework

  • 202 views
Uploaded on

Presentation at the 2013 Summit meeting of the Phenotype Research Collaboration Network, February 25-27, 2013

Presentation at the 2013 Summit meeting of the Phenotype Research Collaboration Network, February 25-27, 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
202
On Slideshare
202
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Doesn’t do it well; doesn’t organize the results in a domain specific way; doesn’t search across itFor use as content goal Dynamic inventory for deep coverage of neuroscience data: Genes -> Systems
  • What animal models show
  • NIFSTD and PATO ontologies served as building blocks to build a phenotype model the ontologies provide relationships between neuroscience related terms provide a structure to qualities and allow related qualities to show relationships
  • Need an interface to explore and ask questions. Cannot view as a graph. Need to be able to ask a question not in SPARQL and get an answer. Need a better interface to put things in. Discuss Neurolex and PKB. Doesn’t have to be perfect interface, but has to allow a domain expert to ask and answer questions..
  • Indirect matches that match due to hierarchiesNOTE: should make diagram in the style of previous slides (not screenshot)
  • In validating our results, we see three types of matches.The first are direct matchesNOTE: should make diagram in the style of previous slides (not screenshot)

Transcript

  • 1. The real world of ontologies andphenotype representation:perspectives from theNeuroscience InformationFrameworkMaryann Martone, Ph. D.University of California, San Diego
  • 2. “Neural Choreography”“A grand challenge in neuroscience is to elucidate brain function in relationto its multiple layers of organization that operate at different spatial andtemporal scales. Central to this effort is tackling “neural choreography” --the integrated functioning of neurons into brain circuits-- Neuralchoreography cannot be understood via a purely reductionist approach.Rather, it entails the convergent use of analytical and synthetic tools togather, analyze and mine information from each level of analysis, andcapture the emergence of new layers of function (or dysfunction) as wemove from studying genes and proteins, to cells, circuits, thought, andbehavior....However, the neuroscience community is not yet fully engaged in exploiting therich array of data currently available, nor is it adequately poised to capitalizeon the forthcoming data explosion. “Akil et al., Science, Feb 11, 2011
  • 3. “Data choreography” In that same issue of Science Asked peer reviewers from last year about the availability and use ofdata About half of those polled store their data only in theirlaboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and archives as amain impediment to using and storing data, and most of therespondents have no funding to support archiving And even where accessible, much data in many fields is too poorlyorganized to enable it to be efficiently used. “...it is a growing challenge to ensure that data produced during thecourse of reported research are appropriately described, standardized,archived, and available to all.” Lead Science editorial (Science 11February 2011:Vol. 331 no. 6018 p. 649 )
  • 4.  NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials, services) areavailable to the neuroscience community? How many are there? What domains do they cover? What domains do they not cover? Where are they? Web sites Databases Literature Supplementary material Who uses them? Who creates them? How can we find them? How can we make them better in the future? http://neuinfo.org• PDF files• Desk drawers
  • 5. In an ideal world...We’d like to be able to find: What is known****: What is the average diameter of a Purkinje neuron IsGRM1 expressed In cerebral cortex? What are the projections of hippocampus? What genes have been found to be upregulated inchronic drug abuse in adults Is alpha synuclein in the striatum? What studies used my polyclonal antibody againstGABA in humans? What rat strains have been used most extensively inresearch during the last 20 years? What is not known: Connections among data Gaps in knowledgeWithout some sort of framework, very difficult toRequiredComponents:– Query interface– Search strategies– Data sources– Infrastructure– Results display– Why did I get thisresult?– Analysis tools
  • 6. The Neuroscience Information Framework: Discovery andutilization of web-based resources for neuroscience A portal for finding andusing neuroscienceresources A consistent framework fordescribing resources Provides simultaneoussearch of multiple types ofinformation, organized bycategory Supported by an expansiveontology for neuroscience Utilizes advancedtechnologies to search the“hidden web”http://neuinfo.orgUCSD,Yale, CalTech, George Mason, Washington UnivSupported by NIH BlueprintLiteratureDatabaseFederationRegistry
  • 7. We need more databases !?•NIF Registry: Acatalog ofneuroscience-relevantresources•> 5000 currentlylisted•> 2000 databases•And we are findingmore every day
  • 8. NIF must work with ecosystem asit is today NIF was one of the first projects to attempt data integration inthe neurosciences on a large scale NIF is supported by a contract that specified the number ofresources to be added per year Designed to be populated rapidly; set up process for progressive refinement No budget was allocated to retrofit existing resources; had to work withthem in their current state We designed a system that required little to no cooperation or work fromproviders NIF was required to assemble (not create) ontologies very fast and to provide aplatform through which the community could view, comment and add NIF is enriched by ontologies but does not depend on them Took advantage of community ontologies But needed to take a very pragmatic and aggressive approach to incorporating and using them Neurolex semantic wiki
  • 9. What are the connections of thehippocampus?HippocampusOR “CornuAmmonis” OR“Ammon’s horn” Query expansion: Synonymsand related conceptsBoolean queriesData sourcescategorized by“data type” andlevel of nervoussystemCommon viewsacross multiplesourcesTutorials for usingfull resource whengetting there fromNIFLink back torecord inoriginalsource
  • 10. Imminent: NIF 5.0 NIF 5.0 aboutto be released New design New queryfeatures New analytics
  • 11. What do you mean by data?Databases come in many shapes and sizes Primary data: Data available forreanalysis, e.g., microarray datasets from GEO; brain images fromXNAT; microscopic images(CCDB/CIL) Secondary data Data features extracted throughdata processing and sometimesnormalization, e.g, brain structurevolumes (IBVD), gene expressionlevels (Allen Brain Atlas); brainconnectivity statements (BAMS) Tertiary data Claims and assertions about themeaning of data E.g., geneupregulation/downregulation, Registries: Metadata Pointers to data sets ormaterials stored elsewhere Data aggregators Aggregate data of the sametype from multiple sources,e.g., Cell Image Library,SUMSdb, Brede Single source Data acquired within a singlecontext , e.g., Allen Brain AtlasResearchers are producing a variety ofinformation resources using a multitude oftechnologies
  • 12. Exploration: Where is alpha synuclein?•Spatially:•Gene•Protein•Subcellular•Cellular•Regional•Organism•Semantically:•Gene regulation networks•Protein pathways•Cellular local connectivity•Regional connectivity•Who is studying it?•Who is funding its study?Networks exist across scales; all important in the nervous system
  • 13.  Set of modular ontologies 86, 000 + distinct concepts +synonyms Bridge files between modules Expressed in OWL-DL language Currently supports OWL 2 Tries to follow OBO communitybest practices Standardized to the sameupper level ontologies e.g., Basic Formal Ontology(BFO), OBO RelationsOntology (OBO-RO), Imports existing communityontologies e.g., CHEBI, GO, PRO,DOID, OBI etc. Retains identifiers inmost recent additionsbut reflects history13Covers major domains of neuroscience:Organisms, Brain Regions, Cells,Molecules, Subcellular parts, Diseases,Nervous system functions,TechniquesNIFSTD OntologiesFahim Imam, William Bug
  • 14. “Search computing”: Query by conceptWhat genes are upregulated by drugs of abuse in theadult mouse? (show me the data!)MorphineIncreasedexpressionAdult MouseReasonable standards make it easy to search for and compare results
  • 15. Diseases of nervous systemNew: Data analyticsNIF is in a unique position to answer questions about the neuroscienceecosystem using new analytics toolsNeurodegenerativeSeizuredisordersNeoplasticdiseaseofnervoussystemNIHReporterNIFdatafederatedsources
  • 16. Results are organized within a commonframeworkConnects toSynapsed withSynapsed byInput regioninnervatesAxon innervatesProjects toCellular contactSubcellular contactSource siteTarget siteEach resource implements a different, though related model;systems are complex and difficult to learn, in many cases
  • 17. NIF Concept Mapper
  • 18. The scourge of neuroanatomical nomenclature:Importance of NIF semantic framework•NIFConnectivity: 7 databases containing connectivity primary data or claimsfrom literature on connectivity between brain regions•BrainArchitecture Management System (rodent)•Temporal lobe.com (rodent)•ConnectomeWiki (human)•Brain Maps (various)•CoCoMac (primate cortex)•UCLA Multimodal database (Human fMRI)•Avian Brain Connectivity Database (Bird)•Total: 1800 unique brain terms (excluding Avian)•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
  • 19. Why so many names? The brain is perhaps unique among major organ systems in themultiplicity of naming schemes for its major and minor regions. The brain has been divided based on topology of majorfeatures, cyto- and myelo-architecture, developmentalboundaries, supposed evolutionary origins, histochemistry, geneexpression and functional criteria. The gross anatomy of the brain reflects the underlying networksonly superficially, and thus any parcellation reflects a somewhatarbitrary division based on one or more of these criteria.The “activation map” images that commonly accompany brain imaging papers can bemisleading to inexperienced readers, by seeming to suggest that the boundaries between“activated” and “unactivated” patches of cortex are unambigous and sharp. Instead, asmost researchers are aware, the apparent sharp boundaries are subject to the choice ofthreshold applied to the statistical tests that generate the image.What, then, justifiesdividing the cortex into regions with boundaries based on this fuzzy, mutable measure offunctional profile?(Saxe et al., 2010, p. 39).Brainmaps.org
  • 20. Program on Ontologies for NeuralStructures International Neuroinformatics Coordinating Committee Structural LexiconTask Force Defining brain structures Translate among terminologies Neuronal RegistryTask Force Consistent naming scheme for neurons Knowledge base of neuron properties Representation and DeploymentTask Force Formal representation Also interacts with Digital Atlasing Task Forcehttp://incf.org
  • 21. NeuroLexWikihttp://neurolex.org Stephen Larson•Provide a simple frameworkfor defining the conceptsrequired•Light weight semantics•Good teaching tool forlearning aboutsemantic integrationand the benefits of aconsistent semanticframework•Community based:•Anyone can contributetheir terms, concepts,things•Anyone can edit•Anyone can link•Accessible: searched byGoogle•Building an extensive cross-disciplinary knowledge basefor neuroscienceDemo D03
  • 22. Defining nervous system structuresParcellation scheme: Set of parcelsoccupying part or all of an anatomicalentity that has been delineated using acommon approach or set of criteria,often in a single study.A parcellationscheme for any given individual entitymay include gaps, transitional zones, orregions of uncertainty. A parcellationscheme derived from a set of individualsregistered to a common target (atlas)may be probabilistic and include overlapof parcels in regions that reflectindividual variability or imperfections inalignment.14 parcellation schemes currently represented in NeurolexDocumentation availableINCF task force onontologies
  • 23. Basic model: do not conflate conceptualstructures with parcelsRegional part ofnervous systemFunctional part ofnervous systemParceloverlapsoverlaps overlapsParcel ParcelNeuroscientists have a lot of different parcellation schemes because they have a lot of differentways of classifying brain structures and techniques to match them are imperfect
  • 24. Linking semantics to space: INCF Atlasingwww.neurolex.orgLink to spatialrepresentation inscalable brainatlasWaxholm spaceSeth Ruffins,Alan Ruttenberg, Rembrandt Bakker
  • 25. Neurons in Neurolex InternationalNeuroinformaticsCoordinating Facility (INCF)building a knowledge base ofneurons and their propertiesvia the NeurolexWiki Led by Dr. Gordon Shepherd Consistent and parseablenaming scheme Knowledge is readilyaccessible, editable andcomputable While structure is imposed,don’t worry too much aboutthe upper level classes of theontologyStephen Larson
  • 26. A KNOWLEDGE BASE OF NEURONAL PROPERTIES26Additional semantics added in NIFSTD by ontology engineer
  • 27. Concept-based search: search by meaning Search Google: GABAergic neuron Search NIF: GABAergic neuron NIF automatically searches for types ofGABAergic neuronsTypes of GABAergicneurons
  • 28. Challenges of multiscale neurodegenerativedisease phenotypes•Neurodegenerative diseases target very specific cellpopulations•Model systems only replicate a subset of features of thedisease•Related phenotypes occur across anatomical scales•Different vocabularies are used by different communitiesnotnotMidbrain degeneratedSubstantianigra decreasedin volumeSubstantianigra parscompacta atrophiedLoss of SnpcdopaminergicneuronsDegeneration of nigrostriatalterminalsTyrosine-hydroxylase containingneurons degenerate
  • 29. Approach: Use ontologies to provide necessaryknowledge for matching related phenotypesSarah Maynard, Chris Mungall,Suzie Lewis, Fahim ImamMidbrainSubstantianigraSubstantianigra parscompactaSubstantianigra parscompacta dopaminecellDopamineNeuron cellsomaNeuron (CL)Part of neuron(GO)Small molecule(Chebi)AtrophiedDecreasedvolumeFewer innumberDegenerateDecreased in magnituderelative to some normalHas partHas partIs partofHas partHas partIs aIs a Is aIs aEntitiesQualitiesNIFSTD/PKBOBO ontology
  • 30. Alzheimer’sdiseaseHuman(birnlex_516)Neocortex pyramidalneuronIncreasednumber ofLipofuscinhas partinheres in inheres intowardsEQ Representation of Phenotypes in NeurodegenerativeDisease: PATO and NIFSTDInstance: Human withAlzheimer’s disease 050Phenotypebirnlex_2087_56inheres inaboutChris Mungall, Suzanna LewisStructured annotationmodel implemented in WIB
  • 31. OBD: Ontology based database Provides a userinterface for matchingorganisms based onsimilarity ofphenotypes Based on EQ model Uses knowledge in theontology to computesimilarity scores andother statisticalmeasures likeinformation contenthttp://www.berkeleybop.org/pkb/Chris Mungall, Suzanna Lewis, Lawrence BerkeleyLabs
  • 32. ThalamusCellularinclusionMidline nucleargroupLewy BodyParacentralnucleusCellularinclusionComputes common subsumers and informationcontent among phenotypes
  • 33. *B6CBA-TgN (HDexon1)62) that express exon1 of the human mutant HD gene- Li et al., JNeurosci, 21(21):8473-8481PhenoSim: What organism is most similar to a humanwith Huntington’s disease?Putamen atrophiedGlobuspallidusneuropildegeneratePart of basal gangliadecreased inmagnitudeFewer neostriatummedium spiny neurons inputamenNeurons in striatumdegenerateNeuron in striatumdecreased inmagnitudeIncreased number ofastrocytes in caudatenucleusNeurons in striatumdegenerateNervous system cellchange in number instriatum
  • 34. Progressive enrichmentUnderstanding and comparing phenotypes will be enriched through communityknowledge bases like NeurolexLooking forward to continuing this as part of the Monarch project with MelissaHaendel, Chris Mungall and Suzie Lewis
  • 35. Top Down vs Bottom upTop-down ontology construction• A select few authors have write privileges• Maximizes consistency of terms with each other (automated consistencychecking)• Making changes requires approval and re-publishing•Works best when domain to be organized has: small corpus, formal categories,stable entities, restricted entities, clear edges.•Works best with participants who are: expert catalogers, coordinated users, expertusers, people with authoritative source of judgmentBottom-up ontology construction• Multiple participants can edit the ontology instantly (many eyes to correct errors)• Semantics are limited to what is convenient for the domain• Not a replacement for top-down construction; sometimes necessary to increase flexibility• Necessary when domain has: large corpus, no formal categories, no clear edges•Necessary when participants are: uncoordinated users, amateur users, naïve catalogers• Neuroscience is a domain that is less formal and neuroscientists are more uncoordinatedNIFSTDNEUROLEXImportant for Ontologists to define community contribution model
  • 36. It’s a messy ecosystem (and that’s OK)NIF favors a hybrid, tiered,federated system Domain knowledge Ontologies Claims about results Virtuoso RDF triples Data Data federation Workflows Narrative Full text accessNeuron Brain part DiseaseOrganism GeneCaudate projects toSnpc Grm1 is upregulated inchronic cocaineBetz cellsdegenerate in ALS
  • 37. Musings from the NIF No one can be stopped from doing what they need to do Every resource is resource limited: few have enough time,money, staff or expertise required to do everything they wouldlike If the market can support 11 MRI databases, fine Some consolidation, coordination is warranted though Big, broad and messy beats small, narrow and neat Without trying to integrate a lot of data, we will not know what needs to be done A lot can be done with messy data; neatness helps though Progressive refinement; addition of complexity through layers Be flexible and opportunistic A single optimal technology/container for all types of scientific data andinformation does not exist; technology is changing Think globally; act locally: No source, not even NIF, isTHE source; we are all a source
  • 38. Grabbing the long tail of smalldata Analysis of NIF showsmultiple databases withsimilar scope and content Many contain partiallyoverlapping data Data “flows” from oneresource to the next Data is reinterpreted,reanalyzed or added to Is duplication good or bad?
  • 39. Same data: different analysisChronic vs acutemorphine in striatum Drug Related Gene database:extracted statements fromfigures, tables and supplementarydata from published article Gemma: Reanalyzed microarrayresults from GEO using differentalgorithms Both provide results of increasedor decreased expression as afunction of experimentalparadigm 4 strains of mice 3 conditions: chronic morphine,acute morphine, saline Mined NIF for all references to GEOID’s: found small number where thesame dataset was represented in twoor more databaseshttp://www.chibi.ubc.ca/Gemma/home.html
  • 40. How easy was it to compare? Gemma: Gene ID + Gene Symbol DRG: Gene name + Probe ID Gemma: Increased expression/decreased expression DRG: Increased expression/decreased expression But...Gemma presented results relative to baseline chronic morphine; DRG withrespect to saline, so direction of change is opposite in the 2 databases Analysis: 1370 statements from Gemma regarding gene expression as a function ofchronicmorphine 617 were consistent with DRG; over half of the claims of the paper were notconfirmed in this analysis Results for 1 gene were opposite in DRG and Gemma 45 did not have enough information provided in the paper to make a judgmentNIF annotationstandard
  • 41. Beware of False Dichotomies Top-down vs bottom up Light weight vs heavy weight “Chaotic Nihilists and Semantic Idealists” Text mining vs annotation Curators vs scientists Human vs machine DOI’svsURI’shttp://www.datanami.com/datanami/2013-02-05/chaotic_nihilists_and_semantic_idealists.html
  • 42. NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnathGupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd,Yale UniversityPerry MillerLuis MarencoRixinWangDavidVan Essen,Washington UniversityErin ReidPaul Sternberg, CalTechArunRangarajanHans Michael MullerYuling LiGiorgioAscoli,George Mason UniversitySrideviPolavarumFahim Imam, NIF Ontology EngineerLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceLee HornbrookBinh NgoVadimAstakhovXufeiQianChris ConditMark EllismanStephen LarsonWillieWongTimClark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer