An Up-to-date Knowledge Base and Focused Exploration System for Human Performance and CognitionAmitShethLexisNexis Ohio Eminent ScholarDirector, Kno.e.sis CenterWright State UniversityRamakanthKavuluruPostdoctoral Research ScientistKno.e.sis CenterThanks  to Dr. Victor Chan for support and guidanceHPC-KB team: Christopher Thomas, Wenbo Wang, Alan Smith, Paul Fultz, Delroy Cameron, Priti Parikh
Focused Knowledge BasesA knowledge base (KB) functions as astandalone reference for a particular domain of interestbackend for knowledge-based search, browsing, and exploration of literature
What is a KB?“A body of knowledge describing a topic or domain of interest”categories or classes – Neurotransmitters, Diseaseindividuals (instances of classes)Dopamine, Magnesium, Migraineroles (properties/predicates) – inhibits, is a, part ofassertions (triples)Dopamine is a neurotransmitterMagnesium treats Migraine
Then, What are Ontologies?“Ontology is the basic structure or armature around which a knowledge base can be built” (Swartout and Tate, 1999)“An ontology is an explicit representation of a shared understanding of the important concepts in some domain of interest.” (Kalfoglou, 2001)So, mostly static blocks of well accepted and consensual knowledge
Ontologies in Life SciencesThe National Center for Biomedical Ontology (NCBO) - Open Biomedical Ontologies (OBO)About 200 ontologies and 1.5 million termsOnly part_of and is_a relations in the Gene OntologyHistolysis is_a positive regulation of cell sizeRequests for changes are expert reviewed before modifications
What about Emergent Knowledge, Richer Relationships?New scientific results and insights published everyday backed by experimentationPubMed: 18+ million articles; 1300 new per dayAlso, what about other predicates besides is_a and part_of (eg., UMLS Semantic Network of 54 predicates). Need a way of capturing and meaningfully utilizing this emerging knowledge
Enter HPC-KBNLPPatternsSCOONER
Steps in Creating the HPC-KBCarve a focused domain hierarchy out of WikipediaExtract mentions of entities and relationships in the relevant scientific literature (Pubmed abstracts) to support non-hierarchical guidance.Map extracted entity mentions to concepts and extracted predicates to relationships to create the knowledge-base
Workflow OverviewHPC keywordsDoozer: Base Hierarchy from WikipediaFocused Pattern based extractionSenseLab Neuroscience OntologiesInitial KB CreationMeta KnowledgebasePubMed AbstractsKnoesis: Parsing based NLP Triples  Enrich Knowledge BaseNLM: Rule based BKR TriplesFinal Knowledge Base
Hierarchy Using Wikipedia Categories and Graph Structure
Triple ExtractionOpen Extraction No fixed number of predetermined entities and predicatesAt  Knoesis – NLP (parsing and dependency trees)Supervised ExtractionPredetermined set of entities and predicatesAt  Knoesis – Pattern based extraction to connect entities in the base hierarchy using statistical techniquesAt NLM – NLP and rule based approaches
Mapping of Triples to HierarchyEntities in both subject and object must contain at least one concept from the hierarchy to be mapped to the KBPreliminary synonyms based on anchor labels and page redirects in WikipediaProlactostatin redirects to DopaminePredicates  (verbs) and entities are subjected to stemming using Wordnet
HPC-KB Stats  Number of Entities   --   2 Million
  Number of non-trivial facts  --  3 Million    Examples:NLP based: calcium-binding protein S100B modulates long-term synaptic plasticityPattern based: Olfactory Bulb has physical part of anatomic structure Mitral cell  Extracted from all abstracts untilAug 2008(Note: Abstracts accessible through Scooner: Oct 2010)
Full Architecture
Scooner  FeaturesKnowledge-based browsing: Relations window, inverse relations, creating trailsPersistent projects: Work bench, browsing history, comments, filteringCollaboration: comments, dashboard, exporting (sub)projects, importing projects
ScoonerDemostration Video			http://slidesha.re/scooner-video
Comments on Scooner“The ability to browse predications together with documents will likely reduce the cognitive load required for encountering interesting facts, both for novice users and domain experts.” – Thomas Rindflesch, Researcher in Biomedical Informatics, NLM.“Being able to keep track of multiple articles is a really nice tool to have, and it cuts down on time between jumping back and forth between articles” - Anonymous comment from an AFRL researcher
Next: KnexpaceAutomatic UpdatesIndex new abstracts as they arriveExtract relationships as new abstracts arrivePeriodically update indices for abstracts and triplesOther MaintenanceAdmin interfacesApplication software support
Improving the KB Quality & FilteringAdhere to a stricter schemaHaving a fixed number of predicates and a fixed range and domain for each predicateEx: For the immunology and chemical warfare agentsPredicate: activates	Restricted Domain: Cytokines OR μRNARestricted Range: MacrophagesAlso useful to directly launch queries of the form 	Question:     ?x    activates    Macrophages	Answer:        TNF-α  and IFN-γ
Normalize Entities and Predicates How do we find Bovine spongiform encephalopathy and mad cow disease are same   (Use UMLS Metathesaurus! )long term memory and long lasting memory are the same computationally  (UMLS does not work)More complex similarities: HepG(2) cell line and Human Hepatoma G2 cellswhich textual forms map to the fixed set of predicates: Do modulates, regulates, stimulates all map to affects?(NLM  expert collaboration & AFRL help)NLM tools: MetaMap, SemRep, and other lexical tools
Provenance and other meta dataOriginal abstract PMIDs will be captured for each triple.Other data: authors, journal namesHow about other meta data for filtering:In Vitro / In VivoIf, In Vivo, which organism. If Mice, which type: Ames Dwarf  Which techniques are used. ex: Flow Cytometry
New Knowledge ExampleVIP Peptide  – increases – Catecholamine BiosynthesisCatecholamines – induce – β-adrenergic receptor activityβ-adrenergic receptors – are involved – fear conditioningVIP Peptide  – affects – fear conditioningIn  CattleIn  RatsIn  Humans
Domain Specific ProvenanceFor Immunology and Warfare Agent effectsWhich pathogen:  FrancisellatularensisWhich Strains: U112, LVS, …What is measured: Cells (dendritic, macrophages, monocytes), Proteins, cytokines, chemokinesNeed to find preexisting taxonomies for organisms, techniques, pathogens; or need to build some and integrateThe more the specificity, the more the KB quality
Better Ranking of Abstracts and TriplesUse search phrases to fine tune ranking of abstracts and triplesJust because the user clicked on neurogenesis, does not mean she wants to know everything about it. Predict  which triples or abstracts the user might be more interested in the current search session.Show top-k entities in the result set to facilitate filtering based on top conceptsVisualize networks of a set of browsed triples
Semantic IntegrationFamous OBO Ontologies (total 7 foundry ontologies)Gene OntologyPRotein OntologyOther domain specific OBO candidateontologiesNCBI organismal classificationInfectious DiseaseHuman Disease (Tularemia found here, and also in SNOMED, & others too, what to choose for mapping?)Pathogen transmission
Semantic Integration, contd…Linked Open Drug DataDrugBank: drugs and drug targets (pathways, structures, pathways)Diseasome: disorders and disease-gene associationsLinkedCT: Clinical trialsMendelian InheritanceOnline MendelianInhertance in Animals (OMIA)Online Mendelian Inheritance in Man (OMIM)Different formats: owl, obo, rrf

An Up-to-date Knowledge Base and Focused Exploration System for Human Performance and Cognition

  • 1.
    An Up-to-date KnowledgeBase and Focused Exploration System for Human Performance and CognitionAmitShethLexisNexis Ohio Eminent ScholarDirector, Kno.e.sis CenterWright State UniversityRamakanthKavuluruPostdoctoral Research ScientistKno.e.sis CenterThanks to Dr. Victor Chan for support and guidanceHPC-KB team: Christopher Thomas, Wenbo Wang, Alan Smith, Paul Fultz, Delroy Cameron, Priti Parikh
  • 2.
    Focused Knowledge BasesAknowledge base (KB) functions as astandalone reference for a particular domain of interestbackend for knowledge-based search, browsing, and exploration of literature
  • 3.
    What is aKB?“A body of knowledge describing a topic or domain of interest”categories or classes – Neurotransmitters, Diseaseindividuals (instances of classes)Dopamine, Magnesium, Migraineroles (properties/predicates) – inhibits, is a, part ofassertions (triples)Dopamine is a neurotransmitterMagnesium treats Migraine
  • 4.
    Then, What areOntologies?“Ontology is the basic structure or armature around which a knowledge base can be built” (Swartout and Tate, 1999)“An ontology is an explicit representation of a shared understanding of the important concepts in some domain of interest.” (Kalfoglou, 2001)So, mostly static blocks of well accepted and consensual knowledge
  • 5.
    Ontologies in LifeSciencesThe National Center for Biomedical Ontology (NCBO) - Open Biomedical Ontologies (OBO)About 200 ontologies and 1.5 million termsOnly part_of and is_a relations in the Gene OntologyHistolysis is_a positive regulation of cell sizeRequests for changes are expert reviewed before modifications
  • 6.
    What about EmergentKnowledge, Richer Relationships?New scientific results and insights published everyday backed by experimentationPubMed: 18+ million articles; 1300 new per dayAlso, what about other predicates besides is_a and part_of (eg., UMLS Semantic Network of 54 predicates). Need a way of capturing and meaningfully utilizing this emerging knowledge
  • 7.
  • 8.
    Steps in Creatingthe HPC-KBCarve a focused domain hierarchy out of WikipediaExtract mentions of entities and relationships in the relevant scientific literature (Pubmed abstracts) to support non-hierarchical guidance.Map extracted entity mentions to concepts and extracted predicates to relationships to create the knowledge-base
  • 9.
    Workflow OverviewHPC keywordsDoozer:Base Hierarchy from WikipediaFocused Pattern based extractionSenseLab Neuroscience OntologiesInitial KB CreationMeta KnowledgebasePubMed AbstractsKnoesis: Parsing based NLP Triples Enrich Knowledge BaseNLM: Rule based BKR TriplesFinal Knowledge Base
  • 10.
    Hierarchy Using WikipediaCategories and Graph Structure
  • 11.
    Triple ExtractionOpen ExtractionNo fixed number of predetermined entities and predicatesAt Knoesis – NLP (parsing and dependency trees)Supervised ExtractionPredetermined set of entities and predicatesAt Knoesis – Pattern based extraction to connect entities in the base hierarchy using statistical techniquesAt NLM – NLP and rule based approaches
  • 12.
    Mapping of Triplesto HierarchyEntities in both subject and object must contain at least one concept from the hierarchy to be mapped to the KBPreliminary synonyms based on anchor labels and page redirects in WikipediaProlactostatin redirects to DopaminePredicates (verbs) and entities are subjected to stemming using Wordnet
  • 13.
    HPC-KB Stats Number of Entities -- 2 Million
  • 14.
    Numberof non-trivial facts -- 3 Million Examples:NLP based: calcium-binding protein S100B modulates long-term synaptic plasticityPattern based: Olfactory Bulb has physical part of anatomic structure Mitral cell Extracted from all abstracts untilAug 2008(Note: Abstracts accessible through Scooner: Oct 2010)
  • 15.
  • 16.
    Scooner FeaturesKnowledge-basedbrowsing: Relations window, inverse relations, creating trailsPersistent projects: Work bench, browsing history, comments, filteringCollaboration: comments, dashboard, exporting (sub)projects, importing projects
  • 17.
  • 18.
    Comments on Scooner“Theability to browse predications together with documents will likely reduce the cognitive load required for encountering interesting facts, both for novice users and domain experts.” – Thomas Rindflesch, Researcher in Biomedical Informatics, NLM.“Being able to keep track of multiple articles is a really nice tool to have, and it cuts down on time between jumping back and forth between articles” - Anonymous comment from an AFRL researcher
  • 19.
    Next: KnexpaceAutomatic UpdatesIndexnew abstracts as they arriveExtract relationships as new abstracts arrivePeriodically update indices for abstracts and triplesOther MaintenanceAdmin interfacesApplication software support
  • 20.
    Improving the KBQuality & FilteringAdhere to a stricter schemaHaving a fixed number of predicates and a fixed range and domain for each predicateEx: For the immunology and chemical warfare agentsPredicate: activates Restricted Domain: Cytokines OR μRNARestricted Range: MacrophagesAlso useful to directly launch queries of the form Question: ?x activates Macrophages Answer: TNF-α and IFN-γ
  • 21.
    Normalize Entities andPredicates How do we find Bovine spongiform encephalopathy and mad cow disease are same (Use UMLS Metathesaurus! )long term memory and long lasting memory are the same computationally (UMLS does not work)More complex similarities: HepG(2) cell line and Human Hepatoma G2 cellswhich textual forms map to the fixed set of predicates: Do modulates, regulates, stimulates all map to affects?(NLM expert collaboration & AFRL help)NLM tools: MetaMap, SemRep, and other lexical tools
  • 22.
    Provenance and othermeta dataOriginal abstract PMIDs will be captured for each triple.Other data: authors, journal namesHow about other meta data for filtering:In Vitro / In VivoIf, In Vivo, which organism. If Mice, which type: Ames Dwarf Which techniques are used. ex: Flow Cytometry
  • 23.
    New Knowledge ExampleVIPPeptide – increases – Catecholamine BiosynthesisCatecholamines – induce – β-adrenergic receptor activityβ-adrenergic receptors – are involved – fear conditioningVIP Peptide – affects – fear conditioningIn CattleIn RatsIn Humans
  • 24.
    Domain Specific ProvenanceForImmunology and Warfare Agent effectsWhich pathogen: FrancisellatularensisWhich Strains: U112, LVS, …What is measured: Cells (dendritic, macrophages, monocytes), Proteins, cytokines, chemokinesNeed to find preexisting taxonomies for organisms, techniques, pathogens; or need to build some and integrateThe more the specificity, the more the KB quality
  • 25.
    Better Ranking ofAbstracts and TriplesUse search phrases to fine tune ranking of abstracts and triplesJust because the user clicked on neurogenesis, does not mean she wants to know everything about it. Predict which triples or abstracts the user might be more interested in the current search session.Show top-k entities in the result set to facilitate filtering based on top conceptsVisualize networks of a set of browsed triples
  • 26.
    Semantic IntegrationFamous OBOOntologies (total 7 foundry ontologies)Gene OntologyPRotein OntologyOther domain specific OBO candidateontologiesNCBI organismal classificationInfectious DiseaseHuman Disease (Tularemia found here, and also in SNOMED, & others too, what to choose for mapping?)Pathogen transmission
  • 27.
    Semantic Integration, contd…LinkedOpen Drug DataDrugBank: drugs and drug targets (pathways, structures, pathways)Diseasome: disorders and disease-gene associationsLinkedCT: Clinical trialsMendelian InheritanceOnline MendelianInhertance in Animals (OMIA)Online Mendelian Inheritance in Man (OMIM)Different formats: owl, obo, rrf
  • 28.

Editor's Notes

  • #3 Just say that knowledge bases are useful for two specific purposes.In the next slide talk about what exactly is a KB, what does it look like.
  • #7 Emphasize the predicates are primary indicators of new knowledge.People are aware of concepts, but what is interesting is which predicates (if any) connect those concepts
  • #11 The edges here show part_of and is_a relations: neuroscience is a subclass of science, while brain is part of the general area of cognitive science.So we do not model a strict ontology here.
  • #26 Ncbi – national center for biotechnology informationObo – open biomedical ontologiesNcbo – natinal center for biomedical ontologySNOMED - Systematized Nomenclature of Medicine
  • #27 Rrf – rich release format