Knowledge Extraction Semantic Web

  • 2,403 views
Uploaded on

 

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,403
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
147
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge Extraction/Semantic Web
  • 2. Overview
    • Semantic Web
      • Introduction
      • Semantic Web Representation and Query Languages
      • Semantic Web Tools
    • Ontologies and Knowledge Markup
      • Ontologies and other Knowledge Organization Systems
      • Knowledge Markup for Ontology Population
      • Ontology Life-Cycle
    • Knowledge Extraction
      • Ontology Population
      • Ontology Learning
  • 3.
    • Semantic Web
  • 4. Web Docs, Data Web
  • 5. Web Docs, Data Knowledge Markup Web > Semantic Web
  • 6. Web Docs, Data Knowledge Markup Ontologies Web > Semantic Web
  • 7. Knowledge Markup Ontologies Web > Semantic Web
  • 8. Knowledge Markup Ontologies Semantic Web Services Accessing the Semantic Web - Machines
  • 9. Intelligent Man-Machine Interface Knowledge Markup Ontologies Semantic Web Services Accessing the Semantic Web - Humans
  • 10. Semantic Web Layer cake
    • Introduced by Tim Berners-Lee in 2001
    • Built upon existing WWW standards
  • 11. Resource Description Framework (RDF)
    • RDF is an extensible language for expressing graph-structures
    • Serializes to XML
    node1 DFKI GmbH Kaiserslautern <?xml version=‘1.0’ ?> < rdf:RDF xmlns:rdf=“… rdf-syntax-ns#” xmlns:rdfs=“… rdf-schema#” xmlns=“http://example.org”> < rdf:Description rdf:nodeID =“node1”> <name> DFKI GmbH </name> <location> Kaiserslautern </location> <w ww rdf:resource=“ http://www.dfki.de ” /> </ rdf:Description > </ rdf:RDF > name location www http://www.dfki.de
  • 12. RDF Schema (RDFS)
    • Adds a vocabulary for representing classes and properties to RDF
    Person Teacher Student rdf:Literal name Course teaches enrolledIn is-a is-a
  • 13. Web Ontology Language (OWL)
    • OWL - Based on Description Logics
    • Adds further modelling vocabulary on top of RDFS
    XML Schema Namespaces Interpretation Context RDF Schema OWL Formalization: Classes (Inheritance), Properties Formalization: Classes, Class Definitions, Properties, Property Types (e.g. Transitivity) Data Types XML RDF Syntax Semantics
  • 14. Semantic Web Query Languages - SPARQL
    • SPARQL - query language developed by W3C
    • Syntactically based on SQL:
    • Results available as XML Documents
    PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?foafName WHERE { ?x foaf:name ?foafName . OPTIONAL { ?x foaf:mbox ?mbox } . }
  • 15. Semantic Web Tools
    • Programming APIs
      • Jena - Java
      • Redland – Python, …
      • RAP - PhP
    • Editors
      • Prot égé
      • OntoStudio
      • Triple20 - Prolog
    • Storage
      • Sesame
      • OntoBroker
  • 16.
    • Ontologies and Knowledge Markup
  • 17. Ontologies in Philosophy
    • Ontology is a branch of philosophy that deals with the nature and the organization of reality
    • Science of Being (Aristotle, Metaphysics)
      • What characterizes being?
      • Eventually, what is being?
  • 18. Ontologies in Computer Science
    • Ontology refers to an engineering artifact
      • a specific vocabulary used to describe a certain reality
      • a set of explicit assumptions regarding the intended meaning of the vocabulary
    • An Ontology is
      • an explicit specification of a conceptualization [Gruber 93]
      • a shared understanding of a domain of interest [Uschold/Gruninger 96]
  • 19. Why Develop an Ontology?
    • Make domain assumptions explicit
      • Easier to change domain assumptions
      • Easier to understand and update legacy data
    • Separate domain knowledge from operational knowledge
      • Re-use domain and operational knowledge separately
    • A community reference for applications
    • Shared understanding of what information means
  • 20. Types of Ontologies [Guarino, 98] Describe very general concepts like space, time, event, which are independent of a particular problem or domain. It seems reasonable to have unified top-level ontologies for large communities of users. Describe the vocabulary related to a generic domain by specializing the concepts introduced in the top-level ontology. Describe the vocabulary related to a generic task or activity by specializing the top-level ontologies. These are the most specific ontologies. Concepts in application ontologies often correspond to roles played by domain entities while performing a certain activity .
  • 21. Ontologies and Their Relatives Catalog / ID Terms/ Glossary Thesauri Informal Is-a Formal Is-a Formal Instance Frames Value Restric- tions General logical constraints Axioms Disjoint Inverse Relations, ...
  • 22. Knowledge Organization Systems
    • Semantic Lexicons – e.g. WordNet
      • … group together words according to lexical semantic relations like synonymy , hyponymy , meronymy , antonymy , etc.
    • Thesauri
      • … group together domain terms according to a set of taxonomic relations, including broader term, narrower term, sibling , etc.
    • Semantic Networks and Ontologies
      • … group together classes of objects according to a set of relations that originate in the nature of the domain of application.
      • Ontologies are defined by a formal semantics, but semantic networks may be informally defined. Therefore all ontologies are semantic networks, but not all semantic networks are ontologies.
  • 23. Thesauri - Examples MeSH Heading Databases, Genetic Entry Term Genetic Databases Entry Term Genetic Sequence Databases Entry Term OMIM Entry Term Online Mendelian Inheritance in Man Entry Term Genetic Data Banks Entry Term Genetic Data Bases Entry Term Genetic Databanks Entry Term Genetic Information Databases See Also Genetic Screening MT 3606 natural and applied sciences UF gene pool genetic resource genetic stock genotype heredity BT1 biology BT2 life sciences NT1 DNA NT1 eugenics RT genetic engineering (6411) EuroVoc covers terminology in all of the official EU languages for all fields that concern the EU institutions, e.g., politics, trade, law, science, energy, agriculture, 27 such fields in total. MeSH (Medical Subject Headings) is organized by terms (currently over 250,000) that correspond to a specific medical subject. For each such term a list of syntactic, morphological or semantic variants is given.
  • 24. Semantic Networks - Examples Pharmacologic Substance affects Pathologic Function Pharmacologic Substance causes Pathologic Function Pharmacologic Substance complicates Pathologic Function Pharmacologic Substance diagnoses Pathologic Function Pharmacologic Substance prevents Pathologic Function Pharmacologic Substance treats Pathologic Function Accession: GO:0009292 Ontology: biological process Synonyms: broad: genetic exchange Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual. Term Lineage all : all (164142) GO:0008150 : biological process (115947) GO:0007275 : development (11892) GO:0009292 : genetic transfer (69) GO (Gene Ontology) allows for “consistent descriptions of gene products in different databases, including several of the world’s major repositories for plant, animal and microbial genomes…“ Organizing principles are molecular function, biological process and cellular component. UMLS (Unified Medical Language System) integrates linguistic, terminological and semantic information. The Semantic Network consists of 134 semantic types and 54 relations between types.
  • 25. Example Ontology Consider an Example Ontology for the Newspaper Domain
  • 26. Knowledge Markup
    • Ontologies are used to semantically organize and retrieve data (structured, textual, multimedia) through knowledge markup
      • Consider the following example:
    • Knowledge Markup from Text is based on Named-Entity Recognition, Semantic Tagging (Term to Class Mapping) and Relation Extraction
    <news:story xmnls:jobs=“http://www.jobs.org/owl-jobs#” xmlns:com=“http://www.companies.org/owl-companies#” xmlns:it=“http://www.it.net/owl-it#”> “ We were surprised by several of the results, particularly the order of finish,” said <jobs:SystemsAnalyst> Dan Olds </jobs:SystemsAnalyst>. <com:Company> IBM </com:Company> finished first with very strong results, and <com:Company> HP </com:Company> scored a solid number two; we expected to see <com:Company> Sun Microsystems </com:Company> challenging for first place or at least a strong second place. As the largest <it:operatingsystem> UNIX </it:operatingsystem> vendor in terms of number of installed systems, a third place finish should put their management on notice that their installed base may be vulnerable.
  • 27. Knowledge Markup - Images Semantic Annotation of Medical Images (miAKT Project - UK)
  • 28. Knowledge Markup - Images Semantic Annotation of Video (SmartMedia – DFKI KM)
  • 29. Ontology Life-Cycle Create/Select Development and/or Selection Populate Knowledge Base Generation Validate Consistency Checks Evolve Extension, Modification Maintain Usability Tests Deploy Knowledge Retrieval
  • 30.
    • Knowledge Extraction
    • Ontology Population & Ontology Learning
  • 31. Ontology Life-Cycle – Ontology Population Create/Select Development and/or Selection Populate Knowledge Base Generation Validate Consistency Checks Evolve Extension, Modification Maintain Usability Tests Deploy Knowledge Retrieval
  • 32. Ontology Population with SOBA
    • SOBA: SmartWeb Ontology-based Annotation
    • Application Context
      • SmartWeb (http://www.smartweb-projekt.de/) – German Project around World-Cup 2006
      • Integrates
        • Multimodal Dialog Processing
        • IR-based Question Answering
        • Ontology-Based Information Extraction
        • Semantic Web Services
    • Ontology-Based Information Extraction …
      • Combines:
        • Semantic Wrapping of Semi-Structured Data
        • Semantic and Linguistic Annotation of Free Text
        • Inference Rules for Instantiation and Integration of Annotated Entities and Events
    • … and Display
      • Ontology-driven Hyperlink Generation for Display of Extracted Information
  • 33. Linguistic Annotation Named Entity Recognition & Semantic Tagging Image Extraction PDF Analysis Inference Rules for Instantiation & Integration Knowledge Base Documents Ontologies Wrapping of SemiStructured Data SOBA – Processing and Data Flow
  • 34. SWIntO: SmartWeb Integrated Ontology SmartDOLCE:Entity SmartSUMO:Attribute SmartSUMO:SocialRole SmartSUMO:Proposition SportEvent:FootballPlayer SportEvent:Goalkeeper SportEvent:FootballOrganizationPerson SportEvent:FootballClubPresident … … … … … … … …
    • SWIntO (by AIFB, DFKI KM/IUI, EML) covers
      • Foundational (DOLCE) and General (SUMO) Knowledge
      • Domain- and Task-Specific Knowledge
        • Football / Sport Events
        • Navigation, Discourse, Multimedia
        • other
  • 35. SMartWeb Integrated Ontology (by AIFB, DFKI KM/IUI, EML)
  • 36.  
  • 37. SmartWeb Corpus
    • (Growing) Web Corpus through Monitor on
      • http://fifaworldcup.yahoo.com/
      • http://www.uefa.com/competitions/worldcup
    • Semi-Structured Data
      • Tabular: Match Reports, Teams, etc.
    • Free Text
      • Match Reports
      • Image Captions
  • 38. Semi-Structured Data - HTML
  • 39. Semi-Structured Data - XML
  • 40. Semi-Structured Data – F-Logic
  • 41. MatchEvent [Score, Team1, Team2] FootballPlayer Information Extraction from Free Text
  • 42. FoulEvent [FootballPlayer] FootballPlayer Information Extraction from Image Captions
  • 43. Linguistic and Semantic Annotation Mark Crossley saved twice with his legs from Huckerby. Named Entity Recognition & Semantic Tagging [ Mark Crossley GOALKEEPER] [ saved GOALKEEPER_ACTION] twice with his legs from [ Huckerby PLAYER] . Linguistic Annotation [ Mark Crossley GOALKEEPER : SUBJ] [ saved PRED : GOALKEEPER_ACTION] twice [ with his legs PP_OBJ] [ from [ Huckerby PLAYER] PP_ADJUNCT] . [ GOALKEEPER_ACTION = 'save‘, GOALKEEPER = ' Mark Crossley ‘, PLAYER = ' Huckerby ‘, MANNER = ‘legs' ]
  • 44. Annotation/Extraction Example
    • Example Sentence from Match Report
      • Allerdings ist Petrow fuer die Partie gegen Schweden gesperrt und kann erst gegen Ungarn eingesetzt werden.
      • “ However Petrow has been banned for the match against Sweden and can again be deployed against Hungary.”
    • Annotated/Extracted Information (with SProUT IE Tool - DFKI-LT )
    • player_action & [GAME_EVENT &quot;Ban&quot;,
      • AGENT player & [SURNAME &quot;PETROW&quot;],
      • IN_MATCH game & [TEAM2 &quot;SWE&quot;, TOURNAMENT &quot;Match&quot;]]
      • team & [NAME &quot;HUN&quot;]
  • 45. Knowledge Base Generation
    • <type orig=&quot;player&quot; target=&quot;dolce#natual-person-denomination>
    • <link type=&quot;dolce#natural-person&quot; method=&quot;dolce#HAS-DENOMINATION&quot; id=&quot;&quot;/>
    • <map>
    • <simple-mapping>
    • <input>
    • <arg orig=&quot;GIVEN_NAME&quot; target=&quot;VAR1&quot;/>
    • </input>
    • <output method=&quot;dolce#FIRSTNAME&quot; value=&quot;VAR1&quot;/>
    • </simple-mapping>
    • <simple-mapping>
    • <input>
    • <arg orig=&quot;SURNAME&quot; target=&quot;VAR1&quot;/>
    • </input>
    • <output method=&quot;dolce#LASTNAME&quot; value=&quot;VAR1&quot;/>
    • </simple-mapping>
    • </map>
    • </type>
    Transformation of SProUt Output to F-Logic via Declarative Mappings, e.g.:
  • 46. SProUt to F-Logic
    • FS type=&quot;player_action&quot;>
    • [N [N <F name=&quot;GAME_EVENT&quot;>
    • <FS type=&quot;world champion&quot;/>
    • <F name=&quot;ACTION_TIME&quot;>
    • <FS type=&quot;1990&quot;/>
    • <F name=&quot;ACTION_LOCATION&quot;>
    • <FS type=&quot;Italy&quot;/>
    • <F name=&quot;AGENT&quot;>
    • <FS type=&quot;player&quot;>
    • <F name=&quot;SURNAME&quot;>
    • <FS type=&quot;Buchwald&quot;/>
    • <F name=&quot;GIVEN_NAME&quot;>
    • <FS type=&quot;Guido&quot;/>
    soba#player124:sportevent#FootballPlayer [sportevent#impersonatedBy -> soba#Guido_BUCHWALD]. soba#Guido_BUCHWALD:dolce#&quot;natural-person&quot; [dolce#&quot;HAS-DENOMINATION&quot; -> soba#Guido_BUCHWALD_Denomination]. soba#Guido_BUCHWALD_Denomination&quot;:dolce#&quot;natural-person-denomination&quot; [dolce#LASTNAME -> &quot;Buchwald&quot;; dolce#FIRSTNAME -> &quot;Guido&quot;]. SProUt F-Logic
  • 47. A Complex Example semistruct#&quot;Bolivien_vs_Brasilien_09_Oct_05_16_00_Luis_CRISTALDO&quot;: sportevent#FieldMatchFootballPlayer [ externalRepresentation@(de) ->> &quot;Luis CRISTALDO (7)&quot;; sportevent#number -> 7; sportevent#impersonatedBy -> semistruct#&quot;Luis_CRISTALDO&quot; ]. semistruct#&quot;Bolivien_vs_Brasilien_09_OCt_05_16_00&quot; [ sportevent#matchEvents -> soba#ID25 ]. soba#ID25:sportevent#Foul [ sportevent#commitedBy -> semistruct#&quot;Bolivien_vs_Brasilien_09_Oct_05_Luis_CRISTALDO ]. mediainst#ID67:media#Picture [ media#URL -> &quot;http://fifaworldcup.yahoo.com/06/de/photos/index.html?aid=124155&d=1&quot;; media#shows -> ID25 ].
  • 48. Display of Extracted Information
  • 49. Ontology Life-Cycle – Ontology Learning Create/Select Development and/or Selection Populate Knowledge Base Generation Validate Consistency Checks Evolve Extension, Modification Maintain Usability Tests Deploy Knowledge Retrieval
  • 50. Ontology Learning Layer Cake Terms Concepts Taxonomy Relations Rules & Axioms disease, doctor, hospital {disease, illness, Krankheit} DISEASE:=<Int, Ext, Lex> is_a(DOCTOR, PERSON) cure(dom:DOCTOR, range:DISEASE) Introduced in: Philipp Cimiano, PhD Thesis University of Karlsruhe, forthcoming (Multilingual) Synonyms
  • 51. Some Current Work on Ontology Learning from Text
    • Term Extraction
        • Statistical Analysis
        • Patterns
        • (Shallow) Linguistic Parsing
        • Term Disambiguation & Compositional Interpretation
        • Combinations
    • Taxonomy Extraction
        • Statistical Analysis & Clustering (e.g. FCA)
        • Patterns
        • (Shallow) Linguistic Parsing
        • WordNet
        • Combinations
    • Relation Extraction
        • Anonymous Relations (e.g. with Association Rules)
        • Named Relations (Linguistic Parsing)
        • (Linguistic) Compound Analysis
        • Web Mining, Social Network Analysis
        • Combinations
    • Definition Extraction
        • (Linguistic) Compound Analysis (incl. WordNet)
    Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.
  • 52. RelExt - Relation Extraction for Ontology Learning Terms Concepts Taxonomy Relations Rules & Axioms disease, doctor, hospital {disease, illness, Krankheit} DISEASE:=<Int, Ext, Lex> is_a(DOCTOR, PERSON) cure(dom:DOCTOR, range:DISEASE) (Multilingual) Synonyms
  • 53. RelExt - Motivation
    • Extend Ontology with Relations
        • Currently ~ 60 Relations in the Sport Events Ontology
          • Mostly Properties, e.g. hasName, atMinute , …
        • Representation of (Verbal) Relations Enables Better Modeling of Events for Information Extraction Purposes
    • Example
        • “ Ballack shoots the ball in the net.”
        • Relation: Shoot ( Domain: FootballPlayer Range: BallObject)
  • 54. RelExt – System Architecture Triple Generation Triples Head : Pred : Head Evaluation Relation Extraction and Evaluation Named-Entity Rec. & Semantic Tagging Shallow Parsing Corpus Annotated Corpus Relevance Measure Frequencies In BNC, NZZ Relevance Scores Heads, Preds Co-occurrence Measure Co-occurrence Scores Heads <> Preds Linguistic Annotation Statistical Processing
  • 55. Linguistic Annotation
    • Named-Entity Recognition
    • “ Michael Ballack” : FootballPlayer
    • Semantic Tagging
        • “ Ball” (ball), “Leder” (leather) : BallObject
    • Shallow Parsing
      • Part-of-Speech Tagging
        • Fussballspieler (soccer player): Noun
      • Morphological Analysis
        • Fussballspieler: Fussball – Spieler
      • Dependency Structure Analysis
          • “ The team won the second match.”
          • SUBJECT PREDICATE DIRECT_OBJECT
  • 56. Relevance Ranking Top-10 Head-Nouns before and after mapping to Ontology Classes Top-10 Predicates
  • 57. Co-Occurrence Analysis ... ... flanken SUBJ: FOOTBALLPLAYER “Klasnic” flanken DOBJ: FOOTBALLPLAYER “Klose” flanken_in PP_ADJ “Zuschauer” (audience) ... beschimpfen (to insult) SUBJ: FOOTBALLPLAYER “Klasnic” ... ... ...
  • 58. Integration into Ontology Development
  • 59. OntoLT – Protégé Plug-In for Ontology Extraction from Text Terms Concepts Taxonomy Relations Rules & Axioms disease, doctor, hospital {disease, illness, Krankheit} DISEASE:=<Int, Ext, Lex> is_a(DOCTOR, PERSON) cure(dom:DOCTOR, range:DISEASE) (Multilingual) Synonyms
  • 60. OntoLT – Basic Idea
    • Middleware Solution in Ontology Development
      • Supports the Ontology Engineer through Semi-Automatic Extraction of Ontology Fragments from Domain-Relevant Document Collections
      • Download http://olp.dfki.de/OntoLT/OntoLT.htm
    • Based on
      • Automatic Linguistic Annotation
      • Manual Definition of Mapping Rules
      • Statistical Preprocessing (Option)
      • Interactive Validation of Candidates
      • Generation in Protégé of Ontology Fragments
  • 61. OntoLT – System Architecture
  • 62. Corpus Example – KMI News
  • 63. Mapping Rules
  • 64. Statistical Relevance
  • 65. Extract Candidates
  • 66. Generate Ontology Fragments
  • 67. Exercises
    • Knowledge Extraction
      • Ontology Modeling (from Text)
      • Ontology Population
      • Ontology Learning (Extension)
      • Ontology Mapping