Your SlideShare is downloading. ×
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

682
views

Published on

The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the …

The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
682
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig
    Tobias Wunner
    Unit for Natural Language Processing (UNLP)
    firstname.lastname@deri.org
    Wednesday,22nd June, 2011
    DERI, Reading Group
    1
  • 2. Based On:
    “SOFIE: A Self-Organizing Framework for Information Extraction”
    Authors: Fabian Suchanek, Mauro Sozio,
    Gerhard Weikum
    Published: World Wide Web Conference (WWW)
    Madrid, 2009
    2
  • 3. Overview
    Introduction
    SOFIE Model + Rules
    Excursion: Satisfiability
    SOFIE Approach
    Evaluation experiments
    Conclusion
    3
  • 4. Motivation
    Classical IE on text
    pattern-based  80pc
    Semistructural approach
    Wikipedia infoboxes 95%
    Idea of Paper: combine
    use text (hypotheses) + ontology (trusted facts)
    4
  • 5. Example
    5
    Document1
    YAGO ontology
    familyName(AlbertEinstein, Einstein)
    bornIn(AlbertEinstein, Germany)
    attendedSchoolIn( AlbertEinstein, Germany)
    Einstein attended secondary school in Germany.
    New Knowledge
  • 6. General Idea
    Express extraction patterns as fact
    Rules to understand usage of terms
    Add restrictions
    6
    patternOcc(“X went to school in Y”,Einstein, Switzerland)
    patternOcc(Pattern,X,Y) and R(X,Y) ⇒ express(Pattern,R)
  • 7. Contribution
    Unified approach to
    Pattern matching
    Word Sense Disambiguation
    Reasoning
    Large Scale
    On Unstructured Data
    7
  • 8. Pattern extraction with WICs
    Extract patterns based on ‘interesting’ entities
    8
    Documents
    Einstein was born at Ulm in Württemberg, Germany, on March 18, 1879. When Albert was around four, his father gave him a magnetic compass.
    When Albert became older, he went to a school in Switzerland. After he graduated, he got a job in the patent office there…
    Knowledge Base
    patternOcc(“Einstein was born in Ulm”,Einstein@D1, Ulm@D1) [1]
    patternOcc(“Ulm is in Württemberg, Germany”,Ulm@D1, Germany@D1) [1]
    patternOcc(“Albert .. Switzerland”,Albert@D1, Switzerland@D1) [1]
    WICs (Word in Context)
  • 9. Grounding
    Test Rules
    How?
    find an instance which satisfies the formulae
    9
    bornIn(Einstein,Ulm) ⇒ ¬bornIn(Einstein,Timbuktu)
    studiedIn(Einstein,Ulm)
    bornIn(X,Ulm) ⇒ ¬bornIn(X,Timbuktu)
    studiedIn(X,Ulm)
  • 10. Rules (Hypotheses)
    Disambiguation
    disambiguatesAs(Albert@D,AlberEinstein)[?]
    Expresses a new fact
    expresses(P, livedIn(Einstein,Switzerland) )[?]
    New facts
    CityIn(Ulm,Germany)[?]
    10
  • 11. New fact rule
    ...with disambiguation
    11
    “Pattern P expresses
    Relation R when
    analysis of WICs
    are disambiguated”
    patternOcc( P, WX, WY ) and
    disambiguatesAs(WX, X) and
    disambiguatesAs(WY, Y) and
    R(X,Y)
    ⇒ express( P, R )
  • 12. Restrictions
    Disambiguation
    disambiguation prior should influence choice of disambiguation
    12
    N - any disamb. function
    disambPrior( W, X, N )
    ⇒ disambiguatedAs( W, X )
    | words(D1) ∩ rel(AlbertEinstein)|
    | words(D1) |
  • 13. Restrictions
    Functional restrictions
    13
    R(X,Y) and
    type(R, function) and
    different(Y,Z)
    ⇒ ¬R(X,Z)
    “Albert@D1 born in?”
    Albert@D1 ≠ Albert@D2
  • 14. SOFIE Rules
    Framework to test the hypotheses
    Question
    “How to satisfy all them?”
    rules + trusted facts
    14
    dismbPrior(Albert@D1, AlbertEinstein, 10)
    ⇒ disambiguatesAs(Albert@D1, AlbertEinstein)
    patternOcc( P, X, Y ) and
    R(X,Y)
    ⇒ express( P, R )
    dismbPrior(Albert@D1, HermannEinstein, 3)
    ⇒ disambiguatesAs(Albert@D1, HermannEinstein)
    Country(Germany)
    livedIn(AlbertEinstein,Ulm)

  • 15. SAT / MAX SAT
    SAT (Satisfiability)
    proove formula can be TRUE
    Complexity Classes
    P  Good example: Nk
    NP  Bad cN
    e.g. naive algorithm for 100 variables
     2100 x 10-10 ms per row = 4 x 1012 y
    Not always.. 3SAT in (4/3)N
    SAT Solver
    15
    F = (X or Y or Z) and (¬X or Y or Z)
    and (¬X or ¬Y or ¬Z)
    G = (X or Y) and (¬X or ¬Y) and (X)
    truth table has 23 rows
    Details Schöning 2010
  • 16. SAT / MAX SAT
    SAT (Satisfiability)
    proove formula can be TRUE
    Complexity Classes
    P  Good example: Nk
    NP  Bad cN
    e.g. naive algorithm for 100 variables
     2100 x 10-10 ms per row = 4 x 1012 y
    Not always.. 3SAT in (4/3)N
    SAT Solver
    MAX SAT
    16
    F = (X or Y or Z) and (¬X or Y or Z)
    and (¬X or ¬Y or ¬Z)
    G = (X or Y) and (¬X or ¬Y) and (X)
    truth table has 23 rows
    Details Schöning 2010
  • 17. Weighted MAX SAT in SOFIE
    ...back to SOFIE
    this is MAX SAT but with weights
    17
    rules + trusted facts
    Country(Germany)
    livedIn(AlbertEinstein,Ulm)

    dismbPrior(Albert@D1, AlbertEinstein, 10)
    ⇒ disambiguatesAs(Albert@D1, AlbertEinstein)
    patternOcc( P, X, Y ) and
    R(X,Y)
    ⇒ express( P, R )
    dismbPrior(Albert@D1, HermannEinstein, 3)
    ⇒ disambiguatesAs(Albert@D1, HermannEinstein)
  • 18. Weighted MAX SAT in SOFIE
    Weighted MAX SAT is NP hard
    only approximation algorithms
     impractical to find optimal solution
    SAT Solver
    Johnson’s algorithm:  2/3 (apprx guarantee)
  • 19. Weighted MAX SAT in SOFIE
    Functional MAX SAT
    Specialized reasoning (support for functional properties)
    Approximation guarantee 1/2
    Propagates dominating unit clauses
    Considers only unit clauses
    A v B [w1]
    A v B [w2]
    B v C [w3]
    C [w4]
    A v B [10]
    A [10]
    A [30]
    A = true
    30 > 10+10
  • 20. Controlled experiment
    Corpus from Wikipedia infoboxes
    100 articles
    Semantic is known!
    20
  • 21. Controlled experiment
    Large-scale: Corpus from Wikipedia articles
    2000 articles
    13 frequent relations from YAGO
    Parsing = 87min Reaoning = 77min
    21
  • 22. Unstructured text sources
    150 news paper articles
    relation under test headquarterOf
    YAGO (modified with relation seeds)
    Parsing 87min WeightedMaxSat 77min
    disambiguated entries (provenance) could be manually assessed
    22
    functional
    relation
  • 23. Unstructured text sources
    Large-scale:
    10 biographies for each of 400 US senators
    5 relationships
    Disambiguation was not ideal for YAGO (13 James Watson)
    Parsing 7h W-MAX-SAT 9h
    Results
    4 good
    1 bad (misleading patterns)
    23
  • 24. MAX SAT can’t do OWL per se (Open World Assumption)
    Reformulate OWL in propositional logic
    OWL  FOL  Skolem Normal Form  Propositional Logic
    Might find OWL-inconsistent ontologies due to OW Assumption
    24
    define a student as a subclass “attends some course”
    ⇒ ∀ x, ∃ y: attends(x,y), Course(y) -> Student(y)
    ⇒ ∀ x: attends(x,k), Course(y) -> Student(y); ∃ k
    ⇒ ¬attends(xi, ki) or ¬Course(xi) or Student(xi); k=x1 .. xn
    Inferred Ontology
    { Student(alex), Student(bob),
    Student subClassOf attends some Course,
    attends(alex, SemanticWeb) }
    Details JMC 2010
  • 25. Conclusions
    Ontology-based IE (OBIE) reformulated as weighted MAX SAT problem
    Approximation algorithm with 1/2
    Works and scales (large corpus + YAGO)
    25
  • 26. Limitations
    Specialized approximation algorithm
    Accounts for SOFIE rules NOT OWL
    MAX SAT Restrictions
    ∈ Prepositional Logic
    ∉ First-Order Logic
    Ontology population approach (can’t infer new relations)
    26
  • 27. References
    27
    F Suchanek et al, SOFIE: a self-organizing framework for information extraction, Proceeding WWW '09 Proceedings of the 18th international conference on World wide web, link
    John McCrae, Automatic Extraction Of Logically Consistent Ontologies From Text, PhD thesis at National Institute of Informatics, Japan, 2009 link
    Uwe Schöning: Das SAT-Problem. In Informatik Spektrum 33(5): 479-483, 2010, link
    F Suchanek, Automated Construction and Growth of a Large Ontology, PhD thesis at Technology of Saarland University. Saarbrücken, Germany, 2009, link