Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lodifier: Generating Linked Data from Unstructured Text

963 views

Published on

Slides of my ESWC 2012 talk presenting our paper "Lodifier: Generating Linked Data from Unstructured Text"

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Lodifier: Generating Linked Data from Unstructured Text

  1. 1. LODifier: Generating Linked Data from Unstructured Text Isabelle Augenstein1 Sebastian Pad´o1 Sebastian Rudolph2 1Department of Computational Linguistics, Universit¨at Heidelberg, Germany 2Institute AIFB, Karlsruhe Institute of Technology, Germany {augenste,pado}@cl.uni-heidelberg.de, rudolph@kit.edu Extended Semantic Web Conference, 2012 Augenstein, Pad´o, Rudolph LODifier ESWC 2012 1 / 28
  2. 2. Outline 1 Motivation 2 The LODifier Architecture and Workflow 3 Evaluation in a Document Similarity Task 4 Conclusion Augenstein, Pad´o, Rudolph LODifier ESWC 2012 2 / 28
  3. 3. Introduction Introduction What? Represent information from NL text as Linked Data Creation of graph-based representation for unstructured plain text Why? Task is easy for structured text, but quite challenging for unstructured text Most other approaches are very selective, e.g., extract only pre-defined relations Our approach translates the text in its entirety Possible use cases: Domain-independent scenarios in which no pre-defined schema is available Augenstein, Pad´o, Rudolph LODifier ESWC 2012 3 / 28
  4. 4. Introduction Introduction How? Use noise-tolerant NLP pipeline to analyze text Translate result into RDF representation Embed output in the LOD cloud by using vocabulary from DBPedia and Wordnet Augenstein, Pad´o, Rudolph LODifier ESWC 2012 4 / 28
  5. 5. System Overview Pipeline http: //ww w.w3 .org/ RDF /icon s/rdf _flye r.svg 1 of 1 08.12 .11 23:48 lemmatized text NE-marked text parsed text word sense URIs parsed text DRS tokenized text DBpedia URIs LOD-linked RDF unstructured text Named Entity Recognition (Wikifier) Parsing (C&C) Deep Semantic Analysis (Boxer) Lemmatization Word Sense Disambiguation (UKB) Tokenization RDF Generation and Integration Augenstein, Pad´o, Rudolph LODifier ESWC 2012 5 / 28
  6. 6. System NLP analysis steps NLP analysis steps (NER) Named Entity Recognition (Wikifier 1): Named entities are recognized using Wikifier Wikifier finds Wikipedia links for named entities Wikipedia URLs are converted to DBPedia URIs The New York Times reported that John McCarthy died. He invented the programming language LISP. Figure: Test sentences [[The New York Times]] reported that [[John McCarthy (computer scientist)|John McCarthy]] died. He invented the [[Programming language|programming language]] [[Lisp (programming language)| Lisp]]. Figure: Wikifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 6 / 28
  7. 7. System NLP analysis steps NLP analysis steps (WSD) Word Sense Disambiguation (UKB 2)): Words are disambiguated using UKB, an unsupervised graph-based WSD tool UKB finds WordNet links for word senses UKB output is converted to RDF WordNet URIs ctx_0 w1 00965687-v !! report ctx_0 w4 00358431-v !! die ctx_1 w7 01634424-v !! invent ctx_1 w9 06898352-n !! programming_language ctx_1 w10 06901936-n !! lisp Figure: UKB output for the test sentences. 2 http://ixa2.si.ehu.es/ukb/ Augenstein, Pad´o, Rudolph LODifier ESWC 2012 7 / 28
  8. 8. System NLP analysis steps NLP analysis steps (Parsing and Semantic Analysis) Parsing (C&C 3): Sentences are parsed with the C&C parser The C&C parser is a statistical parser that uses the combined categorial grammar (CCG) CCG is a semantically motivated grammar, making it suitable for further semantic analysis Deep Semantic Analysis (Boxer 4): Boxer builds on the output of C&C and produces discourse representation structures (DRSs) DRSs model the meaning of texts in terms of the relevant entities (discourse referents) and relations between them (conditions) 3 http://svn.ask.it.usyd.edu.au/trac/candc/ 4 http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer Augenstein, Pad´o, Rudolph LODifier ESWC 2012 8 / 28
  9. 9. System NLP analysis steps ______________________________ _______________________ | x0 x1 x2 x3 | | x4 x5 x6 | |..............................| |.......................| (| male(x0) |+| event(x4) |) | named(x0,john_mccarthy,per) | | invent(x4) | | programming_language(x1) | | agent(x4,x0) | | nn(x1,x2) | | patient(x4,x2) | | named(x2,lisp,nam) | | event(x5) | | named(x3,new_york_times,org) | | report(x5) | |______________________________| | agent(x5,x3) | | theme(x5,x6) | | proposition(x6) | | ______________ | | | x7 | | | x6:|..............| | | | event(x7) | | | | die(x7) | | | | agent(x7,x0) | | | |______________| | |_______________________| Figure: Boxer output for the example sentences Augenstein, Pad´o, Rudolph LODifier ESWC 2012 9 / 28
  10. 10. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 10 / 28
  11. 11. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 11 / 28
  12. 12. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 12 / 28
  13. 13. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 13 / 28
  14. 14. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 14 / 28
  15. 15. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 15 / 28
  16. 16. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 16 / 28
  17. 17. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 17 / 28
  18. 18. System NLP analysis steps _:var0x0 drsclass:named ne:john_mccarthy ; rdf:type drsclass:male , foaf:Person ; owl:sameAs dbpedia:John_McCarthy_(computer_scientist) . _:var0x1 rdf:type class:programming_language ; owl:sameAs dbpedia:Programming_language . _:var0x2 drsrel:nn _:var0x1 . _:var0x2 drsclass:named ne:lisp ; owl:sameAs dbpedia:Lisp_(programming_language) . _:var0x3 drsclass:named ne:the_new_york_times ; owl:sameAs dbpedia:The_New_York_Times . _:var0x4 rdf:type drsclass:event , wn30:wordsense-invent-verb-2 . drsrel:agent _:var0x0 ; drsrel:patient _:var0x2 . _:var0x5 rdf:type drsclass:event , wn30:wordsense-report-verb-3 ; drsrel:agent _:var0x3 ; drsrel:theme _:var0x6 . _:var0x6 rdf:type drsclass:proposition , reify:proposition , reify:conjunction ; reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object drsclass:event . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate rdf:type ; rdf:object wn30:wordsense-die-verb-1 . ] reify:conjunct [ rdf:subject _:var0x7 ; rdf:predicate drsrel:agent ; rdf:object _:var0x0 . ] Figure: LODifier output for the test sentences. Augenstein, Pad´o, Rudolph LODifier ESWC 2012 18 / 28
  19. 19. Evaluation Evaluation overview Evaluation overview Task: Assess document similarity Story link detection task, part of topic detection and tracking (TDT) family of tasks Data: TDT-2 benchmark dataset 5 Consists of newspaper, TV and radio news in English, Arabic and Mandarin Subset: English language, only newspaper articles 183 positive and negative document pairs 5 http://projects.ldc.upenn.edu/TDT2/ Augenstein, Pad´o, Rudolph LODifier ESWC 2012 19 / 28
  20. 20. Evaluation Evaluation overview Evaluation overview Method: Define various document similarity measures Measures without structural information Measures with structural information Documents are predicted to have the same topic if they have a similarity of θ or more θ is determined with supervised learning Augenstein, Pad´o, Rudolph LODifier ESWC 2012 20 / 28
  21. 21. Evaluation Evaluation method Evaluation method Similarity measures without structure: Random baseline Bag-of-words baseline: Word overlap between the two documents in a document pair Bag-of-URI baseline: URI overlap between two RDF documents Augenstein, Pad´o, Rudolph LODifier ESWC 2012 21 / 28
  22. 22. Evaluation Evaluation method Evaluation method URI class variants: Three variants for URI baseline: Take the relative importance of various URI classes into account Variant 1: All NEs identified by Wikifier, all words successfully disambiguated by UKB (namespaces dbpedia:, wn30:) Variant 2: Adds all NEs recognized by Boxer (namespace ne:) Variant 3: Further adds the URIs for all words that were not recognized by Wikifier, UKB or Boxer (namespace class:) Augenstein, Pad´o, Rudolph LODifier ESWC 2012 22 / 28
  23. 23. Evaluation Evaluation method Evaluation method Extended setting: Aims at drawing more information from LOD cloud into the similarity computation The generated RDF graph is enriched by: DBpedia categories (namespace dbpedia:) WordNet synsets and WordNet senses related to the URIs in the generated graph (namespace wn30:) Augenstein, Pad´o, Rudolph LODifier ESWC 2012 23 / 28
  24. 24. Evaluation Evaluation method Evaluation method Similarity measures with structure: Full-fledged graph similarity measures, e.g. based on isomorphic subgraphs, are infeasible due to the size of the RDF graphs Our similarity measures are based on the shortest paths between relevant nodes Our intuition is that short paths between URI pairs denote relevant semantic information and we expect they are related by a short path in other documents as well Augenstein, Pad´o, Rudolph LODifier ESWC 2012 24 / 28
  25. 25. Evaluation Evaluation method Evaluation method Similarity measures with structure: proSimk,Rel,f (G1, G2) = a,b∈Rel(G1) a,b ∈Ck (G1)∩Ck (G2) f ( (a, b)) a,b∈Rel(G1) a,b ∈Ck (G1) f ( (a, b)) k: path length, Rel: semantic relations, f : similarity function proSimcnt: Uses f ( ) = 1, i.e., just counts the number of paths irrespective of their length proSimlen: Uses f ( ) = 1/ , giving less weight to longer paths proSimsqlen: Uses f ( ) = 1/ √ , discounting long paths less aggressively than proSimlen Augenstein, Pad´o, Rudolph LODifier ESWC 2012 25 / 28
  26. 26. Evaluation Results Results Model normal extended Similarity measures without structure Random Baseline 50.0 – Bag of Words 63.0 – Bag of URIs (Variant 3) 73.4 76.4 Similarity measures with structure proSimcnt (k=6, Variant 1) 77.7 77.6 proSimcnt (k=6, Variant 2) 79.2 79.0 proSimcnt (k=8, Variant 3) 82.1 81.9 proSimlen (k=6, Variant 3) 81.5 81.4 proSimsqlen (k=8, Variant 3) 81.1 80.9 Augenstein, Pad´o, Rudolph LODifier ESWC 2012 26 / 28
  27. 27. Summary Summary Main contributions: Service for the LOD community that can be used in different scenarios that can profit from structural representation of text Combination of well-established concepts and systems in a deep semantic pipeline Linking existing NLP technologies to the LOD cloud Evaluation of LODifier in a topic link detection task gives clear evidence of its potential Possible future work directions: Instead of a pipeline approach, simultaneously consider information from the NLP components to allow interaction between them Test LODifier in different scenarios Augenstein, Pad´o, Rudolph LODifier ESWC 2012 27 / 28
  28. 28. Summary Thank you! Augenstein, Pad´o, Rudolph LODifier ESWC 2012 28 / 28

×