Successfully reported this slideshow.
Your SlideShare is downloading. ×

Extracting Multilingual Natural-Language Patterns for RDF Predicates

Extracting Multilingual Natural-Language Patterns for RDF Predicates

Download to read offline

Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, a bootstrapping strategy for ex- tracting RDF from text. The idea behind BOA is to extract natural-language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web. These patterns are then used to extract instance knowledge from natural-language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. The approach followed by BOA is quasi independent of the language in which the corpus is written. We demonstrate our approach by applying it to four different corpora and two different languages. We evaluate BOA on these data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high accuracy.

Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, a bootstrapping strategy for ex- tracting RDF from text. The idea behind BOA is to extract natural-language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web. These patterns are then used to extract instance knowledge from natural-language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. The approach followed by BOA is quasi independent of the language in which the corpus is written. We demonstrate our approach by applying it to four different corpora and two different languages. We evaluate BOA on these data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high accuracy.

More Related Content

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Extracting Multilingual Natural-Language Patterns for RDF Predicates

  1. 1. Daniel Gerber Axel-Cyrille Ngonga Ngomo AKSW, Universität Leipzig BOA Extracting Multilingual Natural-Language Patterns for RDF Predicates
  2. 2. Bootstrapping the Data Web Motivation ๏ Most knowledge bases are extracted from (semi)-structured data ๏ Only 15-20 % of information in structured data ๏ Semantic Web ⬌ Document Web ๏ How can we extract data from the document- oriented web? EKAW - 10.10.2012 - Page 2 http://boa.aksw.org
  3. 3. Bootstrapping the Data Web Idea I dbpedia:Barack_Obama dbpedia-owl:birthPlace dbpedia-owl:spouse dbpedia-owl:party dbpedia:Honolulu,_Hawaii dbpedia:Michelle_Obama dbpedia:Democratic_Party EKAW - 10.10.2012 - Page 3 http://boa.aksw.org
  4. 4. Bootstrapping the Data Web Idea II Barack Obama was born in Honolulu, Hawaii. is a politician of the Barack Hussein Obama is a politician of the Democratic Party. Obama met married Michelle Robinson in 1992. EKAW - 10.10.2012 - Page 4 http://boa.aksw.org
  5. 5. Bootstrapping the Data Web Idea III married is a politician of the Jackie Bouvier Kennedy Onassis who married John F. Kennedy was tied to Joseph Martin "Joschka" Fischer (born 1948-04-12) the Auchinclosses via her sister's is a politician of the German Green Party. marriage into the Auchincloss family. was born in Dietrich's only child, Maria Elisabeth Sieber, was born in Berlin on 13 December 1924. EKAW - 10.10.2012 - Page 5 http://boa.aksw.org
  6. 6. Bootstrapping the Data Web The BOA approach Data Web 2 5 Feature 6 SPARQL Filter Extraction Neural Surface Network Web 3 forms 4 Search & Filter Patterns 7 Corpus Extraction Module Crawler Indexer Cleaner Corpora 8 1 Generation EKAW - 10.10.2012 - Page 6 http://boa.aksw.org
  7. 7. Bootstrapping the Data Web Pattern Search (1) Set of entities s and o connected through p (2) Find all sentences which contain s and o (3) Replace labels with variables (?D?, ?R?) BOA pattern: BOA pattern mapping: dbpedia-owl:spouse “?D? with his wife ?R?” “?D? with his wife ?R?” “?D? and her husband ?R?” “?D? and his wife ?R?” EKAW - 10.10.2012 - Page 7 http://boa.aksw.org
  8. 8. Bootstrapping the Data Web Feature Extraction - Language Independent subsidiary ↣ “?Company was acquired by ?Company” Support Specificity Typicity pattern should be used pattern should not be used pattern should be used to across several triples by many pattern mappings connect entities of correct type ๏ subsidiary: ๏ Hypercom_ORG was_O ๏ Google - DoubleClick: 2 acquired_O by_O “?R? is a part of ?D?” ๏ General Motors - Verifone_ORG ._O Opel:1 ๏ foundationOrg: ๏ Cablevision - “?R? is a part of ?D?” Rainbow Media: 4 EKAW - 10.10.2012 - Page 8 http://boa.aksw.org
  9. 9. Bootstrapping the Data Web Feature Extraction - Language Dependent Intrinsic Information Content Metric ReVerb dbpedia:subsidiary ๏ Open Information Extraction ๏ Patterns need to abide a POS rdfs:label “subsidiary”@en tag sequence ๏ Logistic Regression Classifier Wordnet ?D? was acquired by ?R? EKAW - 10.10.2012 - Page 9 http://boa.aksw.org
  10. 10. Bootstrapping the Data Web BOA Neuronal Network ๏ 200 patterns are manually classified as good (1) or bad (0) Input Layer Hidden Layer Output Layer [0,1] [0,1] ๏ up to 18 Reverb features, depending Specificity on language IICM Typicity EKAW - 10.10.2012 - Page 10 http://boa.aksw.org
  11. 11. Bootstrapping the Data Web RDF Generation ?D? with his wife ?R? Pacheco arrived with his wife Leyla Rodriguez Stahl and several... Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O NEW dbpedia-owl:spouse NEW dbpedia:Abel_Pacheco boa:Leyla_Rodriguez_Stahl rdf:type rdf:type rdfs:label NEW rdfs:label dbpedia- NEW dbpedia- ‘‘Abel Pacheco’’@en owl:Person ‘‘Leyla Rodriguez Stahl’’@en owl:Person EKAW - 10.10.2012 - Page 11 http://boa.aksw.org
  12. 12. Bootstrapping the Data Web Evaluation I en-wiki en-news de-wiki de-news Language english english german german Topic general knowledge news general knowledge news # of sentences 58M 214,2M 24,6M 112,8M # of tokens per sentence 21,4 22,1 17,4 18,3 EKAW - 10.10.2012 - Page 12 http://boa.aksw.org
  13. 13. Bootstrapping the Data Web Evaluation II en-wiki en-news de-wiki de-news # of pattern mappings 125 44 66 19 # of patterns 9551 586 7366 109 # of new triples 78944 22883 10138 883 # of known triples 1829 798 655 42 # of found triples 80773 3081 10793 925 Precision Top-100 92 % 70 % 91 % 74 % EKAW - 10.10.2012 - Page 13 http://boa.aksw.org
  14. 14. Bootstrapping the Data Web Conclusion ๏ No manual created seed patterns needed ๏ > 90% precision for german an english dataset ๏ high recall through surface forms ๏ Output easily integrable in LOD Cloud ๏ Library of natural-language representations of formal relations, Demo EKAW - 10.10.2012 - Page 14 http://boa.aksw.org
  15. 15. Bootstrapping the Data Web BOA Graphical User Interface http://boa.aksw.org EKAW - 10.10.2012 - Page 15 http://boa.aksw.org
  16. 16. Thank you! Questions? Daniel Gerber Augustusplatz 10, Room P616 04109 Leipzig, Germany SIMBA@AKSW http://bis.informatik.uni-leipzig.de/DanielGerber http://boa.aksw.org http://code.google.com/p/boa LOD2 Presentation . 02.09.2010 . Page http://lod2.eu

×