BOA - Bootstrapping Linked Data

8,593 views

Published on

Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, an iterative bootstrapping strat- egy for extracting RDF from unstructured data. The idea behind BOA is to use the Data Web as background knowledge for the extraction of natural language patterns that represent predicates found on the Data Web. These patterns are used to extract instance knowledge from natu- ral language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. We evaluate our approach on two data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high ac- curacy. Moreover, we provide the first repository of natural language representations of predicates found on the Data Web.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
8,593
On SlideShare
0
From Embeds
0
Number of Embeds
6,677
Actions
Shares
0
Downloads
35
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BOA - Bootstrapping Linked Data

  1. 1. Daniel GerberAxel-Cyrille Ngonga Ngomo AKSW, Universität Leipzig
  2. 2. Bootstrapping the Data WebMotivation ๏ Most knowledge bases extracted from (semi)- structured data ๏ Only 15-20 % of information in structured data ๏ Semantic Web ⬌ Document Web ๏ How can we extract data from the document- oriented web?WeKEx@ISWC - 17.01.2012 - Page 2 http://boa.aksw.org
  3. 3. Bootstrapping the Data WebIdea I dbpedia:Barack_Obama dbpedia-owl:birthPlace dbpedia-owl:spouse dbpedia-owl:party dbpedia:Honolulu,_Hawaii dbpedia:Michelle_Obama dbpedia:Democratic_PartyWeKEx@ISWC - 17.01.2012 - Page 3 http://boa.aksw.org
  4. 4. Bootstrapping the Data WebIdea II Barack Obama was born in Honolulu, Hawaii. is a politician of the Barack Hussein Obama is a politician of the Democratic Party. Obama married Michelle Robinson in 1992.WeKEx@ISWC - 17.01.2012 - Page 4 http://boa.aksw.org
  5. 5. Bootstrapping the Data WebIdea III married is a politician of the Jackie Bouvier Kennedy Onassis who married John F. Kennedy was tied to Joseph Martin "Joschka" Fischer (born 1948-04-12) the Auchinclosses via her sisters is a politician of the German Green Party. marriage into the Auchincloss family. was born in Dietrichs only child, Maria Elisabeth Sieber, was born in Berlin on 13 December 1924.WeKEx@ISWC - 17.01.2012 - Page 5 http://boa.aksw.org
  6. 6. Bootstrapping the Data WebRelated Work ๏ ReadTheWeb Project: N(ever) E(nding) L(anguage) L(earner) ๏ PROSPERA: Scalable Knowledge Harvesting with High Precision and High RecallWeKEx@ISWC - 17.01.2012 - Page 6 http://boa.aksw.org
  7. 7. Bootstrapping the Data Web The BOA approach Use in next Knowledge Acquisition iteration Filtering Data Web SPARQL 2 3 Background Web Knowledge Pattern Pattern Scoring Patterns 1 Search Corpus Extraction 4 Crawler Indexer Cleaner RDF Corpora Generation 5WeKEx@ISWC - 17.01.2012 - Page 7 http://boa.aksw.org
  8. 8. Bootstrapping the Data WebKnowledge acquisition SELECT ?x ?xLabel ?prop ?y ?yLabel ?domain ?range WHERE { ?x rdf:type dbpedia-owl:[Organisation|Person|Place] . ?x rdfs:label ?xLabel . ?y rdfs:label ?yLabel . [?y ?prop ?x | ?x ?prop ?y] . FILTER ( lang(?xLabel) = ‘en’ && lang(?yLabel) = ‘en’ ) . ?prop rdfs:range ?range . ?prop rdfs:domain ?domain . } http://dbpedia.org/resource/Google http://dbpedia.org/ontology/Company “Google” http://dbpedia.org/ontology/Company http://dbpedia.org/ontology/subsidiary http://dbpedia.org/resource/YouTube “Youtube”WeKEx@ISWC - 17.01.2012 - Page 8 http://boa.aksw.org
  9. 9. Bootstrapping the Data WebPattern Search (1) Set of entities s and o connected through p (2) Find all sentences which contain s and o (3) Replace labels with variables (?D?, ?R?) BOA pattern: BOA pattern mapping: dbpedia-owl:spouse dbpedia-owl:spouse dbpedia-owl:spouse “?D? with his wife ?R?” “?D? with his wife ?R?” “?D? and her husband ?R?” dbpedia-owl:spouse “?D? and his wife ?R?”WeKEx@ISWC - 17.01.2012 - Page 9 http://boa.aksw.org
  10. 10. Bootstrapping the Data WebPattern Scoring - Support Support pattern should be used across several triples in background knowledge subsidiary ↣ “?R? was acquired by ?D?” ๏ [Google, DoubleClick] ↣ 2 ๏ [General Motors, Opel] ↣ 1 ๏ [Cablevision, Rainbow Media] ↣ 4WeKEx@ISWC - 17.01.2012 - Page 10 http://boa.aksw.org
  11. 11. Bootstrapping the Data WebPattern Scoring - Specificity Specificity pattern should not be used by many pattern mappings ๏ subsidiary: “?D? agreed to buy ?R?” ๏ subsidiary: “?R? is a part of ?D?” ๏ foundationOrganisation: “?R? is a part of ?D?”WeKEx@ISWC - 17.01.2012 - Page 11 http://boa.aksw.org
  12. 12. Bootstrapping the Data WebPattern Scoring - Typicity Typicity pattern should be used to connect entities of correct type ๏ Hypercom was acquired by Verifone . ๏ Hypercom_ORG was_O acquired_O by_O Verifone_ORG ._O ๏ Maktoob was acquired by Yahoo! ๏ Maktoob_PER was_O acquired_O by_O Yahoo_ORG ._OWeKEx@ISWC - 17.01.2012 - Page 12 http://boa.aksw.org
  13. 13. Bootstrapping the Data WebRDF Generation ?D? with his wife ?R? Pacheco arrived with his wife Leyla Rodriguez Stahl and several...Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O NEW dbpedia-owl:spouse NEW dbpedia:Abel_Pacheco boa:Leyla_Rodriguez_Stahl rdf:type rdf:type rdfs:label NEW rdfs:label dbpedia- NEW dbpedia-‘‘Abel Pacheco’’@en owl:Person ‘‘Leyla Rodriguez Stahl’’@en owl:PersonWeKEx@ISWC - 17.01.2012 - Page 13 http://boa.aksw.org
  14. 14. Bootstrapping the Data WebEvaluation I riverMouth musicalArtist musicalBand # of triples en-wiki en-news award writer almaMater occupation Language english english formerTeam deathPlace general birthPlace Topic news knowledge # of lines 44.7M 256.1M riverMouth 158697 musicalArtist musicalBand is object award # of triples is subject # of words 1,032.1M 5,068.7M writer 551693 almaMater 327430 occupation 137990 formerTeam deathPlace 72820 64239 birthPlace Place Person OrganisationWeKEx@ISWC - 17.01.2012 - Page 14 http://boa.aksw.org
  15. 15. Bootstrapping the Data WebEvaluation II en-wiki en-news LOC PER ORG LOC PER ORG Triples extracted 1465 8817 2567 488 903 916 Triples in DBpedia 138 183 48 52 44 7 Evaluated Triples 100 (8) 100 (1) 100 (1) 100 (1) 100 (7) 100 (0) Precision 90,5 97 99 61,5 73,5 91 New true Statements* 1200 8375 2494 268 631 827 Found pattern mappings 62 72 59 49 70 55 Found patterns 123k 136k 38k 569k 465k 92k Scored patterns 1045 612 241 3832 7294 1077 * Number of extracted statements not found in DBpedia multiplied with the precision of our approachWeKEx@ISWC - 17.01.2012 - Page 15 http://boa.aksw.org
  16. 16. Bootstrapping the Data WebFuture Work ๏ Iteration 1+ ๏ Human feedback ๏ Pattern generalization ๏ Datatype Properties ๏ Languages/Corpora ๏ WebservicesWeKEx@ISWC - 17.01.2012 - Page 16 http://boa.aksw.org
  17. 17. Bootstrapping the Data WebConclusion ๏ No manual created seed patterns needed ๏ 95.5% Precision on DBpedia/Wikipedia ๏ Output easily integrable in LOD Cloud ๏ Library of natural-language representations of formal relations, Demo ๏ Quasi language independent (German/Korean)WeKEx@ISWC - 17.01.2012 - Page 17 http://boa.aksw.org
  18. 18. Thank you! Questions?Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa LOD2 Presentation . 02.09.2010 . Page http://lod2.eu

×