Daniel Gerber     Axel-Cyrille Ngonga Ngomo                    AKSW, Universität LeipzigBOAHow To Integrate YourLanguage
Bootstrapping the Data WebGeneral Overview                                               Background            Corpus     ...
Bootstrapping the Data Web1. Create a corpus in your language             ๏      At least 25M sentences             ๏     ...
Bootstrapping the Data Web2. Corpus indexing             ๏       Apache Lucene 3.4.0             ๏       Set of >20 UTF-8 ...
Bootstrapping the Data Web3. Background knowledge I                 Object                     Datatype               Prop...
Bootstrapping the Data Web3. Background knowledge II                                   Line #1                            ...
Bootstrapping the Data Web4. Surface form generation            ๏ DBpedia Spotlight             ๏ Labels             ๏ Red...
Bootstrapping the Data Web5. Korean feature extraction                     Language                             Language  ...
Bootstrapping the Data Web6. Pattern search and scoring     Barack Obama                  was born in Honolulu.           ...
Bootstrapping the Data Web7. RDF extraction         Barack Obama                         was born in Honolulu.            ...
Bootstrapping the Data Web8. Evaluation    1. Select properties P to evaluate (T100)    2. Query DBpedia for triples (and ...
Bootstrapping the Data WebNecessary resources for new language            ๏   50M sentence (best general knowledge)       ...
Thank you!                                           Questions?Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, Germ...
Upcoming SlideShare
Loading in …5
×

BOA - How To Integrate Your Language

729 views

Published on

BOA tries to extract knowledge (binary relations) from unstructured data like free text. This is a tutorial based on the Korean language on how to adopt the BOA approach to your language.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

BOA - How To Integrate Your Language

  1. 1. Daniel Gerber Axel-Cyrille Ngonga Ngomo AKSW, Universität LeipzigBOAHow To Integrate YourLanguage
  2. 2. Bootstrapping the Data WebGeneral Overview Background Corpus Indexing Surface forms Knowledge Search & Korean features RDF extraction Evaluation ScoringAKSW@KAIST - 17.01.2012 - Page 2 http://boa.aksw.org
  3. 3. Bootstrapping the Data Web1. Create a corpus in your language ๏ At least 25M sentences ๏ Chunked into one sentence per line ๏ No HTML ๏ UTF-8? ๏ For later Coreference Resolution, resource URL needs to be availableAKSW@KAIST - 17.01.2012 - Page 3 http://boa.aksw.org
  4. 4. Bootstrapping the Data Web2. Corpus indexing ๏ Apache Lucene 3.4.0 ๏ Set of >20 UTF-8 RegEx filters ๏ Whitespace Analyzer ➡ No stemming ➡ Tokenization on every token ➡ Stop-words included in index ➡ Lowercase versionAKSW@KAIST - 17.01.2012 - Page 4 http://boa.aksw.org
  5. 5. Bootstrapping the Data Web3. Background knowledge I Object Datatype Properties vs PropertiesAKSW@KAIST - 17.01.2012 - Page 5 http://boa.aksw.org
  6. 6. Bootstrapping the Data Web3. Background knowledge II Line #1 Line #2 URI1 http://dbpedia.org/resource/South_Korea http://dbpedia.org/resource/KAIST Label1 대한민국 한국 과학 기술원 Property http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/country URI2 http://dbpedia.org/resource/Seoul http://dbpedia.org/resource/South_Korea Label2 서울 대한민국 Domain http://dbpedia.org/ontology/PopulatedPlace ⎯ Range http://dbpedia.org/ontology/PopulatedPlace http://dbpedia.org/ontology/CountryAKSW@KAIST - 17.01.2012 - Page 6 http://boa.aksw.org
  7. 7. Bootstrapping the Data Web4. Surface form generation ๏ DBpedia Spotlight ๏ Labels ๏ Redirects ๏ Disambiguation ๏ Datatype Properties ๏ Person XY is born on 1st of October in 1972. ๏ Person XY is born on 1 October in 1972. ๏ Person XY is born on a Thursday in 1972 ๏ Find and Create those surface formsAKSW@KAIST - 17.01.2012 - Page 7 http://boa.aksw.org
  8. 8. Bootstrapping the Data Web5. Korean feature extraction Language Language Dependent Independent ReVerb # of stopwords Wordnet # of words Distance ? # of occurrences ? ? ?AKSW@KAIST - 17.01.2012 - Page 8 http://boa.aksw.org
  9. 9. Bootstrapping the Data Web6. Pattern search and scoring Barack Obama was born in Honolulu. was born in Predicate? Subject? Object? 버락 오바마는 호놀룰루에서 태어났습니다.AKSW@KAIST - 17.01.2012 - Page 9 http://boa.aksw.org
  10. 10. Bootstrapping the Data Web7. RDF extraction Barack Obama was born in Honolulu. was born in Barack Obama Named Entity Disambiguation! dbpedia-owl:birthPlace dbpedia-owl:birthPlace Honolulu 버락 오바마는 호놀룰루에서 태어났습니다. 에서 태어났습니다.AKSW@KAIST - 17.01.2012 - Page 10 http://boa.aksw.org
  11. 11. Bootstrapping the Data Web8. Evaluation 1. Select properties P to evaluate (T100) 2. Query DBpedia for triples (and labels) with p ∈ P 3. Find sentence with labels 4. Assess if triple can be found in sentence ➡ Gold Standard with 1000 annotated sentence/triples 5. Run one BOA iteration on Gold Standard 6. Measure Precision/Recall/F-MeasureAKSW@KAIST - 17.01.2012 - Page 11 http://boa.aksw.org
  12. 12. Bootstrapping the Data WebNecessary resources for new language ๏ 50M sentence (best general knowledge) ๏ Sentence Boundary Disambiguation ๏ Part of speech tagger helpful ๏ Named Entity Recognition ๏ Named Entity Disambiguation ๏ Labels for resources ๏ SPARQL endpointAKSW@KAIST - 17.01.2012 - Page 12 http://boa.aksw.org
  13. 13. Thank you! Questions?Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa LOD2 Presentation . 02.09.2010 . Page http://lod2.eu

×