BOA - How To Integrate Your Language
Upcoming SlideShare
Loading in...5
×
 

BOA - How To Integrate Your Language

on

  • 542 views

BOA tries to extract knowledge (binary relations) from unstructured data like free text. This is a tutorial based on the Korean language on how to adopt the BOA approach to your language.

BOA tries to extract knowledge (binary relations) from unstructured data like free text. This is a tutorial based on the Korean language on how to adopt the BOA approach to your language.

Statistics

Views

Total Views
542
Views on SlideShare
542
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BOA - How To Integrate Your Language BOA - How To Integrate Your Language Presentation Transcript

    • Daniel Gerber Axel-Cyrille Ngonga Ngomo AKSW, Universität LeipzigBOAHow To Integrate YourLanguage
    • Bootstrapping the Data WebGeneral Overview Background Corpus Indexing Surface forms Knowledge Search & Korean features RDF extraction Evaluation ScoringAKSW@KAIST - 17.01.2012 - Page 2 http://boa.aksw.org
    • Bootstrapping the Data Web1. Create a corpus in your language ๏ At least 25M sentences ๏ Chunked into one sentence per line ๏ No HTML ๏ UTF-8? ๏ For later Coreference Resolution, resource URL needs to be availableAKSW@KAIST - 17.01.2012 - Page 3 http://boa.aksw.org
    • Bootstrapping the Data Web2. Corpus indexing ๏ Apache Lucene 3.4.0 ๏ Set of >20 UTF-8 RegEx filters ๏ Whitespace Analyzer ➡ No stemming ➡ Tokenization on every token ➡ Stop-words included in index ➡ Lowercase versionAKSW@KAIST - 17.01.2012 - Page 4 http://boa.aksw.org
    • Bootstrapping the Data Web3. Background knowledge I Object Datatype Properties vs PropertiesAKSW@KAIST - 17.01.2012 - Page 5 http://boa.aksw.org
    • Bootstrapping the Data Web3. Background knowledge II Line #1 Line #2 URI1 http://dbpedia.org/resource/South_Korea http://dbpedia.org/resource/KAIST Label1 대한민국 한국 과학 기술원 Property http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/country URI2 http://dbpedia.org/resource/Seoul http://dbpedia.org/resource/South_Korea Label2 서울 대한민국 Domain http://dbpedia.org/ontology/PopulatedPlace ⎯ Range http://dbpedia.org/ontology/PopulatedPlace http://dbpedia.org/ontology/CountryAKSW@KAIST - 17.01.2012 - Page 6 http://boa.aksw.org
    • Bootstrapping the Data Web4. Surface form generation ๏ DBpedia Spotlight ๏ Labels ๏ Redirects ๏ Disambiguation ๏ Datatype Properties ๏ Person XY is born on 1st of October in 1972. ๏ Person XY is born on 1 October in 1972. ๏ Person XY is born on a Thursday in 1972 ๏ Find and Create those surface formsAKSW@KAIST - 17.01.2012 - Page 7 http://boa.aksw.org
    • Bootstrapping the Data Web5. Korean feature extraction Language Language Dependent Independent ReVerb # of stopwords Wordnet # of words Distance ? # of occurrences ? ? ?AKSW@KAIST - 17.01.2012 - Page 8 http://boa.aksw.org
    • Bootstrapping the Data Web6. Pattern search and scoring Barack Obama was born in Honolulu. was born in Predicate? Subject? Object? 버락 오바마는 호놀룰루에서 태어났습니다.AKSW@KAIST - 17.01.2012 - Page 9 http://boa.aksw.org
    • Bootstrapping the Data Web7. RDF extraction Barack Obama was born in Honolulu. was born in Barack Obama Named Entity Disambiguation! dbpedia-owl:birthPlace dbpedia-owl:birthPlace Honolulu 버락 오바마는 호놀룰루에서 태어났습니다. 에서 태어났습니다.AKSW@KAIST - 17.01.2012 - Page 10 http://boa.aksw.org
    • Bootstrapping the Data Web8. Evaluation 1. Select properties P to evaluate (T100) 2. Query DBpedia for triples (and labels) with p ∈ P 3. Find sentence with labels 4. Assess if triple can be found in sentence ➡ Gold Standard with 1000 annotated sentence/triples 5. Run one BOA iteration on Gold Standard 6. Measure Precision/Recall/F-MeasureAKSW@KAIST - 17.01.2012 - Page 11 http://boa.aksw.org
    • Bootstrapping the Data WebNecessary resources for new language ๏ 50M sentence (best general knowledge) ๏ Sentence Boundary Disambiguation ๏ Part of speech tagger helpful ๏ Named Entity Recognition ๏ Named Entity Disambiguation ๏ Labels for resources ๏ SPARQL endpointAKSW@KAIST - 17.01.2012 - Page 12 http://boa.aksw.org
    • Thank you! Questions?Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa LOD2 Presentation . 02.09.2010 . Page http://lod2.eu