0
NLP2RDFIntegration of Data, Tools andApplicationswith RDF/OWL in the Areas of Textmining andLinguistics<br />PhD Thesis, S...
Extensive Topic – Whatisthecore?<br />Features forMachine Learning<br />Whichfeatures do I needfor a certain Textmining ta...
Database withAgeandOccupation</li></ul>Goal: predictincomeofpersons<br /><ul><li>Young studentsearnlessthanoldCEO‘s. </li>...
Basic idea: a benchmarkingframework<br />Input: <br /><ul><li>Task specification
Text
Training/testdata</li></ul>Output:<br /><ul><li>Tools anddatarequiredtosolvethetask</li></ul>Do I need POS tags toclassify...
Data needs a standardizedformat</li></li></ul><li>Basic idea: a benchmarkingframework<br />NLP2RDF <br />stack<br />
Basic idea: a benchmarkingframework<br />Google Code project was created<br /><ul><li>Stanford parser was integrated
Ontologieswerefoundandintegrated
Pipeline implemented
Pluginsystemimplemented
Someresultswereachieved</li></ul>But…<br /><ul><li>Architecture not flexible enough (Pipeline)
Integration boundto Java
Data sourceswere not sufficient
Wikipedia/DBpediatoocourse-grained
Speed ofintegrationtooslow</li></li></ul><li>Prerequisites<br />Onestep back:<br />Creationofdatasets in RDF<br />Data int...
Why RDF and OWL ?<br />RDF makesdataintegration easy: URIref, LinkedData<br />OWL isbased on Description Logics (Guarded F...
LOD Cloud - over 26 Billion Facts<br />DBpediaiscentral:<br /><ul><li>Cross-domain
Crystalizationpoint (earlybird)</li></ul>Linking Open Data clouddiagram, by Richard CyganiakandAnja Jentzsch. http://lod-c...
Simplified:<br /><ul><li>Circlesare Database Tables
Links areHTTP-Foreign Keys</li></li></ul><li>LinkedData<br />http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%...
URIs pointingtosubjects in the same table
URIs pointingtosubjects in anyothertable</li></li></ul><li>SPARQL – optimizationsfortablejoins<br />All soccer players, wh...
SPARQL – optimizationsfortablejoins<br />
Upcoming SlideShare
Loading in...5
×

NLP2RDF Wortschatz and Linguistic LOD draft

910

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
910
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "NLP2RDF Wortschatz and Linguistic LOD draft"

  1. 1. NLP2RDFIntegration of Data, Tools andApplicationswith RDF/OWL in the Areas of Textmining andLinguistics<br />PhD Thesis, Sebastian Hellmann<br />
  2. 2. Extensive Topic – Whatisthecore?<br />Features forMachine Learning<br />Whichfeatures do I needfor a certain Textmining task?<br />An introductoryexample :<br />Resources: <br /><ul><li>Face Recognition Tool thatdetectscoloroftheeyes(brown, green, blue) andtype ofhaircut(Vo-ku-hi-la, Mullet, GI Joe)
  3. 3. Database withAgeandOccupation</li></ul>Goal: predictincomeofpersons<br /><ul><li>Young studentsearnlessthanoldCEO‘s. </li></ul>=> Color ofeyesandhaircutprobably irrelevant! <br />
  4. 4. Basic idea: a benchmarkingframework<br />Input: <br /><ul><li>Task specification
  5. 5. Text
  6. 6. Training/testdata</li></ul>Output:<br /><ul><li>Tools anddatarequiredtosolvethetask</li></ul>Do I need POS tags toclassifyTourismdocuments?<br />Prerequisites:<br /><ul><li>Tools andapplicationsneed a standardizedinterface
  7. 7. Data needs a standardizedformat</li></li></ul><li>Basic idea: a benchmarkingframework<br />NLP2RDF <br />stack<br />
  8. 8. Basic idea: a benchmarkingframework<br />Google Code project was created<br /><ul><li>Stanford parser was integrated
  9. 9. Ontologieswerefoundandintegrated
  10. 10. Pipeline implemented
  11. 11. Pluginsystemimplemented
  12. 12. Someresultswereachieved</li></ul>But…<br /><ul><li>Architecture not flexible enough (Pipeline)
  13. 13. Integration boundto Java
  14. 14. Data sourceswere not sufficient
  15. 15. Wikipedia/DBpediatoocourse-grained
  16. 16. Speed ofintegrationtooslow</li></li></ul><li>Prerequisites<br />Onestep back:<br />Creationofdatasets in RDF<br />Data integrationandlinkingofdatasets<br />Licences<br />Standardizedformatfortoolintegration<br />Acquisitionof additional knowledge<br />
  17. 17. Why RDF and OWL ?<br />RDF makesdataintegration easy: URIref, LinkedData<br />OWL isbased on Description Logics (Guarded Fragment)<br />Availabilityof open datasets (accessandlicence)<br />Diverse serializationsforannotations: XML, Turtle, RDFa+XHTML<br />Scalabletoolsupport (Databases, Reasoning)<br />6. Iftheonlytoolyouhaveis a hammer, everythinglookslike a nail.<br />
  18. 18. LOD Cloud - over 26 Billion Facts<br />DBpediaiscentral:<br /><ul><li>Cross-domain
  19. 19. Crystalizationpoint (earlybird)</li></ul>Linking Open Data clouddiagram, by Richard CyganiakandAnja Jentzsch. http://lod-cloud.net/<br />
  20. 20. Simplified:<br /><ul><li>Circlesare Database Tables
  21. 21. Links areHTTP-Foreign Keys</li></li></ul><li>LinkedData<br />http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fdata.nytimes.com%2FN12930380387917339601<br />Resemblesdatabasetable<br />Key-Value pairs<br />Values canbe:<br /><ul><li>Datatypes (Strings, Integers)
  22. 22. URIs pointingtosubjects in the same table
  23. 23. URIs pointingtosubjects in anyothertable</li></li></ul><li>SPARQL – optimizationsfortablejoins<br />All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants<br />http://tinyurl.com/2uhuow9<br />
  24. 24. SPARQL – optimizationsfortablejoins<br />
  25. 25. Creationofdatasets: Wiktionary2RDF<br />
  26. 26. Creationofdatasets: Wiktionary2RDF<br />http://en.wiktionary.org/wiki/house<br /><ul><li>Covers 170 languages
  27. 27. Total of 10 millionpages
  28. 28. 900.000 users
  29. 29. RDF Dump will increasenumberofeditors
  30. 30. Same propertiesas Wikipedia (stableidentifiers)
  31. 31. HundredsofWiktionaryparsers (especiallyfor English)
  32. 32. Information istrapped in theWiki
  33. 33. Structurechangesmakesoftware obsolete</li></ul>Whytryitagain?<br /><ul><li>DBpediaExtraction Framework isverymature (5 years, 15 developers)
  34. 34. Configurationover Code, Templates will allowWiktionariansto update Parsers
  35. 35. Early contactwiththecommunity</li></li></ul><li>Creationofdatasets: Wortschatz<br />Converted in 2009:<br />Matthias Quasthoff, Sebastian Hellmann und Konrad Höffner:<br />StandardizedMultilingual Language Resources forthe Web of Data:<br />http://corpora.uni-leipzig.de/rdf <br />3rd prizeatthe LOD Triplification Challenge, Graz, 2009<br />What was missing?<br /><ul><li>Research questions
  36. 36. Usecases
  37. 37. Other datasetsto link to!
  38. 38. Wikipedia as a linkingpartner not suited
  39. 39. Noservers</li></li></ul><li>Wiktionary, Wortschatz, OLiAcanbecometheCrystallizationpointfor a LinguisticLinked Data Web<br />Fourmajortypes:<br /><ul><li>LexicalSemantic Resources
  40. 40. Dictionaries
  41. 41. Corporas
  42. 42. Schemas/Ontologies</li></li></ul><li>Interlinking Wortschatz: Research andUse Case<br />Iterated Co-occurencescanbedonewith SPARQL<br />Wiktionaryand Wortschatz canbeloaded in the same database<br />Interestingquestions:<br /><ul><li>Whatistheoverlapandcoverage?
  43. 43. WhichWiktionaryrelationcanbelinkedtowhichstatisticalrelation?
  44. 44. Can webuildtoolsthathelpsWiktionaryeditors (Suggestions)?
  45. 45. Wiktionary links Words acrosslanguages. Are thereanysimilarpatterns?
  46. 46. Can wevalidatetheWiktionary RDF dumpwith Wortschatz?</li></li></ul><li>Open Licences – Focus of LOD2 and OKFN<br />http://ckan.net/<br />CKAN is an open registry of data and content packages. Harnessing the CKAN software, this site makes it easy to find, share and reuse content and data, especially in ways that are machine automatable.<br />Working Group on Open Data in Linguistics<br />http://wiki.okfn.org/wg/linguistics<br /><ul><li>Founded on Nov 2010
  47. 47. 6-7 Members
  48. 48. Membership open, pleasejoin</li></li></ul><li>Standardized Formats: Part 1 – Corpora<br />http://www.sfb632.uni-potsdam.de/~d1/paula/doc/<br />PAULA XML is the PotsdamerAustauschformatfürlinguistische Annotation ("Potsdam Interchange Format for Linguistic Annotation"). It is an XML-based standoff representation format, which has been designed to represent data with heterogeneous annotation layers produced by different tools. For visualization and querying of PAULA XML data, the database ANNIS can be used. <br />Christian Chiarcosatwork: <br />PAULA will become POWLA and will beusedforrepresentationofcorporaannotations. <br />
  49. 49. Standardized Formats: Part 2 – the Web<br />Bottomlayerofthe NLP2RDF stackcanbereused:<br />An ontologytorepresent Strings (formerlythe SSO).<br />In hislatestbook, Wikinomics, Don Tapscottexplainsdeepchanges in technology, demographicsandbusiness. <br /><ul><li>URIs torepresent Strings e.g. http://nlp2rdf.org/example/Don_Tapscott
  50. 50. Relation betweenStrings: previous, next, sub, super
  51. 51. http://nlp2rdf.org/example/Don isa subStringoftheabove</li></li></ul><li>Standardized Formats: Part 2 – the Web<br /><ul><li>RDFaallowsfor flexible in-lineannotations
  52. 52. Multiple servicescanbe ad-hoc integrated
  53. 53. Multiple layersofannotationcanbeused
  54. 54. Fullcompatabilitywith POWLA
  55. 55. Trade-off betweenflexibilityandspeed</li></li></ul><li>KnowledgeAcquisition<br />Tiger Corpus Navigator <br />
  56. 56. Ontology Learning<br />Johanna Völker – Learning Expressive Ontologies(LExO)<br /># Example:<br /># A fishisanyaquaticvertebrateanimalthatiscoveredwithscales,<br /># andequippedwithtwosetsofpairedfinsandseveralunpairedfins.<br />#<br /># [fish] subClassOf [anyaquaticvertebrateanimalthatiscovered…]<br />#Construct {?subrdfs:subClassOf ?super} {<br />Construct {?subowl:equivalentClass ?super} {<br />?is a penn:BePresentTense .<br />?isnlp:superToken ?is_any_aquatic_.<br />?is_any_aquatic_ a olia:VerbPhrase .<br />?is_any_aquatic_ nlp:syntacticSubToken [ nlp:normUri ?super] .<br />?animalnlp:cop ?is .<br />?animalnlp:nsubj ?fish .?fishnlp:superToken [ nlp:normUri ?sub] .<br />}<br />
  57. 57. Standing on theshouldersofgiants<br />Christian Chiarcos<br />SFB632 - Uni Potsdam<br />Johanna Völker<br />Uni Mannheim<br />Markus Strohmaier,<br />TU Graz<br />Thankyouforyourattention<br />Jens Lehmann<br />Uni Leipzig<br />Sören Auer<br />Uni Leipzig<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×