Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP2RDF Wortschatz and Linguistic LOD draft


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

NLP2RDF Wortschatz and Linguistic LOD draft

  1. 1. NLP2RDFIntegration of Data, Tools andApplicationswith RDF/OWL in the Areas of Textmining andLinguistics<br />PhD Thesis, Sebastian Hellmann<br />
  2. 2. Extensive Topic – Whatisthecore?<br />Features forMachine Learning<br />Whichfeatures do I needfor a certain Textmining task?<br />An introductoryexample :<br />Resources: <br /><ul><li>Face Recognition Tool thatdetectscoloroftheeyes(brown, green, blue) andtype ofhaircut(Vo-ku-hi-la, Mullet, GI Joe)
  3. 3. Database withAgeandOccupation</li></ul>Goal: predictincomeofpersons<br /><ul><li>Young studentsearnlessthanoldCEO‘s. </li></ul>=> Color ofeyesandhaircutprobably irrelevant! <br />
  4. 4. Basic idea: a benchmarkingframework<br />Input: <br /><ul><li>Task specification
  5. 5. Text
  6. 6. Training/testdata</li></ul>Output:<br /><ul><li>Tools anddatarequiredtosolvethetask</li></ul>Do I need POS tags toclassifyTourismdocuments?<br />Prerequisites:<br /><ul><li>Tools andapplicationsneed a standardizedinterface
  7. 7. Data needs a standardizedformat</li></li></ul><li>Basic idea: a benchmarkingframework<br />NLP2RDF <br />stack<br />
  8. 8. Basic idea: a benchmarkingframework<br />Google Code project was created<br /><ul><li>Stanford parser was integrated
  9. 9. Ontologieswerefoundandintegrated
  10. 10. Pipeline implemented
  11. 11. Pluginsystemimplemented
  12. 12. Someresultswereachieved</li></ul>But…<br /><ul><li>Architecture not flexible enough (Pipeline)
  13. 13. Integration boundto Java
  14. 14. Data sourceswere not sufficient
  15. 15. Wikipedia/DBpediatoocourse-grained
  16. 16. Speed ofintegrationtooslow</li></li></ul><li>Prerequisites<br />Onestep back:<br />Creationofdatasets in RDF<br />Data integrationandlinkingofdatasets<br />Licences<br />Standardizedformatfortoolintegration<br />Acquisitionof additional knowledge<br />
  17. 17. Why RDF and OWL ?<br />RDF makesdataintegration easy: URIref, LinkedData<br />OWL isbased on Description Logics (Guarded Fragment)<br />Availabilityof open datasets (accessandlicence)<br />Diverse serializationsforannotations: XML, Turtle, RDFa+XHTML<br />Scalabletoolsupport (Databases, Reasoning)<br />6. Iftheonlytoolyouhaveis a hammer, everythinglookslike a nail.<br />
  18. 18. LOD Cloud - over 26 Billion Facts<br />DBpediaiscentral:<br /><ul><li>Cross-domain
  19. 19. Crystalizationpoint (earlybird)</li></ul>Linking Open Data clouddiagram, by Richard CyganiakandAnja Jentzsch.<br />
  20. 20. Simplified:<br /><ul><li>Circlesare Database Tables
  21. 21. Links areHTTP-Foreign Keys</li></li></ul><li>LinkedData<br /><br />Resemblesdatabasetable<br />Key-Value pairs<br />Values canbe:<br /><ul><li>Datatypes (Strings, Integers)
  22. 22. URIs pointingtosubjects in the same table
  23. 23. URIs pointingtosubjects in anyothertable</li></li></ul><li>SPARQL – optimizationsfortablejoins<br />All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants<br /><br />
  24. 24. SPARQL – optimizationsfortablejoins<br />
  25. 25. Creationofdatasets: Wiktionary2RDF<br />
  26. 26. Creationofdatasets: Wiktionary2RDF<br /><br /><ul><li>Covers 170 languages
  27. 27. Total of 10 millionpages
  28. 28. 900.000 users
  29. 29. RDF Dump will increasenumberofeditors
  30. 30. Same propertiesas Wikipedia (stableidentifiers)
  31. 31. HundredsofWiktionaryparsers (especiallyfor English)
  32. 32. Information istrapped in theWiki
  33. 33. Structurechangesmakesoftware obsolete</li></ul>Whytryitagain?<br /><ul><li>DBpediaExtraction Framework isverymature (5 years, 15 developers)
  34. 34. Configurationover Code, Templates will allowWiktionariansto update Parsers
  35. 35. Early contactwiththecommunity</li></li></ul><li>Creationofdatasets: Wortschatz<br />Converted in 2009:<br />Matthias Quasthoff, Sebastian Hellmann und Konrad Höffner:<br />StandardizedMultilingual Language Resources forthe Web of Data:<br /> <br />3rd prizeatthe LOD Triplification Challenge, Graz, 2009<br />What was missing?<br /><ul><li>Research questions
  36. 36. Usecases
  37. 37. Other datasetsto link to!
  38. 38. Wikipedia as a linkingpartner not suited
  39. 39. Noservers</li></li></ul><li>Wiktionary, Wortschatz, OLiAcanbecometheCrystallizationpointfor a LinguisticLinked Data Web<br />Fourmajortypes:<br /><ul><li>LexicalSemantic Resources
  40. 40. Dictionaries
  41. 41. Corporas
  42. 42. Schemas/Ontologies</li></li></ul><li>Interlinking Wortschatz: Research andUse Case<br />Iterated Co-occurencescanbedonewith SPARQL<br />Wiktionaryand Wortschatz canbeloaded in the same database<br />Interestingquestions:<br /><ul><li>Whatistheoverlapandcoverage?
  43. 43. WhichWiktionaryrelationcanbelinkedtowhichstatisticalrelation?
  44. 44. Can webuildtoolsthathelpsWiktionaryeditors (Suggestions)?
  45. 45. Wiktionary links Words acrosslanguages. Are thereanysimilarpatterns?
  46. 46. Can wevalidatetheWiktionary RDF dumpwith Wortschatz?</li></li></ul><li>Open Licences – Focus of LOD2 and OKFN<br /><br />CKAN is an open registry of data and content packages. Harnessing the CKAN software, this site makes it easy to find, share and reuse content and data, especially in ways that are machine automatable.<br />Working Group on Open Data in Linguistics<br /><br /><ul><li>Founded on Nov 2010
  47. 47. 6-7 Members
  48. 48. Membership open, pleasejoin</li></li></ul><li>Standardized Formats: Part 1 – Corpora<br /><br />PAULA XML is the PotsdamerAustauschformatfürlinguistische Annotation ("Potsdam Interchange Format for Linguistic Annotation"). It is an XML-based standoff representation format, which has been designed to represent data with heterogeneous annotation layers produced by different tools. For visualization and querying of PAULA XML data, the database ANNIS can be used. <br />Christian Chiarcosatwork: <br />PAULA will become POWLA and will beusedforrepresentationofcorporaannotations. <br />
  49. 49. Standardized Formats: Part 2 – the Web<br />Bottomlayerofthe NLP2RDF stackcanbereused:<br />An ontologytorepresent Strings (formerlythe SSO).<br />In hislatestbook, Wikinomics, Don Tapscottexplainsdeepchanges in technology, demographicsandbusiness. <br /><ul><li>URIs torepresent Strings e.g.
  50. 50. Relation betweenStrings: previous, next, sub, super
  51. 51. isa subStringoftheabove</li></li></ul><li>Standardized Formats: Part 2 – the Web<br /><ul><li>RDFaallowsfor flexible in-lineannotations
  52. 52. Multiple servicescanbe ad-hoc integrated
  53. 53. Multiple layersofannotationcanbeused
  54. 54. Fullcompatabilitywith POWLA
  55. 55. Trade-off betweenflexibilityandspeed</li></li></ul><li>KnowledgeAcquisition<br />Tiger Corpus Navigator <br />
  56. 56. Ontology Learning<br />Johanna Völker – Learning Expressive Ontologies(LExO)<br /># Example:<br /># A fishisanyaquaticvertebrateanimalthatiscoveredwithscales,<br /># andequippedwithtwosetsofpairedfinsandseveralunpairedfins.<br />#<br /># [fish] subClassOf [anyaquaticvertebrateanimalthatiscovered…]<br />#Construct {?subrdfs:subClassOf ?super} {<br />Construct {?subowl:equivalentClass ?super} {<br />?is a penn:BePresentTense .<br />?isnlp:superToken ?is_any_aquatic_.<br />?is_any_aquatic_ a olia:VerbPhrase .<br />?is_any_aquatic_ nlp:syntacticSubToken [ nlp:normUri ?super] .<br />?animalnlp:cop ?is .<br />?animalnlp:nsubj ?fish .?fishnlp:superToken [ nlp:normUri ?sub] .<br />}<br />
  57. 57. Standing on theshouldersofgiants<br />Christian Chiarcos<br />SFB632 - Uni Potsdam<br />Johanna Völker<br />Uni Mannheim<br />Markus Strohmaier,<br />TU Graz<br />Thankyouforyourattention<br />Jens Lehmann<br />Uni Leipzig<br />Sören Auer<br />Uni Leipzig<br />