Your SlideShare is downloading. ×
NLP2RDF Wortschatz and Linguistic LOD draft
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

NLP2RDF Wortschatz and Linguistic LOD draft

849
views

Published on

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
849
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NLP2RDFIntegration of Data, Tools andApplicationswith RDF/OWL in the Areas of Textmining andLinguistics
    PhD Thesis, Sebastian Hellmann
  • 2. Extensive Topic – Whatisthecore?
    Features forMachine Learning
    Whichfeatures do I needfor a certain Textmining task?
    An introductoryexample :
    Resources:
    • Face Recognition Tool thatdetectscoloroftheeyes(brown, green, blue) andtype ofhaircut(Vo-ku-hi-la, Mullet, GI Joe)
    • 3. Database withAgeandOccupation
    Goal: predictincomeofpersons
    • Young studentsearnlessthanoldCEO‘s.
    => Color ofeyesandhaircutprobably irrelevant!
  • 4. Basic idea: a benchmarkingframework
    Input:
    • Task specification
    • 5. Text
    • 6. Training/testdata
    Output:
    • Tools anddatarequiredtosolvethetask
    Do I need POS tags toclassifyTourismdocuments?
    Prerequisites:
    • Tools andapplicationsneed a standardizedinterface
    • 7. Data needs a standardizedformat
  • Basic idea: a benchmarkingframework
    NLP2RDF
    stack
  • 8. Basic idea: a benchmarkingframework
    Google Code project was created
    • Stanford parser was integrated
    • 9. Ontologieswerefoundandintegrated
    • 10. Pipeline implemented
    • 11. Pluginsystemimplemented
    • 12. Someresultswereachieved
    But…
    • Architecture not flexible enough (Pipeline)
    • 13. Integration boundto Java
    • 14. Data sourceswere not sufficient
    • 15. Wikipedia/DBpediatoocourse-grained
    • 16. Speed ofintegrationtooslow
  • Prerequisites
    Onestep back:
    Creationofdatasets in RDF
    Data integrationandlinkingofdatasets
    Licences
    Standardizedformatfortoolintegration
    Acquisitionof additional knowledge
  • 17. Why RDF and OWL ?
    RDF makesdataintegration easy: URIref, LinkedData
    OWL isbased on Description Logics (Guarded Fragment)
    Availabilityof open datasets (accessandlicence)
    Diverse serializationsforannotations: XML, Turtle, RDFa+XHTML
    Scalabletoolsupport (Databases, Reasoning)
    6. Iftheonlytoolyouhaveis a hammer, everythinglookslike a nail.
  • 18. LOD Cloud - over 26 Billion Facts
    DBpediaiscentral:
    • Cross-domain
    • 19. Crystalizationpoint (earlybird)
    Linking Open Data clouddiagram, by Richard CyganiakandAnja Jentzsch. http://lod-cloud.net/
  • 20. Simplified:
    • Circlesare Database Tables
    • 21. Links areHTTP-Foreign Keys
  • LinkedData
    http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fdata.nytimes.com%2FN12930380387917339601
    Resemblesdatabasetable
    Key-Value pairs
    Values canbe:
    • Datatypes (Strings, Integers)
    • 22. URIs pointingtosubjects in the same table
    • 23. URIs pointingtosubjects in anyothertable
  • SPARQL – optimizationsfortablejoins
    All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants
    http://tinyurl.com/2uhuow9
  • 24. SPARQL – optimizationsfortablejoins
  • 25. Creationofdatasets: Wiktionary2RDF
  • 26. Creationofdatasets: Wiktionary2RDF
    http://en.wiktionary.org/wiki/house
    • Covers 170 languages
    • 27. Total of 10 millionpages
    • 28. 900.000 users
    • 29. RDF Dump will increasenumberofeditors
    • 30. Same propertiesas Wikipedia (stableidentifiers)
    • 31. HundredsofWiktionaryparsers (especiallyfor English)
    • 32. Information istrapped in theWiki
    • 33. Structurechangesmakesoftware obsolete
    Whytryitagain?
    • DBpediaExtraction Framework isverymature (5 years, 15 developers)
    • 34. Configurationover Code, Templates will allowWiktionariansto update Parsers
    • 35. Early contactwiththecommunity
  • Creationofdatasets: Wortschatz
    Converted in 2009:
    Matthias Quasthoff, Sebastian Hellmann und Konrad Höffner:
    StandardizedMultilingual Language Resources forthe Web of Data:
    http://corpora.uni-leipzig.de/rdf
    3rd prizeatthe LOD Triplification Challenge, Graz, 2009
    What was missing?
    • Research questions
    • 36. Usecases
    • 37. Other datasetsto link to!
    • 38. Wikipedia as a linkingpartner not suited
    • 39. Noservers
  • Wiktionary, Wortschatz, OLiAcanbecometheCrystallizationpointfor a LinguisticLinked Data Web
    Fourmajortypes:
    • LexicalSemantic Resources
    • 40. Dictionaries
    • 41. Corporas
    • 42. Schemas/Ontologies
  • Interlinking Wortschatz: Research andUse Case
    Iterated Co-occurencescanbedonewith SPARQL
    Wiktionaryand Wortschatz canbeloaded in the same database
    Interestingquestions:
    • Whatistheoverlapandcoverage?
    • 43. WhichWiktionaryrelationcanbelinkedtowhichstatisticalrelation?
    • 44. Can webuildtoolsthathelpsWiktionaryeditors (Suggestions)?
    • 45. Wiktionary links Words acrosslanguages. Are thereanysimilarpatterns?
    • 46. Can wevalidatetheWiktionary RDF dumpwith Wortschatz?
  • Open Licences – Focus of LOD2 and OKFN
    http://ckan.net/
    CKAN is an open registry of data and content packages. Harnessing the CKAN software, this site makes it easy to find, share and reuse content and data, especially in ways that are machine automatable.
    Working Group on Open Data in Linguistics
    http://wiki.okfn.org/wg/linguistics
    • Founded on Nov 2010
    • 47. 6-7 Members
    • 48. Membership open, pleasejoin
  • Standardized Formats: Part 1 – Corpora
    http://www.sfb632.uni-potsdam.de/~d1/paula/doc/
    PAULA XML is the PotsdamerAustauschformatfürlinguistische Annotation ("Potsdam Interchange Format for Linguistic Annotation"). It is an XML-based standoff representation format, which has been designed to represent data with heterogeneous annotation layers produced by different tools. For visualization and querying of PAULA XML data, the database ANNIS can be used.
    Christian Chiarcosatwork:
    PAULA will become POWLA and will beusedforrepresentationofcorporaannotations.
  • 49. Standardized Formats: Part 2 – the Web
    Bottomlayerofthe NLP2RDF stackcanbereused:
    An ontologytorepresent Strings (formerlythe SSO).
    In hislatestbook, Wikinomics, Don Tapscottexplainsdeepchanges in technology, demographicsandbusiness.
    • URIs torepresent Strings e.g. http://nlp2rdf.org/example/Don_Tapscott
    • 50. Relation betweenStrings: previous, next, sub, super
    • 51. http://nlp2rdf.org/example/Don isa subStringoftheabove
  • Standardized Formats: Part 2 – the Web
    • RDFaallowsfor flexible in-lineannotations
    • 52. Multiple servicescanbe ad-hoc integrated
    • 53. Multiple layersofannotationcanbeused
    • 54. Fullcompatabilitywith POWLA
    • 55. Trade-off betweenflexibilityandspeed
  • KnowledgeAcquisition
    Tiger Corpus Navigator
  • 56. Ontology Learning
    Johanna Völker – Learning Expressive Ontologies(LExO)
    # Example:
    # A fishisanyaquaticvertebrateanimalthatiscoveredwithscales,
    # andequippedwithtwosetsofpairedfinsandseveralunpairedfins.
    #
    # [fish] subClassOf [anyaquaticvertebrateanimalthatiscovered…]
    #Construct {?subrdfs:subClassOf ?super} {
    Construct {?subowl:equivalentClass ?super} {
    ?is a penn:BePresentTense .
    ?isnlp:superToken ?is_any_aquatic_.
    ?is_any_aquatic_ a olia:VerbPhrase .
    ?is_any_aquatic_ nlp:syntacticSubToken [ nlp:normUri ?super] .
    ?animalnlp:cop ?is .
    ?animalnlp:nsubj ?fish .?fishnlp:superToken [ nlp:normUri ?sub] .
    }
  • 57. Standing on theshouldersofgiants
    Christian Chiarcos
    SFB632 - Uni Potsdam
    Johanna Völker
    Uni Mannheim
    Markus Strohmaier,
    TU Graz
    Thankyouforyourattention
    Jens Lehmann
    Uni Leipzig
    Sören Auer
    Uni Leipzig