NIF 2.0 Tutorial: Content Analysis and the Semantic Web
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

NIF 2.0 Tutorial: Content Analysis and the Semantic Web

  • 3,246 views
Uploaded on

This tutorial is held by Sebastian Hellmann from the NLP2RDF Group at AKSW: ...

This tutorial is held by Sebastian Hellmann from the NLP2RDF Group at AKSW:

The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. NIF consists of specifications, ontologies and software (overview), which are combined under the version identifier “NIF 2.0″. Links:

http://nlp2rdf.org
http://persistence.uni-leipzig.org/nlp2rdf/


More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,246
On Slideshare
3,211
From Embeds
35
Number of Embeds
4

Actions

Shares
Downloads
31
Comments
0
Likes
3

Embeds 35

http://eventifier.co 15
http://eventifier.com 13
https://twitter.com 6
http://news.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NIF Tutorial – 2013/09/24 – Page 1 http://lod2.eu Creating Knowledge out of Interlinked Data LOD2 Presentation . 02.09.2010 . Page http://lod2.eu AKSW, Universität Leipzig Sebastian Hellmann Content Analysis and the Semantic Web NIF 2.0 Tutorial http://nlp2rdf.org http://lod2.eu http://slideshare.net/kurzum
  • 2. NIF Tutorial – 2013/09/24 – Page 2 http://lod2.eu Sebastian Hellmann – researcher working on LOD2 EU Project AKSW – Agile Knowledge and the Semantic Web research group in Leipzig - http://aksw.org InfAI – Institute for Applied Informatics - http://infai.org ALL DEMOS ARE AVAILABLE AT: http://nlp2rdf.org/leipzig-24-9-2013 Introduction
  • 3. NIF Tutorial – 2013/09/24 – Page 3 http://lod2.eu Introduction ALL DEMOS ARE AVAILABLE AT: http://nlp2rdf.org/leipzig-24-9-2013
  • 4. NIF Tutorial – 2013/09/24 – Page 4 http://lod2.eu End users have tasks for NLP, but: Each new tool is a challenge: • How to download and start it? • What kind of annotations does it use? • How good does it perform (on my domain)? • If badly, are there any alternatives? How can I find them? • Open source? • Lot's of know-how needed to exploit NLP. • Lot's of data needed to exploit NLP. Barriers to NLP
  • 5. NIF Tutorial – 2013/09/24 – Page 5 http://lod2.eu The Semantic Gap
  • 6. NIF Tutorial – 2013/09/24 – Page 6 http://lod2.eu
  • 7. NIF Tutorial – 2013/09/24 – Page 7 http://lod2.eu • Part 1: exploiting free, open and interoperable (FOI) language resources • Part 2: Connecting text to these resources • Part 3: tools, demos, infrastructure From a walled garden to an interoperable infrastructure
  • 8. NIF Tutorial – 2013/09/24 – Page 8 http://lod2.eu • Part 1: exploiting free, open and interoperable (FOI) language resources From a walled garden to an interoperable infrastructure
  • 9. NIF Tutorial – 2013/09/24 – Page 9 http://lod2.eu http://lod-cloud.net Linguistic/NLP Data currently filed under “cross-domain”
  • 10. NIF Tutorial – 2013/09/24 – Page 10 http://lod2.eu http://lod-cloud.net Linked Open Data - All datasets provide open access to individual records via HTTP - Many are free (no payment required, as in royalty-free) - Some are openly licensed, e.g. CC-0 or CC-BY-SA => Open access also applies to published HTML on the WWW, but in LOD the data itself is published unrendered via RDF
  • 11. NIF Tutorial – 2013/09/24 – Page 11 http://lod2.eu Question: • Who knows how to add a new bubble to the LOD cloud? From a walled garden to an interoperable infrastructure
  • 12. NIF Tutorial – 2013/09/24 – Page 12 http://lod2.eu • Who knows how to add a new bubble to the LOD cloud? http://datahub.io/group/linguistics https://github.com/jmccrae/llod-cloud.py http://validator.lod-cloud.net/validate.php From a walled garden to an interoperable infrastructure
  • 13. NIF Tutorial – 2013/09/24 – Page 13 http://lod2.eu
  • 14. NIF Tutorial – 2013/09/24 – Page 14 http://lod2.eu
  • 15. NIF Tutorial – 2013/09/24 – Page 15 http://lod2.eu Question: • What are the most important data sets and ontologies for NLP? • Who has used what? FOI data
  • 16. NIF Tutorial – 2013/09/24 – Page 16 http://lod2.eu Analysis of mentions of Wikipedia / DBpedia at LREC 2012: • https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2 → 163 papers • https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2 → 24 papers FOI data 1: Wikipedia / DBpedia
  • 17. NIF Tutorial – 2013/09/24 – Page 17 http://lod2.eu • Training data for NLP, e.g. URI, surrounding text, surface form • Probabilities: • P(sf|URI): P that “apple” refers to wikipedia:Apple_Inc. • P(URI|sf): P that wikipedia:Apple_Inc. is “apple” in text FOI data 1: Wikipedia / DBpedia http://wiki.dbpedia.org/Datasets/NLP
  • 18. NIF Tutorial – 2013/09/24 – Page 18 http://lod2.eu FOI data: Wikipedia / DBpedia http://lookup.dbpedia.org/api/search.asmx/KeywordSearch? QueryString=sodium http://lookup.dbpedia.org/api/search.asmx/KeywordSearch? QueryString=sodium Available data for “Sodium” http://dbpedia.org/snorql select ?labels where { <http://dbpedia.org/resource/Sodium> rdfs:label ?labels . } LIMIT 100 select ?altlabel where { ?redirect dbpedia-owl:wikiPageRedirects <http://dbpedia.org/resource/Sodium> . ?redirect rdfs:label ?altlabel . } LIMIT 100 http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN
  • 19. NIF Tutorial – 2013/09/24 – Page 19 http://lod2.eu Wiktionary2RDF – Mediator Wrapper http://dbpedia.org/Wiktionary
  • 20. NIF Tutorial – 2013/09/24 – Page 20 http://lod2.eu http://dbpedia.org/Wiktionary
  • 21. NIF Tutorial – 2013/09/24 – Page 21 http://lod2.eu http://dbpedia.org/Wiktionary
  • 22. NIF Tutorial – 2013/09/24 – Page 22 http://lod2.eu Wiktionary2RDF – Mediator Wrapper http://dbpedia.org/Wiktionary Mediator Lemon
  • 23. NIF Tutorial – 2013/09/24 – Page 23 http://lod2.eu Wiktionary2RDF – Mediator Wrapper http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN https://en.wiktionary.org/wiki/sodium#English http://wiktionary.dbpedia.org/resource/sodium
  • 24. NIF Tutorial – 2013/09/24 – Page 24 http://lod2.eu Lemon Ontology - http://lemon-model.net
  • 25. NIF Tutorial – 2013/09/24 – Page 25 http://lod2.eu Lemon Ontology - http://lemon-model.net IntersectiveDataPropertyAdjective ("extinct" , dbpedia:conservationStatus ,"EX") IntersectiveDataPropertyAdjective ("endangered" , dbpedia:conservationStatus ,"EN") https://github.com/cunger/lemon.dbpedia Christina Unger, John Mccrae, Sebastian Walter, Sara Winter and Philipp Cimiano (2013): A lemon lexicon for DBpedia. NLP & DBpedia Workshop
  • 26. NIF Tutorial – 2013/09/24 – Page 26 http://lod2.eu • Part 2: Connecting text to these resources From a walled garden to an interoperable infrastructure
  • 27. NIF Tutorial – 2013/09/24 – Page 27 http://lod2.eu From a walled garden to an interoperable infrastructure https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
  • 28. NIF Tutorial – 2013/09/24 – Page 28 http://lod2.eu From a walled garden to an interoperable infrastructure Overview of existing tools: • http://en.wikipedia.org/wiki/Knowledge_extraction#Tools
  • 29. NIF Tutorial – 2013/09/24 – Page 29 http://lod2.eu From a walled garden to an interoperable infrastructure Developers nightmare: • All tools belong to similar class of NLP tools → Wikifier or Named Entity Linking, SOA principle But they all have: • Heterogeneous output formats (JSON, XML) • Heterogeneous API parameters • Heterogeneous ways of annotating text: • Some remove HTML internally, offsets not usable • Some use byte offset instead of char offset
  • 30. NIF Tutorial – 2013/09/24 – Page 30 http://lod2.eu From a walled garden to an interoperable infrastructure Demo • http://rdface.aksw.org/new/tinymce/examples/rdface.html
  • 31. NIF Tutorial – 2013/09/24 – Page 31 http://lod2.eu ITS 2.0 - http://www.w3.org/TR/its20/ The Internationalization Tag Set (ITS) 2.0 – enhances the foundation to integrate automated processing of human language into core Web technologies. • Currently last call • Driven by localization industry • Embed translation aids into HTML and XML • Robust way to encode NLP information in HTML • ITS 2.0 describes 20 data categories → ontology
  • 32. NIF Tutorial – 2013/09/24 – Page 32 http://lod2.eu NIF overview Summary • Motivated the Walled Garden problem • Overview of the emerging Web of Language resources • Motivated the NLP tool heterogeneity problem • Introduction of ITS 2.0 Use case for NIF • Now: NIF 2.0
  • 33. NIF Tutorial – 2013/09/24 – Page 33 http://lod2.eu The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. • Reuse of existing standards such as RDF, OWL 2, the PROV Ontology, LAF (ISO 24612), Unicode and RFC 5147 • Standardize access parameters, annotations (e.g. tokenization), validation and log messages. • A NIF workflow, however, can obviously not provide any better performance (F-measure, speed) than a properly configured UIMA or GATE pipeline with the same components. • Lower entry barrier, easy data integration, reusability of tools and conceptualisation, off-the-shelf solutions for common tasks. NIF Overview
  • 34. NIF Tutorial – 2013/09/24 – Page 34 http://lod2.eu Relation of NIF and UIMA and Gate • A Formal Framework for Linguistic Annotation (2000) by Steven Bird, Mark Liberman • take home message: generic annotation formats should be based on graphs • Ontologies in NIF (e.g. OliA, lemon) can be hard compiled for internal use (as is done in Stanbol) WP3 Task 3.2 – Community work: NLP2RDF Not primarily aimed at increasing features or performance (F-Measure)
  • 35. NIF Tutorial – 2013/09/24 – Page 35 http://lod2.eu WP3 Task 3.2 – NIF overview
  • 36. NIF Tutorial – 2013/09/24 – Page 36 http://lod2.eu • NIF turns out to have a Unique selling proposition regarding NLP and RDF • NIF will be the recommended RDF conversion of the Internationalisation Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/ • There was no alternative RDF vocabulary for this conversion available. NIF Overview
  • 37. NIF Tutorial – 2013/09/24 – Page 37 http://lod2.eu WP3 Task 3.2 – Community work: NLP2RDF RDFa parsers loose all provenance information: <http://examples.com/books/wikinomics> dc:title ''Wikinomics'' . https://en.wikipedia.org/wiki/RDFa
  • 38. NIF Tutorial – 2013/09/24 – Page 38 http://lod2.eu Available resources: http://persistence.uni-leipzig.org/nlp2rdf/ Disclaimer Migration to the online presence is still on-going, but there are 15 scientific publications, e.g. Integrating NLP using Linked Data. Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, (2013) - http://svn.aksw.org/papers/2013/ISWC_NIF/public.pdf NIF Overview
  • 39. NIF Tutorial – 2013/09/24 – Page 39 http://lod2.eu Question: • What is a String? NIF Basics
  • 40. NIF Tutorial – 2013/09/24 – Page 40 http://lod2.eu Counting strings is more difficult than it seems: • Three ways to count Unicode: • Code Units • Code Points • Graphems • Encoding: • UTF-8, 16, 32 NIF Basics Unicode
  • 41. NIF Tutorial – 2013/09/24 – Page 41 http://lod2.eu • Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. • Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set. • Unicode Normal Form C • http://unicode.org/reports/tr15/#Norm_Forms Unicode
  • 42. NIF Tutorial – 2013/09/24 – Page 42 http://lod2.eu • Recommendation for RDF Literals • http://unicode.org/reports/tr15/#Norm_Forms Unicode Normal Form C
  • 43. NIF Tutorial – 2013/09/24 – Page 43 http://lod2.eu • NIF uses Unicode Normal Form C • NIF counts in Code Points Unicode
  • 44. NIF Tutorial – 2013/09/24 – Page 44 http://lod2.eu • Sadly, there are still implementation problems: • Java length() vs. PHP strlen() function • curl --data-urlencode i=" 대 " -d f=text "http://nlp2rdf.lod2.eu/nif-ws.php" • Korean Character is URL encoded (#%EB%8C%80) and counted as 3 characters (not NFC in PHP) Demo ALL DEMOS ARE AVAILABLE AT: http://nlp2rdf.org/leipzig-24-9-2013
  • 45. NIF Tutorial – 2013/09/24 – Page 45 http://lod2.eu • Now some RDF (finally): • Note that in NIF the document is != content of the document. • two different documents can have the same content => must not have the same URI Context
  • 46. NIF Tutorial – 2013/09/24 – Page 46 http://lod2.eu Annotations
  • 47. NIF Tutorial – 2013/09/24 – Page 47 http://lod2.eu Tokenization Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations. Language Resources and Evaluation 46(1): 53-74 (2012)
  • 48. NIF Tutorial – 2013/09/24 – Page 48 http://lod2.eu NIF Demo: http://nlp2rdf.lod2.eu/demo.php
  • 49. NIF Tutorial – 2013/09/24 – Page 49 http://lod2.eu • SPARQL queries produce (find) errors • http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.t • RLOG – An RDF Logging Ontology • ./validate.jar -i nif-erroneous-model.ttl -t file • Demo → character count • Demo → all errors Validation over specification ALL DEMOS ARE AVAILABLE AT: http://nlp2rdf.org/leipzig-24-9-2013
  • 50. NIF Tutorial – 2013/09/24 – Page 50 http://lod2.eu NIF Demo: http://nlp2rdf.lod2.eu/demo.php
  • 51. NIF Tutorial – 2013/09/24 – Page 51 http://lod2.eu NIF
  • 52. NIF Tutorial – 2013/09/24 – Page 52 http://lod2.eu • http://www.w3.org/TR/its20/#conversion-to-nif • http://www.w3.org/TR/its20/#nif-backconversion NIF
  • 53. NIF Tutorial – 2013/09/24 – Page 53 http://lod2.eu • Demo • Load Terminological model or Inference Model Reasoning
  • 54. NIF Tutorial – 2013/09/24 – Page 54 http://lod2.eu Open Community – All feedback is welcome! http://slideshare.net/kurzum Websites: http://dbpedia.org http://nlp2rdf.org http://lod2.eu Thanks for your attention ALL DEMOS ARE AVAILABLE AT: http://nlp2rdf.org/leipzig-24-9-2013