NIF 2.0 Phd thesis intermediate report

3,884 views

Published on

Public part of a presentation held at our internal research group meeting

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,884
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NIF 2.0 Phd thesis intermediate report

  1. 1. BIS – 2013/04/15 – Page 1 http://lod2.euCreating Knowledge out of Interlinked DataLOD2 Presentation . 02.09.2010 . Page http://lod2.euAKSW, Universität LeipzigSebastian HellmannPhD thesis intermediate reportNLP Interchange Format (NIF) 2.0http://nlp2rdf.orghttp://lod2.euhttp://slideshare.net/kurzumDISCLAIMER:this presentation is work in progress, example RDF is outdated
  2. 2. BIS – 2013/04/15 – Page 2 http://lod2.euNLP Interchange Format 2.0
  3. 3. BIS – 2013/04/15 – Page 3 http://lod2.euThe NLP Interchange Format (NIF) is an RDF/OWL-based format that aims toachieve interoperability between Natural Language Processing (NLP) tools,language resources and annotations.• NIF 2.0 will be published in 6-8 weeks• Highly probable to become the de-facto standard for modelling RDF tooloutput in the NLP domainNLP Interchange Format 2.0
  4. 4. BIS – 2013/04/15 – Page 4 http://lod2.euIntroductionComponents have pre- and postconditionsauto configuration theoretical possible, but in reality a lot of manual work
  5. 5. BIS – 2013/04/15 – Page 5 http://lod2.euIntroductionComponents have pre- and postconditionsauto configuration theoretical possible, but in reality a lot of manual workHuge potential to save time and money at the interfaces
  6. 6. BIS – 2013/04/15 – Page 6 http://lod2.euCore problems:1. Too much heterogeneity2. Almost no standards available3. No open collaboration4. Difficult and large domainProblem analysis
  7. 7. BIS – 2013/04/15 – Page 7 http://lod2.euTechnical heterogeneity• Technologies: XML, Relational Databases, CSV, DOC, PDF• Similar to other domains• Formats: Negra, CoNLL, GrAF, Paula, CAS (UIMA), Penn• Virtually each tool has implemented readers for the 5-6 formats + itsown serialization• Programming languages: Java, Python, ...• Java has predominanceProblem analysis
  8. 8. BIS – 2013/04/15 – Page 8 http://lod2.euDomain heterogeneity• Multilingualism• Over 100 part of speech tags (several for each language)• No open mappings exist• About 20 different tasks listed on:http://en.wikipedia.org/wiki/Natural_language_processing#Major_tasks_in_NLP• Natural language is a difficult topic:• The roulette dealer siad: “Rien ne va plus!”– 8 words, 4 French, 4 English, one spelling mistake, impossible todecide the language of the whole.• Ban on Nude Dancing on Governors DeskProblem analysis
  9. 9. BIS – 2013/04/15 – Page 9 http://lod2.euProblem analysis
  10. 10. BIS – 2013/04/15 – Page 10 http://lod2.euOpen collaboration• LAF/GrAF is a recently released ISO standard• But it is not open (60 Euros to view the document)• Not in RDF (the main requirements for any Semantic Web tool)• Large frameworks tend to only be “inward” compatible• UIMA advocates say: “Why dont you just use UIMA?”• Gate advocates: “Integrate it into GATE!”• Generally, a large time investment and lock-inProblem analysis
  11. 11. BIS – 2013/04/15 – Page 11 http://lod2.euSummary:Hardly any reusability• Free software (as in free beer), but no open licenses• No standards and no mappings• Integration is hard-wired (you have to write software)Problem analysis
  12. 12. BIS – 2013/04/15 – Page 12 http://lod2.eu• Definition for text normalization + URI Schemes (give URIs to Strings)• NIF Core Ontology: default vocabulary for most often used annotations• Predefined modules for most use cases• Infrastructure• for open collaboration / discussion• persistent hosting• validation and demo services• Reference implementation• Data conversionNIF Overview
  13. 13. BIS – 2013/04/15 – Page 13 http://lod2.euText Normalization + URI Schemes
  14. 14. BIS – 2013/04/15 – Page 14 http://lod2.euText Normalization + URI SchemesNIF 1.0:http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729NIF 2.0 uses RFC 5147 as base form:http://www.w3.org/DesignIssues/LinkedData.html#char=717,729User extensions possible:http://www.w3.org/DesignIssues/LinkedData.html#your_own_scheme(but you have to link to documentation on how it was created)
  15. 15. BIS – 2013/04/15 – Page 15 http://lod2.euAs a Web Servicecurl--data-urlencode prefix="http://prefix.given.by/theClient#"--data-urlencode input="[...]"(--data-urlencode source=”http://www.w3.org/DesignIssues/LinkedData.html”)http://nlp2rdf.lod2.eu/demo/NIFStanfordCoreThe new namespace is http://persistence.uni-leipzig.org/nlp2rdf/nif-core#
  16. 16. BIS – 2013/04/15 – Page 16 http://lod2.euOntologies:• NIF Core Ontology (URI Scheme, String, Context, but also Token, Sentence,lemma, stem, etc. ) for often used annotations.• Simple Error Ontology to describe errors (fatal, message, timestamp)• Vocabulary Modules for each purpose or ontology or projectOverview of Ontologies
  17. 17. BIS – 2013/04/15 – Page 17 http://lod2.euEach ontology consists of three sets of axioms:- Terminology model (definitions)- Inference model (especially transitivity)- Validation model (consistency)1) nif-core.ttl2) nif-core-inf.ttl imports 13) nif-core-val.ttl imports 1 and 2Logical Modularity
  18. 18. BIS – 2013/04/15 – Page 18 http://lod2.euNIF simple:• Only one truth• Easy to understand and to query• Least amount of triplesNIF + Stanbol (Apache Project)• Several ranked alternatives• Provenance of annotations• In collaboration with Apache StanbolOpen Annotation (W3C group)• Rich model• Not only text, but everything (images)Granularity Modularity- More triples- more complexity- worse usability- losslessconversionWell-defined conversionsbetween the different levels- easier queries- higher performance- lossful conversion
  19. 19. BIS – 2013/04/15 – Page 19 http://lod2.euStrucural Interoperability:- URI schemes provide normalization- RDF provide graph data model- OWL provides the logical modelConceptual Interoperability- NIF Core Ontology and mapping to most often used annotations, e.g. lemma,stems- Vocabulary Module to include other terminologies and ontologiesInteroperability
  20. 20. BIS – 2013/04/15 – Page 20 http://lod2.eu• ITS 2.0• FISE used in Apache Stanbol (IKS-EU Project)• LAF/GrAF XML – ISO standard, recently published• Fragment Identifiers by IETF and W3C• Lemon ontology from Monnet EU Project• NERD ontology from EURECOM and LinkedTV EU Project• Xpointer/XPath URI scheme• Open Annotation• ISOCatNIF 2.0 tries to be compatible to (Vocabulary Module)
  21. 21. BIS – 2013/04/15 – Page 21 http://lod2.eu• Tibeto-Burman languages: http://purl.org/olia/tibet.owl#VNst• Russian TreeTagger :http://purl.org/olia/russ.owl#partizip_prt_sg_neut_passiv_gen_langform• German STTS: http://purl.org/olia/stts.owl#VAPP• English Penn: http://purl.org/olia/penn.owl#VBG→ all map to http://purl.org/olia/olia.owl#NonFiniteVerbOntologies of Lingingustic Annotation (OLiA) contain mappings for over 50 Tagsets (freeand open, CC-By)Vocabulary Module: OLiA
  22. 22. BIS – 2013/04/15 – Page 22 http://lod2.euNIF can be extended by Vocabulary ModulesOliAhttp://purl.org/oliaConceptual Interoperability
  23. 23. BIS – 2013/04/15 – Page 23 http://lod2.eu• Java-Maven implementation• PHP implementation• Reference implementations: DBpedia Spotlight, Stanford Parser, Korean POStagger, Keyword Search• Wiki: http://wiki.nlp2rdf.org• Validators• Code generators (convert vocabulary modules to code stubs)• NIF is free and open (CC-0 / CC-BY / Apache)• All ontologies will be hosted persistently by University Leipzig•http://persistence.uni-leipzig.org/nlp2rdf/NIF 2.0 Infrastructure for adoption
  24. 24. BIS – 2013/04/15 – Page 24 http://lod2.eu• Huge collection of use cases• e.g. Ali wants to exchange different NLP service for RDFace• LOD2 from Wolters Kluwer• A selection will be implemented. Assumption:• NIF is good, if it fulfills many use casesEvaluation 1
  25. 25. BIS – 2013/04/15 – Page 25 http://lod2.eu• There are about 10 to 20 third party implementationsEvaluation 2
  26. 26. BIS – 2013/04/15 – Page 26 http://lod2.euAnalysis of existing frameworks and formats. Criteria:• Convertability (Adequacy)• Do the graph models match?• Coverage• Quantitative analysis of used annotations• Does NIF Core provide terms for the most common annotations, arethere any gaps?Evaluation 3
  27. 27. BIS – 2013/04/15 – Page 27 http://lod2.euData Conversion
  28. 28. BIS – 2013/04/15 – Page 28 http://lod2.euData Conversion
  29. 29. BIS – 2013/04/15 – Page 29 http://lod2.euData ConversionData is available asfree, open, interoperable (FOI) language resources athttp://linguistics.okfn.org/resources/llod/(work in progress)
  30. 30. BIS – 2013/04/15 – Page 30 http://lod2.euProject has a very good impact:• Many adopters• Industrial uptake• Inclusion in a W3C standard for ITS 2.0:http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html• Several projects involved as stakeholders (LOD2, Monnet, ...)• Several motivated open-source developers• Funding is coming inCritical judgement
  31. 31. BIS – 2013/04/15 – Page 31 http://lod2.euScientific merit ?• provides scientific infrastructure• Easier to write and combine software• Free, open, interoperable (FOI) language resources• Free, open NLP test benchmarks (Future work)• What part is scientific and what part is community work and negotiation?• No progress in state of the art in NLP methods, yet• Difficult to judge were to put the emphasis on. Lot of “soft evaluation”topics, no key performance indicators(KPI) .Critical judgement
  32. 32. BIS – 2013/04/15 – Page 32 http://lod2.eu• 2011: Open Knowledge Conference• 2012: Workshop and book “Linked Data in Linguistics”• 2012: Linked Data Cup @ I-Semantics• 2012: Web of Linked Entities @ ISWC• 2012: MLODE@ Sabre• 2013: Semantic Web Journal: Special Issue on Multilingual Linked Open Data(MLOD)• Future work: DBpedia & NLP @ ISWC 2013Conference + Workshops + Proceedings
  33. 33. BIS – 2013/04/15 – Page 33 http://lod2.euThanks for your attention

×