BIS – 2013/04/15 – Page 1 http://lod2.eu
Creating Knowledge out of Interlinked Data
LOD2 Presentation . 02.09.2010 . Page http://lod2.eu
AKSW, Universität Leipzig
Sebastian Hellmann
PhD thesis intermediate report
NLP Interchange Format (NIF) 2.0
http://nlp2rdf.org
http://lod2.eu
http://slideshare.net/kurzum
DISCLAIMER:
this presentation is work in progress, example RDF is outdated
BIS – 2013/04/15 – Page 2 http://lod2.eu
NLP Interchange Format 2.0
BIS – 2013/04/15 – Page 3 http://lod2.eu
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.
• NIF 2.0 will be published in 6-8 weeks
• Highly probable to become the de-facto standard for modelling RDF tool
output in the NLP domain
NLP Interchange Format 2.0
BIS – 2013/04/15 – Page 4 http://lod2.eu
Introduction
Components have pre- and postconditions
auto configuration theoretical possible, but in reality a lot of manual work
BIS – 2013/04/15 – Page 5 http://lod2.eu
Introduction
Components have pre- and postconditions
auto configuration theoretical possible, but in reality a lot of manual work
Huge potential to save time and money at the interfaces
BIS – 2013/04/15 – Page 6 http://lod2.eu
Core problems:
1. Too much heterogeneity
2. Almost no standards available
3. No open collaboration
4. Difficult and large domain
Problem analysis
BIS – 2013/04/15 – Page 7 http://lod2.eu
Technical heterogeneity
• Technologies: XML, Relational Databases, CSV, DOC, PDF
• Similar to other domains
• Formats: Negra, CoNLL, GrAF, Paula, CAS (UIMA), Penn
• Virtually each tool has implemented readers for the 5-6 formats + its
own serialization
• Programming languages: Java, Python, ...
• Java has predominance
Problem analysis
BIS – 2013/04/15 – Page 8 http://lod2.eu
Domain heterogeneity
• Multilingualism
• Over 100 part of speech tags (several for each language)
• No open mappings exist
• About 20 different tasks listed on:
http://en.wikipedia.org/wiki/Natural_language_processing#Major_tasks_in_NLP
• Natural language is a difficult topic:
• The roulette dealer siad: “Rien ne va plus!”
– 8 words, 4 French, 4 English, one spelling mistake, impossible to
decide the language of the whole.
• Ban on Nude Dancing on Governor's Desk
Problem analysis
BIS – 2013/04/15 – Page 9 http://lod2.eu
Problem analysis
BIS – 2013/04/15 – Page 10 http://lod2.eu
Open collaboration
• LAF/GrAF is a recently released ISO standard
• But it is not open (60 Euros to view the document)
• Not in RDF (the main requirements for any Semantic Web tool)
• Large frameworks tend to only be “inward” compatible
• UIMA advocates say: “Why don't you just use UIMA?”
• Gate advocates: “Integrate it into GATE!”
• Generally, a large time investment and lock-in
Problem analysis
BIS – 2013/04/15 – Page 11 http://lod2.eu
Summary:
Hardly any reusability
• Free software (as in free beer), but no open licenses
• No standards and no mappings
• Integration is hard-wired (you have to write software)
Problem analysis
BIS – 2013/04/15 – Page 12 http://lod2.eu
• Definition for text normalization + URI Schemes (give URIs to Strings)
• NIF Core Ontology: default vocabulary for most often used annotations
• Predefined modules for most use cases
• Infrastructure
• for open collaboration / discussion
• persistent hosting
• validation and demo services
• Reference implementation
• Data conversion
NIF Overview
BIS – 2013/04/15 – Page 13 http://lod2.eu
Text Normalization + URI Schemes
BIS – 2013/04/15 – Page 14 http://lod2.eu
Text Normalization + URI Schemes
NIF 1.0:http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
NIF 2.0 uses RFC 5147 as base form:
http://www.w3.org/DesignIssues/LinkedData.html#char=717,729
User extensions possible:
http://www.w3.org/DesignIssues/LinkedData.html#your_own_scheme
(but you have to link to documentation on how it was created)
BIS – 2013/04/15 – Page 15 http://lod2.eu
As a Web Service
curl
--data-urlencode prefix="http://prefix.given.by/theClient#"
--data-urlencode input="[...]"
(--data-urlencode source=”http://www.w3.org/DesignIssues/LinkedData.html”)
http://nlp2rdf.lod2.eu/demo/NIFStanfordCore
The new namespace is http://persistence.uni-leipzig.org/nlp2rdf/nif-core#
BIS – 2013/04/15 – Page 16 http://lod2.eu
Ontologies:
• NIF Core Ontology (URI Scheme, String, Context, but also Token, Sentence,
lemma, stem, etc. ) for often used annotations.
• Simple Error Ontology to describe errors (fatal, message, timestamp)
• Vocabulary Modules for each purpose or ontology or project
Overview of Ontologies
BIS – 2013/04/15 – Page 17 http://lod2.eu
Each ontology consists of three sets of axioms:
- Terminology model (definitions)
- Inference model (especially transitivity)
- Validation model (consistency)
1) nif-core.ttl
2) nif-core-inf.ttl imports 1
3) nif-core-val.ttl imports 1 and 2
Logical Modularity
BIS – 2013/04/15 – Page 18 http://lod2.eu
NIF simple:
• Only one truth
• Easy to understand and to query
• Least amount of triples
NIF + Stanbol (Apache Project)
• Several ranked alternatives
• Provenance of annotations
• In collaboration with Apache Stanbol
Open Annotation (W3C group)
• Rich model
• Not only text, but everything (images)
Granularity Modularity
- More triples
- more complexity
- worse usability
- lossless
conversion
Well-defined conversions
between the different levels
- easier queries
- higher performance
- lossful conversion
BIS – 2013/04/15 – Page 19 http://lod2.eu
Strucural Interoperability:
- URI schemes provide normalization
- RDF provide graph data model
- OWL provides the logical model
Conceptual Interoperability
- NIF Core Ontology and mapping to most often used annotations, e.g. lemma,
stems
- Vocabulary Module to include other terminologies and ontologies
Interoperability
BIS – 2013/04/15 – Page 20 http://lod2.eu
• ITS 2.0
• FISE used in Apache Stanbol (IKS-EU Project)
• LAF/GrAF XML – ISO standard, recently published
• Fragment Identifiers by IETF and W3C
• Lemon ontology from Monnet EU Project
• NERD ontology from EURECOM and LinkedTV EU Project
• Xpointer/XPath URI scheme
• Open Annotation
• ISOCat
NIF 2.0 tries to be compatible to (Vocabulary Module)
BIS – 2013/04/15 – Page 21 http://lod2.eu
• Tibeto-Burman languages: http://purl.org/olia/tibet.owl#VNst
• Russian TreeTagger :
http://purl.org/olia/russ.owl#partizip_prt_sg_neut_passiv_gen_langform
• German STTS: http://purl.org/olia/stts.owl#VAPP
• English Penn: http://purl.org/olia/penn.owl#VBG
→ all map to http://purl.org/olia/olia.owl#NonFiniteVerb
Ontologies of Lingingustic Annotation (OLiA) contain mappings for over 50 Tagsets (free
and open, CC-By)
Vocabulary Module: OLiA
BIS – 2013/04/15 – Page 22 http://lod2.eu
NIF can be extended by Vocabulary Modules
OliA
http://purl.org/olia
Conceptual Interoperability
BIS – 2013/04/15 – Page 23 http://lod2.eu
• Java-Maven implementation
• PHP implementation
• Reference implementations: DBpedia Spotlight, Stanford Parser, Korean POS
tagger, Keyword Search
• Wiki: http://wiki.nlp2rdf.org
• Validators
• Code generators (convert vocabulary modules to code stubs)
• NIF is free and open (CC-0 / CC-BY / Apache)
• All ontologies will be hosted persistently by University Leipzig
•http://persistence.uni-leipzig.org/nlp2rdf/
NIF 2.0 Infrastructure for adoption
BIS – 2013/04/15 – Page 24 http://lod2.eu
• Huge collection of use cases
• e.g. Ali wants to exchange different NLP service for RDFace
• LOD2 from Wolters Kluwer
• A selection will be implemented. Assumption:
• NIF is good, if it fulfills many use cases
Evaluation 1
BIS – 2013/04/15 – Page 25 http://lod2.eu
• There are about 10 to 20 third party implementations
Evaluation 2
BIS – 2013/04/15 – Page 26 http://lod2.eu
Analysis of existing frameworks and formats. Criteria:
• Convertability (Adequacy)
• Do the graph models match?
• Coverage
• Quantitative analysis of used annotations
• Does NIF Core provide terms for the most common annotations, are
there any gaps?
Evaluation 3
BIS – 2013/04/15 – Page 27 http://lod2.eu
Data Conversion
BIS – 2013/04/15 – Page 28 http://lod2.eu
Data Conversion
BIS – 2013/04/15 – Page 29 http://lod2.eu
Data Conversion
Data is available as
free, open, interoperable (FOI) language resources at
http://linguistics.okfn.org/resources/llod/
(work in progress)
BIS – 2013/04/15 – Page 30 http://lod2.eu
Project has a very good impact:
• Many adopters
• Industrial uptake
• Inclusion in a W3C standard for ITS 2.0:
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html
• Several projects involved as stakeholders (LOD2, Monnet, ...)
• Several motivated open-source developers
• Funding is coming in
Critical judgement
BIS – 2013/04/15 – Page 31 http://lod2.eu
Scientific merit ?
• provides scientific infrastructure
• Easier to write and combine software
• Free, open, interoperable (FOI) language resources
• Free, open NLP test benchmarks (Future work)
• What part is scientific and what part is community work and negotiation?
• No progress in state of the art in NLP methods, yet
• Difficult to judge were to put the emphasis on. Lot of “soft evaluation”
topics, no key performance indicators(KPI) .
Critical judgement
BIS – 2013/04/15 – Page 32 http://lod2.eu
• 2011: Open Knowledge Conference
• 2012: Workshop and book “Linked Data in Linguistics”
• 2012: Linked Data Cup @ I-Semantics
• 2012: Web of Linked Entities @ ISWC
• 2012: MLODE@ Sabre
• 2013: Semantic Web Journal: Special Issue on Multilingual Linked Open Data
(MLOD)
• Future work: DBpedia & NLP @ ISWC 2013
Conference + Workshops + Proceedings
BIS – 2013/04/15 – Page 33 http://lod2.eu
Thanks for your attention

NIF 2.0 Phd thesis intermediate report

  • 1.
    BIS – 2013/04/15– Page 1 http://lod2.eu Creating Knowledge out of Interlinked Data LOD2 Presentation . 02.09.2010 . Page http://lod2.eu AKSW, Universität Leipzig Sebastian Hellmann PhD thesis intermediate report NLP Interchange Format (NIF) 2.0 http://nlp2rdf.org http://lod2.eu http://slideshare.net/kurzum DISCLAIMER: this presentation is work in progress, example RDF is outdated
  • 2.
    BIS – 2013/04/15– Page 2 http://lod2.eu NLP Interchange Format 2.0
  • 3.
    BIS – 2013/04/15– Page 3 http://lod2.eu The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. • NIF 2.0 will be published in 6-8 weeks • Highly probable to become the de-facto standard for modelling RDF tool output in the NLP domain NLP Interchange Format 2.0
  • 4.
    BIS – 2013/04/15– Page 4 http://lod2.eu Introduction Components have pre- and postconditions auto configuration theoretical possible, but in reality a lot of manual work
  • 5.
    BIS – 2013/04/15– Page 5 http://lod2.eu Introduction Components have pre- and postconditions auto configuration theoretical possible, but in reality a lot of manual work Huge potential to save time and money at the interfaces
  • 6.
    BIS – 2013/04/15– Page 6 http://lod2.eu Core problems: 1. Too much heterogeneity 2. Almost no standards available 3. No open collaboration 4. Difficult and large domain Problem analysis
  • 7.
    BIS – 2013/04/15– Page 7 http://lod2.eu Technical heterogeneity • Technologies: XML, Relational Databases, CSV, DOC, PDF • Similar to other domains • Formats: Negra, CoNLL, GrAF, Paula, CAS (UIMA), Penn • Virtually each tool has implemented readers for the 5-6 formats + its own serialization • Programming languages: Java, Python, ... • Java has predominance Problem analysis
  • 8.
    BIS – 2013/04/15– Page 8 http://lod2.eu Domain heterogeneity • Multilingualism • Over 100 part of speech tags (several for each language) • No open mappings exist • About 20 different tasks listed on: http://en.wikipedia.org/wiki/Natural_language_processing#Major_tasks_in_NLP • Natural language is a difficult topic: • The roulette dealer siad: “Rien ne va plus!” – 8 words, 4 French, 4 English, one spelling mistake, impossible to decide the language of the whole. • Ban on Nude Dancing on Governor's Desk Problem analysis
  • 9.
    BIS – 2013/04/15– Page 9 http://lod2.eu Problem analysis
  • 10.
    BIS – 2013/04/15– Page 10 http://lod2.eu Open collaboration • LAF/GrAF is a recently released ISO standard • But it is not open (60 Euros to view the document) • Not in RDF (the main requirements for any Semantic Web tool) • Large frameworks tend to only be “inward” compatible • UIMA advocates say: “Why don't you just use UIMA?” • Gate advocates: “Integrate it into GATE!” • Generally, a large time investment and lock-in Problem analysis
  • 11.
    BIS – 2013/04/15– Page 11 http://lod2.eu Summary: Hardly any reusability • Free software (as in free beer), but no open licenses • No standards and no mappings • Integration is hard-wired (you have to write software) Problem analysis
  • 12.
    BIS – 2013/04/15– Page 12 http://lod2.eu • Definition for text normalization + URI Schemes (give URIs to Strings) • NIF Core Ontology: default vocabulary for most often used annotations • Predefined modules for most use cases • Infrastructure • for open collaboration / discussion • persistent hosting • validation and demo services • Reference implementation • Data conversion NIF Overview
  • 13.
    BIS – 2013/04/15– Page 13 http://lod2.eu Text Normalization + URI Schemes
  • 14.
    BIS – 2013/04/15– Page 14 http://lod2.eu Text Normalization + URI Schemes NIF 1.0:http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 NIF 2.0 uses RFC 5147 as base form: http://www.w3.org/DesignIssues/LinkedData.html#char=717,729 User extensions possible: http://www.w3.org/DesignIssues/LinkedData.html#your_own_scheme (but you have to link to documentation on how it was created)
  • 15.
    BIS – 2013/04/15– Page 15 http://lod2.eu As a Web Service curl --data-urlencode prefix="http://prefix.given.by/theClient#" --data-urlencode input="[...]" (--data-urlencode source=”http://www.w3.org/DesignIssues/LinkedData.html”) http://nlp2rdf.lod2.eu/demo/NIFStanfordCore The new namespace is http://persistence.uni-leipzig.org/nlp2rdf/nif-core#
  • 16.
    BIS – 2013/04/15– Page 16 http://lod2.eu Ontologies: • NIF Core Ontology (URI Scheme, String, Context, but also Token, Sentence, lemma, stem, etc. ) for often used annotations. • Simple Error Ontology to describe errors (fatal, message, timestamp) • Vocabulary Modules for each purpose or ontology or project Overview of Ontologies
  • 17.
    BIS – 2013/04/15– Page 17 http://lod2.eu Each ontology consists of three sets of axioms: - Terminology model (definitions) - Inference model (especially transitivity) - Validation model (consistency) 1) nif-core.ttl 2) nif-core-inf.ttl imports 1 3) nif-core-val.ttl imports 1 and 2 Logical Modularity
  • 18.
    BIS – 2013/04/15– Page 18 http://lod2.eu NIF simple: • Only one truth • Easy to understand and to query • Least amount of triples NIF + Stanbol (Apache Project) • Several ranked alternatives • Provenance of annotations • In collaboration with Apache Stanbol Open Annotation (W3C group) • Rich model • Not only text, but everything (images) Granularity Modularity - More triples - more complexity - worse usability - lossless conversion Well-defined conversions between the different levels - easier queries - higher performance - lossful conversion
  • 19.
    BIS – 2013/04/15– Page 19 http://lod2.eu Strucural Interoperability: - URI schemes provide normalization - RDF provide graph data model - OWL provides the logical model Conceptual Interoperability - NIF Core Ontology and mapping to most often used annotations, e.g. lemma, stems - Vocabulary Module to include other terminologies and ontologies Interoperability
  • 20.
    BIS – 2013/04/15– Page 20 http://lod2.eu • ITS 2.0 • FISE used in Apache Stanbol (IKS-EU Project) • LAF/GrAF XML – ISO standard, recently published • Fragment Identifiers by IETF and W3C • Lemon ontology from Monnet EU Project • NERD ontology from EURECOM and LinkedTV EU Project • Xpointer/XPath URI scheme • Open Annotation • ISOCat NIF 2.0 tries to be compatible to (Vocabulary Module)
  • 21.
    BIS – 2013/04/15– Page 21 http://lod2.eu • Tibeto-Burman languages: http://purl.org/olia/tibet.owl#VNst • Russian TreeTagger : http://purl.org/olia/russ.owl#partizip_prt_sg_neut_passiv_gen_langform • German STTS: http://purl.org/olia/stts.owl#VAPP • English Penn: http://purl.org/olia/penn.owl#VBG → all map to http://purl.org/olia/olia.owl#NonFiniteVerb Ontologies of Lingingustic Annotation (OLiA) contain mappings for over 50 Tagsets (free and open, CC-By) Vocabulary Module: OLiA
  • 22.
    BIS – 2013/04/15– Page 22 http://lod2.eu NIF can be extended by Vocabulary Modules OliA http://purl.org/olia Conceptual Interoperability
  • 23.
    BIS – 2013/04/15– Page 23 http://lod2.eu • Java-Maven implementation • PHP implementation • Reference implementations: DBpedia Spotlight, Stanford Parser, Korean POS tagger, Keyword Search • Wiki: http://wiki.nlp2rdf.org • Validators • Code generators (convert vocabulary modules to code stubs) • NIF is free and open (CC-0 / CC-BY / Apache) • All ontologies will be hosted persistently by University Leipzig •http://persistence.uni-leipzig.org/nlp2rdf/ NIF 2.0 Infrastructure for adoption
  • 24.
    BIS – 2013/04/15– Page 24 http://lod2.eu • Huge collection of use cases • e.g. Ali wants to exchange different NLP service for RDFace • LOD2 from Wolters Kluwer • A selection will be implemented. Assumption: • NIF is good, if it fulfills many use cases Evaluation 1
  • 25.
    BIS – 2013/04/15– Page 25 http://lod2.eu • There are about 10 to 20 third party implementations Evaluation 2
  • 26.
    BIS – 2013/04/15– Page 26 http://lod2.eu Analysis of existing frameworks and formats. Criteria: • Convertability (Adequacy) • Do the graph models match? • Coverage • Quantitative analysis of used annotations • Does NIF Core provide terms for the most common annotations, are there any gaps? Evaluation 3
  • 27.
    BIS – 2013/04/15– Page 27 http://lod2.eu Data Conversion
  • 28.
    BIS – 2013/04/15– Page 28 http://lod2.eu Data Conversion
  • 29.
    BIS – 2013/04/15– Page 29 http://lod2.eu Data Conversion Data is available as free, open, interoperable (FOI) language resources at http://linguistics.okfn.org/resources/llod/ (work in progress)
  • 30.
    BIS – 2013/04/15– Page 30 http://lod2.eu Project has a very good impact: • Many adopters • Industrial uptake • Inclusion in a W3C standard for ITS 2.0: http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html • Several projects involved as stakeholders (LOD2, Monnet, ...) • Several motivated open-source developers • Funding is coming in Critical judgement
  • 31.
    BIS – 2013/04/15– Page 31 http://lod2.eu Scientific merit ? • provides scientific infrastructure • Easier to write and combine software • Free, open, interoperable (FOI) language resources • Free, open NLP test benchmarks (Future work) • What part is scientific and what part is community work and negotiation? • No progress in state of the art in NLP methods, yet • Difficult to judge were to put the emphasis on. Lot of “soft evaluation” topics, no key performance indicators(KPI) . Critical judgement
  • 32.
    BIS – 2013/04/15– Page 32 http://lod2.eu • 2011: Open Knowledge Conference • 2012: Workshop and book “Linked Data in Linguistics” • 2012: Linked Data Cup @ I-Semantics • 2012: Web of Linked Entities @ ISWC • 2012: MLODE@ Sabre • 2013: Semantic Web Journal: Special Issue on Multilingual Linked Open Data (MLOD) • Future work: DBpedia & NLP @ ISWC 2013 Conference + Workshops + Proceedings
  • 33.
    BIS – 2013/04/15– Page 33 http://lod2.eu Thanks for your attention