Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015
Roundtripping of NIF based
Linguistic Linked Data with non
linked data sources
Felix Sasaki
DFKI / W3C Fellow
Slides:
http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015
1

What is NIF?
• Natural Language Processing Interchange
Format
– See http://nlp2rdf.org/
• LLD format to store annotations & to organize
NLP pipelines
• API specification to create NIF workflows
• More details: after the coffee break 
• Following slides: main roles for NIF
2

Example (Partial; JSON-LD Syntax)
{ "@graph" : [ {
"@id" : "p:char=0,18",
"@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ],
"anchorOf" : "Welcome to Prague.",
"beginIndex" : "0",
"endIndex" : "18",
"isString" : "Welcome to Prague.",
"referenceContext" : "p:char=0,18”
}, {
"@id" : "p:char=11,17",
"@type" : [ "nif:RFC5147String", "nif:Word" ], …
"referenceContext" : "p:char=0,18",
"taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }
3

{ "@graph" : [ {
"@id" : "p:char=0,18",
"beginIndex" : "0",
"endIndex" : "18",
}, {
"@id" : "p:char=11,17",
4
• Identifying and typing
annotations
• Identifying annotation
offsets
• Adding additional
knowledge, e.g. named
entity identifier
• Interrelating
annotations

{ "@graph" : [ {
"@id" : "p:char=0,18",
"beginIndex" : "0",
"endIndex" : "18",
}, {
"@id" : "p:char=11,17",
5
annotations
offsets
knowledge, e.g. named
entity identifier
• Interrelating
annotations

{ "@graph" : [ {
"@id" : "p:char=0,18",
"beginIndex" : "0",
"endIndex" : "18",
}, {
"@id" : "p:char=11,17",
6
annotations
offsets
knowledge, e.g.
named entity identifier
• Interrelating
annotations

{ "@graph" : [ {
"@id" : "p:char=0,18",
"beginIndex" : "0",
"endIndex" : "18",
}, {
"@id" : "p:char=11,17",
7
annotations
offsets
knowledge, e.g.
named entity identifier
• Interrelating
annotations

A NIF workflow
8
Existing
content
Content analytics, e.g.
named entity
recognition
Conversion to
NIF
Deploying knowledge from the LLD cloud

Potential scenario: roundtripping
9
Existing
content
Content analytics, e.g.
named entity
recognition
Conversion to
NIF
Storing annotations in original content
Deploying knowledge from the LLD cloud

Roundtripping
• Roundtripping: Storing the outcome of
content processing (analytics) tasks in the
original content
• Not always needed, but sometimes –
examples:
– Enriching Web content with named entity
information; generating Schema.org markup via
NIF pipelines. Format: HTML
– Enriching localisation content, to add value
beyond translation: Format: XLIFF
10

Example: HTML
Example roundtripping workflow
11
… Welcome to Prague!…
…Welcome to Prague!…
1) Conversion to NIF 2) NER processing
3) Back conversion to HTML

Example: XLIFF
Example roundtripping workflow
12
… <xlf:source>Welcome to Prague!</xlf:source> …
… <xlf:source>Welcome to <mrk …
its:taClassRef="http://schema.org/Place">Prague
</mrk>!</xlf:source> …
1) Conversion to NIF 2) NER processing
3) Back conversion to HTML

Example usage scenario:
FREME project
• See http://www.freme-project.eu/
• Developing interfaces for multilingual and semantic
enrichment of digital content
• Relies on NIF based enrichment workflows
– See FREME API version 0.1
http://api.freme-project.eu/doc/0.1/
• Deploys aspects of the LIDER reference architecture for LLD
processing
– See D3.1.1 at http://lider-project.eu/?q=doc/deliverables
• Focuses on four business cases
– Localization BC requires XLIFF roundtripping
– Web content personalisation BC requires HTML roundtripping
13

Challenges for roundtripping
• Source format
– How to store enrichment information
(annotations)
– How to handle existing information
• Annotation model
– NIF = a general graph-based annotation model
– Sources format and annotation motivation may
require restriction of the model
14

How to store annotations in various
source formats
• Solvable for markup languages like HTML or
XLIFF
• Challenge to preserve existing markup
“Welcome to Prague!”
• General issue with complex and proprietary
formats:
– “My own” storage mechanism = no tool support
– Using existing storage mechanisms may mean:
overloading semantics
15

Source format example: Word
… <w:t>Welcome to Prague!</w:t> …
16
… <w:commentRangeStart w:id="0"/><w:t>Prague</w:t>
<w:commentRangeEnd w:id="0"/>
<w:r w:rsidR="00987079"> …
<w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p>
Enrichment process; storing enrichment as comments
Change of original content: creation of anchor
Comment stored separately; refers to anchor: “standoff approach”
Content storage
Comment storage
Content storage (Word file unzipped)

Annotation models
• NIF: like RDF = general graph model
– Consisting of nodes and arcs
17
p:char=11,17 dbp:Prague
taIdentRef

Restricting graphs: Tree structured annotations
on several layers
18
• Tree structures
for syntactic
annotations
• Several
annotation layers
for the same text
• Concurrent
hierarchies
• Representation
only of one of
these in
roundtripping
with XML
Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html

Representing overlapping hierarchies
with markup (1/2)
Solutions advertised by the TEI
• Multiple encoding of the same information
– One XML document per annotation
• Boundary marking with empty “milestone”
elements
– Also used by XLIFF
19

with markup (2/2)
Solutions advertised by the TEI
• Fragmentation and reconstitution of virtual
elements
– One hierarchy explicit, others with interrelated
marked-up spans
• Stand-off markup
– Separation of text and annotations, interlinked via
anchor and reference
– Cf. Word example
20

in RDF
POWLA (cf. Chiarcos, 2012)
• RDF representation for corpus annotation,
based on PAULA XML Standoff format
• Allows to represent hierarchical, multi-layer
corpora in RDF and query in SPARQL
• Not relevant for roundtripping, but for
linguistic annotation representation and
processing in RDF
21

Lessons learned
• Choose the overlap solution that fits your
roundtripping modelling and processing needs
• Consider off-the-shelf tooling
– For 100% hierarchical data: XPath / CSS selectors, DOM, …
• Consider libraries
– For extraction only: Tika http://tika.apache.org/
– For roundtripping: Okapi http://okapi.opentag.com/ - in
FREME currently being adapted for roundtripping in
selected formats
• Make sure the annotation survives in the original
format – cf. Word example
– Soon to be made easier by using Okapi
22

Roundtripping of NIF based
Linguistic Linked Data with non
linked data sources
Felix Sasaki
DFKI / W3C Fellow
23

Sasaki datathon-madrid-2015

More Related Content

What's hot

Similar to Sasaki datathon-madrid-2015

More from Felix Sasaki

Recently uploaded

Sasaki datathon-madrid-2015