Web Annotations – A Game Changer for Language Technology?

Georg Rehm, Felix Sasaki, Aljoscha Burchardt
DFKI GmbH – Language Technology Lab, Berlin
Web Annotations
A Game Changer for Language Technologies?

Language Technology
•  Language Technology is a heterogeneous and evolving
set of applications that involve the
–  (semi-)automatic processing (analysis) or
–  (semi-)automatic production
of human language (written or spoken).
•  Driven by NLP, CL, Linguistics, CompSci, CogSci, AI.
•  Methods operate on language data (often web-scale)
•  Rule-based tools, statistics (machine learning)
•  Need for human experts to analyse and annotate data
sets with highly specialised linguistic analysis information
Web Annotations and Language Technology – I Annotate 2016 2

Selected LT Applications
Spell checking, grammar checking
Search engines (IR)
Interactive personal assistants (Cortana, Siri etc.)
Machine Translation
Recommender systems
Social media (analytics, streams)
Knowledge-based systems

Web Annotations and Language Technology – I Annotate 2016
Web Annotation Architecture
Web annotation architecture
http://www.w3.org/annotation
What is the relationship between
Web Annotations
and Language Technology?
4

Content could be created by Language
Technology fully automatically or in a
semi-automatic way (text generation).
5

Content could be analysed by
Language Technology (semantic
analysis, input for ML algorithms etc.)
6

Especially in Social Media Analytics we
are very interested in UGC, i.e., in
comments, feedback – “what do users
think of a certain product?“ etc.
7

•  Today, analysing UGC is difﬁcult
and costly (many heterogeneous
sources, many different formats).
•  A few established and widely used
Web Annotation services would
simplify SMA dramatically!
8

We can also use Language Technology
methods to create (or help create)
annotations, for example, in a smart
authoring scenario.
9

LT and Web Annotations
•  Analysis of web annotations and making use of web
annotations through Language Technology:
–  Arbitrary web annotations (i.e., unstructured text)
•  No more crawling, aggregating, mapping!
–  Dedicated LT-speciﬁc web annotations
•  Annotating language data without any specialised
stand-alone tools or data repositories!
•  Generation of web annotations through Language
Technology (e.g., to provide background information on
important content – see, e.g., the Pundit use cases).

Example Scenarios
•  Two example scenarios to demonstrate how Language
Technology and Web Annotations go together.
•  Scenario 1 – Digital Curation Technologies: 
Semantiﬁcation of content for curators of digital information
•  Scenario 2 – Machine Translation: 
Web Annotations for High-Quality Machine Translation

language and knowledge technologies
curation technologies
sector-speciﬁc technologies
platformtechnologies
sector-speciﬁc solutions
!
Digital Curation Technologies
•  Support curation processes through sophisticated
language and knowledge technologies.
•  Goal: transfer of these technologies into industry
through platform for digital curation technologies.

Information
Information
Information
Information
Information
Information
Information
Information
Information
? ??
?Information
OutputInput SoftwareProcesses
•  Investigative journalist
•  Curator of an exhibition
•  TV editor
•  Author
•  Scholar
•  Knowledge worker
•  Curator of digital information

Sectors
Input Processes Software Output
tweet analyse text processor newspaper article
newspaper article select presentation multimedia website
wire copy focus spreadsheet tv report
facebook status update revise email exhibition catalogue
search result read up on browser mobile application
email write groupware mashup (e.g., map)
text message create sector-speciﬁc application text piece
concept research CMS concept
text ﬁle assess ECMS timeline
video evaluate CRM study
map arrange enterprise software presentation
stockphoto sort graphics/layouting software fact collection
in-house database structure IP telephony description of an exhibit
calendar entry summarise etc. analysis
spreadsheet shorten etc.
archive translate
etc. catch up on
combine
abstract
integrate
visualise
generate
annotate
reference
etc.
Information
Information
Information
Information
Information
Information
Information
Information
Information
? ??
?Information
OutputInput SoftwareProcesses

Structure visualisation
Multilingual multimedia sources
Crossmedia recommendations
Multilingual summarisation
Event timelining
Semantification of content
Multilingual sentiment analysis
Semantic story-telling
Ontology-based knowledge structures
15
Curation Processes

platform for digital curation technologies
broker REST API
curation service 1
language or knowledge
technology
curation service 2
technology
client using  
the API
external
service 1
external
service 2
client using  
the API
client using  
the API
client using  
the API
pipelined curation workﬂow

platform for digital curation technologies
broker REST API
curation service 1
technology
curation service 2
technology
client using  
the API
external
service 1
external
service 2
client using  
the API
client using  
the API
client using  
the API
pipelined curation workﬂow
•  Annotation of time expressions – needed for visualisation of time-lining
•  Input: text content – output: list of time expressions and mean dates
•  Storage using the Web Annotation model
•  http://dkt-projekt.github.io/webAnnotation/webannotation-dkt.html
Example

Input

Output
Mean dates
Intervals
JSON-LD representation

Web Annotations for HQMT
•  Current MT research workﬂows use several specialised and
incompatible tools and distributed repositories.
•  Ideal scenario: one coherent,  
interoperable and integrated  
ecosystem of tools.
•  Centrally stored web  
annotations would be  
a massive step in the  
right direction!
http://www.cracking-the-language-barrier.eu/mt-eval-workshop-2016/
- Ranking
- Post-Editing
- Error Annotation (MQM)
- Task based Evaluation
Human Evaluation
- Sampling
- Filtering
- Translation Memory Inclusion
- Terminology Checking
Translation Production Workﬂows
- Tokeinisation
- POS tagging
- Parsing
- Entity recognition
- WSD
Linguistic Analysis
- Services
- Development
Machine Translation
- BLEU
- Quality Estimation
- PE-Distance
- Test-Suites
Automatic Evaluation
REPOSITORY
COCKPIT
BACKEND
DATA SETS
META-SHARE
WMT
JRC
CLARIN

Multidimensional Quality Metrics
MQM for MT diagnostics
•  Customisable framework for translation quality metrics
•  Early version standardised in W3C’s ITS 2.0
21
•  Annotations in current workﬂows are typically
proprietary, tool-, format- and workﬂow-based.
•  Web annotations could enable the creation of a
collaborative corpus of translation data for the
whole community.
•  Feedback into MT engines through annotated
web-scale corpora could lead to a boost in
performance and quality.
•  Next slide: conversion of proprietary tool format
to Web Annotations.

From MQM to Web Annotations
Web Annotation
(intermediate XML syntax)
Proprietary and tool-speciﬁc CSV
MQM issue type
https://github.com/dkt-projekt/webAnnotation/tree/gh-pages/mqm-webannotation

Web Annotation Infrastructure
•  Web annotations themselves work on language.
•  Language Technology could help build better services.
•  Anchoring annotations to changing content in a
robust way is apparently tricky.
•  Semantic methods for identifying the new position of the
original anchors that have changed since the annotation
was put there.
•  Annotating all copies of the document that is
currently being annotated – application of methods for
duplicate detection or near duplicate detection.

Vision 2020
•  Next generation personal assistant.
•  Highly personalised, assisted browsing experience.
•  Semantic language technologies in the background.
•  Detection of the user‘s tasks, intentions, preferences.
•  Annotation of relevant, surprising, new facts in current
and future content through web annotations.
•  Anticipation of the user’s next steps.
•  Suggestion of related content based on  
user modelling and semantic story telling.
Georg Rehm and Hans Uszkoreit (eds.). The META-NET Strategic Research Agenda for Multilingual
Europe 2020. Springer, 2013; see Priority Research Theme “Socially-Aware Interactive Assistant”.

So, are Web Annotations a game changer
for Language Technology?
Yes, most certainly – if the UX and
browser support are done right.
Maybe Language Technology can also be
a game changer for Web Annotations.

Thank you!
supported by supported by
Beyond Multilingual Europe
04/05 July, 2016 – Lisbon, Portugal
http://www.meta-forum.eu
Deadline for submissions: 29 May 2016

Web Annotations – A Game Changer for Language Technology?

Recommended

Recommended

More Related Content

Similar to Web Annotations – A Game Changer for Language Technology?

Similar to Web Annotations – A Game Changer for Language Technology? (20)

More from Georg Rehm

More from Georg Rehm (20)

Recently uploaded

Recently uploaded (20)

Web Annotations – A Game Changer for Language Technology?