Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again

Georg Rehm
German Research Center for Artificial Intelligence (DFKI) GmbH
Annotation in scholarly editions and research
Bergische Universität Wuppertal – 21 February 2019
Observations on Annotations
From Computational Linguistics and the
World Wide Web to AI and back again

Observations on Annotations – Wuppertal, Germany, 21 February 2019 2
Annotation
Computational
Linguistics and AI
(since 1992)
SGML and TEI
(since 1995)
XML since 1998
XSLT
XPath
Several others ...
Corpus
annotation
formats
Hypertext and
Textlinguistics
Web
Technologies,
W3C, Markup
Languages
W3C Office
Germany/Austria
(since 2013)
AI and Language
Technology
Development
(since 2009)
Infrastructures
and Platforms
Service
Deployment
Research Data
Language
Resources
Metadata
Data FormatsOpen Science
Annotation:
Personal Background

Introduction
• Annotations have been playing an important role in
Computational Linguistics and related fields
(especially Digital Humanities) for decades.
• This talk: Recent examples, lessons learned and
some general observations on annotations.
• My own research in this area (since approx. 1996):
– from basic and applied research to
– innovation and technology development

Outline
• Annotations – brief definition
• World Wide Web
• Annotations and AI
• Annotations and Computational Linguistics
• Annotations and Language Technology
• Annotations for a Credible Web
• Annotations and Open Science
• Annotations and Markup
• Dimensions of Annotations
• Summary and Conclusions

Annotations:
a brief definition

Annotations
• Definition/“Definition”:
Secondary data added to a piece of primary data –
in science, this is, often, research data.
• Wikipedia:
An annotation is a metadatum (e.g., a post, explanation,
markup) attached to [a?] location or other data.
http://www.merriam-webster.com

• Literature and education:
– Textual scholarship: Textual scholarship is a discipline that often
uses the technique of annotation to describe or add additional
historical context to texts and physical documents.
– Learning and instruction: As part of guided noticing [annotation]
involves highlighting, naming or labelling and commenting
aspects of visual representations to help focus learners' attention
on specific visual aspects. In other words, it means the assignment
of typological representations (culturally meaningful categories),
to topological representations (e.g. images).
• Software engineering:
– Text documents: Markup languages like XML and HTML annotate
text in a way that is syntactically distinguishable from that text.
They can be used to add information about the desired visual
presentation, or machine-readable semantic information, as in
the semantic web.
• Linguistics:
– In linguistics, annotations include comments and metadata; these
non-transcriptional annotations are also non-linguistic.

World Wide Web

“Vague but exciting”
Information Management: A Proposal
Tim Berners-Lee, CERN, March 1989, May 1990
“Private links
One must be able to
add one's own private
links to and from public
information. One must
also be able to annotate
links, as well as nodes,
privately.”

World Wide Web Consortium
• W3C is an international non-profit member-financed
standards developing organisation
• Founded in 1994 by Sir Tim Berners-Lee
• Currently 451 members – 23 in Germany/Austria
• Approx. 60 staff (ERCIM, MIT, UKeio, UBeihang)
• Approx. 20 offices in important regions
• The W3C Office Germany/Austria is run by
• Open Web Platform, HTML5, CSS, Credible Web, Digital
Publishing, Linked Data etc.
http://w3.org ! http://w3c.de
13
Interested in joining? Talk to me!

Relevant W3C Standards
• XML – Extensible Markup Language
– Extremely influential
– Widely adopted
– TEI and many other languages
• Semantic Web
– RDF, OWL, SPARQL, SKOS etc.
• Digital Publishing
– New versions of EPub
• Web Annotation Data Model and Vocabulary
https://www.w3.org/2001/10/03-sww-1/slide7-0.html

Web Annotation

Web Annotations
• Web Annotation – Three W3C Recommendations
• Most popular and relevant implementation: Hypothes.is
– Mission-driven, non-profit Open Source company
– Main focus on scholarly publishing
(“Annotating All Knowledge Coalition”)
– Very active and vibrant community
• Hypothes.is: main driving force
behind the I Annotate conference series
– Open proceedings, very interesting programme, diverse
speakers from several disciplines – consider attending!
– Videos of almost all previous events available online

• Web Annotation Data Model
Describes the underlying Annotation Abstract Data
Model as well as a JSON-LD serialization
• Web Annotation Vocabulary
The Vocabulary which underpins the
Web Annotation Data Model
• Web Annotation Protocol
The HTTP API for publishing, syndicating,
and distributing Web Annotations
• Published on 23 February 2017
Web Annotation Standard

Web Annotation Standard
• What does this mean for end users?
– Annotation: a set of connected resources, typically incl. a
body and target – the body is related to the target.
– No more comment widgets and silos!
– Annotation capability can be built natively into the browser
– Conversations can take place anywhere on the web and in
a standards-based way
• Why is this different?
– Annotations can live separately from documents and are
reunited and re-anchored in real-time
– Annotations are under the control of the user
– Users can form communities (across HTML, PDF etc.)

Hypothes.is Statistics
December 2018: 4.4 Million Annotations and Counting
260K
In groups, private
In groups, shared
Private
Public
JAN
2015
JAN
2016
JAN
2017
JAN
2018
DEC
2018
20K
40K
60K
80K
100K
120K
140K
160K
180K
200K
220K
240K

The Hypothes.is Tool
! Private Notes
! Public annotations
! Collaboration groups
! Linked Data connections
! Cross format:
○ HTML
○ PDF
○ EPUB
○ Data
! Community driven
! Open Source

Open Groups

Errata and Corrections

ADA: American Diabetes Association
● Wanted a way to update content
and add information links
● Needed to restrict use to ADA staff

Peer Review

Automated Annotation
Automated systems can
tag elements such as
RRIDs (Research
Resource Identifiers) and
other scholarly identifiers
or entities, allowing
navigation to background
information and powerful
search queries through
other papers mentioning
the same entity.

User Profiles

Use anywhere on the web

Annotations and AI

Observations on Annotations – Wuppertal, Germany, 21 February 2019
Data Intelligence
Current breakthroughs based on Machine Learning (“Deep Learning”)
Also still in use: symbolic, rule-based methods and expert systems
Artificial Intelligence
Huge data sets + powerful learning algorithms + very fast hardware
31

Annotations and AI
• Modern AI is data-driven – supervised learning relies
on annotated data sets.
• However, certain AI algorithms can learn structure and
patterns without any annotations whatsoever.
• The relevance of annotations has increased dramatically.
• This is especially true for very large annotated data sets.
• Many consist of primary data and secondary annotations.
• Companies have emerged that produce annotated data
sets using crowd-workers (e.g., Figure Eight, Crowdee)
• Key question: how detailed, relevant, correct, meaningful
and reliable are these annotations really?

Annotations and Events
• Likes and Favs (user-driven annotation, action)
• Five-star ratings (user-driven annotation, action)
• Online comments (user-driven annotation, action)
• Online reviews (user-driven annotation, action)
• Clicking an article headline/link (user-initiated event, action)
• Reading an ebook (user-initiated event, action)
– Page turns in ebooks are measured – when slow: “boredom”, “disinterest”
– Next time in the ebook store you’re getting adjusted recommendations
• No longer reading an ebook (user-initiated event, non-action)
– Boring chapters where people throw in the towel can be easily identified
– (Brave new) future: use automatic paraphrasing to re-write the chapter
– Or maybe NLG and A/B tests – then it’s the original author vs. the machine

Annotations in
Computational
Linguistics

Annotations in CL
• Diverse and specialised tool landscape
http://annotation.exmaralda.org/index.php?title=Linguistic_Annotation
• Diverse and specialised format landscape:
TEI, NIF, NAF, LAF, TIGER, STTS, FoLiA
and many, many others
• From trivial annotation schemes to extremely complex
• From low inter-annotator agreement scores to high ones
• From flexible tools to highly specialised tools
• From very high quality annotations to very low ones
• A brief look at a few tools …

Exmaralda

Praat

ELAN

brat

WebAnno

Annis

Annotations in
Language Technology

Language Technology
• Language Technology transfers theoretical results from
language-oriented research into technologies and
applications that are ready for production use.
• Uses results from, e.g.:
– Artificial Intelligence
– Computer Science
– Computational Linguistics
– Natural Language Processing
– Psychology, Psycholinguistics
– Cognitive Science
Example Applications
• Spell checkers
• Dictation systems
• Translation systems
• Search engines
• Report generation
• Expert systems
• Dialogue systems
• Text summarisers

Web Annotation Architecture
The relationship between
Web Annotations
and Language Technology
on a rather general level.
44

Content could be created by Language
Technology fully automatically or in a
semi-automatic way (text generation)
45

Content could be analysed by
Language Technology (semantic
analysis, input for ML algorithms etc.)
46

Especially in Social Media Analytics
we are interested in UGC, i.e., in
comments, feedback – “what do
users think of a certain product?“.
47

• Analysing UGC is difficult and
costly (many heterogeneous
sources, many different formats)
• A few established and widely used
Web Annotation services would
simplify SMA dramatically!
48

We can also use LT methods to
create or help create annotations,
e.g., in smart authoring scenarios.
49

LT and Web Annotations
• Analysis of web annotations and exploiting web
annotations through Language Technology:
– Arbitrary web annotations (i.e., unstructured text)
• No more crawling, aggregating, mapping!
– Dedicated LT-specific web annotations
• Annotating language data without any specialised
stand-alone tools or data repositories!
• Generation of web annotations through Language
Technology (e.g., to provide background information on
important content). Example: Content semantification.

Platform for digital Curation Technologies
Broker REST API
Curation Service 1
Curation Service 2
Client uses
the API
External
Service 1
External
Service 2
Client uses
the API
Client uses
the API
Curation Workflow
Input
Output
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos/> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
<http://link.omitted/documents/document1#char=0,26>
a nif:RFC5147String , nif:String , nif:Context ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "26"^^xsd:nonNegativeInteger ;
nif:isString "Welcome to Berlin in 2016. "^^xsd:string ;
dfkinif:averageLatitude "52.516666666666666"^^xsd:double ;
dfkinif:averageLongitude "13.383333333333333"^^xsd:double ;
dfkinif:stdDevLatitude "0.0"^^xsd:double ;
dfkinif:stdDevLongitude "0.0"^^xsd:double ;
nif:meanDateRange "20160101010000_20170101010000"^^xsd:string .
a nif:RFC5147String , nif:String ;
itsrdf:taIdentRef <http://link.omitted/ontologies/nif#date=20160101000000_20170101000000> ;
nif:anchorOf "2016"^^xsd:string ;
nif:entity <http://link.omitted/ontologies/nif#date>.
<http://link.omitted/documents/#char=11,17>
nif:anchorOf "Berlin"^^xsd:string ;
itsrdf:taClassRef <http://dbpedia.org/ontology/Location> ;
nif:referenceContext <http://link.omitted/documents/#char=0,26> ;
geo:lat "52.516666666666666"^^xsd:double ;
geo:long "13.383333333333333"^^xsd:double ;
itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin> .
NLP Interchange
Format (NIF)
“Welcome to Berlin in 2016.”
• RDF/OWL-basiertes Format für NLP-
Anwendungen
• Ermöglicht Interoperabilität
• Durch pures RDF „natürliche“
Integration von Linked-Data-Daten
• Entwickelt von der Universität Leipzig
• Plattform unterstützt neben NIF auch
Web Annotations
Digital Curation Technologies:
Prototypically implemented Platform and Services
Peter Bourgonje, Julian Moreno-Schneider, Jan Nehring, Georg Rehm, Felix Sasaki, and Ankit Srivastava.
“Towards a Platform for Curation Technologies: Enriching Text Collections with a Semantic-Web Layer.” In
Harald Sack, Giuseppe Rizzo, Nadine Steinmetz, Dunja Mladenić, Sören Auer, and Christoph Lange,
editors, The Semantic Web, number 9989 in LNCS, pages 65-68. Springer, June 2016. ESWC 2016
Satellite Events. Heraklion, Crete, Greece, May 29 - June 2, 2016 Revised Selected Papers.
Client uses
the API

52
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos/> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
a nif:RFC5147String , nif:String , nif:Context ;
nif:isString "Welcome to Berlin in 2019. "^^xsd:string ;
dfkinif:averageLatitude "52.516666666666666"^^xsd:double ;
dfkinif:averageLongitude "13.383333333333333"^^xsd:double ;
dfkinif:stdDevLatitude "0.0"^^xsd:double ;
dfkinif:stdDevLongitude "0.0"^^xsd:double ;
nif:meanDateRange "20190101010000_20200101010000"^^xsd:string .
itsrdf:taIdentRef <http://link.omitted/ontologies/nif#date=20190101000000_20200101000000> ;
nif:anchorOf "2019"^^xsd:string ;
nif:entity <http://link.omitted/ontologies/nif#date>.
<http://link.omitted/documents/#char=11,17>
nif:anchorOf "Berlin"^^xsd:string ;
itsrdf:taClassRef <http://dbpedia.org/ontology/Location> ;
nif:referenceContext <http://link.omitted/documents/#char=0,26> ;
geo:lat "52.516666666666666"^^xsd:double ;
geo:long "13.383333333333333"^^xsd:double ;
itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin> .
NLP Interchange
Format (NIF)
“Welcome to Berlin in 2019.”
• RDF/OWL-based format for NLP
applications
• Enables interoperability
• Pure RDF and, hence, natural
integration of Linked Data data
• Developed by Universität Leipzig
• Our platform also supports Web
Annotation data model

Julian Moreno-Schneider, Ankit Srivastava, Peter Bourgonje, David Wabnitz, and Georg Rehm. “Semantic Storytelling, Cross-
lingual Event Detection and other Semantic Services for a Newsroom Content Curation Dashboard.” In Octavian Popescu and
Carlo Strapparava, editors, Proceedings of Natural Language Processing meets Journalism - EMNLP 2017 Workshop (NLPMJ
2017), Copenhagen, Denmark, September 2017. 7. September.
Sector: Journalism
53

Sector: TV, Web-TV, Media
54
Georg Rehm, Julián Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Rolf Fricke, Jan Thomsen, Jing He, Joachim Quantz, Armin Berger, Luca König, Sören
Räuchle, Jens Gerth, and David Wabnitz. “Different Types of Automated and Semi-Automated Semantic Storytelling: Curation Technologies for Different Sectors”.
In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin,
Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture Notes in Artificial Intelligence (LNAI), pages 232-247, Cham, Switzerland, January 2018.
Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017.

Annotations for a
Credible Web

Viral Content and Filter Bubbles
• Content is often published without checking its validity,
discovered through social media and, if it appears
relevant, shared immediately.
• Content is often shared without reading it.
• Goal: virality ➟ reach ➟ clicks ➟ ad revenue
• Not all “journalistic” content (or publishing outlets) is really
committed to reporting the facts.
• Nowadays the burden of fact-checking is with the readers.
• „Fake news“: label for several classes of online content.
• Can we balance out filter bubble and network effects?
Georg Rehm. “An Infrastructure for Empowering Internet Users to handle Fake News and other Online Media Phenomena”. In Georg
Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: Proceedings of the GSCL
Conference 2017, Berlin, September 2017. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V. 13.-15. September 2017.

Seven classes
of false news
Satire or
parody
Wrong
connection
or relation:
when title
and photos
don‘t
support the
content
Misleading
content:
use of
information
to put
someone
or
something
in a bad
light
Wrong
context:
when
genuine
content is
presented
in the
wrong
context
Deceiving
content:
imitation of
real
sources
Bad
content
with a clear
purpose to
deceive
Fabricated
content:
completely
untrue,
produced
to deceive
Characteristics
Clickbait X X ? ? ?
Disinformation X X X X
Political bias ? X ? ? X
Bad journalism X X X
Publisher‘sintention
Parody X ? ?
Provocation X X X
Profit ? X X X
Deception X X X X X X
Influence politics X X X X
Influence politics X X X X X
Different classes of false news and their individual characteristics and intentions
(based on Wardle, 2017; Walbrühl, 2017; Rubin et al., 2015; Holan, 2016; Weedon et al., 2017)
59

Website
with content
Tool1
Browser has native support for the infrastructure and
aggregates the different scores, messages and values
into messages or warnings regarding the content
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
UGA: User-generated annotations (free text)
UGM: User-generated metadata (standardised)
MGM: Machine-generated Metadata (standardised)
MGM
MGM
MGM
Decentral filters process content automatically and send
results to the browser (important: multilingualism)
UGA
Web
Annotations
DB4UGM
Example: user rates the
content quality regarding
a standardised schema
other users‘ annotations
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Decentral repositories
store all annotations
Detection of
hate speech Classify content for its
political spectrum
Fact checker

Website
with content
Tool1
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
MGM
MGM
MGM
UGA
Web
Annotations
DB4UGM
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Detection of
political spectrum
Fact checker
• Infrastructure as a native part of the web
• Necessary for that: support and buy-in from all
browser vendors, media publishers and standards
• All users need immediate access

Website
with content
Tool1
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
MGM
MGM
MGM
UGA
Web
Annotations
DB4UGM
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Detection of
political spectrum
Fact checker
Tools analyse
automatically

Website
with content
Tool1
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
MGM
MGM
MGM
UGA
Web
Annotations
DB4UGM
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Detection of
political spectrum
Fact checker
• Automatic results and free text
annotations are stored as Web
Annotations.
• Users make their annotations
available to one another.

Website
with content
Tool1
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
MGM
MGM
MGM
UGA
Web
Annotations
DB4UGM
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Detection of
political spectrum
Fact checker
• Automatic analysis of free text
annotations (NLP, IE, RE etc.).
• Extraction of opinions, arguments,
claims, statements etc.

Website
with content
Tool1
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
MGM
MGM
MGM
UGA
Web
Annotations
DB4UGM
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Detection of
political spectrum
Fact checker
UGM
• Standardised metadata schemas for efficient annotations,
e.g. “content is intentionally deceptive.”
• W3C Provenance Ontology, Schema.org (ClaimReview).
• To be used by the human and the machine

Website
with content
Tool1
Web
Annotations
DB1
Web
Annotations
DB2
Tool3
Tool2
MGM
MGM
MGM
UGA
Web
Annotations
DB4UGM
Other
users
Web
Annotations
DB3
UGA
UGM
UGM
UGA
Detection of
political spectrum
Fact checker
UGM
Goal: provide technologies to the user, with which
they can consume, assess, analyse, verify and
process digital content and media in a better way and
that indicate which contents may be problematic.

Web Annotation + Fake News
• Crowd-sourced Web Annotation content in combination
with a set of automatic analysis tools has enormous
potential to tackle online misinformation campaigns.
• Big impact if deployed widely and implemented correctly.
• However, there’s a danger to shift the point of attack that
misinformation campaigns exploit (to annotations).
• The Credibility Coalition has developed a similar
approach in parallel, see, e.g.,
https://web.hypothes.is/blog/annotation-powered-questionnaires/

Annotations and
Open Science

Open Science
• Movement to make scientific research, data
and dissemination accessible to all levels of
an inquiring society, amateur or professional.
• Encompasses practices such as
publishing open research, campaigning
for open access, encouraging scientists to
practice open notebook science, and
generally making it easier to publish and
communicate scientific knowledge.
• Connection to: annotations, research data
(corpora, LRs), semantics, knowledge,
linked data, repositories and other topics.
https://en.wikipedia.org/wiki/Open_science

Open Science Taxonomy

Annotations & Open Science
• Open Science will soon become the norm and goal in
data-intensive science
• Important aspects: interoperability, reproducibility, open
documentation of experiments, use of standards etc.
• Trend: open tools, open workflows, open data sets
• Annotations are an important and crucial piece of the
puzzle, especially documented, meaningful annotations
• Relevant initiatives: NFDI, EOSC
• Relevant principle: FAIR

FAIR Principles
• TO BE FINDABLE:
– F1 (meta)data are assigned a globally unique and eternally persistent identifier.
– F2 data are described with rich metadata.
– F3 (meta)data are registered or indexed in a searchable resource.
– F4 metadata specify the data identifier.
• TO BE ACCESSIBLE:
– A1 (meta)data are retrievable by their identifier using a standardized protocol.
– A1.1 the protocol is open, free, and universally implementable.
– A1.2 the protocol allows for an authentication and authorization procedure.
– A2 metadata are accessible, even when the data are no longer available.
• TO BE INTEROPERABLE:
– I1. (meta)data use a formal, accessible, shared, and broadly applicable language for
knowledge representation.
– I2. (meta)data use vocabularies that follow FAIR principles.
– I3. (meta)data include qualified references to other (meta)data.
• TO BE RE-USABLE:
– R1. meta(data) have a plurality of accurate and relevant attributes.
– R1.1 (meta)data are released with a clear and accessible data usage license.
– R1.2 (meta)data are associated with their provenance.
– R1.3 (meta)data meet domain-relevant community standards.

Open Science and … Science
• Open Science approaches recommend the use of standards
• Only standardised data and metadata are truly interoperable
• BUT fundamental research is about inventing NEW things
• This contradicts the use of standards as the consensus that
was reached within a specific community
• However, it does NOT contradict the use of established tools
and best practice approaches
• Neither does it contradict the modification of standards
• At the end of the day, it’s about semantics & documentation
• If an established, standardised approach does not work for a
new piece of research, invent a new approach or get creative!

Annotation of Documents
• Open Science will be transforming research, making it
more sustainable, more visible, more transparent
• Substantially improved digital infrastructures
• This will, soon, include the annotation of documents,
starting with scientific publications (Web Annotation)
• First steps towards Open Peer Review (cf. arxiv.org)
• Trend: micro-publications (esp. for incremental research)
• Will the scientific paper continue to be the atomic unit?
• Important relevant initiative: ORKG

ORKG
• Vision driven forward by Sören Auer (TIB Hannover)
• Exchange of scholarly knowledge is primarily
document-based: researchers produce articles (online
or offline) as coarse-grained text documents.
• Transform this predominant paradigm into knowledge-
based information flows by representing and expressing
knowledge through semantically rich, interlinked graphs.
• Sören Auer et al. (2018): “Towards an Open Research
Knowledge Graph“.
https://doi.org/10.5281/zenodo.1157185

Interlinking of Concepts
ated procedures alone do not achieve the necessary coverage and accuracy; fully manual
n is too time-consuming; librarians lack the necessary domain-specific expertise; and scientists
e necessary expertise in knowledge representation. By combining the four strategies in a
ngful way, they can bring their respective strengths to bear and compensate for the weak points.
Interlinking of interdisciplinary and subject-specific concepts and artefacts of scientific work in the
different domains (here: TIB subject areas).
Open Research Knowledge Graph (ORKG) provides interlinking, integration, visualization,
ation, and search functions. It enables scientists to gain a much faster overview of new
pments in a specific field and identify relevant research problems. It represents the evolution of
entific discourse in the individual disciplines and enables scientists to make their work more
ible to colleagues and potential users in industry through semantic description. Figure 3 depicts a
ch contribution represented in simplified form by a knowledge graph.
technical ecosystem for knowledge-based science communication. The ORKG service is
Auer et al. (2018)
Linked Open Data Cloud
Semantic Web
Standards
Persistent Identifiers
GND European
Open Science Cloud

Annotations and Markup

Annotations and Markup
• Complex topic – we can only scratch the surface
• XML is – unfortunately – considered “done” within W3C,
all senior XML specialists have left the organisation.
• https://www.balisage.net/Proceedings/vol21/html/Tovey0
1/BalisageVol21-Tovey01.html
– Discussion on the trend from declarative to procedural (!)
markup – there’s stagnation in the markup world.
• Relevant and timely: https://markupdeclaration.org
• Markup is not dead – there’s a small but active and
passionate community.

Dimensions of
Annotations

Annotations
• Annotation – Definition:
Secondary data added to a piece of primary data –
in science, this is, often, research data.
• The secondary data is, typically, a property of part of the
primary research data.
• Let’s examine this a bit more closely.

Annotations
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Property
Label of
property
Value of
property
Pointer to
annotation schema
Annotation schema
(possibly external)
may constrain
or restrict
Examples: lemma,
part of speech,
instance-of etc.
• What is the conceptual
nature of this property? Is it
best practice in research or
can it be entirely made up?
• How many colleagues in
the community agree on it?
• Is the label adequate and
self-explanatory?
Text

Annotations
Property
Label of
property
Value of
property
Pointer to
annotation schema
Annotation schema
(possibly external)
may constrain
or restrict
Examples: adjective,
JJ, object, “some free
text comment” etc.
• The actual annotation payload
• Is the value free text or taken from a
shared vocabulary?
• Is the shared vocabulary prescribed by
an annotation schema or ontology?
• How many colleagues in the community
agree on the value?
• How many colleagues in the community
agree on the shared vocabulary?
Text

Annotations
Property
Label of
property
Value of
property
Pointer to
annotation schema
Annotation schema
(possibly external)
may constrain
or restrict
Text
• Is there structure among the different properties?
• Markup languages, markup grammars
• Syntactic structure
– Ex.: “HVBXJ” => “AHXB”, “HKVZ”
• Semantic, i.e., logical structure
– Ex.: “NP” => “DET”, “N”
Many annotations

Annotating Annotations
Annotations on annotations (just a few selected points)
• Source (machine vs. single human vs. crowd-sourced)
• Application scenario: annotations for human vs. machine consumption
• Purpose or scope of the annotation (e.g., document structure, layout or
style, semantics, rhetorical structure, linguistic properties etc.)
– Can the structure be made explicit by the annotation format,
maybe via a markup language’s grammar?
– Can structure be made explicit through an ontology
that is put on top of the individual properties?
• Confidence value
• Quality indicator (0..1)
• Time added, time modified (timestamp)
• Style information – how annotations are rendered
• Annotation layers – one or multiple layers, independent or interrelated?

Evaluation of Annotations
• Measuring inter-annotator agreement
• Measuring intra-annotator agreement – what if the same
person does the same annotation task again after a
week or a month?
• Test replicability and reproducibility
• Important exercise for:
– Emerging annotation formats
– Complex annotation exercises
– Measuring consensus
– Making sure that terms and labels are meaningful

Complexity of Annotations
• In (Computational) Linguistics we’ve designed some
fairly detailed annotation formats in the last 30 years.
• In contrast, many modern data sets (especially for data-
driven AI approaches in NLP) are quite shallow.
• AI classifiers need enormous amounts of data and just a
few high-level labels.
• It’s not feasible and too expensive to annotate data with
complex and sophisticated annotation formats.
• Is NLP/AI research forgetting annotation principles?
• Are we dumbing down linguistics to the simple
annotation of trivial labels?
• Has annotation research perhaps become obsolete?

• Example: GermEval 2018 data set
Tweet label, tweet label, tweet label etc.
• There is no structure, no concretisation, no hierarchical
information, no additional metadata
• Two observations:
– there’s a trend towards simply more annotations, i.e.,
increased quantity while ignoring quality, complexity and
structure – complex annotations are expensive and difficult
to generalise from.
– there’s a trend towards dumb annotations, which are
often crowd-sourced – it’s easier to generalise from simple
than from structured, hierarchical annotations.
Complexity of Annotations

Summary and
Conclusions

Summary
• Annotations: from trivial to very complex
• From experimental to highly (de facto) standardised
• Annotations of annotations
• Multi-layer annotations – independent or interrelated
• Interoperability and reusability through standards
• But: standards vs. flexibility – basic science vs. applied
• Nowadays, annotations usually happen in the web
• Powerful stack of W3C technologies:
Web Annotation, Semantic Web, Linked Data, XML
• Web-scale annotations for scholarly publishing
• Annotations for Open Science

Summary
• Language Technology …
• … to automate the generation of annotations
– Semantification of journalistic/media content
– Semantification of scientific content
• … to automate the analysis of annotations
– Annotations for Open Science
• … to restore credibility and trust in the media
• In AI, annotations in data sets are often trivial
– Trend towards simply more and more annotations
– Trend towards more and more simple annotations

Annotating Annotations
• Different Dimensions of Annotations
• Is it possible to tie all dimensions together in a compact,
machine-readable way to describe and document an
annotation project?
– Complexity
– Semantics
– Source
– Impact
– Standard
– Research Question
– Methodology
– …
• Relevant for Open Science
• Relevant for interoperability
• Relevant for search & retrieval
• Relevant for reproducibility
• Relevant for evaluation
• Relevant for documentation & repos
• Relevant for good scientific practice
• … but maybe this is all too complicated
because a scientific paper already
does the trick in an established way?

Four Quadrant Diagram
Basic
research
Applications
and solutions
Humanities research
Computer Science and ICT research
X
• No need for
standardisation
• No need to use
standards
X
Clear need to use standards
for maximum adoption
X
• Avantgarde formats
• Weird phenomena
• Weird needs
• Expressibility
X
• Performance
• Standards
• Interoperability
Number of users:
rather small
Number of users:
rather high
XAI
X
• Markup
• Formal languages
• Querying
• Overlap
X
Digital
Humanities
This diagram is
work in progress.

Thank you!
Dr. Georg Rehm
Principal Researcher and Research Fellow
Speech and Language Technology Lab
DFKI, Berlin, Germany
! georg.rehm@dfki.de
! http://georg-re.hm
! http://de.linkedin.com/in/georgrehm
! https://www.slideshare.net/georgrehm
With many thanks to (in alphabetical order):
• Ivan Herman (W3C, The Netherlands)
• Heather Staines, Jon Udell, Dan Whaley (Hypothes.is, USA)

• Georg Rehm, Julian Moreno Schneider, and Peter Bourgonje. Automatic and Manual Web Annotations in an
Infrastructure to handle Fake News and other Online Media Phenomena. In Nicoletta Calzolari, Khalid Choukri,
Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the
11th Language Resources and Evaluation Conference (LREC 2018), pages 2416-2422, Miyazaki, Japan, May
2018. European Language Resources Association (ELRA).
• Georg Rehm. An Infrastructure for Empowering Internet Users to handle Fake News and other Online Media
Phenomena. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital
Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number
10713 in Lecture Notes in Artificial Intelligence (LNAI), pages 216-231, Cham, Switzerland, January 2018.
Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017.
• Georg Rehm. The Language Resource Life Cycle: Towards a Generic Model for Creating, Maintaining, Using and
Distributing Language Resources. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck,
Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors,
Proceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), pages 2450-2454,
Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).
• Georg Rehm. Texttechnologische Grundlagen. In Kai-Uwe Carstensen, Christian Ebert, Cornelia Endriss, Susanne
Jekat, Ralf Klabunde, and Hagen Langer, editors, Computerlinguistik und Sprachtechnologie - Eine Einführung,
pages 159-168. Spektrum, Heidelberg, 3 edition, 2010.
• Georg Rehm, Richard Eckart, Christian Chiarcos, and Johannes Dellert. Ontology-Based XQuery'ing of XML-
Encoded Language Resources on Multiple Annotation Layers. In Nicoletta Calzolari (Conference Chair), Khalid
Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, and Daniel Tapias, editors, Proc. of the 6th
Language Resources and Evaluation Conference (LREC 2008), pages 525-532, Marrakesh, Morocco, May 2008.
• Georg Rehm, Andreas Witt, Erhard Hinrichs, and Marga Reis. Sustainability of Annotated Resources in Linguistics.
In Lisa Lena Opas-Hänninen, Mikko Jokelainen, Ilkka Juuso, and Tapio Seppänen, editors, Digital Humanities 2008,
pages 21-29, Oulu, Finland, June 2008. ACH, ALLC.
• Andreas Witt, Georg Rehm, Timm Lehmberg, and Erhard Hinrichs. Mapping Multi-Rooted Trees from a Sustainable
Exchange Format to TEI Feature Structures. In TEI@20: 20 Years of Supporting the Digital Humanities. The 20th
Anniversary TEI Consortium Members' Meeting, University of Maryland, College Park, October 2007.
• Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, and Kilian Evang. On the Lossless Transformation
of Single-File, Multi-Layer Annotations into Multi-Rooted Trees. In B. Tommie Usdin, editor, Proceedings of Extreme
Markup Languages 2007, Montréal, Canada, August 2007.
• Kai Wörner, Andreas Witt, Georg Rehm, and Stefanie Dipper. Modelling Linguistic Data Structures. In B. Tommie
Usdin, editor, Proceedings of Extreme Markup Languages 2006, Montréal, Canada, August 2006.

Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again

More Related Content

Similar to Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again

More from Georg Rehm

Recently uploaded

Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again