Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again

139 views

Published on

Georg Rehm. Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again. Annotation in Scholarly Editions and Research: Function – Differentiation – Systematization, University of Wuppertal, Germany. February 20-22, 2019. Invited keynote talk.

Published in: Technology
  • Be the first to comment

Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again

  1. 1. Georg Rehm German Research Center for Artificial Intelligence (DFKI) GmbH Annotation in scholarly editions and research Bergische Universität Wuppertal – 21 February 2019 Observations on Annotations From Computational Linguistics and the World Wide Web to AI and back again
  2. 2. Observations on Annotations – Wuppertal, Germany, 21 February 2019 2 Annotation Computational Linguistics and AI (since 1992) SGML and TEI (since 1995) XML since 1998 XSLT XPath Several others ... Corpus annotation formats Hypertext and Textlinguistics Web Technologies, W3C, Markup Languages W3C Office Germany/Austria (since 2013) AI and Language Technology Development (since 2009) Infrastructures and Platforms Service Deployment Research Data Language Resources Metadata Data FormatsOpen Science Annotation: Personal Background
  3. 3. Introduction • Annotations have been playing an important role in Computational Linguistics and related fields (especially Digital Humanities) for decades. • This talk: Recent examples, lessons learned and some general observations on annotations. • My own research in this area (since approx. 1996): – from basic and applied research to – innovation and technology development Observations on Annotations – Wuppertal, Germany, 21 February 2019 3
  4. 4. Outline • Annotations – brief definition • World Wide Web • Annotations and AI • Annotations and Computational Linguistics • Annotations and Language Technology • Annotations for a Credible Web • Annotations and Open Science • Annotations and Markup • Dimensions of Annotations • Summary and Conclusions Observations on Annotations – Wuppertal, Germany, 21 February 2019 4
  5. 5. Annotations: a brief definition Observations on Annotations – Wuppertal, Germany, 21 February 2019 5
  6. 6. Annotations • Definition/“Definition”: Secondary data added to a piece of primary data – in science, this is, often, research data. • Wikipedia: An annotation is a metadatum (e.g., a post, explanation, markup) attached to [a?] location or other data. http://www.merriam-webster.com Observations on Annotations – Wuppertal, Germany, 21 February 2019 6
  7. 7. • Literature and education: – Textual scholarship: Textual scholarship is a discipline that often uses the technique of annotation to describe or add additional historical context to texts and physical documents. – Learning and instruction: As part of guided noticing [annotation] involves highlighting, naming or labelling and commenting aspects of visual representations to help focus learners' attention on specific visual aspects. In other words, it means the assignment of typological representations (culturally meaningful categories), to topological representations (e.g. images). • Software engineering: – Text documents: Markup languages like XML and HTML annotate text in a way that is syntactically distinguishable from that text. They can be used to add information about the desired visual presentation, or machine-readable semantic information, as in the semantic web. • Linguistics: – In linguistics, annotations include comments and metadata; these non-transcriptional annotations are also non-linguistic. Observations on Annotations – Wuppertal, Germany, 21 February 2019 7
  8. 8. World Wide Web Observations on Annotations – Wuppertal, Germany, 21 February 2019 8
  9. 9. Observations on Annotations – Wuppertal, Germany, 21 February 2019 9
  10. 10. Observations on Annotations – Wuppertal, Germany, 21 February 2019 10
  11. 11. Observations on Annotations – Wuppertal, Germany, 21 February 2019 11
  12. 12. “Vague but exciting” Observations on Annotations – Wuppertal, Germany, 21 February 2019 12 Information Management: A Proposal Tim Berners-Lee, CERN, March 1989, May 1990 “Private links One must be able to add one's own private links to and from public information. One must also be able to annotate links, as well as nodes, privately.”
  13. 13. World Wide Web Consortium • W3C is an international non-profit member-financed standards developing organisation • Founded in 1994 by Sir Tim Berners-Lee • Currently 451 members – 23 in Germany/Austria • Approx. 60 staff (ERCIM, MIT, UKeio, UBeihang) • Approx. 20 offices in important regions • The W3C Office Germany/Austria is run by • Open Web Platform, HTML5, CSS, Credible Web, Digital Publishing, Linked Data etc. http://w3.org ! http://w3c.de 13 Interested in joining? Talk to me!
  14. 14. Relevant W3C Standards • XML – Extensible Markup Language – Extremely influential – Widely adopted – TEI and many other languages • Semantic Web – RDF, OWL, SPARQL, SKOS etc. • Digital Publishing – New versions of EPub • Web Annotation Data Model and Vocabulary Observations on Annotations – Wuppertal, Germany, 21 February 2019 14 https://www.w3.org/2001/10/03-sww-1/slide7-0.html
  15. 15. Web Annotation Observations on Annotations – Wuppertal, Germany, 21 February 2019 15
  16. 16. Web Annotations • Web Annotation – Three W3C Recommendations • Most popular and relevant implementation: Hypothes.is – Mission-driven, non-profit Open Source company – Main focus on scholarly publishing (“Annotating All Knowledge Coalition”) – Very active and vibrant community • Hypothes.is: main driving force behind the I Annotate conference series – Open proceedings, very interesting programme, diverse speakers from several disciplines – consider attending! – Videos of almost all previous events available online Observations on Annotations – Wuppertal, Germany, 21 February 2019 16
  17. 17. • Web Annotation Data Model Describes the underlying Annotation Abstract Data Model as well as a JSON-LD serialization • Web Annotation Vocabulary The Vocabulary which underpins the Web Annotation Data Model • Web Annotation Protocol The HTTP API for publishing, syndicating, and distributing Web Annotations • Published on 23 February 2017 Observations on Annotations – Wuppertal, Germany, 21 February 2019 17 Web Annotation Standard
  18. 18. Web Annotation Standard • What does this mean for end users? – Annotation: a set of connected resources, typically incl. a body and target – the body is related to the target. – No more comment widgets and silos! – Annotation capability can be built natively into the browser – Conversations can take place anywhere on the web and in a standards-based way • Why is this different? – Annotations can live separately from documents and are reunited and re-anchored in real-time – Annotations are under the control of the user – Users can form communities (across HTML, PDF etc.) Observations on Annotations – Wuppertal, Germany, 21 February 2019 18
  19. 19. Observations on Annotations – Wuppertal, Germany, 21 February 2019 19
  20. 20. Hypothes.is Statistics Observations on Annotations – Wuppertal, Germany, 21 February 2019 20 December 2018: 4.4 Million Annotations and Counting 260K In groups, private In groups, shared Private Public JAN 2015 JAN 2016 JAN 2017 JAN 2018 DEC 2018 20K 40K 60K 80K 100K 120K 140K 160K 180K 200K 220K 240K
  21. 21. The Hypothes.is Tool Observations on Annotations – Wuppertal, Germany, 21 February 2019 21 ! Private Notes ! Public annotations ! Collaboration groups ! Linked Data connections ! Cross format: ○ HTML ○ PDF ○ EPUB ○ Data ! Community driven ! Open Source
  22. 22. Open Groups Observations on Annotations – Wuppertal, Germany, 21 February 2019 22
  23. 23. Errata and Corrections Observations on Annotations – Wuppertal, Germany, 21 February 2019 23
  24. 24. Observations on Annotations – Wuppertal, Germany, 21 February 2019 24 ADA: American Diabetes Association ● Wanted a way to update content and add information links ● Needed to restrict use to ADA staff
  25. 25. Peer Review Observations on Annotations – Wuppertal, Germany, 21 February 2019 25
  26. 26. Automated Annotation Observations on Annotations – Wuppertal, Germany, 21 February 2019 26 Automated systems can tag elements such as RRIDs (Research Resource Identifiers) and other scholarly identifiers or entities, allowing navigation to background information and powerful search queries through other papers mentioning the same entity.
  27. 27. User Profiles Observations on Annotations – Wuppertal, Germany, 21 February 2019 27
  28. 28. Use anywhere on the web Observations on Annotations – Wuppertal, Germany, 21 February 2019 28
  29. 29. Annotations and AI Observations on Annotations – Wuppertal, Germany, 21 February 2019 29
  30. 30. Observations on Annotations – Wuppertal, Germany, 21 February 2019 30
  31. 31. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Data Intelligence Current breakthroughs based on Machine Learning (“Deep Learning”) Also still in use: symbolic, rule-based methods and expert systems Artificial Intelligence Huge data sets + powerful learning algorithms + very fast hardware 31
  32. 32. Annotations and AI • Modern AI is data-driven – supervised learning relies on annotated data sets. • However, certain AI algorithms can learn structure and patterns without any annotations whatsoever. • The relevance of annotations has increased dramatically. • This is especially true for very large annotated data sets. • Many consist of primary data and secondary annotations. • Companies have emerged that produce annotated data sets using crowd-workers (e.g., Figure Eight, Crowdee) • Key question: how detailed, relevant, correct, meaningful and reliable are these annotations really? Observations on Annotations – Wuppertal, Germany, 21 February 2019 32
  33. 33. Annotations and Events • Likes and Favs (user-driven annotation, action) • Five-star ratings (user-driven annotation, action) • Online comments (user-driven annotation, action) • Online reviews (user-driven annotation, action) • Clicking an article headline/link (user-initiated event, action) • Reading an ebook (user-initiated event, action) – Page turns in ebooks are measured – when slow: “boredom”, “disinterest” – Next time in the ebook store you’re getting adjusted recommendations • No longer reading an ebook (user-initiated event, non-action) – Boring chapters where people throw in the towel can be easily identified – (Brave new) future: use automatic paraphrasing to re-write the chapter – Or maybe NLG and A/B tests – then it’s the original author vs. the machine Observations on Annotations – Wuppertal, Germany, 21 February 2019 33
  34. 34. Annotations in Computational Linguistics Observations on Annotations – Wuppertal, Germany, 21 February 2019 34
  35. 35. Annotations in CL • Diverse and specialised tool landscape http://annotation.exmaralda.org/index.php?title=Linguistic_Annotation • Diverse and specialised format landscape: TEI, NIF, NAF, LAF, TIGER, STTS, FoLiA and many, many others • From trivial annotation schemes to extremely complex • From low inter-annotator agreement scores to high ones • From flexible tools to highly specialised tools • From very high quality annotations to very low ones • A brief look at a few tools … Observations on Annotations – Wuppertal, Germany, 21 February 2019 35
  36. 36. Observations on Annotations – Wuppertal, Germany, 21 February 2019 36 Exmaralda
  37. 37. Observations on Annotations – Wuppertal, Germany, 21 February 2019 37 Praat
  38. 38. Observations on Annotations – Wuppertal, Germany, 21 February 2019 38 ELAN
  39. 39. Observations on Annotations – Wuppertal, Germany, 21 February 2019 39 brat
  40. 40. Observations on Annotations – Wuppertal, Germany, 21 February 2019 40 WebAnno
  41. 41. Observations on Annotations – Wuppertal, Germany, 21 February 2019 41 Annis
  42. 42. Annotations in Language Technology Observations on Annotations – Wuppertal, Germany, 21 February 2019 42
  43. 43. Language Technology • Language Technology transfers theoretical results from language-oriented research into technologies and applications that are ready for production use. • Uses results from, e.g.: – Artificial Intelligence – Computer Science – Computational Linguistics – Natural Language Processing – Psychology, Psycholinguistics – Cognitive Science Observations on Annotations – Wuppertal, Germany, 21 February 2019 43 Example Applications • Spell checkers • Dictation systems • Translation systems • Search engines • Report generation • Expert systems • Dialogue systems • Text summarisers
  44. 44. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Web Annotation Architecture The relationship between Web Annotations and Language Technology on a rather general level. 44
  45. 45. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Web Annotation Architecture Content could be created by Language Technology fully automatically or in a semi-automatic way (text generation) 45
  46. 46. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Web Annotation Architecture Content could be analysed by Language Technology (semantic analysis, input for ML algorithms etc.) 46
  47. 47. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Web Annotation Architecture Especially in Social Media Analytics we are interested in UGC, i.e., in comments, feedback – “what do users think of a certain product?“. 47
  48. 48. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Web Annotation Architecture • Analysing UGC is difficult and costly (many heterogeneous sources, many different formats) • A few established and widely used Web Annotation services would simplify SMA dramatically! 48
  49. 49. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Web Annotation Architecture We can also use LT methods to create or help create annotations, e.g., in smart authoring scenarios. 49
  50. 50. LT and Web Annotations • Analysis of web annotations and exploiting web annotations through Language Technology: – Arbitrary web annotations (i.e., unstructured text) • No more crawling, aggregating, mapping! – Dedicated LT-specific web annotations • Annotating language data without any specialised stand-alone tools or data repositories! • Generation of web annotations through Language Technology (e.g., to provide background information on important content). Example: Content semantification. Observations on Annotations – Wuppertal, Germany, 21 February 2019 50
  51. 51. Platform for digital Curation Technologies Broker REST API Curation Service 1 Curation Service 2 Client uses the API External Service 1 External Service 2 Client uses the API Client uses the API Curation Workflow Input Output @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos/> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . <http://link.omitted/documents/document1#char=0,26> a nif:RFC5147String , nif:String , nif:Context ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "26"^^xsd:nonNegativeInteger ; nif:isString "Welcome to Berlin in 2016. "^^xsd:string ; dfkinif:averageLatitude "52.516666666666666"^^xsd:double ; dfkinif:averageLongitude "13.383333333333333"^^xsd:double ; dfkinif:stdDevLatitude "0.0"^^xsd:double ; dfkinif:stdDevLongitude "0.0"^^xsd:double ; nif:meanDateRange "20160101010000_20170101010000"^^xsd:string . <http://link.omitted/documents/document1#char=21,25> a nif:RFC5147String , nif:String ; itsrdf:taIdentRef <http://link.omitted/ontologies/nif#date=20160101000000_20170101000000> ; nif:anchorOf "2016"^^xsd:string ; nif:beginIndex "21"^^xsd:nonNegativeInteger ; nif:endIndex "25"^^xsd:nonNegativeInteger ; nif:entity <http://link.omitted/ontologies/nif#date>. <http://link.omitted/documents/#char=11,17> a nif:RFC5147String , nif:String ; nif:anchorOf "Berlin"^^xsd:string ; nif:beginIndex "11"^^xsd:nonNegativeInteger ; nif:endIndex "17"^^xsd:nonNegativeInteger ; itsrdf:taClassRef <http://dbpedia.org/ontology/Location> ; nif:referenceContext <http://link.omitted/documents/#char=0,26> ; geo:lat "52.516666666666666"^^xsd:double ; geo:long "13.383333333333333"^^xsd:double ; itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin> . NLP Interchange Format (NIF) “Welcome to Berlin in 2016.” • RDF/OWL-basiertes Format für NLP- Anwendungen • Ermöglicht Interoperabilität • Durch pures RDF „natürliche“ Integration von Linked-Data-Daten • Entwickelt von der Universität Leipzig • Plattform unterstützt neben NIF auch Web Annotations Digital Curation Technologies: Prototypically implemented Platform and Services Peter Bourgonje, Julian Moreno-Schneider, Jan Nehring, Georg Rehm, Felix Sasaki, and Ankit Srivastava. “Towards a Platform for Curation Technologies: Enriching Text Collections with a Semantic-Web Layer.” In Harald Sack, Giuseppe Rizzo, Nadine Steinmetz, Dunja Mladenić, Sören Auer, and Christoph Lange, editors, The Semantic Web, number 9989 in LNCS, pages 65-68. Springer, June 2016. ESWC 2016 Satellite Events. Heraklion, Crete, Greece, May 29 - June 2, 2016 Revised Selected Papers. Client uses the API
  52. 52. 52 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos/> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . <http://link.omitted/documents/document1#char=0,26> a nif:RFC5147String , nif:String , nif:Context ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "26"^^xsd:nonNegativeInteger ; nif:isString "Welcome to Berlin in 2019. "^^xsd:string ; dfkinif:averageLatitude "52.516666666666666"^^xsd:double ; dfkinif:averageLongitude "13.383333333333333"^^xsd:double ; dfkinif:stdDevLatitude "0.0"^^xsd:double ; dfkinif:stdDevLongitude "0.0"^^xsd:double ; nif:meanDateRange "20190101010000_20200101010000"^^xsd:string . <http://link.omitted/documents/document1#char=21,25> a nif:RFC5147String , nif:String ; itsrdf:taIdentRef <http://link.omitted/ontologies/nif#date=20190101000000_20200101000000> ; nif:anchorOf "2019"^^xsd:string ; nif:beginIndex "21"^^xsd:nonNegativeInteger ; nif:endIndex "25"^^xsd:nonNegativeInteger ; nif:entity <http://link.omitted/ontologies/nif#date>. <http://link.omitted/documents/#char=11,17> a nif:RFC5147String , nif:String ; nif:anchorOf "Berlin"^^xsd:string ; nif:beginIndex "11"^^xsd:nonNegativeInteger ; nif:endIndex "17"^^xsd:nonNegativeInteger ; itsrdf:taClassRef <http://dbpedia.org/ontology/Location> ; nif:referenceContext <http://link.omitted/documents/#char=0,26> ; geo:lat "52.516666666666666"^^xsd:double ; geo:long "13.383333333333333"^^xsd:double ; itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin> . NLP Interchange Format (NIF) “Welcome to Berlin in 2019.” • RDF/OWL-based format for NLP applications • Enables interoperability • Pure RDF and, hence, natural integration of Linked Data data • Developed by Universität Leipzig • Our platform also supports Web Annotation data model
  53. 53. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Julian Moreno-Schneider, Ankit Srivastava, Peter Bourgonje, David Wabnitz, and Georg Rehm. “Semantic Storytelling, Cross- lingual Event Detection and other Semantic Services for a Newsroom Content Curation Dashboard.” In Octavian Popescu and Carlo Strapparava, editors, Proceedings of Natural Language Processing meets Journalism - EMNLP 2017 Workshop (NLPMJ 2017), Copenhagen, Denmark, September 2017. 7. September. Sector: Journalism 53
  54. 54. Observations on Annotations – Wuppertal, Germany, 21 February 2019 Sector: TV, Web-TV, Media 54 Georg Rehm, Julián Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Rolf Fricke, Jan Thomsen, Jing He, Joachim Quantz, Armin Berger, Luca König, Sören Räuchle, Jens Gerth, and David Wabnitz. “Different Types of Automated and Semi-Automated Semantic Storytelling: Curation Technologies for Different Sectors”. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture Notes in Artificial Intelligence (LNAI), pages 232-247, Cham, Switzerland, January 2018. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017.
  55. 55. Annotations for a Credible Web Observations on Annotations – Wuppertal, Germany, 21 February 2019 55
  56. 56. Observations on Annotations – Wuppertal, Germany, 21 February 2019 56
  57. 57. Observations on Annotations – Wuppertal, Germany, 21 February 2019 57
  58. 58. Viral Content and Filter Bubbles • Content is often published without checking its validity, discovered through social media and, if it appears relevant, shared immediately. • Content is often shared without reading it. • Goal: virality ➟ reach ➟ clicks ➟ ad revenue • Not all “journalistic” content (or publishing outlets) is really committed to reporting the facts. • Nowadays the burden of fact-checking is with the readers. • „Fake news“: label for several classes of online content. • Can we balance out filter bubble and network effects? Observations on Annotations – Wuppertal, Germany, 21 February 2019 58 Georg Rehm. “An Infrastructure for Empowering Internet Users to handle Fake News and other Online Media Phenomena”. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: Proceedings of the GSCL Conference 2017, Berlin, September 2017. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V. 13.-15. September 2017.
  59. 59. Seven classes of false news Satire or parody Wrong connection or relation: when title and photos don‘t support the content Misleading content: use of information to put someone or something in a bad light Wrong context: when genuine content is presented in the wrong context Deceiving content: imitation of real sources Bad content with a clear purpose to deceive Fabricated content: completely untrue, produced to deceive Characteristics Clickbait X X ? ? ? Disinformation X X X X Political bias ? X ? ? X Bad journalism X X X Publisher‘sintention Parody X ? ? Provocation X X X Profit ? X X X Deception X X X X X X Influence politics X X X X Influence politics X X X X X Different classes of false news and their individual characteristics and intentions (based on Wardle, 2017; Walbrühl, 2017; Rubin et al., 2015; Holan, 2016; Weedon et al., 2017) 59
  60. 60. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker
  61. 61. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker • Infrastructure as a native part of the web • Necessary for that: support and buy-in from all browser vendors, media publishers and standards • All users need immediate access
  62. 62. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker Tools analyse automatically
  63. 63. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker • Automatic results and free text annotations are stored as Web Annotations. • Users make their annotations available to one another.
  64. 64. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker • Automatic analysis of free text annotations (NLP, IE, RE etc.). • Extraction of opinions, arguments, claims, statements etc.
  65. 65. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker UGM • Standardised metadata schemas for efficient annotations, e.g. “content is intentionally deceptive.” • W3C Provenance Ontology, Schema.org (ClaimReview). • To be used by the human and the machine
  66. 66. Website with content Tool1 Browser has native support for the infrastructure and aggregates the different scores, messages and values into messages or warnings regarding the content Web Annotations DB1 Web Annotations DB2 Tool3 Tool2 UGA: User-generated annotations (free text) UGM: User-generated metadata (standardised) MGM: Machine-generated Metadata (standardised) MGM MGM MGM Decentral filters process content automatically and send results to the browser (important: multilingualism) UGA Web Annotations DB4UGM Example: user rates the content quality regarding a standardised schema other users‘ annotations Other users Web Annotations DB3 UGA UGM UGM UGA Decentral repositories store all annotations Detection of hate speech Classify content for its political spectrum Fact checker UGM Goal: provide technologies to the user, with which they can consume, assess, analyse, verify and process digital content and media in a better way and that indicate which contents may be problematic.
  67. 67. Web Annotation + Fake News • Crowd-sourced Web Annotation content in combination with a set of automatic analysis tools has enormous potential to tackle online misinformation campaigns. • Big impact if deployed widely and implemented correctly. • However, there’s a danger to shift the point of attack that misinformation campaigns exploit (to annotations). • The Credibility Coalition has developed a similar approach in parallel, see, e.g., https://web.hypothes.is/blog/annotation-powered-questionnaires/ Observations on Annotations – Wuppertal, Germany, 21 February 2019 67
  68. 68. Annotations and Open Science Observations on Annotations – Wuppertal, Germany, 21 February 2019 68
  69. 69. Open Science • Movement to make scientific research, data and dissemination accessible to all levels of an inquiring society, amateur or professional. • Encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge. • Connection to: annotations, research data (corpora, LRs), semantics, knowledge, linked data, repositories and other topics. Observations on Annotations – Wuppertal, Germany, 21 February 2019 69 https://en.wikipedia.org/wiki/Open_science
  70. 70. Observations on Annotations – Wuppertal, Germany, 21 February 2019 70 Open Science Taxonomy https://en.wikipedia.org/wiki/Open_science
  71. 71. Observations on Annotations – Wuppertal, Germany, 21 February 2019 71 Open Science Taxonomy https://en.wikipedia.org/wiki/Open_science
  72. 72. Annotations & Open Science • Open Science will soon become the norm and goal in data-intensive science • Important aspects: interoperability, reproducibility, open documentation of experiments, use of standards etc. • Trend: open tools, open workflows, open data sets • Annotations are an important and crucial piece of the puzzle, especially documented, meaningful annotations • Relevant initiatives: NFDI, EOSC • Relevant principle: FAIR Observations on Annotations – Wuppertal, Germany, 21 February 2019 72
  73. 73. FAIR Principles • TO BE FINDABLE: – F1 (meta)data are assigned a globally unique and eternally persistent identifier. – F2 data are described with rich metadata. – F3 (meta)data are registered or indexed in a searchable resource. – F4 metadata specify the data identifier. • TO BE ACCESSIBLE: – A1 (meta)data are retrievable by their identifier using a standardized protocol. – A1.1 the protocol is open, free, and universally implementable. – A1.2 the protocol allows for an authentication and authorization procedure. – A2 metadata are accessible, even when the data are no longer available. • TO BE INTEROPERABLE: – I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. – I2. (meta)data use vocabularies that follow FAIR principles. – I3. (meta)data include qualified references to other (meta)data. • TO BE RE-USABLE: – R1. meta(data) have a plurality of accurate and relevant attributes. – R1.1 (meta)data are released with a clear and accessible data usage license. – R1.2 (meta)data are associated with their provenance. – R1.3 (meta)data meet domain-relevant community standards. Observations on Annotations – Wuppertal, Germany, 21 February 2019 73
  74. 74. Open Science and … Science • Open Science approaches recommend the use of standards • Only standardised data and metadata are truly interoperable • BUT fundamental research is about inventing NEW things • This contradicts the use of standards as the consensus that was reached within a specific community • However, it does NOT contradict the use of established tools and best practice approaches • Neither does it contradict the modification of standards • At the end of the day, it’s about semantics & documentation • If an established, standardised approach does not work for a new piece of research, invent a new approach or get creative! Observations on Annotations – Wuppertal, Germany, 21 February 2019 74
  75. 75. Annotation of Documents • Open Science will be transforming research, making it more sustainable, more visible, more transparent • Substantially improved digital infrastructures • This will, soon, include the annotation of documents, starting with scientific publications (Web Annotation) • First steps towards Open Peer Review (cf. arxiv.org) • Trend: micro-publications (esp. for incremental research) • Will the scientific paper continue to be the atomic unit? • Important relevant initiative: ORKG Observations on Annotations – Wuppertal, Germany, 21 February 2019 75
  76. 76. ORKG • Vision driven forward by Sören Auer (TIB Hannover) • Exchange of scholarly knowledge is primarily document-based: researchers produce articles (online or offline) as coarse-grained text documents. • Transform this predominant paradigm into knowledge- based information flows by representing and expressing knowledge through semantically rich, interlinked graphs. • Sören Auer et al. (2018): “Towards an Open Research Knowledge Graph“. https://doi.org/10.5281/zenodo.1157185 Observations on Annotations – Wuppertal, Germany, 21 February 2019 76
  77. 77. Interlinking of Concepts Observations on Annotations – Wuppertal, Germany, 21 February 2019 77 ated procedures alone do not achieve the necessary coverage and accuracy; fully manual n is too time-consuming; librarians lack the necessary domain-specific expertise; and scientists e necessary expertise in knowledge representation. By combining the four strategies in a ngful way, they can bring their respective strengths to bear and compensate for the weak points. Interlinking of interdisciplinary and subject-specific concepts and artefacts of scientific work in the different domains (here: TIB subject areas). Open Research Knowledge Graph (ORKG) provides interlinking, integration, visualization, ation, and search functions. It enables scientists to gain a much faster overview of new pments in a specific field and identify relevant research problems. It represents the evolution of entific discourse in the individual disciplines and enables scientists to make their work more ible to colleagues and potential users in industry through semantic description. Figure 3 depicts a ch contribution represented in simplified form by a knowledge graph. technical ecosystem for knowledge-based science communication. ​The ORKG service is Auer et al. (2018) Linked Open Data Cloud Semantic Web Standards Persistent Identifiers GND European Open Science Cloud
  78. 78. Annotations and Markup Observations on Annotations – Wuppertal, Germany, 21 February 2019 78
  79. 79. Annotations and Markup • Complex topic – we can only scratch the surface • XML is – unfortunately – considered “done” within W3C, all senior XML specialists have left the organisation. • https://www.balisage.net/Proceedings/vol21/html/Tovey0 1/BalisageVol21-Tovey01.html – Discussion on the trend from declarative to procedural (!) markup – there’s stagnation in the markup world. • Relevant and timely: https://markupdeclaration.org • Markup is not dead – there’s a small but active and passionate community. Observations on Annotations – Wuppertal, Germany, 21 February 2019 79
  80. 80. Dimensions of Annotations Observations on Annotations – Wuppertal, Germany, 21 February 2019 80
  81. 81. Annotations • Annotation – Definition: Secondary data added to a piece of primary data – in science, this is, often, research data. • The secondary data is, typically, a property of part of the primary research data. • Let’s examine this a bit more closely. Observations on Annotations – Wuppertal, Germany, 21 February 2019 81
  82. 82. Annotations Observations on Annotations – Wuppertal, Germany, 21 February 2019 82 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Property Label of property Value of property Pointer to annotation schema Annotation schema (possibly external) may constrain or restrict Examples: lemma, part of speech, instance-of etc. • What is the conceptual nature of this property? Is it best practice in research or can it be entirely made up? • How many colleagues in the community agree on it? • Is the label adequate and self-explanatory? Text
  83. 83. Annotations Observations on Annotations – Wuppertal, Germany, 21 February 2019 83 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Property Label of property Value of property Pointer to annotation schema Annotation schema (possibly external) may constrain or restrict Examples: adjective, JJ, object, “some free text comment” etc. • The actual annotation payload • Is the value free text or taken from a shared vocabulary? • Is the shared vocabulary prescribed by an annotation schema or ontology? • How many colleagues in the community agree on the value? • How many colleagues in the community agree on the shared vocabulary? Text
  84. 84. Annotations Observations on Annotations – Wuppertal, Germany, 21 February 2019 84 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Property Label of property Value of property Pointer to annotation schema Annotation schema (possibly external) may constrain or restrict Text • Is there structure among the different properties? • Markup languages, markup grammars • Syntactic structure – Ex.: “HVBXJ” => “AHXB”, “HKVZ” • Semantic, i.e., logical structure – Ex.: “NP” => “DET”, “N” Many annotations
  85. 85. Annotating Annotations Annotations on annotations (just a few selected points) • Source (machine vs. single human vs. crowd-sourced) • Application scenario: annotations for human vs. machine consumption • Purpose or scope of the annotation (e.g., document structure, layout or style, semantics, rhetorical structure, linguistic properties etc.) – Can the structure be made explicit by the annotation format, maybe via a markup language’s grammar? – Can structure be made explicit through an ontology that is put on top of the individual properties? • Confidence value • Quality indicator (0..1) • Time added, time modified (timestamp) • Style information – how annotations are rendered • Annotation layers – one or multiple layers, independent or interrelated? Observations on Annotations – Wuppertal, Germany, 21 February 2019 85
  86. 86. Evaluation of Annotations • Measuring inter-annotator agreement • Measuring intra-annotator agreement – what if the same person does the same annotation task again after a week or a month? • Test replicability and reproducibility • Important exercise for: – Emerging annotation formats – Complex annotation exercises – Measuring consensus – Making sure that terms and labels are meaningful Observations on Annotations – Wuppertal, Germany, 21 February 2019 86
  87. 87. Complexity of Annotations • In (Computational) Linguistics we’ve designed some fairly detailed annotation formats in the last 30 years. • In contrast, many modern data sets (especially for data- driven AI approaches in NLP) are quite shallow. • AI classifiers need enormous amounts of data and just a few high-level labels. • It’s not feasible and too expensive to annotate data with complex and sophisticated annotation formats. • Is NLP/AI research forgetting annotation principles? • Are we dumbing down linguistics to the simple annotation of trivial labels? • Has annotation research perhaps become obsolete? Observations on Annotations – Wuppertal, Germany, 21 February 2019 87
  88. 88. • Example: GermEval 2018 data set Tweet label, tweet label, tweet label etc. • There is no structure, no concretisation, no hierarchical information, no additional metadata • Two observations: – there’s a trend towards simply more annotations, i.e., increased quantity while ignoring quality, complexity and structure – complex annotations are expensive and difficult to generalise from. – there’s a trend towards dumb annotations, which are often crowd-sourced – it’s easier to generalise from simple than from structured, hierarchical annotations. Observations on Annotations – Wuppertal, Germany, 21 February 2019 88 Complexity of Annotations
  89. 89. Summary and Conclusions Observations on Annotations – Wuppertal, Germany, 21 February 2019 89
  90. 90. Summary • Annotations: from trivial to very complex • From experimental to highly (de facto) standardised • Annotations of annotations • Multi-layer annotations – independent or interrelated • Interoperability and reusability through standards • But: standards vs. flexibility – basic science vs. applied • Nowadays, annotations usually happen in the web • Powerful stack of W3C technologies: Web Annotation, Semantic Web, Linked Data, XML • Web-scale annotations for scholarly publishing • Annotations for Open Science Observations on Annotations – Wuppertal, Germany, 21 February 2019 90
  91. 91. Summary • Language Technology … • … to automate the generation of annotations – Semantification of journalistic/media content – Semantification of scientific content • … to automate the analysis of annotations – Annotations for Open Science • … to restore credibility and trust in the media • In AI, annotations in data sets are often trivial – Trend towards simply more and more annotations – Trend towards more and more simple annotations Observations on Annotations – Wuppertal, Germany, 21 February 2019 91
  92. 92. Annotating Annotations • Different Dimensions of Annotations • Is it possible to tie all dimensions together in a compact, machine-readable way to describe and document an annotation project? – Complexity – Semantics – Source – Impact – Standard – Research Question – Methodology – … Observations on Annotations – Wuppertal, Germany, 21 February 2019 92 • Relevant for Open Science • Relevant for interoperability • Relevant for search & retrieval • Relevant for reproducibility • Relevant for evaluation • Relevant for documentation & repos • Relevant for good scientific practice • … but maybe this is all too complicated because a scientific paper already does the trick in an established way?
  93. 93. Four Quadrant Diagram Observations on Annotations – Wuppertal, Germany, 21 February 2019 93 Basic research Applications and solutions Humanities research Computer Science and ICT research X • No need for standardisation • No need to use standards X Clear need to use standards for maximum adoption X • Avantgarde formats • Weird phenomena • Weird needs • Expressibility X • Performance • Standards • Interoperability Number of users: rather small Number of users: rather high XAI X • Markup • Formal languages • Querying • Overlap X Digital Humanities This diagram is work in progress.
  94. 94. Thank you! Dr. Georg Rehm Principal Researcher and Research Fellow Speech and Language Technology Lab DFKI, Berlin, Germany ! georg.rehm@dfki.de ! http://georg-re.hm ! http://de.linkedin.com/in/georgrehm ! https://www.slideshare.net/georgrehm With many thanks to (in alphabetical order): • Ivan Herman (W3C, The Netherlands) • Heather Staines, Jon Udell, Dan Whaley (Hypothes.is, USA) Observations on Annotations – Wuppertal, Germany, 21 February 2019 94
  95. 95. • Georg Rehm, Julian Moreno Schneider, and Peter Bourgonje. Automatic and Manual Web Annotations in an Infrastructure to handle Fake News and other Online Media Phenomena. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the 11th Language Resources and Evaluation Conference (LREC 2018), pages 2416-2422, Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). • Georg Rehm. An Infrastructure for Empowering Internet Users to handle Fake News and other Online Media Phenomena. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture Notes in Artificial Intelligence (LNAI), pages 216-231, Cham, Switzerland, January 2018. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017. • Georg Rehm. The Language Resource Life Cycle: Towards a Generic Model for Creating, Maintaining, Using and Distributing Language Resources. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), pages 2450-2454, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). • Georg Rehm. Texttechnologische Grundlagen. In Kai-Uwe Carstensen, Christian Ebert, Cornelia Endriss, Susanne Jekat, Ralf Klabunde, and Hagen Langer, editors, Computerlinguistik und Sprachtechnologie - Eine Einführung, pages 159-168. Spektrum, Heidelberg, 3 edition, 2010. • Georg Rehm, Richard Eckart, Christian Chiarcos, and Johannes Dellert. Ontology-Based XQuery'ing of XML- Encoded Language Resources on Multiple Annotation Layers. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, and Daniel Tapias, editors, Proc. of the 6th Language Resources and Evaluation Conference (LREC 2008), pages 525-532, Marrakesh, Morocco, May 2008. • Georg Rehm, Andreas Witt, Erhard Hinrichs, and Marga Reis. Sustainability of Annotated Resources in Linguistics. In Lisa Lena Opas-Hänninen, Mikko Jokelainen, Ilkka Juuso, and Tapio Seppänen, editors, Digital Humanities 2008, pages 21-29, Oulu, Finland, June 2008. ACH, ALLC. • Andreas Witt, Georg Rehm, Timm Lehmberg, and Erhard Hinrichs. Mapping Multi-Rooted Trees from a Sustainable Exchange Format to TEI Feature Structures. In TEI@20: 20 Years of Supporting the Digital Humanities. The 20th Anniversary TEI Consortium Members' Meeting, University of Maryland, College Park, October 2007. • Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, and Kilian Evang. On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees. In B. Tommie Usdin, editor, Proceedings of Extreme Markup Languages 2007, Montréal, Canada, August 2007. • Kai Wörner, Andreas Witt, Georg Rehm, and Stefanie Dipper. Modelling Linguistic Data Structures. In B. Tommie Usdin, editor, Proceedings of Extreme Markup Languages 2006, Montréal, Canada, August 2006. Observations on Annotations – Wuppertal, Germany, 21 February 2019 95

×