What is the state of natural language processing for Danish in 2018? This reviews language technology in Denmark this year. Present at a "Puzzle of Danish" workshop.
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...The European Library
This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work, but frequently, secondary contributors are represented in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. Our approach performed reliably in most languages and bibliographic datasets of at least one million records, achieving precision and recall above 0.97 on five of the six evaluated datasets. We conclude that the approach can be reliably applied to other national bibliographies and languages.
This document provides an overview of semantic web technologies for publishing data. It introduces the semantic web and describes semantic web languages like RDF, RDF Schema, and OWL. These languages allow modeling data as graphs and defining ontologies to provide unambiguous meaning to information. The document discusses using these languages to publish structured data on the web in ways that enable semantic annotation, integration, and reasoning across interconnected data sources.
JRC-Names: A freely available, highly multilingual named entity resource. This presentation was part of the Diplohack Brussels Data Market on 29-30 April 2016.
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware ac.uk
Introductory presentation about the SoundSoftware.ac.uk project: Sustainable software for audio and music research.
Presented on the DMRN+5: Digital Music Research Network One-day Workshop 2010, in the Queen Mary, University of London, on the 21st Dec 2010.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
GTTS System for the Spoken Web Search Task at MediaEval 2012MediaEval2012
The document describes the GTTS system for the Spoken Web Search Task at MediaEval 2012. The GTTS system allows for searching spoken queries against broadcast news audio using automatic speech recognition and indexing. It also enables searching parliamentary sessions by aligning audio and text. For MediaEval 2012, the GTTS system performs phonetic matching between the n-best phone decoding of queries and phone lattices of spoken resources to locate query detections. Initial experiments showed poor performance, so the approach was changed to search for each query's single best detection in each audio document.
This slide introduces various kinds of basic steganography techniques.
Also, the tools that could be useful for CTF(Capture the Flag) stegano challenges are listed
Linked Data and cultural heritage data: an overview of the approaches from Eu...The European Library
Europeana provides access to digital resources from a wide range of cultural heritage institutions all across Europe. In order to support Europeana, a wide network of organizations collaborates in data integration activities. The European Library plays the role of library-domain aggregator for Europeana, and its activities include also being a gateway to the collections and data of Europe’s national and research libraries, operating on the principle of open data for re-use.
The Europeana Network addresses its data integration challenges by leveraging on Linked Data and the Semantic Web. Its approach to data integration is based in a single data model, the Europeana Data Model, which embraces the Semantic Web principles to integrate the various data models and ontologies used in cultural heritage data.
The paradigm of Linked Data, brings many new challenges to libraries. The generic nature of data representation used in Linked Data, while allowing any community to manipulate the data, also opens many paths for implementation, with no clear optimal choice for libraries. The European Library leverages on its operational infrastructure to make library data available. It maintains The European Library Open Dataset, which is derived from the data aggregated from member libraries, and made available under the Creative Commons CC0 1.0 Universal license, in order to promote and facilitate its reuse by any community.
Extensive linking is performed in the preparation of The European Library Open Dataset. It relies on Information Extraction and Data Mining to establish links to external open datasets, covering the most prominent entities types present in library data: persons, corporate bodies, places, concepts, intellectual works and manifestations.
The European Library also applies a linked data approach for intellectual property rights clearance processes, for supporting mass digitization projects. This approach is applied in the within the European ARROW rights infrastructure .
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...The European Library
This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work, but frequently, secondary contributors are represented in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. Our approach performed reliably in most languages and bibliographic datasets of at least one million records, achieving precision and recall above 0.97 on five of the six evaluated datasets. We conclude that the approach can be reliably applied to other national bibliographies and languages.
This document provides an overview of semantic web technologies for publishing data. It introduces the semantic web and describes semantic web languages like RDF, RDF Schema, and OWL. These languages allow modeling data as graphs and defining ontologies to provide unambiguous meaning to information. The document discusses using these languages to publish structured data on the web in ways that enable semantic annotation, integration, and reasoning across interconnected data sources.
JRC-Names: A freely available, highly multilingual named entity resource. This presentation was part of the Diplohack Brussels Data Market on 29-30 April 2016.
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware ac.uk
Introductory presentation about the SoundSoftware.ac.uk project: Sustainable software for audio and music research.
Presented on the DMRN+5: Digital Music Research Network One-day Workshop 2010, in the Queen Mary, University of London, on the 21st Dec 2010.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
GTTS System for the Spoken Web Search Task at MediaEval 2012MediaEval2012
The document describes the GTTS system for the Spoken Web Search Task at MediaEval 2012. The GTTS system allows for searching spoken queries against broadcast news audio using automatic speech recognition and indexing. It also enables searching parliamentary sessions by aligning audio and text. For MediaEval 2012, the GTTS system performs phonetic matching between the n-best phone decoding of queries and phone lattices of spoken resources to locate query detections. Initial experiments showed poor performance, so the approach was changed to search for each query's single best detection in each audio document.
This slide introduces various kinds of basic steganography techniques.
Also, the tools that could be useful for CTF(Capture the Flag) stegano challenges are listed
Linked Data and cultural heritage data: an overview of the approaches from Eu...The European Library
Europeana provides access to digital resources from a wide range of cultural heritage institutions all across Europe. In order to support Europeana, a wide network of organizations collaborates in data integration activities. The European Library plays the role of library-domain aggregator for Europeana, and its activities include also being a gateway to the collections and data of Europe’s national and research libraries, operating on the principle of open data for re-use.
The Europeana Network addresses its data integration challenges by leveraging on Linked Data and the Semantic Web. Its approach to data integration is based in a single data model, the Europeana Data Model, which embraces the Semantic Web principles to integrate the various data models and ontologies used in cultural heritage data.
The paradigm of Linked Data, brings many new challenges to libraries. The generic nature of data representation used in Linked Data, while allowing any community to manipulate the data, also opens many paths for implementation, with no clear optimal choice for libraries. The European Library leverages on its operational infrastructure to make library data available. It maintains The European Library Open Dataset, which is derived from the data aggregated from member libraries, and made available under the Creative Commons CC0 1.0 Universal license, in order to promote and facilitate its reuse by any community.
Extensive linking is performed in the preparation of The European Library Open Dataset. It relies on Information Extraction and Data Mining to establish links to external open datasets, covering the most prominent entities types present in library data: persons, corporate bodies, places, concepts, intellectual works and manifestations.
The European Library also applies a linked data approach for intellectual property rights clearance processes, for supporting mass digitization projects. This approach is applied in the within the European ARROW rights infrastructure .
Semantic Technologies and Programmatic Access to Semantic Data Steffen Staab
This is a talk given at the Semantics@Roche Forum on September 8, 2015. It is a short version of the talk I gave in July at Summer School Semantic Web and really a subset of the slides I showed then.
a system called natural language interface which transforms user's natural language question into SPARQL query
find related papers here https://sites.google.com/site/fadhlinams81/publication
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
Apertium: a unique free/open-source MT system for related languages [but not ...Gema Ramirez-Sanchez
The document provides an overview of Apertium, an open-source machine translation system. It describes Apertium's main components, including its engine, data in XML formats, and tools. It also discusses Apertium's ready-to-use products, licensing as free/open-source, active community of hundreds of developers, research uses, over 40 supported language pairs including smaller languages, and some success stories in localization.
This document discusses the Web Ontology Language (OWL). It begins by providing motivation for OWL, noting limitations of RDF and RDF Schema in areas like expressiveness. It then outlines the technical solution of OWL, including its design goals of being shareable, changing over time, ensuring interoperability, and balancing expressiveness with complexity. Finally, it introduces the three dialects of OWL - OWL Lite, OWL DL, and OWL Full - and their different levels of expressiveness and reasoning capabilities.
WhiteLab is a web application that allows users to explore and search the large Dutch text collections SoNaR-500 and CGN. It provides access to the texts, audio, transcriptions, and linguistic annotations. Users can view collection composition and statistics, search by words, parts of speech, or lemmas using the CQP query language, and view concordance results and linked audio/context. OpenSoNaR-CGN was developed by several Dutch institutions to make these annotated resources openly available.
This document summarizes a presentation on the OpenNLP toolkit. OpenNLP is an open-source Java toolkit for natural language processing. It provides common NLP features like tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. The presentation discusses how these features work using pre-trained models for different languages. An example is also given showing how OpenNLP could be used to extract tags from a website and display them in a tag cloud. The presentation concludes by providing contact information for the presenter.
This document contains slides from a presentation by Pedro Szekely on RDF and related Semantic Web topics. The slides cover Unicode, URLs, URIs, namespaces, XML, XML Schema, RDF graphs, RDF syntaxes including XML and Turtle formats, and comparisons between XML and RDF. Key topics include using URIs to identify resources on the web, representing information as subject-predicate-object triples in RDF graphs, combining vocabularies using namespaces, and leveraging XML tools while making RDF more human-readable.
This document discusses various techniques for question answering and relation extraction in natural language processing. It provides an overview of question answering systems and approaches, including examples like START, Ask Jeeves and Siri. It also discusses using search engines for question answering, relation extraction from questions, and common evaluation metrics for question answering systems like accuracy and mean reciprocal rank.
Building NLP solutions for Davidson ML Groupbotsplash.com
This document provides an overview of natural language processing (NLP) and discusses various NLP applications and techniques. It covers the scope of NLP including natural language understanding, generation, and speech recognition/synthesis. Example applications mentioned include chatbots, sentiment analysis, text classification, summarization, and more. Popular Python packages for NLP like NLTK, SpaCy, and Gensim are also highlighted. Techniques like word embeddings, neural networks, and deep learning approaches to NLP are briefly outlined.
Agile Offsharing: Using Pair Work to OvercomeNearshoring DifficultiesMobileSolutionsDTAG
This document discusses using distributed pair programming (DPP) to overcome challenges with nearshoring software development. It proposes "Agile Offshoring" which maximizes cross-site communication through DPP instead of separating technical work by location. Key steps include finding pairing volunteers, establishing technical infrastructure like Saros for distributed editing, arranging knowledge transfer tasks between pairs, and refining the process through reflection. Research shows competent pairs can work fluently with DPP and it facilitates knowledge transfer, though awareness limitations must be addressed. The approach was deemed plausible by experts and is being evaluated further.
The document discusses the future of the Digital Curation Centre (DCC) and its role as a center of expertise in data curation and preservation. It outlines the DCC's proposed core services for the next phase, including providing reference resources, training, expertise/consultancy, community building, and tools/toolkits. It also discusses potential additional services and ensuring the DCC complements rather than conflicts with the UK Research Data Service.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
Here are a few approaches to address the context demand challenge for machine translation of cultural heritage content:
- Leverage knowledge graphs and ontologies to disambiguate terms based on conceptual relationships
- Train domain-specific models on large cultural heritage corpora to capture nuances of language use in different contexts
- Perform multi-task learning to optimize models for both translation accuracy and conceptual mapping between languages
- Allow users to provide feedback to iteratively improve disambiguation of ambiguous terms over time
- Develop specialized interfaces that surface contextual clues from objects to help machine translation
The goal is to mimic how humans understand intended meaning based on surrounding context clues. Combining linguistic and conceptual techniques can help machines do the same.
The document provides an introduction to a course on natural language processing, outlining the course overview, topics to be covered including introductions to NLP and Watson, machine learning for NLP, and why NLP is difficult. It provides information on the course instructor, teaching assistant, homepages, office hours, goals and topics of the course, organization, recommended textbooks, assignments, grading, class policies, and an outline of course topics.
Apertium is a free/open-source machine translation platform that provides an engine, linguistic data, and tools for rule-based machine translation of related languages. It supports over 40 language pairs and has an active international community of hundreds of developers. Apertium translations have been used successfully in localization projects, content translation on Wikipedia, and crisis translation for under-resourced languages.
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
Free Webinar on the Lynx Services Platform LySP: Architecture and basic Services
The main objective of the Lynx research and innovation project is to create an ecosystem of smart cloud services to better manage compliance, based on a Legal Knowledge Graph (LKG) which integrates and links multilingual and heterogeneous compliance data sources including legislation, case law, standards, regulations and other private contracts, beside others.
This webinar will provide insights into all smart services of the Lynx Services Platform (LySP) including demos of these LySP services, as for instance: Named Entity Extraction (NER) by DFKI, Relation Extraction and Question-Answering by SWC, Machine Translation by Tilde or the Lexicala cross-lingual lexical data service by KDictionaries.
The document summarizes efforts to support digital humanities research through collaboration at various institutions. It describes projects at Wheaton College involving students encoding a text using TEI XML under faculty supervision. It also discusses initiatives at the University of Vermont and Brown University to provide infrastructure and expertise for digital scholarship through partnerships between libraries, academic technology groups, and faculty researchers.
Semantic Technologies and Programmatic Access to Semantic Data Steffen Staab
This is a talk given at the Semantics@Roche Forum on September 8, 2015. It is a short version of the talk I gave in July at Summer School Semantic Web and really a subset of the slides I showed then.
a system called natural language interface which transforms user's natural language question into SPARQL query
find related papers here https://sites.google.com/site/fadhlinams81/publication
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
Apertium: a unique free/open-source MT system for related languages [but not ...Gema Ramirez-Sanchez
The document provides an overview of Apertium, an open-source machine translation system. It describes Apertium's main components, including its engine, data in XML formats, and tools. It also discusses Apertium's ready-to-use products, licensing as free/open-source, active community of hundreds of developers, research uses, over 40 supported language pairs including smaller languages, and some success stories in localization.
This document discusses the Web Ontology Language (OWL). It begins by providing motivation for OWL, noting limitations of RDF and RDF Schema in areas like expressiveness. It then outlines the technical solution of OWL, including its design goals of being shareable, changing over time, ensuring interoperability, and balancing expressiveness with complexity. Finally, it introduces the three dialects of OWL - OWL Lite, OWL DL, and OWL Full - and their different levels of expressiveness and reasoning capabilities.
WhiteLab is a web application that allows users to explore and search the large Dutch text collections SoNaR-500 and CGN. It provides access to the texts, audio, transcriptions, and linguistic annotations. Users can view collection composition and statistics, search by words, parts of speech, or lemmas using the CQP query language, and view concordance results and linked audio/context. OpenSoNaR-CGN was developed by several Dutch institutions to make these annotated resources openly available.
This document summarizes a presentation on the OpenNLP toolkit. OpenNLP is an open-source Java toolkit for natural language processing. It provides common NLP features like tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. The presentation discusses how these features work using pre-trained models for different languages. An example is also given showing how OpenNLP could be used to extract tags from a website and display them in a tag cloud. The presentation concludes by providing contact information for the presenter.
This document contains slides from a presentation by Pedro Szekely on RDF and related Semantic Web topics. The slides cover Unicode, URLs, URIs, namespaces, XML, XML Schema, RDF graphs, RDF syntaxes including XML and Turtle formats, and comparisons between XML and RDF. Key topics include using URIs to identify resources on the web, representing information as subject-predicate-object triples in RDF graphs, combining vocabularies using namespaces, and leveraging XML tools while making RDF more human-readable.
This document discusses various techniques for question answering and relation extraction in natural language processing. It provides an overview of question answering systems and approaches, including examples like START, Ask Jeeves and Siri. It also discusses using search engines for question answering, relation extraction from questions, and common evaluation metrics for question answering systems like accuracy and mean reciprocal rank.
Building NLP solutions for Davidson ML Groupbotsplash.com
This document provides an overview of natural language processing (NLP) and discusses various NLP applications and techniques. It covers the scope of NLP including natural language understanding, generation, and speech recognition/synthesis. Example applications mentioned include chatbots, sentiment analysis, text classification, summarization, and more. Popular Python packages for NLP like NLTK, SpaCy, and Gensim are also highlighted. Techniques like word embeddings, neural networks, and deep learning approaches to NLP are briefly outlined.
Agile Offsharing: Using Pair Work to OvercomeNearshoring DifficultiesMobileSolutionsDTAG
This document discusses using distributed pair programming (DPP) to overcome challenges with nearshoring software development. It proposes "Agile Offshoring" which maximizes cross-site communication through DPP instead of separating technical work by location. Key steps include finding pairing volunteers, establishing technical infrastructure like Saros for distributed editing, arranging knowledge transfer tasks between pairs, and refining the process through reflection. Research shows competent pairs can work fluently with DPP and it facilitates knowledge transfer, though awareness limitations must be addressed. The approach was deemed plausible by experts and is being evaluated further.
The document discusses the future of the Digital Curation Centre (DCC) and its role as a center of expertise in data curation and preservation. It outlines the DCC's proposed core services for the next phase, including providing reference resources, training, expertise/consultancy, community building, and tools/toolkits. It also discusses potential additional services and ensuring the DCC complements rather than conflicts with the UK Research Data Service.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
Here are a few approaches to address the context demand challenge for machine translation of cultural heritage content:
- Leverage knowledge graphs and ontologies to disambiguate terms based on conceptual relationships
- Train domain-specific models on large cultural heritage corpora to capture nuances of language use in different contexts
- Perform multi-task learning to optimize models for both translation accuracy and conceptual mapping between languages
- Allow users to provide feedback to iteratively improve disambiguation of ambiguous terms over time
- Develop specialized interfaces that surface contextual clues from objects to help machine translation
The goal is to mimic how humans understand intended meaning based on surrounding context clues. Combining linguistic and conceptual techniques can help machines do the same.
The document provides an introduction to a course on natural language processing, outlining the course overview, topics to be covered including introductions to NLP and Watson, machine learning for NLP, and why NLP is difficult. It provides information on the course instructor, teaching assistant, homepages, office hours, goals and topics of the course, organization, recommended textbooks, assignments, grading, class policies, and an outline of course topics.
Apertium is a free/open-source machine translation platform that provides an engine, linguistic data, and tools for rule-based machine translation of related languages. It supports over 40 language pairs and has an active international community of hundreds of developers. Apertium translations have been used successfully in localization projects, content translation on Wikipedia, and crisis translation for under-resourced languages.
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
Free Webinar on the Lynx Services Platform LySP: Architecture and basic Services
The main objective of the Lynx research and innovation project is to create an ecosystem of smart cloud services to better manage compliance, based on a Legal Knowledge Graph (LKG) which integrates and links multilingual and heterogeneous compliance data sources including legislation, case law, standards, regulations and other private contracts, beside others.
This webinar will provide insights into all smart services of the Lynx Services Platform (LySP) including demos of these LySP services, as for instance: Named Entity Extraction (NER) by DFKI, Relation Extraction and Question-Answering by SWC, Machine Translation by Tilde or the Lexicala cross-lingual lexical data service by KDictionaries.
The document summarizes efforts to support digital humanities research through collaboration at various institutions. It describes projects at Wheaton College involving students encoding a text using TEI XML under faculty supervision. It also discusses initiatives at the University of Vermont and Brown University to provide infrastructure and expertise for digital scholarship through partnerships between libraries, academic technology groups, and faculty researchers.
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...Victor de Boer
This document discusses using a limited speech corpus of recordings from Dutch news anchor Philip Bloemendal to develop a text-to-speech (TTS) engine. It evaluates how much of the Dutch language can be synthesized using the corpus and methods to improve it, like finding synonyms and decompounding compounds. It also explores using neural networks to colorize old black-and-white video footage from the archive to make it more engaging for viewers. While the TTS engine works well for common words, full sentences have lower coverage, and colorization introduces artifacts but can increase attention to the archive's collection.
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceBasis Technology
What’s all this cloud stuff, anyway? What kinds of problems do organizations set out to solve with ‘a cloud,’ or even ‘the cloud’? What are a few of the major government initiatives involving this technology? How does HLT in general, and Search in particular, fit?
This talk will take a tour of the technology behind clouds and the sometimes-foggy ambitions of the projects that use them, and look in particular detail at the challenges of applying cloud technologies to Text Analytics.
View more slides from the Human Language Technology Conference 2012 here: http://info.basistech.com/hlt-2012-slides
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
TR Discover is an NLP tool that allows users to search Thomson Reuters databases using natural language queries. It uses a context-free grammar and logical semantics to parse queries and translate them into SQL or SPARQL queries. For the query "Drugs developed by Merck", it generates a SPARQL query to retrieve drugs developed by the company Merck. The system provides query autocompletions based on the grammar and relationships in the knowledge graph to guide users. Working as a scientist within Thomson Reuters provides applied research opportunities while requiring technology to support business needs around privacy, customization for different markets, and long-term client relationships.
The Semantic Web - Interacting with the UnknownSteffen Staab
When developing user interfaces for interacting with data and content one typically assumes that one knows the type of data and one knows how to interact with such type of data. The core idea of the Semantic Web is that data is self-describing, which implies that its semantics is not designed and described at an initial point in time, but it rather emerges by its use. This flexibility is one of the greatest assets of the Semantic Web, but it also severely handicaps intelligent interaction with its data.
In this talk, we will sketch the principal problem as well as first steps to deal with the problem of interacting with the unknown.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
Similar to State of Tools for NLP in Danish: 2018 (20)
The net is rife with rumours that spread through microblogs and social media. Not all the claims in these can be verified. However, recent work has shown that the stances alone that commenters take toward claims can be sufficiently good indicators of claim veracity, using e.g. an HMM that takes conversational stance sequences as the only input. Existing results are monolingual (English) and mono-platform (Twitter). This paper introduces a stanceannotated Reddit dataset for the Danish language, and describes various implementations of stance classification models. Of these, a Linear SVM provides predicts stance best, with 0.76 accuracy / 0.42 macro F1. Stance labels are then used to predict veracity across platforms and also across languages, training on conversations held in one language and using the model on conversations held in another. In our experiments, monolinugal scores reach stance-based veracity accuracy of 0.83 (F1 0.68); applying the model across languages predicts veracity of claims with an accuracy of 0.82 (F1 0.67). This demonstrates the surprising and powerful viability of transferring stance-based veracity prediction across languages.
This document describes SemEval-2017 Task 8 on determining rumour veracity and stance. It introduces two subtasks: (A) determining the stance of statements as supporting, denying, querying, or commenting on rumours and (B) determining the veracity of rumours as true, false, or unknown. The document outlines the data provided for training, development and testing, which covers several rumour events. It provides the participant numbers for the two subtasks and discusses the difficulty of the tasks. The document concludes by thanking the participants and SemEval committee.
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
This presents a new resource for helping to find names of entities in social media. It takes an inclusive approach, meaning we get high variety in named entities - something other corpora have struggled with, leaving them poorly placed to help machine learning approaches generalise beyond the lexical level.
Handling and Mining Linguistic Variation in UGCLeon Derczynski
This document discusses user-generated content (UGC) found on social media and the linguistic variation present within it. It notes that UGC comes directly from end users without editing and contains nonstandard spelling, grammar, slang, and abbreviations. The document qualitatively and quantitatively analyzes the nature of this variation, including its relationship to social factors. It also discusses challenges this variation poses for natural language processing systems and different approaches that have been explored to better handle UGC, such as distributional semantic models, normalization, and leveraging author metadata.
Efficient named entity annotation through pre-emptingLeon Derczynski
Linguistic annotation is time-consuming and expensive. One common annotation task is to mark entities – such as names
of people, places and organisations – in text. In a document, many segments of text often contain no entities at all. We show that these segments are worth skipping, and demonstrate a technique for reducing the amount of entity-less text examined
by annotators, which we call “preempting”. This technique is evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus.
A light intro to natural language processing on social media, presented as an invited talk at the University of Sheffield Engineering Symposium 2014 in the AI session. As well as an introduction to the area, this presentation covers powerful real-world applications of social media, and touches on the work we do in the Sheffield NLP group.
Video cast: https://www.youtube.com/watch?v=QUbRmUinhHw&feature=youtu.be
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
Annotating data is expensive and often fraught. Crowdsourcing promises a quick, cheap and high-quality solution, but it is critical to understand the process and plan work appropriately in order to get results. This presentation and paper discuss the challenges involves and explain simple ways to getting reliable, quality results when crowdsourcing corpora.
Full paper: https://gate.ac.uk/sale/lrec2014/crowdsourcing/crowdsourcing-NLP-corpora.pdf
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
Presentation with audio: https://www.youtube.com/watch?v=heYj8sCmWCo
Finding the names in tweets is difficult. However, with a few simple modifications to handle the noise and variety in tweets, and a automatic post-editor to fix errors made by the automatic systems, it becomes easier.
Full paper: http://derczynski.com/sheffield/papers/person_tweets.pdf
Natural Language Processing for the Social Media
A PhD course at the University of Szeged, organised by the FuturICT.hu project; 2013. December 9-13.
1. Twitter intro + JSON structure
2. Challenges in analysing social media: why traditional NLP models do not work well
3. GATE for social media
The document discusses several topics related to artificial intelligence including machine learning, evaluating AI, and big data from social media. It notes that machine learning allows computers to write programs themselves so humans can go drinking. Big data is defined using the three Vs: velocity of tweets, volume of active teenagers, and variety of data applications including virus prediction, earthquake detection, and discussions of Bieber.
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
Paper: http://derczynski.com/sheffield/papers/named_timex.pdf
This paper introduces a new class of temporal expression – named temporal expressions – and methods for recognising and interpreting its members. The commonest temporal expressions typically contain date and time words, like April or hours. Research into recognising and interpreting these typical expressions is mature in many languages. However, there is a class of expressions that are less typical, very varied, and difficult to automatically interpret. These indicate dates and times, but are harder to detect because they often do not contain time words and are not used frequently enough to appear in conventional temporally-annotated corpora – for example Michaelmas or Vasant Panchami.
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
Code: http://gate.ac.uk/wiki/twitie.html
Paper: https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf
Twitter is the largest source of microblog text, responsible for gigabytes of human discourse every day. Processing microblog text is difficult: the genre is noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets and other microblog text. We present TwitIE, an open-source NLP pipeline customised to microblog text at every stage. Additionally, it includes Twitter-specific data import and metadata handling. This paper introduces each stage of the TwitIE pipeline, which is a modification of the GATE ANNIE open-source pipeline for news text. An evaluation against some state-of-the-art systems is also presented.
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
Download software: http://gate.ac.uk/wiki/twitter-postagger.html
Original paper: http://derczynski.com/sheffield/papers/twitter_pos.pdf
Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.
Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/
Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.
This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.
The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
Working out when events in a text happen is difficult. Many have tried over the past decade but the state of the art has not advanced.
After introducing a few fundamental concepts for dealing with time in language, we work out what makes this task so difficult, and then identify two common causes of temporal ordering difficulty and describe how to overcome them.
Full document: http://derczynski.com/sheffield/papers/derczynski-phdthesis.pdf
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
This document discusses challenges in applying natural language processing pipelines to microblog texts like tweets. Key challenges include non-standard language use, brevity, and lack of context. The document evaluates performance of typical NLP tasks on microblogs, like part-of-speech tagging and named entity recognition, and proposes approaches to address noise, such as customizing tools to the microblog genre and applying normalization techniques. It concludes that while performance is lower on microblogs, targeted approaches can provide gains and that leveraging additional context from metadata may further help analyze microblog language.
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
There exist formal accounts of tense and aspect, such as that detailed by Reichenbach (1947). Temporal semantics for corpus annotation are also available, such as TimeML. This paper describes a technique for linking the two, in order to perform a corpus-based empirical validation of Reichenbach's tense framework. It is found, via use of Freksa's semi-interval temporal algebra, that tense appropriately constrains the types of temporal relations that can hold between pairs of events described by verbs. Further, Reichenbach's framework of tense and aspect is supported by corpus evidence, leading to the first validation of the framework. Results suggest that the linking technique proposed here can be used to make advances in the difficult area of automatic temporal relation typing and other current problems regarding reasoning about time in language.
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
Social media has changed the way we communicate. Social media data capture our social interactions and utterances in machine readable format. Searching and analysing massive and frequently updated social media data brings significant and diverse rewards across many different application domains, from politics and business to social science and epidemiology.A notable proportion of social media data comes with explicit or implicit spatial annotations, and almost all social media data has temporal metadata. We view social media data as a constant stream of data points, each containing text with spatial and temporal con-texts. We identify challenges relevant to each context, which we intend to subject to context aware querying and analysis, specifically including longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatio-temporal intent search. Finally, for each context, emerging applications and further avenues for investigation are discussed.
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
This document discusses determining the types of temporal relations in discourse. It introduces key temporal information extraction concepts like events, temporal expressions, and links between events and times. The document also examines relation extraction challenges, the role of temporal signals and tense in modelling temporal relations, and potential areas of future work such as temporal dataset construction.
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
We present TIMEN, a resource for building and sharing knowledge and rules for TimeML temporal expression normalization subtask - that is, the generation of a TIMEX3 annotation from a linguistic temporal expression. This sets a strong basis built from current best approaches which is independent from the rest of temporal expression processing subtasks. Therefore, it can be easily integrated as a module in temporal information processing systems.
Since it is open it can be used, improved and extended by the community, in contrast to closed tools, which must be replicated from scratch as the field advances. Furthermore, TIMEN eases the development of normalization knowledge and rules for low-resourced languages since the normalization process is partially shared between languages.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
3. Language resources: datasets
Danish Faroese
West
Greenlandic
East
Greenlandic
North
Greenlandic
English Swedish
Fiction Yes - UD Yes - UD Yes - UD
News Yes - UD Yes - UD Yes - UD
Nonfiction Yes - UD Yes - UD Yes - UD
Spoken Yes - UD Yes - UD Yes - UD
Wikipedia 239.715
Yes - UD;
12.788
1.657 0 0
Yes - UD;
5.708.356
Yes - UD;
3.771.701
Reviews Independe
nt
Yes - UD
Spoken,
email, blog,
social
media,
academic,
legal,
essays,
fiction,
learner, web
Yes - UD
Margaret Hamilton, Apollo project lead coder, with
more text than we have for most of these languages
4. Language resources: tools
• DKIE - a plugin for GATE, using Stanford
CoreNLP
• tokenization
• PoS tagging
• NER
• daner, dapipe
• Wrapped tools using UD resources
• Free from ITU (http://nlp.itu.dk)
• VISL
• Grammar, ML for some pairs
• Pretty good resource - but now old (bit
rot?)
• Closed source
• http://visl.sdu.dk
Fig. 1: Kamelåså Syggelekokle
5. Language resources: standards
• PAROLE (FP3)
• TEI (2001-)
• CLARIN (2012-)
• CST actually have some pretty big
resources, handy for domain adaptation
• Conversions to partial UD
• DDT (2004; PAROLE format)
• CDT (2009; Discontinuous Grammar)
6. Exploitable tech
• Sentiment ❌
• IE
• Triples ❌
• NER ❌
• Stance ❌
• Events
• Frames ❌
• Who-did-what-to-whom ❌
• Legal
• Compliance ❌
• Discovery ❌
• Clinical
• IE ❌
• Events ❌
• MH ❌
• Social media ❌
“12,3 milliard er vi ned”
7. Danish LT at ITU
• Hub for Danish language technology
• Four faculty
• Zeljko Agic:
• multilinguality, representations
• Leon Derczynski
• NLPL project, clinical, stance, social media
• Barbara Plank
• Fundamental processing tools
• Natalie Schluter
• Decoding algorithms in NLP (espec. parsing and
summarisation)
• Learning algorithms for NLP
• UD Treebanking for Danish (hopefully Faroese), with
extensions
• …and looking for collaborators!
• Theory of Hacks for the Machine Learning
Practitioner
• Related: Deep Learning for NLP
• Resources: dapipe, daner
• nlp.itu.dk
ITU NLP - @NLPatITU
8. Funding situation
• DFF
• FTP: “This is too linguistic-y”
• ..FTP: “We don’t really fund CS anyway”
• FKK: “This is too computer-y”
• FNU: “We’re full”
• EC
• Dropped LT as specific funding category
I’m not funded well enough to caption this yet
9. Solutions: public education
• Inform the population
• People not familiar with NLP
• Ground up approach way to formal
• Top-down: introduce tools and their effects
• Political analysis, clinical analysis, business
analysis
• Give up on local funding for basic NLP research
• Until the local population is as familiar with
the tech as anglophone populations are
• What do we need to do to get there?
10. Solutions: build resources
• Scrape more
• Use CLARIN more
• Publish your damn data
• (everybody else is)
• (it’s 2018 and the world has changed again)
• Put resources on GitHub / Figshare
• No matter how trivial - e.g. much in NLPL is
near-free to build, but not previously available
• Build equivalents of English LRs
• Directory of LRs for Danish languages?
• Maybe just re-use (or start using) LRE Map
Fig. 2: “træls”
11. Solutions: standards
• Consider: task, and format
• Meaning representation: AMR; Frames; entity linking
• Parsing: discontinuous grammar; UD;
Stanford; ..PTB-like
• Morphology: UD; Danish-specific
• Translation: Word-aligned; comparable corpora
• Semantics: ISO standards; CLEF; interoperability
(ISA workshops); SRL
• Coreference: SemEval; MUC; Stanford deps
• NLG (MT, summ): multiple summaries/target
examples
Sabine Kirchmeier, direktør for Dansk Sprognævne
12. Solutions: funding routes
• Through the back door: application oriented
• Political analysis: stance
• Business intelligence: sentiment, IE (NER &
triples)
• Clinical: semantic extraction, parsing, IE
(temporal)
• Architecture and city planning: spatial
processing, IE (NER)
• Arctic collaboration: Faroese, Greenlandic