Slides for my tutorial at the ESWC Summer School 2015, giving an introduction to information extraction with Linked Data and an introduction to one of the applications of information extraction, opinion mining.
The document provides an overview of natural language processing (NLP) and its related areas. It discusses the classical view of NLP involving stages of processing like syntax, semantics, pragmatics, etc. It also discusses the statistical/machine learning view of NLP, where NLP tasks are framed as classification problems and cues from language help reduce uncertainty. Finally, it provides examples of lower-level NLP tasks like part-of-speech tagging that can be viewed as sequence labeling problems.
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
The document discusses modelling information change in scientific communication. It begins by noting how science is often communicated through journalists to the public, and how the message can change and become exaggerated or misleading along the way. It then discusses developing models to detect exaggeration by predicting the strength of causal claims, such as distinguishing between correlational and causal language. Pattern exploiting training is explored as a way to leverage large language models for this task in a semi-supervised manner. Finally, it proposes generally modelling information change by comparing original research to how it is communicated elsewhere, such as in news articles and tweets, using semantic matching techniques. Experiments are discussed on newly created datasets to benchmark performance of models on this task.
The document discusses automatically detecting scientific misinformation and exaggeration. It introduces work on cite-worthiness detection to improve scientific document understanding, and on detecting exaggeration in health science press releases. It describes generating scientific claims from citations for zero-shot scientific fact checking. The talk covers claim detection and generation, cite-worthiness detection, scientific claim generation, and exaggeration detection.
The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, a knowledge-intensive and complex reasoning task. Most existing fact checking models predict a claim’s veracity with black-box models, which often lack explanations of the reasons behind their predictions and contain hidden vulnerabilities. The lack of transparency in fact checking systems and ML models, in general, has been exacerbated by increased model size and by “the right…to obtain an explanation of the decision reached” enshrined in European law. This talk presents some first solutions to generating explanations for fact checking models. It then examines how to assess the generated explanations using diagnostic properties, and how further optimising for these diagnostic properties can improve the quality of the generating explanations. Finally, the talk examines how to systemically reveal vulnerabilities of black-box fact checking models.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
Outline:
- Fact checking – what is it and why do we need it?
- False information online
- Content-based automatic fact checking
- Explainability – what is it and why do we need it?
- Making the right predictions for the right reasons
- Model training pipeline
- Explainable fact checking – some first solutions
- Rationale selection
- Generating free-text explanations
- Wrap-up
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
The document provides an overview of natural language processing (NLP) and its related areas. It discusses the classical view of NLP involving stages of processing like syntax, semantics, pragmatics, etc. It also discusses the statistical/machine learning view of NLP, where NLP tasks are framed as classification problems and cues from language help reduce uncertainty. Finally, it provides examples of lower-level NLP tasks like part-of-speech tagging that can be viewed as sequence labeling problems.
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
The document discusses modelling information change in scientific communication. It begins by noting how science is often communicated through journalists to the public, and how the message can change and become exaggerated or misleading along the way. It then discusses developing models to detect exaggeration by predicting the strength of causal claims, such as distinguishing between correlational and causal language. Pattern exploiting training is explored as a way to leverage large language models for this task in a semi-supervised manner. Finally, it proposes generally modelling information change by comparing original research to how it is communicated elsewhere, such as in news articles and tweets, using semantic matching techniques. Experiments are discussed on newly created datasets to benchmark performance of models on this task.
The document discusses automatically detecting scientific misinformation and exaggeration. It introduces work on cite-worthiness detection to improve scientific document understanding, and on detecting exaggeration in health science press releases. It describes generating scientific claims from citations for zero-shot scientific fact checking. The talk covers claim detection and generation, cite-worthiness detection, scientific claim generation, and exaggeration detection.
The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, a knowledge-intensive and complex reasoning task. Most existing fact checking models predict a claim’s veracity with black-box models, which often lack explanations of the reasons behind their predictions and contain hidden vulnerabilities. The lack of transparency in fact checking systems and ML models, in general, has been exacerbated by increased model size and by “the right…to obtain an explanation of the decision reached” enshrined in European law. This talk presents some first solutions to generating explanations for fact checking models. It then examines how to assess the generated explanations using diagnostic properties, and how further optimising for these diagnostic properties can improve the quality of the generating explanations. Finally, the talk examines how to systemically reveal vulnerabilities of black-box fact checking models.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
Outline:
- Fact checking – what is it and why do we need it?
- False information online
- Content-based automatic fact checking
- Explainability – what is it and why do we need it?
- Making the right predictions for the right reasons
- Model training pipeline
- Explainable fact checking – some first solutions
- Rationale selection
- Generating free-text explanations
- Wrap-up
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Talk on 'Tracking False Information Online' at W-NUT workshop at EMNLP 2019.
=========
Digital media enables fast sharing of information and discussions among users. While this comes with many benefits to today’s society, such as broadening information access, the manner in which information is disseminated also has obvious downsides. Since fast access to information is expected by many users and news outlets are often under financial pressure, speedy access often comes at the expense of accuracy, which leads to misinformation. Moreover, digital media can be misused by campaigns to intentionally spread false information, i.e. disinformation, about events, individuals or governments. In this talk, I will present on different ways false information is spread online, including misinformation and disinformation. I will then report findings from our recent and ongoing work on automatic fact checking, stance detection and framing attitudes.
What can typological knowledge bases and language representations tell us abo...Isabelle Augenstein
One of the core challenges in typology is to record properties of languages in a structured way. As a result of manual efforts, typological knowledge bases have emerged, which contains information about languages’ phonological, morphological and syntactic properties; as well as information about language families. Ideally, such typological knowledge bases would provide useful information for multilingual NLP models to learn how to selectively share parameters.
A related area of research suggests a different way of encoding properties of languages, namely to learn language representation vectors directly from text documents.
In this talk, I will analyse and contrast these two ways of encoding linguistic properties, as well as present research on how the two can benefit one another.
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Isabelle Augenstein
Paper presented at NAACL 2018. Link: https://arxiv.org/abs/1802.09913
Abstract:
============
We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Spreading of mis- and disinformation is growing and is having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g. manual fact-checking by reporters cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Shared task summary for SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications
Paper: https://arxiv.org/abs/1704.02853
Abstract:
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...Isabelle Augenstein
The document summarizes the history and goals of the WiNLP workshop, which aims to promote and support women and underrepresented groups in natural language processing. It discusses the growth of WiNLP from its inception in 2016 to the 2017 workshop with over 130 participants. It outlines WiNLP's mission to increase awareness of work by underrepresented groups and build community. It also notes challenges such as underrepresentation, bias, and lack of resources that WiNLP addresses through mentoring, funding, and community building.
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
The document discusses machine reading using neural machines. It presents goals of fact checking claims and understanding scientific publications. It outlines challenges in tasks like stance detection on tweets and summarizing scientific papers. These include interpreting statements based on the target or headline, handling unseen targets, and the small size of benchmark datasets which makes neural machine reading computationally costly.
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Imitation learning is used to address the problem of distant supervision for relation extraction. It decomposes the task into named entity classification (NEC) and relation extraction (RE), allowing the models to be trained separately. Through an iterative process, imitation learning is able to learn the dependencies between NEC and RE even when only labels for RE are provided. This overcomes limitations of prior approaches that rely on distantly labeled data. Evaluation shows the approach improves over baselines by leveraging multi-stage modeling to compensate for mistakes at the NEC stage.
Extracting Relations between Non-Standard Entities using Distant Supervision ...Isabelle Augenstein
Poster for our EMNLP paper on extracting non-standard relations from the Web with distant supervision and imitation learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
Relation Extraction from the Web using Distant SupervisionIsabelle Augenstein
This paper proposes using distant supervision to extract relations from web text to populate knowledge bases without requiring manual effort. It does this by using an existing knowledge base to automatically label sentences with entity relations, training a classifier on this distant supervision data. The paper describes using statistical methods to select better training data and discard noisy examples, and shows this improves precision. It also introduces methods for integrating information across sentences which improves both precision and recall of extracted relations.
Seed Selection for Distantly Supervised Web-Based Relation ExtractionIsabelle Augenstein
Slides of my presentation on "Seed Selection for Distantly Supervised Web-Based Relation Extraction" at the Semantic Web and Information Extraction workshop (SWAIE) and COLING 2014
Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/SWAIE2014-Seed.pdf
The document presents a method for mapping keywords to linked data resources for automatic query expansion. It aims to address challenges like spelling mistakes, synonyms, and lexical variations. The method learns an expanded set of keywords and ranks them by identifying concepts through labeling properties within the dataset. It was evaluated on DBpedia and showed an improvement over state-of-the-art methods, achieving a 17% increase in mean reciprocal rank. Future work is discussed to integrate multiple strategies and fine-tune the approach.
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Talk on 'Tracking False Information Online' at W-NUT workshop at EMNLP 2019.
=========
Digital media enables fast sharing of information and discussions among users. While this comes with many benefits to today’s society, such as broadening information access, the manner in which information is disseminated also has obvious downsides. Since fast access to information is expected by many users and news outlets are often under financial pressure, speedy access often comes at the expense of accuracy, which leads to misinformation. Moreover, digital media can be misused by campaigns to intentionally spread false information, i.e. disinformation, about events, individuals or governments. In this talk, I will present on different ways false information is spread online, including misinformation and disinformation. I will then report findings from our recent and ongoing work on automatic fact checking, stance detection and framing attitudes.
What can typological knowledge bases and language representations tell us abo...Isabelle Augenstein
One of the core challenges in typology is to record properties of languages in a structured way. As a result of manual efforts, typological knowledge bases have emerged, which contains information about languages’ phonological, morphological and syntactic properties; as well as information about language families. Ideally, such typological knowledge bases would provide useful information for multilingual NLP models to learn how to selectively share parameters.
A related area of research suggests a different way of encoding properties of languages, namely to learn language representation vectors directly from text documents.
In this talk, I will analyse and contrast these two ways of encoding linguistic properties, as well as present research on how the two can benefit one another.
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Isabelle Augenstein
Paper presented at NAACL 2018. Link: https://arxiv.org/abs/1802.09913
Abstract:
============
We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Spreading of mis- and disinformation is growing and is having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g. manual fact-checking by reporters cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Shared task summary for SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications
Paper: https://arxiv.org/abs/1704.02853
Abstract:
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...Isabelle Augenstein
The document summarizes the history and goals of the WiNLP workshop, which aims to promote and support women and underrepresented groups in natural language processing. It discusses the growth of WiNLP from its inception in 2016 to the 2017 workshop with over 130 participants. It outlines WiNLP's mission to increase awareness of work by underrepresented groups and build community. It also notes challenges such as underrepresentation, bias, and lack of resources that WiNLP addresses through mentoring, funding, and community building.
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
The document discusses machine reading using neural machines. It presents goals of fact checking claims and understanding scientific publications. It outlines challenges in tasks like stance detection on tweets and summarizing scientific papers. These include interpreting statements based on the target or headline, handling unseen targets, and the small size of benchmark datasets which makes neural machine reading computationally costly.
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Imitation learning is used to address the problem of distant supervision for relation extraction. It decomposes the task into named entity classification (NEC) and relation extraction (RE), allowing the models to be trained separately. Through an iterative process, imitation learning is able to learn the dependencies between NEC and RE even when only labels for RE are provided. This overcomes limitations of prior approaches that rely on distantly labeled data. Evaluation shows the approach improves over baselines by leveraging multi-stage modeling to compensate for mistakes at the NEC stage.
Extracting Relations between Non-Standard Entities using Distant Supervision ...Isabelle Augenstein
Poster for our EMNLP paper on extracting non-standard relations from the Web with distant supervision and imitation learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
Relation Extraction from the Web using Distant SupervisionIsabelle Augenstein
This paper proposes using distant supervision to extract relations from web text to populate knowledge bases without requiring manual effort. It does this by using an existing knowledge base to automatically label sentences with entity relations, training a classifier on this distant supervision data. The paper describes using statistical methods to select better training data and discard noisy examples, and shows this improves precision. It also introduces methods for integrating information across sentences which improves both precision and recall of extracted relations.
Seed Selection for Distantly Supervised Web-Based Relation ExtractionIsabelle Augenstein
Slides of my presentation on "Seed Selection for Distantly Supervised Web-Based Relation Extraction" at the Semantic Web and Information Extraction workshop (SWAIE) and COLING 2014
Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/SWAIE2014-Seed.pdf
The document presents a method for mapping keywords to linked data resources for automatic query expansion. It aims to address challenges like spelling mistakes, synonyms, and lexical variations. The method learns an expanded set of keywords and ranks them by identifying concepts through labeling properties within the dataset. It was evaluated on DBpedia and showed an improvement over state-of-the-art methods, achieving a 17% increase in mean reciprocal rank. Future work is discussed to integrate multiple strategies and fine-tune the approach.
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
Superpower Your Apache Kafka Applications Development with Complementary Open...Paul Brebner
Kafka Summit talk (Bangalore, India, May 2, 2024, https://events.bizzabo.com/573863/agenda/session/1300469 )
Many Apache Kafka use cases take advantage of Kafka’s ability to integrate multiple heterogeneous systems for stream processing and real-time machine learning scenarios. But Kafka also exists in a rich ecosystem of related but complementary stream processing technologies and tools, particularly from the open-source community. In this talk, we’ll take you on a tour of a selection of complementary tools that can make Kafka even more powerful. We’ll focus on tools for stream processing and querying, streaming machine learning, stream visibility and observation, stream meta-data, stream visualisation, stream development including testing and the use of Generative AI and LLMs, and stream performance and scalability. By the end you will have a good idea of the types of Kafka “superhero” tools that exist, which are my favourites (and what superpowers they have), and how they combine to save your Kafka applications development universe from swamploads of data stagnation monsters!
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Ortus Solutions, Corp
Join us for a session exploring CommandBox 6’s smooth website transition and efficient deployment. CommandBox revolutionizes web development, simplifying tasks across Linux, Windows, and Mac platforms. Gain insights and practical tips to enhance your development workflow.
Come join us for an enlightening session where we delve into the smooth transition of current websites and the efficient deployment of new ones using CommandBox 6. CommandBox has revolutionized web development, consistently introducing user-friendly enhancements that catalyze progress in the field. During this presentation, we’ll explore CommandBox’s rich history and showcase its unmatched capabilities within the realm of ColdFusion, covering both major variations.
The journey of CommandBox has been one of continuous innovation, constantly pushing boundaries to simplify and optimize development processes. Regardless of whether you’re working on Linux, Windows, or Mac platforms, CommandBox empowers developers to streamline tasks with unparalleled ease.
In our session, we’ll illustrate the simple process of transitioning existing websites to CommandBox 6, highlighting its intuitive features and seamless integration. Moreover, we’ll unveil the potential for effortlessly deploying multiple websites, demonstrating CommandBox’s versatility and adaptability.
Join us on this journey through the evolution of web development, guided by the transformative power of CommandBox 6. Gain invaluable insights, practical tips, and firsthand experiences that will enhance your development workflow and embolden your projects.
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
Streamlining End-to-End Testing Automation with Azure DevOps Build & Release Pipelines
Automating end-to-end (e2e) test for Android and iOS native apps, and web apps, within Azure build and release pipelines, poses several challenges. This session dives into the key challenges and the repeatable solutions implemented across multiple teams at a leading Indian telecom disruptor, renowned for its affordable 4G/5G services, digital platforms, and broadband connectivity.
Challenge #1. Ensuring Test Environment Consistency: Establishing a standardized test execution environment across hundreds of Azure DevOps agents is crucial for achieving dependable testing results. This uniformity must seamlessly span from Build pipelines to various stages of the Release pipeline.
Challenge #2. Coordinated Test Execution Across Environments: Executing distinct subsets of tests using the same automation framework across diverse environments, such as the build pipeline and specific stages of the Release Pipeline, demands flexible and cohesive approaches.
Challenge #3. Testing on Linux-based Azure DevOps Agents: Conducting tests, particularly for web and native apps, on Azure DevOps Linux agents lacking browser or device connectivity presents specific challenges in attaining thorough testing coverage.
This session delves into how these challenges were addressed through:
1. Automate the setup of essential dependencies to ensure a consistent testing environment.
2. Create standardized templates for executing API tests, API workflow tests, and end-to-end tests in the Build pipeline, streamlining the testing process.
3. Implement task groups in Release pipeline stages to facilitate the execution of tests, ensuring consistency and efficiency across deployment phases.
4. Deploy browsers within Docker containers for web application testing, enhancing portability and scalability of testing environments.
5. Leverage diverse device farms dedicated to Android, iOS, and browser testing to cover a wide range of platforms and devices.
6. Integrate AI technology, such as Applitools Visual AI and Ultrafast Grid, to automate test execution and validation, improving accuracy and efficiency.
7. Utilize AI/ML-powered central test automation reporting server through platforms like reportportal.io, providing consolidated and real-time insights into test performance and issues.
These solutions not only facilitate comprehensive testing across platforms but also promote the principles of shift-left testing, enabling early feedback, implementing quality gates, and ensuring repeatability. By adopting these techniques, teams can effectively automate and execute tests, accelerating software delivery while upholding high-quality standards across Android, iOS, and web applications.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
Hands-on with Apache Druid: Installation & Data Ingestion StepsservicesNitor
Supercharge your analytics workflow with https://bityl.co/Qcuk Apache Druid's real-time capabilities and seamless Kafka integration. Learn about it in just 14 steps.
Boost Your Savings with These Money Management AppsJhone kinadey
A money management app can transform your financial life by tracking expenses, creating budgets, and setting financial goals. These apps offer features like real-time expense tracking, bill reminders, and personalized insights to help you save and manage money effectively. With a user-friendly interface, they simplify financial planning, making it easier to stay on top of your finances and achieve long-term financial stability.
What is Continuous Testing in DevOps - A Definitive Guide.pdfkalichargn70th171
Once an overlooked aspect, continuous testing has become indispensable for enterprises striving to accelerate application delivery and reduce business impacts. According to a Statista report, 31.3% of global enterprises have embraced continuous integration and deployment within their DevOps, signaling a pervasive trend toward hastening release cycles.
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceICS
This webinar explores the “secure-by-design” approach to medical device software development. During this important session, we will outline which security measures should be considered for compliance, identify technical solutions available on various hardware platforms, summarize hardware protection methods you should consider when building in security and review security software such as Trusted Execution Environments for secure storage of keys and data, and Intrusion Detection Protection Systems to monitor for threats.
The Role of DevOps in Digital Transformation.pdfmohitd6
DevOps plays a crucial role in driving digital transformation by fostering a collaborative culture between development and operations teams. This approach enhances the speed and efficiency of software delivery, ensuring quicker deployment of new features and updates. DevOps practices like continuous integration and continuous delivery (CI/CD) streamline workflows, reduce manual errors, and increase the overall reliability of software systems. By leveraging automation and monitoring tools, organizations can improve system stability, enhance customer experiences, and maintain a competitive edge. Ultimately, DevOps is pivotal in enabling businesses to innovate rapidly, respond to market changes, and achieve their digital transformation goals.
The Comprehensive Guide to Validating Audio-Visual Performances.pdfkalichargn70th171
Ensuring the optimal performance of your audio-visual (AV) equipment is crucial for delivering exceptional experiences. AV performance validation is a critical process that verifies the quality and functionality of your AV setup. Whether you're a content creator, a business conducting webinars, or a homeowner creating a home theater, validating your AV performance is essential.
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
Information Extraction with Linked Data
1. Information Extraction with
Linked Data
Isabelle Augenstein
Department of Computer Science, University of Sheffield, UK
i.augenstein@sheffield.ac.uk
2 September 2015
Information Extraction with Linked Data Tutorial, ESWC Summer School 2015
7. 7
Information Extraction
Isabelle Augenstein
The Arctic Monkeys almost exclusively played songs from their
new album AM at Summerfest 2014 at Miller Lite Oasis in
Milwaukee on 25 June 2014.
Named Entity Recognition
8. 8
Information Extraction
Isabelle Augenstein
The Arctic Monkeys almost exclusively played songs from their
new album AM at Summerfest 2014 at Miller Lite Oasis in
Milwaukee on 25 June 2014.
Named Entity Recognition
Named Entity Classification (NEC):
Arctic Monkeys: mo:MusicArtist
AM: mo:SignalGroup
Summerfest 2014: mo:Festival
Miller Lite Oases: geo:SpatialThing
Milwaukee: geo:SpatialThing
9. 9
Information Extraction
Isabelle Augenstein
The Arctic Monkeys almost exclusively played songs from their
new album AM at Summerfest 2014 at Miller Lite Oasis in
Milwaukee on 25 June 2014.
Named Entity Recognition
Named Entity Classification (NEC): Named Entity Linking (NEL):
Arctic Monkeys: mo:MusicArtist Arctic Monkeys: mo:artist/ada7a83 ...
AM: mo:SignalGroup AM: mo:release-group/a348ba2f-f8b3 …
Summerfest 2014: mo:Festival Summerfest 2014: mo:event/3fc3 …
Miller Lite Oases: geo:SpatialThing Miller Lite Oases: mo:place/3f26acf …
Milwaukee: geo:SpatialThing Milwaukee: mo:area/4dc3fa97-cf9b- …
10. 10
Named Entities: Definition
Named Entities: Proper nouns, which refer to real-life entities
Named Entity Recognition: Detecting boundaries of named
entities (NEs)
Named Entity Classification: Assigning classes to NEs, such as
PERSON, LOCATION, ORGANISATION, MISC or fine-grained
classes such as SIGNAL GROUP
Named Entity Linking / Disambiguation: Linking NEs to
concrete entries in knowledge base, example:
Milwaukee -> LOC: largest city in the U.S. state of Wisconsin
-> LOC: Milwaukee, Oregon, named after the city in Wisconsin
-> LOC: Milwaukee County, Wisconsin
-> ORG: Milwaukee Tool Corp, a manufacturer of electric power tools
-> MISC: early codename for what was to become the Macintosh II
-> …
11. 11
Relations
Isabelle Augenstein
The Arctic Monkeys almost exclusively played songs from their
new album AM at Summerfest 2014 at Miller Lite Oasis in
Milwaukee on 25 June 2014.
Named Entity Recognition
Relation Extraction
foaf:made
gn:parentFeature
mo:Festival
12. 12
Relations
Isabelle Augenstein
The Arctic Monkeys almost exclusively played songs from their
new album AM at Summerfest 2014 at Miller Lite Oasis in
Milwaukee on 25 June 2014.
Named Entity Recognition
Relation Extraction
Temporal Extraction
foaf:made
gn:parentFeature
mo:Festival
2014-06-25
13. 13
Relations
Isabelle Augenstein
The Arctic Monkeys almost exclusively played songs from their
new album AM at Summerfest 2014 at Miller Lite Oasis in
Milwaukee on 25 June 2014.
Named Entity Recognition
Relation Extraction
Temporal Extraction
foaf:made
gn:parentFeature
mo:Festival
2014-06-25
Event Extraction
Event: mo:Festival: Summerfest 2014
foaf:Agent: Arctic Monkeys
time:TemporalEntity: 2014-06-25
geo:SpatialThing: Miller Lite Oasis
14. 14
Relations, Time Expressions
and Events: Definition
Relations: Two or more entities which relate to one another in
real life
Relation Extraction: Detecting relations between entities and
assigning relation types to them, such as LOCATED-IN
Temporal Extraction: Recognising and normalising time
expressions: times (e.g. “3 in the afternoon”), dates (“tomorrow”),
durations (“since yesterday”), and sets (e.g. “twice a month”)
Events: Real-life events that happened at some point in space
and time, e.g. music festival, album release
Event Extraction: Extracting events consisting of the name and
type of event, agent, time and location
15. 15
Summary: Introduction
• Information extraction (IE) methods such as named entity
recognition (NER), named entity classification (NEC), named
entity linking, relation extraction (RE), temporal extraction, and
event extraction can help to add markup to Web pages
• Information extraction approaches can serve two purposes:
• Annotating every single mention of an entity, relation or event,
e.g. to add markup to Web pages
• Aggregating those mentions to populate knowledge bases, e.g.
based on confidence values and majority voting
Milwaukee LOC 0.9
Milwaukee LOC 0.8
Milwaukee ORG 0.4
à Milwaukee LOC
Isabelle Augenstein
16. 16
NERC: Methods
• Possible methodologies
• Rule-based approaches: write manual extraction rules
• Machine learning based approaches
• Supervised learning: manually annotate text, train machine
learning model
• Unsupervised learning: extract language patterns, cluster similar
ones
• Semi-supervised learning: start with a small number of language
patterns, iteratively learn more (bootstrapping)
• Gazetteer-based method: use existing list of named entities
• Combination of the above
Isabelle Augenstein
34. 34
Information Extraction
Language is ambiguous..
Can we still build named entity extractors that extract all
entities from unseen text correctly?
Isabelle Augenstein
35. 35
Information Extraction
Language is ambiguous..
Can we still build named entity extractors that extract all
entities from unseen text correctly?
Isabelle Augenstein
36. 36
Information Extraction
Language is ambiguous..
Can we still build named entity extractors that extract all
entities from unseen text correctly?
However, we can try to extract most of them correctly
using linguistic cues and background knowledge!
Isabelle Augenstein
37. 37
NERC: Features
What can help to recognise and/or classify named entities?
• Words:
• Words in window before and after mention
• Sequences
• Bags of words
Summerfest 2014 took place at Miller Lite Oasis in Milwaukee on 25
June 2014.
w: Milwaukee w-1: in w-2: Oasis w+1: on w+2: 25
seq[-]: Oasis in seq[+]: on 25
bow: Milwaukee bow[-]: in bow[-]: Oasis bow[+]: on bow[+]: 25
Isabelle Augenstein
38. 38
NERC: Features
What can help to recognise and/or classify named entities?
• Morphology:
• Capitalisation: is upper case (China), all upper case (IBM), mixed case
(eBay)
• Symbols: contains $, £, €, roman symbols (IV), ..
• Contains period (google.com), apostrophe (Mandy’s), hyphen (speed-o-
meter), ampersand (Fisher & Sons)
• Stem or Lemma (cats -> cat), prefix (disadvantages -> dis),
suffix (cats -> s), interfix (speed-o-meter -> o)
Isabelle Augenstein
39. 39
NERC: Features
What can help to recognise and/or classify named entities?
• POS (part of speech) tags
• Most named entities are nouns
• Prokofyev (2014)
Isabelle Augenstein
4.1 Part-of-Speech Tags
Part-Of-Speech (POS) tags have often been considered as
an important discriminative feature for term identification.
Many works on key term identification apply either fixed
or regular expression POS tag patterns to improve their ef-
fectiveness. Nonetheless, POS tags alone cannot produce
high-quality results. As can be seen from the overall POS
tag distribution graph extracted from one of our collections
(see Figure 3), many of the most frequent tag patterns (e.g.,
JJ NN tagging adjectives and nouns6
) are far from yielding
perfect results.
Figure 3: Top 6 most frequent part-of-speech tag
comma, while the n-gram “au
tion”, which is indeed a valid
the beginning or at the end
The contingency tables giv
trate this: The +punctuation
respectively, the counts of the
punctuation mark in any of
of the n-grams that have no
occurrences. From the tables
of punctuation marks (+pun
an n-gram occurs twice as o
valid entities compared to th
that the absence of punctuat
pens less frequently for the va
ones.
Table 1: Contingency ta
appearing immediately be
Va
+punctuation 16
punctuation 65
Totals 81
Table 2: Contingency ta
appearing immediately af
Va
41. 41Morphology: Penn Treebank POS tags
Nouns
(all start
with N)
Verbs (all
start with V)
Adjectives
(all start
with J)
42. 42
NERC: Features
What can help to recognise and/or classify named entities?
• POS (part of speech) tags
• Most named entities are nouns
• Prokofyev (2014)
Isabelle Augenstein
4.1 Part-of-Speech Tags
Part-Of-Speech (POS) tags have often been considered as
an important discriminative feature for term identification.
Many works on key term identification apply either fixed
or regular expression POS tag patterns to improve their ef-
fectiveness. Nonetheless, POS tags alone cannot produce
high-quality results. As can be seen from the overall POS
tag distribution graph extracted from one of our collections
(see Figure 3), many of the most frequent tag patterns (e.g.,
JJ NN tagging adjectives and nouns6
) are far from yielding
perfect results.
Figure 3: Top 6 most frequent part-of-speech tag
comma, while the n-gram “au
tion”, which is indeed a valid
the beginning or at the end
The contingency tables giv
trate this: The +punctuation
respectively, the counts of the
punctuation mark in any of
of the n-grams that have no
occurrences. From the tables
of punctuation marks (+pun
an n-gram occurs twice as o
valid entities compared to th
that the absence of punctuat
pens less frequently for the va
ones.
Table 1: Contingency ta
appearing immediately be
Va
+punctuation 16
punctuation 65
Totals 81
Table 2: Contingency ta
appearing immediately af
Va
43. 43
NERC: Features
What can help to recognise and/or classify named entities?
• Gazetteers
• Retrieved from HTML lists or tables [1]
• Using regular expressions patterns and search engines (e.g.
“Popular artists such as * ”)
• Retrieved from knowledge bases
[1] https://en.wikipedia.org/wiki/Billboard_200
Isabelle Augenstein
45. 45
NERC: Training Models
• Unfortunately, there isn’t enough time to explain machine learning
algorithms in detail
• CRFs (conditional random fields) are one of the most widely used
algorithms for NERC
• Graphical models, view NERC as a sequence labelling task
• Named entities consist of a beginning token (B), inside tokens (I),
and outside tokens (O)
took(O) place(O) at(O) Miller(B-LOC) Lite(I-LOC) Oasis(I-LOC) in(O)
• For now, we will rule- and gazetteer-based NERC
• It is fairly easy to write manual extraction rules for NEs, can
achieve a high performance when combined with gazetteers
• This can be done with the GATE software (general architecture for
text engineering) and Jape rules
-> Hands-on session
Isabelle Augenstein
46. 46
NLP & ML Software
Natural Language Processing:
- GATE (general purpose architecture, includes other NLP and ML
software as plugins)
- Stanford NLP (Java)
- OpenNLP (Java)
- NLTK (Python)
Machine Learning:
- scikit-learn (Python, rich documentation, highly recommended!)
- Mallet (Java)
- WEKA (Java)
- Alchemy (graphical models, Java)
- FACTORIE, wolfe (graphical models, Scala)
- CRFSuite (efficient implementation of CRFs, Python)
Isabelle Augenstein
47. 47
NLP & ML Software
Ready to use NERC software:
- ANNIE (rule-based, part of GATE)
- Wikifier (based on Wikipedia)
- FIGER (based on Wikipedia, fine-grained Freebase NE classes)
Almost ready to use NERC software:
- CRFSuite (already includes Python implementation for feature extraction,
you just need to feed it with training data, which you can also download)
Ready to use RE software:
- ReVerb (Open IE, extracts patterns for any kind of relation)
- MultiR (Distant supervision, relation extractor trained on Freebase)
Web Content Extraction software:
- Boilerpipe (extract main text content from Web pages)
- Jsoup (traverse elements of Web pages individually, also allows to
extract text)
Isabelle Augenstein
48. 48
Application: Opinion Mining
• Extracting opinions or sentiments in text
• It’s about finding out what people think
University of Sheffield, NLP
It's about finding out what people think...
49. 49
Application: Opinion Mining
• Opinion Mining is big business
• Someone just bought an album by a
music artist
• Writes a review about it
• Someone else wants to buy an album
• Looks up reviews by fans and music
critics
• Music artist and music producer
• Get feedback from fans
• Improve their product
• Improve their marketing strategy
50. 50
Application: Opinion Mining
• “Miley Cyrus's attempts to shock would be
more effective if she had songs to back up
the posturing.”
– The Guardian
• “Bangerz is an Amazing album with great
lyrics and we can see the Miley Cyrus'
musical evolution. Would love to buy it and I
already did. ALBUM OF THE YEAR. Peace”
– Rodolfoalmeida3
51. 51
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• Relatively easy to find sentiment words in sentences, difficult to identify
which topic they are about
52. 52
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• Relatively easy to find sentiment words in sentences, difficult to identify
which topic they are about
• “The album comes with a free bonus CD but I don't like the cover art much.”
Does this refer to the cover art of the bonus CD or the album?
53. 53
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• Relatively easy to find sentiment words in sentences, difficult to identify
which topic they are about
54. 54
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• Relatively easy to find sentiment words in sentences, difficult to identify
which topic they are about
• Whitney Houston was quite unpopular…
University of Sheffield, NLP
Whitney Houston wasn't very popular...
55. 55
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• Relatively easy to find sentiment words in sentences, difficult to identify
which topic they are about
• Whitney Houston was quite unpopular… or was she?
• Death confuses opinion mining tools
University of Sheffield, NLP
Or was she?
56. 56
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• It’s not just about finding sentiment words, context is important too
• “It's a great movie if you have the taste and sensibilities of a 5-
year-old boy.”
• “It's terrible Candidate X did so well in the debate last night.”
• “I'd have liked the track a lot more if it had been a bit shorter.”
• If sentiment words are neutral, negative or positive depends on domain
• “a long track” vs “a long walk” vs “a long battery life”
57. 57
Application: Opinion Mining
Why is opinion mining and sentiment analysis challenging?
• How much should every single opinion be worth?
• experts vs non-experts
• relationship trust
• reputation trust
• spammers
• frequent vs infrequent posters
• “experts” in one area may not be expert in another
• how frequently do other people agree?
58. 58
Application: Opinion Mining
Subtopics
• Opinion extraction: extract the piece of text which represents the
opinion
• Cyrus has made a 23-song, purposely strange psych-rock record. Make no
mistake, some of this album is unlistenable. But Cyrus is also too skilled of
an artist to not place some beauty inside this madness, and Miley Cyrus and
Her Dead Petz swerves into thoughtful territory when it’s least expected.
• Sentiment classification/orientation: extract the polarity of the
opinion (e.g. positive, negative, neutral, or classify on a numerical
scale)
• negative: purposely strange, some is unlistenable
• positive: skilled artist, beauty inside madness, thoughful
• Opinion summarisation: summarise the overall opinion about
something
• Strange, some unlistenable: negative, skilled artist, beauty, thoughful:
positive, Overall 6/10
59. 59
Application: Opinion Mining
Subtopics
• Feature-opinion association: given a text with target features and
opinions extracted, decide which opinions comment on which features.
• “The tracks are good but not so keen on the cover art”
• Target identification: which thing is the opinion referring to?
• Source identification: who is holding the opinion?
60. 60
Application: Opinion Mining
Opinion Mining Resources
Bing Liu’s English Sentiment Lexicon
• 2006 pos words, 4783 neg words
• Useful properties: includes misspellings, morphological variants, slang
• Available from: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
The MPQA Subjectivity Lexicon
• Polarities: positive, negative, both, neutral
• Subjectivity: strongsubj or weaksubj
• Download from: http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
61. 61
Application: Opinion Mining
Opinion Mining Resources
WordNet Affect
• Extension of WordNet with affect words
• Useful properties: includes POS categories
• Available from: http://wndomains.fbk.eu/wnaffect.html
Hands-on session: Applying standard opinion mining lexicons with GATE
• Spoiler: general purpose lexicons do not always perform well, for better
performance, domain- or context-specific lexicons are necessary
62. 62Information Extraction with Linked Data
Thank you for
your attention!(And thank you to Diana Maynard for allowing me to adapt and reuse her
Opinion Mining slides!)
Questions?
Isabelle Augenstein