The document discusses argument extraction from news, blogs, and social media. It describes argument extraction as identifying claims and premises from text. This is a difficult task, even for humans. Existing methods for argument extraction are reviewed, including rule-based and machine learning approaches. A new two-step method is proposed using supervised classification and conditional random fields to first identify argumentative sentences and then extract claims and premises. Evaluation results show the proposed approach outperforms a simple baseline method.
Argument extraction from news, blogs and social media.Shubhangi Tandon
This presentation explains the pioneer Argument Extraction paper by Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis on Argument Extraction from News, Blogs ,and Social Media.
(published Springer International Publishing, 2014.)
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
Argument extraction from news, blogs and social media.Shubhangi Tandon
This presentation explains the pioneer Argument Extraction paper by Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis on Argument Extraction from News, Blogs ,and Social Media.
(published Springer International Publishing, 2014.)
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
Lexical Analysis, Tokens, Patterns, Lexemes, Example pattern, Stages of a Lexical Analyzer, Regular expressions to the lexical analysis, Implementation of Lexical Analyzer, Lexical analyzer: use as generator.
Relationship Among Token, Lexeme & PatternBharat Rathore
Relationship among Token, Lexeme and Pattern
Outline
Token
Lexeme
Pattern
Relationship
Tokens : A token is sequence of characters that can be treated
as a unit/single logical entity.
Examples
Keywords
Examples : for, while, if etc.
Identifier
Examples : Variable name, function name, etc.
Operators
Examples : '+', '++', '-' etc.
Separators
Examples : ',' ';' etc.
Pattern
Pattern is a rule describing all those lexemes that can represent a particular token in a source language.
Lexeme
It is a sequence of characters in the source program that is matched by the pattern for a token.
Example : “float”, “=“, “223”, “;”
The goal of this project is to build a classifier able to predict whether a song is happy or sad analysing its lyrics. Most of the research on music classication is based on features
obtained by audio signals. However, the exploration of lyrics alone as a source of information can be relevant in music
classication. It is an interesting problem and it has not been widely explored in the literature.
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
HU, Yuxiu (Harbin Institute of Technology Shenzhen Graduate School, China)
BODOMO, Adams (The University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_603.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
Slides presenting a paper published in the proceeding of 22nd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2018), Belgrade, Serbia
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Lexical Analysis, Tokens, Patterns, Lexemes, Example pattern, Stages of a Lexical Analyzer, Regular expressions to the lexical analysis, Implementation of Lexical Analyzer, Lexical analyzer: use as generator.
Relationship Among Token, Lexeme & PatternBharat Rathore
Relationship among Token, Lexeme and Pattern
Outline
Token
Lexeme
Pattern
Relationship
Tokens : A token is sequence of characters that can be treated
as a unit/single logical entity.
Examples
Keywords
Examples : for, while, if etc.
Identifier
Examples : Variable name, function name, etc.
Operators
Examples : '+', '++', '-' etc.
Separators
Examples : ',' ';' etc.
Pattern
Pattern is a rule describing all those lexemes that can represent a particular token in a source language.
Lexeme
It is a sequence of characters in the source program that is matched by the pattern for a token.
Example : “float”, “=“, “223”, “;”
The goal of this project is to build a classifier able to predict whether a song is happy or sad analysing its lyrics. Most of the research on music classication is based on features
obtained by audio signals. However, the exploration of lyrics alone as a source of information can be relevant in music
classication. It is an interesting problem and it has not been widely explored in the literature.
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
HU, Yuxiu (Harbin Institute of Technology Shenzhen Graduate School, China)
BODOMO, Adams (The University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_603.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
Slides presenting a paper published in the proceeding of 22nd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2018), Belgrade, Serbia
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The most integral part of our work is to extract Aspects from User Feedback and associate Sentiment and Opinion terms to them. The dataset we have at our disposal to work upon, is a set of feedback documents for various departments in a Hospital in XML format which have comments represented in tags. It contains about 65000 responses to a survey taken in a Hospital. Every response or comment is treated as a sentence or a set of them. We perform a sentence level aspect and sentiment extraction and we attempt to understand and mine User Feedback data to gather aspects from it. Further to it, we extract the sentiment mentions and evaluate them contextually for sentiment and associate those sentiment mentions with the corresponding aspects. To start with, we perform a clean up on the User Feedback data, followed by aspect extraction and sentiment polarity calculation, with the help of POS tagging and SentiWordNet filters respectively. The obtained sentiments are further classified according to a set of Linguistic rules and the scores are normalized to nullify any noise that might be present. We lay emphasis on using a rule based approach; rules being Linguistic rules that correspond to the positioning of various parts-of-speech words in a sentence.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Sentiment Analysis Using Hybrid Approach: A SurveyIJERA Editor
Sentiment analysis is the process of identifying people’s attitude and emotional state’s from language. The main objective is realized by identifying a set of potential features in the review and extracting opinion expressions about those features by exploiting their associations. Opinion mining, also known as Sentiment analysis, plays an important role in this process. It is the study of emotions i.e. Sentiments, Expressions that are stated in natural language. Natural language techniques are applied to extract emotions from unstructured data. There are several techniques which can be used to analysis such type of data. Here, we are categorizing these techniques broadly as ”supervised learning”, ”unsupervised learning” and ”hybrid techniques”. The objective of this paper is to provide the overview of Sentiment Analysis, their challenges and a comparative analysis of it’s techniques in the field of Natural Language Processing.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...CITE
5 March 2010 (Friday) | 09:00 - 12:30 | http://citers2010.cite.hku.hk/abstract/69 | Dr. Kwok Ping CHAN, Associate Professor, Department of Computer Science, HKU
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Ontology Learning from Text
Ontology construction ‘Layer Cake’
Knowledge representation and knowledge management systems
Subtasks in ontology learning
Most Popular Ontology Learning Tools
Are you interested in learning about text analysis but have little to no experience with programming languages or writing code? These two short courses will introduce you to multiple text analysis methods. We will examine real-world examples and engage in hands-on activities that don’t require running any code. These short courses are ideal for students and researchers in non-technical fields, faculty who would like to incorporate text analysis in their curriculum, or as a precursor to programming with text analysis tools.
A Gentle Introduction to Text Analysis I will cover both qualitative and quantitative text analysis methods, bag-of-words techniques and classification.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
In the task of summarization of a scientific paper, a lot of information stands to be gained about a reference paper, from the papers that cite it. Automatically generating the reference scope (the span of cited text) in a reference paper, corresponding to citances (sentences in the citing papers that cite it) has great significance in preparing a structured summary of the reference paper. We treat this task as a binary classification problem, by extracting feature vectors from pairs of citances and reference sentences. These features are lexical, corpus-based, surface and knowledge-based. We extend the current feature set employed for reference-citance pair identification in the current state-of-the-art system. Using these features, we present a novel classification approach for this task, that employs a deep Convolutional Neural Network along with two boosting ensemble algorithms. We outperform the existing state-of-the- art for distinguishing between cited spans and non-cited spans of text in the reference paper.
Similar to Argument extraction from news, blog and social media. (20)
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Argument extraction from news, blog and social media.
1. Argument Extraction from News,
Blogs,and Social Media.
Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis.
Presented by :
Sharath T.S
Shubhangi Tandon
2. What is Argument Extraction?
An argument can be usually decomposed into a claim and one or more premises justifying it.
Task of identifying arguments along with their components in text
Difficult even for humans to distinguish whether a part of a sentence contains an argument element or not
3. Why social media?
● Most widely used and accessible platform available to seek advice or express opinion
● Is a storehouse of both meaningful and meaningless information on social media about recent trends and
topics.
● Almost no prior research in this field; only one publication related to product reviews on Amazon.
4. Why is this difficult ?
● Almost no prior research in this field; only one publication related to product reviews on Amazon!
● Text from social media may not always contain arguments
● Expressed in an informal form, and they do not follow any formal guidelines or specific rules
● Absence of widely used corpora in order to comparably evaluate approaches for argument extraction.
● Traditional research in the area, concentrates mainly on law documents and scientific publications.
5. Existing methods● Palau et al. [4,7]
○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum
entropy.
○ Identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of
words contained
○ Detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum
entropy classifier
○ Argumentative clauses are classified into premises and claims through support vector machines
○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80%
● A rule based system - Input an argumentation scheme and an ontology concerning an object, for example, a camera and its
characteristic features. Argumentation schemes are populated with discourse indicators, domain specific features and rules
are constructed.
6. Proposed Method
The proposed Automatic Argument Extraction is a two step process :
Step A : Identification of Argumentative Sentences (Supervised Classification using standard classifiers : Logistic
regression, Random Forest, Support Vector Machines, Naive Bayes)
Step B :Extraction of Claims and Premises (From output of Step A , using Conditional Random Fields)
7. Feature Selection for Corpus
State of the art features:
Position Comma Token Number Connective Number Verb Number Word Number
Cue words # verbs in passive voice Domain Entities Number Adverb Number Word Mean Length
8. Feature Selection for Corpus (contd.)
New Domain Specific Features:
Adjective number : Number of adjectives in a sentence .Usually in argumentation opinions are expressed towards an entity/claim,
through adjectives.
Entities in previous sentences: Number of entities in the nth previous sentence. History of n = 5 sentences, obtain five features,
Correlates to the probability that the current sentence contains an argument element.
Cumulative number of entities in previous sentences: Total number of entities from the previous n sentences. Considering a history
of n = 5 we obtain four features.
Ratio of distributions: Two Language models created from sentences that contain argument elements and from sentences that do
not contain an argument element. The ratio between these two distributions based on unigrams, bigrams and trigrams of
words. Can be described as :
Distributions over unigrams, bigrams, trigrams of part of speech tags (POS tags): Identical to [4] with the exception that unigrams,
bigrams and trigrams are extracted from the part of speech tags instead of words.
9. Step B: Argument Extraction with CRFWhy Conditional Random Fields ?
Structured prediction algorithm
Can take local context into consideration ( help maintain linguistic aspects such as the word ordering in the sentence)
Features:
The words in these sentences
Gazetteer lists of known entities for the thematic domain related to the arguments we want to extract,
Gazetteer lists of cue words and indicator phrases
Lexica of verbs and adjectives automatically acquired using Term Frequency - Inverse Document Frequency (TF-IDF) between
two “documents” ( With and without argumentative text from Step A )
10. Corpus Preparation
● 204 documents (in Greek) collected from the social media
● Thematic domain of Renewable Energy Sources
● Selected documents were manually annotated with domain entities and text segments that correspond to argument premises.
● Claims are not represented into documents as segments, but implied by the author as positive or negative views
760 sentences:
Annotated as
containing
arguments
16000 sentences
from 204
documents
Final Output
Ellogon
Step A Step B
11. Evaluation : Base Case
Simple base case classifier:
1. Manually annotated segments (argument components) used to form a gazetteer.
2. Applied on the corpus in order to detect all exact matches of all these segments.
a. All segments identified are marked as argumentative segments
b. All sentences that contain at least one argumentative segment identified by the gazetteer, are characterised as an
argumentative sentence.
3. Argumentative segments/sentences are compared to “gold” counterparts, manually annotated by humans.
a. Sentences that contain these recognized fragments are marked as argumentative for the first step base case.
b. Segments marked as argumentative are evaluated for the second step base case.
4. Results are taken through 10-fold
cross validation on the whole corpus (all 16k sentences)
12. Evaluation : Step A
Each sentence represented as a fixed-size vector using features described (including class - Supervised learning )
Tested against classifiers such as : Support Vector Machines, Naive Bayes, Random Forest and Logistic Regression.
Initial Data set is heavily skewed towards non-argumentative documents . Therefore, Data Sampling and Testing was done in two
different ways :
Use Precision , Recall , F-1 Measure and Accuracy for Evaluation
Logistic Regression and Naive Bayes performed the best
Way #1 Way #2
Sampling Randomly ignore negative examples.
Result set contains equal number of instances
from both classes
Split Initial Data set in the ratio 70:30 for
testing and training
Evaluation 10-fold cross validation, achieved high accuracy Achieved 49% accuracy , Discarded
13. Evaluation : Step B
To use CRF, need BIO tagging for sentences:
B for starting a text segment (premise),
I for a token in a premise other than the first, and
O for all other tokens (outside of the premise segment)
Example for “Wind turbines generate noise in the summer”
Final Result after CRF
Baseline Results
19. Features continued
○ Verb number : is the number of the verbs inside a sentence, which indicates the number of periods
inside a sentence.
○ Number of verbs in passive voice
○ Cue words: this feature indicates the existence and the number of cue words(also known as discourse
indicators). Cue words are identified through a predefined, manually constructed, lexicon. The cue
words in the lexicon are structural words which indicate the connection between periods or
subordinate clauses. Examples - anyway, by the way, furthermore, first, second, then, now, thus,
moreover, therefore, hence, lastly, finally, in summary, and on the other hand.
○ Domain entities number : this feature indicates the existence and the number of entity mentions of
named-entities relevant to our domain, in the context of a sentence.
○ Adverb number : this feature indicates the number of adverbs in the context of a sentence.
○ Word number : the number of words in the context of a sentence. This feature is based in the
20. Proposed method
● Identification of argumentative sentences is a supervised classification task. Logistic regression, Random
Forest, Support Vector Machines, Naive Bayes are the classifiers used.
● From the sentences classified as argumentative, classify then further into premises and claims using CRF.
● Generic features used -
○ Position: this feature indicates the position of the sentence inside the text.
○ Comma token number, is the number of commas inside a sentence. This feature represents the number
subordinate clauses inside a sentence, based on the idea that sentences containing argument elements
may have a large number of clauses.
○ Connective number : is the number of connectives in the sentence, as connectives usually connect
subordinate clauses. This feature is also selected based on the hypothesis that sentences containing
argument elements may have a large number of clauses.
21. Additional features to aid argument detection in social media
● Adjective number : the number of adjectives in a sentence may characterize a sentence as argumentative or
not. We considered the fact that usually in argumentation opinions are expressed towards an entity/claim,
which are usually expressed through adjectives.
● Entities in previous sentences: this feature represents the number of entities in the n th previous sentence.
Considering a history of n = 5 sentences,we obtain five features, with each one containing the number of
entities in the respective sentence. These features correlate to the probability that the current sentence
contains an argument element.
● Cumulative number of entities in previous sentences: This feature contains the total number of entities from
the previous n sentences. Considering a history of n = 5 we obtain four features, with each one containing
the cumulative number of entities from all the previous sentences.
● Ratio of distributions: we created a language model from sentences that contain argument elements and one
from sentences that do not contain an argument element. The ratio between these two distributions was
used used as a feature. We have created three ratios of language models based on unigrams, bigrams and
trigrams of words. The ratio can be described as P(X|sentence contains an argument element)
P(X|sentence does not contain an argument element), where X ∈ {unigrams, bigrams, trigrams}
22. Related Work
Argumentation is a branch of philosophy that studies the act or process of forming reasons and of drawing
conclusions in the context of a discussion, dialogue, or conversation.
The reasons are called premises and the conclusion is called the claim.
23. Existing methods
● Palau et al. [4,7]
○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB,
SVM, maximum entropy.
○ identify groups of sentences that refer to the same argument, using semantic distance based on the
relatedness of words contained
○ detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a
maximum entropy classifier
○ Then argumentative clauses are classified into premises and claims through support vector machines
○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80%
● A rule based system - The system is given as input an argumentation scheme and an ontology concerning an object, for
example, a camera and its characteristic features. Argumentation schemes are populated and along with discourse indicators
and other domain specific features, the rules are constructed. An interesting aspect of this work is the fact that they applied
25. Existing methods - Continued
● Lawrence et al., 2014 - Proposed a machine learning approach to extract propositions from philosophical text,
with a topic model to determine argument structure, without considering whether a piece of text is part of an
argument. Hence, the machine learning algorithm was used in order to define the boundaries and afterwards
classify each word as the beginning or end of a proposition.
● the first step of their task was to identify sentences containing context dependent claims (CDCs) in each article.
Afterwards they used a classifier in order to find the exact boundaries of the CDCs detected. As a final step, the
ranked each CDC in order to isolate the most relevant to the corresponding topic CDCs. That said, their goal is to
automatically pinpoint CDCs within topic related documents.
Editor's Notes
The corpus was constructed by manually filtering a larger
corpus, automatically collected by performing queries on popular search engines
(such as Bing2 ), Google Plus 3 , Twitter 4 , and by crawling sites from a list of
sources relevant to the domain of renewable energy.