SlideShare a Scribd company logo
Argument Extraction from News,
Blogs,and Social Media.
Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis.
Presented by :
Sharath T.S
Shubhangi Tandon
What is Argument Extraction?
An argument can be usually decomposed into a claim and one or more premises justifying it.
Task of identifying arguments along with their components in text
Difficult even for humans to distinguish whether a part of a sentence contains an argument element or not
Why social media?
● Most widely used and accessible platform available to seek advice or express opinion
● Is a storehouse of both meaningful and meaningless information on social media about recent trends and
topics.
● Almost no prior research in this field; only one publication related to product reviews on Amazon.
Why is this difficult ?
● Almost no prior research in this field; only one publication related to product reviews on Amazon!
● Text from social media may not always contain arguments
● Expressed in an informal form, and they do not follow any formal guidelines or specific rules
● Absence of widely used corpora in order to comparably evaluate approaches for argument extraction.
● Traditional research in the area, concentrates mainly on law documents and scientific publications.
Existing methods● Palau et al. [4,7]
○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum
entropy.
○ Identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of
words contained
○ Detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum
entropy classifier
○ Argumentative clauses are classified into premises and claims through support vector machines
○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80%
● A rule based system - Input an argumentation scheme and an ontology concerning an object, for example, a camera and its
characteristic features. Argumentation schemes are populated with discourse indicators, domain specific features and rules
are constructed.
Proposed Method
The proposed Automatic Argument Extraction is a two step process :
Step A : Identification of Argumentative Sentences (Supervised Classification using standard classifiers : Logistic
regression, Random Forest, Support Vector Machines, Naive Bayes)
Step B :Extraction of Claims and Premises (From output of Step A , using Conditional Random Fields)
Feature Selection for Corpus
State of the art features:
Position Comma Token Number Connective Number Verb Number Word Number
Cue words # verbs in passive voice Domain Entities Number Adverb Number Word Mean Length
Feature Selection for Corpus (contd.)
New Domain Specific Features:
Adjective number : Number of adjectives in a sentence .Usually in argumentation opinions are expressed towards an entity/claim,
through adjectives.
Entities in previous sentences: Number of entities in the nth previous sentence. History of n = 5 sentences, obtain five features,
Correlates to the probability that the current sentence contains an argument element.
Cumulative number of entities in previous sentences: Total number of entities from the previous n sentences. Considering a history
of n = 5 we obtain four features.
Ratio of distributions: Two Language models created from sentences that contain argument elements and from sentences that do
not contain an argument element. The ratio between these two distributions based on unigrams, bigrams and trigrams of
words. Can be described as :
Distributions over unigrams, bigrams, trigrams of part of speech tags (POS tags): Identical to [4] with the exception that unigrams,
bigrams and trigrams are extracted from the part of speech tags instead of words.
Step B: Argument Extraction with CRFWhy Conditional Random Fields ?
Structured prediction algorithm
Can take local context into consideration ( help maintain linguistic aspects such as the word ordering in the sentence)
Features:
The words in these sentences
Gazetteer lists of known entities for the thematic domain related to the arguments we want to extract,
Gazetteer lists of cue words and indicator phrases
Lexica of verbs and adjectives automatically acquired using Term Frequency - Inverse Document Frequency (TF-IDF) between
two “documents” ( With and without argumentative text from Step A )
Corpus Preparation
● 204 documents (in Greek) collected from the social media
● Thematic domain of Renewable Energy Sources
● Selected documents were manually annotated with domain entities and text segments that correspond to argument premises.
● Claims are not represented into documents as segments, but implied by the author as positive or negative views
760 sentences:
Annotated as
containing
arguments
16000 sentences
from 204
documents
Final Output
Ellogon
Step A Step B
Evaluation : Base Case
Simple base case classifier:
1. Manually annotated segments (argument components) used to form a gazetteer.
2. Applied on the corpus in order to detect all exact matches of all these segments.
a. All segments identified are marked as argumentative segments
b. All sentences that contain at least one argumentative segment identified by the gazetteer, are characterised as an
argumentative sentence.
3. Argumentative segments/sentences are compared to “gold” counterparts, manually annotated by humans.
a. Sentences that contain these recognized fragments are marked as argumentative for the first step base case.
b. Segments marked as argumentative are evaluated for the second step base case.
4. Results are taken through 10-fold
cross validation on the whole corpus (all 16k sentences)
Evaluation : Step A
Each sentence represented as a fixed-size vector using features described (including class - Supervised learning )
Tested against classifiers such as : Support Vector Machines, Naive Bayes, Random Forest and Logistic Regression.
Initial Data set is heavily skewed towards non-argumentative documents . Therefore, Data Sampling and Testing was done in two
different ways :
Use Precision , Recall , F-1 Measure and Accuracy for Evaluation
Logistic Regression and Naive Bayes performed the best
Way #1 Way #2
Sampling Randomly ignore negative examples.
Result set contains equal number of instances
from both classes
Split Initial Data set in the ratio 70:30 for
testing and training
Evaluation 10-fold cross validation, achieved high accuracy Achieved 49% accuracy , Discarded
Evaluation : Step B
To use CRF, need BIO tagging for sentences:
B for starting a text segment (premise),
I for a token in a premise other than the first, and
O for all other tokens (outside of the premise segment)
Example for “Wind turbines generate noise in the summer”
Final Result after CRF
Baseline Results
What did we think ?
Questions/ Observations/Inputs?
Appendix
Evaluation
Results
Step A
Evaluation Results :Step A (contd)
Go back
Features continued
○ Verb number : is the number of the verbs inside a sentence, which indicates the number of periods
inside a sentence.
○ Number of verbs in passive voice
○ Cue words: this feature indicates the existence and the number of cue words(also known as discourse
indicators). Cue words are identified through a predefined, manually constructed, lexicon. The cue
words in the lexicon are structural words which indicate the connection between periods or
subordinate clauses. Examples - anyway, by the way, furthermore, first, second, then, now, thus,
moreover, therefore, hence, lastly, finally, in summary, and on the other hand.
○ Domain entities number : this feature indicates the existence and the number of entity mentions of
named-entities relevant to our domain, in the context of a sentence.
○ Adverb number : this feature indicates the number of adverbs in the context of a sentence.
○ Word number : the number of words in the context of a sentence. This feature is based in the
Proposed method
● Identification of argumentative sentences is a supervised classification task. Logistic regression, Random
Forest, Support Vector Machines, Naive Bayes are the classifiers used.
● From the sentences classified as argumentative, classify then further into premises and claims using CRF.
● Generic features used -
○ Position: this feature indicates the position of the sentence inside the text.
○ Comma token number, is the number of commas inside a sentence. This feature represents the number
subordinate clauses inside a sentence, based on the idea that sentences containing argument elements
may have a large number of clauses.
○ Connective number : is the number of connectives in the sentence, as connectives usually connect
subordinate clauses. This feature is also selected based on the hypothesis that sentences containing
argument elements may have a large number of clauses.
Additional features to aid argument detection in social media
● Adjective number : the number of adjectives in a sentence may characterize a sentence as argumentative or
not. We considered the fact that usually in argumentation opinions are expressed towards an entity/claim,
which are usually expressed through adjectives.
● Entities in previous sentences: this feature represents the number of entities in the n th previous sentence.
Considering a history of n = 5 sentences,we obtain five features, with each one containing the number of
entities in the respective sentence. These features correlate to the probability that the current sentence
contains an argument element.
● Cumulative number of entities in previous sentences: This feature contains the total number of entities from
the previous n sentences. Considering a history of n = 5 we obtain four features, with each one containing
the cumulative number of entities from all the previous sentences.
● Ratio of distributions: we created a language model from sentences that contain argument elements and one
from sentences that do not contain an argument element. The ratio between these two distributions was
used used as a feature. We have created three ratios of language models based on unigrams, bigrams and
trigrams of words. The ratio can be described as P(X|sentence contains an argument element)
P(X|sentence does not contain an argument element), where X ∈ {unigrams, bigrams, trigrams}
Related Work
Argumentation is a branch of philosophy that studies the act or process of forming reasons and of drawing
conclusions in the context of a discussion, dialogue, or conversation.
The reasons are called premises and the conclusion is called the claim.
Existing methods
● Palau et al. [4,7]
○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB,
SVM, maximum entropy.
○ identify groups of sentences that refer to the same argument, using semantic distance based on the
relatedness of words contained
○ detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a
maximum entropy classifier
○ Then argumentative clauses are classified into premises and claims through support vector machines
○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80%
● A rule based system - The system is given as input an argumentation scheme and an ontology concerning an object, for
example, a camera and its characteristic features. Argumentation schemes are populated and along with discourse indicators
and other domain specific features, the rules are constructed. An interesting aspect of this work is the fact that they applied
Existing methods - Continued
●
Existing methods - Continued
● Lawrence et al., 2014 - Proposed a machine learning approach to extract propositions from philosophical text,
with a topic model to determine argument structure, without considering whether a piece of text is part of an
argument. Hence, the machine learning algorithm was used in order to define the boundaries and afterwards
classify each word as the beginning or end of a proposition.
● the first step of their task was to identify sentences containing context dependent claims (CDCs) in each article.
Afterwards they used a classifier in order to find the exact boundaries of the CDCs detected. As a final step, the
ranked each CDC in order to isolate the most relevant to the corresponding topic CDCs. That said, their goal is to
automatically pinpoint CDCs within topic related documents.

More Related Content

What's hot

Lexical1
Lexical1Lexical1
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
A Role of Lexical Analyzer
A Role of Lexical AnalyzerA Role of Lexical Analyzer
A Role of Lexical Analyzer
Archana Gopinath
 
Syntax analyzer
Syntax analyzerSyntax analyzer
Syntax analyzer
ahmed51236
 
Relationship Among Token, Lexeme & Pattern
Relationship Among Token, Lexeme & PatternRelationship Among Token, Lexeme & Pattern
Relationship Among Token, Lexeme & Pattern
Bharat Rathore
 
Token, Pattern and Lexeme
Token, Pattern and LexemeToken, Pattern and Lexeme
Token, Pattern and Lexeme
A. S. M. Shafi
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
Francesco Cucari
 
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningError Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
CITE
 
Syntax analysis
Syntax analysisSyntax analysis
Compier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.pptCompier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.ppt
Apoorv Diwan
 
Chain indexing
Chain indexingChain indexing
Chain indexingsilambu111
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
Prabhakar Bikkaneti
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
Binsent Ribera
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
tokens patterns and lexemes
tokens patterns and lexemestokens patterns and lexemes
tokens patterns and lexemes
Saqib Javed
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
Hind Abdulkhaleq
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design
MAHASREEM
 

What's hot (20)

Lexical1
Lexical1Lexical1
Lexical1
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
A Role of Lexical Analyzer
A Role of Lexical AnalyzerA Role of Lexical Analyzer
A Role of Lexical Analyzer
 
Syntax analyzer
Syntax analyzerSyntax analyzer
Syntax analyzer
 
Relationship Among Token, Lexeme & Pattern
Relationship Among Token, Lexeme & PatternRelationship Among Token, Lexeme & Pattern
Relationship Among Token, Lexeme & Pattern
 
Token, Pattern and Lexeme
Token, Pattern and LexemeToken, Pattern and Lexeme
Token, Pattern and Lexeme
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
 
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningError Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Compier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.pptCompier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.ppt
 
Chain indexing
Chain indexingChain indexing
Chain indexing
 
Precis
PrecisPrecis
Precis
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
tokens patterns and lexemes
tokens patterns and lexemestokens patterns and lexemes
tokens patterns and lexemes
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
POPSI
POPSIPOPSI
POPSI
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design
 

Similar to Argument extraction from news, blog and social media.

A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
IJERA Editor
 
Aspect mining and sentiment association
Aspect mining and sentiment associationAspect mining and sentiment association
Aspect mining and sentiment association
Koushik Ramachandra
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
Ivan Berlocher
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
IJERA Editor
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A Survey
IJERA Editor
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
DataminingTools Inc
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
Datamining Tools
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
Abdelaziz Al-Rihawi
 
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
CITE
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Jarrar.lecture notes.aai.2011s.ontology part4_methodologies
Jarrar.lecture notes.aai.2011s.ontology part4_methodologiesJarrar.lecture notes.aai.2011s.ontology part4_methodologies
Jarrar.lecture notes.aai.2011s.ontology part4_methodologiesPalGov
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
Priyatham Bollimpalli
 
Ontology learning
Ontology learningOntology learning
Ontology learning
Ehsan Asgarian
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
UNCResearchHub
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
IJERA Editor
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
Saurav Jha
 

Similar to Argument extraction from news, blog and social media. (20)

A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 
Aspect mining and sentiment association
Aspect mining and sentiment associationAspect mining and sentiment association
Aspect mining and sentiment association
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A Survey
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
G04124041046
G04124041046G04124041046
G04124041046
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
 
Doc format.
Doc format.Doc format.
Doc format.
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Jarrar.lecture notes.aai.2011s.ontology part4_methodologies
Jarrar.lecture notes.aai.2011s.ontology part4_methodologiesJarrar.lecture notes.aai.2011s.ontology part4_methodologies
Jarrar.lecture notes.aai.2011s.ontology part4_methodologies
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
Ontology learning
Ontology learningOntology learning
Ontology learning
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 

Recently uploaded

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 

Argument extraction from news, blog and social media.

  • 1. Argument Extraction from News, Blogs,and Social Media. Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis. Presented by : Sharath T.S Shubhangi Tandon
  • 2. What is Argument Extraction? An argument can be usually decomposed into a claim and one or more premises justifying it. Task of identifying arguments along with their components in text Difficult even for humans to distinguish whether a part of a sentence contains an argument element or not
  • 3. Why social media? ● Most widely used and accessible platform available to seek advice or express opinion ● Is a storehouse of both meaningful and meaningless information on social media about recent trends and topics. ● Almost no prior research in this field; only one publication related to product reviews on Amazon.
  • 4. Why is this difficult ? ● Almost no prior research in this field; only one publication related to product reviews on Amazon! ● Text from social media may not always contain arguments ● Expressed in an informal form, and they do not follow any formal guidelines or specific rules ● Absence of widely used corpora in order to comparably evaluate approaches for argument extraction. ● Traditional research in the area, concentrates mainly on law documents and scientific publications.
  • 5. Existing methods● Palau et al. [4,7] ○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum entropy. ○ Identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of words contained ○ Detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum entropy classifier ○ Argumentative clauses are classified into premises and claims through support vector machines ○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80% ● A rule based system - Input an argumentation scheme and an ontology concerning an object, for example, a camera and its characteristic features. Argumentation schemes are populated with discourse indicators, domain specific features and rules are constructed.
  • 6. Proposed Method The proposed Automatic Argument Extraction is a two step process : Step A : Identification of Argumentative Sentences (Supervised Classification using standard classifiers : Logistic regression, Random Forest, Support Vector Machines, Naive Bayes) Step B :Extraction of Claims and Premises (From output of Step A , using Conditional Random Fields)
  • 7. Feature Selection for Corpus State of the art features: Position Comma Token Number Connective Number Verb Number Word Number Cue words # verbs in passive voice Domain Entities Number Adverb Number Word Mean Length
  • 8. Feature Selection for Corpus (contd.) New Domain Specific Features: Adjective number : Number of adjectives in a sentence .Usually in argumentation opinions are expressed towards an entity/claim, through adjectives. Entities in previous sentences: Number of entities in the nth previous sentence. History of n = 5 sentences, obtain five features, Correlates to the probability that the current sentence contains an argument element. Cumulative number of entities in previous sentences: Total number of entities from the previous n sentences. Considering a history of n = 5 we obtain four features. Ratio of distributions: Two Language models created from sentences that contain argument elements and from sentences that do not contain an argument element. The ratio between these two distributions based on unigrams, bigrams and trigrams of words. Can be described as : Distributions over unigrams, bigrams, trigrams of part of speech tags (POS tags): Identical to [4] with the exception that unigrams, bigrams and trigrams are extracted from the part of speech tags instead of words.
  • 9. Step B: Argument Extraction with CRFWhy Conditional Random Fields ? Structured prediction algorithm Can take local context into consideration ( help maintain linguistic aspects such as the word ordering in the sentence) Features: The words in these sentences Gazetteer lists of known entities for the thematic domain related to the arguments we want to extract, Gazetteer lists of cue words and indicator phrases Lexica of verbs and adjectives automatically acquired using Term Frequency - Inverse Document Frequency (TF-IDF) between two “documents” ( With and without argumentative text from Step A )
  • 10. Corpus Preparation ● 204 documents (in Greek) collected from the social media ● Thematic domain of Renewable Energy Sources ● Selected documents were manually annotated with domain entities and text segments that correspond to argument premises. ● Claims are not represented into documents as segments, but implied by the author as positive or negative views 760 sentences: Annotated as containing arguments 16000 sentences from 204 documents Final Output Ellogon Step A Step B
  • 11. Evaluation : Base Case Simple base case classifier: 1. Manually annotated segments (argument components) used to form a gazetteer. 2. Applied on the corpus in order to detect all exact matches of all these segments. a. All segments identified are marked as argumentative segments b. All sentences that contain at least one argumentative segment identified by the gazetteer, are characterised as an argumentative sentence. 3. Argumentative segments/sentences are compared to “gold” counterparts, manually annotated by humans. a. Sentences that contain these recognized fragments are marked as argumentative for the first step base case. b. Segments marked as argumentative are evaluated for the second step base case. 4. Results are taken through 10-fold cross validation on the whole corpus (all 16k sentences)
  • 12. Evaluation : Step A Each sentence represented as a fixed-size vector using features described (including class - Supervised learning ) Tested against classifiers such as : Support Vector Machines, Naive Bayes, Random Forest and Logistic Regression. Initial Data set is heavily skewed towards non-argumentative documents . Therefore, Data Sampling and Testing was done in two different ways : Use Precision , Recall , F-1 Measure and Accuracy for Evaluation Logistic Regression and Naive Bayes performed the best Way #1 Way #2 Sampling Randomly ignore negative examples. Result set contains equal number of instances from both classes Split Initial Data set in the ratio 70:30 for testing and training Evaluation 10-fold cross validation, achieved high accuracy Achieved 49% accuracy , Discarded
  • 13. Evaluation : Step B To use CRF, need BIO tagging for sentences: B for starting a text segment (premise), I for a token in a premise other than the first, and O for all other tokens (outside of the premise segment) Example for “Wind turbines generate noise in the summer” Final Result after CRF Baseline Results
  • 14. What did we think ?
  • 18. Evaluation Results :Step A (contd) Go back
  • 19. Features continued ○ Verb number : is the number of the verbs inside a sentence, which indicates the number of periods inside a sentence. ○ Number of verbs in passive voice ○ Cue words: this feature indicates the existence and the number of cue words(also known as discourse indicators). Cue words are identified through a predefined, manually constructed, lexicon. The cue words in the lexicon are structural words which indicate the connection between periods or subordinate clauses. Examples - anyway, by the way, furthermore, first, second, then, now, thus, moreover, therefore, hence, lastly, finally, in summary, and on the other hand. ○ Domain entities number : this feature indicates the existence and the number of entity mentions of named-entities relevant to our domain, in the context of a sentence. ○ Adverb number : this feature indicates the number of adverbs in the context of a sentence. ○ Word number : the number of words in the context of a sentence. This feature is based in the
  • 20. Proposed method ● Identification of argumentative sentences is a supervised classification task. Logistic regression, Random Forest, Support Vector Machines, Naive Bayes are the classifiers used. ● From the sentences classified as argumentative, classify then further into premises and claims using CRF. ● Generic features used - ○ Position: this feature indicates the position of the sentence inside the text. ○ Comma token number, is the number of commas inside a sentence. This feature represents the number subordinate clauses inside a sentence, based on the idea that sentences containing argument elements may have a large number of clauses. ○ Connective number : is the number of connectives in the sentence, as connectives usually connect subordinate clauses. This feature is also selected based on the hypothesis that sentences containing argument elements may have a large number of clauses.
  • 21. Additional features to aid argument detection in social media ● Adjective number : the number of adjectives in a sentence may characterize a sentence as argumentative or not. We considered the fact that usually in argumentation opinions are expressed towards an entity/claim, which are usually expressed through adjectives. ● Entities in previous sentences: this feature represents the number of entities in the n th previous sentence. Considering a history of n = 5 sentences,we obtain five features, with each one containing the number of entities in the respective sentence. These features correlate to the probability that the current sentence contains an argument element. ● Cumulative number of entities in previous sentences: This feature contains the total number of entities from the previous n sentences. Considering a history of n = 5 we obtain four features, with each one containing the cumulative number of entities from all the previous sentences. ● Ratio of distributions: we created a language model from sentences that contain argument elements and one from sentences that do not contain an argument element. The ratio between these two distributions was used used as a feature. We have created three ratios of language models based on unigrams, bigrams and trigrams of words. The ratio can be described as P(X|sentence contains an argument element) P(X|sentence does not contain an argument element), where X ∈ {unigrams, bigrams, trigrams}
  • 22. Related Work Argumentation is a branch of philosophy that studies the act or process of forming reasons and of drawing conclusions in the context of a discussion, dialogue, or conversation. The reasons are called premises and the conclusion is called the claim.
  • 23. Existing methods ● Palau et al. [4,7] ○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum entropy. ○ identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of words contained ○ detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum entropy classifier ○ Then argumentative clauses are classified into premises and claims through support vector machines ○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80% ● A rule based system - The system is given as input an argumentation scheme and an ontology concerning an object, for example, a camera and its characteristic features. Argumentation schemes are populated and along with discourse indicators and other domain specific features, the rules are constructed. An interesting aspect of this work is the fact that they applied
  • 24. Existing methods - Continued ●
  • 25. Existing methods - Continued ● Lawrence et al., 2014 - Proposed a machine learning approach to extract propositions from philosophical text, with a topic model to determine argument structure, without considering whether a piece of text is part of an argument. Hence, the machine learning algorithm was used in order to define the boundaries and afterwards classify each word as the beginning or end of a proposition. ● the first step of their task was to identify sentences containing context dependent claims (CDCs) in each article. Afterwards they used a classifier in order to find the exact boundaries of the CDCs detected. As a final step, the ranked each CDC in order to isolate the most relevant to the corresponding topic CDCs. That said, their goal is to automatically pinpoint CDCs within topic related documents.

Editor's Notes

  1. The corpus was constructed by manually filtering a larger corpus, automatically collected by performing queries on popular search engines (such as Bing2 ), Google Plus 3 , Twitter 4 , and by crawling sites from a list of sources relevant to the domain of renewable energy.