SlideShare a Scribd company logo
1 of 32
Download to read offline
News clustering approach
based on discourse text structure
Tatyana Makhalova, Dmitry Ilvovsky, Boris Galitsky
National Research University Higher School of Economics, Moscow, Russia
Knowledge Trail Incorporated, San Jose, USA
t.makhalova@gmail.com, dilvovsky@hse.ru, bgalitsky@hotmail.com
Table of contents
1 Text clustering problem
2 Parse thickets as a text representation model
3 The clustering approach
4 Experiments
5 Conclusion
2 / 32
Main Clustering Aspects
Text preprocessing and representation
Clustering methods
Similarity measures
3 / 32
Text Representation Models
Model Authors
Data
structure
Words
order
preserving
Embedded
semantics
VSM Salton et al, 1975 matrix - -
GVSM Wong et al,1985 matrix - +
TVSM Becker and Kuropka, 2003 matrix - +
eTVSM Polyvyanyy and Kuropka, 2007 matrix - +
DIG Hammouda and Kamel, 2004 graph + -
”Suffix Tree” Zamir and Etzioni, 1998 tree + -
N-Grams Schenker et al, 2007 graph + -
Parse Thickets Galitsky, 2013 trees (forest) + +
4 / 32
Parse Thickets: basic characteristics
Preserving a linguistic structure of a text paragraph
Constructing of parse trees for each sentence within a
paragraph
Adding inter-sentence relations between parse tree nodes
5 / 32
Parse Thickets: types of discourse relations
Coreferences (Lee et al., 2012)
Anaphora
Same entity
Hyponym/hyperonym
Rhetoric structure theory (RST) (Mann et al., 1992)
Communicative Actions (Searle, 1969)
6 / 32
Coreferences: example
7 / 32
Relations based on Rhetoric Structure Theory
RST characterizes structure of text in terms of relations that
hold between parts of text
RST describes relations between clauses in text which might
not be syntactically linked
RST helps to discover text patterns such as nucleus/satellite
structure with relation such as evidence, justify, antithesis,
concession and so on.
8 / 32
Parse Thickets: an example
“Iran refuses to accept the UN proposal to end the dispute over work on nuclear
weapons”
“UN nuclear watchdog passes a resolution condemning Iran for developing a second
uranium enrichment site in secret”,
“A recent IAEA report presented diagrams that suggested Iran was secretly working on
nuclear weapons”,
“Iran envoy says its nuclear development is for peaceful purpose, and the material
evidence against it has been fabricated by the US”
“UN passes a resolution condemning the work of Iran on nuclear weapons, in spite of
Iran claims that its nuclear research is for peaceful purpose”,
“Envoy of Iran to IAEA proceeds with the dispute over its nuclear program and
develops an enrichment site in secret”,
“Iran confirms that the evidence of its nuclear weapons program is fabricated by the
US and proceeds with the second uranium enrichment site”
9 / 32
Parse Thickets: discourse relations
“Iran confirms that the evidence
of its nuclear weapons program
is fabricated by the US and
proceeds with the second
uranium enrichment site”
“Iran envoy says its nuclear
development is for peaceful
purpose, and the material
evidence against it has been
fabricated by the US”
10 / 32
Parse Thickets: discourse relations
“UN nuclear watchdog passes a
resolution condemning Iran for
developing a second Uranium
enrichment site in secret”,
“A recent IAEA report presented
diagrams that suggested Iran was
secretly working on nuclear weapons”,
“UN passes a resolution condemning
the work of Iran on nuclear weapons,
in spite of Iran claims that its nuclear
research is for peaceful purpose”,
“Envoy of Iran to IAEA proceeds with
the dispute over its nuclear program
and develops an enrichment site in
secret”
11 / 32
Parse Thickets: an example
12 / 32
Clustering of Parse Thickets: the main idea
Similarity of parse thickets based on sub-trees matching
labeled discourse arcs
unlabeled syntactic arcs
nodes with part of speech and stem of a word
13 / 32
Clustering of paragraphs:
generalisation of syntactic trees
[NN-work IN-* IN-on JJ-nuclear NNS-weapons ],
[DT-the NN-dispute IN-over JJ-nuclear NNS-* ],
[VBZ-passes DT-a - NN-resolution],
[VBG-condemning NNP-iran IN-*],
[VBG-developing DT-* NN-enrichment NN-site IN-in NNsecret],
[DT-* JJ-second NN-uranium NN-enrichment NN-site],
[VBZ-is IN-for JJ-peaceful NN-purpose],
[DT-the NN-evidence IN-* PRP-it],
[VBN-* VBN-fabricated - IN-by DT-the NNP-us]
14 / 32
Clustering of paragraphs:
generalisation of parse thickets
[NN-Iran VBG-developing DT-* NN-enrichment NN-site IN-in
NN-secret ]
[NN-generalization-<UN/nuclear watchdog> * VB-pass NN-resolution
VBG-condemning NN- Iran]
[NN-generalization- <Iran/envoy of Iran> Communicative action
DT-the NN-dispute IN-over JJ-nuclear NNS-*]
[Communicative action NN-work IN-of NN-Iran IN-on JJ-nuclear
NNS-weapons]
[NN-generalization <Iran/envoy to UN> Communicative action
NN-Iran NN-nuclear NN-* VBZ-is IN-for JJ-peaceful NN-purpose ]
[Communicative action NN-generalization <work/develop> IN-of
NN-Iran IN-on JJ-nuclear NNS-weapons]
[NN-generalization <Iran/envoy to UN> Communicative action
NN-evidence IN-against NN-Iran NN-nuclear VBN-fabricated IN-by
DT-the NNP-us ]
[NN-Iran JJ-nuclear NN-weapon NN-* RST-evidence VBN-fabricated
IN-by DT-the NNP-US condemnproceed [enrichment site] <leads to>
suggestcondemn [ work Iran nuclear weapon ]
15 / 32
Clustering of Parse Thickets: what do we want?
Adequately represent groups of texts with overlapping content
Get text clusters with different refinement
Goal: (multi-level) hierarchical structure
Solution: Construction of pattern structures on parse thickets
16 / 32
Clustering of Parse Thickets: the mathematical foundation
Pattern Structure
A triple (G, (D, ) , δ), where G is a set of objects, (D, ) is a
complete meet-semilattice of descriptions and δ : G → D is a
mapping an object to a description.
Pattern concept
A pair (A, d) for which A = d and d = A, where A and d are
the Galois connections, defined as follows:
A := g∈Aδ (g) for A ⊆ G
d := {g ∈ G|d δ (g)} for d ∈ D
17 / 32
Pattern Structures on Parse Thickets
an original paragraph of text → an object a ∈ A
parse thickets constructed
from paragraphs
→
a set of its maximal
generalized sub-trees d
a pattern concept → a cluster
Drawback: the exponential growth of the number of clusters by
increasing the number of texts (objects)
18 / 32
Reduced pattern structures:
meaningfulness estimates of a pattern concept
Average and Maximal Pattern Score
Maximum score among all sub-trees in the cluster
Scoremax
A, d := max
chunk∈d
Score (chunk)
Average score of sub-trees in the cluster
Scoreavg
A, d :=
1
|d|
chunk∈d
Score (chunk)
where Score (chunk) = node∈chunk wnode
19 / 32
Reduced pattern structures:
loss estimates of a cluster with respect to original texts
Average and Minimal Pattern Loss Score
Estimates minimal lost meaning of cluster content w.r.t. original
texts in the cluster
ScoreLossmin
A, d := 1 −
Scoremax A, d
ming∈A Scoremax g, dg
Estimates lost meaning of cluster content on average
ScoreLossavg
A, d := 1 −
Scoreavg A, d
1
|d| g∈A Scoremax g, dg
20 / 32
Reduced pattern structures: generalization
Controlling the loss of meaning w.r.t. the original texts
ScoreLoss∗
A1 ∪ A2 , d1 ∩ d2 ≤ θ
θ = 0, 75, µ1 = 0, 1, µ2 = 0, 9 θ = 0, 5, µ1 = 0, 1, µ2 = 0, 9
21 / 32
Reduced pattern structures: clusters distinguishability
Controlling the loss of meaning w.r.t. the nearest more
meaningfulness neighbors in the cluster hierarchy
Score∗
A1 ∪ A2 , d1∩d2 ≥ µ1min {Score∗
A1 , d1 , Score∗
A2 , d2 }
Controlling the distinguishability w.r.t. the nearest neighbors in the
hierarchy of clusters
Score∗
A1 ∪ A2 , d1∩d2 ≤ µ2max {Score∗
A1 , d1 , Score∗
A2 , d2 }
µ1 = 0, 1, µ2 = 0, 9,
θ = 0, 75
µ1 = 0, 5, µ2 = 0, 9,
θ = 0, 75
µ1 = 0, 1, µ2 = 0, 8,
θ = 0, 75
22 / 32
Reduced pattern structures: constraints
ScoreLoss∗
A1 ∪ A2 , d1 ∩ d2 ≤ θ
Score∗
A1 ∪ A2 , d1 ∩ d2 ≥ µ1min {Score∗
A1 , d1 , Score∗
A2 , d2 }
Score∗
A1 ∪ A2 , d1 ∩ d2 ≤ µ2max {Score∗
A1 , d1 , Score∗
A2 , d2 }
pattern structure
without reduction
reduced pattern structure
with θ = 0, 75, µ1 = 0, 1 and µ2 = 0, 9
23 / 32
Implementation
The Apache OpenNLP library (the most common NLP tasks)
Bing search API (to obtain news snippets)
Pattern structure builder: modified by authors version of
AddIntent algorithm (van der Merwe et al., 2004)
24 / 32
News Clustering: motivation
A long list of search results
Many groups of pages with a similar content
An overlapping content
25 / 32
User Study: non-overlapping partition
web snippets on world’s most pressing news: “F1 winners”,
“fighting Ebola with nanoparticles”, “2015 ACM awards
winners”, “read facial expressions through webcam”, “turning
brown eyes blue”
inconsistency of human-labeled partitions: low values of a
pairwise Adjusted Mutual Information score of human-labeled
partitions 0, 03 ≤ MIadj ≤ 0, 51
26 / 32
Example: The Ebola News Set
Text ID # words # symbols # sentences
quoted
speech
reported
speech
1 42 210 3
2 42 253 3 +
3 54 287 3 +
4 75 399 3 + +
5 31 167 2 +
6 44 209 2 + +
7 49 247 2 +
8 61 340 3 +
9 50 242 2 +
10 62 295 4 +
11 90 526 4 + +
12 75 370 4
27 / 32
Accuracy of non-overlapping clustering methods
Accuracy of conventional clustering methods in the case of
overlapping texts groups
low (in most cases)
greatly depends on taken as ground truth a human-labeled
partition
Method Linkage Distance
A human-labeled partition
1 2 3 4
HAC
average cityblock 0,42 0,42 0,33 0,08
complete cityblock 0,42 0,33 0,17 0,17
average
euclidean
cosine
0,58 0,5 0,33 0,17
complete
euclidean
cosine
0,33 0,92 0,42 0,17
k-means euclidean 0,08 0,08 0,17 0,25
28 / 32
Accuracy of non-overlapping clustering methods
Accuracy of conventional clustering methods
for 4 human-labeled partitions
29 / 32
An example of pattern structures clustering:
clusters with maximal score
reduced pattern structure
with θ = 0, 75, µ1 = 0, 1 and µ2 = 0, 9
30 / 32
An example of pattern structures clustering:
clusters with maximal score
31 / 32
Conclusion
Short text clustering problem
A failure of the traditional clustering methods
Parse Thickets as a text model
Texts similarity based on pattern structures
Reduced pattern structures with constraints
Score and ScoreLoss to improve efficiency and to remove
redundant clusters
Improvement of browsing and navigation through texts set for
users
32 / 32

More Related Content

What's hot

What's hot (12)

Spacey random walks from Householder Symposium XX 2017
Spacey random walks from Householder Symposium XX 2017Spacey random walks from Householder Symposium XX 2017
Spacey random walks from Householder Symposium XX 2017
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficients
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
C0211822
C0211822C0211822
C0211822
 
Using Alpha-cuts and Constraint Exploration Approach on Quadratic Programming...
Using Alpha-cuts and Constraint Exploration Approach on Quadratic Programming...Using Alpha-cuts and Constraint Exploration Approach on Quadratic Programming...
Using Alpha-cuts and Constraint Exploration Approach on Quadratic Programming...
 
Digital Watermarking through Embedding of Encrypted and Arithmetically Compre...
Digital Watermarking through Embedding of Encrypted and Arithmetically Compre...Digital Watermarking through Embedding of Encrypted and Arithmetically Compre...
Digital Watermarking through Embedding of Encrypted and Arithmetically Compre...
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
 
Enhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationEnhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K Anonymization
 
An Introduction to RSA Public-Key Cryptography
An Introduction to RSA Public-Key CryptographyAn Introduction to RSA Public-Key Cryptography
An Introduction to RSA Public-Key Cryptography
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
Pert 05 aplikasi clustering
Pert 05 aplikasi clusteringPert 05 aplikasi clustering
Pert 05 aplikasi clustering
 

Similar to Tatyana Makhalova - 2015 - News clustering approach based on discourse text structure

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

Maxim Kazantsev
 
slides_nuclear_norm_regularization_david_mateos
slides_nuclear_norm_regularization_david_mateosslides_nuclear_norm_regularization_david_mateos
slides_nuclear_norm_regularization_david_mateos
David Mateos
 
Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)
Global R & D Services
 
2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
nozomuhamada
 

Similar to Tatyana Makhalova - 2015 - News clustering approach based on discourse text structure (20)

Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
话题模型2
话题模型2话题模型2
话题模型2
 
Namesake
NamesakeNamesake
Namesake
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
True global optimality of the pressure vessel design problem: a benchmark for...
True global optimality of the pressure vessel design problem: a benchmark for...True global optimality of the pressure vessel design problem: a benchmark for...
True global optimality of the pressure vessel design problem: a benchmark for...
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Who Cares about Others' Privacy: Personalized Anonymization of Moving Object ...
Who Cares about Others' Privacy: Personalized Anonymization of Moving Object ...Who Cares about Others' Privacy: Personalized Anonymization of Moving Object ...
Who Cares about Others' Privacy: Personalized Anonymization of Moving Object ...
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
kcde
kcdekcde
kcde
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

 
slides_nuclear_norm_regularization_david_mateos
slides_nuclear_norm_regularization_david_mateosslides_nuclear_norm_regularization_david_mateos
slides_nuclear_norm_regularization_david_mateos
 
Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)
 
2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
 
Investigative Compression Of Lossy Images By Enactment Of Lattice Vector Quan...
Investigative Compression Of Lossy Images By Enactment Of Lattice Vector Quan...Investigative Compression Of Lossy Images By Enactment Of Lattice Vector Quan...
Investigative Compression Of Lossy Images By Enactment Of Lattice Vector Quan...
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficients
 

More from Association for Computational Linguistics

More from Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Tatyana Makhalova - 2015 - News clustering approach based on discourse text structure

  • 1. News clustering approach based on discourse text structure Tatyana Makhalova, Dmitry Ilvovsky, Boris Galitsky National Research University Higher School of Economics, Moscow, Russia Knowledge Trail Incorporated, San Jose, USA t.makhalova@gmail.com, dilvovsky@hse.ru, bgalitsky@hotmail.com
  • 2. Table of contents 1 Text clustering problem 2 Parse thickets as a text representation model 3 The clustering approach 4 Experiments 5 Conclusion 2 / 32
  • 3. Main Clustering Aspects Text preprocessing and representation Clustering methods Similarity measures 3 / 32
  • 4. Text Representation Models Model Authors Data structure Words order preserving Embedded semantics VSM Salton et al, 1975 matrix - - GVSM Wong et al,1985 matrix - + TVSM Becker and Kuropka, 2003 matrix - + eTVSM Polyvyanyy and Kuropka, 2007 matrix - + DIG Hammouda and Kamel, 2004 graph + - ”Suffix Tree” Zamir and Etzioni, 1998 tree + - N-Grams Schenker et al, 2007 graph + - Parse Thickets Galitsky, 2013 trees (forest) + + 4 / 32
  • 5. Parse Thickets: basic characteristics Preserving a linguistic structure of a text paragraph Constructing of parse trees for each sentence within a paragraph Adding inter-sentence relations between parse tree nodes 5 / 32
  • 6. Parse Thickets: types of discourse relations Coreferences (Lee et al., 2012) Anaphora Same entity Hyponym/hyperonym Rhetoric structure theory (RST) (Mann et al., 1992) Communicative Actions (Searle, 1969) 6 / 32
  • 8. Relations based on Rhetoric Structure Theory RST characterizes structure of text in terms of relations that hold between parts of text RST describes relations between clauses in text which might not be syntactically linked RST helps to discover text patterns such as nucleus/satellite structure with relation such as evidence, justify, antithesis, concession and so on. 8 / 32
  • 9. Parse Thickets: an example “Iran refuses to accept the UN proposal to end the dispute over work on nuclear weapons” “UN nuclear watchdog passes a resolution condemning Iran for developing a second uranium enrichment site in secret”, “A recent IAEA report presented diagrams that suggested Iran was secretly working on nuclear weapons”, “Iran envoy says its nuclear development is for peaceful purpose, and the material evidence against it has been fabricated by the US” “UN passes a resolution condemning the work of Iran on nuclear weapons, in spite of Iran claims that its nuclear research is for peaceful purpose”, “Envoy of Iran to IAEA proceeds with the dispute over its nuclear program and develops an enrichment site in secret”, “Iran confirms that the evidence of its nuclear weapons program is fabricated by the US and proceeds with the second uranium enrichment site” 9 / 32
  • 10. Parse Thickets: discourse relations “Iran confirms that the evidence of its nuclear weapons program is fabricated by the US and proceeds with the second uranium enrichment site” “Iran envoy says its nuclear development is for peaceful purpose, and the material evidence against it has been fabricated by the US” 10 / 32
  • 11. Parse Thickets: discourse relations “UN nuclear watchdog passes a resolution condemning Iran for developing a second Uranium enrichment site in secret”, “A recent IAEA report presented diagrams that suggested Iran was secretly working on nuclear weapons”, “UN passes a resolution condemning the work of Iran on nuclear weapons, in spite of Iran claims that its nuclear research is for peaceful purpose”, “Envoy of Iran to IAEA proceeds with the dispute over its nuclear program and develops an enrichment site in secret” 11 / 32
  • 12. Parse Thickets: an example 12 / 32
  • 13. Clustering of Parse Thickets: the main idea Similarity of parse thickets based on sub-trees matching labeled discourse arcs unlabeled syntactic arcs nodes with part of speech and stem of a word 13 / 32
  • 14. Clustering of paragraphs: generalisation of syntactic trees [NN-work IN-* IN-on JJ-nuclear NNS-weapons ], [DT-the NN-dispute IN-over JJ-nuclear NNS-* ], [VBZ-passes DT-a - NN-resolution], [VBG-condemning NNP-iran IN-*], [VBG-developing DT-* NN-enrichment NN-site IN-in NNsecret], [DT-* JJ-second NN-uranium NN-enrichment NN-site], [VBZ-is IN-for JJ-peaceful NN-purpose], [DT-the NN-evidence IN-* PRP-it], [VBN-* VBN-fabricated - IN-by DT-the NNP-us] 14 / 32
  • 15. Clustering of paragraphs: generalisation of parse thickets [NN-Iran VBG-developing DT-* NN-enrichment NN-site IN-in NN-secret ] [NN-generalization-<UN/nuclear watchdog> * VB-pass NN-resolution VBG-condemning NN- Iran] [NN-generalization- <Iran/envoy of Iran> Communicative action DT-the NN-dispute IN-over JJ-nuclear NNS-*] [Communicative action NN-work IN-of NN-Iran IN-on JJ-nuclear NNS-weapons] [NN-generalization <Iran/envoy to UN> Communicative action NN-Iran NN-nuclear NN-* VBZ-is IN-for JJ-peaceful NN-purpose ] [Communicative action NN-generalization <work/develop> IN-of NN-Iran IN-on JJ-nuclear NNS-weapons] [NN-generalization <Iran/envoy to UN> Communicative action NN-evidence IN-against NN-Iran NN-nuclear VBN-fabricated IN-by DT-the NNP-us ] [NN-Iran JJ-nuclear NN-weapon NN-* RST-evidence VBN-fabricated IN-by DT-the NNP-US condemnproceed [enrichment site] <leads to> suggestcondemn [ work Iran nuclear weapon ] 15 / 32
  • 16. Clustering of Parse Thickets: what do we want? Adequately represent groups of texts with overlapping content Get text clusters with different refinement Goal: (multi-level) hierarchical structure Solution: Construction of pattern structures on parse thickets 16 / 32
  • 17. Clustering of Parse Thickets: the mathematical foundation Pattern Structure A triple (G, (D, ) , δ), where G is a set of objects, (D, ) is a complete meet-semilattice of descriptions and δ : G → D is a mapping an object to a description. Pattern concept A pair (A, d) for which A = d and d = A, where A and d are the Galois connections, defined as follows: A := g∈Aδ (g) for A ⊆ G d := {g ∈ G|d δ (g)} for d ∈ D 17 / 32
  • 18. Pattern Structures on Parse Thickets an original paragraph of text → an object a ∈ A parse thickets constructed from paragraphs → a set of its maximal generalized sub-trees d a pattern concept → a cluster Drawback: the exponential growth of the number of clusters by increasing the number of texts (objects) 18 / 32
  • 19. Reduced pattern structures: meaningfulness estimates of a pattern concept Average and Maximal Pattern Score Maximum score among all sub-trees in the cluster Scoremax A, d := max chunk∈d Score (chunk) Average score of sub-trees in the cluster Scoreavg A, d := 1 |d| chunk∈d Score (chunk) where Score (chunk) = node∈chunk wnode 19 / 32
  • 20. Reduced pattern structures: loss estimates of a cluster with respect to original texts Average and Minimal Pattern Loss Score Estimates minimal lost meaning of cluster content w.r.t. original texts in the cluster ScoreLossmin A, d := 1 − Scoremax A, d ming∈A Scoremax g, dg Estimates lost meaning of cluster content on average ScoreLossavg A, d := 1 − Scoreavg A, d 1 |d| g∈A Scoremax g, dg 20 / 32
  • 21. Reduced pattern structures: generalization Controlling the loss of meaning w.r.t. the original texts ScoreLoss∗ A1 ∪ A2 , d1 ∩ d2 ≤ θ θ = 0, 75, µ1 = 0, 1, µ2 = 0, 9 θ = 0, 5, µ1 = 0, 1, µ2 = 0, 9 21 / 32
  • 22. Reduced pattern structures: clusters distinguishability Controlling the loss of meaning w.r.t. the nearest more meaningfulness neighbors in the cluster hierarchy Score∗ A1 ∪ A2 , d1∩d2 ≥ µ1min {Score∗ A1 , d1 , Score∗ A2 , d2 } Controlling the distinguishability w.r.t. the nearest neighbors in the hierarchy of clusters Score∗ A1 ∪ A2 , d1∩d2 ≤ µ2max {Score∗ A1 , d1 , Score∗ A2 , d2 } µ1 = 0, 1, µ2 = 0, 9, θ = 0, 75 µ1 = 0, 5, µ2 = 0, 9, θ = 0, 75 µ1 = 0, 1, µ2 = 0, 8, θ = 0, 75 22 / 32
  • 23. Reduced pattern structures: constraints ScoreLoss∗ A1 ∪ A2 , d1 ∩ d2 ≤ θ Score∗ A1 ∪ A2 , d1 ∩ d2 ≥ µ1min {Score∗ A1 , d1 , Score∗ A2 , d2 } Score∗ A1 ∪ A2 , d1 ∩ d2 ≤ µ2max {Score∗ A1 , d1 , Score∗ A2 , d2 } pattern structure without reduction reduced pattern structure with θ = 0, 75, µ1 = 0, 1 and µ2 = 0, 9 23 / 32
  • 24. Implementation The Apache OpenNLP library (the most common NLP tasks) Bing search API (to obtain news snippets) Pattern structure builder: modified by authors version of AddIntent algorithm (van der Merwe et al., 2004) 24 / 32
  • 25. News Clustering: motivation A long list of search results Many groups of pages with a similar content An overlapping content 25 / 32
  • 26. User Study: non-overlapping partition web snippets on world’s most pressing news: “F1 winners”, “fighting Ebola with nanoparticles”, “2015 ACM awards winners”, “read facial expressions through webcam”, “turning brown eyes blue” inconsistency of human-labeled partitions: low values of a pairwise Adjusted Mutual Information score of human-labeled partitions 0, 03 ≤ MIadj ≤ 0, 51 26 / 32
  • 27. Example: The Ebola News Set Text ID # words # symbols # sentences quoted speech reported speech 1 42 210 3 2 42 253 3 + 3 54 287 3 + 4 75 399 3 + + 5 31 167 2 + 6 44 209 2 + + 7 49 247 2 + 8 61 340 3 + 9 50 242 2 + 10 62 295 4 + 11 90 526 4 + + 12 75 370 4 27 / 32
  • 28. Accuracy of non-overlapping clustering methods Accuracy of conventional clustering methods in the case of overlapping texts groups low (in most cases) greatly depends on taken as ground truth a human-labeled partition Method Linkage Distance A human-labeled partition 1 2 3 4 HAC average cityblock 0,42 0,42 0,33 0,08 complete cityblock 0,42 0,33 0,17 0,17 average euclidean cosine 0,58 0,5 0,33 0,17 complete euclidean cosine 0,33 0,92 0,42 0,17 k-means euclidean 0,08 0,08 0,17 0,25 28 / 32
  • 29. Accuracy of non-overlapping clustering methods Accuracy of conventional clustering methods for 4 human-labeled partitions 29 / 32
  • 30. An example of pattern structures clustering: clusters with maximal score reduced pattern structure with θ = 0, 75, µ1 = 0, 1 and µ2 = 0, 9 30 / 32
  • 31. An example of pattern structures clustering: clusters with maximal score 31 / 32
  • 32. Conclusion Short text clustering problem A failure of the traditional clustering methods Parse Thickets as a text model Texts similarity based on pattern structures Reduced pattern structures with constraints Score and ScoreLoss to improve efficiency and to remove redundant clusters Improvement of browsing and navigation through texts set for users 32 / 32