SlideShare a Scribd company logo
1 of 27
Download to read offline
1
Grouping business news stories
based on
salience of named entities
Llorenç Escoter,̧ Lidia Pivovarova, Mian Du, Anisia Katiskaya
and Roman Yangarber
EACL 2017, Valencia, Spain
2
Task definition
● PULS – an online news monitoring system for the
business domain (puls.cs.helsinki.fi)
● 4000-6000 news articles daily
● Some stories appear multiple times
● Our task: cluster articles into a set of stories:
– to minimize redundancy
– to identify trending stories
3
4
Event grouping task is different from
topical text clustering:
- fine-grained
- named entities are crucial
- group size distribution is skewed
5
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
6
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - +
2
3
4
5
6
7
8
9
7
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - + - -
2 -
3 - -
4 -
5 - -
6 -
7
8
9
8
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - + - -
2 - -
3 - - - -
4 -
5 - -
6 -
7
8
9
9
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - + - + -
2 -
3 - - - -
4 -
5 - -
6 -
7
8
9
10
11
Named Entity Recognition
● Part of the PULS news monitoring system:
– extracts named entities
– assigns type: company, person, location, etc.
– computes salience
12
Salience
● Our definition of salience relies on the general
nature of news articles:
– Authors typically mention the main event in the title;
then, the main information is elaborated in the first few
sentences, followed by further detail and background
13
Salience
● The most important NEs are mentioned early in
the text and then repeated.
● Less important NEs are mentioned in the later
paragraphs and are less frequent.
14
Salience
15
Clustering method
● Features:
– Word-based
– NE-based
● Hierarchical clustering of document vectors using
cosine similarity
16
Clustering method
● Features:
– Word-based
I. TF-IDF
II.”standard” CBOW embeddings, built on Google News
(Mikolov et al. 2013)
III. CBOW embeddings, built on our (4.5 M documents)
corpus of business news
● Hierarchical clustering of document vectors
17
Clustering method
● Hierarchical clustering of document vectors
● Features:
– Word-based
I. TF-IDF
II.”standard” CBOW embeddings, built on Google News
III. CBOW embeddings, built on our (4.5 M documents)
corpus of business news
18
Clustering method
● Hierarchical clustering of document vectors
● Features:
– Word-based
– NE-based
I. Counts
II.TF-IDF
III. Salience
19
Clustering method
● Features:
– Word-based
– NE-based
● Combining features:
● Hierarchical clustering of document vectors
20
Clustering method
● Features:
– Word-based
– NE-based
● Combining features:
I. Concatenation
- a document vector
consists of both word-
based and document-
based features
● Hierarchical clustering of document vectors
21
Clustering method
● Features:
– Word-based
– NE-based
● Combining features:
I. Concatenation
II.Combination using
AND function
- both word distance
and NE distance
should be sufficiently
close
● Hierarchical clustering of document vectors
22
Evaluation
● Measures
– V-measure (combination of completeness and
homogeneity)
– Rand Index (ratio of correctly classified pairs)
● Adjustment against naïve strategy (doing nothing):
– naïve strategy scores: V-measure 0.96, RI 0.99
23
Individual features
Rand Index and V-measure adjusted for naïve strategy.
θ – cosine distance threshold for hierarchical clustering
24
Vector concatenation
25
Combination using AND function
Rand Index adjusted for naïve strategy
(improvement from 0.25 to 0.4)
V-measure adjusted for naïve strategy
(improvement from 0.37 to 0.49)
26
Conclusions
● automatically extracted named entities are better
features than keywords for event-based clustering
● salience of NEs—a weighting sheme that
combines frequency and prominence of the NE—is
the best document representation
● corpus-specific word embeddings alone give lower
performance than embeddings built using bigger
corpus
● combining NE salience with domain-specific
embeddings yields the best performance
● AND-function combination strategy works better
than concatenation
27
Thanks for your attention!
data and code: puls.cs.helsinki.fi/grouping
contacts: first_name.last_name@cs.helsinki.fi

More Related Content

Similar to Grouping business news stories based on salience of named entities

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Lidia Pivovarova
 
Michael Lang Sr. Presentation
Michael Lang Sr. PresentationMichael Lang Sr. Presentation
Michael Lang Sr. PresentationMediabistro
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Mark Ivan Ligason
 
Databases and Information Management (1).ppt
Databases and Information Management (1).pptDatabases and Information Management (1).ppt
Databases and Information Management (1).pptAlaaShaqfa2
 
Data science chapter-7,8,9
Data science chapter-7,8,9Data science chapter-7,8,9
Data science chapter-7,8,9varshakumar21
 
Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015Splunk
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceCOVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceVibhuti Mandral
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
GoOpen 2010: Stian Danenbarger
GoOpen 2010: Stian DanenbargerGoOpen 2010: Stian Danenbarger
GoOpen 2010: Stian DanenbargerFriprogsenteret
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutionsOpen Data Support
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Zhang Bo
 

Similar to Grouping business news stories based on salience of named entities (20)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Michael Lang Sr. Presentation
Michael Lang Sr. PresentationMichael Lang Sr. Presentation
Michael Lang Sr. Presentation
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)
 
Databases and Information Management (1).ppt
Databases and Information Management (1).pptDatabases and Information Management (1).ppt
Databases and Information Management (1).ppt
 
Nosql
NosqlNosql
Nosql
 
Data science chapter-7,8,9
Data science chapter-7,8,9Data science chapter-7,8,9
Data science chapter-7,8,9
 
Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015
 
Data science unit3
Data science unit3Data science unit3
Data science unit3
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Database systems Handbook dbms rdbms pdf.pdf
Database systems Handbook dbms rdbms pdf.pdfDatabase systems Handbook dbms rdbms pdf.pdf
Database systems Handbook dbms rdbms pdf.pdf
 
Database systems Handbook rdbms.pdf
Database systems Handbook  rdbms.pdfDatabase systems Handbook  rdbms.pdf
Database systems Handbook rdbms.pdf
 
Database systems Handbook rdbms.pdf
Database systems Handbook  rdbms.pdfDatabase systems Handbook  rdbms.pdf
Database systems Handbook rdbms.pdf
 
Database systems Handbook dbms & rdbms.pdf
Database systems Handbook dbms & rdbms.pdfDatabase systems Handbook dbms & rdbms.pdf
Database systems Handbook dbms & rdbms.pdf
 
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceCOVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
GoOpen 2010: Stian Danenbarger
GoOpen 2010: Stian DanenbargerGoOpen 2010: Stian Danenbarger
GoOpen 2010: Stian Danenbarger
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutions
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 

More from Lidia Pivovarova

Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classificationLidia Pivovarova
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текстаLidia Pivovarova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovLidia Pivovarova
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...Lidia Pivovarova
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyLidia Pivovarova
 

More from Lidia Pivovarova (20)

Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
 

Recently uploaded

ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteRaunakRastogi4
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cherry
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptxCherry
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCherry
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxCherry
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Cherry
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationSérgio Sacani
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneySérgio Sacani
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methodsimroshankoirala
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACherry
 

Recently uploaded (20)

ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 

Grouping business news stories based on salience of named entities

  • 1. 1 Grouping business news stories based on salience of named entities Llorenç Escoter,̧ Lidia Pivovarova, Mian Du, Anisia Katiskaya and Roman Yangarber EACL 2017, Valencia, Spain
  • 2. 2 Task definition ● PULS – an online news monitoring system for the business domain (puls.cs.helsinki.fi) ● 4000-6000 news articles daily ● Some stories appear multiple times ● Our task: cluster articles into a set of stories: – to minimize redundancy – to identify trending stories
  • 3. 3
  • 4. 4 Event grouping task is different from topical text clustering: - fine-grained - named entities are crucial - group size distribution is skewed
  • 5. 5 Dataset ● Popular clustering data sets target much coarser categorization tasks ● The dataset and the interface are publicly available ● Dataset based on PULS business corpus – “Typical” day, ~4000 documents – Manually annotated using a command-line interface, which displays documents pairwise – Initialization: if a pair of documents do not mention the same name they cannot be grouped together – Decision propagation: only one member of a group should be shown against other group
  • 6. 6 Dataset ● Popular clustering data sets target much coarser categorization tasks ● The dataset and the interface are publicly available ● Dataset based on PULS business corpus – “Typical” day, ~4000 documents – Manually annotated using a command-line interface, which displays documents pairwise – Initialization: if a pair of documents do not mention the same name they cannot be grouped together – Decision propagation: only one member of a group should be shown against other group 1 2 3 4 5 6 7 8 9 1 - + 2 3 4 5 6 7 8 9
  • 7. 7 Dataset ● Popular clustering data sets target much coarser categorization tasks ● The dataset and the interface are publicly available ● Dataset based on PULS business corpus – “Typical” day, ~4000 documents – Manually annotated using a command-line interface, which displays documents pairwise – Initialization: if a pair of documents do not mention the same name they cannot be grouped together – Decision propagation: only one member of a group should be shown against other group 1 2 3 4 5 6 7 8 9 1 - + - - 2 - 3 - - 4 - 5 - - 6 - 7 8 9
  • 8. 8 Dataset ● Popular clustering data sets target much coarser categorization tasks ● The dataset and the interface are publicly available ● Dataset based on PULS business corpus – “Typical” day, ~4000 documents – Manually annotated using a command-line interface, which displays documents pairwise – Initialization: if a pair of documents do not mention the same name they cannot be grouped together – Decision propagation: only one member of a group should be shown against other group 1 2 3 4 5 6 7 8 9 1 - + - - 2 - - 3 - - - - 4 - 5 - - 6 - 7 8 9
  • 9. 9 Dataset ● Popular clustering data sets target much coarser categorization tasks ● The dataset and the interface are publicly available ● Dataset based on PULS business corpus – “Typical” day, ~4000 documents – Manually annotated using a command-line interface, which displays documents pairwise – Initialization: if a pair of documents do not mention the same name they cannot be grouped together – Decision propagation: only one member of a group should be shown against other group 1 2 3 4 5 6 7 8 9 1 - + - + - 2 - 3 - - - - 4 - 5 - - 6 - 7 8 9
  • 10. 10
  • 11. 11 Named Entity Recognition ● Part of the PULS news monitoring system: – extracts named entities – assigns type: company, person, location, etc. – computes salience
  • 12. 12 Salience ● Our definition of salience relies on the general nature of news articles: – Authors typically mention the main event in the title; then, the main information is elaborated in the first few sentences, followed by further detail and background
  • 13. 13 Salience ● The most important NEs are mentioned early in the text and then repeated. ● Less important NEs are mentioned in the later paragraphs and are less frequent.
  • 15. 15 Clustering method ● Features: – Word-based – NE-based ● Hierarchical clustering of document vectors using cosine similarity
  • 16. 16 Clustering method ● Features: – Word-based I. TF-IDF II.”standard” CBOW embeddings, built on Google News (Mikolov et al. 2013) III. CBOW embeddings, built on our (4.5 M documents) corpus of business news ● Hierarchical clustering of document vectors
  • 17. 17 Clustering method ● Hierarchical clustering of document vectors ● Features: – Word-based I. TF-IDF II.”standard” CBOW embeddings, built on Google News III. CBOW embeddings, built on our (4.5 M documents) corpus of business news
  • 18. 18 Clustering method ● Hierarchical clustering of document vectors ● Features: – Word-based – NE-based I. Counts II.TF-IDF III. Salience
  • 19. 19 Clustering method ● Features: – Word-based – NE-based ● Combining features: ● Hierarchical clustering of document vectors
  • 20. 20 Clustering method ● Features: – Word-based – NE-based ● Combining features: I. Concatenation - a document vector consists of both word- based and document- based features ● Hierarchical clustering of document vectors
  • 21. 21 Clustering method ● Features: – Word-based – NE-based ● Combining features: I. Concatenation II.Combination using AND function - both word distance and NE distance should be sufficiently close ● Hierarchical clustering of document vectors
  • 22. 22 Evaluation ● Measures – V-measure (combination of completeness and homogeneity) – Rand Index (ratio of correctly classified pairs) ● Adjustment against naïve strategy (doing nothing): – naïve strategy scores: V-measure 0.96, RI 0.99
  • 23. 23 Individual features Rand Index and V-measure adjusted for naïve strategy. θ – cosine distance threshold for hierarchical clustering
  • 25. 25 Combination using AND function Rand Index adjusted for naïve strategy (improvement from 0.25 to 0.4) V-measure adjusted for naïve strategy (improvement from 0.37 to 0.49)
  • 26. 26 Conclusions ● automatically extracted named entities are better features than keywords for event-based clustering ● salience of NEs—a weighting sheme that combines frequency and prominence of the NE—is the best document representation ● corpus-specific word embeddings alone give lower performance than embeddings built using bigger corpus ● combining NE salience with domain-specific embeddings yields the best performance ● AND-function combination strategy works better than concatenation
  • 27. 27 Thanks for your attention! data and code: puls.cs.helsinki.fi/grouping contacts: first_name.last_name@cs.helsinki.fi