Uploaded byHan Xu, PhD

PPTX, PDF143 views

NAACL2015 presentation

This document presents a multi-stage framework for extractive summarization of scientific papers based on keyword profiling and language modeling. The framework first identifies keywords that capture a paper's most important contributions, then creates a keyword profile language model. It ranks sentences by divergence from this model and re-ranks them based on novelty to generate a 5-sentence summary. Evaluation shows the framework achieves state-of-the-art performance on summarizing individual papers and remains effective under more stringent length limits.

School of Computer Science and Engineering
Research Overview
2/12
Motivation:
 Research benefits from and expands on the work of others
 Interdependent nature of knowledge creates needs to explore new fields
 Large amount of literature -- challenging task
Objectives:
Design a tool that facilitates identification of the key contributions of papers
1. First by identifying keywords that capture the most important contributions of a paper
2. Then by creating an extractive summary consisting of information-rich sentences that
cover the former

School of Computer Science and Engineering
Theoretical Background
Abstract
– Retrospective perception by the authors
– At the time of writing
– Low information redundancy
o McDonald et al (2005) use the Chu-Liu-Edmonds (CLE) algorithm to solve the maximum
spanning tree problem.
o To learn these structures we used online large-margin learning (McDonald et al, 2005) that
empirically provides state-of-the-art performance for Czech.
Citation summary
– Extrospective judgment by the community
– Over a period of time
– High information redundancy
 Source of contributions:
 Qazvinian’s single paper summarisation corpus:
 25 highly cited papers in the ACL Anthology Network from 5 different domains
 Two files provided for each paper:
1. A citation summary
2. A manually constructed key contribution list
3/12

School of Computer Science and Engineering
Qualifying Key Contributions
4/12
 Statistical Characteristics:
 Over-representedness: keywords that are frequently used when citing a paper
 Exclusiveness: keywords that are only used when citing a paper
Citation sentences containing W1
Citation sentences containing W2

School of Computer Science and Engineering
Multi-stage Extractive Summarisation
5/12

School of Computer Science and Engineering
Stage 4: Novelty-driven Re-ranking
 Input: top-k ranked sentence pool
 Method: Top Sentence Re-ranking (TSR)
 Output: 5-sentence extractive summary
9/12

School of Computer Science and Engineering
Results
 Evaluation method: Pyramid score
 Performance comparison:
 Resilience to more stringent summarisation size limit:
10/12

School of Computer Science and Engineering
THANK YOU!
12/12
QUESTIONS?

Recommended

PDF

Closing the Gap: Data Models for Documentary Linguistics

PDF

Improving Document Clustering by Eliminating Unnatural Language

PDF

17. Anne Schuman (USAAR) Terminology and Ontologies 2

PDF

The recognition system of sentential

PPT

A New Linkage for Prior Learning Assessment

PDF

Taxonomy extraction from automotive natural language requirements using unsup...

PDF

Query Answering Approach Based on Document Summarization

PPTX

The effect of number of concepts on readability of schemas 2

PPT

Taking into account communities of practice’s specific vocabularies in inform...

PDF

A Guide to Academic Research - new book being released this month

byProf. Mohandas K P

PDF

13. Constantin Orasan (UoW) Natural Language Processing for Translation

PPTX

Contextual Definition Generation

bySergey Sosnovsky

PDF

Semantic based automatic question generation using artificial immune system

byAlexander Decker

DOC

Report

PDF

Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...

PDF

Response quality-evaluation-in-heterogeneous-question-answering-system-a-blac...

PDF

Possibility of interdisciplinary research software engineering andnatural lan...

PDF

Novelty detection via topic modeling in research articles

PDF

Nguyen

PDF

2. Constantin Orasan (UoW) EXPERT Introduction

PDF

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...

PPTX

Ontology mapping for the semantic web

byWorawith Sangkatip

PPT

DODDLE-OWL: A Domain Ontology Construction Tool with OWL

byTakeshi Morita

PDF

An Automatic Question Paper Generation : Using Bloom's Taxonomy

byIRJET Journal

DOCX

Proposal of an Ontology Applied to Technical Debt on PL/SQL Development

byJorge Barreto

PDF

Named Entity Recognition System for Hindi Language: A Hybrid Approach

PDF

AICOL2015_paper_16

byTenyo Tyankov

PDF

76 s201906

PDF

The Cloud Suites & Spa Resort Chayal

PDF

SINESCREEN ISSUE 4 P24

More Related Content

PDF

Closing the Gap: Data Models for Documentary Linguistics

PDF

Improving Document Clustering by Eliminating Unnatural Language

PDF

17. Anne Schuman (USAAR) Terminology and Ontologies 2

PDF

The recognition system of sentential

PPT

A New Linkage for Prior Learning Assessment

PDF

Taxonomy extraction from automotive natural language requirements using unsup...

PDF

Query Answering Approach Based on Document Summarization

PPTX

The effect of number of concepts on readability of schemas 2

Closing the Gap: Data Models for Documentary Linguistics

Improving Document Clustering by Eliminating Unnatural Language

17. Anne Schuman (USAAR) Terminology and Ontologies 2

The recognition system of sentential

A New Linkage for Prior Learning Assessment

Taxonomy extraction from automotive natural language requirements using unsup...

Query Answering Approach Based on Document Summarization

The effect of number of concepts on readability of schemas 2

What's hot

PPT

Taking into account communities of practice’s specific vocabularies in inform...

PDF

A Guide to Academic Research - new book being released this month

byProf. Mohandas K P

PDF

13. Constantin Orasan (UoW) Natural Language Processing for Translation

PPTX

Contextual Definition Generation

bySergey Sosnovsky

PDF

Semantic based automatic question generation using artificial immune system

byAlexander Decker

DOC

Report

PDF

Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...

PDF

Response quality-evaluation-in-heterogeneous-question-answering-system-a-blac...

PDF

Possibility of interdisciplinary research software engineering andnatural lan...

PDF

Novelty detection via topic modeling in research articles

PDF

Nguyen

PDF

2. Constantin Orasan (UoW) EXPERT Introduction

PDF

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...

PPTX

Ontology mapping for the semantic web

byWorawith Sangkatip

PPT

DODDLE-OWL: A Domain Ontology Construction Tool with OWL

byTakeshi Morita

PDF

An Automatic Question Paper Generation : Using Bloom's Taxonomy

byIRJET Journal

DOCX

Proposal of an Ontology Applied to Technical Debt on PL/SQL Development

byJorge Barreto

PDF

Named Entity Recognition System for Hindi Language: A Hybrid Approach

PDF

AICOL2015_paper_16

byTenyo Tyankov

PDF

76 s201906

Taking into account communities of practice’s specific vocabularies in inform...

A Guide to Academic Research - new book being released this month

byProf. Mohandas K P

13. Constantin Orasan (UoW) Natural Language Processing for Translation

Contextual Definition Generation

bySergey Sosnovsky

Semantic based automatic question generation using artificial immune system

byAlexander Decker

Report

Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...

Response quality-evaluation-in-heterogeneous-question-answering-system-a-blac...

Possibility of interdisciplinary research software engineering andnatural lan...

Novelty detection via topic modeling in research articles

Nguyen

2. Constantin Orasan (UoW) EXPERT Introduction

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...

Ontology mapping for the semantic web

byWorawith Sangkatip

DODDLE-OWL: A Domain Ontology Construction Tool with OWL

byTakeshi Morita

An Automatic Question Paper Generation : Using Bloom's Taxonomy

byIRJET Journal

Proposal of an Ontology Applied to Technical Debt on PL/SQL Development

byJorge Barreto

Named Entity Recognition System for Hindi Language: A Hybrid Approach

AICOL2015_paper_16

byTenyo Tyankov

76 s201906

Viewers also liked

PDF

The Cloud Suites & Spa Resort Chayal

PDF

SINESCREEN ISSUE 4 P24

PPT

ΟΜΑΔΑ Β_2

PDF

CV Wilbert Knuvers 20160112

byWilbert Knuvers

PDF

Event Marquees | Case Study | Stockland Elara

byEvent Marquees

PPT

ΟΜΑΔΑ Α_2

PDF

campusgreenguide

PPTX

Editing

byshannonclarkemedia

PDF

Mohamed Ali Hamail-CV

byMuhammed Hamail

PDF

POWER POINT

byJosé Miguel Obregón Rodríguez

PPTX

Ips 6 sd benua

byRiza Muafiqunnahar

PPTX

Data protection for hadoop environments

byDataWorks Summit

PPT

Παραγοντοποίηση σύνθετων αριθμών

byΓιάννης Φερεντίνος

PPTX

Administracion moderna (1) (1)

PPTX

Artificial intelligence

PPTX

μετατροπές

PDF

Visa Stock Analysis Report

The Cloud Suites & Spa Resort Chayal

SINESCREEN ISSUE 4 P24

ΟΜΑΔΑ Β_2

CV Wilbert Knuvers 20160112

byWilbert Knuvers

Event Marquees | Case Study | Stockland Elara

byEvent Marquees

ΟΜΑΔΑ Α_2

campusgreenguide

Editing

byshannonclarkemedia

Mohamed Ali Hamail-CV

byMuhammed Hamail

POWER POINT

byJosé Miguel Obregón Rodríguez

Ips 6 sd benua

byRiza Muafiqunnahar

Data protection for hadoop environments

byDataWorks Summit

Παραγοντοποίηση σύνθετων αριθμών

byΓιάννης Φερεντίνος

Administracion moderna (1) (1)

Artificial intelligence

μετατροπές

Visa Stock Analysis Report

Similar to NAACL2015 presentation

PDF

N15-1013

PDF

AN OVERVIEW OF EXTRACTIVE BASED AUTOMATIC TEXT SUMMARIZATION SYSTEMS

PDF

QUICK GLANCE: UNSUPERVISED EXTRACTIVE SUMMARIZATION MODEL

byIRJET Journal

PDF

Automation tool for evaluation of the quality of nlp based

byIAEME Publication

PPTX

Keyword_extraction.pptx

byBiswarupDas18

PPTX

Automatic keyword extraction.pptx

byBiswarupDas18

PDF

A domain specific automatic text summarization using fuzzy logic

byIAEME Publication

PDF

I AM SAM web app

byJohn Ray Martinez

PDF

Summarization using ntc approach based on keyword extraction for discussion f...

byeSAT Publishing House

PDF

Text summarization

byprateek khandelwal

PDF

Article Summarizer

PDF

IRJET- Automatic Recapitulation of Text Document

byIRJET Journal

PDF

IRJET- Automatic Text Summarization using Text Rank

byIRJET Journal

PDF

Towards efficient knowledge extraction: Natural language processing-based sum...

PDF

IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach

byIRJET Journal

PDF

Automatic Text Summarization

byIRJET Journal

PDF

A Survey on Automatic Text Summarization

byIRJET Journal

PDF

DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION

PDF

Keyword Extraction Based Summarization of Categorized Kannada Text Documents

PPTX

OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)

byNicolas Van Labeke

N15-1013

AN OVERVIEW OF EXTRACTIVE BASED AUTOMATIC TEXT SUMMARIZATION SYSTEMS

QUICK GLANCE: UNSUPERVISED EXTRACTIVE SUMMARIZATION MODEL

byIRJET Journal

Automation tool for evaluation of the quality of nlp based

byIAEME Publication

Keyword_extraction.pptx

byBiswarupDas18

Automatic keyword extraction.pptx

byBiswarupDas18

A domain specific automatic text summarization using fuzzy logic

byIAEME Publication

I AM SAM web app

byJohn Ray Martinez

Summarization using ntc approach based on keyword extraction for discussion f...

byeSAT Publishing House

Text summarization

byprateek khandelwal

Article Summarizer

IRJET- Automatic Recapitulation of Text Document

byIRJET Journal

IRJET- Automatic Text Summarization using Text Rank

byIRJET Journal

Towards efficient knowledge extraction: Natural language processing-based sum...

IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach

byIRJET Journal

Automatic Text Summarization

byIRJET Journal

A Survey on Automatic Text Summarization

byIRJET Journal

DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION

Keyword Extraction Based Summarization of Categorized Kannada Text Documents

OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)

byNicolas Van Labeke

NAACL2015 presentation

1.
Never Stand StillFaculty of Engineering Computer Science and Engineering Click to edit Present’s Name Extractive Summarisation Based on Keyword Profile and Language Model Han Xu, Eric Martin and Ashesh Mahidadia 1/12
2.
School of ComputerScience and Engineering Research Overview 2/12 Motivation:  Research benefits from and expands on the work of others  Interdependent nature of knowledge creates needs to explore new fields  Large amount of literature -- challenging task Objectives: Design a tool that facilitates identification of the key contributions of papers 1. First by identifying keywords that capture the most important contributions of a paper 2. Then by creating an extractive summary consisting of information-rich sentences that cover the former
3.
School of ComputerScience and Engineering Theoretical Background Abstract – Retrospective perception by the authors – At the time of writing – Low information redundancy o McDonald et al (2005) use the Chu-Liu-Edmonds (CLE) algorithm to solve the maximum spanning tree problem. o To learn these structures we used online large-margin learning (McDonald et al, 2005) that empirically provides state-of-the-art performance for Czech. Citation summary – Extrospective judgment by the community – Over a period of time – High information redundancy  Source of contributions:  Qazvinian’s single paper summarisation corpus:  25 highly cited papers in the ACL Anthology Network from 5 different domains  Two files provided for each paper: 1. A citation summary 2. A manually constructed key contribution list 3/12
4.
School of ComputerScience and Engineering Qualifying Key Contributions 4/12  Statistical Characteristics:  Over-representedness: keywords that are frequently used when citing a paper  Exclusiveness: keywords that are only used when citing a paper Citation sentences containing W1 Citation sentences containing W2
5.
School of ComputerScience and Engineering Multi-stage Extractive Summarisation 5/12
6.
School of ComputerScience and Engineering Stage 1: Keyword Profiling  Input: 1. Citation summary of the target paper 2. Citation summaries of all papers in the target paper’s domain  Method: one-tailed Fisher’s exact test  Output: keyword profile 6/12
7.
School of ComputerScience and Engineering Stage 2: Keyword Profile Language Modelling  Input: paper keyword profile  Method: negative log transformation  Output: keyword profile language model (KPLM) 1. Directly encodes words’ salience as pseudo generative probabilities 2. More discriminative than a traditional language model 7/12
8.
School of ComputerScience and Engineering Stage 3: Summarisation as Model Divergence IR  Input: keyword profile language model  Method: negative cross entropy retrieval model  Output: top-k sentences whose MLEs are of the smallest divergence to the KPLM 8/12
9.
School of ComputerScience and Engineering Stage 4: Novelty-driven Re-ranking  Input: top-k ranked sentence pool  Method: Top Sentence Re-ranking (TSR)  Output: 5-sentence extractive summary 9/12
10.
School of ComputerScience and Engineering Results  Evaluation method: Pyramid score  Performance comparison:  Resilience to more stringent summarisation size limit: 10/12
11.
School of ComputerScience and Engineering Conclusion 11/12  KPLM -- a multi-stage statistical summarisation framework: 1. Keyword profiling 2. Keyword profile language modelling 3. Summarisation as model divergence based IR 4. Novelty-driven re-ranking  State-of-the-art performance in summarising scientific papers  Good resilience to more stringent summary length limit  Future Work:  Higher order n-grams  Multiple paper summarisation
12.
School of ComputerScience and Engineering THANK YOU! 12/12 QUESTIONS?

Editor's Notes

#3 Firstly, I would like to introduce the motivation behind this work. As we all know that science is not an isolated endeavour but benefits from and expands on the work of others with more or less cross fertilisation bt disciplines Researchers find themselves constantly in need to explore the scientific literature further from the core of their research This thus calls for utilities that can make this task less daunting. Our aim of this work is then to develop such a tool than can facilitate identification of the key contributions of a paper towards making this task less daunting. This tool first identifies Firstly, we aim to automatically identify the most important and unique contributions of a paper Furthermore, we aim to automatically extract information-rich sentences that cover those contributions to form an extractive summary for the paper to provide better contexts of how its contributions are discussed
#4 Well, how can we do this automatically? A first question is from what textual sources could we extract contributions of a paper? An information source came to mind immediately is the abstract of a paper. Arguably, the citation summary can be deemed as a form of crowd-sourced review of a paper’s main contributions. Former work had confirmed that citation summaries contain more focused coverage of a paper’s contributions. We thus chose citation summaries as our corpus to mine contributions from. As a case study, we use Qazvinian’s … in this paper … After deciding the source from which to mine contributions of a paper and identifying the statistical characteristics of main contributions of a paper, we introduce the data used in our experiments.
#5 After we have decided the information source, a second question we need to answer is: how can we qualify key contributions of a paper, or what are their characteristics? Let’s imagine a tiny domain containing 3 papers, whose citation summaries are represented with those circles and we aim to find main contributions from this green circle. Let’s further denote citation sentences containing a word W1 with blue triangles. We can see that W1 is evenly distributed across the 3 citation summaries. One would then heuristically conclude that W1 is not likely to be a keyword of the green paper. In contrast, we mark out citing sentences containing a word W2 with red dots, whose distribution has a high concentration in the green circle. Intuitively, W2 is much more likely to be a main contribution to the paper corresponding to the green circle as 8 out of the 10 total mentions of the word W2 belong to this paper’s citation summary. The above intuitive observation translates into the following statistical characteristics of paper’s key contributions which can be measured along 2 dimensions.
#6 After consolidating the theoretical background, we present our approach, a multi-stage extractive summarisation framework consists of 4 stages pipelined together with later stages consuming the output of early stages In the first stage, we take the citation summary of the target paper and those of papers belong to the same domain as input and generate a keyword profile of the target paper capturing words’ over-representedness and exclusiveness in the target paper’s citation summary In the second stage, paper keyword profile created in the first stage are processed into a keyword profile language model that incorporates words’ salience in reflecting the paper’s main contributions into pseudo generative probabilities … In the third stage, we cast the task of extractive summarisation as model-divergence based IR and select the top-k ranked sentences best conforming to the paper’s KPLM … It is a common problem that the top-ranked sentences tend to repetitively cover the most important contributions of a paper, while fall short in the diversity of coverage. In the final stage, we use a novelty-driven sentence re-ranking method to diversify contribution coverage and produce the extractive summary of a paper … The following few slides will dive into more details of each phase of our pipelined approach.
#7 In the first stage of our framework, we build a keyword profile for a target paper, our method is informed by the statistical characteristics of key contributions discussed earlier. We use the Fisher’s exact test to measure each word’s salience in characterising the target paper’s main contributions. More specifically, we use the following hypergeometric distribution to model words’ distribution in citation sentences … So this is the probability observing exactly k citing sentences in the target paper’s citation summary containing word W. We subsequently calculate the salience of each word using the following 1-tailed Fisher’s exact test and the salience scores are simply the p-values. The output is the target paper’s keyword profile in the form of a list of words ranked by their p-values. It statistically and objectively captures words’ salience in characterising the target paper’s main contributions using statistical surprise. Compared to the human generated key contributions list on the right, it can be seen that our method is highly accurate in identifying the paper’s main contributions that closely mirror those picked by human experts.
#8 So we have produced a keyword profile for a target paper and we know that keywords are of different importance, a good summariser should favour sentences covering the most important ones. Intuitively, the keyword profile of a paper containing valuable information on words’ salience in characterising the paper’s main contributions should be utlised to drive such a discriminative sentence selection process. Therefore in the second stage, we build a discriminative unigram language model using the paper’s keyword profile to incorporate such information. We achieve this by converting words’ salience into pseudo counts using negative log transformation and normalise into a probability distribution. The output of stage 2 is a keyword profile language model of the paper. Here is an illustrative KPLM built for an imaginary document consists of only 5 distinct words, W1 to W5 … We can see that … W5 is a non keyword with the lowest salience possible and it gets automatically eliminated from the resulting KPLM. It can be seen that the produced keyword profile language model can function as a true language model, but it has the following advantages over a traditional one: Firstly, the KPLM is not constructed using actual words frequencies but words’ pseudo frequencies that directly encode words’ salience in characterising the target paper’s main contributions. Secondly, the pseudo document that corresponds to the target paper’s keyword profile is essentially a bag of keywords repeating themselves according to their importance, it thus has more discriminative power than a traditional language model.
#9 The KPLM of a paper is a discriminative generative model that incorporates words’ salience in characterising its main contributions in generative probabilities. It thus represents an effective language model from which a model citing sentence covering the paper’s main contributions could be sampled from. We therefore in the third stage, cast the task of extractive summarisation to model divergence based IR More specifically, we adopt the negative cross entropy retrieval model to select the citation sentences that best conform to the KPLM of the paper. We first build a maximum likelihood estimation LM for each citing sentence Then we measure the divergence B/T the MLE of each citing sentence to the KPLM of the paper – the smaller the divergence, the better the citing sentence in capturing the target paper’s main contributions.
#10 We have now got a sentence pool containing the top-k citing sentences that best conform to the KPLM of the target paper. However, they are likely to repetitively cover the most important contributions while fall short in the diversity of contributions covered. Therefore the 4th and final stage of our framework performs novelty-driven re-ranking to diversify contributions coverage. We call our method top sentence re-ranking or TSR, which implements a very simple heuristic. We first select the top-ranked sentence in the sentence pool into our extractive summary, Then in subsequent iterations we select the citing sentence left in the pool with the largest divergence from the current extractive summary and append it to the end of the extractive summary till the summary length limit is reached, which is conventionally 5. And we output the extractive summary
#11 For evaluation, we use the pyramid method, a popular evaluation metric that scores summaries by how well they cover humanly picked keywords. More specifically, we first calculate the total weights of keywords a summary covers. The pyramid score of a summary is the ratio B/T the weighted sum of keywords it covers and that of an optimal summary of the highest total weights. Comparing our results with a state-of-the-art system, C-LexRank, it is clear that KPLM performs better. Finally, an artificially imposed limitation in the evaluation is the summary length limit, which may be adjusted to suit a specific application context. The summarisation task becomes increasingly more challenging when this limit is further tightened. To further evaluate KPLM’s performance under more stringent summarisation length limit, we gather its mean pyramid scores achieved under limits decreasing from 5 to 1 and visualise the results.
#12 We present KPLM, a multi-stage statistical summarisation framework consists of 4 stages … In the future, we plan to expand our method to higher order n-grams and see if larger information units would further boost summarisation performance Also, we plan to expand our approach to multiple paper summarisation, and more specifically to automatically generate surveys for scientific domains
#13 12