tlDr:
CAPSTONE PROJECT
FOR
BABI – SEPTEMBER 2017
Presented by – Narayanamurthy T, Pavan Vasantham, Sriharsha
Manne, Pavan M & Madan P
Table of
contents
• Motivation
• Task definition & Basic Approach
• Data
• Feature Engineering
• Exploratory Analysis
• Classification
• Summarizing Factors
• Schematic Summary Processing Model
• Approach
• Extractive – Unsupervised
• Abstractive – Supervised
• Model Output
• Evaluation
Motivation
Text summarization, computer-based production of
condensed versions of documents, is an important
technology for the information society.
• Without summaries it would be practically
impossible for human beings to get access to
the ever growing mass of information available
online.
• Research in text summarization is over fifty
years old.
• Some efforts are still needed given the
insufficient quality of automatic summaries and
the number of interesting summarization topics
being proposed in different contexts by end
users ("domain-specific summaries", "opinion-
oriented summaries", "update summaries", etc.)
Task Definition and Basic Approach
Extractive - Select relevant phrases of
the input document and concatenate
them to form a summary (like "copy-
and-paste")
• Pros: They are quite robust since
they use existing natural-language
phrases that are taken straight from
the input.
• Cons: But they lack in flexibility since
they cannot use novel words or
connectors. They also cannot
paraphrase like people sometimes
do.
Abstractive - Generate a summary
that keeps original intent. It's just like
humans do.
• Pros: They can use words that were
not in the original input. It enables
to make more fluent and natural
summaries.
• Cons: But it is also a much harder
problem as we now require the
model to generate coherent
phrases and connectors.
We can regard the "summarization" as the "function" its input is document
and output is summary, and its input & output type helps us to categorize
multiple summarization tasks
Data
For most of us, it’s impractical to download all the data from
web. We must first identify the data sources we want to
target. Data, of course, covers a very wide range of quality,
volume, applicability, and accessibility.
Obtained the BBC / Daily Mail dataset which has the News
article and Summary of an article generated by a real person.
Consists of 2225 documents from the BBC news website
corresponding to stories in five topical areas from 2004 to
2005.
Feature Engineering
Statistical count features from headline text
• Word Count - Total number of words in the
headline
• Character Count - Total number of characters in
the headline excluding spaces
• Word Density - Average length of the words used
in the headline
• Punctuation Count - Total number of
punctuations used in the headline
• Upper-Case to Lower-Case Words ratio - ratio of
upper case words used and lower case words
used in the text
Headline Text Features
• Sentiment: Polarity - sentiment value of the
headline computed using text blob package
• Part of Speech: Nouns to Verbs Ratio - ratio of
nouns and verbs used in the text
Exploratory Analysis
Word Count Distribution, Inference
From the graph, it can be depicted that most
journalist prefer writing headlines which
contains about 10 words. Very few headlines
(about 3000) have extremely high word counts.
Character Count of Headlines, Inference
Similar to word count graph, character count
graph also shows normal distribution with
mean character count = 65, most generally the
range of 40-80 characters in the headlines is
preferred by readers.
Exploratory Analysis
Punctuations used, Inference
It is a general practice among the authors to
write headlines which have about 1-4
punctuation marks
Keywords in a news article, Inference
Close to 7 to 8 keywords are being identified by
every article which was published
Classification
Evaluating each classifier's ability to select the appropriate category
given an article’s title and a brief article description
Models implemented
•Multinomial
Naive Bayes
Support
Vector
Machines
Neural
Network
Softmax
Layer
Metrics
used to
evaluate
Accuracy
Recall
F1 Score
Classification
Accuracy - Model which predicts whether a
classified news article is Politics or a Sport. Ex:
60% classes in our news articles are Politics and
40% are Sport.
Recall - Gives us information about a classifier’s
performance with respect to false negatives
(how many did we miss).
F1 Score – What if we can get a single score
that kind of represents both Precision(P) and
Recall(R) to compare the results.
11
Summarizing factors
• Input
• subject type: domain
• genre: newspaper articles, editorials, letters,
reports...
• form: regular text structure; free-form
• source size: single doc; multiple docs (few; many)
• Purpose
• situation: embedded in larger system (MT, IR) or not?
• audience: focused or general
• usage: IR, sorting, skimming...
• Output
• completeness: include all aspects, or focus on some?
• format: paragraph, table, etc.
• style: informative, indicative, aggregative, critical...
• Extracting entities
• such as companies, people, dollar amounts, key
initiatives, etc.
• Categorizing content
• positive or negative (e.g. sentiment analysis), by
function, intention or purpose, or by industry or other
categories for analytics and trending
• Clustering content
• identify main topics of discourse and/or to discover
new topics
• Fact extraction
• fill databases with structured information for analysis,
visualization, trending, or alerts
• Relationship extraction
• fill out graph databases to explore real-world
relationships
Schematic summary processing model
Source text
Interpretation
Source
representation
Transformation
Summary
representation
Generation
Summary
text
Three
approaches
to
performing
extraction
Top Down
Determine Part of Speech,
understand and diagram
sentence into clauses,
nouns, verbs, object and
subject, modifying
adjectives and adverbs,
etc., then traverse this
structure to identify
structures of interest
•Advantages – can handle
complex, never-seen-before
structures and patterns
•Disadvantages – hard to
construct rules, brittle, often fails
with variant input, may still
require substantial pattern
matching even after parsing
Bottoms Up
Create lots of patterns,
match patterns to text and
extract necessary facts,
patterns may be manually
entered or may be
computed using text
mining
•Advantages – easy to create
patterns, can be done by
business users, does not require
programming, easy to debug and
fix, runs fast, matches directly to
desired outputs
•Disadvantages – requires on-
going pattern maintenance,
cannot match on newly invented
constructs
Statistical
Similar to bottoms-up, but
matches patterns against a
statistically weighted
database of patterns
generated from tagged
training data
•Advantages – patterns are
created automatically, built-in
statistical trade-offs
•Disadvantages – requires
generating extensive training
data (1000’s of examples), will
need to be periodically retrained
for best accuracy, cannot match
on newly invented constructs,
harder to debug
Content selection Unsupervised (Extractive)
Choose sentences that
have distinguished or
informative words
Two Approaches to
define distinguished
words
tf-idf: weigh each
word wi in document
j by tf-idf
Topic signature:
choose smaller set of
distinguished words,
Log-likelihood ratio
(LLR)
Content selection Supervised - Abstrctive
Given
A labeled training
set of good
summaries for
each document
Align
The sentences in
the document
with sentences in
the summary
Abstractive Features
Sequence 2
Sequence
Glove embedding
Recurrent Neural
Network
Glove
Sequence 2 Sequence
Model output
Input: Article 1st sentence Output - Ideal Model-written headline
metro-goldwyn-mayer reported a third-
quarter net loss of dlrs 16 million due mainly
to the effect of accounting rules adopted this
year
mgm reports 16 million net loss on higher
revenue
starting from july 1, the island province of
hainan in southern china will implement
strict market access control on all incoming
livestock and animal products to prevent the
possible spread of epidemic diseases
hainan to curb spread of diseases
Shorter texts, summarization can be learned
end-to-end with a deep learning technique
called sequence-to-sequence learning, we’re
able to train such models to produce very good
headlines for news articles.
Observed that due to the nature of news
headlines, the model can generate good
headlines from reading just a few sentences
from the beginning of the article
First column shows the first sentence of a news article which is
the model input, and the second column shows what headline
the model has written
EVALUATING SUMMARIES: ROUGE
• ROUGE “ Recall Oriented Understudy for Gisting
Evaluation ”
• Given a document D, and an automatic summary X:
• Have N humans produce a set of reference
summaries of D
• Run System, giving automatic summary X
• What percentage of the bigrams from the
reference
• Summaries appear in X?

Tldr

  • 1.
    tlDr: CAPSTONE PROJECT FOR BABI –SEPTEMBER 2017 Presented by – Narayanamurthy T, Pavan Vasantham, Sriharsha Manne, Pavan M & Madan P
  • 2.
    Table of contents • Motivation •Task definition & Basic Approach • Data • Feature Engineering • Exploratory Analysis • Classification • Summarizing Factors • Schematic Summary Processing Model • Approach • Extractive – Unsupervised • Abstractive – Supervised • Model Output • Evaluation
  • 3.
    Motivation Text summarization, computer-basedproduction of condensed versions of documents, is an important technology for the information society. • Without summaries it would be practically impossible for human beings to get access to the ever growing mass of information available online. • Research in text summarization is over fifty years old. • Some efforts are still needed given the insufficient quality of automatic summaries and the number of interesting summarization topics being proposed in different contexts by end users ("domain-specific summaries", "opinion- oriented summaries", "update summaries", etc.)
  • 4.
    Task Definition andBasic Approach Extractive - Select relevant phrases of the input document and concatenate them to form a summary (like "copy- and-paste") • Pros: They are quite robust since they use existing natural-language phrases that are taken straight from the input. • Cons: But they lack in flexibility since they cannot use novel words or connectors. They also cannot paraphrase like people sometimes do. Abstractive - Generate a summary that keeps original intent. It's just like humans do. • Pros: They can use words that were not in the original input. It enables to make more fluent and natural summaries. • Cons: But it is also a much harder problem as we now require the model to generate coherent phrases and connectors. We can regard the "summarization" as the "function" its input is document and output is summary, and its input & output type helps us to categorize multiple summarization tasks
  • 5.
    Data For most ofus, it’s impractical to download all the data from web. We must first identify the data sources we want to target. Data, of course, covers a very wide range of quality, volume, applicability, and accessibility. Obtained the BBC / Daily Mail dataset which has the News article and Summary of an article generated by a real person. Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005.
  • 6.
    Feature Engineering Statistical countfeatures from headline text • Word Count - Total number of words in the headline • Character Count - Total number of characters in the headline excluding spaces • Word Density - Average length of the words used in the headline • Punctuation Count - Total number of punctuations used in the headline • Upper-Case to Lower-Case Words ratio - ratio of upper case words used and lower case words used in the text Headline Text Features • Sentiment: Polarity - sentiment value of the headline computed using text blob package • Part of Speech: Nouns to Verbs Ratio - ratio of nouns and verbs used in the text
  • 7.
    Exploratory Analysis Word CountDistribution, Inference From the graph, it can be depicted that most journalist prefer writing headlines which contains about 10 words. Very few headlines (about 3000) have extremely high word counts. Character Count of Headlines, Inference Similar to word count graph, character count graph also shows normal distribution with mean character count = 65, most generally the range of 40-80 characters in the headlines is preferred by readers.
  • 8.
    Exploratory Analysis Punctuations used,Inference It is a general practice among the authors to write headlines which have about 1-4 punctuation marks Keywords in a news article, Inference Close to 7 to 8 keywords are being identified by every article which was published
  • 9.
    Classification Evaluating each classifier'sability to select the appropriate category given an article’s title and a brief article description Models implemented •Multinomial Naive Bayes Support Vector Machines Neural Network Softmax Layer Metrics used to evaluate Accuracy Recall F1 Score
  • 10.
    Classification Accuracy - Modelwhich predicts whether a classified news article is Politics or a Sport. Ex: 60% classes in our news articles are Politics and 40% are Sport. Recall - Gives us information about a classifier’s performance with respect to false negatives (how many did we miss). F1 Score – What if we can get a single score that kind of represents both Precision(P) and Recall(R) to compare the results.
  • 11.
    11 Summarizing factors • Input •subject type: domain • genre: newspaper articles, editorials, letters, reports... • form: regular text structure; free-form • source size: single doc; multiple docs (few; many) • Purpose • situation: embedded in larger system (MT, IR) or not? • audience: focused or general • usage: IR, sorting, skimming... • Output • completeness: include all aspects, or focus on some? • format: paragraph, table, etc. • style: informative, indicative, aggregative, critical... • Extracting entities • such as companies, people, dollar amounts, key initiatives, etc. • Categorizing content • positive or negative (e.g. sentiment analysis), by function, intention or purpose, or by industry or other categories for analytics and trending • Clustering content • identify main topics of discourse and/or to discover new topics • Fact extraction • fill databases with structured information for analysis, visualization, trending, or alerts • Relationship extraction • fill out graph databases to explore real-world relationships
  • 12.
    Schematic summary processingmodel Source text Interpretation Source representation Transformation Summary representation Generation Summary text
  • 13.
    Three approaches to performing extraction Top Down Determine Partof Speech, understand and diagram sentence into clauses, nouns, verbs, object and subject, modifying adjectives and adverbs, etc., then traverse this structure to identify structures of interest •Advantages – can handle complex, never-seen-before structures and patterns •Disadvantages – hard to construct rules, brittle, often fails with variant input, may still require substantial pattern matching even after parsing Bottoms Up Create lots of patterns, match patterns to text and extract necessary facts, patterns may be manually entered or may be computed using text mining •Advantages – easy to create patterns, can be done by business users, does not require programming, easy to debug and fix, runs fast, matches directly to desired outputs •Disadvantages – requires on- going pattern maintenance, cannot match on newly invented constructs Statistical Similar to bottoms-up, but matches patterns against a statistically weighted database of patterns generated from tagged training data •Advantages – patterns are created automatically, built-in statistical trade-offs •Disadvantages – requires generating extensive training data (1000’s of examples), will need to be periodically retrained for best accuracy, cannot match on newly invented constructs, harder to debug
  • 14.
    Content selection Unsupervised(Extractive) Choose sentences that have distinguished or informative words Two Approaches to define distinguished words tf-idf: weigh each word wi in document j by tf-idf Topic signature: choose smaller set of distinguished words, Log-likelihood ratio (LLR)
  • 15.
    Content selection Supervised- Abstrctive Given A labeled training set of good summaries for each document Align The sentences in the document with sentences in the summary Abstractive Features Sequence 2 Sequence Glove embedding Recurrent Neural Network
  • 16.
  • 17.
    Model output Input: Article1st sentence Output - Ideal Model-written headline metro-goldwyn-mayer reported a third- quarter net loss of dlrs 16 million due mainly to the effect of accounting rules adopted this year mgm reports 16 million net loss on higher revenue starting from july 1, the island province of hainan in southern china will implement strict market access control on all incoming livestock and animal products to prevent the possible spread of epidemic diseases hainan to curb spread of diseases Shorter texts, summarization can be learned end-to-end with a deep learning technique called sequence-to-sequence learning, we’re able to train such models to produce very good headlines for news articles. Observed that due to the nature of news headlines, the model can generate good headlines from reading just a few sentences from the beginning of the article First column shows the first sentence of a news article which is the model input, and the second column shows what headline the model has written
  • 18.
    EVALUATING SUMMARIES: ROUGE •ROUGE “ Recall Oriented Understudy for Gisting Evaluation ” • Given a document D, and an automatic summary X: • Have N humans produce a set of reference summaries of D • Run System, giving automatic summary X • What percentage of the bigrams from the reference • Summaries appear in X?