This document summarizes a research paper that classified CNN news articles using term frequency-inverse document frequency (TF-IDF) metrics. It discusses the TF-IDF family of metrics used to calculate word frequencies, describes preprocessing steps and algorithms for classification. The dataset contained 3,000 news articles from 12 CNN categories split into training and test sets. Keywords were extracted and weighted using various TF-IDF variants to classify articles by comparing words to the training set categories.
Classification of CNN.com Articles using a TF*IDF Metric
1. Classification of CNN.com Articles
using a TF*IDF Metric
Marie Vans and Steven Simske HP Labs;
Fort Collins, Colorado
April 20, 2016
1
2. Agenda
• TF*IDF Family of Metrics
• Word Frequencies
• Data Set & Preprocessing
• Algorithms for word frequencies and classification
• An example
• Results
• Future Directions
• Conclusions
2
6. TF*IDF – Family – Inverse Document
Frequency
6
IDF Name IDF Equation
13 NormPowerOfSums 𝑗=1
𝑁−1
𝑘𝑗
𝑤𝑖,𝑛
𝑃𝑜𝑤𝑒𝑟
14 NormPowersOfSums
𝑗=1
𝑁−1
𝑘𝑗
𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
𝑤𝑖,𝑛
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
i =current word
j = current document
k = total words in document j
n = total words in other than current document
N = total number of documents in the corpus
wi,j = number of occurrences of word i in document j.
wi,n = word occurrences of word i in other documents.
ni = number of documents in which i occurs.
LogRatio = ratio of log for individual word to log for document length
MinLogRatio = user settable minimum for LogRatio
WordPower & DocPower = adjustable value
9. CNN Data Set
9
Class Name TTL Number
of Files
Number of Files
Training Set
Number of Files
Test Set
Business 161 81 80
Health 290 145 145
Justice 224 112 112
Living 98 49 49
Opinion 192 96 96
Politics 195 98 97
Showbiz 241 121 120
Sport 148 74 74
Tech 132 66 66
Travel 171 86 85
US 160 80 80
World 988 494 494
• 12 Classes
• 3,000 Total
Files
• Each Class
split into 2
sets:
• Training Set
• Test Set
• File Classes
Ground-trouth
by CNN
Rafael Dueire Lins, Steven J. Simske, Luciano de Souza Cabral, Gabriel de Silva, Rinaldo Lima, Rafael F. Mello, and Luciano Favaro.
A multi-tool scheme for summarizing textual documents. In Proceedings of 11st IADIS International Conference WWW/INTERNET 2012,
pages 1–8, July 2012
10. CNN Data Set
10
Class Name TTL Number of
Train Set Unique
Words
TTL Number of
Test Set Unique
Words
Total Number of
Words Processed
Business 8278 7851 16129
Health 12246 12036 24282
Justice 9133 9032 18165
Living 7936 7030 14966
Opinion 11382 10886 22268
Politics 9268 9039 18307
Showbiz 8997 9949 18946
Sport 7445 7191 14636
Tech 7971 7548 15519
Travel 14931 12612 27543
US 8488 8707 17195
World 22936 23441 46377
• 12 Classes
• Total words
254,333
• Training Set
129,011
• Test Set
125,322
11. Preprocessing
• Remove “stop words”
• Remove punctuation (hyphenation excepted)
• No lemmatization
• SharpNLP – Open Source Natural Language Processing
(https://sharpnlp.codeplex.com/)
• sentence splitter
• tokenizer
• part-of-speech tagger
• chunker
• parser
• name finder
• coreference tool
• interface to the WordNet lexical database
• File parsed with each word tagged with part of speech
11
12. Program Classes (Not CNN
Classes)• Word Class
• m_Spelling
• m_Count (frequency of word in file)
• m_Weight (assigned by different TF*IDF
measures)
• m_HasHyphen (Hyphenated words counts as single
word)
• m_PennTags (Parts of speech tag)
• m_Tags (Number of tags associated with
word)
• TermFrequencies Class
• m_TermName;
• int m_TermFreq;
12
• Classify Class
• m_businessWords;
• m_healthWords;
• m_justiceWords;
• m_livingWords;
• m_opinionWords;
• m_politicsWords;
• m_showbizWords;
• m_sportWords;
• m_techWords;
• m_travelWords;
• m_usWords;
• m_worldWords;
• m_confusionMatrx
13. Algorithm
A. Using Training Set files in each class: (i.e. do this 12
times)
1.0 For each file in the set:
Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique
word for the entire set of documents
3.0 Calculate the weight of each word:
total occurrences of wordi in all files / total occurrences of
all words in all files
𝑇 𝑤𝑜𝑟𝑑 𝑖
=𝑓(𝑤 𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)13
14. Algorithm
B. Using the Testing Set files in a specific class: (i.e. business)
1.0 For each file in the set:
Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique word for the entire set
of documents
3.0 Calculate the weight of each word:
𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗
= total occurrences of wordi in file / total occurrences of all words in
file
𝑇 𝑤𝑜𝑟𝑑 𝑖
= total occurrences of wordi in all files / total occurrences of all words
in all
files
14
𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗
=
𝑓 𝑤𝑖,𝑗
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)
𝑇 𝑤𝑜𝑟𝑑 𝑖
= 𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑓(𝑤𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)
15. Algorithm
D. Classify each wordi in one test file by comparing to the same
word in all training classes:
𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
= 𝑤𝑜𝑟𝑑𝑖 in Business test class
𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
15
𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
.
.
.
𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
16. Algorithm
C. Classify each wordi in the entire test class by comparing to
the same word in all training classes:
𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
= 𝑤𝑜𝑟𝑑𝑖 in Business test class
𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
16
𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
.
.
.
𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
17. CNN Data Set – Example Article – Business
Class
17
After : Could Germany's nuclear gamble backfire?
As Germany's switchover from nuclear power to renewable energy gathers pace, concerns are mounting over
the cost to the country's prosperity and its already squeezed consumers.
Politicians in Europe's largest economy want renewable power to contribute 35% of the country's electricity
consumption by 2020 and 80% by 2050 as part of its clean energy drive.
The country's 'energiewende' -- translated as energy transformation -- is part of the government's plan to move
away from nuclear power and fossil fuels to renewable energy sources, following Japan's disaster
in 2011.
Michael Limburg, vice-president of the European Institute for Climate and Energy, told CNN that the
government's energy targets are 'completely unfeasible.'
'Of course, it's possible to erect tens of thousands of windmills but only at an extreme cost and waste of natural
space,' he said.
'And still it would not be able to deliver electricity when it is needed.'
The government is investing heavily in onshore and offshore wind farms and solar technology in an effort to
reduce 40% of greenhouse gas emissions by 2020.
Last year Chancellor Angela Merkel, who this week won her third term as Germany's leader, proposed to
construct offshore wind farms in the North Sea, a plan that would cost 200 billion euros ($270 billion), according
to the DIW economic institute in Berlin.
As part of the energy drive, Merkel also pledged to permanently shut down the country's 17 nuclear reactors,
which fuel 18% of the country's power needs.
Under Germany's Atomic Energy Act, the last nuclear power plant will be disconnected by 2022.
18. CNN Data Set – Example Frequencies –
Training Set
18
m_TermFreq 6 int
m_TermName "fukushima"
string
m_TermFreq 1 int
m_TermName "germany"
string
m_TermFreq 12 int
m_TermName "nuclear"
string
Single File
m_TermFreq 9 int
m_TermName "fukushima"
string
m_TermFreq 26 int
m_TermName "germany"
string
m_TermFreq 33 int
m_TermName "nuclear"
string
All Files in Class
fukushima 0.000307188203972967
germany 0.000887432589255239
nuclear 0.00112635674790088
% Occurrence in
Class
% Occurrence in
Filefukushima 0.0102739726027397
germany 0.00156739811912226
nuclear 0.0205479452054795
19. CNN Data Set – Example Frequencies – Test
Set
19
m_TermFreq 2 int
m_TermName "fukushima"
string
m_TermFreq 9 int
m_TermName "germany"
string
m_TermFreq 5 int
m_TermName "nuclear"
string
Single File
m_TermFreq 2 int
m_TermName "fukushima"
string
m_TermFreq 21 int
m_TermName "germany"
string
m_TermFreq 6 int
m_TermName "nuclear"
string
All Files in Class
fukushima 0.000073773515308004
germany 0.000774621910734047
nuclear 0.000221320545924013
% Occurrence in
Class
% Occurrence in
Filefukushima 0.0056022408963585
germany 0.0252100840336134
nuclear 0.0140056022408964
20. Classify All Words in Single Business Test File
20
Business 0.00069364
Health 0.00030063
Justice 0.00025000
Living 0.00026707
Opinion 0.00033446
Politics 0.00034694
Showbiz 0.00025372
Sport 0.00029984
Tech 0.00033337
Travel 0.00023201
US 0.00031539
World 0.00040208
MAX Class Value
21. Classify All Words in All Business Test Files
21
Business 0.00059513
Health 0.00035854
Justice 0.00027830
Living 0.00038269
Opinion 0.00039295
Politics 0.00036828
Showbiz 0.00029162
Sport 0.00036698
Tech 0.00040147
Travel 0.00032406
US 0.00032592
World 0.00037747
MAX Class Value
22. Confusion Matrix
22
• Each column contains samples of classifier output
• Each row contains samples in true class
• Each row sums to 1.0
• Diagonal show percent classified correctly
• Mean of diagonal = 89%
• Off-diagonal shows types of errors that occur
• A is misclassified as B – 3%
• A is misclassified as C – 3%
Normalized Confusion Matrix Classifier Output (Computed Classification) Prediction
A B C
True Class of the
Samples (Input)
A 0.94 0.03 0.03
B 0.08 0.85 0.07
C 0.08 0.04 0.88
23. Results - Classification
23
business health justice living opinion politics
showbi
z sport tech travel us world
business 0.75 0 0 0.0875 0 0 0 0.025 0.1125 0.025 0 0
health 0
0.772
4 0.0207 0.1793 0 0.0138 0.0069 0 0.0069 0 0 0
justice 0 0 0.9018 0.0179 0 0.0446 0.0089 0 0 0 0.0268 0
living 0.0204
0.040
8 0 0.8163 0 0.0204 0 0.0612 0.0408 0 0 0
opinion 0.2083
0.072
9 0.0208 0.2708 0.0313 0.1667 0 0.0417 0.0417 0 0.0104 0.1354
politics 0.0103
0.010
3 0.0515 0.0412 0 0.8557 0 0 0 0 0 0.0309
showbiz 0.0083
0.008
3 0.1583 0.1417 0 0.0083 0.6417 0.025 0 0 0.0083 0
sport 0.027
0.013
5 0.0405 0.0541 0 0.027 0 0.8108 0 0.027 0 0
tech 0.0303
0.030
3 0.0152 0.2121 0 0 0.0152 0 0.6818 0.0152 0 0
travel 0.14120.0118 0.0235 0.1412 0 0.0824 0 0.0353 0.0588 0.4353 0.0471 0.0235
us 0.025 0.05 0.3125 0.175 0 0.1125 0.025 0.0625 0.0375 0.0125 0.1875 0
world 0.0769
0.014
2 0.1316 0.0789 0.002 0.1255 0.0061 0.0142 0.0202 0.002 0.0142 0.5142
Note that the diagonals (in bold) are the correct classifications
The rows sum to 1.0 since the left column represents the actual class from which the document is taken
The columns have a mean of 1.0 with some variance depending on whether the class in the column is an
attractor class (> 1.0) or a repulsor class (<1.0)
24. Example of Incorrectly Classified File
24
Business 0.00033924
Health 0.00025056
Justice
0.00027728
Living 0.00027807
Opinion 0.00041936
Politics
0.00046704
Showbiz 0.00023136
Sport 0.00028422
Tech 0.00025991
Travel 0.00021793
US 0.00032973
World 0.00043251
2nd MAX Class Value
Results of File from Opinion Test Class:
MAX Class Value
3rd MAX Class Value
It takes 3 tries to get it right
25. Classification Attempts to Success
25
Measures the average number of attempts until correct class is chose
1 × 𝑃1 + 2 × 𝑃2 + 3 × 𝑃3 … + 12 × 𝑃12
𝑛𝑓𝑖𝑙𝑒𝑠
where
P1 = number correctly classified on first try
p2 = number correctly classified after two tries
.
.
.
P12 = number correctly classified on the last try
nfiles = number of files in testing class
Example: Worst Class – Opinion
P1 = 3
P2 = 35 x 2
P3 = 31 x 3
P4 = 14 x 4
P5 = 7 x5
P6 = 3 x6
P7 = 2 x 7
P8 = 1 x8
=297/96 = 3.09Ʃ
26. Results - Classification Attempts to
Success
26
• Measures the average number of attempts until correct class is cho
• Ideal is 1.0 – We get it right on the first try
• Best Class – Justice
• Correctly classified: 0.9018
• Mean classification attempts: 1.19
• Delta from ideal = 0.19
• Worst Class – Opinion
• Correctly classified: 0.0313
• Mean classification attempts: 3.09
• Delta from ideal = 2.09
• Best classification attempts class 11 times better than worst class
• All other classes between best and worst
27. Results - General
• Confusion matrix shows good classification results
• Average classification rate for all classes = 0.61655883
• Classification Errors:
• Attractor classes:
• Repulsor classes:
• Normalized by total occurrences of all words in file
• For classification of single file
• Normalized by total occurrences of all words in the class
• For classification of multiple files
27
busines
s
health justice living opinion politics showbi
z
sport tech travel us world
1.297
8
1.024
5
1.676
5
2.216 0.033
3
1.456
9
0.703
7
1.075
7
1.000
3
0.517 0.294
3
0.704
Busines
s
Health Justice Living Politics Sport Tech
Opinion Showbiz Travel U.S. World
29. Future Directions
• Automatic summarization based on word frequencies in sentences
• Data from Brazil also contained Gold Standard sentences for summarization
• Each file contains sentences pulled out of the full article by at least 3 students
• Gold Standard sentences for each file act as ground truth for automatic
summarization
• New York Times Annotated Corpus: (https://catalog.ldc.upenn.edu/LDC2008T19)
• Written and published by the New York Times between January 1, 1987 and June 19,
2007
• Metadata provided by the New York Times Newsroom, the New York Times Indexing
Service and the online production staff at nytimes.com:
• Over 1.8 million articles
• Over 650,000 article summaries written by library scientists
• Over 1,500,000 articles manually tagged by library scientists with tags drawn from
a normalized indexing vocabulary of people, organizations, locations and topic
descriptors
• Over 275,000 algorithmically-tagged articles that have been hand verified by the29
30. Conclusions
• A family of TF*IDF metrics for summarization and classification
• A simple TF*IDF metric
• Classification scheme that works well on a set of 3,000 CNN articles
separated into 12 classes
• Classification attempts to success is a measure that tells us how
hard it is to classify
• Attractor and repulsor classes may help for identifying imbalances in
the data
• Simple TF*IDF metric can be used for benchmarking the rest of the
112 TF*IDF
30