SlideShare a Scribd company logo
1 of 31
Classification of CNN.com Articles
using a TF*IDF Metric
Marie Vans and Steven Simske HP Labs;
Fort Collins, Colorado
April 20, 2016
1
Agenda
• TF*IDF Family of Metrics
• Word Frequencies
• Data Set & Preprocessing
• Algorithms for word frequencies and classification
• An example
• Results
• Future Directions
• Conclusions
2
TF*IDF – Family – Term Frequency
3
TF Name Equation
1 Power
𝑤𝑖,𝑗
𝑃𝑜𝑤𝑒𝑟
2 Mean
𝑤𝑖,𝑗
3 NormLog
1 + log2 𝑤𝑖,𝑗
log2 𝑘
4 Log
1 + log2 𝑤𝑖,𝑗
5 NormLogs 1 + log2 𝑤𝑖,𝑗
log 2
𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
𝑘 𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 ≥ 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
1 + log2 𝑤𝑖,𝑗
log2 𝑘 𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 < 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
6 NormMean 𝑤𝑖,𝑗
𝑘
7 NormPower
𝑤𝑖,𝑗
𝑃𝑜𝑤𝑒𝑟
𝑘 𝑃𝑜𝑤𝑒𝑟
8 NormPowers
𝑤𝑖,𝑗
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
𝑘 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
TF*IDF – Family – Inverse Document
Frequency
4
IDF Name IDF Equation
1 NormLogsOfSums
log 2
𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
𝑗=1
𝑁−1
𝑘 𝑗
1+log2 𝑤 𝑖,𝑛
if LogRatio ≥ MinLogRatio
log2 𝑗=1
𝑁−1
𝑘 𝑗
1+log2 𝑤 𝑖,𝑛
if LogRatio <
MinLogRatio
2 NormSumsOfLogs
log 2
𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
𝑗=1
𝑁−1
𝑘 𝑗
𝑛=1
𝑁−1 1+log2 𝑤 𝑖,𝑛
if LogRatio ≥ MinLogRatio
log2 𝑗=1
𝑁−1
𝑘𝑗
𝑛=1
𝑁−1
1 + log2 𝑤𝑖,𝑛
if LogRatio < MinLogRatio
3 SumOfPowers 𝑁 − 1
𝑛=1
𝑁−1
𝑤𝑖,𝑛
𝑃𝑜𝑤𝑒𝑟
4 PowerOfSums 𝑁 − 1
𝑤𝑖,𝑛
𝑃𝑜𝑤𝑒𝑟
TF*IDF – Family – Inverse Document
Frequency
5
IDF Name IDF Equation
5 Mean 𝑁 − 1
𝑤𝑖,𝑛
6 NormSumOfLogs
𝑗=1
𝑁−1
𝑘𝑗
𝑛=1
𝑁−1
1 + log2 𝑤𝑖,𝑛
7 NormLogOfSums
𝑗=1
𝑁−1
𝑘𝑗
1 + log2 𝑤𝑖,𝑛
8 NormSumOfPowers
𝑗=1
𝑁−1
𝑘𝑗
𝑛=1
𝑁−1
𝑤𝑖,𝑛
𝑃𝑜𝑤𝑒𝑟
9 NormSumsOfPowers 𝑗=1
𝑁−1
𝑘𝑗
𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
𝑛=1
𝑁−1
𝑤𝑖,𝑛
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
10 SumOfLogs
𝑁 − 1
𝑛=1
𝑁−1
1 + log2 𝑤𝑖,𝑛
11 LogOfSums
𝑁 − 1
1 + log2 𝑤𝑖,𝑛
12 NormMean 𝑗=1
𝑁−1
𝑘𝑗
𝑤𝑖,𝑛
TF*IDF – Family – Inverse Document
Frequency
6
IDF Name IDF Equation
13 NormPowerOfSums 𝑗=1
𝑁−1
𝑘𝑗
𝑤𝑖,𝑛
𝑃𝑜𝑤𝑒𝑟
14 NormPowersOfSums
𝑗=1
𝑁−1
𝑘𝑗
𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
𝑤𝑖,𝑛
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
i =current word
j = current document
k = total words in document j
n = total words in other than current document
N = total number of documents in the corpus
wi,j = number of occurrences of word i in document j.
wi,n = word occurrences of word i in other documents.
ni = number of documents in which i occurs.
LogRatio = ratio of log for individual word to log for document length
MinLogRatio = user settable minimum for LogRatio
WordPower & DocPower = adjustable value
TF*IDF – Family – Putting it together
7
TF_Power*IDF_NormLogsOfSums
𝑤𝑖,𝑗
𝑃𝑜𝑤𝑒𝑟
∗
log 2
𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
𝑗=1
𝑁−1
𝑘 𝑗
1+log2 𝑤 𝑖,𝑛
If LogRatio ≥ MinLogRatio
𝑤𝑖,𝑗
𝑃𝑜𝑤𝑒𝑟
*
log2 𝑗=1
𝑁−1
𝑘 𝑗
1+log2 𝑤 𝑖,𝑛
if LogRatio < MinLogRatio
TF_Power*IDF_NormSumsOfLogs
𝑤𝑖,𝑗
𝑃𝑜𝑤𝑒𝑟
*
log 2
𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
𝑗=1
𝑁−1
𝑘 𝑗
𝑛=1
𝑁−1 1+log2 𝑤 𝑖,𝑛
𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 ≥
𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
𝑤𝑖,𝑗
𝑃𝑜𝑤𝑒𝑟
*
log2 𝑗=1
𝑁−1
𝑘 𝑗
𝑛=1
𝑁−1 1+log2 𝑤 𝑖,𝑛
𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 < 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜
.
.
.
TF_NormPowers*IDF_NormPowersOfSums
𝑤 𝑖,𝑗
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
𝑘 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟 *
𝑗=1
𝑁−1
𝑘 𝑗
𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
𝑤 𝑖,𝑛
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
112 TF*IDF Equations
Word Frequencies
8
𝑓(𝑤𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑓(𝑤𝑖,𝑗)
The frequency of word i in file j
The frequency of word i in nfiles
CNN Data Set
9
Class Name TTL Number
of Files
Number of Files
Training Set
Number of Files
Test Set
Business 161 81 80
Health 290 145 145
Justice 224 112 112
Living 98 49 49
Opinion 192 96 96
Politics 195 98 97
Showbiz 241 121 120
Sport 148 74 74
Tech 132 66 66
Travel 171 86 85
US 160 80 80
World 988 494 494
• 12 Classes
• 3,000 Total
Files
• Each Class
split into 2
sets:
• Training Set
• Test Set
• File Classes
Ground-trouth
by CNN
Rafael Dueire Lins, Steven J. Simske, Luciano de Souza Cabral, Gabriel de Silva, Rinaldo Lima, Rafael F. Mello, and Luciano Favaro.
A multi-tool scheme for summarizing textual documents. In Proceedings of 11st IADIS International Conference WWW/INTERNET 2012,
pages 1–8, July 2012
CNN Data Set
10
Class Name TTL Number of
Train Set Unique
Words
TTL Number of
Test Set Unique
Words
Total Number of
Words Processed
Business 8278 7851 16129
Health 12246 12036 24282
Justice 9133 9032 18165
Living 7936 7030 14966
Opinion 11382 10886 22268
Politics 9268 9039 18307
Showbiz 8997 9949 18946
Sport 7445 7191 14636
Tech 7971 7548 15519
Travel 14931 12612 27543
US 8488 8707 17195
World 22936 23441 46377
• 12 Classes
• Total words
254,333
• Training Set
129,011
• Test Set
125,322
Preprocessing
• Remove “stop words”
• Remove punctuation (hyphenation excepted)
• No lemmatization
• SharpNLP – Open Source Natural Language Processing
(https://sharpnlp.codeplex.com/)
• sentence splitter
• tokenizer
• part-of-speech tagger
• chunker
• parser
• name finder
• coreference tool
• interface to the WordNet lexical database
• File parsed with each word tagged with part of speech
11
Program Classes (Not CNN
Classes)• Word Class
• m_Spelling
• m_Count (frequency of word in file)
• m_Weight (assigned by different TF*IDF
measures)
• m_HasHyphen (Hyphenated words counts as single
word)
• m_PennTags (Parts of speech tag)
• m_Tags (Number of tags associated with
word)
• TermFrequencies Class
• m_TermName;
• int m_TermFreq;
12
• Classify Class
• m_businessWords;
• m_healthWords;
• m_justiceWords;
• m_livingWords;
• m_opinionWords;
• m_politicsWords;
• m_showbizWords;
• m_sportWords;
• m_techWords;
• m_travelWords;
• m_usWords;
• m_worldWords;
• m_confusionMatrx
Algorithm
A. Using Training Set files in each class: (i.e. do this 12
times)
1.0 For each file in the set:
Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique
word for the entire set of documents
3.0 Calculate the weight of each word:
total occurrences of wordi in all files / total occurrences of
all words in all files
𝑇 𝑤𝑜𝑟𝑑 𝑖
=𝑓(𝑤 𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)13
Algorithm
B. Using the Testing Set files in a specific class: (i.e. business)
1.0 For each file in the set:
Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique word for the entire set
of documents
3.0 Calculate the weight of each word:
𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗
= total occurrences of wordi in file / total occurrences of all words in
file
𝑇 𝑤𝑜𝑟𝑑 𝑖
= total occurrences of wordi in all files / total occurrences of all words
in all
files
14
𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗
=
𝑓 𝑤𝑖,𝑗
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)
𝑇 𝑤𝑜𝑟𝑑 𝑖
= 𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑓(𝑤𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)
Algorithm
D. Classify each wordi in one test file by comparing to the same
word in all training classes:
𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
= 𝑤𝑜𝑟𝑑𝑖 in Business test class
𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
15
𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
.
.
.
𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
Algorithm
C. Classify each wordi in the entire test class by comparing to
the same word in all training classes:
𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
= 𝑤𝑜𝑟𝑑𝑖 in Business test class
𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
16
𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
.
.
.
𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
CNN Data Set – Example Article – Business
Class
17
After : Could Germany's nuclear gamble backfire?
As Germany's switchover from nuclear power to renewable energy gathers pace, concerns are mounting over
the cost to the country's prosperity and its already squeezed consumers.
Politicians in Europe's largest economy want renewable power to contribute 35% of the country's electricity
consumption by 2020 and 80% by 2050 as part of its clean energy drive.
The country's 'energiewende' -- translated as energy transformation -- is part of the government's plan to move
away from nuclear power and fossil fuels to renewable energy sources, following Japan's disaster
in 2011.
Michael Limburg, vice-president of the European Institute for Climate and Energy, told CNN that the
government's energy targets are 'completely unfeasible.'
'Of course, it's possible to erect tens of thousands of windmills but only at an extreme cost and waste of natural
space,' he said.
'And still it would not be able to deliver electricity when it is needed.'
The government is investing heavily in onshore and offshore wind farms and solar technology in an effort to
reduce 40% of greenhouse gas emissions by 2020.
Last year Chancellor Angela Merkel, who this week won her third term as Germany's leader, proposed to
construct offshore wind farms in the North Sea, a plan that would cost 200 billion euros ($270 billion), according
to the DIW economic institute in Berlin.
As part of the energy drive, Merkel also pledged to permanently shut down the country's 17 nuclear reactors,
which fuel 18% of the country's power needs.
Under Germany's Atomic Energy Act, the last nuclear power plant will be disconnected by 2022.
CNN Data Set – Example Frequencies –
Training Set
18
m_TermFreq 6 int
m_TermName "fukushima"
string
m_TermFreq 1 int
m_TermName "germany"
string
m_TermFreq 12 int
m_TermName "nuclear"
string
Single File
m_TermFreq 9 int
m_TermName "fukushima"
string
m_TermFreq 26 int
m_TermName "germany"
string
m_TermFreq 33 int
m_TermName "nuclear"
string
All Files in Class
fukushima 0.000307188203972967
germany 0.000887432589255239
nuclear 0.00112635674790088
% Occurrence in
Class
% Occurrence in
Filefukushima 0.0102739726027397
germany 0.00156739811912226
nuclear 0.0205479452054795
CNN Data Set – Example Frequencies – Test
Set
19
m_TermFreq 2 int
m_TermName "fukushima"
string
m_TermFreq 9 int
m_TermName "germany"
string
m_TermFreq 5 int
m_TermName "nuclear"
string
Single File
m_TermFreq 2 int
m_TermName "fukushima"
string
m_TermFreq 21 int
m_TermName "germany"
string
m_TermFreq 6 int
m_TermName "nuclear"
string
All Files in Class
fukushima 0.000073773515308004
germany 0.000774621910734047
nuclear 0.000221320545924013
% Occurrence in
Class
% Occurrence in
Filefukushima 0.0056022408963585
germany 0.0252100840336134
nuclear 0.0140056022408964
Classify All Words in Single Business Test File
20
Business 0.00069364
Health 0.00030063
Justice 0.00025000
Living 0.00026707
Opinion 0.00033446
Politics 0.00034694
Showbiz 0.00025372
Sport 0.00029984
Tech 0.00033337
Travel 0.00023201
US 0.00031539
World 0.00040208
MAX Class Value
Classify All Words in All Business Test Files
21
Business 0.00059513
Health 0.00035854
Justice 0.00027830
Living 0.00038269
Opinion 0.00039295
Politics 0.00036828
Showbiz 0.00029162
Sport 0.00036698
Tech 0.00040147
Travel 0.00032406
US 0.00032592
World 0.00037747
MAX Class Value
Confusion Matrix
22
• Each column contains samples of classifier output
• Each row contains samples in true class
• Each row sums to 1.0
• Diagonal show percent classified correctly
• Mean of diagonal = 89%
• Off-diagonal shows types of errors that occur
• A is misclassified as B – 3%
• A is misclassified as C – 3%
Normalized Confusion Matrix Classifier Output (Computed Classification) Prediction
A B C
True Class of the
Samples (Input)
A 0.94 0.03 0.03
B 0.08 0.85 0.07
C 0.08 0.04 0.88
Results - Classification
23
business health justice living opinion politics
showbi
z sport tech travel us world
business 0.75 0 0 0.0875 0 0 0 0.025 0.1125 0.025 0 0
health 0
0.772
4 0.0207 0.1793 0 0.0138 0.0069 0 0.0069 0 0 0
justice 0 0 0.9018 0.0179 0 0.0446 0.0089 0 0 0 0.0268 0
living 0.0204
0.040
8 0 0.8163 0 0.0204 0 0.0612 0.0408 0 0 0
opinion 0.2083
0.072
9 0.0208 0.2708 0.0313 0.1667 0 0.0417 0.0417 0 0.0104 0.1354
politics 0.0103
0.010
3 0.0515 0.0412 0 0.8557 0 0 0 0 0 0.0309
showbiz 0.0083
0.008
3 0.1583 0.1417 0 0.0083 0.6417 0.025 0 0 0.0083 0
sport 0.027
0.013
5 0.0405 0.0541 0 0.027 0 0.8108 0 0.027 0 0
tech 0.0303
0.030
3 0.0152 0.2121 0 0 0.0152 0 0.6818 0.0152 0 0
travel 0.14120.0118 0.0235 0.1412 0 0.0824 0 0.0353 0.0588 0.4353 0.0471 0.0235
us 0.025 0.05 0.3125 0.175 0 0.1125 0.025 0.0625 0.0375 0.0125 0.1875 0
world 0.0769
0.014
2 0.1316 0.0789 0.002 0.1255 0.0061 0.0142 0.0202 0.002 0.0142 0.5142
Note that the diagonals (in bold) are the correct classifications
The rows sum to 1.0 since the left column represents the actual class from which the document is taken
The columns have a mean of 1.0 with some variance depending on whether the class in the column is an
attractor class (> 1.0) or a repulsor class (<1.0)
Example of Incorrectly Classified File
24
Business 0.00033924
Health 0.00025056
Justice
0.00027728
Living 0.00027807
Opinion 0.00041936
Politics
0.00046704
Showbiz 0.00023136
Sport 0.00028422
Tech 0.00025991
Travel 0.00021793
US 0.00032973
World 0.00043251
2nd MAX Class Value
Results of File from Opinion Test Class:
MAX Class Value
3rd MAX Class Value
It takes 3 tries to get it right
Classification Attempts to Success
25
Measures the average number of attempts until correct class is chose
1 × 𝑃1 + 2 × 𝑃2 + 3 × 𝑃3 … + 12 × 𝑃12
𝑛𝑓𝑖𝑙𝑒𝑠
where
P1 = number correctly classified on first try
p2 = number correctly classified after two tries
.
.
.
P12 = number correctly classified on the last try
nfiles = number of files in testing class
Example: Worst Class – Opinion
P1 = 3
P2 = 35 x 2
P3 = 31 x 3
P4 = 14 x 4
P5 = 7 x5
P6 = 3 x6
P7 = 2 x 7
P8 = 1 x8
=297/96 = 3.09Ʃ
Results - Classification Attempts to
Success
26
• Measures the average number of attempts until correct class is cho
• Ideal is 1.0 – We get it right on the first try
• Best Class – Justice
• Correctly classified: 0.9018
• Mean classification attempts: 1.19
• Delta from ideal = 0.19
• Worst Class – Opinion
• Correctly classified: 0.0313
• Mean classification attempts: 3.09
• Delta from ideal = 2.09
• Best classification attempts class 11 times better than worst class
• All other classes between best and worst
Results - General
• Confusion matrix shows good classification results
• Average classification rate for all classes = 0.61655883
• Classification Errors:
• Attractor classes:
• Repulsor classes:
• Normalized by total occurrences of all words in file
• For classification of single file
• Normalized by total occurrences of all words in the class
• For classification of multiple files
27
busines
s
health justice living opinion politics showbi
z
sport tech travel us world
1.297
8
1.024
5
1.676
5
2.216 0.033
3
1.456
9
0.703
7
1.075
7
1.000
3
0.517 0.294
3
0.704
Busines
s
Health Justice Living Politics Sport Tech
Opinion Showbiz Travel U.S. World
Discussion
28
Opinion
Justic
e
World
U.S.
1.Documents more varied?
2.Generic words?
3.Topics overlap?
4.Word clusters are
broader?
Future Directions
• Automatic summarization based on word frequencies in sentences
• Data from Brazil also contained Gold Standard sentences for summarization
• Each file contains sentences pulled out of the full article by at least 3 students
• Gold Standard sentences for each file act as ground truth for automatic
summarization
• New York Times Annotated Corpus: (https://catalog.ldc.upenn.edu/LDC2008T19)
• Written and published by the New York Times between January 1, 1987 and June 19,
2007
• Metadata provided by the New York Times Newsroom, the New York Times Indexing
Service and the online production staff at nytimes.com:
• Over 1.8 million articles
• Over 650,000 article summaries written by library scientists
• Over 1,500,000 articles manually tagged by library scientists with tags drawn from
a normalized indexing vocabulary of people, organizations, locations and topic
descriptors
• Over 275,000 algorithmically-tagged articles that have been hand verified by the29
Conclusions
• A family of TF*IDF metrics for summarization and classification
• A simple TF*IDF metric
• Classification scheme that works well on a set of 3,000 CNN articles
separated into 12 classes
• Classification attempts to success is a measure that tells us how
hard it is to classify
• Attractor and repulsor classes may help for identifying imbalances in
the data
• Simple TF*IDF metric can be used for benchmarking the rest of the
112 TF*IDF
30
Thank You for Your Kind Attention
31

More Related Content

What's hot

Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using RKnoldus Inc.
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinkingmultimediaeval
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slideMohd Iqbal Al-farabi
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekingeProf. Wim Van Criekinge
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
 

What's hot (20)

Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept Analysis
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
07 04-06
07 04-0607 04-06
07 04-06
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
 

Viewers also liked

5.comunicaicon y organizacion
5.comunicaicon y organizacion5.comunicaicon y organizacion
5.comunicaicon y organizacionMayra Granda
 
ARIMATSU PORTAL; PROJECT紹介
ARIMATSU PORTAL; PROJECT紹介ARIMATSU PORTAL; PROJECT紹介
ARIMATSU PORTAL; PROJECT紹介Kakeru Asano
 
1.Teorias de la personalidad
1.Teorias de la personalidad1.Teorias de la personalidad
1.Teorias de la personalidadMayra Granda
 
ニッポンとフィンランド、障害とアート・デザインの現場から
ニッポンとフィンランド、障害とアート・デザインの現場からニッポンとフィンランド、障害とアート・デザインの現場から
ニッポンとフィンランド、障害とアート・デザインの現場からKakeru Asano
 
Summary of NIFA funded research on soil nitrification at Oregon State University
Summary of NIFA funded research on soil nitrification at Oregon State UniversitySummary of NIFA funded research on soil nitrification at Oregon State University
Summary of NIFA funded research on soil nitrification at Oregon State UniversityNational Institute of Food and Agriculture
 
Predictive Lead Generation
Predictive Lead GenerationPredictive Lead Generation
Predictive Lead GenerationAndreas Kulpa
 
Methacrylic adhesives for Plastic metal bonding
Methacrylic adhesives for Plastic metal bondingMethacrylic adhesives for Plastic metal bonding
Methacrylic adhesives for Plastic metal bondingParson Adhesives, INC
 
Wie Social Media Listening Leben retten kann
Wie Social Media Listening Leben retten kannWie Social Media Listening Leben retten kann
Wie Social Media Listening Leben retten kannRising Media Ltd.
 
UXSD Workshop
UXSD WorkshopUXSD Workshop
UXSD WorkshopNTUST
 
BlindNavi, a mobile navigation app specially designed for the visually impair...
BlindNavi, a mobile navigation app specially designed for the visually impair...BlindNavi, a mobile navigation app specially designed for the visually impair...
BlindNavi, a mobile navigation app specially designed for the visually impair...Anne Chen
 
Morphological tree / Design Method
Morphological tree / Design MethodMorphological tree / Design Method
Morphological tree / Design MethodNTUST
 
科技於失智照護的運用案例
科技於失智照護的運用案例科技於失智照護的運用案例
科技於失智照護的運用案例NTUST
 

Viewers also liked (20)

Analyze gears
Analyze gearsAnalyze gears
Analyze gears
 
5.comunicaicon y organizacion
5.comunicaicon y organizacion5.comunicaicon y organizacion
5.comunicaicon y organizacion
 
Ttg leaflet
Ttg leafletTtg leaflet
Ttg leaflet
 
Lenguaje de definición de datos (ddl)
Lenguaje de definición de datos (ddl)Lenguaje de definición de datos (ddl)
Lenguaje de definición de datos (ddl)
 
ARIMATSU PORTAL; PROJECT紹介
ARIMATSU PORTAL; PROJECT紹介ARIMATSU PORTAL; PROJECT紹介
ARIMATSU PORTAL; PROJECT紹介
 
1.Teorias de la personalidad
1.Teorias de la personalidad1.Teorias de la personalidad
1.Teorias de la personalidad
 
ニッポンとフィンランド、障害とアート・デザインの現場から
ニッポンとフィンランド、障害とアート・デザインの現場からニッポンとフィンランド、障害とアート・デザインの現場から
ニッポンとフィンランド、障害とアート・デザインの現場から
 
Summary of NIFA funded research on soil nitrification at Oregon State University
Summary of NIFA funded research on soil nitrification at Oregon State UniversitySummary of NIFA funded research on soil nitrification at Oregon State University
Summary of NIFA funded research on soil nitrification at Oregon State University
 
Predictive Lead Generation
Predictive Lead GenerationPredictive Lead Generation
Predictive Lead Generation
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
Methacrylic adhesives for Plastic metal bonding
Methacrylic adhesives for Plastic metal bondingMethacrylic adhesives for Plastic metal bonding
Methacrylic adhesives for Plastic metal bonding
 
урок №5
урок №5урок №5
урок №5
 
Wie Social Media Listening Leben retten kann
Wie Social Media Listening Leben retten kannWie Social Media Listening Leben retten kann
Wie Social Media Listening Leben retten kann
 
UXSD Workshop
UXSD WorkshopUXSD Workshop
UXSD Workshop
 
BlindNavi, a mobile navigation app specially designed for the visually impair...
BlindNavi, a mobile navigation app specially designed for the visually impair...BlindNavi, a mobile navigation app specially designed for the visually impair...
BlindNavi, a mobile navigation app specially designed for the visually impair...
 
Morphological tree / Design Method
Morphological tree / Design MethodMorphological tree / Design Method
Morphological tree / Design Method
 
科技於失智照護的運用案例
科技於失智照護的運用案例科技於失智照護的運用案例
科技於失智照護的運用案例
 
cours
courscours
cours
 
урок №4
урок №4урок №4
урок №4
 
BARC Studie: Predictive & Advanced Analytics - Schlüssel zur zukünftigen Wett...
BARC Studie: Predictive & Advanced Analytics - Schlüssel zur zukünftigen Wett...BARC Studie: Predictive & Advanced Analytics - Schlüssel zur zukünftigen Wett...
BARC Studie: Predictive & Advanced Analytics - Schlüssel zur zukünftigen Wett...
 

Similar to Classification of CNN.com Articles using a TF*IDF Metric

ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)Antonio Toral
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
DITA and Translation Best Praticices
DITA and Translation Best PraticicesDITA and Translation Best Praticices
DITA and Translation Best PraticicesAndrzej Zydroń MBCS
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptxMOINDALVS
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra SigmodJeff Hammerbacher
 
Testing the Numerical Precisions Required to Execute Real World Programs
Testing the Numerical Precisions Required to Execute Real World Programs Testing the Numerical Precisions Required to Execute Real World Programs
Testing the Numerical Precisions Required to Execute Real World Programs ijseajournal
 

Similar to Classification of CNN.com Articles using a TF*IDF Metric (20)

ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
DITA and Translation Best Praticices
DITA and Translation Best PraticicesDITA and Translation Best Praticices
DITA and Translation Best Praticices
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
Odp
OdpOdp
Odp
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
OAXAL
OAXALOAXAL
OAXAL
 
Ir 09
Ir   09Ir   09
Ir 09
 
Testing the Numerical Precisions Required to Execute Real World Programs
Testing the Numerical Precisions Required to Execute Real World Programs Testing the Numerical Precisions Required to Execute Real World Programs
Testing the Numerical Precisions Required to Execute Real World Programs
 

More from Marie Vans

Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2Marie Vans
 
Librarianship in alternative virtual worlds
Librarianship in alternative virtual worldsLibrarianship in alternative virtual worlds
Librarianship in alternative virtual worldsMarie Vans
 
Creating an Award-Winning Educational Machinima
Creating an Award-Winning Educational MachinimaCreating an Award-Winning Educational Machinima
Creating an Award-Winning Educational MachinimaMarie Vans
 
Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2Marie Vans
 
Archive enabling tagging using progressive barcodes
Archive enabling tagging using progressive barcodesArchive enabling tagging using progressive barcodes
Archive enabling tagging using progressive barcodesMarie Vans
 
Creating displays of virtual objects and events
Creating displays of virtual objects and eventsCreating displays of virtual objects and events
Creating displays of virtual objects and eventsMarie Vans
 
Creating displays of virtual objects and events
Creating displays of virtual objects and eventsCreating displays of virtual objects and events
Creating displays of virtual objects and eventsMarie Vans
 
Progressive barcode applications
Progressive barcode applicationsProgressive barcode applications
Progressive barcode applicationsMarie Vans
 
Progressive barcode presentation
Progressive barcode presentationProgressive barcode presentation
Progressive barcode presentationMarie Vans
 
Impact of scrambling on barcode entropy
Impact of scrambling on barcode entropyImpact of scrambling on barcode entropy
Impact of scrambling on barcode entropyMarie Vans
 
VWBPE 15: The story of science during the scientific revolution: Designing an...
VWBPE 15: The story of science during the scientific revolution: Designing an...VWBPE 15: The story of science during the scientific revolution: Designing an...
VWBPE 15: The story of science during the scientific revolution: Designing an...Marie Vans
 

More from Marie Vans (11)

Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2
 
Librarianship in alternative virtual worlds
Librarianship in alternative virtual worldsLibrarianship in alternative virtual worlds
Librarianship in alternative virtual worlds
 
Creating an Award-Winning Educational Machinima
Creating an Award-Winning Educational MachinimaCreating an Award-Winning Educational Machinima
Creating an Award-Winning Educational Machinima
 
Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2Preserving virtual worlds educational events using social media v2
Preserving virtual worlds educational events using social media v2
 
Archive enabling tagging using progressive barcodes
Archive enabling tagging using progressive barcodesArchive enabling tagging using progressive barcodes
Archive enabling tagging using progressive barcodes
 
Creating displays of virtual objects and events
Creating displays of virtual objects and eventsCreating displays of virtual objects and events
Creating displays of virtual objects and events
 
Creating displays of virtual objects and events
Creating displays of virtual objects and eventsCreating displays of virtual objects and events
Creating displays of virtual objects and events
 
Progressive barcode applications
Progressive barcode applicationsProgressive barcode applications
Progressive barcode applications
 
Progressive barcode presentation
Progressive barcode presentationProgressive barcode presentation
Progressive barcode presentation
 
Impact of scrambling on barcode entropy
Impact of scrambling on barcode entropyImpact of scrambling on barcode entropy
Impact of scrambling on barcode entropy
 
VWBPE 15: The story of science during the scientific revolution: Designing an...
VWBPE 15: The story of science during the scientific revolution: Designing an...VWBPE 15: The story of science during the scientific revolution: Designing an...
VWBPE 15: The story of science during the scientific revolution: Designing an...
 

Recently uploaded

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Classification of CNN.com Articles using a TF*IDF Metric

  • 1. Classification of CNN.com Articles using a TF*IDF Metric Marie Vans and Steven Simske HP Labs; Fort Collins, Colorado April 20, 2016 1
  • 2. Agenda • TF*IDF Family of Metrics • Word Frequencies • Data Set & Preprocessing • Algorithms for word frequencies and classification • An example • Results • Future Directions • Conclusions 2
  • 3. TF*IDF – Family – Term Frequency 3 TF Name Equation 1 Power 𝑤𝑖,𝑗 𝑃𝑜𝑤𝑒𝑟 2 Mean 𝑤𝑖,𝑗 3 NormLog 1 + log2 𝑤𝑖,𝑗 log2 𝑘 4 Log 1 + log2 𝑤𝑖,𝑗 5 NormLogs 1 + log2 𝑤𝑖,𝑗 log 2 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 𝑘 𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 ≥ 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 1 + log2 𝑤𝑖,𝑗 log2 𝑘 𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 < 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 6 NormMean 𝑤𝑖,𝑗 𝑘 7 NormPower 𝑤𝑖,𝑗 𝑃𝑜𝑤𝑒𝑟 𝑘 𝑃𝑜𝑤𝑒𝑟 8 NormPowers 𝑤𝑖,𝑗 𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟 𝑘 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
  • 4. TF*IDF – Family – Inverse Document Frequency 4 IDF Name IDF Equation 1 NormLogsOfSums log 2 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 𝑗=1 𝑁−1 𝑘 𝑗 1+log2 𝑤 𝑖,𝑛 if LogRatio ≥ MinLogRatio log2 𝑗=1 𝑁−1 𝑘 𝑗 1+log2 𝑤 𝑖,𝑛 if LogRatio < MinLogRatio 2 NormSumsOfLogs log 2 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 𝑗=1 𝑁−1 𝑘 𝑗 𝑛=1 𝑁−1 1+log2 𝑤 𝑖,𝑛 if LogRatio ≥ MinLogRatio log2 𝑗=1 𝑁−1 𝑘𝑗 𝑛=1 𝑁−1 1 + log2 𝑤𝑖,𝑛 if LogRatio < MinLogRatio 3 SumOfPowers 𝑁 − 1 𝑛=1 𝑁−1 𝑤𝑖,𝑛 𝑃𝑜𝑤𝑒𝑟 4 PowerOfSums 𝑁 − 1 𝑤𝑖,𝑛 𝑃𝑜𝑤𝑒𝑟
  • 5. TF*IDF – Family – Inverse Document Frequency 5 IDF Name IDF Equation 5 Mean 𝑁 − 1 𝑤𝑖,𝑛 6 NormSumOfLogs 𝑗=1 𝑁−1 𝑘𝑗 𝑛=1 𝑁−1 1 + log2 𝑤𝑖,𝑛 7 NormLogOfSums 𝑗=1 𝑁−1 𝑘𝑗 1 + log2 𝑤𝑖,𝑛 8 NormSumOfPowers 𝑗=1 𝑁−1 𝑘𝑗 𝑛=1 𝑁−1 𝑤𝑖,𝑛 𝑃𝑜𝑤𝑒𝑟 9 NormSumsOfPowers 𝑗=1 𝑁−1 𝑘𝑗 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟 𝑛=1 𝑁−1 𝑤𝑖,𝑛 𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟 10 SumOfLogs 𝑁 − 1 𝑛=1 𝑁−1 1 + log2 𝑤𝑖,𝑛 11 LogOfSums 𝑁 − 1 1 + log2 𝑤𝑖,𝑛 12 NormMean 𝑗=1 𝑁−1 𝑘𝑗 𝑤𝑖,𝑛
  • 6. TF*IDF – Family – Inverse Document Frequency 6 IDF Name IDF Equation 13 NormPowerOfSums 𝑗=1 𝑁−1 𝑘𝑗 𝑤𝑖,𝑛 𝑃𝑜𝑤𝑒𝑟 14 NormPowersOfSums 𝑗=1 𝑁−1 𝑘𝑗 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟 𝑤𝑖,𝑛 𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟 i =current word j = current document k = total words in document j n = total words in other than current document N = total number of documents in the corpus wi,j = number of occurrences of word i in document j. wi,n = word occurrences of word i in other documents. ni = number of documents in which i occurs. LogRatio = ratio of log for individual word to log for document length MinLogRatio = user settable minimum for LogRatio WordPower & DocPower = adjustable value
  • 7. TF*IDF – Family – Putting it together 7 TF_Power*IDF_NormLogsOfSums 𝑤𝑖,𝑗 𝑃𝑜𝑤𝑒𝑟 ∗ log 2 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 𝑗=1 𝑁−1 𝑘 𝑗 1+log2 𝑤 𝑖,𝑛 If LogRatio ≥ MinLogRatio 𝑤𝑖,𝑗 𝑃𝑜𝑤𝑒𝑟 * log2 𝑗=1 𝑁−1 𝑘 𝑗 1+log2 𝑤 𝑖,𝑛 if LogRatio < MinLogRatio TF_Power*IDF_NormSumsOfLogs 𝑤𝑖,𝑗 𝑃𝑜𝑤𝑒𝑟 * log 2 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 𝑗=1 𝑁−1 𝑘 𝑗 𝑛=1 𝑁−1 1+log2 𝑤 𝑖,𝑛 𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 ≥ 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 𝑤𝑖,𝑗 𝑃𝑜𝑤𝑒𝑟 * log2 𝑗=1 𝑁−1 𝑘 𝑗 𝑛=1 𝑁−1 1+log2 𝑤 𝑖,𝑛 𝐼𝑓 𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 < 𝑀𝑖𝑛𝐿𝑜𝑔𝑅𝑎𝑡𝑖𝑜 . . . TF_NormPowers*IDF_NormPowersOfSums 𝑤 𝑖,𝑗 𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟 𝑘 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟 * 𝑗=1 𝑁−1 𝑘 𝑗 𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟 𝑤 𝑖,𝑛 𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟 112 TF*IDF Equations
  • 9. CNN Data Set 9 Class Name TTL Number of Files Number of Files Training Set Number of Files Test Set Business 161 81 80 Health 290 145 145 Justice 224 112 112 Living 98 49 49 Opinion 192 96 96 Politics 195 98 97 Showbiz 241 121 120 Sport 148 74 74 Tech 132 66 66 Travel 171 86 85 US 160 80 80 World 988 494 494 • 12 Classes • 3,000 Total Files • Each Class split into 2 sets: • Training Set • Test Set • File Classes Ground-trouth by CNN Rafael Dueire Lins, Steven J. Simske, Luciano de Souza Cabral, Gabriel de Silva, Rinaldo Lima, Rafael F. Mello, and Luciano Favaro. A multi-tool scheme for summarizing textual documents. In Proceedings of 11st IADIS International Conference WWW/INTERNET 2012, pages 1–8, July 2012
  • 10. CNN Data Set 10 Class Name TTL Number of Train Set Unique Words TTL Number of Test Set Unique Words Total Number of Words Processed Business 8278 7851 16129 Health 12246 12036 24282 Justice 9133 9032 18165 Living 7936 7030 14966 Opinion 11382 10886 22268 Politics 9268 9039 18307 Showbiz 8997 9949 18946 Sport 7445 7191 14636 Tech 7971 7548 15519 Travel 14931 12612 27543 US 8488 8707 17195 World 22936 23441 46377 • 12 Classes • Total words 254,333 • Training Set 129,011 • Test Set 125,322
  • 11. Preprocessing • Remove “stop words” • Remove punctuation (hyphenation excepted) • No lemmatization • SharpNLP – Open Source Natural Language Processing (https://sharpnlp.codeplex.com/) • sentence splitter • tokenizer • part-of-speech tagger • chunker • parser • name finder • coreference tool • interface to the WordNet lexical database • File parsed with each word tagged with part of speech 11
  • 12. Program Classes (Not CNN Classes)• Word Class • m_Spelling • m_Count (frequency of word in file) • m_Weight (assigned by different TF*IDF measures) • m_HasHyphen (Hyphenated words counts as single word) • m_PennTags (Parts of speech tag) • m_Tags (Number of tags associated with word) • TermFrequencies Class • m_TermName; • int m_TermFreq; 12 • Classify Class • m_businessWords; • m_healthWords; • m_justiceWords; • m_livingWords; • m_opinionWords; • m_politicsWords; • m_showbizWords; • m_sportWords; • m_techWords; • m_travelWords; • m_usWords; • m_worldWords; • m_confusionMatrx
  • 13. Algorithm A. Using Training Set files in each class: (i.e. do this 12 times) 1.0 For each file in the set: Create a word object for every unique word in the file 2.0 Count the total number of occurrences of each unique word for the entire set of documents 3.0 Calculate the weight of each word: total occurrences of wordi in all files / total occurrences of all words in all files 𝑇 𝑤𝑜𝑟𝑑 𝑖 =𝑓(𝑤 𝑖,𝑗) 𝑗=1 𝑛𝑓𝑖𝑙𝑒𝑠 𝑖=1 𝑛𝑤𝑜𝑟𝑑𝑠 𝑓(𝑤𝑖,𝑗)13
  • 14. Algorithm B. Using the Testing Set files in a specific class: (i.e. business) 1.0 For each file in the set: Create a word object for every unique word in the file 2.0 Count the total number of occurrences of each unique word for the entire set of documents 3.0 Calculate the weight of each word: 𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗 = total occurrences of wordi in file / total occurrences of all words in file 𝑇 𝑤𝑜𝑟𝑑 𝑖 = total occurrences of wordi in all files / total occurrences of all words in all files 14 𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗 = 𝑓 𝑤𝑖,𝑗 𝑗=1 𝑛𝑓𝑖𝑙𝑒𝑠 𝑖=1 𝑛𝑤𝑜𝑟𝑑𝑠 𝑓(𝑤𝑖,𝑗) 𝑇 𝑤𝑜𝑟𝑑 𝑖 = 𝑗=1 𝑛𝑓𝑖𝑙𝑒𝑠 𝑓(𝑤𝑖,𝑗) 𝑗=1 𝑛𝑓𝑖𝑙𝑒𝑠 𝑖=1 𝑛𝑤𝑜𝑟𝑑𝑠 𝑓(𝑤𝑖,𝑗)
  • 15. Algorithm D. Classify each wordi in one test file by comparing to the same word in all training classes: 𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 = 𝑤𝑜𝑟𝑑𝑖 in Business test class 𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 15 𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 . . . 𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
  • 16. Algorithm C. Classify each wordi in the entire test class by comparing to the same word in all training classes: 𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 = 𝑤𝑜𝑟𝑑𝑖 in Business test class 𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 16 𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 . . . 𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖 × 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖 𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
  • 17. CNN Data Set – Example Article – Business Class 17 After : Could Germany's nuclear gamble backfire? As Germany's switchover from nuclear power to renewable energy gathers pace, concerns are mounting over the cost to the country's prosperity and its already squeezed consumers. Politicians in Europe's largest economy want renewable power to contribute 35% of the country's electricity consumption by 2020 and 80% by 2050 as part of its clean energy drive. The country's 'energiewende' -- translated as energy transformation -- is part of the government's plan to move away from nuclear power and fossil fuels to renewable energy sources, following Japan's disaster in 2011. Michael Limburg, vice-president of the European Institute for Climate and Energy, told CNN that the government's energy targets are 'completely unfeasible.' 'Of course, it's possible to erect tens of thousands of windmills but only at an extreme cost and waste of natural space,' he said. 'And still it would not be able to deliver electricity when it is needed.' The government is investing heavily in onshore and offshore wind farms and solar technology in an effort to reduce 40% of greenhouse gas emissions by 2020. Last year Chancellor Angela Merkel, who this week won her third term as Germany's leader, proposed to construct offshore wind farms in the North Sea, a plan that would cost 200 billion euros ($270 billion), according to the DIW economic institute in Berlin. As part of the energy drive, Merkel also pledged to permanently shut down the country's 17 nuclear reactors, which fuel 18% of the country's power needs. Under Germany's Atomic Energy Act, the last nuclear power plant will be disconnected by 2022.
  • 18. CNN Data Set – Example Frequencies – Training Set 18 m_TermFreq 6 int m_TermName "fukushima" string m_TermFreq 1 int m_TermName "germany" string m_TermFreq 12 int m_TermName "nuclear" string Single File m_TermFreq 9 int m_TermName "fukushima" string m_TermFreq 26 int m_TermName "germany" string m_TermFreq 33 int m_TermName "nuclear" string All Files in Class fukushima 0.000307188203972967 germany 0.000887432589255239 nuclear 0.00112635674790088 % Occurrence in Class % Occurrence in Filefukushima 0.0102739726027397 germany 0.00156739811912226 nuclear 0.0205479452054795
  • 19. CNN Data Set – Example Frequencies – Test Set 19 m_TermFreq 2 int m_TermName "fukushima" string m_TermFreq 9 int m_TermName "germany" string m_TermFreq 5 int m_TermName "nuclear" string Single File m_TermFreq 2 int m_TermName "fukushima" string m_TermFreq 21 int m_TermName "germany" string m_TermFreq 6 int m_TermName "nuclear" string All Files in Class fukushima 0.000073773515308004 germany 0.000774621910734047 nuclear 0.000221320545924013 % Occurrence in Class % Occurrence in Filefukushima 0.0056022408963585 germany 0.0252100840336134 nuclear 0.0140056022408964
  • 20. Classify All Words in Single Business Test File 20 Business 0.00069364 Health 0.00030063 Justice 0.00025000 Living 0.00026707 Opinion 0.00033446 Politics 0.00034694 Showbiz 0.00025372 Sport 0.00029984 Tech 0.00033337 Travel 0.00023201 US 0.00031539 World 0.00040208 MAX Class Value
  • 21. Classify All Words in All Business Test Files 21 Business 0.00059513 Health 0.00035854 Justice 0.00027830 Living 0.00038269 Opinion 0.00039295 Politics 0.00036828 Showbiz 0.00029162 Sport 0.00036698 Tech 0.00040147 Travel 0.00032406 US 0.00032592 World 0.00037747 MAX Class Value
  • 22. Confusion Matrix 22 • Each column contains samples of classifier output • Each row contains samples in true class • Each row sums to 1.0 • Diagonal show percent classified correctly • Mean of diagonal = 89% • Off-diagonal shows types of errors that occur • A is misclassified as B – 3% • A is misclassified as C – 3% Normalized Confusion Matrix Classifier Output (Computed Classification) Prediction A B C True Class of the Samples (Input) A 0.94 0.03 0.03 B 0.08 0.85 0.07 C 0.08 0.04 0.88
  • 23. Results - Classification 23 business health justice living opinion politics showbi z sport tech travel us world business 0.75 0 0 0.0875 0 0 0 0.025 0.1125 0.025 0 0 health 0 0.772 4 0.0207 0.1793 0 0.0138 0.0069 0 0.0069 0 0 0 justice 0 0 0.9018 0.0179 0 0.0446 0.0089 0 0 0 0.0268 0 living 0.0204 0.040 8 0 0.8163 0 0.0204 0 0.0612 0.0408 0 0 0 opinion 0.2083 0.072 9 0.0208 0.2708 0.0313 0.1667 0 0.0417 0.0417 0 0.0104 0.1354 politics 0.0103 0.010 3 0.0515 0.0412 0 0.8557 0 0 0 0 0 0.0309 showbiz 0.0083 0.008 3 0.1583 0.1417 0 0.0083 0.6417 0.025 0 0 0.0083 0 sport 0.027 0.013 5 0.0405 0.0541 0 0.027 0 0.8108 0 0.027 0 0 tech 0.0303 0.030 3 0.0152 0.2121 0 0 0.0152 0 0.6818 0.0152 0 0 travel 0.14120.0118 0.0235 0.1412 0 0.0824 0 0.0353 0.0588 0.4353 0.0471 0.0235 us 0.025 0.05 0.3125 0.175 0 0.1125 0.025 0.0625 0.0375 0.0125 0.1875 0 world 0.0769 0.014 2 0.1316 0.0789 0.002 0.1255 0.0061 0.0142 0.0202 0.002 0.0142 0.5142 Note that the diagonals (in bold) are the correct classifications The rows sum to 1.0 since the left column represents the actual class from which the document is taken The columns have a mean of 1.0 with some variance depending on whether the class in the column is an attractor class (> 1.0) or a repulsor class (<1.0)
  • 24. Example of Incorrectly Classified File 24 Business 0.00033924 Health 0.00025056 Justice 0.00027728 Living 0.00027807 Opinion 0.00041936 Politics 0.00046704 Showbiz 0.00023136 Sport 0.00028422 Tech 0.00025991 Travel 0.00021793 US 0.00032973 World 0.00043251 2nd MAX Class Value Results of File from Opinion Test Class: MAX Class Value 3rd MAX Class Value It takes 3 tries to get it right
  • 25. Classification Attempts to Success 25 Measures the average number of attempts until correct class is chose 1 × 𝑃1 + 2 × 𝑃2 + 3 × 𝑃3 … + 12 × 𝑃12 𝑛𝑓𝑖𝑙𝑒𝑠 where P1 = number correctly classified on first try p2 = number correctly classified after two tries . . . P12 = number correctly classified on the last try nfiles = number of files in testing class Example: Worst Class – Opinion P1 = 3 P2 = 35 x 2 P3 = 31 x 3 P4 = 14 x 4 P5 = 7 x5 P6 = 3 x6 P7 = 2 x 7 P8 = 1 x8 =297/96 = 3.09Ʃ
  • 26. Results - Classification Attempts to Success 26 • Measures the average number of attempts until correct class is cho • Ideal is 1.0 – We get it right on the first try • Best Class – Justice • Correctly classified: 0.9018 • Mean classification attempts: 1.19 • Delta from ideal = 0.19 • Worst Class – Opinion • Correctly classified: 0.0313 • Mean classification attempts: 3.09 • Delta from ideal = 2.09 • Best classification attempts class 11 times better than worst class • All other classes between best and worst
  • 27. Results - General • Confusion matrix shows good classification results • Average classification rate for all classes = 0.61655883 • Classification Errors: • Attractor classes: • Repulsor classes: • Normalized by total occurrences of all words in file • For classification of single file • Normalized by total occurrences of all words in the class • For classification of multiple files 27 busines s health justice living opinion politics showbi z sport tech travel us world 1.297 8 1.024 5 1.676 5 2.216 0.033 3 1.456 9 0.703 7 1.075 7 1.000 3 0.517 0.294 3 0.704 Busines s Health Justice Living Politics Sport Tech Opinion Showbiz Travel U.S. World
  • 28. Discussion 28 Opinion Justic e World U.S. 1.Documents more varied? 2.Generic words? 3.Topics overlap? 4.Word clusters are broader?
  • 29. Future Directions • Automatic summarization based on word frequencies in sentences • Data from Brazil also contained Gold Standard sentences for summarization • Each file contains sentences pulled out of the full article by at least 3 students • Gold Standard sentences for each file act as ground truth for automatic summarization • New York Times Annotated Corpus: (https://catalog.ldc.upenn.edu/LDC2008T19) • Written and published by the New York Times between January 1, 1987 and June 19, 2007 • Metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com: • Over 1.8 million articles • Over 650,000 article summaries written by library scientists • Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors • Over 275,000 algorithmically-tagged articles that have been hand verified by the29
  • 30. Conclusions • A family of TF*IDF metrics for summarization and classification • A simple TF*IDF metric • Classification scheme that works well on a set of 3,000 CNN articles separated into 12 classes • Classification attempts to success is a measure that tells us how hard it is to classify • Attractor and repulsor classes may help for identifying imbalances in the data • Simple TF*IDF metric can be used for benchmarking the rest of the 112 TF*IDF 30
  • 31. Thank You for Your Kind Attention 31