This document summarizes an application that classifies news articles into four categories (Tech, My Study, Biz, Life) using a naive Bayes classifier trained on notes from the user's Evernote account. The application crawls the user's Evernote for training data, estimates word probabilities from the notes, collects news articles from various sources, classifies the articles using the trained model, and displays the top results for each category on a web interface. The document discusses the software architecture and implementation details, and raises some questions about the classifier's performance and scoring method.
2. 2
Background
• Evernote
• One of my favorite applications
• I can clip interesting news and technical articles using Evernote
I will show a web-based application utilizing Evernote
3. 3
Brief Overview
• News Classifier
• Divides news articles into 4 categories
• Tech
• General technical news
• My Study
• IT news relevant to me, such as programing, development, and computer architectures
• Biz
• Business news
• Life
• Other news, such as politics, economics, sports, and music
• Displays high-score articles
• Training data
• Notes clipped into my Evernote
• The notes in my Evernote are basically divided into the above 4 folders
4. 4
Naïve Bayes Classifier
• It divides documents into categories
arg max cat P ( cat doc)
• My application also returns max cat P ( cat doc ) for scoring
• The classifier calculates P ( cat doc ) using Bayes Theorem
• Given doc, it returns
P ( cat doc) =
P ( cat ) P ( doc cat )
∑ P (cat!) P (doc cat!)
cat!
•
P ( cat ), P ( doc cat ) are calculated using Multinomial model
5. 5
(Simplified) Multinomial Model
• Document is defined as a sequence of n words
doc := ( w1, w2 ,!, wn )
• Suppose documents are generated by repeatedly picking up
one word
P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat )
W2
W3
W3
W3
W1
W1
W2
W2
Category 1
Category 2
6. 6
Estimating model parameters
(cat ), P ( w cat )
P
nw,c
Nc
• Simple method: P ( cat ) =
, P ( w cat ) =
N
nc
N c : Num of docs in category c
N : Num of docs in total
nw,c : Num of word w in documents in category c
nc : Num of total words in documents in category c
• Disadvantage: If nw,c = 0, P(w|cat) = 0
P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0
• Improved method: smoothing
• Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2)
⇒
nw,c +1
N c +1
P ( cat ) =
, P ( w cat ) =
N + cat
nc + W
cat
W
: Num of category
: Num of vocabulary
8. 8
Crawling Evernote
• I used “Evernote Ruby API” to crawl Evernote
Category
(Notebook)
to_json
Title
Content
(XML)
{ "cat": ”Tech",
“title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情",
“content”: ”・・・”
}
I used the “title” and the “content” as the training document
9. 9
Training
• Outline
1. Parse XML content using the Nokogiri library
2. Divide sentences into words using the MeCab library
3. Calculate the model parameters
10. 10
Crawling web
• News sites
• IT Media
• Yahoo! News
• Impress Watch
• TechCrunch
• gizmode
• Nikkan Sports
• Nikkei BP
• I downloaded RSS files using Ruby and saved them as json files
{
"title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を",
“link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・",
“desc”: “・・・"
},
I used the “title” and the “description” for classification
11. 11
Application
• Web-based
• Ruby + Sinatra + jQuery mobile
• Articles are sorted by P cat doc
(
)
• The application displays the top 5 articles for each category
12. 12
Discussion
• Most of displayed articles matched my interest
• Some articles were divided into wrong categories
• Possible reasons
• No stemming
• Few stop words
• Is it appropriate to use
P ( cat doc)
as the score?