A News Classifier Using Evernote

•

1 like•1,792 views

This document summarizes an application that classifies news articles into four categories (Tech, My Study, Biz, Life) using a naive Bayes classifier trained on notes from the user's Evernote account. The application crawls the user's Evernote for training data, estimates word probabilities from the notes, collects news articles from various sources, classifies the articles using the trained model, and displays the top results for each category on a web interface. The document discusses the software architecture and implementation details, and raises some questions about the classifier's performance and scoring method.

Technology

1

A News Classifier Using Evernote
Atsushi KOIKE
Sokendai
Feb. 18, 2014

2

Background
•  Evernote
•  One of my favorite applications
•  I can clip interesting news and technical articles using Evernote

I will show a web-based application utilizing Evernote

3

Brief Overview
•  News Classifier
•  Divides news articles into 4 categories
•  Tech
•  General technical news
•  My Study
•  IT news relevant to me, such as programing, development, and computer architectures
•  Biz
•  Business news
•  Life
•  Other news, such as politics, economics, sports, and music

•  Displays high-score articles

•  Training data
•  Notes clipped into my Evernote
•  The notes in my Evernote are basically divided into the above 4 folders

4

Naïve Bayes Classifier
•  It divides documents into categories

arg max cat P ( cat doc)
•  My application also returns max cat P ( cat doc ) for scoring
•  The classifier calculates P ( cat doc ) using Bayes Theorem
•  Given doc, it returns

P ( cat doc) =

P ( cat ) P ( doc cat )

∑ P (cat!) P (doc cat!)
cat!

• 

P ( cat ), P ( doc cat ) are calculated using Multinomial model

5

(Simplified) Multinomial Model
•  Document is defined as a sequence of n words

doc := ( w1, w2 ,!, wn )

•  Suppose documents are generated by repeatedly picking up

one word

P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat )
W2

W3

W3

W3

W1

W1

W2

W2

Category 1

Category 2

6

Estimating model parameters
(cat ), P ( w cat )
P
nw,c
Nc
•  Simple method: P ( cat ) =
, P ( w cat ) =
N
nc
N c : Num of docs in category c
N : Num of docs in total
nw,c : Num of word w in documents in category c
nc : Num of total words in documents in category c
•  Disadvantage: If nw,c = 0, P(w|cat) = 0

P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0
•  Improved method: smoothing
•  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2)
⇒

nw,c +1
N c +1
P ( cat ) =
, P ( w cat ) =
N + cat
nc + W

cat
W

: Num of category
: Num of vocabulary

7

Software Architecture
Training data
Evernote

Evernote
Crawler

Model parameters

Bayes
Classifier

classifier.json

notes.json
Train

Articles
Collector
URL list

Classify
articles.json
contents.json

Display

$8 Crawling Evernote •  I used “Evernote Ruby API” to crawl Evernote Category (Notebook) to_json Title Content (XML) { "cat": ”Tech", “title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情", “content”: ”・・・” } I used the “title” and the “content” as the training document$

9

Training
•  Outline
1.  Parse XML content using the Nokogiri library
2.  Divide sentences into words using the MeCab library
3.  Calculate the model parameters

10

Crawling web
•  News sites
•  IT Media
•  Yahoo! News
•  Impress Watch
•  TechCrunch
•  gizmode
•  Nikkan Sports
•  Nikkei BP
•  I downloaded RSS files using Ruby and saved them as json files
{
"title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を",
“link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・",
“desc”: “・・・"
},
I used the “title” and the “description” for classification

11

Application
•  Web-based
•  Ruby + Sinatra + jQuery mobile
•  Articles are sorted by P cat doc

(

)

•  The application displays the top 5 articles for each category

12

Discussion
•  Most of displayed articles matched my interest
•  Some articles were divided into wrong categories
•  Possible reasons
•  No stemming
•  Few stop words
•  Is it appropriate to use

P ( cat doc)

as the score?

Viewers also liked

CWK COMMERCE fortunedata

Retete magica 2003Nico Risnoveanu

Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...progressive01

Dip Your Toes in the Sea of Security (phpDay 2016)James Titcumb

Viral, czyli jak to działa.#e-biznes festiwal

Poms the farmers view - David Barkerprogressive01

How to generate sales leadsFundoodata.com

Steps to-heaven-2015.eng-1Sim Aleksiev

Ivan yzunov-2013engSim Aleksiev

#e-biznes festiwal 2013: idzie nowe!#e-biznes festiwal

Un día de nuestras vidasChristian Arguedas Vargas Ministerio de Educación Pública

Topik 2 pembangunan pangkalan dataChamp14n

Educacion autónomaLaura Patricia Cazares Contreras

Nota justificativaDo outro lado da barricada

Proiect educativ internationalNico Risnoveanu

Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidariasLaura Cecilia Silva

Toamna in imaginiNico Risnoveanu

"상금 1억" 꿈의 마케팅 아이디어 공모전Ji Hyeok Kim

Sense T - What happens when sensing happens? progressive01

Valchan petrov-2014-1Sim Aleksiev

Viewers also liked (20)

CWK COMMERCE

Retete magica 2003

Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...

Dip Your Toes in the Sea of Security (phpDay 2016)

Viral, czyli jak to działa.

Poms the farmers view - David Barker

How to generate sales leads

Steps to-heaven-2015.eng-1

Ivan yzunov-2013eng

#e-biznes festiwal 2013: idzie nowe!

Un día de nuestras vidas

Topik 2 pembangunan pangkalan data

Educacion autónoma

Nota justificativa

Proiect educativ international

Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias

Toamna in imagini

"상금 1억" 꿈의 마케팅 아이디어 공모전

Sense T - What happens when sensing happens?

Valchan petrov-2014-1

Similar to A News Classifier Using Evernote

What's new in pandas and the SciPy stack for financial usersWes McKinney

LaTeX로 문서 작성하자Kangjun Heo

Python mlShubham Sharma

Pinecone Vector Database.pdfAniruddha Chakrabarti

ProjectsSummary.pptxJamesKirk79

Solved Big Data and Data Science Projects pdf.pdfProjectPro Big Data and Data Science Projects

Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn

Abhishek Training PPT.pptxKashishKashish22

PythonChetan Khanzode

Python Programming and GISJohn Reiser

Kaggle tokyo 2018Cournapeau David

I2DS Project.pdfAbdulnasserAlMaqrami

Certified Python Business AnalystAnkitSingh2134

GTU Asp.net Project Training GuidelinesTOPS Technologies

COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERINGPUNE VIDYARTHI GRIHA'S COLLEGE OF ENGINEERING, NASHIK

Introduction to NVivoMarieke Guy

Array computing and the evolution of SciPy, NumPy, and PyDataTravis Oliphant

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza

Presentation.pptxAyushmanTiwari11

Similar to A News Classifier Using Evernote (20)

What's new in pandas and the SciPy stack for financial users

LaTeX로 문서 작성하자

Python ml

Pinecone Vector Database.pdf

ProjectsSummary.pptx

Solved Big Data and Data Science Projects pdf.pdf

Data Science With Python | Python For Data Science | Python Data Science Cour...

Abhishek Training PPT.pptx

Python

Python Programming and GIS

Kaggle tokyo 2018

I2DS Project.pdf

Certified Python Business Analyst

GTU Asp.net Project Training Guidelines

COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING

Introduction to NVivo

Array computing and the evolution of SciPy, NumPy, and PyData

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

Presentation.pptx

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?

Gen AI in Business - Global Trends Report 2024.pdf

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Human Factors of XR: Using Human Factors to Design XR Systems

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

DevEX - reference for building teams, processes, and platforms

SQL Database Design For Developers at php[tek] 2024

Vertex AI Gemini Prompt Engineering Tips

APIForce Zurich 5 April Automation LPDG

Developer Data Modeling Mistakes: From Postgres to NoSQL

SAP Build Work Zone - Overview L2-L3.pptx

Unleash Your Potential - Namagunga Girls Coding Club

"Debugging python applications inside k8s environment", Andrii Soldatenko

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Understanding the Laravel MVC Architecture

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

DMCC Future of Trade Web3 - Special Edition

Ensuring Technical Readiness For Copilot in Microsoft 365

Dev Dives: Streamline document processing with UiPath Studio Web

A News Classifier Using Evernote

1. 1 A News Classifier Using Evernote Atsushi KOIKE Sokendai Feb. 18, 2014

2. 2 Background •  Evernote •  One of my favorite applications •  I can clip interesting news and technical articles using Evernote I will show a web-based application utilizing Evernote

3. 3 Brief Overview •  News Classifier •  Divides news articles into 4 categories •  Tech •  General technical news •  My Study •  IT news relevant to me, such as programing, development, and computer architectures •  Biz •  Business news •  Life •  Other news, such as politics, economics, sports, and music •  Displays high-score articles •  Training data •  Notes clipped into my Evernote •  The notes in my Evernote are basically divided into the above 4 folders

4. 4 Naïve Bayes Classifier •  It divides documents into categories arg max cat P ( cat doc) •  My application also returns max cat P ( cat doc ) for scoring •  The classifier calculates P ( cat doc ) using Bayes Theorem •  Given doc, it returns P ( cat doc) = P ( cat ) P ( doc cat ) ∑ P (cat!) P (doc cat!) cat! •  P ( cat ), P ( doc cat ) are calculated using Multinomial model

5. 5 (Simplified) Multinomial Model •  Document is defined as a sequence of n words doc := ( w1, w2 ,!, wn ) •  Suppose documents are generated by repeatedly picking up one word P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) W2 W3 W3 W3 W1 W1 W2 W2 Category 1 Category 2

6. 6 Estimating model parameters (cat ), P ( w cat ) P nw,c Nc •  Simple method: P ( cat ) = , P ( w cat ) = N nc N c : Num of docs in category c N : Num of docs in total nw,c : Num of word w in documents in category c nc : Num of total words in documents in category c •  Disadvantage: If nw,c = 0, P(w|cat) = 0 P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0 •  Improved method: smoothing •  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2) ⇒ nw,c +1 N c +1 P ( cat ) = , P ( w cat ) = N + cat nc + W cat W : Num of category : Num of vocabulary

7. 7 Software Architecture Training data Evernote Evernote Crawler Model parameters Bayes Classifier classifier.json notes.json Train Articles Collector URL list Classify articles.json contents.json Display

8. 8 Crawling Evernote •  I used “Evernote Ruby API” to crawl Evernote Category (Notebook) to_json Title Content (XML) { "cat": ”Tech", “title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情", “content”: ”・・・” } I used the “title” and the “content” as the training document

9. 9 Training •  Outline 1.  Parse XML content using the Nokogiri library 2.  Divide sentences into words using the MeCab library 3.  Calculate the model parameters

10. 10 Crawling web •  News sites •  IT Media •  Yahoo! News •  Impress Watch •  TechCrunch •  gizmode •  Nikkan Sports •  Nikkei BP •  I downloaded RSS files using Ruby and saved them as json files { "title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を", “link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・", “desc”: “・・・" }, I used the “title” and the “description” for classification

11. 11 Application •  Web-based •  Ruby + Sinatra + jQuery mobile •  Articles are sorted by P cat doc ( ) •  The application displays the top 5 articles for each category

12. 12 Discussion •  Most of displayed articles matched my interest •  Some articles were divided into wrong categories •  Possible reasons •  No stemming •  Few stop words •  Is it appropriate to use P ( cat doc) as the score?

A News Classifier Using Evernote

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to A News Classifier Using Evernote

Similar to A News Classifier Using Evernote (20)

Recently uploaded

Recently uploaded (20)

A News Classifier Using Evernote